# Diglossa Genome Assembly #research #manuscripts ## Initial Information ### PacBio * PB Run 1: Gb: 11.00560636; Longest subread: 19017; Longest Subread N50 (bp): 31327 * PB Run 2a: Gb: 10.84430589; Longest subread: 21410; Longest Subread N50 (bp): 34819 * PB Run 2b: Gb: 10.83405489; Longest subread: 20529; Longest Subread N50 (bp): 33287 - - - - ## Read data location * /nfs/data1/genomes/faircloth-lab/diglossa/10x/raw-data/H2FJCCCX2 * /nfs/data1/genomes/faircloth-lab/diglossa/pacbio/raw-data ## Processing data location * /nfs/data1/genomes/faircloth-lab/diglossa - - - - ## Analysis steps ### 10x genomics #### De novo assembly * downloaded the data from novogene and verified MD5 * get supernova setup on @qb2 ```bash supernova --version (2.1.1) Copyright (c) 2018 10x Genomics, Inc. All rights reserved. ``` * setup assembly on @qb2 after uploading the data. initially setup assemblies after downsampling the reads, but after first attempt, that assembly was only so-so and coverage was lower than expected. so, nuked that run and re-ran supernova using all of the data ```bash #PBS -q bigmem #PBS -A loni_bf_qb_02 #PBS -l walltime=72:00:00 #PBS -l nodes=1:ppn=24 #PBS -V #PBS -N diglossa_assembly #PBS -o diglossa_assembly.out #PBS -e diglossa_assembly.err export PATH=$HOME/project/supernova-2.1.1:$PATH cd $PBS_O_WORKDIR supernova run \ --id=diglossa-10x-assembly-all-reads \ --fastqs=fastqs \ --maxreads='all' \ --localcores 24 \ --localmem 745 ``` * output all possible versions of the assemblies using 4 qsub scripts ```bash #!/bin/bash #PBS -q checkpt #PBS -A loni_bf_qb_02 #PBS -l walltime=12:00:00 #PBS -l nodes=1:ppn=20 #PBS -V #PBS -N diglossa_assembly_out #PBS -o diglossa_megabubbles_assembly_out.out #PBS -e diglossa_megabubbles_assembly_out_out.err export PATH=$HOME/project/supernova-2.1.1:$PATH cd $PBS_O_WORKDIR supernova mkoutput \ --style=megabubbles \ --asmdir=./assembly/ \ --outprefix=diglossa_v1_megabubbles ``` ```bash #!/bin/bash #PBS -q checkpt #PBS -A loni_bf_qb_02 #PBS -l walltime=12:00:00 #PBS -l nodes=1:ppn=20 #PBS -V #PBS -N diglossa_assembly_out #PBS -o diglossa_pseudohap2_assembly_out.out #PBS -e diglossa_pseudohap2assembly_out_out.err export PATH=$HOME/project/supernova-2.1.1:$PATH cd $PBS_O_WORKDIR supernova mkoutput \ --style=pseudohap2 \ --asmdir=./assembly/ \ --outprefix=diglossa_v1_pseudohap2 ``` ```bash #!/bin/bash #PBS -q checkpt #PBS -A loni_bf_qb_02 #PBS -l walltime=12:00:00 #PBS -l nodes=1:ppn=20 #PBS -V #PBS -N diglossa_assembly_out #PBS -o diglossa_assembly_out.out #PBS -e diglossa_assembly_out_out.err export PATH=$HOME/project/supernova-2.1.1:$PATH cd $PBS_O_WORKDIR supernova mkoutput \ --style=pseudohap \ --asmdir=./assembly/ \ --outprefix=diglossa_v1_pseudohap ``` ```bash #!/bin/bash #PBS -q checkpt #PBS -A loni_bf_qb_02 #PBS -l walltime=12:00:00 #PBS -l nodes=1:ppn=20 #PBS -V #PBS -N diglossa_assembly_out #PBS -o diglossa_raw_assembly_out.out #PBS -e diglossa_raw_assembly_out_out.err export PATH=$HOME/project/supernova-2.1.1:$PATH cd $PBS_O_WORKDIR supernova mkoutput \ --style=raw \ --asmdir=./assembly/ \ --outprefix=diglossa_v1_raw ``` * downloaded assemblies and qsub files after those finished ```bash rsync -avLP brant@qb.loni.org:/home/brant/work/10x-diglossa-all-data/diglossa-10x-assembly-all-reads/outs/output-assemblies ./ rsync -avLP brant@qb.loni.org:/home/brant/work/10x-diglossa-all-data/diglossa-10x-assembly-all-reads/assembly-qsub ./ ``` #### Trimmed Reads for Polishing * we can trim the 10x barcodes from all the reads and then use those as we would any other mate pair data for things like polishing. I found a handy script online to do this: ``` $ python ~/git/proc10xG/filter_10xReads.py --version filter_10xReads.py version: 0.0.1 ``` * run that against the 10x reads and output those in a `trimmed-reads` directory ```bash python ~/git/proc10xG/process_10xReads.py \ --output trimmed-reads/A10x_Songbird_S3_L006 \ -1 raw-data/H2FJCCCX2/A10x_Songbird_S3_L006_R1_001.fastq.gz \ -2 raw-data/H2FJCCCX2/A10x_Songbird_S3_L006_R2_001.fastq.gz ``` ### Pacbio #### Assembly * Here’s the initial data layout of the pacbio runs ``` 1_A01/ 1_A01/tmp-file-65b03cb8-d740-4fed-917e-5cf6e727036a.txt 1_A01/.m54193_190328_212540.run.metadata.xml 1_A01/.m54193_190328_212540.metadata.xml 1_A01/m54193_190328_212540.subreadset.xml 1_A01/m54193_190328_212540.subreads.bam 1_A01/m54193_190328_212540.subreads.bam.pbi 1_A01/m54193_190328_212540.scraps.bam 1_A01/m54193_190328_212540.scraps.bam.pbi 1_A01/m54193_190328_212540.adapters.fasta 1_A01/m54193_190328_212540.sts.xml 1_A01/m54193_190328_212540.baz2bam_1.log 1_A01/m54193_190328_212540.transferdone ``` * Unzip data, upload those to @qb2 * get canu installed ```bash ○ → canu --version Canu snapshot v1.8 +255 changes (r9465 fadb4eb6008c017d0c4b685f0a053c181b7af8b2) ``` * On qb2, for canu we need to convert data from BAM format to fastq, go ahead and do that: ```bash #!/bin/bash #PBS -q single #PBS -A loni_bf_qb_02 #PBS -l walltime=12:00:00 #PBS -l nodes=1:ppn=4 #PBS -V #PBS -N bam_to_fastq #PBS -o bam_to_fastq.out #PBS -e bam_to_fastq.err # load the parallel module to run files in parallel (up to 4 cores in single queue) module load gnuparallel/20170122 # activate our conda env source activate pacbio cd $PBS_O_WORKDIR mkdir pacbio-fastq && cd pacbio-fastq find ../pacbio-raw/ -name *.bam | parallel "bam2fastq -o {/.} {}" ``` * Now that they are fastq format, go ahead and start the assembly. Because of the way that canu is setup, we’ll just be running in local versus grid mode. After lots of back and forth w/ loni staff about this, it seemed like the best solution to run the job on the qb2 bigmem nodes. During the run, I ran out of wall-time and needed to restart the job, which I did by simply renaming the stdout and stderr files and the re-submitting the script below: ```bash #!/bin/bash #PBS -q bigmem #PBS -A loni_bf_qb_02 #PBS -l walltime=72:00:00 #PBS -l nodes=1:ppn=48 #PBS -V #PBS -N canu_config #PBS -o canu_config.out #PBS -e canu_config.err #PBS -m abe #PBS -M brant@faircloth-lab.org module load gcc/6.4.0 module load java/1.8.0 cd $PBS_O_WORKDIR mkdir -p canu-assembly && cd canu-assembly canu \ -p diglossa \ -d diglossa-pacbio \ genomeSize=1.1g \ useGrid=false \ -pacbio-raw ../pacbio-fastq/*.fastq.gz 1>canu-assembly.stdout 2>canu-assembly.stderr ``` * post-assembly, I downloaded the assembly files to @nfs over rsync, keeping only the most important ones (basically the files in the @qb2 directory and *NOT* the folders: ```bash rsync -vt 'brant@qb.loni.org:/home/brant/work/pacbio-diglossa/canu-assembly/diglossa-pacbio/*' ./ ``` * A description of the various output files can be found [Canu Tutorial — canu 1.8 documentation](https://canu.readthedocs.io/en/latest/tutorial.html#outputs) * I did a quick compare and contrast of the initial 10x and pacbio assemblies using quast on @bbq: ```bash quast \ --output-dir diglossa-initial-compare \ 10x-diglossa.supernova.vanilla.fasta \ pacbio-diglossa.canu.vanilla.fasta \ --threads 24 --eukaryote --conserved-genes-finding ``` ### Pacbio Polishing * create dir for longranger output ```bash mkdir longranger-ouput ``` * we need to process the 10x data w/ long ranger before doing anything else. This removed the 10x barcodes from each sample ```bash #!/bin/bash #PBS -q checkpt #PBS -A hpc_allbirds02 #PBS -l walltime=24:00:00 #PBS -l nodes=1:ppn=16 #PBS -V #PBS -N longranger_basic #PBS -o longranger_basic.out #PBS -e longranger_basic.err export PATH=/home/brant/project/shared/bin/longranger-2.2.2/:$PATH cd $PBS_O_WORKDIR longranger basic \ --id=10x-diglossa \ --fastqs=/home/brant/work/10x-diglossa-scaffold/raw-fastq \ --localcores 16 ``` * Map those reads to the pacbio assembly using bwa v.0.7.17-r1188 and samtools 1.9 following the protocol. The easiest thing to do is probably to create a new directory in `pacbio-polished` named `bwa-aligned`. Also symlink in the *.fastq.gz file from above: ```bash mkdir bwa-aligned && cd $_ ln -s ../longranger-ouput//outs/barcoded.fastq.gz ``` * After that run completes, run pilon against the assembly using the bigmem queue on @qb2: ```bash #!/bin/bash #PBS -q bigmem #PBS -A #PBS -l walltime=72:00:00 #PBS -l nodes=1:ppn=48 #PBS -V #PBS -N pilon #PBS -o pilon.out #PBS -e pilon.err source activate polishing cd $PBS_O_WORKDIR # run pilon pilon -Xmx1400G --genome diglossa.contigs.fa \ --bam diglossa.contigs.barcoded.bam \ --changes --diploid --threads 48 \ --output diglossa.contigs.polished ``` ### Pacbio w/ 10x scaffolding * we don’t need to run long ranger against the data because we already did that, above. * on @supermike, create a directory to hold the scaffolded data and make a copy of the long ranger data. We already parsed the 10x data to try and scaffold against un-polished contigs. However, here we are scaffolding against polished contigs. ```bash mkdir 10x-diglossa-scaffold-polished cp -R ../10x-diglossa-scaffold-unpolished/longranger-output/ ./ ``` * we also already copied the barcode multiplicities over, so we don’t need to run that step again. see protocol for how we did that. * run arks ```bash #!/bin/bash #PBS -q checkpt #PBS -A hpc_anna01 #PBS -l walltime=72:00:00 #PBS -l nodes=1:ppn=16 #PBS -V #PBS -N arks_scaffolding #PBS -o arks_scaffolding.out #PBS -e arks_scaffolding.err module load gcc/6.4.0 module load perl/5.24.0/INTEL-18.0.0 module load boost/1.63.0/INTEL-18.0.0 export PATH=$HOME/project/shared/bin/:$HOME/project/shared/src/links_v1.8.7:/project/brant/home/anaconda/envs/scaffolding/bin/:$PATH cd $PBS_O_WORKDIR make -f arks-make.txt arks-tigmint \ draft=diglossa.contigs.polished \ reads=barcoded \ m=50-30000 o=3 time=1 ``` * let's get some assembly stats for all of the pacbio assemblies (ran this October 2020): ```bash #!/bin/bash #PBS -A hpc_allbirds04 #PBS -l nodes=1:ppn=20 #PBS -l walltime=4:00:00 #PBS -q workq #PBS -N quast # activate the correct conda env source activate genome-eval # move into the directory containing this script cd $PBS_O_WORKDIR #run quast quast --output-dir diglossa-assembly-stats --threads 20 --min-contig 1000 \ --no-plots --no-sv --no-gc --no-snps --no-icarus \ assemblies/diglossa.contigs.fasta \ assemblies/diglossa.contigs.polished.fasta \ diglossa.contigs.polished.tigmint_c5_m50-30000_k30_r0.05_e30000_z500_l5_a0.3.scaffolds.fa ``` * run BUSCO against those results: * BUSCO results seem to indicate a somewhat high level of duplication, so we're going to see if we can remove haplotigs * first, let's get a copy of the contigs and filter loci < 1000 bp in length ```bash cp ../canu-assembly-polished-arks-links/diglossa.contigs.polished.tigmint_c5_m50-30000_k30_r0.05_e30000_z500_l5_a0.3.scaffolds.fa.gz ./ mv diglossa.contigs.polished.tigmint_c5_m50-30000_k30_r0.05_e30000_z500_l5_a0.3.scaffolds.fa diglossa.contigs.polished.arks.fasta faSize diglossa.contigs.polished.arks.fasta # filter loci < 1000 bp faSize diglossa.contigs.polished.arks.fasta 1172156322 bases (28700 Ns 1172127622 real 1172127622 upper 0 lower) in 3146 sequences in 1 files Total size: mean 372586.2 sd 1925506.4 min 28 (scaffold3146,28,f1909Z28) max 30446970 (scaffold1,30446870,f380Z28916091k17a0.11m100_f2000z1530779) median 54751 N count: mean 9.1 sd 38.6 U count: mean 372577.1 sd 1925494.4 L count: mean 0.0 sd 0.0 %0.00 masked total, %0.00 masked real faFilter -minSize=1000 diglossa.contigs.polished.arks.fasta diglossa.contigs.polished.arks.filtered.fasta faSize diglossa.contigs.polished.arks.filtered.fasta 1172045046 bases (28700 Ns 1172016346 real 1172016346 upper 0 lower) in 2960 sequences in 1 files Total size: mean 395961.2 sd 1982772.1 min 1005 (scaffold2960,1005,f1927Z1005) max 30446970 (scaffold1,30446870,f380Z28916091k17a0.11m100_f2000z1530779) median 58600 N count: mean 9.7 sd 39.7 U count: mean 395951.5 sd 1982759.7 L count: mean 0.0 sd 0.0 %0.00 masked total, %0.00 masked real ``` * now, get the raw-read pacbio data back up to QBC by rsync * those come as BAM files, so we need to convert them to fastq (we're on QBC now) ```bash #!/bin/bash #SBATCH -N 1 #SBATCH -n 3 #SBATCH -t 24:00:00 #SBATCH -p single #SBATCH -A loni_bf_qb_03 #SBATCH -o slurm-%x-%j.out-%N #SBATCH -e slurm-%x-%j.err-%N #SBATCH --job-name=untar module load parallel/20190222/intel-19.0.5 # activate our conda env source activate pacbio export WORK_DIR=/home/brant/work/diglossa cd $WORK_DIR mkdir pacbio-fastq && cd pacbio-fastq find ../bams/ -name *.bam | parallel "bam2fastq -o {/.} {}" ``` * get those reads into a single file: ```bash cp m54193_190328_212540.subreads.fastq.gz ALL.subreads.fastq.gz cat m54193_190329_183459.subreads.fastq.gz >> ALL.subreads.fastq.gz cat m54193_190330_044854.subreads.fastq.gz >> ALL.subreads.fastq.gz ``` * map those reads to the diglossa assembly: ```bash #!/bin/bash #SBATCH -N 1 #SBATCH -n 20 #SBATCH -t 48:00:00 #SBATCH -p single #SBATCH -A loni_bf_qb_04 #SBATCH -o slurm-%x-%j.out-%N #SBATCH -e slurm-%x-%j.err-%N #SBATCH --job-name=minimap2 module load samtools date export WORK_DIR=/home/brant/work/diglossa/haplotigs cd $WORK_DIR mkdir -p $WORK_DIR/tmp export TMPDIR=$WORK_DIR/tmp samtools faidx diglossa.contigs.polished.arks.filtered.fasta minimap2 -t 18 -ax map-pb diglossa.contigs.polished.arks.filtered.fasta ../pacbio-fastq/ALL.subreads.fastq.gz --secondary=no | samtools sort -m 60G -o aligned.bam -T tmp.ali ``` * once that's done, generate a histogram of read covereage across the assembly to identify peaks ```bash #!/bin/bash #SBATCH -N 1 #SBATCH -n 24 #SBATCH -t 48:00:00 #SBATCH -p single #SBATCH -A loni_bf_qb_03 #SBATCH -o slurm-%x-%j.out-%N #SBATCH -e slurm-%x-%j.err-%N #SBATCH --job-name=purge_haplotigs source activate purge_haplotigs date export WORK_DIR=/home/brant/work/diglossa/haplotigs cd $WORK_DIR mkdir -p $WORK_DIR/tmp export TMPDIR=$WORK_DIR/tmp samtools index -@ 24 aligned.bam purge_haplotigs hist -b aligned.bam -g diglossa.contigs.polished.arks.filtered.fasta -t 48 ``` * based on the histogram for that assembly, there's a small peak directly adjacent to a larger peak which may represent a reasonably small number of haploid contigs. Use that histogram to set low = 5, mid = 15, high = 190 for coverage cutoffs. Using those values compute the file of contig coverages ```bash #!/bin/bash #SBATCH -N 1 #SBATCH -n 1 #SBATCH -t 48:00:00 #SBATCH -p single #SBATCH -A loni_squamate01 #SBATCH -o slurm-%x-%j.out-%N #SBATCH -e slurm-%x-%j.err-%N #SBATCH --job-name=purge_haplotigs source activate purge_haplotigs date export WORK_DIR=/home/brant/work/diglossa/haplotigs cd $WORK_DIR mkdir -p $WORK_DIR/tmp export TMPDIR=$WORK_DIR/tmp purge_haplotigs contigcov -i aligned.bam.gencov -l 5 -m 15 -h 190 ``` * let's go ahead and model some repeats so we can use those in our contig purging. prep the database: ```bash #!/bin/bash #SBATCH -N 1 #SBATCH -n 1 #SBATCH -t 12:00:00 #SBATCH -p single #SBATCH -A loni_bf_qb_03 #SBATCH -o slurm-%x-%j.out-%N #SBATCH -e slurm-%x-%j.err-%N #SBATCH --job-name=build_db source activate genome-annot-repeat date export WORK_DIR=/home/brant/work/diglossa/repeat-models cd $WORK_DIR BuildDatabase -name diglossa_brunneiventris diglossa.contigs.polished.arks.filtered.fasta # exit the job date exit 0 ``` * let's model some repeats ```bash #!/bin/bash #SBATCH -N 1 #SBATCH -n 48 #SBATCH -t 48 :00:00 #SBATCH -p workq #SBATCH -A loni_bf_qb_03 #SBATCH -o slurm-%x-%j.out-%N #SBATCH -e slurm-%x-%j.err-%N #SBATCH --job-name=r_modeler source activate genome-annot-repeat date export WORK_DIR=/home/brant/work/diglossa/repeat-models cd $WORK_DIR RepeatModeler -database diglossa_brunneiventris -pa 48 # exit the job date exit 0 ``` * the scaffolds we started with are named poorly. rename the scaffolds: ```python from Bio import SeqIO with open("diglossa.contigs.polished.arks.filtered.fasta") as infile: with open("diglossa.contigs.polished.arks.filtered.rename.fasta", 'w') as outfile: for record in SeqIO.parse(infile, 'fasta'): record.name = "" record.description = "" record.id = record.id.split(",")[0] outfile.write(record.format('fasta')) ``` * finally, use repeat masker with the database we created to mask repeats in the renamed sequence: ```bash #!/bin/bash #SBATCH -N 1 #SBATCH -n 48 #SBATCH -t 12:00:00) #SBATCH -p workq #SBATCH -A loni_bf_qb_04 #SBATCH -o slurm-%x-%j.out-%N #SBATCH -e slurm-%x-%j.err-%N #SBATCH --job-name=r_masker # move into the directory containing this script source activate genome-annot-repeat date export WORK_DIR=/home/brant/work/diglossa/repeat-models cd $WORK_DIR RepeatMasker -pa 48 -xsmall -gff -lib RM_267691.ThuNov191759492020/consensi.fa.classified diglossa.contigs.polished.arks.filtered.rename.fasta date exit ``` * using the repeats identified w/ repeat masker, make a BED file of the repeat information: ```bash cat ../repeat-models/diglossa.contigs.polished.arks.filtered.rename.fasta.out.gff | awk '{print $1"\t"$4"\t"$5}' > repeats.bed ``` * we need to rename the coverage_stats.csv contigs to shore up the contig names (remove the extraneous assembly crap) and make the equivalent to the repeat we masked and the renamed contigs ```python import pdb names = {} for line in open('diglossa.contigs.polished.arks.filtered.sizes'): ls = line.strip().split("\t") names[ls[0].split(',')[0]] = ls[0] with open('coverage_stats.csv') as infile: with open('coverage_stats.rename.csv', 'w') as outfile: for line in infile: if line.startswith('#'): outfile.write(line) else: ls = line.strip().split("\t") contig = ls[0].split(',')[0] # get replacement new_line = line.replace(names[contig],contig) #pdb.set_trace() outfile.write("{}".format(new_line)) ``` * run the actual purge script to iteratively mark and remove putative haplotigs: ```bash #!/bin/bash #SBATCH -N 1 #SBATCH -n 48 #SBATCH -t 48:00:00 #SBATCH -p workq #SBATCH -A loni_bf_qb_04 #SBATCH -o slurm-%x-%j.out-%N #SBATCH -e slurm-%x-%j.err-%N #SBATCH --job-name=purge_haplotigs source activate purge_haplotigs date export WORK_DIR=/home/brant/work/diglossa/haplotigs cd $WORK_DIR mkdir -p $WORK_DIR/tmp export TMPDIR=$WORK_DIR/tmp samtools faidx diglossa.contigs.polished.arks.filtered.rename.fasta purge_haplotigs purge -t 48 -g diglossa.contigs.polished.arks.filtered.rename.fasta -c coverage_stats.rename.csv -repeats repeats.bed ``` * rename the purged contig file: ```bash mv curated.fasta diglossa.haplotig-purged.fasta ``` * let's get pre-purging BUSCO scores to see if we've done any good, on @smic ```bash #!/bin/bash #PBS -A hpc_allbirds04 #PBS -l nodes=1:ppn=20 #PBS -l walltime=06:00:00 #PBS -q checkpt #PBS -N busco_prehap # activate the correct conda env source activate genome-eval # move into the directory containing this script cd $PBS_O_WORKDIR busco --in pre-haplotig/diglossa.contigs.polished.arks.filtered.rename.fasta --cpu 20 --out pre-haplotig-busco --mode genome --lineage_dataset eukaryota_odb10 ``` * get post-purging BUSCO scores for the curated data set, on @smic ```bash #!/bin/bash #PBS -A hpc_allbirds04 #PBS -l nodes=1:ppn=20 #PBS -l walltime=06:00:00 #PBS -q checkpt #PBS -N busco_haplotig # activate the correct conda env source activate genome-eval # move into the directory containing this script cd $PBS_O_WORKDIR busco --in haplotig/diglossa.haplotig-purged.fasta --cpu 20 --out haplotig-purged-busco --mode genome --lineage_dataset eukaryota_odb10 ``` * those results look good, in terms of removing duplication ```text # PRE-PURGING ***** Results: ***** C:94.5%[S:88.6%,D:5.9%],F:1.2%,M:4.3%,n:255 241 Complete BUSCOs (C) 226 Complete and single-copy BUSCOs (S) 15 Complete and duplicated BUSCOs (D) 3 Fragmented BUSCOs (F) 11 Missing BUSCOs (M) 255 Total BUSCO groups searched # PURGED # ***** Results: ***** C:94.9%[S:94.1%,D:0.8%],F:1.2%,M:3.9%,n:255 242 Complete BUSCOs (C) 240 Complete and single-copy BUSCOs (S) 2 Complete and duplicated BUSCOs (D) 3 Fragmented BUSCOs (F) 10 Missing BUSCOs (M) 255 Total BUSCO groups searched ``` * let's scaffold diglossa against Camarhyncus which is another thraupid with a good assembly. * download reference assembly (refseq) for Camarhynchus: ```bash wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/901/933/205/GCF_901933205.1_STF_HiC/GCF_901933205.1_STF_HiC_genomic.fna.gz ``` * filter out the unplaced scaffolds from this assembly - keep only chromosomes ```bash faSomeRecords GCF_901933205.1_STF_HiC_genomic.fna chromosomes.txt GCF_901933205.1_STF_HiC_genomic.chromosomes.fna ``` * we need to copy over the purged assembly and mask it: ```bash cp ../haplotig/diglossa.haplotig-purged.fasta ./ faSize diglossa.haplotig-purged.fasta 1079415679 bases (24100 Ns 1079391579 real 1079391579 upper 0 lower) in 1328 sequences in 1 files Total size: mean 812813.0 sd 2906699.7 min 1005 (scaffold2960) max 30446970 (scaffold1) median 83362 N count: mean 18.1 sd 55.1 U count: mean 812794.9 sd 2906683.3 L count: mean 0.0 sd 0.0 %0.00 masked total, %0.00 masked real bedtools maskfasta -soft -fi diglossa.haplotig-purged.fasta -fo diglossa.haplotig-purged.masked.fasta -bed ../haplotig-repeat-models/diglossa.contigs.polished.arks.filtered.rename.fasta.out.gff ``` * let's look at info and remove the pre-masked version ```bash 1079415679 bases (24100 Ns 1079391579 real 944964677 upper 134426902 lower) in 1328 sequences in 1 files Total size: mean 812813.0 sd 2906699.7 min 1005 (scaffold2960) max 30446970 (scaffold1) median 83362 N count: mean 18.1 sd 55.1 U count: mean 711569.8 sd 2707184.7 L count: mean 101225.1 sd 217197.5 %12.45 masked total, %12.45 masked real rm diglossa.haplotig-purged.fasta ``` * replace all gaps w/ 50 Ns. This will let us know which gaps were added by arks (50 bp) and which come from ragtag (100 bp; used in minute): ```python import re import pdb from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord with open('diglossa.haplotig-purged.masked.fasta') as input: with open('diglossa.haplotig-purged.masked.50-gaps.fasta', 'w') as output: for seq in SeqIO.parse(input, 'fasta'): seq_str = str(seq.seq) gappy_seq = re.sub('N+', "{}".format('N'* 50), seq_str) new_record = SeqRecord(Seq(gappy_seq), id=seq.id, name="", description="") output.write(new_record.format('fasta')) ``` * check sizes after 50-gaps - we basically just cut all gaps in half: ```bash faSize diglossa.haplotig-purged.masked.50-gaps.fasta 1079403629 bases (12050 Ns 1079391579 real 944964677 upper 134426902 lower) in 1328 sequences in 1 files Total size: mean 812803.9 sd 2906691.5 min 1005 (scaffold2960) max 30446920 (scaffold1) median 83362 N count: mean 9.1 sd 27.6 U count: mean 711569.8 sd 2707184.7 L count: mean 101225.1 sd 217197.5 %12.45 masked total, %12.45 masked real ``` * remove the original ```bash rm diglossa.haplotig-purged.masked.fasta ``` * run ragtag for scaffolding: ```bash ragtag.py scaffold -t 12 reference/GCF_901933205.1_STF_HiC_genomic.chromosomes.fna diglossa.haplotig-purged.masked.50-gaps.fasta -o diglossa-ragtag-scaffolds ``` * let's rename the ragtag scaffolds with chromosome names from Camarhyncus (converting genbank IDs to scaffold_cp_chrN) and also update the AGP file ```python from Bio import SeqIO names = {} with open('ncbi-to-regular-names.txt') as infile: for line in infile: ls = line.strip().split() names[ls[0]] = ls[1] with open('ragtag.scaffolds.fasta') as infile: with open('ragtag.scaffolds.rename.fasta', 'w') as outfile: for seq in SeqIO.parse(infile, 'fasta'): name = seq.id.replace("_RagTag","") if name in names.keys(): seq.id = names[name] seq.name = "" seq.description = "" outfile.write(seq.format('fasta')) with open('ragtag.scaffolds.agp') as infile: with open('ragtag.scaffolds.rename.agp', 'w') as outfile: for line in infile: if line.startswith('#'): outfile.write(line) elif "_RagTag" in line: ls = line.strip().split("\t") name = ls[0].replace("_RagTag","") ls[0] = names[name] outfile.write("{}\n".format("\t".join(ls))) else: outfile.write(line) ``` * get stats: ```bash faSize ragtag.scaffolds.rename.fasta 1079476329 bases (84750 Ns 1079391579 real 944964677 upper 134426902 lower) in 601 sequences in 1 files Total size: mean 1796133.7 sd 11313000.1 min 1005 (scaffold2960) max 155774234 (scaff_cp_chr2) median 44887 N count: mean 141.0 sd 1229.4 U count: mean 1572320.6 sd 10192125.9 L count: mean 223672.0 sd 1143808.9 %12.45 masked total, %12.45 masked real ``` * let's uppercase the entire sequence, so we can run it back through repeat masker: ```bash from Bio import SeqIO with open('ragtag.scaffolds.rename.fasta') as infile: with open('ragtag.scaffolds.rename.upper.fasta', 'w') as outfile: for seq in SeqIO.parse(infile, 'fasta'): outfile.write(seq.upper().format('fasta')) ``` * check to ensure uppercase: ```bash faSize ragtag.scaffolds.rename.upper.fasta 1079476329 bases (84750 Ns 1079391579 real 1079391579 upper 0 lower) in 601 sequences in 1 files Total size: mean 1796133.7 sd 11313000.1 min 1005 (scaffold2960) max 155774234 (scaff_cp_chr2) median 44887 N count: mean 141.0 sd 1229.4 U count: mean 1795992.6 sd 11312310.2 L count: mean 0.0 sd 0.0 %0.00 masked total, %0.00 masked real ``` * upload that to QBC and repeat mask again using our custom library from above so we'll gave good coordinates for the repeats: ```bash #!/bin/bash #SBATCH -N 1 #SBATCH -n 48 #SBATCH -t 12:00:00 #SBATCH -p workq #SBATCH -A loni_bf_qb_04 #SBATCH -o slurm-%x-%j.out-%N #SBATCH -e slurm-%x-%j.err-%N #SBATCH --job-name=r_masker # move into the directory containing this script source activate genome-annot-repeat date export WORK_DIR=/home/brant/work/diglossa cd $WORK_DIR RepeatMasker -s -pa 48 -xsmall -gff -lib RM_267691.ThuNov191759492020/consensi.fa.classified ragtag.scaffolds.rename.upper.fasta date exit 0 ``` * copy that back down to local storage: ```bash rsync -avLP --exclude ragtag.scaffolds.rename.upper.fasta brant@qbc.loni.org:/home/brant/work/diglossa/repeat-models/* ./ mv ragtag.scaffolds.rename.upper.fasta.masked diglossa.v1.ragtag.masked.fasta # move the masking stuff into it's own directory mkdir repeat-masking mv * repeat-masking mv repeat-masking/diglossa.v1.ragtag.masked.fasta ./ faSize -detailed diglossa.v1.ragtag.masked.fasta > diglossa.v1.ragtag.masked.sizes ``` * get busco results for that scaffolded, masked assembly ```bash #!/bin/bash #PBS -A hpc_allbirds04 #PBS -l nodes=1:ppn=20 #PBS -l walltime=06:00:00 #PBS -q checkpt #PBS -N busco # activate the correct conda env source activate genome-eval # move into the directory containing this script cd $PBS_O_WORKDIR busco --in diglossa.v1.ragtag.masked.fasta --cpu 20 --out ragtag-masked-busco --mode genome --lineage_dataset eukaryota_odb10 ``` * and also get quast data: ```bash #!/bin/bash #PBS -A hpc_allbirds04 #PBS -l nodes=1:ppn=20 #PBS -l walltime=4:00:00 #PBS -q workq #PBS -N quast # activate the correct conda env source activate genome-eval # move into the directory containing this script cd $PBS_O_WORKDIR #run quast quast --output-dir quast-masked-stats --threads 20 --min-contig 1000 \ --no-plots --no-sv --no-gc --no-snps --no-icarus \ diglossa.v1.ragtag.masked.fasta ``` * quast stats: ```bash cat quast-masked-busco/report.txt All statistics are based on contigs of size >= 1000 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs). Assembly diglossa.v1.ragtag.masked # contigs (>= 0 bp) 601 # contigs (>= 1000 bp) 601 # contigs (>= 5000 bp) 555 # contigs (>= 10000 bp) 513 # contigs (>= 25000 bp) 426 # contigs (>= 50000 bp) 285 Total length (>= 0 bp) 1079476329 Total length (>= 1000 bp) 1079476329 Total length (>= 5000 bp) 1079318743 Total length (>= 10000 bp) 1079000693 Total length (>= 25000 bp) 1077552903 Total length (>= 50000 bp) 1072385347 # contigs 601 Largest contig 155774234 Total length 1079476329 N50 67281049 N75 22837129 L50 6 L75 12 # N's per 100 kbp 7.85 ``` * busco stats: ```bash # BUSCO version is: 4.0.6 # The lineage dataset is: eukaryota_odb10 (Creation date: 2020-09-10, number of species: 70, number of BUSCOs: 255) # Summarized benchmarking in BUSCO notation for file diglossa.v1.ragtag.masked.fasta # BUSCO was run in mode: genome ***** Results: ***** C:94.5%[S:93.7%,D:0.8%],F:1.6%,M:3.9%,n:255 241 Complete BUSCOs (C) 239 Complete and single-copy BUSCOs (S) 2 Complete and duplicated BUSCOs (D) 4 Fragmented BUSCOs (F) 10 Missing BUSCOs (M) 255 Total BUSCO groups searched ``` ## NCBI * Prepare genome for submission to NCBI * Followed normal steps of establishing a BioSample using older BioProject * Uploaded the genome used for annotation, above, with separate AGP files for chromosomes and for unplaced scaffolds * Received notice from NCBI that the following changes needed to be made: ```bash remove scaffold2902 2544 vector/etc mask scaff_cp_chr1A 73428958 23075522..23075696 mitochondrion (174 bp region) mask scaff_cp_chr2 155774234 16047190..16047313 mitochondrion (123 bp region) ``` * Remove scaffold 2902 ```python from Bio import SeqIO contam = ["scaffold2902"] with open('../diglossa.v1.ragtag.masked.fasta') as infile: with open('diglossa.v1.ragtag.masked.filter.fasta', 'w') as outfile: for seq in SeqIO.parse(infile, 'fasta'): if seq.id in contam: print(seq.id) pass else: outfile.write(seq.format('fasta')) ``` * mask out mtDNA similar regions: ```python import numpy as np from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord with open('diglossa.v1.ragtag.masked.filter.fasta') as input: with open('diglossa.v1.ragtag.masked.filter.NCBI.fasta', 'w') as outfile: for seq in SeqIO.parse(input, 'fasta'): if seq.id == 'scaff_cp_chr1A': seq_list = np.array(list(seq.seq)) # mask 39 bases seq_list[23075522:23075696] = 'N' # create new seq new_record = SeqRecord(Seq(''.join(seq_list)), id=seq.id, name="", description="") outfile.write(new_record.format('fasta')) elif seq.id == 'scaff_cp_chr2': seq_list = np.array(list(seq.seq)) # mask 29 bases seq_list[16047190:16047313] = 'N' # create new seq new_record = SeqRecord(Seq(''.join(seq_list)), id=seq.id, name="", description="") outfile.write(new_record.format('fasta')) else: outfile.write(seq.format('fasta')) ```