# Diglossa Genome Assembly
#research #manuscripts
## Initial Information

### PacBio
* PB Run 1: Gb: 11.00560636; Longest subread: 19017; Longest Subread N50 (bp): 31327
* PB Run 2a: Gb: 10.84430589; Longest subread: 21410; 
Longest Subread N50 (bp): 34819
* PB Run 2b: Gb: 10.83405489; Longest subread: 20529; Longest Subread N50 (bp): 33287
- - - -
## Read data location
* /nfs/data1/genomes/faircloth-lab/diglossa/10x/raw-data/H2FJCCCX2
* /nfs/data1/genomes/faircloth-lab/diglossa/pacbio/raw-data

## Processing data location
* /nfs/data1/genomes/faircloth-lab/diglossa
- - - -
## Analysis steps
### 10x genomics
#### De novo assembly
* downloaded the data from novogene and verified MD5
* get supernova setup on @qb2
```bash
supernova --version (2.1.1)
Copyright (c) 2018 10x Genomics, Inc.  All rights reserved.
```
* setup assembly on @qb2 after uploading the data. initially setup assemblies after downsampling the reads, but after first attempt, that assembly was only so-so and coverage was lower than expected.  so, nuked that run and re-ran supernova using all of the data
```bash
#PBS -q bigmem
#PBS -A loni_bf_qb_02
#PBS -l walltime=72:00:00
#PBS -l nodes=1:ppn=24
#PBS -V
#PBS -N diglossa_assembly
#PBS -o diglossa_assembly.out
#PBS -e diglossa_assembly.err


export PATH=$HOME/project/supernova-2.1.1:$PATH

cd $PBS_O_WORKDIR
supernova run \
    --id=diglossa-10x-assembly-all-reads \
    --fastqs=fastqs \
    --maxreads='all' \
    --localcores 24 \
    --localmem 745
```
* output all possible versions of the assemblies using 4 qsub scripts
```bash
#!/bin/bash
#PBS -q checkpt
#PBS -A loni_bf_qb_02
#PBS -l walltime=12:00:00
#PBS -l nodes=1:ppn=20
#PBS -V
#PBS -N diglossa_assembly_out
#PBS -o diglossa_megabubbles_assembly_out.out
#PBS -e diglossa_megabubbles_assembly_out_out.err


export PATH=$HOME/project/supernova-2.1.1:$PATH

cd $PBS_O_WORKDIR
supernova mkoutput \
        --style=megabubbles \
        --asmdir=./assembly/ \
        --outprefix=diglossa_v1_megabubbles
```
```bash
#!/bin/bash
#PBS -q checkpt
#PBS -A loni_bf_qb_02
#PBS -l walltime=12:00:00
#PBS -l nodes=1:ppn=20
#PBS -V
#PBS -N diglossa_assembly_out
#PBS -o diglossa_pseudohap2_assembly_out.out
#PBS -e diglossa_pseudohap2assembly_out_out.err


export PATH=$HOME/project/supernova-2.1.1:$PATH

cd $PBS_O_WORKDIR
supernova mkoutput \
        --style=pseudohap2 \
        --asmdir=./assembly/ \
        --outprefix=diglossa_v1_pseudohap2
```
```bash
#!/bin/bash
#PBS -q checkpt
#PBS -A loni_bf_qb_02
#PBS -l walltime=12:00:00
#PBS -l nodes=1:ppn=20
#PBS -V
#PBS -N diglossa_assembly_out
#PBS -o diglossa_assembly_out.out
#PBS -e diglossa_assembly_out_out.err


export PATH=$HOME/project/supernova-2.1.1:$PATH

cd $PBS_O_WORKDIR
supernova mkoutput \
        --style=pseudohap \
        --asmdir=./assembly/ \
        --outprefix=diglossa_v1_pseudohap
```
```bash
#!/bin/bash
#PBS -q checkpt
#PBS -A loni_bf_qb_02
#PBS -l walltime=12:00:00
#PBS -l nodes=1:ppn=20
#PBS -V
#PBS -N diglossa_assembly_out
#PBS -o diglossa_raw_assembly_out.out
#PBS -e diglossa_raw_assembly_out_out.err


export PATH=$HOME/project/supernova-2.1.1:$PATH

cd $PBS_O_WORKDIR
supernova mkoutput \
        --style=raw \
        --asmdir=./assembly/ \
        --outprefix=diglossa_v1_raw
```
* downloaded assemblies and qsub files after those finished
```bash
rsync -avLP brant@qb.loni.org:/home/brant/work/10x-diglossa-all-data/diglossa-10x-assembly-all-reads/outs/output-assemblies ./
rsync -avLP brant@qb.loni.org:/home/brant/work/10x-diglossa-all-data/diglossa-10x-assembly-all-reads/assembly-qsub ./
```

#### Trimmed Reads for Polishing
* we can trim the 10x barcodes from all the reads and then use those as we would any other mate pair data for things like polishing.  I found a handy script online to do this:
```
$ python ~/git/proc10xG/filter_10xReads.py --version
filter_10xReads.py version: 0.0.1
```
* run that against the 10x reads and output those in a `trimmed-reads` directory
```bash
python ~/git/proc10xG/process_10xReads.py \
	--output trimmed-reads/A10x_Songbird_S3_L006 \
	-1 raw-data/H2FJCCCX2/A10x_Songbird_S3_L006_R1_001.fastq.gz \
	-2 raw-data/H2FJCCCX2/A10x_Songbird_S3_L006_R2_001.fastq.gz
```

### Pacbio
#### Assembly
* Here’s the initial data layout of the pacbio runs
```
1_A01/
1_A01/tmp-file-65b03cb8-d740-4fed-917e-5cf6e727036a.txt
1_A01/.m54193_190328_212540.run.metadata.xml
1_A01/.m54193_190328_212540.metadata.xml
1_A01/m54193_190328_212540.subreadset.xml
1_A01/m54193_190328_212540.subreads.bam
1_A01/m54193_190328_212540.subreads.bam.pbi
1_A01/m54193_190328_212540.scraps.bam
1_A01/m54193_190328_212540.scraps.bam.pbi
1_A01/m54193_190328_212540.adapters.fasta
1_A01/m54193_190328_212540.sts.xml
1_A01/m54193_190328_212540.baz2bam_1.log
1_A01/m54193_190328_212540.transferdone
```
* Unzip data, upload those to @qb2
* get canu installed
```bash
○ → canu --version
Canu snapshot v1.8 +255 changes (r9465 fadb4eb6008c017d0c4b685f0a053c181b7af8b2)
```
* On qb2, for canu we need to convert data from BAM format to fastq, go ahead and do that:
```bash
#!/bin/bash
#PBS -q single
#PBS -A loni_bf_qb_02
#PBS -l walltime=12:00:00
#PBS -l nodes=1:ppn=4
#PBS -V
#PBS -N bam_to_fastq
#PBS -o bam_to_fastq.out
#PBS -e bam_to_fastq.err

# load the parallel module to run files in parallel (up to 4 cores in single queue)
module load gnuparallel/20170122

# activate our conda env
source activate pacbio

cd $PBS_O_WORKDIR
mkdir pacbio-fastq && cd pacbio-fastq
find ../pacbio-raw/ -name *.bam | parallel "bam2fastq -o {/.} {}"
```
* Now that they are fastq format, go ahead and start the assembly.  Because of the way that canu is setup, we’ll just be running in local versus grid mode.  After lots of back and forth w/ loni staff about this, it seemed like the best solution to run the job on the qb2 bigmem nodes.  During the run, I ran out of wall-time and needed to restart the job, which I did by simply renaming the stdout and stderr files and the re-submitting the script below:
```bash
#!/bin/bash
#PBS -q bigmem
#PBS -A loni_bf_qb_02
#PBS -l walltime=72:00:00
#PBS -l nodes=1:ppn=48
#PBS -V
#PBS -N canu_config
#PBS -o canu_config.out
#PBS -e canu_config.err
#PBS -m abe
#PBS -M brant@faircloth-lab.org

module load gcc/6.4.0
module load java/1.8.0

cd $PBS_O_WORKDIR
mkdir -p canu-assembly && cd canu-assembly

canu \
    -p diglossa \
    -d diglossa-pacbio \
    genomeSize=1.1g \
    useGrid=false \
    -pacbio-raw ../pacbio-fastq/*.fastq.gz 1>canu-assembly.stdout 2>canu-assembly.stderr
```
* post-assembly, I downloaded the assembly files to @nfs over rsync, keeping only the most important ones (basically the files in the @qb2 directory and *NOT* the folders:
```bash
rsync -vt 'brant@qb.loni.org:/home/brant/work/pacbio-diglossa/canu-assembly/diglossa-pacbio/*' ./
```
* A description of the various output files can be found [Canu Tutorial — canu 1.8 documentation](https://canu.readthedocs.io/en/latest/tutorial.html#outputs)
* I did a quick compare and contrast of the initial 10x and pacbio assemblies using quast on @bbq:
```bash
quast \
	--output-dir diglossa-initial-compare \
	10x-diglossa.supernova.vanilla.fasta \
	pacbio-diglossa.canu.vanilla.fasta \
	--threads 24 --eukaryote --conserved-genes-finding
```

### Pacbio Polishing
* create dir for longranger output
```bash
mkdir longranger-ouput 
```
* we need to process the 10x data w/ long ranger before doing anything else.  This removed the 10x barcodes from each sample
```bash
#!/bin/bash
#PBS -q checkpt
#PBS -A hpc_allbirds02
#PBS -l walltime=24:00:00
#PBS -l nodes=1:ppn=16
#PBS -V
#PBS -N longranger_basic
#PBS -o longranger_basic.out
#PBS -e longranger_basic.err

export PATH=/home/brant/project/shared/bin/longranger-2.2.2/:$PATH

cd $PBS_O_WORKDIR

longranger basic \
    --id=10x-diglossa \
    --fastqs=/home/brant/work/10x-diglossa-scaffold/raw-fastq \
    --localcores 16

```
* Map those reads to the pacbio assembly using bwa v.0.7.17-r1188 and samtools 1.9 following the protocol.  The easiest thing to do is probably to create a new directory in `pacbio-polished` named `bwa-aligned`.  Also symlink in the *.fastq.gz file from above:
```bash
        mkdir bwa-aligned && cd $_
        ln -s ../longranger-ouput/<my_name>/outs/barcoded.fastq.gz
```
* After that run completes, run pilon against the assembly using the bigmem queue on @qb2:
```bash
#!/bin/bash
#PBS -q bigmem
#PBS -A <allocation>
#PBS -l walltime=72:00:00
#PBS -l nodes=1:ppn=48
#PBS -V
#PBS -N pilon
#PBS -o pilon.out
#PBS -e pilon.err

source activate polishing

cd $PBS_O_WORKDIR

# run pilon
pilon -Xmx1400G --genome diglossa.contigs.fa \
    --bam diglossa.contigs.barcoded.bam \
    --changes --diploid --threads 48 \
    --output diglossa.contigs.polished

```

### Pacbio w/ 10x scaffolding
* we don’t need to run long ranger against the data because we already did that, above.
* on @supermike, create a directory to hold the scaffolded data and make a copy of the long ranger data.  We already parsed the 10x data to try and scaffold against un-polished contigs.  However, here we are scaffolding against polished contigs.
```bash
mkdir 10x-diglossa-scaffold-polished
cp -R ../10x-diglossa-scaffold-unpolished/longranger-output/ ./
```
* we also already copied the barcode multiplicities over, so we don’t need to run that step again.  see protocol for how we did that.
* run arks
```bash
#!/bin/bash

#PBS -q checkpt
#PBS -A hpc_anna01
#PBS -l walltime=72:00:00
#PBS -l nodes=1:ppn=16
#PBS -V
#PBS -N arks_scaffolding
#PBS -o arks_scaffolding.out
#PBS -e arks_scaffolding.err

module load gcc/6.4.0
module load perl/5.24.0/INTEL-18.0.0
module load boost/1.63.0/INTEL-18.0.0

export PATH=$HOME/project/shared/bin/:$HOME/project/shared/src/links_v1.8.7:/project/brant/home/anaconda/envs/scaffolding/bin/:$PATH

cd $PBS_O_WORKDIR

make -f arks-make.txt arks-tigmint \
    draft=diglossa.contigs.polished \
    reads=barcoded \
    m=50-30000 o=3 time=1
```

* let's get some assembly stats for all of the pacbio assemblies (ran this October 2020):
```bash
#!/bin/bash
#PBS -A hpc_allbirds04
#PBS -l nodes=1:ppn=20
#PBS -l walltime=4:00:00
#PBS -q workq
#PBS -N quast

# activate the correct conda env
source activate genome-eval

# move into the directory containing this script
cd $PBS_O_WORKDIR

#run quast
quast --output-dir diglossa-assembly-stats --threads 20 --min-contig 1000 \
--no-plots --no-sv --no-gc --no-snps --no-icarus \
assemblies/diglossa.contigs.fasta \
assemblies/diglossa.contigs.polished.fasta \
diglossa.contigs.polished.tigmint_c5_m50-30000_k30_r0.05_e30000_z500_l5_a0.3.scaffolds.fa
```

* run BUSCO against those results:

* BUSCO results seem to indicate a somewhat high level of duplication, so we're going to see if we can remove haplotigs
* first, let's get a copy of the contigs and filter loci < 1000 bp  in length
```bash
cp ../canu-assembly-polished-arks-links/diglossa.contigs.polished.tigmint_c5_m50-30000_k30_r0.05_e30000_z500_l5_a0.3.scaffolds.fa.gz ./
mv diglossa.contigs.polished.tigmint_c5_m50-30000_k30_r0.05_e30000_z500_l5_a0.3.scaffolds.fa diglossa.contigs.polished.arks.fasta
faSize diglossa.contigs.polished.arks.fasta
# filter loci < 1000 bp
faSize diglossa.contigs.polished.arks.fasta
1172156322 bases (28700 Ns 1172127622 real 1172127622 upper 0 lower) in 3146 sequences in 1 files
Total size: mean 372586.2 sd 1925506.4 min 28 (scaffold3146,28,f1909Z28) max 30446970 (scaffold1,30446870,f380Z28916091k17a0.11m100_f2000z1530779) median 54751
N count: mean 9.1 sd 38.6
U count: mean 372577.1 sd 1925494.4
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real

faFilter -minSize=1000 diglossa.contigs.polished.arks.fasta diglossa.contigs.polished.arks.filtered.fasta

faSize diglossa.contigs.polished.arks.filtered.fasta
1172045046 bases (28700 Ns 1172016346 real 1172016346 upper 0 lower) in 2960 sequences in 1 files
Total size: mean 395961.2 sd 1982772.1 min 1005 (scaffold2960,1005,f1927Z1005) max 30446970 (scaffold1,30446870,f380Z28916091k17a0.11m100_f2000z1530779) median 58600
N count: mean 9.7 sd 39.7
U count: mean 395951.5 sd 1982759.7
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real
```

* now, get the raw-read pacbio data back up to QBC by rsync
* those come as BAM files, so we need to convert them to fastq (we're on QBC now)
```bash
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 3
#SBATCH -t 24:00:00
#SBATCH -p single
#SBATCH -A loni_bf_qb_03
#SBATCH -o slurm-%x-%j.out-%N
#SBATCH -e slurm-%x-%j.err-%N
#SBATCH --job-name=untar

module load parallel/20190222/intel-19.0.5
# activate our conda env
source activate pacbio


export WORK_DIR=/home/brant/work/diglossa
cd $WORK_DIR
mkdir pacbio-fastq && cd pacbio-fastq
find ../bams/ -name *.bam | parallel "bam2fastq -o {/.} {}"
```

* get those reads into a single file:
```bash
cp m54193_190328_212540.subreads.fastq.gz ALL.subreads.fastq.gz
cat m54193_190329_183459.subreads.fastq.gz >> ALL.subreads.fastq.gz
cat m54193_190330_044854.subreads.fastq.gz >> ALL.subreads.fastq.gz

```

* map those reads to the diglossa assembly:
```bash
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 20
#SBATCH -t 48:00:00
#SBATCH -p single
#SBATCH -A loni_bf_qb_04
#SBATCH -o slurm-%x-%j.out-%N
#SBATCH -e slurm-%x-%j.err-%N
#SBATCH --job-name=minimap2

module load samtools

date
export WORK_DIR=/home/brant/work/diglossa/haplotigs
cd $WORK_DIR
mkdir -p $WORK_DIR/tmp
export TMPDIR=$WORK_DIR/tmp

samtools faidx diglossa.contigs.polished.arks.filtered.fasta
minimap2 -t 18 -ax map-pb diglossa.contigs.polished.arks.filtered.fasta ../pacbio-fastq/ALL.subreads.fastq.gz --secondary=no | samtools sort -m 60G -o aligned.bam -T tmp.ali
```

* once that's done, generate a histogram of read covereage across the assembly to identify peaks

```bash
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 24
#SBATCH -t 48:00:00
#SBATCH -p single
#SBATCH -A loni_bf_qb_03
#SBATCH -o slurm-%x-%j.out-%N
#SBATCH -e slurm-%x-%j.err-%N
#SBATCH --job-name=purge_haplotigs

source activate purge_haplotigs

date
export WORK_DIR=/home/brant/work/diglossa/haplotigs
cd $WORK_DIR
mkdir -p $WORK_DIR/tmp
export TMPDIR=$WORK_DIR/tmp

samtools index -@ 24 aligned.bam
purge_haplotigs hist -b aligned.bam -g diglossa.contigs.polished.arks.filtered.fasta -t 48
```

* based on the histogram for that assembly, there's a small peak directly adjacent to a larger peak which may represent a reasonably small number of haploid contigs.  Use that histogram to set low = 5, mid = 15, high = 190 for coverage cutoffs.  Using those values compute the file of contig coverages

```bash
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 48:00:00
#SBATCH -p single
#SBATCH -A loni_squamate01
#SBATCH -o slurm-%x-%j.out-%N
#SBATCH -e slurm-%x-%j.err-%N
#SBATCH --job-name=purge_haplotigs

source activate purge_haplotigs

date
export WORK_DIR=/home/brant/work/diglossa/haplotigs
cd $WORK_DIR
mkdir -p $WORK_DIR/tmp
export TMPDIR=$WORK_DIR/tmp

purge_haplotigs contigcov -i aligned.bam.gencov -l 5  -m 15 -h 190
```

* let's go ahead and model some repeats so we can use those in our contig purging. prep the database:
```bash
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 12:00:00
#SBATCH -p single
#SBATCH -A loni_bf_qb_03
#SBATCH -o slurm-%x-%j.out-%N
#SBATCH -e slurm-%x-%j.err-%N
#SBATCH --job-name=build_db

source activate genome-annot-repeat

date
export WORK_DIR=/home/brant/work/diglossa/repeat-models
cd $WORK_DIR
BuildDatabase -name diglossa_brunneiventris diglossa.contigs.polished.arks.filtered.fasta
# exit the job
date
exit 0
```

* let's model some repeats
```bash
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 48
#SBATCH -t 48 :00:00
#SBATCH -p workq
#SBATCH -A loni_bf_qb_03
#SBATCH -o slurm-%x-%j.out-%N
#SBATCH -e slurm-%x-%j.err-%N
#SBATCH --job-name=r_modeler

source activate genome-annot-repeat

date
export WORK_DIR=/home/brant/work/diglossa/repeat-models
cd $WORK_DIR
RepeatModeler -database diglossa_brunneiventris -pa 48
# exit the job
date
exit 0
```
* the scaffolds we started with are named poorly. rename the scaffolds:
```python
from Bio import SeqIO
with open("diglossa.contigs.polished.arks.filtered.fasta") as infile:
        with open("diglossa.contigs.polished.arks.filtered.rename.fasta", 'w') as outfile:
                for record in SeqIO.parse(infile, 'fasta'):
                        record.name = ""
                        record.description = ""
                        record.id = record.id.split(",")[0]
                        outfile.write(record.format('fasta'))
```

* finally, use repeat masker with the database we created to mask repeats in the renamed sequence:
```bash
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 48
#SBATCH -t 12:00:00)
#SBATCH -p workq
#SBATCH -A loni_bf_qb_04
#SBATCH -o slurm-%x-%j.out-%N
#SBATCH -e slurm-%x-%j.err-%N
#SBATCH --job-name=r_masker

# move into the directory containing this script

source activate genome-annot-repeat

date
export WORK_DIR=/home/brant/work/diglossa/repeat-models
cd $WORK_DIR
RepeatMasker -pa 48 -xsmall -gff -lib RM_267691.ThuNov191759492020/consensi.fa.classified diglossa.contigs.polished.arks.filtered.rename.fasta
date
exit
```
* using the repeats identified w/ repeat masker, make a BED file of the repeat information:
```bash
cat ../repeat-models/diglossa.contigs.polished.arks.filtered.rename.fasta.out.gff | awk '{print $1"\t"$4"\t"$5}' > repeats.bed
```

* we need to rename the coverage_stats.csv contigs to shore up the contig names (remove the extraneous assembly crap) and make the equivalent to the repeat we masked and the renamed contigs
```python
import pdb
names = {}
for line in open('diglossa.contigs.polished.arks.filtered.sizes'):
        ls = line.strip().split("\t")
        names[ls[0].split(',')[0]] = ls[0]

with open('coverage_stats.csv') as infile:
        with open('coverage_stats.rename.csv', 'w') as outfile:
                for line in infile:
                        if line.startswith('#'):
                                outfile.write(line)
                        else:
                                ls = line.strip().split("\t")
                                contig = ls[0].split(',')[0]
                                # get replacement
                                new_line = line.replace(names[contig],contig)
                                #pdb.set_trace()
                                outfile.write("{}".format(new_line))
```

* run the actual purge script to iteratively mark and remove putative haplotigs:
```bash
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 48
#SBATCH -t 48:00:00
#SBATCH -p workq
#SBATCH -A loni_bf_qb_04
#SBATCH -o slurm-%x-%j.out-%N
#SBATCH -e slurm-%x-%j.err-%N
#SBATCH --job-name=purge_haplotigs

source activate purge_haplotigs

date
export WORK_DIR=/home/brant/work/diglossa/haplotigs
cd $WORK_DIR
mkdir -p $WORK_DIR/tmp
export TMPDIR=$WORK_DIR/tmp

samtools faidx diglossa.contigs.polished.arks.filtered.rename.fasta
purge_haplotigs purge -t 48 -g diglossa.contigs.polished.arks.filtered.rename.fasta -c coverage_stats.rename.csv -repeats repeats.bed
```
* rename the purged contig file:
```bash
mv curated.fasta diglossa.haplotig-purged.fasta
```

* let's get pre-purging BUSCO scores to see if we've done any good, on @smic
```bash
#!/bin/bash
#PBS -A hpc_allbirds04
#PBS -l nodes=1:ppn=20
#PBS -l walltime=06:00:00
#PBS -q checkpt
#PBS -N busco_prehap

# activate the correct conda env
source activate genome-eval

# move into the directory containing this script
cd $PBS_O_WORKDIR

busco --in pre-haplotig/diglossa.contigs.polished.arks.filtered.rename.fasta --cpu 20 --out  pre-haplotig-busco --mode genome --lineage_dataset eukaryota_odb10
```

* get post-purging BUSCO scores for the curated data set, on @smic
```bash
#!/bin/bash
#PBS -A hpc_allbirds04
#PBS -l nodes=1:ppn=20
#PBS -l walltime=06:00:00
#PBS -q checkpt
#PBS -N busco_haplotig

# activate the correct conda env
source activate genome-eval

# move into the directory containing this script
cd $PBS_O_WORKDIR

busco --in haplotig/diglossa.haplotig-purged.fasta --cpu 20 --out haplotig-purged-busco --mode genome --lineage_dataset eukaryota_odb10
```

* those results look good, in terms of removing duplication
```text
# PRE-PURGING
***** Results: *****

C:94.5%[S:88.6%,D:5.9%],F:1.2%,M:4.3%,n:255
241	Complete BUSCOs (C)
226	Complete and single-copy BUSCOs (S)
15	Complete and duplicated BUSCOs (D)
3	Fragmented BUSCOs (F)
11	Missing BUSCOs (M)
255	Total BUSCO groups searched

# PURGED #
***** Results: *****

C:94.9%[S:94.1%,D:0.8%],F:1.2%,M:3.9%,n:255
242	Complete BUSCOs (C)
240	Complete and single-copy BUSCOs (S)
2	Complete and duplicated BUSCOs (D)
3	Fragmented BUSCOs (F)
10	Missing BUSCOs (M)
255	Total BUSCO groups searched
```

* let's scaffold diglossa against Camarhyncus which is another thraupid with a good assembly.  
* download reference assembly (refseq) for Camarhynchus:
```bash
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/901/933/205/GCF_901933205.1_STF_HiC/GCF_901933205.1_STF_HiC_genomic.fna.gz
```
* filter out the unplaced scaffolds from this assembly - keep only chromosomes
```bash
faSomeRecords GCF_901933205.1_STF_HiC_genomic.fna chromosomes.txt GCF_901933205.1_STF_HiC_genomic.chromosomes.fna
```
* we need to copy over the purged assembly and mask it:
```bash
cp ../haplotig/diglossa.haplotig-purged.fasta ./

faSize diglossa.haplotig-purged.fasta
1079415679 bases (24100 Ns 1079391579 real 1079391579 upper 0 lower) in 1328 sequences in 1 files
Total size: mean 812813.0 sd 2906699.7 min 1005 (scaffold2960) max 30446970 (scaffold1) median 83362
N count: mean 18.1 sd 55.1
U count: mean 812794.9 sd 2906683.3
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real

bedtools maskfasta -soft -fi diglossa.haplotig-purged.fasta -fo diglossa.haplotig-purged.masked.fasta -bed ../haplotig-repeat-models/diglossa.contigs.polished.arks.filtered.rename.fasta.out.gff
```
* let's look at info and remove the pre-masked version
```bash
1079415679 bases (24100 Ns 1079391579 real 944964677 upper 134426902 lower) in 1328 sequences in 1 files
Total size: mean 812813.0 sd 2906699.7 min 1005 (scaffold2960) max 30446970 (scaffold1) median 83362
N count: mean 18.1 sd 55.1
U count: mean 711569.8 sd 2707184.7
L count: mean 101225.1 sd 217197.5
%12.45 masked total, %12.45 masked real

rm diglossa.haplotig-purged.fasta
```

* replace all gaps w/ 50 Ns.  This will let us know which gaps were added by arks (50 bp) and which come from ragtag (100 bp; used in minute):
```python
import re
import pdb
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

with open('diglossa.haplotig-purged.masked.fasta') as input:
        with open('diglossa.haplotig-purged.masked.50-gaps.fasta', 'w') as output:
                for seq in SeqIO.parse(input, 'fasta'):
                        seq_str = str(seq.seq)
                        gappy_seq = re.sub('N+', "{}".format('N'* 50), seq_str)
                        new_record = SeqRecord(Seq(gappy_seq), id=seq.id, name="", description="")
                        output.write(new_record.format('fasta'))
```

* check sizes after 50-gaps - we basically just cut all gaps in half:
```bash
faSize diglossa.haplotig-purged.masked.50-gaps.fasta
1079403629 bases (12050 Ns 1079391579 real 944964677 upper 134426902 lower) in 1328 sequences in 1 files
Total size: mean 812803.9 sd 2906691.5 min 1005 (scaffold2960) max 30446920 (scaffold1) median 83362
N count: mean 9.1 sd 27.6
U count: mean 711569.8 sd 2707184.7
L count: mean 101225.1 sd 217197.5
%12.45 masked total, %12.45 masked real
```
* remove the original
```bash
rm diglossa.haplotig-purged.masked.fasta
```

* run ragtag for scaffolding:
```bash
ragtag.py scaffold -t 12 reference/GCF_901933205.1_STF_HiC_genomic.chromosomes.fna diglossa.haplotig-purged.masked.50-gaps.fasta -o diglossa-ragtag-scaffolds
```
* let's rename the ragtag scaffolds with chromosome names from Camarhyncus (converting genbank IDs to scaffold_cp_chrN) and also update the AGP file
```python
from Bio import SeqIO

names = {}
with open('ncbi-to-regular-names.txt') as infile:
        for line in infile:
                ls = line.strip().split()
                names[ls[0]] = ls[1]

with open('ragtag.scaffolds.fasta') as infile:
        with open('ragtag.scaffolds.rename.fasta', 'w') as outfile:
                for seq in SeqIO.parse(infile, 'fasta'):
                        name = seq.id.replace("_RagTag","")
                        if name in names.keys():
                                seq.id = names[name]
                                seq.name = ""
                                seq.description = ""
                        outfile.write(seq.format('fasta'))

with open('ragtag.scaffolds.agp') as infile:
        with open('ragtag.scaffolds.rename.agp', 'w') as outfile:
                for line in infile:
                        if line.startswith('#'):
                                outfile.write(line)
                        elif "_RagTag" in line:
                                ls = line.strip().split("\t")
                                name = ls[0].replace("_RagTag","")
                                ls[0] = names[name]
                                outfile.write("{}\n".format("\t".join(ls)))
                        else:
                                outfile.write(line)
```
* get stats:
```bash
faSize ragtag.scaffolds.rename.fasta
1079476329 bases (84750 Ns 1079391579 real 944964677 upper 134426902 lower) in 601 sequences in 1 files
Total size: mean 1796133.7 sd 11313000.1 min 1005 (scaffold2960) max 155774234 (scaff_cp_chr2) median 44887
N count: mean 141.0 sd 1229.4
U count: mean 1572320.6 sd 10192125.9
L count: mean 223672.0 sd 1143808.9
%12.45 masked total, %12.45 masked real
```

* let's uppercase the entire sequence, so we can run it back through repeat masker:
```bash
from Bio import SeqIO

with open('ragtag.scaffolds.rename.fasta') as infile:
        with open('ragtag.scaffolds.rename.upper.fasta', 'w') as outfile:
                for seq in SeqIO.parse(infile, 'fasta'):
                        outfile.write(seq.upper().format('fasta'))
```
* check to ensure uppercase:
```bash
faSize ragtag.scaffolds.rename.upper.fasta
1079476329 bases (84750 Ns 1079391579 real 1079391579 upper 0 lower) in 601 sequences in 1 files
Total size: mean 1796133.7 sd 11313000.1 min 1005 (scaffold2960) max 155774234 (scaff_cp_chr2) median 44887
N count: mean 141.0 sd 1229.4
U count: mean 1795992.6 sd 11312310.2
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real

```
* upload that to QBC and repeat mask again using our custom library from above so we'll gave good coordinates for the repeats:
```bash
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 48
#SBATCH -t 12:00:00
#SBATCH -p workq
#SBATCH -A loni_bf_qb_04
#SBATCH -o slurm-%x-%j.out-%N
#SBATCH -e slurm-%x-%j.err-%N
#SBATCH --job-name=r_masker

# move into the directory containing this script

source activate genome-annot-repeat

date
export WORK_DIR=/home/brant/work/diglossa
cd $WORK_DIR
RepeatMasker -s -pa 48 -xsmall -gff -lib RM_267691.ThuNov191759492020/consensi.fa.classified ragtag.scaffolds.rename.upper.fasta
date
exit 0
```
* copy that back down to local storage:
```bash
rsync -avLP --exclude ragtag.scaffolds.rename.upper.fasta brant@qbc.loni.org:/home/brant/work/diglossa/repeat-models/* ./
mv ragtag.scaffolds.rename.upper.fasta.masked diglossa.v1.ragtag.masked.fasta

# move the masking stuff into it's own directory
mkdir repeat-masking
mv * repeat-masking
mv repeat-masking/diglossa.v1.ragtag.masked.fasta ./
faSize -detailed diglossa.v1.ragtag.masked.fasta > diglossa.v1.ragtag.masked.sizes

```

* get busco results for that scaffolded, masked assembly
```bash
#!/bin/bash
#PBS -A hpc_allbirds04
#PBS -l nodes=1:ppn=20
#PBS -l walltime=06:00:00
#PBS -q checkpt
#PBS -N busco

# activate the correct conda env
source activate genome-eval

# move into the directory containing this script
cd $PBS_O_WORKDIR

busco --in diglossa.v1.ragtag.masked.fasta --cpu 20 --out  ragtag-masked-busco --mode genome --lineage_dataset eukaryota_odb10
```

* and also get quast data:
```bash
#!/bin/bash
#PBS -A hpc_allbirds04
#PBS -l nodes=1:ppn=20
#PBS -l walltime=4:00:00
#PBS -q workq
#PBS -N quast

# activate the correct conda env
source activate genome-eval

# move into the directory containing this script
cd $PBS_O_WORKDIR

#run quast
quast --output-dir quast-masked-stats --threads 20 --min-contig 1000 \
--no-plots --no-sv --no-gc --no-snps --no-icarus \
diglossa.v1.ragtag.masked.fasta
```

* quast stats:
```bash
cat quast-masked-busco/report.txt
All statistics are based on contigs of size >= 1000 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

Assembly                    diglossa.v1.ragtag.masked
# contigs (>= 0 bp)         601
# contigs (>= 1000 bp)      601
# contigs (>= 5000 bp)      555
# contigs (>= 10000 bp)     513
# contigs (>= 25000 bp)     426
# contigs (>= 50000 bp)     285
Total length (>= 0 bp)      1079476329
Total length (>= 1000 bp)   1079476329
Total length (>= 5000 bp)   1079318743
Total length (>= 10000 bp)  1079000693
Total length (>= 25000 bp)  1077552903
Total length (>= 50000 bp)  1072385347
# contigs                   601
Largest contig              155774234
Total length                1079476329
N50                         67281049
N75                         22837129
L50                         6
L75                         12
# N's per 100 kbp           7.85
```

* busco stats:
```bash
# BUSCO version is: 4.0.6
# The lineage dataset is: eukaryota_odb10 (Creation date: 2020-09-10, number of species: 70, number of BUSCOs: 255)
# Summarized benchmarking in BUSCO notation for file diglossa.v1.ragtag.masked.fasta
# BUSCO was run in mode: genome

	***** Results: *****

	C:94.5%[S:93.7%,D:0.8%],F:1.6%,M:3.9%,n:255
	241	Complete BUSCOs (C)
	239	Complete and single-copy BUSCOs (S)
	2	Complete and duplicated BUSCOs (D)
	4	Fragmented BUSCOs (F)
	10	Missing BUSCOs (M)
	255	Total BUSCO groups searched
```

## NCBI

* Prepare genome for submission to NCBI
* Followed normal steps of establishing a BioSample using older BioProject
* Uploaded the genome used for annotation, above, with separate AGP files for chromosomes and for unplaced scaffolds
* Received notice from NCBI that the following changes needed to be made:
```bash
remove scaffold2902	2544	vector/etc
mask scaff_cp_chr1A	73428958	23075522..23075696	mitochondrion (174 bp region)
mask scaff_cp_chr2	155774234	16047190..16047313	mitochondrion (123 bp region)
```

* Remove scaffold 2902
```python
from Bio import SeqIO
contam = ["scaffold2902"]
with open('../diglossa.v1.ragtag.masked.fasta') as infile:
        with open('diglossa.v1.ragtag.masked.filter.fasta', 'w') as outfile:
                for seq in SeqIO.parse(infile, 'fasta'):
                    if seq.id in contam:
                        print(seq.id)
                        pass
                    else:
                        outfile.write(seq.format('fasta'))
```

* mask out mtDNA similar regions:
```python
import numpy as np
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

with open('diglossa.v1.ragtag.masked.filter.fasta') as input:
        with open('diglossa.v1.ragtag.masked.filter.NCBI.fasta', 'w') as outfile:
                for seq in SeqIO.parse(input, 'fasta'):
                        if seq.id == 'scaff_cp_chr1A':
                                seq_list = np.array(list(seq.seq))
                                # mask 39 bases
                                seq_list[23075522:23075696] = 'N'
                                # create new seq
                                new_record = SeqRecord(Seq(''.join(seq_list)), id=seq.id, name="", description="")
                                outfile.write(new_record.format('fasta'))
                        elif seq.id == 'scaff_cp_chr2':
                                seq_list = np.array(list(seq.seq))
                                # mask 29 bases
                                seq_list[16047190:16047313] = 'N'
                                # create new seq
                                new_record = SeqRecord(Seq(''.join(seq_list)), id=seq.id, name="", description="")
                                outfile.write(new_record.format('fasta'))
                        else:
                                outfile.write(seq.format('fasta'))
```