library(tidyverse)

Sequencing and Basecalling
Sequencing was performed on two flow cells, an R9.4.1 and an R10.4.1.
Each cell was loaded and run for 24 hours, then nuclease flushed and
reloaded an additional 24 hours for a total of 72 hours. Post-hoc
basecalling was performed using guppy with the following parameters to
include modified basecalling at CpG sites.
guppy_basecaller --input_path . -r -s sgm-output/ -c dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_sup.cfg --bam_out --compress_fastq --nested_output_folder --min_qscore 10`
Full Assembly
Flye --asm-coverage 128
Assemblies were compared with the full read set and a subset of
>3400 bp reads. The full read set yielded better assemblies as
measured by BUSCO and was used for all further processing.
Install
Flye 2.9.1-b1780 was installed from github via conda
conda env create -n flye -c bioconda flye=2.9.1
conda activate flye
Run
The --asm-coverage
option limits the minimum required
coverage for the initial disjointig extension state, a memory bottleneck
in the program. Setting --genome-size
also reduces memory
usage.
flye --nano-hq sgm-total.fastq --out-dir flye_results_sgmtotal --genome-size 200m --asm-coverage 128 --threads 23 --no-alt-contigs
The run completed in ~3 hours on a 12 core AMD 5900.
BUSCO
BUSCO scores are used to assess assembly quality without reliance on
inherent non-biologically relevant characteristics of the assembly. It
identifies phylum-unique genes within the assembly. So, this assembly
should contain a high number of the genes unique to Diptera
Install
The command line package busco
v5.4.2 was installed in a
mamba
virtual environment on via
mamba install -c bioconda -c conda-forge busco=5.4.2
.
Note: the mamba documentation says you should still use conda to
activate and deactivate. Full environment specs in
busco_conda_env.txt
.
Run
busco
in genome mode was run in a mamba (conda) virtual
environment, taking approximately 15 minutes.
Diptera lineage
busco -i assembly.fasta -o busco_output_draft -m genome --lineage diptera -c 24
Results via auto-generated short_summary text file:
# BUSCO version is: 5.4.2
# The lineage dataset is: diptera_odb10 (Creation date: 2020-08-05, number of genomes: 56, number of BUSCOs: 3285)
# Summarized benchmarking in BUSCO notation for file /home/cfaulk/Desktop/ont-sequencing_runs/soybean_gall_midge/sgm-analysis/flye_results_sgmtotal/medaka-results/consensus.fasta
# BUSCO was run in mode: euk_genome_met
# Gene predictor used: metaeuk
***** Results: *****
C:87.9%[S:85.1%,D:2.8%],F:1.9%,M:10.2%,n:3285
2886 Complete BUSCOs (C)
2795 Complete and single-copy BUSCOs (S)
91 Complete and duplicated BUSCOs (D)
63 Fragmented BUSCOs (F)
336 Missing BUSCOs (M)
3285 Total BUSCO groups searched
Assembly Statistics:
2613 Number of scaffolds
2613 Number of contigs
211232549 Total length
0.000% Percent gaps
698 KB Scaffold N50
698 KB Contigs N50
Dependencies and versions:
hmmsearch: 3.1
bbtools: 39.01
metaeuk: 5.34c21f2
busco: 5.4.2
Polish
Medaka v1.7.2 was used to polish the assembly.
conda env create -n medaka -c bioconda flye=1.7.2
conda activate medaka
#Note that r941_min_sup_g507 model was used because some of the data came from an r9 flow cell.
medaka_consensus -b 100 -i sgm-total.fasq -d assembly.fasta -o medaka-results -t 24 -m r941_min_sup_g507
BUSCO
/home/cfaulk/miniconda3/bin/busco -i medaka-results/consensus.fasta -o sgm-total-medaka-busco -m genome -l diptera -c 24
Result:
C:87.9%[S:85.1%,D:2.8%],F:1.9%,M:10.2%,n:3285
Purge haplotigs
Skipped purge_haplotigs, since separate peaks were not obvious,
indicating lack of demonstrated haplotig presence.
minimap2 -t 24 -ax map-ont consensus.fasta ../sgm-total.fastq | samtools sort -m 1G -o aligned.bam -T tmp.ali
purge_haplotigs hist -b aligned.bam -g consensus.fasta -t 24
BLAST for contaminants
Blast was performed and provided to Blobtools as input. A local
installation of blast++ and the non-redundant (nr) database were used to
identify contigs.
blastn -query consensus.fasta -task megablast -db ~/Desktop/genomes/nt/nt -outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' -culling_limit 10 -num_threads 23 -evalue 1e-3 -out consensus.fasta.vs.nt.cul5.1e3.megablast.out
Megablast results are saved as
consensus.fasta.vs.nt.cul5.1e3.megablast.out
, assembly as
consensus.fasta
, coverage information as
consensus.coverage.txt
, and BUSCO results as
full_table.tsv
. These files are ready for Blobtoolkit
input.
Decontamination
We filter on the following criteria to remove short and low coverage
contigs. These filter settings result in no loss of full length
BUSCOs.
Remove length less than 8000 bp.
Remove coverage less than 20X and greater than 200X.
Filtered contigs are exported in .csv format from blobtools
2613 Original Contigs
1009 Remaining Contigs saved as 1009.clean.contigs.csv
Filter assembly
Use seqkit to filter the consensus assembly to only the 1009 contigs
from decontamination.
seqkit grep -f 1009.contigs.id.txt ../blobtools-out/consensus.fasta -o clean.consensus.fasta
Final BUSCO
busco -i clean.consensus.fasta -o busco-cleanconsensus -m genome -l ~/Desktop/genomes/busco_downloads/lineages/diptera_odb10 -c 24
Result:
***** Results: *****
C:87.8%[S:85.2%,D:2.6%],F:1.9%,M:10.3%,n:3285
2886 Complete BUSCOs (C)
2800 Complete and single-copy BUSCOs (S)
86 Complete and duplicated BUSCOs (D)
63 Fragmented BUSCOs (F)
336 Missing BUSCOs (M)
3285 Total BUSCO groups searched
Methylation
Use the unmapped .bam files generated during the guppy basecalling.
Cat together modbams for each flow cell directory:
samtools cat -o sgm-mod.unmapped.bam bug1/*.bam bug2/*.bam bug3/*.bam
Map Unmapped Bams.
Map, convert and sort all in one command. The -T Mm,Ml
flag carries over the modifications from bam to fastq. In newer bams the
flags are -T MM, ML
. The -y
flag copies input
fastq commands to output.
samtools fastq -T Mm,Ml sgm-mod.unmapped.bam | minimap2 -y -ax map-ont ../final_assembly/clean.consensus.fasta - -t 24 | samtools view -u - | samtools sort -@ 24 -o sgm.modmapped.bam; samtools index sgm.modmapped.bam -@ 24
Convert to bed format. The -d
flag limits depth to save
memory. The aggregate flag adds together bases on either strand and
creates and aggregate.bed file. The -cpg
flag limits to cpg
sites.
modbam2bed -m 5mC -t 23 -e -d 300 -p sgm.modmapped --aggregate --cpg clean.consensus.fasta sgm.modmapped.bam > sgm.modmapped.bed
Assess global DNA methylation as a percentage by adding up all
canonical (column 12) and modified (column 13) cytosine base calls at
CpG sites. Divide the modified by the total and print.
awk '$5>0 {can+=$12; mod+=$13} END{print 100*(mod/(can+mod))}' sgm.mapped.bed
Global Methylation
1.07294 is the global methylation in Soybean Gall Midge
Mosdepth Assembly Stats
Depth statistics generated with mosdepth
Run
Overall depth stats:
mosdepth -n --fast-mode -t 8 sgm.modmapped ../methylation/sgm.modmapped.bam
chrom length bases mean min max
total 206036186 13366775054 64.88 0 43586
For 500 bp windows:
mosdepth -n --fast-mode --by 500 -t 8 500bpwindow.sgm.modmapped ../methylation/sgm.modmapped.bam
chrom length bases mean min max
total_region 206036186 13366775054 64.88 0 43586
Repeats
Repeat Identification
Since annotated repeats in insect genomes are sparse, repeat
identification requires two stages. First de novo repeat
identification was performed with RepeatModler2.
Second, the libraries generated with RepeatModeler2 were used as input
to RepeatMasker to create a
complete genome annotation of repeats with existing names for classes,
families, and subfamilies where known and novel IDs for previously
unknown families.
RepeatModeler was run to identify repeat content. Takes about 24
hours on a 300 Mb insect genome.
- First build database from consensus fasta.
/usr/local/RepeatModeler-2.0.2a/BuildDatabase -name "sgm-genome" ../final-assembly/consensus.clean.fasta
- Run RepeatModeler2.
nohup <RepeatModelerPath>/RepeatModeler -database sgm-genome -pa 14 -LTRStruct >& run.out &
- RepeatMasker was then run to determine genomic repeat content
specifying the custom library created with RepeatModeler.
RepeatMasker -pa 14 -s -xsmall -lib sgm-genome-families.fa ../final-assembly/consensus.clean.fasta
Comparison to other midge genomes was performed using Dfam v3.6 and
by ab initio identification using the same RepeatModeler2 pipeline as
above.
GeMoMa Gene Annotation
Wiki
Install
Install depencency mmseq
:
conda install -c conda-forge -c bioconda mmseqs2
Installed version 1.9 by downloading .zip
file from website
vs. Drosophila melanogaster (dm6) and Swede midge (Contarinia
nasturtii)
Two annotations of diptera were used as a reference based on close
phylogenetic proximity to the soybean gall midge and annotation
availability. FASTA and GFF files were downloaded from NCBI’s Assembly
page. Note: A hardmasked version of the genome (repeats
replaced with N’s) was used for input to eliminate transposon derived
genes.
java -Xmx50g -jar gemoma-java/GeMoMa-1.9.jar CLI GeMoMaPipeline threads=23 outdir=combined_Dmel_swedemidge GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=clean.consensus.hardmasked.fasta s=own i=swede_gall_midge a=GCF_009176525.2_AAFC_CNas_1.1_genomic.gff g=GCF_009176525.2_AAFC_CNas_1.1_genomic.fna s=own i=D_mel g=GCA_000001215.4_Release_6_plus_ISO1_MT_genomic.fna a=GCA_000001215.4_Release_6_plus_ISO1_MT_genomic.gff
BUSCO was used to assess completeness in protein mode:
busco -i predicted_proteins.fasta -o busco -m protein -l diptera -c 24
--------------------------------------------------
|Results from dataset diptera_odb10 |
--------------------------------------------------
|C:89.9%[S:85.5%,D:4.4%],F:0.7%,M:9.4%,n:3285 |
|2952 Complete BUSCOs (C) |
|2808 Complete and single-copy BUSCOs (S) |
|144 Complete and duplicated BUSCOs (D) |
|24 Fragmented BUSCOs (F) |
|309 Missing BUSCOs (M) |
|3285 Total BUSCO groups searched |
--------------------------------------------------
2023-01-05 18:39:09 INFO: BUSCO analysis done. Total running time: 110 seconds
2023-01-05 18:39:09 INFO: Results written in /home/cfaulk/Desktop/ont-sequencing_runs/soybean_gall_midge/sgm-analysis/gemoma-out/combined_Dmel_swedemidge/busco-combined-protein
Mitochondrial genome
Total reads were first blasted by mtblaster (https://github.com/nidafra92/squirrel-project/blob/master/mtblaster.py))
using the Orseolia oryzae mitogenome (KM888183) to select for reads with
high identity to a gall midge mitochondria sequence. We blasted against
reads only from the R10 flowcell to ensure higher accuracy. Next,
resulting reads were filtered by nanofilt (https://github.com/wdecoster/nanofilt))
to keep only reads above 15 kbp in length and average Q-score above 30.
Finally, flye was used to perform mitogenome assembly (Kolmogorov et al.
2020). A single circular contig was recovered with 785X coverage. The
assembly was polished using four rounds of racon (https://github.com/lbcb-sci/racon) followed by medaka
with the same parameters as the nuclear assembly.
Blast against reads from the R10 flow cell:
mtblaster.py sgm-bug3.fastq Orseolia_oryzae-KM888183.1.fasta
Filter reads higher than 15 kb with quality >Q30.
nanofilt -l 15000 -q 30 reads_w_hits.fastq > reads_w_hits15kq30.fastq
Run flye to assemble genome using the asm parameter.
flye --nano-hq reads_w_hits15kq30.fastq --out-dir flye-mitogenome-highasm --genome-size 16k --threads 24 --asm 256
Run 4 rounds of Racon polishing,
minimap2 -ax map-ont assembly.fasta ../reads_w_hits15kq30.fastq > assembly.sam
racon -t 24 ../reads_w_hits15kq30.fastq assembly.sam assembly.fasta > assembly.racon1.fasta
Final polish with medaka.
medaka_consensus -i ../reads_w_hits15kq30.fastq -d assembly.racon4.fasta -o medaka-results/ -t 23 -m r1041_e82_400bps_sup_g615
Save consensus.fasta
as
sgm_mitogenome_consensus.fasta
Wolbachia detection
Collect panel of 9 genomes into a single fasta file and match against
the total SGM read set. Identify any resulting matches against the
kracken2 standard database. The genomes were downloaded from NCBI and
are listed in the manuscript.
Match reads to Wolbachia genomes.
minimap2 -ax map-ont wolbachia-panel/wolbachia-panel.fna ../sgm-total.fastq.gz -t 23 > wolbachia-panel.vs.sgm-total.sam
Sort and convert to bam format.
samtools sort -@24 -O BAM -o sgm-total.vs.wolbachia-panel.sorted.bam wolbachia-panel.vs.sgm-total.sam
Filter alignment file for only mapped reads.
samtools fasta -F 4 sgm-total.vs.wolbachia-panel.sorted.bam > sgm-total.vs.wolbachia-panel.sorted.hits.bam
Remove duplicates.
awk '/^>/{f=!d[$1];d[$1]=1}f' sgm-total.vs.wolbachia-panel.sorted.hits.fasta > sgm-total.vs.wolbachia-panel.sorted.nodups.fasta
Run through Kraken2.
kraken2 --db /mnt/data16Tb/kraken_db/k2_standard_16gb_20220607 --threads 23 --use-names --report sgm.vs.wolbachia-panel.report.txt --output sgm.vs.wolbachia-panel.txt sgm-total.vs.wolbachia-panel.sorted.nodups.fasta
Run Pavian from inside R to generate graphics.
pavian::runApp(port=5000)
NCBI Submission
Link
Used command line to upload folder:
- Navigate to the source folder where the files for submission
are;
- Establish an FTP connection using the credentials below: Address:
ftp-private.ncbi.nlm.nih.gov
Username: subftp
Password: <sent via email>
- Navigate to your account folder: From the command line use the ‘cd’
command:
cd uploads/<folder_sent_by_email>
When using a GUI FTP client (eg: Filezilla, NcFTP, Cyberduck, etc.),
after you’ve connected to the FTP server, paste your account folder
(uploads/<folder_sent_by_email>
) into the “Remote
Site” or “Remote Directory” box on the interface and press “Enter”.
Then you will be able to create the data sub-directory for your
submission. Until you do this, you will see a message stating “550 /:
Permission denied” or “Failed to read the directory listing”. We prevent
directory listing in the default sign in folder for security
reasons.
Create a subfolder (required!) with a meaningful name. Navigate to
the target folder you just created. Copy your files into the target
folder:
<mitogenome>.fa.gz <nuclear_genome>.fa.gz
Press “New submission” in browser.
---
title: "Soybean Gall Midge Genome Notebook"
author: "Chris Faulk"
output: 
  html_notebook:
    toc: true
    toc_float: true
    df_print: tibble
editor_options: 
  chunk_output_type: inline
  markdown: 
    wrap: 72
---

```{r}
library(tidyverse)
```

![Soybean Gall Midge.](images/sgm_image.jpg)

![](images/jones%20gall%20midge%20illustration.jpg)

# Sequencing and Basecalling

Sequencing was performed on two flow cells, an R9.4.1 and an R10.4.1.
Each cell was loaded and run for 24 hours, then nuclease flushed and
reloaded an additional 24 hours for a total of 72 hours. Post-hoc
basecalling was performed using guppy with the following parameters to
include modified basecalling at CpG sites.

``` {#guppy_basecall .bash}
guppy_basecaller --input_path . -r -s sgm-output/ -c dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_sup.cfg --bam_out --compress_fastq --nested_output_folder --min_qscore 10`
```

# Full Assembly

### Flye `--asm-coverage 128`

Assemblies were compared with the full read set and a subset of \>3400
bp reads. The full read set yielded better assemblies as measured by
BUSCO and was used for all further processing.

### Install

Flye 2.9.1-b1780 was installed from github via conda

``` {#flye_install .bash}
conda env create -n flye -c bioconda flye=2.9.1
conda activate flye
```

### Run

The `--asm-coverage` option limits the minimum required coverage for the
initial disjointig extension state, a memory bottleneck in the program.
Setting `--genome-size` also reduces memory usage.

``` {#flye_asm_run .bash}
flye --nano-hq sgm-total.fastq --out-dir flye_results_sgmtotal --genome-size 200m --asm-coverage 128 --threads 23 --no-alt-contigs
```

The run completed in \~3 hours on a 12 core AMD 5900.

### BUSCO

BUSCO scores are used to assess assembly quality without reliance on
inherent non-biologically relevant characteristics of the assembly. It
identifies phylum-unique genes within the assembly. So, this assembly
should contain a high number of the genes unique to Diptera

#### Install

The command line package `busco` v5.4.2 was installed in a `mamba`
virtual environment on via
`mamba install -c bioconda -c conda-forge busco=5.4.2`. *Note: the mamba
documentation says you should still use conda to activate and
deactivate*. Full environment specs in `busco_conda_env.txt`.

#### Run

`busco` in genome mode was run in a mamba (conda) virtual environment,
taking approximately 15 minutes.

#### Diptera lineage

`busco -i assembly.fasta -o busco_output_draft -m genome --lineage diptera -c 24`

Results via auto-generated short_summary text file:

``` {#sgm_total_busco .bash}
# BUSCO version is: 5.4.2 
# The lineage dataset is: diptera_odb10 (Creation date: 2020-08-05, number of genomes: 56, number of BUSCOs: 3285)
# Summarized benchmarking in BUSCO notation for file /home/cfaulk/Desktop/ont-sequencing_runs/soybean_gall_midge/sgm-analysis/flye_results_sgmtotal/medaka-results/consensus.fasta
# BUSCO was run in mode: euk_genome_met
# Gene predictor used: metaeuk

    ***** Results: *****
    C:87.9%[S:85.1%,D:2.8%],F:1.9%,M:10.2%,n:3285      
    2886    Complete BUSCOs (C)            
    2795    Complete and single-copy BUSCOs (S)    
    91  Complete and duplicated BUSCOs (D)     
    63  Fragmented BUSCOs (F)              
    336 Missing BUSCOs (M)             
    3285    Total BUSCO groups searched        

Assembly Statistics:
    2613    Number of scaffolds
    2613    Number of contigs
    211232549   Total length
    0.000%  Percent gaps
    698 KB  Scaffold N50
    698 KB  Contigs N50

Dependencies and versions:
    hmmsearch: 3.1
    bbtools: 39.01
    metaeuk: 5.34c21f2
    busco: 5.4.2
```

### Polish

Medaka v1.7.2 was used to polish the assembly.

``` {#medaka_install .bash}

conda env create -n medaka -c bioconda flye=1.7.2
conda activate medaka

#Note that r941_min_sup_g507 model was used because some of the data came from an r9 flow cell.
medaka_consensus -b 100 -i sgm-total.fasq -d assembly.fasta -o medaka-results -t 24 -m r941_min_sup_g507
```

##### BUSCO

`/home/cfaulk/miniconda3/bin/busco -i medaka-results/consensus.fasta -o sgm-total-medaka-busco -m genome -l diptera -c 24`

Result: `C:87.9%[S:85.1%,D:2.8%],F:1.9%,M:10.2%,n:3285`

### Purge haplotigs

Skipped purge_haplotigs, since separate peaks were not obvious,
indicating lack of demonstrated haplotig presence.

``` {#purge_haplotigs .bash}
minimap2 -t 24 -ax map-ont consensus.fasta ../sgm-total.fastq | samtools sort -m 1G -o aligned.bam -T tmp.ali
purge_haplotigs  hist  -b aligned.bam  -g consensus.fasta  -t 24
```

## BLAST for contaminants

Blast was performed and provided to Blobtools as input. A local
installation of blast++ and the non-redundant (nr) database were used to
identify contigs.

`blastn -query consensus.fasta -task megablast -db ~/Desktop/genomes/nt/nt -outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' -culling_limit 10 -num_threads 23 -evalue 1e-3 -out consensus.fasta.vs.nt.cul5.1e3.megablast.out`

Megablast results are saved as
`consensus.fasta.vs.nt.cul5.1e3.megablast.out`, assembly as
`consensus.fasta`, coverage information as `consensus.coverage.txt`, and
BUSCO results as `full_table.tsv`. These files are ready for Blobtoolkit
input.

## Contaminant searching with Blobtoolkit

### Pip install

[Install instructions](https://blobtoolkit.genomehubs.org/install/)
Installs version 3.2.6

``` {#blob_pip .bash}
pip install blobtoolkit
pip install fastjsonschema
```

### Generate coverage file

``` {#blob_coverage .bash}
minimap2 -ax map-ont -t 24 consensus.fasta sgm-total.fastq | samtools sort -@16 -O BAM -o consensus.sorted.bam -
```

Generate coverage .txt
`samtools coverage consensus.sorted.bam > consensus.coverage.txt`.

### Create blobdir

blobtoolkit v3.2.6

Download taxdump from
[NCBI](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz),
unzip, and moved all the resulting `.dmp` files to the directory
`taxdump`.

Create the blobdir and load it with files.

``` {#blobdir .bash}
blobtools create --fasta consensus.fasta --meta data/midge.yaml --taxid 2494512 --taxdump taxdump/ datasets/midge
#Add BUSCO
blobtools add --busco full_table.tsv datasets/midge
#Add coverage
blobtools add --text consensus.coverage.txt --text-header --text-cols '#rname=identifier,meandepth=midge_reads_cov' datasets/midge
#Fix axes
blobtools add --key plot.y=midge_reads_cov datasets/midge
```

### View

`blobtools view --local --interactive --ports 4000-4999 datasets/midge`

## Decontamination

We filter on the following criteria to remove short and low coverage
contigs. These filter settings result in no loss of full length BUSCOs.

Remove length less than 8000 bp.\
Remove coverage less than 20X and greater than 200X.\
Filtered contigs are exported in .csv format from blobtools

2613 Original Contigs\
1009 Remaining Contigs saved as `1009.clean.contigs.csv`

### Filter assembly

Use seqkit to filter the consensus assembly to only the 1009 contigs
from decontamination.

`seqkit grep -f 1009.contigs.id.txt ../blobtools-out/consensus.fasta -o clean.consensus.fasta`

## Final BUSCO

`busco -i clean.consensus.fasta -o busco-cleanconsensus -m genome -l ~/Desktop/genomes/busco_downloads/lineages/diptera_odb10 -c 24`

Result:

``` {#clean_consensus_busco .bash}
    ***** Results: *****

    C:87.8%[S:85.2%,D:2.6%],F:1.9%,M:10.3%,n:3285      
    2886    Complete BUSCOs (C)            
    2800    Complete and single-copy BUSCOs (S)    
    86  Complete and duplicated BUSCOs (D)     
    63  Fragmented BUSCOs (F)              
    336 Missing BUSCOs (M)             
    3285    Total BUSCO groups searched 
```

## Methylation

Use the unmapped .bam files generated during the guppy basecalling. Cat
together modbams for each flow cell directory:
`samtools cat -o sgm-mod.unmapped.bam bug1/*.bam bug2/*.bam bug3/*.bam`

### Map Unmapped Bams.

Map, convert and sort all in one command. The `-T Mm,Ml` flag carries
over the modifications from bam to fastq. In newer bams the flags are
`-T MM, ML`. The `-y` flag copies input fastq commands to output.
`samtools fastq -T Mm,Ml sgm-mod.unmapped.bam | minimap2 -y -ax map-ont ../final_assembly/clean.consensus.fasta - -t 24 | samtools view -u - | samtools sort -@ 24 -o sgm.modmapped.bam; samtools index sgm.modmapped.bam -@ 24`

Convert to bed format. The `-d` flag limits depth to save memory. The
aggregate flag adds together bases on either strand and creates and
aggregate.bed file. The `-cpg` flag limits to cpg sites.
`modbam2bed -m 5mC -t 23 -e -d 300 -p sgm.modmapped --aggregate --cpg clean.consensus.fasta sgm.modmapped.bam > sgm.modmapped.bed`

Assess global DNA methylation as a percentage by adding up all canonical
(column 12) and modified (column 13) cytosine base calls at CpG sites.
Divide the modified by the total and print.

`awk '$5>0 {can+=$12; mod+=$13} END{print 100*(mod/(can+mod))}' sgm.mapped.bed`

### Global Methylation

1.07294 is the global methylation in Soybean Gall Midge

## Mosdepth Assembly Stats

Depth statistics generated with mosdepth

### Install

Downloaded v0.3.3 binary from [Github release
page](https://github.com/brentp/mosdepth/releases).

### Run

1.  Overall depth stats:
    `mosdepth -n --fast-mode -t 8 sgm.modmapped ../methylation/sgm.modmapped.bam`

    `chrom  length  bases   mean    min max`

    `total  206036186   13366775054 64.88   0   43586`

2.  For 500 bp windows:
    `mosdepth -n --fast-mode --by 500 -t 8 500bpwindow.sgm.modmapped ../methylation/sgm.modmapped.bam`

    `chrom  length  bases   mean    min max`

    `total_region   206036186   13366775054 64.88   0   43586`

# Repeats

## Repeat Identification

Since annotated repeats in insect genomes are sparse, repeat
identification requires two stages. First *de novo* repeat
identification was performed with
[RepeatModler2](https://www.repeatmasker.org/RepeatModeler/). Second,
the libraries generated with RepeatModeler2 were used as input to
[RepeatMasker](https://www.repeatmasker.org) to create a complete genome
annotation of repeats with existing names for classes, families, and
subfamilies where known and novel IDs for previously unknown families.

RepeatModeler was run to identify repeat content. Takes about 24 hours
on a 300 Mb insect genome.

1.  First build database from consensus fasta.\
    `/usr/local/RepeatModeler-2.0.2a/BuildDatabase -name "sgm-genome" ../final-assembly/consensus.clean.fasta`
2.  Run RepeatModeler2.\
    `nohup <RepeatModelerPath>/RepeatModeler -database sgm-genome -pa 14 -LTRStruct >& run.out &`
3.  RepeatMasker was then run to determine genomic repeat content
    specifying the custom library created with RepeatModeler.\
    `RepeatMasker -pa 14 -s -xsmall -lib sgm-genome-families.fa ../final-assembly/consensus.clean.fasta`

Comparison to other midge genomes was performed using Dfam v3.6 and by
ab initio identification using the same RepeatModeler2 pipeline as
above.

# GeMoMa Gene Annotation

[Wiki](http://www.jstacs.de/index.php/GeMoMa)

### Install

Install depencency `mmseq`:

`conda install -c conda-forge -c bioconda mmseqs2`

Installed version 1.9 by downloading `.zip` file from
[website](https://www.jstacs.de/index.php/GeMoMa#Requirements)

### vs. Drosophila melanogaster (dm6) and Swede midge ([*Contarinia nasturtii*](https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/265458))

Two annotations of diptera were used as a reference based on close
phylogenetic proximity to the soybean gall midge and annotation
availability. FASTA and GFF files were downloaded from NCBI's Assembly
page. **Note:** A hardmasked version of the genome (repeats replaced
with N's) was used for input to eliminate transposon derived genes.

`java -Xmx50g -jar gemoma-java/GeMoMa-1.9.jar CLI GeMoMaPipeline threads=23 outdir=combined_Dmel_swedemidge GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=clean.consensus.hardmasked.fasta s=own i=swede_gall_midge a=GCF_009176525.2_AAFC_CNas_1.1_genomic.gff g=GCF_009176525.2_AAFC_CNas_1.1_genomic.fna s=own i=D_mel g=GCA_000001215.4_Release_6_plus_ISO1_MT_genomic.fna a=GCA_000001215.4_Release_6_plus_ISO1_MT_genomic.gff`

BUSCO was used to assess completeness in protein mode:

`busco -i predicted_proteins.fasta -o busco -m protein -l diptera -c 24`

``` {#sgm_vs_Dmel_swedemidge_prot_busco .bash}
    --------------------------------------------------
    |Results from dataset diptera_odb10               |
    --------------------------------------------------
    |C:89.9%[S:85.5%,D:4.4%],F:0.7%,M:9.4%,n:3285     |
    |2952   Complete BUSCOs (C)                       |
    |2808   Complete and single-copy BUSCOs (S)       |
    |144    Complete and duplicated BUSCOs (D)        |
    |24 Fragmented BUSCOs (F)                     |
    |309    Missing BUSCOs (M)                        |
    |3285   Total BUSCO groups searched               |
    --------------------------------------------------
2023-01-05 18:39:09 INFO:   BUSCO analysis done. Total running time: 110 seconds
2023-01-05 18:39:09 INFO:   Results written in /home/cfaulk/Desktop/ont-sequencing_runs/soybean_gall_midge/sgm-analysis/gemoma-out/combined_Dmel_swedemidge/busco-combined-protein
```

# Mitochondrial genome

Total reads were first blasted by mtblaster
([https://github.com/nidafra92/squirrel-project/blob/master/mtblaster.py)](https://github.com/nidafra92/squirrel-project/blob/master/mtblaster.py))
using the Orseolia oryzae mitogenome (KM888183) to select for reads with
high identity to a gall midge mitochondria sequence. We blasted against
reads only from the R10 flowcell to ensure higher accuracy. Next,
resulting reads were filtered by nanofilt
([https://github.com/wdecoster/nanofilt)](https://github.com/wdecoster/nanofilt))
to keep only reads above 15 kbp in length and average Q-score above 30.
Finally, flye was used to perform mitogenome assembly (Kolmogorov et al.
2020). A single circular contig was recovered with 785X coverage. The
assembly was polished using four rounds of racon
(<https://github.com/lbcb-sci/racon>) followed by medaka with the same
parameters as the nuclear assembly.

1.  Blast against reads from the R10 flow cell:
    `mtblaster.py sgm-bug3.fastq Orseolia_oryzae-KM888183.1.fasta`

2.  Filter reads higher than 15 kb with quality \>Q30.
    `nanofilt -l 15000 -q 30 reads_w_hits.fastq > reads_w_hits15kq30.fastq`

3.  Run flye to assemble genome using the asm parameter.
    `flye --nano-hq reads_w_hits15kq30.fastq --out-dir flye-mitogenome-highasm --genome-size 16k --threads 24 --asm 256`

4.  Run 4 rounds of Racon polishing,
    `minimap2 -ax map-ont assembly.fasta ../reads_w_hits15kq30.fastq > assembly.sam`

    `racon -t 24 ../reads_w_hits15kq30.fastq assembly.sam assembly.fasta > assembly.racon1.fasta`

5.  Final polish with medaka.
    `medaka_consensus -i ../reads_w_hits15kq30.fastq -d assembly.racon4.fasta -o medaka-results/ -t 23 -m r1041_e82_400bps_sup_g615`

6.  Save `consensus.fasta` as `sgm_mitogenome_consensus.fasta`

# Wolbachia detection

Collect panel of 9 genomes into a single fasta file and match against
the total SGM read set. Identify any resulting matches against the
kracken2 standard database. The genomes were downloaded from NCBI and
are listed in the manuscript.

1.  Match reads to Wolbachia genomes.
    `minimap2 -ax map-ont wolbachia-panel/wolbachia-panel.fna ../sgm-total.fastq.gz -t 23 > wolbachia-panel.vs.sgm-total.sam`

2.  Sort and convert to bam format.
    `samtools sort -@24 -O BAM -o sgm-total.vs.wolbachia-panel.sorted.bam wolbachia-panel.vs.sgm-total.sam`

3.  Filter alignment file for only mapped reads.
    `samtools fasta -F 4 sgm-total.vs.wolbachia-panel.sorted.bam > sgm-total.vs.wolbachia-panel.sorted.hits.bam`

4.  Remove duplicates.
    `awk '/^>/{f=!d[$1];d[$1]=1}f' sgm-total.vs.wolbachia-panel.sorted.hits.fasta > sgm-total.vs.wolbachia-panel.sorted.nodups.fasta`

5.  Run through Kraken2.
    `kraken2 --db /mnt/data16Tb/kraken_db/k2_standard_16gb_20220607 --threads 23 --use-names --report sgm.vs.wolbachia-panel.report.txt --output sgm.vs.wolbachia-panel.txt sgm-total.vs.wolbachia-panel.sorted.nodups.fasta`

6.  Run Pavian from inside R to generate graphics.
    `pavian::runApp(port=5000)`

# NCBI Submission

[Link](https://submit.ncbi.nlm.nih.gov/subs/genome/)

Used command line to upload folder:

1.  Navigate to the source folder where the files for submission are;
2.  Establish an FTP connection using the credentials below: Address:
    `ftp-private.ncbi.nlm.nih.gov` Username: `subftp` Password:
    `<sent via email>`
3.  Navigate to your account folder: From the command line use the 'cd'
    command: `cd uploads/<folder_sent_by_email>`

When using a GUI FTP client (eg: Filezilla, NcFTP, Cyberduck, etc.),
after you've connected to the FTP server, paste your account folder
(`uploads/<folder_sent_by_email>`) into the "Remote Site" or "Remote
Directory" box on the interface and press "Enter".

Then you will be able to create the data sub-directory for your
submission. Until you do this, you will see a message stating "550 /:
Permission denied" or "Failed to read the directory listing". We prevent
directory listing in the default sign in folder for security reasons.

Create a subfolder (required!) with a meaningful name. Navigate to the
target folder you just created. Copy your files into the target folder:
`<mitogenome>.fa.gz <nuclear_genome>.fa.gz`

Press "New submission" in browser.
