# Genome and transcriptome assemblies of the kuruma shrimp, *Marsupenaeus japonicus*: bioinformatic methods
This file describes the custom scripts used in "Genome and transcriptome assemblies of the kuruma shrimp, *Marsupenaeus japonicus*".
Note that the scripts presented in this file are "clean copies" of the original scripts; we stripped most of the snippets for QC or exploratory purposes, streamlined convoluted directory structure, and renamed file/variable names that were too lengthy or inconsitent. We also removed most of the SLURM job parameters that were hard-coded in the original scripts.
- [ONT library preparation and sequencing](#ont-library-preparation-and-sequencing)
- [Base-calling Nanopore reads (Guppy)](#base-calling-nanopore-reads-guppy)
- [Short read removal (Fastp)](#short-read-removal-fastp)
- [PacBio Iso-Seq sequencing and analysis (IsoSeq3)](#pacbio-iso-seq-sequencing-and-analysis-isoseq3)
- [CCS](#ccs)
- [LIMA](#lima)
- [IsoSeq3 refine](#isoseq3-refine)
- [dataset create --type TranscriptSet](#dataset-create---type-transcriptset)
- [dataset create --type SubreadSet](#dataset-create---type-subreadset)
- [IsoSeq3 Cluster](#isoseq3-cluster)
- [IsoSeq3 Polish](#isoseq3-polish)
- [Genome size estimation and ploidy analysis](#genome-size-estimation-and-ploidy-analysis)
- [Trimming paired-end reads (Fastp)](#trimming-paired-end-reads-fastp)
- [K-mer count (KMC)](#k-mer-count-kmc)
- [Genome size estimation (GenomeScope2)](#genome-size-estimation-genomescope2)
- [Ploidy analysis (SmudgePlot)](#ploidy-analysis-smudgeplot)
- [Error correction of Illumina paired-end reads from muscle library](#error-correction-of-illumina-paired-end-reads-from-muscle-library)
- [Error correction (Tadpole)](#error-correction-tadpole)
- [K-mer count (KMC)](#k-mer-count-kmc-1)
- [K-mer distribution plot (ggplot2)](#k-mer-distribution-plot-ggplot2)
- [De novo assembly of Illumina genomic DNA libraries](#ide-novoi-assembly-of-illumina-genomic-dna-libraries)
- [Preprocessing Illumina paired-end reads from sperm library (Fastp)](#preprocessing-illumina-paired-end-reads-from-sperm-library-fastp)
- [Preprocessing Illumina mate-pair reads](#preprocessing-illumina-mate-pair-reads)
- [Trimmomatic](#trimmomatic)
- [NextClip](#nextclip)
- [De novo assembly (SPAdes)](#ide-novoi-assembly-spades)
- [Redundancy removal and scaffolding (Redundans)](#redundancy-removal-and-scaffolding-redundans)
- [Preprocessing Illumina paired-end reads from muscle library (Fastp)](#preprocessing-illumina-paired-end-reads-from-muscle-library-fastp)
- [Run Redundans pipeline](#run-redundans-pipeline)
- [Misassembly detection and assembly break (REAPR)](#misassembly-detection-and-assembly-break-reapr)
- [Reverse-complement mate-pair reads (SeqKit)](#reverse-complement-mate-pair-reads-seqkit)
- [Run REAPR pipeline](#run-reapr-pipeline)
- [Transcriptome shotgun assembly](#transcriptome-shotgun-assembly)
- [Preprocessing Illumina paired-end reads (Fastp)](#preprocessing-illumina-paired-end-reads-fastp)
- [Mapping to the genome (HISAT2)](#mapping-to-the-genome-hisat2)
- [De novo assembly](#ide-novoi-assembly)
- [Trinity](#trinity)
- [rnaSPAdes](#rnaspades)
- [Trans-ABySS](#trans-abyss)
- [Redundancy removal (TransDecoder, CD-HIT-EST)](#redundancy-removal-transdecoder-cd-hit-est)
- [BUSCO (transcriptome, arthropoda_odb10)](#busco-transcriptome-arthropoda_odb10)
- [Scaffolding of genome assembly using transcriptome data](#scaffolding-of-genome-assembly-using-transcriptome-data)
- [Scaffolding using cDNA alignments (L_RNA_Scaffolder)](#scaffolding-using-cdna-alignments-l_rna_scaffolder)
- [Scaffolding using paired-end RNA-seq reads (P_RNA_scaffolder)](#scaffolding-using-paired-end-rna-seq-reads-p_rna_scaffolder)
- [Gap filling (Sealer)](#gap-filling-sealer)
- [Polishing (NtEdit)](#polishing-ntedit)
- [Removing short (<2kb) scaffolds](#removing-short-2kb-scaffolds)
- [Assembly improvement using ONT long reads, Illumina mate-pair reads, and transcriptome data](#assembly-improvement-using-ont-long-reads-illumina-mate-pair-reads-and-transcriptome-data)
- [Mate-pair reads processing (NxTrim, Fastp. BBduk)](#mate-pair-reads-processing-nxtrim-fastp-bbduk)
- [Gap filling, assembly break, and polishing (TGS-GapCloser, NtEdit)](#gap-filling-assembly-break-and-polishing-tgs-gapcloser-ntedit)
- [Scaffolding (1)](#scaffolding-1)
- [Scaffolding using ONT long reads (LRScaf)](#scaffolding-using-ont-long-reads-lrscaf)
- [Scaffolding using Illumina mate-pair reads (BESST)](#scaffolding-using-illumina-mate-pair-reads-besst)
- [Scaffolding using cDNA alignments (L_RNA_scaffolder)](#scaffolding-using-cdna-alignments-l_rna_scaffolder-1)
- [Misassembly detection and assembly break (REAPR)](#misassembly-detection-and-assembly-break-reapr-1)
- [Haplotig removal (1) (purge_haplotigs)](#haplotig-removal-1-purge_haplotigs)
- [Scaffolding (2)](#scaffolding-2)
- [Scaffolding using ONT long reads (LRScaf)](#scaffolding-using-ont-long-reads-lrscaf-1)
- [Scaffolding using Illumina mate-pair reads (BESST)](#scaffolding-using-illumina-mate-pair-reads-besst-1)
- [Scaffolding using cDNA alignments (L_RNA_scaffolder)](#scaffolding-using-cdna-alignments-l_rna_scaffolder-2)
- [Haplotig removal (2) (purge_haplotigs)](#haplotig-removal-2-purge_haplotigs)
- [Scaffolding (3)](#scaffolding-3)
- [Scaffolding using ONT long reads (LRScaf)](#scaffolding-using-ont-long-reads-lrscaf-2)
- [Scaffolding using Illumina mate-pair reads (BESST)](#scaffolding-using-illumina-mate-pair-reads-besst-2)
- [Assembly polishing (HyPo)](#assembly-polishing-hypo)
- [Assembly curation](#assembly-curation)
- [Mitogenome contamination removal](#mitogenome-contamination-removal)
- [Filter out sequences with abnormal coverage](#filter-out-sequences-with-abnormal-coverage)
- [Screening bacterial contamination (Barrnap)](#screening-bacterial-contamination-barrnap)
- [Mapping Illumina paired-end reads onto the final assembly](#mapping-illumina-paired-end-reads-onto-the-final-assembly)
- [Mitogenome assembly (Flye, --meta)](#mitogenome-assembly-flye---meta)
- [Repeat identification and masking](#repeat-identification-and-masking)
- [Repeat library construction (RepeatScout)](#repeat-library-construction-repeatscout)
- [Taxonomic assingment (RepeatClassifier)](#taxonomic-assingment-repeatclassifier)
- [Repeat masking (RepeatMasker)](#repeat-masking-repeatmasker)
- [Gene prediction and functional annotation](#gene-prediction-and-functional-annotation)
- [Prediction of protein-coding genes](#prediction-of-protein-coding-genes)
- [cDNA alignments](#cdna-alignments)
- [Transcriptome shotgun assembly](#transcriptome-shotgun-assembly-1)
- [preprocessing paired-end reads (Fastp)](#preprocessing-paired-end-reads-fastp)
- [De novo assembly (Trinity)](#de-novo-assembly-trinity)
- [Running PASA pipeline](#running-pasa-pipeline)
- [Genome-guided transcriptome assembly](#genome-guided-transcriptome-assembly)
- [RNA-seq mapping (HISAT2)](#rna-seq-mapping-hisat2)
- [Transcript reconstruction (StringTie)](#transcript-reconstruction-stringtie)
- [*Ab initio* gene prediction (BRAKER2)](#ab-initio-gene-prediction-braker2)
- [Protein-to-genome alignment (BLAST and Exonerate)](#protein-to-genome-alignment-blast-and-exonerate)
- [Merging gene models (EVidenceModeler)](#merging-gene-models-evidencemodeler)
- [Homology search (MMSeqs2)](#homology-search-mmseqs2)
- [Manual curation](#manual-curation)
- [UTR recovery (PASA updates)](#utr-recovery-pasa-updates)
- [Incorporation of IsoSeq and shotgun and transcriptome assemblies (SQANTI3, TAMA, TransDecoder)](#incorporation-of-isoseq-and-shotgun-and-transcriptome-assemblies-sqanti3-tama-transdecoder)
- [Non-coding RNA prediction](#non-coding-rna-prediction)
- [tRNAScan-SE 2.0 (tRNA)](#trnascan-se-20-trna)
- [Prediction of other non-coding RNA genes (Infernal)](#prediction-of-other-non-coding-rna-genes-infernal)
- [Merge predicted genes and assign locus IDs](#merge-predicted-genes-and-assign-locus-ids)
- [Functional annotation of protein-coding genes (AHRD, KAAS)](#functional-annotation-of-protein-coding-genes-ahrd-kaas)
- [Extracting non-redundant protein sequences](#extracting-non-redundant-protein-sequences)
- [Functional annotation (AHRD)](#functional-annotation-ahrd)
- [Append functional annotation to predicted protein-coding gene models](#append-functional-annotation-to-predicted-protein-coding-gene-models)
- [Structural annotation of IsoSeq cDNA sequences (SQANTI3)](#structural-annotation-of-isoseq-cdna-sequences-sqanti3)
- [Comparative genomics](#comparative-genomics)
- [BUSCO (genome, arthropoda_odb10)](#busco-genome-arthropoda_odb10)
- [Running BUSCO pipeline](#running-busco-pipeline)
- [Generate summary script (generate_plot.py)](#generate-summary-script-generate_plotpy)
- [Plot (R script)](#plot-r-script)
- [Comparison of simple repeat contents in arthropod genomes (RepeatMasker)](#comparison-of-simple-repeat-contents-in-arthropod-genomes-repeatmasker)
- [Comparison of protein-coding gene metrics in three penaeid shrimp genomes](#comparison-of-protein-coding-gene-metrics-in-three-penaeid-shrimp-genomes)
- [Get metrics](#get-metrics)
- [Visualization (ggplot2)](#visualization-ggplot2)
- [Orthologous gene clustering analysis (OrthoFinder)](#orthologous-gene-clustering-analysis-orthofinder)
- [Running OrthoFinder](#running-orthofinder)
- [Visualization (VennDiagram)](#visualization-venndiagram)
## ONT library preparation and sequencing
### Base-calling Nanopore reads (Guppy)
```sh
#!/bin/bash
# >>>>>>
# GuppyBasecaller.sh
# >>>>>>
# Define variables
IN=fast5
OUT=guppy_out
# Run guppy_basecaller
guppy_basecaller -x auto --qscore_filtering --calib_detect --flowcell FLO-MIN106 --kit SQK-LSK109 -i ${IN} -s ${OUT}
```
### Short read removal (Fastp)
```sh
#!/bin/bash
# >>>>>>
# RemoveShortNanoporeReads.sh
# >>>>>>
cat ONT.fastq|seqkit seq -m 1000 >ONT.up1k.fastq # For gap filling
cat ONT.fastq|seqkit seq -m 5000 >ONT.up5k.fastq # For mtDNA assembly
```
## PacBio Iso-Seq sequencing and analysis (IsoSeq3)
### CCS
```sh
#!/bin/bash -l
#SBATCH --job-name=ccs
#SBATCH --partition=compute
#SBATCH --mem=30G
#SBATCH --cpus-per-task=1
#SBATCH --time=6-23:59:59
#SBATCH --mail-user=example@example.com
#SBATCH --mail-type=ALL
#SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/
#SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out
#SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error
cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/ccs.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in
conda activate isoseq3
ccs hp_m_up5k.subreads.bam hp_m_up5k.ccs.bam --noPolish --minPasses 1 --minLength=3000 --maxLength=100000 --reportFile=hp_m_up5k.ccs_report.txt
ccs hp_m_3to5k.subreads.bam hp_m_3to5k.ccs.bam --noPolish --minPasses 1 --minLength=2000 --maxLength=6000 --reportFile=hp_m_3to5k.ccs_report.txt
ccs hp_m_2to3k.subreads.bam hp_m_2to3k.ccs.bam --noPolish --minPasses 1 --minLength=1500 --maxLength=4000 --reportFile=hp_m_2to3k.ccs_report.txt
ccs hp_m_to2k.subreads.bam hp_m_to2k.ccs.bam --noPolish --minPasses 1 --minLength=100 --maxLength=3000 --reportFile=hp_m_to2k.ccs_report.txt
ccs hc_m_up5k.subreads.bam hc_m_up5k.ccs.bam --noPolish --minPasses 1 --minLength=3000 --maxLength=100000 --reportFile=hc_m_up5k.ccs_report.txt
ccs hc_m_3to5k.subreads.bam hc_m_3to5k.ccs.bam --noPolish --minPasses 1 --minLength=2000 --maxLength=6000 --reportFile=hc_m_3to5k.ccs_report.txt
ccs hc_m_2to3k.subreads.bam hc_m_2to3k.ccs.bam --noPolish --minPasses 1 --minLength=1500 --maxLength=4000 --reportFile=hc_m_2to3k.ccs_report.txt
ccs hc_m_to2k.subreads.bam hc_m_to2k.ccs.bam --noPolish --minPasses 1 --minLength=100 --maxLength=3000 --reportFile=hc_m_to2k.ccs_report.txt
ccs lv_42_up5k.subreads.bam lv_42_up5k.ccs.bam --noPolish --minPasses 1 --minLength=3000 --maxLength=100000 --reportFile=lv_42_up5k.ccs_report.txt
ccs lv_42_3to5k.subreads.bam lv_42_3to5k.ccs.bam --noPolish --minPasses 1 --minLength=2000 --maxLength=6000 --reportFile=lv_42_3to5k.ccs_report.txt
ccs lv_42_2to3k.subreads.bam lv_42_2to3k.ccs.bam --noPolish --minPasses 1 --minLength=1500 --maxLength=4000 --reportFile=lv_42_2to3k.ccs_report.txt
ccs lv_42_to2k.subreads.bam lv_42_to2k.ccs.bam --noPolish --minPasses 1 --minLength=100 --maxLength=3000 --reportFile=lv_42_to2k.ccs_report.txt
ccs pl_12_up5k.subreads.bam pl_12_up5k.ccs.bam --noPolish --minPasses 1 --minLength=3000 --maxLength=100000 --reportFile=pl_12_up5k.ccs_report.txt
ccs pl_12_3to5k.subreads.bam pl_12_3to5k.ccs.bam --noPolish --minPasses 1 --minLength=2000 --maxLength=6000 --reportFile=pl_12_3to5k.ccs_report.txt
ccs pl_12_2to3k.subreads.bam pl_12_2to3k.ccs.bam --noPolish --minPasses 1 --minLength=1500 --maxLength=4000 --reportFile=pl_12_2to3k.ccs_report.txt
ccs pl_12_to2k.subreads.bam pl_12_to2k.ccs.bam --noPolish --minPasses 1 --minLength=100 --maxLength=3000 --reportFile=pl_12_to2k.ccs_report.txt
conda deactivate
```
### LIMA
```sh
#!/bin/bash -l
#SBATCH --job-name=lima
#SBATCH --partition=compute
#SBATCH --mem=30G
#SBATCH --cpus-per-task=1
#SBATCH --time=6-23:59:59
#SBATCH --mail-user=example@example.com
#SBATCH --mail-type=ALL
#SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/
#SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out
#SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error
#SBATCH --exclude=sango405[01-14]
bin_dir=/work/SatohU/kawato/apps/
num_threads=$SLURM_CPUS_PER_TASK
cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/lima.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in
conda activate isoseq3
lima hp_m_up5k.ccs.bam primer.fasta hp_m_up5k.fl.bam --isoseq --dump-clips
lima hp_m_3to5k.ccs.bam primer.fasta hp_m_3to5k.fl.bam --isoseq --dump-clips
lima hp_m_2to3k.ccs.bam primer.fasta hp_m_2to3k.fl.bam --isoseq --dump-clips
lima hp_m_to2k.ccs.bam primer.fasta hp_m_to2k.fl.bam --isoseq --dump-clips
lima hc_m_up5k.ccs.bam primer.fasta hc_m_up5k.fl.bam --isoseq --dump-clips
lima hc_m_3to5k.ccs.bam primer.fasta hc_m_3to5k.fl.bam --isoseq --dump-clips
lima hc_m_2to3k.ccs.bam primer.fasta hc_m_2to3k.fl.bam --isoseq --dump-clips
lima hc_m_to2k.ccs.bam primer.fasta hc_m_to2k.fl.bam --isoseq --dump-clips
lima lv_42_up5k.ccs.bam primer.fasta lv_42_up5k.fl.bam --isoseq --dump-clips
lima lv_42_3to5k.ccs.bam primer.fasta lv_42_3to5k.fl.bam --isoseq --dump-clips
lima lv_42_2to3k.ccs.bam primer.fasta lv_42_2to3k.fl.bam --isoseq --dump-clips
lima lv_42_to2k.ccs.bam primer.fasta lv_42_to2k.fl.bam --isoseq --dump-clips
lima pl_12_up5k.ccs.bam primer.fasta pl_12_up5k.fl.bam --isoseq --dump-clips
lima pl_12_3to5k.ccs.bam primer.fasta pl_12_3to5k.fl.bam --isoseq --dump-clips
lima pl_12_2to3k.ccs.bam primer.fasta pl_12_2to3k.fl.bam --isoseq --dump-clips
lima pl_12_to2k.ccs.bam primer.fasta pl_12_to2k.fl.bam --isoseq --dump-clips
conda deactivate
```
### IsoSeq3 refine
```sh
#!/bin/bash -l
#SBATCH --job-name=isoseq3_refine
#SBATCH --partition=compute
#SBATCH --mem=30G
#SBATCH --cpus-per-task=1
#SBATCH --time=6-23:59:59
#SBATCH --mail-user=example@example.com
#SBATCH --mail-type=ALL
#SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/
#SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out
#SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error
#SBATCH --exclude=sango405[01-14]
bin_dir=/work/SatohU/kawato/apps/
num_threads=$SLURM_CPUS_PER_TASK
cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/isoseq3_refine.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in
conda activate isoseq3
isoseq3 refine hp_m_up5k.fl.primer_5p--primer_3p.bam primer.fasta hp_m_up5k.flnc.bam
isoseq3 refine hp_m_3to5k.fl.primer_5p--primer_3p.bam primer.fasta hp_m_3to5k.flnc.bam
isoseq3 refine hp_m_2to3k.fl.primer_5p--primer_3p.bam primer.fasta hp_m_2to3k.flnc.bam
isoseq3 refine hp_m_to2k.fl.primer_5p--primer_3p.bam primer.fasta hp_m_to2k.flnc.bam
isoseq3 refine hc_m_up5k.fl.primer_5p--primer_3p.bam primer.fasta hc_m_up5k.flnc.bam
isoseq3 refine hc_m_3to5k.fl.primer_5p--primer_3p.bam primer.fasta hc_m_3to5k.flnc.bam
isoseq3 refine hc_m_2to3k.fl.primer_5p--primer_3p.bam primer.fasta hc_m_2to3k.flnc.bam
isoseq3 refine hc_m_to2k.fl.primer_5p--primer_3p.bam primer.fasta hc_m_to2k.flnc.bam
isoseq3 refine lv_42_up5k.fl.primer_5p--primer_3p.bam primer.fasta lv_42_up5k.flnc.bam
isoseq3 refine lv_42_3to5k.fl.primer_5p--primer_3p.bam primer.fasta lv_42_3to5k.flnc.bam
isoseq3 refine lv_42_2to3k.fl.primer_5p--primer_3p.bam primer.fasta lv_42_2to3k.flnc.bam
isoseq3 refine lv_42_to2k.fl.primer_5p--primer_3p.bam primer.fasta lv_42_to2k.flnc.bam
isoseq3 refine pl_12_up5k.fl.primer_5p--primer_3p.bam primer.fasta pl_12_up5k.flnc.bam
isoseq3 refine pl_12_3to5k.fl.primer_5p--primer_3p.bam primer.fasta pl_12_3to5k.flnc.bam
isoseq3 refine pl_12_2to3k.fl.primer_5p--primer_3p.bam primer.fasta pl_12_2to3k.flnc.bam
isoseq3 refine pl_12_to2k.fl.primer_5p--primer_3p.bam primer.fasta pl_12_to2k.flnc.bam
conda deactivate
```
primer.fasta
```
>primer_5p
AAGCAGTGGTATCAACGCAGAGTACATGGG
>primer_3p
GTACTCTGCGTTGATACCACTGCTT
```
### dataset create --type TranscriptSet
```sh
#!/bin/bash -l
#SBATCH --job-name=dataset_create_transcriptset
#SBATCH --partition=compute
#SBATCH --mem=30G
#SBATCH --cpus-per-task=1
#SBATCH --time=6-23:59:59
#SBATCH --mail-user=example@example.com
#SBATCH --mail-type=ALL
#SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/
#SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out
#SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error
#SBATCH --exclude=sango405[01-14]
bin_dir=/work/SatohU/kawato/apps/
num_threads=$SLURM_CPUS_PER_TASK
cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/dataset_create_transcriptset.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in
conda activate isoseq3
dataset create --type TranscriptSet merged.flnc.xml \
hp_m_up5k.flnc.bam hp_m_3to5k.flnc.bam hp_m_2to3k.flnc.bam hp_m_to2k.flnc.bam \
hc_m_up5k.flnc.bam hc_m_3to5k.flnc.bam hc_m_2to3k.flnc.bam hc_m_to2k.flnc.bam \
lv_42_up5k.flnc.bam lv_42_3to5k.flnc.bam lv_42_2to3k.flnc.bam lv_42_to2k.flnc.bam \
pl_12_up5k.flnc.bam pl_12_3to5k.flnc.bam pl_12_2to3k.flnc.bam pl_12_to2k.flnc.bam
conda deactivate
```
### dataset create --type SubreadSet
```sh
#!/bin/bash -l
#SBATCH --job-name=dataset_create_subreadset
#SBATCH --partition=compute
#SBATCH --mem=30G
#SBATCH --cpus-per-task=1
#SBATCH --time=6-23:59:59
#SBATCH --mail-user=example@example.com
#SBATCH --mail-type=ALL
#SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/
#SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out
#SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error
#SBATCH --exclude=sango405[01-14]
bin_dir=/work/SatohU/kawato/apps/
num_threads=$SLURM_CPUS_PER_TASK
cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/dataset_create_subreadset.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in
conda activate isoseq3
dataset create --type SubreadSet merged.subreadset.xml \
hp_m_up5k.subreads.bam hp_m_3to5k.subreads.bam hp_m_2to3k.subreads.bam hp_m_to2k.subreads.bam \
hc_m_up5k.subreads.bam hc_m_3to5k.subreads.bam hc_m_2to3k.subreads.bam hc_m_to2k.subreads.bam \
lv_42_up5k.subreads.bam lv_42_3to5k.subreads.bam lv_42_2to3k.subreads.bam lv_42_to2k.subreads.bam \
pl_12_up5k.subreads.bam pl_12_3to5k.subreads.bam pl_12_2to3k.subreads.bam pl_12_to2k.subreads.bam
conda deactivate
```
### IsoSeq3 Cluster
```sh
#!/bin/bash -l
#SBATCH --job-name=isoseq3_cluster
#SBATCH --partition=compute
#SBATCH --mem=120G
#SBATCH --cpus-per-task=12
#SBATCH --time=6-23:59:59
#SBATCH --mail-user=example@example.com
#SBATCH --mail-type=ALL
#SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/
#SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out
#SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error
bin_dir=/work/SatohU/kawato/apps/
num_threads=$SLURM_CPUS_PER_TASK
num=$SLURM_ARRAY_TASK_ID
cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/isoseq3_cluster.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in
conda activate isoseq3
isoseq3 cluster merged.flnc.xml unpolished.bam --verbose --split-bam 24
conda deactivate
```
### IsoSeq3 Polish
```sh
#!/bin/bash -l
#SBATCH --job-name=isoseq3_polish
#SBATCH --partition=compute
#SBATCH --mem=30G
#SBATCH --cpus-per-task=1
#SBATCH --time=6-23:59:59
#SBATCH --mail-user=example@example.com
#SBATCH --mail-type=ALL
#SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/
#SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out
#SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error
#SBATCH --array=1-23
#1-23
bin_dir=/work/SatohU/kawato/apps/
num_threads=$SLURM_CPUS_PER_TASK
num=$SLURM_ARRAY_TASK_ID
cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/isoseq3_polish.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in
conda activate isoseq3
isoseq3 polish unpolished.${num}.bam merged.subreadset.xml polished.${num}.bam
conda deactivate
```
## Genome size estimation and ploidy analysis
### Trimming paired-end reads (Fastp)
```sh
#!/bin/bash
# >>>>>>
# TrimPE_GenomeSizeEstimate.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WDIR=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/
cd ${WDIR}
# Quality trim with Fastp
For lib in sperm muscle;do
fastp \
-i Mj01${lib}DNA_R1.fastq.gz -I Mj01${lib}DNA_R2.fastq.gz \
-o Mj01${lib}DNA_R1_trim.fq.gz -O Mj01${lib}DNA_R2_trim.fq.gz \
--n_base_limit 0 --length_required 140 --thread 8 \
--html Mj01${lib}DNA_fastp.html
done
```
### K-mer count (KMC)
```sh
#!/bin/bash
# >>>>>>
# KMC_all.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=KMC_all_GenomeScope2
cd ${WHOME}
mkdir ${WDIR}; cd ${WDIR}
# Define variables
NUM_THREADS=50
RAM_G=1200
COV_MIN=1
COV_MAX=1000000
TMP_DIR=tmp
RDIR=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq
IN=reads_list
OUT=${IN}_kmc_k${MER}_${COV_MIN}-${COV_MAX}
# Create input file list
echo -e "${RDIR}/Mj01muscleDNA_R1_trim.fq.gz\n${RDIR}/Mj01muscleDNA_R2_trim.fq.gz\n${RDIR}/Mj01spermDNA_R1_trim.fq.gz\n${RDIR}/Mj01spermDNA_R2_trim.fq.gz" > reads_list
# Run KMC; iterate over 17- to 35-mers with 2-mer intervals
for MER in 17 19 21 23 25 27 29 31 33 35;do
/work/kawato/apps/kmc/3.1.1/kmc -k${MER} -t${NUM_THREADS} -m${RAM_G} -ci${COV_MIN} -cs${COV_MAX} @${IN} ${OUT} ./${TMP}
/work/kawato/apps/kmc/3.1.1/kmc_tools transform ${OUT} histogram ${OUT}_k${MER}.hist -cx1000000
done
```
### Genome size estimation (GenomeScope2)
```sh
#!/bin/bash
# >>>>>>
# GenomeScope2.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=KMC_all_GenomeScope2
cd ${WHOME}${WDIR}
# Define variables
NUM_THREADS=50
RAM_G=1200
COV_MIN=1
COV_MAX=1000000
IN=reads_list
OUT=${IN}_kmc_k${MER}_${COV_MIN}-${COV_MAX}
# Run GenomeScope2; iterate over 17- to 35-mers with 2-mer intervals
conda activate genomescope2
for MER in 17 19 21 23 25 27 29 31 33 35;do
genomescope2 -i ${OUT}_k${MER}.hist -o ${OUT}_k${MER}.genomescope2 -p 2 -k ${MER} -m -1
done
```
### Ploidy analysis (SmudgePlot)
```sh
#!/bin/bash
# >>>>>>
# SmudgePlot.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=SmudgePlot
cd ${WHOME}
mkdir ${WDIR}; cd ${WDIR}
# Define variables
NUM_THREADS=50
RAM_G=1200
COV_MIN=1
COV_MAX=1000000
TMP_DIR=tmp
RDIR=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq
IN=reads_list
OUT=${IN}_kmc_k${MER}_${COV_MIN}-${COV_MAX}
# Make symlinks to KMC results (23-mer)
ln -s ../KMC_GenomeScope2/reads_list reads_list
ln -s ../KMC_GenomeScope2/${OUT}_k23.hist ${OUT}_k23.hist
ln -s ../KMC_GenomeScope2/${OUT}.kmc_suf ${OUT}.kmc_suf
ln -s ../KMC_GenomeScope2/${OUT}.kmc_pre ${OUT}.kmc_pre
conda activate smudgeplot
# Set coverage cutoffs
L=$(smudgeplot.py cutoff ${OUT}_k${MER}.hist L) # 51
U=$(smudgeplot.py cutoff ${OUT}_k${MER}.hist U) # 1300
# Transform 23-mer table
/work/kawato/apps/kmc/3.1.1/kmc_tools transform ${OUT} -ci"$L" -cx"$U" dump -s kmc_k${MER}_"$L"_"$U".dump
# Run SmudgePlot
smudgeplot.py hetkmers -o kmc_k${MER}_"$L"_"$U" < kmc_k${MER}_"$L"_"$U".dump
smudgeplot.py plot kmc_k${MER}_"$L"_"$U"_coverages.tsv -k ${MER}
```
### Error correction of Illumina paired-end reads from muscle library
#### Error correction (Tadpole)
```sh
#!/bin/bash
# >>>>>>
# Tadpole_muscle.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WDIR=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/
cd ${WDIR}
# Tadpole
For lib in muscle sperm;do
tadpole.sh \
in=Mj01${lib}DNA_R1_trim.fq.gz \
in2=Mj01${lib}DNA_R2_trim.fq.gz \
out=Mj01${lib}DNA_R1_trim.tadpole.fq.gz \
out2=Mj01${lib}DNA_R2_trim.tadpole.fq.gz \
mode=correct k=50
done
```
#### K-mer count (KMC)
```sh
#!/bin/bash
# >>>>>>
# KMC_muscle.sh
# >>>>>>
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=KMC_muscle
cd ${WHOME}
mkdir ${WDIR}; cd ${WDIR}
# Define variables
NUM_THREADS=50
RAM_G=1200
COV_MIN=1
COV_MAX=1000000
TMP_DIR=tmp
MER=23
RDIR=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq
TMP_DIR=tmp
# Create input file list
echo -e "/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01muscleDNA_R1_trim.fq.gz\n/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01muscleDNA_R2_trim.fq.gz" > Mj01muscleDNA_trim.list
echo -e "/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01muscleDNA_R1_trim.tadpole.fq.gz\n/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01muscleDNA_R2_trim.tadpole.fq.gz" > Mj01muscleDNA_trim.tadpole.list
# Run KMC
for IN_FILE in Mj01muscleDNA_trim.list Mj01muscleDNA_trim.tadpole.list ;do
OUT=${IN_FILE}_kmc_k${MER}_${COV_MIN}-${COV_MAX}
/home/admin/kawato/apps/kmc/3.1.1/kmc -k${MER} -t${NUM_THREADS} -m${RAM_G} -ci${COV_MIN} -cs${COV_MAX} @${IN_FILE} ${OUT} ./${TMP}
/home/admin/kawato/apps/kmc/3.1.1/kmc_tools transform ${OUT} histogram ${OUT}_k${MER}.hist -cx1000000
sed -e 's/\t/\s/g' ${OUT}_k${MER}.hist > ${OUT}_k${MER}.hist.mod
done
```
#### K-mer distribution plot (ggplot2)
```R
##################################
#
# plot_kmer_distribution.R
#
##################################
setwd("C:/Users/kawato/study/kuruma_genome2017/analysis/kmer_distribution/)
library(tidyverse)
library(ggplot2)
library(patchwork)
cbp1 <- c("#999999", "#E69F00")
df_raw_before <- read_tsv("Mj01muscleDNA_trim.list_kmc_k23_1-1000000_k23.hist", col_names=c("Coverage", "Occurence"))
df_raw_after <- read_tsv("Mj01muscleDNA_trim.tadpole.list_kmc_k23_1-1000000_k23.hist", col_names=c("Coverage", "Occurence"))
df.Fre.before <- data.frame(Lib="Before", Coverage=df_raw_before$Coverage, Fre=df_raw_before$Occurence)
df.Fre.after <- data.frame(Lib="After", Coverage=df_raw_after$Coverage, Fre=df_raw_after$Occurence)
df.Cov_Fre.before <- data.frame(Lib="Before", Coverage=df_raw_before$Coverage, Cov_Fre=df_raw_before$Coverage * df_raw_before$Occurence)
df.Cov_Fre.after <- data.frame(Lib="After", Coverage=df_raw_after$Coverage, Cov_Fre=df_raw_after$Coverage * df_raw_after$Occurence)
df.Fre_merge <- dplyr::bind_rows(df.Fre.before, df.Fre.after)
df.Fre_merge$Lib <- as.factor(df.Fre_merge$Lib)
df.Fre_merge$Lib <- factor(df.Fre_merge$Lib, levels = c("Before", "After"))
df.Cov_Fre_merge <- dplyr::bind_rows(df.Cov_Fre.before, df.Cov_Fre.after)
df.Cov_Fre_merge$Lib <- as.factor(df.Cov_Fre_merge$Lib)
df.Cov_Fre_merge$Lib <- factor(df.Cov_Fre_merge$Lib, levels = c("Before", "After"))
levels(df.Fre_merge$Lib) <- c("Before", "After")
levels(df.Cov_Fre_merge$Lib) <- c("Before", "After")
p1 <- ggplot(df.Fre_merge) +
geom_line(aes(x=Coverage,y=Fre, colour=Lib),size=2) +
scale_x_continuous(expand = c(0,0), limits = c(0,220)) +
scale_y_continuous(expand = c(0,0), limits = c(0,15500000)) +
theme_classic() +
scale_colour_manual(values=cbp1) +
labs(
x = "Coverage",
y = "Frequency",
colour = "Lib") +
theme(axis.title=element_text(size=18,face="bold"), axis.text=element_text(size=18),legend.text = element_text(size=18),legend.key.size = unit(1.5, 'lines'), legend.justification=c(1,1), legend.position=c(1,1),legend.title=element_blank())
p2 <- ggplot(df.Cov_Fre_merge) +
geom_line(aes(x=Coverage,y=Cov_Fre, colour=Lib),size=2) +
scale_x_continuous(expand = c(0,0), limits = c(0,220)) +
scale_y_continuous(expand = c(0,0), limits = c(0,1000000000)) +
theme_classic() +
scale_colour_manual(values=cbp1) +
labs(
x = "Coverage",
y = "Coverage*Frequency",
colour = "Lib") +
theme(axis.title=element_text(size=18,face="bold"), axis.text=element_text(size=18),legend.text = element_text(size=18),legend.key.size = unit(1.5, 'lines'), legend.justification=c(1,1), legend.position=c(1,1),legend.title=element_blank())
p_merge <- p1 + p2 +
plot_annotation(tag_levels = 'A') & theme(plot.tag = element_text(size = 32,face="bold"))
ggsave(file = "p_merge.svg", plot = p_merge, width = 12, height = 6)
ggsave(file = "p_merge.png", plot = p_merge, dpi = 300, width = 12, height = 6)
```
## De novo assembly of Illumina genomic DNA libraries
### Preprocessing Illumina paired-end reads from sperm library (Fastp)
```sh
#!/bin/bash
# >>>>>>
# TrimPE_DeNovoAssembly.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WDIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/pe_hiseq/
cd ${WDIR}
# Run Fastp on sperm genomic PE reads
fastp -i Mj01spermDNA_R1.fastq -I Mj01spermDNA_R2.fastq \
-o Mj01spermDNA_R1_fastp01.fastq -O Mj01spermDNA_R2_fastp01.fastq \
-length_required=140 --qualified_quality_phred=20 \
--unqualified_percent_limit=10 --n_base_limit=0 --low_complexity_filter \
--thread=8 --html=Mj01sperm_fastp01.html
```
### Preprocessing Illumina mate-pair reads
#### Trimmomatic
```sh
#!/bin/bash
# >>>>>>
# MP_Trimmomatic.sh
# >>>>>>
# Create/goto WDIR
WDIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/mp_hiseq/
cd ${WDIR}
# Define variables
sq=20
minlen=80
NUMTH=4
# Create symlinks
ln -s /work/SatohU/kawato/apps/Trimmomatic/0.36/adapters/TruSeq3-PE-2.fa .
# Trimmomatic
for i in 02k 03k 04k 05k 06k 07k 08k 09k 10k 11k 12k 13k 14k 15k; do
a1=${i}_mp.raw.R1.fq
a2=${i}_mp.raw.R2.fq
java -jar /work/SatohU/kawato/apps/Trimmomatic/0.36/trimmomatic-0.36.jar \
-threads ${NUMTH} -phred33 \
$a1 $a2 \
${a1}.strict_sldwin4-${sq}_minlen${minlen}_trimmed \
${a1}.strict_sldwin4-${sq}_minlen${minlen}_unpair \
${a2}.strict_sldwin4-${sq}_minlen${minlen}_trimmed \
${a2}.strict_sldwin4-${sq}_minlen${minlen}_unpair \
ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:${sq} MINLEN:${minlen}
done
```
#### NextClip
```sh
#!/bin/bash
# >>>>>>
# MP_NextClip.sh
# >>>>>>
# Create/goto WDIR
WDIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/mp_hiseq/
cd ${WDIR}
# Define variables
sq=20
minlen=80
NUMTH=4
for i in 02k 03k 04k 05k 06k 07k 08k 09k 10k 11k 12k 14k 15k; do
# Run NextClip
/work/SatohU/kawato/apps/nextclip-master/bin/nextclip \
--input_one ${i}_mp.raw.R1.fq.sldwin4-20_minlen80_trimmed \
--input_two ${i}_mp.raw.R2.fq.sldwin4-20_minlen80_trimmed \
--adaptor_sequence CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG \
-o ${i}_adaptor_mod_sldwin4-20_minlen80_trimmed \
-n 40000000 --trim_ends 0 --min_length 36 --remove_duplicates
# Concatenate categories A, B, and C
cat \
${i}_adaptor_mod_sldwin4-20_minlen80_trimmed_A_R1.fastq \
${i}_adaptor_mod_sldwin4-20_minlen80_trimmed_B_R1.fastq \
${i}_adaptor_mod_sldwin4-20_minlen80_trimmed_C_R1.fastq \
> ${i}_adaptor_mod_sldwin4-20_minlen80_trimmed.nextclip.A-C_cat.R1.fq
cat \
${i}_adaptor_mod_sldwin4-20_minlen80_trimmed_A_R2.fastq \
${i}_adaptor_mod_sldwin4-20_minlen80_trimmed_B_R2.fastq \
${i}_adaptor_mod_sldwin4-20_minlen80_trimmed_C_R2.fastq \
> ${i}_adaptor_mod_sldwin4-20_minlen80_trimmed.nextclip.A-C_cat.R2.fq
done
```
### De novo assembly (SPAdes)
```sh
#!/bin/bash
# >>>>>>
# SPAdes.sh
# >>>>>>
# Create/goto WDIR
WDIR=/work/SatohU/kawato/study/kuruma_genome2017/analysis/
cd ${WDIR}
# Set VARs
NUMTH=24
R_DIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/pe_hiseq
conda activate spades
# Run SPAdes
/work/SatohU/kawato/apps/spades/3.13.0/bin/spades.py \
-o SPAdes_out \
-1 ${R_DIR}/Mj01spermDNA_R1_fastp01.fastq \
-2 ${R_DIR}/Mj01spermDNA_R2_fastp01.fastq \
--threads=${NUMTH} \
--only-assembler \
-k 21,33,55,77,89,101 \
--memory=500
```
### Redundancy removal and scaffolding (Redundans)
#### Preprocessing Illumina paired-end reads from muscle library (Fastp)
```sh
#!/bin/bash
# >>>>>>
# Mj01muscleDNA_fastp_minq5.sh
# >>>>>>
# Create/goto WDIR
WDIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/pe_hiseq/
cd ${WDIR}
# Set VARs
NUMTH=8
# Run Fastp
for lib in Mj01muscleDNA; do
/work/SatohU/kawato/apps/fastp/0.19.6/fastp \
--in1=${lib}_R1.fastq --out1=${lib}_R1.fastq.fastp.minq5 \
--in2=${lib}_R2.fastq --out2=${lib}_R2.fastq.fastp.minq5 \
--detect_adapter_for_pe --qualified_quality_phred=5 \
--html=${lib}_fastp.minq5.html
done
```
#### Run Redundans pipeline
```sh
#!/bin/bash
# >>>>>>
# Redundans.sh
# >>>>>>
# Create/goto WDIR
WDIR=/work/SatohU/kawato/study/kuruma_genome2017/analysis/SPAdes_out/
cd ${WDIR}
# Set VARs
NUMTH=24
R_DIR=/work/SatohU/kawato/study/kuruma_genome2017/reads
python /work/SatohU/kawato/apps/redundans/redundans.py -i \
${R_DIR}/pe_hiseq/Mj01muscleDNA_R1.fastq.fastp.minq5.fq ${R_DIR}/pe_hiseq/Mj01muscleDNA_R2.fastq.fastp.minq5.fq \
${R_DIR}/mp_hiseq/02k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/02k.nextclip.A-C_cat.R2.fq \
${R_DIR}/mp_hiseq/03k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/03k.nextclip.A-C_cat.R2.fq \
${R_DIR}/mp_hiseq/04k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/04k.nextclip.A-C_cat.R2.fq \
${R_DIR}/mp_hiseq/05k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/05k.nextclip.A-C_cat.R2.fq \
${R_DIR}/mp_hiseq/06k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/06k.nextclip.A-C_cat.R2.fq \
${R_DIR}/mp_hiseq/07k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/07k.nextclip.A-C_cat.R2.fq \
${R_DIR}/mp_hiseq/08k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/08k.nextclip.A-C_cat.R2.fq \
${R_DIR}/mp_hiseq/10k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/10k.nextclip.A-C_cat.R2.fq \
${R_DIR}/mp_hiseq/12k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/12k.nextclip.A-C_cat.R2.fq \
${R_DIR}/mp_hiseq/14k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/14k.nextclip.A-C_cat.R2.fq \
${R_DIR}/mp_hiseq/15k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/15k.nextclip.A-C_cat.R2.fq \
-f scaffolds.fasta -o ../redundans_out -t ${NUMTH} -m 500
```
### Misassembly detection and assembly break (REAPR)
#### Reverse-complement mate-pair reads (SeqKit)
```sh
#!/bin/bash
# >>>>>>
# ReverseComplementMPReads.sh
# >>>>>>
# Create/goto WDIR
WDIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/mp_hiseq/
cd ${WDIR}
# Convert MP reads to PE (REAPR accepts "innies" only)
for n in 04 05 06 07 08 10 12 14 15 ; do
seqkit seq -pr ${n}k.nextclip.A-C_cat.R1.fq > ${n}k.nextclip.A-C_cat.R1.rc.fq
seqkit seq -pr ${n}k.nextclip.A-C_cat.R2.fq > ${n}k.nextclip.A-C_cat.R2.rc.fq
done
```
#### Run REAPR pipeline
```sh
#!/bin/bash
# >>>>>>
# REAPR.sh
# >>>>>>
# Create/goto WDIR
WHOME=/work/SatohU/kawato/study/kuruma_genome2017/analysis/
WDIR=REAPR
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Set VARs
export PATH=/work/SatohU/kawato/apps/reapr/1.0.18:/work/SatohU/kawato/apps/reapr/1.0.18/src:/work/SatohU/kawato/apps/reapr/1.0.18/third_party:$PATH
module load R/3.4.2
R_DIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/mp_hiseq
# Create symlinks
../redundans_out/scaffolds.fa scaffolds.fa
# REAPR facheck
reapr facheck scaffolds.fa scaffolds.fa.facheck
# Run REAPR for each MP lib
for n in 04 05 06 07 08 10 12 14 15 ; do
#REAPR accepts only innies, so we need to RC nextrimp reads
reapr smaltmap scaffolds.fa.facheck.fa ${R_DIR}/${n}k.nextclip.A-C_cat.R1.rc.fq ${R_DIR}/${n}k.nextclip.A-C_cat.R2.rc.fq mapped_${n}k.bam
reapr pipeline scaffolds.fa.facheck.fa mapped_${n}k.bam reapr_${n}k_out
done
```
## Transcriptome shotgun assembly
### Preprocessing Illumina paired-end reads (Fastp)
```sh
#!/bin/bash
# >>>>>>
# TrimPE_RNAseq_1.sh
# >>>>>>
# Create/goto WDIR
WDIR=/work/SatohU/kawato/study/kuruma_transcriptome2017/reads/
cd ${WDIR}
# Run Fastp
cat lib_list|while read lib; do
/work/SatohU/kawato/apps/fastp/0.19.3/fastp \
-i ${lib}_R1_001.fastq.gz \
-I ${lib}_R2_001.fastq.gz \
-o ${lib}_R1_trimmed.fastq.gz \
-O ${lib}_R2_trimmed.fastq.gz \
-h ${lib}_fastp.html \
-j ${lib}_fastp.json \
--trim_front1=12 \
--trim_tail1=2 \
--length_required=80
done
```
### Mapping to the genome (HISAT2)
```sh
#!/bin/bash
# >>>>>>
# hisat2.sh
# >>>>>>
# Create/goto WDIR
WHOME=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis
WDIR=hisat2
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Set VARs
NUMTH=16
RDIR=/work/SatohU/kawato/study/kuruma_transcriptome2017/reads
REF=reapr_06k
# Create synmlinks
ln -s /work/SatohU/kawato/study/kuruma_genome2017/analysis/REAPR/reapr_06k_out/04.break.broken_assembly.fa ${REF}.fa
# build index
hisat2-build -p ${NUMTH} ${REF}.fa ${REF}
cat ${RDIR}/lib_list|while read lib; do
# map reads
hisat2 -x ${REF} -p ${NUMTH} -1 ${RDIR}/${lib}_R1_trimmed.fastq.gz -2 ${RDIR}/${lib}_R2_trimmed.fastq.gz -S ${REF}_${lib}.sam 2>${REF}_${lib}_hisat2.log
# extract paired reads with at least one mapped mate
samtools view -f 1 -G 12 ${REF}_${lib}.sam -b -@ ${NUMTH} |samtools sort -@ ${NUMTH} -O BAM -n -o ${REF}_${lib}_mapped_qsort.bam
rm ${REF}_${lib}.sam
# Extract fastq
bedtools bamtofastq -i ${REF}_${lib}_mapped_qsort.bam -fq ${REF}_${lib}_mapped_R1.fq -fq2 ${REF}_${lib}_mapped_R2.fq
# Compress fastq
gzip -9 ${REF}_${lib}_mapped_R1.fq
gzip -9 ${REF}_${lib}_mapped_R2.fq
done
```
### De novo assembly
#### Trinity
```sh
#!/bin/bash
# >>>>>>
# Trinity.sh
# >>>>>>
# Create/goto WDIR
WHOME=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis
WDIR=hisat2_mapped_trinity
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Set VARs
NUMTH=24
# load cmake; set TRINITY_HOME, PATH
module load cmake/3.11.4
export TRINITY_HOME=/work/SatohU/kawato/apps/trinity/2.8.6/
export PATH=/work/SatohU/kawato/apps/salmon/0.11.3/bin:/work/SatohU/kawato/apps/bowtie2/2.3.4.3/:$PATH
# Run Trinity
/work/SatohU/kawato/apps/trinity/2.8.6/Trinity \
--seqType fq \
--SS_lib_type RF \
--max_memory 500G \
--CPU ${NUMTH} --samples_file lib_list # lib_list: prepared manually
```
#### rnaSPAdes
```sh
#!/bin/bash
# >>>>>>
# rnaSPAdes.sh
# >>>>>>
# Create/goto WDIR
WHOME=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis
WDIR=rnaspades
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Set VARs
NUMTH=24
RDIR=../hisat2_mapped_trinity/trinity_out_dir/insilico_read_normalization/
R1=${RDIR}left.norm.fq
R2=${RDIR}right.norm.fq
# Run SPAdes
/work/SatohU/kawato/apps/spades/3.13.0/bin/spades.py --rna -o auto -1 ${R1} -2 ${R2} --ss-rf --threads ${NUMTH} --memory 500
```
#### Trans-ABySS
Rename fastq
```sh
#!/bin/bash
# >>>>>>
# rename_fastq.sh
# >>>>>>
# Rename FASTQ for Trans-ABySS
# Create/goto WDIR
WDIR=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/hisat2_mapped_trinity/trinity_out_dir/insilico_read_normalization/
cd ${WDIR}
# Set VARs
R1=left.norm.fq
R2=right.norm.fq
# Rename fastq
cat ${R1}| awk '{print (NR%4 == 1) ? "@" ++i "/1" : $0}' | gzip -c > ${R1}.renamed.gz &
cat ${R2}| awk '{print (NR%4 == 1) ? "@" ++i "/2" : $0}' | gzip -c > ${R2}.renamed.gz
```
Trans-ABySS
```sh
#!/bin/bash
# >>>>>>
# Trans-ABySS.sh
# >>>>>>
# Create/goto WDIR
WHOME=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis
WDIR=transabyss
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Set VARs
NUMTH=24
RDIR=../hisat2_mapped_trinity/trinity_out_dir/insilico_read_normalization/
R1=${RDIR}left.norm.fq.renamed.gz
R2=${RDIR}right.norm.fq.renamed.gz
# Run Trans-ABySS
for mer in 31 41 51 61 71 81 91 101 111 121 131 ;do
/work/SatohU/kawato/apps/transabyss-2.0.1/transabyss \
--pe ${R1} ${R2} --SS \
--outdir transabyss_k${mer} --name transabyss_k${mer} \
--cleanup 3 --threads ${NUMTH} --kmer ${mer}
done
```
### Redundancy removal (TransDecoder, CD-HIT-EST)
```sh
#!/bin/bash
# >>>>>>
# NrTranscripts.sh
# >>>>>>
# Create/goto WDIR
WHOME=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis
WDIR=NrTranscripts
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Set VARs
NUMTH=24
path2transdecoder='/work/SatohU/kawato/apps/transdecoder/5.5.0'
path2cd-hit='/work/SatohU/kawato/apps/cd-hit/4.8.1'
path2busco='/work/SatohU/kawato/apps/busco/scripts'
export PATH=${path2transdecoder}:${path2cd-hit}:${path2busco}:$PATH
# Create symlinks
ln -s ../hisat2_mapped_trinity/trinity_out_dir/Trinity.fasta Trinity.fa
ln -s ../rnaspades/auto/transcripts.fasta rnaspades.fa
for mer in 31 41 51 61 71 81 91 101 111 121 ;do
ln -s ../transabyss/transabyss_k${mer}/transabyss_k${mer}-final.fa transabyss_k${mer}.fa
done
# rename entries
for ASM in Trinity rnaspades transabyss_k31 transabyss_k41 transabyss_k51 transabyss_k61 transabyss_k71 transabyss_k81 transabyss_k91 transabyss_k101 transabyss_k111 transabyss_k121 ;do
less ${ASM}.fa |awk -v var="${ASM}" '/^>/{print ">" var "_" ++i; next}{print}' >${ASM}.renamed
cat ${ASM}.renamed >>tr.merged
done
# TransDecoder
TransDecoder.LongOrfs -S -t tr.merged
TransDecoder.Predict --retain_long_orfs_mode dynamic --single_best_only -T 5000 -t tr.merged
less tr.merged.transdecoder.cds |grep "complete" |sed 's/>//g' | cut -d " " -f 1 >tr.merged.transdecoder.cds.comp.list
less tr.merged.transdecoder.cds |cut -d " " -f 1 > tr.merged.transdecoder.cds.tmp
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' tr.merged.transdecoder.cds.comp.list tr.merged.transdecoder.cds.tmp > tr.merged.transdecoder.cds.comp
rm tr.merged.transdecoder.cds.tmp
# cd-hit-est
cd-hit-est \
-i tr.merged.transdecoder.cds.comp -o tr.merged.transdecoder.cds.comp.nr \
-c 0.98 -G 0 -aS 0.25 -d 0 -T 0 -M 0
less tr.merged.transdecoder.cds.comp.nr |grep ">"|sed 's/>//g' | cut -d "." -f 1 >tr.merged.transdecoder.cds.comp.nr.list
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' tr.merged.transdecoder.cds.comp.nr.list tr.merged > tr.merged.nr # final assembly
```
### BUSCO (transcriptome, arthropoda_odb10)
```sh
#!/bin/bash
# >>>>>>
# BUSCO_transcriptome.sh
# >>>>>>
# Run BUSCO
for ASM in ICRK01.1.fsa_nt ICRJ01.1.fsa_nt GGLH01.1.fsa_nt GGUK01.1.fsa_nt; do
busco -i ${ASM} -l arthropoda_odb10 -c 24 -o ${ASM}.transcriptome.busco -m transcriptome -f
done
```
## Scaffolding of genome assembly using transcriptome data
### Scaffolding using cDNA alignments (L_RNA_Scaffolder)
```sh
#!/bin/bash
# >>>>>>
# L_RNA_scaffolder.sh
# >>>>>>
# Create/goto WDIR
WHOME=/work/SatohU/kawato/study/kuruma_genome2017/analysis
WDIR=L_RNA_saffolder
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
DRAFT=reapr_15k_out
# Make symlink
ln -s /work/SatohU/kawato/study/kuruma_genome2017/analysis/REAPR/reapr_15k_out/04.break.broken_assembly.fa ${DRAFT}.fa
# Merge de novo assembled transcripts (tr.nr) and IsoSeq cDNA (polished.hq.fasta).
cat \
/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/NrTranscripts/tr.merged.nr \
/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/polished.hq.fasta >tr.merged
# Align transcripts by BLAT
cat tr.merged \
| parallel --pipe --recstart ">" "blat -t=dna -q=dna -minIdentity=98 -noHead ${DRAFT}.fa stdin >(cat) >/dev/null" >${DRAFT}.tr.merged.psl
# Run L_RNA_scaffolder
/work/kawato/apps/L_RNA_scaffolder/L_RNA_scaffolder.sh \
-d /work/kawato/apps/L_RNA_scaffolder/ \
-i ${DRAFT}}.tr.merged.psl \
-j ${DRAFT}.fa \
-o L_RNA_scaffolder_out \
-l 0.90 -p 0.98 \
-n 100
```
### Scaffolding using paired-end RNA-seq reads (P_RNA_scaffolder)
```sh
#!/bin/bash
# >>>>>>
# P_RNA_scaffolder.sh
# >>>>>>
# Create/goto WDIR
WHOME=/work/SatohU/kawato/study/kuruma_genome2017/analysis
WDIR=P_RNA_saffolder
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
DRAFT=L_RNA_scaffolder
RDIR=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/hisat2_mapped_trinity/trinity_out_dir/insilico_read_normalization/
R1=${RDIR}left.norm.fq.renamed.gz
R2=${RDIR}right.norm.fq.renamed.gz
# Make symlinks
ln -s /work/SatohU/kawato/study/kuruma_genome2017/analysis/L_RNA_scaffolder/L_RNA_scaffolder_out/L_RNA_scaffolder.fasta ${DRAFT}.fasta
# build hisat2 index
hisat2-build -p 24 ${DRAFT}.fasta ${DRAFT}
# Map in silico-normalized RNA-seq reads generated by Trinity
hisat2 -p 24 -x ${DRAFT} -1 ${R2} -2 ${R1} -S ${DRAFT}_hisat2.sam 2>${DRAFT}_hisat2.log
# Run P_RNA_scaffolder.sh
/work/kawato/apps/P_RNA_scaffolder/P_RNA_scaffolder.sh \
-d /work/kawato/apps/P_RNA_scaffolder/ \
-i ${DRAFT}_hisat2.sam \
-j ${DRAFT}.fasta \
-F ${R2} -R ${R1} \
-o P_RNA_scaffolder_out \
-s yes -b yes -p 0.98 -t 24
```
### Gap filling (Sealer)
```sh
#!/bin/bash
# >>>>>>
# Sealer.sh
# >>>>>>
# Create/goto WDIR
WHOME=/work/SatohU/kawato/study/kuruma_genome2017/analysis
WDIR=Sealer
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
num_threads=24
b1=/work/SatohU/kawato/study/kuruma_genome2017/reads/Mj01muscleDNA_R1_fastp01.fastq
b2=/work/SatohU/kawato/study/kuruma_genome2017/reads/Mj01muscleDNA_R2_fastp01.fastq
# make synmlinks
ln -s ../P_RNA_saffolder/P_RNA_scaffolder_out/P_RNA_scaffold.fasta P_RNA_scaffold.fasta
# Abyss-Sealer
abyss-sealer -b20G --threads=${num_threads} -k51 -k71 -k91 -o sealer_out -S P_RNA_scaffold.fasta ${b1} ${b2}
```
### Polishing (NtEdit)
```sh
#!/bin/bash
# >>>>>>
# NtEdit.sh
# >>>>>>
# Create/goto WDIR
WHOME=/work/SatohU/kawato/study/kuruma_genome2017/analysis
WDIR=NtEdit
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
num_threads=24
b1=/work/SatohU/kawato/study/kuruma_genome2017/reads/Mj01muscleDNA_R1_fastp01.fastq
b2=/work/SatohU/kawato/study/kuruma_genome2017/reads/Mj01muscleDNA_R2_fastp01.fastq
assembly_name=sealer_out
# Create symlinks
ln -s ../Sealer/sealer_out.fa sealer_out.fa
# NtEdit
nthits -c 5 -b 36 -k 47 -t ${num_threads} --outbloom -p Mj01muscleDNA ${b1} ${b2}
ntedit -f ${assembly_name}.fa -r Mj01muscleDNA_k47.bf -k 47 -b ${assembly_name}_ntedit_k47 -t ${num_threads}
```
### Removing short (<2kb) scaffolds
```sh
#!/bin/bash
# >>>>>>
# Remove2kless.sh
# >>>>>>
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=draft_up2k/
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Create symlink
ln -s ./NtEdit/sealer_out_ntedit_k47.edited.fa sealer_out_ntedit_k47.edited.fa
# Remove scaffolds < 2kb
seqkit seq -m 2000 sealer_out_ntedit_k47.edited.fa >Mj.up2k.fa
```
## Assembly improvement using ONT long reads, Illumina mate-pair reads, and transcriptome data
### Mate-pair reads processing (NxTrim, Fastp. BBduk)
```sh
#!/bin/bash
# >>>>>>
# TrimMP_2.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WDIR=/work/kawato/study/kuruma_genome2017/reads/mp_hiseq/
cd ${WDIR}
for n in 02 03 04 05 06 07 08 10 12 14 15;do
# NxTrim
nxtrim -1 ${n}k_R1_raw.fastq.gz -2 ${n}k_R2_raw.fastq.gz --separate --justmp --rf -O ${n}k_nxtrim 2>${n}k_nxtrim.log
# Fastp
fastp -i ${n}k_nxtrim_R1.mp.fastq.gz -I ${n}k_nxtrim_R2.mp.fastq.gz -o ${n}k_nxtrim_R1.mp_trim.fq.gz -O ${n}k_nxtrim_R2.mp_trim.fq.gz --trim_front1 2 --n_base_limit 0 --length_required 60
# bbduk
bbduk.sh in1=${n}k_nxtrim_R1.mp_trim.fq.gz in2=${n}k_nxtrim_R2.mp_trim.fq.gz out1=${n}k_nxtrim_R1.mp_trim.bbduk.fq.gz out2=${n}k_nxtrim_R2.mp_trim.bbduk.fq.gz entropy=0.9 entropywindow=50 entropyk=5 2>${n}k_nxtrim.mp_trim.bbduk.log
done
```
### Gap filling, assembly break, and polishing (TGS-GapCloser, NtEdit)
```sh
#!/bin/bash
# >>>>>>
# scaffolding_round0.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=scaffolding_round0
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
DRAFT=/work/kawato/study/kuruma_genome2017/draft_up2k/Mj.up2k.fa
LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa
BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits
ROUND=round0
NUMTH=56
# TGS-GapCloser
/work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \
--scaff ${DRAFT} \
--reads ${LREADS} \
--output ${ROUND} \
--ne --thread ${NUMTH}
rm done_step*
# Break into contigs
perl /work/kawato/scripts/split.scaffolds.to.contigs.pl -m 1 -i ${ROUND}.scaff_seqs -o ${ROUND}.scaff_seqs.contigs
# Polish contigs
conda activate ntedit-1.3.2
ntedit -f ${ROUND}.scaff_seqs.contigs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}.ntedit_k64 -t ${NUMTH}
ntedit -f ${ROUND}.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}.nteditk_k55 -t ${NUMTH}
ntedit -f ${ROUND}.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}.ntedit_k47 -t ${NUMTH}
ntedit -f ${ROUND}.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}.ntedit_k33 -t ${NUMTH}
```
### Scaffolding (1)
#### Scaffolding using ONT long reads (LRScaf)
```sh
#!/bin/bash
# >>>>>>
# scaffolding_round1_LRScaf.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=scaffolding_round1_LRScaf
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
REF_DIR=/work/kawato/study/kuruma_genome2017/analysis/scaffolding_round0
LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa
BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits
ROUND=round1
DRAFT=draft
# Create symlink to input
ln -s ../scaffolding_round0/round0.ntedit_k33_edited.fa ${DRAFT}.fa
# Map Nanopore reads
minimap2 -t 56 -x map-ont ${DRAFT}.fa ${LREADS} 1>${DRAFT}.paf 2>${DRAFT}.minimap2.log
# Scaffolding with LRScaf
java -jar /work/kawato/apps/lrscaf/target/LRScaf-1.1.11.jar --contig ${DRAFT}.fa -a ${DRAFT}.paf -t mm -i 0.1 -o lrscaf_out -p 56
# Gap filling with TGS-GapCloser
/work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh --scaff lrscaf_out/scaffolds.fasta --reads ${LREADS} --output ${ROUND}_LRScaf --ne --thread 56
rm done_step*
# Break into contigs
perl /work/kawato/scripts/split.scaffolds.to.contigs.pl -m 1 -i ${ROUND}_LRScaf.scaff_seqs -o ${ROUND}_LRScaf.scaff_seqs.contigs
# Polish contigs
conda activate ntedit-1.3.2
ntedit -f LRScaf.scaff_seqs.contigs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_LRScaf.contigs.ntedit_k64 -t ${NUMTH}
ntedit -f ${ROUND}_LRScaf.contigs.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_LRScaf.contigs.nteditk_k55 -t ${NUMTH}
ntedit -f ${ROUND}_LRScaf.contigs.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_LRScaf.contigs.ntedit_k47 -t ${NUMTH}
ntedit -f ${ROUND}_LRScaf.contigs.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_LRScaf.contigs.ntedit_k33 -t ${NUMTH}
```
#### Scaffolding using Illumina mate-pair reads (BESST)
```sh
#!/bin/bash
# >>>>>>
# scaffolding_round1_BESST.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=scaffolding_round1_BESST
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits
LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa
R_DIR=/work/kawato/study/kuruma_genome2017/reads/mp_hiseq
ROUND=round1
DRAFT=draft
NUMTH=56
# Make symlink to input
ln -s ../scaffolding_${ROUND}_LRScaf/${ROUND}_LRScaf.contigs.ntedit_k33_edited.fa ${DRAFT}.fa
# Map mate-pair reads
for n in 02 03 04 05 06 07 08 10 12 14 15;do
R1=${R_DIR}/${n}k_nxtrim_R1.mp_trim.bbduk.fq.gz
R2=${R_DIR}/${n}k_nxtrim_R2.mp_trim.bbduk.fq.gz
minimap2 --secondary=no --MD -ax sr ${DRAFT}.fa $R1 $R2 | samtools view -G 4 -Sb - |samtools sort -O BAM -o ${n}k.bam
samtools index ${n}k.bam
done
# Run BESST
conda activate besst-py2.7
runBESST -g -c ${DRAFT}.fa \
-f 02k.bam 03k.bam 04k.bam 05k.bam 06k.bam 07k.bam 08k.bam 10k.bam 12k.bam 14k.bam 15k.bam \
-orientation rf rf rf rf rf rf rf rf rf rf rf -q
conda deactivate
# TGS-GapCloser
/work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \
--scaff BESST_output/pass11/Scaffolds_pass11.fa \
--reads ${LREADS} \
--output ${ROUND}_BESST \
--ne --thread ${NUMTH}
rm done_step*
# Polish
conda activate ntedit-1.3.2
ntedit -f ${ROUND}_BESST.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_BESST_LRScaf.ntedit_k64 -t ${NUMTH}
ntedit -f ${ROUND}_BESST.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_BESST.nteditk_k55 -t ${NUMTH}
ntedit -f ${ROUND}_BESST.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_BESST.ntedit_k47 -t ${NUMTH}
ntedit -f ${ROUND}_BESST.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_BESST.ntedit_k33 -t ${NUMTH}
```
#### Scaffolding using cDNA alignments (L_RNA_scaffolder)
```sh
#!/bin/bash
# >>>>>>
# scaffolding_round1_L_RNA_Scaffolder.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=scaffolding_round1_L_RNA_Scaffolder
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits
LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa
ROUND=round1
DRAFT=draft
NUMTH=56
TR=tr.merged
# Make symlink to input
ln -s ../scaffolding_${ROUND}_BESST/${ROUND}_BESST.ntedit_k33_edited.fa ${DRAFT}.fa
ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/tr.merged ${TR}
# BLAT
conda activate pasa
cat ${TR} \
| parallel --gnu --pipe --recstart ">" "blat -t=dna -q=dna -minIdentity=98 -noHead ${DRAFT}.fa stdin >(cat) >/dev/null" >${TR}.psl
# L_RNA_scaffolder
/work/kawato/apps/L_RNA_scaffolder/L_RNA_scaffolder.sh \
-d /work/kawato/apps/L_RNA_scaffolder/ \
-i ${TR}.psl -j ${DRAFT}.fa -o . -l 0.90 -p 0.98 -n 100
# TGS-GapCloser
/work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \
--scaff L_RNA_scaffolder.fasta --reads ${LREADS} \
--output ${ROUND}_L_RNA_scaffolder --ne --thread 56
rm done_step*
# Polish
conda activate ntedit-1.3.2
ntedit -f ${ROUND}_L_RNA_scaffolder.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k64 -t ${NUMTH}
ntedit -f ${ROUND}_L_RNA_scaffolder.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k55 -t ${NUMTH}
ntedit -f ${ROUND}_L_RNA_scaffolder.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k47 -t ${NUMTH}
ntedit -f ${ROUND}_L_RNA_scaffolder.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k33 -t ${NUMTH}
```
### Misassembly detection and assembly break (REAPR)
```sh
#!/bin/bash
# >>>>>>
# scaffolding_round1_REAPR.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=scaffolding_round1_REAPR
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
R_DIR=/work/kawato/study/kuruma_genome2017/reads/mp_hiseq
DRAFT=draft
n=10
# Make symlink to input
ln -s ../scaffolding_${ROUND}_L_RNA_scaffolder/${ROUND}_L_RNA_scaffolder.ntedit_k33_edited.fa ${DRAFT}.fa
# REAPR
conda activate reapr
reapr facheck ${DRAFT} ${DRAFT}.facheck
reapr smaltmap \
-n ${NUMTH} ${DRAFT}.facheck.fa \
${RDIR}/${n}k_nxtrim_R1.mp_trim.bbduk.RC.fastq \
${RDIR}/${n}k_nxtrim_R2.mp_trim.bbduk.RC.fastq \
mapped_${n}k.bam
reapr pipeline ${DRAFT}.facheck.fa mapped_${n}k.bam reapr_${n}k_out
```
### Haplotig removal (1) (purge_haplotigs)
```sh
#!/bin/bash
# >>>>>>
# Scaffolding_round1_purge_haplotigs.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=scaffolding_round1_purge_haplotigs
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fq
ROUND=round1
DRAFT=draft
NUMTH=56
# Make symlink to input
ln -s ../scaffolding_${ROUND}_REAPR/reapr_10k_out/04.break.broken_assembly.fa ${DRAFT}.fa
# Map long reads
minimap2 -t ${NUMTH} -ax map-ont ${DRAFT}.fa ${LREADS} --secondary=no | samtools sort -@ ${NUMTH} -o aligned.bam -T tmp.ali
# Purge_haplotigs
conda activate purge_haplotigs
purge_haplotigs hist -b aligned.bam -g ${DRAFT}.fa -t ${NUMTH} 2>purge_haplotigs_hist.log
purge_haplotigs cov -i aligned.bam.gencov -l 0 -m 15 -h 40 -o coverage_stats.csv 2>purge_haplotigs_cov.log
purge_haplotigs purge -g ${DRAFT}.fa -c coverage_stats.csv -t ${NUMTH} 2>purge_haplotigs_purge.log
purge_haplotigs clip -p curated.fasta -h curated.haplotigs.fasta -l 500 -o clip_l0.5k 2>purge_haplotigs_clip_l0.5k.log
```
### Scaffolding (2)
#### Scaffolding using ONT long reads (LRScaf)
```sh
#!/bin/bash
# >>>>>>
# scaffolding_round2_LRScaf.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=saffolding_round2_LRScaf
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa
BF_DIR=/work/kawato/study/kuruma_genome2017/nthits
ROUND=round2
DRAFT=draft
# Create symlink to input
ln -s ../scaffolding_round1_purge_haplotigs/clip_l0.5k.fasta ${DRAFT}.fa
# Map Nanopore reads
minimap2 -t 56 -x map-ont ${DRAFT}.fa ${LREADS} 1>${DRAFT}.paf 2>${DRAFT}.minimap2.log
# Scaffolding with LRScaf
java -jar /work/kawato/apps/lrscaf/target/LRScaf-1.1.11.jar \
--contig ${DRAFT}.fa -a ${DRAFT}.paf -t mm -i 0.1 -o lrscaf_out -p 56
# Gap filling with TGS-GapCloser
/work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \
--scaff lrscaf_out/scaffolds.fasta --reads ${LREADS} \
--output ${ROUND}_LRScaf --ne --thread 56
rm done_step*
# Polish
conda activate ntedit-1.3.2
ntedit -f ${ROUND}_LRScaf.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_LRScaf.ntedit_k64 -t ${NUMTH}
ntedit -f ${ROUND}_LRScaf.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_LRScaf.nteditk_k55 -t ${NUMTH}
ntedit -f ${ROUND}_LRScaf.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_LRScaf.ntedit_k47 -t ${NUMTH}
ntedit -f ${ROUND}_LRScaf.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_LRScaf.ntedit_k33 -t ${NUMTH}
```
#### Scaffolding using Illumina mate-pair reads (BESST)
```sh
#!/bin/bash
# >>>>>>
# scaffolding_round2_BESST.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=saffolding_round2_BESST
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits
LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa
R_DIR=/work/kawato/study/kuruma_genome2017/reads/mp_hiseq
ROUND=round2
DRAFT=draft
NUMTH=56
# Make symlink to input
ln -s ../scaffolding_${ROUND}_LRScaf/${ROUND}_LRScaf.ntedit_k33_edited.fa ${DRAFT}.fa
# Map mate-pair reads
for n in 02 03 04 05 06 07 08 10 12 14 15;do
R1=${R_DIR}/${n}k_nxtrim_R1.mp_trim.bbduk.fq.gz
R2=${R_DIR}/${n}k_nxtrim_R2.mp_trim.bbduk.fq.gz
minimap2 --secondary=no --MD -ax sr ${DRAFT}.fa $R1 $R2 | samtools view -G 4 -Sb - |samtools sort -O BAM -o ${n}k.bam
samtools index ${n}k.bam
done
# Run BESST
conda activate besst_py2.7
runBESST -g -c ${DRAFT}.fa \
-f 02k.bam 03k.bam 04k.bam 05k.bam 06k.bam 07k.bam 08k.bam 10k.bam 12k.bam 14k.bam 15k.bam \
-orientation rf rf rf rf rf rf rf rf rf rf rf -q
conda deactivate
# TGS-GapCloser
/work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \
--scaff BESST_output/pass11/Scaffolds_pass11.fa \
--reads ${LREADS} \
--output ${ROUND}_BESST \
--ne --thread ${NUMTH}
rm done_step*
# Polish
conda activate ntedit-1.3.2
ntedit -f ${ROUND}_BESST.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_BESST_LRScaf.ntedit_k64 -t ${NUMTH}
ntedit -f ${ROUND}_BESST.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_BESST.nteditk_k55 -t ${NUMTH}
ntedit -f ${ROUND}_BESST.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_BESST.ntedit_k47 -t ${NUMTH}
ntedit -f ${ROUND}_BESST.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_BESST.ntedit_k33 -t ${NUMTH}
```
#### Scaffolding using cDNA alignments (L_RNA_scaffolder)
```sh
#!/bin/bash
# >>>>>>
# scaffolding_round2_L_RNA_Scaffolder.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=scaffolding_round2_L_RNA_scaffolder
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits
LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa
ROUND=round2
DRAFT=draft
NUMTH=56
TR=tr.merged
# Make symlink to input
ln -s ../scaffolding_${ROUND}_BESST/${ROUND}_BESST.ntedit_k33_edited.fa ${DRAFT}.fa
ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/tr.merged ${TR}
# BLAT
conda activate pasa
cat ${TR} \
| parallel --gnu --pipe --recstart ">" "blat -t=dna -q=dna -minIdentity=98 -noHead ${DRAFT}.fa stdin >(cat) >/dev/null" >${TR}.psl
# L_RNA_scaffolder
/work/kawato/apps/L_RNA_scaffolder/L_RNA_scaffolder.sh \
-d /work/kawato/apps/L_RNA_scaffolder/ \
-i ${TR}.psl -j ${DRAFT}.fa -o . -l 0.90 -p 0.98 -n 100
# TGS-GapCloser
/work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \
--scaff L_RNA_scaffolder.fasta \
--reads ${LREADS} \
--output ${ROUND}_L_RNA_scaffolder \
--ne --thread 56
rm done_step*
# Polish
conda activate ntedit-1.3.2
ntedit -f ${ROUND}_L_RNA_scaffolder.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k64 -t ${NUMTH}
ntedit -f ${ROUND}_L_RNA_scaffolder.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k55 -t ${NUMTH}
ntedit -f ${ROUND}_L_RNA_scaffolder.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k47 -t ${NUMTH}
ntedit -f ${ROUND}_L_RNA_scaffolder.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k33 -t ${NUMTH}
```
### Haplotig removal (2) (purge_haplotigs)
```sh
#!/bin/bash
# >>>>>>
# scaffolding_round2_purge_haplotigs.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=scaffolding_round2_purge_haplotigs
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fq
ROUND=round2
DRAFT=draft
NUMTH=56
# Make symlink to input
ln -s ../scaffolding_${ROUND}_L_RNA_scaffolder/${ROUND}_L_RNA_scaffolder.ntedit_kk33_edited.fa ${DRAFT}.fa
# Map long reads
minimap2 -t ${NUMTH} -ax map-ont ${DRAFT}.fa ${LREADS} --secondary=no | samtools sort -@ ${NUMTH} -o aligned.bam -T tmp.ali
# Purge_haplotigs
conda activate purge_haplotigs
purge_haplotigs hist -b aligned.bam -g ${DRAFT}.fa -t ${NUMTH} 2>purge_haplotigs_hist.log
purge_haplotigs cov -i aligned.bam.gencov -l 0 -m 15 -h 40 -o coverage_stats.csv 2>purge_haplotigs_cov.log
purge_haplotigs purge -g ${DRAFT}.fa -c coverage_stats.csv -t ${NUMTH} 2>purge_haplotigs_purge.log
purge_haplotigs clip -p curated.fasta -h curated.haplotigs.fasta -l 500 -o clip_l0.5k 2>purge_haplotigs_clip_l0.5k.log
```
### Scaffolding (3)
#### Scaffolding using ONT long reads (LRScaf)
```sh
#!/bin/bash
# >>>>>>
# scaffolding_round3_LRScaf.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=scaffolding_round3_LRScaf
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define variables
LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa
BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits
ROUND=round3
DRAFT=draft
# Create symlink to input
ln -s ../scaffolding_round2_purge_haplotigs/clip_l0.5k.fasta ${DRAFT}.fa
# Map Nanopore reads
minimap2 -t 56 -x map-ont ${DRAFT}.fa ${LREADS} 1>${DRAFT}.paf 2>${DRAFT}.minimap2.log
# Scaffolding with LRScaf
java -jar /work/kawato/apps/lrscaf/target/LRScaf-1.1.11.jar --contig ${DRAFT}.fa -a ${DRAFT}.paf -t mm -i 0.1 -o lrscaf_out -p 56
# Gap filling with TGS-GapCloser
/work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \
--scaff lrscaf_out/scaffolds.fasta --reads ${LREADS} \
--output ${ROUND}_LRScaf --ne --thread 56
rm done_step*
# Polish
conda activate ntedit-1.3.2
ntedit -f ${ROUND}_LRScaf.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_LRScaf.ntedit_k64 -t ${NUMTH}
ntedit -f ${ROUND}_LRScaf.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_LRScaf.nteditk_k55 -t ${NUMTH}
ntedit -f ${ROUND}_LRScaf.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_LRScaf.ntedit_k47 -t ${NUMTH}
ntedit -f ${ROUND}_LRScaf.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_LRScaf.ntedit_k33 -t ${NUMTH}
```
#### Scaffolding using Illumina mate-pair reads (BESST)
```sh
#!/bin/bash
# >>>>>>
# scaffolding_round3_BESST.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=scaffolding_round3_BESST
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Set variables
BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits
LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa
R_DIR=/work/kawato/study/kuruma_genome2017/reads/mp_hiseq
ROUND=round3
DRAFT=draft
NUMTH=56
# Make symlink to input
ln -s ../scaffolding_${ROUND}_LRScaf/${ROUND}_LRScaf.ntedit_k33_edited.fa ${DRAFT}.fa
# Map mate-pair reads
for n in 02 03 04 05 06 07 08 10 12 14 15;do
R1=${R_DIR}/${n}k_nxtrim_R1.mp_trim.bbduk.fq.gz
R2=${R_DIR}/${n}k_nxtrim_R2.mp_trim.bbduk.fq.gz
minimap2 --secondary=no --MD -ax sr ${DRAFT}.fa $R1 $R2 | samtools view -G 4 -Sb - |samtools sort -O BAM -o ${n}k.bam
samtools index ${n}k.bam
done
# Run BESST
conda activate besst-py2.7
runBESST -g -c ${DRAFT}.fa \
-f 02k.bam 03k.bam 04k.bam 05k.bam 06k.bam 07k.bam 08k.bam 10k.bam 12k.bam 14k.bam 15k.bam \
-orientation rf rf rf rf rf rf rf rf rf rf rf -q
conda deactivate
# TGS-GapCloser
/work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \
--scaff BESST_output/pass11/Scaffolds_pass11.fa \
--reads ${LREADS} \
--output ${ROUND}_BESST \
--ne --thread ${NUMTH}
rm done_step*
# Polish
conda activate ntedit-1.3.2
ntedit -f ${ROUND}_BESST.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_BESST_LRScaf.ntedit_k64 -t ${NUMTH}
ntedit -f ${ROUND}_BESST.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_BESST.nteditk_k55 -t ${NUMTH}
ntedit -f ${ROUND}_BESST.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_BESST.ntedit_k47 -t ${NUMTH}
ntedit -f ${ROUND}_BESST.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_BESST.ntedit_k33 -t ${NUMTH}
```
### Assembly polishing (HyPo)
```sh
#!/bin/bash
# >>>>>>
# hypo.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=hypo
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Set variables
NUMTH=56
GENOME_SIZE=2g
COV=50
R1=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01muscleDNA_R1_trim.tadpole.fq.gz
R2=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01muscleDNA_R2_trim.tadpole.fq.gz
# Make symlink
ln -s ../scaffolding_round3_BESST/round3_BESST.ntedit_k33_edited.fa hypo0.fa
# Create input list required by hypo
echo -e "$R1\n$R2" > il_names.txt
conda activate hypo-1.0.3
# hypo, round1
# Map PE reads
minimap2 --secondary=no --MD -ax sr -t $NUMTH hypo0.fa $R1 $R2 | samtools view -Sb - > hypo0.bam
samtools sort -@$NUMTH -o hypo0.sorted.bam hypo0.bam
samtools index hypo0.sorted.bam
rm hypo0.bam
# Hypo
hypo -d hypo0.fa -r @il_names.txt -s ${GENOME_SIZE} -c ${COV} -b hypo0.sorted.bam -t $NUMTH -o hypo1.fa
# hypo, round2
# Map PE reads
minimap2 --secondary=no --MD -ax sr -t $NUMTH hypo1.fa $R1 $R2 | samtools view -Sb - > hypo1.bam
samtools sort -@$NUMTH -o hypo1.sorted.bam hypo1.bam
samtools index hypo1.sorted.bam
rm hypo1.bam
# Hypo
hypo -d hypo1.fa -r @il_names.txt -s ${GENOME_SIZE} -c ${COV} -b hypo1.sorted.bam -t $NUMTH -o hypo2.fa
```
### Assembly curation
#### Mitogenome contamination removal
```sh
#!/bin/bash
# >>>>>>
# purge_mtDNA.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=purge_mtDNA
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Set variables
QUERY=MjmtDNA
SUBJECT=hypo2
# Create symlinks
ln -s ../hypo/hypo2.fa ${SUBJECT}.fa
ln -s /work/kawato/study/ref/genomes/MjmtDNA.fa ${QUERY}.fa
# BLASTN
blastn -query ${QUERY}.fa -subject ${SUBJECT}.fa -perc_identity 90 -outfmt 7 -out ${QUERY}.${SUBJECT}.ident90.blastn
## Following regions were identified as contamination
# NC_007010.1 Scaffolds_826 99.464 932 5 0 6721 7652 2154 1223 0.0 1694
# NC_007010.1 Scaffolds_23445 97.274 4035 79 14 3169 7201 45436 49441 0.0 6813
# NC_007010.1 Scaffolds_23445 99.719 2850 8 0 92 2941 42437 45286 0.0 5219
echo -e "Scaffolds_826\nScaffolds_23445" >${QUERY}.${SUBJECT}.ident90.blastn.hit.list
seqkit grep -v -nf ${QUERY}.${SUBJECT}.ident90.blastn.hit.list ${SUBJECT}.fa >${SUBJECT}.${QUERY}.nohit.fa
blastn -query ${QUERY}.fa -subject ${SUBJECT}.${QUERY}.nohit.fa -outfmt 7 -out ${SUBJECT}.${QUERY}.nohit.blastn # The regions at issue have been removed
seqkit grep -np "Scaffolds_826" ${SUBJECT}.fa >Scaffolds_826.fa
seqkit grep -np "Scaffolds_23445" ${SUBJECT}.fa >Scaffolds_23445.fa
seqkit stats Scaffolds_826.fa Scaffolds_23445.fa
# seqkit stats Scaffolds_826.fa Scaffolds_23445.fa
# file format type num_seqs sum_len min_len avg_len max_len
# Scaffolds_826.fa FASTA DNA 1 164,807 164,807 164,807 164,807
# Scaffolds_23445.fa FASTA DNA 1 116,046 116,046 116,046 116,046
# -> Trim Scaffolds_826:1-2500
# -> Retain Scaffolds_23445:1-42000 and 50001-116046
seqkit subseq -r 2501:-1 Scaffolds_826.fa > Scaffolds_826_2501-164807.fa # Trim the first 2500 bases
seqkit stats Scaffolds_826.fa Scaffolds_826_2501-164807.fa >Scaffolds_826_2501-164807.stats # Verify lengths
blastn -query Scaffolds_826.fa -subject Scaffolds_826_2501-164807.fa -outfmt 7 -out Scaffolds_826_2501-164807.blastn # Confirm that the first 2500 bases have been removed
seqkit subseq -r 1:42000 Scaffolds_23445.fa > Scaffolds_23445_1-42000.fa # Retain first 42000 bases
seqkit stats Scaffolds_23445.fa Scaffolds_23445_1-42000.fa >Scaffolds_23445_1-42000.stats # Verify lengths
blastn -query Scaffolds_23445.fa -subject Scaffolds_23445_1-42000.fa -outfmt 7 -out Scaffolds_23445_1-42000.blastn # Confirm that the first 45000 bases have been retained
seqkit subseq -r 50001:-1 Scaffolds_23445.fa > Scaffolds_23445_50001-116046.fa # Trim first 50000 bases
seqkit stats Scaffolds_23445.fa Scaffolds_23445_50001-116046.fa >Scaffolds_23445_50001-116046.stats # Verify lengths
blastn -query Scaffolds_23445.fa -subject Scaffolds_23445_50001-116046.fa -outfmt 7 -out Scaffolds_23445_50001-116046.blastn # Confirm that the first 2500 bases have been removed
cat Scaffolds_826_2501-164807.fa Scaffolds_23445_1-42000.fa Scaffolds_23445_50001-116046.fa >${SUBJECT}.${QUERY}.mtDNA_trimmed.fa
blastn -query ${QUERY}.fa -subject ${SUBJECT}.${QUERY}.mtDNA_trimmed.fa -outfmt 7 -out ${SUBJECT}.${QUERY}.mtDNA_trimmed.blastn # no hit
cat ${SUBJECT}.${QUERY}.nohit.fa ${SUBJECT}.${QUERY}.mtDNA_trimmed.fa >${SUBJECT}.mtDNA_purged.fa # Merge originally clean seqs + purged seqs
seqkit stats ${SUBJECT}.fa ${SUBJECT}.${QUERY}.nohit.fa ${SUBJECT}.${QUERY}.mtDNA_trimmed.fa ${SUBJECT}.mtDNA_purged.fa >${SUBJECT}.mtDNA_purged.stats # stats
blastn -query ${QUERY}.fa -subject ${SUBJECT}.mtDNA_purged.fa -outfmt 7 -out ${SUBJECT}.mtDNA_purged.blastn # no hit
seqkit seq -m 2000 ${SUBJECT}.mtDNA_purged.fa >${SUBJECT}.mtDNA_purged.up2k.fa # remove < 2kbp frags
seqkit stats ${SUBJECT}.mtDNA_purged.fa ${SUBJECT}.mtDNA_purged.up2k.fa >${SUBJECT}.mtDNA_purged.up2k.stats # stats
seqkit rename ${SUBJECT}.mtDNA_purged.up2k.fa |seqkit sort --quiet -r -l > ${SUBJECT}.mtDNA_purged.up2k.sorted.fa # Sort seqs. Had to run seqkit rename because Scaffolds_23445 were regarded as duplicated
seqkit fx2tab -l -n -i -H ${SUBJECT}.mtDNA_purged.up2k.sorted.fa >${SUBJECT}.mtDNA_purged.up2k.sorted.tab # Confirm that the seqs have been sorted by length; saved as a table file
awk 'BEGIN{N = 0}; {if(/^>/){printf(">scaffold_" "%05d\n", ++n) } else{print $0}}' ${SUBJECT}.mtDNA_purged.up2k.sorted.fa > ${SUBJECT}.mtDNA_purged.up2k.sorted.renamed.fa # Rename entries
```
#### Filter out sequences with abnormal coverage
```sh
#!/bin/bash
# >>>>>>
# filter_lowqual_seq.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=filter_lowqual_seq
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Define variables
ASM=mtDNA_purged
ln -s ../purge_mtDNA/mtDNA_purged.up2k.sorted.renamed.fa ${ASM}.fa
# Identify sequences with extreme GC contents (less than 10% and more than 70%. The thresholds were determined based on manual IGV inspection of the annotated genome. Some of the extemely GC-poor sequences seem to be filled with gaps, rather than AT's
seqkit fx2tab -n -g -l -B N ${ASM}.fa >${ASM}.len.gc.n.tab
awk '$4 >=60 {print $1}' ${ASM}.len.gc.n.tab >${ASM}.gap60up
sort -nk3 ${ASM}.len.gc.n.tab |awk ' $3 < 10 {print $1}' >${ASM}.gc10minus
sort -nk3 ${ASM}.len.gc.n.tab |awk ' $3 > 70 {print $1}' >${ASM}.gc70up
# Coverage file
for LIB in sperm muscle ;do
R1=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01${LIB}DNA_R1_trim.tadpole.fq.gz
R2=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01${LIB}DNA_R2_trim.tadpole.fq.gz
minimap2 --secondary=no --MD -ax sr -t ${NUMTH} ${DRAFT} ${R1} ${R2} | samtools view -Sb - |\
samtools sort -o Mj01${LIB}DNA.sorted.bam
samtools index Mj01${LIB}DNA.sorted.bam
done
samtools coverage Mj01spermDNA.sorted.bam Mj01muscleDNA.sorted.bam >${ASM}.coverage
# Identify low-coverage scaffolds
sort -nk7 ${ASM}.coverage|awk ' $7 < 20 {print $1}' >${ASM}.meandepth20minus
sort -nk7 ${ASM}.coverage|awk ' $6 < 30 {print $1}' >${ASM}.coverage30minus
# Make a non-redundant list of low-quality sequences
cat ${ASM}.gap60up ${ASM}.gc10minus ${ASM}.gc70up ${ASM}.meandepth20minus ${ASM}.coverage30minus|sort|uniq >${ASM}.lowQuality.list
# BLAST to rescue gene-containing scaffolds
seqkit grep -nf ${ASM}.lowQuality.list ${ASM}.fa >${ASM}.lowQuality.fa
blastn -query ${ASM}.lowQuality.fa -subject /work/kawato/study/kuruma_transcriptome2017/analysis/asm/tr.merged.nr \
-outfmt 7 -out ${ASM}.lowQuality.tr.merged.nr.blastn
grep -v "#" ${ASM}.lowQuality.tr.merged.nr.blastn |awk '$11 == 0.0 {print $2}'|sort|uniq >${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.list
seqkit grep -nf ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.list /work/kawato/study/kuruma_transcriptome2017/analysis/asm/tr.merged.nr >${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.fa
blastn -query ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.fa -subject ${ASM}.fa \
-max_target_seqs 1 -outfmt 7 -out ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn
grep -v "#" ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn|awk '$11 == 0.0 {print $2}'|sort|uniq >${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list
cat ${ASM}.lowQuality.list|while read line;do
cat ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list | awk -v line=${line}'{if (match($0, line) > 0){print $0}}'
done >${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list.hits
grep -f ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list.hits ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn|cut -f 1 |sort|uniq >${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list.hits.transcripts.list
# Three transcripts were found and queried against the NCBI nr protein DB with BLASTX. One transcript was found to be a likely shrimp protein and was kept. The other two had no unambiguous hits and theredfore were discarded
seqkit grep -nf ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list.hits.transcripts.list /work/kawato/study/kuruma_transcriptome2017/analysis/asm/tr.merged.nr > ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list.hits.transcripts.fa
# The scaffold containing the shrimp protein gene was isolated and broken ito contigs
seqkit grep -np "scaffold_11228" ${ASM}.fa >scaffold_11228.fa
perl /work/kawato/scripts/split.scaffolds.to.contigs.pl -m 1 -i scaffold_11228.fa -o scaffold_11228.fa.contigs
# Filtered sequences
seqkit grep -nvf ${ASM}.lowQuality.list ${ASM}.fa >${ASM}.filtered.fa.tmp
cat ${ASM}.filtered.fa.tmp scaffold_11228.fa.contigs >${ASM}.filtered.fa
# Trim gaps on (near) scaffold ends. Takes some time. gsub targeting string ends is extremely slow, so I turned to reverse-complementing the seqs
seqkit fx2tab ${ASM}.filtered.fa | awk ' BEGIN {
OFS="\t"
}
{
gsub( /^[ATGCN]{0,100}N{1,}/, "", $2 )
print $1 , $2
}
' |seqkit tab2fx|seqkit seq -pr|seqkit fx2tab| awk ' BEGIN {
OFS="\t"
}
{
gsub( /^[ATGCN]{0,100}N{1,}/, "", $2 )
print $1 , $2
}
' |seqkit tab2fx|seqkit seq -pr >${ASM}.filtered.mod.fa
# Verify that gaps on ends have been eliminated
seqkit subseq -r 1:100 ${ASM}.filtered.mod.fa |seqkit fx2tab > ${ASM}.filtered.mod.end_left.tab
awk '$2 ~/N/ {print $1}' ${ASM}.filtered.mod.end_left.tab > ${ASM}.filtered.mod.end_left_gapAtEnd.list
seqkit subseq -r -100:-1 ${ASM}.filtered.mod.fa |seqkit fx2tab > ${ASM}.filtered.mod.end_right.tab
awk '$2 ~/N/ {print $1}' ${ASM}.filtered.mod.end_right.tab > ${ASM}.filtered.mod.end_right_gapAtEnd.list
# Sort seqs
seqkit sort --quiet -r -l ${ASM}.filtered.mod.fa > ${ASM}.filtered.mod.sorted.fa
# Confirm that the seqs have been sorted by length; saved as a table file
seqkit fx2tab -l -n -i -H ${ASM}.filtered.mod.sorted.fa >${ASM}.filtered.mod.sorted.tab
# Rename entries
awk 'BEGIN{N = 0}; {if(/^>/){printf(">scaffold_" "%05d\n", ++n) } else{print $0}}' ${ASM}.filtered.mod.sorted.fa > Mj_TUMSAT_v1.0.fa
```
#### Screening bacterial contamination (Barrnap)
```sh
#!/bin/bash
# >>>>>>
# Barrnap.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Barrnap
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Define variables
NUMTH=56
ASM=mtDNA_purged.filtered.mod.sorted
# Symlink
ln -s ../filter_lowqual_seq/mtDNA_purged.filtered.mod.sorted.fa ${ASM}.fa
# Run Barrnap
conda activate barrnap
barrnap --kingdom euk --outseq ${ASM}.barrnap.euk.fa --threads ${NUMTH} ${ASM}.fa > ${ASM}.barrnap.euk.gff
grep -v "partial" ${ASM}.barrnap.euk.gff >${ASM}.barrnap.euk.full.gff
barrnap --kingdom bac --outseq ${ASM}.barrnap.bac.fa --threads ${NUMTH} ${ASM}.fa > ${ASM}.barrnap.bac.gff
# BLASTN search perfomed against the NCBI nr nucleotide database (accessed 2021-02-12).
# No bacterial sequence detected
```
#### Mapping Illumina paired-end reads onto the final assembly
```sh
#!/bin/bash
# >>>>>>
# map_IlluminaPE_to_genome.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=map_IlluminaPE_to_genome
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Define variables
ASM=Mj_TUMSAT_v1.0
# Create symlink
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${ASM}.fa
DRAFT=${ASM}.fa
for LIB in sperm muscle ;do
R1=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01${LIB}DNA_R1_trim.fq.gz
R2=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01${LIB}DNA_R2_trim.fq.gz
minimap2 --secondary=no --MD -ax sr -t ${NUMTH} ${DRAFT} ${R1} ${R2} | samtools view -Sb - |samtools sort -o Mj01${LIB}DNA.sorted.bam
samtools index Mj01${LIB}DNA.sorted.bam
samtools flagstat -@ 16 -O tsv Mj01${LIB}DNA.sorted.bam >Mj01${LIB}DNA.sorted.bam.flagstat
done
```
## Mitogenome assembly (Flye, --meta)
```sh
#!/bin/bash
# >>>>>>
# flye_meta_MjmtDNA.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=flye_meta_MjmtDNA
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Define variables
LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/Mj.up5k.fastq
ln -s /work/kawato/study/ref/genomes/MjmtDNA.fa MjmtDNA.fa
NUMTH=16
# Assemble using Flye
conda activate flye
flye --nano-raw ${LREADS} --out-dir flye_meta_up5k --threads ${NUMTH} --meta
# Identify the mitogenome contig
SUBJECT=flye_meta_up5k/assembly.fasta
QUERY=MjmtDNA.fa
# BLASTN; identified contig_2159
blastn -query ${QUERY} -subject ${SUBJECT} -outfmt 7 -out MjmtDNA_blastn.out
# Extract contig_2159; start position adjusted based on BLASTN alignment
seqkit grep -np "contig_2159" ${SUBJECT} |seqkit restart -i 75 >MjmtDNA_unpolished.fa
# Map PE reads
DRAFT=MjmtDNA_unpolished.fa
LIB=muscle
R1=/work/kawato/study/2020-07-17_Mj_genome/reads/pe_hiseq/Mj01${LIB}DNA_R1_trim.tadpole.fq.gz
R2=/work/kawato/study/2020-07-17_Mj_genome/reads/pe_hiseq/Mj01${LIB}DNA_R2_trim.tadpole.fq.gz
minimap2 --secondary=no --MD -ax sr -t ${NUMTH} ${DRAFT} ${R1} ${R2} |\
samtools view -Sb - -f 3 -q 60 |samtools sort -o Mj01${LIB}DNA.sorted.bam
samtools index Mj01${LIB}DNA.sorted.bam
# Polishing by Pilon
java -jar /work/kawato/apps/pilon/1.23/pilon-1.23.jar \
--genome ${DRAFT} \
--frags Mj01${LIB}DNA.sorted.bam \
--outdir ${DRAFT}.pilon \
--changes --vcf --tracks --threads ${NUMTH}
cp ${DRAFT}.pilon/pilon.fasta MjmtDNA_polished.fa
# Calculate coverage
minimap2 --secondary=no --MD -ax sr -t ${NUMTH} MjmtDNA_polished.fa ${R1} ${R2} | samtools view -Sb - -f 3 -q 60 |samtools sort -o MjmtDNA_polished_Mj01${LIB}DNA.sorted.bam
samtools index MjmtDNA_polished_Mj01${LIB}DNA.sorted.bam
samtools coverage MjmtDNA_polished_Mj01muscleDNA.sorted.bam > MjmtDNA_polished_Mj01muscleDNA.sorted.bam.coverage
minimap2 --secondary=no --MD -ax map-ont -t ${NUMTH} MjmtDNA_polished.fa ${LREADS} |samtools view -bS - |samtools sort -O BAM -o MjmtDNA_polished_ONTup5k.sorted.bam
samtools index MjmtDNA_polished_ONTup5k.sorted.bam
samtools coverage MjmtDNA_polished_ONTup5k.sorted.bam > MjmtDNA_polished_ONTup5k.sorted.bam.coverage
```
## Repeat identification and masking
### Repeat library construction (RepeatScout)
```sh
#!/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_RepeatScout.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_RepeatScout
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Define variables
SP=Marsupenaeus_japonicus
NUMTH=16
# Create symlinks
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${SP}.fa
# Subsample contigs
perl /work/kawato/scripts/split.scaffolds.to.contigs.pl -in ${SP}.fa -out ${SP}.fa.contigs
seqkit seq -m 1000 ${SP}.fa.contigs >${SP}.fa.contigs.up1k
seqkit stats ${SP}.fa.contigs.up1k >${SP}.fa.contigs.up1k.stats
seqkit sample --proportion 0.6 ${SP}.fa.contigs.up1k >${SP}.fa.contigs.up1k.0.6x
seqkit stats ${SP}.fa.contigs.up1k.0.6x >${SP}.fa.contigs.up1k.0.6x.stats
conda activate repeatmodeler
# Build repeat library with RepeatScout
SAMPLE=${SP}.fa.contigs.up1k.0.6x
build_lmer_table -l 14 -sequence ${SAMPLE} -freq ${SAMPLE}.14mer_table
RepeatScout -sequence ${SAMPLE} -output ${SAMPLE}.rs_output.fa -freq ${SAMPLE}.14mer_table -l 14
cat ${SAMPLE}.rs_output.fa| filter-stage-1.prl >${SAMPLE}.rs_output.filtered1.fasta
RepeatMasker -pa 20 -no_is -norna -s -lib ${SAMPLE}.rs_output.filtered1.fasta ${SAMPLE}
cat ${SAMPLE}.rs_output.filtered1.fasta | filter-stage-2.prl --cat=${SAMPLE}.out --thresh=3 > ${SAMPLE}.rs_output.filtered2.fasta
```
### Taxonomic assingment (RepeatClassifier)
```sh
#!/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_RepeatClassifier.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_RepeatScout
cd ${WHOME}/${WDIR}
# Define variables
SP=Marsupenaeus_japonicus
NUMTH=16
# Conda distribution of RepeatModeler did not have a repeat libary for RepeatClassifier. I turned to Dfam TETools version
curl -sSLO https://github.com/Dfam-consortium/TETools/raw/master/dfam-tetools.sh
chmod +x dfam-tetools.sh
systemctl start docker
sudo ./dfam-tetools.sh
RepeatClassifier -consensi Marsupenaeus_japonicus.fa.contigs.up1k.0.6x.rs_output.filtered2.fasta &
# Press Ctrl + P + Q to detach while not stopping the job. This lets you going back to the shell and you can exit freely
```
### Repeat masking (RepeatMasker)
```sh
#!/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_RepeatMasker.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_RepeatMasker
cd ${WHOME}/${WDIR}/
# Create symlinks
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa Mj_TUMSAT_v1.0.fa
ln -s ../Mj_TUMSAT_v1.0_RepeatScout/${SAMPLE}.rs_output.filtered2.fasta.classified RepeatLibrary.classified.fa
conda activate repeatmodeler
# Run RepeatMasker
RepeatMasker -pa 20 -s -no_is -lib RepeatLibrary.classified.fa Mj_TUMSAT_v1.0.fa -a -xsmall -html -gff
```
## Gene prediction and functional annotation
### Prediction of protein-coding genes
#### cDNA alignments
##### Transcriptome shotgun assembly
###### preprocessing paired-end reads (Fastp)
```sh
#!/bin/bash
# >>>>>>
# TrimPE_RNAseq_2.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_transcriptome2017/analysis/2020-10-13_Mj_RNA-seq/
WDIR=reads
cd ${WHOME}/${WDIR}/
# Define variables
NUMTH=56
# Fastp
less lib.list|while read lib;do
fastp \
-i ../../reads/${lib}_R1.fastq.gz -I ../../reads/${lib}_R2.fastq.gz \
-o ${lib}_R1_trim.fq.gz -O ${lib}_R2_trim.fq.gz \
--trim_front1 2 --trim_tail1 2 \
--trim_front2 2 --trim_tail2 2 --n_base_limit 0 \
--length_required 140 --detect_adapter_for_pe \
--html ${lib}_fastp.html --thread 8
done
```
###### De novo assembly (Trinity)
```sh
#!/bin/bash
# >>>>>>
# Trinity.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_transcriptome2017/analysis/2020-10-13_Mj_RNA-seq
cd ${WHOME}
# Define variables
NUMTH=56
# Run Trinity
conda activate trinity-2.11.0
Trinity --seqType fq --max_memory 1400G --SS_lib_type RF \
--samples_file reads/reads.list --CPU ${NUMTH}
```
##### Running PASA pipeline
```sh
#!/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_PASA.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_PASA
PASAHOME=/work/kawato/anaconda3/envs/pasa/opt/pasa-2.4.1
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Define VARs
NUMTH=24
ASM=Mj_TUMSAT_v1.0
RUN=${ASM}_pasa
MIN_PERCENT_ALIGNED=90
MIN_AVG_PER_ID=95
TRANSCRIPT=tr.fa
TRANSCRIPT_CLEAN=tr.fa.clean
W_DIR=${WHOME}${WDIR}/
# Create symlinks
ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/isoseq3/polished.hq.fasta polished.hq.fasta
ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/2020-10-13_Mj_RNA-seq/trinity_out_dir/Trinity.fasta Trinity.fasta
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa Mj_TUMSAT_v1.0.fa
conda activate pasa
# Prepare transcripts
cat Trinity.fasta polished.hq.fasta >tr.fa
${PASAHOME}/bin/seqclean tr.fa -L -c 16
# Create config file pasa.alignAssembly.Template.txt
cat /work/kawato/anaconda3/envs/pasa/opt/pasa-2.4.1/pasa_conf/pasa.alignAssembly.Template.txt |\
sed "s@<__DATABASE__>@${W_DIR}${RUN}_db@g"|sed "s@<__MIN_PERCENT_ALIGNED__>@${MIN_PERCENT_ALIGNED}@g"|sed "s@<__MIN_AVG_PER_ID__>@${MIN_AVG_PER_ID}@g" >${RUN}.alignAssembly.config
## Run PASA Pipeline
${PASAHOME}/Launch_PASA_pipeline.pl -C -R \
--config ${RUN}.alignAssembly.config \
--ALIGNERS blat \
--genome ${ASM}.fa \
-t ${TRANSCRIPT_CLEAN} -T -u ${TRANSCRIPT} \
--TRANSDECODER \
--CPU ${NUMTH} --transcribed_is_aligned_orient
${PASAHOME}/scripts/pasa_asmbls_to_training_set.dbi \
--pasa_transcripts_fasta ${RUN}_db.assemblies.fasta \
--pasa_transcripts_gff3 ${RUN}_db.pasa_assemblies.gff3
```
#### Genome-guided transcriptome assembly
##### RNA-seq mapping (HISAT2)
```sh
#!/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_hisat2.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_hisat2
RDIR=/work/kawato/study/kuruma_transcriptome2017/analysis/2020-10-13_Mj_RNA-seq/reads/
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Define VARs
NUMTH=24
ASM=Mj_TUMSAT_v1.0
# Create symlinks
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${ASM}.fa
conda activate hisat2
# HISAT2-build
hisat2-build -p ${NUMTH} ${ASM}.fa ${ASM}
# Run HISAT2
less ${RDIR}lib.list |while read LIB;do
R1=${RDIR}${LIB}_R1_trim.fq.gz
R2=${RDIR}${LIB}_R2_trim.fq.gz
hisat2 --summary-file ${ASM}_${LIB} --new-summary -p ${NUMTH} -x ${ASM} -1 ${R1} -2 ${R2} |samtools sort -@ ${NUMTH} -O BAM -o ${ASM}_${LIB}.sorted.bam
samtools index ${ASM}_${LIB}.sorted.bam
done
```
##### Transcript reconstruction (StringTie)
```sh
#!/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_stringtie.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_stringtie
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Define VARs
RDIR=/work/kawato/study/kuruma_transcriptome2017/analysis/2020-10-13_Mj_RNA-seq/reads/
STRINGTIE_DIR=/work/kawato/apps/stringtie/
export PATH=${STRINGTIE_DIR}:${PATH}
NUMTH=56
ASM=Mj_TUMSAT_v1.0
# Create symlinks
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa Mj_TUMSAT_v1.0.fa
less ${RDIR}lib.list |while read LIB;do
ln -s ../Mj_TUMSAT_v1.0_hisat2/${ASM}_${LIB}.sorted.bam ${ASM}_${LIB}.sorted.bam
ln -s ../Mj_TUMSAT_v1.0_hisat2/${ASM}_${LIB}.sorted.bam.bai ${ASM}_${LIB}.sorted.bam.bai
done
# Run stringtie
less ${RDIR}lib.list |while read LIB;do
stringtie --fr -p ${NUMTH} -o ${ASM}_${LIB}.out.gtf ${ASM}_${LIB}.sorted.bam
done
# Merge transcripts
stringtie --merge -o ${ASM}_merged.gtf ${ASM}_*.out.gtf
```
#### *Ab initio* gene prediction (BRAKER2)
```sh
#!/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_BRAKER2.sh
# >>>>>>
# This is NOT designed to run using nohup.
# Manually run the scripts inside COMMENTOUT).
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_BRAKER2
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Set VARs
NUMTH=56
ASM=Mj_TUMSAT_v1.0
# Copy files; symlink doesn't work for Docker
cp ../Mj_TUMSAT_v1.0_RepeatMasker/${ASM}.fa.masked ${ASM}.fa.masked
cp ../Mj_TUMSAT_v1.0_hisat2/${ASM}_${LIB}.sorted.bam* .
mkdir GeneMark; cp -r /work/kawato/apps/GeneMark/* GeneMark
# Run BRAKER2
<>>>>>
# Mj_TUMSAT_v1.0_Exonerate_Local.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_Exonerate_Local
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Set VARs
NUMTH=12
GENOME=Mj_TUMSAT_v1.0
QUERY=2020-12-17_Mj_IPG
# Make symlinks
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${GENOME}.fa
ln -s /work/kawato/study/ref/proteins/${QUERY}.faa ${QUERY}.faa # Identical Protein Groups downloaded from NCBI on 2020-12-17
# BLAST
makeblastdb -dbtype nucl -in ${GENOME}.fa -title ${GENOME} -out ${GENOME} -parse_seqids -hash_index
tblastn -query ${QUERY}.faa -db ${GENOME} -num_threads ${NUMTH} -outfmt 7 -max_target_seqs 1 -out ${QUERY}.${GENOME}.tblastn
cat ${QUERY}.${GENOME}.tblastn|grep -v "#" |cut -f 1,2 |sort |uniq>${QUERY}.${GENOME}.tblastn.hits
# Separate seqs
mkdir query subject
cut -f 1 ${QUERY}.${GENOME}.tblastn.hits >query.list
cat query.list|while read line;do
seqkit grep -np "${line}" ${QUERY}.faa>query/${line}.faa
done
cut -f 2 ${QUERY}.${GENOME}.tblastn.hits |sort |uniq>subject.list
cat subject.list|while read line;do
seqkit grep -np "${line}" ${GENOME}.fa > subject/${line}.fa
done
# Exonerate
mkdir gff
conda activate exonerate
cat ${QUERY}.${GENOME}.tblastn.hits|while read QUERY SUBJECT;do
exonerate --model protein2genome query/${QUERY}.faa subject/${SUBJECT}.fa --bestn 1 --showalignment no --showvulgar no --showtargetgff yes >gff/${QUERY}.${SUBJECT}.gff
done
# Concatenate all
cat gff/*.gff|grep -v "Command line" |grep -v "Hostname" |grep -v "completed" >query.subject.all.gff
```
#### Merging gene models (EVidenceModeler)
```sh
#!/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_EVidenceModeler.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
### Do execute this script with nohup ###
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_EVidenceModeler
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Defines VARs
EVM_HOME=/work/kawato/apps/EVidenceModeler
PARTITIONS_DIR=partition_EVM_inputs
NUMTH=56
ASM=Mj_TUMSAT_v1.0
# Create symlinks
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${ASM}.fa
ln -s ../${ASM}_BRAKER2/braker/augustus.hints.gtf augustus.hints.gtf
ln -s ../${ASM}_BRAKER2/braker/GeneMark-ET/genemark.gtf genemark.gtf
ln -s ../${ASM}_hisat2/${ASM}_merged.gtf ${ASM}_merged.gtf
ln -s ../${ASM}_PASA/${ASM}_pasa_db.pasa_assemblies.gff3 ${ASM}_pasa_db.pasa_assemblies.gff3
ln -s ..${ASM}_PASA/${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.gff3 ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.gff3
ln -s ../${ASM}_Exonerate_Local/query.subject.all.gff query.subject.all.gff
# Convert all PASACDS alignmnents into DNA-protein match, as suggested by https://github.com/EVidenceModeler/EVidenceModeler/issues/23
grep "CDS" ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.gff3 |sed 's/transdecoder/transdecoder_cds/g'|sed 's/CDS/nucleotide_to_protein_match/g' >${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.nucleotide_to_protein_match.gff3
# Extract complete PASA CDS, as suggested by https://github.com/EVidenceModeler/EVidenceModeler/issues/23
# List up "complete" entries
grep "complete" ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.gff3|grep -E -o "asmbl_[0-9]+.p[0-9]+"|sort |uniq >${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.list
# Split the list to 700 lines, or 56 parts
split -l 700 --numeric-suffixes --suffix-length=3 --elide-empty-files ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.list ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.list_
# grep the entries and redirect
for n in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055;do
nohup $(cat ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.list_${n} |while read line;do
grep "${line}" ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.gff3 >>${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.gff3.part${n};done) &
done
# "nohup: missing operand" seems to have had no negative impact on the results
# Merge into a single gff file
cat ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.gff3.part* >${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.gff3
# Produce .faa for quality assessment
gffread ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.gff3 -g ${ASM}.fa -y ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.faa
seqkit stats ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.faa >${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.faa.stats
# Convert gene predictions to EVM-compatible format
perl ${EVM_HOME}/EvmUtils/misc/augustus_GTF_to_EVM_GFF3.pl augustus.hints.gtf >augustus.hints.gff
perl ${EVM_HOME}/EvmUtils/misc/GeneMarkHMM_GTF_to_EVM_GFF3.pl genemark.gtf >genemark.gff
perl ${EVM_HOME}/EvmUtils/misc/cufflinks_gtf_to_alignment_gff3.pl ${ASM}_merged.gtf >${ASM}_merged.gff
perl ${EVM_HOME}/EvmUtils/misc/exonerate_gff_to_alignment_gff3.pl query.subject.all.gff >query.subject.all.gff3
# Verify gene count
# grep "gene" augustus.hints.gff|wc -l
# grep "gene" genemark.gff|wc -l
# cut -f 9 ${ASM}_merged.gff|cut -f 1 -d ";"|grep -v '^$' |sort|uniq|wc -l
# cut -f 9 query.subject.all.gff3|cut -f 1 -d ";"|grep -v '^$' |sort|uniq|wc -l
# Cooncatenate gene predictions
cat \
${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.gff3 \
genemark.gff \
augustus.hints.gff \
>prediction.merged.gff
cat \
${ASM}_pasa_db.pasa_assemblies.gff3 \
${ASM}_merged.gff \
>transcript.merged.gff
cat \
${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.nucleotide_to_protein_match.gff3 \
query.subject.all.gff3 \
>protein.merged.gff
mkdir ${PARTITIONS_DIR}
cd ${PARTITIONS_DIR}
# Create weight file
echo -e "ABINITIO_PREDICTION\tGeneMark.hmm\t1\nABINITIO_PREDICTION\tAugustus\t3\nPROTEIN\texonerate\t10\nPROTEIN\ttransdecoder_cds\t10\nTRANSCRIPT\tCufflinks\t5\nTRANSCRIPT\tassembler-${ASM}_pasa_db\t10\nOTHER_PREDICTION\ttransdecoder\t10" >weights.txt
# partition_EVM_inputs.pl
${EVM_HOME}/EvmUtils/partition_EVM_inputs.pl --genome ../${ASM}.fa \
--gene_predictions ../prediction.merged.gff \
--protein_alignments ../protein.merged.gff \
--transcript_alignments ../transcript.merged.gff \
--segmentSize 2000000 --overlapSize 100000 --partition_listing partitions_list.out
# write_EVM_commands.pl
${EVM_HOME}/EvmUtils/write_EVM_commands.pl --genome ../${ASM}.fa --weights `cd .. |pwd`/weights.txt \
--gene_predictions ../prediction.merged.gff \
--protein_alignments ../protein.merged.gff \
--transcript_alignments ../transcript.merged.gff \
--output_file_name evm.out --partitions partitions_list.out > commands.list
# Verify command count
wc -l commands.list # 18210 commands.list
# Split to 400-line chunks
split -l 400 --numeric-suffixes --suffix-length=3 --elide-empty-files commands.list commands.list_ # 46 (000-045) partitions
# execute_EVM_commands.pl
for n in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 ;do
nohup $EVM_HOME/EvmUtils/execute_EVM_commands.pl commands.list_${n} | tee commands.list_${n}_run.log 2>commands.list_${n}_run.err &
done
# recombine_EVM_partial_outputs.pl
${EVM_HOME}/EvmUtils/recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out
# Split into 450-line chunks
split -l 450 --numeric-suffixes --suffix-length=3 --elide-empty-files partitions_list.out partitions_list.out_
for n in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 ;do
nohup ${EVM_HOME}/EvmUtils/convert_EVM_outputs_to_GFF3.pl --partitions partitions_list.out_${n} --output evm.out --genome ../${ASM}.fa &
done
# Recombine gff3
find . -regex ".*evm.out.gff3" -exec cat {} \; > EVM.all.gff3
# Translate CDS
perl ${EVM_HOME}/EvmUtils/gff3_file_to_proteins.pl EVM.all.gff3 ../${ASM}.fa >EVM.all.gff3.faa
```
#### Homology search (MMSeqs2)
```sh
#!/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_MMSeq2.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_MMSeq2
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Define VARs
NUMTH=56
GENOME=Mj_TUMSAT_v1.0
ANNOTATION=EVM
GENEID=${ANNOTATION}.geneid
GENEID_TRANSCRIPTID_CDSLEN=${ANNOTATION}.geneid.transcriptid.cdslen
REF_DIR=/work/kawato/study/ref/proteins
# Create symlinks
ln -s ../Mj_TUMSAT_v1.0_EVidenceModeler/partition_EVM_inputs/EVM.all.gff3 ${ANNOTATION}.gff3
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${GENOME}.fa
# extract .faa, with geneid, transcriptid, cdslen information @id,
gffread ${ANNOTATION}.gff3 -V -E -S -g ${GENOME}.fa -y ${GENEID_TRANSCRIPTID_CDSLEN}.faa --table=@geneid,@cdslen
# The file above has a space between the entry name and descriptions and are incompatible with seqkit grep. Trim off
cut -f 1 ${GENEID_TRANSCRIPTID_CDSLEN}.faa >${ANNOTATION}.faa
# Define list names
grep ">" ${GENEID_TRANSCRIPTID_CDSLEN}.faa|sed 's/>//g'|awk '{print $2}'|sort|uniq >${GENEID}.list
grep ">" ${GENEID_TRANSCRIPTID_CDSLEN}.faa|sed 's/>//g'|awk '{print $2 "\t" $1 "\t" $3}' >${GENEID_TRANSCRIPTID_CDSLEN}.list
# Extract the longest isoform from each locus (takes time)
mkdir ${ANNOTATION}
cat ${GENEID}.list |while read line;do
mkdir ${ANNOTATION}/${line}
awk -v line=${line} '$1 == line {print $0}' ${GENEID_TRANSCRIPTID_CDSLEN}.list |cut -f 2,3 |sort -r -n -k 2 >${ANNOTATION}/${line}/${line}.transcriptid.cdslen.list
ID=`awk 'NR==1 {print $1}' ${ANNOTATION}/${line}/${line}.transcriptid.cdslen.list`
seqkit grep -np "${ID}" ${ANNOTATION}.faa >${ANNOTATION}/${line}/${line}.faa
done
# Merge into a single fasta file
cat ${GENEID}.list|while read line;do
cat ${ANNOTATION}/${line}/${line}.faa >>${ANNOTATION}.nr.faa
done
seqkit stats ${ANNOTATION}.nr.faa >${ANNOTATION}.nr.faa.stats
# MMSeq2
conda activate mmseqs2
# Create db
DB_UNIREF50=${REF_DIR}/uniref50
mmseqs createdb ${ANNOTATION}.nr.faa ${ANNOTATION}.nr
mmseqs createdb ${REF_DIR}/GCF_003789085.1_ASM378908v1_protein.faa ASM378908v1_protein
mmseqs createdb ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_protein.faa NSTDA_Pmon_1_protein
# search against uniref50
mmseqs search ${ANNOTATION}.nr ${DB_UNIREF50} ${ANNOTATION}.nr.uniref50.mmseqs.out tmp_uniref50 -e 1e-10 -a
mmseqs convertalis ${ANNOTATION}.nr ${DB_UNIREF50} ${ANNOTATION}.nr.uniref50.mmseqs.out ${ANNOTATION}.nr.uniref50.mmseqs.out.formatmode0 --format-mode 0
# search against two penaeid shrimp species
mmseqs search ${ANNOTATION}.nr ASM378908v1_protein ${ANNOTATION}.nr.ASM378908v1_protein.mmseqs.out tmp_ASM378908v1_protein -e 1e-10 -a
mmseqs convertalis ${ANNOTATION}.nr ASM378908v1_protein ${ANNOTATION}.nr.ASM378908v1_protein.mmseqs.out ${ANNOTATION}.nr.ASM378908v1_protein.mmseqs.out.formatmode0 --format-mode 0
mmseqs search ${ANNOTATION}.nr NSTDA_Pmon_1_protein ${ANNOTATION}.nr.NSTDA_Pmon_1_protein.mmseqs.out tmp_NSTDA_Pmon_1_protein -e 1e-10 -a
mmseqs convertalis ${ANNOTATION}.nr NSTDA_Pmon_1_protein ${ANNOTATION}.nr.NSTDA_Pmon_1_protein.mmseqs.out ${ANNOTATION}.nr.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 --format-mode 0
# Merge results
cut -f 1 ${ANNOTATION}.nr.uniref50.mmseqs.out.formatmode0 ${ANNOTATION}.nr.ASM378908v1_protein.mmseqs.out.formatmode0 ${ANNOTATION}.nr.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 |sort|uniq>${ANNOTATION}.nr.mmseqs.out.list
# wc -l ${ANNOTATION}.nr.mmseqs.out.list
# extract proteins with detectable similarity to known proteins
seqkit grep -nf ${ANNOTATION}.nr.mmseqs.out.list ${ANNOTATION}.nr.faa >${ANNOTATION}.nr.mmseqs.hit.faa
# grep ">" ${ANNOTATION}.nr.mmseqs.hit.faa |wc -l
# cut -f 1 ${ANNOTATION}.nr.ASM378908v1_protein.mmseqs.out.formatmode0 ${ANNOTATION}.nr.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 |sort|uniq|wc -l
# Convert EVM to gtf
gffread ${ANNOTATION}.gff3 -g ${GENOME}.fa -E -T -o ${ANNOTATION}.gtf
# Generate gff
split -l 500 --numeric-suffixes --suffix-length=3 --elide-empty-files ${ANNOTATION}.nr.mmseqs.out.list ${ANNOTATION}.nr.mmseqs.out.list_ # generates 56 split files
for N in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 ;do
while read line;do
grep "transcript_id \"${line}\"" ${ANNOTATION}.gtf
done >${ANNOTATION}.nr.mmseqs.hit_${N}.gtf < ${ANNOTATION}.nr.mmseqs.out.list_${N} &
done
cat ${ANNOTATION}.nr.mmseqs.hit_*.gtf>>${ANNOTATION}.nr.mmseqs.hit.gtf
rm ${ANNOTATION}.nr.mmseqs.hit_*.gtf ${ANNOTATION}.nr.mmseqs.out.list_*
gffread ${ANNOTATION}.nr.mmseqs.hit.gtf -V --keep-genes -g ${GENOME}.fa -o ${ANNOTATION}.nr.mmseqs.hit.gff
gffread ${ANNOTATION}.nr.mmseqs.hit.gtf -V --keep-genes -g ${GENOME}.fa -y ${ANNOTATION}.nr.mmseqs.hit.faa
gffread ${ANNOTATION}.nr.mmseqs.hit.gtf -V --keep-genes -g ${GENOME}.fa -x ${ANNOTATION}.nr.mmseqs.hit.fna
```
#### Manual curation
```sh
#!/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_AddMjPenII.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_MMSeq2
cd ${WHOME}${WDIR}
# Define VARs
NUMTH=56
GENOME=Mj_TUMSAT_v1.0
SCAFFOLD=scaffold_00504
MjPenII_CDS=transcript_21783_170-607
# Extract the scaffold containing the MjPenII gene
seqkit grep -np "${SCAFFOLD}" ${GENOME}.fa >${SCAFFOLD}.fa
# Align MjPenII CDS
conda activate exonerate
exonerate --model est2genome ${MjPenII_CDS}.fna ${SCAFFOLD}.fa --maxintron 2000 --percent 90 --showalignment no --showvulgar no --showtargetgff yes >${MjPenII_CDS}.${SCAFFOLD}.gff
# I manually constructed evm.model.scaffold_00504.16.gff by modifying transcript_21783_170-607.scaffold_00504.gff, based on the corresponding portion in EVM.up10.gff. I made the following modifications;
# Removed the lines beginning with "#" and the final line "-- completed exonerate analysis"
# changed "exonerate:est2genome" to "EVM" ($2)
# Removed lines containing "utr5", "splice5", "intron", "splice3", and "similarity" in $3
# the line for "gene": duplicated and the new copy was edited from "gene" to "transcript". $9 was modified
# to:
# ID=evm.model.scaffold_00504.17;Parent=evm.TU.scaffold_00504.17
# Duplicated the lines with "exon" at $2; the duplicated lines now have "CDS" at $2
# $9 on the "exon" and "CDS" lines were changed to "Parent=evm.model.scaffold_00504.17"
# Final annotation containing the MjPenII gene: transcript_21783_170-607.scaffold_00504.mod.gff
cat EVM.nr.mmseqs.hit.gff ${MjPenII_CDS}.${SCAFFOLD}.mod.gff >2021-06-02_Mj_genome_annotation.gff3
gffread 2021-06-02_Mj_genome_annotation.gff3 -V --keep-genes -g ${GENOME}.fa -y 2021-06-02_Mj_genome_annotation.faa
gffread 2021-06-02_Mj_genome_annotation.gff3 -V --keep-genes -g ${GENOME}.fa -x 2021-06-02_Mj_genome_annotation.fna
grep ">" 2021-06-02_Mj_genome_annotation.faa |wc -l
grep ">" 2021-06-02_Mj_genome_annotation.fna |wc -l
# BUSCO to show that the completeness has not been affected
conda activate busco-4.1.4
busco -i 2021-06-02_Mj_genome_annotation.faa -l arthropoda_odb10 -o 2021-06-02_Mj_genome_annotation.busco -c 56 -m prot -f
less 2021-06-02_Mj_genome_annotation.gff3|grep "gene"|cut -f 9|sed 's/ID=//g'|sed 's/TU/model/g'|grep -v "^#" >2021-06-02_Mj_genome_annotation.gff3.gene.list
wc -l 2021-06-02_Mj_genome_annotation.gff3.gene.list
```
transcript_21783_170-607.fna
```
>transcript_21783_170-607
ATGCGTGTCTTGGTTTTCCTGGCGTTCTTGGTTTTGTTAGCCTTACTCTGTCAAGTGTAC
GCCAAGGGCTCCTCAAGTTCGAGTTCGAGTTCTCGATCCTCAAGTTCGAGTTATCGATCC
TCAGGTTCGAGTTATCGATCCCCAGGTTCGAGTTATCGATCCCCAGGTTCGAGTTATCGA
TCATCTGGCTCCTACGGAACTTCAGGTTCTCGACTCCCTGGCATTCGTCCCTCTTCTCGA
TCATATCGCACCGGTTTTCGAACAGCTGGATCCGTCGGACCTGCAACTCGTCCCTTTACT
CGTCCAACTGGGCCCCTTAAGCCCATTAGCCGGCCTCCTTCCCGAGCAGCCTGCTATTCA
TGCTACAGCGCAAGCAGCGCAACGGCAATACAATGCTGCACCCACTATAGTCTTTGTTGC
AATCTGGTGAAAGGATAG
```
#### UTR recovery (PASA updates)
```sh
#!/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_PASA_update.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_PASA_update
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Set VARs
PASAHOME=/work/kawato/anaconda3/envs/pasa/opt/pasa-2.4.1
NUMTH=56
W_DIR=${WHOME}${WDIR}/
GENOME=Mj_TUMSAT_v1.0
ANNOTATION=2021-06-02_Mj_genome_annotation
# Create symlinks
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${GENOME}.fa
ln -s ../Mj_TUMSAT_v1.0_MMSeq2/2021-06-02_Mj_genome_annotation.gff3 ${ANNOTATION}.gff3
ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/ICRK01.1.fsa_nt ICRK01.1.fsa_nt
ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/ICRJ01.1.fsa_nt ICRJ01.1.fsa_nt
ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/2020-10-13_Mj_RNA-seq/trinity_out_dir/Trinity.fasta Trinity.fasta
conda activate pasa
# Prepare transcripts
cat ICRK01.1.fsa_nt ICRJ01.1.fsa_nt Trinity.fasta >tr.fa
TRANSCRIPT=tr.fa
${PASAHOME}/bin/seqclean ${TRANSCRIPT} -L -c ${NUMTH}
TRANSCRIPT_CLEAN=${TRANSCRIPT}.clean
# Prepare PASA config file
MIN_PERCENT_ALIGNED=90
MIN_AVG_PER_ID=95
RUN=2021-06-03_pasa
cat /work/kawato/anaconda3/envs/pasa/opt/pasa-2.4.1/pasa_conf/pasa.alignAssembly.Template.txt |\
sed "s@<__DATABASE__>@${W_DIR}${RUN}_db@g"|sed "s@<__MIN_PERCENT_ALIGNED__>@${MIN_PERCENT_ALIGNED}@g"|sed "s@<__MIN_AVG_PER_ID__>@${MIN_AVG_PER_ID}@g" >${RUN}.alignAssembly.config
# Map transcripts
${PASAHOME}/Launch_PASA_pipeline.pl -C -R \
--config ${RUN}.alignAssembly.config \
--ALIGNERS blat \
--genome ${GENOME}.fa \
-t ${TRANSCRIPT_CLEAN} -T -u ${TRANSCRIPT} \
--TRANSDECODER \
--CPU ${NUMTH} --transcribed_is_aligned_orient
# Create PASA annotCompare config
cat ${PASAHOME}/pasa_conf/pasa.annotationCompare.Template.txt |\
sed "s@<__DATABASE__>@${W_DIR}${RUN}_db@g" >${RUN}.annotCompare.config
# load annotation
${PASAHOME}/scripts/Load_Current_Gene_Annotations.dbi \
-c ${RUN}.alignAssembly.config -g ${GENOME}.fa \
-P ${ANNOTATION}.gff3
# PASA update, round 1
$PASAHOME/Launch_PASA_pipeline.pl \
-c ${RUN}.annotCompare.config -A \
-g ${GENOME}.fa \
-t ${TRANSCRIPT_CLEAN} --CPU ${NUMTH}
# Load post-update PASA annotation
${PASAHOME}/scripts/Load_Current_Gene_Annotations.dbi \
-c ${RUN}.alignAssembly.config -g ${GENOME}.fa \
-P 2021-06-03_pasa_db.gene_structures_post_PASA_updates.46092.gff3
# PASA update, round 2
$PASAHOME/Launch_PASA_pipeline.pl \
-c ${RUN}.annotCompare.config -A \
-g ${GENOME}.fa \
-t ${TRANSCRIPT_CLEAN} --CPU ${NUMTH}
```
#### Incorporation of IsoSeq and shotgun and transcriptome assemblies (SQANTI3, TAMA, TransDecoder)
```sh
# !/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_sqanti3_TAMA_TransDecoder.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_sqanti3_TAMA_TransDecoder
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Set VARs
conda activate SQANTI3.env
export PYTHONPATH=${PYTHONPATH}:/work/kawato/apps/cDNA_Cupcake/sequence/
GENOME=Mj_TUMSAT_v1.0.fa
CDNA=ISOSEQ.fa
RNASEQ=RNASEQ.fa
NUMTH=56
TAMAHOME=/work/kawato/apps/tama
REF_DIR=/work/kawato/study/ref/proteins
# make symlinks
ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/ICRJ01.1.fsa_nt ${CDNA}
ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/ICRK01.1.fsa_nt ${RNASEQ}
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${GENOME}
ln -s ../Mj_TUMSAT_v1.0_PASA_update/2021-06-03_pasa_db.gene_structures_post_PASA_updates.39230.gff3 PASA.gff3
# SQANTI3 accepts GTF only
gffread PASA.gff3 -g ${GENOME} -E -T -o PASA.gtf
# SQANTI3, ISOSEQ
minimap2 -ax splice -t ${NUMTH} -uf --secondary=no -C5 ${GENOME} ${CDNA} > ${CDNA}.sam
sort -k 3,3 -k 4,4n ${CDNA}.sam > ${CDNA}.sorted.sam
collapse_isoforms_by_sam.py --input ${CDNA} -s ${CDNA}.sorted.sam --dun-merge-5-shorter -o ${CDNA}.align -c 0.90 # coverage cutoff 0.90 = 90%. test.collapsed.gff is in gff2 format = GTF
python /work/kawato/apps/SQANTI3/sqanti3_qc.py --gtf ${CDNA}.align.collapsed.gff PASA.gtf ${GENOME}
# SQANTI3, RNASEQ
minimap2 -ax splice -t ${NUMTH} -uf --secondary=no -C5 ${GENOME} ${RNASEQ} > ${RNASEQ}.sam
sort -k 3,3 -k 4,4n ${RNASEQ}.sam > ${RNASEQ}.sorted.sam
collapse_isoforms_by_sam.py --input ${RNASEQ} -s ${RNASEQ}.sorted.sam --dun-merge-5-shorter -o ${RNASEQ}.align -c 0.90 # gff is in gff2 format = GTF
python /work/kawato/apps/SQANTI3/sqanti3_qc.py --gtf ${RNASEQ}.align.collapsed.gff PASA.gtf ${GENOME}
# TAMA
conda activate tama-py27
# Convert to BED
python ${TAMAHOME}/tama_go/format_converter/tama_format_gtf_to_bed12_ncbi.py PASA.gtf PASA.bed
python ${TAMAHOME}/tama_go/format_converter/tama_format_gtf_to_bed12_ncbi.py ${RNASEQ}.align.collapsed_corrected.gtf ${RNASEQ}.align.collapsed_corrected.bed
python ${TAMAHOME}/tama_go/format_converter/tama_format_gtf_to_bed12_ncbi.py ${CDNA}.align.collapsed_corrected.gtf ${CDNA}.align.collapsed_corrected.bed
echo -e "${CDNA}.align.collapsed_corrected.bed\tno_cap\t1,1,1\tCDNA\n${RNASEQ}.align.collapsed_corrected.bed\tno_cap\t2,1,2\tRNASEQ\nPASA.bed\tno_cap\t2,1,2\tPASA" > filelist.txt
python ${TAMAHOME}/tama_merge.py -f filelist.txt -p merged_annos -s PASA -cds PASA
python2 ${TAMAHOME}/tama_go/format_converter/tama_convert_bed_gtf_ensembl_no_cds.py merged_annos.bed merged_annos.gtf
# TransDecoder
conda activate pasa #TransDecoder in PASA
# Extract transcripts
gtf_genome_to_cdna_fasta.pl merged_annos.gtf ${GENOME} >merged_annos.transcripts.fa
gtf_to_alignment_gff3.pl merged_annos.gtf > merged_annos.gff3
TransDecoder.LongOrfs -S -t merged_annos.transcripts.fa
# MMSeq2
conda activate mmseqs2
# Create db
ln -s merged_annos.transcripts.fa.transdecoder_dir/longest_orfs.pep longest_orfs.pep
mmseqs createdb longest_orfs.pep longest_orfs
mmseqs createdb ${REF_DIR}/GCF_003789085.1_ASM378908v1_protein.faa ASM378908v1_protein
mmseqs createdb ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_protein.faa NSTDA_Pmon_1_protein
DB_UNIREF50=${REF_DIR}/uniref50
# search against uniref50
mmseqs search longest_orfs ${DB_UNIREF50} longest_orfs.uniref50.mmseqs.out tmp_uniref50 -e 1e-10 -a
mmseqs convertalis longest_orfs ${DB_UNIREF50} longest_orfs.uniref50.mmseqs.out longest_orfs.uniref50.mmseqs.out.formatmode0 --format-mode 0
# search against two penaeid shrimp species
mmseqs search longest_orfs ASM378908v1_protein longest_orfs.ASM378908v1_protein.mmseqs.out tmp_ASM378908v1_protein -e 1e-10 -a
mmseqs convertalis longest_orfs ASM378908v1_protein longest_orfs.ASM378908v1_protein.mmseqs.out longest_orfs.ASM378908v1_protein.mmseqs.out.formatmode0 --format-mode 0
mmseqs search longest_orfs NSTDA_Pmon_1_protein longest_orfs.NSTDA_Pmon_1_protein.mmseqs.out tmp_NSTDA_Pmon_1_protein -e 1e-10 -a
mmseqs convertalis longest_orfs NSTDA_Pmon_1_protein longest_orfs.NSTDA_Pmon_1_protein.mmseqs.out longest_orfs.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 --format-mode 0
# Concatenate MMSeqs2 results
cat longest_orfs.uniref50.mmseqs.out.formatmode0 longest_orfs.ASM378908v1_protein.mmseqs.out.formatmode0 longest_orfs.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 >longest_orfs.all.mmseqs.out.formatmode0
conda activate pasa
# Predict ORFs with homology to known proteins
TransDecoder.Predict --single_best_only -t merged_annos.transcripts.fa --retain_blastp_hits longest_orfs.all.mmseqs.out.formatmode0
cdna_alignment_orf_to_genome_orf.pl merged_annos.transcripts.fa.transdecoder.gff3 merged_annos.gff3 merged_annos.transcripts.fa > merged_annos.transcripts.fa.transdecoder.genome.gff3
# Extract ORF
gffread merged_annos.transcripts.fa.transdecoder.genome.gff3 -V -E -S -g ${GENOME} -y merged_annos.transcripts.fa.transdecoder.genome.faa
conda activate mmseqs2
# Further filter low-quality models using MMSeq2
QUERY=TransDecoder_ORF
ln -s merged_annos.transcripts.fa.transdecoder.genome.faa ${QUERY}.faa
# Create db
mmseqs createdb ${QUERY}.faa ${QUERY}
# search against uniref50
mmseqs search ${QUERY} ${DB_UNIREF50} ${QUERY}.uniref50.mmseqs.out tmp_uniref50 -e 1e-03 -a
mmseqs convertalis ${QUERY} ${DB_UNIREF50} ${QUERY}.uniref50.mmseqs.out ${QUERY}.uniref50.mmseqs.out.formatmode0 --format-mode 0
# search against two penaeid shrimp species
mmseqs search ${QUERY} ASM378908v1_protein ${QUERY}.ASM378908v1_protein.mmseqs.out tmp_ASM378908v1_protein -e 1e-03 -a
mmseqs convertalis ${QUERY} ASM378908v1_protein ${QUERY}.ASM378908v1_protein.mmseqs.out ${QUERY}.ASM378908v1_protein.mmseqs.out.formatmode0 --format-mode 0
mmseqs search ${QUERY} NSTDA_Pmon_1_protein ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out tmp_NSTDA_Pmon_1_protein -e 1e-03 -a
mmseqs convertalis ${QUERY} NSTDA_Pmon_1_protein ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 --format-mode 0
# Merge MMSeqs2 results
cat ${QUERY}.uniref50.mmseqs.out.formatmode0 ${QUERY}.ASM378908v1_protein.mmseqs.out.formatmode0 ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 |cut -f 1|sort|uniq >${QUERY}.mmseqs.out.list
wc -l ${QUERY}.mmseqs.out.list # 60281 TransDecoder_ORF.mmseqs.out.list
less TransDecoder_ORF.mmseqs.out.list|cut -f 1 -d "."|sed 's/G//g'|sort -n|uniq |sed 's/^/G/g'>${QUERY}.mmseqs.out.gene.list
wc -l ${QUERY}.mmseqs.out.gene.list # 26387 TransDecoder_ORF.mmseqs.out.gene.list
# Remove lengthy descriptions; split to chunks to make it faster
split -l 500 --numeric-suffixes --suffix-length=3 --elide-empty-files ${QUERY}.mmseqs.out.gene.list ${QUERY}.mmseqs.out.gene.list_ # generates 53 split files
for N in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052;do
cat ${QUERY}.mmseqs.out.gene.list_${N} |while read line;do
grep -E "${line}[;.]" merged_annos.transcripts.fa.transdecoder.genome.gff3
done > merged_annos.transcripts.fa.transdecoder.genome.selected_${N}.gff3 &
done
# Merge into one file
cat merged_annos.transcripts.fa.transdecoder.genome.selected_*.gff3 >Mj_TUMSAT_v1.0_transdecoder.gff3
rm merged_annos.transcripts.fa.transdecoder.genome.selected_*.gff3 ${QUERY}.mmseqs.out.gene.list_*
awk '$3=="gene" {print $0}' Mj_TUMSAT_v1.0_transdecoder.gff3|wc -l # 26387
# Remove genes overlapping with 18S/28S rRNA genes
grep -v "G1771[23456789][.;]" Mj_TUMSAT_v1.0_transdecoder.gff3 >Mj_TUMSAT_v1.0_transdecoder.noOverlap.gff3
```
### Non-coding RNA prediction
#### tRNAScan-SE 2.0 (tRNA)
```sh
# !/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_tRNAscan-SE.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_tRNAscan-SE
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Set VARs
NUMTH=56
ASM=Mj_TUMSAT_v1.0
# Make symlinks
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${ASM}.fa ${ASM}.fa
conda activate trnascan-se
tRNAscan-SE ${ASM}.fa -o ${ASM}_trnascan-se_out -f ${ASM}_trnascan-se_struct -m ${ASM}_trnascan-se_stats -b ${ASM}_trnascan-se.bed -a ${ASM}_trnascan-se.fa -l ${ASM}_trnascan-se.log
# List up and eliminate low-quality genes (Undetermined and possible pseudogenes)
grep -e pseudo -e Undet ${ASM}_trnascan-se_out|awk '{print $1".tRNA"$2"-"}' >${ASM}_trnascan-se_out.lowqual
grep -v -f ${ASM}_trnascan-se_out.lowqual ${ASM}_trnascan-se.bed >${ASM}_trnascan-se.filtered.bed
# Convert to GFF3
gffread ${ASM}_trnascan-se.filtered.bed -F -o ${ASM}_trnascan-se.filtered.gff3
# Add gene_biotype=tRNA, parent-child structure, and functional description
cat ${ASM}_trnascan-se.filtered.gff3|awk 'BEGIN{FS = "\t";OFS="\t"};
{if($3=="transcript")
{
{genename = gensub("ID=","",1, $9)};
{print $1,"tRNAscan-SE","gene",$4,$5,$6,$7,$8,"ID="genename";gene_biotype=tRNA"
};
{print $1,"tRNAscan-SE","tRNA",$4,$5,$6,$7,$8,"ID=rna-"genename";Parent="genename";"
};
{print $1,"tRNAscan-SE","exon",$4,$5,$6,$7,$8,"Parent=rna-"genename";"
}
}
}'|awk 'BEGIN{FS = "\t";OFS="\t"}{
if($3=="tRNA"||$3=="exon"){
if($9 ~ /AlaAGC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA alanine \(anticodon AGC\);product=tRNA-Ala"};
if($9 ~ /AlaCGC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA alanine \(anticodon CGC\);product=tRNA-Ala"};
if($9 ~ /AlaTGC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA alanine \(anticodon UGC\);product=tRNA-Ala"};
if($9 ~ /ArgACG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA arginine \(anticodon ACG\);product=tRNA-Arg"};
if($9 ~ /ArgCCG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA arginine \(anticodon CCG\);product=tRNA-Arg"};
if($9 ~ /ArgCCT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA arginine \(anticodon CCU\);product=tRNA-Arg"};
if($9 ~ /ArgGCG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA arginine \(anticodon GCG\);product=tRNA-Arg"};
if($9 ~ /ArgTCG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA arginine \(anticodon UCG\);product=tRNA-Arg"};
if($9 ~ /ArgTCT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA arginine \(anticodon UCU\);product=tRNA-Arg"};
if($9 ~ /AsnGTT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA asparagine \(anticodon GUU\);product=tRNA-Asn"};
if($9 ~ /AspATC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA aspartic acid \(anticodon AUC\);product=tRNA-Asp"};
if($9 ~ /AspGTC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA aspartic acid \(anticodon GUC\);product=tRNA-Asp"};
if($9 ~ /CysACA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA cysteine \(anticodon ACA\);product=tRNA-Cys"};
if($9 ~ /CysGCA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA cysteine \(anticodon GCA\);product=tRNA-Cys"};
if($9 ~ /GlnCTG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glutamine \(anticodon CUG\);product=tRNA-Gln"};
if($9 ~ /GlnTTG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glutamine \(anticodon UUG\);product=tRNA-Gln"};
if($9 ~ /GluCTC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glutamic acid \(anticodon CUC\);product=tRNA-Glu"};
if($9 ~ /GluTTC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glutamic acid \(anticodon UUC\);product=tRNA-Glu"};
if($9 ~ /GlyCCC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glycine \(anticodon CCC\);product=tRNA-Gly"};
if($9 ~ /GlyGCC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glycine \(anticodon GCC\);product=tRNA-Gly"};
if($9 ~ /GlyTCC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glycine \(anticodon UCC\);product=tRNA-Gly"};
if($9 ~ /HisGTG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA histidine \(anticodon GUG\);product=tRNA-His"};
if($9 ~ /IleAAT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA isoleucine \(anticodon AAU\);product=tRNA-Ile"};
if($9 ~ /IleGAT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA isoleucine \(anticodon GAU\);product=tRNA-Ile"};
if($9 ~ /IleTAT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA isoleucine \(anticodon UAU\);product=tRNA-Ile"};
if($9 ~ /LeuAAG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA leucine \(anticodon AAG\);product=tRNA-Leu"};
if($9 ~ /LeuCAA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA leucine \(anticodon CAA\);product=tRNA-Leu"};
if($9 ~ /LeuCAG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA leucine \(anticodon CAG\);product=tRNA-Leu"};
if($9 ~ /LeuGAG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA leucine \(anticodon GAG\);product=tRNA-Leu"};
if($9 ~ /LeuTAA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA leucine \(anticodon UAA\);product=tRNA-Leu"};
if($9 ~ /LeuTAG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA leucine \(anticodon UAG\);product=tRNA-Leu"};
if($9 ~ /LysCTT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA lysine \(anticodon CUU\);product=tRNA-Lys"};
if($9 ~ /LysTTT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA lysine \(anticodon UUU\);product=tRNA-Lys"};
if($9 ~ /MetCAT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA methionine \(anticodon CAU\);product=tRNA-Met"};
if($9 ~ /PheAAA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA phenylalanine \(anticodon AAA\);product=tRNA-Phe"};
if($9 ~ /PheGAA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA phenylalanine \(anticodon GAA\);product=tRNA-Phe"};
if($9 ~ /ProAGG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA proline \(anticodon AGG\);product=tRNA-Pro"};
if($9 ~ /ProCGG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA proline \(anticodon CGG\);product=tRNA-Pro"};
if($9 ~ /ProTGG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA proline \(anticodon UGG\);product=tRNA-Pro"};
if($9 ~ /SeCTCA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA selenocysteine \(anticodon UCA\);product=tRNA-SeC"};
if($9 ~ /SerAGA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA serine \(anticodon AGA\);product=tRNA-Ser"};
if($9 ~ /SerCGA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA serine \(anticodon CGA\);product=tRNA-Ser"};
if($9 ~ /SerGCT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA serine \(anticodon GCU\);product=tRNA-Ser"};
if($9 ~ /SerGGA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA serine \(anticodon GGA\);product=tRNA-Ser"};
if($9 ~ /SerTGA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA serine \(anticodon UGA\);product=tRNA-Ser"};
if($9 ~ /SupCTA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA amber suppressor \(anticodon CTA\);product=tRNA-Sup"};
if($9 ~ /SupTCA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA opal suppressor \(anticodon UCA\);product=tRNA-Sup"};
if($9 ~ /SupTTA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA ochre suppressor \(anticodon UUA\);product=tRNA-Sup"};
if($9 ~ /ThrAGT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA threonine \(anticodon AGU\);product=tRNA-Thr"};
if($9 ~ /ThrCGT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA threonine \(anticodon CGU\);product=tRNA-Thr"};
if($9 ~ /ThrTGT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA threonine \(anticodon UGU\);product=tRNA-Thr"};
if($9 ~ /TrpCCA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA tryptophan \(anticodon CCA\);product=tRNA-Trp"};
if($9 ~ /TyrATA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA tyrosine \(anticodon AUA\);product=tRNA-Tyr"};
if($9 ~ /TyrGTA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA tyrosine \(anticodon GUA\);product=tRNA-Tyr"};
if($9 ~ /ValAAC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA valine \(anticodon AAC\);product=tRNA-Val"};
if($9 ~ /ValCAC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA valine \(anticodon CAC\);product=tRNA-Val"};
if($9 ~ /ValGAC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA valine \(anticodon GAC\);product=tRNA-Val"};
if($9 ~ /ValTAC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA valine \(anticodon UAC\);product=tRNA-Val"};
if($9 ~ /iMetCAT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA initiator methionine \(anticodon CAU\);product=tRNA-iMet"}}else{print}}'\
>tRNA_curated.gff3
```
#### Prediction of other non-coding RNA genes (Infernal)
```sh
# !/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_Infernal.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_Infernal
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Set VARs
NUMTH=56
ASM=Mj_TUMSAT_v1.0
# Make symlinks
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${ASM}.fa ${ASM}.fa
# Prepare cm
mkdir cm;cd cm
wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz; gunzip Rfam.cm.gz
wget http://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.clanin
cmpress Rfam.cm
cd ..
seqkit split ${ASM}.fa --by-part 56
conda activate infernal-1.1.4
for N in 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056;do
cmscan -Z 3410 --cut_ga --rfam --nohmmonly --oclan --oskip --fmt 2 --tblout ${ASM}.fa.split/${ASM}.part_${N}.Rfam.tbl --clanin cm/Rfam.clanin cm/Rfam.cm ${ASM}.fa.split/${ASM}.part_${N}.fa >${ASM}.fa.split/${ASM}.part_${N}.Rfam.cmscan &
done
# Convert to GFF3
cd ${ASM}.fa.split
for N in 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056;do
perl /work/kawato/scripts/infernal-tblout2gff.pl --cmscan --fmt2 --desc ${ASM}.part_${N}.Rfam.tbl ${ASM}.part_${N}.Rfam.gff3
done
# Concatenate into a single file
cat ${ASM}.part_*.Rfam.gff3 >${ASM}.Rfam.gff3
# Filter out false positives and tRNA; assign gene number to each entry
less ${ASM}.Rfam.gff3|grep -v "tRNA"|grep -v "mir-"|grep -v "microRNA"|grep -v "Histone"|grep -v "Phe_leader"|grep -v "K_chan"|grep -v "CRISPR"|grep -v "mini-ykkC"|grep -v "MIR"|grep -v "SSU_rRNA_microsporidia" |grep -v "IRE_II"|grep -v "let-7"|grep -v "LSU_rRNA_bacteria"|grep -v "Sphinx_1"|grep -v "RNAI" |grep -v "PrrB_RsmZ"|\
awk -F'\t' -vOFS='\t' 'BEGIN {num=1};{ $3="exon";$9 = "Parent=gene" num ";product=" $9;num+=1 }1' >${ASM}.Rfam.parsed.gff3
# Make fasta file
gffread -g ../${ASM}.fa -F -w ${ASM}.Rfam.parsed.fa ${ASM}.Rfam.parsed.gff3 -v --keep-genes -E --table product
# Remove product name on the title line (which makes it incompatible with seqkit grep)
cut -f 1 ${ASM}.Rfam.parsed.fa > ${ASM}.Rfam.parsed.mod.fa
# Create a list of target products. We decided to exclude Small_nucleolar_RNA_SNORA23 due to ambiguity in homology
echo -e "5S_ribosomal_RNA\nEukaryotic_large_subunit_ribosomal_RNA\n5.8S_ribosomal_RNA\nEukaryotic_small_subunit_ribosomal_RNA\n7SK_RNA\nU1_spliceosomal_RNA\nU2_spliceosomal_RNA\nU4_spliceosomal_RNA\nU5_spliceosomal_RNA\nU6_spliceosomal_RNA\nU11_spliceosomal_RNA\nU12_minor_spliceosomal_RNA\nU6atac_minor_spliceosomal_RNA\nSmall_nucleolar_RNA_SNORD18\nSmall_nucleolar_RNA_SNORD31\nSmall_nucleolar_RNA_SNORD34\nSmall_nucleolar_RNA_SNORA53\nSmall_nucleolar_RNA_U3\nSmall_nucleolar_RNA_snR61/Z1/Z11\nSmall_Cajal_body_specific_RNA_8\nMetazoan_signal_recognition_particle_RNA\nNuclear_RNase_P" >Rfam.target_products.list
# Make lists of entries for each gene
less Rfam.target_products.list|while read line;do
grep ">" ${ASM}.Rfam.parsed.fa|grep "$line" |sed 's/>//g'|cut -f 1 >`echo "${line}"|sed 's@/@_@g'`.list
done
# Create fasta files for each gene
less Rfam.target_products.list|while read line;do
seqkit grep -nf `echo "${line}"|sed 's@/@_@g'`.list ${ASM}.Rfam.parsed.mod.fa >`echo "${line}"|sed 's@/@_@g'`.fa
done
# Cluster using cd-hit-est
conda activate cd-hit
less Rfam.target_products.list|while read line;do
cd-hit-est -i `echo "${line}"|sed 's@/@_@g'`.fa -o `echo "${line}"|sed 's@/@_@g'`.nr -c 0.98 -s 0.95
done
# The files were aligned using MAFFT (https://mafft.cbrc.jp/alignment/server/cgi-bin/mafft5.cgi) online with default settings (Accessed June 2021).
# Clusters of obviously abberant lengths were filtered out.
# The sequences were additionally queried against the NCBI non-redundant nucleotide database using BLASTN (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch) (Accessed June 2021).
# retain_gene_list created manually.
# Parse GFF3 as per retain_gene_list; add appropriate notes and attributes to each entry
cat retain_gene_list|while read line;do
grep "Parent=${line};" ${ASM}.Rfam.parsed.gff3
done |awk 'BEGIN{FS = "\t";OFS="\t"}
{gsub(/_/," ",$9);r = gensub("Parent=","",1, gensub( ";.+", "", 1, $9 ))};
$9 ~ /ribosomal RNA/ {$3="rRNA"};
$9 ~ /7SK RNA|spliceosomal/ {$3="snRNA"};
$9 ~ /Small nucleolar/ {$3="snoRNA"};
$9 ~ /Small Cajal body specific RNA 8/ {$3="guide_RNA"};
$9 ~ /Metazoan signal recognition particle RNA/ {$3="scRNA"};
$9 ~ /Nuclear RNase P/ {$3="RNase_P_RNA"};
{print $1,$2,"gene",$4,$5,$6,$7,$8,"ID="substr($9,8)";"};
{print $1,$2,$3,$4,$5,$6,$7,$8,"ID=rna-"r";Parent="substr($9,8)";"};
{print $1,$2,"exon",$4,$5,$6,$7,$8,"Parent=rna-"substr($9,8)";"}'|\
awk 'BEGIN{FS = "\t";OFS="\t"};{
if ($3 == "gene"){match($9, /;/);
if($9 ~ /ribosomal RNA/) {print $1,$2,$3,$4,$5,$6,$7,$8,substr($9, 1, RSTART-1)";gene_biotype=rRNA"};
if($9 ~ /7SK RNA|spliceosomal/) {print $1,$2,$3,$4,$5,$6,$7,$8,substr($9, 1, RSTART-1)";gene_biotype=snRNA"};
if($9 ~ /Small nucleolar/) {print $1,$2,$3,$4,$5,$6,$7,$8,substr($9, 1, RSTART-1)";gene_biotype=snoRNA"};
if($9 ~ /Small Cajal body specific RNA 8/) {print $1,$2,$3,$4,$5,$6,$7,$8,substr($9, 1, RSTART-1)";gene_biotype=guide_RNA"};
if($9 ~ /Metazoan signal recognition particle RNA/) {print $1,$2,$3,$4,$5,$6,$7,$8,substr($9, 1, RSTART-1)";gene_biotype=scRNA"}
if($9 ~ /Nuclear RNase P/) {print $1,$2,$3,$4,$5,$6,$7,$8,substr($9, 1, RSTART-1)";gene_biotype=RNase_P_RNA"}
} else {print}}' >${ASM}.Rfam.curated.gff3
```
### Merge predicted genes and assign locus IDs
```sh
# !/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_merge_gff3.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_merge_gff3
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Set VARs
NUMTH=56
ASM=Mj_TUMSAT_v1.0
# Make symlinks
ln -s ../filter_lowqual_seq/${ASM}.fa ${ASM}.fa
ln -s ../${ASM}_Infernal/${ASM}.fa.split/${ASM}.Rfam.curated.gff3 ncRNA.gff3
ln -s ../${ASM}_tRNAscan-SE/tRNA_curated.gff3 tRNA.gff3
ln -s ../${ASM}_sqanti3_TAMA_TransDecoder/${ASM}_transdecoder.noOverlap.gff3 ${ASM}_transdecoder.noOverlap.gff3
# Remove description;add "gene_biotype=protein_coding"
cat ${ASM}_transdecoder.noOverlap.gff3 |sed -E 's/;Name=ORF.+//g'|awk 'BEGIN{FS = "\t";OFS="\t"};{
if ($3 == "gene"){print $1,$2,$3,$4,$5,$6,$7,$8,$9";gene_biotype=protein_coding"} else {print}}'>protein_coding.gff3
# Remove predicted RNA genes overlapping with CDS
awk '$3 == "CDS" {print $0}' protein_coding.gff3 >protein_coding.cds.gff3
bedtools intersect -v -a tRNA.gff3 -b protein_coding.cds.gff3 >tRNA.noOverlap.gff3
bedtools intersect -v -a ncRNA.gff3 -b protein_coding.cds.gff3 >ncRNA.noOverlap.gff3
# Combine all
cat protein_coding.gff3 tRNA.noOverlap.gff3 ncRNA.noOverlap.gff3|sed '1s/^/##gff-version 3\n/' >all.gff3
# Sort GFF3
conda activate genometools
gt gff3 -sort -retainids -addids no all.gff3 | grep -v "^###$" >all.sort.gff3
# Create a table containing old and new gene ids
cat all.sort.gff3 |awk '$3 =="gene"{print $9}'|sed -E 's/;.+//g'|sed 's/ID=//g'>all.sort.gff3.gene.list
NUM=1
cat all.sort.gff3.gene.list|while read line;do
echo -e "${line}\tLOCUS`printf "%07d" $((NUM))`00"
NUM=$((NUM+1))
done >all.sort.gff3.gene.list.mod &
# Rename genes
cat all.sort.gff3.gene.list.mod|while read BEFORE AFTER;do
grep -E "${BEFORE}[.;]" all.sort.gff3 |sed "s/${BEFORE}/${AFTER}/g"; done >all.sort.mod.gff3
# Rename scaffolds to the NCBI Accession Numbers
cat all.sort.mod.gff3 |sed 's/^scaffold_/BOPN010/g' |sed -E 's/^(BOPN[0-9]+)\t/\1.1\t/g'|awk 'BEGIN{FS = "\t";OFS="\t"};{if($3=="gene") {match($9, /ID=/);print $1,$2,$3,$4,$5,$6,$7,$8,$9";Name="substr($9, RLENGTH+1,14)}else {print}}' >annotation.tmp.gff3
```
### Functional annotation of protein-coding genes (AHRD, KAAS)
#### Extracting non-redundant protein sequences
```sh
# !/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_extractNrProteins.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_extractNrProteins
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Define VARs
GENOME=Mj_TUMSAT_v1.0
GENEID=${GENOME}.geneid
GENEID_TRANSCRIPTID_CDSLEN=${GENEID}.transcriptid.cdslen
ln -s /work/kawato/study/ref/genomes/GCA_017312705.1_Mj_TUMSAT_v1.0_genomic.fna ${GENOME}.fa
ln -s ../Mj_TUMSAT_v1.0_merge_gff3/annotation.tmp.gff3 ${GENOME}.gff3
# extract .faa, with geneid, transcriptid, cdslen information @id,
gffread ${GENOME}.gff3 -V -E -S -g ${GENOME}.fa -y ${GENEID_TRANSCRIPTID_CDSLEN}.faa --table=@geneid,@cdslen
# The file above has a space between the entry name and descriptions and are incompatible with seqkit grep. Trim
cut -f 1 ${GENEID_TRANSCRIPTID_CDSLEN}.faa >${GENOME}.faa
# Define list names
grep ">" ${GENEID_TRANSCRIPTID_CDSLEN}.faa|sed 's/>//g'|awk '{print $2}'|sort|uniq >${GENEID}.list
grep ">" ${GENEID_TRANSCRIPTID_CDSLEN}.faa|sed 's/>//g'|awk '{print $2 "\t" $1 "\t" $3}' >${GENEID_TRANSCRIPTID_CDSLEN}.list
# Extract the longest isoform from each loci
mkdir ${GENOME}
cat ${GENEID}.list |while read line;do
mkdir ${GENOME}/${line}
awk -v line=${line} '$1 == line {print $0}' ${GENEID_TRANSCRIPTID_CDSLEN}.list |cut -f 2,3 |sort -r -n -k 2 >${GENOME}/${line}/${line}.transcriptid.cdslen.list
ID=`awk 'NR==1 {print $1}' ${GENOME}/${line}/${line}.transcriptid.cdslen.list`
seqkit grep -np "${ID}" ${GENOME}.faa >${GENOME}/${line}/${line}.faa
done
# Merge into a single fasta file
cat ${GENEID}.list|while read line;do
cat ${GENOME}/${line}/${line}.faa
done >${GENOME}.nr.faa
seqkit stats ${GENOME}.nr.faa >${GENOME}.nr.faa.stats
conda activate busco-4.1.4
busco -i ${GENOME}.nr.faa -l arthropoda_odb10 -o ${GENOME}.nr.busco -c 56 -m prot -f
busco -i ${GENOME}.faa -l arthropoda_odb10 -o ${GENOME}.busco -c 56 -m prot -f
```
#### Functional annotation (AHRD)
```sh
# !/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_MMSeqs2_AHRD.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_MMSeqs2_AHRD
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
### Define VARs
GENOME=Mj_TUMSAT_v1.0
NUMTH=56
QUERY=${GENOME}.nr
REF_DIR=/work/kawato/study/ref/proteins
# Create symlinks
ln -s ../${GENOME}_extractNrProteins/${QUERY}.faa ${QUERY}.faa
ln -s /work/kawato/apps/AHRD/test/resources/filter_descline_sprot.txt filter_descline_sprot.txt
# 802 M. japonicus proteins were downloaded from NCBI on 2021-02-18.
# Search condition: (("Penaeus japonicus"[porgn:__txid27405])) NOT mitochondrion
# /work/kawato/study/ref/proteins/2021-02-18_NCBI_Mj_prot.fasta
# MMSeqs2
conda activate mmseqs2
# Create db
mmseqs createdb ${QUERY}.faa ${QUERY}
mmseqs createdb ${REF_DIR}/GCF_003789085.1_ASM378908v1_protein.faa ASM378908v1_protein
mmseqs createdb ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_protein.faa NSTDA_Pmon_1_protein
mmseqs createdb ${REF_DIR}/2021-02-18_NCBI_Mj_prot.fasta 2021-02-18_NCBI_Mj_prot
mmseqs createdb ${REF_DIR}/Swiss-Plot_2020-12-17.faa Swiss-Plot_2020-12-17
DB_UNIREF50=${REF_DIR}/uniref50
# search against uniref50
mmseqs search ${QUERY} ${DB_UNIREF50} ${QUERY}.uniref50.mmseqs.out tmp_uniref50 -e 1e-03 -a
mmseqs convertalis ${QUERY} ${DB_UNIREF50} ${QUERY}.uniref50.mmseqs.out ${QUERY}.uniref50.mmseqs.out.formatmode0 --format-mode 0
# search against SwissPlot
mmseqs search ${QUERY} Swiss-Plot_2020-12-17 ${QUERY}.Swiss-Plot_2020-12-17.mmseqs.out tmp_Swiss-Plot_2020-12-17 -e 1e-03 -a
mmseqs convertalis ${QUERY} Swiss-Plot_2020-12-17 ${QUERY}.Swiss-Plot_2020-12-17.mmseqs.out ${QUERY}.Swiss-Plot_2020-12-17.mmseqs.out.formatmode0 --format-mode 0
# search against penaeid shrimp species
mmseqs search ${QUERY} ASM378908v1_protein ${QUERY}.ASM378908v1_protein.mmseqs.out tmp_ASM378908v1_protein -e 1e-03 -a
mmseqs convertalis ${QUERY} ASM378908v1_protein ${QUERY}.ASM378908v1_protein.mmseqs.out ${QUERY}.ASM378908v1_protein.mmseqs.out.formatmode0 --format-mode 0
mmseqs search ${QUERY} NSTDA_Pmon_1_protein ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out tmp_NSTDA_Pmon_1_protein -e 1e-03 -a
mmseqs convertalis ${QUERY} NSTDA_Pmon_1_protein ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 --format-mode 0
mmseqs search ${QUERY} 2021-02-18_NCBI_Mj_prot ${QUERY}.2021-02-18_NCBI_Mj_prot.mmseqs.out tmp_2021-02-18_NCBI_Mj_prot -e 1e-03 -a
mmseqs convertalis ${QUERY} 2021-02-18_NCBI_Mj_prot ${QUERY}.2021-02-18_NCBI_Mj_prot.mmseqs.out ${QUERY}.2021-02-18_NCBI_Mj_prot.mmseqs.out.formatmode0 --format-mode 0
# AHRD
# Basically followed https://qiita.com/MaedaTaro_Umiushi/items/bd9e797d44c379124144
# create the follwoing files:
# ahrd_input.yml
# blacklist_descline.txt
# filter_descline_aply.txt
# blacklist_token.txt
# Run AHRD
java -Xmx200g -jar \
/work/kawato/apps/AHRD/dist/ahrd.jar \
./ahrd_input.yml
# Rename "Unknown protein" to "hypothetical protein"; product names with low quality (only one "*" in quality-code) were renamed to "hypothetical protein"
grep "^LOCUS" ahrd_output.csv |sed 's/Unknown protein/hypothetical protein/g'|awk 'BEGIN{FS="\t";OFS="\t"}{if(gsub(/-/,"-",$3) >= 2){$4="hypothetical protein"};{print $1,$4}}'|sed -E 's/\.[0-9]+\.p[0-9]+//g'>ahrd_output.products.list
```
ahrd_input.yml
```yaml
proteins_fasta: ./Mj_TUMSAT_v1.0.nr.faa
token_score_bit_score_weight: 0.468
token_score_database_score_weight: 0.2098
token_score_overlap_score_weight: 0.3221
output: ./ahrd_output.csv
blast_dbs:
Swiss-Plot_2020-12-17:
weight: 653
description_score_bit_score_weight: 2.717061
file: ./Mj_TUMSAT_v1.0.nr.Swiss-Plot_2020-12-17.mmseqs.out.formatmode0
database: /work/kawato/study/ref/proteins/Swiss-Plot_2020-12-17.faa
blacklist: ./blacklist_descline.txt
filter: ./filter_descline_sprot.txt
token_blacklist: ./blacklist_token.txt
2021-02-18_NCBI_Mj_prot:
weight: 999
description_score_bit_score_weight: 2.917405
file: ./Mj_TUMSAT_v1.0.nr.2021-02-18_NCBI_Mj_prot.mmseqs.out.formatmode0
database: /work/kawato/study/ref/proteins/2021-02-18_NCBI_Mj_prot.fasta
blacklist: ./blacklist_descline.txt
filter: ./filter_descline_aply.txt
token_blacklist: ./blacklist_token.txt
fasta_header_regex: "^>(?.+?) +(?.+?)( \\[Penaeus.+)"
Lvan:
weight: 800
description_score_bit_score_weight: 2.590211
file: ./Mj_TUMSAT_v1.0.nr.ASM378908v1_protein.mmseqs.out.formatmode0
database: /work/kawato/study/ref/proteins/GCF_003789085.1_ASM378908v1_protein.faa
blacklist: ./blacklist_descline.txt
filter: ./filter_descline_aply.txt
token_blacklist: ./blacklist_token.txt
fasta_header_regex: "^>(?.+?) +(?.+?)( \\[Penaeus.+)"
Pmon:
weight: 800
description_score_bit_score_weight: 2.590211
file: ./Mj_TUMSAT_v1.0.nr.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0
database: /work/kawato/study/ref/proteins/GCF_015228065.1_NSTDA_Pmon_1_protein.faa
blacklist: ./blacklist_descline.txt
filter: ./filter_descline_aply.txt
token_blacklist: ./blacklist_token.txt
fasta_header_regex: "^>(?.+?) +(?.+?)( \\[Penaeus.+)"
```
blacklist_descline.txt
```
(?i)^similar\s+to
(?i)^probable
(?i)^putative
(?i)putative$
(?i)^predicted
(?i)^uncharacterized
(?i)^uncharacterised
(?i)^TSA:
(?i)^unknown
(?i)^hypothetical
(?i)^unnamed
(?i)whole\s+genome\s+shotgun\s+sequence
(?i)^clone
(?i)[0-9][0-9][0-9][0-9][0-9]
(?i)genomic scaffold
(?i)genomic contig
(?i)genome sequencing data
(?i)isoform\s+X[0-9]+
(?i),\s+partial
(?i)LOW QUALITY PROTEIN:
(?i)Tollo
```
filter_descline_aply.txt
```
\sOS=.*$
(?i)OS.*[.].*protein
(?i)^H0.*protein
(?i)contains.*
IPR.*
\w{2,}\d{1,2}(g|G)\d+(\.\d)*\s+
\b\[.*
\b\S+\|\S+\|\S+
\(\s*Fragment\s*\)
^(\s|/|\(|\)|-|\+|\*|,|;|\.|\:|\||\d)+$
PREDICTED:
```
blacklist_token.txt
```
(?i)\bunknown\b
(?i)\bmember\b
(?i)\blike\b
(?i)\bassociated\b
(?i)\bcontaining\b
(?i)\bactivated\b
(?i)\bfamily\b
(?i)\bsubfamily\b
(?i)\binteracting\b
(?i)\bactivity\b
(?i)\bsimilar\b
(?i)\bproduct\b
(?i)\bexpressed\b
(?i)\bpredicted\b
(?i)\bputative\b
(?i)\buncharacterized\b
(?i)\bprobable\b
(?i)\bprotein\b
(?i)\bgene\b
(?i)\btair\b
(?i)\bfragment\b
(?i)\bhomolog\b
(?i)\bcontig\b
(?i)\brelated\b
(?i)\bremark\b
(?i)\b\w?orf(\w?|\d+)\b
(?i)\bisoform X[0-9]+\b
(?i)\b, partial\b
(?i)\bLOW QUALITY PROTEIN:\b
```
### Append functional annotation to predicted protein-coding gene models
```sh
# !/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_merge_AHRD_and_KAAS.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_merge_AHRD_and_KAAS
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Set VARs
NUMTH=56
ASM=Mj_TUMSAT_v1.0
# Create symlinks
ln -s ../${ASM}_MMSeqs2_AHRD/ahrd_output.products.list ahrd_output.products.list
ln -s ../${ASM}_filtering/all.sort.renamed.gff3 all.sort.renamed.gff3
ln -s /work/kawato/study/ref/genomes/GCA_017312705.1_Mj_TUMSAT_v1.0_genomic.fna ${ASM}.fa
## KAAS was run on 2021-06-27
## start (UTC) end (UTC)
## 6/27 15:23 2021 6/27 16:45 2021
## ID : 1624807380
## query type : pep
## program : BLAST
## method : BBH
## GENES data set : hsa, mmu, rno, dre, dme, cel, ath, sce, ago, cal
## spo, ecu, pfa, cho, ehi, eco, nme, hpy, bsu, lla
## mge, mtu, syn, aae, mja, ape
# Download KAAS results
wget https://www.genome.jp/tools/kaas/files/dl/1624807380/query.ko
# Parse table
sed -E 's/\.[0-9]+\.p[0-9]+//g' query.ko >query.ko.mod
# Download KO00001
curl "https://www.genome.jp/kegg-bin/download_htext?htext=ko00001.keg&format=htext&filedir=" > ko00001.keg
# Remove redundant texts
grep "^D" ko00001.keg |sed -E 's/\s{2,}/\t/g'|sed -E 's/;/\t/g'|sed -E 's/\[/\t/g'|sed 's/\]//g'| sed 's/[[:space:]]*\t[[:space:]]*/\t/g'|cut -f 2-5 > ko00001.keg.mod
# join ahrd_output.products.list, query.ko.mod, and ko00001.keg
python3 ./join_table.py # out:ahrd_kaas.tsv
# Parse ahrd_kaas.tsv to protei-ncoding gene attributes
cat ahrd_kaas.tsv|awk 'BEGIN{FS="\t";OFS="\t"};NR!=1 {print $1,"product="$2";Note=[KO:"$3"]%3B[EC_number:"$6"]%3Babbreviation%3D"$4"%3Bsynonyms%3D"$5}'|\
sed 's/\[KO\:NA\]%3B//g'|\
sed 's/\[EC_number\:NA\]%3B//g'|\
sed 's/abbreviation\%3DNA%3B//g'|\
sed 's/synonyms\%3DNA//g'|\
sed 's/;Note=$//g'|\
sed 's/,/%2C/g'>ahrd_kaas.gffAttributes.tsv
# Add attributes in ahrd_kaas.gffAttributes.tsv to GFF3
python3 ./add_product_information_to_protein_coding_genes_in_gff3.py >all.sort.renamed.attributes.gff3
# Add comments to GFF3, generating the final output Mj_TUMSAT_v1.0.gff3
cat all.sort.renamed.attributes.gff3|sed '1s/^/##gff-version 3\n/'|sed '2s/^/# Marsupenaeus japonicus Ginoza2017\n/'|sed "3s/^/# Assembly build: Mj_TUMSAT_v1.0\n/" |sed "4s/^/# GenBank assembly accession: GCA_017312705.1\n/" |sed "5s/^/# Date: $(LANG=en_us_88591;date -u)\n/" >${ASM}.gff3
# Verify gene count
cat ${ASM}.gff3 |awk '$3=="gene"{print $0}'|cut -f 2 -d ";"| sort | uniq -c > ${ASM}.genecount
# Make transcript, CDS, and protein fasta files with their product names on the description line
gffread ${ASM}.gff3 \
-g ${ASM}.fa \
-w ${ASM}.transcripts.fna \
-x ${ASM}.cds.fna \
-y ${ASM}.prot.faa --table=product
# BUSCO
conda activate busco-4.1.4
busco -i ${ASM}.prot.faa -l arthropoda_odb10 -o ${ASM}.prot.busco -c 56 -m prot -f
```
join_table.py
```python
#!/usr/bin/env python3
# coding: utf-8
import pandas as pd
# Read tables
df_ahrd_products = pd.read_table('ahrd_output.products.list', names=["locus", "product"]).sort_values(by=['locus'])
df_query_ko = pd.read_table('query.ko.mod', names=["locus","KO"])
df_ko00001 = pd.read_table('ko00001.keg.mod', names=["KO","Abbreviation","synonym","EC number"]).drop_duplicates()
# Join tables
df_ko_merge = pd.merge(df_query_ko, df_ko00001, on='KO',how='left')
df_join = df_ahrd_products.merge(df_ko_merge)
df_join = df_join.fillna('NA')
# Write the final table
df_join.to_csv('ahrd_kaas.tsv', sep='\t', index=False)
```
add_product_information_to_protein_coding_genes_in_gff3.py
```python
#!/usr/bin/env python3
# coding: utf-8
import pandas as pd
pd.set_option('display.max_colwidth', None)
attrs = pd.read_table('ahrd_kaas.gffAttributes.tsv', names=["locus", "attributes"])
attrs.head()
path = './all.sort.renamed.gff3'
with open(path) as f:
for line in f:
line = line.rstrip()
if line.startswith('#'):
print(line)
continue
s = line.split('\t')
if 'CDS' in s[2]:
items = s[8].split(';')
tags = dict([tmp.split('=') for tmp in items])
if 'ID' in tags:
id = tags['ID']
locus = str(id.split('.')[1])
attr = attrs[attrs['locus'] == locus].attributes
attr = attr.to_string(index=False, header=False).strip()
print(line + ";" + attr)
continue
else:
print(line)
continue
```
## Structural annotation of IsoSeq cDNA sequences (SQANTI3)
```sh
# !/bin/bash
# >>>>>>
# Mj_TUMSAT_v1.0_sqanti3.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=Mj_TUMSAT_v1.0_sqanti3
cd ${WHOME}
mkdir ${WDIR};cd ${WDIR}
# Set VARs
NUMTH=56
ASM=Mj_TUMSAT_v1.0
CDNA=IsoSeq
conda activate SQANTI3.env
export PYTHONPATH=$PYTHONPATH:/work/kawato/apps/cDNA_Cupcake/sequence/
# Make symlinks
ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/ICRJ01.1.fsa_nt ${CDNA}.fa
ln -s /work/kawato/study/ref/genomes/GCA_017312705.1_Mj_TUMSAT_v1.0_genomic.fna ${ASM}.fa
ln -s ../Mj_TUMSAT_v1.0_merge_AHRD_and_KAAS/${ASM}.gff3 ${ASM}.gff3
gffread ${ASM}.gff3 -g ${ASM}.fa -E -T -o ${ASM}.gtf #SQANTI3 requires GTF format for reference annotation
# ISOSEQ
minimap2 -ax splice -t ${NUMTH} -uf --secondary=no -C5 ${ASM}.fa ${CDNA}.fa > ${CDNA}.sam
sort -k 3,3 -k 4,4n ${CDNA}.sam > ${CDNA}.sorted.sam
collapse_isoforms_by_sam.py --input ${CDNA} -s ${CDNA}.sorted.sam --dun-merge-5-shorter -o ${CDNA}.align -c 0.90 # coverage cutoff 0.90 = 90%. test.collapsed.gff is in gff2 format = GTF
python /work/kawato/apps/SQANTI3/sqanti3_qc.py --gtf ${CDNA}.align.collapsed.gff ${ASM}.gtf ${ASM}.fa
```
## Comparative genomics
### BUSCO (genome, arthropoda_odb10)
#### Running BUSCO pipeline
```sh
#!/bin/bash
# >>>>>>
# BUSCO_genome.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=BUSCO_arthropods_genome
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Define variables
REF_DIR=/work/kawato/study/ref/genomes
NUMTH=24
# create symlinks
ln -s ${REF_DIR}/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna Drosophila_melanogaster.fna
ln -s ${REF_DIR}/GCF_003789085.1_ASM378908v1_genomic.fna Litopenaeus_vannamei.fna
ln -s ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_genomic.fna Penaeus_monodon.fna
ln -s ${REF_DIR}/GCF_000764305.1_Hazt_2.0_genomic.fna Hyalella_azteca.fna
ln -s ${REF_DIR}/GCF_003990815.1_ASM399081v1_genomic.fna Daphnia_magna.fna
ln -s ${REF_DIR}/GCA_004104545.1_Arma_vul_BF2787_genomic.fna Armadillidium_vulgare.fna
ln -s ${REF_DIR}/GCA_009176605.1_CNRS_Arma_nasa_1.0_genomic.fna Armadillidium_nasatum.fna
ln -s ${REF_DIR}/GCF_000151625.1_ASM15162v1_genomic.fna Bombyx_mori.fna
ln -s ${REF_DIR}/GCF_003254395.2_Amel_HAv3.1_genomic.fna Apis_mellifera.fna
ln -s ${REF_DIR}/GCA_007210705.1_Tcal_SD_v2.1_genomic.fna Tigriopus_californicus.fna
ln -s ${REF_DIR}/GCF_000591075.1_Eaff_2.0_genomic.fna Eurytemora_affinis.fna
ln -s ${REF_DIR}/GCA_009805615.1_SNU_Aamp_1_genomic.fna Amphibalanus_amphitrite.fna
ln -s ${REF_DIR}/GCA_015104395.1_ASM1510439v1_genomic.fna Macrobrachium_nipponense.fna
ln -s ${REF_DIR}/GCA_009830355.1_DU_Cdes_1.0_genomic.fna Cherax_destructor.fna
ln -s ${REF_DIR}/GCA_013283005.1_ASM1328300v1_genomic.fna Paralithodes_platypus.fna
ln -s ${REF_DIR}/GCA_002838885.1_Pvir0.4_genomic.fna Procambarus_virginalis.fna
ln -s ${REF_DIR}/GCA_009761615.1_DU_Cquad_1.0_genomic.fna Cherax_quadricarinatus.fna
ln -s ${REF_DIR}/GCA_013436485.1_ASM1343648v1_genomic.fna Eriocheir_sinensis.fna
ln -s ${REF_DIR}/GCA_002291165.1_Mjap_WGS_v1_genomic.fna Mjap_WGS_v1.fna
ln -s ${REF_DIR}/GCA_002291185.1_Pmon_WGS_v1_genomic.fna Pmon_WGS_v1.fna
ln -s ${REF_DIR}/GCA_007890405.1_Pmod26D_v1_genomic.fna Pmod26D_v1.fna
ln -s ${REF_DIR}/GCA_016920825.1_ASM1692082v1_genomic.fna Fenneropenaeus_chinensis.fna
ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa Marsupenaeus_japonicus.fna
conda activate busco-4.1.4
# BUSCO
for SP in Drosophila_melanogaster Litopenaeus_vannamei Penaeus_monodon Hyalella_azteca Daphnia_magna Armadillidium_vulgare Armadillidium_nasatum Bombyx_mori Apis_mellifera Tigriopus_californicus Eurytemora_affinis Amphibalanus_amphitrite Macrobrachium_nipponense Cherax_destructor Paralithodes_platypus Procambarus_virginalis Cherax_quadricarinatus Eriocheir_sinensis Mjap_WGS_v1 Pmon_WGS_v1 Pmod26D_v1 Marsupenaeus_japonicus;do
busco -i ${SP}.fna -l arthropoda_odb10 -c ${NUMTH} -o ${SP}.genome.busco -m genome
done
```
#### Generate summary script (generate_plot.py)
https://gitlab.com/ezlab/busco/-/blob/master/scripts/generate_plot.py
```sh
#!/bin/bash
# >>>>>>
# BUSCO_generate_plot.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=BUSCO_arthropods_genome
cd ${WHOME}${WDIR}
# Define variables
mkdir BUSCO_summaries
conda activate busco-4.1.4
for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei Fenneropenaeus_chinensis Eriocheir_sinensis Procambarus_virginalis Cherax_destructor Cherax_quadricarinatus Armadillidium_nasatum Armadillidium_vulgare Hyalella_azteca Amphibalanus_amphitrite Eurytemora_affinis Daphnia_magna Apis_mellifera Bombyx_mori Drosophila_melanogaster ;do
cp ${SP}.genome.busco/short_summary.specific.arthropoda_odb10.${SP}.genome.busco.txt BUSCO_summaries/.
done
# generate_plot.py
generate_plot.py -wd BUSCO_summaries
```
#### Plot (R script)
```R
##################################
#
# BUSCO summary figure
# @version 4.0.0
# @since BUSCO 2.0.0
#
# Copyright (c) 2016-2020, Evgeny Zdobnov (ez@ezlab.org)
# Licensed under the MIT license. See LICENSE.md file.
#
##################################
setwd("C:/Users/kawato/Share/2020-09-24_Mj/")
# Load the required libraries
library(ggplot2)
library("grid")
# !!! CONFIGURE YOUR PLOT HERE !!!
# Output
my_width <- 20
my_height <- 15
my_unit <- "cm"
# Colors
my_colors <- c("#F2E436", "#F0B83D", "#74AFD0", "#999999")
# Bar height ratio
my_bar_height <- 0.75
# Legend
my_title <- "BUSCO Assessment Results"
# Font
my_family <- "sans"
my_size_ratio <- 1
# !!! SEE YOUR DATA HERE !!!
# Your data as generated by python, remove or add more
my_species <- c('Marsupenaeus japonicus', 'Marsupenaeus japonicus', 'Marsupenaeus japonicus', 'Marsupenaeus japonicus', 'Penaeus monodon', 'Penaeus monodon', 'Penaeus monodon', 'Penaeus monodon', 'Litopenaeus vannamei', 'Litopenaeus vannamei', 'Litopenaeus vannamei', 'Litopenaeus vannamei', 'Fenneropenaeus chinensis', 'Fenneropenaeus chinensis', 'Fenneropenaeus chinensis', 'Fenneropenaeus chinensis', 'Eriocheir sinensis', 'Eriocheir sinensis', 'Eriocheir sinensis', 'Eriocheir sinensis', 'Procambarus virginalis', 'Procambarus virginalis', 'Procambarus virginalis', 'Procambarus virginalis', 'Cherax destructor', 'Cherax destructor', 'Cherax destructor', 'Cherax destructor', 'Cherax quadricarinatus', 'Cherax quadricarinatus', 'Cherax quadricarinatus', 'Cherax quadricarinatus', 'Armadillidium nasatum', 'Armadillidium nasatum', 'Armadillidium nasatum', 'Armadillidium nasatum', 'Armadillidium vulgare', 'Armadillidium vulgare', 'Armadillidium vulgare', 'Armadillidium vulgare', 'Hyalella azteca', 'Hyalella azteca', 'Hyalella azteca', 'Hyalella azteca', 'Amphibalanus amphitrite', 'Amphibalanus amphitrite', 'Amphibalanus amphitrite', 'Amphibalanus amphitrite', 'Eurytemora affinis', 'Eurytemora affinis', 'Eurytemora affinis', 'Eurytemora affinis', 'Daphnia magna', 'Daphnia magna', 'Daphnia magna', 'Daphnia magna', 'Apis mellifera', 'Apis mellifera', 'Apis mellifera', 'Apis mellifera', 'Bombyx mori', 'Bombyx mori', 'Bombyx mori', 'Bombyx mori', 'Drosophila melanogaster', 'Drosophila melanogaster', 'Drosophila melanogaster', 'Drosophila melanogaster')
my_species <- factor(my_species)
my_species <- fct_rev(factor(my_species,levels=c("Marsupenaeus japonicus", "Penaeus monodon", "Litopenaeus vannamei", "Fenneropenaeus chinensis", "Eriocheir sinensis", "Procambarus virginalis", "Cherax destructor", "Cherax quadricarinatus", "Armadillidium nasatum", "Armadillidium vulgare", "Hyalella azteca", "Amphibalanus amphitrite", "Eurytemora affinis", "Daphnia magna", "Apis mellifera", "Bombyx mori", "Drosophila melanogaster"))) # reorder your species here just by changing the values in the vector :
my_percentage <- c(90.6, 2.8, 2.9, 3.7, 81.9, 1.8, 4.7, 11.6, 82.4, 5.4, 4.5, 7.7, 81.5, 3.4, 6.0, 9.1, 92.0, 0.6, 3.1, 4.3, 84.6, 0.7, 6.1, 8.6, 79.7, 0.9, 8.9, 10.5, 68.9, 0.4, 15.7, 15.0, 81.6, 4.9, 4.2, 9.3, 77.1, 5.3, 6.2, 11.4, 89.9, 0.5, 3.7, 5.9, 65.4, 26.7, 2.2, 5.7, 72.9, 0.9, 7.9, 18.3, 95.4, 0.9, 1.7, 2.0, 97.9, 0.1, 0.6, 1.4, 93.4, 0.4, 2.3, 3.9, 99.2, 0.6, 0.0, 0.2)
my_values <- c(918, 28, 29, 38, 830, 18, 48, 117, 835, 55, 46, 77, 826, 34, 61, 92, 932, 6, 31, 44, 857, 7, 62, 87, 807, 9, 90, 107, 698, 4, 159, 152, 827, 50, 43, 93, 781, 54, 63, 115, 911, 5, 37, 60, 663, 270, 22, 58, 738, 9, 80, 186, 966, 9, 17, 21, 992, 1, 6, 14, 946, 4, 23, 40, 1005, 6, 0, 2)
##################################
##################################
##################################
# Code to produce the graph
labsize = 1
if (length(levels(my_species)) > 10){
labsize = 0.66
}
print("Plotting the figure ...")
category <- c(rep(c("S","D","F","M"),c(1)))
category <-factor(category)
category = factor(category,levels(category)[c(4,1,2,3)])
df = data.frame(my_species,my_percentage,my_values,category)
figure <- ggplot() +
geom_bar(aes(y = my_percentage, x = my_species, fill = category),colour="black", size=0.2, position = position_stack(reverse = TRUE), data = df, stat="identity", width=my_bar_height) +
coord_flip() +
theme_gray(base_size = 8) +
scale_y_continuous(labels = c("0","20","40","60","80","100"), breaks = c(0,20,40,60,80,100)) +
scale_fill_manual(values = my_colors,labels =c(" Complete (C) and single-copy (S) ",
" Complete (C) and duplicated (D)",
" Fragmented (F) ",
" Missing (M)")) +
ggtitle(my_title) +
xlab("") +
ylab("\n%BUSCOs") +
theme(plot.title = element_text(family=my_family, hjust=0.5, colour = "black", size = rel(2.2)*my_size_ratio, face = "bold")) +
theme(legend.position="top",legend.title = element_blank()) +
theme(legend.text = element_text(family=my_family, size = rel(1.2)*my_size_ratio)) +
theme(panel.background = element_rect(color="#FFFFFF", fill="white")) +
theme(panel.grid.minor = element_blank()) +
theme(panel.grid.major = element_blank()) +
theme(axis.text.y = element_text(family=my_family, colour = "black", size = rel(1.66)*my_size_ratio,face = 'italic')) +
theme(axis.text.x = element_text(family=my_family, colour = "black", size = rel(1.66)*my_size_ratio)) +
theme(axis.line = element_line(size=1*my_size_ratio, colour = "black")) +
theme(axis.ticks.length = unit(.85, "cm")) +
theme(axis.ticks.y = element_line(colour="white", size = 0)) +
theme(axis.ticks.x = element_line(colour="#222222")) +
theme(axis.ticks.length = unit(0.4, "cm")) +
theme(axis.title.x = element_text(family=my_family, size=rel(1.2)*my_size_ratio)) +
guides(fill = guide_legend(override.aes = list(colour = NULL))) +
guides(fill=guide_legend(nrow=2,byrow=TRUE))
for(i in rev(c(1:length(levels(my_species))))){
detailed_values <- my_values[my_species==my_species[my_species==levels(my_species)[i]]]
total_buscos <- sum(detailed_values)
figure <- figure +
annotate("text", label=paste("C:", detailed_values[1] + detailed_values[2], " [S:", detailed_values[1], ", D:", detailed_values[2], "], F:", detailed_values[3], ", M:", detailed_values[4], ", n:", total_buscos, sep=""),
y=3, x = i, size = labsize*4*my_size_ratio, colour = "black", hjust=0, family=my_family)
}
ggsave(figure, file = "Figure_2.tiff", device = "tiff", dpi = 350, width = my_width, height = my_height, unit = my_unit)
print("Done")
```
### Comparison of simple repeat contents in arthropod genomes (RepeatMasker)
```sh
# !/bin/bash
# >>>>>>
# RepeatMasker_SimpleRepeats_Arthropods.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=RepeatMasker_SimpleRepeats_Arthropods
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Define variables
REF_DIR=/work/kawato/study/ref/genomes
NUMTH=24
# create symlinks
ln -s ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_genomic.fna Penaeus_monodon.fna
ln -s ${REF_DIR}/GCF_003789085.1_ASM378908v1_genomic.fna Litopenaeus_vannamei.fna
ln -s ${REF_DIR}/GCA_016920825.1_ASM1692082v1_genomic.fna Fenneropenaeus_chinensis.fna
ln -s ${REF_DIR}/GCA_013436485.1_ASM1343648v1_genomic.fna Eriocheir_sinensis.fna
ln -s ${REF_DIR}/GCA_002838885.1_Pvir0.4_genomic.fna Procambarus_virginalis.fna
ln -s ${REF_DIR}/GCA_009830355.1_DU_Cdes_1.0_genomic.fna Cherax_destructor.fna
ln -s ${REF_DIR}/GCA_009761615.1_DU_Cquad_1.0_genomic.fna Cherax_quadricarinatus.fna
ln -s ${REF_DIR}/GCA_009176605.1_CNRS_Arma_nasa_1.0_genomic.fna Armadillidium_nasatum.fna
ln -s ${REF_DIR}/GCA_004104545.1_Arma_vul_BF2787_genomic.fna Armadillidium_vulgare.fna
ln -s ${REF_DIR}/GCF_000764305.1_Hazt_2.0_genomic.fna Hyalella_azteca.fna
ln -s ${REF_DIR}/GCA_009805615.1_SNU_Aamp_1_genomic.fna Amphibalanus_amphitrite.fna
ln -s ${REF_DIR}/GCF_000591075.1_Eaff_2.0_genomic.fna Eurytemora_affinis.fna
ln -s ${REF_DIR}/GCF_003990815.1_ASM399081v1_genomic.fna Daphnia_magna.fna
ln -s ${REF_DIR}/GCF_003254395.2_Amel_HAv3.1_genomic.fna Apis_mellifera.fna
ln -s ${REF_DIR}/GCF_000151625.1_ASM15162v1_genomic.fna Bombyx_mori.fna
ln -s ${REF_DIR}/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna Drosophila_melanogaster.fna
# Run RepeatMasker
conda activate repeatmodeler
for SP in Penaeus_monodon Litopenaeus_vannamei Fenneropenaeus_chinensis Eriocheir_sinensis Procambarus_virginalis Cherax_destructor Cherax_quadricarinatus Armadillidium_nasatum Armadillidium_vulgare Hyalella_azteca Amphibalanus_amphitrite Eurytemora_affinis Daphnia_magna Apis_mellifera Bombyx_mori Drosophila_melanogaster;do
RepeatMasker -pa 14 -noint -no_is -norna -gff -xsmall -a -s ${SP}.fna -dir ${SP}.RepeatMasker.SoftMasked
done
```
### Comparison of protein-coding gene metrics in three penaeid shrimp genomes
#### Get metrics
```sh
# !/bin/bash
# >>>>>>
# GetNrProteinsAndGeneMetrics.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=GetNrProteinsAndGeneMetrics
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Define variables
REF_DIR=/work/kawato/study/ref/genomes
mkdir Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei
# Make symlinks for M. japonicus genome and GFF3. mtDNA already removed
ln -s ${REF_DIR}/GCA_017312705.1_Mj_TUMSAT_v1.0_genomic.fna Marsupenaeus_japonicus.fa
ln -s ../Mj_TUMSAT_v1.0_merge_AHRD_and_KAAS/Mj_TUMSAT_v1.0.gff3 Marsupenaeus_japonicus.gff3 Marsupenaeus_japonicus/Marsupenaeus_japonicus.gff
# Prepare P. mondoon and L. vannamei. Remove mtDNA records
seqkit grep -v -nr -p NC_002184.1* ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_genomic.fna > Penaeus_monodon/Penaeus_monodon.fna
cat ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_genomic.gff |sed -e '/NC_002184.1/,$d' > Penaeus_monodon/Penaeus_monodon.gff
seqkit grep -v -nr -p NC_009626.1* ${REF_DIR}/GCF_003789085.1_ASM378908v1_genomic.fna > Litopenaeus_vannamei/Litopenaeus_vannamei.fna
cat ${REF_DIR}/GCF_003789085.1_ASM378908v1_genomic.gff |sed -e '/NC_009626.1/,$d' >Litopenaeus_vannamei/Litopenaeus_vannamei.gff
for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei;do
cd ${SP}
# extract .faa, with geneid, transcriptid, cdslen information @id,
gffread ${SP}.gff -E -S -g ${SP}.fna -y ${SP}.annotation.geneid.transcriptid.cdslen.faa --table=@geneid,@cdslen
# seqkit grep didn't like information @id; trim off
cut -f 1 ${SP}.annotation.geneid.transcriptid.cdslen.faa >${SP}.annotation.faa
# Create tables
grep ">" ${SP}.annotation.geneid.transcriptid.cdslen.faa |sed 's/>//g'|awk '{print $2}'|sort|uniq >${SP}.annotation.geneid.list
grep ">" ${SP}.annotation.geneid.transcriptid.cdslen.faa |sed 's/>//g'|awk '{print $2 "\t" $1 "\t" $3}' >${SP}.annotation.geneid.transcriptid.cdslen.list
# Extract the longest isoform from each locus
# generates split files per 600 lines
split -l 600 --numeric-suffixes --suffix-length=3 --elide-empty-files ${SP}.annotation.geneid.list ${SP}.annotation.geneid.list_
# 50 files; up to 30000 genes can be handled
for N in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 ;do
while read line;do
ID=`awk -v line=${line} '$1 == line {print $0}' ${SP}.annotation.geneid.transcriptid.cdslen.list |sort -r -n -k 3 |awk 'NR==1 {print $2}'`
seqkit grep -np "${ID}" ${SP}.annotation.faa
done >${SP}.annotation.nr.${N}.faa < ${SP}.annotation.geneid.list_${N} &
done
wait # wait until all the background jobs end
rm ${SP}.annotation.nr.faa # Delete preexisting file if exists
cat ${SP}.annotation.nr.*.faa >>${SP}.annotation.nr.faa # Concatenate partitioned fasta
rm ${SP}.annotation.nr.*.faa ${SP}.annotation.geneid.list_* # clean up
# Convert .gff3 to .gtf for subsequent filtering
gffread ${SP}.gff -T --keep-genes -E -o ${SP}.gtf 2> ${SP}.gffread.log # GFF3 has a complex nested structure to specify gene-transcript-exon relationship, which is detrimental to analysis
# List longest isoforms
grep ">" ${SP}.annotation.nr.faa |sed 's/>//g' >${SP}.annotation.nr.list
# Generate a .gtf file containing the longest isoform of each locus. This GTF is used for gene metrics calculation.
split -l 600 --numeric-suffixes --suffix-length=3 --elide-empty-files ${SP}.annotation.nr.list ${SP}.annotation.nr.list_ # generates 50 split files
for N in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 ;do
while read line;do
grep "transcript_id \"${line}\"" ${SP}.gtf
done >${SP}.nr_${N}.gtf < ${SP}.annotation.nr.list_${N} &
done
wait # wait until all the background jobs end
rm ${SP}.nr.gtf # Delete preexisting file if exists
cat ${SP}.nr_*.gtf>>${SP}.nr.gtf # Concatenate partitioned fasta
rm ${SP}.annotation.nr.list_* ${SP}.nr_*.gtf # clean up
# prepare .gff files for gene metrics calculation. This conversion was necessary to make intron features.
cat ${SP}.nr.gtf |grep $'\tCDS\t' >${SP}.nr.CDSonly.gtf # Create a version without non-CDS (=UTR) exons
gffread ${SP}.nr.CDSonly.gtf --force-exons -F -g ${SP}.fna -o ${SP}.nr.CDSonly.gff # Convert GTF to GFF. Creates exons. I gave up giving gene features using gt gff3
conda activate genometools
gt gff3 -checkids -addintrons -retainids ${SP}.nr.CDSonly.gff >${SP}.nr.CDSonly.addintrons.gff # Creates introns. -retainids option is a must. Without it gt gff3 automatically renames transcripts
gffread ${SP}.nr.CDSonly.addintrons.gff -E -S -g ${SP}.fna -y ${SP}.nr.CDSonly.addintrons.faa # Check the integrity of gff by translating the features into faa
grep ">" ${SP}.nr.CDSonly.addintrons.faa |wc -l >${SP}.nr.CDSonly.addintrons.counts
# Get exon length
awk -v species=$SP '$3 == "exon" {transcript_name = $9 ; sub(/Parent\=/,"", transcript_name); sum = $5 - $4 + 1 ;print species "\t" $1 "\t" transcript_name "\t" $3 "\t" $4 "\t" $5 "\t" sum}' ${SP}.nr.CDSonly.addintrons.gff >${SP}.nr.CDSonly.addintrons.exonlength.tidy
# Get intron length table
awk -v species=$SP '$3 == "intron" {transcript_name = $9 ; sub(/Parent\=/,"", transcript_name); sum = $5 - $4 + 1 ;print species "\t" $1 "\t" transcript_name "\t" $3 "\t" $4 "\t" $5 "\t" sum}' ${SP}.nr.CDSonly.addintrons.gff >${SP}.nr.CDSonly.addintrons.intronlength.tidy
# Get gene length table
gawk -v species=$SP '$3 == "transcript" {sum = $5 - $4 + 1 ; print species "\t" $1 "\t" gensub(/(^ID\=)([^;]+)(.+)/, "\\2", "g", $9) "\t" $3 "\t" $4 "\t" $5 "\t" sum}' ${SP}.nr.CDSonly.addintrons.gff >${SP}.nr.CDSonly.addintrons.genelength.tidy
# Count exons
split -l 600 --numeric-suffixes --suffix-length=3 --elide-empty-files ${SP}.annotation.nr.list ${SP}.annotation.nr.list_ # generates 50 split files
for N in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 ;do
while read line;do
echo -e "${SP}\t`grep "ID=${line};" ${SP}.nr.CDSonly.addintrons.gff|cut -f 1|sort|uniq`\t${line}\t`grep "Parent=${line}$" ${SP}.nr.CDSonly.addintrons.gff |awk '$3 == "exon"' |wc -l`"
done >${SP}.nr.CDSonly.exoncount_${N} < ${SP}.annotation.nr.list_${N} &
done
wait # wait until all the background jobs end
rm ${SP}.nr.CDSonly.addintrons.exoncount.tidy
cat ${SP}.nr.CDSonly.exoncount_* >>${SP}.nr.CDSonly.addintrons.exoncount.tidy
rm ${SP}.nr.CDSonly.exoncount_* ${SP}.annotation.nr.list_* # clean up
cd ..
done
for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei;do
cat $SP/${SP}.nr.CDSonly.addintrons.intronlength.tidy ; done >intronlength.tidy
for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei;do
cat $SP/${SP}.nr.CDSonly.addintrons.exonlength.tidy ; done >exonlength.tidy
for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei;do
cat $SP/${SP}.nr.CDSonly.addintrons.genelength.tidy; done >genelength.tidy
for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei;do
cat $SP/${SP}.nr.CDSonly.addintrons.exoncount.tidy; done >exoncount.tidy
```
#### Visualization (ggplot2)
```R
##################################
#
# plot_gene_metrics.R
#
##################################
setwd("C:/Users/kawato/study/kuruma_genome2017/analysis/plot_gene_metrics/")
# Load the required libraries
library(tidyverse)
library(ggplot2)
library(patchwork)
# Color palette
cbp1 <- c("#999999", "#E69F00", "#56B4E9")
# Gene length
geneLen <- read_tsv("genelength.tidy", col_names=c("species", "scaffold", "locus", "type", "start", "end", "length"))
geneLen$species <- as.factor(geneLen$species)
geneLen$species <- factor(geneLen$species, levels = c("Marsupenaeus_japonicus", "Penaeus_monodon", "Litopenaeus_vannamei"))
levels(geneLen$species) <- c("Marsupenaeus japonicus", "Penaeus monodon", "Litopenaeus vannamei")
p1 <- ggplot(geneLen) +
geom_freqpoly(aes(x=length,color=species),binwidth = 100,size=2) +
xlim(0,22000) + theme_bw()+ theme_classic() +
scale_colour_manual(values=cbp1) +
labs(
x = "Gene length (bp, bin=100 bp)",
y = "Count",
colour = "Species"
) + theme(axis.title=element_text(size=18,face="bold"), axis.text=element_text(size=18),legend.text = element_text(size=18, face = 'italic'), legend.key=element_rect(fill="transparent", colour="transparent"), legend.background=element_rect(fill="transparent", colour="transparent"), legend.key.size = unit(1.5, 'lines'), legend.justification=c(1,1), legend.position=c(1,1),legend.title=element_blank())
# Exon count
exonCount <- read_tsv("exoncount.tidy", col_names=c("species", "scaffold", "locus", "noOfExons"))
exonCount$species <- as.factor(exonCount$species)
exonCount$species <- factor(exonCount$species, levels = c("Marsupenaeus_japonicus", "Penaeus_monodon", "Litopenaeus_vannamei"))
levels(exonCount$species) <- c("Marsupenaeus japonicus", "Penaeus monodon", "Litopenaeus vannamei")
p2 <- ggplot(exonCount) +
geom_freqpoly(aes(x=noOfExons,color=species),binwidth = 1,size=2) +
xlim(0,20) + theme_bw()+ theme_classic() +
scale_colour_manual(values=cbp1) +
labs(
x = "Number of exons",
y = "Count",
colour = "Species"
) + theme(axis.title=element_text(size=18,face="bold"), axis.text=element_text(size=18),legend.text = element_text(size=18, face = 'italic'), legend.key=element_rect(fill="transparent", colour="transparent"), legend.background=element_rect(fill="transparent", colour="transparent"), legend.key.size = unit(1.5, 'lines'), legend.justification=c(1,1), legend.position=c(1,1),legend.title=element_blank())
# Exon length
exonLen <- read_tsv("exonlength.tidy", col_names=c("species", "scaffold", "locus", "type", "start", "end", "length"))
exonLen$species <- as.factor(exonLen$species)
exonLen$species <- factor(exonLen$species, levels = c("Marsupenaeus_japonicus", "Penaeus_monodon", "Litopenaeus_vannamei"))
levels(exonLen$species) <- c("Marsupenaeus japonicus", "Penaeus monodon", "Litopenaeus vannamei")
p3 <- ggplot(exonLen) +
geom_freqpoly(aes(x=length,color=species),binwidth = 10,size=2) +
xlim(0,500) + ylim(0,12000) + theme_bw()+ theme_classic() +
scale_colour_manual(values=cbp1) +
labs(
x = "Exon length (bp, bin=10 bp)",
y = "Count",
colour = "Species"
) + theme(axis.title=element_text(size=18,face="bold"), axis.text=element_text(size=18),legend.text = element_text(size=18, face = 'italic'), legend.key=element_rect(fill="transparent", colour="transparent"), legend.background=element_rect(fill="transparent", colour="transparent"), legend.key.size = unit(1.5, 'lines'), legend.justification=c(1,1), legend.position=c(1,1),legend.title=element_blank())
# Intron length
intronLen <- read_tsv("intronlength.tidy", col_names=c("species", "scaffold", "transcript", "type", "start", "end", "length"))
intronLen <- subset(intronLen,intronLen$length > 10)
intronLen$species <- as.factor(intronLen$species)
intronLen$species <- factor(intronLen$species, levels = c("Marsupenaeus_japonicus", "Penaeus_monodon", "Litopenaeus_vannamei"))
levels(intronLen$species) <- c("Marsupenaeus japonicus", "Penaeus monodon", "Litopenaeus vannamei")
p4 <- ggplot(intronLen) +
geom_freqpoly(aes(x=length,color=species),binwidth = 10,size=2) +
xlim(0,1000) + theme_bw()+ theme_classic() +
scale_colour_manual(values=cbp1) +
labs(
x = "Intron length (bp, bin=10 bp)",
y = "Count",
colour = "Species"
) + theme(axis.title=element_text(size=18,face="bold"), axis.text=element_text(size=18),legend.text = element_text(size=18, face = 'italic'), legend.key=element_rect(fill="transparent", colour="transparent"), legend.background=element_rect(fill="transparent", colour="transparent"), legend.key.size = unit(1.5, 'lines'), legend.justification=c(1,1), legend.position=c(1,1),legend.title=element_blank())
p_merge <- (p1 + p2) / (p3 + p4) +
plot_annotation(tag_levels = 'A') & theme(plot.tag.position = c(0,1),plot.tag = element_text(size = 32,face="bold"))
ggsave(file = "Figure_4.tiff", plot = p_merge, dpi = 350, width = 12, height = 12)
```
### Orthologous gene clustering analysis (OrthoFinder)
#### Running OrthoFinder
```sh
#!/bin/bash
# >>>>>>
# OrthoFinder.sh
# >>>>>>
source ~/anaconda3/etc/profile.d/conda.sh
# Create/goto WDIR
WHOME=/work/kawato/study/kuruma_genome2017/analysis/
WDIR=GetNrProteinsAndGeneMetrics
cd ${WHOME}
mkdir ${WDIR}
cd ${WDIR}
# Define variables
NUMTH=56
# create directory to store faa
mkdir faa
for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei;do
ln -s ${WHOME}2021-07-01_extractNrProteins_GeneMetrics/${SP}/${SP}.annotation.nr.faa faa/${SP}.faa
done
seqkit stats faa/*.faa >faa.stats
conda activate orthofinder
orthofinder -t ${NUMTH} -a ${NUMTH} -n orthofinder_run -f faa
```
#### Visualization (VennDiagram)
```R
##################################
#
# orthofinder_venndiagram.R
#
##################################
setwd("C:/Users/kawato/study/kuruma_genome2017/analysis/orthofinder/Orthogroups/")
# Load the required libraries
library(tidyverse)
library(VennDiagram)
Orthogroups <- read_tsv("Orthogroups.GeneCount.tsv")
x <- list(
'Marsupenaeus japonicus' = unlist(subset(Orthogroups, Orthogroups$Marsupenaeus_japonicus >0, select=c(1)), use.names=FALSE),
'Penaeus monodon' = unlist(subset(Orthogroups, Orthogroups$Penaeus_monodon >0, select=c(1)), use.names=FALSE),
'Litopenaeus vannamei' = unlist(subset(Orthogroups, Orthogroups$Litopenaeus_vannamei >0, select=c(1)), use.names=FALSE)
)
my_plot <- venn.diagram(
x,
category.names = c("Marsupenaeus japonicus" , "Penaeus monodon" , "Litopenaeus vannamei"),
filename = NULL,
output = TRUE ,
imagetype="svg" ,
margin=0.02,
lwd = 4,
col=c("#999999", "#E69F00", "#56B4E9"),
fill = c(alpha("#999999",0.3), alpha('#E69F00',0.3), alpha('#56B4E9',0.3)),
print.mode = c("raw", "percent"),
cex = 1.5,
fontfamily = "sans",
cat.cex = 1.6,
cat.default.pos = "outer",
cat.pos = c(-20, 23, 180),
cat.dist = c(0.055, 0.055, 0.035),
cat.fontfamily = "sans",
cat.fontface = c(3,3,3),
cat.col = c("#4d4d4d", '#E69F00', '#56B4E9'),
rotation = 1
)
ggsave(my_plot, file="Figure_5.tiff", device = "tiff", dpi = 350, width = 7, height = 7)
```