# Genome and transcriptome assemblies of the kuruma shrimp, *Marsupenaeus japonicus*: bioinformatic methods This file describes the custom scripts used in "Genome and transcriptome assemblies of the kuruma shrimp, *Marsupenaeus japonicus*". Note that the scripts presented in this file are "clean copies" of the original scripts; we stripped most of the snippets for QC or exploratory purposes, streamlined convoluted directory structure, and renamed file/variable names that were too lengthy or inconsitent. We also removed most of the SLURM job parameters that were hard-coded in the original scripts. - [ONT library preparation and sequencing](#ont-library-preparation-and-sequencing) - [Base-calling Nanopore reads (Guppy)](#base-calling-nanopore-reads-guppy) - [Short read removal (Fastp)](#short-read-removal-fastp) - [PacBio Iso-Seq sequencing and analysis (IsoSeq3)](#pacbio-iso-seq-sequencing-and-analysis-isoseq3) - [CCS](#ccs) - [LIMA](#lima) - [IsoSeq3 refine](#isoseq3-refine) - [dataset create --type TranscriptSet](#dataset-create---type-transcriptset) - [dataset create --type SubreadSet](#dataset-create---type-subreadset) - [IsoSeq3 Cluster](#isoseq3-cluster) - [IsoSeq3 Polish](#isoseq3-polish) - [Genome size estimation and ploidy analysis](#genome-size-estimation-and-ploidy-analysis) - [Trimming paired-end reads (Fastp)](#trimming-paired-end-reads-fastp) - [K-mer count (KMC)](#k-mer-count-kmc) - [Genome size estimation (GenomeScope2)](#genome-size-estimation-genomescope2) - [Ploidy analysis (SmudgePlot)](#ploidy-analysis-smudgeplot) - [Error correction of Illumina paired-end reads from muscle library](#error-correction-of-illumina-paired-end-reads-from-muscle-library) - [Error correction (Tadpole)](#error-correction-tadpole) - [K-mer count (KMC)](#k-mer-count-kmc-1) - [K-mer distribution plot (ggplot2)](#k-mer-distribution-plot-ggplot2) - [De novo assembly of Illumina genomic DNA libraries](#ide-novoi-assembly-of-illumina-genomic-dna-libraries) - [Preprocessing Illumina paired-end reads from sperm library (Fastp)](#preprocessing-illumina-paired-end-reads-from-sperm-library-fastp) - [Preprocessing Illumina mate-pair reads](#preprocessing-illumina-mate-pair-reads) - [Trimmomatic](#trimmomatic) - [NextClip](#nextclip) - [De novo assembly (SPAdes)](#ide-novoi-assembly-spades) - [Redundancy removal and scaffolding (Redundans)](#redundancy-removal-and-scaffolding-redundans) - [Preprocessing Illumina paired-end reads from muscle library (Fastp)](#preprocessing-illumina-paired-end-reads-from-muscle-library-fastp) - [Run Redundans pipeline](#run-redundans-pipeline) - [Misassembly detection and assembly break (REAPR)](#misassembly-detection-and-assembly-break-reapr) - [Reverse-complement mate-pair reads (SeqKit)](#reverse-complement-mate-pair-reads-seqkit) - [Run REAPR pipeline](#run-reapr-pipeline) - [Transcriptome shotgun assembly](#transcriptome-shotgun-assembly) - [Preprocessing Illumina paired-end reads (Fastp)](#preprocessing-illumina-paired-end-reads-fastp) - [Mapping to the genome (HISAT2)](#mapping-to-the-genome-hisat2) - [De novo assembly](#ide-novoi-assembly) - [Trinity](#trinity) - [rnaSPAdes](#rnaspades) - [Trans-ABySS](#trans-abyss) - [Redundancy removal (TransDecoder, CD-HIT-EST)](#redundancy-removal-transdecoder-cd-hit-est) - [BUSCO (transcriptome, arthropoda_odb10)](#busco-transcriptome-arthropoda_odb10) - [Scaffolding of genome assembly using transcriptome data](#scaffolding-of-genome-assembly-using-transcriptome-data) - [Scaffolding using cDNA alignments (L_RNA_Scaffolder)](#scaffolding-using-cdna-alignments-l_rna_scaffolder) - [Scaffolding using paired-end RNA-seq reads (P_RNA_scaffolder)](#scaffolding-using-paired-end-rna-seq-reads-p_rna_scaffolder) - [Gap filling (Sealer)](#gap-filling-sealer) - [Polishing (NtEdit)](#polishing-ntedit) - [Removing short (<2kb) scaffolds](#removing-short-2kb-scaffolds) - [Assembly improvement using ONT long reads, Illumina mate-pair reads, and transcriptome data](#assembly-improvement-using-ont-long-reads-illumina-mate-pair-reads-and-transcriptome-data) - [Mate-pair reads processing (NxTrim, Fastp. BBduk)](#mate-pair-reads-processing-nxtrim-fastp-bbduk) - [Gap filling, assembly break, and polishing (TGS-GapCloser, NtEdit)](#gap-filling-assembly-break-and-polishing-tgs-gapcloser-ntedit) - [Scaffolding (1)](#scaffolding-1) - [Scaffolding using ONT long reads (LRScaf)](#scaffolding-using-ont-long-reads-lrscaf) - [Scaffolding using Illumina mate-pair reads (BESST)](#scaffolding-using-illumina-mate-pair-reads-besst) - [Scaffolding using cDNA alignments (L_RNA_scaffolder)](#scaffolding-using-cdna-alignments-l_rna_scaffolder-1) - [Misassembly detection and assembly break (REAPR)](#misassembly-detection-and-assembly-break-reapr-1) - [Haplotig removal (1) (purge_haplotigs)](#haplotig-removal-1-purge_haplotigs) - [Scaffolding (2)](#scaffolding-2) - [Scaffolding using ONT long reads (LRScaf)](#scaffolding-using-ont-long-reads-lrscaf-1) - [Scaffolding using Illumina mate-pair reads (BESST)](#scaffolding-using-illumina-mate-pair-reads-besst-1) - [Scaffolding using cDNA alignments (L_RNA_scaffolder)](#scaffolding-using-cdna-alignments-l_rna_scaffolder-2) - [Haplotig removal (2) (purge_haplotigs)](#haplotig-removal-2-purge_haplotigs) - [Scaffolding (3)](#scaffolding-3) - [Scaffolding using ONT long reads (LRScaf)](#scaffolding-using-ont-long-reads-lrscaf-2) - [Scaffolding using Illumina mate-pair reads (BESST)](#scaffolding-using-illumina-mate-pair-reads-besst-2) - [Assembly polishing (HyPo)](#assembly-polishing-hypo) - [Assembly curation](#assembly-curation) - [Mitogenome contamination removal](#mitogenome-contamination-removal) - [Filter out sequences with abnormal coverage](#filter-out-sequences-with-abnormal-coverage) - [Screening bacterial contamination (Barrnap)](#screening-bacterial-contamination-barrnap) - [Mapping Illumina paired-end reads onto the final assembly](#mapping-illumina-paired-end-reads-onto-the-final-assembly) - [Mitogenome assembly (Flye, --meta)](#mitogenome-assembly-flye---meta) - [Repeat identification and masking](#repeat-identification-and-masking) - [Repeat library construction (RepeatScout)](#repeat-library-construction-repeatscout) - [Taxonomic assingment (RepeatClassifier)](#taxonomic-assingment-repeatclassifier) - [Repeat masking (RepeatMasker)](#repeat-masking-repeatmasker) - [Gene prediction and functional annotation](#gene-prediction-and-functional-annotation) - [Prediction of protein-coding genes](#prediction-of-protein-coding-genes) - [cDNA alignments](#cdna-alignments) - [Transcriptome shotgun assembly](#transcriptome-shotgun-assembly-1) - [preprocessing paired-end reads (Fastp)](#preprocessing-paired-end-reads-fastp) - [De novo assembly (Trinity)](#de-novo-assembly-trinity) - [Running PASA pipeline](#running-pasa-pipeline) - [Genome-guided transcriptome assembly](#genome-guided-transcriptome-assembly) - [RNA-seq mapping (HISAT2)](#rna-seq-mapping-hisat2) - [Transcript reconstruction (StringTie)](#transcript-reconstruction-stringtie) - [*Ab initio* gene prediction (BRAKER2)](#ab-initio-gene-prediction-braker2) - [Protein-to-genome alignment (BLAST and Exonerate)](#protein-to-genome-alignment-blast-and-exonerate) - [Merging gene models (EVidenceModeler)](#merging-gene-models-evidencemodeler) - [Homology search (MMSeqs2)](#homology-search-mmseqs2) - [Manual curation](#manual-curation) - [UTR recovery (PASA updates)](#utr-recovery-pasa-updates) - [Incorporation of IsoSeq and shotgun and transcriptome assemblies (SQANTI3, TAMA, TransDecoder)](#incorporation-of-isoseq-and-shotgun-and-transcriptome-assemblies-sqanti3-tama-transdecoder) - [Non-coding RNA prediction](#non-coding-rna-prediction) - [tRNAScan-SE 2.0 (tRNA)](#trnascan-se-20-trna) - [Prediction of other non-coding RNA genes (Infernal)](#prediction-of-other-non-coding-rna-genes-infernal) - [Merge predicted genes and assign locus IDs](#merge-predicted-genes-and-assign-locus-ids) - [Functional annotation of protein-coding genes (AHRD, KAAS)](#functional-annotation-of-protein-coding-genes-ahrd-kaas) - [Extracting non-redundant protein sequences](#extracting-non-redundant-protein-sequences) - [Functional annotation (AHRD)](#functional-annotation-ahrd) - [Append functional annotation to predicted protein-coding gene models](#append-functional-annotation-to-predicted-protein-coding-gene-models) - [Structural annotation of IsoSeq cDNA sequences (SQANTI3)](#structural-annotation-of-isoseq-cdna-sequences-sqanti3) - [Comparative genomics](#comparative-genomics) - [BUSCO (genome, arthropoda_odb10)](#busco-genome-arthropoda_odb10) - [Running BUSCO pipeline](#running-busco-pipeline) - [Generate summary script (generate_plot.py)](#generate-summary-script-generate_plotpy) - [Plot (R script)](#plot-r-script) - [Comparison of simple repeat contents in arthropod genomes (RepeatMasker)](#comparison-of-simple-repeat-contents-in-arthropod-genomes-repeatmasker) - [Comparison of protein-coding gene metrics in three penaeid shrimp genomes](#comparison-of-protein-coding-gene-metrics-in-three-penaeid-shrimp-genomes) - [Get metrics](#get-metrics) - [Visualization (ggplot2)](#visualization-ggplot2) - [Orthologous gene clustering analysis (OrthoFinder)](#orthologous-gene-clustering-analysis-orthofinder) - [Running OrthoFinder](#running-orthofinder) - [Visualization (VennDiagram)](#visualization-venndiagram) ## ONT library preparation and sequencing ### Base-calling Nanopore reads (Guppy) ```sh #!/bin/bash # >>>>>> # GuppyBasecaller.sh # >>>>>> # Define variables IN=fast5 OUT=guppy_out # Run guppy_basecaller guppy_basecaller -x auto --qscore_filtering --calib_detect --flowcell FLO-MIN106 --kit SQK-LSK109 -i ${IN} -s ${OUT} ``` ### Short read removal (Fastp) ```sh #!/bin/bash # >>>>>> # RemoveShortNanoporeReads.sh # >>>>>> cat ONT.fastq|seqkit seq -m 1000 >ONT.up1k.fastq # For gap filling cat ONT.fastq|seqkit seq -m 5000 >ONT.up5k.fastq # For mtDNA assembly ``` ## PacBio Iso-Seq sequencing and analysis (IsoSeq3) ### CCS ```sh #!/bin/bash -l #SBATCH --job-name=ccs #SBATCH --partition=compute #SBATCH --mem=30G #SBATCH --cpus-per-task=1 #SBATCH --time=6-23:59:59 #SBATCH --mail-user=example@example.com #SBATCH --mail-type=ALL #SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/ #SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out #SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/ccs.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in conda activate isoseq3 ccs hp_m_up5k.subreads.bam hp_m_up5k.ccs.bam --noPolish --minPasses 1 --minLength=3000 --maxLength=100000 --reportFile=hp_m_up5k.ccs_report.txt ccs hp_m_3to5k.subreads.bam hp_m_3to5k.ccs.bam --noPolish --minPasses 1 --minLength=2000 --maxLength=6000 --reportFile=hp_m_3to5k.ccs_report.txt ccs hp_m_2to3k.subreads.bam hp_m_2to3k.ccs.bam --noPolish --minPasses 1 --minLength=1500 --maxLength=4000 --reportFile=hp_m_2to3k.ccs_report.txt ccs hp_m_to2k.subreads.bam hp_m_to2k.ccs.bam --noPolish --minPasses 1 --minLength=100 --maxLength=3000 --reportFile=hp_m_to2k.ccs_report.txt ccs hc_m_up5k.subreads.bam hc_m_up5k.ccs.bam --noPolish --minPasses 1 --minLength=3000 --maxLength=100000 --reportFile=hc_m_up5k.ccs_report.txt ccs hc_m_3to5k.subreads.bam hc_m_3to5k.ccs.bam --noPolish --minPasses 1 --minLength=2000 --maxLength=6000 --reportFile=hc_m_3to5k.ccs_report.txt ccs hc_m_2to3k.subreads.bam hc_m_2to3k.ccs.bam --noPolish --minPasses 1 --minLength=1500 --maxLength=4000 --reportFile=hc_m_2to3k.ccs_report.txt ccs hc_m_to2k.subreads.bam hc_m_to2k.ccs.bam --noPolish --minPasses 1 --minLength=100 --maxLength=3000 --reportFile=hc_m_to2k.ccs_report.txt ccs lv_42_up5k.subreads.bam lv_42_up5k.ccs.bam --noPolish --minPasses 1 --minLength=3000 --maxLength=100000 --reportFile=lv_42_up5k.ccs_report.txt ccs lv_42_3to5k.subreads.bam lv_42_3to5k.ccs.bam --noPolish --minPasses 1 --minLength=2000 --maxLength=6000 --reportFile=lv_42_3to5k.ccs_report.txt ccs lv_42_2to3k.subreads.bam lv_42_2to3k.ccs.bam --noPolish --minPasses 1 --minLength=1500 --maxLength=4000 --reportFile=lv_42_2to3k.ccs_report.txt ccs lv_42_to2k.subreads.bam lv_42_to2k.ccs.bam --noPolish --minPasses 1 --minLength=100 --maxLength=3000 --reportFile=lv_42_to2k.ccs_report.txt ccs pl_12_up5k.subreads.bam pl_12_up5k.ccs.bam --noPolish --minPasses 1 --minLength=3000 --maxLength=100000 --reportFile=pl_12_up5k.ccs_report.txt ccs pl_12_3to5k.subreads.bam pl_12_3to5k.ccs.bam --noPolish --minPasses 1 --minLength=2000 --maxLength=6000 --reportFile=pl_12_3to5k.ccs_report.txt ccs pl_12_2to3k.subreads.bam pl_12_2to3k.ccs.bam --noPolish --minPasses 1 --minLength=1500 --maxLength=4000 --reportFile=pl_12_2to3k.ccs_report.txt ccs pl_12_to2k.subreads.bam pl_12_to2k.ccs.bam --noPolish --minPasses 1 --minLength=100 --maxLength=3000 --reportFile=pl_12_to2k.ccs_report.txt conda deactivate ``` ### LIMA ```sh #!/bin/bash -l #SBATCH --job-name=lima #SBATCH --partition=compute #SBATCH --mem=30G #SBATCH --cpus-per-task=1 #SBATCH --time=6-23:59:59 #SBATCH --mail-user=example@example.com #SBATCH --mail-type=ALL #SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/ #SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out #SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error #SBATCH --exclude=sango405[01-14] bin_dir=/work/SatohU/kawato/apps/ num_threads=$SLURM_CPUS_PER_TASK cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/lima.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in conda activate isoseq3 lima hp_m_up5k.ccs.bam primer.fasta hp_m_up5k.fl.bam --isoseq --dump-clips lima hp_m_3to5k.ccs.bam primer.fasta hp_m_3to5k.fl.bam --isoseq --dump-clips lima hp_m_2to3k.ccs.bam primer.fasta hp_m_2to3k.fl.bam --isoseq --dump-clips lima hp_m_to2k.ccs.bam primer.fasta hp_m_to2k.fl.bam --isoseq --dump-clips lima hc_m_up5k.ccs.bam primer.fasta hc_m_up5k.fl.bam --isoseq --dump-clips lima hc_m_3to5k.ccs.bam primer.fasta hc_m_3to5k.fl.bam --isoseq --dump-clips lima hc_m_2to3k.ccs.bam primer.fasta hc_m_2to3k.fl.bam --isoseq --dump-clips lima hc_m_to2k.ccs.bam primer.fasta hc_m_to2k.fl.bam --isoseq --dump-clips lima lv_42_up5k.ccs.bam primer.fasta lv_42_up5k.fl.bam --isoseq --dump-clips lima lv_42_3to5k.ccs.bam primer.fasta lv_42_3to5k.fl.bam --isoseq --dump-clips lima lv_42_2to3k.ccs.bam primer.fasta lv_42_2to3k.fl.bam --isoseq --dump-clips lima lv_42_to2k.ccs.bam primer.fasta lv_42_to2k.fl.bam --isoseq --dump-clips lima pl_12_up5k.ccs.bam primer.fasta pl_12_up5k.fl.bam --isoseq --dump-clips lima pl_12_3to5k.ccs.bam primer.fasta pl_12_3to5k.fl.bam --isoseq --dump-clips lima pl_12_2to3k.ccs.bam primer.fasta pl_12_2to3k.fl.bam --isoseq --dump-clips lima pl_12_to2k.ccs.bam primer.fasta pl_12_to2k.fl.bam --isoseq --dump-clips conda deactivate ``` ### IsoSeq3 refine ```sh #!/bin/bash -l #SBATCH --job-name=isoseq3_refine #SBATCH --partition=compute #SBATCH --mem=30G #SBATCH --cpus-per-task=1 #SBATCH --time=6-23:59:59 #SBATCH --mail-user=example@example.com #SBATCH --mail-type=ALL #SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/ #SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out #SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error #SBATCH --exclude=sango405[01-14] bin_dir=/work/SatohU/kawato/apps/ num_threads=$SLURM_CPUS_PER_TASK cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/isoseq3_refine.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in conda activate isoseq3 isoseq3 refine hp_m_up5k.fl.primer_5p--primer_3p.bam primer.fasta hp_m_up5k.flnc.bam isoseq3 refine hp_m_3to5k.fl.primer_5p--primer_3p.bam primer.fasta hp_m_3to5k.flnc.bam isoseq3 refine hp_m_2to3k.fl.primer_5p--primer_3p.bam primer.fasta hp_m_2to3k.flnc.bam isoseq3 refine hp_m_to2k.fl.primer_5p--primer_3p.bam primer.fasta hp_m_to2k.flnc.bam isoseq3 refine hc_m_up5k.fl.primer_5p--primer_3p.bam primer.fasta hc_m_up5k.flnc.bam isoseq3 refine hc_m_3to5k.fl.primer_5p--primer_3p.bam primer.fasta hc_m_3to5k.flnc.bam isoseq3 refine hc_m_2to3k.fl.primer_5p--primer_3p.bam primer.fasta hc_m_2to3k.flnc.bam isoseq3 refine hc_m_to2k.fl.primer_5p--primer_3p.bam primer.fasta hc_m_to2k.flnc.bam isoseq3 refine lv_42_up5k.fl.primer_5p--primer_3p.bam primer.fasta lv_42_up5k.flnc.bam isoseq3 refine lv_42_3to5k.fl.primer_5p--primer_3p.bam primer.fasta lv_42_3to5k.flnc.bam isoseq3 refine lv_42_2to3k.fl.primer_5p--primer_3p.bam primer.fasta lv_42_2to3k.flnc.bam isoseq3 refine lv_42_to2k.fl.primer_5p--primer_3p.bam primer.fasta lv_42_to2k.flnc.bam isoseq3 refine pl_12_up5k.fl.primer_5p--primer_3p.bam primer.fasta pl_12_up5k.flnc.bam isoseq3 refine pl_12_3to5k.fl.primer_5p--primer_3p.bam primer.fasta pl_12_3to5k.flnc.bam isoseq3 refine pl_12_2to3k.fl.primer_5p--primer_3p.bam primer.fasta pl_12_2to3k.flnc.bam isoseq3 refine pl_12_to2k.fl.primer_5p--primer_3p.bam primer.fasta pl_12_to2k.flnc.bam conda deactivate ``` primer.fasta ``` >primer_5p AAGCAGTGGTATCAACGCAGAGTACATGGG >primer_3p GTACTCTGCGTTGATACCACTGCTT ``` ### dataset create --type TranscriptSet ```sh #!/bin/bash -l #SBATCH --job-name=dataset_create_transcriptset #SBATCH --partition=compute #SBATCH --mem=30G #SBATCH --cpus-per-task=1 #SBATCH --time=6-23:59:59 #SBATCH --mail-user=example@example.com #SBATCH --mail-type=ALL #SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/ #SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out #SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error #SBATCH --exclude=sango405[01-14] bin_dir=/work/SatohU/kawato/apps/ num_threads=$SLURM_CPUS_PER_TASK cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/dataset_create_transcriptset.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in conda activate isoseq3 dataset create --type TranscriptSet merged.flnc.xml \ hp_m_up5k.flnc.bam hp_m_3to5k.flnc.bam hp_m_2to3k.flnc.bam hp_m_to2k.flnc.bam \ hc_m_up5k.flnc.bam hc_m_3to5k.flnc.bam hc_m_2to3k.flnc.bam hc_m_to2k.flnc.bam \ lv_42_up5k.flnc.bam lv_42_3to5k.flnc.bam lv_42_2to3k.flnc.bam lv_42_to2k.flnc.bam \ pl_12_up5k.flnc.bam pl_12_3to5k.flnc.bam pl_12_2to3k.flnc.bam pl_12_to2k.flnc.bam conda deactivate ``` ### dataset create --type SubreadSet ```sh #!/bin/bash -l #SBATCH --job-name=dataset_create_subreadset #SBATCH --partition=compute #SBATCH --mem=30G #SBATCH --cpus-per-task=1 #SBATCH --time=6-23:59:59 #SBATCH --mail-user=example@example.com #SBATCH --mail-type=ALL #SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/ #SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out #SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error #SBATCH --exclude=sango405[01-14] bin_dir=/work/SatohU/kawato/apps/ num_threads=$SLURM_CPUS_PER_TASK cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/dataset_create_subreadset.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in conda activate isoseq3 dataset create --type SubreadSet merged.subreadset.xml \ hp_m_up5k.subreads.bam hp_m_3to5k.subreads.bam hp_m_2to3k.subreads.bam hp_m_to2k.subreads.bam \ hc_m_up5k.subreads.bam hc_m_3to5k.subreads.bam hc_m_2to3k.subreads.bam hc_m_to2k.subreads.bam \ lv_42_up5k.subreads.bam lv_42_3to5k.subreads.bam lv_42_2to3k.subreads.bam lv_42_to2k.subreads.bam \ pl_12_up5k.subreads.bam pl_12_3to5k.subreads.bam pl_12_2to3k.subreads.bam pl_12_to2k.subreads.bam conda deactivate ``` ### IsoSeq3 Cluster ```sh #!/bin/bash -l #SBATCH --job-name=isoseq3_cluster #SBATCH --partition=compute #SBATCH --mem=120G #SBATCH --cpus-per-task=12 #SBATCH --time=6-23:59:59 #SBATCH --mail-user=example@example.com #SBATCH --mail-type=ALL #SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/ #SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out #SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error bin_dir=/work/SatohU/kawato/apps/ num_threads=$SLURM_CPUS_PER_TASK num=$SLURM_ARRAY_TASK_ID cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/isoseq3_cluster.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in conda activate isoseq3 isoseq3 cluster merged.flnc.xml unpolished.bam --verbose --split-bam 24 conda deactivate ``` ### IsoSeq3 Polish ```sh #!/bin/bash -l #SBATCH --job-name=isoseq3_polish #SBATCH --partition=compute #SBATCH --mem=30G #SBATCH --cpus-per-task=1 #SBATCH --time=6-23:59:59 #SBATCH --mail-user=example@example.com #SBATCH --mail-type=ALL #SBATCH --workdir=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/ #SBATCH --output=/work/SatohU/kawato/slurm_out/%j_out #SBATCH --error=/work/SatohU/kawato/slurm_out/%j_error #SBATCH --array=1-23 #1-23 bin_dir=/work/SatohU/kawato/apps/ num_threads=$SLURM_CPUS_PER_TASK num=$SLURM_ARRAY_TASK_ID cp /work/SatohU/kawato/study/kuruma_transcriptome2017/scripts/isoseq3_polish.sh /work/SatohU/kawato/slurm_in/${SLURM_JOB_ID}_in conda activate isoseq3 isoseq3 polish unpolished.${num}.bam merged.subreadset.xml polished.${num}.bam conda deactivate ``` ## Genome size estimation and ploidy analysis ### Trimming paired-end reads (Fastp) ```sh #!/bin/bash # >>>>>> # TrimPE_GenomeSizeEstimate.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WDIR=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/ cd ${WDIR} # Quality trim with Fastp For lib in sperm muscle;do fastp \ -i Mj01${lib}DNA_R1.fastq.gz -I Mj01${lib}DNA_R2.fastq.gz \ -o Mj01${lib}DNA_R1_trim.fq.gz -O Mj01${lib}DNA_R2_trim.fq.gz \ --n_base_limit 0 --length_required 140 --thread 8 \ --html Mj01${lib}DNA_fastp.html done ``` ### K-mer count (KMC) ```sh #!/bin/bash # >>>>>> # KMC_all.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=KMC_all_GenomeScope2 cd ${WHOME} mkdir ${WDIR}; cd ${WDIR} # Define variables NUM_THREADS=50 RAM_G=1200 COV_MIN=1 COV_MAX=1000000 TMP_DIR=tmp RDIR=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq IN=reads_list OUT=${IN}_kmc_k${MER}_${COV_MIN}-${COV_MAX} # Create input file list echo -e "${RDIR}/Mj01muscleDNA_R1_trim.fq.gz\n${RDIR}/Mj01muscleDNA_R2_trim.fq.gz\n${RDIR}/Mj01spermDNA_R1_trim.fq.gz\n${RDIR}/Mj01spermDNA_R2_trim.fq.gz" > reads_list # Run KMC; iterate over 17- to 35-mers with 2-mer intervals for MER in 17 19 21 23 25 27 29 31 33 35;do /work/kawato/apps/kmc/3.1.1/kmc -k${MER} -t${NUM_THREADS} -m${RAM_G} -ci${COV_MIN} -cs${COV_MAX} @${IN} ${OUT} ./${TMP} /work/kawato/apps/kmc/3.1.1/kmc_tools transform ${OUT} histogram ${OUT}_k${MER}.hist -cx1000000 done ``` ### Genome size estimation (GenomeScope2) ```sh #!/bin/bash # >>>>>> # GenomeScope2.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=KMC_all_GenomeScope2 cd ${WHOME}${WDIR} # Define variables NUM_THREADS=50 RAM_G=1200 COV_MIN=1 COV_MAX=1000000 IN=reads_list OUT=${IN}_kmc_k${MER}_${COV_MIN}-${COV_MAX} # Run GenomeScope2; iterate over 17- to 35-mers with 2-mer intervals conda activate genomescope2 for MER in 17 19 21 23 25 27 29 31 33 35;do genomescope2 -i ${OUT}_k${MER}.hist -o ${OUT}_k${MER}.genomescope2 -p 2 -k ${MER} -m -1 done ``` ### Ploidy analysis (SmudgePlot) ```sh #!/bin/bash # >>>>>> # SmudgePlot.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=SmudgePlot cd ${WHOME} mkdir ${WDIR}; cd ${WDIR} # Define variables NUM_THREADS=50 RAM_G=1200 COV_MIN=1 COV_MAX=1000000 TMP_DIR=tmp RDIR=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq IN=reads_list OUT=${IN}_kmc_k${MER}_${COV_MIN}-${COV_MAX} # Make symlinks to KMC results (23-mer) ln -s ../KMC_GenomeScope2/reads_list reads_list ln -s ../KMC_GenomeScope2/${OUT}_k23.hist ${OUT}_k23.hist ln -s ../KMC_GenomeScope2/${OUT}.kmc_suf ${OUT}.kmc_suf ln -s ../KMC_GenomeScope2/${OUT}.kmc_pre ${OUT}.kmc_pre conda activate smudgeplot # Set coverage cutoffs L=$(smudgeplot.py cutoff ${OUT}_k${MER}.hist L) # 51 U=$(smudgeplot.py cutoff ${OUT}_k${MER}.hist U) # 1300 # Transform 23-mer table /work/kawato/apps/kmc/3.1.1/kmc_tools transform ${OUT} -ci"$L" -cx"$U" dump -s kmc_k${MER}_"$L"_"$U".dump # Run SmudgePlot smudgeplot.py hetkmers -o kmc_k${MER}_"$L"_"$U" < kmc_k${MER}_"$L"_"$U".dump smudgeplot.py plot kmc_k${MER}_"$L"_"$U"_coverages.tsv -k ${MER} ``` ### Error correction of Illumina paired-end reads from muscle library #### Error correction (Tadpole) ```sh #!/bin/bash # >>>>>> # Tadpole_muscle.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WDIR=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/ cd ${WDIR} # Tadpole For lib in muscle sperm;do tadpole.sh \ in=Mj01${lib}DNA_R1_trim.fq.gz \ in2=Mj01${lib}DNA_R2_trim.fq.gz \ out=Mj01${lib}DNA_R1_trim.tadpole.fq.gz \ out2=Mj01${lib}DNA_R2_trim.tadpole.fq.gz \ mode=correct k=50 done ``` #### K-mer count (KMC) ```sh #!/bin/bash # >>>>>> # KMC_muscle.sh # >>>>>> # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=KMC_muscle cd ${WHOME} mkdir ${WDIR}; cd ${WDIR} # Define variables NUM_THREADS=50 RAM_G=1200 COV_MIN=1 COV_MAX=1000000 TMP_DIR=tmp MER=23 RDIR=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq TMP_DIR=tmp # Create input file list echo -e "/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01muscleDNA_R1_trim.fq.gz\n/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01muscleDNA_R2_trim.fq.gz" > Mj01muscleDNA_trim.list echo -e "/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01muscleDNA_R1_trim.tadpole.fq.gz\n/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01muscleDNA_R2_trim.tadpole.fq.gz" > Mj01muscleDNA_trim.tadpole.list # Run KMC for IN_FILE in Mj01muscleDNA_trim.list Mj01muscleDNA_trim.tadpole.list ;do OUT=${IN_FILE}_kmc_k${MER}_${COV_MIN}-${COV_MAX} /home/admin/kawato/apps/kmc/3.1.1/kmc -k${MER} -t${NUM_THREADS} -m${RAM_G} -ci${COV_MIN} -cs${COV_MAX} @${IN_FILE} ${OUT} ./${TMP} /home/admin/kawato/apps/kmc/3.1.1/kmc_tools transform ${OUT} histogram ${OUT}_k${MER}.hist -cx1000000 sed -e 's/\t/\s/g' ${OUT}_k${MER}.hist > ${OUT}_k${MER}.hist.mod done ``` #### K-mer distribution plot (ggplot2) ```R ################################## # # plot_kmer_distribution.R # ################################## setwd("C:/Users/kawato/study/kuruma_genome2017/analysis/kmer_distribution/) library(tidyverse) library(ggplot2) library(patchwork) cbp1 <- c("#999999", "#E69F00") df_raw_before <- read_tsv("Mj01muscleDNA_trim.list_kmc_k23_1-1000000_k23.hist", col_names=c("Coverage", "Occurence")) df_raw_after <- read_tsv("Mj01muscleDNA_trim.tadpole.list_kmc_k23_1-1000000_k23.hist", col_names=c("Coverage", "Occurence")) df.Fre.before <- data.frame(Lib="Before", Coverage=df_raw_before$Coverage, Fre=df_raw_before$Occurence) df.Fre.after <- data.frame(Lib="After", Coverage=df_raw_after$Coverage, Fre=df_raw_after$Occurence) df.Cov_Fre.before <- data.frame(Lib="Before", Coverage=df_raw_before$Coverage, Cov_Fre=df_raw_before$Coverage * df_raw_before$Occurence) df.Cov_Fre.after <- data.frame(Lib="After", Coverage=df_raw_after$Coverage, Cov_Fre=df_raw_after$Coverage * df_raw_after$Occurence) df.Fre_merge <- dplyr::bind_rows(df.Fre.before, df.Fre.after) df.Fre_merge$Lib <- as.factor(df.Fre_merge$Lib) df.Fre_merge$Lib <- factor(df.Fre_merge$Lib, levels = c("Before", "After")) df.Cov_Fre_merge <- dplyr::bind_rows(df.Cov_Fre.before, df.Cov_Fre.after) df.Cov_Fre_merge$Lib <- as.factor(df.Cov_Fre_merge$Lib) df.Cov_Fre_merge$Lib <- factor(df.Cov_Fre_merge$Lib, levels = c("Before", "After")) levels(df.Fre_merge$Lib) <- c("Before", "After") levels(df.Cov_Fre_merge$Lib) <- c("Before", "After") p1 <- ggplot(df.Fre_merge) + geom_line(aes(x=Coverage,y=Fre, colour=Lib),size=2) + scale_x_continuous(expand = c(0,0), limits = c(0,220)) + scale_y_continuous(expand = c(0,0), limits = c(0,15500000)) + theme_classic() + scale_colour_manual(values=cbp1) + labs( x = "Coverage", y = "Frequency", colour = "Lib") + theme(axis.title=element_text(size=18,face="bold"), axis.text=element_text(size=18),legend.text = element_text(size=18),legend.key.size = unit(1.5, 'lines'), legend.justification=c(1,1), legend.position=c(1,1),legend.title=element_blank()) p2 <- ggplot(df.Cov_Fre_merge) + geom_line(aes(x=Coverage,y=Cov_Fre, colour=Lib),size=2) + scale_x_continuous(expand = c(0,0), limits = c(0,220)) + scale_y_continuous(expand = c(0,0), limits = c(0,1000000000)) + theme_classic() + scale_colour_manual(values=cbp1) + labs( x = "Coverage", y = "Coverage*Frequency", colour = "Lib") + theme(axis.title=element_text(size=18,face="bold"), axis.text=element_text(size=18),legend.text = element_text(size=18),legend.key.size = unit(1.5, 'lines'), legend.justification=c(1,1), legend.position=c(1,1),legend.title=element_blank()) p_merge <- p1 + p2 + plot_annotation(tag_levels = 'A') & theme(plot.tag = element_text(size = 32,face="bold")) ggsave(file = "p_merge.svg", plot = p_merge, width = 12, height = 6) ggsave(file = "p_merge.png", plot = p_merge, dpi = 300, width = 12, height = 6) ``` ## De novo assembly of Illumina genomic DNA libraries ### Preprocessing Illumina paired-end reads from sperm library (Fastp) ```sh #!/bin/bash # >>>>>> # TrimPE_DeNovoAssembly.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WDIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/pe_hiseq/ cd ${WDIR} # Run Fastp on sperm genomic PE reads fastp -i Mj01spermDNA_R1.fastq -I Mj01spermDNA_R2.fastq \ -o Mj01spermDNA_R1_fastp01.fastq -O Mj01spermDNA_R2_fastp01.fastq \ -length_required=140 --qualified_quality_phred=20 \ --unqualified_percent_limit=10 --n_base_limit=0 --low_complexity_filter \ --thread=8 --html=Mj01sperm_fastp01.html ``` ### Preprocessing Illumina mate-pair reads #### Trimmomatic ```sh #!/bin/bash # >>>>>> # MP_Trimmomatic.sh # >>>>>> # Create/goto WDIR WDIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/mp_hiseq/ cd ${WDIR} # Define variables sq=20 minlen=80 NUMTH=4 # Create symlinks ln -s /work/SatohU/kawato/apps/Trimmomatic/0.36/adapters/TruSeq3-PE-2.fa . # Trimmomatic for i in 02k 03k 04k 05k 06k 07k 08k 09k 10k 11k 12k 13k 14k 15k; do a1=${i}_mp.raw.R1.fq a2=${i}_mp.raw.R2.fq java -jar /work/SatohU/kawato/apps/Trimmomatic/0.36/trimmomatic-0.36.jar \ -threads ${NUMTH} -phred33 \ $a1 $a2 \ ${a1}.strict_sldwin4-${sq}_minlen${minlen}_trimmed \ ${a1}.strict_sldwin4-${sq}_minlen${minlen}_unpair \ ${a2}.strict_sldwin4-${sq}_minlen${minlen}_trimmed \ ${a2}.strict_sldwin4-${sq}_minlen${minlen}_unpair \ ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 \ LEADING:3 TRAILING:3 SLIDINGWINDOW:4:${sq} MINLEN:${minlen} done ``` #### NextClip ```sh #!/bin/bash # >>>>>> # MP_NextClip.sh # >>>>>> # Create/goto WDIR WDIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/mp_hiseq/ cd ${WDIR} # Define variables sq=20 minlen=80 NUMTH=4 for i in 02k 03k 04k 05k 06k 07k 08k 09k 10k 11k 12k 14k 15k; do # Run NextClip /work/SatohU/kawato/apps/nextclip-master/bin/nextclip \ --input_one ${i}_mp.raw.R1.fq.sldwin4-20_minlen80_trimmed \ --input_two ${i}_mp.raw.R2.fq.sldwin4-20_minlen80_trimmed \ --adaptor_sequence CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG \ -o ${i}_adaptor_mod_sldwin4-20_minlen80_trimmed \ -n 40000000 --trim_ends 0 --min_length 36 --remove_duplicates # Concatenate categories A, B, and C cat \ ${i}_adaptor_mod_sldwin4-20_minlen80_trimmed_A_R1.fastq \ ${i}_adaptor_mod_sldwin4-20_minlen80_trimmed_B_R1.fastq \ ${i}_adaptor_mod_sldwin4-20_minlen80_trimmed_C_R1.fastq \ > ${i}_adaptor_mod_sldwin4-20_minlen80_trimmed.nextclip.A-C_cat.R1.fq cat \ ${i}_adaptor_mod_sldwin4-20_minlen80_trimmed_A_R2.fastq \ ${i}_adaptor_mod_sldwin4-20_minlen80_trimmed_B_R2.fastq \ ${i}_adaptor_mod_sldwin4-20_minlen80_trimmed_C_R2.fastq \ > ${i}_adaptor_mod_sldwin4-20_minlen80_trimmed.nextclip.A-C_cat.R2.fq done ``` ### De novo assembly (SPAdes) ```sh #!/bin/bash # >>>>>> # SPAdes.sh # >>>>>> # Create/goto WDIR WDIR=/work/SatohU/kawato/study/kuruma_genome2017/analysis/ cd ${WDIR} # Set VARs NUMTH=24 R_DIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/pe_hiseq conda activate spades # Run SPAdes /work/SatohU/kawato/apps/spades/3.13.0/bin/spades.py \ -o SPAdes_out \ -1 ${R_DIR}/Mj01spermDNA_R1_fastp01.fastq \ -2 ${R_DIR}/Mj01spermDNA_R2_fastp01.fastq \ --threads=${NUMTH} \ --only-assembler \ -k 21,33,55,77,89,101 \ --memory=500 ``` ### Redundancy removal and scaffolding (Redundans) #### Preprocessing Illumina paired-end reads from muscle library (Fastp) ```sh #!/bin/bash # >>>>>> # Mj01muscleDNA_fastp_minq5.sh # >>>>>> # Create/goto WDIR WDIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/pe_hiseq/ cd ${WDIR} # Set VARs NUMTH=8 # Run Fastp for lib in Mj01muscleDNA; do /work/SatohU/kawato/apps/fastp/0.19.6/fastp \ --in1=${lib}_R1.fastq --out1=${lib}_R1.fastq.fastp.minq5 \ --in2=${lib}_R2.fastq --out2=${lib}_R2.fastq.fastp.minq5 \ --detect_adapter_for_pe --qualified_quality_phred=5 \ --html=${lib}_fastp.minq5.html done ``` #### Run Redundans pipeline ```sh #!/bin/bash # >>>>>> # Redundans.sh # >>>>>> # Create/goto WDIR WDIR=/work/SatohU/kawato/study/kuruma_genome2017/analysis/SPAdes_out/ cd ${WDIR} # Set VARs NUMTH=24 R_DIR=/work/SatohU/kawato/study/kuruma_genome2017/reads python /work/SatohU/kawato/apps/redundans/redundans.py -i \ ${R_DIR}/pe_hiseq/Mj01muscleDNA_R1.fastq.fastp.minq5.fq ${R_DIR}/pe_hiseq/Mj01muscleDNA_R2.fastq.fastp.minq5.fq \ ${R_DIR}/mp_hiseq/02k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/02k.nextclip.A-C_cat.R2.fq \ ${R_DIR}/mp_hiseq/03k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/03k.nextclip.A-C_cat.R2.fq \ ${R_DIR}/mp_hiseq/04k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/04k.nextclip.A-C_cat.R2.fq \ ${R_DIR}/mp_hiseq/05k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/05k.nextclip.A-C_cat.R2.fq \ ${R_DIR}/mp_hiseq/06k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/06k.nextclip.A-C_cat.R2.fq \ ${R_DIR}/mp_hiseq/07k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/07k.nextclip.A-C_cat.R2.fq \ ${R_DIR}/mp_hiseq/08k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/08k.nextclip.A-C_cat.R2.fq \ ${R_DIR}/mp_hiseq/10k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/10k.nextclip.A-C_cat.R2.fq \ ${R_DIR}/mp_hiseq/12k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/12k.nextclip.A-C_cat.R2.fq \ ${R_DIR}/mp_hiseq/14k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/14k.nextclip.A-C_cat.R2.fq \ ${R_DIR}/mp_hiseq/15k.nextclip.A-C_cat.R1.fq ${R_DIR}/mp_hiseq/15k.nextclip.A-C_cat.R2.fq \ -f scaffolds.fasta -o ../redundans_out -t ${NUMTH} -m 500 ``` ### Misassembly detection and assembly break (REAPR) #### Reverse-complement mate-pair reads (SeqKit) ```sh #!/bin/bash # >>>>>> # ReverseComplementMPReads.sh # >>>>>> # Create/goto WDIR WDIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/mp_hiseq/ cd ${WDIR} # Convert MP reads to PE (REAPR accepts "innies" only) for n in 04 05 06 07 08 10 12 14 15 ; do seqkit seq -pr ${n}k.nextclip.A-C_cat.R1.fq > ${n}k.nextclip.A-C_cat.R1.rc.fq seqkit seq -pr ${n}k.nextclip.A-C_cat.R2.fq > ${n}k.nextclip.A-C_cat.R2.rc.fq done ``` #### Run REAPR pipeline ```sh #!/bin/bash # >>>>>> # REAPR.sh # >>>>>> # Create/goto WDIR WHOME=/work/SatohU/kawato/study/kuruma_genome2017/analysis/ WDIR=REAPR cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Set VARs export PATH=/work/SatohU/kawato/apps/reapr/1.0.18:/work/SatohU/kawato/apps/reapr/1.0.18/src:/work/SatohU/kawato/apps/reapr/1.0.18/third_party:$PATH module load R/3.4.2 R_DIR=/work/SatohU/kawato/study/kuruma_genome2017/reads/mp_hiseq # Create symlinks ../redundans_out/scaffolds.fa scaffolds.fa # REAPR facheck reapr facheck scaffolds.fa scaffolds.fa.facheck # Run REAPR for each MP lib for n in 04 05 06 07 08 10 12 14 15 ; do #REAPR accepts only innies, so we need to RC nextrimp reads reapr smaltmap scaffolds.fa.facheck.fa ${R_DIR}/${n}k.nextclip.A-C_cat.R1.rc.fq ${R_DIR}/${n}k.nextclip.A-C_cat.R2.rc.fq mapped_${n}k.bam reapr pipeline scaffolds.fa.facheck.fa mapped_${n}k.bam reapr_${n}k_out done ``` ## Transcriptome shotgun assembly ### Preprocessing Illumina paired-end reads (Fastp) ```sh #!/bin/bash # >>>>>> # TrimPE_RNAseq_1.sh # >>>>>> # Create/goto WDIR WDIR=/work/SatohU/kawato/study/kuruma_transcriptome2017/reads/ cd ${WDIR} # Run Fastp cat lib_list|while read lib; do /work/SatohU/kawato/apps/fastp/0.19.3/fastp \ -i ${lib}_R1_001.fastq.gz \ -I ${lib}_R2_001.fastq.gz \ -o ${lib}_R1_trimmed.fastq.gz \ -O ${lib}_R2_trimmed.fastq.gz \ -h ${lib}_fastp.html \ -j ${lib}_fastp.json \ --trim_front1=12 \ --trim_tail1=2 \ --length_required=80 done ``` ### Mapping to the genome (HISAT2) ```sh #!/bin/bash # >>>>>> # hisat2.sh # >>>>>> # Create/goto WDIR WHOME=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis WDIR=hisat2 cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Set VARs NUMTH=16 RDIR=/work/SatohU/kawato/study/kuruma_transcriptome2017/reads REF=reapr_06k # Create synmlinks ln -s /work/SatohU/kawato/study/kuruma_genome2017/analysis/REAPR/reapr_06k_out/04.break.broken_assembly.fa ${REF}.fa # build index hisat2-build -p ${NUMTH} ${REF}.fa ${REF} cat ${RDIR}/lib_list|while read lib; do # map reads hisat2 -x ${REF} -p ${NUMTH} -1 ${RDIR}/${lib}_R1_trimmed.fastq.gz -2 ${RDIR}/${lib}_R2_trimmed.fastq.gz -S ${REF}_${lib}.sam 2>${REF}_${lib}_hisat2.log # extract paired reads with at least one mapped mate samtools view -f 1 -G 12 ${REF}_${lib}.sam -b -@ ${NUMTH} |samtools sort -@ ${NUMTH} -O BAM -n -o ${REF}_${lib}_mapped_qsort.bam rm ${REF}_${lib}.sam # Extract fastq bedtools bamtofastq -i ${REF}_${lib}_mapped_qsort.bam -fq ${REF}_${lib}_mapped_R1.fq -fq2 ${REF}_${lib}_mapped_R2.fq # Compress fastq gzip -9 ${REF}_${lib}_mapped_R1.fq gzip -9 ${REF}_${lib}_mapped_R2.fq done ``` ### De novo assembly #### Trinity ```sh #!/bin/bash # >>>>>> # Trinity.sh # >>>>>> # Create/goto WDIR WHOME=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis WDIR=hisat2_mapped_trinity cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Set VARs NUMTH=24 # load cmake; set TRINITY_HOME, PATH module load cmake/3.11.4 export TRINITY_HOME=/work/SatohU/kawato/apps/trinity/2.8.6/ export PATH=/work/SatohU/kawato/apps/salmon/0.11.3/bin:/work/SatohU/kawato/apps/bowtie2/2.3.4.3/:$PATH # Run Trinity /work/SatohU/kawato/apps/trinity/2.8.6/Trinity \ --seqType fq \ --SS_lib_type RF \ --max_memory 500G \ --CPU ${NUMTH} --samples_file lib_list # lib_list: prepared manually ``` #### rnaSPAdes ```sh #!/bin/bash # >>>>>> # rnaSPAdes.sh # >>>>>> # Create/goto WDIR WHOME=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis WDIR=rnaspades cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Set VARs NUMTH=24 RDIR=../hisat2_mapped_trinity/trinity_out_dir/insilico_read_normalization/ R1=${RDIR}left.norm.fq R2=${RDIR}right.norm.fq # Run SPAdes /work/SatohU/kawato/apps/spades/3.13.0/bin/spades.py --rna -o auto -1 ${R1} -2 ${R2} --ss-rf --threads ${NUMTH} --memory 500 ``` #### Trans-ABySS Rename fastq ```sh #!/bin/bash # >>>>>> # rename_fastq.sh # >>>>>> # Rename FASTQ for Trans-ABySS # Create/goto WDIR WDIR=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/hisat2_mapped_trinity/trinity_out_dir/insilico_read_normalization/ cd ${WDIR} # Set VARs R1=left.norm.fq R2=right.norm.fq # Rename fastq cat ${R1}| awk '{print (NR%4 == 1) ? "@" ++i "/1" : $0}' | gzip -c > ${R1}.renamed.gz & cat ${R2}| awk '{print (NR%4 == 1) ? "@" ++i "/2" : $0}' | gzip -c > ${R2}.renamed.gz ``` Trans-ABySS ```sh #!/bin/bash # >>>>>> # Trans-ABySS.sh # >>>>>> # Create/goto WDIR WHOME=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis WDIR=transabyss cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Set VARs NUMTH=24 RDIR=../hisat2_mapped_trinity/trinity_out_dir/insilico_read_normalization/ R1=${RDIR}left.norm.fq.renamed.gz R2=${RDIR}right.norm.fq.renamed.gz # Run Trans-ABySS for mer in 31 41 51 61 71 81 91 101 111 121 131 ;do /work/SatohU/kawato/apps/transabyss-2.0.1/transabyss \ --pe ${R1} ${R2} --SS \ --outdir transabyss_k${mer} --name transabyss_k${mer} \ --cleanup 3 --threads ${NUMTH} --kmer ${mer} done ``` ### Redundancy removal (TransDecoder, CD-HIT-EST) ```sh #!/bin/bash # >>>>>> # NrTranscripts.sh # >>>>>> # Create/goto WDIR WHOME=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis WDIR=NrTranscripts cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Set VARs NUMTH=24 path2transdecoder='/work/SatohU/kawato/apps/transdecoder/5.5.0' path2cd-hit='/work/SatohU/kawato/apps/cd-hit/4.8.1' path2busco='/work/SatohU/kawato/apps/busco/scripts' export PATH=${path2transdecoder}:${path2cd-hit}:${path2busco}:$PATH # Create symlinks ln -s ../hisat2_mapped_trinity/trinity_out_dir/Trinity.fasta Trinity.fa ln -s ../rnaspades/auto/transcripts.fasta rnaspades.fa for mer in 31 41 51 61 71 81 91 101 111 121 ;do ln -s ../transabyss/transabyss_k${mer}/transabyss_k${mer}-final.fa transabyss_k${mer}.fa done # rename entries for ASM in Trinity rnaspades transabyss_k31 transabyss_k41 transabyss_k51 transabyss_k61 transabyss_k71 transabyss_k81 transabyss_k91 transabyss_k101 transabyss_k111 transabyss_k121 ;do less ${ASM}.fa |awk -v var="${ASM}" '/^>/{print ">" var "_" ++i; next}{print}' >${ASM}.renamed cat ${ASM}.renamed >>tr.merged done # TransDecoder TransDecoder.LongOrfs -S -t tr.merged TransDecoder.Predict --retain_long_orfs_mode dynamic --single_best_only -T 5000 -t tr.merged less tr.merged.transdecoder.cds |grep "complete" |sed 's/>//g' | cut -d " " -f 1 >tr.merged.transdecoder.cds.comp.list less tr.merged.transdecoder.cds |cut -d " " -f 1 > tr.merged.transdecoder.cds.tmp perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' tr.merged.transdecoder.cds.comp.list tr.merged.transdecoder.cds.tmp > tr.merged.transdecoder.cds.comp rm tr.merged.transdecoder.cds.tmp # cd-hit-est cd-hit-est \ -i tr.merged.transdecoder.cds.comp -o tr.merged.transdecoder.cds.comp.nr \ -c 0.98 -G 0 -aS 0.25 -d 0 -T 0 -M 0 less tr.merged.transdecoder.cds.comp.nr |grep ">"|sed 's/>//g' | cut -d "." -f 1 >tr.merged.transdecoder.cds.comp.nr.list perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' tr.merged.transdecoder.cds.comp.nr.list tr.merged > tr.merged.nr # final assembly ``` ### BUSCO (transcriptome, arthropoda_odb10) ```sh #!/bin/bash # >>>>>> # BUSCO_transcriptome.sh # >>>>>> # Run BUSCO for ASM in ICRK01.1.fsa_nt ICRJ01.1.fsa_nt GGLH01.1.fsa_nt GGUK01.1.fsa_nt; do busco -i ${ASM} -l arthropoda_odb10 -c 24 -o ${ASM}.transcriptome.busco -m transcriptome -f done ``` ## Scaffolding of genome assembly using transcriptome data ### Scaffolding using cDNA alignments (L_RNA_Scaffolder) ```sh #!/bin/bash # >>>>>> # L_RNA_scaffolder.sh # >>>>>> # Create/goto WDIR WHOME=/work/SatohU/kawato/study/kuruma_genome2017/analysis WDIR=L_RNA_saffolder cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables DRAFT=reapr_15k_out # Make symlink ln -s /work/SatohU/kawato/study/kuruma_genome2017/analysis/REAPR/reapr_15k_out/04.break.broken_assembly.fa ${DRAFT}.fa # Merge de novo assembled transcripts (tr.nr) and IsoSeq cDNA (polished.hq.fasta). cat \ /work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/NrTranscripts/tr.merged.nr \ /work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/rna_pacbio/bax_h5/polished.hq.fasta >tr.merged # Align transcripts by BLAT cat tr.merged \ | parallel --pipe --recstart ">" "blat -t=dna -q=dna -minIdentity=98 -noHead ${DRAFT}.fa stdin >(cat) >/dev/null" >${DRAFT}.tr.merged.psl # Run L_RNA_scaffolder /work/kawato/apps/L_RNA_scaffolder/L_RNA_scaffolder.sh \ -d /work/kawato/apps/L_RNA_scaffolder/ \ -i ${DRAFT}}.tr.merged.psl \ -j ${DRAFT}.fa \ -o L_RNA_scaffolder_out \ -l 0.90 -p 0.98 \ -n 100 ``` ### Scaffolding using paired-end RNA-seq reads (P_RNA_scaffolder) ```sh #!/bin/bash # >>>>>> # P_RNA_scaffolder.sh # >>>>>> # Create/goto WDIR WHOME=/work/SatohU/kawato/study/kuruma_genome2017/analysis WDIR=P_RNA_saffolder cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables DRAFT=L_RNA_scaffolder RDIR=/work/SatohU/kawato/study/kuruma_transcriptome2017/analysis/hisat2_mapped_trinity/trinity_out_dir/insilico_read_normalization/ R1=${RDIR}left.norm.fq.renamed.gz R2=${RDIR}right.norm.fq.renamed.gz # Make symlinks ln -s /work/SatohU/kawato/study/kuruma_genome2017/analysis/L_RNA_scaffolder/L_RNA_scaffolder_out/L_RNA_scaffolder.fasta ${DRAFT}.fasta # build hisat2 index hisat2-build -p 24 ${DRAFT}.fasta ${DRAFT} # Map in silico-normalized RNA-seq reads generated by Trinity hisat2 -p 24 -x ${DRAFT} -1 ${R2} -2 ${R1} -S ${DRAFT}_hisat2.sam 2>${DRAFT}_hisat2.log # Run P_RNA_scaffolder.sh /work/kawato/apps/P_RNA_scaffolder/P_RNA_scaffolder.sh \ -d /work/kawato/apps/P_RNA_scaffolder/ \ -i ${DRAFT}_hisat2.sam \ -j ${DRAFT}.fasta \ -F ${R2} -R ${R1} \ -o P_RNA_scaffolder_out \ -s yes -b yes -p 0.98 -t 24 ``` ### Gap filling (Sealer) ```sh #!/bin/bash # >>>>>> # Sealer.sh # >>>>>> # Create/goto WDIR WHOME=/work/SatohU/kawato/study/kuruma_genome2017/analysis WDIR=Sealer cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables num_threads=24 b1=/work/SatohU/kawato/study/kuruma_genome2017/reads/Mj01muscleDNA_R1_fastp01.fastq b2=/work/SatohU/kawato/study/kuruma_genome2017/reads/Mj01muscleDNA_R2_fastp01.fastq # make synmlinks ln -s ../P_RNA_saffolder/P_RNA_scaffolder_out/P_RNA_scaffold.fasta P_RNA_scaffold.fasta # Abyss-Sealer abyss-sealer -b20G --threads=${num_threads} -k51 -k71 -k91 -o sealer_out -S P_RNA_scaffold.fasta ${b1} ${b2} ``` ### Polishing (NtEdit) ```sh #!/bin/bash # >>>>>> # NtEdit.sh # >>>>>> # Create/goto WDIR WHOME=/work/SatohU/kawato/study/kuruma_genome2017/analysis WDIR=NtEdit cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables num_threads=24 b1=/work/SatohU/kawato/study/kuruma_genome2017/reads/Mj01muscleDNA_R1_fastp01.fastq b2=/work/SatohU/kawato/study/kuruma_genome2017/reads/Mj01muscleDNA_R2_fastp01.fastq assembly_name=sealer_out # Create symlinks ln -s ../Sealer/sealer_out.fa sealer_out.fa # NtEdit nthits -c 5 -b 36 -k 47 -t ${num_threads} --outbloom -p Mj01muscleDNA ${b1} ${b2} ntedit -f ${assembly_name}.fa -r Mj01muscleDNA_k47.bf -k 47 -b ${assembly_name}_ntedit_k47 -t ${num_threads} ``` ### Removing short (<2kb) scaffolds ```sh #!/bin/bash # >>>>>> # Remove2kless.sh # >>>>>> # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=draft_up2k/ cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Create symlink ln -s ./NtEdit/sealer_out_ntedit_k47.edited.fa sealer_out_ntedit_k47.edited.fa # Remove scaffolds < 2kb seqkit seq -m 2000 sealer_out_ntedit_k47.edited.fa >Mj.up2k.fa ``` ## Assembly improvement using ONT long reads, Illumina mate-pair reads, and transcriptome data ### Mate-pair reads processing (NxTrim, Fastp. BBduk) ```sh #!/bin/bash # >>>>>> # TrimMP_2.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WDIR=/work/kawato/study/kuruma_genome2017/reads/mp_hiseq/ cd ${WDIR} for n in 02 03 04 05 06 07 08 10 12 14 15;do # NxTrim nxtrim -1 ${n}k_R1_raw.fastq.gz -2 ${n}k_R2_raw.fastq.gz --separate --justmp --rf -O ${n}k_nxtrim 2>${n}k_nxtrim.log # Fastp fastp -i ${n}k_nxtrim_R1.mp.fastq.gz -I ${n}k_nxtrim_R2.mp.fastq.gz -o ${n}k_nxtrim_R1.mp_trim.fq.gz -O ${n}k_nxtrim_R2.mp_trim.fq.gz --trim_front1 2 --n_base_limit 0 --length_required 60 # bbduk bbduk.sh in1=${n}k_nxtrim_R1.mp_trim.fq.gz in2=${n}k_nxtrim_R2.mp_trim.fq.gz out1=${n}k_nxtrim_R1.mp_trim.bbduk.fq.gz out2=${n}k_nxtrim_R2.mp_trim.bbduk.fq.gz entropy=0.9 entropywindow=50 entropyk=5 2>${n}k_nxtrim.mp_trim.bbduk.log done ``` ### Gap filling, assembly break, and polishing (TGS-GapCloser, NtEdit) ```sh #!/bin/bash # >>>>>> # scaffolding_round0.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=scaffolding_round0 cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables DRAFT=/work/kawato/study/kuruma_genome2017/draft_up2k/Mj.up2k.fa LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits ROUND=round0 NUMTH=56 # TGS-GapCloser /work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \ --scaff ${DRAFT} \ --reads ${LREADS} \ --output ${ROUND} \ --ne --thread ${NUMTH} rm done_step* # Break into contigs perl /work/kawato/scripts/split.scaffolds.to.contigs.pl -m 1 -i ${ROUND}.scaff_seqs -o ${ROUND}.scaff_seqs.contigs # Polish contigs conda activate ntedit-1.3.2 ntedit -f ${ROUND}.scaff_seqs.contigs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}.ntedit_k64 -t ${NUMTH} ntedit -f ${ROUND}.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}.nteditk_k55 -t ${NUMTH} ntedit -f ${ROUND}.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}.ntedit_k47 -t ${NUMTH} ntedit -f ${ROUND}.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}.ntedit_k33 -t ${NUMTH} ``` ### Scaffolding (1) #### Scaffolding using ONT long reads (LRScaf) ```sh #!/bin/bash # >>>>>> # scaffolding_round1_LRScaf.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=scaffolding_round1_LRScaf cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables REF_DIR=/work/kawato/study/kuruma_genome2017/analysis/scaffolding_round0 LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits ROUND=round1 DRAFT=draft # Create symlink to input ln -s ../scaffolding_round0/round0.ntedit_k33_edited.fa ${DRAFT}.fa # Map Nanopore reads minimap2 -t 56 -x map-ont ${DRAFT}.fa ${LREADS} 1>${DRAFT}.paf 2>${DRAFT}.minimap2.log # Scaffolding with LRScaf java -jar /work/kawato/apps/lrscaf/target/LRScaf-1.1.11.jar --contig ${DRAFT}.fa -a ${DRAFT}.paf -t mm -i 0.1 -o lrscaf_out -p 56 # Gap filling with TGS-GapCloser /work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh --scaff lrscaf_out/scaffolds.fasta --reads ${LREADS} --output ${ROUND}_LRScaf --ne --thread 56 rm done_step* # Break into contigs perl /work/kawato/scripts/split.scaffolds.to.contigs.pl -m 1 -i ${ROUND}_LRScaf.scaff_seqs -o ${ROUND}_LRScaf.scaff_seqs.contigs # Polish contigs conda activate ntedit-1.3.2 ntedit -f LRScaf.scaff_seqs.contigs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_LRScaf.contigs.ntedit_k64 -t ${NUMTH} ntedit -f ${ROUND}_LRScaf.contigs.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_LRScaf.contigs.nteditk_k55 -t ${NUMTH} ntedit -f ${ROUND}_LRScaf.contigs.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_LRScaf.contigs.ntedit_k47 -t ${NUMTH} ntedit -f ${ROUND}_LRScaf.contigs.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_LRScaf.contigs.ntedit_k33 -t ${NUMTH} ``` #### Scaffolding using Illumina mate-pair reads (BESST) ```sh #!/bin/bash # >>>>>> # scaffolding_round1_BESST.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=scaffolding_round1_BESST cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa R_DIR=/work/kawato/study/kuruma_genome2017/reads/mp_hiseq ROUND=round1 DRAFT=draft NUMTH=56 # Make symlink to input ln -s ../scaffolding_${ROUND}_LRScaf/${ROUND}_LRScaf.contigs.ntedit_k33_edited.fa ${DRAFT}.fa # Map mate-pair reads for n in 02 03 04 05 06 07 08 10 12 14 15;do R1=${R_DIR}/${n}k_nxtrim_R1.mp_trim.bbduk.fq.gz R2=${R_DIR}/${n}k_nxtrim_R2.mp_trim.bbduk.fq.gz minimap2 --secondary=no --MD -ax sr ${DRAFT}.fa $R1 $R2 | samtools view -G 4 -Sb - |samtools sort -O BAM -o ${n}k.bam samtools index ${n}k.bam done # Run BESST conda activate besst-py2.7 runBESST -g -c ${DRAFT}.fa \ -f 02k.bam 03k.bam 04k.bam 05k.bam 06k.bam 07k.bam 08k.bam 10k.bam 12k.bam 14k.bam 15k.bam \ -orientation rf rf rf rf rf rf rf rf rf rf rf -q conda deactivate # TGS-GapCloser /work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \ --scaff BESST_output/pass11/Scaffolds_pass11.fa \ --reads ${LREADS} \ --output ${ROUND}_BESST \ --ne --thread ${NUMTH} rm done_step* # Polish conda activate ntedit-1.3.2 ntedit -f ${ROUND}_BESST.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_BESST_LRScaf.ntedit_k64 -t ${NUMTH} ntedit -f ${ROUND}_BESST.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_BESST.nteditk_k55 -t ${NUMTH} ntedit -f ${ROUND}_BESST.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_BESST.ntedit_k47 -t ${NUMTH} ntedit -f ${ROUND}_BESST.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_BESST.ntedit_k33 -t ${NUMTH} ``` #### Scaffolding using cDNA alignments (L_RNA_scaffolder) ```sh #!/bin/bash # >>>>>> # scaffolding_round1_L_RNA_Scaffolder.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=scaffolding_round1_L_RNA_Scaffolder cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa ROUND=round1 DRAFT=draft NUMTH=56 TR=tr.merged # Make symlink to input ln -s ../scaffolding_${ROUND}_BESST/${ROUND}_BESST.ntedit_k33_edited.fa ${DRAFT}.fa ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/tr.merged ${TR} # BLAT conda activate pasa cat ${TR} \ | parallel --gnu --pipe --recstart ">" "blat -t=dna -q=dna -minIdentity=98 -noHead ${DRAFT}.fa stdin >(cat) >/dev/null" >${TR}.psl # L_RNA_scaffolder /work/kawato/apps/L_RNA_scaffolder/L_RNA_scaffolder.sh \ -d /work/kawato/apps/L_RNA_scaffolder/ \ -i ${TR}.psl -j ${DRAFT}.fa -o . -l 0.90 -p 0.98 -n 100 # TGS-GapCloser /work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \ --scaff L_RNA_scaffolder.fasta --reads ${LREADS} \ --output ${ROUND}_L_RNA_scaffolder --ne --thread 56 rm done_step* # Polish conda activate ntedit-1.3.2 ntedit -f ${ROUND}_L_RNA_scaffolder.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k64 -t ${NUMTH} ntedit -f ${ROUND}_L_RNA_scaffolder.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k55 -t ${NUMTH} ntedit -f ${ROUND}_L_RNA_scaffolder.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k47 -t ${NUMTH} ntedit -f ${ROUND}_L_RNA_scaffolder.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k33 -t ${NUMTH} ``` ### Misassembly detection and assembly break (REAPR) ```sh #!/bin/bash # >>>>>> # scaffolding_round1_REAPR.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=scaffolding_round1_REAPR cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables R_DIR=/work/kawato/study/kuruma_genome2017/reads/mp_hiseq DRAFT=draft n=10 # Make symlink to input ln -s ../scaffolding_${ROUND}_L_RNA_scaffolder/${ROUND}_L_RNA_scaffolder.ntedit_k33_edited.fa ${DRAFT}.fa # REAPR conda activate reapr reapr facheck ${DRAFT} ${DRAFT}.facheck reapr smaltmap \ -n ${NUMTH} ${DRAFT}.facheck.fa \ ${RDIR}/${n}k_nxtrim_R1.mp_trim.bbduk.RC.fastq \ ${RDIR}/${n}k_nxtrim_R2.mp_trim.bbduk.RC.fastq \ mapped_${n}k.bam reapr pipeline ${DRAFT}.facheck.fa mapped_${n}k.bam reapr_${n}k_out ``` ### Haplotig removal (1) (purge_haplotigs) ```sh #!/bin/bash # >>>>>> # Scaffolding_round1_purge_haplotigs.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=scaffolding_round1_purge_haplotigs cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fq ROUND=round1 DRAFT=draft NUMTH=56 # Make symlink to input ln -s ../scaffolding_${ROUND}_REAPR/reapr_10k_out/04.break.broken_assembly.fa ${DRAFT}.fa # Map long reads minimap2 -t ${NUMTH} -ax map-ont ${DRAFT}.fa ${LREADS} --secondary=no | samtools sort -@ ${NUMTH} -o aligned.bam -T tmp.ali # Purge_haplotigs conda activate purge_haplotigs purge_haplotigs hist -b aligned.bam -g ${DRAFT}.fa -t ${NUMTH} 2>purge_haplotigs_hist.log purge_haplotigs cov -i aligned.bam.gencov -l 0 -m 15 -h 40 -o coverage_stats.csv 2>purge_haplotigs_cov.log purge_haplotigs purge -g ${DRAFT}.fa -c coverage_stats.csv -t ${NUMTH} 2>purge_haplotigs_purge.log purge_haplotigs clip -p curated.fasta -h curated.haplotigs.fasta -l 500 -o clip_l0.5k 2>purge_haplotigs_clip_l0.5k.log ``` ### Scaffolding (2) #### Scaffolding using ONT long reads (LRScaf) ```sh #!/bin/bash # >>>>>> # scaffolding_round2_LRScaf.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=saffolding_round2_LRScaf cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa BF_DIR=/work/kawato/study/kuruma_genome2017/nthits ROUND=round2 DRAFT=draft # Create symlink to input ln -s ../scaffolding_round1_purge_haplotigs/clip_l0.5k.fasta ${DRAFT}.fa # Map Nanopore reads minimap2 -t 56 -x map-ont ${DRAFT}.fa ${LREADS} 1>${DRAFT}.paf 2>${DRAFT}.minimap2.log # Scaffolding with LRScaf java -jar /work/kawato/apps/lrscaf/target/LRScaf-1.1.11.jar \ --contig ${DRAFT}.fa -a ${DRAFT}.paf -t mm -i 0.1 -o lrscaf_out -p 56 # Gap filling with TGS-GapCloser /work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \ --scaff lrscaf_out/scaffolds.fasta --reads ${LREADS} \ --output ${ROUND}_LRScaf --ne --thread 56 rm done_step* # Polish conda activate ntedit-1.3.2 ntedit -f ${ROUND}_LRScaf.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_LRScaf.ntedit_k64 -t ${NUMTH} ntedit -f ${ROUND}_LRScaf.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_LRScaf.nteditk_k55 -t ${NUMTH} ntedit -f ${ROUND}_LRScaf.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_LRScaf.ntedit_k47 -t ${NUMTH} ntedit -f ${ROUND}_LRScaf.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_LRScaf.ntedit_k33 -t ${NUMTH} ``` #### Scaffolding using Illumina mate-pair reads (BESST) ```sh #!/bin/bash # >>>>>> # scaffolding_round2_BESST.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=saffolding_round2_BESST cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa R_DIR=/work/kawato/study/kuruma_genome2017/reads/mp_hiseq ROUND=round2 DRAFT=draft NUMTH=56 # Make symlink to input ln -s ../scaffolding_${ROUND}_LRScaf/${ROUND}_LRScaf.ntedit_k33_edited.fa ${DRAFT}.fa # Map mate-pair reads for n in 02 03 04 05 06 07 08 10 12 14 15;do R1=${R_DIR}/${n}k_nxtrim_R1.mp_trim.bbduk.fq.gz R2=${R_DIR}/${n}k_nxtrim_R2.mp_trim.bbduk.fq.gz minimap2 --secondary=no --MD -ax sr ${DRAFT}.fa $R1 $R2 | samtools view -G 4 -Sb - |samtools sort -O BAM -o ${n}k.bam samtools index ${n}k.bam done # Run BESST conda activate besst_py2.7 runBESST -g -c ${DRAFT}.fa \ -f 02k.bam 03k.bam 04k.bam 05k.bam 06k.bam 07k.bam 08k.bam 10k.bam 12k.bam 14k.bam 15k.bam \ -orientation rf rf rf rf rf rf rf rf rf rf rf -q conda deactivate # TGS-GapCloser /work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \ --scaff BESST_output/pass11/Scaffolds_pass11.fa \ --reads ${LREADS} \ --output ${ROUND}_BESST \ --ne --thread ${NUMTH} rm done_step* # Polish conda activate ntedit-1.3.2 ntedit -f ${ROUND}_BESST.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_BESST_LRScaf.ntedit_k64 -t ${NUMTH} ntedit -f ${ROUND}_BESST.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_BESST.nteditk_k55 -t ${NUMTH} ntedit -f ${ROUND}_BESST.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_BESST.ntedit_k47 -t ${NUMTH} ntedit -f ${ROUND}_BESST.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_BESST.ntedit_k33 -t ${NUMTH} ``` #### Scaffolding using cDNA alignments (L_RNA_scaffolder) ```sh #!/bin/bash # >>>>>> # scaffolding_round2_L_RNA_Scaffolder.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=scaffolding_round2_L_RNA_scaffolder cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa ROUND=round2 DRAFT=draft NUMTH=56 TR=tr.merged # Make symlink to input ln -s ../scaffolding_${ROUND}_BESST/${ROUND}_BESST.ntedit_k33_edited.fa ${DRAFT}.fa ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/tr.merged ${TR} # BLAT conda activate pasa cat ${TR} \ | parallel --gnu --pipe --recstart ">" "blat -t=dna -q=dna -minIdentity=98 -noHead ${DRAFT}.fa stdin >(cat) >/dev/null" >${TR}.psl # L_RNA_scaffolder /work/kawato/apps/L_RNA_scaffolder/L_RNA_scaffolder.sh \ -d /work/kawato/apps/L_RNA_scaffolder/ \ -i ${TR}.psl -j ${DRAFT}.fa -o . -l 0.90 -p 0.98 -n 100 # TGS-GapCloser /work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \ --scaff L_RNA_scaffolder.fasta \ --reads ${LREADS} \ --output ${ROUND}_L_RNA_scaffolder \ --ne --thread 56 rm done_step* # Polish conda activate ntedit-1.3.2 ntedit -f ${ROUND}_L_RNA_scaffolder.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k64 -t ${NUMTH} ntedit -f ${ROUND}_L_RNA_scaffolder.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k55 -t ${NUMTH} ntedit -f ${ROUND}_L_RNA_scaffolder.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k47 -t ${NUMTH} ntedit -f ${ROUND}_L_RNA_scaffolder.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_L_RNA_scaffolder.ntedit_k33 -t ${NUMTH} ``` ### Haplotig removal (2) (purge_haplotigs) ```sh #!/bin/bash # >>>>>> # scaffolding_round2_purge_haplotigs.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=scaffolding_round2_purge_haplotigs cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fq ROUND=round2 DRAFT=draft NUMTH=56 # Make symlink to input ln -s ../scaffolding_${ROUND}_L_RNA_scaffolder/${ROUND}_L_RNA_scaffolder.ntedit_kk33_edited.fa ${DRAFT}.fa # Map long reads minimap2 -t ${NUMTH} -ax map-ont ${DRAFT}.fa ${LREADS} --secondary=no | samtools sort -@ ${NUMTH} -o aligned.bam -T tmp.ali # Purge_haplotigs conda activate purge_haplotigs purge_haplotigs hist -b aligned.bam -g ${DRAFT}.fa -t ${NUMTH} 2>purge_haplotigs_hist.log purge_haplotigs cov -i aligned.bam.gencov -l 0 -m 15 -h 40 -o coverage_stats.csv 2>purge_haplotigs_cov.log purge_haplotigs purge -g ${DRAFT}.fa -c coverage_stats.csv -t ${NUMTH} 2>purge_haplotigs_purge.log purge_haplotigs clip -p curated.fasta -h curated.haplotigs.fasta -l 500 -o clip_l0.5k 2>purge_haplotigs_clip_l0.5k.log ``` ### Scaffolding (3) #### Scaffolding using ONT long reads (LRScaf) ```sh #!/bin/bash # >>>>>> # scaffolding_round3_LRScaf.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=scaffolding_round3_LRScaf cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define variables LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits ROUND=round3 DRAFT=draft # Create symlink to input ln -s ../scaffolding_round2_purge_haplotigs/clip_l0.5k.fasta ${DRAFT}.fa # Map Nanopore reads minimap2 -t 56 -x map-ont ${DRAFT}.fa ${LREADS} 1>${DRAFT}.paf 2>${DRAFT}.minimap2.log # Scaffolding with LRScaf java -jar /work/kawato/apps/lrscaf/target/LRScaf-1.1.11.jar --contig ${DRAFT}.fa -a ${DRAFT}.paf -t mm -i 0.1 -o lrscaf_out -p 56 # Gap filling with TGS-GapCloser /work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \ --scaff lrscaf_out/scaffolds.fasta --reads ${LREADS} \ --output ${ROUND}_LRScaf --ne --thread 56 rm done_step* # Polish conda activate ntedit-1.3.2 ntedit -f ${ROUND}_LRScaf.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_LRScaf.ntedit_k64 -t ${NUMTH} ntedit -f ${ROUND}_LRScaf.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_LRScaf.nteditk_k55 -t ${NUMTH} ntedit -f ${ROUND}_LRScaf.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_LRScaf.ntedit_k47 -t ${NUMTH} ntedit -f ${ROUND}_LRScaf.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_LRScaf.ntedit_k33 -t ${NUMTH} ``` #### Scaffolding using Illumina mate-pair reads (BESST) ```sh #!/bin/bash # >>>>>> # scaffolding_round3_BESST.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=scaffolding_round3_BESST cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Set variables BF_DIR=/work/kawato/study/kuruma_genome2017/analysis/nthits LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/ONT.up1k.fa R_DIR=/work/kawato/study/kuruma_genome2017/reads/mp_hiseq ROUND=round3 DRAFT=draft NUMTH=56 # Make symlink to input ln -s ../scaffolding_${ROUND}_LRScaf/${ROUND}_LRScaf.ntedit_k33_edited.fa ${DRAFT}.fa # Map mate-pair reads for n in 02 03 04 05 06 07 08 10 12 14 15;do R1=${R_DIR}/${n}k_nxtrim_R1.mp_trim.bbduk.fq.gz R2=${R_DIR}/${n}k_nxtrim_R2.mp_trim.bbduk.fq.gz minimap2 --secondary=no --MD -ax sr ${DRAFT}.fa $R1 $R2 | samtools view -G 4 -Sb - |samtools sort -O BAM -o ${n}k.bam samtools index ${n}k.bam done # Run BESST conda activate besst-py2.7 runBESST -g -c ${DRAFT}.fa \ -f 02k.bam 03k.bam 04k.bam 05k.bam 06k.bam 07k.bam 08k.bam 10k.bam 12k.bam 14k.bam 15k.bam \ -orientation rf rf rf rf rf rf rf rf rf rf rf -q conda deactivate # TGS-GapCloser /work/kawato/apps/TGS-GapCloser/TGS-GapCloser.sh \ --scaff BESST_output/pass11/Scaffolds_pass11.fa \ --reads ${LREADS} \ --output ${ROUND}_BESST \ --ne --thread ${NUMTH} rm done_step* # Polish conda activate ntedit-1.3.2 ntedit -f ${ROUND}_BESST.scaff_seqs -r ${BF_DIR}/solids_k64.bf -b ${ROUND}_BESST_LRScaf.ntedit_k64 -t ${NUMTH} ntedit -f ${ROUND}_BESST.ntedit_k64_edited.fa -r ${BF_DIR}/solids_k55.bf -b ${ROUND}_BESST.nteditk_k55 -t ${NUMTH} ntedit -f ${ROUND}_BESST.ntedit_k55_edited.fa -r ${BF_DIR}/solids_k47.bf -b ${ROUND}_BESST.ntedit_k47 -t ${NUMTH} ntedit -f ${ROUND}_BESST.ntedit_k47_edited.fa -r ${BF_DIR}/solids_k33.bf -b ${ROUND}_BESST.ntedit_k33 -t ${NUMTH} ``` ### Assembly polishing (HyPo) ```sh #!/bin/bash # >>>>>> # hypo.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=hypo cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Set variables NUMTH=56 GENOME_SIZE=2g COV=50 R1=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01muscleDNA_R1_trim.tadpole.fq.gz R2=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01muscleDNA_R2_trim.tadpole.fq.gz # Make symlink ln -s ../scaffolding_round3_BESST/round3_BESST.ntedit_k33_edited.fa hypo0.fa # Create input list required by hypo echo -e "$R1\n$R2" > il_names.txt conda activate hypo-1.0.3 # hypo, round1 # Map PE reads minimap2 --secondary=no --MD -ax sr -t $NUMTH hypo0.fa $R1 $R2 | samtools view -Sb - > hypo0.bam samtools sort -@$NUMTH -o hypo0.sorted.bam hypo0.bam samtools index hypo0.sorted.bam rm hypo0.bam # Hypo hypo -d hypo0.fa -r @il_names.txt -s ${GENOME_SIZE} -c ${COV} -b hypo0.sorted.bam -t $NUMTH -o hypo1.fa # hypo, round2 # Map PE reads minimap2 --secondary=no --MD -ax sr -t $NUMTH hypo1.fa $R1 $R2 | samtools view -Sb - > hypo1.bam samtools sort -@$NUMTH -o hypo1.sorted.bam hypo1.bam samtools index hypo1.sorted.bam rm hypo1.bam # Hypo hypo -d hypo1.fa -r @il_names.txt -s ${GENOME_SIZE} -c ${COV} -b hypo1.sorted.bam -t $NUMTH -o hypo2.fa ``` ### Assembly curation #### Mitogenome contamination removal ```sh #!/bin/bash # >>>>>> # purge_mtDNA.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=purge_mtDNA cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Set variables QUERY=MjmtDNA SUBJECT=hypo2 # Create symlinks ln -s ../hypo/hypo2.fa ${SUBJECT}.fa ln -s /work/kawato/study/ref/genomes/MjmtDNA.fa ${QUERY}.fa # BLASTN blastn -query ${QUERY}.fa -subject ${SUBJECT}.fa -perc_identity 90 -outfmt 7 -out ${QUERY}.${SUBJECT}.ident90.blastn ## Following regions were identified as contamination # NC_007010.1 Scaffolds_826 99.464 932 5 0 6721 7652 2154 1223 0.0 1694 # NC_007010.1 Scaffolds_23445 97.274 4035 79 14 3169 7201 45436 49441 0.0 6813 # NC_007010.1 Scaffolds_23445 99.719 2850 8 0 92 2941 42437 45286 0.0 5219 echo -e "Scaffolds_826\nScaffolds_23445" >${QUERY}.${SUBJECT}.ident90.blastn.hit.list seqkit grep -v -nf ${QUERY}.${SUBJECT}.ident90.blastn.hit.list ${SUBJECT}.fa >${SUBJECT}.${QUERY}.nohit.fa blastn -query ${QUERY}.fa -subject ${SUBJECT}.${QUERY}.nohit.fa -outfmt 7 -out ${SUBJECT}.${QUERY}.nohit.blastn # The regions at issue have been removed seqkit grep -np "Scaffolds_826" ${SUBJECT}.fa >Scaffolds_826.fa seqkit grep -np "Scaffolds_23445" ${SUBJECT}.fa >Scaffolds_23445.fa seqkit stats Scaffolds_826.fa Scaffolds_23445.fa # seqkit stats Scaffolds_826.fa Scaffolds_23445.fa # file format type num_seqs sum_len min_len avg_len max_len # Scaffolds_826.fa FASTA DNA 1 164,807 164,807 164,807 164,807 # Scaffolds_23445.fa FASTA DNA 1 116,046 116,046 116,046 116,046 # -> Trim Scaffolds_826:1-2500 # -> Retain Scaffolds_23445:1-42000 and 50001-116046 seqkit subseq -r 2501:-1 Scaffolds_826.fa > Scaffolds_826_2501-164807.fa # Trim the first 2500 bases seqkit stats Scaffolds_826.fa Scaffolds_826_2501-164807.fa >Scaffolds_826_2501-164807.stats # Verify lengths blastn -query Scaffolds_826.fa -subject Scaffolds_826_2501-164807.fa -outfmt 7 -out Scaffolds_826_2501-164807.blastn # Confirm that the first 2500 bases have been removed seqkit subseq -r 1:42000 Scaffolds_23445.fa > Scaffolds_23445_1-42000.fa # Retain first 42000 bases seqkit stats Scaffolds_23445.fa Scaffolds_23445_1-42000.fa >Scaffolds_23445_1-42000.stats # Verify lengths blastn -query Scaffolds_23445.fa -subject Scaffolds_23445_1-42000.fa -outfmt 7 -out Scaffolds_23445_1-42000.blastn # Confirm that the first 45000 bases have been retained seqkit subseq -r 50001:-1 Scaffolds_23445.fa > Scaffolds_23445_50001-116046.fa # Trim first 50000 bases seqkit stats Scaffolds_23445.fa Scaffolds_23445_50001-116046.fa >Scaffolds_23445_50001-116046.stats # Verify lengths blastn -query Scaffolds_23445.fa -subject Scaffolds_23445_50001-116046.fa -outfmt 7 -out Scaffolds_23445_50001-116046.blastn # Confirm that the first 2500 bases have been removed cat Scaffolds_826_2501-164807.fa Scaffolds_23445_1-42000.fa Scaffolds_23445_50001-116046.fa >${SUBJECT}.${QUERY}.mtDNA_trimmed.fa blastn -query ${QUERY}.fa -subject ${SUBJECT}.${QUERY}.mtDNA_trimmed.fa -outfmt 7 -out ${SUBJECT}.${QUERY}.mtDNA_trimmed.blastn # no hit cat ${SUBJECT}.${QUERY}.nohit.fa ${SUBJECT}.${QUERY}.mtDNA_trimmed.fa >${SUBJECT}.mtDNA_purged.fa # Merge originally clean seqs + purged seqs seqkit stats ${SUBJECT}.fa ${SUBJECT}.${QUERY}.nohit.fa ${SUBJECT}.${QUERY}.mtDNA_trimmed.fa ${SUBJECT}.mtDNA_purged.fa >${SUBJECT}.mtDNA_purged.stats # stats blastn -query ${QUERY}.fa -subject ${SUBJECT}.mtDNA_purged.fa -outfmt 7 -out ${SUBJECT}.mtDNA_purged.blastn # no hit seqkit seq -m 2000 ${SUBJECT}.mtDNA_purged.fa >${SUBJECT}.mtDNA_purged.up2k.fa # remove < 2kbp frags seqkit stats ${SUBJECT}.mtDNA_purged.fa ${SUBJECT}.mtDNA_purged.up2k.fa >${SUBJECT}.mtDNA_purged.up2k.stats # stats seqkit rename ${SUBJECT}.mtDNA_purged.up2k.fa |seqkit sort --quiet -r -l > ${SUBJECT}.mtDNA_purged.up2k.sorted.fa # Sort seqs. Had to run seqkit rename because Scaffolds_23445 were regarded as duplicated seqkit fx2tab -l -n -i -H ${SUBJECT}.mtDNA_purged.up2k.sorted.fa >${SUBJECT}.mtDNA_purged.up2k.sorted.tab # Confirm that the seqs have been sorted by length; saved as a table file awk 'BEGIN{N = 0}; {if(/^>/){printf(">scaffold_" "%05d\n", ++n) } else{print $0}}' ${SUBJECT}.mtDNA_purged.up2k.sorted.fa > ${SUBJECT}.mtDNA_purged.up2k.sorted.renamed.fa # Rename entries ``` #### Filter out sequences with abnormal coverage ```sh #!/bin/bash # >>>>>> # filter_lowqual_seq.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=filter_lowqual_seq cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Define variables ASM=mtDNA_purged ln -s ../purge_mtDNA/mtDNA_purged.up2k.sorted.renamed.fa ${ASM}.fa # Identify sequences with extreme GC contents (less than 10% and more than 70%. The thresholds were determined based on manual IGV inspection of the annotated genome. Some of the extemely GC-poor sequences seem to be filled with gaps, rather than AT's seqkit fx2tab -n -g -l -B N ${ASM}.fa >${ASM}.len.gc.n.tab awk '$4 >=60 {print $1}' ${ASM}.len.gc.n.tab >${ASM}.gap60up sort -nk3 ${ASM}.len.gc.n.tab |awk ' $3 < 10 {print $1}' >${ASM}.gc10minus sort -nk3 ${ASM}.len.gc.n.tab |awk ' $3 > 70 {print $1}' >${ASM}.gc70up # Coverage file for LIB in sperm muscle ;do R1=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01${LIB}DNA_R1_trim.tadpole.fq.gz R2=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01${LIB}DNA_R2_trim.tadpole.fq.gz minimap2 --secondary=no --MD -ax sr -t ${NUMTH} ${DRAFT} ${R1} ${R2} | samtools view -Sb - |\ samtools sort -o Mj01${LIB}DNA.sorted.bam samtools index Mj01${LIB}DNA.sorted.bam done samtools coverage Mj01spermDNA.sorted.bam Mj01muscleDNA.sorted.bam >${ASM}.coverage # Identify low-coverage scaffolds sort -nk7 ${ASM}.coverage|awk ' $7 < 20 {print $1}' >${ASM}.meandepth20minus sort -nk7 ${ASM}.coverage|awk ' $6 < 30 {print $1}' >${ASM}.coverage30minus # Make a non-redundant list of low-quality sequences cat ${ASM}.gap60up ${ASM}.gc10minus ${ASM}.gc70up ${ASM}.meandepth20minus ${ASM}.coverage30minus|sort|uniq >${ASM}.lowQuality.list # BLAST to rescue gene-containing scaffolds seqkit grep -nf ${ASM}.lowQuality.list ${ASM}.fa >${ASM}.lowQuality.fa blastn -query ${ASM}.lowQuality.fa -subject /work/kawato/study/kuruma_transcriptome2017/analysis/asm/tr.merged.nr \ -outfmt 7 -out ${ASM}.lowQuality.tr.merged.nr.blastn grep -v "#" ${ASM}.lowQuality.tr.merged.nr.blastn |awk '$11 == 0.0 {print $2}'|sort|uniq >${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.list seqkit grep -nf ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.list /work/kawato/study/kuruma_transcriptome2017/analysis/asm/tr.merged.nr >${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.fa blastn -query ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.fa -subject ${ASM}.fa \ -max_target_seqs 1 -outfmt 7 -out ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn grep -v "#" ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn|awk '$11 == 0.0 {print $2}'|sort|uniq >${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list cat ${ASM}.lowQuality.list|while read line;do cat ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list | awk -v line=${line}'{if (match($0, line) > 0){print $0}}' done >${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list.hits grep -f ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list.hits ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn|cut -f 1 |sort|uniq >${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list.hits.transcripts.list # Three transcripts were found and queried against the NCBI nr protein DB with BLASTX. One transcript was found to be a likely shrimp protein and was kept. The other two had no unambiguous hits and theredfore were discarded seqkit grep -nf ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list.hits.transcripts.list /work/kawato/study/kuruma_transcriptome2017/analysis/asm/tr.merged.nr > ${ASM}.lowQuality.tr.merged.nr.blastn.evalue0.blastn.list.hits.transcripts.fa # The scaffold containing the shrimp protein gene was isolated and broken ito contigs seqkit grep -np "scaffold_11228" ${ASM}.fa >scaffold_11228.fa perl /work/kawato/scripts/split.scaffolds.to.contigs.pl -m 1 -i scaffold_11228.fa -o scaffold_11228.fa.contigs # Filtered sequences seqkit grep -nvf ${ASM}.lowQuality.list ${ASM}.fa >${ASM}.filtered.fa.tmp cat ${ASM}.filtered.fa.tmp scaffold_11228.fa.contigs >${ASM}.filtered.fa # Trim gaps on (near) scaffold ends. Takes some time. gsub targeting string ends is extremely slow, so I turned to reverse-complementing the seqs seqkit fx2tab ${ASM}.filtered.fa | awk ' BEGIN { OFS="\t" } { gsub( /^[ATGCN]{0,100}N{1,}/, "", $2 ) print $1 , $2 } ' |seqkit tab2fx|seqkit seq -pr|seqkit fx2tab| awk ' BEGIN { OFS="\t" } { gsub( /^[ATGCN]{0,100}N{1,}/, "", $2 ) print $1 , $2 } ' |seqkit tab2fx|seqkit seq -pr >${ASM}.filtered.mod.fa # Verify that gaps on ends have been eliminated seqkit subseq -r 1:100 ${ASM}.filtered.mod.fa |seqkit fx2tab > ${ASM}.filtered.mod.end_left.tab awk '$2 ~/N/ {print $1}' ${ASM}.filtered.mod.end_left.tab > ${ASM}.filtered.mod.end_left_gapAtEnd.list seqkit subseq -r -100:-1 ${ASM}.filtered.mod.fa |seqkit fx2tab > ${ASM}.filtered.mod.end_right.tab awk '$2 ~/N/ {print $1}' ${ASM}.filtered.mod.end_right.tab > ${ASM}.filtered.mod.end_right_gapAtEnd.list # Sort seqs seqkit sort --quiet -r -l ${ASM}.filtered.mod.fa > ${ASM}.filtered.mod.sorted.fa # Confirm that the seqs have been sorted by length; saved as a table file seqkit fx2tab -l -n -i -H ${ASM}.filtered.mod.sorted.fa >${ASM}.filtered.mod.sorted.tab # Rename entries awk 'BEGIN{N = 0}; {if(/^>/){printf(">scaffold_" "%05d\n", ++n) } else{print $0}}' ${ASM}.filtered.mod.sorted.fa > Mj_TUMSAT_v1.0.fa ``` #### Screening bacterial contamination (Barrnap) ```sh #!/bin/bash # >>>>>> # Barrnap.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Barrnap cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Define variables NUMTH=56 ASM=mtDNA_purged.filtered.mod.sorted # Symlink ln -s ../filter_lowqual_seq/mtDNA_purged.filtered.mod.sorted.fa ${ASM}.fa # Run Barrnap conda activate barrnap barrnap --kingdom euk --outseq ${ASM}.barrnap.euk.fa --threads ${NUMTH} ${ASM}.fa > ${ASM}.barrnap.euk.gff grep -v "partial" ${ASM}.barrnap.euk.gff >${ASM}.barrnap.euk.full.gff barrnap --kingdom bac --outseq ${ASM}.barrnap.bac.fa --threads ${NUMTH} ${ASM}.fa > ${ASM}.barrnap.bac.gff # BLASTN search perfomed against the NCBI nr nucleotide database (accessed 2021-02-12). # No bacterial sequence detected ``` #### Mapping Illumina paired-end reads onto the final assembly ```sh #!/bin/bash # >>>>>> # map_IlluminaPE_to_genome.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=map_IlluminaPE_to_genome cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Define variables ASM=Mj_TUMSAT_v1.0 # Create symlink ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${ASM}.fa DRAFT=${ASM}.fa for LIB in sperm muscle ;do R1=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01${LIB}DNA_R1_trim.fq.gz R2=/work/kawato/study/kuruma_genome2017/reads/pe_hiseq/Mj01${LIB}DNA_R2_trim.fq.gz minimap2 --secondary=no --MD -ax sr -t ${NUMTH} ${DRAFT} ${R1} ${R2} | samtools view -Sb - |samtools sort -o Mj01${LIB}DNA.sorted.bam samtools index Mj01${LIB}DNA.sorted.bam samtools flagstat -@ 16 -O tsv Mj01${LIB}DNA.sorted.bam >Mj01${LIB}DNA.sorted.bam.flagstat done ``` ## Mitogenome assembly (Flye, --meta) ```sh #!/bin/bash # >>>>>> # flye_meta_MjmtDNA.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=flye_meta_MjmtDNA cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Define variables LREADS=/work/kawato/study/kuruma_genome2017/reads/nanopore/Mj.up5k.fastq ln -s /work/kawato/study/ref/genomes/MjmtDNA.fa MjmtDNA.fa NUMTH=16 # Assemble using Flye conda activate flye flye --nano-raw ${LREADS} --out-dir flye_meta_up5k --threads ${NUMTH} --meta # Identify the mitogenome contig SUBJECT=flye_meta_up5k/assembly.fasta QUERY=MjmtDNA.fa # BLASTN; identified contig_2159 blastn -query ${QUERY} -subject ${SUBJECT} -outfmt 7 -out MjmtDNA_blastn.out # Extract contig_2159; start position adjusted based on BLASTN alignment seqkit grep -np "contig_2159" ${SUBJECT} |seqkit restart -i 75 >MjmtDNA_unpolished.fa # Map PE reads DRAFT=MjmtDNA_unpolished.fa LIB=muscle R1=/work/kawato/study/2020-07-17_Mj_genome/reads/pe_hiseq/Mj01${LIB}DNA_R1_trim.tadpole.fq.gz R2=/work/kawato/study/2020-07-17_Mj_genome/reads/pe_hiseq/Mj01${LIB}DNA_R2_trim.tadpole.fq.gz minimap2 --secondary=no --MD -ax sr -t ${NUMTH} ${DRAFT} ${R1} ${R2} |\ samtools view -Sb - -f 3 -q 60 |samtools sort -o Mj01${LIB}DNA.sorted.bam samtools index Mj01${LIB}DNA.sorted.bam # Polishing by Pilon java -jar /work/kawato/apps/pilon/1.23/pilon-1.23.jar \ --genome ${DRAFT} \ --frags Mj01${LIB}DNA.sorted.bam \ --outdir ${DRAFT}.pilon \ --changes --vcf --tracks --threads ${NUMTH} cp ${DRAFT}.pilon/pilon.fasta MjmtDNA_polished.fa # Calculate coverage minimap2 --secondary=no --MD -ax sr -t ${NUMTH} MjmtDNA_polished.fa ${R1} ${R2} | samtools view -Sb - -f 3 -q 60 |samtools sort -o MjmtDNA_polished_Mj01${LIB}DNA.sorted.bam samtools index MjmtDNA_polished_Mj01${LIB}DNA.sorted.bam samtools coverage MjmtDNA_polished_Mj01muscleDNA.sorted.bam > MjmtDNA_polished_Mj01muscleDNA.sorted.bam.coverage minimap2 --secondary=no --MD -ax map-ont -t ${NUMTH} MjmtDNA_polished.fa ${LREADS} |samtools view -bS - |samtools sort -O BAM -o MjmtDNA_polished_ONTup5k.sorted.bam samtools index MjmtDNA_polished_ONTup5k.sorted.bam samtools coverage MjmtDNA_polished_ONTup5k.sorted.bam > MjmtDNA_polished_ONTup5k.sorted.bam.coverage ``` ## Repeat identification and masking ### Repeat library construction (RepeatScout) ```sh #!/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_RepeatScout.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_RepeatScout cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Define variables SP=Marsupenaeus_japonicus NUMTH=16 # Create symlinks ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${SP}.fa # Subsample contigs perl /work/kawato/scripts/split.scaffolds.to.contigs.pl -in ${SP}.fa -out ${SP}.fa.contigs seqkit seq -m 1000 ${SP}.fa.contigs >${SP}.fa.contigs.up1k seqkit stats ${SP}.fa.contigs.up1k >${SP}.fa.contigs.up1k.stats seqkit sample --proportion 0.6 ${SP}.fa.contigs.up1k >${SP}.fa.contigs.up1k.0.6x seqkit stats ${SP}.fa.contigs.up1k.0.6x >${SP}.fa.contigs.up1k.0.6x.stats conda activate repeatmodeler # Build repeat library with RepeatScout SAMPLE=${SP}.fa.contigs.up1k.0.6x build_lmer_table -l 14 -sequence ${SAMPLE} -freq ${SAMPLE}.14mer_table RepeatScout -sequence ${SAMPLE} -output ${SAMPLE}.rs_output.fa -freq ${SAMPLE}.14mer_table -l 14 cat ${SAMPLE}.rs_output.fa| filter-stage-1.prl >${SAMPLE}.rs_output.filtered1.fasta RepeatMasker -pa 20 -no_is -norna -s -lib ${SAMPLE}.rs_output.filtered1.fasta ${SAMPLE} cat ${SAMPLE}.rs_output.filtered1.fasta | filter-stage-2.prl --cat=${SAMPLE}.out --thresh=3 > ${SAMPLE}.rs_output.filtered2.fasta ``` ### Taxonomic assingment (RepeatClassifier) ```sh #!/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_RepeatClassifier.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_RepeatScout cd ${WHOME}/${WDIR} # Define variables SP=Marsupenaeus_japonicus NUMTH=16 # Conda distribution of RepeatModeler did not have a repeat libary for RepeatClassifier. I turned to Dfam TETools version curl -sSLO https://github.com/Dfam-consortium/TETools/raw/master/dfam-tetools.sh chmod +x dfam-tetools.sh systemctl start docker sudo ./dfam-tetools.sh RepeatClassifier -consensi Marsupenaeus_japonicus.fa.contigs.up1k.0.6x.rs_output.filtered2.fasta & # Press Ctrl + P + Q to detach while not stopping the job. This lets you going back to the shell and you can exit freely ``` ### Repeat masking (RepeatMasker) ```sh #!/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_RepeatMasker.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_RepeatMasker cd ${WHOME}/${WDIR}/ # Create symlinks ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa Mj_TUMSAT_v1.0.fa ln -s ../Mj_TUMSAT_v1.0_RepeatScout/${SAMPLE}.rs_output.filtered2.fasta.classified RepeatLibrary.classified.fa conda activate repeatmodeler # Run RepeatMasker RepeatMasker -pa 20 -s -no_is -lib RepeatLibrary.classified.fa Mj_TUMSAT_v1.0.fa -a -xsmall -html -gff ``` ## Gene prediction and functional annotation ### Prediction of protein-coding genes #### cDNA alignments ##### Transcriptome shotgun assembly ###### preprocessing paired-end reads (Fastp) ```sh #!/bin/bash # >>>>>> # TrimPE_RNAseq_2.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_transcriptome2017/analysis/2020-10-13_Mj_RNA-seq/ WDIR=reads cd ${WHOME}/${WDIR}/ # Define variables NUMTH=56 # Fastp less lib.list|while read lib;do fastp \ -i ../../reads/${lib}_R1.fastq.gz -I ../../reads/${lib}_R2.fastq.gz \ -o ${lib}_R1_trim.fq.gz -O ${lib}_R2_trim.fq.gz \ --trim_front1 2 --trim_tail1 2 \ --trim_front2 2 --trim_tail2 2 --n_base_limit 0 \ --length_required 140 --detect_adapter_for_pe \ --html ${lib}_fastp.html --thread 8 done ``` ###### De novo assembly (Trinity) ```sh #!/bin/bash # >>>>>> # Trinity.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_transcriptome2017/analysis/2020-10-13_Mj_RNA-seq cd ${WHOME} # Define variables NUMTH=56 # Run Trinity conda activate trinity-2.11.0 Trinity --seqType fq --max_memory 1400G --SS_lib_type RF \ --samples_file reads/reads.list --CPU ${NUMTH} ``` ##### Running PASA pipeline ```sh #!/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_PASA.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_PASA PASAHOME=/work/kawato/anaconda3/envs/pasa/opt/pasa-2.4.1 cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Define VARs NUMTH=24 ASM=Mj_TUMSAT_v1.0 RUN=${ASM}_pasa MIN_PERCENT_ALIGNED=90 MIN_AVG_PER_ID=95 TRANSCRIPT=tr.fa TRANSCRIPT_CLEAN=tr.fa.clean W_DIR=${WHOME}${WDIR}/ # Create symlinks ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/isoseq3/polished.hq.fasta polished.hq.fasta ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/2020-10-13_Mj_RNA-seq/trinity_out_dir/Trinity.fasta Trinity.fasta ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa Mj_TUMSAT_v1.0.fa conda activate pasa # Prepare transcripts cat Trinity.fasta polished.hq.fasta >tr.fa ${PASAHOME}/bin/seqclean tr.fa -L -c 16 # Create config file pasa.alignAssembly.Template.txt cat /work/kawato/anaconda3/envs/pasa/opt/pasa-2.4.1/pasa_conf/pasa.alignAssembly.Template.txt |\ sed "s@<__DATABASE__>@${W_DIR}${RUN}_db@g"|sed "s@<__MIN_PERCENT_ALIGNED__>@${MIN_PERCENT_ALIGNED}@g"|sed "s@<__MIN_AVG_PER_ID__>@${MIN_AVG_PER_ID}@g" >${RUN}.alignAssembly.config ## Run PASA Pipeline ${PASAHOME}/Launch_PASA_pipeline.pl -C -R \ --config ${RUN}.alignAssembly.config \ --ALIGNERS blat \ --genome ${ASM}.fa \ -t ${TRANSCRIPT_CLEAN} -T -u ${TRANSCRIPT} \ --TRANSDECODER \ --CPU ${NUMTH} --transcribed_is_aligned_orient ${PASAHOME}/scripts/pasa_asmbls_to_training_set.dbi \ --pasa_transcripts_fasta ${RUN}_db.assemblies.fasta \ --pasa_transcripts_gff3 ${RUN}_db.pasa_assemblies.gff3 ``` #### Genome-guided transcriptome assembly ##### RNA-seq mapping (HISAT2) ```sh #!/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_hisat2.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_hisat2 RDIR=/work/kawato/study/kuruma_transcriptome2017/analysis/2020-10-13_Mj_RNA-seq/reads/ cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Define VARs NUMTH=24 ASM=Mj_TUMSAT_v1.0 # Create symlinks ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${ASM}.fa conda activate hisat2 # HISAT2-build hisat2-build -p ${NUMTH} ${ASM}.fa ${ASM} # Run HISAT2 less ${RDIR}lib.list |while read LIB;do R1=${RDIR}${LIB}_R1_trim.fq.gz R2=${RDIR}${LIB}_R2_trim.fq.gz hisat2 --summary-file ${ASM}_${LIB} --new-summary -p ${NUMTH} -x ${ASM} -1 ${R1} -2 ${R2} |samtools sort -@ ${NUMTH} -O BAM -o ${ASM}_${LIB}.sorted.bam samtools index ${ASM}_${LIB}.sorted.bam done ``` ##### Transcript reconstruction (StringTie) ```sh #!/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_stringtie.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_stringtie cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Define VARs RDIR=/work/kawato/study/kuruma_transcriptome2017/analysis/2020-10-13_Mj_RNA-seq/reads/ STRINGTIE_DIR=/work/kawato/apps/stringtie/ export PATH=${STRINGTIE_DIR}:${PATH} NUMTH=56 ASM=Mj_TUMSAT_v1.0 # Create symlinks ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa Mj_TUMSAT_v1.0.fa less ${RDIR}lib.list |while read LIB;do ln -s ../Mj_TUMSAT_v1.0_hisat2/${ASM}_${LIB}.sorted.bam ${ASM}_${LIB}.sorted.bam ln -s ../Mj_TUMSAT_v1.0_hisat2/${ASM}_${LIB}.sorted.bam.bai ${ASM}_${LIB}.sorted.bam.bai done # Run stringtie less ${RDIR}lib.list |while read LIB;do stringtie --fr -p ${NUMTH} -o ${ASM}_${LIB}.out.gtf ${ASM}_${LIB}.sorted.bam done # Merge transcripts stringtie --merge -o ${ASM}_merged.gtf ${ASM}_*.out.gtf ``` #### *Ab initio* gene prediction (BRAKER2) ```sh #!/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_BRAKER2.sh # >>>>>> # This is NOT designed to run using nohup. # Manually run the scripts inside COMMENTOUT). # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_BRAKER2 cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Set VARs NUMTH=56 ASM=Mj_TUMSAT_v1.0 # Copy files; symlink doesn't work for Docker cp ../Mj_TUMSAT_v1.0_RepeatMasker/${ASM}.fa.masked ${ASM}.fa.masked cp ../Mj_TUMSAT_v1.0_hisat2/${ASM}_${LIB}.sorted.bam* . mkdir GeneMark; cp -r /work/kawato/apps/GeneMark/* GeneMark # Run BRAKER2 <>>>>> # Mj_TUMSAT_v1.0_Exonerate_Local.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_Exonerate_Local cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Set VARs NUMTH=12 GENOME=Mj_TUMSAT_v1.0 QUERY=2020-12-17_Mj_IPG # Make symlinks ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${GENOME}.fa ln -s /work/kawato/study/ref/proteins/${QUERY}.faa ${QUERY}.faa # Identical Protein Groups downloaded from NCBI on 2020-12-17 # BLAST makeblastdb -dbtype nucl -in ${GENOME}.fa -title ${GENOME} -out ${GENOME} -parse_seqids -hash_index tblastn -query ${QUERY}.faa -db ${GENOME} -num_threads ${NUMTH} -outfmt 7 -max_target_seqs 1 -out ${QUERY}.${GENOME}.tblastn cat ${QUERY}.${GENOME}.tblastn|grep -v "#" |cut -f 1,2 |sort |uniq>${QUERY}.${GENOME}.tblastn.hits # Separate seqs mkdir query subject cut -f 1 ${QUERY}.${GENOME}.tblastn.hits >query.list cat query.list|while read line;do seqkit grep -np "${line}" ${QUERY}.faa>query/${line}.faa done cut -f 2 ${QUERY}.${GENOME}.tblastn.hits |sort |uniq>subject.list cat subject.list|while read line;do seqkit grep -np "${line}" ${GENOME}.fa > subject/${line}.fa done # Exonerate mkdir gff conda activate exonerate cat ${QUERY}.${GENOME}.tblastn.hits|while read QUERY SUBJECT;do exonerate --model protein2genome query/${QUERY}.faa subject/${SUBJECT}.fa --bestn 1 --showalignment no --showvulgar no --showtargetgff yes >gff/${QUERY}.${SUBJECT}.gff done # Concatenate all cat gff/*.gff|grep -v "Command line" |grep -v "Hostname" |grep -v "completed" >query.subject.all.gff ``` #### Merging gene models (EVidenceModeler) ```sh #!/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_EVidenceModeler.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh ### Do execute this script with nohup ### # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_EVidenceModeler cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Defines VARs EVM_HOME=/work/kawato/apps/EVidenceModeler PARTITIONS_DIR=partition_EVM_inputs NUMTH=56 ASM=Mj_TUMSAT_v1.0 # Create symlinks ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${ASM}.fa ln -s ../${ASM}_BRAKER2/braker/augustus.hints.gtf augustus.hints.gtf ln -s ../${ASM}_BRAKER2/braker/GeneMark-ET/genemark.gtf genemark.gtf ln -s ../${ASM}_hisat2/${ASM}_merged.gtf ${ASM}_merged.gtf ln -s ../${ASM}_PASA/${ASM}_pasa_db.pasa_assemblies.gff3 ${ASM}_pasa_db.pasa_assemblies.gff3 ln -s ..${ASM}_PASA/${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.gff3 ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.gff3 ln -s ../${ASM}_Exonerate_Local/query.subject.all.gff query.subject.all.gff # Convert all PASACDS alignmnents into DNA-protein match, as suggested by https://github.com/EVidenceModeler/EVidenceModeler/issues/23 grep "CDS" ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.gff3 |sed 's/transdecoder/transdecoder_cds/g'|sed 's/CDS/nucleotide_to_protein_match/g' >${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.nucleotide_to_protein_match.gff3 # Extract complete PASA CDS, as suggested by https://github.com/EVidenceModeler/EVidenceModeler/issues/23 # List up "complete" entries grep "complete" ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.gff3|grep -E -o "asmbl_[0-9]+.p[0-9]+"|sort |uniq >${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.list # Split the list to 700 lines, or 56 parts split -l 700 --numeric-suffixes --suffix-length=3 --elide-empty-files ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.list ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.list_ # grep the entries and redirect for n in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055;do nohup $(cat ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.list_${n} |while read line;do grep "${line}" ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.gff3 >>${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.gff3.part${n};done) & done # "nohup: missing operand" seems to have had no negative impact on the results # Merge into a single gff file cat ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.gff3.part* >${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.gff3 # Produce .faa for quality assessment gffread ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.gff3 -g ${ASM}.fa -y ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.faa seqkit stats ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.faa >${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.faa.stats # Convert gene predictions to EVM-compatible format perl ${EVM_HOME}/EvmUtils/misc/augustus_GTF_to_EVM_GFF3.pl augustus.hints.gtf >augustus.hints.gff perl ${EVM_HOME}/EvmUtils/misc/GeneMarkHMM_GTF_to_EVM_GFF3.pl genemark.gtf >genemark.gff perl ${EVM_HOME}/EvmUtils/misc/cufflinks_gtf_to_alignment_gff3.pl ${ASM}_merged.gtf >${ASM}_merged.gff perl ${EVM_HOME}/EvmUtils/misc/exonerate_gff_to_alignment_gff3.pl query.subject.all.gff >query.subject.all.gff3 # Verify gene count # grep "gene" augustus.hints.gff|wc -l # grep "gene" genemark.gff|wc -l # cut -f 9 ${ASM}_merged.gff|cut -f 1 -d ";"|grep -v '^$' |sort|uniq|wc -l # cut -f 9 query.subject.all.gff3|cut -f 1 -d ";"|grep -v '^$' |sort|uniq|wc -l # Cooncatenate gene predictions cat \ ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.complete.gff3 \ genemark.gff \ augustus.hints.gff \ >prediction.merged.gff cat \ ${ASM}_pasa_db.pasa_assemblies.gff3 \ ${ASM}_merged.gff \ >transcript.merged.gff cat \ ${ASM}_pasa_db.assemblies.fasta.transdecoder.genome.nucleotide_to_protein_match.gff3 \ query.subject.all.gff3 \ >protein.merged.gff mkdir ${PARTITIONS_DIR} cd ${PARTITIONS_DIR} # Create weight file echo -e "ABINITIO_PREDICTION\tGeneMark.hmm\t1\nABINITIO_PREDICTION\tAugustus\t3\nPROTEIN\texonerate\t10\nPROTEIN\ttransdecoder_cds\t10\nTRANSCRIPT\tCufflinks\t5\nTRANSCRIPT\tassembler-${ASM}_pasa_db\t10\nOTHER_PREDICTION\ttransdecoder\t10" >weights.txt # partition_EVM_inputs.pl ${EVM_HOME}/EvmUtils/partition_EVM_inputs.pl --genome ../${ASM}.fa \ --gene_predictions ../prediction.merged.gff \ --protein_alignments ../protein.merged.gff \ --transcript_alignments ../transcript.merged.gff \ --segmentSize 2000000 --overlapSize 100000 --partition_listing partitions_list.out # write_EVM_commands.pl ${EVM_HOME}/EvmUtils/write_EVM_commands.pl --genome ../${ASM}.fa --weights `cd .. |pwd`/weights.txt \ --gene_predictions ../prediction.merged.gff \ --protein_alignments ../protein.merged.gff \ --transcript_alignments ../transcript.merged.gff \ --output_file_name evm.out --partitions partitions_list.out > commands.list # Verify command count wc -l commands.list # 18210 commands.list # Split to 400-line chunks split -l 400 --numeric-suffixes --suffix-length=3 --elide-empty-files commands.list commands.list_ # 46 (000-045) partitions # execute_EVM_commands.pl for n in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 ;do nohup $EVM_HOME/EvmUtils/execute_EVM_commands.pl commands.list_${n} | tee commands.list_${n}_run.log 2>commands.list_${n}_run.err & done # recombine_EVM_partial_outputs.pl ${EVM_HOME}/EvmUtils/recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out # Split into 450-line chunks split -l 450 --numeric-suffixes --suffix-length=3 --elide-empty-files partitions_list.out partitions_list.out_ for n in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 ;do nohup ${EVM_HOME}/EvmUtils/convert_EVM_outputs_to_GFF3.pl --partitions partitions_list.out_${n} --output evm.out --genome ../${ASM}.fa & done # Recombine gff3 find . -regex ".*evm.out.gff3" -exec cat {} \; > EVM.all.gff3 # Translate CDS perl ${EVM_HOME}/EvmUtils/gff3_file_to_proteins.pl EVM.all.gff3 ../${ASM}.fa >EVM.all.gff3.faa ``` #### Homology search (MMSeqs2) ```sh #!/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_MMSeq2.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_MMSeq2 cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Define VARs NUMTH=56 GENOME=Mj_TUMSAT_v1.0 ANNOTATION=EVM GENEID=${ANNOTATION}.geneid GENEID_TRANSCRIPTID_CDSLEN=${ANNOTATION}.geneid.transcriptid.cdslen REF_DIR=/work/kawato/study/ref/proteins # Create symlinks ln -s ../Mj_TUMSAT_v1.0_EVidenceModeler/partition_EVM_inputs/EVM.all.gff3 ${ANNOTATION}.gff3 ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${GENOME}.fa # extract .faa, with geneid, transcriptid, cdslen information @id, gffread ${ANNOTATION}.gff3 -V -E -S -g ${GENOME}.fa -y ${GENEID_TRANSCRIPTID_CDSLEN}.faa --table=@geneid,@cdslen # The file above has a space between the entry name and descriptions and are incompatible with seqkit grep. Trim off cut -f 1 ${GENEID_TRANSCRIPTID_CDSLEN}.faa >${ANNOTATION}.faa # Define list names grep ">" ${GENEID_TRANSCRIPTID_CDSLEN}.faa|sed 's/>//g'|awk '{print $2}'|sort|uniq >${GENEID}.list grep ">" ${GENEID_TRANSCRIPTID_CDSLEN}.faa|sed 's/>//g'|awk '{print $2 "\t" $1 "\t" $3}' >${GENEID_TRANSCRIPTID_CDSLEN}.list # Extract the longest isoform from each locus (takes time) mkdir ${ANNOTATION} cat ${GENEID}.list |while read line;do mkdir ${ANNOTATION}/${line} awk -v line=${line} '$1 == line {print $0}' ${GENEID_TRANSCRIPTID_CDSLEN}.list |cut -f 2,3 |sort -r -n -k 2 >${ANNOTATION}/${line}/${line}.transcriptid.cdslen.list ID=`awk 'NR==1 {print $1}' ${ANNOTATION}/${line}/${line}.transcriptid.cdslen.list` seqkit grep -np "${ID}" ${ANNOTATION}.faa >${ANNOTATION}/${line}/${line}.faa done # Merge into a single fasta file cat ${GENEID}.list|while read line;do cat ${ANNOTATION}/${line}/${line}.faa >>${ANNOTATION}.nr.faa done seqkit stats ${ANNOTATION}.nr.faa >${ANNOTATION}.nr.faa.stats # MMSeq2 conda activate mmseqs2 # Create db DB_UNIREF50=${REF_DIR}/uniref50 mmseqs createdb ${ANNOTATION}.nr.faa ${ANNOTATION}.nr mmseqs createdb ${REF_DIR}/GCF_003789085.1_ASM378908v1_protein.faa ASM378908v1_protein mmseqs createdb ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_protein.faa NSTDA_Pmon_1_protein # search against uniref50 mmseqs search ${ANNOTATION}.nr ${DB_UNIREF50} ${ANNOTATION}.nr.uniref50.mmseqs.out tmp_uniref50 -e 1e-10 -a mmseqs convertalis ${ANNOTATION}.nr ${DB_UNIREF50} ${ANNOTATION}.nr.uniref50.mmseqs.out ${ANNOTATION}.nr.uniref50.mmseqs.out.formatmode0 --format-mode 0 # search against two penaeid shrimp species mmseqs search ${ANNOTATION}.nr ASM378908v1_protein ${ANNOTATION}.nr.ASM378908v1_protein.mmseqs.out tmp_ASM378908v1_protein -e 1e-10 -a mmseqs convertalis ${ANNOTATION}.nr ASM378908v1_protein ${ANNOTATION}.nr.ASM378908v1_protein.mmseqs.out ${ANNOTATION}.nr.ASM378908v1_protein.mmseqs.out.formatmode0 --format-mode 0 mmseqs search ${ANNOTATION}.nr NSTDA_Pmon_1_protein ${ANNOTATION}.nr.NSTDA_Pmon_1_protein.mmseqs.out tmp_NSTDA_Pmon_1_protein -e 1e-10 -a mmseqs convertalis ${ANNOTATION}.nr NSTDA_Pmon_1_protein ${ANNOTATION}.nr.NSTDA_Pmon_1_protein.mmseqs.out ${ANNOTATION}.nr.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 --format-mode 0 # Merge results cut -f 1 ${ANNOTATION}.nr.uniref50.mmseqs.out.formatmode0 ${ANNOTATION}.nr.ASM378908v1_protein.mmseqs.out.formatmode0 ${ANNOTATION}.nr.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 |sort|uniq>${ANNOTATION}.nr.mmseqs.out.list # wc -l ${ANNOTATION}.nr.mmseqs.out.list # extract proteins with detectable similarity to known proteins seqkit grep -nf ${ANNOTATION}.nr.mmseqs.out.list ${ANNOTATION}.nr.faa >${ANNOTATION}.nr.mmseqs.hit.faa # grep ">" ${ANNOTATION}.nr.mmseqs.hit.faa |wc -l # cut -f 1 ${ANNOTATION}.nr.ASM378908v1_protein.mmseqs.out.formatmode0 ${ANNOTATION}.nr.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 |sort|uniq|wc -l # Convert EVM to gtf gffread ${ANNOTATION}.gff3 -g ${GENOME}.fa -E -T -o ${ANNOTATION}.gtf # Generate gff split -l 500 --numeric-suffixes --suffix-length=3 --elide-empty-files ${ANNOTATION}.nr.mmseqs.out.list ${ANNOTATION}.nr.mmseqs.out.list_ # generates 56 split files for N in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 ;do while read line;do grep "transcript_id \"${line}\"" ${ANNOTATION}.gtf done >${ANNOTATION}.nr.mmseqs.hit_${N}.gtf < ${ANNOTATION}.nr.mmseqs.out.list_${N} & done cat ${ANNOTATION}.nr.mmseqs.hit_*.gtf>>${ANNOTATION}.nr.mmseqs.hit.gtf rm ${ANNOTATION}.nr.mmseqs.hit_*.gtf ${ANNOTATION}.nr.mmseqs.out.list_* gffread ${ANNOTATION}.nr.mmseqs.hit.gtf -V --keep-genes -g ${GENOME}.fa -o ${ANNOTATION}.nr.mmseqs.hit.gff gffread ${ANNOTATION}.nr.mmseqs.hit.gtf -V --keep-genes -g ${GENOME}.fa -y ${ANNOTATION}.nr.mmseqs.hit.faa gffread ${ANNOTATION}.nr.mmseqs.hit.gtf -V --keep-genes -g ${GENOME}.fa -x ${ANNOTATION}.nr.mmseqs.hit.fna ``` #### Manual curation ```sh #!/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_AddMjPenII.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_MMSeq2 cd ${WHOME}${WDIR} # Define VARs NUMTH=56 GENOME=Mj_TUMSAT_v1.0 SCAFFOLD=scaffold_00504 MjPenII_CDS=transcript_21783_170-607 # Extract the scaffold containing the MjPenII gene seqkit grep -np "${SCAFFOLD}" ${GENOME}.fa >${SCAFFOLD}.fa # Align MjPenII CDS conda activate exonerate exonerate --model est2genome ${MjPenII_CDS}.fna ${SCAFFOLD}.fa --maxintron 2000 --percent 90 --showalignment no --showvulgar no --showtargetgff yes >${MjPenII_CDS}.${SCAFFOLD}.gff # I manually constructed evm.model.scaffold_00504.16.gff by modifying transcript_21783_170-607.scaffold_00504.gff, based on the corresponding portion in EVM.up10.gff. I made the following modifications; # Removed the lines beginning with "#" and the final line "-- completed exonerate analysis" # changed "exonerate:est2genome" to "EVM" ($2) # Removed lines containing "utr5", "splice5", "intron", "splice3", and "similarity" in $3 # the line for "gene": duplicated and the new copy was edited from "gene" to "transcript". $9 was modified # to: # ID=evm.model.scaffold_00504.17;Parent=evm.TU.scaffold_00504.17 # Duplicated the lines with "exon" at $2; the duplicated lines now have "CDS" at $2 # $9 on the "exon" and "CDS" lines were changed to "Parent=evm.model.scaffold_00504.17" # Final annotation containing the MjPenII gene: transcript_21783_170-607.scaffold_00504.mod.gff cat EVM.nr.mmseqs.hit.gff ${MjPenII_CDS}.${SCAFFOLD}.mod.gff >2021-06-02_Mj_genome_annotation.gff3 gffread 2021-06-02_Mj_genome_annotation.gff3 -V --keep-genes -g ${GENOME}.fa -y 2021-06-02_Mj_genome_annotation.faa gffread 2021-06-02_Mj_genome_annotation.gff3 -V --keep-genes -g ${GENOME}.fa -x 2021-06-02_Mj_genome_annotation.fna grep ">" 2021-06-02_Mj_genome_annotation.faa |wc -l grep ">" 2021-06-02_Mj_genome_annotation.fna |wc -l # BUSCO to show that the completeness has not been affected conda activate busco-4.1.4 busco -i 2021-06-02_Mj_genome_annotation.faa -l arthropoda_odb10 -o 2021-06-02_Mj_genome_annotation.busco -c 56 -m prot -f less 2021-06-02_Mj_genome_annotation.gff3|grep "gene"|cut -f 9|sed 's/ID=//g'|sed 's/TU/model/g'|grep -v "^#" >2021-06-02_Mj_genome_annotation.gff3.gene.list wc -l 2021-06-02_Mj_genome_annotation.gff3.gene.list ``` transcript_21783_170-607.fna ``` >transcript_21783_170-607 ATGCGTGTCTTGGTTTTCCTGGCGTTCTTGGTTTTGTTAGCCTTACTCTGTCAAGTGTAC GCCAAGGGCTCCTCAAGTTCGAGTTCGAGTTCTCGATCCTCAAGTTCGAGTTATCGATCC TCAGGTTCGAGTTATCGATCCCCAGGTTCGAGTTATCGATCCCCAGGTTCGAGTTATCGA TCATCTGGCTCCTACGGAACTTCAGGTTCTCGACTCCCTGGCATTCGTCCCTCTTCTCGA TCATATCGCACCGGTTTTCGAACAGCTGGATCCGTCGGACCTGCAACTCGTCCCTTTACT CGTCCAACTGGGCCCCTTAAGCCCATTAGCCGGCCTCCTTCCCGAGCAGCCTGCTATTCA TGCTACAGCGCAAGCAGCGCAACGGCAATACAATGCTGCACCCACTATAGTCTTTGTTGC AATCTGGTGAAAGGATAG ``` #### UTR recovery (PASA updates) ```sh #!/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_PASA_update.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_PASA_update cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Set VARs PASAHOME=/work/kawato/anaconda3/envs/pasa/opt/pasa-2.4.1 NUMTH=56 W_DIR=${WHOME}${WDIR}/ GENOME=Mj_TUMSAT_v1.0 ANNOTATION=2021-06-02_Mj_genome_annotation # Create symlinks ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${GENOME}.fa ln -s ../Mj_TUMSAT_v1.0_MMSeq2/2021-06-02_Mj_genome_annotation.gff3 ${ANNOTATION}.gff3 ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/ICRK01.1.fsa_nt ICRK01.1.fsa_nt ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/ICRJ01.1.fsa_nt ICRJ01.1.fsa_nt ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/2020-10-13_Mj_RNA-seq/trinity_out_dir/Trinity.fasta Trinity.fasta conda activate pasa # Prepare transcripts cat ICRK01.1.fsa_nt ICRJ01.1.fsa_nt Trinity.fasta >tr.fa TRANSCRIPT=tr.fa ${PASAHOME}/bin/seqclean ${TRANSCRIPT} -L -c ${NUMTH} TRANSCRIPT_CLEAN=${TRANSCRIPT}.clean # Prepare PASA config file MIN_PERCENT_ALIGNED=90 MIN_AVG_PER_ID=95 RUN=2021-06-03_pasa cat /work/kawato/anaconda3/envs/pasa/opt/pasa-2.4.1/pasa_conf/pasa.alignAssembly.Template.txt |\ sed "s@<__DATABASE__>@${W_DIR}${RUN}_db@g"|sed "s@<__MIN_PERCENT_ALIGNED__>@${MIN_PERCENT_ALIGNED}@g"|sed "s@<__MIN_AVG_PER_ID__>@${MIN_AVG_PER_ID}@g" >${RUN}.alignAssembly.config # Map transcripts ${PASAHOME}/Launch_PASA_pipeline.pl -C -R \ --config ${RUN}.alignAssembly.config \ --ALIGNERS blat \ --genome ${GENOME}.fa \ -t ${TRANSCRIPT_CLEAN} -T -u ${TRANSCRIPT} \ --TRANSDECODER \ --CPU ${NUMTH} --transcribed_is_aligned_orient # Create PASA annotCompare config cat ${PASAHOME}/pasa_conf/pasa.annotationCompare.Template.txt |\ sed "s@<__DATABASE__>@${W_DIR}${RUN}_db@g" >${RUN}.annotCompare.config # load annotation ${PASAHOME}/scripts/Load_Current_Gene_Annotations.dbi \ -c ${RUN}.alignAssembly.config -g ${GENOME}.fa \ -P ${ANNOTATION}.gff3 # PASA update, round 1 $PASAHOME/Launch_PASA_pipeline.pl \ -c ${RUN}.annotCompare.config -A \ -g ${GENOME}.fa \ -t ${TRANSCRIPT_CLEAN} --CPU ${NUMTH} # Load post-update PASA annotation ${PASAHOME}/scripts/Load_Current_Gene_Annotations.dbi \ -c ${RUN}.alignAssembly.config -g ${GENOME}.fa \ -P 2021-06-03_pasa_db.gene_structures_post_PASA_updates.46092.gff3 # PASA update, round 2 $PASAHOME/Launch_PASA_pipeline.pl \ -c ${RUN}.annotCompare.config -A \ -g ${GENOME}.fa \ -t ${TRANSCRIPT_CLEAN} --CPU ${NUMTH} ``` #### Incorporation of IsoSeq and shotgun and transcriptome assemblies (SQANTI3, TAMA, TransDecoder) ```sh # !/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_sqanti3_TAMA_TransDecoder.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_sqanti3_TAMA_TransDecoder cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Set VARs conda activate SQANTI3.env export PYTHONPATH=${PYTHONPATH}:/work/kawato/apps/cDNA_Cupcake/sequence/ GENOME=Mj_TUMSAT_v1.0.fa CDNA=ISOSEQ.fa RNASEQ=RNASEQ.fa NUMTH=56 TAMAHOME=/work/kawato/apps/tama REF_DIR=/work/kawato/study/ref/proteins # make symlinks ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/ICRJ01.1.fsa_nt ${CDNA} ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/ICRK01.1.fsa_nt ${RNASEQ} ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${GENOME} ln -s ../Mj_TUMSAT_v1.0_PASA_update/2021-06-03_pasa_db.gene_structures_post_PASA_updates.39230.gff3 PASA.gff3 # SQANTI3 accepts GTF only gffread PASA.gff3 -g ${GENOME} -E -T -o PASA.gtf # SQANTI3, ISOSEQ minimap2 -ax splice -t ${NUMTH} -uf --secondary=no -C5 ${GENOME} ${CDNA} > ${CDNA}.sam sort -k 3,3 -k 4,4n ${CDNA}.sam > ${CDNA}.sorted.sam collapse_isoforms_by_sam.py --input ${CDNA} -s ${CDNA}.sorted.sam --dun-merge-5-shorter -o ${CDNA}.align -c 0.90 # coverage cutoff 0.90 = 90%. test.collapsed.gff is in gff2 format = GTF python /work/kawato/apps/SQANTI3/sqanti3_qc.py --gtf ${CDNA}.align.collapsed.gff PASA.gtf ${GENOME} # SQANTI3, RNASEQ minimap2 -ax splice -t ${NUMTH} -uf --secondary=no -C5 ${GENOME} ${RNASEQ} > ${RNASEQ}.sam sort -k 3,3 -k 4,4n ${RNASEQ}.sam > ${RNASEQ}.sorted.sam collapse_isoforms_by_sam.py --input ${RNASEQ} -s ${RNASEQ}.sorted.sam --dun-merge-5-shorter -o ${RNASEQ}.align -c 0.90 # gff is in gff2 format = GTF python /work/kawato/apps/SQANTI3/sqanti3_qc.py --gtf ${RNASEQ}.align.collapsed.gff PASA.gtf ${GENOME} # TAMA conda activate tama-py27 # Convert to BED python ${TAMAHOME}/tama_go/format_converter/tama_format_gtf_to_bed12_ncbi.py PASA.gtf PASA.bed python ${TAMAHOME}/tama_go/format_converter/tama_format_gtf_to_bed12_ncbi.py ${RNASEQ}.align.collapsed_corrected.gtf ${RNASEQ}.align.collapsed_corrected.bed python ${TAMAHOME}/tama_go/format_converter/tama_format_gtf_to_bed12_ncbi.py ${CDNA}.align.collapsed_corrected.gtf ${CDNA}.align.collapsed_corrected.bed echo -e "${CDNA}.align.collapsed_corrected.bed\tno_cap\t1,1,1\tCDNA\n${RNASEQ}.align.collapsed_corrected.bed\tno_cap\t2,1,2\tRNASEQ\nPASA.bed\tno_cap\t2,1,2\tPASA" > filelist.txt python ${TAMAHOME}/tama_merge.py -f filelist.txt -p merged_annos -s PASA -cds PASA python2 ${TAMAHOME}/tama_go/format_converter/tama_convert_bed_gtf_ensembl_no_cds.py merged_annos.bed merged_annos.gtf # TransDecoder conda activate pasa #TransDecoder in PASA # Extract transcripts gtf_genome_to_cdna_fasta.pl merged_annos.gtf ${GENOME} >merged_annos.transcripts.fa gtf_to_alignment_gff3.pl merged_annos.gtf > merged_annos.gff3 TransDecoder.LongOrfs -S -t merged_annos.transcripts.fa # MMSeq2 conda activate mmseqs2 # Create db ln -s merged_annos.transcripts.fa.transdecoder_dir/longest_orfs.pep longest_orfs.pep mmseqs createdb longest_orfs.pep longest_orfs mmseqs createdb ${REF_DIR}/GCF_003789085.1_ASM378908v1_protein.faa ASM378908v1_protein mmseqs createdb ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_protein.faa NSTDA_Pmon_1_protein DB_UNIREF50=${REF_DIR}/uniref50 # search against uniref50 mmseqs search longest_orfs ${DB_UNIREF50} longest_orfs.uniref50.mmseqs.out tmp_uniref50 -e 1e-10 -a mmseqs convertalis longest_orfs ${DB_UNIREF50} longest_orfs.uniref50.mmseqs.out longest_orfs.uniref50.mmseqs.out.formatmode0 --format-mode 0 # search against two penaeid shrimp species mmseqs search longest_orfs ASM378908v1_protein longest_orfs.ASM378908v1_protein.mmseqs.out tmp_ASM378908v1_protein -e 1e-10 -a mmseqs convertalis longest_orfs ASM378908v1_protein longest_orfs.ASM378908v1_protein.mmseqs.out longest_orfs.ASM378908v1_protein.mmseqs.out.formatmode0 --format-mode 0 mmseqs search longest_orfs NSTDA_Pmon_1_protein longest_orfs.NSTDA_Pmon_1_protein.mmseqs.out tmp_NSTDA_Pmon_1_protein -e 1e-10 -a mmseqs convertalis longest_orfs NSTDA_Pmon_1_protein longest_orfs.NSTDA_Pmon_1_protein.mmseqs.out longest_orfs.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 --format-mode 0 # Concatenate MMSeqs2 results cat longest_orfs.uniref50.mmseqs.out.formatmode0 longest_orfs.ASM378908v1_protein.mmseqs.out.formatmode0 longest_orfs.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 >longest_orfs.all.mmseqs.out.formatmode0 conda activate pasa # Predict ORFs with homology to known proteins TransDecoder.Predict --single_best_only -t merged_annos.transcripts.fa --retain_blastp_hits longest_orfs.all.mmseqs.out.formatmode0 cdna_alignment_orf_to_genome_orf.pl merged_annos.transcripts.fa.transdecoder.gff3 merged_annos.gff3 merged_annos.transcripts.fa > merged_annos.transcripts.fa.transdecoder.genome.gff3 # Extract ORF gffread merged_annos.transcripts.fa.transdecoder.genome.gff3 -V -E -S -g ${GENOME} -y merged_annos.transcripts.fa.transdecoder.genome.faa conda activate mmseqs2 # Further filter low-quality models using MMSeq2 QUERY=TransDecoder_ORF ln -s merged_annos.transcripts.fa.transdecoder.genome.faa ${QUERY}.faa # Create db mmseqs createdb ${QUERY}.faa ${QUERY} # search against uniref50 mmseqs search ${QUERY} ${DB_UNIREF50} ${QUERY}.uniref50.mmseqs.out tmp_uniref50 -e 1e-03 -a mmseqs convertalis ${QUERY} ${DB_UNIREF50} ${QUERY}.uniref50.mmseqs.out ${QUERY}.uniref50.mmseqs.out.formatmode0 --format-mode 0 # search against two penaeid shrimp species mmseqs search ${QUERY} ASM378908v1_protein ${QUERY}.ASM378908v1_protein.mmseqs.out tmp_ASM378908v1_protein -e 1e-03 -a mmseqs convertalis ${QUERY} ASM378908v1_protein ${QUERY}.ASM378908v1_protein.mmseqs.out ${QUERY}.ASM378908v1_protein.mmseqs.out.formatmode0 --format-mode 0 mmseqs search ${QUERY} NSTDA_Pmon_1_protein ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out tmp_NSTDA_Pmon_1_protein -e 1e-03 -a mmseqs convertalis ${QUERY} NSTDA_Pmon_1_protein ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 --format-mode 0 # Merge MMSeqs2 results cat ${QUERY}.uniref50.mmseqs.out.formatmode0 ${QUERY}.ASM378908v1_protein.mmseqs.out.formatmode0 ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 |cut -f 1|sort|uniq >${QUERY}.mmseqs.out.list wc -l ${QUERY}.mmseqs.out.list # 60281 TransDecoder_ORF.mmseqs.out.list less TransDecoder_ORF.mmseqs.out.list|cut -f 1 -d "."|sed 's/G//g'|sort -n|uniq |sed 's/^/G/g'>${QUERY}.mmseqs.out.gene.list wc -l ${QUERY}.mmseqs.out.gene.list # 26387 TransDecoder_ORF.mmseqs.out.gene.list # Remove lengthy descriptions; split to chunks to make it faster split -l 500 --numeric-suffixes --suffix-length=3 --elide-empty-files ${QUERY}.mmseqs.out.gene.list ${QUERY}.mmseqs.out.gene.list_ # generates 53 split files for N in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052;do cat ${QUERY}.mmseqs.out.gene.list_${N} |while read line;do grep -E "${line}[;.]" merged_annos.transcripts.fa.transdecoder.genome.gff3 done > merged_annos.transcripts.fa.transdecoder.genome.selected_${N}.gff3 & done # Merge into one file cat merged_annos.transcripts.fa.transdecoder.genome.selected_*.gff3 >Mj_TUMSAT_v1.0_transdecoder.gff3 rm merged_annos.transcripts.fa.transdecoder.genome.selected_*.gff3 ${QUERY}.mmseqs.out.gene.list_* awk '$3=="gene" {print $0}' Mj_TUMSAT_v1.0_transdecoder.gff3|wc -l # 26387 # Remove genes overlapping with 18S/28S rRNA genes grep -v "G1771[23456789][.;]" Mj_TUMSAT_v1.0_transdecoder.gff3 >Mj_TUMSAT_v1.0_transdecoder.noOverlap.gff3 ``` ### Non-coding RNA prediction #### tRNAScan-SE 2.0 (tRNA) ```sh # !/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_tRNAscan-SE.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_tRNAscan-SE cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Set VARs NUMTH=56 ASM=Mj_TUMSAT_v1.0 # Make symlinks ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${ASM}.fa ${ASM}.fa conda activate trnascan-se tRNAscan-SE ${ASM}.fa -o ${ASM}_trnascan-se_out -f ${ASM}_trnascan-se_struct -m ${ASM}_trnascan-se_stats -b ${ASM}_trnascan-se.bed -a ${ASM}_trnascan-se.fa -l ${ASM}_trnascan-se.log # List up and eliminate low-quality genes (Undetermined and possible pseudogenes) grep -e pseudo -e Undet ${ASM}_trnascan-se_out|awk '{print $1".tRNA"$2"-"}' >${ASM}_trnascan-se_out.lowqual grep -v -f ${ASM}_trnascan-se_out.lowqual ${ASM}_trnascan-se.bed >${ASM}_trnascan-se.filtered.bed # Convert to GFF3 gffread ${ASM}_trnascan-se.filtered.bed -F -o ${ASM}_trnascan-se.filtered.gff3 # Add gene_biotype=tRNA, parent-child structure, and functional description cat ${ASM}_trnascan-se.filtered.gff3|awk 'BEGIN{FS = "\t";OFS="\t"}; {if($3=="transcript") { {genename = gensub("ID=","",1, $9)}; {print $1,"tRNAscan-SE","gene",$4,$5,$6,$7,$8,"ID="genename";gene_biotype=tRNA" }; {print $1,"tRNAscan-SE","tRNA",$4,$5,$6,$7,$8,"ID=rna-"genename";Parent="genename";" }; {print $1,"tRNAscan-SE","exon",$4,$5,$6,$7,$8,"Parent=rna-"genename";" } } }'|awk 'BEGIN{FS = "\t";OFS="\t"}{ if($3=="tRNA"||$3=="exon"){ if($9 ~ /AlaAGC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA alanine \(anticodon AGC\);product=tRNA-Ala"}; if($9 ~ /AlaCGC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA alanine \(anticodon CGC\);product=tRNA-Ala"}; if($9 ~ /AlaTGC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA alanine \(anticodon UGC\);product=tRNA-Ala"}; if($9 ~ /ArgACG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA arginine \(anticodon ACG\);product=tRNA-Arg"}; if($9 ~ /ArgCCG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA arginine \(anticodon CCG\);product=tRNA-Arg"}; if($9 ~ /ArgCCT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA arginine \(anticodon CCU\);product=tRNA-Arg"}; if($9 ~ /ArgGCG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA arginine \(anticodon GCG\);product=tRNA-Arg"}; if($9 ~ /ArgTCG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA arginine \(anticodon UCG\);product=tRNA-Arg"}; if($9 ~ /ArgTCT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA arginine \(anticodon UCU\);product=tRNA-Arg"}; if($9 ~ /AsnGTT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA asparagine \(anticodon GUU\);product=tRNA-Asn"}; if($9 ~ /AspATC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA aspartic acid \(anticodon AUC\);product=tRNA-Asp"}; if($9 ~ /AspGTC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA aspartic acid \(anticodon GUC\);product=tRNA-Asp"}; if($9 ~ /CysACA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA cysteine \(anticodon ACA\);product=tRNA-Cys"}; if($9 ~ /CysGCA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA cysteine \(anticodon GCA\);product=tRNA-Cys"}; if($9 ~ /GlnCTG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glutamine \(anticodon CUG\);product=tRNA-Gln"}; if($9 ~ /GlnTTG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glutamine \(anticodon UUG\);product=tRNA-Gln"}; if($9 ~ /GluCTC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glutamic acid \(anticodon CUC\);product=tRNA-Glu"}; if($9 ~ /GluTTC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glutamic acid \(anticodon UUC\);product=tRNA-Glu"}; if($9 ~ /GlyCCC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glycine \(anticodon CCC\);product=tRNA-Gly"}; if($9 ~ /GlyGCC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glycine \(anticodon GCC\);product=tRNA-Gly"}; if($9 ~ /GlyTCC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA glycine \(anticodon UCC\);product=tRNA-Gly"}; if($9 ~ /HisGTG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA histidine \(anticodon GUG\);product=tRNA-His"}; if($9 ~ /IleAAT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA isoleucine \(anticodon AAU\);product=tRNA-Ile"}; if($9 ~ /IleGAT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA isoleucine \(anticodon GAU\);product=tRNA-Ile"}; if($9 ~ /IleTAT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA isoleucine \(anticodon UAU\);product=tRNA-Ile"}; if($9 ~ /LeuAAG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA leucine \(anticodon AAG\);product=tRNA-Leu"}; if($9 ~ /LeuCAA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA leucine \(anticodon CAA\);product=tRNA-Leu"}; if($9 ~ /LeuCAG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA leucine \(anticodon CAG\);product=tRNA-Leu"}; if($9 ~ /LeuGAG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA leucine \(anticodon GAG\);product=tRNA-Leu"}; if($9 ~ /LeuTAA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA leucine \(anticodon UAA\);product=tRNA-Leu"}; if($9 ~ /LeuTAG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA leucine \(anticodon UAG\);product=tRNA-Leu"}; if($9 ~ /LysCTT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA lysine \(anticodon CUU\);product=tRNA-Lys"}; if($9 ~ /LysTTT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA lysine \(anticodon UUU\);product=tRNA-Lys"}; if($9 ~ /MetCAT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA methionine \(anticodon CAU\);product=tRNA-Met"}; if($9 ~ /PheAAA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA phenylalanine \(anticodon AAA\);product=tRNA-Phe"}; if($9 ~ /PheGAA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA phenylalanine \(anticodon GAA\);product=tRNA-Phe"}; if($9 ~ /ProAGG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA proline \(anticodon AGG\);product=tRNA-Pro"}; if($9 ~ /ProCGG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA proline \(anticodon CGG\);product=tRNA-Pro"}; if($9 ~ /ProTGG/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA proline \(anticodon UGG\);product=tRNA-Pro"}; if($9 ~ /SeCTCA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA selenocysteine \(anticodon UCA\);product=tRNA-SeC"}; if($9 ~ /SerAGA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA serine \(anticodon AGA\);product=tRNA-Ser"}; if($9 ~ /SerCGA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA serine \(anticodon CGA\);product=tRNA-Ser"}; if($9 ~ /SerGCT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA serine \(anticodon GCU\);product=tRNA-Ser"}; if($9 ~ /SerGGA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA serine \(anticodon GGA\);product=tRNA-Ser"}; if($9 ~ /SerTGA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA serine \(anticodon UGA\);product=tRNA-Ser"}; if($9 ~ /SupCTA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA amber suppressor \(anticodon CTA\);product=tRNA-Sup"}; if($9 ~ /SupTCA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA opal suppressor \(anticodon UCA\);product=tRNA-Sup"}; if($9 ~ /SupTTA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA ochre suppressor \(anticodon UUA\);product=tRNA-Sup"}; if($9 ~ /ThrAGT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA threonine \(anticodon AGU\);product=tRNA-Thr"}; if($9 ~ /ThrCGT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA threonine \(anticodon CGU\);product=tRNA-Thr"}; if($9 ~ /ThrTGT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA threonine \(anticodon UGU\);product=tRNA-Thr"}; if($9 ~ /TrpCCA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA tryptophan \(anticodon CCA\);product=tRNA-Trp"}; if($9 ~ /TyrATA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA tyrosine \(anticodon AUA\);product=tRNA-Tyr"}; if($9 ~ /TyrGTA/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA tyrosine \(anticodon GUA\);product=tRNA-Tyr"}; if($9 ~ /ValAAC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA valine \(anticodon AAC\);product=tRNA-Val"}; if($9 ~ /ValCAC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA valine \(anticodon CAC\);product=tRNA-Val"}; if($9 ~ /ValGAC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA valine \(anticodon GAC\);product=tRNA-Val"}; if($9 ~ /ValTAC/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA valine \(anticodon UAC\);product=tRNA-Val"}; if($9 ~ /iMetCAT/) {print $1,$2,$3,$4,$5,$6,$7,$8,$9"Note=transfer RNA initiator methionine \(anticodon CAU\);product=tRNA-iMet"}}else{print}}'\ >tRNA_curated.gff3 ``` #### Prediction of other non-coding RNA genes (Infernal) ```sh # !/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_Infernal.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_Infernal cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Set VARs NUMTH=56 ASM=Mj_TUMSAT_v1.0 # Make symlinks ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa ${ASM}.fa ${ASM}.fa # Prepare cm mkdir cm;cd cm wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz; gunzip Rfam.cm.gz wget http://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.clanin cmpress Rfam.cm cd .. seqkit split ${ASM}.fa --by-part 56 conda activate infernal-1.1.4 for N in 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056;do cmscan -Z 3410 --cut_ga --rfam --nohmmonly --oclan --oskip --fmt 2 --tblout ${ASM}.fa.split/${ASM}.part_${N}.Rfam.tbl --clanin cm/Rfam.clanin cm/Rfam.cm ${ASM}.fa.split/${ASM}.part_${N}.fa >${ASM}.fa.split/${ASM}.part_${N}.Rfam.cmscan & done # Convert to GFF3 cd ${ASM}.fa.split for N in 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056;do perl /work/kawato/scripts/infernal-tblout2gff.pl --cmscan --fmt2 --desc ${ASM}.part_${N}.Rfam.tbl ${ASM}.part_${N}.Rfam.gff3 done # Concatenate into a single file cat ${ASM}.part_*.Rfam.gff3 >${ASM}.Rfam.gff3 # Filter out false positives and tRNA; assign gene number to each entry less ${ASM}.Rfam.gff3|grep -v "tRNA"|grep -v "mir-"|grep -v "microRNA"|grep -v "Histone"|grep -v "Phe_leader"|grep -v "K_chan"|grep -v "CRISPR"|grep -v "mini-ykkC"|grep -v "MIR"|grep -v "SSU_rRNA_microsporidia" |grep -v "IRE_II"|grep -v "let-7"|grep -v "LSU_rRNA_bacteria"|grep -v "Sphinx_1"|grep -v "RNAI" |grep -v "PrrB_RsmZ"|\ awk -F'\t' -vOFS='\t' 'BEGIN {num=1};{ $3="exon";$9 = "Parent=gene" num ";product=" $9;num+=1 }1' >${ASM}.Rfam.parsed.gff3 # Make fasta file gffread -g ../${ASM}.fa -F -w ${ASM}.Rfam.parsed.fa ${ASM}.Rfam.parsed.gff3 -v --keep-genes -E --table product # Remove product name on the title line (which makes it incompatible with seqkit grep) cut -f 1 ${ASM}.Rfam.parsed.fa > ${ASM}.Rfam.parsed.mod.fa # Create a list of target products. We decided to exclude Small_nucleolar_RNA_SNORA23 due to ambiguity in homology echo -e "5S_ribosomal_RNA\nEukaryotic_large_subunit_ribosomal_RNA\n5.8S_ribosomal_RNA\nEukaryotic_small_subunit_ribosomal_RNA\n7SK_RNA\nU1_spliceosomal_RNA\nU2_spliceosomal_RNA\nU4_spliceosomal_RNA\nU5_spliceosomal_RNA\nU6_spliceosomal_RNA\nU11_spliceosomal_RNA\nU12_minor_spliceosomal_RNA\nU6atac_minor_spliceosomal_RNA\nSmall_nucleolar_RNA_SNORD18\nSmall_nucleolar_RNA_SNORD31\nSmall_nucleolar_RNA_SNORD34\nSmall_nucleolar_RNA_SNORA53\nSmall_nucleolar_RNA_U3\nSmall_nucleolar_RNA_snR61/Z1/Z11\nSmall_Cajal_body_specific_RNA_8\nMetazoan_signal_recognition_particle_RNA\nNuclear_RNase_P" >Rfam.target_products.list # Make lists of entries for each gene less Rfam.target_products.list|while read line;do grep ">" ${ASM}.Rfam.parsed.fa|grep "$line" |sed 's/>//g'|cut -f 1 >`echo "${line}"|sed 's@/@_@g'`.list done # Create fasta files for each gene less Rfam.target_products.list|while read line;do seqkit grep -nf `echo "${line}"|sed 's@/@_@g'`.list ${ASM}.Rfam.parsed.mod.fa >`echo "${line}"|sed 's@/@_@g'`.fa done # Cluster using cd-hit-est conda activate cd-hit less Rfam.target_products.list|while read line;do cd-hit-est -i `echo "${line}"|sed 's@/@_@g'`.fa -o `echo "${line}"|sed 's@/@_@g'`.nr -c 0.98 -s 0.95 done # The files were aligned using MAFFT (https://mafft.cbrc.jp/alignment/server/cgi-bin/mafft5.cgi) online with default settings (Accessed June 2021). # Clusters of obviously abberant lengths were filtered out. # The sequences were additionally queried against the NCBI non-redundant nucleotide database using BLASTN (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch) (Accessed June 2021). # retain_gene_list created manually. # Parse GFF3 as per retain_gene_list; add appropriate notes and attributes to each entry cat retain_gene_list|while read line;do grep "Parent=${line};" ${ASM}.Rfam.parsed.gff3 done |awk 'BEGIN{FS = "\t";OFS="\t"} {gsub(/_/," ",$9);r = gensub("Parent=","",1, gensub( ";.+", "", 1, $9 ))}; $9 ~ /ribosomal RNA/ {$3="rRNA"}; $9 ~ /7SK RNA|spliceosomal/ {$3="snRNA"}; $9 ~ /Small nucleolar/ {$3="snoRNA"}; $9 ~ /Small Cajal body specific RNA 8/ {$3="guide_RNA"}; $9 ~ /Metazoan signal recognition particle RNA/ {$3="scRNA"}; $9 ~ /Nuclear RNase P/ {$3="RNase_P_RNA"}; {print $1,$2,"gene",$4,$5,$6,$7,$8,"ID="substr($9,8)";"}; {print $1,$2,$3,$4,$5,$6,$7,$8,"ID=rna-"r";Parent="substr($9,8)";"}; {print $1,$2,"exon",$4,$5,$6,$7,$8,"Parent=rna-"substr($9,8)";"}'|\ awk 'BEGIN{FS = "\t";OFS="\t"};{ if ($3 == "gene"){match($9, /;/); if($9 ~ /ribosomal RNA/) {print $1,$2,$3,$4,$5,$6,$7,$8,substr($9, 1, RSTART-1)";gene_biotype=rRNA"}; if($9 ~ /7SK RNA|spliceosomal/) {print $1,$2,$3,$4,$5,$6,$7,$8,substr($9, 1, RSTART-1)";gene_biotype=snRNA"}; if($9 ~ /Small nucleolar/) {print $1,$2,$3,$4,$5,$6,$7,$8,substr($9, 1, RSTART-1)";gene_biotype=snoRNA"}; if($9 ~ /Small Cajal body specific RNA 8/) {print $1,$2,$3,$4,$5,$6,$7,$8,substr($9, 1, RSTART-1)";gene_biotype=guide_RNA"}; if($9 ~ /Metazoan signal recognition particle RNA/) {print $1,$2,$3,$4,$5,$6,$7,$8,substr($9, 1, RSTART-1)";gene_biotype=scRNA"} if($9 ~ /Nuclear RNase P/) {print $1,$2,$3,$4,$5,$6,$7,$8,substr($9, 1, RSTART-1)";gene_biotype=RNase_P_RNA"} } else {print}}' >${ASM}.Rfam.curated.gff3 ``` ### Merge predicted genes and assign locus IDs ```sh # !/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_merge_gff3.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_merge_gff3 cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Set VARs NUMTH=56 ASM=Mj_TUMSAT_v1.0 # Make symlinks ln -s ../filter_lowqual_seq/${ASM}.fa ${ASM}.fa ln -s ../${ASM}_Infernal/${ASM}.fa.split/${ASM}.Rfam.curated.gff3 ncRNA.gff3 ln -s ../${ASM}_tRNAscan-SE/tRNA_curated.gff3 tRNA.gff3 ln -s ../${ASM}_sqanti3_TAMA_TransDecoder/${ASM}_transdecoder.noOverlap.gff3 ${ASM}_transdecoder.noOverlap.gff3 # Remove description;add "gene_biotype=protein_coding" cat ${ASM}_transdecoder.noOverlap.gff3 |sed -E 's/;Name=ORF.+//g'|awk 'BEGIN{FS = "\t";OFS="\t"};{ if ($3 == "gene"){print $1,$2,$3,$4,$5,$6,$7,$8,$9";gene_biotype=protein_coding"} else {print}}'>protein_coding.gff3 # Remove predicted RNA genes overlapping with CDS awk '$3 == "CDS" {print $0}' protein_coding.gff3 >protein_coding.cds.gff3 bedtools intersect -v -a tRNA.gff3 -b protein_coding.cds.gff3 >tRNA.noOverlap.gff3 bedtools intersect -v -a ncRNA.gff3 -b protein_coding.cds.gff3 >ncRNA.noOverlap.gff3 # Combine all cat protein_coding.gff3 tRNA.noOverlap.gff3 ncRNA.noOverlap.gff3|sed '1s/^/##gff-version 3\n/' >all.gff3 # Sort GFF3 conda activate genometools gt gff3 -sort -retainids -addids no all.gff3 | grep -v "^###$" >all.sort.gff3 # Create a table containing old and new gene ids cat all.sort.gff3 |awk '$3 =="gene"{print $9}'|sed -E 's/;.+//g'|sed 's/ID=//g'>all.sort.gff3.gene.list NUM=1 cat all.sort.gff3.gene.list|while read line;do echo -e "${line}\tLOCUS`printf "%07d" $((NUM))`00" NUM=$((NUM+1)) done >all.sort.gff3.gene.list.mod & # Rename genes cat all.sort.gff3.gene.list.mod|while read BEFORE AFTER;do grep -E "${BEFORE}[.;]" all.sort.gff3 |sed "s/${BEFORE}/${AFTER}/g"; done >all.sort.mod.gff3 # Rename scaffolds to the NCBI Accession Numbers cat all.sort.mod.gff3 |sed 's/^scaffold_/BOPN010/g' |sed -E 's/^(BOPN[0-9]+)\t/\1.1\t/g'|awk 'BEGIN{FS = "\t";OFS="\t"};{if($3=="gene") {match($9, /ID=/);print $1,$2,$3,$4,$5,$6,$7,$8,$9";Name="substr($9, RLENGTH+1,14)}else {print}}' >annotation.tmp.gff3 ``` ### Functional annotation of protein-coding genes (AHRD, KAAS) #### Extracting non-redundant protein sequences ```sh # !/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_extractNrProteins.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_extractNrProteins cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Define VARs GENOME=Mj_TUMSAT_v1.0 GENEID=${GENOME}.geneid GENEID_TRANSCRIPTID_CDSLEN=${GENEID}.transcriptid.cdslen ln -s /work/kawato/study/ref/genomes/GCA_017312705.1_Mj_TUMSAT_v1.0_genomic.fna ${GENOME}.fa ln -s ../Mj_TUMSAT_v1.0_merge_gff3/annotation.tmp.gff3 ${GENOME}.gff3 # extract .faa, with geneid, transcriptid, cdslen information @id, gffread ${GENOME}.gff3 -V -E -S -g ${GENOME}.fa -y ${GENEID_TRANSCRIPTID_CDSLEN}.faa --table=@geneid,@cdslen # The file above has a space between the entry name and descriptions and are incompatible with seqkit grep. Trim cut -f 1 ${GENEID_TRANSCRIPTID_CDSLEN}.faa >${GENOME}.faa # Define list names grep ">" ${GENEID_TRANSCRIPTID_CDSLEN}.faa|sed 's/>//g'|awk '{print $2}'|sort|uniq >${GENEID}.list grep ">" ${GENEID_TRANSCRIPTID_CDSLEN}.faa|sed 's/>//g'|awk '{print $2 "\t" $1 "\t" $3}' >${GENEID_TRANSCRIPTID_CDSLEN}.list # Extract the longest isoform from each loci mkdir ${GENOME} cat ${GENEID}.list |while read line;do mkdir ${GENOME}/${line} awk -v line=${line} '$1 == line {print $0}' ${GENEID_TRANSCRIPTID_CDSLEN}.list |cut -f 2,3 |sort -r -n -k 2 >${GENOME}/${line}/${line}.transcriptid.cdslen.list ID=`awk 'NR==1 {print $1}' ${GENOME}/${line}/${line}.transcriptid.cdslen.list` seqkit grep -np "${ID}" ${GENOME}.faa >${GENOME}/${line}/${line}.faa done # Merge into a single fasta file cat ${GENEID}.list|while read line;do cat ${GENOME}/${line}/${line}.faa done >${GENOME}.nr.faa seqkit stats ${GENOME}.nr.faa >${GENOME}.nr.faa.stats conda activate busco-4.1.4 busco -i ${GENOME}.nr.faa -l arthropoda_odb10 -o ${GENOME}.nr.busco -c 56 -m prot -f busco -i ${GENOME}.faa -l arthropoda_odb10 -o ${GENOME}.busco -c 56 -m prot -f ``` #### Functional annotation (AHRD) ```sh # !/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_MMSeqs2_AHRD.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_MMSeqs2_AHRD cd ${WHOME} mkdir ${WDIR};cd ${WDIR} ### Define VARs GENOME=Mj_TUMSAT_v1.0 NUMTH=56 QUERY=${GENOME}.nr REF_DIR=/work/kawato/study/ref/proteins # Create symlinks ln -s ../${GENOME}_extractNrProteins/${QUERY}.faa ${QUERY}.faa ln -s /work/kawato/apps/AHRD/test/resources/filter_descline_sprot.txt filter_descline_sprot.txt # 802 M. japonicus proteins were downloaded from NCBI on 2021-02-18. # Search condition: (("Penaeus japonicus"[porgn:__txid27405])) NOT mitochondrion # /work/kawato/study/ref/proteins/2021-02-18_NCBI_Mj_prot.fasta # MMSeqs2 conda activate mmseqs2 # Create db mmseqs createdb ${QUERY}.faa ${QUERY} mmseqs createdb ${REF_DIR}/GCF_003789085.1_ASM378908v1_protein.faa ASM378908v1_protein mmseqs createdb ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_protein.faa NSTDA_Pmon_1_protein mmseqs createdb ${REF_DIR}/2021-02-18_NCBI_Mj_prot.fasta 2021-02-18_NCBI_Mj_prot mmseqs createdb ${REF_DIR}/Swiss-Plot_2020-12-17.faa Swiss-Plot_2020-12-17 DB_UNIREF50=${REF_DIR}/uniref50 # search against uniref50 mmseqs search ${QUERY} ${DB_UNIREF50} ${QUERY}.uniref50.mmseqs.out tmp_uniref50 -e 1e-03 -a mmseqs convertalis ${QUERY} ${DB_UNIREF50} ${QUERY}.uniref50.mmseqs.out ${QUERY}.uniref50.mmseqs.out.formatmode0 --format-mode 0 # search against SwissPlot mmseqs search ${QUERY} Swiss-Plot_2020-12-17 ${QUERY}.Swiss-Plot_2020-12-17.mmseqs.out tmp_Swiss-Plot_2020-12-17 -e 1e-03 -a mmseqs convertalis ${QUERY} Swiss-Plot_2020-12-17 ${QUERY}.Swiss-Plot_2020-12-17.mmseqs.out ${QUERY}.Swiss-Plot_2020-12-17.mmseqs.out.formatmode0 --format-mode 0 # search against penaeid shrimp species mmseqs search ${QUERY} ASM378908v1_protein ${QUERY}.ASM378908v1_protein.mmseqs.out tmp_ASM378908v1_protein -e 1e-03 -a mmseqs convertalis ${QUERY} ASM378908v1_protein ${QUERY}.ASM378908v1_protein.mmseqs.out ${QUERY}.ASM378908v1_protein.mmseqs.out.formatmode0 --format-mode 0 mmseqs search ${QUERY} NSTDA_Pmon_1_protein ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out tmp_NSTDA_Pmon_1_protein -e 1e-03 -a mmseqs convertalis ${QUERY} NSTDA_Pmon_1_protein ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out ${QUERY}.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 --format-mode 0 mmseqs search ${QUERY} 2021-02-18_NCBI_Mj_prot ${QUERY}.2021-02-18_NCBI_Mj_prot.mmseqs.out tmp_2021-02-18_NCBI_Mj_prot -e 1e-03 -a mmseqs convertalis ${QUERY} 2021-02-18_NCBI_Mj_prot ${QUERY}.2021-02-18_NCBI_Mj_prot.mmseqs.out ${QUERY}.2021-02-18_NCBI_Mj_prot.mmseqs.out.formatmode0 --format-mode 0 # AHRD # Basically followed https://qiita.com/MaedaTaro_Umiushi/items/bd9e797d44c379124144 # create the follwoing files: # ahrd_input.yml # blacklist_descline.txt # filter_descline_aply.txt # blacklist_token.txt # Run AHRD java -Xmx200g -jar \ /work/kawato/apps/AHRD/dist/ahrd.jar \ ./ahrd_input.yml # Rename "Unknown protein" to "hypothetical protein"; product names with low quality (only one "*" in quality-code) were renamed to "hypothetical protein" grep "^LOCUS" ahrd_output.csv |sed 's/Unknown protein/hypothetical protein/g'|awk 'BEGIN{FS="\t";OFS="\t"}{if(gsub(/-/,"-",$3) >= 2){$4="hypothetical protein"};{print $1,$4}}'|sed -E 's/\.[0-9]+\.p[0-9]+//g'>ahrd_output.products.list ``` ahrd_input.yml ```yaml proteins_fasta: ./Mj_TUMSAT_v1.0.nr.faa token_score_bit_score_weight: 0.468 token_score_database_score_weight: 0.2098 token_score_overlap_score_weight: 0.3221 output: ./ahrd_output.csv blast_dbs: Swiss-Plot_2020-12-17: weight: 653 description_score_bit_score_weight: 2.717061 file: ./Mj_TUMSAT_v1.0.nr.Swiss-Plot_2020-12-17.mmseqs.out.formatmode0 database: /work/kawato/study/ref/proteins/Swiss-Plot_2020-12-17.faa blacklist: ./blacklist_descline.txt filter: ./filter_descline_sprot.txt token_blacklist: ./blacklist_token.txt 2021-02-18_NCBI_Mj_prot: weight: 999 description_score_bit_score_weight: 2.917405 file: ./Mj_TUMSAT_v1.0.nr.2021-02-18_NCBI_Mj_prot.mmseqs.out.formatmode0 database: /work/kawato/study/ref/proteins/2021-02-18_NCBI_Mj_prot.fasta blacklist: ./blacklist_descline.txt filter: ./filter_descline_aply.txt token_blacklist: ./blacklist_token.txt fasta_header_regex: "^>(?.+?) +(?.+?)( \\[Penaeus.+)" Lvan: weight: 800 description_score_bit_score_weight: 2.590211 file: ./Mj_TUMSAT_v1.0.nr.ASM378908v1_protein.mmseqs.out.formatmode0 database: /work/kawato/study/ref/proteins/GCF_003789085.1_ASM378908v1_protein.faa blacklist: ./blacklist_descline.txt filter: ./filter_descline_aply.txt token_blacklist: ./blacklist_token.txt fasta_header_regex: "^>(?.+?) +(?.+?)( \\[Penaeus.+)" Pmon: weight: 800 description_score_bit_score_weight: 2.590211 file: ./Mj_TUMSAT_v1.0.nr.NSTDA_Pmon_1_protein.mmseqs.out.formatmode0 database: /work/kawato/study/ref/proteins/GCF_015228065.1_NSTDA_Pmon_1_protein.faa blacklist: ./blacklist_descline.txt filter: ./filter_descline_aply.txt token_blacklist: ./blacklist_token.txt fasta_header_regex: "^>(?.+?) +(?.+?)( \\[Penaeus.+)" ``` blacklist_descline.txt ``` (?i)^similar\s+to (?i)^probable (?i)^putative (?i)putative$ (?i)^predicted (?i)^uncharacterized (?i)^uncharacterised (?i)^TSA: (?i)^unknown (?i)^hypothetical (?i)^unnamed (?i)whole\s+genome\s+shotgun\s+sequence (?i)^clone (?i)[0-9][0-9][0-9][0-9][0-9] (?i)genomic scaffold (?i)genomic contig (?i)genome sequencing data (?i)isoform\s+X[0-9]+ (?i),\s+partial (?i)LOW QUALITY PROTEIN: (?i)Tollo ``` filter_descline_aply.txt ``` \sOS=.*$ (?i)OS.*[.].*protein (?i)^H0.*protein (?i)contains.* IPR.* \w{2,}\d{1,2}(g|G)\d+(\.\d)*\s+ \b\[.* \b\S+\|\S+\|\S+ \(\s*Fragment\s*\) ^(\s|/|\(|\)|-|\+|\*|,|;|\.|\:|\||\d)+$ PREDICTED: ``` blacklist_token.txt ``` (?i)\bunknown\b (?i)\bmember\b (?i)\blike\b (?i)\bassociated\b (?i)\bcontaining\b (?i)\bactivated\b (?i)\bfamily\b (?i)\bsubfamily\b (?i)\binteracting\b (?i)\bactivity\b (?i)\bsimilar\b (?i)\bproduct\b (?i)\bexpressed\b (?i)\bpredicted\b (?i)\bputative\b (?i)\buncharacterized\b (?i)\bprobable\b (?i)\bprotein\b (?i)\bgene\b (?i)\btair\b (?i)\bfragment\b (?i)\bhomolog\b (?i)\bcontig\b (?i)\brelated\b (?i)\bremark\b (?i)\b\w?orf(\w?|\d+)\b (?i)\bisoform X[0-9]+\b (?i)\b, partial\b (?i)\bLOW QUALITY PROTEIN:\b ``` ### Append functional annotation to predicted protein-coding gene models ```sh # !/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_merge_AHRD_and_KAAS.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_merge_AHRD_and_KAAS cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Set VARs NUMTH=56 ASM=Mj_TUMSAT_v1.0 # Create symlinks ln -s ../${ASM}_MMSeqs2_AHRD/ahrd_output.products.list ahrd_output.products.list ln -s ../${ASM}_filtering/all.sort.renamed.gff3 all.sort.renamed.gff3 ln -s /work/kawato/study/ref/genomes/GCA_017312705.1_Mj_TUMSAT_v1.0_genomic.fna ${ASM}.fa ## KAAS was run on 2021-06-27 ## start (UTC) end (UTC) ## 6/27 15:23 2021 6/27 16:45 2021 ## ID : 1624807380 ## query type : pep ## program : BLAST ## method : BBH ## GENES data set : hsa, mmu, rno, dre, dme, cel, ath, sce, ago, cal ## spo, ecu, pfa, cho, ehi, eco, nme, hpy, bsu, lla ## mge, mtu, syn, aae, mja, ape # Download KAAS results wget https://www.genome.jp/tools/kaas/files/dl/1624807380/query.ko # Parse table sed -E 's/\.[0-9]+\.p[0-9]+//g' query.ko >query.ko.mod # Download KO00001 curl "https://www.genome.jp/kegg-bin/download_htext?htext=ko00001.keg&format=htext&filedir=" > ko00001.keg # Remove redundant texts grep "^D" ko00001.keg |sed -E 's/\s{2,}/\t/g'|sed -E 's/;/\t/g'|sed -E 's/\[/\t/g'|sed 's/\]//g'| sed 's/[[:space:]]*\t[[:space:]]*/\t/g'|cut -f 2-5 > ko00001.keg.mod # join ahrd_output.products.list, query.ko.mod, and ko00001.keg python3 ./join_table.py # out:ahrd_kaas.tsv # Parse ahrd_kaas.tsv to protei-ncoding gene attributes cat ahrd_kaas.tsv|awk 'BEGIN{FS="\t";OFS="\t"};NR!=1 {print $1,"product="$2";Note=[KO:"$3"]%3B[EC_number:"$6"]%3Babbreviation%3D"$4"%3Bsynonyms%3D"$5}'|\ sed 's/\[KO\:NA\]%3B//g'|\ sed 's/\[EC_number\:NA\]%3B//g'|\ sed 's/abbreviation\%3DNA%3B//g'|\ sed 's/synonyms\%3DNA//g'|\ sed 's/;Note=$//g'|\ sed 's/,/%2C/g'>ahrd_kaas.gffAttributes.tsv # Add attributes in ahrd_kaas.gffAttributes.tsv to GFF3 python3 ./add_product_information_to_protein_coding_genes_in_gff3.py >all.sort.renamed.attributes.gff3 # Add comments to GFF3, generating the final output Mj_TUMSAT_v1.0.gff3 cat all.sort.renamed.attributes.gff3|sed '1s/^/##gff-version 3\n/'|sed '2s/^/# Marsupenaeus japonicus Ginoza2017\n/'|sed "3s/^/# Assembly build: Mj_TUMSAT_v1.0\n/" |sed "4s/^/# GenBank assembly accession: GCA_017312705.1\n/" |sed "5s/^/# Date: $(LANG=en_us_88591;date -u)\n/" >${ASM}.gff3 # Verify gene count cat ${ASM}.gff3 |awk '$3=="gene"{print $0}'|cut -f 2 -d ";"| sort | uniq -c > ${ASM}.genecount # Make transcript, CDS, and protein fasta files with their product names on the description line gffread ${ASM}.gff3 \ -g ${ASM}.fa \ -w ${ASM}.transcripts.fna \ -x ${ASM}.cds.fna \ -y ${ASM}.prot.faa --table=product # BUSCO conda activate busco-4.1.4 busco -i ${ASM}.prot.faa -l arthropoda_odb10 -o ${ASM}.prot.busco -c 56 -m prot -f ``` join_table.py ```python #!/usr/bin/env python3 # coding: utf-8 import pandas as pd # Read tables df_ahrd_products = pd.read_table('ahrd_output.products.list', names=["locus", "product"]).sort_values(by=['locus']) df_query_ko = pd.read_table('query.ko.mod', names=["locus","KO"]) df_ko00001 = pd.read_table('ko00001.keg.mod', names=["KO","Abbreviation","synonym","EC number"]).drop_duplicates() # Join tables df_ko_merge = pd.merge(df_query_ko, df_ko00001, on='KO',how='left') df_join = df_ahrd_products.merge(df_ko_merge) df_join = df_join.fillna('NA') # Write the final table df_join.to_csv('ahrd_kaas.tsv', sep='\t', index=False) ``` add_product_information_to_protein_coding_genes_in_gff3.py ```python #!/usr/bin/env python3 # coding: utf-8 import pandas as pd pd.set_option('display.max_colwidth', None) attrs = pd.read_table('ahrd_kaas.gffAttributes.tsv', names=["locus", "attributes"]) attrs.head() path = './all.sort.renamed.gff3' with open(path) as f: for line in f: line = line.rstrip() if line.startswith('#'): print(line) continue s = line.split('\t') if 'CDS' in s[2]: items = s[8].split(';') tags = dict([tmp.split('=') for tmp in items]) if 'ID' in tags: id = tags['ID'] locus = str(id.split('.')[1]) attr = attrs[attrs['locus'] == locus].attributes attr = attr.to_string(index=False, header=False).strip() print(line + ";" + attr) continue else: print(line) continue ``` ## Structural annotation of IsoSeq cDNA sequences (SQANTI3) ```sh # !/bin/bash # >>>>>> # Mj_TUMSAT_v1.0_sqanti3.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=Mj_TUMSAT_v1.0_sqanti3 cd ${WHOME} mkdir ${WDIR};cd ${WDIR} # Set VARs NUMTH=56 ASM=Mj_TUMSAT_v1.0 CDNA=IsoSeq conda activate SQANTI3.env export PYTHONPATH=$PYTHONPATH:/work/kawato/apps/cDNA_Cupcake/sequence/ # Make symlinks ln -s /work/kawato/study/kuruma_transcriptome2017/analysis/transcripts/ICRJ01.1.fsa_nt ${CDNA}.fa ln -s /work/kawato/study/ref/genomes/GCA_017312705.1_Mj_TUMSAT_v1.0_genomic.fna ${ASM}.fa ln -s ../Mj_TUMSAT_v1.0_merge_AHRD_and_KAAS/${ASM}.gff3 ${ASM}.gff3 gffread ${ASM}.gff3 -g ${ASM}.fa -E -T -o ${ASM}.gtf #SQANTI3 requires GTF format for reference annotation # ISOSEQ minimap2 -ax splice -t ${NUMTH} -uf --secondary=no -C5 ${ASM}.fa ${CDNA}.fa > ${CDNA}.sam sort -k 3,3 -k 4,4n ${CDNA}.sam > ${CDNA}.sorted.sam collapse_isoforms_by_sam.py --input ${CDNA} -s ${CDNA}.sorted.sam --dun-merge-5-shorter -o ${CDNA}.align -c 0.90 # coverage cutoff 0.90 = 90%. test.collapsed.gff is in gff2 format = GTF python /work/kawato/apps/SQANTI3/sqanti3_qc.py --gtf ${CDNA}.align.collapsed.gff ${ASM}.gtf ${ASM}.fa ``` ## Comparative genomics ### BUSCO (genome, arthropoda_odb10) #### Running BUSCO pipeline ```sh #!/bin/bash # >>>>>> # BUSCO_genome.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=BUSCO_arthropods_genome cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Define variables REF_DIR=/work/kawato/study/ref/genomes NUMTH=24 # create symlinks ln -s ${REF_DIR}/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna Drosophila_melanogaster.fna ln -s ${REF_DIR}/GCF_003789085.1_ASM378908v1_genomic.fna Litopenaeus_vannamei.fna ln -s ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_genomic.fna Penaeus_monodon.fna ln -s ${REF_DIR}/GCF_000764305.1_Hazt_2.0_genomic.fna Hyalella_azteca.fna ln -s ${REF_DIR}/GCF_003990815.1_ASM399081v1_genomic.fna Daphnia_magna.fna ln -s ${REF_DIR}/GCA_004104545.1_Arma_vul_BF2787_genomic.fna Armadillidium_vulgare.fna ln -s ${REF_DIR}/GCA_009176605.1_CNRS_Arma_nasa_1.0_genomic.fna Armadillidium_nasatum.fna ln -s ${REF_DIR}/GCF_000151625.1_ASM15162v1_genomic.fna Bombyx_mori.fna ln -s ${REF_DIR}/GCF_003254395.2_Amel_HAv3.1_genomic.fna Apis_mellifera.fna ln -s ${REF_DIR}/GCA_007210705.1_Tcal_SD_v2.1_genomic.fna Tigriopus_californicus.fna ln -s ${REF_DIR}/GCF_000591075.1_Eaff_2.0_genomic.fna Eurytemora_affinis.fna ln -s ${REF_DIR}/GCA_009805615.1_SNU_Aamp_1_genomic.fna Amphibalanus_amphitrite.fna ln -s ${REF_DIR}/GCA_015104395.1_ASM1510439v1_genomic.fna Macrobrachium_nipponense.fna ln -s ${REF_DIR}/GCA_009830355.1_DU_Cdes_1.0_genomic.fna Cherax_destructor.fna ln -s ${REF_DIR}/GCA_013283005.1_ASM1328300v1_genomic.fna Paralithodes_platypus.fna ln -s ${REF_DIR}/GCA_002838885.1_Pvir0.4_genomic.fna Procambarus_virginalis.fna ln -s ${REF_DIR}/GCA_009761615.1_DU_Cquad_1.0_genomic.fna Cherax_quadricarinatus.fna ln -s ${REF_DIR}/GCA_013436485.1_ASM1343648v1_genomic.fna Eriocheir_sinensis.fna ln -s ${REF_DIR}/GCA_002291165.1_Mjap_WGS_v1_genomic.fna Mjap_WGS_v1.fna ln -s ${REF_DIR}/GCA_002291185.1_Pmon_WGS_v1_genomic.fna Pmon_WGS_v1.fna ln -s ${REF_DIR}/GCA_007890405.1_Pmod26D_v1_genomic.fna Pmod26D_v1.fna ln -s ${REF_DIR}/GCA_016920825.1_ASM1692082v1_genomic.fna Fenneropenaeus_chinensis.fna ln -s ../filter_lowqual_seq/Mj_TUMSAT_v1.0.fa Marsupenaeus_japonicus.fna conda activate busco-4.1.4 # BUSCO for SP in Drosophila_melanogaster Litopenaeus_vannamei Penaeus_monodon Hyalella_azteca Daphnia_magna Armadillidium_vulgare Armadillidium_nasatum Bombyx_mori Apis_mellifera Tigriopus_californicus Eurytemora_affinis Amphibalanus_amphitrite Macrobrachium_nipponense Cherax_destructor Paralithodes_platypus Procambarus_virginalis Cherax_quadricarinatus Eriocheir_sinensis Mjap_WGS_v1 Pmon_WGS_v1 Pmod26D_v1 Marsupenaeus_japonicus;do busco -i ${SP}.fna -l arthropoda_odb10 -c ${NUMTH} -o ${SP}.genome.busco -m genome done ``` #### Generate summary script (generate_plot.py) https://gitlab.com/ezlab/busco/-/blob/master/scripts/generate_plot.py ```sh #!/bin/bash # >>>>>> # BUSCO_generate_plot.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=BUSCO_arthropods_genome cd ${WHOME}${WDIR} # Define variables mkdir BUSCO_summaries conda activate busco-4.1.4 for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei Fenneropenaeus_chinensis Eriocheir_sinensis Procambarus_virginalis Cherax_destructor Cherax_quadricarinatus Armadillidium_nasatum Armadillidium_vulgare Hyalella_azteca Amphibalanus_amphitrite Eurytemora_affinis Daphnia_magna Apis_mellifera Bombyx_mori Drosophila_melanogaster ;do cp ${SP}.genome.busco/short_summary.specific.arthropoda_odb10.${SP}.genome.busco.txt BUSCO_summaries/. done # generate_plot.py generate_plot.py -wd BUSCO_summaries ``` #### Plot (R script) ```R ################################## # # BUSCO summary figure # @version 4.0.0 # @since BUSCO 2.0.0 # # Copyright (c) 2016-2020, Evgeny Zdobnov (ez@ezlab.org) # Licensed under the MIT license. See LICENSE.md file. # ################################## setwd("C:/Users/kawato/Share/2020-09-24_Mj/") # Load the required libraries library(ggplot2) library("grid") # !!! CONFIGURE YOUR PLOT HERE !!! # Output my_width <- 20 my_height <- 15 my_unit <- "cm" # Colors my_colors <- c("#F2E436", "#F0B83D", "#74AFD0", "#999999") # Bar height ratio my_bar_height <- 0.75 # Legend my_title <- "BUSCO Assessment Results" # Font my_family <- "sans" my_size_ratio <- 1 # !!! SEE YOUR DATA HERE !!! # Your data as generated by python, remove or add more my_species <- c('Marsupenaeus japonicus', 'Marsupenaeus japonicus', 'Marsupenaeus japonicus', 'Marsupenaeus japonicus', 'Penaeus monodon', 'Penaeus monodon', 'Penaeus monodon', 'Penaeus monodon', 'Litopenaeus vannamei', 'Litopenaeus vannamei', 'Litopenaeus vannamei', 'Litopenaeus vannamei', 'Fenneropenaeus chinensis', 'Fenneropenaeus chinensis', 'Fenneropenaeus chinensis', 'Fenneropenaeus chinensis', 'Eriocheir sinensis', 'Eriocheir sinensis', 'Eriocheir sinensis', 'Eriocheir sinensis', 'Procambarus virginalis', 'Procambarus virginalis', 'Procambarus virginalis', 'Procambarus virginalis', 'Cherax destructor', 'Cherax destructor', 'Cherax destructor', 'Cherax destructor', 'Cherax quadricarinatus', 'Cherax quadricarinatus', 'Cherax quadricarinatus', 'Cherax quadricarinatus', 'Armadillidium nasatum', 'Armadillidium nasatum', 'Armadillidium nasatum', 'Armadillidium nasatum', 'Armadillidium vulgare', 'Armadillidium vulgare', 'Armadillidium vulgare', 'Armadillidium vulgare', 'Hyalella azteca', 'Hyalella azteca', 'Hyalella azteca', 'Hyalella azteca', 'Amphibalanus amphitrite', 'Amphibalanus amphitrite', 'Amphibalanus amphitrite', 'Amphibalanus amphitrite', 'Eurytemora affinis', 'Eurytemora affinis', 'Eurytemora affinis', 'Eurytemora affinis', 'Daphnia magna', 'Daphnia magna', 'Daphnia magna', 'Daphnia magna', 'Apis mellifera', 'Apis mellifera', 'Apis mellifera', 'Apis mellifera', 'Bombyx mori', 'Bombyx mori', 'Bombyx mori', 'Bombyx mori', 'Drosophila melanogaster', 'Drosophila melanogaster', 'Drosophila melanogaster', 'Drosophila melanogaster') my_species <- factor(my_species) my_species <- fct_rev(factor(my_species,levels=c("Marsupenaeus japonicus", "Penaeus monodon", "Litopenaeus vannamei", "Fenneropenaeus chinensis", "Eriocheir sinensis", "Procambarus virginalis", "Cherax destructor", "Cherax quadricarinatus", "Armadillidium nasatum", "Armadillidium vulgare", "Hyalella azteca", "Amphibalanus amphitrite", "Eurytemora affinis", "Daphnia magna", "Apis mellifera", "Bombyx mori", "Drosophila melanogaster"))) # reorder your species here just by changing the values in the vector : my_percentage <- c(90.6, 2.8, 2.9, 3.7, 81.9, 1.8, 4.7, 11.6, 82.4, 5.4, 4.5, 7.7, 81.5, 3.4, 6.0, 9.1, 92.0, 0.6, 3.1, 4.3, 84.6, 0.7, 6.1, 8.6, 79.7, 0.9, 8.9, 10.5, 68.9, 0.4, 15.7, 15.0, 81.6, 4.9, 4.2, 9.3, 77.1, 5.3, 6.2, 11.4, 89.9, 0.5, 3.7, 5.9, 65.4, 26.7, 2.2, 5.7, 72.9, 0.9, 7.9, 18.3, 95.4, 0.9, 1.7, 2.0, 97.9, 0.1, 0.6, 1.4, 93.4, 0.4, 2.3, 3.9, 99.2, 0.6, 0.0, 0.2) my_values <- c(918, 28, 29, 38, 830, 18, 48, 117, 835, 55, 46, 77, 826, 34, 61, 92, 932, 6, 31, 44, 857, 7, 62, 87, 807, 9, 90, 107, 698, 4, 159, 152, 827, 50, 43, 93, 781, 54, 63, 115, 911, 5, 37, 60, 663, 270, 22, 58, 738, 9, 80, 186, 966, 9, 17, 21, 992, 1, 6, 14, 946, 4, 23, 40, 1005, 6, 0, 2) ################################## ################################## ################################## # Code to produce the graph labsize = 1 if (length(levels(my_species)) > 10){ labsize = 0.66 } print("Plotting the figure ...") category <- c(rep(c("S","D","F","M"),c(1))) category <-factor(category) category = factor(category,levels(category)[c(4,1,2,3)]) df = data.frame(my_species,my_percentage,my_values,category) figure <- ggplot() + geom_bar(aes(y = my_percentage, x = my_species, fill = category),colour="black", size=0.2, position = position_stack(reverse = TRUE), data = df, stat="identity", width=my_bar_height) + coord_flip() + theme_gray(base_size = 8) + scale_y_continuous(labels = c("0","20","40","60","80","100"), breaks = c(0,20,40,60,80,100)) + scale_fill_manual(values = my_colors,labels =c(" Complete (C) and single-copy (S) ", " Complete (C) and duplicated (D)", " Fragmented (F) ", " Missing (M)")) + ggtitle(my_title) + xlab("") + ylab("\n%BUSCOs") + theme(plot.title = element_text(family=my_family, hjust=0.5, colour = "black", size = rel(2.2)*my_size_ratio, face = "bold")) + theme(legend.position="top",legend.title = element_blank()) + theme(legend.text = element_text(family=my_family, size = rel(1.2)*my_size_ratio)) + theme(panel.background = element_rect(color="#FFFFFF", fill="white")) + theme(panel.grid.minor = element_blank()) + theme(panel.grid.major = element_blank()) + theme(axis.text.y = element_text(family=my_family, colour = "black", size = rel(1.66)*my_size_ratio,face = 'italic')) + theme(axis.text.x = element_text(family=my_family, colour = "black", size = rel(1.66)*my_size_ratio)) + theme(axis.line = element_line(size=1*my_size_ratio, colour = "black")) + theme(axis.ticks.length = unit(.85, "cm")) + theme(axis.ticks.y = element_line(colour="white", size = 0)) + theme(axis.ticks.x = element_line(colour="#222222")) + theme(axis.ticks.length = unit(0.4, "cm")) + theme(axis.title.x = element_text(family=my_family, size=rel(1.2)*my_size_ratio)) + guides(fill = guide_legend(override.aes = list(colour = NULL))) + guides(fill=guide_legend(nrow=2,byrow=TRUE)) for(i in rev(c(1:length(levels(my_species))))){ detailed_values <- my_values[my_species==my_species[my_species==levels(my_species)[i]]] total_buscos <- sum(detailed_values) figure <- figure + annotate("text", label=paste("C:", detailed_values[1] + detailed_values[2], " [S:", detailed_values[1], ", D:", detailed_values[2], "], F:", detailed_values[3], ", M:", detailed_values[4], ", n:", total_buscos, sep=""), y=3, x = i, size = labsize*4*my_size_ratio, colour = "black", hjust=0, family=my_family) } ggsave(figure, file = "Figure_2.tiff", device = "tiff", dpi = 350, width = my_width, height = my_height, unit = my_unit) print("Done") ``` ### Comparison of simple repeat contents in arthropod genomes (RepeatMasker) ```sh # !/bin/bash # >>>>>> # RepeatMasker_SimpleRepeats_Arthropods.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=RepeatMasker_SimpleRepeats_Arthropods cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Define variables REF_DIR=/work/kawato/study/ref/genomes NUMTH=24 # create symlinks ln -s ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_genomic.fna Penaeus_monodon.fna ln -s ${REF_DIR}/GCF_003789085.1_ASM378908v1_genomic.fna Litopenaeus_vannamei.fna ln -s ${REF_DIR}/GCA_016920825.1_ASM1692082v1_genomic.fna Fenneropenaeus_chinensis.fna ln -s ${REF_DIR}/GCA_013436485.1_ASM1343648v1_genomic.fna Eriocheir_sinensis.fna ln -s ${REF_DIR}/GCA_002838885.1_Pvir0.4_genomic.fna Procambarus_virginalis.fna ln -s ${REF_DIR}/GCA_009830355.1_DU_Cdes_1.0_genomic.fna Cherax_destructor.fna ln -s ${REF_DIR}/GCA_009761615.1_DU_Cquad_1.0_genomic.fna Cherax_quadricarinatus.fna ln -s ${REF_DIR}/GCA_009176605.1_CNRS_Arma_nasa_1.0_genomic.fna Armadillidium_nasatum.fna ln -s ${REF_DIR}/GCA_004104545.1_Arma_vul_BF2787_genomic.fna Armadillidium_vulgare.fna ln -s ${REF_DIR}/GCF_000764305.1_Hazt_2.0_genomic.fna Hyalella_azteca.fna ln -s ${REF_DIR}/GCA_009805615.1_SNU_Aamp_1_genomic.fna Amphibalanus_amphitrite.fna ln -s ${REF_DIR}/GCF_000591075.1_Eaff_2.0_genomic.fna Eurytemora_affinis.fna ln -s ${REF_DIR}/GCF_003990815.1_ASM399081v1_genomic.fna Daphnia_magna.fna ln -s ${REF_DIR}/GCF_003254395.2_Amel_HAv3.1_genomic.fna Apis_mellifera.fna ln -s ${REF_DIR}/GCF_000151625.1_ASM15162v1_genomic.fna Bombyx_mori.fna ln -s ${REF_DIR}/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna Drosophila_melanogaster.fna # Run RepeatMasker conda activate repeatmodeler for SP in Penaeus_monodon Litopenaeus_vannamei Fenneropenaeus_chinensis Eriocheir_sinensis Procambarus_virginalis Cherax_destructor Cherax_quadricarinatus Armadillidium_nasatum Armadillidium_vulgare Hyalella_azteca Amphibalanus_amphitrite Eurytemora_affinis Daphnia_magna Apis_mellifera Bombyx_mori Drosophila_melanogaster;do RepeatMasker -pa 14 -noint -no_is -norna -gff -xsmall -a -s ${SP}.fna -dir ${SP}.RepeatMasker.SoftMasked done ``` ### Comparison of protein-coding gene metrics in three penaeid shrimp genomes #### Get metrics ```sh # !/bin/bash # >>>>>> # GetNrProteinsAndGeneMetrics.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=GetNrProteinsAndGeneMetrics cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Define variables REF_DIR=/work/kawato/study/ref/genomes mkdir Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei # Make symlinks for M. japonicus genome and GFF3. mtDNA already removed ln -s ${REF_DIR}/GCA_017312705.1_Mj_TUMSAT_v1.0_genomic.fna Marsupenaeus_japonicus.fa ln -s ../Mj_TUMSAT_v1.0_merge_AHRD_and_KAAS/Mj_TUMSAT_v1.0.gff3 Marsupenaeus_japonicus.gff3 Marsupenaeus_japonicus/Marsupenaeus_japonicus.gff # Prepare P. mondoon and L. vannamei. Remove mtDNA records seqkit grep -v -nr -p NC_002184.1* ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_genomic.fna > Penaeus_monodon/Penaeus_monodon.fna cat ${REF_DIR}/GCF_015228065.1_NSTDA_Pmon_1_genomic.gff |sed -e '/NC_002184.1/,$d' > Penaeus_monodon/Penaeus_monodon.gff seqkit grep -v -nr -p NC_009626.1* ${REF_DIR}/GCF_003789085.1_ASM378908v1_genomic.fna > Litopenaeus_vannamei/Litopenaeus_vannamei.fna cat ${REF_DIR}/GCF_003789085.1_ASM378908v1_genomic.gff |sed -e '/NC_009626.1/,$d' >Litopenaeus_vannamei/Litopenaeus_vannamei.gff for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei;do cd ${SP} # extract .faa, with geneid, transcriptid, cdslen information @id, gffread ${SP}.gff -E -S -g ${SP}.fna -y ${SP}.annotation.geneid.transcriptid.cdslen.faa --table=@geneid,@cdslen # seqkit grep didn't like information @id; trim off cut -f 1 ${SP}.annotation.geneid.transcriptid.cdslen.faa >${SP}.annotation.faa # Create tables grep ">" ${SP}.annotation.geneid.transcriptid.cdslen.faa |sed 's/>//g'|awk '{print $2}'|sort|uniq >${SP}.annotation.geneid.list grep ">" ${SP}.annotation.geneid.transcriptid.cdslen.faa |sed 's/>//g'|awk '{print $2 "\t" $1 "\t" $3}' >${SP}.annotation.geneid.transcriptid.cdslen.list # Extract the longest isoform from each locus # generates split files per 600 lines split -l 600 --numeric-suffixes --suffix-length=3 --elide-empty-files ${SP}.annotation.geneid.list ${SP}.annotation.geneid.list_ # 50 files; up to 30000 genes can be handled for N in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 ;do while read line;do ID=`awk -v line=${line} '$1 == line {print $0}' ${SP}.annotation.geneid.transcriptid.cdslen.list |sort -r -n -k 3 |awk 'NR==1 {print $2}'` seqkit grep -np "${ID}" ${SP}.annotation.faa done >${SP}.annotation.nr.${N}.faa < ${SP}.annotation.geneid.list_${N} & done wait # wait until all the background jobs end rm ${SP}.annotation.nr.faa # Delete preexisting file if exists cat ${SP}.annotation.nr.*.faa >>${SP}.annotation.nr.faa # Concatenate partitioned fasta rm ${SP}.annotation.nr.*.faa ${SP}.annotation.geneid.list_* # clean up # Convert .gff3 to .gtf for subsequent filtering gffread ${SP}.gff -T --keep-genes -E -o ${SP}.gtf 2> ${SP}.gffread.log # GFF3 has a complex nested structure to specify gene-transcript-exon relationship, which is detrimental to analysis # List longest isoforms grep ">" ${SP}.annotation.nr.faa |sed 's/>//g' >${SP}.annotation.nr.list # Generate a .gtf file containing the longest isoform of each locus. This GTF is used for gene metrics calculation. split -l 600 --numeric-suffixes --suffix-length=3 --elide-empty-files ${SP}.annotation.nr.list ${SP}.annotation.nr.list_ # generates 50 split files for N in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 ;do while read line;do grep "transcript_id \"${line}\"" ${SP}.gtf done >${SP}.nr_${N}.gtf < ${SP}.annotation.nr.list_${N} & done wait # wait until all the background jobs end rm ${SP}.nr.gtf # Delete preexisting file if exists cat ${SP}.nr_*.gtf>>${SP}.nr.gtf # Concatenate partitioned fasta rm ${SP}.annotation.nr.list_* ${SP}.nr_*.gtf # clean up # prepare .gff files for gene metrics calculation. This conversion was necessary to make intron features. cat ${SP}.nr.gtf |grep $'\tCDS\t' >${SP}.nr.CDSonly.gtf # Create a version without non-CDS (=UTR) exons gffread ${SP}.nr.CDSonly.gtf --force-exons -F -g ${SP}.fna -o ${SP}.nr.CDSonly.gff # Convert GTF to GFF. Creates exons. I gave up giving gene features using gt gff3 conda activate genometools gt gff3 -checkids -addintrons -retainids ${SP}.nr.CDSonly.gff >${SP}.nr.CDSonly.addintrons.gff # Creates introns. -retainids option is a must. Without it gt gff3 automatically renames transcripts gffread ${SP}.nr.CDSonly.addintrons.gff -E -S -g ${SP}.fna -y ${SP}.nr.CDSonly.addintrons.faa # Check the integrity of gff by translating the features into faa grep ">" ${SP}.nr.CDSonly.addintrons.faa |wc -l >${SP}.nr.CDSonly.addintrons.counts # Get exon length awk -v species=$SP '$3 == "exon" {transcript_name = $9 ; sub(/Parent\=/,"", transcript_name); sum = $5 - $4 + 1 ;print species "\t" $1 "\t" transcript_name "\t" $3 "\t" $4 "\t" $5 "\t" sum}' ${SP}.nr.CDSonly.addintrons.gff >${SP}.nr.CDSonly.addintrons.exonlength.tidy # Get intron length table awk -v species=$SP '$3 == "intron" {transcript_name = $9 ; sub(/Parent\=/,"", transcript_name); sum = $5 - $4 + 1 ;print species "\t" $1 "\t" transcript_name "\t" $3 "\t" $4 "\t" $5 "\t" sum}' ${SP}.nr.CDSonly.addintrons.gff >${SP}.nr.CDSonly.addintrons.intronlength.tidy # Get gene length table gawk -v species=$SP '$3 == "transcript" {sum = $5 - $4 + 1 ; print species "\t" $1 "\t" gensub(/(^ID\=)([^;]+)(.+)/, "\\2", "g", $9) "\t" $3 "\t" $4 "\t" $5 "\t" sum}' ${SP}.nr.CDSonly.addintrons.gff >${SP}.nr.CDSonly.addintrons.genelength.tidy # Count exons split -l 600 --numeric-suffixes --suffix-length=3 --elide-empty-files ${SP}.annotation.nr.list ${SP}.annotation.nr.list_ # generates 50 split files for N in 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 ;do while read line;do echo -e "${SP}\t`grep "ID=${line};" ${SP}.nr.CDSonly.addintrons.gff|cut -f 1|sort|uniq`\t${line}\t`grep "Parent=${line}$" ${SP}.nr.CDSonly.addintrons.gff |awk '$3 == "exon"' |wc -l`" done >${SP}.nr.CDSonly.exoncount_${N} < ${SP}.annotation.nr.list_${N} & done wait # wait until all the background jobs end rm ${SP}.nr.CDSonly.addintrons.exoncount.tidy cat ${SP}.nr.CDSonly.exoncount_* >>${SP}.nr.CDSonly.addintrons.exoncount.tidy rm ${SP}.nr.CDSonly.exoncount_* ${SP}.annotation.nr.list_* # clean up cd .. done for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei;do cat $SP/${SP}.nr.CDSonly.addintrons.intronlength.tidy ; done >intronlength.tidy for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei;do cat $SP/${SP}.nr.CDSonly.addintrons.exonlength.tidy ; done >exonlength.tidy for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei;do cat $SP/${SP}.nr.CDSonly.addintrons.genelength.tidy; done >genelength.tidy for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei;do cat $SP/${SP}.nr.CDSonly.addintrons.exoncount.tidy; done >exoncount.tidy ``` #### Visualization (ggplot2) ```R ################################## # # plot_gene_metrics.R # ################################## setwd("C:/Users/kawato/study/kuruma_genome2017/analysis/plot_gene_metrics/") # Load the required libraries library(tidyverse) library(ggplot2) library(patchwork) # Color palette cbp1 <- c("#999999", "#E69F00", "#56B4E9") # Gene length geneLen <- read_tsv("genelength.tidy", col_names=c("species", "scaffold", "locus", "type", "start", "end", "length")) geneLen$species <- as.factor(geneLen$species) geneLen$species <- factor(geneLen$species, levels = c("Marsupenaeus_japonicus", "Penaeus_monodon", "Litopenaeus_vannamei")) levels(geneLen$species) <- c("Marsupenaeus japonicus", "Penaeus monodon", "Litopenaeus vannamei") p1 <- ggplot(geneLen) + geom_freqpoly(aes(x=length,color=species),binwidth = 100,size=2) + xlim(0,22000) + theme_bw()+ theme_classic() + scale_colour_manual(values=cbp1) + labs( x = "Gene length (bp, bin=100 bp)", y = "Count", colour = "Species" ) + theme(axis.title=element_text(size=18,face="bold"), axis.text=element_text(size=18),legend.text = element_text(size=18, face = 'italic'), legend.key=element_rect(fill="transparent", colour="transparent"), legend.background=element_rect(fill="transparent", colour="transparent"), legend.key.size = unit(1.5, 'lines'), legend.justification=c(1,1), legend.position=c(1,1),legend.title=element_blank()) # Exon count exonCount <- read_tsv("exoncount.tidy", col_names=c("species", "scaffold", "locus", "noOfExons")) exonCount$species <- as.factor(exonCount$species) exonCount$species <- factor(exonCount$species, levels = c("Marsupenaeus_japonicus", "Penaeus_monodon", "Litopenaeus_vannamei")) levels(exonCount$species) <- c("Marsupenaeus japonicus", "Penaeus monodon", "Litopenaeus vannamei") p2 <- ggplot(exonCount) + geom_freqpoly(aes(x=noOfExons,color=species),binwidth = 1,size=2) + xlim(0,20) + theme_bw()+ theme_classic() + scale_colour_manual(values=cbp1) + labs( x = "Number of exons", y = "Count", colour = "Species" ) + theme(axis.title=element_text(size=18,face="bold"), axis.text=element_text(size=18),legend.text = element_text(size=18, face = 'italic'), legend.key=element_rect(fill="transparent", colour="transparent"), legend.background=element_rect(fill="transparent", colour="transparent"), legend.key.size = unit(1.5, 'lines'), legend.justification=c(1,1), legend.position=c(1,1),legend.title=element_blank()) # Exon length exonLen <- read_tsv("exonlength.tidy", col_names=c("species", "scaffold", "locus", "type", "start", "end", "length")) exonLen$species <- as.factor(exonLen$species) exonLen$species <- factor(exonLen$species, levels = c("Marsupenaeus_japonicus", "Penaeus_monodon", "Litopenaeus_vannamei")) levels(exonLen$species) <- c("Marsupenaeus japonicus", "Penaeus monodon", "Litopenaeus vannamei") p3 <- ggplot(exonLen) + geom_freqpoly(aes(x=length,color=species),binwidth = 10,size=2) + xlim(0,500) + ylim(0,12000) + theme_bw()+ theme_classic() + scale_colour_manual(values=cbp1) + labs( x = "Exon length (bp, bin=10 bp)", y = "Count", colour = "Species" ) + theme(axis.title=element_text(size=18,face="bold"), axis.text=element_text(size=18),legend.text = element_text(size=18, face = 'italic'), legend.key=element_rect(fill="transparent", colour="transparent"), legend.background=element_rect(fill="transparent", colour="transparent"), legend.key.size = unit(1.5, 'lines'), legend.justification=c(1,1), legend.position=c(1,1),legend.title=element_blank()) # Intron length intronLen <- read_tsv("intronlength.tidy", col_names=c("species", "scaffold", "transcript", "type", "start", "end", "length")) intronLen <- subset(intronLen,intronLen$length > 10) intronLen$species <- as.factor(intronLen$species) intronLen$species <- factor(intronLen$species, levels = c("Marsupenaeus_japonicus", "Penaeus_monodon", "Litopenaeus_vannamei")) levels(intronLen$species) <- c("Marsupenaeus japonicus", "Penaeus monodon", "Litopenaeus vannamei") p4 <- ggplot(intronLen) + geom_freqpoly(aes(x=length,color=species),binwidth = 10,size=2) + xlim(0,1000) + theme_bw()+ theme_classic() + scale_colour_manual(values=cbp1) + labs( x = "Intron length (bp, bin=10 bp)", y = "Count", colour = "Species" ) + theme(axis.title=element_text(size=18,face="bold"), axis.text=element_text(size=18),legend.text = element_text(size=18, face = 'italic'), legend.key=element_rect(fill="transparent", colour="transparent"), legend.background=element_rect(fill="transparent", colour="transparent"), legend.key.size = unit(1.5, 'lines'), legend.justification=c(1,1), legend.position=c(1,1),legend.title=element_blank()) p_merge <- (p1 + p2) / (p3 + p4) + plot_annotation(tag_levels = 'A') & theme(plot.tag.position = c(0,1),plot.tag = element_text(size = 32,face="bold")) ggsave(file = "Figure_4.tiff", plot = p_merge, dpi = 350, width = 12, height = 12) ``` ### Orthologous gene clustering analysis (OrthoFinder) #### Running OrthoFinder ```sh #!/bin/bash # >>>>>> # OrthoFinder.sh # >>>>>> source ~/anaconda3/etc/profile.d/conda.sh # Create/goto WDIR WHOME=/work/kawato/study/kuruma_genome2017/analysis/ WDIR=GetNrProteinsAndGeneMetrics cd ${WHOME} mkdir ${WDIR} cd ${WDIR} # Define variables NUMTH=56 # create directory to store faa mkdir faa for SP in Marsupenaeus_japonicus Penaeus_monodon Litopenaeus_vannamei;do ln -s ${WHOME}2021-07-01_extractNrProteins_GeneMetrics/${SP}/${SP}.annotation.nr.faa faa/${SP}.faa done seqkit stats faa/*.faa >faa.stats conda activate orthofinder orthofinder -t ${NUMTH} -a ${NUMTH} -n orthofinder_run -f faa ``` #### Visualization (VennDiagram) ```R ################################## # # orthofinder_venndiagram.R # ################################## setwd("C:/Users/kawato/study/kuruma_genome2017/analysis/orthofinder/Orthogroups/") # Load the required libraries library(tidyverse) library(VennDiagram) Orthogroups <- read_tsv("Orthogroups.GeneCount.tsv") x <- list( 'Marsupenaeus japonicus' = unlist(subset(Orthogroups, Orthogroups$Marsupenaeus_japonicus >0, select=c(1)), use.names=FALSE), 'Penaeus monodon' = unlist(subset(Orthogroups, Orthogroups$Penaeus_monodon >0, select=c(1)), use.names=FALSE), 'Litopenaeus vannamei' = unlist(subset(Orthogroups, Orthogroups$Litopenaeus_vannamei >0, select=c(1)), use.names=FALSE) ) my_plot <- venn.diagram( x, category.names = c("Marsupenaeus japonicus" , "Penaeus monodon" , "Litopenaeus vannamei"), filename = NULL, output = TRUE , imagetype="svg" , margin=0.02, lwd = 4, col=c("#999999", "#E69F00", "#56B4E9"), fill = c(alpha("#999999",0.3), alpha('#E69F00',0.3), alpha('#56B4E9',0.3)), print.mode = c("raw", "percent"), cex = 1.5, fontfamily = "sans", cat.cex = 1.6, cat.default.pos = "outer", cat.pos = c(-20, 23, 180), cat.dist = c(0.055, 0.055, 0.035), cat.fontfamily = "sans", cat.fontface = c(3,3,3), cat.col = c("#4d4d4d", '#E69F00', '#56B4E9'), rotation = 1 ) ggsave(my_plot, file="Figure_5.tiff", device = "tiff", dpi = 350, width = 7, height = 7) ```