Aim
BLAST metagenomics data from fish gut
Data set
- 111470 contigs FASTA
Directories
# Define the file names
DIR_PROJECT="/projet/umr7144/dipo/vaulot/metagenomes/2018_india_fish_gut/"
cd $DIR_PROJECT
# Next file is for testing
# FILE=Sample-100_contig_10
FILE=Sample-100_contig
FASTA=$DIR_PROJECT$FILE".fas"
Load personnal R library
library(dvutils)Metaxa2
To find 18 and 16S genes in metagenomics data
# To install the package locally
wget http://microbiology.se/sw/Metaxa2_2.2-beta10.tar.gz
tar -zxvf Metaxa2_2.2-beta10.tar.gz
# Metaxa2 ----------------------------------------------------
# Use my local version of metatax2
METAXA="/home/umr7144/dipo/vaulot/bin"
metaxa2 -i $FASTA -o $DIR_PROJECT$FILE --plus T --cpu 32
# The following lines are for use on the server (BUT DO NOT USE - problems)
# METAXA="/usr/local/genome2/metaxa2-2.1.3"
# metaxa2 -i $FASTA -o $DIR_PROJECT$FILE -d $METAXA -p $METAXA --plus T
Prodigal
To find ORF in metagenomics datas
prodigal -i $FASTA -o $FILE".prodigal.genes" -a $FILE".prodigal.proteins.faa" -p meta
Use R to remove short sequences (<200 bp)
fasta_filter("Sample-100_contig.prodigal.proteins.faa", min_length=200, max_length=10000, type="AA", max_ambig=5)“Number of sequences initially: 284945”
“Number of sequences after filtration: 27895”
BLASTP on AA file
OUT_FMT="6 qseqid sseqid sacc stitle sscinames staxids sskingdoms sblastnames pident slen length mismatch gapopen qstart qend sstart send evalue bitscore"
# Tabular format
blastp -max_target_seqs 100 -evalue 10 -query $FILE".prodigal.proteins.filtered.faa" -out $FILE".prodigal.proteins.filtered.blast.6.txt" -db /db/blast/all/nr -outfmt "$OUT_FMT" -num_threads 32
# Pairwise format
blastp -num_descriptions 25 -num_alignments 25 -evalue 1.00e-10 -query $FILE".prodigal.proteins.filtered.faa" -out $FILE".prodigal.proteins.filtered.blast.0.txt" -db /db/blast/all/nr -outfmt 0 -num_threads 32
Create a summary file using R
blast_summary("C:/Data Biomol/Metagenomes/_Other metagenomes/Subrata Fish Gut India 2018/Sample-100_contig.prodigal.proteins.filtered.blast.6.txt")Files produced
- Pairwise format : Sample-100_contig.prodigal.proteins.filtered.blast.0.txt.gz
- Tabular format : Sample-100_contig.prodigal.proteins.filtered.blast.6.summary.txt.gz
- Tabular format (summary) : Sample-100_contig.prodigal.proteins.filtered.blast.6.txt.gz