Aim

BLAST metagenomics data from fish gut

Data set

  • 111470 contigs FASTA

Directories

# Define the file names 
DIR_PROJECT="/projet/umr7144/dipo/vaulot/metagenomes/2018_india_fish_gut/"
cd $DIR_PROJECT

# Next file is for testing
# FILE=Sample-100_contig_10
FILE=Sample-100_contig

FASTA=$DIR_PROJECT$FILE".fas"

Load personnal R library

library(dvutils)

Metaxa2

To find 18 and 16S genes in metagenomics data

# To install the package locally
wget http://microbiology.se/sw/Metaxa2_2.2-beta10.tar.gz
tar -zxvf Metaxa2_2.2-beta10.tar.gz

# Metaxa2  ----------------------------------------------------

# Use my local version of metatax2
METAXA="/home/umr7144/dipo/vaulot/bin"

metaxa2 -i $FASTA -o $DIR_PROJECT$FILE --plus T --cpu 32

# The following lines are for use on the server (BUT DO NOT USE - problems)
# METAXA="/usr/local/genome2/metaxa2-2.1.3"
# metaxa2 -i $FASTA -o $DIR_PROJECT$FILE -d $METAXA -p $METAXA --plus T

Prodigal

To find ORF in metagenomics datas

prodigal -i $FASTA -o $FILE".prodigal.genes" -a $FILE".prodigal.proteins.faa" -p meta

Use R to remove short sequences (<200 bp)

fasta_filter("Sample-100_contig.prodigal.proteins.faa", min_length=200, max_length=10000, type="AA", max_ambig=5)

“Number of sequences initially: 284945”

“Number of sequences after filtration: 27895”

BLASTP on AA file

OUT_FMT="6 qseqid sseqid sacc stitle sscinames staxids sskingdoms sblastnames pident slen length mismatch gapopen qstart qend sstart send evalue bitscore"

# Tabular format

blastp -max_target_seqs 100 -evalue 10 -query $FILE".prodigal.proteins.filtered.faa" -out $FILE".prodigal.proteins.filtered.blast.6.txt" -db /db/blast/all/nr -outfmt "$OUT_FMT" -num_threads 32

# Pairwise format
blastp -num_descriptions 25 -num_alignments 25 -evalue  1.00e-10 -query $FILE".prodigal.proteins.filtered.faa" -out $FILE".prodigal.proteins.filtered.blast.0.txt" -db /db/blast/all/nr -outfmt 0 -num_threads 32

Create a summary file using R

blast_summary("C:/Data Biomol/Metagenomes/_Other metagenomes/Subrata Fish Gut India 2018/Sample-100_contig.prodigal.proteins.filtered.blast.6.txt")

Files produced

  • Pairwise format : Sample-100_contig.prodigal.proteins.filtered.blast.0.txt.gz
  • Tabular format : Sample-100_contig.prodigal.proteins.filtered.blast.6.summary.txt.gz
  • Tabular format (summary) : Sample-100_contig.prodigal.proteins.filtered.blast.6.txt.gz