BIOSOPE - step 5 - BLAST processing

1 Aim

Processing the BLAST of the dada2 OTUs and eventually correcting taxonomy.

2 Directory structure

  • /cutadapt : fastq files treated by cutadapt
  • /dada2 : dada2 files
  • /mothur : mothur files
  • /blast : BLAST output

3 Analysis

3.1 To do

Nothing =)

3.2 Read BIOSOPE_init.R

This file defines all the necessary libraries and variables

source("BIOSOPE_init.R", echo = FALSE)

3.3 Run BLAST on server

Using a qsub run the following code

DIR_PROJECT="/projet/umr7144/dipo/vaulot/mothur/biosope/blast/"

cd $DIR_PROJECT

FILE=dada2_seq_18S

FASTA=$DIR_PROJECT$FILE".fas"
BLAST_XLM=$DIR_PROJECT$FILE".xml"
BLAST_TSV=$DIR_PROJECT$FILE".tsv"
OUT_FMT="6 qseqid sseqid sacc stitle sscinames staxids sskingdoms sblastnames pident slen length mismatch gapopen qstart qend sstart send evalue bitscore"

# blastall -p blastn -b 100 -v 100 -m 7 -e 1.00e-100 -i $FASTA -o $BLAST_XLM -d /db/blast/all/nt
blastn -max_target_seqs 100 -evalue 1.00e-10 -query $FASTA -out $BLAST_TSV -db /db/blast/all/nt -outfmt "$OUT_FMT"

3.4 Process BLAST output

We call the blast_18S function to create two files

  • a modified BLAST output file which includes additional coolumns
    • kingdom -> species : PR2 taxonomy for those accession numbers that are present in PR2
    • hit_rank : the rank of the hit based on decreasing % identity and decreasing bit scores
    • uncultured : TRUE if the hit corresponds to an uncultured item
    • hit_lineage : GenBank taxonomy of the hit
  • a summary file with one line per query that contains three sets of columns
    • The top hit (column with prefix hit_top_)
    • The top hit for which a PR2 sequence is available (columns starting with hit_pr2_).
    • The top hit corresponding to a culture or an isolate (columns starting with hit_cult_).
blast_file_name <- str_c(dir_blast, "dada2_seq_18S.tsv")
blast_18S_reformat(blast_file_name)
[1] TRUE

Read the modified BLAST file (100 lines per otu) and the BLAST summary file (1 line per otu).

blast_file <- str_c(dir_blast, "dada2_seq_18S_pr2.tsv")
blast_summary_file <- str_c(dir_blast, "dada2_seq_18S_summary.tsv")

blast <- read_tsv(blast_file, guess_max = 10000)
blast_summary <- read_tsv(blast_summary_file, guess_max = 10000)
  • Split the query_id column into the otuxxxx field and the taxonomy.
    • Note | is a metacharacter, so it must be escaped.
  • Save a file with the current taxonomy. This file will be used to correct the taxonomy
dada2_otus <- str_split_fixed(blast_summary$query_id, pattern = "\\|", 8)
colnames(dada2_otus) <- c("OTUNumber", str_c("dada2_", pr2_taxo_levels[2:8]))
blast_summary <- cbind(dada2_otus, blast_summary)
write_tsv(as.data.frame(dada2_otus), str_c(dir_blast, "dada2_taxo.tsv", na = ""))

3.5 Analyze BLAST output

3.5.1 Percent identity for best hit

  • Mean identity of best hit (%) = 98.46
plot_pct <- gg_hist(blast_summary, "hit_top_pct_identity")
plot_pct + xlim(100, 80) + ylim(0, 0.75) + ggtitle("Best hit")

summary(blast_summary$hit_top_pct_identity, digits = 4)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     82      98     100      98     100     100 
  • Mean identity of best hit to isolates (%) = 94.81
plot_pct <- gg_hist(blast_summary, "hit_cult_pct_identity")
plot_pct + xlim(100, 80) + ylim(0, 0.25) + ggtitle("Best hit to isolate")

summary(blast_summary$hit_cult_pct_identity, digits = 4)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     80      92      96      95      99     100     560 

3.5.2 Average percent identity vs OTU rank

  • The percent identity becomes lower for the less abundant OTUS
ggplot(aes(x = as.numeric(row.names(blast_summary)), y = hit_top_pct_identity), 
    data = blast_summary) + geom_point() + geom_smooth() + ylim(100, 80) + labs(x = "otu number", 
    y = "Top hit identity (%)")

3.5.3 Average percent identity vs for different divisions

ggplot(aes(x = dada2_division, y = hit_top_pct_identity), data = blast_summary) + 
    geom_boxplot() + ylim(100, 80) + coord_flip() + labs(x = "Division", y = "Top hit identity (%)") + 
    ggtitle("For all Otus")

ggplot(aes(x = dada2_division, y = hit_top_pct_identity), data = blast_summary[1:100, 
    ]) + geom_boxplot() + ylim(100, 80) + coord_flip() + labs(x = "Division", 
    y = "Top hit identity (%)") + ggtitle("For first 100 Otus")

3.6 Compare the dada2_taxonomy and the BLAST annotation at the genus level

blast_disagree_genus <- blast_summary %>% filter(dada2_genus != hit_pr2_genus)
write_tsv(blast_disagree_genus, str_c(dir_blast, "blast_disgree.tsv", na = ""))

blast_disagree_division <- blast_summary %>% filter(dada2_division != hit_pr2_division)
  • Number of otus with difference in dada2 annotation and blast annotation (taxonomy is based on pr2).

    • Division level : 22

    • Genus level : 385

Remarks

  • Some blastn gave back very short alignements : e.g. some Apusozoa, Chlorophyta.
  • The first BLAST hit can be different from the dada2 assignement but then the following hits correspond to the dada2 assignment (necessary to do some kind of consensus assignement using a threshold value for length of hit and % identity).
  • For example for otu0007 for which all top blasts are 100% matching and only the entry #7 corresponds to dada2 assignement.
otu0007 - dada2 assignement
OTUNumber dada2_family dada2_genus dada2_species
otu0007 Chaunacanthida_X Acanthometron Acanthometron_sp
otu0007 - BLAST assignement - top 10 hits
hit_acc pct_identity bit_score family genus species
KF130582 100 713 Arthracanthida-Symphyacanthida_X Arthracanthida-Symphyacanthida_XX Arthracanthida-Symphyacanthida_XX_sp.
KF129802 100 713 Arthracanthida-Symphyacanthida_X Arthracanthida-Symphyacanthida_XX Arthracanthida-Symphyacanthida_XX_sp.
KJ760709 100 713 Arthracanthida-Symphyacanthida_X Arthracanthida-Symphyacanthida_XX Arthracanthida-Symphyacanthida_XX_sp.
KJ760545 100 713 Arthracanthida-Symphyacanthida_X Arthracanthida-Symphyacanthida_XX Arthracanthida-Symphyacanthida_XX_sp.
KJ760471 100 713 Arthracanthida-Symphyacanthida_X Arthracanthida-Symphyacanthida_XX Arthracanthida-Symphyacanthida_XX_sp.
KF733601 100 713 NA NA NA
JQ697713 100 713 Chaunacanthida_X Acanthometron Acanthometron_sp.
JQ697711 100 713 Chaunacanthida_X Acanthometron Acanthometron_sp.
JQ697710 100 713 Chaunacanthida_X Acanthometron Acanthometron_sp.
  • For some groups V4 is not discrimant like for clade VII A2 vs A3 so the hits are from both (otu0073)
otu0073 - dada2 assignement
OTUNumber dada2_family dada2_genus dada2_species
otu0073 Prasino-Clade-VII-A Prasino-Clade-VII-A-3 Prasino-Clade-VII-A-3_sp
otu0073 - BLAST assignement - top 20 hits
hit_acc pct_identity bit_score family genus species
LC189152 100 723 NA NA NA
LC189149 100 723 NA NA NA
KU743556 100 723 NA NA NA
KU843591 100 723 NA NA NA
KU843590 100 723 NA NA NA
KU843585 100 723 NA NA NA
KU843581 100 723 NA NA NA
KU843576 100 723 NA NA NA
KX014641 100 723 NA NA NA
KT860872 100 723 Prasino-Clade-VII-A Prasino-Clade-VII-A-2 Prasino-Clade-VII-A-2_sp.
KT860659 100 723 NA NA NA
KF925345 100 723 Prasino-Clade-VII-A Prasino-Clade-VII-A_X Prasino-Clade-VII-A_X_sp.
KF031895 100 723 Prasino-Clade-VII-A Prasino-Clade-VII-A-2 Prasino-Clade-VII-A-2_sp.
KF031774 100 723 Prasino-Clade-VII-A Prasino-Clade-VII-A-3 Prasino-Clade-VII-A-3_sp.
KF031687 100 723 Prasino-Clade-VII-A Prasino-Clade-VII-A_X Prasino-Clade-VII-A_X_sp.
KF031657 100 723 Prasino-Clade-VII-A Prasino-Clade-VII-A-3 Prasino-Clade-VII-A-3_sp.
JX291917 100 723 NA NA NA
AY425302 100 723 Prasino-Clade-VII-A Prasino-Clade-VII-A-3 Prasino-Clade-VII-A-3_sp.
U40921 100 723 Prasino-Clade-VII-A Prasino-Clade-VII-A-2 Prasino-Clade-VII-A-2_sp.

3.7 Change made to the taxonomy assigned by dada2

Changes made are recorded in an Excel sheet that will loaded at the next step and the phyloseq file will modified accordingly. This allows a clean update if we see more change in the taxonomy.

  • otu0003 changed to Chrysophyceae Clade I

  • No change for Clade VII A2 and A3 because dada2 assigned all of them to Clade VII A3

dada2_taxo_change <- read_xlsx(str_c(dir_dada2, "dada2_taxo_changes.xlsx"))
knitr::kable(head(filter(dada2_taxo_change, taxo_changed == TRUE), 31), caption = "Taxonomic change") %>% 
    kable_styling(latex_options = "scale_down", font_size = 7)
Taxonomic change
OTUNumber taxo_changed dada2_species dada2_supergroup dada2_division dada2_class dada2_order dada2_family dada2_genus
otu0003 TRUE Chrysophyceae_Clade-I_X_sp. Stramenopiles Ochrophyta Chrysophyceae Chrysophyceae_X Chrysophyceae_Clade-I Chrysophyceae_Clade-I_X

Daniel Vaulot

28 04 2018