BIOSOPE - step 5 - BLAST processing
1 Aim
Processing the BLAST of the dada2 OTUs and eventually correcting taxonomy.
2 Directory structure
- /cutadapt : fastq files treated by cutadapt
- /dada2 : dada2 files
- /mothur : mothur files
- /blast : BLAST output
3 Analysis
3.1 To do
Nothing =)
3.2 Read BIOSOPE_init.R
This file defines all the necessary libraries and variables
source("BIOSOPE_init.R", echo = FALSE)3.3 Run BLAST on server
Using a qsub run the following code
DIR_PROJECT="/projet/umr7144/dipo/vaulot/mothur/biosope/blast/"
cd $DIR_PROJECT
FILE=dada2_seq_18S
FASTA=$DIR_PROJECT$FILE".fas"
BLAST_XLM=$DIR_PROJECT$FILE".xml"
BLAST_TSV=$DIR_PROJECT$FILE".tsv"
OUT_FMT="6 qseqid sseqid sacc stitle sscinames staxids sskingdoms sblastnames pident slen length mismatch gapopen qstart qend sstart send evalue bitscore"
# blastall -p blastn -b 100 -v 100 -m 7 -e 1.00e-100 -i $FASTA -o $BLAST_XLM -d /db/blast/all/nt
blastn -max_target_seqs 100 -evalue 1.00e-10 -query $FASTA -out $BLAST_TSV -db /db/blast/all/nt -outfmt "$OUT_FMT"
3.4 Process BLAST output
We call the blast_18S function to create two files
- a modified BLAST output file which includes additional coolumns
- kingdom -> species : PR2 taxonomy for those accession numbers that are present in PR2
- hit_rank : the rank of the hit based on decreasing % identity and decreasing bit scores
- uncultured : TRUE if the hit corresponds to an uncultured item
- hit_lineage : GenBank taxonomy of the hit
- a summary file with one line per query that contains three sets of columns
- The top hit (column with prefix hit_top_)
- The top hit for which a PR2 sequence is available (columns starting with hit_pr2_).
- The top hit corresponding to a culture or an isolate (columns starting with hit_cult_).
blast_file_name <- str_c(dir_blast, "dada2_seq_18S.tsv")
blast_18S_reformat(blast_file_name)[1] TRUE
Read the modified BLAST file (100 lines per otu) and the BLAST summary file (1 line per otu).
blast_file <- str_c(dir_blast, "dada2_seq_18S_pr2.tsv")
blast_summary_file <- str_c(dir_blast, "dada2_seq_18S_summary.tsv")
blast <- read_tsv(blast_file, guess_max = 10000)
blast_summary <- read_tsv(blast_summary_file, guess_max = 10000)- Split the query_id column into the otuxxxx field and the taxonomy.
- Note | is a metacharacter, so it must be escaped.
- Save a file with the current taxonomy. This file will be used to correct the taxonomy
dada2_otus <- str_split_fixed(blast_summary$query_id, pattern = "\\|", 8)
colnames(dada2_otus) <- c("OTUNumber", str_c("dada2_", pr2_taxo_levels[2:8]))
blast_summary <- cbind(dada2_otus, blast_summary)
write_tsv(as.data.frame(dada2_otus), str_c(dir_blast, "dada2_taxo.tsv", na = ""))3.5 Analyze BLAST output
3.5.1 Percent identity for best hit
- Mean identity of best hit (%) = 98.46
plot_pct <- gg_hist(blast_summary, "hit_top_pct_identity")
plot_pct + xlim(100, 80) + ylim(0, 0.75) + ggtitle("Best hit")summary(blast_summary$hit_top_pct_identity, digits = 4) Min. 1st Qu. Median Mean 3rd Qu. Max.
82 98 100 98 100 100
- Mean identity of best hit to isolates (%) = 94.81
plot_pct <- gg_hist(blast_summary, "hit_cult_pct_identity")
plot_pct + xlim(100, 80) + ylim(0, 0.25) + ggtitle("Best hit to isolate")summary(blast_summary$hit_cult_pct_identity, digits = 4) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
80 92 96 95 99 100 560
3.5.2 Average percent identity vs OTU rank
- The percent identity becomes lower for the less abundant OTUS
ggplot(aes(x = as.numeric(row.names(blast_summary)), y = hit_top_pct_identity),
data = blast_summary) + geom_point() + geom_smooth() + ylim(100, 80) + labs(x = "otu number",
y = "Top hit identity (%)")3.5.3 Average percent identity vs for different divisions
ggplot(aes(x = dada2_division, y = hit_top_pct_identity), data = blast_summary) +
geom_boxplot() + ylim(100, 80) + coord_flip() + labs(x = "Division", y = "Top hit identity (%)") +
ggtitle("For all Otus")ggplot(aes(x = dada2_division, y = hit_top_pct_identity), data = blast_summary[1:100,
]) + geom_boxplot() + ylim(100, 80) + coord_flip() + labs(x = "Division",
y = "Top hit identity (%)") + ggtitle("For first 100 Otus")3.6 Compare the dada2_taxonomy and the BLAST annotation at the genus level
blast_disagree_genus <- blast_summary %>% filter(dada2_genus != hit_pr2_genus)
write_tsv(blast_disagree_genus, str_c(dir_blast, "blast_disgree.tsv", na = ""))
blast_disagree_division <- blast_summary %>% filter(dada2_division != hit_pr2_division)Number of otus with difference in dada2 annotation and blast annotation (taxonomy is based on pr2).
Division level : 22
Genus level : 385
Remarks
- Some blastn gave back very short alignements : e.g. some Apusozoa, Chlorophyta.
- The first BLAST hit can be different from the dada2 assignement but then the following hits correspond to the dada2 assignment (necessary to do some kind of consensus assignement using a threshold value for length of hit and % identity).
- For example for otu0007 for which all top blasts are 100% matching and only the entry #7 corresponds to dada2 assignement.
| OTUNumber | dada2_family | dada2_genus | dada2_species |
|---|---|---|---|
| otu0007 | Chaunacanthida_X | Acanthometron | Acanthometron_sp |
| hit_acc | pct_identity | bit_score | family | genus | species |
|---|---|---|---|---|---|
| KF130582 | 100 | 713 | Arthracanthida-Symphyacanthida_X | Arthracanthida-Symphyacanthida_XX | Arthracanthida-Symphyacanthida_XX_sp. |
| KF129802 | 100 | 713 | Arthracanthida-Symphyacanthida_X | Arthracanthida-Symphyacanthida_XX | Arthracanthida-Symphyacanthida_XX_sp. |
| KJ760709 | 100 | 713 | Arthracanthida-Symphyacanthida_X | Arthracanthida-Symphyacanthida_XX | Arthracanthida-Symphyacanthida_XX_sp. |
| KJ760545 | 100 | 713 | Arthracanthida-Symphyacanthida_X | Arthracanthida-Symphyacanthida_XX | Arthracanthida-Symphyacanthida_XX_sp. |
| KJ760471 | 100 | 713 | Arthracanthida-Symphyacanthida_X | Arthracanthida-Symphyacanthida_XX | Arthracanthida-Symphyacanthida_XX_sp. |
| KF733601 | 100 | 713 | NA | NA | NA |
| JQ697713 | 100 | 713 | Chaunacanthida_X | Acanthometron | Acanthometron_sp. |
| JQ697711 | 100 | 713 | Chaunacanthida_X | Acanthometron | Acanthometron_sp. |
| JQ697710 | 100 | 713 | Chaunacanthida_X | Acanthometron | Acanthometron_sp. |
- For some groups V4 is not discrimant like for clade VII A2 vs A3 so the hits are from both (otu0073)
| OTUNumber | dada2_family | dada2_genus | dada2_species |
|---|---|---|---|
| otu0073 | Prasino-Clade-VII-A | Prasino-Clade-VII-A-3 | Prasino-Clade-VII-A-3_sp |
| hit_acc | pct_identity | bit_score | family | genus | species |
|---|---|---|---|---|---|
| LC189152 | 100 | 723 | NA | NA | NA |
| LC189149 | 100 | 723 | NA | NA | NA |
| KU743556 | 100 | 723 | NA | NA | NA |
| KU843591 | 100 | 723 | NA | NA | NA |
| KU843590 | 100 | 723 | NA | NA | NA |
| KU843585 | 100 | 723 | NA | NA | NA |
| KU843581 | 100 | 723 | NA | NA | NA |
| KU843576 | 100 | 723 | NA | NA | NA |
| KX014641 | 100 | 723 | NA | NA | NA |
| KT860872 | 100 | 723 | Prasino-Clade-VII-A | Prasino-Clade-VII-A-2 | Prasino-Clade-VII-A-2_sp. |
| KT860659 | 100 | 723 | NA | NA | NA |
| KF925345 | 100 | 723 | Prasino-Clade-VII-A | Prasino-Clade-VII-A_X | Prasino-Clade-VII-A_X_sp. |
| KF031895 | 100 | 723 | Prasino-Clade-VII-A | Prasino-Clade-VII-A-2 | Prasino-Clade-VII-A-2_sp. |
| KF031774 | 100 | 723 | Prasino-Clade-VII-A | Prasino-Clade-VII-A-3 | Prasino-Clade-VII-A-3_sp. |
| KF031687 | 100 | 723 | Prasino-Clade-VII-A | Prasino-Clade-VII-A_X | Prasino-Clade-VII-A_X_sp. |
| KF031657 | 100 | 723 | Prasino-Clade-VII-A | Prasino-Clade-VII-A-3 | Prasino-Clade-VII-A-3_sp. |
| JX291917 | 100 | 723 | NA | NA | NA |
| AY425302 | 100 | 723 | Prasino-Clade-VII-A | Prasino-Clade-VII-A-3 | Prasino-Clade-VII-A-3_sp. |
| U40921 | 100 | 723 | Prasino-Clade-VII-A | Prasino-Clade-VII-A-2 | Prasino-Clade-VII-A-2_sp. |
3.7 Change made to the taxonomy assigned by dada2
Changes made are recorded in an Excel sheet that will loaded at the next step and the phyloseq file will modified accordingly. This allows a clean update if we see more change in the taxonomy.
otu0003 changed to Chrysophyceae Clade I
No change for Clade VII A2 and A3 because dada2 assigned all of them to Clade VII A3
dada2_taxo_change <- read_xlsx(str_c(dir_dada2, "dada2_taxo_changes.xlsx"))
knitr::kable(head(filter(dada2_taxo_change, taxo_changed == TRUE), 31), caption = "Taxonomic change") %>%
kable_styling(latex_options = "scale_down", font_size = 7)| OTUNumber | taxo_changed | dada2_species | dada2_supergroup | dada2_division | dada2_class | dada2_order | dada2_family | dada2_genus |
|---|---|---|---|---|---|---|---|---|
| otu0003 | TRUE | Chrysophyceae_Clade-I_X_sp. | Stramenopiles | Ochrophyta | Chrysophyceae | Chrysophyceae_X | Chrysophyceae_Clade-I | Chrysophyceae_Clade-I_X |