NZ 2017 - Analysis step 7
BLAST processing
NZ 2017 - Analysis step 7
BLAST processing
Load the variables common to the different scripts and the necessary libraries
source("00-NZ_2017_init.R", echo = FALSE)1 Steps
- BLAST of the mothur OTUs on server
- Process the BLAST output
- Analyse the BLAST data
- Compare mothur and BLAST assignation
- Compare dada2 and BLAST assingation
- Correct dada2 assignation based on BLAST file
2 Directory structure
- /mothur : mothur files
- /blast : BLAST output
3 Files used and generated
- nz_2017_otu_mothur.blast.tsv : result from the BLAST on the server
- nz_2017_otu_mothur.blast_pr2.tsv : result of the BLAST processed with additional columns (see below)
- nz_2017_otu_mothur.blast_summary.tsv : summary of BLAST with one line per otu (see below)
- NZ_2017_BLAST.xls : contains the information from the BLAST file (see below)
- phyloseq_nz_2017_mothur.rds : phyloseq file with mothur taxonomy
- phyloseq_nz_2017_dada2.rds : phyloseq file with dada2 taxonomy
- phyloseq_nz_2017.rds : final phyloseq file with dada2 taxonomy corrected based on BLAST (see below)
4 Analysis
4.1 To be done
- Nothing =)
4.2 BLAST processing
4.2.1 Run BLAST on server
Using a qsub run the following code (takes about 2 days - breaks in the middle because takes too much time)We use the OTU file generated at step 06.
DIR_PROJECT="/projet/sbr/ccebarcodep1408/nz_2017/blast/"
cd $DIR_PROJECT
FILE=nz_2017_otu_mothur
FASTA=$DIR_PROJECT$FILE".fas"
BLAST_XLM=$DIR_PROJECT$FILE".xml"
BLAST_TSV=$DIR_PROJECT$FILE".tsv"
OUT_FMT="6 qseqid sseqid sacc stitle sscinames staxids sskingdoms sblastnames pident slen length mismatch gapopen qstart qend sstart send evalue bitscore"
# blastall -p blastn -b 100 -v 100 -m 7 -e 1.00e-100 -i $FASTA -o $BLAST_XLM -d /db/blast/all/nt
blastn -max_target_seqs 100 -evalue 1.00e-10 -query $FASTA -out $BLAST_TSV -db /db/blast/all/nt -outfmt "$OUT_FMT"
4.2.2 Process BLAST output
We call the blast_18S function to create two files
- a modified BLAST output file which includes additional coolumns
- kingdom -> species : PR2 taxonomy for those accession numbers that are present in PR2
- hit_rank : the rank of the hit based on decreasing % identity and decreasing bit scores
- uncultured : TRUE if the hit corresponds to an uncultured item
- hit_lineage : GenBank taxonomy of the hit
- a summary file with one line per query that contains three sets of columns
- The top hit (column with prefix hit_top_)
- The top hit for which a PR2 sequence is available (columns starting with hit_pr2_).
- The top hit corresponding to a culture or an isolate (columns starting with hit_cult_).
- A “consensus”" for taxonomy at the genus level based on sequences which have > 96% similairty and for which PR2 annotation exist.
- A remark concerning any contradiction at the division level between hits that have > 90% similarity. This may help detect chimera if the balance is 50:50. If this only for a small number of hits, this no problem (bad annotation ?)
blast_file_name <- str_c(path_blast, "nz_2017_otu_mothur.blast.tsv")
blast_18S_reformat(blast_file_name)Read the modified BLAST file (100 lines per otu) and the BLAST summary file (1 line per otu).
blast_file <- str_c(path_blast, "nz_2017_otu_mothur.blast_pr2.tsv")
blast <- read_tsv(blast_file, guess_max = 10000)
colnames(blast) [1] "query_id" "hit_id" "hit_acc"
[4] "hit_title" "hit_sci_names" "hit_tax_ids"
[7] "hit_super_kingdoms" "hit_blast_names" "pct_identity"
[10] "hit_length" "alignment_length" "mismatches"
[13] "gap_opens" "query_start" "query_end"
[16] "hit_start" "hit_end" "evalue"
[19] "bit_score" "hit_rank" "species"
[22] "kingdom" "supergroup" "division"
[25] "class" "order" "family"
[28] "genus" "uncultured" "tax_organism"
[31] "tax_lineage"
blast_summary_file <- str_c(path_blast, "nz_2017_otu_mothur.blast_summary.tsv")
blast_summary <- read_tsv(blast_summary_file, guess_max = 10000)
colnames(blast_summary) [1] "query_id" "hit_number"
[3] "hit_top_hit_id" "hit_top_hit_acc"
[5] "hit_top_hit_title" "hit_top_hit_sci_names"
[7] "hit_top_hit_tax_ids" "hit_top_hit_super_kingdoms"
[9] "hit_top_hit_blast_names" "hit_top_pct_identity"
[11] "hit_top_hit_length" "hit_top_alignment_length"
[13] "hit_top_mismatches" "hit_top_gap_opens"
[15] "hit_top_query_start" "hit_top_query_end"
[17] "hit_top_hit_start" "hit_top_hit_end"
[19] "hit_top_evalue" "hit_top_bit_score"
[21] "hit_top_hit_rank" "hit_top_species"
[23] "hit_top_kingdom" "hit_top_supergroup"
[25] "hit_top_division" "hit_top_class"
[27] "hit_top_order" "hit_top_family"
[29] "hit_top_genus" "hit_top_uncultured"
[31] "hit_top_tax_organism" "hit_top_tax_lineage"
[33] "hit_pr2_hit_id" "hit_pr2_hit_acc"
[35] "hit_pr2_hit_title" "hit_pr2_hit_sci_names"
[37] "hit_pr2_hit_tax_ids" "hit_pr2_hit_super_kingdoms"
[39] "hit_pr2_hit_blast_names" "hit_pr2_pct_identity"
[41] "hit_pr2_hit_length" "hit_pr2_alignment_length"
[43] "hit_pr2_mismatches" "hit_pr2_gap_opens"
[45] "hit_pr2_query_start" "hit_pr2_query_end"
[47] "hit_pr2_hit_start" "hit_pr2_hit_end"
[49] "hit_pr2_evalue" "hit_pr2_bit_score"
[51] "hit_pr2_hit_rank" "hit_pr2_species"
[53] "hit_pr2_kingdom" "hit_pr2_supergroup"
[55] "hit_pr2_division" "hit_pr2_class"
[57] "hit_pr2_order" "hit_pr2_family"
[59] "hit_pr2_genus" "hit_pr2_uncultured"
[61] "hit_pr2_tax_organism" "hit_pr2_tax_lineage"
[63] "hit_cult_hit_id" "hit_cult_hit_acc"
[65] "hit_cult_hit_title" "hit_cult_hit_sci_names"
[67] "hit_cult_hit_tax_ids" "hit_cult_hit_super_kingdoms"
[69] "hit_cult_hit_blast_names" "hit_cult_pct_identity"
[71] "hit_cult_hit_length" "hit_cult_alignment_length"
[73] "hit_cult_mismatches" "hit_cult_gap_opens"
[75] "hit_cult_query_start"
[ reached getOption("max.print") -- omitted 27 entries ]
- Split the query_id column into the otuxxxx field and the taxonomy.
- Note | is a metacharacter, so it must be escaped.
- Save a file with the current taxonomy. This file will be used to correct the taxonomy
mothur_otus <- str_split_fixed(blast_summary$query_id, pattern = "\\|", 8)
colnames(mothur_otus) <- c("OTUNumber", str_c("mothur_", pr2_taxo_levels[2:8]))
blast_summary <- cbind(mothur_otus, blast_summary)
write_tsv(as.data.frame(mothur_otus), str_c(path_mothur, "nz_2017_otu_mothur_taxo.tsv"),
na = "")4.3 Analyze BLAST output
4.3.1 Percent identity for best hit
- Mean identity of best hit (%) = 98.11
plot_pct <- gg_hist(blast_summary, "hit_top_pct_identity")
plot_pct + xlim(100, 80) + ylim(0, 0.75) + ggtitle("Best hit")summary(blast_summary$hit_top_pct_identity, digits = 4) Min. 1st Qu. Median Mean 3rd Qu. Max.
77.66 97.60 99.40 98.11 100.00 100.00
- Mean identity of best hit to isolates (%) = 93.80
plot_pct <- gg_hist(blast_summary, "hit_cult_pct_identity")
plot_pct + xlim(100, 80) + ylim(0, 0.25) + ggtitle("Best hit to isolate")summary(blast_summary$hit_cult_pct_identity, digits = 4) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
74.24 90.42 94.72 93.80 98.15 100.00 1422
4.3.2 Average percent identity vs OTU rank
- The percent identity becomes lower for the less abundant OTUS
ggplot(aes(x = as.numeric(row.names(blast_summary)), y = hit_top_pct_identity),
data = blast_summary) + geom_point() + geom_smooth() + ylim(100, 80) + labs(x = "otu number",
y = "Top hit identity (%)")4.3.3 Average percent identity for different divisions
The idea is to compare the % identity for the different divisions, with the idea that for some divisions we have very few reference sequences and therefore probably new sequences. This done for the first 100 OTUs and for all OTUs
ggplot(aes(x = mothur_division, y = hit_top_pct_identity), data = blast_summary) +
geom_boxplot() + ylim(100, 80) + coord_flip() + labs(x = "Division", y = "Top hit identity (%)") +
ggtitle("For all Otus")ggplot(aes(x = mothur_division, y = hit_top_pct_identity), data = blast_summary[1:100,
]) + geom_boxplot() + ylim(100, 96) + coord_flip() + labs(x = "Division",
y = "Top hit identity (%)") + ggtitle("For first 100 Otus")4.4 Compare the mothur taxonomy and the BLAST annotation
blast_disagree_genus <- blast_summary %>% filter(mothur_genus != hit_pr2_genus)
write_tsv(blast_disagree_genus, str_c(path_blast, "NZ_2017_mothur_blast_disagree.tsv"),
na = "")
blast_disagree_division <- blast_summary %>% filter(mothur_division != hit_pr2_division)Number of otus with difference in mothur annotation and blast annotation (taxonomy is based on pr2).
Division level : 2440
Genus level : 1883
Remarks
- One of the problem of mothur is that if the confidence < 51 % at level ilevel then mothur sets the taxon as “Taxo[ilevel-1]_Unassigned“. This a problem with PR2 because PR2 use”Taxo[ilevel-1]_X" for such case. So it is hard to reconcile the taxo from mothur to the one from PR2
4.5 Compare the dada2 taxonomy and the BLAST annotation
4.5.1 Merge the dada2 assignement
Because of this problem we use the dada2 assignement rather than the mothur assignement to do the comparison with the BLAST.
- First we merge the dada2 assignement (see Step 06)
phyloseq_dada2 <- readRDS(phyloseq, file = "phyloseq_nz_2017_dada2.rds")
taxo_dada2 <- data.frame(tax_table(phyloseq_dada2)) %>% rename_all(funs(str_c("dada2_",
.))) %>% rownames_to_column(var = "OTUNumber")
blast_summary <- left_join(blast_summary, taxo_dada2)4.5.2 Compare the dada2 taxonomy and the BLAST annotation at the genus and division level
- We create a file (NZ_2017_dada2_blast_disagree.tsv) that lists the otus for which there is a disagrement at the genus level between the dada2 taxonomy and the BLAST taxonomy
blast_disagree_genus <- blast_summary %>% filter(dada2_genus != hit_pr2_genus)
write_tsv(blast_disagree_genus, str_c(path_blast, "NZ_2017_dada2_blast_disagree.tsv"),
na = "")
blast_disagree_division <- blast_summary %>% filter(dada2_division != hit_pr2_division)Number of otus with difference in dada2 annotation and BLAST annotation (taxonomy is based on pr2). Note that this is much better than with the mothur assingation. At the division levels, more than 10 times less disagreement. At the genus level almost twice less.
Division level : 155
Genus level : 1148
4.6 Correct the dada2 taxonomy
All the data can be found in the the workseet NZ_2017_BLAST.xls with 4 individual tables * taxonomy corrected : keep track of all corrections * summary disagreement : the file created at the previous step keeping only otus for which the dada2 and BLAST taxo disagree at the genus level * dada2 bootstrap : the dada2 bootstrap values (a value of 100 means that the assignation is very strong) * hit table (100 lines) for the first 300 otus
We proceed in the following way: 1. We only correct up to otu300 (the following otus are much less important). Note that this could be done in other specific otus 2. For each otu where there is a disagreement between the dada2 assignement and blast we look at the 100 hits 3. Correction is made only when necessary and reported in the “taxonomy corrected sheet”
4.6.1 Examples of disagreements and change
In contrast to mothur, dada2 has a tendency to attribute an otu to a species rather than an uncultured group. Much of the time it is fine except in a few cases where PR2 annotation is not accurate and an environmental sequences is attributed wrongly to a species (often the case with dinos)
- otu0029 : dada2 labelled as Pycnococcacae while it hits Pycnonoccus with 100% similarity
blast_summary_display("Otu0029")| OTUNumber | dada2_family | dada2_genus | dada2_species |
|---|---|---|---|
| Otu0029 | Pycnococcaceae | Pycnococcaceae_X | Pycnococcaceae_X_sp. |
blast_display("Otu0029")| hit_acc | hit_title | pct_identity | bit_score | family | genus | species |
|---|---|---|---|---|---|---|
| LC189146 | Pycnococcus provasolii gene for 18S ribosomal RNA, partial sequence, strain: NIES-3882 | 100 | 699 | NA | NA | NA |
| KT860878 | Pycnococcus provasolii strain RCC254 18S ribosomal RNA gene, partial sequence | 100 | 699 | NA | NA | NA |
| KT860845 | Pycnococcus provasolii strain RCC3055 18S ribosomal RNA gene, partial sequence | 100 | 699 | Pycnococcaceae | Pycnococcus | Pycnococcus_provasolii |
| KT860844 | Pycnococcus provasolii strain RCC2336 18S ribosomal RNA gene, partial sequence | 100 | 699 | Pycnococcaceae | Pycnococcus | Pycnococcus_provasolii |
| JX188319 | Uncultured eukaryote clone CC02A105.058 18S ribosomal RNA gene, partial sequence | 100 | 699 | Pycnococcaceae | Pycnococcaceae_X | Pycnococcaceae_X_sp. |
| FR865764 | Pycnococcus provasolii genomic DNA containing 18S rRNA gene, ITS1, culture collection CCAP 913/8 | 100 | 699 | Pycnococcaceae | Pycnococcus | Pycnococcus_provasolii |
| HQ427472 | Uncultured eukaryote clone RU12192007A14 18S ribosomal RNA gene, partial sequence | 100 | 699 | Pycnococcaceae | Pycnococcaceae_X | Pycnococcaceae_X_sp. |
| FJ431781 | Uncultured marine Pycnococcaceae clone RA070625T.069 18S ribosomal RNA gene, partial sequence | 100 | 699 | NA | NA | NA |
| GQ122364 | Pycnococcus sp. KMMCC P-54 18S ribosomal RNA gene, partial sequence | 100 | 699 | Pycnococcaceae | Pycnococcus | Pycnococcus_provasolii |
| EU143513 | Uncultured marine Pycnococcaceae clone PROSOPE.CD-50m.172 18S ribosomal RNA gene, partial sequence | 100 | 699 | Pycnococcaceae | Pycnococcaceae_X | Pycnococcaceae_X_sp. |
| EU143510 | Uncultured marine Pycnococcaceae clone PROSOPE.C9-5m.73 18S ribosomal RNA gene, partial sequence | 100 | 699 | Pycnococcaceae | Pycnococcaceae_X | Pycnococcaceae_X_sp. |
| AY425304 | Pseudoscourfieldia marina small subunit ribosomal RNA gene, partial sequence | 100 | 699 | Pycnococcaceae | Pseudoscourfieldia | Pseudoscourfieldia_marina |
| AY046704 | Uncultured eukaryote isolate A1_E038 18S ribosomal RNA gene, partial sequence | 100 | 699 | NA | NA | NA |
| AY046686 | Uncultured eukaryote isolate A1_E019 18S ribosomal RNA gene, partial sequence | 100 | 699 | Pycnococcaceae | Pycnococcaceae_X | Pycnococcaceae_X_sp. |
| AB058359 | Pycnococcus sp. MBIC10519 gene for 18S rRNA, partial sequence | 100 | 699 | Pycnococcaceae | Pycnococcus | Pycnococcus_sp. |
| AF122889 | Pycnococcus provasolii 18S ribosomal RNA gene, partial sequence | 100 | 699 | Pycnococcaceae | Pycnococcus | Pycnococcus_provasolii |
| AF122888 | Pseudoscourfieldia marina 18S ribosomal RNA gene, partial sequence | 100 | 699 | Pycnococcaceae | Pseudoscourfieldia | Pseudoscourfieldia_marina |
| AJ132619 | Pseudoscourfieldia marina 18S rRNA gene, strain K-0017, partial | 100 | 699 | Pycnococcaceae | Pseudoscourfieldia | Pseudoscourfieldia_marina |
| X91264 | P.provasolii 18S ribosomal RNA gene | 100 | 699 | Pycnococcaceae | Pycnococcus | Pycnococcus_provasolii |
- otu0039 : dada2 labelled as Gephyrocapsa but Emiliania huxleyi has same sequence
blast_summary_display("Otu0039")| OTUNumber | dada2_family | dada2_genus | dada2_species |
|---|---|---|---|
| Otu0039 | Noelaerhabdaceae | Gephyrocapsa | Gephyrocapsa_oceanica |
blast_display("Otu0039")| hit_acc | hit_title | pct_identity | bit_score | family | genus | species |
|---|---|---|---|---|---|---|
| MG022751 | Emiliania huxleyi isolate CCAP 920/9 small subunit ribosomal RNA gene, partial sequence | 100 | 697 | NA | NA | NA |
| KX229688 | Emiliania huxleyi strain UNC1419 18S ribosomal RNA gene, partial sequence | 100 | 697 | NA | NA | NA |
| LC189150 | Gephyrocapsa oceanica gene for 18S ribosomal RNA, partial sequence, strain: NIES-3886 | 100 | 697 | NA | NA | NA |
| KT825602 | Uncultured marine eukaryote clone B21 18S ribosomal RNA gene, partial sequence | 100 | 697 | NA | NA | NA |
| KX056536 | Reticulofenestra parvula isolate RCC4036 18S ribosomal RNA gene, partial sequence | 100 | 697 | NA | NA | NA |
| KX056535 | Reticulofenestra parvula isolate RCC4035 18S ribosomal RNA gene, partial sequence | 100 | 697 | NA | NA | NA |
| KX056534 | Reticulofenestra parvula isolate RCC4034 18S ribosomal RNA gene, partial sequence | 100 | 697 | NA | NA | NA |
| KT861255 | Emiliania huxleyi strain RCC909 18S ribosomal RNA gene, partial sequence | 100 | 697 | NA | NA | NA |
| XM_005792518 | Emiliania huxleyi CCMP1516 hypothetical protein mRNA | 100 | 697 | NA | NA | NA |
| XM_005765611 | Emiliania huxleyi CCMP1516 hypothetical protein mRNA | 100 | 697 | NA | NA | NA |
| XM_005786329 | Emiliania huxleyi CCMP1516 hypothetical protein mRNA | 100 | 697 | NA | NA | NA |
| XM_005783370 | Emiliania huxleyi CCMP1516 hypothetical protein mRNA | 100 | 697 | NA | NA | NA |
| XM_005770834 | Emiliania huxleyi CCMP1516 hypothetical protein mRNA | 100 | 697 | NA | NA | NA |
| JX680408 | Uncultured haptophyte clone Ma135-Pry1-C4 18S ribosomal RNA gene, partial sequence | 100 | 697 | Noelaerhabdaceae | Emiliania_Gephyrocapsa | Emiliania_Gephyrocapsa_sp. |
| JX291918 | Uncultured eukaryote clone 7657BH1668_SP6 18S ribosomal RNA gene, partial sequence | 100 | 697 | NA | NA | NA |
| JX291909 | Uncultured eukaryote clone 7657BH1659_SP6 18S ribosomal RNA gene, partial sequence | 100 | 697 | NA | NA | NA |
| JX291873 | Uncultured eukaryote clone 7657BH1623_SP6 18S ribosomal RNA gene, partial sequence | 100 | 697 | NA | NA | NA |
| HQ877901 | Emiliania huxleyi strain KMMCC H-18 18S ribosomal RNA gene, partial sequence | 100 | 697 | Noelaerhabdaceae | Emiliania | Emiliania_huxleyi |
| AB183665 | Gephyrocapsa oceanica gene for 18S rRNA, partial sequence, strain: MBIC11100 | 100 | 697 | Noelaerhabdaceae | Gephyrocapsa | Gephyrocapsa_oceanica |
- otu0037 : For some groups V4 is not discrimant like for clade VII A2 vs A3 so the hits are from both
blast_summary_display("Otu0037")| OTUNumber | dada2_family | dada2_genus | dada2_species |
|---|---|---|---|
| Otu0037 | Prasino-Clade-VII-A | Prasino-Clade-VII-A-3 | Prasino-Clade-VII-A-3_sp. |
blast_display("Otu0037")| hit_acc | hit_title | pct_identity | bit_score | family | genus | species |
|---|---|---|---|---|---|---|
| KU743497 | Uncultured Chlorophyta clone 08D30P39 small subunit ribosomal RNA gene, partial sequence | 100.00 | 695 | NA | NA | NA |
| KU843588 | Prasinophyte sp. RCC1019 (clade VIIA3) 18S ribosomal RNA gene, partial sequence | 100.00 | 695 | NA | NA | NA |
| KU843586 | Prasinophyte sp. RCC996 (clade VIIA) 18S ribosomal RNA gene, partial sequence | 100.00 | 695 | NA | NA | NA |
| LC189152 | Prasinophyceae sp. NIES-3888 gene for 18S ribosomal RNA, partial sequence | 99.73 | 689 | NA | NA | NA |
| LC189149 | Prasinophyceae sp. NIES-3885 gene for 18S ribosomal RNA, partial sequence | 99.73 | 689 | NA | NA | NA |
| KU743556 | Uncultured Chlorophyta clone 08D03P51 small subunit ribosomal RNA gene, partial sequence | 99.73 | 689 | NA | NA | NA |
| KU743555 | Uncultured Chlorophyta clone 08D03P46 small subunit ribosomal RNA gene, partial sequence | 99.73 | 689 | NA | NA | NA |
| KU843591 | Prasinophyte sp. RCC1043 (clade VIIA3) 18S ribosomal RNA gene, partial sequence | 99.73 | 689 | NA | NA | NA |
| KU843590 | Prasinophyte sp. RCC1032 (clade VIIA3) 18S ribosomal RNA gene, partial sequence | 99.73 | 689 | NA | NA | NA |
| KU843585 | Prasinophyte sp. RCC857 (clade VIIA3) 18S ribosomal RNA gene, partial sequence | 99.73 | 689 | NA | NA | NA |
| KU843581 | Prasinophyte sp. RCC717 (clade VIIA2) 18S ribosomal RNA gene, partial sequence | 99.73 | 689 | NA | NA | NA |
| KU843576 | Prasinophyte sp. NIES-3671 (clade VIIA3) 18S ribosomal RNA gene, partial sequence | 99.73 | 689 | NA | NA | NA |
| KX014641 | Prasinophyceae sp. RCC4569 (clade 7X-A3) 18S ribosomal RNA gene, partial sequence | 99.73 | 689 | NA | NA | NA |
| KT860872 | Prasinophyte sp. RCC138 (clade VIIA2) 18S ribosomal RNA gene, partial sequence | 99.73 | 689 | Prasino-Clade-VII-A | Prasino-Clade-VII-A-2 | Prasino-Clade-VII-A-2_sp. |
| KT860659 | Prasinophyte sp. RCC297 (clade VIIA3) 18S ribosomal RNA gene, partial sequence | 99.73 | 689 | NA | NA | NA |
| KF925345 | Prasinophyceae sp. CCMP1205 18S ribosomal RNA gene, partial sequence | 99.73 | 689 | Prasino-Clade-VII-A | Prasino-Clade-VII-A_X | Prasino-Clade-VII-A_X_sp. |
| KF031895 | Uncultured eukaryote clone CC02A740.048 18S ribosomal RNA gene, partial sequence | 99.73 | 689 | Prasino-Clade-VII-A | Prasino-Clade-VII-A-2 | Prasino-Clade-VII-A-2_sp. |
| KF031774 | Uncultured eukaryote clone CC02A105.073 18S ribosomal RNA gene, partial sequence | 99.73 | 689 | Prasino-Clade-VII-A | Prasino-Clade-VII-A-3 | Prasino-Clade-VII-A-3_sp. |
| KF031687 | Uncultured eukaryote clone CC02SE75.038 18S ribosomal RNA gene, partial sequence | 99.73 | 689 | Prasino-Clade-VII-A | Prasino-Clade-VII-A_X | Prasino-Clade-VII-A_X_sp. |
- otu0082 : dada2 assign as Apocalathium but the hit is quite low (90%) so better to assign to Dinophyceae_XXX_sp.
blast_summary_display("Otu0082")| OTUNumber | dada2_family | dada2_genus | dada2_species |
|---|---|---|---|
| Otu0082 | Thoracosphaeraceae | Apocalathium | Apocalathium_baicalense |
blast_display("Otu0082")| hit_acc | hit_title | pct_identity | bit_score | family | genus | species |
|---|---|---|---|---|---|---|
| AY046663 | Uncultured eukaryote isolate CS_E042 18S ribosomal RNA gene, partial sequence | 94.72 | 588 | NA | NA | NA |
| AY046659 | Uncultured eukaryote isolate CS_E035 18S ribosomal RNA gene, partial sequence | 94.46 | 582 | Dinophyceae_XX | Dinophyceae_XXX | Dinophyceae_XXX_sp. |
| AY046866 | Uncultured eukaryote isolate C3_E038 18S ribosomal RNA gene, partial sequence | 94.20 | 579 | Dinophyceae_XX | Dinophyceae_XXX | Dinophyceae_XXX_sp. |
| AY046773 | Uncultured eukaryote isolate A3_E031 18S ribosomal RNA gene, partial sequence | 94.20 | 577 | Dinophyceae_XX | Dinophyceae_XXX | Dinophyceae_XXX_sp. |
| AY046787 | Uncultured eukaryote isolate A3_E046 18S ribosomal RNA gene, partial sequence | 93.67 | 568 | Dinophyceae_XX | Dinophyceae_XXX | Dinophyceae_XXX_sp. |
| AY046653 | Uncultured eukaryote isolate CS_E028 18S ribosomal RNA gene, partial sequence | 93.40 | 562 | Dinophyceae_XX | Dinophyceae_XXX | Dinophyceae_XXX_sp. |
| AJ402340 | eukaryote clone OLI11027 18S rRNA gene | 93.40 | 560 | Dinophyceae_XX | Dinophyceae_XXX | Dinophyceae_XXX_sp. |
| KT813471 | Uncultured eukaryote clone 34c_59722 18S ribosomal RNA gene, partial sequence | 93.05 | 544 | NA | NA | NA |
| DQ371292 | Polykrikos kofoidii isolate 2 18S ribosomal RNA gene, partial sequence | 91.56 | 521 | Gymnodiniaceae | Polykrikos | Polykrikos_kofoidii |
| AB466294 | Polykrikos kofoidii gene for small subunit ribosomal RNA, complete sequence, strain: 2 | 91.29 | 516 | Gymnodiniaceae | Polykrikos | Polykrikos_kofoidii |
| AB466292 | Polykrikos kofoidii gene for small subunit ribosomal RNA, complete sequence, isolate: 070216-6 | 91.29 | 516 | Gymnodiniaceae | Polykrikos | Polykrikos_kofoidii |
| AB466290 | Polykrikos kofoidii gene for small subunit ribosomal RNA, complete sequence, country: Japan:Iwate, Ofunato Bay | 91.29 | 516 | Gymnodiniaceae | Polykrikos | Polykrikos_kofoidii |
| DQ371291 | Polykrikos kofoidii isolate 1 18S ribosomal RNA gene, partial sequence | 91.29 | 516 | Gymnodiniaceae | Polykrikos | Polykrikos_kofoidii |
| AB466291 | Polykrikos kofoidii gene for small subunit ribosomal RNA, complete sequence, country: Japan:Nagasaki, Omura Bay | 91.08 | 512 | Gymnodiniaceae | Polykrikos | Polykrikos_kofoidii |
| KJ757664 | Uncultured eukaryote clone SGYY886 18S ribosomal RNA gene, partial sequence | 91.03 | 510 | Dinophyceae_XX | Dinophyceae_XXX | Dinophyceae_XXX_sp. |
| KJ759286 | Uncultured eukaryote clone SGYO1066 18S ribosomal RNA gene, partial sequence | 91.01 | 510 | Dinophyceae_XX | Dinophyceae_XXX | Dinophyceae_XXX_sp. |
| KC771146 | Uncultured marine eukaryote clone I-9-MC884-OTU-47 18S ribosomal RNA gene, partial sequence | 91.01 | 510 | Peridiniaceae | Peridinium | Peridinium_sp. |
| FN690233 | Uncultured alveolate partial 18S rRNA gene, clone 4-H5 | 91.01 | 510 | Dinophyceae_XX | Dinophyceae_XXX | Dinophyceae_XXX_sp. |
| EF527032 | Uncultured marine eukaryote clone SIF_3C2 18S ribosomal RNA gene, partial sequence | 91.01 | 510 | Dinophyceae_XX | Dinophyceae_XXX | Dinophyceae_XXX_sp. |
4.7 Final list of changes made to the taxonomy assigned by dada2
Changes made are recorded in the Excel sheet (NZ_2017_BLAST.xls) that will loaded at the next step and the phyloseq file will modified accordingly. This allows a clean update if we see more change in the taxonomy.
4.7.1 List of changes made
taxo_corrected <- read_xlsx(str_c(path_files, "NZ_2017_BLAST.xlsx"), sheet = "taxonomy corrected")
taxo_corrected_show <- taxo_corrected %>% filter(!is.na(corrected_species)) %>%
select(OTUNumber, dada2_species, corrected_species, corrected_class:corrected_genus,
corrected_remark)
knitr::kable(head(taxo_corrected_show, 31), caption = "Taxonomic corrections made") %>%
kable_styling(latex_options = "scale_down", font_size = 10)| OTUNumber | dada2_species | corrected_species | corrected_class | corrected_order | corrected_family | corrected_genus | corrected_remark |
|---|---|---|---|---|---|---|---|
| Otu0029 | Pycnococcaceae_X_sp. | Pycnococcus_provasolii | Prasino-Clade-V | Pseudoscourfieldiales | Pycnococcaceae | Pycnococcus | Hits Pycnococcus 100% |
| Otu0039 | Gephyrocapsa_oceanica | Emiliania_huxleyi | Prymnesiophyceae | Isochrysidales | Noelaerhabdaceae | Emiliania | Emiliania and Gephyrocapsa cannot be distinguished |
| Otu0082 | Apocalathium_baicalense | Dinophyceae_XXX_sp. | Dinophyceae | Dinophyceae_X | Dinophyceae_XX | Dinophyceae_XXX | Hit to Apocalathium only 90% |
| Otu0114 | Bispinodinium_angelaceum | Eukaryota_XXXXXX_sp. | Eukaryota_XXX | Eukaryota_XXXX | Eukaryota_XXXXX | Eukaryota_XXXXXX | Very low hit to any group (83% to Fungi) |
| Otu0119 | Gyrodinium_sp. | Heterocapsa_nei/rotundata | Dinophyceae | Peridiniales | Heterocapsaceae | Heterocapsa | Only one sequence of Gyrodinium |
| Otu0120 | Dinophyceae_XXX_sp. | Gyrodinium_sp. | Dinophyceae | Gymnodiniales | Gymnodiniaceae | Gyrodinium | 98% to Gyrodinium |
| Otu0124 | Gymnodinium_sp. | Dinophyceae_XXX_sp. | Dinophyceae | Dinophyceae_X | Dinophyceae_XX | Dinophyceae_XXX | No hit to Gymnodiunium culture |
| Otu0133 | Dino-Group-II-Clade-46_X_sp. | Dino-Group-II_XX_sp. | Syndiniales | Dino-Group-II | Dino-Group-II_X | Dino-Group-II_XX | Hit very low 85% |
| Otu0221 | Blastodinium_pruvoti | Dinophyceae_XXX_sp. | Dinophyceae | Dinophyceae_X | Dinophyceae_XX | Dinophyceae_XXX | Low hit (89%) to cultures |
| Otu0242 | Syndinium_turbo | Hematodinium_sp. | Syndiniales | Dino-Group-IV | Dino-Group-IV-Hematodinium-Group | Hematodinium | Better hit to Hematodinium |
4.7.2 Correct the phyloseq file
The file is saved into phyloseq_nz_2017.rds
taxo_changed <- taxo_corrected %>% filter(!is.na(corrected_species)) %>% select(OTUNumber,
corrected_supergroup:corrected_species) %>% rename_all(funs(str_replace(.,
"corrected_", "")))
taxo_kept <- taxo_corrected %>% filter(is.na(corrected_species)) %>% select(OTUNumber,
dada2_supergroup:dada2_species) %>% rename_all(funs(str_replace(., "dada2_",
"")))
taxo_final <- rbind(taxo_kept, taxo_changed) %>% arrange(OTUNumber) %>% mutate(kingdom = "Eukaryota") %>%
column_to_rownames(var = "OTUNumber")
phyloseq_final <- phyloseq_dada2
tax_table(phyloseq_final) <- as.matrix(taxo_final)
saveRDS(phyloseq_final, file = "phyloseq_nz_2017.rds")