NZ 2017 - Analysis step 7
BLAST processing

Load the variables common to the different scripts and the necessary libraries

source("00-NZ_2017_init.R", echo = FALSE)

1 Steps

  1. BLAST of the mothur OTUs on server
  2. Process the BLAST output
  3. Analyse the BLAST data
  4. Compare mothur and BLAST assignation
  5. Compare dada2 and BLAST assingation
  6. Correct dada2 assignation based on BLAST file

2 Directory structure

  • /mothur : mothur files
  • /blast : BLAST output

3 Files used and generated

  • nz_2017_otu_mothur.blast.tsv : result from the BLAST on the server
  • nz_2017_otu_mothur.blast_pr2.tsv : result of the BLAST processed with additional columns (see below)
  • nz_2017_otu_mothur.blast_summary.tsv : summary of BLAST with one line per otu (see below)
  • NZ_2017_BLAST.xls : contains the information from the BLAST file (see below)
  • phyloseq_nz_2017_mothur.rds : phyloseq file with mothur taxonomy
  • phyloseq_nz_2017_dada2.rds : phyloseq file with dada2 taxonomy
  • phyloseq_nz_2017.rds : final phyloseq file with dada2 taxonomy corrected based on BLAST (see below)

4 Analysis

4.1 To be done

  • Nothing =)

4.2 BLAST processing

4.2.1 Run BLAST on server

Using a qsub run the following code (takes about 2 days - breaks in the middle because takes too much time)We use the OTU file generated at step 06.

DIR_PROJECT="/projet/sbr/ccebarcodep1408/nz_2017/blast/"

cd $DIR_PROJECT

FILE=nz_2017_otu_mothur

FASTA=$DIR_PROJECT$FILE".fas"
BLAST_XLM=$DIR_PROJECT$FILE".xml"
BLAST_TSV=$DIR_PROJECT$FILE".tsv"
OUT_FMT="6 qseqid sseqid sacc stitle sscinames staxids sskingdoms sblastnames pident slen length mismatch gapopen qstart qend sstart send evalue bitscore"

# blastall -p blastn -b 100 -v 100 -m 7 -e 1.00e-100 -i $FASTA -o $BLAST_XLM -d /db/blast/all/nt
blastn -max_target_seqs 100 -evalue 1.00e-10 -query $FASTA -out $BLAST_TSV -db /db/blast/all/nt -outfmt "$OUT_FMT"

4.2.2 Process BLAST output

We call the blast_18S function to create two files

  • a modified BLAST output file which includes additional coolumns
    • kingdom -> species : PR2 taxonomy for those accession numbers that are present in PR2
    • hit_rank : the rank of the hit based on decreasing % identity and decreasing bit scores
    • uncultured : TRUE if the hit corresponds to an uncultured item
    • hit_lineage : GenBank taxonomy of the hit
  • a summary file with one line per query that contains three sets of columns
    • The top hit (column with prefix hit_top_)
    • The top hit for which a PR2 sequence is available (columns starting with hit_pr2_).
    • The top hit corresponding to a culture or an isolate (columns starting with hit_cult_).
    • A “consensus”" for taxonomy at the genus level based on sequences which have > 96% similairty and for which PR2 annotation exist.
    • A remark concerning any contradiction at the division level between hits that have > 90% similarity. This may help detect chimera if the balance is 50:50. If this only for a small number of hits, this no problem (bad annotation ?)
blast_file_name <- str_c(path_blast, "nz_2017_otu_mothur.blast.tsv")
blast_18S_reformat(blast_file_name)

Read the modified BLAST file (100 lines per otu) and the BLAST summary file (1 line per otu).

blast_file <- str_c(path_blast, "nz_2017_otu_mothur.blast_pr2.tsv")
blast <- read_tsv(blast_file, guess_max = 10000)
colnames(blast)
 [1] "query_id"           "hit_id"             "hit_acc"           
 [4] "hit_title"          "hit_sci_names"      "hit_tax_ids"       
 [7] "hit_super_kingdoms" "hit_blast_names"    "pct_identity"      
[10] "hit_length"         "alignment_length"   "mismatches"        
[13] "gap_opens"          "query_start"        "query_end"         
[16] "hit_start"          "hit_end"            "evalue"            
[19] "bit_score"          "hit_rank"           "species"           
[22] "kingdom"            "supergroup"         "division"          
[25] "class"              "order"              "family"            
[28] "genus"              "uncultured"         "tax_organism"      
[31] "tax_lineage"       
blast_summary_file <- str_c(path_blast, "nz_2017_otu_mothur.blast_summary.tsv")
blast_summary <- read_tsv(blast_summary_file, guess_max = 10000)
colnames(blast_summary)
 [1] "query_id"                    "hit_number"                 
 [3] "hit_top_hit_id"              "hit_top_hit_acc"            
 [5] "hit_top_hit_title"           "hit_top_hit_sci_names"      
 [7] "hit_top_hit_tax_ids"         "hit_top_hit_super_kingdoms" 
 [9] "hit_top_hit_blast_names"     "hit_top_pct_identity"       
[11] "hit_top_hit_length"          "hit_top_alignment_length"   
[13] "hit_top_mismatches"          "hit_top_gap_opens"          
[15] "hit_top_query_start"         "hit_top_query_end"          
[17] "hit_top_hit_start"           "hit_top_hit_end"            
[19] "hit_top_evalue"              "hit_top_bit_score"          
[21] "hit_top_hit_rank"            "hit_top_species"            
[23] "hit_top_kingdom"             "hit_top_supergroup"         
[25] "hit_top_division"            "hit_top_class"              
[27] "hit_top_order"               "hit_top_family"             
[29] "hit_top_genus"               "hit_top_uncultured"         
[31] "hit_top_tax_organism"        "hit_top_tax_lineage"        
[33] "hit_pr2_hit_id"              "hit_pr2_hit_acc"            
[35] "hit_pr2_hit_title"           "hit_pr2_hit_sci_names"      
[37] "hit_pr2_hit_tax_ids"         "hit_pr2_hit_super_kingdoms" 
[39] "hit_pr2_hit_blast_names"     "hit_pr2_pct_identity"       
[41] "hit_pr2_hit_length"          "hit_pr2_alignment_length"   
[43] "hit_pr2_mismatches"          "hit_pr2_gap_opens"          
[45] "hit_pr2_query_start"         "hit_pr2_query_end"          
[47] "hit_pr2_hit_start"           "hit_pr2_hit_end"            
[49] "hit_pr2_evalue"              "hit_pr2_bit_score"          
[51] "hit_pr2_hit_rank"            "hit_pr2_species"            
[53] "hit_pr2_kingdom"             "hit_pr2_supergroup"         
[55] "hit_pr2_division"            "hit_pr2_class"              
[57] "hit_pr2_order"               "hit_pr2_family"             
[59] "hit_pr2_genus"               "hit_pr2_uncultured"         
[61] "hit_pr2_tax_organism"        "hit_pr2_tax_lineage"        
[63] "hit_cult_hit_id"             "hit_cult_hit_acc"           
[65] "hit_cult_hit_title"          "hit_cult_hit_sci_names"     
[67] "hit_cult_hit_tax_ids"        "hit_cult_hit_super_kingdoms"
[69] "hit_cult_hit_blast_names"    "hit_cult_pct_identity"      
[71] "hit_cult_hit_length"         "hit_cult_alignment_length"  
[73] "hit_cult_mismatches"         "hit_cult_gap_opens"         
[75] "hit_cult_query_start"       
 [ reached getOption("max.print") -- omitted 27 entries ]
  • Split the query_id column into the otuxxxx field and the taxonomy.
    • Note | is a metacharacter, so it must be escaped.
  • Save a file with the current taxonomy. This file will be used to correct the taxonomy
mothur_otus <- str_split_fixed(blast_summary$query_id, pattern = "\\|", 8)
colnames(mothur_otus) <- c("OTUNumber", str_c("mothur_", pr2_taxo_levels[2:8]))
blast_summary <- cbind(mothur_otus, blast_summary)
write_tsv(as.data.frame(mothur_otus), str_c(path_mothur, "nz_2017_otu_mothur_taxo.tsv"), 
    na = "")

4.3 Analyze BLAST output

4.3.1 Percent identity for best hit

  • Mean identity of best hit (%) = 98.11
plot_pct <- gg_hist(blast_summary, "hit_top_pct_identity")
plot_pct + xlim(100, 80) + ylim(0, 0.75) + ggtitle("Best hit")

summary(blast_summary$hit_top_pct_identity, digits = 4)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  77.66   97.60   99.40   98.11  100.00  100.00 
  • Mean identity of best hit to isolates (%) = 93.80
plot_pct <- gg_hist(blast_summary, "hit_cult_pct_identity")
plot_pct + xlim(100, 80) + ylim(0, 0.25) + ggtitle("Best hit to isolate")

summary(blast_summary$hit_cult_pct_identity, digits = 4)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  74.24   90.42   94.72   93.80   98.15  100.00    1422 

4.3.2 Average percent identity vs OTU rank

  • The percent identity becomes lower for the less abundant OTUS
ggplot(aes(x = as.numeric(row.names(blast_summary)), y = hit_top_pct_identity), 
    data = blast_summary) + geom_point() + geom_smooth() + ylim(100, 80) + labs(x = "otu number", 
    y = "Top hit identity (%)")

4.3.3 Average percent identity for different divisions

The idea is to compare the % identity for the different divisions, with the idea that for some divisions we have very few reference sequences and therefore probably new sequences. This done for the first 100 OTUs and for all OTUs

ggplot(aes(x = mothur_division, y = hit_top_pct_identity), data = blast_summary) + 
    geom_boxplot() + ylim(100, 80) + coord_flip() + labs(x = "Division", y = "Top hit identity (%)") + 
    ggtitle("For all Otus")

ggplot(aes(x = mothur_division, y = hit_top_pct_identity), data = blast_summary[1:100, 
    ]) + geom_boxplot() + ylim(100, 96) + coord_flip() + labs(x = "Division", 
    y = "Top hit identity (%)") + ggtitle("For first 100 Otus")

4.4 Compare the mothur taxonomy and the BLAST annotation

blast_disagree_genus <- blast_summary %>% filter(mothur_genus != hit_pr2_genus)
write_tsv(blast_disagree_genus, str_c(path_blast, "NZ_2017_mothur_blast_disagree.tsv"), 
    na = "")

blast_disagree_division <- blast_summary %>% filter(mothur_division != hit_pr2_division)
  • Number of otus with difference in mothur annotation and blast annotation (taxonomy is based on pr2).

    • Division level : 2440

    • Genus level : 1883

Remarks

  • One of the problem of mothur is that if the confidence < 51 % at level ilevel then mothur sets the taxon as “Taxo[ilevel-1]_Unassigned“. This a problem with PR2 because PR2 use”Taxo[ilevel-1]_X" for such case. So it is hard to reconcile the taxo from mothur to the one from PR2

4.5 Compare the dada2 taxonomy and the BLAST annotation

4.5.1 Merge the dada2 assignement

Because of this problem we use the dada2 assignement rather than the mothur assignement to do the comparison with the BLAST.

  • First we merge the dada2 assignement (see Step 06)
phyloseq_dada2 <- readRDS(phyloseq, file = "phyloseq_nz_2017_dada2.rds")

taxo_dada2 <- data.frame(tax_table(phyloseq_dada2)) %>% rename_all(funs(str_c("dada2_", 
    .))) %>% rownames_to_column(var = "OTUNumber")
blast_summary <- left_join(blast_summary, taxo_dada2)

4.5.2 Compare the dada2 taxonomy and the BLAST annotation at the genus and division level

  • We create a file (NZ_2017_dada2_blast_disagree.tsv) that lists the otus for which there is a disagrement at the genus level between the dada2 taxonomy and the BLAST taxonomy
blast_disagree_genus <- blast_summary %>% filter(dada2_genus != hit_pr2_genus)
write_tsv(blast_disagree_genus, str_c(path_blast, "NZ_2017_dada2_blast_disagree.tsv"), 
    na = "")

blast_disagree_division <- blast_summary %>% filter(dada2_division != hit_pr2_division)
  • Number of otus with difference in dada2 annotation and BLAST annotation (taxonomy is based on pr2). Note that this is much better than with the mothur assingation. At the division levels, more than 10 times less disagreement. At the genus level almost twice less.

    • Division level : 155

    • Genus level : 1148

4.6 Correct the dada2 taxonomy

All the data can be found in the the workseet NZ_2017_BLAST.xls with 4 individual tables * taxonomy corrected : keep track of all corrections * summary disagreement : the file created at the previous step keeping only otus for which the dada2 and BLAST taxo disagree at the genus level * dada2 bootstrap : the dada2 bootstrap values (a value of 100 means that the assignation is very strong) * hit table (100 lines) for the first 300 otus

We proceed in the following way: 1. We only correct up to otu300 (the following otus are much less important). Note that this could be done in other specific otus 2. For each otu where there is a disagreement between the dada2 assignement and blast we look at the 100 hits 3. Correction is made only when necessary and reported in the “taxonomy corrected sheet”

4.6.1 Examples of disagreements and change

In contrast to mothur, dada2 has a tendency to attribute an otu to a species rather than an uncultured group. Much of the time it is fine except in a few cases where PR2 annotation is not accurate and an environmental sequences is attributed wrongly to a species (often the case with dinos)

  • otu0029 : dada2 labelled as Pycnococcacae while it hits Pycnonoccus with 100% similarity
blast_summary_display("Otu0029")
Otu0029 - dada2 assignement
OTUNumber dada2_family dada2_genus dada2_species
Otu0029 Pycnococcaceae Pycnococcaceae_X Pycnococcaceae_X_sp.
blast_display("Otu0029")
Otu0029 - BLAST results - top 10 hits
hit_acc hit_title pct_identity bit_score family genus species
LC189146 Pycnococcus provasolii gene for 18S ribosomal RNA, partial sequence, strain: NIES-3882 100 699 NA NA NA
KT860878 Pycnococcus provasolii strain RCC254 18S ribosomal RNA gene, partial sequence 100 699 NA NA NA
KT860845 Pycnococcus provasolii strain RCC3055 18S ribosomal RNA gene, partial sequence 100 699 Pycnococcaceae Pycnococcus Pycnococcus_provasolii
KT860844 Pycnococcus provasolii strain RCC2336 18S ribosomal RNA gene, partial sequence 100 699 Pycnococcaceae Pycnococcus Pycnococcus_provasolii
JX188319 Uncultured eukaryote clone CC02A105.058 18S ribosomal RNA gene, partial sequence 100 699 Pycnococcaceae Pycnococcaceae_X Pycnococcaceae_X_sp.
FR865764 Pycnococcus provasolii genomic DNA containing 18S rRNA gene, ITS1, culture collection CCAP 913/8 100 699 Pycnococcaceae Pycnococcus Pycnococcus_provasolii
HQ427472 Uncultured eukaryote clone RU12192007A14 18S ribosomal RNA gene, partial sequence 100 699 Pycnococcaceae Pycnococcaceae_X Pycnococcaceae_X_sp.
FJ431781 Uncultured marine Pycnococcaceae clone RA070625T.069 18S ribosomal RNA gene, partial sequence 100 699 NA NA NA
GQ122364 Pycnococcus sp. KMMCC P-54 18S ribosomal RNA gene, partial sequence 100 699 Pycnococcaceae Pycnococcus Pycnococcus_provasolii
EU143513 Uncultured marine Pycnococcaceae clone PROSOPE.CD-50m.172 18S ribosomal RNA gene, partial sequence 100 699 Pycnococcaceae Pycnococcaceae_X Pycnococcaceae_X_sp.
EU143510 Uncultured marine Pycnococcaceae clone PROSOPE.C9-5m.73 18S ribosomal RNA gene, partial sequence 100 699 Pycnococcaceae Pycnococcaceae_X Pycnococcaceae_X_sp.
AY425304 Pseudoscourfieldia marina small subunit ribosomal RNA gene, partial sequence 100 699 Pycnococcaceae Pseudoscourfieldia Pseudoscourfieldia_marina
AY046704 Uncultured eukaryote isolate A1_E038 18S ribosomal RNA gene, partial sequence 100 699 NA NA NA
AY046686 Uncultured eukaryote isolate A1_E019 18S ribosomal RNA gene, partial sequence 100 699 Pycnococcaceae Pycnococcaceae_X Pycnococcaceae_X_sp.
AB058359 Pycnococcus sp. MBIC10519 gene for 18S rRNA, partial sequence 100 699 Pycnococcaceae Pycnococcus Pycnococcus_sp.
AF122889 Pycnococcus provasolii 18S ribosomal RNA gene, partial sequence 100 699 Pycnococcaceae Pycnococcus Pycnococcus_provasolii
AF122888 Pseudoscourfieldia marina 18S ribosomal RNA gene, partial sequence 100 699 Pycnococcaceae Pseudoscourfieldia Pseudoscourfieldia_marina
AJ132619 Pseudoscourfieldia marina 18S rRNA gene, strain K-0017, partial 100 699 Pycnococcaceae Pseudoscourfieldia Pseudoscourfieldia_marina
X91264 P.provasolii 18S ribosomal RNA gene 100 699 Pycnococcaceae Pycnococcus Pycnococcus_provasolii
  • otu0039 : dada2 labelled as Gephyrocapsa but Emiliania huxleyi has same sequence
blast_summary_display("Otu0039")
Otu0039 - dada2 assignement
OTUNumber dada2_family dada2_genus dada2_species
Otu0039 Noelaerhabdaceae Gephyrocapsa Gephyrocapsa_oceanica
blast_display("Otu0039")
Otu0039 - BLAST results - top 10 hits
hit_acc hit_title pct_identity bit_score family genus species
MG022751 Emiliania huxleyi isolate CCAP 920/9 small subunit ribosomal RNA gene, partial sequence 100 697 NA NA NA
KX229688 Emiliania huxleyi strain UNC1419 18S ribosomal RNA gene, partial sequence 100 697 NA NA NA
LC189150 Gephyrocapsa oceanica gene for 18S ribosomal RNA, partial sequence, strain: NIES-3886 100 697 NA NA NA
KT825602 Uncultured marine eukaryote clone B21 18S ribosomal RNA gene, partial sequence 100 697 NA NA NA
KX056536 Reticulofenestra parvula isolate RCC4036 18S ribosomal RNA gene, partial sequence 100 697 NA NA NA
KX056535 Reticulofenestra parvula isolate RCC4035 18S ribosomal RNA gene, partial sequence 100 697 NA NA NA
KX056534 Reticulofenestra parvula isolate RCC4034 18S ribosomal RNA gene, partial sequence 100 697 NA NA NA
KT861255 Emiliania huxleyi strain RCC909 18S ribosomal RNA gene, partial sequence 100 697 NA NA NA
XM_005792518 Emiliania huxleyi CCMP1516 hypothetical protein mRNA 100 697 NA NA NA
XM_005765611 Emiliania huxleyi CCMP1516 hypothetical protein mRNA 100 697 NA NA NA
XM_005786329 Emiliania huxleyi CCMP1516 hypothetical protein mRNA 100 697 NA NA NA
XM_005783370 Emiliania huxleyi CCMP1516 hypothetical protein mRNA 100 697 NA NA NA
XM_005770834 Emiliania huxleyi CCMP1516 hypothetical protein mRNA 100 697 NA NA NA
JX680408 Uncultured haptophyte clone Ma135-Pry1-C4 18S ribosomal RNA gene, partial sequence 100 697 Noelaerhabdaceae Emiliania_Gephyrocapsa Emiliania_Gephyrocapsa_sp.
JX291918 Uncultured eukaryote clone 7657BH1668_SP6 18S ribosomal RNA gene, partial sequence 100 697 NA NA NA
JX291909 Uncultured eukaryote clone 7657BH1659_SP6 18S ribosomal RNA gene, partial sequence 100 697 NA NA NA
JX291873 Uncultured eukaryote clone 7657BH1623_SP6 18S ribosomal RNA gene, partial sequence 100 697 NA NA NA
HQ877901 Emiliania huxleyi strain KMMCC H-18 18S ribosomal RNA gene, partial sequence 100 697 Noelaerhabdaceae Emiliania Emiliania_huxleyi
AB183665 Gephyrocapsa oceanica gene for 18S rRNA, partial sequence, strain: MBIC11100 100 697 Noelaerhabdaceae Gephyrocapsa Gephyrocapsa_oceanica
  • otu0037 : For some groups V4 is not discrimant like for clade VII A2 vs A3 so the hits are from both
blast_summary_display("Otu0037")
Otu0037 - dada2 assignement
OTUNumber dada2_family dada2_genus dada2_species
Otu0037 Prasino-Clade-VII-A Prasino-Clade-VII-A-3 Prasino-Clade-VII-A-3_sp.
blast_display("Otu0037")
Otu0037 - BLAST results - top 10 hits
hit_acc hit_title pct_identity bit_score family genus species
KU743497 Uncultured Chlorophyta clone 08D30P39 small subunit ribosomal RNA gene, partial sequence 100.00 695 NA NA NA
KU843588 Prasinophyte sp. RCC1019 (clade VIIA3) 18S ribosomal RNA gene, partial sequence 100.00 695 NA NA NA
KU843586 Prasinophyte sp. RCC996 (clade VIIA) 18S ribosomal RNA gene, partial sequence 100.00 695 NA NA NA
LC189152 Prasinophyceae sp. NIES-3888 gene for 18S ribosomal RNA, partial sequence 99.73 689 NA NA NA
LC189149 Prasinophyceae sp. NIES-3885 gene for 18S ribosomal RNA, partial sequence 99.73 689 NA NA NA
KU743556 Uncultured Chlorophyta clone 08D03P51 small subunit ribosomal RNA gene, partial sequence 99.73 689 NA NA NA
KU743555 Uncultured Chlorophyta clone 08D03P46 small subunit ribosomal RNA gene, partial sequence 99.73 689 NA NA NA
KU843591 Prasinophyte sp. RCC1043 (clade VIIA3) 18S ribosomal RNA gene, partial sequence 99.73 689 NA NA NA
KU843590 Prasinophyte sp. RCC1032 (clade VIIA3) 18S ribosomal RNA gene, partial sequence 99.73 689 NA NA NA
KU843585 Prasinophyte sp. RCC857 (clade VIIA3) 18S ribosomal RNA gene, partial sequence 99.73 689 NA NA NA
KU843581 Prasinophyte sp. RCC717 (clade VIIA2) 18S ribosomal RNA gene, partial sequence 99.73 689 NA NA NA
KU843576 Prasinophyte sp. NIES-3671 (clade VIIA3) 18S ribosomal RNA gene, partial sequence 99.73 689 NA NA NA
KX014641 Prasinophyceae sp. RCC4569 (clade 7X-A3) 18S ribosomal RNA gene, partial sequence 99.73 689 NA NA NA
KT860872 Prasinophyte sp. RCC138 (clade VIIA2) 18S ribosomal RNA gene, partial sequence 99.73 689 Prasino-Clade-VII-A Prasino-Clade-VII-A-2 Prasino-Clade-VII-A-2_sp.
KT860659 Prasinophyte sp. RCC297 (clade VIIA3) 18S ribosomal RNA gene, partial sequence 99.73 689 NA NA NA
KF925345 Prasinophyceae sp. CCMP1205 18S ribosomal RNA gene, partial sequence 99.73 689 Prasino-Clade-VII-A Prasino-Clade-VII-A_X Prasino-Clade-VII-A_X_sp.
KF031895 Uncultured eukaryote clone CC02A740.048 18S ribosomal RNA gene, partial sequence 99.73 689 Prasino-Clade-VII-A Prasino-Clade-VII-A-2 Prasino-Clade-VII-A-2_sp.
KF031774 Uncultured eukaryote clone CC02A105.073 18S ribosomal RNA gene, partial sequence 99.73 689 Prasino-Clade-VII-A Prasino-Clade-VII-A-3 Prasino-Clade-VII-A-3_sp.
KF031687 Uncultured eukaryote clone CC02SE75.038 18S ribosomal RNA gene, partial sequence 99.73 689 Prasino-Clade-VII-A Prasino-Clade-VII-A_X Prasino-Clade-VII-A_X_sp.
  • otu0082 : dada2 assign as Apocalathium but the hit is quite low (90%) so better to assign to Dinophyceae_XXX_sp.
blast_summary_display("Otu0082")
Otu0082 - dada2 assignement
OTUNumber dada2_family dada2_genus dada2_species
Otu0082 Thoracosphaeraceae Apocalathium Apocalathium_baicalense
blast_display("Otu0082")
Otu0082 - BLAST results - top 10 hits
hit_acc hit_title pct_identity bit_score family genus species
AY046663 Uncultured eukaryote isolate CS_E042 18S ribosomal RNA gene, partial sequence 94.72 588 NA NA NA
AY046659 Uncultured eukaryote isolate CS_E035 18S ribosomal RNA gene, partial sequence 94.46 582 Dinophyceae_XX Dinophyceae_XXX Dinophyceae_XXX_sp.
AY046866 Uncultured eukaryote isolate C3_E038 18S ribosomal RNA gene, partial sequence 94.20 579 Dinophyceae_XX Dinophyceae_XXX Dinophyceae_XXX_sp.
AY046773 Uncultured eukaryote isolate A3_E031 18S ribosomal RNA gene, partial sequence 94.20 577 Dinophyceae_XX Dinophyceae_XXX Dinophyceae_XXX_sp.
AY046787 Uncultured eukaryote isolate A3_E046 18S ribosomal RNA gene, partial sequence 93.67 568 Dinophyceae_XX Dinophyceae_XXX Dinophyceae_XXX_sp.
AY046653 Uncultured eukaryote isolate CS_E028 18S ribosomal RNA gene, partial sequence 93.40 562 Dinophyceae_XX Dinophyceae_XXX Dinophyceae_XXX_sp.
AJ402340 eukaryote clone OLI11027 18S rRNA gene 93.40 560 Dinophyceae_XX Dinophyceae_XXX Dinophyceae_XXX_sp.
KT813471 Uncultured eukaryote clone 34c_59722 18S ribosomal RNA gene, partial sequence 93.05 544 NA NA NA
DQ371292 Polykrikos kofoidii isolate 2 18S ribosomal RNA gene, partial sequence 91.56 521 Gymnodiniaceae Polykrikos Polykrikos_kofoidii
AB466294 Polykrikos kofoidii gene for small subunit ribosomal RNA, complete sequence, strain: 2 91.29 516 Gymnodiniaceae Polykrikos Polykrikos_kofoidii
AB466292 Polykrikos kofoidii gene for small subunit ribosomal RNA, complete sequence, isolate: 070216-6 91.29 516 Gymnodiniaceae Polykrikos Polykrikos_kofoidii
AB466290 Polykrikos kofoidii gene for small subunit ribosomal RNA, complete sequence, country: Japan:Iwate, Ofunato Bay 91.29 516 Gymnodiniaceae Polykrikos Polykrikos_kofoidii
DQ371291 Polykrikos kofoidii isolate 1 18S ribosomal RNA gene, partial sequence 91.29 516 Gymnodiniaceae Polykrikos Polykrikos_kofoidii
AB466291 Polykrikos kofoidii gene for small subunit ribosomal RNA, complete sequence, country: Japan:Nagasaki, Omura Bay 91.08 512 Gymnodiniaceae Polykrikos Polykrikos_kofoidii
KJ757664 Uncultured eukaryote clone SGYY886 18S ribosomal RNA gene, partial sequence 91.03 510 Dinophyceae_XX Dinophyceae_XXX Dinophyceae_XXX_sp.
KJ759286 Uncultured eukaryote clone SGYO1066 18S ribosomal RNA gene, partial sequence 91.01 510 Dinophyceae_XX Dinophyceae_XXX Dinophyceae_XXX_sp.
KC771146 Uncultured marine eukaryote clone I-9-MC884-OTU-47 18S ribosomal RNA gene, partial sequence 91.01 510 Peridiniaceae Peridinium Peridinium_sp.
FN690233 Uncultured alveolate partial 18S rRNA gene, clone 4-H5 91.01 510 Dinophyceae_XX Dinophyceae_XXX Dinophyceae_XXX_sp.
EF527032 Uncultured marine eukaryote clone SIF_3C2 18S ribosomal RNA gene, partial sequence 91.01 510 Dinophyceae_XX Dinophyceae_XXX Dinophyceae_XXX_sp.

4.7 Final list of changes made to the taxonomy assigned by dada2

Changes made are recorded in the Excel sheet (NZ_2017_BLAST.xls) that will loaded at the next step and the phyloseq file will modified accordingly. This allows a clean update if we see more change in the taxonomy.

4.7.1 List of changes made

taxo_corrected <- read_xlsx(str_c(path_files, "NZ_2017_BLAST.xlsx"), sheet = "taxonomy corrected")
taxo_corrected_show <- taxo_corrected %>% filter(!is.na(corrected_species)) %>% 
    select(OTUNumber, dada2_species, corrected_species, corrected_class:corrected_genus, 
        corrected_remark)

knitr::kable(head(taxo_corrected_show, 31), caption = "Taxonomic corrections made") %>% 
    kable_styling(latex_options = "scale_down", font_size = 10)
Taxonomic corrections made
OTUNumber dada2_species corrected_species corrected_class corrected_order corrected_family corrected_genus corrected_remark
Otu0029 Pycnococcaceae_X_sp. Pycnococcus_provasolii Prasino-Clade-V Pseudoscourfieldiales Pycnococcaceae Pycnococcus Hits Pycnococcus 100%
Otu0039 Gephyrocapsa_oceanica Emiliania_huxleyi Prymnesiophyceae Isochrysidales Noelaerhabdaceae Emiliania Emiliania and Gephyrocapsa cannot be distinguished
Otu0082 Apocalathium_baicalense Dinophyceae_XXX_sp. Dinophyceae Dinophyceae_X Dinophyceae_XX Dinophyceae_XXX Hit to Apocalathium only 90%
Otu0114 Bispinodinium_angelaceum Eukaryota_XXXXXX_sp. Eukaryota_XXX Eukaryota_XXXX Eukaryota_XXXXX Eukaryota_XXXXXX Very low hit to any group (83% to Fungi)
Otu0119 Gyrodinium_sp. Heterocapsa_nei/rotundata Dinophyceae Peridiniales Heterocapsaceae Heterocapsa Only one sequence of Gyrodinium
Otu0120 Dinophyceae_XXX_sp. Gyrodinium_sp. Dinophyceae Gymnodiniales Gymnodiniaceae Gyrodinium 98% to Gyrodinium
Otu0124 Gymnodinium_sp. Dinophyceae_XXX_sp. Dinophyceae Dinophyceae_X Dinophyceae_XX Dinophyceae_XXX No hit to Gymnodiunium culture
Otu0133 Dino-Group-II-Clade-46_X_sp. Dino-Group-II_XX_sp. Syndiniales Dino-Group-II Dino-Group-II_X Dino-Group-II_XX Hit very low 85%
Otu0221 Blastodinium_pruvoti Dinophyceae_XXX_sp. Dinophyceae Dinophyceae_X Dinophyceae_XX Dinophyceae_XXX Low hit (89%) to cultures
Otu0242 Syndinium_turbo Hematodinium_sp. Syndiniales Dino-Group-IV Dino-Group-IV-Hematodinium-Group Hematodinium Better hit to Hematodinium

4.7.2 Correct the phyloseq file

The file is saved into phyloseq_nz_2017.rds

taxo_changed <- taxo_corrected %>% filter(!is.na(corrected_species)) %>% select(OTUNumber, 
    corrected_supergroup:corrected_species) %>% rename_all(funs(str_replace(., 
    "corrected_", "")))

taxo_kept <- taxo_corrected %>% filter(is.na(corrected_species)) %>% select(OTUNumber, 
    dada2_supergroup:dada2_species) %>% rename_all(funs(str_replace(., "dada2_", 
    "")))

taxo_final <- rbind(taxo_kept, taxo_changed) %>% arrange(OTUNumber) %>% mutate(kingdom = "Eukaryota") %>% 
    column_to_rownames(var = "OTUNumber")

phyloseq_final <- phyloseq_dada2
tax_table(phyloseq_final) <- as.matrix(taxo_final)
saveRDS(phyloseq_final, file = "phyloseq_nz_2017.rds")

Daniel Vaulot

07 05 2018