BIOSOPE - step 4 - Compare dada2 vs. mothur

## Warning: package 'rmdformats' was built under R version 3.4.4

1 Aim

Comparing the community composition with dada2 vs mothur

2 Directory structure

  • /cutadapt : fastq files treated by cutadapt
  • /dada2 : dada2 files
  • /mothur : mothur files

3 Analysis

3.1 To do

  • Move from Species to Genus
  • Color dots by division
  • Do the treemaps for the two methods
  • Do station histograms at genus level for chlorophyta

3.2 Initialize

  • BIOSOPE_init.R defines all the necessary libraries and variables # r, eval=TRUE, message=FALSE, warning=FALSE
source("BIOSOPE_init.R", echo = FALSE)

3.3 Read the phyloseq and database files

ps_dada2 <- readRDS(file = str_c(dir_dada2, "ps_biosope_dada2.rds"))
ps_mothur <- readRDS(file = str_c(dir_mothur, "ps_biosope_mothur.rds"))

3.4 Make Simple tables from the phyloseq

Note: for Dada2 only consider the samples common to mothur and dada2merge and tranform to long format

samples <- data.frame(sample_data(ps_dada2), stringsAsFactors = FALSE)

# Dada2 : only consider the samples common to mothur and dada2
otus_dada2 <- data.frame(otu_table(subset_samples(ps_dada2, mothur_missing == 
    0)))
taxa_dada2 <- data.frame(tax_table(subset_samples(ps_dada2, mothur_missing == 
    0)))

# It is necessary to create a column from the row names
otus_dada2 <- rownames_to_column(otus_dada2, var = "OTUNumber")
taxa_dada2 <- rownames_to_column(taxa_dada2, var = "OTUNumber")

# Mothur
otus_mothur <- data.frame(otu_table(ps_mothur))
taxa_mothur <- data.frame(tax_table(ps_mothur))
otus_mothur <- rownames_to_column(otus_mothur, var = "OTUNumber")
taxa_mothur <- rownames_to_column(taxa_mothur, var = "OTUNumber")

3.5 Merge and convert to long format

Note: when converting to from phyloseq format to dataframe format, an X is added at the start of column name for samples because they are starting with a digit. This X is removed.

dada2_long <- left_join(taxa_dada2, otus_dada2) %>% gather(key = "sample", value = "n_seq", 
    starts_with("X")) %>% mutate(sample = str_replace(sample, "X", ""))
mothur_long <- left_join(taxa_mothur, otus_mothur) %>% gather(key = "sample", 
    value = "n_seq", starts_with("X")) %>% mutate(sample = str_replace(sample, 
    "X", ""))

3.6 Aggregate by Division, Class, Genus, Species

species_dada2 <- dada2_long %>% group_by(Division, Class, Genus, Species) %>% 
    summarize(n_seq = sum(n_seq))
write_tsv(species_dada2, str_c(dir_R_out, "species_dada2.txt"), na = "")

genus_dada2 <- dada2_long %>% group_by(Division, Class, Genus) %>% summarize(n_seq = sum(n_seq))
genus_dada2 <- genus_dada2 %>% ungroup() %>% arrange(desc(n_seq))

knitr::kable(head(species_dada2, 20), caption = "Genera from dada2 - first 20 lines") %>% 
    kable_styling(latex_options = "scale_down", font_size = 7)
Genera from dada2 - first 20 lines
Division Class Genus Species n_seq
Apicomplexa Apicomplexa_X Colpodella Colpodella_pontica 133
Apusomonadidae Apusomonadidae_Group-1 Amastigomonas Amastigomonas_debruynei 660
Apusomonadidae Apusomonadidae_Group-2 Apusomonadidae_Group-2A_XX Apusomonadidae_Group-2A_XX_sp. 619
Apusomonadidae Apusomonadidae_Group-2 Apusomonadidae_Group-2B_XX Apusomonadidae_Group-2B_XX_sp. 8338
Cercozoa Cercozoa_X Cercozoa_XXXX Cercozoa_XXXX_sp. 45
Cercozoa Endomyxa Endo4-lineage_X Endo4-lineage_X_sp. 2777
Cercozoa Filosa Discomonas Discomonas_retusa 3
Cercozoa Filosa-Chlorarachnea Lotharella Lotharella_sp. 4
Cercozoa Filosa-Chlorarachnea Minorisa Minorisa_minuta 249
Cercozoa Filosa-Chlorarachnea Minorisa-lineage_X Minorisa-lineage_X_sp. 27
Cercozoa Filosa-Chlorarachnea NPK2-lineage_X NPK2-lineage_X_sp. 1346
Cercozoa Filosa-Granofilosea Massisteria Massisteria_marina 93
Cercozoa Filosa-Granofilosea Massiteriidae_X Massiteriidae_X_sp. 383
Cercozoa Filosa-Imbricatea Novel-clade-2_X Novel-clade-2_X_sp. 1728
Cercozoa Filosa-Sarcomonadea Helkesimastix Helkesimastix_marina 13
Cercozoa Filosa-Thecofilosea Botuliforma Botuliforma_benthica 7
Cercozoa Filosa-Thecofilosea Protaspa Protaspa_grandis 183
Cercozoa Filosa-Thecofilosea Ventrifissura Ventrifissura_foliiformis 5
Cercozoa Phaeodarea Challengeron Challengeron_tizardi 265
Cercozoa Phaeodarea Entocannula Entocannula_infundibulum 4
species_mothur <- mothur_long %>% group_by(Division, Class, Genus, Species) %>% 
    summarize(n_seq = sum(n_seq))
write_tsv(species_mothur, str_c(dir_R_out, "species_mothur.txt"), na = "")

genus_mothur <- mothur_long %>% group_by(Division, Class, Genus) %>% summarize(n_seq = sum(n_seq))
genus_mothur <- genus_mothur %>% ungroup() %>% arrange(desc(n_seq))

knitr::kable(head(genus_mothur, 20), caption = "Genera from mothur - first 20 lines") %>% 
    kable_styling(latex_options = "scale_down", font_size = 7)
Genera from mothur - first 20 lines
Division Class Genus n_seq
Radiolaria RAD-A RAD-A_XXX 11871
Chlorophyta Prasino-Clade-VII Prasino-Clade-VII-B-1 8517
Dinoflagellata Syndiniales Dino-Group-I-Clade-1_X 7171
Radiolaria RAD-B RAD-B-Group-I_X 6598
Chlorophyta Mamiellophyceae Bathycoccus 5949
Radiolaria Acantharea Acanthometron 4472
Apusomonadidae Apusomonadidae_Group-2 Apusomonadidae_Group-2B_XX 4450
Radiolaria RAD-B RAD-B-Group-IV_X 4223
Chlorophyta Prasino-Clade-9 Prasino-Clade-9-B_X 3827
Dinoflagellata Dinophyceae Prorocentrum 3539
Dinoflagellata Dinophyceae Dinophyceae_unclassified 3356
Eukaryota_unclassified Eukaryota_unclassified Eukaryota_unclassified 2184
Radiolaria RAD-B RAD-B-Group-II_X 2116
Radiolaria Polycystinea Spongotrochus 2036
Chlorophyta Prasino-Clade-VII Prasino-Clade-VII-A-4 2027
Radiolaria RAD-B Sticholonche 2000
Radiolaria Polycystinea Spumellarida-Group-I_X 1990
Dinoflagellata Syndiniales Dino-Group-III_XX 1876
Cercozoa Endomyxa Endo4-lineage_X 1817
Radiolaria Polycystinea Pterocorys 1787

3.7 Compare overall composition

dv_treemap(species_mothur, c("Division", "Class"), "n_seq", "Mothur")
dv_treemap(species_dada2, c("Division", "Class"), "n_seq", "Dada2")

3.8 Compare Chlorophyta composition

dv_treemap(filter(species_mothur, Division == "Chlorophyta"), c("Class", "Genus"), 
    "n_seq", "Mothur")
dv_treemap(filter(species_dada2, Division == "Chlorophyta"), c("Class", "Genus"), 
    "n_seq", "Dada2")

3.9 Compare estimates of taxon abundance

3.9.1 Merge the two lists to compute the relation between mothur and dada2 estimates

both_genus <- rbind(genus_mothur, genus_dada2) %>% group_by(Division, Class, 
    Genus) %>% summarize(n_methods = n()) %>% left_join(genus_dada2) %>% dplyr::rename(dada2 = n_seq) %>% 
    left_join(genus_mothur) %>% dplyr::rename(mothur = n_seq) %>% arrange(desc(mothur))
n_genus <- nrow(both_genus)
both_class <- both_genus %>% group_by(Division, Class) %>% summarise(dada2 = sum(dada2, 
    na.rm = TRUE), mothur = sum(mothur, na.rm = TRUE)) %>% arrange(desc(dada2))

3.9.2 Class level

ggplot(both_class) + geom_point(aes(mothur, dada2, color = Division)) + geom_smooth(aes(mothur, 
    dada2), method = "lm") + ggtitle("Class level count")

summary(lm(both_class$dada2 ~ both_class$mothur))

Call:
lm(formula = both_class$dada2 ~ both_class$mothur)

Residuals:
   Min     1Q Median     3Q    Max 
-10628   -207   -167      1  13462 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       199.5926   373.1680    0.53     0.59    
both_class$mothur   2.3935     0.0897   26.69   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2950 on 72 degrees of freedom
Multiple R-squared:  0.908, Adjusted R-squared:  0.907 
F-statistic:  712 on 1 and 72 DF,  p-value: <2e-16

3.9.3 Genus level

ggplot(both_genus) + geom_point(aes(mothur, dada2, color = Division)) + geom_smooth(aes(mothur, 
    dada2), method = "lm") + ggtitle("Genus level count")

summary(lm(both_genus$dada2 ~ both_genus$mothur))

Call:
lm(formula = both_genus$dada2 ~ both_genus$mothur)

Residuals:
    Min      1Q  Median      3Q     Max 
-2932.4  -447.0  -340.0    69.9  4902.6 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       455.67600  123.53326   3.689 0.000352 ***
both_genus$mothur   1.95003    0.05936  32.848  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1163 on 110 degrees of freedom
  (225 observations deleted due to missingness)
Multiple R-squared:  0.9075,    Adjusted R-squared:  0.9066 
F-statistic:  1079 on 1 and 110 DF,  p-value: < 2.2e-16
# Idem for Chlorophyta

ggplot(filter(both_genus, Division == "Chlorophyta")) + geom_point(aes(mothur, 
    dada2)) + geom_smooth(aes(mothur, dada2), method = "lm") + ggtitle("Genus level count for Chlorophyta")

3.9.4 Genus only found with one method

There are 9% of the genera that are only found with mothur while 57% of the genera are only found with dada2. So mothur is missing a lot of genera.

genus_mothur_only <- both_genus %>% filter(is.na(dada2)) %>% arrange(desc(mothur))
n_genus_mothur_only = nrow(genus_mothur_only)

knitr::kable(head(genus_mothur_only, 30), caption = "Genera found only with mothur - top 30") %>% 
    kable_styling(latex_options = "scale_down", font_size = 7)
Genera found only with mothur - top 30
Division Class Genus n_methods dada2 mothur
Dinoflagellata Dinophyceae Dinophyceae_unclassified 1 NA 3356
Eukaryota_unclassified Eukaryota_unclassified Eukaryota_unclassified 1 NA 2184
Metazoa Ctenophora Ctenophora_XX_unclassified 1 NA 1463
Radiolaria Polycystinea Spumellarida-Group-I_unclassified 1 NA 1213
Radiolaria Acantharea Acantharea_unclassified 1 NA 1042
Radiolaria Acantharea Arthracanthida-Symphyacanthida_X_unclassified 1 NA 797
Radiolaria Polycystinea Pterocoryothoidea_unclassified 1 NA 617
Radiolaria Acantharea Dorataspis 1 NA 568
Radiolaria Polycystinea Spumellarida_unclassified 1 NA 559
Radiolaria Polycystinea Nassellaria_unclassified 1 NA 402
Apusomonadidae Apusomonadidae_unclassified Apusomonadidae_unclassified 1 NA 358
Radiolaria Polycystinea Sphaerozoidae_unclassified 1 NA 208
Archaeplastida_unclassified Archaeplastida_unclassified Archaeplastida_unclassified 1 NA 123
Haptophyta Prymnesiophyceae Prymnesiophyceae_unclassified 1 NA 118
Radiolaria Acantharea Acantharea_XX_unclassified 1 NA 108
Dinoflagellata Syndiniales Dino-Group-II_unclassified 1 NA 90
Chlorophyta Pyramimonadales Pyramimonadales_XX_unclassified 1 NA 86
Apicomplexa Apicomplexa_X Colpodellidae_X_unclassified 1 NA 83
Radiolaria Acantharea Chaunacanthida_X_unclassified 1 NA 83
Chlorophyta Prasino-Clade-VII Prasino-Clade-VII-A_unclassified 1 NA 75
Radiolaria RAD-B RAD-B_X_unclassified 1 NA 67
Cercozoa Cercozoa_unclassified Cercozoa_unclassified 1 NA 62
Radiolaria Acantharea Acantharea-Group-VI_X_unclassified 1 NA 50
Dinoflagellata Dinophyceae Gymnodiniaceae_unclassified 1 NA 39
Radiolaria Radiolaria_unclassified Radiolaria_unclassified 1 NA 34
Chlorophyta Chlorophyta_unclassified Chlorophyta_unclassified 1 NA 33
Rhizaria_unclassified Rhizaria_unclassified Rhizaria_unclassified 1 NA 32
Radiolaria RAD-B RAD-B-Group-IV_unclassified 1 NA 23
Alveolata_unclassified Alveolata_unclassified Alveolata_unclassified 1 NA 15
Radiolaria Polycystinea Plagiacanthoidea_unclassified 1 NA 14
genus_dada2_only <- both_genus %>% filter(is.na(mothur)) %>% arrange(desc(dada2))
n_genus_dada2_only = nrow(genus_dada2_only)
knitr::kable(head(genus_dada2_only, 30), caption = "Genera found only with dada2 - top 30") %>% 
    kable_styling(latex_options = "scale_down", font_size = 7)
Genera found only with dada2 - top 30
Division Class Genus n_methods dada2 mothur
Dinoflagellata Syndiniales Dino-Group-II-Clade-26_X 1 11300 NA
Ochrophyta Synurophyceae Tessellaria 1 11146 NA
Dinoflagellata Dinophyceae Heterocapsa 1 6179 NA
Radiolaria Acantharea Coleaspis 1 1234 NA
Radiolaria Polycystinea Spumellarida-Group-III_X 1 838 NA
Apusomonadidae Apusomonadidae_Group-1 Amastigomonas 1 660 NA
Dinoflagellata Dinophyceae Azadinium 1 654 NA
Radiolaria Polycystinea Sphaerozoum 1 645 NA
Apusomonadidae Apusomonadidae_Group-2 Apusomonadidae_Group-2A_XX 1 619 NA
Chlorophyta Prasino-Clade-VII Prasino-Clade-VII-A-1 1 580 NA
Dinoflagellata Dinophyceae Lepidodinium 1 499 NA
Radiolaria Acantharea Acanthometra 1 432 NA
Radiolaria Polycystinea Dictyocoryne 1 368 NA
Ochrophyta Chrysophyceae Chrysophyceae_Clade-G_X 1 279 NA
Cercozoa Phaeodarea Challengeron 1 265 NA
Cercozoa Filosa-Chlorarachnea Minorisa 1 249 NA
Radiolaria Polycystinea Spongodiscus 1 244 NA
Stramenopiles_X MAST MAST-1C_X 1 243 NA
Radiolaria Acantharea Phractopelta 1 240 NA
Radiolaria Polycystinea Spongocore 1 231 NA
Radiolaria Acantharea Symphyacanthida 1 204 NA
Chlorophyta Chlorophyceae Scenedesmus 1 186 NA
Cercozoa Filosa-Thecofilosea Protaspa 1 183 NA
Dinoflagellata Syndiniales Dino-Group-II-Clade-1_X 1 176 NA
Choanoflagellida Choanoflagellida_X Choanoflagellida_XX_Clade-3_X 1 169 NA
Dinoflagellata Dinophyceae Blastodinium 1 159 NA
Dinoflagellata Dinophyceae Lessardia 1 145 NA
Dinoflagellata Syndiniales Dino-Group-II-Clade-23_X 1 145 NA
Radiolaria Acantharea Chaunacanthid 1 145 NA
Radiolaria Acantharea Staurolithium 1 145 NA

3.10 Comparaison per station for the whole community

Compute sum per division and per station

division_sample_dada2 <- dada2_long %>% group_by(Division, sample) %>% summarize(n_seq = sum(n_seq)) %>% 
    group_by(sample) %>% mutate(n_seq_pct = n_seq/sum(n_seq))

division_sample_mothur <- mothur_long %>% group_by(Division, sample) %>% summarize(n_seq = sum(n_seq)) %>% 
    group_by(sample) %>% mutate(n_seq_pct = n_seq/sum(n_seq))

Collapse divisions that are less than 5% of the total to “Others”

# dada2

# Compute the percentage of sequences in each division and rename the
# sequences with less 1% to others
division_dada2 <- dada2_long %>% group_by(Division) %>% summarize(n_seq = sum(n_seq)) %>% 
    mutate(n_seq_pct = n_seq/sum(dada2_long$n_seq) * 100) %>% mutate(Division_collapsed = if_else(n_seq_pct < 
    5, "Others", Division))
# Merge the new column with 'Others'
division_sample_dada2 <- left_join(division_sample_dada2, select(division_dada2, 
    Division, Division_collapsed))

# mothur

# Compute the percentage of sequences in each division and rename the
# sequences with less 1% to others
division_mothur <- mothur_long %>% group_by(Division) %>% summarize(n_seq = sum(n_seq)) %>% 
    mutate(n_seq_pct = n_seq/sum(mothur_long$n_seq) * 100) %>% mutate(Division_collapsed = if_else(n_seq_pct < 
    5, "Others", Division))
# Merge the new column with 'Others'
division_sample_mothur <- left_join(division_sample_mothur, select(division_mothur, 
    Division, Division_collapsed))

Doing the graphs for the absolute abundance

ggplot(division_sample_mothur, aes(x = sample, y = n_seq, fill = Division_collapsed)) + 
    geom_bar(stat = "identity") + xlab("Stations") + ylab("Sequences") + ylim(0, 
    35000) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle("Mothur - absolute")

ggplot(division_sample_dada2, aes(x = sample, y = n_seq, fill = Division_collapsed)) + 
    geom_bar(stat = "identity") + xlab("Stations") + ylab("Sequences") + ylim(0, 
    35000) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle("dada2 - absolute")

Doing the graphs for the relative abundance

ggplot(division_sample_mothur, aes(x = sample, y = n_seq_pct, fill = Division_collapsed)) + 
    geom_bar(stat = "identity") + xlab("Stations") + ylab("Sequences") + theme(axis.text.x = element_text(angle = 90, 
    vjust = 0.5)) + ggtitle("Mothur - relative")

ggplot(division_sample_dada2, aes(x = sample, y = n_seq_pct, fill = Division_collapsed)) + 
    geom_bar(stat = "identity") + xlab("Stations") + ylab("Sequences") + theme(axis.text.x = element_text(angle = 90, 
    vjust = 0.5)) + ggtitle("dada2 - relative")

Notes Some samples have different composition with dada2 than with mothur

  • Sample 074_070 has a large population of Ochrophyta (collapsed into “Others” - visible by changing the threshold for others). They correspond to Chrysophyceae clade I (wrongly assigned by dada2 to Synurophyceae).

  • Sample 135_005 has more Dinoflagellata than in mothur

3.11 Comparaison per station for Chlorophyta

Compute sum per division and per station

chlo_genus_sample_dada2 <- dada2_long %>% filter(Division == "Chlorophyta") %>% 
    group_by(Genus, sample) %>% summarize(n_seq = sum(n_seq)) %>% group_by(sample) %>% 
    mutate(n_seq_pct = n_seq/sum(n_seq))

chlo_genus_sample_mothur <- mothur_long %>% filter(Division == "Chlorophyta") %>% 
    group_by(Genus, sample) %>% summarize(n_seq = sum(n_seq)) %>% group_by(sample) %>% 
    mutate(n_seq_pct = n_seq/sum(n_seq))

Collapse genera that are less than 1% of the total to “Others”

# dada2

# Compute the percentage of sequences in each division and rename the
# sequences with less 1% to others
chlo_dada2_long <- dada2_long %>% filter(Division == "Chlorophyta")
chlo_genus_dada2 <- chlo_dada2_long %>% group_by(Genus) %>% summarize(n_seq = sum(n_seq)) %>% 
    mutate(n_seq_pct = n_seq/sum(chlo_dada2_long$n_seq) * 100) %>% mutate(Genus_collapsed = if_else(n_seq_pct < 
    1, "Others", Genus))
# Merge the new column with 'Others'
chlo_genus_sample_dada2 <- left_join(chlo_genus_sample_dada2, select(chlo_genus_dada2, 
    Genus, Genus_collapsed))

# mothur

# Compute the percentage of sequences in each division and rename the
# sequences with less 1% to others
chlo_mothur_long <- mothur_long %>% filter(Division == "Chlorophyta")
chlo_genus_mothur <- chlo_mothur_long %>% group_by(Genus) %>% summarize(n_seq = sum(n_seq)) %>% 
    mutate(n_seq_pct = n_seq/sum(chlo_mothur_long$n_seq) * 100) %>% mutate(Genus_collapsed = if_else(n_seq_pct < 
    1, "Others", Genus))
# Merge the new column with 'Others'
chlo_genus_sample_mothur <- left_join(chlo_genus_sample_mothur, select(chlo_genus_mothur, 
    Genus, Genus_collapsed))

Doing the graphs for the relative abundance

ggplot(chlo_genus_sample_mothur, aes(x = sample, y = n_seq_pct, fill = Genus_collapsed)) + 
    geom_bar(stat = "identity") + xlab("Stations") + ylab("Sequences") + theme(axis.text.x = element_text(angle = 90, 
    vjust = 0.5)) + ggtitle("Mothur - Chlorophyta relative")

ggplot(chlo_genus_sample_dada2, aes(x = sample, y = n_seq_pct, fill = Genus_collapsed)) + 
    geom_bar(stat = "identity") + xlab("Stations") + ylab("Sequences") + theme(axis.text.x = element_text(angle = 90, 
    vjust = 0.5)) + ggtitle("dada2 - Chlorophyta relative")

4 Conclusions

The comparison has been made only for stations where both mothur and dada2 were used. There is no overall dramatic differnces in the two methods but since there almost 3 times more sequences for dada2 than for mothur, I propose to use the dada2 data. Moreover with the dada2 sequences we have recovered the stations for which there was a problem with the barcodes.

Mothur

OTUs build at 98%

[1] "Number of sequences with mothur = 120 343"
[1] "Number of Otus with mothur      =  747"
[1] "Number of Species with mothur   =  155"
[1] "Number of Genera with mothur    =  143"
[1] "Number of genera found only with mothur =  31 (9.20 %)"

Dada2

The raw sequences have been processed with cut-adapt

[1] "Number of sequences with dada2 = 302 809"
[1] "Number of Otus with dada2      = 1783"
[1] "Number of Species with dada2   =  358"
[1] "Number of Genera with dada2    =  306"
[1] "Number of genera found only with dada2 = 194 (57.57 %)"

Daniel Vaulot

24 04 2018