Phyloseq tutorial
1 Aim
This document explains the use of the phyloseq R library to analyze metabarcoding data.
2 Phyloseq R library
- Phyloseq web site : https://joey711.github.io/phyloseq/index.html
- See in particular tutorials for
- importing data: https://joey711.github.io/phyloseq/import-data.html
- heat maps: https://joey711.github.io/phyloseq/plot_heatmap-examples.html
3 Data
This tutorial uses a reduced metabarcoding dataset obtained by C. Ribeiro and A. Lopes dos Santos. This dataset originates from the CARBOM cruise in 2013 off Brazil and corresponds to the 18S V4 region amplified on flow cytometry sorted samples (see pptx file for details) and sequenced on an Illumina run 2*250 bp analyzed with mothur.
3.1 References for data
- Gérikas Ribeiro, C., Lopes dos Santos, A., Marie, D., Helena Pellizari, V., Pereira Brandini, F., and Vaulot, D. (2016). Pico and nanoplankton abundance and carbon stocks along the Brazilian Bight. PeerJ 4, e2587. doi:10.7717/peerj.2587.
- Gérikas Ribeiro, C., Marie, D., Lopes dos Santos, A., Pereira Brandini, F., and Vaulot, D. (2016). Estimating microbial populations by flow cytometry: Comparison between instruments. Limnol. Oceanogr. Methods 14, 750–758. doi:10.1002/lom3.10135.
- Gérikas Ribeiro C, Lopes dos Santos A, Marie D, Brandini P, Vaulot D. (2018). Relationships between photosynthetic eukaryotes and nitrogen-fixing cyanobacteria off Brazil. ISME J in press.
4 To be added
- Exercices
5 Prerequisites to be installed
R studio : https://www.rstudio.com/products/rstudio/download/#download
Download this tutorial from GitHub : https://github.com/vaulot/R_tutorials/archive/master.zip
Download and install the following libraries by running under R studio the following lines
install.packages("dplyr") # To manipulate dataframes
install.packages("readxl") # To read Excel files into R
install.packages("ggplot2") # for high quality graphics
source("https://bioconductor.org/biocLite.R")
biocLite("phyloseq")
6 Gettin started
Transfer the files that you downloaded from https://github.com/vaulot/R_tutorials/archive/master.zip to a directory on the computer or server you are using “C:/R_tutorial” or “home/R_tutorial”
Navigate to the /R_tutorial/phyloseq directory. You should see the following files :
Phyloseq_tutorial.Rmd
- Double click on the file and this should open R Studio
- Run the “R chunks”
7 Step by step
7.1 Load necessary libraries
library("phyloseq")
library("ggplot2") # graphics
library("readxl") # necessary to import the data from Excel file
library("dplyr") # filter and reformat data frames
library("tibble") # Needed for converting column to row names
7.2 Read the data and create phyloseq objects
Change your working directory to where the files are located
Three tables are needed
- OTU
- Taxonomy
- Samples
They are read from a single Excel file where each sheet contains one of the tables
<- read_excel("../data/CARBOM data.xlsx", sheet = "OTU matrix")
otu_mat<- read_excel("../data/CARBOM data.xlsx", sheet = "Taxonomy table")
tax_mat<- read_excel("../data/CARBOM data.xlsx", sheet = "Samples") samples_df
Phyloseq objects need to have row.names
- define the row names from the otu column
<- otu_mat %>%
otu_mat ::column_to_rownames("otu") tibble
- Idem for the two other matrixes
<- tax_mat %>%
tax_mat ::column_to_rownames("otu")
tibble
<- samples_df %>%
samples_df ::column_to_rownames("sample") tibble
Transform into matrixes otu and tax tables (sample table can be left as data frame)
<- as.matrix(otu_mat)
otu_mat <- as.matrix(tax_mat) tax_mat
Transform to phyloseq objects
= otu_table(otu_mat, taxa_are_rows = TRUE)
OTU = tax_table(tax_mat)
TAX = sample_data(samples_df)
samples
<- phyloseq(OTU, TAX, samples)
carbom carbom
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 287 taxa and 55 samples ]
## sample_data() Sample Data: [ 55 samples by 27 sample variables ]
## tax_table() Taxonomy Table: [ 287 taxa by 7 taxonomic ranks ]
Visualize data
sample_names(carbom)
## [1] "X10n" "X10p" "X11n" "X11p" "X120n" "X120p" "X121n" "X121p"
## [9] "X122n" "X122p" "X125n" "X125p" "X126n" "X126p" "X127n" "X13n"
## [17] "X13p" "X140n" "X140p" "X141n" "X141p" "X142n" "X142p" "X155n"
## [25] "X155p" "X156n" "X156p" "X157n" "X157p" "X15n" "X15p" "X165n"
## [33] "X165p" "X166n" "X166p" "X167n" "X167p" "X1n" "X1p" "X2n"
## [41] "X2p" "X3n" "X3p" "X5n" "X5p" "X7n" "X7p" "X9n"
## [49] "X9p" "tri01n" "tri01p" "tri02n" "tri02p" "tri03n" "tri03p"
rank_names(carbom)
## [1] "Domain" "Supergroup" "Division" "Class" "Order"
## [6] "Family" "Genus"
sample_variables(carbom)
## [1] "fraction" "Select_18S_nifH" "total_18S"
## [4] "total_16S" "total_nifH" "sample_number"
## [7] "transect" "station" "depth"
## [10] "latitude" "longitude" "picoeuks"
## [13] "nanoeuks" "bottom_depth" "level"
## [16] "transect_distance" "date" "time"
## [19] "phosphates" "silicates" "ammonia"
## [22] "nitrates" "nitrites" "temperature"
## [25] "fluorescence" "salinity" "sample_label"
Keep only samples to be analyzed
<- subset_samples(carbom, Select_18S_nifH =="Yes")
carbom carbom
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 287 taxa and 54 samples ]
## sample_data() Sample Data: [ 54 samples by 27 sample variables ]
## tax_table() Taxonomy Table: [ 287 taxa by 7 taxonomic ranks ]
Keep only photosynthetic taxa
<- subset_taxa(carbom, Division %in% c("Chlorophyta", "Dinophyta", "Cryptophyta",
carbom "Haptophyta", "Ochrophyta", "Cercozoa"))
<- subset_taxa(carbom, !(Class %in% c("Syndiniales", "Sarcomonadea")))
carbom carbom
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 205 taxa and 54 samples ]
## sample_data() Sample Data: [ 54 samples by 27 sample variables ]
## tax_table() Taxonomy Table: [ 205 taxa by 7 taxonomic ranks ]
Normalize number of reads in each sample using median sequencing depth.
= median(sample_sums(carbom))
total = function(x, t=total) round(t * (x / sum(x)))
standf = transform_sample_counts(carbom, standf) carbom
The number of reads used for normalization is 44903.
7.3 Bar graphs
Basic bar graph based on Division
plot_bar(carbom, fill = "Division")
Make the bargraph nicer by removing OTUs boundaries. This is done by adding ggplot2 modifier.
plot_bar(carbom, fill = "Division") +
geom_bar(aes(color=Division, fill=Division), stat="identity", position="stack")
Regroup together Pico vs Nano samples
<- merge_samples(carbom, "fraction")
carbom_fraction plot_bar(carbom_fraction, fill = "Division") +
geom_bar(aes(color=Division, fill=Division), stat="identity", position="stack")
Keep only Chlorophyta and use color according to genus. Do separate panels Pico vs Nano and Surface vs Deep samples.
<- subset_taxa(carbom, Division %in% c("Chlorophyta"))
carbom_chloro plot_bar(carbom_chloro, x="Genus", fill = "Genus", facet_grid = level~fraction) +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack")
7.4 Heatmaps
A basic heatmap using the default parameters.
plot_heatmap(carbom, method = "NMDS", distance = "bray")
## Warning: Transformation introduced infinite values in discrete y-axis
It is very very cluttered. It is better to only consider the most abundant OTUs for heatmaps. For example one can only take OTUs that represent at least 20% of reads in at least one sample. Remember we normalized all the sampples to median number of reads (total). We are left with only 33 OTUS which makes the reading much more easy.
<- filter_taxa(carbom, function(x) sum(x > total*0.20) > 0, TRUE)
carbom_abund carbom_abund
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 33 taxa and 54 samples ]
## sample_data() Sample Data: [ 54 samples by 27 sample variables ]
## tax_table() Taxonomy Table: [ 33 taxa by 7 taxonomic ranks ]
otu_table(carbom_abund)[1:8, 1:5]
## OTU Table: [8 taxa and 5 samples]
## taxa are rows
## X10n X10p X11p X120n X120p
## Otu001 13339 7346 3804 12662 3
## Otu002 18 8329 14958 30 36206
## Otu003 9692 10488 20 16537 11
## Otu004 3584 4943 33 7 9
## Otu005 0 6 11 0 5
## Otu006 0 9 0 0 5
## Otu007 4473 605 587 5894 3
## Otu008 1 9 6707 2 17
plot_heatmap(carbom_abund, method = "NMDS", distance = "bray")
## Warning: Transformation introduced infinite values in discrete y-axis
It is possible to use different distances and different multivaraite methods. For example Jaccard distance and MDS and label OTUs with Class, order by Class. We can also change the Palette (the default palette is a bit ugly…).
plot_heatmap(carbom_abund, method = "MDS", distance = "(A+B-2*J)/(A+B-J)",
taxa.label = "Class", taxa.order = "Class",
trans=NULL, low="beige", high="red", na.value="beige")
Many different built-in distances can be used
<- unlist(distanceMethodList)
dist_methods print(dist_methods)
## UniFrac1 UniFrac2 DPCoA JSD vegdist1 vegdist2
## "unifrac" "wunifrac" "dpcoa" "jsd" "manhattan" "euclidean"
## vegdist3 vegdist4 vegdist5 vegdist6 vegdist7 vegdist8
## "canberra" "bray" "kulczynski" "jaccard" "gower" "altGower"
## vegdist9 vegdist10 vegdist11 vegdist12 vegdist13 vegdist14
## "morisita" "horn" "mountford" "raup" "binomial" "chao"
## vegdist15 betadiver1 betadiver2 betadiver3 betadiver4 betadiver5
## "cao" "w" "-1" "c" "wb" "r"
## betadiver6 betadiver7 betadiver8 betadiver9 betadiver10 betadiver11
## "I" "e" "t" "me" "j" "sor"
## betadiver12 betadiver13 betadiver14 betadiver15 betadiver16 betadiver17
## "m" "-2" "co" "cc" "g" "-3"
## betadiver18 betadiver19 betadiver20 betadiver21 betadiver22 betadiver23
## "l" "19" "hk" "rlb" "sim" "gl"
## betadiver24 dist1 dist2 dist3 designdist
## "z" "maximum" "binary" "minkowski" "ANY"
You can also buid your own distances.
For vectors x and y the “quadratic” terms are J = sum(x*y), A = sum(x^2), B = sum(y^2) and “minimum” terms are J = sum(pmin(x,y)), A = sum(x) and B = sum(y), and “binary” terms are either of these after transforming data into binary form (shared number of species, and number of species for each row). Somes examples :
A+B-2*J “quadratic” squared Euclidean
A+B-2*J “minimum” Manhattan
(A+B-2*J)/(A+B) “minimum” Bray-Curtis
(A+B-2*J)/(A+B) “binary” Sørensen
(A+B-2*J)/(A+B-J) “binary” Jaccard
Another strategy is to do a heatmap for a specific taxonomy group.
For example we can taget the Chlorophyta and then label the OTUs using the Genus.
plot_heatmap(carbom_chloro, method = "NMDS", distance = "bray",
taxa.label = "Genus", taxa.order = "Genus",
low="beige", high="red", na.value="beige")
## Warning: Transformation introduced infinite values in discrete y-axis
7.5 Alpha diversity
Plot Chao1 richness estimator and Shannon diversity estimator.
plot_richness(carbom, measures=c("Chao1", "Shannon"))
Regroup together samples from the same fraction.
plot_richness(carbom, measures=c("Chao1", "Shannon"), x="level", color="fraction")
7.6 Ordination
Do multivariate analysis based on Bray-Curtis distance and NMDS ordination.
<- ordinate(carbom, "NMDS", "bray") carbom.ord
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.2317058
## Run 1 stress 0.244906
## Run 2 stress 0.2503821
## Run 3 stress 0.2406178
## Run 4 stress 0.2500933
## Run 5 stress 0.2406116
## Run 6 stress 0.2536113
## Run 7 stress 0.250753
## Run 8 stress 0.2329672
## Run 9 stress 0.2335592
## Run 10 stress 0.2542002
## Run 11 stress 0.2538693
## Run 12 stress 0.2477015
## Run 13 stress 0.2308669
## ... New best solution
## ... Procrustes: rmse 0.1025932 max resid 0.3862265
## Run 14 stress 0.2661822
## Run 15 stress 0.2335262
## Run 16 stress 0.2398239
## Run 17 stress 0.2587417
## Run 18 stress 0.2494361
## Run 19 stress 0.240805
## Run 20 stress 0.2573136
## *** No convergence -- monoMDS stopping criteria:
## 1: no. of iterations >= maxit
## 17: stress ratio > sratmax
## 2: scale factor of the gradient < sfgrmin
Plot OTUs
plot_ordination(carbom, carbom.ord, type="taxa", color="Class", shape= "Division",
title="OTUs")
A bit confusing, so make it more easy to visualize by breaking according to taxonomic division.
plot_ordination(carbom, carbom.ord, type="taxa", color="Class",
title="OTUs", label="Class") +
facet_wrap(~Division, 3)
Now display samples and enlarge the points to make it more easy to read.
plot_ordination(carbom, carbom.ord, type="samples", color="fraction",
shape="level", title="Samples") + geom_point(size=3)
Display both samples and OTUs but in 2 different panels.
plot_ordination(carbom, carbom.ord, type="split", color="Class",
shape="level", title="biplot", label = "station") +
geom_point(size=3)
7.7 Network analysis
Simple network analysis
plot_net(carbom, distance = "(A+B-2*J)/(A+B)", type = "taxa",
maxdist = 0.7, color="Class", point_label="Genus")
This is quite confusing. Let us make it more simple by using only major OTUs
plot_net(carbom_abund, distance = "(A+B-2*J)/(A+B)", type = "taxa",
maxdist = 0.8, color="Class", point_label="Genus")