Phyloseq tutorial

1 Aim

This document explains the use of the phyloseq R library to analyze metabarcoding data.

2 Phyloseq R library

Phyloseq web site : https://joey711.github.io/phyloseq/index.html
See in particular tutorials for
- importing data: https://joey711.github.io/phyloseq/import-data.html
- heat maps: https://joey711.github.io/phyloseq/plot_heatmap-examples.html

3 Data

Carbom cruise

This tutorial uses a reduced metabarcoding dataset obtained by C. Ribeiro and A. Lopes dos Santos. This dataset originates from the CARBOM cruise in 2013 off Brazil and corresponds to the 18S V4 region amplified on flow cytometry sorted samples (see pptx file for details) and sequenced on an Illumina run 2*250 bp analyzed with mothur.

3.1 References for data

Gérikas Ribeiro, C., Lopes dos Santos, A., Marie, D., Helena Pellizari, V., Pereira Brandini, F., and Vaulot, D. (2016). Pico and nanoplankton abundance and carbon stocks along the Brazilian Bight. PeerJ 4, e2587. doi:10.7717/peerj.2587.
Gérikas Ribeiro, C., Marie, D., Lopes dos Santos, A., Pereira Brandini, F., and Vaulot, D. (2016). Estimating microbial populations by flow cytometry: Comparison between instruments. Limnol. Oceanogr. Methods 14, 750–758. doi:10.1002/lom3.10135.
Gérikas Ribeiro C, Lopes dos Santos A, Marie D, Brandini P, Vaulot D. (2018). Relationships between photosynthetic eukaryotes and nitrogen-fixing cyanobacteria off Brazil. ISME J in press.

Carbom paper in ISME

4 To be added

Exercices

5 Prerequisites to be installed

R : https://pbil.univ-lyon1.fr/CRAN/
R studio : https://www.rstudio.com/products/rstudio/download/#download
Download this tutorial from GitHub : https://github.com/vaulot/R_tutorials/archive/master.zip
Download and install the following libraries by running under R studio the following lines

install.packages("dplyr")     # To manipulate dataframes
install.packages("readxl")    # To read Excel files into R

install.packages("ggplot2")   # for high quality graphics

source("https://bioconductor.org/biocLite.R")
biocLite("phyloseq")

6 Gettin started

Transfer the files that you downloaded from https://github.com/vaulot/R_tutorials/archive/master.zip to a directory on the computer or server you are using “C:/R_tutorial” or “home/R_tutorial”
Navigate to the /R_tutorial/phyloseq directory. You should see the following files :

Phyloseq_tutorial.Rmd

Double click on the file and this should open R Studio

R studio

Run the “R chunks”

7 Step by step

7.1 Load necessary libraries

library("phyloseq")
library("ggplot2")      # graphics
library("readxl")       # necessary to import the data from Excel file
library("dplyr")        # filter and reformat data frames
library("tibble")       # Needed for converting column to row names

7.2 Read the data and create phyloseq objects

Change your working directory to where the files are located

Three tables are needed

OTU
Taxonomy
Samples

They are read from a single Excel file where each sheet contains one of the tables

Table OTU - OTU abundance

Table Taxo - OTU taxonomy

Table Samples

  otu_mat<- read_excel("../data/CARBOM data.xlsx", sheet = "OTU matrix")
  tax_mat<- read_excel("../data/CARBOM data.xlsx", sheet = "Taxonomy table")
  samples_df <- read_excel("../data/CARBOM data.xlsx", sheet = "Samples")

Phyloseq objects need to have row.names

define the row names from the otu column

  otu_mat <- otu_mat %>%
    tibble::column_to_rownames("otu")

Idem for the two other matrixes

  tax_mat <- tax_mat %>% 
    tibble::column_to_rownames("otu")

  samples_df <- samples_df %>% 
    tibble::column_to_rownames("sample")

Transform into matrixes otu and tax tables (sample table can be left as data frame)

  otu_mat <- as.matrix(otu_mat)
  tax_mat <- as.matrix(tax_mat)

Transform to phyloseq objects

  OTU = otu_table(otu_mat, taxa_are_rows = TRUE)
  TAX = tax_table(tax_mat)
  samples = sample_data(samples_df)
  
  carbom <- phyloseq(OTU, TAX, samples)
  carbom

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 287 taxa and 55 samples ]
## sample_data() Sample Data:       [ 55 samples by 27 sample variables ]
## tax_table()   Taxonomy Table:    [ 287 taxa by 7 taxonomic ranks ]

Visualize data

  sample_names(carbom)

##  [1] "X10n"   "X10p"   "X11n"   "X11p"   "X120n"  "X120p"  "X121n"  "X121p" 
##  [9] "X122n"  "X122p"  "X125n"  "X125p"  "X126n"  "X126p"  "X127n"  "X13n"  
## [17] "X13p"   "X140n"  "X140p"  "X141n"  "X141p"  "X142n"  "X142p"  "X155n" 
## [25] "X155p"  "X156n"  "X156p"  "X157n"  "X157p"  "X15n"   "X15p"   "X165n" 
## [33] "X165p"  "X166n"  "X166p"  "X167n"  "X167p"  "X1n"    "X1p"    "X2n"   
## [41] "X2p"    "X3n"    "X3p"    "X5n"    "X5p"    "X7n"    "X7p"    "X9n"   
## [49] "X9p"    "tri01n" "tri01p" "tri02n" "tri02p" "tri03n" "tri03p"

  rank_names(carbom)

## [1] "Domain"     "Supergroup" "Division"   "Class"      "Order"     
## [6] "Family"     "Genus"

  sample_variables(carbom)

##  [1] "fraction"          "Select_18S_nifH"   "total_18S"        
##  [4] "total_16S"         "total_nifH"        "sample_number"    
##  [7] "transect"          "station"           "depth"            
## [10] "latitude"          "longitude"         "picoeuks"         
## [13] "nanoeuks"          "bottom_depth"      "level"            
## [16] "transect_distance" "date"              "time"             
## [19] "phosphates"        "silicates"         "ammonia"          
## [22] "nitrates"          "nitrites"          "temperature"      
## [25] "fluorescence"      "salinity"          "sample_label"

Keep only samples to be analyzed

  carbom <- subset_samples(carbom, Select_18S_nifH =="Yes")
  carbom

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 287 taxa and 54 samples ]
## sample_data() Sample Data:       [ 54 samples by 27 sample variables ]
## tax_table()   Taxonomy Table:    [ 287 taxa by 7 taxonomic ranks ]

Keep only photosynthetic taxa

  carbom <- subset_taxa(carbom, Division %in% c("Chlorophyta", "Dinophyta", "Cryptophyta", 
                                                "Haptophyta", "Ochrophyta", "Cercozoa"))
  carbom <- subset_taxa(carbom, !(Class %in% c("Syndiniales", "Sarcomonadea")))
  carbom

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 205 taxa and 54 samples ]
## sample_data() Sample Data:       [ 54 samples by 27 sample variables ]
## tax_table()   Taxonomy Table:    [ 205 taxa by 7 taxonomic ranks ]

Normalize number of reads in each sample using median sequencing depth.

  total = median(sample_sums(carbom))
  standf = function(x, t=total) round(t * (x / sum(x)))
  carbom = transform_sample_counts(carbom, standf)

The number of reads used for normalization is 44903.

7.3 Bar graphs

Basic bar graph based on Division

  plot_bar(carbom, fill = "Division")

Make the bargraph nicer by removing OTUs boundaries. This is done by adding ggplot2 modifier.

  plot_bar(carbom, fill = "Division") + 
  geom_bar(aes(color=Division, fill=Division), stat="identity", position="stack")

Regroup together Pico vs Nano samples

  carbom_fraction <- merge_samples(carbom, "fraction")
  plot_bar(carbom_fraction, fill = "Division") + 
  geom_bar(aes(color=Division, fill=Division), stat="identity", position="stack")

Keep only Chlorophyta and use color according to genus. Do separate panels Pico vs Nano and Surface vs Deep samples.

  carbom_chloro <- subset_taxa(carbom, Division %in% c("Chlorophyta"))
  plot_bar(carbom_chloro, x="Genus", fill = "Genus", facet_grid = level~fraction) +
  geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack")

7.4 Heatmaps

A basic heatmap using the default parameters.

  plot_heatmap(carbom, method = "NMDS", distance = "bray")

## Warning: Transformation introduced infinite values in discrete y-axis

It is very very cluttered. It is better to only consider the most abundant OTUs for heatmaps. For example one can only take OTUs that represent at least 20% of reads in at least one sample. Remember we normalized all the sampples to median number of reads (total). We are left with only 33 OTUS which makes the reading much more easy.

  carbom_abund <- filter_taxa(carbom, function(x) sum(x > total*0.20) > 0, TRUE)
  carbom_abund

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 33 taxa and 54 samples ]
## sample_data() Sample Data:       [ 54 samples by 27 sample variables ]
## tax_table()   Taxonomy Table:    [ 33 taxa by 7 taxonomic ranks ]

  otu_table(carbom_abund)[1:8, 1:5]

## OTU Table:          [8 taxa and 5 samples]
##                      taxa are rows
##         X10n  X10p  X11p X120n X120p
## Otu001 13339  7346  3804 12662     3
## Otu002    18  8329 14958    30 36206
## Otu003  9692 10488    20 16537    11
## Otu004  3584  4943    33     7     9
## Otu005     0     6    11     0     5
## Otu006     0     9     0     0     5
## Otu007  4473   605   587  5894     3
## Otu008     1     9  6707     2    17

  plot_heatmap(carbom_abund, method = "NMDS", distance = "bray")

## Warning: Transformation introduced infinite values in discrete y-axis

It is possible to use different distances and different multivaraite methods. For example Jaccard distance and MDS and label OTUs with Class, order by Class. We can also change the Palette (the default palette is a bit ugly…).

  plot_heatmap(carbom_abund, method = "MDS", distance = "(A+B-2*J)/(A+B-J)", 
               taxa.label = "Class", taxa.order = "Class", 
               trans=NULL, low="beige", high="red", na.value="beige")

Many different built-in distances can be used

dist_methods <- unlist(distanceMethodList)
print(dist_methods)

##     UniFrac1     UniFrac2        DPCoA          JSD     vegdist1     vegdist2 
##    "unifrac"   "wunifrac"      "dpcoa"        "jsd"  "manhattan"  "euclidean" 
##     vegdist3     vegdist4     vegdist5     vegdist6     vegdist7     vegdist8 
##   "canberra"       "bray" "kulczynski"    "jaccard"      "gower"   "altGower" 
##     vegdist9    vegdist10    vegdist11    vegdist12    vegdist13    vegdist14 
##   "morisita"       "horn"  "mountford"       "raup"   "binomial"       "chao" 
##    vegdist15   betadiver1   betadiver2   betadiver3   betadiver4   betadiver5 
##        "cao"          "w"         "-1"          "c"         "wb"          "r" 
##   betadiver6   betadiver7   betadiver8   betadiver9  betadiver10  betadiver11 
##          "I"          "e"          "t"         "me"          "j"        "sor" 
##  betadiver12  betadiver13  betadiver14  betadiver15  betadiver16  betadiver17 
##          "m"         "-2"         "co"         "cc"          "g"         "-3" 
##  betadiver18  betadiver19  betadiver20  betadiver21  betadiver22  betadiver23 
##          "l"         "19"         "hk"        "rlb"        "sim"         "gl" 
##  betadiver24        dist1        dist2        dist3   designdist 
##          "z"    "maximum"     "binary"  "minkowski"        "ANY"

You can also buid your own distances.

For vectors x and y the “quadratic” terms are J = sum(x*y), A = sum(x^2), B = sum(y^2) and “minimum” terms are J = sum(pmin(x,y)), A = sum(x) and B = sum(y), and “binary” terms are either of these after transforming data into binary form (shared number of species, and number of species for each row). Somes examples :

A+B-2*J “quadratic” squared Euclidean
A+B-2*J “minimum” Manhattan
(A+B-2*J)/(A+B) “minimum” Bray-Curtis
(A+B-2*J)/(A+B) “binary” Sørensen
(A+B-2*J)/(A+B-J) “binary” Jaccard

Another strategy is to do a heatmap for a specific taxonomy group.

For example we can taget the Chlorophyta and then label the OTUs using the Genus.

  plot_heatmap(carbom_chloro, method = "NMDS", distance = "bray", 
               taxa.label = "Genus", taxa.order = "Genus", 
               low="beige", high="red", na.value="beige")

## Warning: Transformation introduced infinite values in discrete y-axis

7.5 Alpha diversity

Plot Chao1 richness estimator and Shannon diversity estimator.

  plot_richness(carbom, measures=c("Chao1", "Shannon"))

Regroup together samples from the same fraction.

  plot_richness(carbom, measures=c("Chao1", "Shannon"), x="level", color="fraction")

7.6 Ordination

Do multivariate analysis based on Bray-Curtis distance and NMDS ordination.

  carbom.ord <- ordinate(carbom, "NMDS", "bray")

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.2317058 
## Run 1 stress 0.244906 
## Run 2 stress 0.2503821 
## Run 3 stress 0.2406178 
## Run 4 stress 0.2500933 
## Run 5 stress 0.2406116 
## Run 6 stress 0.2536113 
## Run 7 stress 0.250753 
## Run 8 stress 0.2329672 
## Run 9 stress 0.2335592 
## Run 10 stress 0.2542002 
## Run 11 stress 0.2538693 
## Run 12 stress 0.2477015 
## Run 13 stress 0.2308669 
## ... New best solution
## ... Procrustes: rmse 0.1025932  max resid 0.3862265 
## Run 14 stress 0.2661822 
## Run 15 stress 0.2335262 
## Run 16 stress 0.2398239 
## Run 17 stress 0.2587417 
## Run 18 stress 0.2494361 
## Run 19 stress 0.240805 
## Run 20 stress 0.2573136 
## *** No convergence -- monoMDS stopping criteria:
##      1: no. of iterations >= maxit
##     17: stress ratio > sratmax
##      2: scale factor of the gradient < sfgrmin

Plot OTUs

  plot_ordination(carbom, carbom.ord, type="taxa", color="Class", shape= "Division", 
                  title="OTUs")

A bit confusing, so make it more easy to visualize by breaking according to taxonomic division.

  plot_ordination(carbom, carbom.ord, type="taxa", color="Class", 
                  title="OTUs", label="Class") + 
                  facet_wrap(~Division, 3)

Now display samples and enlarge the points to make it more easy to read.

  plot_ordination(carbom, carbom.ord, type="samples", color="fraction", 
                  shape="level", title="Samples") + geom_point(size=3)

Display both samples and OTUs but in 2 different panels.

  plot_ordination(carbom, carbom.ord, type="split", color="Class", 
                  shape="level", title="biplot", label = "station") +  
  geom_point(size=3)

7.7 Network analysis

Simple network analysis

  plot_net(carbom, distance = "(A+B-2*J)/(A+B)", type = "taxa", 
           maxdist = 0.7, color="Class", point_label="Genus")

This is quite confusing. Let us make it more simple by using only major OTUs

  plot_net(carbom_abund, distance = "(A+B-2*J)/(A+B)", type = "taxa", 
           maxdist = 0.8, color="Class", point_label="Genus")