Tutorial - R Dada2 metabarcode analyis

1 Aim

This tutorial explain how to process Illumina data with the Dada2 suite as implemented in R (dada2 is also implemented in Qiime). It is adapted from : https://benjjneb.github.io/dada2/tutorial.html

2 Directory structure

Relative to the main directory from GitHub

  • ../fastq : fastq files
  • ../fastq_filtered : fastq files after filtration
  • ../qual_pdf : qual pdf files
  • ../dada2 : dada2 processed files
  • ../databases : PR2 database files (contains PR2 database formatted for dada2 - https://github.com/pr2database/pr2database/releases/)
  • ../blast : BLAST files output
  • ../img : Images
  • ../R_dada2 : This tutorial for Illumina files

4 Data used

The samples originate from the CARBOM cruise (2013) off Brazil.

Samples have been sorted by flow cytometry and 3 genes have been PCR amplified :

  • 18S rRNA - V4 region
  • 16S rNA with plastid
  • nifH

The PCR products have been sequenced by 1 run of Illumina 2*250 bp. The data consist of the picoplankton samples from one transect and fastq files have been subsampled with 1000 sequences per sample.

4.1 References

  • Gerikas Ribeiro C, Marie D, Lopes dos Santos A, Pereira Brandini F, Vaulot D. (2016). Estimating microbial populations by flow cytometry: Comparison between instruments. Limnol Oceanogr Methods 14:750–758.
  • Gerikas Ribeiro C, Lopes dos Santos A, Marie D, Brandini P, Vaulot D. (2018). Relationships between photosynthetic eukaryotes and nitrogen-fixing cyanobacteria off Brazil. ISME J in press.
  • Gerikas Ribeiro C, Lopes dos Santos A, Marie D, Helena Pellizari V, Pereira Brandini F, Vaulot D. (2016). Pico and nanoplankton abundance and carbon stocks along the Brazilian Bight. PeerJ 4:e2587.

5 Tutorial description

5.5 Examine the fastQ files

5.5.2 Compute number of paired reads

n_seq file_name
1000 120p_S39_R1.subsample.fastq
1000 120p_S39_R2.subsample.fastq
1000 121p_S57_R1.subsample.fastq
1000 121p_S57_R2.subsample.fastq
1000 122p_S4_R1.subsample.fastq
1000 122p_S4_R2.subsample.fastq
1000 125p_S22_R1.subsample.fastq
1000 125p_S22_R2.subsample.fastq
1000 126p_S40_R1.subsample.fastq
1000 126p_S40_R2.subsample.fastq
1000 140p_S5_R1.subsample.fastq
1000 140p_S5_R2.subsample.fastq
1000 141p_S23_R1.subsample.fastq
1000 141p_S23_R2.subsample.fastq

5.6 Filter and Trim the reads

The dada2 algorithm requires primers to be removed prior to processing.

  • Using dada2 there are 2 possibilities
    • Remove by sequence, but dada2 does not allow for ambiguities
    • Remove by position, which is not a problem for Illumina sequences but is a problem for 454
  • For complex situation we recommend to use cutadapt to remove the primers : http://cutadapt.readthedocs.io/en/stable/guide.html#.
    The program is really very powerful.

5.6.1 Create names for the filtered files

We create the name of the files that will be generated by the filterAndTrim function in the step below. These names are composed by the path name (“../fastq_filtered/”), the sample names, the read number (R1 or R2) and a "_filt" suffix.

5.6.2 Removing the primers by sequence (DO NOT EXECUTE THIS STEP)

  • Go to next step

The next piece of code could be used to remove the primers by sequence. The dada2 package does not allow for primer degeneracy. Since our forward primer is degenerated at two positions, all four combinations need to be tested. However it will be necessary to re-assemble after that the 4 fastQ files created (which has not to done). So the better strategy is to remove primer by trunction (see next step).

5.7 Dada2 processing

5.7.1 Learn error rates

The error rates are plotted.

1581480 total bases in 6876 reads from 14 samples will be used for learning the error rates.
Initializing error rates to maximum possible estimate.
selfConsist step 1 ..............
   selfConsist step 2
   selfConsist step 3
Convergence after  3  rounds.

1505844 total bases in 6876 reads from 14 samples will be used for learning the error rates.
Initializing error rates to maximum possible estimate.
selfConsist step 1 ..............
   selfConsist step 2
   selfConsist step 3
Convergence after  3  rounds.

5.7.3 Sequence-variant inference algorithm to the dereplicated data

Sample 1 - 256 reads in 93 unique sequences.
Sample 2 - 457 reads in 166 unique sequences.
Sample 3 - 407 reads in 128 unique sequences.
Sample 4 - 553 reads in 220 unique sequences.
Sample 5 - 508 reads in 219 unique sequences.
Sample 6 - 456 reads in 147 unique sequences.
Sample 7 - 473 reads in 180 unique sequences.
Sample 8 - 583 reads in 211 unique sequences.
Sample 9 - 528 reads in 172 unique sequences.
Sample 10 - 530 reads in 211 unique sequences.
Sample 11 - 513 reads in 177 unique sequences.
Sample 12 - 521 reads in 199 unique sequences.
Sample 13 - 519 reads in 172 unique sequences.
Sample 14 - 572 reads in 170 unique sequences.
Sample 1 - 256 reads in 236 unique sequences.
Sample 2 - 457 reads in 391 unique sequences.
Sample 3 - 407 reads in 299 unique sequences.
Sample 4 - 553 reads in 409 unique sequences.
Sample 5 - 508 reads in 451 unique sequences.
Sample 6 - 456 reads in 335 unique sequences.
Sample 7 - 473 reads in 347 unique sequences.
Sample 8 - 583 reads in 458 unique sequences.
Sample 9 - 528 reads in 418 unique sequences.
Sample 10 - 530 reads in 404 unique sequences.
Sample 11 - 513 reads in 384 unique sequences.
Sample 12 - 521 reads in 395 unique sequences.
Sample 13 - 519 reads in 396 unique sequences.
Sample 14 - 572 reads in 400 unique sequences.
dada-class: object describing DADA2 denoising results
5 sequence variants were inferred from 93 input unique sequences.
Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
dada-class: object describing DADA2 denoising results
3 sequence variants were inferred from 236 input unique sequences.
Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

5.7.4 Merge sequences

sequence abundance forward reverse nmatch nmismatch nindel prefer accept
AGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAACGCTCGTAGTCGGATTTCGGGGCAGGTTTGCCGGTCTGCCGATGGGTATGCACTGGTAGAACTGTCCTTCCTTCCGGAGACGGCGCCTACTCTTAACTGAGCGGGTGTCGGAGACGGATCGTTTACTTTGAAAAAATCAGAGTGTTTCAAGCAGGCAGCTCGCTTTTGCATGGATTAGCATGGGATAATGAAATAGGACTTTGGTGCTATTTTGTTGGTTTCGAACACCGAAGTAATGATTAACAGGGACAGTCAGGGGCACTCGTATTCCGCCGAGAGAGGTGAAATTCTTAGACCAGCGGAAGACGAACCACTGCGAAAGCATTTGCCAGGGATGTTT 146 1 1 71 0 0 1 TRUE
CACACGTCTAATGTTGCATTGTAAAGCACAAACCACTGTTTTACATTTAGCTGCAGAAAGAGGAACTGTAGAAGATATTGAACTTGATGAAGTAGTAATTCCTGGCTATAACAACGTTTTATGCGTTGAGTCCGGTGGTCCTGAGCCTGGAGTTGGATGTGCTGGTCGTGGTATTATTACTGCTATCAACTTCCTTGAAGAAGAAGGTGCTTACGAAAACCTAGATTTCGTATCTTATGATGTATTAGGAGACGTTGTTTGTGGTGGTTTCGCTATGCCTATCCGTGAAGGAAAAGCACAAGAAATCTACATCGTTAC 64 2 2 131 0 0 1 TRUE
AGCTCCAATAGCGTATACTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTCGAAGCAGCCGCTCTGTTACTAGTTAGCACGCAGCGTGCGCTTCGTCCATTTTTGTAAAGACTATGTCTGTCATTAAGTTGGTGGGCATAGGACTTGCATCATTTACTGTGAGCAAAATAGAGTGTTCAAAGCAGGCTTAGGCCATGAATATCTTAGCATGGAATAATAAGATAAGACCTTGGTCTATTTTGTTGGTTAGCATTCCGAGGTAATGATTAATAGGGATAGTTGGGGGTATTCGTATTCAATTGTCAGAGGTGAAATTCTTGGATTTATTGAAGACGAACTACTGCGAAAGCATTTACCAAGGATGTTT 13 3 3 72 0 0 1 TRUE

5.7.5 Make sequence table

[1] 14 58

318 351 360 363 367 369 371 372 373 375 376 377 378 379 380 382 383 384 
  3   1   1   3   1   1   2   1   2   1   3   6  12   4   4   2   5   2 
387 388 
  1   3 

5.7.6 Remove chimeras

Note that remove chimeras will produce spurious results if primers have not be removed. The parameter methods can be pooled or consensus

[1] "% of non chimeras : 100"
[1] "total number of sequences : 5869"

In our case there were no chimeras found. It is noteworthy that the total number of sequences is almost twice that what is recovered with mothur which is 2573

5.7.7 Track number of reads at each step

input filtered denoised merged tabled nonchim
120p 1000 256 241 223 223 223
121p 1000 457 446 397 397 397
122p 1000 407 396 357 357 357
125p 1000 553 551 464 464 464
126p 1000 508 493 340 340 340
140p 1000 456 441 427 427 427
141p 1000 473 460 381 381 381
142p 1000 583 567 495 495 495
155p 1000 528 524 445 445 445
156p 1000 530 525 426 426 426
157p 1000 513 507 438 438 438
165p 1000 521 509 442 442 442
166p 1000 519 510 478 478 478
167p 1000 572 563 556 556 556

5.7.12 Filter for 18S

Remember that we sequenced 3 genes (18S, 16S plastid and nifH. We remove the sequences are not 18S by selecting only bootstrap value for Supergroup in excess of 80.

5.7.13 Write FASTA file for BLAST analysis with taxonomy

Use the Biostrings library

This file can be sent to a server and a BLAST analysis can be done using the following qsub file


#!/bin/bash
# Commands starting with  '#$' are interpreted by SGE
# Shell to be used for the job
#$ -S /bin/bash
# User to be informed
#$ -M vaulot@sb-roscoff.fr
# Export all environment variable
#$ -V
# Send a message by email  at beginning (b), end (e) and abort (a) of job
#$ -m bea
# Standard output.  Can use '-j y' to add stderr with stdout
#$ -o repl
# Send the commande from the curent directory where the script reside
#$ -cwd
# Define environmental variables

# submitted with
# qsub -q short.q qsub_blast_antar.sh

# Replace the next line by the location of the directory where you have your data
DIR_PROJECT="/projet/sbr/ccebarcodep1408/workshop_nz_2018/blast/"

cd $DIR_PROJECT

FILE="CARBOM_ASV"

FASTA=$DIR_PROJECT$FILE".fasta"
BLAST_TSV=$DIR_PROJECT$FILE".blast.tsv"

OUT_FMT="6 qseqid sseqid sacc stitle sscinames staxids sskingdoms sblastnames pident slen length mismatch gapopen qstart qend sstart send evalue bitscore"

blastn -max_target_seqs 100 -evalue 1.00e-10 -query $FASTA -out $BLAST_TSV -db /db/blast/all/nt -outfmt "$OUT_FMT"

Daniel Vaulot

29 11 2018