Introduction to R for microbial ecologists

1 Aim

This document introduces basic R functions that can be used by microbial ecologists.

3 Ressources

3.1 Books

  • R-intro.pdf : Very good introduction to R, short and clear
  • R_in_a_nutshell.pdf : Many many receipes to solve all your questions
  • R graphics cook book : very good for ggplot2

3.3 Cheat sheets

3.4 On line course

4 Step by step tutorial

4.1 Some important points before starting

  • R is an interpreted language
  • R is case sensitive
  • R works with vectors
  • Types of variables : character, real, logical, factor
  • Special values : TRUE, FALSE, NA
  • Types of structures : vector, matrix, list, data frame
  • Directory names use the linux convention: use / and not

4.2 Start R Studio

  • Go to the tutorial folder
  • Switch to the subdirectory \introduction
  • Launch R Studio
  • Four windows
    • top-left : script files / data tables
    • bottom -left: code
    • top - left : objects
    • bottom - right : help / libraries / files / graphics

4.4 Create simple vectors and data frame

4.4.1 Enter the data

Our aim here to create a small table and then to compute some simple statistics

station fraction diatoms bolidos
4 Nano 17165 2
4 Pico 59786 1293
72 Nano 6951 3
72 Pico 3815 494

4.4.2 Compute new quantities

[1] 17167 61079  6954  4309
[1] 0.9998835 0.9788307 0.9995686 0.8853562

4.4.3 Compute statistics

[1] 0.9659098
[1] 0.05459839
[1] 4
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.8854  0.9555  0.9892  0.9659  0.9996  0.9999 

4.4.4 Accessing subsets

[1] 0.9998835
[1] 0.9998835 0.9788307

4.4.5 Data frames

  station fraction diatoms bolidos diatoms_pct
1       4     Nano   17165       2   0.9998835
2       4     Pico   59786    1293   0.9788307
3      72     Nano    6951       3   0.9995686
4      72     Pico    3815     494   0.8853562

4.4.6 Access individual columns

[1] 17165 59786  6951  3815

4.4.7 Access specific lines

[1] 17165 59786

4.4.8 Compute statistics of a specific group

[1] 38475.5

4.4.9 Computing statistics according to a factor

This can be done at least two different ways, but you will see later that it is much easier to do with the dplyr package

      4      72 
38475.5  5383.0 
  station diatoms
1       4 38475.5
2      72  5383.0

4.5 Importing data

A few important points :

  • Your data must be formatted in a clean table form
    • No blank line
    • Each column must contain data of the same type (e.g. dates)
    • Missing data can be represented by empty cells
    • Each line must contain data in ALL columns
  • Column titles (the first line)
    • No space (use _)
    • Always begin by letter (not a number)
  • Only import primary data, all derived data can (and must) be computed with R which makes data changes much more easy

4.5.1 The hard way - exporting from Excel to a tab-delimited file

  • Open Excel file in /data directory : R_Tara.xlsx
  • Copy and Paste into text file using Notepad++
  • Save as R_Tara.txt

Note : you can also export from Excel but then it must be TAB-delimited (tsv file)

Get the name and type of all the columns - Note that strings are of type “factor” Note that empty cells are labelled as NA (not available) which is a R constant

'data.frame':   293 obs. of  28 variables:
 $ Sample          : Factor w/ 293 levels "TV9_1","TV9_10",..: 124 121 141 122 123 120 93 58 101 68 ...
 $ Bacillariophyta : int  17165 6159 59786 4689 6280 1000 12517 64721 8126 13584 ...
 $ Bolidophyceae   : int  12 42 1293 1036 2 188 24 163 2991 2261 ...
 $ Chrysophyceae   : int  26 223 8758 7494 21 670 296 593 10069 25834 ...
 $ Dictyochophyceae: int  155 487 21967 21293 14 1026 265 1658 19440 48876 ...
 $ Pelagophyceae   : int  233 138 73474 4774 13 722 40 31 1687 871 ...
 $ Phaeophyceae    : int  0 12 1835 526 0 11 12 25 382 2738 ...
 $ Pinguiophyceae  : int  11 2 19 40 6 5 50 229 20 32 ...
 $ Raphidophyceae  : int  0 0 0 0 0 0 18 21 48 23 ...
 $ Strameno_all    : int  17602 7063 167132 39852 6336 3622 13222 67441 42763 94219 ...
 $ Photo_all       : int  22708 8817 427846 93006 8976 5392 14299 70406 81891 144725 ...
 $ depth_level     : Factor w/ 2 levels "DCM","SUR": 1 2 1 2 1 2 1 2 1 2 ...
 $ station         : int  4 4 4 4 4 4 7 7 7 7 ...
 $ template        : Factor w/ 2 levels "D","WGA/D": 2 1 1 1 2 1 2 2 1 1 ...
 $ fraction        : Factor w/ 4 levels "0.8-5","180-2000",..: 4 4 1 1 2 2 4 4 1 1 ...
 $ ntags           : int  1796545 2128487 2122955 976685 1857697 3150580 2549282 1606212 1625284 1334743 ...
 $ Month           : Factor w/ 10 levels "apr","aug","dec",..: 10 10 10 10 10 10 10 10 10 10 ...
 $ Latitude        : num  36.6 36.6 36.6 36.6 36.6 ...
 $ Longitude       : num  -6.57 -6.57 -6.57 -6.57 -6.57 ...
 $ sampling_depth  : num  40 3 40 3 40 3 42 3 42 3 ...
 $ date            : Factor w/ 77 levels "01-Aug-2011 20:13:34",..: 44 44 44 44 44 44 65 64 65 64 ...
 $ chloro_hplc     : num  NA 0.0984 NA 0.0984 NA ...
 $ tara_NO2        : num  NA NA NA NA NA NA 0.005 0.076 0.005 0.076 ...
 $ tara_PO4        : num  NA NA NA NA NA NA 0.026 0.041 0.026 0.041 ...
 $ NO2NO3          : num  NA NA NA NA NA NA 0.4 0.23 0.4 0.23 ...
 $ tara_SI         : num  NA NA NA NA NA NA 0.652 0.998 0.652 0.998 ...
 $ tara_temp       : num  NA NA NA NA NA ...
 $ tara_salinity   : num  NA NA NA NA NA ...

4.5.2 The easy way - Read directly Excel (readxl library)

Get the name and type of all the columns - Note that strings are now of type “char”, which is better

Classes 'tbl_df', 'tbl' and 'data.frame':   293 obs. of  28 variables:
 $ Sample          : chr  "TV9_237" "TV9_234" "TV9_254" "TV9_235" ...
 $ Bacillariophyta : num  17165 6159 59786 4689 6280 ...
 $ Bolidophyceae   : num  12 42 1293 1036 2 ...
 $ Chrysophyceae   : num  26 223 8758 7494 21 ...
 $ Dictyochophyceae: num  155 487 21967 21293 14 ...
 $ Pelagophyceae   : num  233 138 73474 4774 13 ...
 $ Phaeophyceae    : num  0 12 1835 526 0 ...
 $ Pinguiophyceae  : num  11 2 19 40 6 5 50 229 20 32 ...
 $ Raphidophyceae  : num  0 0 0 0 0 0 18 21 48 23 ...
 $ Strameno_all    : num  17602 7063 167132 39852 6336 ...
 $ Photo_all       : num  22708 8817 427846 93006 8976 ...
 $ depth_level     : chr  "DCM" "SUR" "DCM" "SUR" ...
 $ station         : num  4 4 4 4 4 4 7 7 7 7 ...
 $ template        : chr  "WGA/D" "D" "D" "D" ...
 $ fraction        : chr  "5-20" "5-20" "0.8-5" "0.8-5" ...
 $ ntags           : num  1796545 2128487 2122955 976685 1857697 ...
 $ Month           : chr  "sep" "sep" "sep" "sep" ...
 $ Latitude        : num  36.6 36.6 36.6 36.6 36.6 ...
 $ Longitude       : num  -6.57 -6.57 -6.57 -6.57 -6.57 ...
 $ sampling_depth  : num  40 3 40 3 40 3 42 3 42 3 ...
 $ date            : chr  "15-Sep-2009 16:45:02" "15-Sep-2009 16:45:02" "15-Sep-2009 16:45:02" "15-Sep-2009 16:45:02" ...
 $ chloro_hplc     : num  NA 0.0984 NA 0.0984 NA ...
 $ tara_NO2        : num  NA NA NA NA NA ...
 $ tara_PO4        : num  NA NA NA NA NA ...
 $ NO2NO3          : num  NA NA NA NA NA ...
 $ tara_SI         : num  NA NA NA NA NA ...
 $ tara_temp       : num  NA NA NA NA NA ...
 $ tara_salinity   : num  NA NA NA NA NA ...

4.6 Compute derived quantities and Statistics (using dplyr library)

Compute % of Bacilliarophyta and Pelagophyceae vs Total photosynthetic

Mean and SD as a function of size fraction and depth_level

# A tibble: 8 x 5
# Groups:   fraction [?]
  fraction depth_level Baci_pct_mean Baci_pct_SD     n
  <chr>    <chr>               <dbl>       <dbl> <int>
1 0.8-5    DCM                  14.5        15.6    33
2 0.8-5    SUR                  12.5        15.0    40
3 180-2000 DCM                  59.0        30.2    31
4 180-2000 SUR                  53.7        29.9    45
5 20-180   DCM                  84.7        20.6    28
6 20-180   SUR                  81.7        19.6    42
7 5-20     DCM                  73.8        26.5    33
8 5-20     SUR                  74.6        27.7    41

4.7 Do simple X-Y plots (using ggplot2 library)

Load the ggplot2 library

X vs Y

X vs Y with variation in color of points with size fraction

X vs Y with variation in color of points with size fraction and shape with depth level

X vs Y with variation in color of points with size fraction and shape with depth level

X vs Y with variation sampling_depth for color of points and shape with with size fraction.

Note that sampling_depth is a continuous variable

Categorical data vs y with variation in color of points with depth level

4.8 Other types of plots

Boxplot for the same data

Histogram for all the data

Histogram with different color for each size fraction

Histogram with different graphs (facets) for each size fraction and depth and change bin width

Daniel Vaulot

25 11 2018