+ - 0:00:00
Notes for current slide
Notes for next slide

Fundamental of Data Science for EESS

R session 03 - Data wrangling

Daniel Vaulot

2019-01-31

1 / 45

Outline

  • Concept of tidy data
  • Reading data
  • Manipulating data
    • Columns
    • Rows
    • Joining tables
2 / 45

Installation and Resources

Packages

  • readxl : Reading Excel files
  • readr : Reading and writing Text files

  • dplyr : Filter and reformat data frames

  • tidyr : Make data "tidy"

  • stringr : Manipulating strings

  • lubridate : Manipulate date

Resources

3 / 45

Tidy data

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

4 / 45

Initialize

Load necessary libraries

library("readxl") # Import the data from Excel file
library("readr") # Import the data from Excel file
library("dplyr") # filter and reformat data frames
library("tidyr") # make data tidy
library("stringr") # manipulate strings
library("lubridate") # manipulate date
library("ggplot2") # graphics
5 / 45

Oceanographic data

CARBOM cruise off Brazil

6 / 45

Oceanographic data

CARBOM cruise off Brazil

  • Stations
  • Depth
  • Coordinates
  • Temperature, Salinity
  • Nitrates, Phosphates

6 / 45

Oceanographic data

Microbial populations

7 / 45

Oceanographic data

Microbial populations

  • Flow cytometry :
    • pico-eukaryotes
    • nano-eukaryotes
7 / 45

Read data

Text file - TAB delimited

8 / 45

Read data

Reading a text file

samples <- readr::read_tsv("data/CARBOM data.txt")
9 / 45

Read data

Reading a text file

samples <- readr::read_tsv("data/CARBOM data.txt")
sample number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity
10 1 81 13/11/2013 01:00:00 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9
11 1 85 13/11/2013 13:30:00 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5
120 2 96 18/11/2013 23:50:00 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5
121 2 18/11/2013 23:50:00 30 Deep -27.39 -47.82 1737 218 0.43 0.23 22.6 33.7
122 2 18/11/2013 23:50:00 50 Deep -27.39 -47.82 853 234 0.56 0.21 20.3 35.9
125 2 98 18/11/2013 05:00:00 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7
126 2 18/11/2013 05:00:00 50 Deep -27.59 -47.39 1217 782 0.25 0.20 23.7 37.2
127 2 18/11/2013 05:00:00 85 Deep -27.59 -47.39 3420 226 0.25 0.47 22.9 37.0
13 1 86 13/11/2013 17:00:00 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3
140 2 101 18/11/2013 12:00:00 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5
9 / 45

Read data

Reading a text file

samples <- readr::read_tsv("data/CARBOM data.txt")
sample number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity
10 1 81 13/11/2013 01:00:00 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9
11 1 85 13/11/2013 13:30:00 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5
120 2 96 18/11/2013 23:50:00 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5
121 2 18/11/2013 23:50:00 30 Deep -27.39 -47.82 1737 218 0.43 0.23 22.6 33.7
122 2 18/11/2013 23:50:00 50 Deep -27.39 -47.82 853 234 0.56 0.21 20.3 35.9
125 2 98 18/11/2013 05:00:00 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7
126 2 18/11/2013 05:00:00 50 Deep -27.59 -47.39 1217 782 0.25 0.20 23.7 37.2
127 2 18/11/2013 05:00:00 85 Deep -27.59 -47.39 3420 226 0.25 0.47 22.9 37.0
13 1 86 13/11/2013 17:00:00 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3
140 2 101 18/11/2013 12:00:00 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5
  • readr::read_tsv() : read tab delimited files
  • readr::read_csv() : read comma delimited files

  • readr::write_tsv() : write tab delimited files

9 / 45

Read data

Excel sheet

10 / 45

Read data

Read the data - read_excel

samples <- readxl::read_excel("data/CARBOM data.xlsx", sheet = "Samples_boat")
11 / 45

Read data

Read the data - read_excel

samples <- readxl::read_excel("data/CARBOM data.xlsx", sheet = "Samples_boat")
sample number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity
10 1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9
11 1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5
120 2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5
121 2 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218 0.43 0.23 22.6 33.7
122 2 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234 0.56 0.21 20.3 35.9
125 2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7
126 2 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782 0.25 0.20 23.7 37.2
127 2 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226 0.25 0.47 22.9 37.0
13 1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3
140 2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5
  • Can also select a range : e.g. A1:Q26
  • Can skip lines
11 / 45

Read data

Bad data input under Excel


sample number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity
10 1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9
11 1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5
120 2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5
121 2 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218 0.43 0.23 22.6 33.7
122 2 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234 0.56 0.21 20.3 35.9
125 2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7
126 2 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782 0.25 0.20 23.7 37.2
127 2 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226 0.25 0.47 22.9 37.0
13 1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3
140 2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5
  • There are missing values in the column station because only recorded when changed
12 / 45

Read data

Filling missing values - fill

samples <- tidyr::fill(samples, station)
13 / 45

Read data

Filling missing values - fill

samples <- tidyr::fill(samples, station)
sample number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity
10 1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9
11 1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5
120 2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5
121 2 96 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218 0.43 0.23 22.6 33.7
122 2 96 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234 0.56 0.21 20.3 35.9
125 2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7
126 2 98 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782 0.25 0.20 23.7 37.2
127 2 98 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226 0.25 0.47 22.9 37.0
13 1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3
140 2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5
  • All missing values have been filled in.
13 / 45

Read data

dplyr - Manipulate tables

@allison_horst

14 / 45

Manipulate columns

List columns

colnames(samples)
[1] "sample number" "transect" "station" "date"
[5] "time" "depth" "level" "latitude"
[9] "longitude" "picoeuks" "nanoeuks" "phosphates"
[13] "nitrates" "temperature" "salinity"
15 / 45

Manipulate columns

Summarize columns

summary(samples$depth)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
5.0 5.0 50.0 56.6 100.0 140.0 3
16 / 45

Manipulate columns

Select specific columns - select

samples_select <- dplyr::select(samples, transect, `sample number`, station,
depth, latitude, longitude, picoeuks, nanoeuks)
17 / 45

Manipulate columns

Select specific columns - select

samples_select <- dplyr::select(samples, transect, `sample number`, station,
depth, latitude, longitude, picoeuks, nanoeuks)
transect sample number station depth latitude longitude picoeuks nanoeuks
1 10 81 140 -27.42 -44.72 3278 1232
1 11 85 110 -26.80 -45.30 16312 1615
2 120 96 5 -27.39 -47.82 1150 75
2 121 96 30 -27.39 -47.82 1737 218
2 122 96 50 -27.39 -47.82 853 234
2 125 98 5 -27.59 -47.39 3086 1300
2 126 98 50 -27.59 -47.39 1217 782
2 127 98 85 -27.59 -47.39 3420 226
1 13 86 105 -26.33 -45.41 6366 1007
2 140 101 5 -27.79 -46.96 500 366
  • Column names are not "quoted" (in base R you need to "quote" the column names)
  • Better not to put space in column header because then must enclose column name with ` (back-quote)
17 / 45

Manipulate columns

Select a range of columns - select

samples_select <- dplyr::select(samples, transect:nanoeuks)
18 / 45

Manipulate columns

Select a range of columns - select

samples_select <- dplyr::select(samples, transect:nanoeuks)
transect station date time depth level latitude longitude picoeuks nanoeuks
1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232
1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615
2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75
2 96 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218
2 96 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234
2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300
2 98 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782
2 98 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226
1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007
2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366
18 / 45

Manipulate columns

Unselect columns - select

samples_select <- dplyr::select(samples, -nitrates, -phosphates)
19 / 45

Manipulate columns

Unselect columns - select

samples_select <- dplyr::select(samples, -nitrates, -phosphates)
sample number transect station date time depth level latitude longitude picoeuks nanoeuks temperature salinity
10 1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232 17.3 35.9
11 1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615 21.3 36.5
120 2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75 23.1 33.5
121 2 96 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218 22.6 33.7
122 2 96 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234 20.3 35.9
125 2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300 23.1 35.7
126 2 98 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782 23.7 37.2
127 2 98 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226 22.9 37.0
13 1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007 20.9 36.3
140 2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366 23.5 36.5
19 / 45

Manipulate columns

Using the pipe operator - %>%

samples_select <- samples %>% dplyr::select(transect:nanoeuks)
20 / 45

Manipulate columns

Using the pipe operator - %>%

samples_select <- samples %>% dplyr::select(transect:nanoeuks)
transect station date time depth level latitude longitude picoeuks nanoeuks
1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232
1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615
2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75
2 96 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218
2 96 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234
2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300
2 98 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782
2 98 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226
1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007
2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366
  • It is cleaner to write on 2 lines
samples_select <- samples %>%
dplyr::select(transect:nanoeuks)
20 / 45

Manipulate columns

Renaming variables - rename

samples <- samples %>% dplyr::rename(sample_number = `sample number`)
21 / 45

Manipulate columns

Renaming variables - rename

samples <- samples %>% dplyr::rename(sample_number = `sample number`)
sample_number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity
10 1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9
11 1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5
120 2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5
121 2 96 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218 0.43 0.23 22.6 33.7
122 2 96 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234 0.56 0.21 20.3 35.9
125 2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7
126 2 98 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782 0.25 0.20 23.7 37.2
127 2 98 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226 0.25 0.47 22.9 37.0
13 1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3
140 2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5
21 / 45

Manipulate columns

Creating new variables - mutate

samples <- samples %>%
dplyr::mutate(pico_pct = picoeuks/(picoeuks+nanoeuks)*100)
22 / 45

Manipulate columns

Creating new variables - mutate

samples <- samples %>%
dplyr::mutate(pico_pct = picoeuks/(picoeuks+nanoeuks)*100)
sample_number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity pico_pct
10 1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9 72.68293
11 1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5 90.99124
120 2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5 93.87755
121 2 96 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218 0.43 0.23 22.6 33.7 88.84910
122 2 96 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234 0.56 0.21 20.3 35.9 78.47286
125 2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7 70.36024
126 2 98 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782 0.25 0.20 23.7 37.2 60.88044
127 2 98 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226 0.25 0.47 22.9 37.0 93.80143
13 1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3 86.34206
140 2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5 57.73672
  • You can also use transmute() but then it will drop all the other columns.
  • It is much much better to compute new variables in R than in Excel, because you can easily track and correct errors.
22 / 45

Manipulate columns

Using the pipe operator you can chain operations

samples_select <- samples %>%
dplyr::select(sample_number:nanoeuks, level) %>%
dplyr::mutate(pico_pct = picoeuks/(picoeuks+nanoeuks)*100)
23 / 45

Manipulate columns

Using the pipe operator you can chain operations

samples_select <- samples %>%
dplyr::select(sample_number:nanoeuks, level) %>%
dplyr::mutate(pico_pct = picoeuks/(picoeuks+nanoeuks)*100)
sample_number transect station date time depth level latitude longitude picoeuks nanoeuks pico_pct
10 1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232 72.68293
11 1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615 90.99124
120 2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75 93.87755
121 2 96 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218 88.84910
122 2 96 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234 78.47286
125 2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300 70.36024
126 2 98 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782 60.88044
127 2 98 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226 93.80143
13 1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007 86.34206
140 2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366 57.73672
23 / 45

Manipulate columns

Creating labels with mutate and stringr functions

samples <- samples %>%
dplyr::mutate(sample_label = str_c("TR",transect,"St",station, sep="_"))
24 / 45

Manipulate columns

Creating labels with mutate and stringr functions

samples <- samples %>%
dplyr::mutate(sample_label = str_c("TR",transect,"St",station, sep="_"))
sample_number transect station date time sample_label
10 1 81 2013-11-13 1899-12-31 01:00:00 TR_1_St_81
11 1 85 2013-11-13 1899-12-31 13:30:00 TR_1_St_85
120 2 96 2013-11-18 1899-12-31 23:50:00 TR_2_St_96
121 2 96 2013-11-18 1899-12-31 23:50:00 TR_2_St_96
122 2 96 2013-11-18 1899-12-31 23:50:00 TR_2_St_96
125 2 98 2013-11-18 1899-12-31 05:00:00 TR_2_St_98
126 2 98 2013-11-18 1899-12-31 05:00:00 TR_2_St_98
127 2 98 2013-11-18 1899-12-31 05:00:00 TR_2_St_98
13 1 86 2013-11-13 1899-12-31 17:00:00 TR_1_St_86
140 2 101 2013-11-18 1899-12-31 12:00:00 TR_2_St_101
24 / 45

Manipulate columns

Changing type of some columns - mutate

samples <- samples %>%
dplyr::mutate(time = str_c(lubridate::hour(time),
lubridate::minute(time), sep=":"))
25 / 45

Manipulate columns

Changing type of some columns - mutate

samples <- samples %>%
dplyr::mutate(time = str_c(lubridate::hour(time),
lubridate::minute(time), sep=":"))
sample_number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity pico_pct
10 1 81 2013-11-13 1:0 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9 72.68293
11 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5 90.99124
120 2 96 2013-11-18 23:50 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5 93.87755
121 2 96 2013-11-18 23:50 30 Deep -27.39 -47.82 1737 218 0.43 0.23 22.6 33.7 88.84910
122 2 96 2013-11-18 23:50 50 Deep -27.39 -47.82 853 234 0.56 0.21 20.3 35.9 78.47286
125 2 98 2013-11-18 5:0 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7 70.36024
126 2 98 2013-11-18 5:0 50 Deep -27.59 -47.39 1217 782 0.25 0.20 23.7 37.2 60.88044
127 2 98 2013-11-18 5:0 85 Deep -27.59 -47.39 3420 226 0.25 0.47 22.9 37.0 93.80143
13 1 86 2013-11-13 17:0 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3 86.34206
140 2 101 2013-11-18 12:0 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5 57.73672
  • Use the lubridate package to manipulate dates
25 / 45

Manipulating rows

Order rows - arrange

samples <- samples %>% dplyr::arrange(transect, station)
26 / 45

Manipulating rows

Order rows - arrange

samples <- samples %>% dplyr::arrange(transect, station)
sample_number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity pico_pct
3 0 19 2013-11-02 13:30 5 Surf -25.79 -40.36 1005 898 0.29 0.48 22.7 36.9 52.81135
5 0 21 2013-11-02 0:0 5 Surf -26.23 -40.09 793 660 0.16 0.90 22.8 36.9 54.57674
7 0 26 2013-11-03 19:30 5 Surf -27.31 -39.38 907 856 0.20 0.50 21.2 36.4 51.44640
1 0 6 2013-10-31 5:20 45 Deep -23.58 -41.78 7651 4845 0.47 1.07 19.7 36.3 61.22759
2 0 6 2013-10-31 5:20 45 Deep -23.58 -41.78 7343 3258 0.47 1.07 19.7 36.3 69.26705
10 1 81 2013-11-13 1:0 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9 72.68293
9 1 81 2013-11-13 1:0 140 Deep -27.42 -44.72 3181 1235 0.20 0.26 17.3 35.9 72.03351
11 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5 90.99124
13 1 86 2013-11-13 17:0 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3 86.34206
15 1 87 2013-11-13 19:30 105 Deep -26.22 -45.48 6189 622 0.47 1.51 19.5 36.1 90.86771
  • Station 6 is not ordered numerically. It is because station is a character column.
26 / 45

Manipulating rows

Order rows - transform to numeric

samples <- samples %>%
dplyr::mutate(station = as.numeric(station)) %>%
dplyr::arrange(transect, station)
Warning in evalq(as.numeric(station), <environment>): NAs introduced by
coercion
27 / 45

Manipulating rows

Order rows - transform to numeric

samples <- samples %>%
dplyr::mutate(station = as.numeric(station)) %>%
dplyr::arrange(transect, station)
Warning in evalq(as.numeric(station), <environment>): NAs introduced by
coercion
sample_number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity pico_pct
1 0 6 2013-10-31 5:20 45 Deep -23.58 -41.78 7651 4845 0.47 1.07 19.7 36.3 61.22759
2 0 6 2013-10-31 5:20 45 Deep -23.58 -41.78 7343 3258 0.47 1.07 19.7 36.3 69.26705
3 0 19 2013-11-02 13:30 5 Surf -25.79 -40.36 1005 898 0.29 0.48 22.7 36.9 52.81135
5 0 21 2013-11-02 0:0 5 Surf -26.23 -40.09 793 660 0.16 0.90 22.8 36.9 54.57674
7 0 26 2013-11-03 19:30 5 Surf -27.31 -39.38 907 856 0.20 0.50 21.2 36.4 51.44640
10 1 81 2013-11-13 1:0 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9 72.68293
9 1 81 2013-11-13 1:0 140 Deep -27.42 -44.72 3181 1235 0.20 0.26 17.3 35.9 72.03351
11 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5 90.99124
13 1 86 2013-11-13 17:0 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3 86.34206
15 1 87 2013-11-13 19:30 105 Deep -26.22 -45.48 6189 622 0.47 1.51 19.5 36.1 90.86771
  • One station named "Bloom" could not be converted to numerical (-> NA)
27 / 45

Manipulating rows

Summarize rows - group_by / summarize

  • Group by transect and station
  • Compute mean of the percent picoplankton
samples_mean <- samples %>%
dplyr::group_by(transect, station) %>%
dplyr::summarise(n_samples = n(),
mean_pico_percent = mean(pico_pct, na.rm=TRUE))
28 / 45

Manipulating rows

Summarize rows - group_by / summarize

  • Group by transect and station
  • Compute mean of the percent picoplankton
samples_mean <- samples %>%
dplyr::group_by(transect, station) %>%
dplyr::summarise(n_samples = n(),
mean_pico_percent = mean(pico_pct, na.rm=TRUE))
transect station n_samples mean_pico_percent
0 6 2 65.24732
0 19 1 52.81135
0 21 1 54.57674
0 26 1 51.44640
1 81 2 72.35822
1 85 1 90.99124
1 86 1 86.34206
1 87 1 90.86771
2 96 3 87.06651
2 98 3 75.01403
28 / 45

Manipulating rows

Filtering rows - filter

  • Get only the surface samples
samples_surf <- samples %>% dplyr::filter(level == "Surf")
29 / 45

Manipulating rows

Filtering rows - filter

  • Get only the surface samples
samples_surf <- samples %>% dplyr::filter(level == "Surf")
sample_number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity pico_pct
3 0 19 2013-11-02 13:30 5 Surf -25.79 -40.36 1005 898 0.29 0.48 22.7 36.9 52.81135
5 0 21 2013-11-02 0:0 5 Surf -26.23 -40.09 793 660 0.16 0.90 22.8 36.9 54.57674
7 0 26 2013-11-03 19:30 5 Surf -27.31 -39.38 907 856 0.20 0.50 21.2 36.4 51.44640
120 2 96 2013-11-18 23:50 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5 93.87755
125 2 98 2013-11-18 5:0 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7 70.36024
140 2 101 2013-11-18 12:0 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5 57.73672
155 2 106 2013-11-19 2:30 5 Surf -28.12 -46.17 355 18 0.25 0.37 23.0 36.9 95.17426
165 2 114 2013-11-19 21:40 5 Surf -28.65 -44.99 728 226 0.29 0.28 22.4 36.4 76.31027
Trichod.1 2 Surf -27.80 -47.10 1002 194 83.77926
Trichod.2 2 Surf -27.80 -47.10 744 206 78.31579
  • ! Use the logical operators == != > >= < <= is.na()
29 / 45

Joining tables

30 / 45

Joining tables

Very often you have tables that contain a common field and that you need to join together.

31 / 45

Joining tables

Very often you have tables that contain a common field and that you need to join together.

A common example in oceanography. After a cruise you have many tables

Tables :

  • Stations - Station #, Longitude, Latitude
  • Cast (CTD) - Station #, Cast #, Depth, Temp, Sal (continuous)
  • Bottles - Cast #, Depth, Bottle #
  • Water samples - Bottle #, Sample #
  • Biological samples - Sample #, Analysis #

You want to know the Longitude and Latitude of a given biological sample.

31 / 45

Joining tables

Very often you have tables that contain a common field and that you need to join together.

A common example in oceanography. After a cruise you have many tables

Tables :

  • Stations - Station #, Longitude, Latitude
  • Cast (CTD) - Station #, Cast #, Depth, Temp, Sal (continuous)
  • Bottles - Cast #, Depth, Bottle #
  • Water samples - Bottle #, Sample #
  • Biological samples - Sample #, Analysis #

You want to know the Longitude and Latitude of a given biological sample.

Analysis # -> Sample # -> Bottle # -> Cast # -> Station # -> Long, Lat

In order to join 2 tables, they must have a common field. It is called the KEY.

For example it can be Bottle #, Station #

31 / 45

Joining tables

Sequence samples (metabarcoding)

  • Each sample has been split into 2 fractions by sorting : pico- and nano
  • These separate samples have then been sequenced to determine the composition of the plankton community
32 / 45

Joining tables

Sequence samples (metabarcoding)

  • Each sample has been split into 2 fractions by sorting : pico- and nano
  • These separate samples have then been sequenced to determine the composition of the plankton community

32 / 45

Joining tables

Reading table with sequence samples

sequences <- readxl::read_excel("data/CARBOM data.xlsx",
sheet = "Samples_sequencing")
sample sample_number fraction n_sequences
X10n 10 Nano 53230
X10p 10 Pico 47390
X11n 11 Nano 24007
X11p 11 Pico 31899
X120n 120 Nano 70455
33 / 45

Joining tables

  • Table sequences

    sample sample_number fraction n_sequences
    X10n 10 Nano 53230
    X10p 10 Pico 47390
    X11n 11 Nano 24007
    X11p 11 Pico 31899
    X120n 120 Nano 70455
  • Table samples

sample_number transect station date time depth level latitude longitude picoeuks nanoeuks
10 1 81 2013-11-13 1:0 140 Deep -27.42 -44.72 3278 1232
11 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 16312 1615
120 2 96 2013-11-18 23:50 5 Surf -27.39 -47.82 1150 75
121 2 96 2013-11-18 23:50 30 Deep -27.39 -47.82 1737 218
122 2 96 2013-11-18 23:50 50 Deep -27.39 -47.82 853 234
  • The two tables have a common field called sample_number (KEY).
34 / 45

Joining tables

Joining sequence and sample tables.

sequences_join <- left_join(sequences, samples_select)
35 / 45

Joining tables

Joining sequence and sample tables.

sequences_join <- left_join(sequences, samples_select)
sample sample_number fraction n_sequences transect station date time depth level latitude longitude picoeuks nanoeuks
X10n 10 Nano 53230 1 81 2013-11-13 1:0 140 Deep -27.42 -44.72 3278 1232
X10p 10 Pico 47390 1 81 2013-11-13 1:0 140 Deep -27.42 -44.72 3278 1232
X11n 11 Nano 24007 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 16312 1615
X11p 11 Pico 31899 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 16312 1615
X120n 120 Nano 70455 2 96 2013-11-18 23:50 5 Surf -27.39 -47.82 1150 75
X120p 120 Pico 76182 2 96 2013-11-18 23:50 5 Surf -27.39 -47.82 1150 75
X121n 121 Nano 52401 2 96 2013-11-18 23:50 30 Deep -27.39 -47.82 1737 218
X121p 121 Pico 71785 2 96 2013-11-18 23:50 30 Deep -27.39 -47.82 1737 218
X122n 122 Nano 78740 2 96 2013-11-18 23:50 50 Deep -27.39 -47.82 853 234
X122p 122 Pico 37364 2 96 2013-11-18 23:50 50 Deep -27.39 -47.82 853 234
35 / 45

Joining tables

Joining columns with different names

  • If the KEY do not have the same name in the two tables it is possible to specify the name of the two columns used for joining.
sequences <- sequences %>%
rename(sample_code = sample_number)
36 / 45

Joining tables

Joining columns with different names

  • If the KEY do not have the same name in the two tables it is possible to specify the name of the two columns used for joining.
sequences <- sequences %>%
rename(sample_code = sample_number)
sample sample_code fraction n_sequences
X10n 10 Nano 53230
X10p 10 Pico 47390
X11n 11 Nano 24007
X11p 11 Pico 31899
X120n 120 Nano 70455
X120p 120 Pico 76182
X121n 121 Nano 52401
X121p 121 Pico 71785
X122n 122 Nano 78740
X122p 122 Pico 37364
36 / 45

Joining tables

Joining columns with different names

sequences_join <- left_join(sequences, samples_select,
by= c("sample_code" = "sample_number"))
37 / 45

Joining tables

Joining columns with different names

sequences_join <- left_join(sequences, samples_select,
by= c("sample_code" = "sample_number"))
sample sample_code fraction n_sequences transect station date time depth level latitude longitude picoeuks nanoeuks
X10n 10 Nano 53230 1 81 2013-11-13 1:0 140 Deep -27.42 -44.72 3278 1232
X10p 10 Pico 47390 1 81 2013-11-13 1:0 140 Deep -27.42 -44.72 3278 1232
X11n 11 Nano 24007 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 16312 1615
X11p 11 Pico 31899 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 16312 1615
X120n 120 Nano 70455 2 96 2013-11-18 23:50 5 Surf -27.39 -47.82 1150 75
X120p 120 Pico 76182 2 96 2013-11-18 23:50 5 Surf -27.39 -47.82 1150 75
X121n 121 Nano 52401 2 96 2013-11-18 23:50 30 Deep -27.39 -47.82 1737 218
X121p 121 Pico 71785 2 96 2013-11-18 23:50 30 Deep -27.39 -47.82 1737 218
X122n 122 Nano 78740 2 96 2013-11-18 23:50 50 Deep -27.39 -47.82 853 234
X122p 122 Pico 37364 2 96 2013-11-18 23:50 50 Deep -27.39 -47.82 853 234
37 / 45

Joining tables

Joining with missing data

  • Let us remove some samples from the sample table
samples_select <- samples_select %>%
filter(sample_number != "10")
38 / 45

Joining tables

Joining with missing data

  • Let us remove some samples from the sample table
samples_select <- samples_select %>%
filter(sample_number != "10")
sample_number transect station date time depth level latitude longitude picoeuks nanoeuks
11 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 16312 1615
120 2 96 2013-11-18 23:50 5 Surf -27.39 -47.82 1150 75
121 2 96 2013-11-18 23:50 30 Deep -27.39 -47.82 1737 218
122 2 96 2013-11-18 23:50 50 Deep -27.39 -47.82 853 234
125 2 98 2013-11-18 5:0 5 Surf -27.59 -47.39 3086 1300
126 2 98 2013-11-18 5:0 50 Deep -27.59 -47.39 1217 782
127 2 98 2013-11-18 5:0 85 Deep -27.59 -47.39 3420 226
13 1 86 2013-11-13 17:0 105 Deep -26.33 -45.41 6366 1007
140 2 101 2013-11-18 12:0 5 Surf -27.79 -46.96 500 366
141 2 101 2013-11-18 12:0 60 Deep -27.79 -46.96 1046 485
38 / 45

Joining tables

Joining with missing data

sequences_join <- left_join(sequences, samples_select,
by= c("sample_code" = "sample_number"))
39 / 45

Joining tables

Joining with missing data

sequences_join <- left_join(sequences, samples_select,
by= c("sample_code" = "sample_number"))
sample sample_code fraction n_sequences transect station date time depth level latitude longitude picoeuks nanoeuks
X10n 10 Nano 53230
X10p 10 Pico 47390
X11n 11 Nano 24007 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 16312 1615
X11p 11 Pico 31899 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 16312 1615
X120n 120 Nano 70455 2 96 2013-11-18 23:50 5 Surf -27.39 -47.82 1150 75
X120p 120 Pico 76182 2 96 2013-11-18 23:50 5 Surf -27.39 -47.82 1150 75
  • What happened ?
39 / 45

Wide vs long tables

@allison_horst

40 / 45

Wide vs long tables

Go from wide to long - gather

  • This is very useful for statistical and plotting purposes
41 / 45

Wide vs long tables

Go from wide to long - gather

samples_long <- samples_select %>%
tidyr::gather(key="population", value="cell_ml", picoeuks, nanoeuks)
42 / 45

Wide vs long tables

Go from wide to long - gather

samples_long <- samples_select %>%
tidyr::gather(key="population", value="cell_ml", picoeuks, nanoeuks)
sample_number transect station date time depth level latitude longitude population cell_ml
11 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 picoeuks 16312
120 2 96 2013-11-18 23:50 5 Surf -27.39 -47.82 picoeuks 1150
121 2 96 2013-11-18 23:50 30 Deep -27.39 -47.82 picoeuks 1737
122 2 96 2013-11-18 23:50 50 Deep -27.39 -47.82 picoeuks 853
125 2 98 2013-11-18 5:0 5 Surf -27.59 -47.39 picoeuks 3086
126 2 98 2013-11-18 5:0 50 Deep -27.59 -47.39 picoeuks 1217
127 2 98 2013-11-18 5:0 85 Deep -27.59 -47.39 picoeuks 3420
13 1 86 2013-11-13 17:0 105 Deep -26.33 -45.41 picoeuks 6366
140 2 101 2013-11-18 12:0 5 Surf -27.79 -46.96 picoeuks 500
141 2 101 2013-11-18 12:0 60 Deep -27.79 -46.96 picoeuks 1046
42 / 45

Wide vs long tables

Go from long to wide - spread

43 / 45

Wide vs long tables

Go from long to wide - spread

samples_wide <- samples_long %>%
tidyr::spread(key="population", value="cell_ml")
44 / 45

Wide vs long tables

Go from long to wide - spread

samples_wide <- samples_long %>%
tidyr::spread(key="population", value="cell_ml")
sample_number transect station date time depth level latitude longitude nanoeuks picoeuks
1 0 6 2013-10-31 5:20 45 Deep -23.58 -41.78 4845 7651
11 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 1615 16312
120 2 96 2013-11-18 23:50 5 Surf -27.39 -47.82 75 1150
121 2 96 2013-11-18 23:50 30 Deep -27.39 -47.82 218 1737
122 2 96 2013-11-18 23:50 50 Deep -27.39 -47.82 234 853
125 2 98 2013-11-18 5:0 5 Surf -27.59 -47.39 1300 3086
126 2 98 2013-11-18 5:0 50 Deep -27.59 -47.39 782 1217
127 2 98 2013-11-18 5:0 85 Deep -27.59 -47.39 226 3420
13 1 86 2013-11-13 17:0 105 Deep -26.33 -45.41 1007 6366
140 2 101 2013-11-18 12:0 5 Surf -27.79 -46.96 366 500
44 / 45

Next time: Data visualization (ggplot2)

What you will learn :

  • Understand the "grammar" of graphics
  • Create exploratory graphics
  • Finalize graphics for publications

  • Please install the following packages and their dependencies
    • ggplot2
    • cowplot
  • Download data files (links on slack)
Reading list
45 / 45

Outline

  • Concept of tidy data
  • Reading data
  • Manipulating data
    • Columns
    • Rows
    • Joining tables
2 / 45
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow