layout: true background-image: url(img/course-logo.png), url(img/logo_SBR.png), url(img//NTU-Logo-full-colour.png) background-position: right top 30px, right 50px bottom 50px,left 50px bottom 50px, top 350px left 500px background-size: 45%, 25%, 20% # Fundamentals of Data Science for EESS --- <br> <br> <br> <br> ## R session 01 - Introduction to R .font120[**Daniel Vaulot**] 2020-01-24 <br> <br> <br> --- ## R sessions .font150[ 1 - Introduction to R 2 - R markdown 3 - Git 4 - Data wrangling 5 - Data visualisation 6 - Data mapping ] --- layout: false class: middle, inverse # Outline .font150[ * What is R and why use R ? * Resources * Get started * Fundamentals of R - Data objects - Vectors - Operators - Functions - Packages - Data frames ] --- layout: true # Introduction --- background-image: url(img/R-logo.png) background-position: middle center background-size: 25% - .font150[Who has used R before ?] -- - .font150[What other programming language have you used before ?] -- - .font150[For those who are experts in R] -- * please refrain to answer during this session... * help your neighbor... -- .font150[* Two special slide formatting] .student[Your turn...] -- .warning[Warning] --- exclude: false background-image: url(img/computer-languages.png) background-position: right 20px bottom 20px background-size: 70% ## Computer languages --- exclude: false background-image: url(img/R-logo.png) ## History of R * **Mid 1970s** - S Language for Statistical Computing conceived by John Chambers, Rick Becker, Trevor Hastie, Allan Wilks and others at Bell Labs * **Early 1990's** - R was first implemented in the early 1990’s by Robert Gentleman and Ross Ihaka, both faculty members at the University of Auckland. * **1995** - Open Source Project * **1997** - Managed by the R Core Group * **2000** - First release of R * **2011** - First release of R studio * [Historical notes - Paper from 1998]( --- ## Why use R ? - **Script vs. Menu driven software (e.g. Excel)** + Can be re-rerun with new data + Reproducible workflow -- - **Open source** + Huge number of libraries + Tidy "universe" : tidyverse and ggplot2 + Very easy to manipulate tables (select columns, create new variables) - High quality graphics -- - **Work environment** - R studio -- - **Document your data processing** - R markdown - Create HTML, pdf, presentations -- - **Share your data and workflow** - GitHub --- ## What can you do with R ? -- .pull-left[ - **Science** * Statistics of course... * Data processing * Graphics * Time series analyses * Maps * Bioinformatics ] -- .pull-left[ - **But also** * Teach * Do a presentation * Write your CV * Build a web site * Write a book * Much more... ] -- .center[ <img src="img/web-site-dv.png" width="30%" style="display: block; margin: auto;" /> ] --- layout: true # Resources --- background-image: url(img/R_nutshell.png), url(img/R_graphics_cookbook.png) background-position: right 20px top 50px, right 300px top 250px background-size: 20%, 18% ## Books and Manuals * [Applied Statistics with R]( : Quite simple introduction with emphasis on Stats * [R intro]( : Very good introduction to R, short and clear * [R in a nutshell]( : Many many receipes to solve all your questions * R graphics cook book : very good for graphics --- background-image: url(img/web_quickR.png) background-position: right 20px top 80px background-size: 60% ## On line courses and web sites * [Coursera]( * [Pluralsight]( - Not free * [Quick-R, very simple]( --- background-image: url(img/R_Studio-cheatsheets-01.png), url(img/R_Studio-cheatsheets-02.png) background-position: right 600px top 100px, right 20px top 20px background-size: 30%, 50% ## Cheat sheets * [R basics]( * [ggplot2]( * [dplyr]( --- background-image: url(img/stackoverflow.png) background-position: right 20px top 20px background-size: 40% ## Forum * * * * --- layout: true # Let's get started --- background-image: url(img/R_Studio_interface.png) background-position: right 20px top 20px background-size: 65% ## Setup * Install [R]( * Install [R studio]( --- background-image: url(img/R_Studio_interface_numbered.png) background-position: right 20px top 20px background-size: 65% ## The R studio interface .pull-left[ - **Bottom left** - Console - **Top left** - File editor for .R and .Rmd files - Data frame visualization - **Top right** - Environment (i.e. R objects) - History - **Bottom right** - Files - Plots - Packages - Help ] --- background-image: url(img/R-new-project.png) background-position: right 20px top 20px background-size: 40% ## Create a new project * Open R studio * Create new project for the course in a new directory - e.g. `Experimental design course` --- ## Your first script ```r print("Hello world") ``` ``` [1] "Hello world" ``` ### Two ways to proceed 1. Type directly in command window -- 2. Create a new script <img src="img/R-new-script.png" width="25%" style="display: block; margin: auto;" /> .student[Type in script window, select and execute (CTRL-R)] --- layout: true # The R language --- ## Everything in R is an **object** * Assignement done with **<-** ```r > x <- 1 > y <- 2 > x + y ``` ``` [1] 3 ``` -- ```r > z <- x + y > z ``` ``` [1] 3 ``` --- **=** can be used instead of **<-** but refrain from it (not good style) ```r > z = x + y ``` -- You can view the values of the objects in R-studio environment window (top-right) <img src="img/R_studio-environment.png" width="55%" style="display: block; margin: auto;" /> --- ## R is **case sensitive** ```r > Z ``` -- ```r > Z ``` ``` Error in eval(expr, envir, enclos): objet 'Z' introuvable ``` --- ## Rules for naming objects * Use * letters * numbers * the dot * the underscore (not the minus sign !) * Start always with a letter * `Myvariable`, `Myvariable1`, `Myvariable.1`,`Myvariable-01` are OK * `1Myvariable`, `My-variable`, `Myvariable@` are **not** OK --- ## Use consistent naming Five conventions * alllowercase: e.g. adjustcolor * period.separated: e.g. * **underscore_separated**: e.g. numeric_version * lowerCamelCase: e.g. addTaskCallback * UpperCamelCase: e.g. SignatureMethod Prefer third one, much more easy to read * Use **names** for objects : **last_name** * Use **verbs** for function : **build_name** * Think about best order - e.g. prefer maybe **name_last** because then you can have name_first, name_full... - and you identify that all these objects are related to a name... --- layout: true # R objects --- ## Data types * **character**: "Daniel", "This is a course in R", 'Donald' * **numeric**: 2, 15.5, 10e-3 * **integer**: 2L (the L tells R to store this as an integer) * **date**: 2018-02-25 * **logical**: TRUE, FALSE * **complex**: 1+4i (complex numbers with real and imaginary parts) -- * **No data** "NA" * **Not a number** "NaN" (e.g. division by zero) --- ## Data structures * **Vector** * **List** * **Matrix** * **Data frames** * **Function** --- layout: true # Vectors --- The basic R structure is a vector: `$$\begin{bmatrix}10 \\ 20 \\ 30 \end{bmatrix}$$` -- A vector can contain only a single element `$$\begin{bmatrix}10 \end{bmatrix}$$` -- ## Assign a value to a vector ```r x <- 10 x ``` ``` [1] 10 ``` --- ## Assign several elements ```r x <- c(10, 20, 30) x ``` ``` [1] 10 20 30 ``` -- ## Assign range ```r x <- 10:30 x ``` ``` [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ``` --- ## Assign characters ```r PoTU <- c("Donald", "Trump") PoTU ``` ``` [1] "Donald" "Trump" ``` ## Assign logical ```r flags <- c(TRUE, FALSE, TRUE) flags ``` ``` [1] TRUE FALSE TRUE ``` --- ## Access specific elements of a vector ### First ```r x[1] ``` ``` [1] 10 ``` -- ### Range ```r x[1:5] ``` ``` [1] 10 11 12 13 14 ``` -- ### Remove one element ```r x[-1] ``` ``` [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ``` --- ## Determine object properties Apply functions (we will come back to functions latter) * **typeof()** - what is the object’s data type (low-level)? * **length()** - how long is it? What about two dimensional objects? ```r typeof(x) length(x) ``` -- ``` [1] "integer" ``` ``` [1] 21 ``` -- .student[ What is the type and length of **PoTU** ? ] --- layout: true # Operators --- ## Arithmetic Operators | Operator | Description | |---|---|---| | + | addition| | - | subtraction | | * | multiplication| | / | division | | ^ or ** | exponentiation | | x %% y | modulus (x mod y) 5%%2 is 1 | | x %/% y | integer division 5%/%2 is 2 | --- ## Arithmetic Operators We are performing vector operations ! `$$\begin{bmatrix} 1\\2\\3\\..\end{bmatrix}+\begin{bmatrix}1\\2\\3\\..\end{bmatrix}=\begin{bmatrix}2\\4\\6\\..\end{bmatrix}$$` --- ## Arithmetic Operators Vector one element ```r x <- 1 y <- 2 z <- x + y z ``` ``` [1] 3 ``` --- ## Arithmetic Operators Vector several elements ```r # Two instructions on the same line x <- 1:9; y <- 1:9 z <- x + y z ``` ``` [1] 2 4 6 8 10 12 14 16 18 ``` -- .warning[ * Several instructions on same line separate by **;** * The hastag **#** indicate a comment -> Use heavily to document your code * However, it is even better to use R markdown (see next class) ] -- .student[ Use the other operators ] --- ## Arithmetic Operators What happens when the vectors have different number of elements ? ```r x <- 1:9 y <- 1 z <- x + y z ``` -- ``` [1] 2 3 4 5 6 7 8 9 10 ``` -- Equivalent to ```r y <- c(1, 1, 1, 1, 1, 1, 1, 1, 1) ``` The recycling rule... --- ## Can we add logical ? ```r x <- TRUE y <- FALSE z <- x + y z ``` -- ``` [1] 1 ``` --- ## Can we add logical ? No error but... The resulting variable is transformed to a **numeric** .student[ How you would show that ? ] -- ```r typeof(x) ``` ``` [1] "logical" ``` ```r typeof(z) ``` ``` [1] "integer" ``` --- ## Logical Operators | Operator | Description | |---|---|---| | < | less than | | <= | less than or equal to | | > | greater than | | >= | greater than or equal to | | == | exactly equal to | | != | not equal to | | !x | Not x | | x | y | x OR y | | x & y | x AND y | | isTRUE(x) | test if X is TRUE | --- ## Logical Operators ```r x <- TRUE y <- FALSE z1 <- x | y z2 <- x == y ``` -- ``` [1] TRUE ``` ``` [1] FALSE ``` .warning[ Do not mix * == which is logical operator * = which is assignement ] --- ## Can we add characters ? ```r first <- "Donald" last <- "Trump" full <- first + last ``` -- Generates an error ``` Error in first + last: argument non numérique pour un opérateur binaire ``` -- .student[ What can we do ? ] --- layout: true # Functions --- Functions perform specific task on objects * e.g. to concatanate strings we use **paste()** -- ```r paste(first, last) ``` ``` [1] "Donald Trump" ``` -- * Functions take **arguments** and return an object called **result** * To know the arguments use ? ```r ? paste() # Do not forget the parenthesis ``` -- .student[ What happened ? ] -- * Can go directly to Help panel and type function name --- background-image: url(img/R-help-paste-01.png) background-position: right 20px top 20px background-size: 50% ## Help --- background-image: url(img/R-help-paste-02.png) background-position: right 20px top 20px background-size: 50% ## Help --- ## Getting what you want Let's apply paste : ```r paste(first, last) ``` ``` [1] "Donald Trump" ``` .student[ * We would like to get "Donald_Trump" * Can you read the help and suggest a change in the way we call the function ? ] -- ```r paste(first, last, sep = "_") ``` ``` [1] "Donald_Trump" ``` --- ## Write your own function .warning[If you write 3 times the same piece of code write a function...] ```r my_sum <- function(first_number, second_number) { c <- first_number + second_number return(c) } ``` * __my_sum__ : function name * __first_number, second_number__ : arguments * instructions are enclosed by braces ({}) * return() : the value(s) returned -- #### More compact way ```r my_sum <- function(first_number, second_number) {first_number + second_number} ``` --- ## Call your function ```r my_sum(10, 20) ``` ``` [1] 30 ``` -- * better ```r my_sum(first_number = 10, second_number = 20) ``` ``` [1] 30 ``` --- ## Examples of functions Most of the time you do not have to write functions because someone has already written one for what you want to do... * Sum ```r x <- 1:100 sum(x) ``` ``` [1] 5050 ``` -- * Normal distribution ```r y <- rnorm(10, mean = 0, sd = 1) y ``` ``` [1] 1.6731915 -0.2498182 1.0774267 0.7086024 -0.3065056 2.2168636 [7] 0.7236404 0.2397608 -0.1206248 -1.2904100 ``` --- ## Statistics ```r mean(y) ``` ``` [1] 0.4672127 ``` ```r sd(y) ``` ``` [1] 1.033406 ``` -- Sample more points... 10,000 instead of 100 ```r y <- rnorm(10000, mean = 0, sd = 1) mean(y) ``` ``` [1] -0.005369128 ``` ```r sd(y) ``` ``` [1] 0.9956479 ``` --- ## Plot .pull-left[ Histogram ```r library(graphics) hist(y) ``` ] .pull-right[ <img src="R-session-01-intro_files/figure-html/unnamed-chunk-47-1.png" style="display: block; margin: auto;" /> ] .student[ * What is this "library()" ] --- layout: true # Packages --- Packages are set of functions that have a common goal They are really the strength of R <img src="img/R-packages-number.png" width="55%" style="display: block; margin: auto;" /> And these are only the "official"" packages. You can find more on GitHub --- ## Installing a package Download on your computer the package you need .center[ <img src="img/R_studio_package_01.png" width="45%" /><img src="img/R_studio_package_02.png" width="35%" /> ] .student[ Install package **stringr** (to manipulate strings of characters) ] --- ## Using a package To use functions from the package - use the syntax `package::function` ```r stringr::str_c(first, last, sep = " ") ``` ``` [1] "Donald Trump" ``` -- - load the package with the library function ```r library(stringr) str_c(first, last, sep = " ") ``` ``` [1] "Donald Trump" ``` -- .warning[Sometimes functions from different libraries have similar names] --- background-image: url(img/R_studio_package_03.png) background-position: right 20px top 20px background-size: 50% ## List installed packages --- layout: false # Other objects * List * Matrix * Factors * **Data frames** --- layout: true # Data frames --- ## What is it ? * Table mixing different types of columns (an Excel table...) * However within a column all values are similar, e.g. numeric, logical, character -- ```r df <- data.frame(label = letters[1:6], id = 1:6, value = rnorm(6, mean = 0, sd = 1), flag=c(TRUE, FALSE), # recycling rule stringsAsFactors = FALSE) df ``` ``` label id value flag 1 a 1 -0.5324537 TRUE 2 b 2 -0.2858342 FALSE 3 c 3 -0.7311013 TRUE 4 d 4 -1.2440367 FALSE 5 e 5 -0.8671309 TRUE 6 f 6 -2.4274570 FALSE ``` .warning[ * We will NOT use factors: `stringsAsFactors = FALSE` ] --- ## Useful functions ```r dim(df) # returns the dimensions of data frame ``` ``` [1] 6 4 ``` ```r nrow(df) # number of rows ``` ``` [1] 6 ``` ```r ncol(df) # number of columns ``` ``` [1] 4 ``` --- ## Useful functions ```r str(df) # structure of data frame - name, type and preview of data in each column ``` ``` 'data.frame': 6 obs. of 4 variables: $ label: chr "a" "b" "c" "d" ... $ id : int 1 2 3 4 5 6 $ value: num -0.532 -0.286 -0.731 -1.244 -0.867 ... $ flag : logi TRUE FALSE TRUE FALSE TRUE FALSE ``` ```r colnames(df) # columns names ``` ``` [1] "label" "id" "value" "flag" ``` --- ## Access specific value * Use the `df[i,j]` notation, first index corresponds to row, second index to column ```r df[5, 3] ``` ``` [1] -0.8671309 ``` -- * Specifiy the name of the column ```r df[5, "value"] ``` ``` [1] -0.8671309 ``` .warning[ * The result is a **vector** ] --- ## Access specific column * Use the `df[i,j]` notation ```r df[, 3] ``` ``` [1] -0.5324537 -0.2858342 -0.7311013 -1.2440367 -0.8671309 -2.4274570 ``` ```r df[, "value"] ``` ``` [1] -0.5324537 -0.2858342 -0.7311013 -1.2440367 -0.8671309 -2.4274570 ``` .warning[ * The result is a **vector** ] --- ## Access specific column * Use `$`notation ```r df$value ``` ``` [1] -0.5324537 -0.2858342 -0.7311013 -1.2440367 -0.8671309 -2.4274570 ``` -- ## This can be used to access a specific value * `$` for the column, `[i]` for the row ```r df$value[5] ``` ``` [1] -0.8671309 ``` --- ## Access row * Use the `df[i,j]` notation ```r df[1, ] ``` ``` label id value flag 1 a 1 -0.5324537 TRUE ``` .warning[ * The result is a **data frame** ] --- ## Access specific rows * Rows for which the value of id <= 3 ```r df[df$id <= 3,] ``` ``` label id value flag 1 a 1 -0.5324537 TRUE 2 b 2 -0.2858342 FALSE 3 c 3 -0.7311013 TRUE ``` .student[Select lines for which the label is c] -- ```r df[df$label == "c", ] ``` ``` label id value flag 3 c 3 -0.7311013 TRUE ``` .warning[This syntax is complicated - tidyverse packages make it much more easy to manipulate and remember] --- layout: false class: inverse # Recap .font150[ - R is case sensitive: Z != z - Objects: data types vs data structures - Vectors: think in vector operations - Operators: arithmetic vs. logical - Functions: try to practice - Data frames: most useful objects (Excel type) ] --- exclude: false layout: false background-image: url(img/R-markdown-book.png) background-position: right 20px top 20px background-size: 25% # Next class: 02 - Markdown What you will learn : * Mix text, R code and R output in a single document * Produce documents as HTML, pdf or even Word from the same template .student[ * Please install the following packages and their dependencies * rmarkdown (will install also knitr) * tinytex (Latex) * Installation : * Have a look at the book ] --- exclude: false layout: false background-image: url(img/Git-progit2.png) background-position: right 20px top 20px background-size: 30% # Next class: 03 - Git ## Resources * Pro GIT: * GitHub guide: * Happy Git with R: ## Software * Git: * Git desktop: * GitHub account: