Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Fundamental of Data Science for EESS

R session 01 - Introduction to R

Daniel Vaulot

2019-01-17

1 / 58

Outline

  • What is R and why use R ?
  • Resources
  • Get started
  • Fundamentals of R
    • Data objects
    • Vectors
    • Operators
    • Functions
    • Packages
    • Data frames
2 / 58

Introduction

  • Who has used R before ?
3 / 58

Introduction

  • Who has used R before ?

  • What other programming language have you used before ?

3 / 58

Introduction

  • Who has used R before ?

  • What other programming language have you used before ?

  • For those who are experts in R

3 / 58

Introduction

  • Who has used R before ?

  • What other programming language have you used before ?

  • For those who are experts in R

    • please refrain to answer during this session...
    • help your neighbor...
3 / 58

Introduction

  • Who has used R before ?

  • What other programming language have you used before ?

  • For those who are experts in R

    • please refrain to answer during this session...
    • help your neighbor...

  • Two special slide formatting

Your turn...

3 / 58

Introduction

  • Who has used R before ?

  • What other programming language have you used before ?

  • For those who are experts in R

    • please refrain to answer during this session...
    • help your neighbor...

  • Two special slide formatting

Your turn...

Warning

3 / 58

Introduction

Computer languages

4 / 58

Introduction

History of R

  • Mid 1970s - S Language for Statistical Computing conceived by John Chambers, Rick Becker, Trevor Hastie, Allan Wilks and others at Bell Labs

  • Early 1990's - R was first implemented in the early 1990’s by Robert Gentleman and Ross Ihaka, both faculty members at the University of Auckland.

  • 1995 - Open Source Project

  • 1997 - Managed by the R Core Group

  • 2000 - First release of R

  • 2011 - First release of R studio

  • Historical notes - Paper from 1998

5 / 58

Introduction

Why use R ?

  • Script vs. Menu driven software (e.g. Excel)
    • Can be re-rerun with new data
    • Reproducible workflow
6 / 58

Introduction

Why use R ?

  • Script vs. Menu driven software (e.g. Excel)
    • Can be re-rerun with new data
    • Reproducible workflow
  • Open source
    • Huge number of libraries
    • Tidy "universe" : tidyverse and ggplot2
      • Very easy to manipulate tables (select columns, create new variables)
      • High quality graphics
6 / 58

Introduction

Why use R ?

  • Script vs. Menu driven software (e.g. Excel)
    • Can be re-rerun with new data
    • Reproducible workflow
  • Open source
    • Huge number of libraries
    • Tidy "universe" : tidyverse and ggplot2
      • Very easy to manipulate tables (select columns, create new variables)
      • High quality graphics
  • Work environment
    • R studio
6 / 58

Introduction

Why use R ?

  • Script vs. Menu driven software (e.g. Excel)
    • Can be re-rerun with new data
    • Reproducible workflow
  • Open source
    • Huge number of libraries
    • Tidy "universe" : tidyverse and ggplot2
      • Very easy to manipulate tables (select columns, create new variables)
      • High quality graphics
  • Work environment
    • R studio
  • Document your data processing
    • R markdown
    • Create HTML, pdf, presentations
6 / 58

Introduction

Why use R ?

  • Script vs. Menu driven software (e.g. Excel)
    • Can be re-rerun with new data
    • Reproducible workflow
  • Open source
    • Huge number of libraries
    • Tidy "universe" : tidyverse and ggplot2
      • Very easy to manipulate tables (select columns, create new variables)
      • High quality graphics
  • Work environment
    • R studio
  • Document your data processing
    • R markdown
    • Create HTML, pdf, presentations
  • Share your data and workflow
    • GitHub
6 / 58

Introduction

What can you do with R ?

7 / 58

Introduction

What can you do with R ?

  • Science
    • Statistics of course...
    • Data processing
    • Graphics
    • Time series analyses
    • Maps
    • Bioinformatics
7 / 58

Introduction

What can you do with R ?

  • Science
    • Statistics of course...
    • Data processing
    • Graphics
    • Time series analyses
    • Maps
    • Bioinformatics
  • But also
    • Teach
    • Do a presentation
    • Write your CV
    • Build a web site
    • Write a book
    • Much more...
7 / 58

Introduction

What can you do with R ?

  • Science
    • Statistics of course...
    • Data processing
    • Graphics
    • Time series analyses
    • Maps
    • Bioinformatics
  • But also
    • Teach
    • Do a presentation
    • Write your CV
    • Build a web site
    • Write a book
    • Much more...

7 / 58

Resources

Books and Manuals

  • R intro : Very good introduction to R, short and clear
  • R in a nutshell : Many many receipes to solve all your questions
  • R graphics cook book : very good for graphics
8 / 58

Resources

On line courses and web sites

9 / 58

Resources

Cheat sheets

10 / 58

Let's get started

Setup

12 / 58

Let's get started

The R studio interface

  • Bottom left
    • Console
  • Top left
    • File editor
  • Top right
    • Environment (i.e. R objects)
    • History
  • Bottom right
    • Files
    • Plots
    • Packages
    • Help

13 / 58

Let's get started

Create a new project

  • Open R studio
  • Create new project for the course in a new directory
    • e.g. Data Science Class
14 / 58

Let's get started

Your first script

print("Hello world")
[1] "Hello world"

Two ways to proceed

  1. Type directly in command window
15 / 58

Let's get started

Your first script

print("Hello world")
[1] "Hello world"

Two ways to proceed

  1. Type directly in command window

  2. Create a new script

Type in script window, select and execute (CTRL-R)

15 / 58

The R language

Everything in R is an object

  • Assignement done with <-
> x <- 1
> y <- 2
> x + y
[1] 3
16 / 58

The R language

Everything in R is an object

  • Assignement done with <-
> x <- 1
> y <- 2
> x + y
[1] 3
> z <- x + y
> z
[1] 3
16 / 58

The R language

= can be used instead of <- but refrain from it (not good style)

> z = x + y
17 / 58

The R language

= can be used instead of <- but refrain from it (not good style)

> z = x + y

You can view the values of the objects in R-studio environment window (top-right)

17 / 58

The R language

R is case sensitive

> Z
18 / 58

The R language

R is case sensitive

> Z
> Z
Error in eval(expr, envir, enclos): object 'Z' not found
18 / 58

The R language

Rules for naming objects

  • Use
    • letters
    • numbers
    • the dot
    • the underscore (not the minus sign !)
  • Start always with a letter
    • Myvariable, Myvariable1, Myvariable.1,Myvariable-01 are OK
    • 1Myvariable, My-variable, Myvariable@ are not OK
19 / 58

The R language

Use consistent naming

Five conventions

  • alllowercase: e.g. adjustcolor
  • period.separated: e.g. plot.new
  • underscore_separated: e.g. numeric_version
  • lowerCamelCase: e.g. addTaskCallback
  • UpperCamelCase: e.g. SignatureMethod

Prefer third one, much more easy to read

  • Use names for objects : last_name
  • Use verbs for function : build_name
  • Think about best order
    • e.g. prefer maybe name_last because then you can have name_first, name_full...
    • and you identify that all these objects are related to a name...
20 / 58

R objects

Data types

  • character: "Daniel", "This is a course in R", 'Donald'

  • numeric: 2, 15.5, 10e-3

  • integer: 2L (the L tells R to store this as an integer)

  • date: 2018-02-25

  • logical: TRUE, FALSE

  • complex: 1+4i (complex numbers with real and imaginary parts)

21 / 58

R objects

Data types

  • character: "Daniel", "This is a course in R", 'Donald'

  • numeric: 2, 15.5, 10e-3

  • integer: 2L (the L tells R to store this as an integer)

  • date: 2018-02-25

  • logical: TRUE, FALSE

  • complex: 1+4i (complex numbers with real and imaginary parts)

  • No data "NA"

  • Not a number "NaN" (e.g. division by zero)

21 / 58

R objects

Data structures

  • Vector

  • List

  • Matrix

  • Data frames

  • Function

22 / 58

Vectors

The basic R structure is a vector: [102030]

23 / 58

Vectors

The basic R structure is a vector: [102030]

A vector can with a single element only [10]

23 / 58

Vectors

The basic R structure is a vector: [102030]

A vector can with a single element only [10]

Assign a value to a vector

x <- 10
x
[1] 10
23 / 58

Vectors

Assign several elements

x <- c(10, 20, 30)
x
[1] 10 20 30
24 / 58

Vectors

Assign several elements

x <- c(10, 20, 30)
x
[1] 10 20 30

Assign range

x <- 10:30
x
[1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
24 / 58

Vectors

Assign characters

PoTU <- c("Donald", "Trump")
PoTU
[1] "Donald" "Trump"

Assign logical

flags <- c(TRUE, FALSE, TRUE)
flags
[1] TRUE FALSE TRUE
25 / 58

Vectors

Access specific elements of a vector

First

x[1]
[1] 10
26 / 58

Vectors

Access specific elements of a vector

First

x[1]
[1] 10

Range

x[1:5]
[1] 10 11 12 13 14
26 / 58

Vectors

Access specific elements of a vector

First

x[1]
[1] 10

Range

x[1:5]
[1] 10 11 12 13 14

Remove one element

x[-1]
[1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
26 / 58

Vectors

Determine object properties

Apply functions (we will come back to functions latter)

  • typeof() - what is the object’s data type (low-level)?
  • length() - how long is it? What about two dimensional objects?
typeof(x)
length(x)
27 / 58

Vectors

Determine object properties

Apply functions (we will come back to functions latter)

  • typeof() - what is the object’s data type (low-level)?
  • length() - how long is it? What about two dimensional objects?
typeof(x)
length(x)
[1] "integer"
[1] 21
27 / 58

Vectors

Determine object properties

Apply functions (we will come back to functions latter)

  • typeof() - what is the object’s data type (low-level)?
  • length() - how long is it? What about two dimensional objects?
typeof(x)
length(x)
[1] "integer"
[1] 21

What is the type and length of PoTU ?

27 / 58

Operators

Arithmetic Operators

Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
x %% y modulus (x mod y) 5%%2 is 1
x %/% y integer division 5%/%2 is 2
28 / 58

Operators

Arithmetic Operators

We are performing vector operations !

[123..]+[123..]=[246..]

29 / 58

Operators

Arithmetic Operators

Vector one element

x <- 1
y <- 2
z <- x + y
z
[1] 3
30 / 58

Operators

Arithmetic Operators

Vector several elements

# Two instructions on the same line
x <- 1:9; y <- 1:9
z <- x + y
z
[1] 2 4 6 8 10 12 14 16 18
31 / 58

Operators

Arithmetic Operators

Vector several elements

# Two instructions on the same line
x <- 1:9; y <- 1:9
z <- x + y
z
[1] 2 4 6 8 10 12 14 16 18
  • Several instructions on same line separate by ;
  • The hastag # indicate a comment -> Use heavily to document your code
31 / 58

Operators

Arithmetic Operators

Vector several elements

# Two instructions on the same line
x <- 1:9; y <- 1:9
z <- x + y
z
[1] 2 4 6 8 10 12 14 16 18
  • Several instructions on same line separate by ;
  • The hastag # indicate a comment -> Use heavily to document your code

Use the other operators

31 / 58

Operators

Arithmetic Operators

What happens when the vectors have different number of elements ?

x <- 1:9
y <- 1
z <- x + y
z
32 / 58

Operators

Arithmetic Operators

What happens when the vectors have different number of elements ?

x <- 1:9
y <- 1
z <- x + y
z
[1] 2 3 4 5 6 7 8 9 10
32 / 58

Operators

Arithmetic Operators

What happens when the vectors have different number of elements ?

x <- 1:9
y <- 1
z <- x + y
z
[1] 2 3 4 5 6 7 8 9 10

Equivalent to

y <- c(1, 1, 1, 1, 1, 1, 1, 1, 1)

The recycling rule...

32 / 58

Operators

Can we add logical ?

x <- TRUE
y <- FALSE
z <- x + y
z
33 / 58

Operators

Can we add logical ?

x <- TRUE
y <- FALSE
z <- x + y
z
[1] 1
33 / 58

Operators

Can we add logical ?

It does not give an error but...

The resulting variable is transformed to a numeric

How you would show that ?

34 / 58

Operators

Can we add logical ?

It does not give an error but...

The resulting variable is transformed to a numeric

How you would show that ?

typeof(x)
[1] "logical"
typeof(z)
[1] "integer"
34 / 58

Operators

Logical Operators

Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x | y x OR y
x & y x AND y
isTRUE(x) test if X is TRUE
35 / 58

Operators

Logical Operators

x <- TRUE
y <- FALSE
z1 <- x | y
z2 <- x == y
36 / 58

Operators

Logical Operators

x <- TRUE
y <- FALSE
z1 <- x | y
z2 <- x == y
[1] TRUE
[1] FALSE

Do not mix

  • == which is logical operator
  • = which is assignement
36 / 58

Operators

Can we add characters ?

first <- "Donald"
last <- "Trump"
full <- first + last
37 / 58

Operators

Can we add characters ?

first <- "Donald"
last <- "Trump"
full <- first + last

Generates an error

Error in first + last: non-numeric argument to binary operator
37 / 58

Operators

Can we add characters ?

first <- "Donald"
last <- "Trump"
full <- first + last

Generates an error

Error in first + last: non-numeric argument to binary operator

What can we do ?

37 / 58

Functions

Function perform specific task on objects

  • e.g. to concatanate strings we use paste0()
38 / 58

Functions

Function perform specific task on objects

  • e.g. to concatanate strings we use paste0()
paste0(first, last)
[1] "DonaldTrump"
38 / 58

Functions

Function perform specific task on objects

  • e.g. to concatanate strings we use paste0()
paste0(first, last)
[1] "DonaldTrump"
  • Functions take arguments and return an object called result

  • To know the arguments use ?

? paste0() # Do not forget the parenthesis
38 / 58

Functions

Function perform specific task on objects

  • e.g. to concatanate strings we use paste0()
paste0(first, last)
[1] "DonaldTrump"
  • Functions take arguments and return an object called result

  • To know the arguments use ?

? paste0() # Do not forget the parenthesis

What happened ?

38 / 58

Functions

Function perform specific task on objects

  • e.g. to concatanate strings we use paste0()
paste0(first, last)
[1] "DonaldTrump"
  • Functions take arguments and return an object called result

  • To know the arguments use ?

? paste0() # Do not forget the parenthesis

What happened ?

  • Can go directly to Help panel and type function name
38 / 58

Functions

Help

39 / 58

Functions

Help

40 / 58

Functions

Getting what you want

We would like to write "Donald Trump" but we have :

paste0(first, last)
[1] "DonaldTrump"

Can you read the help and suggest a change in the way we call the function ?

41 / 58

Functions

Getting what you want

We would like to write "Donald Trump" but we have :

paste0(first, last)
[1] "DonaldTrump"

Can you read the help and suggest a change in the way we call the function ?

paste(first, last)
[1] "Donald Trump"
41 / 58

Functions

Write your own function

my_sum <- function(a, b) {
c <- a + b
return(c)
}
42 / 58

Functions

Write your own function

my_sum <- function(a, b) {
c <- a + b
return(c)
}
my_sum(10, 20)
[1] 30

If you write 3 times the same piece of code write a function...

End of lecture one

42 / 58

Functions

Examples of functions

Most of the time you do not have to write functions because someone has already written one for what you want to do...

  • Sum
x <- 1:100
sum(x)
[1] 5050
43 / 58

Functions

Examples of functions

Most of the time you do not have to write functions because someone has already written one for what you want to do...

  • Sum
x <- 1:100
sum(x)
[1] 5050
  • Normal distribution
y <- rnorm(100, mean = 0, sd = 1)
y[1:10]
[1] 0.4885882 -0.6260146 -0.8855401 -1.2341267 0.3726551 0.8956950
[7] 0.9124247 0.1755346 0.4628793 -1.5012981
43 / 58

Functions

Statistics

mean(y)
[1] 0.01007783
sd(y)
[1] 0.8875528
44 / 58

Functions

Statistics

mean(y)
[1] 0.01007783
sd(y)
[1] 0.8875528
  • Is the mean close to expected mean ?
  • What can be done ?
44 / 58

Functions

Sample more points... 10,000 instead of 100

y <- rnorm(10000, mean = 0, sd = 1)
mean(y)
[1] 0.01251073
sd(y)
[1] 1.009886
45 / 58

Functions

Plot

Histogram

library(graphics)
hist(y)

  • What is this "library()"
46 / 58

Packages

Packages are set of functions that have a common goal

They are really the strength of R

And these are only the "official"" packages. You can find more on GitHub

47 / 58

Packages

Installing a package

Download on your computer the package you need

Install package stringr (to manipulate strings of characters)

48 / 58

Packages

Using a package

To use functions from the package

  • use the syntax package::function
stringr::str_c(first, last, sep = " ")
[1] "Donald Trump"
49 / 58

Packages

Using a package

To use functions from the package

  • use the syntax package::function
stringr::str_c(first, last, sep = " ")
[1] "Donald Trump"
  • load the package with the library function
library(stringr)
str_c(first, last, sep = " ")
[1] "Donald Trump"
49 / 58

Packages

Using a package

To use functions from the package

  • use the syntax package::function
stringr::str_c(first, last, sep = " ")
[1] "Donald Trump"
  • load the package with the library function
library(stringr)
str_c(first, last, sep = " ")
[1] "Donald Trump"

Sometimes functions from different libraries have similar names

49 / 58

Packages

List installed packages

50 / 58

Other objects

  • List

  • Matrix

  • Factors

  • Data frames

51 / 58

Data frames

What is it ?

  • Table mixing different types of columns (an Excel table...)
  • However within a column all values are similar, e.g. numeric, logical, character
52 / 58

Data frames

What is it ?

  • Table mixing different types of columns (an Excel table...)
  • However within a column all values are similar, e.g. numeric, logical, character
df <- data.frame(label = letters[1:6], id = 1:6, value = rnorm(6, mean = 0,
sd = 1), flag = c(TRUE, FALSE), stringsAsFactors = FALSE)
df
label id value flag
1 a 1 0.8002749 TRUE
2 b 2 -0.1723698 FALSE
3 c 3 1.0188527 TRUE
4 d 4 -1.4748408 FALSE
5 e 5 0.5381787 TRUE
6 f 6 0.8350807 FALSE
  • We will not get into the factor at this time, why we set stringsAsFactors = FALSE
  • Notice the recycling rule ?
52 / 58

Data frames

Useful functions

dim(df) # returns the dimensions of data frame
[1] 6 4
nrow(df) # number of rows
[1] 6
ncol(df) # number of columns
[1] 4
53 / 58

Data frames

Useful functions

str(df) # structure of data frame - name, type and preview of data in each column
'data.frame': 6 obs. of 4 variables:
$ label: chr "a" "b" "c" "d" ...
$ id : int 1 2 3 4 5 6
$ value: num 0.8 -0.172 1.019 -1.475 0.538 ...
$ flag : logi TRUE FALSE TRUE FALSE TRUE FALSE
colnames(df) # columns names
[1] "label" "id" "value" "flag"
54 / 58

Data frames

Access specific column

  • Use $notation
df$value
[1] 0.8002749 -0.1723698 1.0188527 -1.4748408 0.5381787 0.8350807
55 / 58

Data frames

Access specific column

  • Use $notation
df$value
[1] 0.8002749 -0.1723698 1.0188527 -1.4748408 0.5381787 0.8350807
  • Use the df[i,j] notation
df[, 3]
[1] 0.8002749 -0.1723698 1.0188527 -1.4748408 0.5381787 0.8350807
df[, "value"]
[1] 0.8002749 -0.1723698 1.0188527 -1.4748408 0.5381787 0.8350807
55 / 58

Data frames

Access specific value

  • Use $notation
df$label[5]
[1] "e"
df$label[1:5]
[1] "a" "b" "c" "d" "e"
56 / 58

Data frames

Access specific value

  • Use $notation
df$label[5]
[1] "e"
df$label[1:5]
[1] "a" "b" "c" "d" "e"
  • Use the df[i,j] notation
df[5, 1]
[1] "e"
df[1:5, "value"]
[1] 0.8002749 -0.1723698 1.0188527 -1.4748408 0.5381787
56 / 58

Data frames

Filter the data

df[df$id <= 3, ]
label id value flag
1 a 1 0.8002749 TRUE
2 b 2 -0.1723698 FALSE
3 c 3 1.0188527 TRUE

Select lines for which the label is c

57 / 58

Data frames

Filter the data

df[df$id <= 3, ]
label id value flag
1 a 1 0.8002749 TRUE
2 b 2 -0.1723698 FALSE
3 c 3 1.0188527 TRUE

Select lines for which the label is c

df[df$label == 3, ]
[1] label id value flag
<0 lignes> (ou 'row.names' de longueur nulle)
57 / 58

Next time: Markdown

What you will learn :

  • Mix text, R code and R output in a single document
  • Produce documents as HTML, pdf or even Word from the same template

58 / 58

Outline

  • What is R and why use R ?
  • Resources
  • Get started
  • Fundamentals of R
    • Data objects
    • Vectors
    • Operators
    • Functions
    • Packages
    • Data frames
2 / 58
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow