We will start the class with some tools that will be useful for success in this class. Going forward, we will use these tools for any computing and data analysis tasks.




Adapted from R for Data Science, Wickham & Grolemund (2017)




In STAT400, we will cover each part of the data analysis pipeline using

  1. Tools like R and RStudio

  2. Packages in R

  3. Computational ideas in Statistics (implemented in R)

1 R

R (https://www.r-project.org) is a free, open source software environment for statistical computing and graphics that is available for every major platform.

RStudio (https://rstudio.com) is an integrated development environment (IDE) for R. It is also free, open source, and available for every major platform. It makes data analysis and projects in R go a bit smoother.

1.1 Getting Started

We can use R like an overgrown calculator.

# simple math
5*(10 - 4) + 44
## [1] 74
# integer division
7 %/% 2
## [1] 3
# modulo operator (Remainder)
7 %% 2
## [1] 1
# powers
1.5^3
## [1] 3.375


We can use mathematical functions.

# exponentiation
exp(1)
## [1] 2.718282
# logarithms
log(100)
## [1] 4.60517
log(100, base = 10)
## [1] 2
# trigonometric functions
sin(pi/2)
## [1] 1
cos(pi)
## [1] -1
asin(1)
## [1] 1.570796



We can create variables using the assignment operator <-,

# create some variables
x <- 5
class <- 400
hello <- "world"

and then use those variables in our functions.

# functions of variables
log(x)
## [1] 1.609438
class^2
## [1] 160000

There are some rules for variable naming.

Variable names –

  1. Can’t start with a number.

  2. Are case-sensitive.

  3. Can be the name of a predefined internal function or letter in R (e.g., c, q, t, C, D, F, T, I). Try not to use these.

  4. Cannot be reserved words that R (e.g., for, in, while, if, else, repeat, break, next).

1.2 Vectors

Variables can store more than one value, called a vector. We can create vectors using the combine (c()) function.

# store a vector
y <- c(1, 2, 6, 10, 17)

When we perform functions on our vector, the result is elementwise.

# elementwise function
y/2
## [1] 0.5 1.0 3.0 5.0 8.5

A vector must contain all values of the same type (i.e., numeric, integer, character, etc.).

We can also make sequences of numbers using either : or seq().

# sequences
a <- 1:5
a
## [1] 1 2 3 4 5
b <- seq(1, 5, by = 1)
b
## [1] 1 2 3 4 5

Your Turn

  1. Use the rep() function to construct the following vector: 1 1 2 2 3 3 4 4 5 5

  2. Use rep() to construct this vector: 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

We can extract values by index.

a[3]
## [1] 3

Indexing is pretty powerful.

# indexing multiple items
a[c(1, 3, 5)]
## [1] 1 3 5
a[1:3]
## [1] 1 2 3

We can even tell R which elements we don’t want.

a[-3]
## [1] 1 2 4 5

And we can index by logical values. R has logicals built in using TRUE and FALSE (T and F also work, but can be overwritten). Logicals can result from a comparison using

  • < : “less than”
  • > : “greater than”
  • <= : “less than or equal to”
  • >= : “greater than or equal to”
  • == : “is equal to”
  • != : “not equal to”
# indexing by vectors of logicals
a[c(TRUE, TRUE, FALSE, FALSE, FALSE)]
## [1] 1 2
# indexing by calculated logicals
a < 3
## [1]  TRUE  TRUE FALSE FALSE FALSE
a[a < 3]
## [1] 1 2

Your Turn

  1. Create a vector of 1300 values evenly spaced between 1 and 100.

  2. How many of these values are greater than 91? (Hint: see sum() as a helpful function.)

We can combine elementwise logical vectors in the following way:

  • & : elementwise AND
  • | : elementwise OR
c(TRUE, TRUE, FALSE) | c(FALSE, TRUE, FALSE)
## [1]  TRUE  TRUE FALSE
c(TRUE, TRUE, FALSE) & c(FALSE, TRUE, FALSE)
## [1] FALSE  TRUE FALSE

There are two more useful functions for looking at the start (head) and end (tail) of a vector.

head(a, 2)
## [1] 1 2
tail(a, 2)
## [1] 4 5

We can also modify elements in a vector.

a[1] <- 0
a[c(4, 5)] <- 100
a
## [1]   0   2   3 100 100

Your Turn

Using the vector you created of 1300 values evenly spaced between 1 and 100,

  1. Modify the elements greater than 90 to equal 9999.

  2. View (not modify) the first 10 values in your vector.

  3. View (not modify) the last 10 values in your vector.

As mentioned, elements of a vector must all be the same type. So, changing an element of a vector to a different type will result in all elements being converted to the most general type.

a
## [1]   0   2   3 100 100
a[1] <- ":-("
a
## [1] ":-(" "2"   "3"   "100" "100"

By changing a value to a string, all the other values were also changed.

There are many data types in R, numeric, integer, character (i.e., string), Date, and factor being the most common. We can convert between different types using the as series of functions.

as.character(b)
## [1] "1" "2" "3" "4" "5"

There are a whole variety of useful functions to operate on vectors. A couple of the more common ones are length, which returns the length (number of elements) of a vector, and sum, which adds up all the elements of a vector.

n <- length(b)
n
## [1] 5
sum_b <- sum(b)
sum_b
## [1] 15

We can then create some statistics!

mean_b <- sum_b/n
sd_b <- sqrt(sum((b - mean_b)^2)/(n - 1))

But, we don’t have to.

mean(b)
## [1] 3
sd(b)
## [1] 1.581139
summary(b)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       2       3       3       4       5
quantile(b, c(.25, .75))
## 25% 75% 
##   2   4

1.3 Data Frames

Data frames are the data structure you will (probably) use the most in R. You can think of a data frame as any sort of rectangular data. It is easy to conceptualize as a table, where each column is a vector. Recall, each vector must have the same data type within the vector (column), but columns in a data frame need not be of the same type. Let’s look at an example!

# look at top 6 rows
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# structure of the object
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

This is Anderson’s Iris data set (https://en.wikipedia.org/wiki/Iris_flower_data_set), available by default in R.

Some facts about data frames:

  • Structured by rows and columns and can be indexed
  • Each column is a variable of one type
  • Column names or locations can be used to index a variable
  • Advice for naming variables applys to naming columns
  • Can be specified by grouping vectors of equal length as columns

Data frames are indexed (similarly to vectors) with [ ].

  • df[i, j] will select the element of the data frame in the ith row and the jth column.
  • df[i, ] will select the entire ith row as a data frame
  • df[ , j] will select the entire jth column as a vector

We can use logicals or vectors to index as well.

iris[1, ]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
iris[, 1]
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
##  [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
##  [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
##  [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
##  [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
##  [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9
iris[1, 1]
## [1] 5.1

We can also select columns by name in two ways.

iris$Species
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica
iris[, "Species"]
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

To add columns, create a new vector that is the same length as other columns. We can append new column to the data frame using the $ operator or the [] operators.

# make a copy of iris
my_iris <- iris

# add a column
my_iris$sepal_len_square <- my_iris$Sepal.Length^2  
head(my_iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepal_len_square
## 1          5.1         3.5          1.4         0.2  setosa            26.01
## 2          4.9         3.0          1.4         0.2  setosa            24.01
## 3          4.7         3.2          1.3         0.2  setosa            22.09
## 4          4.6         3.1          1.5         0.2  setosa            21.16
## 5          5.0         3.6          1.4         0.2  setosa            25.00
## 6          5.4         3.9          1.7         0.4  setosa            29.16

It’s quite easy to subset a data frame.

my_iris[my_iris$sepal_len_square < 20, ]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepal_len_square
## 9           4.4         2.9          1.4         0.2  setosa            19.36
## 14          4.3         3.0          1.1         0.1  setosa            18.49
## 39          4.4         3.0          1.3         0.2  setosa            19.36
## 43          4.4         3.2          1.3         0.2  setosa            19.36

We’ll see another way to do this in Section 3.

We can create new data frames using the data.frame() function,

df <- data.frame(NUMS = 1:5, 
                 lets = letters[1:5],
                 cols = c("green", "gold", "gold", "gold", "green"))

and we can change column names using the names() function.

names(df)
## [1] "NUMS" "lets" "cols"
names(df)[1] <- "nums"

df
##   nums lets  cols
## 1    1    a green
## 2    2    b  gold
## 3    3    c  gold
## 4    4    d  gold
## 5    5    e green

Your Turn

  1. Make a data frame with column 1: 1,2,3,4,5,6 and column 2: a,b,a,b,a,b

  2. Select only rows with value “a” in column 2 using logical vector

  3. mtcars is a built-in data set like iris: Extract the 4th row of the mtcars data.

There are other data structures available to you in R, namely lists and matrices. We will not cover these in the notes, but I encourage you to read more about them (https://faculty.nps.edu/sebuttre/home/R/lists.html and https://faculty.nps.edu/sebuttre/home/R/matrices.html).

1.4 Basic Programming

We will cover three basic programming ideas: functions, conditionals, and loops.

1.4.1 Functions

We have used many functions that are already built into R already. For example – exp(), log(), sin(), rep(), seq(), head(), tail(), etc.

But what if we want to use a function that doesn’t exist?

We can write it!

Idea: We want to avoid repetitive coding because errors will creep in. Solution: Extract common core of the code, wrap it in a function, and make it reusable.

The basic structure for writing a function is as follows:

  • Name
  • Input arguments (including names and default values)
  • Body (code)
  • Output values
# we store a function in a named value
# function is itself a function to create functions!
# we specify the inputs that we can use inside the function
# we can specify default values, but it is not necessary
name <- function(input = FALSE) {
  # body code goes here
  
  # return output vaues
  return(input)
}

Here is a more realistic first example:

my_mean <- function(x) {
  sum(x)/length(x)
}

Let’s test it out.

my_mean(1:15)
## [1] 8
my_mean(c(1:15, NA))
## [1] NA

Some advice for function writing:

  1. Start simple, then extend.
  2. Test out each step of the way.
  3. Don’t try too much at once.

1.4.2 Conditionals

Conditionals are functions that control the flow of analysis. Conditionals determine if a specified condition is met (or not), then direct subsequent analysis or action depending on whether the condition is met (or not).

if(condition) {
  # Some code that runs if condition is TRUE
} else {
  # Some code that runs if condition is TRUE
}
  • condition is a length one logical value, i.e. either TRUE or FALSE
  • We can use & and | to combine several conditions
  • ! negates condition

For example, if we wanted to do something with na.rm from our function,

if(na.rm) x <- na.omit(x) # na.omit is a function that removes NA values

might be a good option.

1.4.3 Loops

Loops (and their cousins the apply() function) are useful when we want to repeat the same block of code many times. Reducing the amount of typing we do can be nice, and if we have a lot of code that is essentially the same we can take advantage of looping. R offers several loops: for, while, repeat.

For loops will run through a specified index and perform a set of code for each value of the indexing variable.

for(i in index values) {
  # block of code
  # can print values also
  # code in here will most likely depend on i
}
for(i in 1:3) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
for(species in unique(iris$Species)) {
  subset_iris <- iris[iris$Species == species,]
  avg <- mean(subset_iris$Sepal.Length)
  print(paste(species, avg))
}
## [1] "setosa 5.006"
## [1] "versicolor 5.936"
## [1] "virginica 6.588"

While loops will run until a specified condition is no longer true.

condition <- TRUE
while(condition) {
  # do stuff
  # don't forget to eventually set the condition to false
  # in the toy example below I check if the current seconds is divisible by 5
  time <- Sys.time()
  if(as.numeric(format(time, format = "%S")) %% 5 == 0) condition <- FALSE
}
print(time)
## [1] "2022-10-25 09:05:00 MDT"
# we can also use while loops to iterate
i <- 1
while (i <= 5) {
    print(i)
    i <- i + 1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

Your Turn

  1. Alter your my_mean() function to take a second argument (na.rm) with default value FALSE that removes NA values if TRUE.

  2. Add checks to your function to make sure the input data is either numeric or logical. If it is logical convert it to numeric (Hint: look at the stopifnot() function).

  3. The diamonds data set is included in the ggplot2 package (not by default in R). It can be read into your environment with the following function.

    data("diamonds", package = "ggplot2")

    Loop over the columns of the diamonds data set and apply your mean function to all of the numeric columns (Hint: look at the class() function).

1.5 Packages

Commonly used R functions are installed with base R.

R packages containing more specialized R functions can be installed freely from CRAN servers using function install.packages().

After packages are installed, their functions can be loaded into the current R session using the function library().

Packages are contrbuted by R users just like you!

We will use some great packages in this class. Feel free to venture out and find your favorites (google R package + what you’re trying to do to find more packages).

1.6 Additional resources

You can get help with R functions within R by using the help() function, or typing ? before a function name.

Stackoverflow can be helpful – if you have a question, maybe somebody else has already asked it (https://stackoverflow.com/questions/tagged/r).

R Reference Card (https://cran.r-project.org/doc/contrib/Short-refcard.pdf)

Useful Cheatsheets (https://www.rstudio.com/resources/cheatsheets/)

R for Data Science (https://r4ds.had.co.nz)

Advanced R (https://adv-r.hadley.nz)

2 ggplot2

We will be using the ggplot2 package for making graphics in this class.

The first time on your machine you’ll need to install the package:

install.packages("ggplot2")

Whenever you first want to plot during an R session, we need to load the library.

library(ggplot2)

2.1 Why visualize?

The sole purpose of visualization is communication. Visualization offers an alternative way of communicating numbers than simply using tables. Often, we can get more information out of our numbers graphically than with numerical summaries alone. Through the use of exploratory data analysis, we can see what the data can tell us beyond the formal modeling or hypothesis testing task.

For example, let’s look at the following dataset.

anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Anscombe’s Quartet is comprised of 4 datasets that have nearly identical simple statistical properties. Each dataset contains 11 (x, y) points with the same mean, median, standard deviation, and correlation coefficient between x and y.

dataset mean_x sd_x mean_y sd_y cor
1 9 3.316625 7.500909 2.031568 0.8164205
2 9 3.316625 7.500909 2.031657 0.8162365
3 9 3.316625 7.500000 2.030424 0.8162867
4 9 3.316625 7.500909 2.030578 0.8165214

But this doesn’t tell the whole story. Let’s look closer at these datasets.

## `geom_smooth()` using formula 'y ~ x'

Visualizations can aid communication and make the data easier to perceive. It can also show us things about our data that numerical summaries won’t necessarily capture.

2.2 A Grammar of Graphics

The grammar of graphics was developed by Leland Wilkinson (https://www.springer.com/gp/book/9780387245447). It is a set of grammatical rules for creating perceivable graphs. Rather than thinking about a limited set of graphs, we can think about graphical forms. This abstraction makes thinking, creating, and communicating graphics easier.

Statistical graphic specifications are expressed using the following components.

  1. data: a set of data operations that create variables from datasets
  2. trans: variable transformations
  3. scale: scale transformations
  4. coord: a coordinate system
  5. element: graphs (points) and their aesthetic attributes (color)
  6. guide: one or more guides (axes, legends, etc.)

ggplot2 is a package written by Hadley Wickham (https://vita.had.co.nz/papers/layered-grammar.html) that implements the ideas in the grammar of graphics to create layered plots.

ggplot2 uses the idea that you can build every graph with graphical components from three sources

  1. the data, represented by geoms
  2. the scales and coordinate system
  3. the plot annotations

This works by mapping values in the data to visual properties of the geom (aesthetics) like size, color, and locations.

Let’s build a graphic. We start with the data. We will use the diamonds dataset, and we want to explore the relationship between carat and price.

head(diamonds)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
ggplot(data = diamonds)

Next we need to specify the aesthetic (variable) mappings.

ggplot(data = diamonds, mapping = aes(carat, price))

Now we choose a geom to display our data.

ggplot(data = diamonds, mapping = aes(carat, price)) +
  geom_point()

And add an aesthetic to our plot.

ggplot(data = diamonds, mapping = aes(carat, price)) +
  geom_point(aes(color = cut))

We could add another layer.

ggplot(data = diamonds, mapping = aes(carat, price)) +
  geom_point(aes(color = cut)) +
  geom_smooth(aes(color = cut), method = "lm")
## `geom_smooth()` using formula 'y ~ x'

And finally, we can specify coordinate transformations.

ggplot(data = diamonds, mapping = aes(carat, price)) +
  geom_point(aes(color = cut)) +
  geom_smooth(aes(color = cut), method = "lm") +
  scale_y_sqrt()
## `geom_smooth()` using formula 'y ~ x'

Notice we can add on to our plot in a layered fashion.

2.3 Graphical Summaries

There are some basic charts we will use in this class that cover a wide range of cases. For univariate data, we can use dotplots, histograms, and barcharts. For two dimensional data, we can look at scatterplots and boxplots.

2.3.1 Scatterplots

Scatterplots are used for investigating relationships between two numeric variables. To demonstrate some of the flexibility of scatterplots in ggplot2, let’s answer the following question.

Do cars with big engines use more fuel than cars with small engines?

We will use the mpg dataset in the ggplot2 package to answer the question. This dataset contains observations collected by the US Environmental Protection Agency on 38 models of car.

dim(mpg)
## [1] 234  11
summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

mpg contains the following variables: displ, a car’s engine size, in liters, and hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg).

ggplot(data = mpg) +
  geom_point(mapping = aes(displ, hwy))

So we can say, yes, cars with larger engines have worse fuel efficiency. But there is more going on here.

The red points above seem to have higher mpg than they should based on engine size alone (outliers). Maybe there is a confounding variable we’ve missed. The class variable of the mpg dataset classifies cars into groups such as compact, midsize, and SUV.

ggplot(data = mpg) +
  geom_point(mapping = aes(displ, hwy, colour = class))

The colors show that many of the unusual points are two-seater cars, probably sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage.

Instead of color, we could also map a categorical variable (like class) to shape, size, and transparency (alpha).

So far we have mapped aesthetics to variables in our dataset. What happens if we just want to generally change the aesthetics of our plots, without tying that to data? We can specify general aesthetics as parameters of the geom, instead of specifying them as aesthetics (aes).

ggplot(data = mpg) +
  geom_point(mapping = aes(displ, hwy), colour = "darkgreen", size = 2)

When interpreting a scatterplot we can look for big patterns in our data, as well as form, direction, and strength of relationships. Additionally, we can see small patterns and deviations from those patterns (outliers).

Your Turn

  1. Make a scatterplot of cty vs. hwy mpg using the mpg dataset.

  2. Describe the relationship that you see.

  3. Map color and shape to type of drive the car is (see ?mpg for details on the variables.). Do you see any patterns?

  4. Alter your plot from part 3. to make all the points be larger.

2.3.2 Histograms, Barcharts, and Boxplots

We can look at the distribution of continuous variables using histograms and boxplots and the distribution of discrete variables using barcharts.

ggplot(data = mpg) +
  geom_histogram(mapping = aes(hwy), bins = 30) 

## histograms will look very different sometimes with different binwidths

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(drv, hwy)) 

## boxplots allow us to see the distribution of a cts rv conditional on a discrete one
## we can also show the actual data at the same time
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(drv, hwy)) +
  geom_jitter(mapping = aes(drv, hwy), alpha = .5)

ggplot(data = mpg) +
  geom_bar(mapping = aes(drv)) 

## shows us the distribution of a categorical variable

2.3.3 Facets

So far we’ve looked at

  1. how one (or more) variables are distributed - barchart or histogram
  2. how two variables are related - scatterplot, boxplot
  3. how two variables are related, conditioned on other variables - color

Sometimes color isn’t enough to show conditioning because of crowded plots.

ggplot(data = diamonds, mapping = aes(carat, price)) +
  geom_point(aes(color = cut))

When this is the case, we can facet to display plots for different subsets. To do this, we specify row variables ~ column variables (or . for none).

ggplot(data = diamonds, mapping = aes(carat, price)) +
  geom_point(aes(color = cut)) +
  facet_wrap(. ~ cut)

If instead we have two variables we want to facet by, we can use facet_grid().

ggplot(data = diamonds, mapping = aes(carat, price)) +
  geom_point(aes(color = cut)) +
  facet_grid(color ~ cut)

Your Turn

Using the mpg dataset,

  1. Make a histogram of hwy, faceted by drv.

  2. Make a scatterplot that incorporates color, shape, size, and facets.

  3. BONUS - Color your histograms from 1. by cyl. Did this do what you thought it would? (Look at fill and group as options instead).

2.4 Additional resources

Documentation and cheat sheets (https://ggplot2.tidyverse.org)

Book website (http://had.co.nz/ggplot2/)

Ch. 3 of R4DS (https://r4ds.had.co.nz/data-visualisation.html)

3 tidyverse

The tidyverse is a suite of packages released by RStudio that work very well together (“verse”) to make data analysis run smoothly (“tidy”). It’s also a package in R that loads all the packages in the tidyverse at once.

library(tidyverse)

You actually already know one member of the tidyverse – ggplot2! We will highlight three more packages in the tidyverse for data analysis.

Adapted from R for Data Science, Wickham & Grolemund (2017)

3.1 readr

The first step in (almost) any data analysis task is reading data into R. Data can take many formats, but we will focus on text files.

But what about .xlsx??

File extensions .xls and .xlsx are proprietary Excel formats/ These are binary files (meaning if you open one outside of Excel it will not be human readable). An alternable for rectangular data is a .csv.

.csv is an extension for comma separated value files. They are text files – directly readable – where each column is separated by a comma and each row a new line.

Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen
1,2419,PETROLEUM ENGINEERING,2339,2057,282,Engineering,0.120564344
2,2416,MINING AND MINERAL ENGINEERING,756,679,77,Engineering,0.101851852

.tsv is an extension for tab separated value files. These are also text files, but the columns are separated by tabs instead of commas. Sometimes these will be .txt extension files.

Rank    Major_code    Major    Total    Men    Women    Major_category    ShareWomen
1    2419    PETROLEUM ENGINEERING    2339    2057    282    Engineering    0.120564344
2    2416    MINING AND MINERAL ENGINEERING    756    679    77    Engineering    0.101851852

The package readr provides a fast and friendly way to ready rectangular text data into R.

Here is an example csv file from fivethirtyeight.com on how to choose your college major (https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/).

# load readr
library(readr)

# read a csv
recent_grads <- read_csv(file = "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv")

read_csv() is just one way to read a file using the readr package.

  • read_delim(): the most generic function. Use the delim argument to read a file with any type of delimiter
  • read_tsv(): read tab separated files
  • read_lines(): read a file into a vector that has one element per line of the file
  • read_file(): read a file into a single character element
  • read_table(): read a file separated by space

Your Turn

  1. Read the NFL salaries dataset from https://raw.githubusercontent.com/ada-lovecraft/ProcessingSketches/master/Bits%20and%20Pieces/Football_Stuff/data/nfl-salaries.tsv into R.

  2. What is the highest NFL salary in this dataset? Who is the highest paid player?

  3. Make a histogram and describe the distribution of NFL salaries.

3.2 dplyr

We almost never will read in data and have it in exactly the right form for visualizing and modeling. Often we need to create variable or summaries.

To facilitate easy transformation of data, we’re going to learn how to use the dplyr package. dplyr uses 6 main verbs, which correspond to some main tasks we may want to perform in an analysis.

We will do this with the recent_grads data from fivethiryeight.com we just read into R using readr.

3.2.1 |>

Before we get into the verbs in dplyr, I want to introduce a new paradigm. All of the functions in the tidyverse are structured such that the first argument is a data frame and they also return a data frame. This allows for efficient use of the pipe operator |> (pronounce this as “then”).

a |> b()

Taked the result on the left and passes it to the first argument on the right. This is equivalent to

b(a)

This is useful when we want to chain together many operations in an analysis.

3.2.2 filter()

filter() lets us subset observations based on their values. This is similar to using [] to subset a data frame, but simpler.

The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.

Let’s subset the recent_grad data set to focus on Statistics majors.

recent_grads |> filter(Major == "STATISTICS AND DECISION SCIENCE")
## # A tibble: 1 × 21
##    Rank Major_…¹ Major Total   Men Women Major…² Share…³ Sampl…⁴ Emplo…⁵ Full_…⁶
##   <dbl>    <dbl> <chr> <dbl> <dbl> <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>
## 1    47     3702 STAT…  6251  2960  3291 Comput…   0.526      37    4247    3190
## # … with 10 more variables: Part_time <dbl>, Full_time_year_round <dbl>,
## #   Unemployed <dbl>, Unemployment_rate <dbl>, Median <dbl>, P25th <dbl>,
## #   P75th <dbl>, College_jobs <dbl>, Non_college_jobs <dbl>,
## #   Low_wage_jobs <dbl>, and abbreviated variable names ¹​Major_code,
## #   ²​Major_category, ³​ShareWomen, ⁴​Sample_size, ⁵​Employed, ⁶​Full_time

Alternatively, we could look at all Majors in the same category, “Computers & Mathematics”, for comparison.

recent_grads |> filter(Major_category == "Computers & Mathematics")
## # A tibble: 11 × 21
##     Rank Major_code Major      Total   Men Women Major…¹ Share…² Sampl…³ Emplo…⁴
##    <dbl>      <dbl> <chr>      <dbl> <dbl> <dbl> <chr>     <dbl>   <dbl>   <dbl>
##  1    21       2102 COMPUTER… 128319 99743 28576 Comput…   0.223    1196  102087
##  2    42       3700 MATHEMAT…  72397 39956 32441 Comput…   0.448     541   58118
##  3    43       2100 COMPUTER…  36698 27392  9306 Comput…   0.254     425   28459
##  4    46       2105 INFORMAT…  11913  9005  2908 Comput…   0.244     158    9881
##  5    47       3702 STATISTI…   6251  2960  3291 Comput…   0.526      37    4247
##  6    48       3701 APPLIED …   4939  2794  2145 Comput…   0.434      45    3854
##  7    53       4005 MATHEMAT…    609   500   109 Comput…   0.179       7     559
##  8    54       2101 COMPUTER…   4168  3046  1122 Comput…   0.269      43    3257
##  9    82       2106 COMPUTER…   8066  6607  1459 Comput…   0.181     103    6509
## 10    85       2107 COMPUTER…   7613  5291  2322 Comput…   0.305      97    6144
## 11   106       2001 COMMUNIC…  18035 11431  6604 Comput…   0.366     208   14779
## # … with 11 more variables: Full_time <dbl>, Part_time <dbl>,
## #   Full_time_year_round <dbl>, Unemployed <dbl>, Unemployment_rate <dbl>,
## #   Median <dbl>, P25th <dbl>, P75th <dbl>, College_jobs <dbl>,
## #   Non_college_jobs <dbl>, Low_wage_jobs <dbl>, and abbreviated variable names
## #   ¹​Major_category, ²​ShareWomen, ³​Sample_size, ⁴​Employed

Notice we are using |> to pass the data frame to the first argument in filter() and we do not need to use recent_grads$Colum Name to subset our data.

dplyr functions never modify their inputs, so if we need to save the result, we have to do it using <-.

math_grads <- recent_grads |> filter(Major_category == "Computers & Mathematics")

Everything we’ve already learned about logicals and comparisons comes in handy here, since the second argument of filter() is a comparitor expression telling dplyr what rows we care about.

3.2.3 arrange()

arrange() works similarly to filter() except that it changes the order of rows rather than subsetting. Again, the first parameter is a data frame and the additional parameters are a set of column names to order by.

math_grads |> arrange(ShareWomen)
## # A tibble: 11 × 21
##     Rank Major_code Major      Total   Men Women Major…¹ Share…² Sampl…³ Emplo…⁴
##    <dbl>      <dbl> <chr>      <dbl> <dbl> <dbl> <chr>     <dbl>   <dbl>   <dbl>
##  1    53       4005 MATHEMAT…    609   500   109 Comput…   0.179       7     559
##  2    82       2106 COMPUTER…   8066  6607  1459 Comput…   0.181     103    6509
##  3    21       2102 COMPUTER… 128319 99743 28576 Comput…   0.223    1196  102087
##  4    46       2105 INFORMAT…  11913  9005  2908 Comput…   0.244     158    9881
##  5    43       2100 COMPUTER…  36698 27392  9306 Comput…   0.254     425   28459
##  6    54       2101 COMPUTER…   4168  3046  1122 Comput…   0.269      43    3257
##  7    85       2107 COMPUTER…   7613  5291  2322 Comput…   0.305      97    6144
##  8   106       2001 COMMUNIC…  18035 11431  6604 Comput…   0.366     208   14779
##  9    48       3701 APPLIED …   4939  2794  2145 Comput…   0.434      45    3854
## 10    42       3700 MATHEMAT…  72397 39956 32441 Comput…   0.448     541   58118
## 11    47       3702 STATISTI…   6251  2960  3291 Comput…   0.526      37    4247
## # … with 11 more variables: Full_time <dbl>, Part_time <dbl>,
## #   Full_time_year_round <dbl>, Unemployed <dbl>, Unemployment_rate <dbl>,
## #   Median <dbl>, P25th <dbl>, P75th <dbl>, College_jobs <dbl>,
## #   Non_college_jobs <dbl>, Low_wage_jobs <dbl>, and abbreviated variable names
## #   ¹​Major_category, ²​ShareWomen, ³​Sample_size, ⁴​Employed

If we provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

We can use desc() to re-order by a column in descending order.

math_grads |> arrange(desc(ShareWomen))
## # A tibble: 11 × 21
##     Rank Major_code Major      Total   Men Women Major…¹ Share…² Sampl…³ Emplo…⁴
##    <dbl>      <dbl> <chr>      <dbl> <dbl> <dbl> <chr>     <dbl>   <dbl>   <dbl>
##  1    47       3702 STATISTI…   6251  2960  3291 Comput…   0.526      37    4247
##  2    42       3700 MATHEMAT…  72397 39956 32441 Comput…   0.448     541   58118
##  3    48       3701 APPLIED …   4939  2794  2145 Comput…   0.434      45    3854
##  4   106       2001 COMMUNIC…  18035 11431  6604 Comput…   0.366     208   14779
##  5    85       2107 COMPUTER…   7613  5291  2322 Comput…   0.305      97    6144
##  6    54       2101 COMPUTER…   4168  3046  1122 Comput…   0.269      43    3257
##  7    43       2100 COMPUTER…  36698 27392  9306 Comput…   0.254     425   28459
##  8    46       2105 INFORMAT…  11913  9005  2908 Comput…   0.244     158    9881
##  9    21       2102 COMPUTER… 128319 99743 28576 Comput…   0.223    1196  102087
## 10    82       2106 COMPUTER…   8066  6607  1459 Comput…   0.181     103    6509
## 11    53       4005 MATHEMAT…    609   500   109 Comput…   0.179       7     559
## # … with 11 more variables: Full_time <dbl>, Part_time <dbl>,
## #   Full_time_year_round <dbl>, Unemployed <dbl>, Unemployment_rate <dbl>,
## #   Median <dbl>, P25th <dbl>, P75th <dbl>, College_jobs <dbl>,
## #   Non_college_jobs <dbl>, Low_wage_jobs <dbl>, and abbreviated variable names
## #   ¹​Major_category, ²​ShareWomen, ³​Sample_size, ⁴​Employed

3.2.4 select()

Sometimes we have data sets with a ton of variables and often we want to narrow down the ones that we actually care about. select() allows us to do this based on the names of the variables.

math_grads |> select(Major, ShareWomen, Total, Full_time, P75th)
## # A tibble: 11 × 5
##    Major                                           ShareW…¹  Total Full_…² P75th
##    <chr>                                              <dbl>  <dbl>   <dbl> <dbl>
##  1 COMPUTER SCIENCE                                   0.223 128319   91485 70000
##  2 MATHEMATICS                                        0.448  72397   46399 60000
##  3 COMPUTER AND INFORMATION SYSTEMS                   0.254  36698   26348 60000
##  4 INFORMATION SCIENCES                               0.244  11913    9105 58000
##  5 STATISTICS AND DECISION SCIENCE                    0.526   6251    3190 60000
##  6 APPLIED MATHEMATICS                                0.434   4939    3465 63000
##  7 MATHEMATICS AND COMPUTER SCIENCE                   0.179    609     584 78000
##  8 COMPUTER PROGRAMMING AND DATA PROCESSING           0.269   4168    3204 46000
##  9 COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY    0.181   8066    6289 50000
## 10 COMPUTER NETWORKING AND TELECOMMUNICATIONS         0.305   7613    5495 49000
## 11 COMMUNICATION TECHNOLOGIES                         0.366  18035   11981 45000
## # … with abbreviated variable names ¹​ShareWomen, ²​Full_time

We can also use

  • : to select all columns between two columns
  • - to select all columns except those specified
  • starts_with("abc") matches names that begin with “abc”
  • ends_with("xyz") matches names that end with “xyz”
  • contains("ijk") matches names that contain “ijk”
  • everything() mathes all columns
math_grads |> select(Major, College_jobs:Low_wage_jobs)
## # A tibble: 11 × 4
##    Major                                           College_jobs Non_co…¹ Low_w…²
##    <chr>                                                  <dbl>    <dbl>   <dbl>
##  1 COMPUTER SCIENCE                                       68622    25667    5144
##  2 MATHEMATICS                                            34800    14829    4569
##  3 COMPUTER AND INFORMATION SYSTEMS                       13344    11783    1672
##  4 INFORMATION SCIENCES                                    4390     4102     608
##  5 STATISTICS AND DECISION SCIENCE                         2298     1200     343
##  6 APPLIED MATHEMATICS                                     2437      803     357
##  7 MATHEMATICS AND COMPUTER SCIENCE                         452       67      25
##  8 COMPUTER PROGRAMMING AND DATA PROCESSING                2024     1033     263
##  9 COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY         2354     3244     308
## 10 COMPUTER NETWORKING AND TELECOMMUNICATIONS              2593     2941     352
## 11 COMMUNICATION TECHNOLOGIES                              4545     8794    2495
## # … with abbreviated variable names ¹​Non_college_jobs, ²​Low_wage_jobs

rename() is a function that will rename an existing column and select all columns.

math_grads |> rename(Code_major = Major_code)
## # A tibble: 11 × 21
##     Rank Code_major Major      Total   Men Women Major…¹ Share…² Sampl…³ Emplo…⁴
##    <dbl>      <dbl> <chr>      <dbl> <dbl> <dbl> <chr>     <dbl>   <dbl>   <dbl>
##  1    21       2102 COMPUTER… 128319 99743 28576 Comput…   0.223    1196  102087
##  2    42       3700 MATHEMAT…  72397 39956 32441 Comput…   0.448     541   58118
##  3    43       2100 COMPUTER…  36698 27392  9306 Comput…   0.254     425   28459
##  4    46       2105 INFORMAT…  11913  9005  2908 Comput…   0.244     158    9881
##  5    47       3702 STATISTI…   6251  2960  3291 Comput…   0.526      37    4247
##  6    48       3701 APPLIED …   4939  2794  2145 Comput…   0.434      45    3854
##  7    53       4005 MATHEMAT…    609   500   109 Comput…   0.179       7     559
##  8    54       2101 COMPUTER…   4168  3046  1122 Comput…   0.269      43    3257
##  9    82       2106 COMPUTER…   8066  6607  1459 Comput…   0.181     103    6509
## 10    85       2107 COMPUTER…   7613  5291  2322 Comput…   0.305      97    6144
## 11   106       2001 COMMUNIC…  18035 11431  6604 Comput…   0.366     208   14779
## # … with 11 more variables: Full_time <dbl>, Part_time <dbl>,
## #   Full_time_year_round <dbl>, Unemployed <dbl>, Unemployment_rate <dbl>,
## #   Median <dbl>, P25th <dbl>, P75th <dbl>, College_jobs <dbl>,
## #   Non_college_jobs <dbl>, Low_wage_jobs <dbl>, and abbreviated variable names
## #   ¹​Major_category, ²​ShareWomen, ³​Sample_size, ⁴​Employed

3.2.5 mutate()

Besides selecting sets of existing columns, we can also add new columns that are functions of existing columns with mutate(). mutate() always adds new columns at the end of the data frame.

math_grads |> mutate(Full_time_rate = Full_time_year_round/Total)
## # A tibble: 11 × 22
##     Rank Major_code Major      Total   Men Women Major…¹ Share…² Sampl…³ Emplo…⁴
##    <dbl>      <dbl> <chr>      <dbl> <dbl> <dbl> <chr>     <dbl>   <dbl>   <dbl>
##  1    21       2102 COMPUTER… 128319 99743 28576 Comput…   0.223    1196  102087
##  2    42       3700 MATHEMAT…  72397 39956 32441 Comput…   0.448     541   58118
##  3    43       2100 COMPUTER…  36698 27392  9306 Comput…   0.254     425   28459
##  4    46       2105 INFORMAT…  11913  9005  2908 Comput…   0.244     158    9881
##  5    47       3702 STATISTI…   6251  2960  3291 Comput…   0.526      37    4247
##  6    48       3701 APPLIED …   4939  2794  2145 Comput…   0.434      45    3854
##  7    53       4005 MATHEMAT…    609   500   109 Comput…   0.179       7     559
##  8    54       2101 COMPUTER…   4168  3046  1122 Comput…   0.269      43    3257
##  9    82       2106 COMPUTER…   8066  6607  1459 Comput…   0.181     103    6509
## 10    85       2107 COMPUTER…   7613  5291  2322 Comput…   0.305      97    6144
## 11   106       2001 COMMUNIC…  18035 11431  6604 Comput…   0.366     208   14779
## # … with 12 more variables: Full_time <dbl>, Part_time <dbl>,
## #   Full_time_year_round <dbl>, Unemployed <dbl>, Unemployment_rate <dbl>,
## #   Median <dbl>, P25th <dbl>, P75th <dbl>, College_jobs <dbl>,
## #   Non_college_jobs <dbl>, Low_wage_jobs <dbl>, Full_time_rate <dbl>, and
## #   abbreviated variable names ¹​Major_category, ²​ShareWomen, ³​Sample_size,
## #   ⁴​Employed
# we can't see everything
math_grads |> 
  mutate(Full_time_rate = Full_time_year_round/Total) |> 
  select(Major, ShareWomen, Full_time_rate)
## # A tibble: 11 × 3
##    Major                                           ShareWomen Full_time_rate
##    <chr>                                                <dbl>          <dbl>
##  1 COMPUTER SCIENCE                                     0.223          0.553
##  2 MATHEMATICS                                          0.448          0.466
##  3 COMPUTER AND INFORMATION SYSTEMS                     0.254          0.576
##  4 INFORMATION SCIENCES                                 0.244          0.619
##  5 STATISTICS AND DECISION SCIENCE                      0.526          0.344
##  6 APPLIED MATHEMATICS                                  0.434          0.525
##  7 MATHEMATICS AND COMPUTER SCIENCE                     0.179          0.642
##  8 COMPUTER PROGRAMMING AND DATA PROCESSING             0.269          0.589
##  9 COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY      0.181          0.612
## 10 COMPUTER NETWORKING AND TELECOMMUNICATIONS           0.305          0.574
## 11 COMMUNICATION TECHNOLOGIES                           0.366          0.504

3.2.6 summarise()

The last major verb is summarise(). It collapses a data frame to a single row based on a summary function.

math_grads |> summarise(mean_major_size = mean(Total))
## # A tibble: 1 × 1
##   mean_major_size
##             <dbl>
## 1          27183.

A useful summary function is a count (n()), or a count of non-missing values (sum(!is.na())).

math_grads |> summarise(mean_major_size = mean(Total), num_majors = n())
## # A tibble: 1 × 2
##   mean_major_size num_majors
##             <dbl>      <int>
## 1          27183.         11

3.2.7 group_by()

summarise() is not super useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when we use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”.

recent_grads |>
  group_by(Major_category) |>
  summarise(mean_major_size = mean(Total, na.rm = TRUE)) |>
  arrange(desc(mean_major_size))
## # A tibble: 16 × 2
##    Major_category                      mean_major_size
##    <chr>                                         <dbl>
##  1 Business                                    100183.
##  2 Communications & Journalism                  98150.
##  3 Social Science                               58885.
##  4 Psychology & Social Work                     53445.
##  5 Humanities & Liberal Arts                    47565.
##  6 Arts                                         44641.
##  7 Health                                       38602.
##  8 Law & Public Policy                          35821.
##  9 Education                                    34946.
## 10 Industrial Arts & Consumer Services          32827.
## 11 Biology & Life Science                       32419.
## 12 Computers & Mathematics                      27183.
## 13 Physical Sciences                            18548.
## 14 Engineering                                  18537.
## 15 Interdisciplinary                            12296 
## 16 Agriculture & Natural Resources               8402.

We can group by multiple variables and if we need to remove grouping, and return to operations on ungrouped data, we use ungroup().

Grouping is also useful for arrange() and mutate() within groups.

Your Turn

Using the NFL salaries from https://raw.githubusercontent.com/ada-lovecraft/ProcessingSketches/master/Bits%20and%20Pieces/Football_Stuff/data/nfl-salaries.tsv that you loaded into R in the previous your turn, perform the following.

  1. What is the team with the highest paid roster?

  2. What are the top 5 paid players?

  3. What is the highest paid position on average? the lowest? the most variable?

3.3 tidyr

“Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

Tidy data is an organization strategy for data that makes it easier to work with, analyze, and visualize. tidyr is a package that can help us tidy our data in a less painful way.

The following all contain the same data, but show different levels of “tidiness”.

table1
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table3
## # A tibble: 6 × 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
# spread across two data frames
table4a
## # A tibble: 3 × 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b
## # A tibble: 3 × 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

While these are all representations of the same underlying data, they are not equally easy to use.

There are three interrelated rules which make a dataset tidy:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

In the above example,

table2 isn’t tidy because each variable doesn’t have its own column.

table3 isn’t tidy because each value doesn’t have its own cell.

table4a and table4b aren’t tidy because each observation doesn’t have its own row.

table1 is tidy!

Being tidy with our data is useful because it’s a consistent set of rules to follow for working with data and because it allows R to be efficient.

# Compute rate per 10,000
table1 |> 
  mutate(rate = cases / population * 10000)
## # A tibble: 6 × 5
##   country      year  cases population  rate
##   <chr>       <int>  <int>      <int> <dbl>
## 1 Afghanistan  1999    745   19987071 0.373
## 2 Afghanistan  2000   2666   20595360 1.29 
## 3 Brazil       1999  37737  172006362 2.19 
## 4 Brazil       2000  80488  174504898 4.61 
## 5 China        1999 212258 1272915272 1.67 
## 6 China        2000 213766 1280428583 1.67
# Visualize cases over time
library(ggplot2)
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country)) + 
  geom_point(aes(colour = country))

3.3.1 Pivoting

Unfortunately, most of the data you will find in the “wild” is not tidy. So, we need tools to help us tidy unruly data.

The main tools in tidyr are the ideas of pivot_longer() and pivot_wider(). As the names imply, pivot_longer() “lengthens” our data, increasing the number of rows and decreasing the number of columns. pivot_wider does the opposite, increasing the number of columns and decreasing the number of rows.

These two functions resolve one of two common problems:

  1. One variable might be spread across multiple columns. (pivot_longer())
  2. One observation might be scattered across multiple rows. (pivot_wider())

A common issue with data is when values are used as column names.

table4a
## # A tibble: 3 × 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

We can fix this using pivot_longer().

table4a |>
  pivot_longer(-country, names_to = "year", values_to = "cases")

Notice we specified with columns we wanted to consolidate by telling the function the column we didn’t want to change (-country). We can use the dplyr::select() syntax here for specifying the columns to pivot.

We can do the same thing with table4b and then join the databases together by specifying unique identifying attributes.

table4a |>
  pivot_longer(-country, names_to = "year", values_to = "cases") |>
  left_join(table4b |> pivot_longer(-country, names_to = "year", values_to = "population"))

If, instead, variables don’t have their own column, we can pivot_wider().

table2

table2 |>
  pivot_wider(names_from = type, values_from = count)

3.3.2 Separating and Uniting

So far we have tidied table2 and table4a and table4b, but what about table3?

table3
## # A tibble: 6 × 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

We need to split the rate column into the cases and population columns so that each value has its own cell. The function we will use is separate(). We need to specify the column, the value to split on (“/”), and the names of the new coumns.

table3 |>
  separate(rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

By default, separate() will split values wherever it sees a character that isn’t a number or letter.

unite() is the opposite of separate() – it combines multiple columns into a single column.

Your Turn

  1. Is the NFL salaries from https://raw.githubusercontent.com/ada-lovecraft/ProcessingSketches/master/Bits%20and%20Pieces/Football_Stuff/data/nfl-salaries.tsv that you loaded into R in a previous your turn tidy? Why or why not?

  2. There is a data set in tidyr called world_bank_pop that contains information about population from the World Bank (https://data.worldbank.org/). Why is this data not tidy? You may want to read more about the data to answer (?world_bank_pop).

  3. Use functions in tidyr to turn this into a tidy form.

4 Rmarkdown

Markdown is a particular type of markup language that is designed to produce documents from text.

Markdown is becoming a standard. Many websites will generate HTML from Markdown (e.g. GitHub, Stack Overflow, reddit, …) and this course website is written in markdown as well.

Markdown is easy for humans to read and write.

*italic*   
**bold**
# Header 1
## Header 2
### Header 3
* Item 1
* Item 2
    + Item 2a
    + Item 2b

1. Item 1
2. Item 2
3. Item 3
    + Item 3a
    + Item 3b
[linked phrase](http://example.com)

A friend once said:

> It's always better to give 
> than to receive.

Rmarkdown is an authoring format that lets you incorporate the results from R code in your documents.

It combines the core syntax of markdown with embedded R code chunks that are run so their output can be included in the final document.

You no longer have to copy/paste plots into your homework!

Documents built from Rmarkdown are fully reproducible, i.e. they are automatically regenerated whenever embedded R code changes.

To include an R chunk in an Rmarkdown document, you use backticks.

In order to create a new Rmarkdown document in RStudio, File > New File > R markdown.

There are many options that can affect the aesthetics of the resulting document and the results and appearance of R chunks. For a list of chunk options, see https://yihui.name/knitr/options/. Here are some useful ones:

Your Turn

  1. Create a new Rmarkdown document.

  2. Alter the template to use a ggplot2 figure and specify the size to be a height of 6.

  3. Add a caption to your figure.

  4. Compile your document to pdf.

4.1 Additional resources

Documentation and cheat sheets (https://rmarkdown.rstudio.com)

R Markdown: The Definitive Guide (https://bookdown.org/yihui/rmarkdown/)

5 Git and GitHub

Note: Thanks to http://happygitwithr.com for inspiration and material.

5.1 Definition/background

Git is a version control system that was created to help developers manage collaborative software projects. Git tracks the evolution of a set of files, called a repository or repo.

This helps us

  • merge conflicts that arrise from collaboration
  • rollback to previous versions of files as necessary
  • store master versions of files, no more paper_final_final_I_really_mean_it.docx

5.2 Terminology

  • Repository: The most basic element of git, imagine as a project’s folder. A repository contains all of the project files, and stores each file’s revision history. Can be either public or private.
  • Clone: A copy of a repository that lives on your computer instead of on a website’s server somewhere, or the act of making that copy.
  • Pull: When you are fetching in changes and merging them.
  • Commit: An individual change to a file (or set of files). Every time you save it creates a unique ID that allows you to keep record of what changes were made when and by who.
  • Push: Sending your committed changes to a remote repository such as GitHub.com.
  • Fork: A fork is a personal copy of another user’s repository that lives on your account. Forks allow you to freely make changes to a project without affecting the original.
  • Pull Request: Proposed changes to a repository submitted by a user and accepted or rejected by a repository’s collaborators.
  • Issue: Issues are suggested improvements, tasks or questions related to the repository.
  • Remote: This is the version of something that is hosted on a server, most likely GitHub.com. It can be connected to local clones so that changes can be synced.

From https://help.github.com/articles/github-glossary/.

5.3 GitHub

There are many hosting services for remote repositories (GitHub, Bitbucket, GitLab, etc.). We will use GitHub in this class, but the ideas carry over to the other services.

By default, all materials on GitHub are public. This is good because you are getting your work out there and contributing to the open source community!

If you need private repos, checkout GitHub for Education - free private repos for students/postdocs/professors.

5.4 Creating Repos

Initialize readme (yes), .gitignore (R usually), license (e.g. GPL 3)

Your Turn

  1. Create a hello-world repo

Cloning a repo –

From scratch:

  1. Create the repo on the GitHub website

  2. Clone the repo

  3. Start working

  4. Add files, commit, push, etc.

From existing folder:

  1. Create the repo on the GitHub website

  2. Clone the repo

  3. Copy existing work into local folder

  4. Add files, commit, push, etc.

5.5 Pushing and pulling, a tug of war

Important: remember to pull before you start working to get the most up to date changes from your collaborators (or your past self) before making local changes!

5.6 When should I commit?

Think of commits as a checkpoint in a video game. This is a point in time when you want to save your status so that you can come back to it later if need be.

Commits are like voting. I like to do it early and often.

- Me, right now

5.7 Blow it up

Sometimes your local repo gets borked. That’s OK. There are ways that we can work really hard and fix them, but sometimes you just want to stash your files somewhere and re-clone from your centralized repository.

5.8 Git with RStudio

Rstudio (and our class RStudio server) allows you to make projects based on GitHub repos.

Local RStudio works much the same way, with the ability to push/pull from a local project to a GitHub repo.

By letting us

  1. Select files to commit.
  2. Commit
  3. Push/Pull

5.9 Collaboration

In this class, we will have a collaborative project. It will be hosted on GitHub and part of your grade will be how well you collaborate through the use of GitHub. For this, we will need to have project repos that everyone can push to!

GitHub Repo site (https://github.com/username/reponame) > Settings > Collaborators & Teams > Collaborators > Add collaborator

Your collaborators will have a lot of power in your repo, so choose wisely! They can change anything in the repo and add their own files. They can also delete your files! The only thing they can’t do is delete the repo, add collaborators, or change other settings for the repo.

5.10 Installation help

We are not covering installation on your personal computers for this class. If you would like to work through it on your own, here is an excellent guide: http://happygitwithr.com/installation-pain.html

Feel free to come to office hours or setup individual time with us if you need help.

Your Turn

  1. Edit the README file in your hello-world repo

  2. Commit and push changes

  3. Check out your commit history!