Do the ‘R’ Thing

Feb 08, 2018

GC Digital Initiatives

What we do

GCDI offers different types of support for digital scholarship such as our bi-weekly drop-in Python User’s Group (PUG), evening workshops, afternoon skillshare lunch conversations, one-on-one consultations for faculty and students, working groups based around common tools or data sources, special events such as our annual Digital Showcase and Sound Series, and online resources.

Whether you have participated in our workshops before or your idea of the perfect software is a paperback edition, there is something for you! Our offerings are open to scholars at all levels of digital experience and at all stages of graduate research. Whether you are digitally driven, curious, or defiant, we are prepared to help.

How to get involved

Join us on the CUNY Academic Commons cuny.is/group-gcdi
Drop-in to Python Users’ Group (PUG) in the Digital Scholarship Lab cuny.is/pug
Check our Event Calendar for upcoming events and workshops cuny.is/workshops
Request a one-on-one consultation at cuny.is/gcdfconsults (students) or cuny.is/gcfacultyconsults (faculty)
Join the GIS Working Group cuny.is/group-gis-working-group
Participate in the Humanidades Digitales/ DH in Spanish group cuny.is/dhspanish
Apply for a Provost’s Digital Innovation Grant cuny.is/digitalgrants

Keeping in touch

Follow the GCDI on Twitter @cunygcdi
Follow the GC Digital Fellows on Twitter @digital_fellows
Follow the #digitalGC hashtag on Twitter
Follow the Digital Fellows blog, Tagging the Tower digitalfellows.commons.gc.cuny.edu
Contact the Digital Fellows at digitalfellows.commons.gc.cuny.edu/contact-us/
Drop-in to the Digital Scholarship Lab Open House hosted twice per semester, or attend PUG, Events, or Workshops!

Motivation for this workshop

By way of introduction

I am a doctoral student in Urban Education,
began using R for data analysis in 2009,
frustrated with R, began using Pandas and iPython instead,
in 2016, fell in love with R again.

Brought to you by the letter ‘R’

Data analysis, the right way and wrong way

But, what is “data analysis”?

“Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.” (John W. Tukey, 1961)

A schematic view of data analysis process

r4ds.had.co.nz

Why you use R for data analysis

Or should, if you are not already

It is open source and free unlike Stata, SAS, SPSS, etc.
Good documentation and large online community (e.g., Stack Overflow)
Many available packages from Comprehensive R Archive Network and GitHub

How not to teach R

Garrett Grolemund of RStudio has shared a few principles we will follow:

Do not teach R as if it were a programming language
Do not avoid the lecture
Do not let your workshop become a consulting clinic for installation bugs

Basics for data analysis

Welcome to the tidyverse

library(tidyverse)

Tidy your data

Importing data

Use readr and readxl packages

readr::read_csv('path/to/file.csv')
readr::read_csv('http://host/file.csv')
readxl::read_excel('path/to/file.xlsx')

Load up sample data

library(nycflights13)
flights %>% head(5)

## # A tibble: 5 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute time_hour          
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl> <dttm>             
## 1  2013     1     1      517            515      2.00      830            819      11.0 UA        1545 N14228  EWR    IAH        227     1400  5.00   15.0 2013-01-01 05:00:00
## 2  2013     1     1      533            529      4.00      850            830      20.0 UA        1714 N24211  LGA    IAH        227     1416  5.00   29.0 2013-01-01 05:00:00
## 3  2013     1     1      542            540      2.00      923            850      33.0 AA        1141 N619AA  JFK    MIA        160     1089  5.00   40.0 2013-01-01 05:00:00
## 4  2013     1     1      544            545     -1.00     1004           1022     -18.0 B6         725 N804JB  JFK    BQN        183     1576  5.00   45.0 2013-01-01 05:00:00
## 5  2013     1     1      554            600     -6.00      812            837     -25.0 DL         461 N668DN  LGA    ATL        116      762  6.00    0   2013-01-01 06:00:00

Pipe it like it’s hot

pipes are expressed with the %>% operator
pipes can be combined (chained)
pipes treat data frames as immutable

Selecting and transforming data

flights %>%
  mutate(date = as.Date(sprintf('%d-%.2d-%.2d', year, month, day)),
         weekday = weekdays(date)) %>%
  select(date, weekday, air_time, distance) %>%
  head(5)

## # A tibble: 5 x 4
##   date       weekday air_time distance
##   <date>     <chr>      <dbl>    <dbl>
## 1 2013-01-01 Tuesday      227     1400
## 2 2013-01-01 Tuesday      227     1416
## 3 2013-01-01 Tuesday      160     1089
## 4 2013-01-01 Tuesday      183     1576
## 5 2013-01-01 Tuesday      116      762

Filtering and ordering data

flights %>%
  filter(day < 8) %>%
  arrange(-air_time,distance) %>%
  select(day, carrier, flight, air_time, distance) %>%
  head(5)

## # A tibble: 5 x 5
##     day carrier flight air_time distance
##   <int> <chr>    <int>    <dbl>    <dbl>
## 1     6 HA          51      691     4983
## 2     5 HA          51      679     4983
## 3     3 HA          51      671     4983
## 4     6 UA          15      665     4963
## 5     5 UA          15      664     4963

Let’s get statistical

flights %>%
  summarise(maxairtime = max(air_time, na.rm=TRUE))

## # A tibble: 1 x 1
##   maxairtime
##        <dbl>
## 1        695

flights %>% 
  summarise(avg_time = mean(air_time, na.rm=TRUE),
            avg_speed = mean(distance/air_time, na.rm=TRUE))

## # A tibble: 1 x 2
##   avg_time avg_speed
##      <dbl>     <dbl>
## 1      151      6.57

More than one dataset

weather %>%
  filter(origin == 'EWR') %>%
  head(5)

## # A tibble: 5 x 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour          
##   <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>  <dbl>    <dbl> <dbl> <dttm>             
## 1 EWR     2013  1.00     1     0  37.0  21.9  54.0      230       10.4      11.9      0     1014  10.0 2012-12-31 19:00:00
## 2 EWR     2013  1.00     1     1  37.0  21.9  54.0      230       13.8      15.9      0     1013  10.0 2012-12-31 20:00:00
## 3 EWR     2013  1.00     1     2  37.9  21.9  52.1      230       12.7      14.6      0     1013  10.0 2012-12-31 21:00:00
## 4 EWR     2013  1.00     1     3  37.9  23.0  54.5      230       13.8      15.9      0     1013  10.0 2012-12-31 22:00:00
## 5 EWR     2013  1.00     1     4  37.9  24.1  57.0      240       15.0      17.2      0     1013  10.0 2012-12-31 23:00:00

Get It Together

Combining weather and flights data

flightsweather <- flights %>%
  left_join(weather, by = c("origin", "year", "month", "day", "hour")) %>%
  select(origin, dep_delay, wind_speed)
flightsweather %>% head(5)

## # A tibble: 5 x 3
##   origin dep_delay wind_speed
##   <chr>      <dbl>      <dbl>
## 1 EWR         2.00       NA  
## 2 LGA         4.00       NA  
## 3 JFK         2.00       NA  
## 4 JFK        -1.00       NA  
## 5 LGA        -6.00       13.8

Tidying data

table4a

## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

Example “pivot” table where column names contain years
We want to “unpivot” these columns names as values for the variable year

table4a %>%
  gather(-country, key = "year", value = "cases")

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

Explore variables with visualizations

flightsweather %>%
  na.omit() %>%
  ggplot(aes(x=dep_delay)) +
    geom_histogram(aes(y=..density..),
                   binwidth = 5)

More than just pretty graphics

flightsweather %>%
  filter(wind_speed < 250 & dep_delay < 500) %>%
  ggplot(aes(x=wind_speed, y=dep_delay)) +
    geom_point() +
    geom_smooth() +
    facet_grid(. ~ origin)

Let’s hit the gym

First, stretch

Take a five minute break.

Proper gym equipment

R and RStudio Desktop

Install R and RStudio Desktop
Or, create an account on rstudio.cloud and copy the project for this workshop
Read the RStudio Desktop documentation

Start with free weights

Use the R Console

Familiarize yourself with the Console

version
sessionInfo()
x <- date()
x

Remember gym etiquette

Create an R project

Don’t make a mess! Use Projects
File > New Directory > New Project

Tracking your progress

Keep an R Notebook

A notebook is an rmarkdown document intended for recording your analyses
Rather than comment your code, your documentation contains code
Chunks are executed independently and interactively

What’s your max, bro?

Keeping research reproducible

Notebooks are an example of literate programming
Notebooks are reproducible documents
Rmarkdown generates publication-quality output
Share project as a git repository: do-the-right-thing

Data Exercise

Data exercise notebook

Before you leave…

Help us help you

Fill out a workshop evaluation so we can improve our programming cuny.is/gcdievals
Need more support about what you just learned? Request follow-up individual consultations at cuny.is/gcdfconsults (students) or cuny.is/gcfacultyconsults (faculty)
Drop-in to PUG for support with Python or other programming questions. For current dates, visit cuny.is/pug
Join the GCDI Group on the CUNY Academic Commons for all GCDI-related updates! cuny.is/group-gcdi
Thank you for attending and being involved! The #DigitalGC is you!