airbnb_nyc_analysis.Rmd

---
title: "Beyond Spreadsheets: R for MBAs"
subtitle: "Exploring New York City's Airbnb Listings"
author: "Asif Mehedi | Batten Institute"
date: Oct 3, 2019
output:
  rmdformats::readthedown:
    self_contained: false
    lightbox: true
    gallery: true
    highlight: tango
    code_folding: show
    css: "style.css"
---

# Goals

- Understand how a coding-based analysis workflow is different from one based on spreadsheets
- Get familiar with R (a programming language) and RStudio (a desktop program where you write R code)
- Learn some powerful analytic techniques, especially useful for exploring a large-ish dataset
- Learn a few fundamental programming concepts (variable, function, data type)

We'll accomplish these objectives by taking a close look at an interesting dataset, one that contains detailed data on over 35,000 recent (August 2019) Airbnb listings, all in New York City.

It's okay if you can't start using these techniques right away after the workshop. To be comfortable with a tool like R takes some time and practice. Your exposure today to these tools and techniques will likely show you R's possibilities and motivate you to keep learning. I suggest some learning resources at the end of this document.

# RStudio

RStudio is a free program that makes writing R code much more enjoyable and efficient. Strictly speaking, you don't need it to write R code, but I highly recommend you use RStudio.

It has four main panes. This one — assuming you're reading this on RStudio  — is the "code editor" (also known as "script editor" and "source editor"). This is where we'll spend most of our time. I'll describe the other panes later. 

## R Notebook

Before starting our analysis, let's understand how this document works. (I'm assuming you're reading this on RStudio.)

This document is an *R Notebook*. An R Notebook contains commentary interspersed with code chunks. The code chunks can be run (executed) independently and interactively. The output will appear below the code chunk. R Notebooks are easy to convert to well-formatted final documents as a pdf file, a webpage, or an MS Word file. [More here](http://rmarkdown.rstudio.com/r_notebooks.html)

What you're reading is *commentary* and what you see below is the container for a *code* chunk.

```{r}


```

To run a code chunk, click on the little triangle at the right edge of the code chunk. Or, use the keyboard shortcut **CTRL + SHIFT + ENTER** (Windows) or **CMD + SHIFT + ENTER** (Mac).

🏏 Run the chunk below and see what happens.

```{r}
a <- 5 + 7
print(a)
```

Do you see some output below the code chunk? This way you can immediately see the result of your code as you work on your analysis.

(Do you know what's going on in the code above? We'll talk about variables and assignments later.)

Feel free to add your commentary and code anywhere in this document. You can always download the [unmodified version](https://github.com/asifm/workshops/archive/R-intro-2019.zip). 

To write your own code chunk, look for the **insert** button above and then select **R**. Or, use the keyboard shortcut **CTRL + ALT + I** (Windows) or **CMD + OPTION + I** (Mac). 

🏏 Place the cursor below and create a container for writing code. Now write some code and run it. For example, find the result of `3543 / 562`.


# The Data

The dataset contains almost all the Airbnb listings in NYC as of August 2019. The data came from [Inside Airbnb](http://insideairbnb.com/get-the-data.html).

It's always a good idea to approach a new dataset with a few basic questions. Some examples:

- What kind of data is here? What are the variables? How many observations?
- Who collected these data? How? Why?
- How accurate are these data?
- How complete are these data? Are there too many missing values?
- Do I have the legal rights to use these data? Under what conditions?

[Here's some background information](http://insideairbnb.com/behind.html) that may answer some of these questions.

Let's take a look at the data. Open the Excel file in the `data` folder. Browse around a bit.

🤔 Can you explain what each column is about?

🤔 What's the first thing you'd like to find out from these data?

🤔 What do you think about the quality of the data?


# The Context

Data analysis happens within a broader context. Usually, there's an overarching business, policy, or scientific question that one would like to answer. Often, though, the question is not very clear. Whatever the case, you need to approach the data with curiosity about the larger context.  

The more you understand the context of the data, the better will be your questions and hypotheses guiding your analysis.

For our dataset, we should have a good understanding of the business model of Airbnb as well as the economy and geography of NYC. 

🤔 Is there anything else we should know about?

## Airbnb

🤔 How much do I need to know about Airbnb and its business? Is my personal experience with Airbnb enough? What if I don't have any personal experience?

- [How Airbnb works](https://www.wikiwand.com/en/Airbnb#/How_it_works)
- [Example of an Airbnb listing](https://www.airbnb.com/rooms/2168594?s=frffIXWS)

## New York City

We can start with a map of the city.

![](https://i.imgur.com/se7fTYT.png)

The map gives us an idea about the relative size and location of the five boroughs of NYC. 

🤔 What else do we know about these boroughs? What about the neighborhoods within the boroughs?


# Prepare to Analyze

## Install and load packages

We'll first install a few packages that we'll need at different stages of the analysis. This may take a minute or two. 

Take a look at the code chunk below and note the `#`s at the start some lines. The `#` at the start of a line means the line is a comment. These lines are not executed when you run a code chunk. You can use comments either to explain your code within a code chunk or to temporarily prevent a line of code from executing. The latter is known as "commenting out" a line of code.

If the lines containing `install.packages` are "commented out" below, please remove the `#`s before running the code chunk.

```{r, message=FALSE}
# For loading data
# install.packages("readr")

# # For data manipulation (we'll spend most time with this one)
# install.packages("dplyr")

# # For visualization
# install.packages("ggplot2")

# # To work with date and time
# install.packages("lubridate")
```

And then we load the packages for our current R session. After these packages are successfully loaded, all the functions in these packages will be available for our use.

```{r}
library(readr)
library(dplyr)
library(ggplot2)
library(lubridate)
```

## Load and prepare data

We'll use the function `read_csv` — which comes from the package `readr` — to load data from a file and convert the data into a dataframe. A dataframe is a tabular data structure, with columns as variables and rows as observations. In this case, the dataframe looks exactly as it does in the CSV file.

🤔 What's a CSV file? 

🤔 What are some other formats used for storing data? 

🤔 What can I do if my data is in a format I'm not familiar with? [hint: Google, package, function] 

```{r}
df <- read_csv("data/airbnb_nyc_data.csv")
```

The first thing to check is whether the data have been successfully loaded and assigned to the variable (here `df`). We don't see any error — which is a good sign. We can also see the variable `df` in the environment pane (usually to the right of this code editor pane – click on "Environment" if hidden).

Next you should check whether `read_csv` was able to correctly infer the data type of each column.

For a better glimpse at the data, use the function `glimpse`.

🤔 How am I supposed to know what functions are out there?

```{r}
glimpse(df)
```

What you see within the angular brackets `<...>` next to each column name is the data type of that column. 

But what are data types? And why are they important? 

What exactly is a variable in programming? 

What is a function?

Let's take a detour.

# Detour: Programming Concepts

## Variable

In programming, a variable stores some value. Our code can then reference the name of the variable to do something with it. 

We use a left-facing arrow to assign a value to a variable.  The arrow looks like this: `<-` (the less-than sign followed by a hyphen).

For example, in the code below, `foo` and `bar` are variable names. We assigned the values `43` and `"kitten"` to `foo` and `bar` respectively. You can read them as `foo` gets `43` and `bar` gets `"kitten"`.


```{r}
foo <- 43
bar <- "kitten"
```

Let's do something with these variables.

```{r}
# Multiply foo by 10
print(foo * 10)

# Just see the value of bar
print(bar)
```

Important: The word `variable` can be ambiguous. You may be familiar with the term as used in statistics. For example, if you have some data about people, some variables there could be age, sex, income and address. The names of the columns of a dataframe are variables in this sense. Usually, the context would clarify which meaning is intended.

Tip: Give short descriptive names to your variables. This will make the code easy to follow.

🏏 A room is 11 yards long and 7.5 yards wide. Assign these values to `width` and `length` variables and then calculate the `area` of the room. You can, of course, give different names to these variables if you wish.

```{r}

```

## Data types

We also need to know a little about data types. These are the main ones:

- Integer: such as 2, 543, 90.
- Double: numbers other than integers, such as 4.56, 1/33. Doubles are always approximations.
- Character: a sequence of letters, numbers, punctuation, etc., such as "rainbow", "34265". Character data are always written within quotes, either single or double (we'll always use double for consistency).
- Logical: data that can take on one of only two values — TRUE or FALSE.

🤔 Why is "34265" above of the type character? What's a real-world example of this kind of data?

There are some other important data types: 
- `factor`, for categorical variables (here "variable" in statistical sense). A categorical variable is one that can take on a limited, fixed number of values.
- `date-time` and `date`

## Data structures

`vector`, `list`, `matrix`, and `dataframe` are the main data-structures for storing data in R.

Think of a vector as simply a collection of data elements, where each element is of the same data type. 

This is how you create a vector: `c(element1, element2, element3, ...)`

```{r}
vec <- c(3, 5, 10, 20)
print(vec)
```


## Functions

You're likely familiar with Excel functions, such as `sum()` or `average()`. The concept of functions in a programming language is no different. You give some values (called parameters) to a function. The function does something with the values and returns a new value. So `sum(3, 6, 1)` returns 10. Here `sum` is the name of the function, 3, 6, 1 are parameters, and 10 is the returned value.

Most of the times, you'll use functions written by other people. But it's not too difficult to write your own functions. Here's a simple one. All it does is add 10 to a given number and then return the total.

```{r}
function(num){
  new_value <-  num + 10
  new_value
}
```

But we can't really use the function yet. We need to give it a name. Let's go with `add10`.

```{r}
add10 <- function(num){
  new_value <-  num + 10
  new_value
}
```

Now we can use this function whenever we need to add 10 to a number. (This is, of course, a toy example. In practice, a function would do much more work.)

```{r}
# Add 10 to 35
add10(35)
```

We won't write any function today. Rather, we'll use functions that R — and the packages we loaded — make available to us. Here's an example: the `mean` function, which comes preloaded in R.

```{r}
# First let's create a vector
vec2 <- c(23, 53, 11, 34, 87, 100, 5, 12, 66, 9, 87, 110, 20, 33, 54, 43, 76)

# Let's calculate the mean of the vector
calculated_mean <- mean(vec2)

# Then output it with some text description
message("The mean of the vector vec2 is ", calculated_mean)
```

🏏 Calculate the sum, median, and standard deviation (hint: `sd`) of the vector `vec2`.

```{r}

```


🏏 Let's load some new data. You'll see there are two other data files in the `data` folder: `babynames_2018.csv` and `babynames_1880.csv`. They contain the top 1000 names of babies born in 2018 and 1880, respectively. Load the 2018 data to a dataframe (hint: `read_csv`). Assign the dataframe to a variable named `df_babynames` (You can give a variable any name you like — but it's always a good idea to give variables meaningful, consistent names). 

```{r}

```

🏏 Now take a look at the dataframe's data types. (hint: `glimpse`)
```{r}

```

[Quiz Time]

# Back to Analysis

Let's take another look at the data types of our dataframe.

```{r}

glimpse(df)

```

Some of the data types don't look right. Let's correct them.

🤔 why is it important to have the correct data type?

But, first, let's learn how to reference a column in a dataframe: first the name of the dataframe, then `$` and then the column name. For example: `df$cancellation_policy` or `df_babynames$name`.

To change data type, we apply the relevant function, such `as.character` or `as.factor`, to a column and then assign the output of the function (values with changed data type) to the same column. 

```{r}
df$host_id <- as.character(df$host_id)
df$host_response_rate <- as.double(df$host_response_rate)
df$property_type <- as.factor(df$property_type)
df$host_since <- mdy(df$host_since)
```

🤔 What is the rationale for each of these changes?

Side note: The last function `mdy` is different from the others. Normally, working with date-time in R (and in programming in general) is not simple. The `lubridate` package, which we loaded earlier, makes the job much easier. We'll not go into the details of date-times. What we did here with `mdy` is convert character data into date data. We used `mdy` because the character data were in the form month-date-year. If they were in the form, for example, year-month-date, we would've used the function `ymd`.

🏏 Change the data type of any other variables you think necessary.
```{r}

```

## Missing values

Let's inspect our data for missing values using the function `summary`, which will give us plenty of other relevant information as well.

```{r}
summary(df)

```

🤔 Should we worry about missing values?

There are a few options for dealing with missing values.

- We can drop all rows that have one or more missing values.
- Drop rows that have missing values in particular columns.
- Replace the missing values with some other value.
- Do nothing, but do remember to take care of them when running arithmetic operations. (explained later)

In our case, we'll go with the last option.

🤔 Is the last option the best for our dataset? What would be a better alternative?

## A Set of Verbs

Finally, we're ready to dive into the data. We'll use six functions — think of them as six verbs — to slice and dice the data in all kinds of ways. These functions come from the `dplyr` package. They are:

- filter
- select
- mutate
- arrange
- summarize
- group_by

We'll first learn these functions one by one and later we'll learn how to combine them for rich analysis. 

For each of these functions, we provide the name of the dataframe as the first parameter. The subsequent parameters inform what the function is to do with the dataframe.

### Verb: filter

You use `filter`, when you want a subset of the rows based on one or more conditions. The returned rows will be those that meet the condition(s).

The syntax is: `filter(name_of_dataframe, condition1, condition2, ...)`

Use these *logical operators* to create the conditions:

    <       less than
    <=      less than or equal to
    >       greater than
    >=      greater than or equal to
    ==      exactly equal to
    !=      not equal to
    !x      Not x
    x | y   x OR y
    x & y   x AND y



Example: Return a dataframe with only those rows where the neighborhood is Chelsea.

```{r}
df_chelsea <- filter(df, neighborhood == "Chelsea")
```

Example: Return a dataframe with only those rows where the price is more than 8000.
```{r}
filter(df, price > 8000)
```

You can combine multiple conditions.

Example: Return a dataframe with rows where the borough is Manhattan and the price is less than 50.
```{r}
filter(df, borough == "Manhattan", price < 50)
```

Example: Return a dataframe with rows where cancellation_policy is flexible *or* accommodates more than 2.
```{r}
filter(df, cancellation_policy == "flexible" | accommodates > 2 )
```

Example: Return a dataframe with rows where number_of_reviews is *not* 0
```{r}
filter(df, number_of_reviews != 0)
```


Note: when multiple conditions are separated by commas, the conditions are assumed to have *and* logical operator between them. Using `&` instead would be the same thing. For *or* and other logical operators, we have to explicitly use the appropriate logical operator (e.g. `|` for *or*).

🏏 Return a dataframe with rows where number of bedrooms is not 1 and property_type is House

```{r}

```

🏏 Return a dataframe with rows where host_response_time is within an hour or host_response_rate is more than 90% or calendar was updated today

```{r}

```

### Verb: select

Our second verb is `select`, which is used to return a subset of the columns.

The syntax is: `select(name_of_dataframe, name_of_column1, name_of_column2, ...)`

There are some variations to this syntax, which come in handy in certain situations.

- If you want all columns except one: `select(name_of_dataframe, -column_to_exclude)`
- If you want all columns except two: `select(name_of_dataframe, -c(column_to_exlude1, column_to_exclue2))`
- If you want the third through eighth columns: `select(name_of_dataframe, 3:8)`

Example: Return a dataframe with only the listing_url column.

```{r}
select(df, listing_url)
```

Example: Return a dataframe with columns host_response_time, cancellation_policy, and neighborhood.

```{r}
select(df, host_response_time, cancellation_policy, neighborhood)
```

Example: Return a dataframe with second through fourth columns

```{r}
select(df, 2:4)
```

🏏 Return a dataframe with property_type, room_type, and zip_code columns.

```{r}

```

🏏 Return a dataframe with the last 23 columns (sixth through 28th)

```{r}

```

🏏 What would be another way of getting the same dataframe?

```{r}

```


### Piping

Before I introduce the third verb, let's take another detour. 

You can't accomplish a lot with `filter` and `select` in isolation. Let's combine them. 

Here, we're creating a new dataframe `df_expensive` by using `filter` and then we're selecting some columns of interest from that dataframe.

```{r}
df_expensive <- filter(df, price > 8000)
select(df_expensive, price, neighborhood, listing_url)
```

🤔 More than $8000 per night? What's going on here? How can we find out more? What should we do with this kind of outliers?

The above code could be written more succinctly by using pipe `%>%`.

```{r}
df %>% 
  filter(price > 8000) %>% 
  select(price, neighborhood, listing_url)
```

Let's try to understand how the pipe works.

`filter` and `select` both receive a dataframe as their first parameter (this is true for the other four main verbs as well.) Also, you may have noticed that `filter` and `select` return a dataframe (again, it's true for the others as well). Thanks to this, we can use the pipe to string together several verbs to form a complex query.

```{r}
filter(df, price > 8000)
```

is the same as

```{r}
df %>% filter(price > 8000)
```

Similarly,

```{r}
select(df, price, review_scores_rating, listing_url)
```

is the same as

```{r}
df %>% select(price, review_scores_rating, listing_url)
```

The left side of the pipe is used as the first parameter of the function on the right side of the pipe. 

This is how `filter`, `select`, and such functions receive a dataframe, do something to it, return the modified dataframe and passes it on to the next function.

Here's again the piped expression. Make sure you understand what's going on.

```{r}
df %>% 
  filter(price > 8000) %>% 
  select(price, review_scores_rating, listing_url)
```

### Verb: arrange

`arrange` is our third verb. (Note that I'm using verb and function interchangeably.) `arrange` sorts a dataframe based on the values of one or more columns.

Example: Get the same dataframe as the last code chunk. Sort it by price.

```{r}
df %>% 
  filter(price > 8000) %>% 
  select(price, review_scores_rating, listing_url) %>%
  arrange(desc(price))
```

Example: The same dataframe, but now in descending order of price.

```{r}
df %>% 
  filter(price > 8000) %>% 
  select(price, review_scores_rating, listing_url) %>%
  arrange(desc(price))
```

Example: You can add more columns to `arrange`. Now when price is the same, the order will be based on neighborhood.

```{r}
df %>% 
  filter(price > 8000) %>% 
  select(price, neighborhood, listing_url) %>% 
  arrange(desc(price), neighborhood)
```

🏏 Return the whole dataframe `df`, in descending order of number of bedrooms and where number of bedrooms are equal, descending order of review_scores_rating

```{r}

```

### Verb: mutate

`mutate` creates a new column by modifying one or more existing columns. It's better explained by examples.

Example:

```{r}
mutate(df, price_per_person = price / accommodates)
```


🏏 Create a new review_rating column, which is an average of review_scores_accuracy and review_scores_value.

```{r}

```

### Verb: summarize (or summarise)

`summarize`, as you'd expect, gives some kind of summary (such as mean, median, min, max) of the values of a given column.

Example: What is the total number of people that can stay in NYC Airbnbs?

```{r}
summarize(df, total_accommodates = sum(accommodates))
```

Here, total_accommodates is a name I've given to the summary number.

The common summary functions are:
- sum
- n (count)
- mean
- median
- sd (standard deviation)
- var (variance)
- range
- min
- max

Note: The function `n` doesn't need a parameter because it just counts the number of rows in the dataframe. This would become more clear in the examples.

We can also write our own function and use here. But that is a more advanced topic.

Example: What is the mean rating for accuracy

```{r}
summarize(df, mean_accuracy = mean(review_scores_accuracy))
```

Hmm, not exactly what we expected. This is because any arithmetic operation involving `NA` (that is, *Not Available* or missing values) results in `NA`. 

```{r}
5 + NA
```

Or,

```{r}
NA / 33
```

The review_scores_accuracy had several `NA`s. The correct way of doing this would be:

```{r}
summarize(df, mean_accuracy = mean(review_scores_accuracy, na.rm = TRUE))
```

`na.rm = TRUE` means "remove all NAs before performing arithmetic operations."

When we summed the accommodates column earlier, we didn't add `na.rm = TRUE` because that column didn't have any NAs. 

Example: We can have multiple summaries at the same time.

```{r}
summarize(df, min_score_value = min(review_scores_value, na.rm = TRUE), 
          max_score_value = max(review_scores_value, na.rm = TRUE))
```

Example: What's the total number of listings in this dataset?
```{r}
df %>% 
  summarize(count = n())
```


🏏 The host who's been with Airbnb NYC for the longest time started on which date?

```{r}

```

🏏 Find the standard deviation and mean of the price variable.

```{r}

```

🏏 Count the number of unique zip_codes in the data. Hint `distinct` and `n`.

```{r}

```

🏏 How many listings have perfect 100 for review_scores_rating (which is the overall rating shown as stars on the listing's webpage)?

```{r}

```

🏏 How many listings have a perfect 100 of review_scores_rating and cost less than 50?

```{r}

```


### Verb: group_by

`group_by` is one of the most useful functions. It divides the data into different groups and then `summarize` summarizes each of the groups. This way, we can compare the different groups on various attributes.

Example: What's the mean and median price in each of the five boroughs?

```{r}
df %>% 
  group_by(borough) %>% 
  summarize(mean(price), median(price))
```

🏏 Return the same dataframe but sorted by mean price.

```{r}

```

Example: Which are the top 10 neighborhoods by total number of listings?

```{r}
df %>% 
  group_by(neighborhood) %>% 
  summarize(count = n()) %>% 
  top_n(10, wt = count) %>% 
  arrange(desc(count))
```

We introduce a new function `top_n` here, which returns top n rows (or bottom n rows, when n is negative) ordered by some variable. In the code chunk above, if we had `top_n(-10)`, we would've got the bottom 10 rows.

Example: Which are the top two neighborhoods in each of the boroughs by number of listings?

```{r}
df %>% 
  group_by(borough, neighborhood) %>% 
  summarize(count = n()) %>% 
  top_n(2, wt = count) %>% 
  arrange(desc(count))
```


🏏 Do the same as above but only for listings where price is more than 500 and property_type is Apartment.

```{r}

```

Example: In Manhattan, which neighborhood has the highest mean price?*

```{r}
df %>%
  filter(borough == "Manhattan") %>% 
  group_by(neighborhood) %>% 
  summarize(mean_price = mean(price)) %>% 
  arrange(desc(mean_price))
```

🏏 In Bronx and Queens, what's the mean price for different property_type?

```{r}

```

🏏 For each borough, which neighborhood has the highest variance in price?

```{r}

```


🏏 Which combination of property_type and room_type has the highest median price?

```{r}

```

🏏 Considering only the Brooklyn listings that have review_scores_value 9 or 10, which neighborhood, out of those neighborhoods with at least 100 listings, has the lowest median price?

```{r}

  
```

# Explore Visually

We're going to use the ggplot2 package, which we loaded earlier, for some visual exploration of the data. Actually, this is how we should have begun our data exploration. We couldn't do this because we needed to learn the six verbs first to get the full benefit of ggplot2.

Unfortunately, we won't have time today to learn ggplot2. It may be possible to organize a separate session later on visualization in R.

Example: What's the distribution of price?

```{r}
ggplot(df) + geom_histogram(aes(price))
```

🏏 What's the problem with this histogram? How can we solve this?

```{r}

```

Example: Compare price distribution of different boroughs.

```{r}
df %>% 
  filter(price < 500) %>% 
  ggplot() + geom_histogram(aes(x = price, fill = borough))
```

Example: Do the same, but with boxplots

```{r}
df %>% 
  filter(price < 500) %>% 
  ggplot() + geom_boxplot(aes(x = borough, y = price))
```

Example: Which neighborhoods have the highest number Of listings that have review_scores_value less than 5.

```{r}
df %>% 
  filter(review_scores_value < 5) %>% 
  group_by(borough, neighborhood) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  top_n(5) %>% 
  ggplot() + geom_col(aes(x = reorder(neighborhood, count), y = count)) +
  coord_flip()

```


# Keep Learning

- If you like `dplyr` and the other tidyverse packages we used in this workshop, [R for Data Science](https://r4ds.had.co.nz) is an excellent resource. Hadley Wickham, the author of the tidyverse packages, wrote that book.

- [Swirl](https://swirlstats.com) teaches R and data science interactively right on the console. Follow the instructions [here](https://swirlstats.com/students.html).

- To learn how to do statistics using R, you can watch [this YouTube video](https://www.youtube.com/watch?v=_V8eKsto3Ug)

- To learn R's programming concepts (which we didn't cover in detail today), you can watch [this YouTube video](https://www.youtube.com/watch?v=fDRa82lxzaU)

- [An Introduction to R](https://cran.r-project.org/doc/manuals/r-release/R-intro.html) is perhaps the most comprehensive introduction. *But it's not the best place to start if you're a complete beginner.*

## Download Cheatsheets

- [RStudio](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf)
- [Data transformation with dplyr](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)
- [Work with strings](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf)
- [Data import](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf)
- [R Markdown](https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf)
- [Data visualization with ggplot2](https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf)