Skip to content

Commit

Permalink
Add workflow advice from @jennybc
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley committed Aug 18, 2016
1 parent 4cf15b5 commit 93179cb
Show file tree
Hide file tree
Showing 14 changed files with 353 additions and 22 deletions.
3 changes: 3 additions & 0 deletions _bookdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,11 @@ rmd_files: [

"explore.Rmd",
"visualize.Rmd",
"workflow-basics.Rmd",
"transform.Rmd",
"workflow-scripts.Rmd",
"EDA.Rmd",
"workflow-projects.Rmd",

"wrangle.Rmd",
"tibble.Rmd",
Expand Down
46 changes: 25 additions & 21 deletions intro.Rmd
Original file line number Diff line number Diff line change
@@ -1,3 +1,20 @@
```{r include=FALSE, cache=FALSE}
set.seed(1014)
options(digits = 3)
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
cache = TRUE,
out.width = "70%",
fig.align = 'center',
fig.width = 6,
fig.asp = 0.618, # 1 / phi
fig.show = "hold"
)
options(dplyr.print_min = 6, dplyr.print_max = 6)
```
# Introduction

Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to help you learn the most important tools in R that will allow you to do data science. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
Expand Down Expand Up @@ -115,27 +132,7 @@ RStudio is an integrated development environment, or IDE, for R programming. The
knitr::include_graphics("diagrams/intro-rstudio.png")
```

You run R code in the __console__ pane. Textual output appears inline, and graphical output appears in the __output__ pane. You write more complex R scripts in the __editor__ pane.

There are three keyboard shortcuts for the RStudio IDE that we strongly encourage that you learn because they'll save you so much time:

* Cmd/Ctrl + Enter: sends the current line (or current selection) from the editor to
the console and runs it.

* Tab: suggest possible completions for the text you've typed.

* Cmd/Ctrl + ↑: in the console, searches all commands you've typed that start with
those characters.

If you want to see a list of all keyboard shortcuts, use the meta shortcut Alt + Shift + K: that's the keyboard shortcut to show all the other keyboard shortcuts!

We strongly recommend making two changes to the default RStudio options:

```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("screenshots/rstudio-workspace.png")
```

This ensures that every time you restart RStudio you get a completely clean slate. That's good practice because it encourages you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code. During a project, it's good practice to regularly restart R either using the menu option Session | Restart R or the keyboard shortcut Cmd + Shift + F10.
For now, all you need to know is that you type R code in the console pane, and press enter to run it. You'll learn more as we go along!

### R packages

Expand Down Expand Up @@ -227,6 +224,10 @@ This book isn't just the product of Hadley and Garrett, but is the result of man

* Jenny Bryan and Lionel Henry for many helpful discussions around working
with lists and list-columns.

* The three chapters on workflow were adapted (with permission), from
<http://stat545.com/block002_hello-r-workspace-wd-project.html> by
Jenny Bryan.

* Genevera Allen for discussions about models, modelling, the statistical
learning perspective, and the difference between hypothesis generation and
Expand All @@ -238,6 +239,9 @@ This book isn't just the product of Hadley and Garrett, but is the result of man
* Bill Behrman for his thoughtful reading of the entire book, and for trying
it out with his data science class at Stanford.

* The \#rstats twitter community who reviewed all the of the draft chapters
and provided tons of useful feedback.

This book was written in the open, and many people contributed pull requests to fix minor problems. Special thanks goes to everyone who contributed via GitHub:

```{r, results = "asis", echo = FALSE, message = FALSE}
Expand Down
Binary file added screenshots/rstudio-diagnostic-tip.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added screenshots/rstudio-diagnostic-warn.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added screenshots/rstudio-diagnostic.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added screenshots/rstudio-env.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added screenshots/rstudio-project-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added screenshots/rstudio-project-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added screenshots/rstudio-project-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added screenshots/rstudio-wd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion transform.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -558,7 +558,8 @@ flights %>%
In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights. We'll save this dataset so we can reuse in the next few examples.

```{r}
not_cancelled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
Expand Down
153 changes: 153 additions & 0 deletions workflow-basics.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Workflow: basics

You've now have some experience running R code. I didn't give you many details, but you've obviously figured out the basics, or you would've thrown this book away in frustration! Before we go any further, let's make sure you've got a solid foundation in running R code and, and that you know about the most helpful RStudio features.

Let's review the basics: you can use R as a calculator:

```{r}
1 / 200 * 30
(59 + 73 + 2) / 3
```

And you can create new objects with `<-`:

```{r}
x <- 3 * 4
```

All R statements where you create objects, __assignment__ statements, have the same form:

```{r eval = FALSE}
object_name <- value
```

When reading that code say "object_name gets value" in your head.

You will make lots of assignments and the operator `<-` is a pain to type. Don't be lazy and use `=`. It will work, but it will sow confusion later. Instead, use RStudio's keyboard shortcut: Alt + - (the minus sign). RStudio offers many handy keyboard shortcuts. To get the full list, use the one keyboard shortcut to rule them all: Alt + Shift + K brings up a keyboard shortcut reference card.

Notice that RStudio automagically surrounds `<-` with spaces, which is a good code formatting practice. Code is miserable to read on a good day, so giveyoureeyesabreak and use spaces.

Object names must start with a letter, and cannot contain characters like commas or spaces. You want your object names to be descriptive, so it's a good idea to adopt a convention for demarcating words in names. I recommend __snake_case__ where you separate lowercase words with `_`.

```{r, eval = FALSE}
i_use_snake_case
otherPeopleUseCamelCase
some.people.use.periods
And_aFew.People_HATEconventions
```

We'll come back to code style in [functions].

You can inspect an object by typing its name:

```{r}
x
```


Make another assignment:

```{r}
this_is_a_really_long_name <- 2.5
```

To inspect this object, try out RStudio's completion facility: type the "this", press TAB, add characters until you have a unique prefix, then press return.

Ooops, you made a mistake! `this_is_a_really_long_name` should have value 3.5 not 2.5. Use another keyboard short to help you fix it. Type "this" then press Cmd/Ctrl + ↑. That will list all the commands you've typed that start those letters. Use the arrow keys to navigate, then press enter to retype the command. Change 2.5 to 3.5 and rerun.

Make yet another assignment:

```{r}
r_rocks <- 2 ^ 3
```

Let's try to inspect it:

```{r, error = TRUE}
r_rock
R_rocks
```

There's an implicit contract between you and R: it will do the tedious computation for you, but in return, you must be completely precise in your instructions. Typos matter. Case matters. Improving your touch typing skills will pay off!

R has a large collection of built-in functions that are called like this:

```{r eval = FALSE}
functionName(arg1 = val1, arg2 = val2, and so on)
```

Let's try using `seq()` which makes regular sequences of numbers and, while we're at it, learn more helpful features of RStudio.

Type `se` and hit TAB. A pop up shows you possible completions. Specify `seq()` by typing more (a "q") to disambiguate or using the up/down arrows to select. Notice the floating tooltip that pops up, reminding you of the function's arguments and purpose. If you want more help, press F1 to get all the details in help tab in the lower right pane.

Press TAB once more when you've selected the function you want. RStudio will add matching opening (`(`) and closing (`)`) parentheses for you. Type the arguments `1, 10` and hit return.

```{r}
seq(1, 10)
```

Type this code and notice similar assistance help with the paired quotation marks:

```{r}
x <- "hello world"
```

Quotation marks and parentheses must always come in a pair. RStudio does it's best to help you, but it's still possible to mess up and end up with a mismatch. If this happen, R will show you the continuation character "+":

```
> x <- "hello
+
```

The `+` tells you that R is waiting for more input; it doesn't think you're done yet. Usually that means you've forgotten either a `"` or a `)`. Either add missing pair, or press ESCAPE to abort the expression and try again.

If you make an assignment, you don't get to see the value. You're then tempted to immediately double check the result: inspect.

```{r}
y <- seq(1, 10, length = 5)
y
```

This common action can be shortened by surrounding the assignment with parentheses, which causes assignment and "print to screen" to happen.

```{r}
(y <- seq(1, 10, length = 5))
```

Now look at your environment in the upper right pane:

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("screenshots/rstudio-env.png")
```

The environment is where user-defined objects accumulate.

## Practice

1. Why does this code not work?

```{r, error = TRUE}
my_variable <- 10
my_varıable
```
Look carefully! (This may seem like an exercise in pointlessness, but
training your brain to notice even the tiniest difference will pay off
when programming.)
1. Tweak the each of the following R commands so that they run correctly:
```{r, eval = FALSE}
library(ggplot2)
library(dplyr)
ggplot(dota = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
mpg %>%
fliter(cyl = 8)
diamond %>%
filter(carat > 3)
```
111 changes: 111 additions & 0 deletions workflow-projects.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Workflow: projects

One day you will need to quit R, go do something else and return to your analysis later. One day you will have multiple analyses going that use R and you want to keep them separate. One day you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.

To handle these real life situations, you need to make two decisions:

1. What about your analysis is "real", i.e. you will save it as your
lasting record of what happened?

1. Where does your analysis "live"?

## What is real?

As a beginning R user, it's OK to consider your environment (i.e. the objects listed in the environment pane) "real". However, in the long-run, you'll be much better off if you consider your R scripts as "real". With the input data and the R code you used, you can reproduce _everything_. You can make your analysis fancier. You can get to the bottom of puzzling results and discover and fix bugs in your code. You can reuse the code to conduct similar analyses in new projects. You can remake a figure with different aspect ratio or save is as TIFF instead of PDF. You are ready to take questions. You are ready for the future.

If you regard your environment as "real" (saving and reloading all the time), it's hard to reproduce an analysis after the fact. You'll either need to retype a lot of code (making mistakes all the way) or will have to mine your R history for the commands you used. Rather than [becoming an expert on managing the R history](https://support.rstudio.com/hc/en-us/articles/200526217-Command-History), a better use of your time and psychic energy is to keep your "good" R code in a script for future reuse.

To foster this behaviour, I highly recommend that you tell RStudio not to preserve your workspace between sessions:

```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("screenshots/rstudio-workspace.png")
```

This ensures that every time you restart RStudio you get a completely clean slate. That's good practice because it encourages you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code.

There is a great pair of keyboard short cuts that will work together to make sure you've captured the important parts of your code in the editor:

1. Press Cmd/Ctrl + Shift + F10 to restart RStudio.
2. Press Cmd/Ctrl + Shift + S to rerun the current script.

I do this probably hundreds of times a day.

## Where does your analysis live?

R has a powerful notion of the __working directory__. This is where R looks, by default, for files that you ask it to load, and where it will put any files that you save to disk. RStudio shows your current working directory at the top of the console:

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("screenshots/rstudio-wd.png")
```

And you can print this out in R code by running `getwd()`:

```{r eval = FALSE}
getwd()
#> [1] "/Users/hadley/Documents/r4ds/r4ds"
```

As a beginning R user, it's OK let your home directory or any other weird directory on your computer be R's working directory. But _very soon_ you should evolve to organising your analytical projects into directories and, when working on project A, set R's working directory to the associated directory.

__Although I do not recommend it__, in case you're curious, you can set R's working directory at the command line like so:

```{r eval = FALSE}
setwd("~/myCoolProject")
```

But there's a better way. A way that also puts you on the path to managing your R work like an expert.

## RStudio projects

Keeping all the files associated with a project organized together -- input data, R scripts, analytical results, figures -- is such a wise and common practice that RStudio has built-in support for this via its _projects_.

[Using Projects](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects)

Let's make one for you to use for the rest of this book. Click File > New Project, then:

```{r, echo = FALSE, out.width = "50%"}
knitr::include_graphics("screenshots/rstudio-project-1.png")
knitr::include_graphics("screenshots/rstudio-project-2.png")
knitr::include_graphics("screenshots/rstudio-project-3.png")
```

Call your project `r4ds`.

Once this process is complete, you'll get a new RStudio project that just for this book. Check that the "home" directory for your project is the working directory of our current R process:

```{r eval = FALSE}
getwd()
#> [1] ~/Desktop/r4ds
```

Now, whenever you refer to a file (sans directory) it will look for it in this directory.

Now enter the following commands in the script editor, then save the file, calling it "diamonds.R". Next, run the complete script which will save a pdf and csv file into your project directory. Don't worry about the details --- you'll learn them later in the book.

```{r toy-line, eval = FALSE}
library(ggplot2)
library(readr)
ggplot(diamonds, aes(carat, price)) +
geom_hex()
ggsave("diamonds-hex.pdf")
write_csv(diamonds, "diamonds.csv")
```

Quit RStudio. Inspect the folder associated with your project --- notice the `.Rproj` file. You can click on that to re-open the project in the future (using projects even allows you to have multiple instances of RStudio open at the same time). Maybe view the PDF in an external viewer.

Restart RStudio. Notice you get back to where you left off: it's the save working directory and command history, and all the files you were working on are still open. You will, however, have a completely fresh environment, guaranteeing that you're starting with a clean slate.

In your favorite OS-specific way, search your computer for `diamonds.pdf` and presumably you will find the PDF (no surprise) but _also the script that created it _ (`diamonds.r`). This is huge win! One day you will want to remake a figure or just simply understand where it came from. If you rigorously save figures to file __with R code__ and never with the mouse or the clipboard, you will be able to reproduce old work with ease!

## Overall workflow

RStudio projects give you a solid workflow that will serve you well in the future:

* Create an RStudio project for each data analyis project.
* Keep data files there; we'll talk about a bit later importing in [import].
* Keep scripts there; edit them, run them in bits or as a whole.
* Save your outputs there.

Everything you need is in one place, and cleanly separated from all the other projects that you are working on.
Loading

0 comments on commit 93179cb

Please sign in to comment.