forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintro.Rmd
147 lines (95 loc) · 8.26 KB
/
intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
layout: default
title: Welcome
output: bookdown::html_chapter
---
# Welcome
## Overview
The goal of "R for Data Science" is to give you a solid foundation into using R to do data science. The goal is not to be exhaustive, but to instead focus on what we think are the critical skills for data science:
* Getting your data into R so you can work with it.
* Wrangling your data into a tidy form, so it's easier to work with. This let's you
spend your time struggling with your questions, not fighting to get data
into the right form for different functions.
* Manipulating your data to add variables and compute basic summaries. We'll
show you the broad tools, and focus on three common types of data: numbers,
strings, and date/times.
* Visualising your data to gain insight. Visualisations are one of the most
important tools of data science because they can surprise you: you can
see something in a visualisation that you did not expect. Visualisations
are also really helpful for helping you refine your questions of the data.
* Modelling your data to scale visualisations to larger datasets, and to
remove strong patterns. Modelling is a very deep topic - we can't possibly
cover all the details, but we'll give you a taste of how you can use it,
and where you can go to learn more.
* Communicating your results to others. It doesn't matter how great your
analysis is unless you can communicate the results to others. We'll show
how you can create static reports with rmarkdown, and interactive apps with
shiny.
## Learning data science
Above, I've listed the components of the data science process in roughly the order you'll encounter them in an analysis (although of course you'll iterate multiple times). This, however, is not the order you'll encounter them in this book. This is because:
* Starting with data ingest is boring. It's much more interesting to learn
some new visualisation and manipulation tools on data that's already been
imported and cleaned. You'll later learn the skills to apply these new ideas
to your own data.
* Some topics, like modelling, are best explained with other tools, like
visualisation and manipulation. These topics need to come later in the book.
We've honed this order based on our experience teaching live classes, and it's been carefully designed to keep you motivated. We try and stick to a similar pattern within each chapter: give some bigger motivating examples so you can see the bigger picture, and then dive into the details.
Each section of the book also comes with exercises to help you practice what you've learned. It's tempting to skip these, but there's no better way to learn than practicing. If you were taking a class with either of us, we'd force you to do them by making them homework. (Sometimes I feel like teaching is the art of tricking people to do what's in their own best interests.)
## Talking about data science
Throughout the book, we will discuss the principles of data that will help you become a better scientist. That begins here. We will refer to the terms below throughout the book because they are so useful.
* A _variable_ is a quantity, quality, or property that you can measure.
* A _value_ is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
* An _observation_ is a set of measurments you make under similar conditions (usually all at the same time or on the same object). Observations contain values that you measure on different variables.
These terms will help us speak precisely about the different parts of a data set. They will also provide a system for turning data into insights.
This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation.
## R and big data
This book also focuses almost exclusively on in-memory datasets.
* Small data: data that fits in memory on a laptop, ~10 GB. Note that small
data is still big! R is great with small data.
* Medium data: data that fits in memory on a powerful server, ~5 TB. It's
possible to use R with this much data, but it's challenging. Dealing
effectively with medium data requires effective use of all cores on a
computer. It's not that hard to do that from R, but it requires some thought,
and many packages do not take advantage of R's tools.
* Big data: data that must be stored on disk or spread across the memory of
multiple machines. Writing code that works efficiently with this sort of data
is a very challenging. Tools for this sort of data will never be written in
R: they'll be written in a language specially designed for high performance
computing like C/C++, Fortran or Scala. But R can still talk to these systems.
The other thing to bear in mind, is that while all your data might be big, typically you don't need all of it to answer a specific question:
* Many questions can be answered with the right small dataset. It's often
possible to find a subset, subsample, or summary that fits in memory and
still allows you to answer the question you're interested in. The challenge
here is finding the right small data, which often requires a lot of iteration.
* Other challenges are because an individual problem might fit in memory,
but you have hundreds of thousands or millions of them. For example, you
might want to fit a model to each person in your dataset. That would be
trivial if you had just 10 or 100 people, but instead you have a million.
Fortunately each problem is independent (sometimes called embarassingly
parallel), so you just need a system (like hadoop) that allows you to
send different datasets to different computers for processing.
## Prerequisites
To run the code in this book, you will need to install both R and the RStudio IDE, an application that makes R easier to use. Both are free and easy to install.
### R
To install R, visit [cran.r-project.org](http://cran.r-project.org) and click the link that matches your operating system. What you do next will depend on your operating system.
* Mac users should click the `.pkg` file at the top of the page. This file contains the most current release of R. Once the file is downloaded, double click it to open an R installer. Follow the directions in the installer to install R.
* Windows users should click "base" and then download the most current version of R, which will be linked at the top of the page.
* Linux users should select their distribution and then follow the distribution specific instructions to install R. [cran.r-project.org](https://cran.r-project.org/bin/linux/) includes these instructions alongside the files to download.
### RStudio
After you install R, visit [www.rstudio.com/download](http://www.rstudio.com/download) to download the RStudio IDE. Choose the installer for your system. Then click the link to download the application. Once you have the application, installation is easy. Once RStudio IDE is installed, open it as you would open any other application.
### R Packages
An R _package_ is a collection of functions, data sets, and help files that extends the R language. We will use several packages in this book: `DBI`, `devtools`, `dplyr`, `ggplot2`, `haven`, `knitr`, `lubridate`, `packrat`, `readr`, `rmarkdown`, `rsqlite`, `rvest`, `scales `, `shiny`, `stringr`, and `tidyr`.
To install these packages, open the RStudio IDE and run the command
```{r eval = FALSE}
install.packages(c("DBI", "devtools", "dplyr", "ggplot2", "haven", "knitr", "lubridate", "packrat", "readr", "rmarkdown", "rsqlite", "rvest", "scales", "shiny", "stringr", "tidyr"))
```
R will download the packages from [cran.r-project.org](http://cran.r-project.org) and instll them in your system library. So be sure that you are connected to the internet, and that you have not blocked [cran.r-project.org](http://cran.r-project.org)in your firewall or proxy settings.
After you have downloaded the packages, you can load any of the packages into your current R session with the `library()` command, e.g.
```{r eval = FALSE}
library(tidyr)
```
You will not be able to use the functions, objects, and help files in a package until you load it with `library()`. You will need to reload the package if you start a new R session.
### Getting help
* Google
* StackOverflow ([reprex](https://github.com/jennybc/reprex))
* Twitter