Skip to content

Commit

Permalink
Mild import/wrangling reorg
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley committed Jun 20, 2022
1 parent 23bfba6 commit 8f7748d
Show file tree
Hide file tree
Showing 12 changed files with 25 additions and 72 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,5 @@ libs
_main.*
tmp-pdfcrop-*
figures

/.quarto/
site_libs
2 changes: 1 addition & 1 deletion EDA.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ To make the discussion easier, let's define some terms:
Tabular data is *tidy* if each value is placed in its own "cell", each variable in its own column, and each observation in its own row.

So far, all of the data that you've seen has been tidy.
In real-life, most data isn't tidy, so we'll come back to these ideas again in [Chapter -@sec-list-columns] and [Chapter -@sec-rectangle-data].
In real-life, most data isn't tidy, so we'll come back to these ideas again in @sec-rectangling.

## Variation

Expand Down
13 changes: 6 additions & 7 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,14 +65,13 @@ book:
- missing-values.qmd
- column-wise.qmd

- part: import.qmd
- part: wrangle.qmd
chapters:
- import-rectangular.qmd
- import-spreadsheets.qmd
- import-databases.qmd
- rectangle.qmd
- import-webscrape.qmd
- import-other.qmd
- parsing.qmd
- spreadsheets.qmd
- databases.qmd
- rectangling.qmd
- webscraping.qmd

- part: program.qmd
chapters:
Expand Down
28 changes: 2 additions & 26 deletions data-import.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,7 @@ status("polishing")

Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data.
In this chapter, you'll learn how to read plain-text rectangular files into R.
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data.
We'll finish with a few pointers to packages that are useful for other types of data.
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data, which we'll come back to in @sec-wrangle.

### Prerequisites

Expand Down Expand Up @@ -320,33 +319,10 @@ There are two alternatives:
```
Feather tends to be faster than RDS and is usable outside of R.
RDS supports list-columns (which you'll learn about in [Chapter -@sec-list-columns]; feather currently does not.
RDS supports list-columns (which you'll learn about in @sec-rectangling; feather currently does not.
```{r}
#| include: false
file.remove("students-2.csv")
file.remove("students.rds")
```

## Other types of data

To get other types of data into R, we recommend starting with the tidyverse packages listed below.
They're certainly not perfect, but they are a good place to start.
For rectangular data:

- **readxl** reads Excel files (both `.xls` and `.xlsx`).
See [Chapter -@sec-import-spreadsheets] for more on working with data stored in Excel spreadsheets.

- **googlesheets4** reads Google Sheets.
Also see [Chapter -@sec-import-spreadsheets] for more on working with data stored in Google Sheets.

- **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows you to run SQL queries against a database and return a data frame.
See [Chapter -@sec-import-databases] for more on working with databases .

- **haven** reads SPSS, Stata, and SAS files.

For hierarchical data: use **jsonlite** (by Jeroen Ooms) for json, and **xml2** for XML.
Jenny Bryan has some excellent worked examples at <https://jennybc.github.io/purrr-tutorial/>.

For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package.
2 changes: 1 addition & 1 deletion data-tidy.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -557,7 +557,7 @@ df <- tribble(
)
```

If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in [Chapter -@sec-list-columns]:
If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in @sec-rectangling:

```{r}
df |> pivot_wider(
Expand Down
File renamed without changes.
2 changes: 1 addition & 1 deletion import-rectangular.qmd → parsing.qmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Rectangular data {#sec-import-rectangular}
# Parsing {#sec-import-rectangular}

```{r}
#| results: "asis"
Expand Down
6 changes: 3 additions & 3 deletions rectangle.qmd → rectangling.qmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Data rectangling {#sec-rectangle-data}
# Data rectangling {#sec-rectangling}

```{r}
#| results: "asis"
Expand Down Expand Up @@ -86,10 +86,10 @@ x5 <- list(1, list(2, list(3, list(4, list(5)))))
str(x5)
```

As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangle-1].
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangling-1].
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.

[^rectangle-1]: This is an RStudio feature.
[^rectangling-1]: This is an RStudio feature.

```{r}
#| label: fig-view-collapsed
Expand Down
File renamed without changes.
28 changes: 0 additions & 28 deletions tidy.qmd

This file was deleted.

File renamed without changes.
14 changes: 10 additions & 4 deletions import.qmd → wrangle.qmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Wrangle {#sec-import-intro .unnumbered}
# Wrangle {#sec-wrangle .unnumbered}

```{r}
#| results: "asis"
Expand All @@ -14,14 +14,20 @@ But in more complex cases it encompasses both tidying and transformation as the

This part of the book proceeds as follows:

- In @sec-import-rectangular, you'll learn how to get plain-text data in rectangular formats from disk and into R.
- In @sec-rectangling, you'll learn how to get plain-text data in rectangular formats from disk and into R.

- In @sec-import-spreadsheets, you'll learn how to get data from Excel spreadsheets and Google Sheets into R.

- In @sec-import-databases, you'll learn about getting data into R from databases.

- In @sec-rectangle-data, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON.
- In @sec-rectangling, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON.

- In @sec-import-webscrape, you'll learn about harvesting data off the web and getting it into R.

- We'll close up the part with a brief discussion on other types of data and pointers for how to get them into R in @sec-import-other.
Some other types of data are not covered in this book:

- **haven** reads SPSS, Stata, and SAS files.

- xml2 for **xml2** for XML

For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package.

0 comments on commit 8f7748d

Please sign in to comment.