Skip to content

Commit

Permalink
More minor page count tweaks & fixes
Browse files Browse the repository at this point in the history
And re-convert with latest htmlbook
  • Loading branch information
hadley committed Jan 26, 2023
1 parent d9afa13 commit aa9d72a
Show file tree
Hide file tree
Showing 38 changed files with 839 additions and 1,094 deletions.
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,8 @@ devtools::install_github("hadley/r4ds")
To generate book for O'Reilly, build the book then:
```{r}
devtools::load_all("../minibook/"); process_book()
# pak::pak("hadley/htmlbook")
htmlbook::convert_book()
html <- list.files("oreilly", pattern = "[.]html$", full.names = TRUE)
file.copy(html, "../r-for-data-science-2e/", overwrite = TRUE)
Expand All @@ -63,6 +64,8 @@ fs::dir_create(unique(dirname(dest)))
file.copy(pngs, dest, overwrite = TRUE)
```
Then commit and push to atlas.
## Code of Conduct
Please note that r4ds uses a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html).
Expand Down
5 changes: 3 additions & 2 deletions _common.R
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,9 @@ options(
pillar.max_footer_lines = 2,
pillar.min_chars = 15,
stringr.view_n = 6,
# Activate crayon output - temporarily disabled for quarto
# crayon.enabled = TRUE,
# Temporarily deactivate cli output for quarto
cli.num_colors = 0,
cli.hyperlink = FALSE,
pillar.bold = TRUE,
width = 77 # 80 - 3 for #> comment
)
Expand Down
2 changes: 1 addition & 1 deletion base-R.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ This function was the inspiration for much of dplyr's syntax.
2. Why is `x[-which(x > 0)]` not the same as `x[x <= 0]`?
Read the documentation for `which()` and do some experiments to figure it out.

## Selecting a single element `$` and `[[` {#sec-subset-one}
## Selecting a single element with `$` and `[[` {#sec-subset-one}

`[`, which selects many elements, is paired with `[[` and `$`, which extract a single element.
In this section, we'll show you how to use `[[` and `$` to pull columns out of data frames, discuss a couple more differences between `data.frames` and tibbles, and emphasize some important differences between `[` and `[[` when used with lists.
Expand Down
2 changes: 1 addition & 1 deletion intro.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -365,7 +365,7 @@ knitr::kable(df, format = "markdown")
```

```{r}
#| eval: false
#| include: false
cli:::ruler()
```
26 changes: 12 additions & 14 deletions oreilly/EDA.html
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-EDA">
<h1><span id="sec-exploratory-data-analysis" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Exploratory data analysis</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="EDA-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:</p>
Expand All @@ -10,7 +10,7 @@ <h1>
</ol><p>EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.</p>
<p>EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualization, transformation, and modelling.</p>

<section id="prerequisites" data-type="sect2">
<section id="EDA-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter we’ll combine what you’ve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.</p>
Expand Down Expand Up @@ -137,7 +137,7 @@ <h2>
<p>It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to omit them, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.</p>
</section>

<section id="exercises" data-type="sect2">
<section id="EDA-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Explore the distribution of each of the <code>x</code>, <code>y</code>, and <code>z</code> variables in <code>diamonds</code>. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.</p></li>
Expand Down Expand Up @@ -198,7 +198,7 @@ <h1>
</div>
<p>However this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.</p>

<section id="exercises-1" data-type="sect2">
<section id="EDA-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?</p></li>
Expand All @@ -217,9 +217,7 @@ <h2>
<p>For example, let’s explore how the price of a diamond varies with its quality (measured by <code>cut</code>) using <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price)) +
geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)
#&gt; Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#&gt; ℹ Please use `linewidth` instead.</pre>
geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500." width="576"/></p>
</div>
Expand All @@ -235,7 +233,7 @@ <h2>
<p>To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the <strong>density</strong>, which is the count standardized so that the area under each frequency polygon is one.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price, y = after_stat(density))) +
geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)</pre>
geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p>
</div>
Expand Down Expand Up @@ -279,7 +277,7 @@ <h2>
</div>
</div>

<section id="exercises-2" data-type="sect3">
<section id="EDA-exercises-2" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>Use what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.</p></li>
Expand All @@ -291,7 +289,7 @@ <h3>
</ol></section>
</section>

<section id="two-categorical-variables" data-type="sect2">
<section id="EDA-two-categorical-variables" data-type="sect2">
<h2>
Two categorical variables</h2>
<p>To visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>:</p>
Expand Down Expand Up @@ -330,7 +328,7 @@ <h2>
</div>
<p>If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.</p>

<section id="exercises-3" data-type="sect3">
<section id="EDA-exercises-3" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?</p></li>
Expand All @@ -340,7 +338,7 @@ <h3>
</ol></section>
</section>

<section id="two-numerical-variables" data-type="sect2">
<section id="EDA-two-numerical-variables" data-type="sect2">
<h2>
Two numerical variables</h2>
<p>You’ve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
Expand Down Expand Up @@ -390,7 +388,7 @@ <h2>
</div>
</div>

<section id="exercises-4" data-type="sect3">
<section id="EDA-exercises-4" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code> vs. <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>? How does that impact a visualization of the 2d distribution of <code>carat</code> and <code>price</code>?</p></li>
Expand Down Expand Up @@ -464,7 +462,7 @@ <h1>
<p>We’re not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.</p>
</section>

<section id="summary" data-type="sect1">
<section id="EDA-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter you’ve learned a variety of tools to help you understand the variation within your data. You’ve seen technique that work with a single variable at a time and with a pair of variables. This might seem painful restrictive if you have tens or hundreds of variables in your data, but they’re foundation upon which all other techniques are built.</p>
Expand Down
6 changes: 3 additions & 3 deletions oreilly/arrow.html
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
<section data-type="chapter" id="chp-arrow">
<h1><span id="sec-arrow" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Arrow</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="arrow-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>CSV files are designed to be easily read by humans. They’re a good interchange format because they’re very simple and they can be read by every tool under the sun. But CSV files aren’t very efficient: you have to do quite a lot of work to read the data into R. In this chapter, you’ll learn about a powerful alternative: the <a href="https://parquet.apache.org/">parquet format</a>, an open standards-based format widely used by big data systems.</p>
<p>We’ll pair parquet files with <a href="https://arrow.apache.org">Apache Arrow</a>, a multi-language toolbox designed for efficient analysis and transport of large data sets. We’ll use Apache Arrow via the the <a href="https://arrow.apache.org/docs/r/">arrow package</a>, which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax. As an additional benefit, arrow is extremely fast: you’ll see some examples later in the chapter.</p>
<p>Both arrow and dbplyr provide dplyr backends, so you might wonder when to use each. In many cases, the choice is made for you, as in the data is already in a database or in parquet files, and you’ll want to work with it as is. But if you’re starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet. In general, it’s hard to know what will work best, so in the early stages of your analysis we’d encourage you to try both and pick the one that works the best for you.</p>

<section id="prerequisites" data-type="sect2">
<section id="arrow-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, we’ll continue to use the tidyverse, particularly dplyr, but we’ll pair it with the arrow package which is designed specifically for working with large data.</p>
Expand Down Expand Up @@ -272,7 +272,7 @@ <h2>
</section>
</section>

<section id="summary" data-type="sect1">
<section id="arrow-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, you’ve been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets. It can work with CSV files, its much much faster if you convert your data to parquet. Parquet is a binary data format that’s designed specifically for data analysis on modern computers. Far fewer tools can work with parquet files compared to CSV, but it’s partitioned, compressed, and columnar structure makes it much more efficient to analyze.</p>
Expand Down
14 changes: 7 additions & 7 deletions oreilly/base-R.html
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-base-R">
<h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><p>To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that you’ll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&amp;</code>, and <code>!</code>. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll certainly encounter these other approaches when you start reading R code written by other people, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!</p><p>In this chapter, we’ll focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and <code>for</code> loops. To finish off, we’ll briefly discuss two important plotting functions.</p>
<section id="prerequisites" data-type="sect2">
<section id="base-R-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<div class="cell">
Expand All @@ -10,7 +10,7 @@ <h2>

<section id="sec-subset-many" data-type="sect1">
<h1>
Selecting multiple elements with<code>[</code>
Selecting multiple elements with [
</h1>
<p><code>[</code> is used to extract sub-components from vectors and data frames, and is called like <code>x[i]</code> or <code>x[i, j]</code>. In this section, we’ll introduce you to the power of <code>[</code>, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames. We’ll then help you cement that knowledge by showing how various dplyr verbs are special cases of <code>[</code>.</p>

Expand Down Expand Up @@ -188,7 +188,7 @@ <h2>
<p>This function was the inspiration for much of dplyr’s syntax.</p>
</section>

<section id="exercises" data-type="sect2">
<section id="base-R-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
Expand All @@ -203,7 +203,7 @@ <h2>

<section id="sec-subset-one" data-type="sect1">
<h1>
Selecting a single element<code>$</code> and <code>[[</code>
Selecting a single element with $ and [[
</h1>
<p><code>[</code>, which selects many elements, is paired with <code>[[</code> and <code>$</code>, which extract a single element. In this section, we’ll show you how to use <code>[[</code> and <code>$</code> to pull columns out of data frames, discuss a couple more differences between <code>data.frames</code> and tibbles, and emphasize some important differences between <code>[</code> and <code>[[</code> when used with lists.</p>

Expand Down Expand Up @@ -284,7 +284,7 @@ <h2>
<p>For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.</p>
</section>

<section id="lists" data-type="sect2">
<section id="base-R-lists" data-type="sect2">
<h2>
Lists</h2>
<p><code>[[</code> and <code>$</code> are also really important for working with lists, and it’s important to understand how they differ from <code>[</code>. Lets illustrate the differences with a list named <code>l</code>:</p>
Expand Down Expand Up @@ -372,7 +372,7 @@ <h2>
</div>
</section>

<section id="exercises-1" data-type="sect2">
<section id="base-R-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What happens when you use <code>[[</code> with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist?</p></li>
Expand Down Expand Up @@ -515,7 +515,7 @@ <h1>
<p>Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using <code>$</code> or some other technique.</p>
</section>

<section id="summary" data-type="sect1">
<section id="base-R-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, we’ve shown you a selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend to have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.</p>
Expand Down
Loading

0 comments on commit aa9d72a

Please sign in to comment.