More minor page count tweaks & fixes

And re-convert with latest htmlbook
qiuwei · Jan 26, 2023 · aa9d72a · aa9d72a
1 parent d9afa13
commit aa9d72a
Show file tree

Hide file tree

Showing 38 changed files with 839 additions and 1,094 deletions.
diff --git a/README.md b/README.md
@@ -52,7 +52,8 @@ devtools::install_github("hadley/r4ds")
 To generate book for O'Reilly, build the book then:
 
 ```{r}
-devtools::load_all("../minibook/"); process_book()
+# pak::pak("hadley/htmlbook")
+htmlbook::convert_book()
 
 html <- list.files("oreilly", pattern = "[.]html$", full.names = TRUE)
 file.copy(html, "../r-for-data-science-2e/", overwrite = TRUE)
@@ -63,6 +64,8 @@ fs::dir_create(unique(dirname(dest)))
 file.copy(pngs, dest, overwrite = TRUE)
 ```
 
+Then commit and push to atlas.
+
 ## Code of Conduct
 
 Please note that r4ds uses a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html).

diff --git a/_common.R b/_common.R
@@ -16,8 +16,9 @@ options(
   pillar.max_footer_lines = 2,
   pillar.min_chars = 15,
   stringr.view_n = 6,
-  # Activate crayon output - temporarily disabled for quarto
-  # crayon.enabled = TRUE,
+  # Temporarily deactivate cli output for quarto
+  cli.num_colors = 0,
+  cli.hyperlink = FALSE,
   pillar.bold = TRUE,
   width = 77 # 80 - 3 for #> comment
 )

diff --git a/base-R.qmd b/base-R.qmd
@@ -210,7 +210,7 @@ This function was the inspiration for much of dplyr's syntax.
 2.  Why is `x[-which(x > 0)]` not the same as `x[x <= 0]`?
     Read the documentation for `which()` and do some experiments to figure it out.
 
-## Selecting a single element `$` and `[[` {#sec-subset-one}
+## Selecting a single element with `$` and `[[` {#sec-subset-one}
 
 `[`, which selects many elements, is paired with `[[` and `$`, which extract a single element.
 In this section, we'll show you how to use `[[` and `$` to pull columns out of data frames, discuss a couple more differences between `data.frames` and tibbles, and emphasize some important differences between `[` and `[[` when used with lists.

diff --git a/intro.qmd b/intro.qmd
@@ -365,7 +365,7 @@ knitr::kable(df, format = "markdown")
 ```
 
 ```{r}
-#| eval: false
+#| include: false
 
 cli:::ruler()
 ```
diff --git a/oreilly/EDA.html b/oreilly/EDA.html
@@ -1,6 +1,6 @@
 <section data-type="chapter" id="chp-EDA">
 <h1><span id="sec-exploratory-data-analysis" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Exploratory data analysis</span></span></h1>
-<section id="introduction" data-type="sect1">
+<section id="EDA-introduction" data-type="sect1">
 <h1>
 Introduction</h1>
 <p>This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:</p>
@@ -10,7 +10,7 @@ <h1>
 </ol><p>EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.</p>
 <p>EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualization, transformation, and modelling.</p>
 
-<section id="prerequisites" data-type="sect2">
+<section id="EDA-prerequisites" data-type="sect2">
 <h2>
 Prerequisites</h2>
 <p>In this chapter we’ll combine what you’ve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.</p>
@@ -137,7 +137,7 @@ <h2>
 <p>It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to omit them, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.</p>
 </section>
 
-<section id="exercises" data-type="sect2">
+<section id="EDA-exercises" data-type="sect2">
 <h2>
 Exercises</h2>
 <ol type="1"><li><p>Explore the distribution of each of the <code>x</code>, <code>y</code>, and <code>z</code> variables in <code>diamonds</code>. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.</p></li>
@@ -198,7 +198,7 @@ <h1>
 </div>
 <p>However this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.</p>
 
-<section id="exercises-1" data-type="sect2">
+<section id="EDA-exercises-1" data-type="sect2">
 <h2>
 Exercises</h2>
 <ol type="1"><li><p>What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?</p></li>
@@ -217,9 +217,7 @@ <h2>
 <p>For example, let’s explore how the price of a diamond varies with its quality (measured by <code>cut</code>) using <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code>:</p>
 <div class="cell">
 <pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price)) + 
-  geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)
-#&gt; Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
-#&gt; ℹ Please use `linewidth` instead.</pre>
+  geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)</pre>
 <div class="cell-output-display">
 <p><img src="EDA_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500." width="576"/></p>
 </div>
@@ -235,7 +233,7 @@ <h2>
 <p>To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the <strong>density</strong>, which is the count standardized so that the area under each frequency polygon is one.</p>
 <div class="cell">
 <pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price, y = after_stat(density))) + 
-  geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)</pre>
+  geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)</pre>
 <div class="cell-output-display">
 <p><img src="EDA_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p>
 </div>
@@ -279,7 +277,7 @@ <h2>
 </div>
 </div>
 
-<section id="exercises-2" data-type="sect3">
+<section id="EDA-exercises-2" data-type="sect3">
 <h3>
 Exercises</h3>
 <ol type="1"><li><p>Use what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.</p></li>
@@ -291,7 +289,7 @@ <h3>
 </ol></section>
 </section>
 
-<section id="two-categorical-variables" data-type="sect2">
+<section id="EDA-two-categorical-variables" data-type="sect2">
 <h2>
 Two categorical variables</h2>
 <p>To visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>:</p>
@@ -330,7 +328,7 @@ <h2>
 </div>
 <p>If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.</p>
 
-<section id="exercises-3" data-type="sect3">
+<section id="EDA-exercises-3" data-type="sect3">
 <h3>
 Exercises</h3>
 <ol type="1"><li><p>How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?</p></li>
@@ -340,7 +338,7 @@ <h3>
 </ol></section>
 </section>
 
-<section id="two-numerical-variables" data-type="sect2">
+<section id="EDA-two-numerical-variables" data-type="sect2">
 <h2>
 Two numerical variables</h2>
 <p>You’ve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
@@ -390,7 +388,7 @@ <h2>
 </div>
 </div>
 
-<section id="exercises-4" data-type="sect3">
+<section id="EDA-exercises-4" data-type="sect3">
 <h3>
 Exercises</h3>
 <ol type="1"><li><p>Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code> vs. <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>? How does that impact a visualization of the 2d distribution of <code>carat</code> and <code>price</code>?</p></li>
@@ -464,7 +462,7 @@ <h1>
 <p>We’re not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.</p>
 </section>
 
-<section id="summary" data-type="sect1">
+<section id="EDA-summary" data-type="sect1">
 <h1>
 Summary</h1>
 <p>In this chapter you’ve learned a variety of tools to help you understand the variation within your data. You’ve seen technique that work with a single variable at a time and with a pair of variables. This might seem painful restrictive if you have tens or hundreds of variables in your data, but they’re foundation upon which all other techniques are built.</p>

diff --git a/oreilly/arrow.html b/oreilly/arrow.html
@@ -1,13 +1,13 @@
 <section data-type="chapter" id="chp-arrow">
 <h1><span id="sec-arrow" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Arrow</span></span></h1>
-<section id="introduction" data-type="sect1">
+<section id="arrow-introduction" data-type="sect1">
 <h1>
 Introduction</h1>
 <p>CSV files are designed to be easily read by humans. They’re a good interchange format because they’re very simple and they can be read by every tool under the sun. But CSV files aren’t very efficient: you have to do quite a lot of work to read the data into R. In this chapter, you’ll learn about a powerful alternative: the <a href="https://parquet.apache.org/">parquet format</a>, an open standards-based format widely used by big data systems.</p>
 <p>We’ll pair parquet files with <a href="https://arrow.apache.org">Apache Arrow</a>, a multi-language toolbox designed for efficient analysis and transport of large data sets. We’ll use Apache Arrow via the the <a href="https://arrow.apache.org/docs/r/">arrow package</a>, which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax. As an additional benefit, arrow is extremely fast: you’ll see some examples later in the chapter.</p>
 <p>Both arrow and dbplyr provide dplyr backends, so you might wonder when to use each. In many cases, the choice is made for you, as in the data is already in a database or in parquet files, and you’ll want to work with it as is. But if you’re starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet. In general, it’s hard to know what will work best, so in the early stages of your analysis we’d encourage you to try both and pick the one that works the best for you.</p>
 
-<section id="prerequisites" data-type="sect2">
+<section id="arrow-prerequisites" data-type="sect2">
 <h2>
 Prerequisites</h2>
 <p>In this chapter, we’ll continue to use the tidyverse, particularly dplyr, but we’ll pair it with the arrow package which is designed specifically for working with large data.</p>
@@ -272,7 +272,7 @@ <h2>
 </section>
 </section>
 
-<section id="summary" data-type="sect1">
+<section id="arrow-summary" data-type="sect1">
 <h1>
 Summary</h1>
 <p>In this chapter, you’ve been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets. It can work with CSV files, its much much faster if you convert your data to parquet. Parquet is a binary data format that’s designed specifically for data analysis on modern computers. Far fewer tools can work with parquet files compared to CSV, but it’s partitioned, compressed, and columnar structure makes it much more efficient to analyze.</p>

diff --git a/oreilly/base-R.html b/oreilly/base-R.html
@@ -1,6 +1,6 @@
 <section data-type="chapter" id="chp-base-R">
 <h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><p>To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that you’ll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&amp;</code>, and <code>!</code>. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll certainly encounter these other approaches when you start reading R code written by other people, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!</p><p>In this chapter, we’ll focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and <code>for</code> loops. To finish off, we’ll briefly discuss two important plotting functions.</p>
-<section id="prerequisites" data-type="sect2">
+<section id="base-R-prerequisites" data-type="sect2">
 <h2>
 Prerequisites</h2>
 <div class="cell">
@@ -10,7 +10,7 @@ <h2>
 
 <section id="sec-subset-many" data-type="sect1">
 <h1>
-Selecting multiple elements with<code>[</code>
+Selecting multiple elements with [
 </h1>
 <p><code>[</code> is used to extract sub-components from vectors and data frames, and is called like <code>x[i]</code> or <code>x[i, j]</code>. In this section, we’ll introduce you to the power of <code>[</code>, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames. We’ll then help you cement that knowledge by showing how various dplyr verbs are special cases of <code>[</code>.</p>
 
@@ -188,7 +188,7 @@ <h2>
 <p>This function was the inspiration for much of dplyr’s syntax.</p>
 </section>
 
-<section id="exercises" data-type="sect2">
+<section id="base-R-exercises" data-type="sect2">
 <h2>
 Exercises</h2>
 <ol type="1"><li>
@@ -203,7 +203,7 @@ <h2>
 
 <section id="sec-subset-one" data-type="sect1">
 <h1>
-Selecting a single element<code>$</code> and <code>[[</code>
+Selecting a single element with $ and [[
 </h1>
 <p><code>[</code>, which selects many elements, is paired with <code>[[</code> and <code>$</code>, which extract a single element. In this section, we’ll show you how to use <code>[[</code> and <code>$</code> to pull columns out of data frames, discuss a couple more differences between <code>data.frames</code> and tibbles, and emphasize some important differences between <code>[</code> and <code>[[</code> when used with lists.</p>
 
@@ -284,7 +284,7 @@ <h2>
 <p>For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.</p>
 </section>
 
-<section id="lists" data-type="sect2">
+<section id="base-R-lists" data-type="sect2">
 <h2>
 Lists</h2>
 <p><code>[[</code> and <code>$</code> are also really important for working with lists, and it’s important to understand how they differ from <code>[</code>. Lets illustrate the differences with a list named <code>l</code>:</p>
@@ -372,7 +372,7 @@ <h2>
 </div>
 </section>
 
-<section id="exercises-1" data-type="sect2">
+<section id="base-R-exercises-1" data-type="sect2">
 <h2>
 Exercises</h2>
 <ol type="1"><li><p>What happens when you use <code>[[</code> with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist?</p></li>
@@ -515,7 +515,7 @@ <h1>
 <p>Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using <code>$</code> or some other technique.</p>
 </section>
 
-<section id="summary" data-type="sect1">
+<section id="base-R-summary" data-type="sect1">
 <h1>
 Summary</h1>
 <p>In this chapter, we’ve shown you a selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend to have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.</p>