Finish writing the Percival "starter notebook" tutorial (#7)

* Small tweak to aggregate formatting * Remove old directive-parsing code * Add architecture document to the README * Add sub-sections to architecture * Slight tweak to intro cell * Add support for CSV and TSV data loading * Add iowa-electricity test for CSV loading * Add starter notebook tutorial * Mention named relations
ekzhang · Dec 11, 2021 · 7666121 · 7666121
1 parent 445eaa1
commit 7666121
Show file tree

Hide file tree

Showing 4 changed files with 192 additions and 42 deletions.
diff --git a/README.md b/README.md
@@ -47,7 +47,59 @@ following command to start the development server:
 npm run dev
 ```
 
-This should open a Percival notebook in your browser.
+This should open a Percival notebook in your browser, with live reloading.
+
+## Architecture
+
+This section outlines the high-level technical design of Percival.
+
+### User Interface
+
+Percival is a client-side web application running fully in the user's browser.
+The notebook interface is built with [Svelte](https://svelte.dev/) and styled
+with [Tailwind CSS](https://tailwindcss.com/). It relies on numerous other open
+source libraries, including [CodeMirror 6](https://codemirror.net/6/) for live
+code editing and syntax highlighting,
+[Remark](https://github.com/remarkjs/remark) and [KaTeX](https://katex.org/) for
+Markdown rendering, and [Vite](https://vitejs.dev/) for frontend bundling.
+
+The code for the web frontend is located in `src/`, which contains a mix of
+Svelte (in `src/components/`) and TypeScript (in `src/lib/`). These modules are
+bundled into a static website at build time, and there is no dynamic server-side
+rendering.
+
+### JIT Compiler
+
+Users write code cells in a custom dialect of Datalog, and they are translated
+to JavaScript by a Rust compiler, which itself is compiled to WebAssembly using
+[wasm-bindgen](https://github.com/rustwasm/wasm-bindgen). The Percival
+compiler's code is located in the `crates/` folder. For ergonomic parsing with
+human-readable error messages, the compiler relies on
+[chumsky](https://github.com/zesterer/chumsky), a parser combinator library.
+
+After the `percival-wasm` crate is compiled to WebAssembly, it can be used by
+client-side code. The compiler processes code cells, then sends the resulting
+JavaScript to separate
+[web workers](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API)
+that sandbox the code and execute it just-in-time. As the user writes queries,
+their notebook automatically tracks inter-cell dependencies and evaluates cells
+in topological order, spawning / terminating worker threads on demand.
+
+### Data Visualization
+
+Plotting is done using a specialized web worker that runs JavaScript code with
+access to the [Observable Plot](https://observablehq.com/@observablehq/plot)
+library. In order for this library (and D3) to run in a worker context, we patch
+the global document with a lightweight virtual DOM implementation ported from
+[Domino](https://github.com/fgnass/domino).
+
+### Deployment
+
+In production, the `main` branch of this repository is continuously deployed to
+[percival.ink](https://percival.ink/) via [Vercel](https://vercel.com/), which
+hosts the static website. It also runs a serverless function (see
+`api/index.go`) that allows users to share notebooks through the GitHub Gist
+API.
 
 ## Development
 
@@ -80,7 +132,7 @@ and Puppeteer for this, and tests can be run with:
 npm test
 ```
 
-## Acknowledgement
+## Acknowledgements
 
 Created by Eric Zhang ([@ekzhang1](https://twitter.com/ekzhang1)). Licensed
 under the [MIT license](LICENSE).
diff --git a/crates/percival/src/parser.rs b/crates/percival/src/parser.rs
@@ -206,17 +206,6 @@ pub fn parser() -> BoxedParser<'static, char, Program, Simple<char>> {
         .then(string.padded().padded_by(comments))
         .map(|(name, uri)| Import { name, uri });
 
-    // let directive =
-    //     just('@')
-    //         .ignore_then(text::ident())
-    //         .try_map(|directive, span| match &directive[..] {
-    //             "mark_bar" | "mark_point" => Ok(directive),
-    //             _ => Err(Simple::custom(
-    //                 span,
-    //                 format!("Unknown directive \"{}\"", directive),
-    //             )),
-    //         });
-
     enum Entry {
         Rule(Rule),
         Import(Import),
@@ -573,16 +562,6 @@ import football from "gh://vega/vega-datasets@next/data/football.json"
         );
     }
 
-    // #[test]
-    // fn parse_bad_directive() {
-    //     let parser = parser();
-    //     let text = "@bad_syntax 123";
-    //     let (_, errors) = parser.parse_recovery(text);
-    //     assert!(errors.len() == 1);
-    //     let message = format_errors(text, errors);
-    //     assert!(message.contains("Unknown directive \"bad_syntax\""));
-    // }
-
     #[test]
     fn parse_boolean() {
         let parser = parser();

diff --git a/src/lib/runtime.worker.ts b/src/lib/runtime.worker.ts
@@ -1,5 +1,5 @@
 import Immutable from "immutable";
-import { csvParse, tsvParse } from "d3-dsv";
+import { autoType, csvParse, tsvParse } from "d3-dsv";
 
 /** Load data from an external source. */
 async function load(url: string): Promise<object[]> {
@@ -11,12 +11,12 @@ async function load(url: string): Promise<object[]> {
   if (url.endsWith(".json") || contentType?.match(/application\/json/i)) {
     return resp.json();
   } else if (url.endsWith(".csv") || contentType?.match(/text\/csv/i)) {
-    return csvParse(await resp.text());
+    return csvParse(await resp.text(), autoType);
   } else if (
     url.endsWith(".tsv") ||
     contentType?.match(/text\/tab-separated-values/i)
   ) {
-    return tsvParse(await resp.text());
+    return tsvParse(await resp.text(), autoType);
   } else {
     throw new Error(
       `Unknown file format for ${url}. Only JSON, CSV, and TSV are supported.

diff --git a/src/samples/starter.percival b/src/samples/starter.percival
@@ -3,7 +3,7 @@ This is a Percival notebook (https://percival.ink/).
 ╔═╣ Markdown
 # Welcome to Percival!
 
-Percival is an in-browser interactive notebook for **declarative data analysis** and **visualization**. It combines the power of compiled [Datalog](https://en.wikipedia.org/wiki/Datalog) queries with the flexibility of [modern plotting libraries](https://observablehq.com/@observablehq/plot) for the web.
+Percival is an interactive in-browser notebook for **declarative data analysis** and **visualization**. It combines the power of compiled [Datalog](https://en.wikipedia.org/wiki/Datalog) queries with the flexibility of [modern plotting libraries](https://observablehq.com/@observablehq/plot) for the web.
 
 ![Picture of a landscape](https://upload.wikimedia.org/wikipedia/commons/e/ee/Lake_Geneva_after_storm.jpg)
 
@@ -20,7 +20,7 @@ To get started, let's dive into the basics of the language.
 
 Datalog is a fully-featured database query language, similar to SQL. It originates from logic programming as a subset of Prolog. The basic object in Datalog is called a _relation_, and it is the equivalent of a table in traditional databases.
 
-Let's create a very simple relation that stores edges in a directed graph.
+Let's create a very simple relation that stores edges in a directed graph. This relation has two named fields, `x` and `y`.
 
 ╔═╡ Code
 // Edge relation: each line is a database entry.
@@ -29,7 +29,7 @@ edge(x: 2, y: 3).
 edge(x: 2, y: 4).
 
 ╔═╣ Markdown
-With Datalog, you can compute all paths within this graph by writing the query in the following code cell. This query consists of two _rules_, which use the `:-` notation. When we write this query, the outputs are displayed above the cell.
+With Datalog, you can compute all paths within this graph by writing the query in the following code cell. This query consists of two _rules_, which use the `:-` notation. When we run this query, its outputs are displayed above the cell.
 
 ╔═╡ Code
 // Given an edge x -> y, there is a path x -> y.
@@ -120,35 +120,154 @@ For each year and country of origin in the dataset, we will query for the averag
 average_mpg(country, year: `new Date(year)`, value) :-
   country(name: country),
   cars(Year: year),
-  value = mean[Miles_per_Gallon] { cars(Origin: country, Year: year, Miles_per_Gallon) }.
+  value = mean[Miles_per_Gallon] {
+    cars(Origin: country, Year: year, Miles_per_Gallon)
+  }.
 
 ╔═╣ Markdown
 With support for aggregates, we can now answer a lot of analytical questions about the data. One key tool for exploring datasets is visualization. Percival supports declarative data visualization through _Plot_ cells, which run JavaScript code that generates diagrams using the [Observable Plot](https://github.com/observablehq/plot) library.
 
 ╔═╡ Plot
-average_mpg => Plot.plot({
-  x: { grid: true },
-  y: { grid: true },
+average_mpg => Plot.line(average_mpg, {
+  sort: "year",
+  x: "year",
+  y: "value",
+  stroke: "country",
+}).plot({ grid: true })
+
+╔═╣ Markdown
+Here's another example of a plot on our dataset. This time, we'll make a simple scatter plot on the entire cars dataset, faceted by the country of origin.
+
+╔═╡ Plot
+cars => Plot.plot({
   marks: [
-    Plot.line(average_mpg, {
-      sort: "year",
-      x: "year",
-      y: "value",
-      stroke: "country",
+    Plot.dot(cars, {
+      x: "Horsepower",
+      y: "Miles_per_Gallon",
+      stroke: "Weight_in_lbs",
+      strokeWidth: 1.5,
     }),
+    Plot.ruleX([40]),
+    Plot.ruleY([5]),
   ],
+  facet: {
+    data: cars,
+    y: "Origin",
+  },
+  color: {
+    type: "linear",
+    range: ["steelblue", "orange"],
+    interpolate: "hcl",
+  },
+  grid: true,
 })
 
 ╔═╣ Markdown
-## Integrated Case Study
+## Real-World Case Study
+
+Let's see how all of these pieces fit together to work on a real-world dataset, where you might want to combine data from multiple different sources.
+
+╔═╣ Markdown
+### Initial Exploration
+
+Suppose that you just got access to a collection of data about airports, and you're eager to start exploring it. The dataset is tabular and contains information such as name, geographical location, city, state, and country.
+
+╔═╡ Code
+import airports from "npm://[email protected]/data/airports.csv"
+
+╔═╣ Markdown
+From looking at the rows, it seems like there are airports from multiple different countries in this dataset! Let's figure out what the value counts in the `country` column look like.
 
-Let's see how all of these pieces combine together to work on a real-world dataset, where you join and piece together data from multiple different sources.
+╔═╡ Code
+airports_per_country(country, count) :-
+  airports(country),
+  count = count[1] { airports(country) }.
+
+╔═╣ Markdown
+It turns out that **all but 4 of the airports are in the United States**. To make the rest of our analysis simpler, we're going to filter only those airports that have country equal to `"USA"`. We're also going to reduce our columns to only the necessary ones.
+
+╔═╡ Code
+us_airports(state, iata, name) :-
+  airports(state, iata, name, country: "USA").
+
+╔═╣ Markdown
+Cool, that was really simple! Let's use another aggregate query to see how many airports are in each US state.
+
+╔═╡ Code
+airports_per_state(state, count) :-
+  us_airports(state),
+  count = count[1] { us_airports(state) }.
 
-**TODO: NOT DONE**
+╔═╡ Plot
+airports_per_state => Plot.plot({
+  marks: [
+    Plot.dot(airports_per_state, {
+      x: "count",
+      fill: "steelblue",
+      fillOpacity: 0.6,
+    }),
+  ],
+  grid: true,
+})
+
+╔═╣ Markdown
+It seems like most states have between 0-100 airports, with a few outliers having 200-300 airports. This makes sense, given that some states are much smaller than others, and even between states of the same size, population density can be very different!
+
+╔═╣ Markdown
+### Loading More Data
+
+We might wonder if states with higher populations have more airports. However, we don't have this information in our current table, so we'll need to find a new dataset for this. [Here's one](https://github.com/jakevdp/data-USstates) that we found, off-the-shelf, on GitHub.
+
+_(I quickly updated some of the column names in these tables to make them compatible with Percival, which is why the latter two tables are imported from Gists.)_
+
+╔═╡ Code
+import state_abbrevs from "gh://jakevdp/data-USstates@b9c5dfa/state-abbrevs.csv"
+import state_areas from "https://gist.githubusercontent.com/ekzhang/a68794f064594cf0ab56a317c3b7d121/raw/state-areas.csv"
+import state_population from "https://gist.githubusercontent.com/ekzhang/a68794f064594cf0ab56a317c3b7d121/raw/state-population.csv"
+
+╔═╣ Markdown
+Since this dataset consists of multiple tables in a slightly different format, we'll need to construct an inner join between these tables and our airports to combine them together. Luckily, this is very simple to do with a Datalog query!
+
+╔═╡ Code
+airports_state_info(state, count, population, area) :-
+  state_abbrevs(state: name, abbreviation: state),
+  airports_per_state(count, state),
+  state_population(state, population, ages: "total", year: 2013),
+  state_areas(state: name, area_sq_mi: area).
+
+╔═╡ Plot
+airports_state_info => Plot.plot({
+  marks: [
+    Plot.dot(airports_state_info, {
+      x: "population",
+      y: "count",
+      r: "area",
+      fill: "steelblue",
+      fillOpacity: 0.8,
+      title: "state",
+    }),
+    Plot.text(airports_state_info, {
+      x: "population",
+      y: "count",
+      textAnchor: "start",
+      dx: "1em",
+      text: "state",
+      fillColor: "#222",
+      fillOpacity: 0.8,
+      fontSize: d => Math.sqrt(d.area) / 50,
+    }),
+    Plot.ruleY([0]),
+    Plot.ruleX([0]),
+  ],
+  grid: true,
+})
+
+╔═╣ Markdown
+As you can see, there is a clear direct relationship between the size of a state, its population, and the number of airports in that state. The one exception to this relationship is **Alaska (AK)**, where although the population is very small, it has over 260 airports! We're also able to see that **Texas (TX)** and **California (CA)** have the second and third largest number of airports, respectively.
 
 ╔═╣ Markdown
 ## Closing
 
 Thanks for reading! Percival is at an experimental stage. If you have any comments or feedback, you can reach me at [GitHub Discussions](https://github.com/ekzhang/percival) or on Twitter [@ekzhang1](https://twitter.com/ekzhang1).
 
-If you like the ideas behind Percival, feel free to try it out on your own problems! By the way, if you press the "Share" button at the top of this page, you'll get a permanent link to the current notebook. Unlike Jupyter or R exports, these documents are fully interactive, and you only need a browser to continue exploring where you left off. 😀
+If you like the ideas behind Percival, feel free to try it out on your own problems! By the way, if you press the "Share" button at the top of this page, you'll get a permanent link to the current notebook. Unlike Jupyter or R exports, these documents are fully interactive, and you only need a browser to continue exploring where you left off. ✨