Skip to content

Commit

Permalink
Finish writing the Percival "starter notebook" tutorial (#7)
Browse files Browse the repository at this point in the history
* Small tweak to aggregate formatting

* Remove old directive-parsing code

* Add architecture document to the README

* Add sub-sections to architecture

* Slight tweak to intro cell

* Add support for CSV and TSV data loading

* Add iowa-electricity test for CSV loading

* Add starter notebook tutorial

* Mention named relations
  • Loading branch information
ekzhang authored Dec 11, 2021
1 parent 445eaa1 commit 7666121
Show file tree
Hide file tree
Showing 4 changed files with 192 additions and 42 deletions.
56 changes: 54 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,59 @@ following command to start the development server:
npm run dev
```

This should open a Percival notebook in your browser.
This should open a Percival notebook in your browser, with live reloading.

## Architecture

This section outlines the high-level technical design of Percival.

### User Interface

Percival is a client-side web application running fully in the user's browser.
The notebook interface is built with [Svelte](https://svelte.dev/) and styled
with [Tailwind CSS](https://tailwindcss.com/). It relies on numerous other open
source libraries, including [CodeMirror 6](https://codemirror.net/6/) for live
code editing and syntax highlighting,
[Remark](https://github.com/remarkjs/remark) and [KaTeX](https://katex.org/) for
Markdown rendering, and [Vite](https://vitejs.dev/) for frontend bundling.

The code for the web frontend is located in `src/`, which contains a mix of
Svelte (in `src/components/`) and TypeScript (in `src/lib/`). These modules are
bundled into a static website at build time, and there is no dynamic server-side
rendering.

### JIT Compiler

Users write code cells in a custom dialect of Datalog, and they are translated
to JavaScript by a Rust compiler, which itself is compiled to WebAssembly using
[wasm-bindgen](https://github.com/rustwasm/wasm-bindgen). The Percival
compiler's code is located in the `crates/` folder. For ergonomic parsing with
human-readable error messages, the compiler relies on
[chumsky](https://github.com/zesterer/chumsky), a parser combinator library.

After the `percival-wasm` crate is compiled to WebAssembly, it can be used by
client-side code. The compiler processes code cells, then sends the resulting
JavaScript to separate
[web workers](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API)
that sandbox the code and execute it just-in-time. As the user writes queries,
their notebook automatically tracks inter-cell dependencies and evaluates cells
in topological order, spawning / terminating worker threads on demand.

### Data Visualization

Plotting is done using a specialized web worker that runs JavaScript code with
access to the [Observable Plot](https://observablehq.com/@observablehq/plot)
library. In order for this library (and D3) to run in a worker context, we patch
the global document with a lightweight virtual DOM implementation ported from
[Domino](https://github.com/fgnass/domino).

### Deployment

In production, the `main` branch of this repository is continuously deployed to
[percival.ink](https://percival.ink/) via [Vercel](https://vercel.com/), which
hosts the static website. It also runs a serverless function (see
`api/index.go`) that allows users to share notebooks through the GitHub Gist
API.

## Development

Expand Down Expand Up @@ -80,7 +132,7 @@ and Puppeteer for this, and tests can be run with:
npm test
```

## Acknowledgement
## Acknowledgements

Created by Eric Zhang ([@ekzhang1](https://twitter.com/ekzhang1)). Licensed
under the [MIT license](LICENSE).
21 changes: 0 additions & 21 deletions crates/percival/src/parser.rs
Original file line number Diff line number Diff line change
Expand Up @@ -206,17 +206,6 @@ pub fn parser() -> BoxedParser<'static, char, Program, Simple<char>> {
.then(string.padded().padded_by(comments))
.map(|(name, uri)| Import { name, uri });

// let directive =
// just('@')
// .ignore_then(text::ident())
// .try_map(|directive, span| match &directive[..] {
// "mark_bar" | "mark_point" => Ok(directive),
// _ => Err(Simple::custom(
// span,
// format!("Unknown directive \"{}\"", directive),
// )),
// });

enum Entry {
Rule(Rule),
Import(Import),
Expand Down Expand Up @@ -573,16 +562,6 @@ import football from "gh://vega/vega-datasets@next/data/football.json"
);
}

// #[test]
// fn parse_bad_directive() {
// let parser = parser();
// let text = "@bad_syntax 123";
// let (_, errors) = parser.parse_recovery(text);
// assert!(errors.len() == 1);
// let message = format_errors(text, errors);
// assert!(message.contains("Unknown directive \"bad_syntax\""));
// }

#[test]
fn parse_boolean() {
let parser = parser();
Expand Down
6 changes: 3 additions & 3 deletions src/lib/runtime.worker.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import Immutable from "immutable";
import { csvParse, tsvParse } from "d3-dsv";
import { autoType, csvParse, tsvParse } from "d3-dsv";

/** Load data from an external source. */
async function load(url: string): Promise<object[]> {
Expand All @@ -11,12 +11,12 @@ async function load(url: string): Promise<object[]> {
if (url.endsWith(".json") || contentType?.match(/application\/json/i)) {
return resp.json();
} else if (url.endsWith(".csv") || contentType?.match(/text\/csv/i)) {
return csvParse(await resp.text());
return csvParse(await resp.text(), autoType);
} else if (
url.endsWith(".tsv") ||
contentType?.match(/text\/tab-separated-values/i)
) {
return tsvParse(await resp.text());
return tsvParse(await resp.text(), autoType);
} else {
throw new Error(
`Unknown file format for ${url}. Only JSON, CSV, and TSV are supported.
Expand Down
151 changes: 135 additions & 16 deletions src/samples/starter.percival
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ This is a Percival notebook (https://percival.ink/).
╔═╣ Markdown
# Welcome to Percival!

Percival is an in-browser interactive notebook for **declarative data analysis** and **visualization**. It combines the power of compiled [Datalog](https://en.wikipedia.org/wiki/Datalog) queries with the flexibility of [modern plotting libraries](https://observablehq.com/@observablehq/plot) for the web.
Percival is an interactive in-browser notebook for **declarative data analysis** and **visualization**. It combines the power of compiled [Datalog](https://en.wikipedia.org/wiki/Datalog) queries with the flexibility of [modern plotting libraries](https://observablehq.com/@observablehq/plot) for the web.

![Picture of a landscape](https://upload.wikimedia.org/wikipedia/commons/e/ee/Lake_Geneva_after_storm.jpg)

Expand All @@ -20,7 +20,7 @@ To get started, let's dive into the basics of the language.

Datalog is a fully-featured database query language, similar to SQL. It originates from logic programming as a subset of Prolog. The basic object in Datalog is called a _relation_, and it is the equivalent of a table in traditional databases.

Let's create a very simple relation that stores edges in a directed graph.
Let's create a very simple relation that stores edges in a directed graph. This relation has two named fields, `x` and `y`.

╔═╡ Code
// Edge relation: each line is a database entry.
Expand All @@ -29,7 +29,7 @@ edge(x: 2, y: 3).
edge(x: 2, y: 4).

╔═╣ Markdown
With Datalog, you can compute all paths within this graph by writing the query in the following code cell. This query consists of two _rules_, which use the `:-` notation. When we write this query, the outputs are displayed above the cell.
With Datalog, you can compute all paths within this graph by writing the query in the following code cell. This query consists of two _rules_, which use the `:-` notation. When we run this query, its outputs are displayed above the cell.

╔═╡ Code
// Given an edge x -> y, there is a path x -> y.
Expand Down Expand Up @@ -120,35 +120,154 @@ For each year and country of origin in the dataset, we will query for the averag
average_mpg(country, year: `new Date(year)`, value) :-
country(name: country),
cars(Year: year),
value = mean[Miles_per_Gallon] { cars(Origin: country, Year: year, Miles_per_Gallon) }.
value = mean[Miles_per_Gallon] {
cars(Origin: country, Year: year, Miles_per_Gallon)
}.

╔═╣ Markdown
With support for aggregates, we can now answer a lot of analytical questions about the data. One key tool for exploring datasets is visualization. Percival supports declarative data visualization through _Plot_ cells, which run JavaScript code that generates diagrams using the [Observable Plot](https://github.com/observablehq/plot) library.

╔═╡ Plot
average_mpg => Plot.plot({
x: { grid: true },
y: { grid: true },
average_mpg => Plot.line(average_mpg, {
sort: "year",
x: "year",
y: "value",
stroke: "country",
}).plot({ grid: true })

╔═╣ Markdown
Here's another example of a plot on our dataset. This time, we'll make a simple scatter plot on the entire cars dataset, faceted by the country of origin.

╔═╡ Plot
cars => Plot.plot({
marks: [
Plot.line(average_mpg, {
sort: "year",
x: "year",
y: "value",
stroke: "country",
Plot.dot(cars, {
x: "Horsepower",
y: "Miles_per_Gallon",
stroke: "Weight_in_lbs",
strokeWidth: 1.5,
}),
Plot.ruleX([40]),
Plot.ruleY([5]),
],
facet: {
data: cars,
y: "Origin",
},
color: {
type: "linear",
range: ["steelblue", "orange"],
interpolate: "hcl",
},
grid: true,
})

╔═╣ Markdown
## Integrated Case Study
## Real-World Case Study

Let's see how all of these pieces fit together to work on a real-world dataset, where you might want to combine data from multiple different sources.

╔═╣ Markdown
### Initial Exploration

Suppose that you just got access to a collection of data about airports, and you're eager to start exploring it. The dataset is tabular and contains information such as name, geographical location, city, state, and country.

╔═╡ Code
import airports from "npm://[email protected]/data/airports.csv"

╔═╣ Markdown
From looking at the rows, it seems like there are airports from multiple different countries in this dataset! Let's figure out what the value counts in the `country` column look like.

Let's see how all of these pieces combine together to work on a real-world dataset, where you join and piece together data from multiple different sources.
╔═╡ Code
airports_per_country(country, count) :-
airports(country),
count = count[1] { airports(country) }.

╔═╣ Markdown
It turns out that **all but 4 of the airports are in the United States**. To make the rest of our analysis simpler, we're going to filter only those airports that have country equal to `"USA"`. We're also going to reduce our columns to only the necessary ones.

╔═╡ Code
us_airports(state, iata, name) :-
airports(state, iata, name, country: "USA").

╔═╣ Markdown
Cool, that was really simple! Let's use another aggregate query to see how many airports are in each US state.

╔═╡ Code
airports_per_state(state, count) :-
us_airports(state),
count = count[1] { us_airports(state) }.

**TODO: NOT DONE**
╔═╡ Plot
airports_per_state => Plot.plot({
marks: [
Plot.dot(airports_per_state, {
x: "count",
fill: "steelblue",
fillOpacity: 0.6,
}),
],
grid: true,
})

╔═╣ Markdown
It seems like most states have between 0-100 airports, with a few outliers having 200-300 airports. This makes sense, given that some states are much smaller than others, and even between states of the same size, population density can be very different!

╔═╣ Markdown
### Loading More Data

We might wonder if states with higher populations have more airports. However, we don't have this information in our current table, so we'll need to find a new dataset for this. [Here's one](https://github.com/jakevdp/data-USstates) that we found, off-the-shelf, on GitHub.

_(I quickly updated some of the column names in these tables to make them compatible with Percival, which is why the latter two tables are imported from Gists.)_

╔═╡ Code
import state_abbrevs from "gh://jakevdp/data-USstates@b9c5dfa/state-abbrevs.csv"
import state_areas from "https://gist.githubusercontent.com/ekzhang/a68794f064594cf0ab56a317c3b7d121/raw/state-areas.csv"
import state_population from "https://gist.githubusercontent.com/ekzhang/a68794f064594cf0ab56a317c3b7d121/raw/state-population.csv"

╔═╣ Markdown
Since this dataset consists of multiple tables in a slightly different format, we'll need to construct an inner join between these tables and our airports to combine them together. Luckily, this is very simple to do with a Datalog query!

╔═╡ Code
airports_state_info(state, count, population, area) :-
state_abbrevs(state: name, abbreviation: state),
airports_per_state(count, state),
state_population(state, population, ages: "total", year: 2013),
state_areas(state: name, area_sq_mi: area).

╔═╡ Plot
airports_state_info => Plot.plot({
marks: [
Plot.dot(airports_state_info, {
x: "population",
y: "count",
r: "area",
fill: "steelblue",
fillOpacity: 0.8,
title: "state",
}),
Plot.text(airports_state_info, {
x: "population",
y: "count",
textAnchor: "start",
dx: "1em",
text: "state",
fillColor: "#222",
fillOpacity: 0.8,
fontSize: d => Math.sqrt(d.area) / 50,
}),
Plot.ruleY([0]),
Plot.ruleX([0]),
],
grid: true,
})

╔═╣ Markdown
As you can see, there is a clear direct relationship between the size of a state, its population, and the number of airports in that state. The one exception to this relationship is **Alaska (AK)**, where although the population is very small, it has over 260 airports! We're also able to see that **Texas (TX)** and **California (CA)** have the second and third largest number of airports, respectively.

╔═╣ Markdown
## Closing

Thanks for reading! Percival is at an experimental stage. If you have any comments or feedback, you can reach me at [GitHub Discussions](https://github.com/ekzhang/percival) or on Twitter [@ekzhang1](https://twitter.com/ekzhang1).

If you like the ideas behind Percival, feel free to try it out on your own problems! By the way, if you press the "Share" button at the top of this page, you'll get a permanent link to the current notebook. Unlike Jupyter or R exports, these documents are fully interactive, and you only need a browser to continue exploring where you left off. 😀
If you like the ideas behind Percival, feel free to try it out on your own problems! By the way, if you press the "Share" button at the top of this page, you'll get a permanent link to the current notebook. Unlike Jupyter or R exports, these documents are fully interactive, and you only need a browser to continue exploring where you left off.

0 comments on commit 7666121

Please sign in to comment.