vignettes/cdata.Rmd

---
title: "cdata"
author: "John Mount, Win-Vector LLC"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{cdata}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The [`cdata`](https://github.com/WinVector/cdata) package is a demonstration of the ["coordinatized data" theory](https://winvector.github.io/FluidData/RowsAndColumns.html) and includes an implementation of the ["fluid data" methodology](https://winvector.github.io/FluidData/FluidData.html).   

Briefly `cdata` supplies data transform operators that:

 * Work on local data or with any `DBI` data source.
 * Are powerful generalizations of the operators commonly called `pivot` and `un-pivot`.
 * Can be specified by drawing an example.

A quick example:

```{r}
library("cdata")

# first few rows of the iris data as an example
d <- wrapr::build_frame(
   "Sepal.Length"  , "Sepal.Width", "Petal.Length", "Petal.Width", "Species" |
     5.1           , 3.5          , 1.4           , 0.2          , "setosa"  |
     4.9           , 3            , 1.4           , 0.2          , "setosa"  |
     4.7           , 3.2          , 1.3           , 0.2          , "setosa"  |
     4.6           , 3.1          , 1.5           , 0.2          , "setosa"  |
     5             , 3.6          , 1.4           , 0.2          , "setosa"  |
     5.4           , 3.9          , 1.7           , 0.4          , "setosa"  )
d$iris_id <- seq_len(nrow(d))

knitr::kable(d)
```

Now suppose we want to take the above "all facts about each iris are in a single row" representation and convert it into a per-iris record block with the following structure.

```{r}
record_example <- wrapr::qchar_frame(
   "plant_part"  , "measurement", "value"      |
     "sepal"     , "width"      , Sepal.Width  |
     "sepal"     , "length"     , Sepal.Length |
     "petal"     , "width"      , Petal.Width  |
     "petal"     , "length"     , Petal.Length )

knitr::kable(record_example)
```

The above sort of transformation may seem exotic, but it is fairly common when we want to plot many aspects of a record at the same time.

To specify our transformation we combine the record example with information about how records are keyed (recordKeys showing which rows go together to form a record, and controlTableKeys specifying the internal structure of a data record).

```{r}
layout <- rowrecs_to_blocks_spec(
  record_example,
  controlTableKeys = c("plant_part", "measurement"),
  recordKeys = c("iris_id", "Species"))

print(layout)
```

In the above we have used the common useful data organizing trick of specifying a dependent column (Species being a function of iris_id) as an additional key.

This layout then specifies and implements the data transform.  We can transform the data by sending it to the layout.

```{r}
d_transformed <- d %.>% 
  layout

knitr::kable(d_transformed)
```

And it is easy to invert these transforms using the `t()` transpose/adjoint notation.

```{r}
inv_layout <- t(layout)

print(inv_layout)

d_transformed %.>%
  inv_layout %.>%
  knitr::kable(.)
```

The layout specifications themselves are just simple lists with "pretty print methods" (the control table being simply and example record in the form of a data.frame).

```{r}
unclass(layout)
```

Notice that almost all of the time and space in using cdata is spent in specifying how your data is structured and is to be structured.

The main `cdata` interfaces are given by the following set of methods:

  * [`rowrecs_to_blocks_spec()`](https://winvector.github.io/cdata/reference/rowrecs_to_blocks_spec.html), for specifying how single row records map to general multi-row (or block) records.
  * [`blocks_to_rowrecs_spec()`](https://winvector.github.io/cdata/reference/blocks_to_rowrecs_spec.html), for specifying how multi-row block records map to single-row records.
  * [`layout_specification()`](https://winvector.github.io/cdata/reference/layout_specification.html), for specifying transforms from multi-row records to other multi-row records.
  * [`layout_by()`](https://winvector.github.io/cdata/reference/layout_by.html) or the [wrapr dot arrow pipe](https://winvector.github.io/wrapr/reference/dot_arrow.html) for applying a layout to re-arrange data.
  * `t()` (transpose/adjoint) to invert or reverse layout specifications.

Some convenience functions include:

  * [`pivot_to_rowrecs()`](https://winvector.github.io/cdata/reference/pivot_to_rowrecs.html), for moving data from multi-row block records with one value per row (a single column of values) to single-row records [`spread` or `dcast`].
  * [`pivot_to_blocks()`/`unpivot_to_blocks()`](https://winvector.github.io/cdata/reference/unpivot_to_blocks.html), for moving data from single-row records to possibly multi row block records with one row per value (a single column of values) [`gather` or `melt`].
  * [`wrapr::qchar_frame()`](https://winvector.github.io/wrapr/reference/qchar_frame.html) a helper function for specifying record control table layout specifications.
 * [`wrapr::build_frame()`](https://winvector.github.io/wrapr/reference/build_frame.html) a helper function for specifying data frames.
  
The package vignettes can be found in the "Articles" tab of [the `cdata` documentation site](https://winvector.github.io/cdata/).

The (older) recommended tutorial is: [Fluid data reshaping with cdata](https://winvector.github.io/FluidData/FluidDataReshapingWithCdata.html). We also have an (older) [short free cdata screencast](https://youtu.be/4cYbP3kbc0k) (and another example can be found [here](https://winvector.github.io/FluidData/DataWranglingAtScale.html)).