-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathcdata.Rmd
114 lines (82 loc) · 5.51 KB
/
cdata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
title: "cdata"
author: "John Mount, Win-Vector LLC"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{cdata}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
The [`cdata`](https://github.com/WinVector/cdata) package is a demonstration of the ["coordinatized data" theory](https://winvector.github.io/FluidData/RowsAndColumns.html) and includes an implementation of the ["fluid data" methodology](https://winvector.github.io/FluidData/FluidData.html).
Briefly `cdata` supplies data transform operators that:
* Work on local data or with any `DBI` data source.
* Are powerful generalizations of the operators commonly called `pivot` and `un-pivot`.
* Can be specified by drawing an example.
A quick example:
```{r}
library("cdata")
# first few rows of the iris data as an example
d <- wrapr::build_frame(
"Sepal.Length" , "Sepal.Width", "Petal.Length", "Petal.Width", "Species" |
5.1 , 3.5 , 1.4 , 0.2 , "setosa" |
4.9 , 3 , 1.4 , 0.2 , "setosa" |
4.7 , 3.2 , 1.3 , 0.2 , "setosa" |
4.6 , 3.1 , 1.5 , 0.2 , "setosa" |
5 , 3.6 , 1.4 , 0.2 , "setosa" |
5.4 , 3.9 , 1.7 , 0.4 , "setosa" )
d$iris_id <- seq_len(nrow(d))
knitr::kable(d)
```
Now suppose we want to take the above "all facts about each iris are in a single row" representation and convert it into a per-iris record block with the following structure.
```{r}
record_example <- wrapr::qchar_frame(
"plant_part" , "measurement", "value" |
"sepal" , "width" , Sepal.Width |
"sepal" , "length" , Sepal.Length |
"petal" , "width" , Petal.Width |
"petal" , "length" , Petal.Length )
knitr::kable(record_example)
```
The above sort of transformation may seem exotic, but it is fairly common when we want to plot many aspects of a record at the same time.
To specify our transformation we combine the record example with information about how records are keyed (recordKeys showing which rows go together to form a record, and controlTableKeys specifying the internal structure of a data record).
```{r}
layout <- rowrecs_to_blocks_spec(
record_example,
controlTableKeys = c("plant_part", "measurement"),
recordKeys = c("iris_id", "Species"))
print(layout)
```
In the above we have used the common useful data organizing trick of specifying a dependent column (Species being a function of iris_id) as an additional key.
This layout then specifies and implements the data transform. We can transform the data by sending it to the layout.
```{r}
d_transformed <- d %.>%
layout
knitr::kable(d_transformed)
```
And it is easy to invert these transforms using the `t()` transpose/adjoint notation.
```{r}
inv_layout <- t(layout)
print(inv_layout)
d_transformed %.>%
inv_layout %.>%
knitr::kable(.)
```
The layout specifications themselves are just simple lists with "pretty print methods" (the control table being simply and example record in the form of a data.frame).
```{r}
unclass(layout)
```
Notice that almost all of the time and space in using cdata is spent in specifying how your data is structured and is to be structured.
The main `cdata` interfaces are given by the following set of methods:
* [`rowrecs_to_blocks_spec()`](https://winvector.github.io/cdata/reference/rowrecs_to_blocks_spec.html), for specifying how single row records map to general multi-row (or block) records.
* [`blocks_to_rowrecs_spec()`](https://winvector.github.io/cdata/reference/blocks_to_rowrecs_spec.html), for specifying how multi-row block records map to single-row records.
* [`layout_specification()`](https://winvector.github.io/cdata/reference/layout_specification.html), for specifying transforms from multi-row records to other multi-row records.
* [`layout_by()`](https://winvector.github.io/cdata/reference/layout_by.html) or the [wrapr dot arrow pipe](https://winvector.github.io/wrapr/reference/dot_arrow.html) for applying a layout to re-arrange data.
* `t()` (transpose/adjoint) to invert or reverse layout specifications.
Some convenience functions include:
* [`pivot_to_rowrecs()`](https://winvector.github.io/cdata/reference/pivot_to_rowrecs.html), for moving data from multi-row block records with one value per row (a single column of values) to single-row records [`spread` or `dcast`].
* [`pivot_to_blocks()`/`unpivot_to_blocks()`](https://winvector.github.io/cdata/reference/unpivot_to_blocks.html), for moving data from single-row records to possibly multi row block records with one row per value (a single column of values) [`gather` or `melt`].
* [`wrapr::qchar_frame()`](https://winvector.github.io/wrapr/reference/qchar_frame.html) a helper function for specifying record control table layout specifications.
* [`wrapr::build_frame()`](https://winvector.github.io/wrapr/reference/build_frame.html) a helper function for specifying data frames.
The package vignettes can be found in the "Articles" tab of [the `cdata` documentation site](https://winvector.github.io/cdata/).
The (older) recommended tutorial is: [Fluid data reshaping with cdata](https://winvector.github.io/FluidData/FluidDataReshapingWithCdata.html). We also have an (older) [short free cdata screencast](https://youtu.be/4cYbP3kbc0k) (and another example can be found [here](https://winvector.github.io/FluidData/DataWranglingAtScale.html)).