forked from tidymodels/recipes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRoles.Rmd
145 lines (109 loc) · 4.82 KB
/
Roles.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
title: "Roles in recipes"
output: rmarkdown::html_vignette
description: |
In recipes, roles provide a way to select variables for different steps.
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Roles in recipes}
%\VignetteEncoding{UTF-8}
---
```{r ex_setup, include=FALSE}
knitr::opts_chunk$set(
message = FALSE,
digits = 3,
collapse = TRUE,
comment = "#>",
eval = requireNamespace("modeldata", quietly = TRUE)
)
options(digits = 3)
library(recipes)
```
`recipes` can assign one or more roles to each column in the data. The roles are not restricted to a predefined set; they can be anything. For most conventional situations, they are typically "predictor" and/or "outcome". Additional roles enable targeted step operations on specific variables or groups of variables.
## The Formula Method
When a recipe is created using the formula interface, this defines the roles for all columns of the data set. `summary()` can be used to view a tibble containing information regarding the roles.
```{r formula-roles}
library(recipes)
recipe(Species ~ ., data = iris) %>% summary()
recipe( ~ Species, data = iris) %>% summary()
recipe(Sepal.Length + Sepal.Width ~ ., data = iris) %>% summary()
```
These roles can be updated despite this initial assignment. `update_role()` can modify a single existing role:
```{r formula-update}
library(modeldata)
data(biomass)
recipe(HHV ~ ., data = biomass) %>%
update_role(dataset, new_role = "dataset split variable") %>%
update_role(sample, new_role = "sample ID") %>%
summary()
```
When you want to get rid of a role for a column, use `remove_role()`.
```{r formula-rm}
recipe(HHV ~ ., data = biomass) %>%
remove_role(sample, old_role = "predictor") %>%
summary()
```
It represents the lack of a role as `NA`, which means that the variable is used in the recipe, but does not yet have a declared role. Setting the role manually to `NA` is not allowed:
```{r formula-rm-fail, error=TRUE}
recipe(HHV ~ ., data = biomass) %>%
update_role(sample, new_role = NA_character_)
```
When there are cases when a column will be used in more than one context, `add_role()` can create additional roles:
```{r formula-add}
multi_role <- recipe(HHV ~ ., data = biomass) %>%
update_role(dataset, new_role = "dataset split variable") %>%
update_role(sample, new_role = "sample ID") %>%
# Roles below from https://wordcounter.net/random-word-generator
add_role(sample, new_role = "jellyfish")
multi_role %>%
summary()
```
If a variable has multiple existing roles and you want to update one of them, the additional `old_role` argument to `update_role()` must be used to resolve any ambiguity.
```{r}
multi_role %>%
update_role(sample, new_role = "flounder", old_role = "jellyfish") %>%
summary()
```
Additional variable roles allow you to use `has_role()` in combination with other selection methods (see `?selections`) to target specific variables in subsequent processing steps. For example, in the following recipe, by adding the role `"nocenter"` to the `HHV` predictor, you can use `-has_role("nocenter")` to exclude `HHV` when centering `all_predictors()`.
```{r}
multi_role %>%
add_role(HHV, new_role = "nocenter") %>%
step_center(all_predictors(), -has_role("nocenter")) %>%
prep(training = biomass, retain = TRUE) %>%
bake(new_data = NULL) %>%
head()
```
The selector `all_numeric_predictors()` can also be used in place of the compound specification above.
## The Non-Formula Interface
You can start a recipe without any roles:
```{r x-none}
recipe(biomass) %>%
summary()
```
and roles can be added in bulk as needed:
```{r x-none-updated}
recipe(biomass) %>%
update_role(contains("gen"), new_role = "lunchroom") %>%
update_role(sample, HHV, new_role = "snail") %>%
summary()
```
## Role Inheritance
All recipes steps have a `role` argument that lets you set the role of _new_ columns generated by the step. When a recipe modifies a column in-place, the role is never modified. For example, `?step_center` has the documentation:
> `role`: Not used by this step since no new variables are created
In other cases, the roles are defaulted to a relevant value based the context. For example, `?step_dummy` has
> `role`: For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that the binary dummy variable columns created by the original variables will be used as predictors in a model.
So, by default, they are predictors but don't have to be:
```{r dummy}
recipe( ~ ., data = iris) %>%
step_dummy(Species) %>%
prep() %>%
bake(new_data = NULL, all_predictors()) %>%
dplyr::select(starts_with("Species")) %>%
names()
# or something else
recipe( ~ ., data = iris) %>%
step_dummy(Species, role = "trousers") %>%
prep() %>%
bake(new_data = NULL, has_role("trousers")) %>%
names()
```