forked from tidyverse/design
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcs-setNames.Rmd
258 lines (189 loc) · 7.64 KB
/
cs-setNames.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# Case study: `setNames()`
```{r, include = FALSE}
source("common.R")
```
## What does `setNames()` do?
`stats::setNames()` is a shorthand that allows you to set vector names inline (it's a little surprising that it lives in the stats package). It has a simple definition:
```{r}
setNames <- function(object = nm, nm) {
names(object) <- nm
object
}
```
And is easy to use:
```{r}
# Instead of
x <- 1:3
names(x) <- c("a", "b", "c")
# Can write
x <- setNames(1:3, c("a", "b", "c"))
x
```
This function is short (just two lines of code!) but yields a surprisingly rich analysis.
## How can we improve the names?
Firstly, I prefer snake_case to camelCase, so I'd call the function `set_names()`. Then we need to consider the arguments:
* I think the first argument, `object`, would be better called `x` in order
to emphasise that this function only works with vectors (because only
vectors have names).
* The second argument, `nm` is rather terse, and I don't see any disadvantage
in calling it `names`. I think you could also argue that it should be
called `y` since its meaning should be obvious from the function name.
This yields:
```{r}
set_names <- function(x = names, names) {
names(x) <- names
x
}
```
## What about the default values?
The default values of `setNames()` are a little hard to understand, because the default value of the first argument is the second argument. It was defined this way to make it possible to name a character vector with itself:
```{r}
setNames(nm = c("apple", "banana", "cake"))
```
But that decision leads to a function signature that violates one of the principles of Chapter \@ref(args-data-details): a required argument comes after an optional argument. Fortunately, we can fix this easily and still preserve the useful ability to name a vector with itself:
```{r}
set_names <- function(x, names = x) {
names(x) <- names
x
}
set_names(c("apple", "banana", "cake"))
```
This helps to emphasise that `x` is the primary argument.
## What about bad inputs?
Now that we've considered how the function works with correct inputs, it's time to consider how it should work with malformed inputs. The current function checks neither the length not the type:
```{r}
set_names(1:3, "a")
set_names(1:3, list(letters[1:3], letters[4], letters[5:6]))
```
We can resolve this by asserting that the names should always be a character vector, and should have the same length as `x`:
```{r, error = TRUE}
set_names <- function(x, names = x) {
if (!is.character(names) || length(names) != length(x)) {
stop("`names` must be a character vector the same length as `x`.", call. = FALSE)
}
names(x) <- names
x
}
set_names(1:3, "a")
set_names(1:3, list(letters[1:3], letters[4], letters[5:6]))
```
You could also frame this test using vctrs assertions:
```{r}
library(vctrs)
set_names <- function(x, names = x) {
vec_assert(x)
vec_assert(names, ptype = character(), size = length(x))
names(x) <- names
x
}
```
Note that I slipped in an assertion that `x` should be a vector. This slightly improves the error message if you accidentally supply the wrong sort of input to `set_names()`:
```{r, error = TRUE}
setNames(mean, 1:3)
set_names(mean, 1:3)
```
Note that we're simply checking the length of `names` here, rather than recycling it, i.e. the invariant is `vec_size(set_names(x, y))` is `vec_size(x)`, not `vec_size_common(x, y)`. I think this is the correct behaviour because you usually add names to a vector to create a lookup table, and a lookup table is not useful if there are duplicated names. This makes `set_names()` less general in return for better error messages when you do something suspicious (and you can always use an explicit `rep_along()` if do want this behaviour.)
## How could we extend this function?
Now that we've modified the function so it doesn't violate the principles in this book, we can think about how we might extend it. Currently the function is only useful for setting names to a constant. Maybe we could extend it to also make it easier to change existing names? One way to do that would be to allow `names` to be a function:
```{r}
set_names <- function(x, names = x) {
vec_assert(x)
if (is.function(names)) {
names <- names(base::names(x))
}
vec_assert(names, ptype = character(), size = length(x))
names(x) <- names
x
}
x <- c(a = 1, b = 2, c = 3)
set_names(x, toupper)
```
We could also support anonymous function formula shortcut used in many places in the tidyverse.
```{r}
set_names <- function(x, names = x) {
vec_assert(x)
if (is.function(names) || rlang::is_formula(names)) {
fun <- rlang::as_function(names)
names <- fun(base::names(x))
}
vec_assert(names, ptype = character(), size = length(x))
names(x) <- names
x
}
x <- c(a = 1, b = 2, c = 3)
set_names(x, ~ paste0("x-", .))
```
Now `set_names()` supports overriding and modifying names. What about removing them? It turns out that `setNames()` supported this, but our stricter checks prohibit:
```{r, error = TRUE}
x <- c(a = 1, b = 2, c = 3)
setNames(x, NULL)
set_names(x, NULL)
```
We can fix this with another clause:
```{r}
set_names <- function(x, names = x) {
vec_assert(x)
if (!is.null(names)) {
if (is.function(names) || rlang::is_formula(names)) {
fun <- rlang::as_function(names)
names <- fun(base::names(x))
}
}
names(x) <- names
x
}
x <- c(a = 1, b = 2, c = 3)
set_names(x, NULL)
```
However, I think this has muddied the logic. To resolve it, I think we should pull out the checking code into a separate function. After trying out a [few approaches](https://github.com/tidyverse/principles/issues/79), I ended up with:
```{r}
check_names <- function(names, x) {
if (is.null(names)) {
names
} else if (vec_is(names)) {
vec_assert(names, ptype = character(), size = length(x))
} else if (is.function(names)) {
check_names_2(names(base::names(x)), x)
} else if (rlang::is.formula(names)) {
check_names_2(rlang::as_function(names), x)
} else {
rlang::abort("`names` must be NULL, a function or formula, or a vector")
}
}
```
This then replaces `vec_assert()` in `set_names()`. I separate the input checking and implementation with a blank line to help visually group the parts of the function.
```{r}
set_names <- function(x, names = x) {
vec_assert(x)
names <- check_names(names, x)
names(x) <- names
x
}
```
We _could_ simplify the function even further, but I think this is a bad idea becaues it mingles input validation with implementation:
```{r}
# Don't do this
set_names <- function(x, names = x) {
vec_assert(x)
names(x) <- check_names(names, x)
x
}
# Or even
set_names <- function(x, names = x) {
`names<-`(vec_assert(x), check_names(names, x))
}
```
## Compared to `rlang::set_names()`
If you're familiar with rlang, you might notice that we've ended up with something rather similar to `rlang::set_names()`. However, these careful analysis in this chapter has lead to a few differences. `rlang::set_names()`:
* Calls the second argument `nm`, instead of something more descriptive.
I think this is simply because we never sat down and fully considered the
interface.
* Coerces `nm` to character vector. This allows `rlang::set_names(1:4)` to
automatically name the vector, but this seems a relatively weak new feature
in return for the cost of not throwing an error message if you provide an
unsual vector type. (Both lists and data frames have `as.character()` methods
so this will work for basically any type of vector, even if completely
inappropriate.)
* Passes `...` on to function `nm`. I now think that decision was a
mistake: it substantially complicates the interface in return for a
relatively small investment.