library(dplyr)
library(ggplot2)
packageVersion("dplyr")
#> [1] '1.0.9'
A Taste of dplyr 1.0.0
across()
, row-wise operations and context-dependent expressions
This post uses the penguins dataset (Gorman 2014) for most demonstration purposes, modified by Allison Horst (as an alternative to iris).
penguins <- palmerpenguins::penguins
penguins
#> # A tibble: 344 × 8
#> species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#> 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
#> 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
#> 4 Adelie Torgersen NA NA NA NA <NA> 2007
#> 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
#> 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
#> 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
#> 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
#> 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
#> 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
#> # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
#> # ²body_mass_g
#> # ℹ Use `print(n = ...)` to see more rows
penguins %>%
ggplot(aes(bill_length_mm, bill_depth_mm, color = species, shape = species)) +
geom_point()
Working within columns
The new across()
function supersedes functionalities of _at
, _if
, _all
variants. The first argument, .cols
, selects the columns you want to operate on. It uses tidy selection (like select()
) so you can pick variables by position, name, and type. The second argument, .fns
, is a function or list of functions to apply to each column. This can also be a purrr style formula
penguins_grouped <- penguins %>% group_by(species)
penguins_grouped %>%
summarize(across(starts_with("bill"), mean, na.rm = TRUE),
n = n())
#> # A tibble: 3 × 4
#> species bill_length_mm bill_depth_mm n
#> <fct> <dbl> <dbl> <int>
#> 1 Adelie 38.8 18.3 152
#> 2 Chinstrap 48.8 18.4 68
#> 3 Gentoo 47.5 15.0 124
For conditional selection (previous _if
variants), predicate function should be wrapped in where
.
# double all numeric columns
penguins %>%
mutate(across(where(is.numeric), ~ .x * 2))
#> # A tibble: 344 × 8
#> species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
#> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Adelie Torgersen 78.2 37.4 362 7500 male 4014
#> 2 Adelie Torgersen 79 34.8 372 7600 fema… 4014
#> 3 Adelie Torgersen 80.6 36 390 6500 fema… 4014
#> 4 Adelie Torgersen NA NA NA NA <NA> 4014
#> 5 Adelie Torgersen 73.4 38.6 386 6900 fema… 4014
#> 6 Adelie Torgersen 78.6 41.2 380 7300 male 4014
#> 7 Adelie Torgersen 77.8 35.6 362 7250 fema… 4014
#> 8 Adelie Torgersen 78.4 39.2 390 9350 male 4014
#> 9 Adelie Torgersen 68.2 36.2 386 6950 <NA> 4014
#> 10 Adelie Torgersen 84 40.4 380 8500 <NA> 4014
#> # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
#> # ²body_mass_g
#> # ℹ Use `print(n = ...)` to see more rows
# count unique values of all character columns
penguins %>%
summarize(across(where(is.character), ~ length(unique(.x))))
#> # A tibble: 1 × 0
Apply multiple functions using list and use the .names
argument to control column names.
penguins_grouped %>%
summarize(across(matches("mm"),
list(min = ~ min(.x, na.rm = TRUE),
max = ~ max(.x, na.rm = TRUE)),
.names = "{fn}_{col}"))
#> # A tibble: 3 × 7
#> species min_bill_length_mm max_bill_length…¹ min_b…² max_b…³ min_f…⁴ max_f…⁵
#> <fct> <dbl> <dbl> <dbl> <dbl> <int> <int>
#> 1 Adelie 32.1 46 15.5 21.5 172 210
#> 2 Chinstrap 40.9 58 16.4 20.8 178 212
#> 3 Gentoo 40.9 59.6 13.1 17.3 203 231
#> # … with abbreviated variable names ¹max_bill_length_mm, ²min_bill_depth_mm,
#> # ³max_bill_depth_mm, ⁴min_flipper_length_mm, ⁵max_flipper_length_mm
Working within rows
Row-wise operations require a special type of grouping where each group consists of a single row. You create this with rowwise()
.
rowwise
doesn’t need any additional arguments unless you have variables that identify the rows, like student_id
here. This can be helpful when you want to keep a row identifier.
Like group_by
, rowwise
doesn’t really do anything itself; it just changes how the other verbs work.
df %>% mutate(avg = mean(c(test1, test2, test3, test4)))
#> # A tibble: 4 × 6
#> student_id test1 test2 test3 test4 avg
#> <int> <int> <int> <int> <int> <dbl>
#> 1 1 10 20 30 40 26.5
#> 2 2 11 21 31 41 26.5
#> 3 3 12 22 32 42 26.5
#> 4 4 13 23 33 43 26.5
df %>% rowwise() %>% mutate(avg = mean(c(test1, test2, test3, test4)))
#> # A tibble: 4 × 6
#> # Rowwise:
#> student_id test1 test2 test3 test4 avg
#> <int> <int> <int> <int> <int> <dbl>
#> 1 1 10 20 30 40 25
#> 2 2 11 21 31 41 26
#> 3 3 12 22 32 42 27
#> 4 4 13 23 33 43 28
In its essence, rowwise
vectorize / parallelize a function to acheive rowwisee computation. And in this case, the mean()
function is vectorized. If there are alternative ways of computing row-wise summaries that take advantage of built-in vectorisation then we do not need rowwise
at all.
df %>% mutate(s = test1 + test2 + test3)
#> # A tibble: 4 × 6
#> student_id test1 test2 test3 test4 s
#> <int> <int> <int> <int> <int> <int>
#> 1 1 10 20 30 40 60
#> 2 2 11 21 31 41 63
#> 3 3 12 22 32 42 66
#> 4 4 13 23 33 43 69
df %>% rowwise() %>% mutate(s = test1 + test2 + test3)
#> # A tibble: 4 × 6
#> # Rowwise:
#> student_id test1 test2 test3 test4 s
#> <int> <int> <int> <int> <int> <int>
#> 1 1 10 20 30 40 60
#> 2 2 11 21 31 41 63
#> 3 3 12 22 32 42 66
#> 4 4 13 23 33 43 69
Another family of summary functions have “parallel” extensions where you can provide multiple variables in the arguments:
df %>%
mutate(min = pmin(test1, test2, test3, test4),
max = pmax(test1, test2, test3, test4),
string = paste(test1, test2, test3, test4, sep = "-"))
#> # A tibble: 4 × 8
#> student_id test1 test2 test3 test4 min max string
#> <int> <int> <int> <int> <int> <int> <int> <chr>
#> 1 1 10 20 30 40 10 40 10-20-30-40
#> 2 2 11 21 31 41 11 41 11-21-31-41
#> 3 3 12 22 32 42 12 42 12-22-32-42
#> 4 4 13 23 33 43 13 43 13-23-33-43
Where these functions exist, they’ll usually be faster than rowwise
. The advantage of rowwise
is that it works with any function, not just those that are already vectorised.
However, an advantage of rowwise
even there is other ways is that it’s paired with c_across()
, which works like c()
but uses the same tidyselect syntax as across()
. That makes it easy to operate on multiple variables:
df %>%
rowwise() %>%
mutate(min = min(c_across(starts_with("test"))),
max = max(c_across(starts_with("test"))))
#> # A tibble: 4 × 7
#> # Rowwise:
#> student_id test1 test2 test3 test4 min max
#> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 10 20 30 40 10 40
#> 2 2 11 21 31 41 11 41
#> 3 3 12 22 32 42 12 42
#> 4 4 13 23 33 43 13 43
Plus, a rowwise df will naturally contain exactly the same rows after summarize()
, the same as mutate
List-columns
Because lists can contain anything, you can use list-columns to keep related objects together, regardless of what type of thing they are. List-columns give you a convenient storage mechanism and rowwise
gives you a convenient computation mechanism.
Simulaiton
The basic idea of using rowwise
to perform simulation is to store all your simulation paramters in a data frame, similar to purrr::pmap
.
df <- tribble(
~id, ~ n, ~ min, ~ max,
1, 3, 0, 1,
2, 2, 10, 100,
3, 2, 100, 1000,
)
Then you can either generate a list-column containing the simulated values with mutate
:
Or taking advantage of summarize
’s new features to return multiple rows per group
df %>%
rowwise(everything()) %>%
summarize(sim = runif(n, min, max))
#> # A tibble: 7 × 5
#> # Groups: id, n, min, max [3]
#> id n min max sim
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3 0 1 0.644
#> 2 1 3 0 1 0.201
#> 3 1 3 0 1 0.151
#> 4 2 2 10 100 84.5
#> 5 2 2 10 100 80.2
#> 6 3 2 100 1000 311.
#> 7 3 2 100 1000 372.
The previous approach
Group-wise models
The new nest_by()
function works similarly to group_nest()
Now we can use mutate
to fit a model to each data frame:
by_species <- by_species %>%
rowwise(species) %>%
mutate(model = list(lm(bill_length_mm ~ bill_depth_mm, data = data)))
by_species
#> # A tibble: 3 × 3
#> # Rowwise: species
#> species data model
#> <fct> <list<tibble[,7]>> <list>
#> 1 Adelie [152 × 7] <lm>
#> 2 Chinstrap [68 × 7] <lm>
#> 3 Gentoo [124 × 7] <lm>
And then extract model summaries or coefficients with summarize()
and broom
functions (note that by_species
is still a rowwise data frame):
by_species %>%
summarize(broom::glance(model))
#> # A tibble: 3 × 13
#> # Groups: species [3]
#> species r.squared adj.r.sq…¹ sigma stati…² p.value df logLik AIC BIC
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Adelie 0.153 0.148 2.46 27.0 6.67e- 7 1 -349. 704. 713.
#> 2 Chinstrap 0.427 0.418 2.55 49.2 1.53e- 9 1 -159. 324. 331.
#> 3 Gentoo 0.414 0.409 2.37 85.5 1.02e-15 1 -280. 565. 574.
#> # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>, and
#> # abbreviated variable names ¹adj.r.squared, ²statistic
#> # ℹ Use `colnames()` to see all variable names
by_species %>%
summarize(broom::tidy(model))
#> # A tibble: 6 × 6
#> # Groups: species [3]
#> species term estimate std.error statistic p.value
#> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Adelie (Intercept) 23.1 3.03 7.60 3.01e-12
#> 2 Adelie bill_depth_mm 0.857 0.165 5.19 6.67e- 7
#> 3 Chinstrap (Intercept) 13.4 5.06 2.66 9.92e- 3
#> 4 Chinstrap bill_depth_mm 1.92 0.274 7.01 1.53e- 9
#> 5 Gentoo (Intercept) 17.2 3.28 5.25 6.60e- 7
#> 6 Gentoo bill_depth_mm 2.02 0.219 9.24 1.02e-15
An alternative approach
penguins %>%
group_by(species) %>%
group_modify(~ broom::tidy(lm(bill_length_mm ~ bill_depth_mm, data = .x)))
#> # A tibble: 6 × 6
#> # Groups: species [3]
#> species term estimate std.error statistic p.value
#> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Adelie (Intercept) 23.1 3.03 7.60 3.01e-12
#> 2 Adelie bill_depth_mm 0.857 0.165 5.19 6.67e- 7
#> 3 Chinstrap (Intercept) 13.4 5.06 2.66 9.92e- 3
#> 4 Chinstrap bill_depth_mm 1.92 0.274 7.01 1.53e- 9
#> 5 Gentoo (Intercept) 17.2 3.28 5.25 6.60e- 7
#> 6 Gentoo bill_depth_mm 2.02 0.219 9.24 1.02e-15
New summarize
features
Multiple rows and columns
Two big changes make summarize()
much more flexible. A single summary expression can now return:
A vector of any length, creating multiple rows. (so we can use summary that returns multiple values without
list
)A data frame, creating multiple columns.
penguins_grouped %>%
summarize(bill_length_dist = quantile(bill_length_mm,
c(0.25, 0.5, 0.75),
na.rm = TRUE),
q = c(0.25, 0.5, 0.75))
#> # A tibble: 9 × 3
#> # Groups: species [3]
#> species bill_length_dist q
#> <fct> <dbl> <dbl>
#> 1 Adelie 36.8 0.25
#> 2 Adelie 38.8 0.5
#> 3 Adelie 40.8 0.75
#> 4 Chinstrap 46.3 0.25
#> 5 Chinstrap 49.6 0.5
#> 6 Chinstrap 51.1 0.75
#> 7 Gentoo 45.3 0.25
#> 8 Gentoo 47.3 0.5
#> 9 Gentoo 49.6 0.75
Or return multiple columns from a single summary expression:
At the first glance this may seem not so different with supplying multiple name-value pairs. But this can be useful inside functions. For example, in the previous quantile
code it would be nice to be able to reduce the duplication so that we don’t have to type the quantile values twice. We can now write a simple function because summary expressions can now be data frames or tibbles:
quibble <- function(x, q = c(0.25, 0.5, 0.75), na.rm = TRUE) {
tibble(x = quantile(x, q, na.rm = na.rm), q = q)
}
penguins_grouped %>%
summarize(quibble(bill_depth_mm))
#> # A tibble: 9 × 3
#> # Groups: species [3]
#> species x q
#> <fct> <dbl> <dbl>
#> 1 Adelie 17.5 0.25
#> 2 Adelie 18.4 0.5
#> 3 Adelie 19 0.75
#> 4 Chinstrap 17.5 0.25
#> 5 Chinstrap 18.4 0.5
#> 6 Chinstrap 19.4 0.75
#> 7 Gentoo 14.2 0.25
#> 8 Gentoo 15 0.5
#> 9 Gentoo 15.7 0.75
When combining glue syntax and tidy evaluation, it is easy to dynamically name the column names.
quibble <- function(x, q = c(0.25, 0.5, 0.75), na.rm = TRUE) {
tibble("{{ x }}_quantile" := quantile(x, q, na.rm = na.rm),
"{{ x }}_q" := q)
}
penguins_grouped %>%
summarize(quibble(flipper_length_mm))
#> # A tibble: 9 × 3
#> # Groups: species [3]
#> species flipper_length_mm_quantile flipper_length_mm_q
#> <fct> <dbl> <dbl>
#> 1 Adelie 186 0.25
#> 2 Adelie 190 0.5
#> 3 Adelie 195 0.75
#> 4 Chinstrap 191 0.25
#> 5 Chinstrap 196 0.5
#> 6 Chinstrap 201 0.75
#> 7 Gentoo 212 0.25
#> 8 Gentoo 216 0.5
#> 9 Gentoo 221 0.75
As an aside, if we name the tibble expression in summarize()
that part will be packed in the result, which can be solved by tidyr::unpack
. That’s because when we leave the name off, the data frame result is automatically unpacked.
penguins_grouped %>%
summarize(df = quibble(flipper_length_mm))
#> # A tibble: 9 × 2
#> # Groups: species [3]
#> species df$flipper_length_mm_quantile $flipper_length_mm_q
#> <fct> <dbl> <dbl>
#> 1 Adelie 186 0.25
#> 2 Adelie 190 0.5
#> 3 Adelie 195 0.75
#> 4 Chinstrap 191 0.25
#> 5 Chinstrap 196 0.5
#> 6 Chinstrap 201 0.75
#> 7 Gentoo 212 0.25
#> 8 Gentoo 216 0.5
#> 9 Gentoo 221 0.75
non-summary context
In combination with rowwise operations, summarize()
is now sufficiently powerful to replace many workflows that previously required a map()
function.
For example, to read all the all the .csv files in the current directory, you could write:
Move columns around within data frames
New verb relocate
is provided to change column positions with the same syntax as select
. The default behavior is to move selected columns to the left-hand side
penguins %>% relocate(island)
#> # A tibble: 344 × 8
#> island species bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 Torgersen Adelie 39.1 18.7 181 3750 male 2007
#> 2 Torgersen Adelie 39.5 17.4 186 3800 fema… 2007
#> 3 Torgersen Adelie 40.3 18 195 3250 fema… 2007
#> 4 Torgersen Adelie NA NA NA NA <NA> 2007
#> 5 Torgersen Adelie 36.7 19.3 193 3450 fema… 2007
#> 6 Torgersen Adelie 39.3 20.6 190 3650 male 2007
#> 7 Torgersen Adelie 38.9 17.8 181 3625 fema… 2007
#> 8 Torgersen Adelie 39.2 19.6 195 4675 male 2007
#> 9 Torgersen Adelie 34.1 18.1 193 3475 <NA> 2007
#> 10 Torgersen Adelie 42 20.2 190 4250 <NA> 2007
#> # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
#> # ²body_mass_g
#> # ℹ Use `print(n = ...)` to see more rows
penguins %>% relocate(starts_with("bill"))
#> # A tibble: 344 × 8
#> bill_length_mm bill_depth_mm species island flipper_…¹ body_…² sex year
#> <dbl> <dbl> <fct> <fct> <int> <int> <fct> <int>
#> 1 39.1 18.7 Adelie Torgersen 181 3750 male 2007
#> 2 39.5 17.4 Adelie Torgersen 186 3800 fema… 2007
#> 3 40.3 18 Adelie Torgersen 195 3250 fema… 2007
#> 4 NA NA Adelie Torgersen NA NA <NA> 2007
#> 5 36.7 19.3 Adelie Torgersen 193 3450 fema… 2007
#> 6 39.3 20.6 Adelie Torgersen 190 3650 male 2007
#> 7 38.9 17.8 Adelie Torgersen 181 3625 fema… 2007
#> 8 39.2 19.6 Adelie Torgersen 195 4675 male 2007
#> 9 34.1 18.1 Adelie Torgersen 193 3475 <NA> 2007
#> 10 42 20.2 Adelie Torgersen 190 4250 <NA> 2007
#> # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
#> # ²body_mass_g
#> # ℹ Use `print(n = ...)` to see more rows
penguins %>% relocate(sex, body_mass_g, .after = species)
#> # A tibble: 344 × 8
#> species sex body_mass_g island bill_length_mm bill_dept…¹ flipp…² year
#> <fct> <fct> <int> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie male 3750 Torgersen 39.1 18.7 181 2007
#> 2 Adelie female 3800 Torgersen 39.5 17.4 186 2007
#> 3 Adelie female 3250 Torgersen 40.3 18 195 2007
#> 4 Adelie <NA> NA Torgersen NA NA NA 2007
#> 5 Adelie female 3450 Torgersen 36.7 19.3 193 2007
#> 6 Adelie male 3650 Torgersen 39.3 20.6 190 2007
#> 7 Adelie female 3625 Torgersen 38.9 17.8 181 2007
#> 8 Adelie male 4675 Torgersen 39.2 19.6 195 2007
#> 9 Adelie <NA> 3475 Torgersen 34.1 18.1 193 2007
#> 10 Adelie <NA> 4250 Torgersen 42 20.2 190 2007
#> # … with 334 more rows, and abbreviated variable names ¹bill_depth_mm,
#> # ²flipper_length_mm
#> # ℹ Use `print(n = ...)` to see more rows
Similarly, mutate
gains new arguments .after
and .before
to control where new columns should appear.
penguins %>%
mutate(mass_double = body_mass_g * 2, .before = 1)
#> # A tibble: 344 × 9
#> mass_double species island bill_leng…¹ bill_…² flipp…³ body_…⁴ sex year
#> <dbl> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 7500 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#> 2 7600 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
#> 3 6500 Adelie Torgersen 40.3 18 195 3250 fema… 2007
#> 4 NA Adelie Torgersen NA NA NA NA <NA> 2007
#> 5 6900 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
#> 6 7300 Adelie Torgersen 39.3 20.6 190 3650 male 2007
#> 7 7250 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
#> 8 9350 Adelie Torgersen 39.2 19.6 195 4675 male 2007
#> 9 6950 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
#> 10 8500 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
#> # … with 334 more rows, and abbreviated variable names ¹bill_length_mm,
#> # ²bill_depth_mm, ³flipper_length_mm, ⁴body_mass_g
#> # ℹ Use `print(n = ...)` to see more rows
Row mutation
dplyr has a new experimental family of row mutation functions inspired by SQL’s UPDATE
, INSERT
, UPSERT
, and DELETE
. Like the join functions, they all work with a pair of data frames:
rows_update(x, y)
updates existing rows in x with values in y.rows_patch(x, y)
works like rows_update() but only changesNA
values.rows_insert(x, y)
adds new rows to x from y.rows_upsert(x, y)
updates existing rows in x and adds new rows from y.rows_delete(x, y)
deletes rows in x that match rows in y.
The rows_
functions match x and y using keys. All of them check that the keys of x and y are valid (i.e. unique) before doing anything.
We can use rows_insert()
to add new rows:
new <- tibble(a = c(4, 5), b = c("d", "e"), c = c(3.5, 4.5))
rows_insert(df, new)
#> # A tibble: 5 × 3
#> a b c
#> <int> <chr> <dbl>
#> 1 1 a 0.5
#> 2 2 b 1.5
#> 3 3 <NA> 2.5
#> 4 4 d 3.5
#> 5 5 e 4.5
Note that rows_insert()
will fail if we attempt to insert a row that already exists:
df %>% rows_insert(tibble(a = 3, b = "c"))
#> Error in `rows_insert()`:
#> ! `y` can't contain keys that already exist in `x`.
#> ℹ The following rows in `y` have keys that already exist in `x`: `c(1)`.
#> ℹ Use `conflict = "ignore"` if you want to ignore these `y` rows.
df %>% rows_insert(tibble(a = 3, b = "c"), by = c("a", "b"))
#> # A tibble: 4 × 3
#> a b c
#> <int> <chr> <dbl>
#> 1 1 a 0.5
#> 2 2 b 1.5
#> 3 3 <NA> 2.5
#> 4 3 c NA
If you want to update existing values, use rows_update()
. It will throw an error if one of the rows to update does not exist:
df %>% rows_update(tibble(a = 3, b = "c"))
#> # A tibble: 3 × 3
#> a b c
#> <int> <chr> <dbl>
#> 1 1 a 0.5
#> 2 2 b 1.5
#> 3 3 c 2.5
df %>% rows_update(tibble(a = 4, b = "d"))
#> Error in `rows_update()`:
#> ! `y` must contain keys that already exist in `x`.
#> ℹ The following rows in `y` have keys that don't exist in `x`: `c(1)`.
#> ℹ Use `unmatched = "ignore"` if you want to ignore these `y` rows.
rows_patch()
is a variant of rows_update()
that will only update values in x that are NA
.
df %>% rows_patch(tibble(a = 1:3, b = "patch"))
#> # A tibble: 3 × 3
#> a b c
#> <int> <chr> <dbl>
#> 1 1 a 0.5
#> 2 2 b 1.5
#> 3 3 patch 2.5
row_upsert
update a df or insert new rows.
df %>%
rows_upsert(tibble(a = 3, b = "c")) %>% # update
rows_upsert(tibble(a = 4, b = "d")) # insert
#> # A tibble: 4 × 3
#> a b c
#> <int> <chr> <dbl>
#> 1 1 a 0.5
#> 2 2 b 1.5
#> 3 3 c 2.5
#> 4 4 d NA
Context dependent expressions
n()
is a special function in dplyr which return the number of observations in the current group. Now the new version comes with more such special functions, aka context dependent expressions. These functions return information about the “current” group or “current” variable, so only work inside specific contexts like summarize()
and mutate()
. Specifically, a family of cur_
functions are added:
cur_data()
gives the current data for the current group (exclusing grouping variables,cur_data_all
in developmental version returns grouping variables as well)cur_group()
gives the group keys, a tibble with one row and one column for each grouping variable.cur_group_id()
gives a unique numeric identifier for the current groupcur_column()
gives the name of the current column (inacross()
only).
df <- tibble(
g = sample(rep(letters[1:3], 1:3)),
x = runif(6),
y = runif(6)
)
gf <- df %>% group_by(g)
gf %>% summarize(row = cur_group_rows())
#> # A tibble: 6 × 2
#> # Groups: g [3]
#> g row
#> <chr> <int>
#> 1 a 3
#> 2 b 1
#> 3 b 2
#> 4 c 4
#> 5 c 5
#> 6 c 6
gf %>% summarize(data = list(cur_group()))
#> # A tibble: 3 × 2
#> g data
#> <chr> <list>
#> 1 a <tibble [1 × 1]>
#> 2 b <tibble [1 × 1]>
#> 3 c <tibble [1 × 1]>
gf %>% summarize(data = list(cur_data()))
#> # A tibble: 3 × 2
#> g data
#> <chr> <list>
#> 1 a <tibble [1 × 2]>
#> 2 b <tibble [2 × 2]>
#> 3 c <tibble [3 × 2]>
# has nothing to do with groups
gf %>% mutate(across(everything(), ~ paste(cur_column(), round(.x, 2))))
#> # A tibble: 6 × 3
#> # Groups: g [3]
#> g x y
#> <chr> <chr> <chr>
#> 1 b x 0.02 y 0.17
#> 2 b x 0.65 y 0.53
#> 3 a x 0.84 y 0.49
#> 4 c x 0.32 y 0.62
#> 5 c x 0.35 y 0.6
#> 6 c x 0.57 y 0.08
Superseded functions
top_n()
, sample_n()
, and sample_frac()
have been superseded in favor of a new family of slice helpers: slice_min()
, slice_max()
, slice_head()
, slice_tail()
, slice_random()
.
# select penguins per group on body mass
penguins_grouped %>%
slice_max(body_mass_g, n = 1)
#> # A tibble: 3 × 8
#> # Groups: species [3]
#> species island bill_length_mm bill_depth_mm flipper_le…¹ body_…² sex year
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 Adelie Biscoe 43.2 19 197 4775 male 2009
#> 2 Chinstrap Dream 52 20.7 210 4800 male 2008
#> 3 Gentoo Biscoe 49.2 15.2 221 6300 male 2007
#> # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
penguins_grouped %>%
slice_min(body_mass_g, n = 1)
#> # A tibble: 4 × 8
#> # Groups: species [3]
#> species island bill_length_mm bill_depth_mm flipper_le…¹ body_…² sex year
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 Adelie Biscoe 36.5 16.6 181 2850 fema… 2008
#> 2 Adelie Biscoe 36.4 17.1 184 2850 fema… 2008
#> 3 Chinstrap Dream 46.9 16.6 192 2700 fema… 2008
#> 4 Gentoo Biscoe 42.7 13.7 208 3950 fema… 2008
#> # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
# random sampling
penguins %>%
slice_sample(n = 10)
#> # A tibble: 10 × 8
#> species island bill_length_mm bill_depth_mm flippe…¹ body_…² sex year
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 Chinstrap Dream 48.1 16.4 199 3325 fema… 2009
#> 2 Gentoo Biscoe 42 13.5 210 4150 fema… 2007
#> 3 Gentoo Biscoe 54.3 15.7 231 5650 male 2008
#> 4 Gentoo Biscoe 45.2 14.8 212 5200 fema… 2009
#> 5 Adelie Torgersen 41.5 18.3 195 4300 male 2009
#> 6 Gentoo Biscoe 49.9 16.1 213 5400 male 2009
#> 7 Chinstrap Dream 50.5 19.6 201 4050 male 2007
#> 8 Adelie Biscoe 35.3 18.9 187 3800 fema… 2007
#> 9 Adelie Biscoe 37.8 20 190 4250 male 2009
#> 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
#> # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
penguins %>%
slice_sample(prop = 0.1)
#> # A tibble: 34 × 8
#> species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 Adelie Dream 44.1 19.7 196 4400 male 2007
#> 2 Gentoo Biscoe 48.6 16 230 5800 male 2008
#> 3 Gentoo Biscoe 49.1 15 228 5500 male 2009
#> 4 Chinstrap Dream 46.2 17.5 187 3650 fema… 2008
#> 5 Gentoo Biscoe 52.2 17.1 228 5400 male 2009
#> 6 Gentoo Biscoe 48.4 14.4 203 4625 fema… 2009
#> 7 Adelie Biscoe 38.8 17.2 180 3800 male 2007
#> 8 Gentoo Biscoe 44.5 14.3 216 4100 <NA> 2007
#> 9 Chinstrap Dream 51.9 19.5 206 3950 male 2009
#> 10 Gentoo Biscoe 43.5 15.2 213 4650 fema… 2009
#> # … with 24 more rows, and abbreviated variable names ¹flipper_length_mm,
#> # ²body_mass_g
#> # ℹ Use `print(n = ...)` to see more rows
summarize()
gains new argument .groups
to control grouping structure of theh result.
.groups = "drop_last"
drops the last grouping level (i.e. the default behaviour)..groups = "drop"
drops all grouping levels and returns a tibble..groups = "keep"
preserves the grouping of the input..groups = "rowwise"
turns each row into its own group.
Other changes
The new rename_with()
makes it easier to rename variables programmatically:
penguins %>%
rename_with(stringr::str_to_upper)
#> # A tibble: 344 × 8
#> SPECIES ISLAND BILL_LENGTH_MM BILL_DEPTH_MM FLIPPER_…¹ BODY_…² SEX YEAR
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#> 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
#> 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
#> 4 Adelie Torgersen NA NA NA NA <NA> 2007
#> 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
#> 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
#> 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
#> 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
#> 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
#> 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
#> # … with 334 more rows, and abbreviated variable names ¹FLIPPER_LENGTH_MM,
#> # ²BODY_MASS_G
#> # ℹ Use `print(n = ...)` to see more rows
You can optionally choose which columns to apply the transformation to with the second argument:
penguins %>%
rename_with(stringr::str_to_upper, starts_with("bill"))
#> # A tibble: 344 × 8
#> species island BILL_LENGTH_MM BILL_DEPTH_MM flipper_…¹ body_…² sex year
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#> 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
#> 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
#> 4 Adelie Torgersen NA NA NA NA <NA> 2007
#> 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
#> 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
#> 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
#> 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
#> 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
#> 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
#> # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
#> # ²body_mass_g
#> # ℹ Use `print(n = ...)` to see more rows
mutate()
gains argument .keep
that allows you to control which columns are retained in the output:
penguins %>% mutate(double_mass = body_mass_g * 2,
island_lower = stringr::str_to_lower(island),
.keep = "used")
#> # A tibble: 344 × 4
#> island body_mass_g double_mass island_lower
#> <fct> <int> <dbl> <chr>
#> 1 Torgersen 3750 7500 torgersen
#> 2 Torgersen 3800 7600 torgersen
#> 3 Torgersen 3250 6500 torgersen
#> 4 Torgersen NA NA torgersen
#> 5 Torgersen 3450 6900 torgersen
#> 6 Torgersen 3650 7300 torgersen
#> 7 Torgersen 3625 7250 torgersen
#> 8 Torgersen 4675 9350 torgersen
#> 9 Torgersen 3475 6950 torgersen
#> 10 Torgersen 4250 8500 torgersen
#> # … with 334 more rows
#> # ℹ Use `print(n = ...)` to see more rows
penguins %>% mutate(double_mass = body_mass_g * 2, .keep = "none")
#> # A tibble: 344 × 1
#> double_mass
#> <dbl>
#> 1 7500
#> 2 7600
#> 3 6500
#> 4 NA
#> 5 6900
#> 6 7300
#> 7 7250
#> 8 9350
#> 9 6950
#> 10 8500
#> # … with 334 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Use cases
This in-progress section documents tasks that would otherwise been impossible or laborious with previous version of dplyr.
Replace missing values in many columns
Since tidyr::replace_na
does not support tidy select syntax, replacing NA values in multiple columns could be a drudgery. Now this is made easy with coalesce
and across
penguins %>% summarize(across(starts_with("bill"), ~ sum(is.na(.x))))
#> # A tibble: 1 × 2
#> bill_length_mm bill_depth_mm
#> <int> <int>
#> 1 2 2
penguins %>%
mutate(across(starts_with("bill"), ~ coalesce(.x, 0))) %>%
summarize(across(starts_with("bill"), ~ sum(is.na(.x))))
#> # A tibble: 1 × 2
#> bill_length_mm bill_depth_mm
#> <int> <int>
#> 1 0 0
Rolling regression
cur_data()
is particularly useful for rolling regression, in conjunction with the slider package.
library(slider)
library(lubridate)
# historical stock prices from 2014-2018 for Google, Amazon, Facebook and Apple
stock <- tsibbledata::gafa_stock %>% select(Symbol, Date, Close, Volume)
stock
#> # A tibble: 5,032 × 4
#> Symbol Date Close Volume
#> <chr> <date> <dbl> <dbl>
#> 1 AAPL 2014-01-02 79.0 58671200
#> 2 AAPL 2014-01-03 77.3 98116900
#> 3 AAPL 2014-01-06 77.7 103152700
#> 4 AAPL 2014-01-07 77.1 79302300
#> 5 AAPL 2014-01-08 77.6 64632400
#> 6 AAPL 2014-01-09 76.6 69787200
#> 7 AAPL 2014-01-10 76.1 76244000
#> 8 AAPL 2014-01-13 76.5 94623200
#> 9 AAPL 2014-01-14 78.1 83140400
#> 10 AAPL 2014-01-15 79.6 97909700
#> # … with 5,022 more rows
#> # ℹ Use `print(n = ...)` to see more rows
# Arrange and group by `Symbol` (i.e. Google)
stock <- stock %>%
arrange(Symbol, Date) %>%
group_by(Symbol)
linear_model <- function(df) {
lm(Close ~ Volume, data = df)
}
# 10 day rolling regression per group
stock %>%
mutate(model = slide_index(
cur_data(),
Date,
linear_model,
.before = days(9),
.complete = TRUE
))
#> # A tibble: 5,032 × 5
#> # Groups: Symbol [4]
#> Symbol Date Close Volume model
#> <chr> <date> <dbl> <dbl> <list>
#> 1 AAPL 2014-01-02 79.0 58671200 <NULL>
#> 2 AAPL 2014-01-03 77.3 98116900 <NULL>
#> 3 AAPL 2014-01-06 77.7 103152700 <NULL>
#> 4 AAPL 2014-01-07 77.1 79302300 <NULL>
#> 5 AAPL 2014-01-08 77.6 64632400 <NULL>
#> 6 AAPL 2014-01-09 76.6 69787200 <NULL>
#> 7 AAPL 2014-01-10 76.1 76244000 <NULL>
#> 8 AAPL 2014-01-13 76.5 94623200 <lm>
#> 9 AAPL 2014-01-14 78.1 83140400 <lm>
#> 10 AAPL 2014-01-15 79.6 97909700 <lm>
#> # … with 5,022 more rows
#> # ℹ Use `print(n = ...)` to see more rows