35 Column-wise Operations in dplyr
Throughout the chapters in this book we have learned to do a really vast array of useful data transformations and statistical analyses with the help of the dplyr
package.
So far, however, we’ve always done these transformations and statistical analyses on one column of our data frame at a time. There isn’t anything inherently “wrong” with this approach, but, for reasons we’ve already discussed, there are often advantages to telling R what you want to do one time, and then asking R to do that thing repeatedly across all, or a subset of, the columns in your data frame. That is exactly what dplyr
’s across()
function allows us to do.
There are so many ways we might want to use the across()
function in our R programs. We can’t begin to cover, or even imagine, them all. Instead, the goal of this chapter is just to provide you with an overview of the across()
function and show you some examples of using it with filter()
, mutate()
, and summarise()
to get you thinking about how you might want to use it in your R programs.
Before we discuss further, let’s take a look at a quick example. The first thing we will need to do is load dplyr
.
Then, we will simulate some data. In this case, we are creating a data frame that contains three columns of 10 random numbers:
set.seed(123)
df_xyz <- tibble(
row = 1:10,
x = rnorm(10),
y = rnorm(10),
z = rnorm(10)
) %>%
print()
## # A tibble: 10 × 4
## row x y z
## <int> <dbl> <dbl> <dbl>
## 1 1 -0.560 1.22 -1.07
## 2 2 -0.230 0.360 -0.218
## 3 3 1.56 0.401 -1.03
## 4 4 0.0705 0.111 -0.729
## 5 5 0.129 -0.556 -0.625
## 6 6 1.72 1.79 -1.69
## 7 7 0.461 0.498 0.838
## 8 8 -1.27 -1.97 0.153
## 9 9 -0.687 0.701 -1.14
## 10 10 -0.446 -0.473 1.25
Up to this point, if we wanted to find the mean of each column, we would probably have written code like this:
## # A tibble: 1 × 3
## x_mean y_mean z_mean
## <dbl> <dbl> <dbl>
## 1 0.0746 0.209 0.209
With the help of the across()
function, we can now get the mean of each column like this:
## # A tibble: 1 × 3
## x_mean y_mean z_mean
## <dbl> <dbl> <dbl>
## 1 0.0746 0.209 -0.425
Now, you might ask why this is a better approach. Fair question.
In this case, using across()
doesn’t actually reduce the number of lines of code we wrote. In fact, we wrote two additional lines when we used the across()
function. However, imagine if we added 20 additional columns to our data frame. Using the first approach, we would have to write 20 additional lines of code inside the summarise()
function. Using the across()
approach, we wouldn’t have to add any additional code at all. We would simply update the value we pass to the .cols
argument.
Perhaps more importantly, did you notice that we “accidentally” forgot to replace y
with z
when we copied and pasted z_mean = mean(y)
in the code chunk for the first approach? If not, go back and take a look. That mistake is fairly easy to catch and fix in this very simple example. But, in real-world projects, mistakes like this are easy to make, and not always so easy to catch. We are much less likely to make similar mistakes when we use across()
.
35.1 The across() function
The across()
function is part of the dplyr
package. We will always use across()
inside of one of the dplyr
verbs we’ve been learning about. Specifically, mutate()
, and summarise()
. We will not use across()
outside of the dplyr
verbs. Additionally, we will always use across()
within the context of a data frame (as opposed to a vector, matrix, or some other data structure).
To view the help documentation for across()
, you can copy and paste ?dplyr::across
into your R console. If you do, you will see that across()
has four arguments. They are:
1️⃣.cols
. The value we pass to this argument should be columns of the data frame we want to operate on. We can once again use tidy-select argument modifiers here. In the example above, we used c(x:z)
to tell R that we wanted to operate on columns x through z (inclusive). If we had also wanted the mean of the row
column for some reason, we could have used the everything()
tidy-select modifier to tell R that we wanted to operate on all of the columns in the data frame.
2️⃣.fns
. This is where you tell across()
what function, or functions, you want to apply to the columns you selected in .cols
. In the example above, we passed the mean function to the .fns
argument. Notice that we typed mean
without the parentheses (i.e., mean()
).
3️⃣...
. In this case, the ...
argument is where we pass any additional arguments to the function we passed to the .fns
argument. For example, we passed the mean
function to the .fns
argument above. In the data frame above, none of the columns had any missing values. Let’s go ahead and add some missing values so that we can take a look at how ...
works in across()
.
## # A tibble: 10 × 4
## row x y z
## <int> <dbl> <dbl> <dbl>
## 1 1 -0.560 1.22 -1.07
## 2 2 NA 0.360 -0.218
## 3 3 1.56 0.401 -1.03
## 4 4 0.0705 NA -0.729
## 5 5 0.129 -0.556 -0.625
## 6 6 1.72 1.79 NA
## 7 7 0.461 0.498 0.838
## 8 8 -1.27 -1.97 0.153
## 9 9 -0.687 0.701 -1.14
## 10 10 -0.446 -0.473 1.25
As we’ve already seen many times, R won’t drop the missing values and carry out a complete case analysis by default:
## # A tibble: 1 × 3
## x_mean y_mean z_mean
## <dbl> <dbl> <dbl>
## 1 NA NA NA
Instead, we have to explicitly tell R to carry out a complete case analysis. We can do so by filtering our rows with missing data (more on this later) or by changing the value of the mean()
function’s na.rm
argument from FALSE
(the default) to TRUE
:
df_xyz %>%
summarise(
x_mean = mean(x, na.rm = TRUE),
y_mean = mean(y, na.rm = TRUE),
z_mean = mean(z, na.rm = TRUE)
)
## # A tibble: 1 × 3
## x_mean y_mean z_mean
## <dbl> <dbl> <dbl>
## 1 0.108 0.220 -0.284
When we use across()
, we will need to pass the na.rm = TRUE
to the mean()
function in across()
’s ...
argument like this:
df_xyz %>%
summarise(
across(
.cols = everything(),
.fns = mean,
na.rm = TRUE, # Passing na.rm = TRUE to the ... argument
.names = "{col}_mean"
)
)
## # A tibble: 1 × 4
## row_mean x_mean y_mean z_mean
## <dbl> <dbl> <dbl> <dbl>
## 1 5.5 0.108 0.220 -0.284
Notice that we do not actually type out ... =
or anything like that.
4️⃣.names
. You can use this argument to adjust the column names that will result from the operation you pass to .fns
. In the example above, we used the special {cols}
keyword to use each of the column names that were passed to the .cols
argument as the first part of each of the new columns’ names. Then, we asked R to add a literal underscore and the word “mean” because these are all mean values. That resulted in the new column names you see above. The default value for .names
is just {cols}
. So, if we hadn’t modified the value passed to the .names
argument, our results would have looked like this:
## # A tibble: 1 × 4
## row x y z
## <dbl> <dbl> <dbl> <dbl>
## 1 5.5 0.108 0.220 -0.284
There is also a special {fn}
keyword that we can use to pass the name of each of the functions we used in .fns
as part of the new column names. However, in order to get {fn}
to work the way we want it to, we have to pass a list of name-function pairs to the .fns
argument. Let me show you what we mean.
First, we will keep the code exactly as it was, but replace “mean” with “{fn}” in the .names
argument:
df_xyz %>%
summarise(
across(
.cols = everything(),
.fns = mean,
na.rm = TRUE,
.names = "{col}_{fn}"
)
)
## # A tibble: 1 × 4
## row_1 x_1 y_1 z_1
## <dbl> <dbl> <dbl> <dbl>
## 1 5.5 0.108 0.220 -0.284
This is not the result we wanted. Because, we didn’t name the function that we passed to .fns
, across()
essentially used “function number 1” as its name. In order to get the result we want, we need to pass a list of name-function pairs to the .fns
argument like this:
df_xyz %>%
summarise(
across(
.cols = everything(),
.fns = list(mean = mean),
na.rm = TRUE,
.names = "{col}_{fn}"
)
)
## # A tibble: 1 × 4
## row_mean x_mean y_mean z_mean
## <dbl> <dbl> <dbl> <dbl>
## 1 5.5 0.108 0.220 -0.284
Although it may not be self-evident from just looking at the code above, the first mean
in the list(mean = mean)
name-function pair is a name that we are choosing to be passed to the new column names. Theoretically, we could have picked any name. For example:
df_xyz %>%
summarise(
across(
.cols = everything(),
.fns = list(r4epi = mean),
na.rm = TRUE,
.names = "{col}_{fn}"
)
)
## # A tibble: 1 × 4
## row_r4epi x_r4epi y_r4epi z_r4epi
## <dbl> <dbl> <dbl> <dbl>
## 1 5.5 0.108 0.220 -0.284
The second mean
in the list(mean = mean)
name-function pair is the name of the actual function we want to apply to the columns in .cols
. This part of the name-function pair must be the name of the function that we actually want to apply to the columns in .cols
. Otherwise, we will get an error:
df_xyz %>%
summarise(
across(
.cols = everything(),
.fns = list(mean = r4epi),
na.rm = TRUE,
.names = "{col}_{fn}"
)
)
## Error in `summarise()`:
## ℹ In argument: `across(...)`.
## Caused by error:
## ! object 'r4epi' not found
An additional advantage of passing a list of name-function pairs to the .fns
argument is that we can pass multiple functions at once. For example, let’s say that we want the minimum and maximum value of each column in our data frame. Without across()
we might do that analysis like this:
df_xyz %>%
summarise(
x_min = min(x, na.rm = TRUE),
x_max = max(x, na.rm = TRUE),
y_min = min(y, na.rm = TRUE),
y_max = max(y, na.rm = TRUE),
z_min = min(z, na.rm = TRUE),
z_max = max(z, na.rm = TRUE)
)
## # A tibble: 1 × 6
## x_min x_max y_min y_max z_min z_max
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -1.27 1.72 -1.97 1.79 -1.14 1.25
But, we can simply pass min
and max
as a list of name-function pairs if we use across()
:
df_xyz %>%
summarise(
across(
.cols = everything(),
.fns = list(min = min, max = max),
na.rm = TRUE,
.names = "{col}_{fn}"
)
)
## # A tibble: 1 × 8
## row_min row_max x_min x_max y_min y_max z_min z_max
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 10 -1.27 1.72 -1.97 1.79 -1.14 1.25
How great is that?!?
So, we’ve seen how to pass an individual function to the .fns
argument and we’ve seen how to pass a list containing multiple functions to the .fns
argument. There is actually a third syntax for passing functions to the .fns
argument. The across()
documentation calls it “a purrr-style lambda”. This can be a little bit confusing, so I’m going to show you an example, and then walk through it step by step.
df_xyz %>%
summarise(
across(
.cols = everything(),
.fns = ~ mean(.x, na.rm = TRUE),
.names = "{col}_mean"
)
)
## # A tibble: 1 × 4
## row_mean x_mean y_mean z_mean
## <dbl> <dbl> <dbl> <dbl>
## 1 5.5 0.108 0.220 -0.284
The purrr-style lambda always begins with the tilde symbol (~). Then we type out a function call behind the tilde symbol. We place the special .x
symbol inside the function call where we would normally want to type the name of the column we want the function to operate on. The across()
function will then substitute each column name we passed to the .cols
argument for .x
sequentially. In the example above, there isn’t really any good reason to use this syntax. However, this syntax can be useful at times. We will see some examples below.
35.2 Across with mutate
We’ve already seen a number of examples of manipulating columns of our data frames using the mutate()
function. In this section, we are going to take a look at two examples where using the across()
function inside mutate()
will allow us to apply the same manipulation to multiple columns in our data frame at once.
Let’s go ahead and simulate the same demographics
data frame we simulated for the recoding missing section of the conditional operations chapter. Let’s also add two new columns: a four-category education column and a six-category income column. For all columns except id
and age
, a value of 7
represents “Don’t know” and a value of 9
represents “refused.”
set.seed(123)
demographics <- tibble(
id = 1:10,
age = c(sample(1:30, 9, TRUE), NA),
race = c(1, 2, 1, 4, 7, 1, 2, 9, 1, 3),
hispanic = c(7, 0, 1, 0, 1, 0, 1, 9, 0, 1),
edu_4cat = c(4, 2, 9, 1, 2, 3, 4, 9, 3, 3),
inc_6cat = c(1, 4, 1, 1, 5, 3, 2, 2, 7, 9)
) %>%
print()
## # A tibble: 10 × 6
## id age race hispanic edu_4cat inc_6cat
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 15 1 7 4 1
## 2 2 19 2 0 2 4
## 3 3 14 1 1 9 1
## 4 4 3 4 0 1 1
## 5 5 10 7 1 2 5
## 6 6 18 1 0 3 3
## 7 7 22 2 1 4 2
## 8 8 11 9 9 9 2
## 9 9 5 1 0 3 7
## 10 10 NA 3 1 3 9
When working with data like this, it’s common to want to recode all the 7
’s and 9
’s to NA
’s. We saw how to do that one column at a time already:
demographics %>%
mutate(
race = if_else(race == 7 | race == 9, NA_real_, race),
hispanic = if_else(race == 7 | hispanic == 9, NA_real_, hispanic),
edu_4cat = if_else(edu_4cat == 7 | edu_4cat == 9, NA_real_, edu_4cat)
)
## # A tibble: 10 × 6
## id age race hispanic edu_4cat inc_6cat
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 15 1 7 4 1
## 2 2 19 2 0 2 4
## 3 3 14 1 1 NA 1
## 4 4 3 4 0 1 1
## 5 5 10 NA NA 2 5
## 6 6 18 1 0 3 3
## 7 7 22 2 1 4 2
## 8 8 11 NA NA NA 2
## 9 9 5 1 0 3 7
## 10 10 NA 3 1 3 9
🚩In the code chunk above, we have essentially the same code copied more than twice. That’s a red flag that we should be thinking about removing unnecessary repetition from our code.
Also, did you notice that we forgot to replace race
with hispanic
in hispanic = if_else(race == 7 | hispanic == 9, NA_real_, hispanic)
? This time, we didn’t write “forgot” in quotes because we really did forget and only noticed it later. In this case, the error caused a value of 1
to be recoded to NA
in the hispanic
column. These typos we’ve been talking about really do happen – even to me!
Here’s how we can use across()
in this situation:
demographics %>%
mutate(
across(
.cols = c(-id, -age),
.fns = ~ if_else(.x == 7 | .x == 9, NA_real_, .x)
)
)
## # A tibble: 10 × 6
## id age race hispanic edu_4cat inc_6cat
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 15 1 NA 4 1
## 2 2 19 2 0 2 4
## 3 3 14 1 1 NA 1
## 4 4 3 4 0 1 1
## 5 5 10 NA 1 2 5
## 6 6 18 1 0 3 3
## 7 7 22 2 1 4 2
## 8 8 11 NA NA NA 2
## 9 9 5 1 0 3 NA
## 10 10 NA 3 1 3 NA
👆Here’s what we did above:
We used a purrr-style lambda to replace
7
’s and9
’s in all columns in our data frame, exceptid
andage
, withNA
.Remember, the special
.x
symbol is just shorthand for each column passed to the.cols
argument.
As another example, let’s say that we are once again working with data from a drug trial that includes a list of side effects for each person:
set.seed(123)
drug_trial <- tibble(
id = 1:10,
se_headache = sample(0:1, 10, TRUE),
se_diarrhea = sample(0:1, 10, TRUE),
se_dry_mouth = sample(0:1, 10, TRUE),
se_nausea = sample(0:1, 10, TRUE)
) %>%
print()
## # A tibble: 10 × 5
## id se_headache se_diarrhea se_dry_mouth se_nausea
## <int> <int> <int> <int> <int>
## 1 1 0 1 0 0
## 2 2 0 1 1 1
## 3 3 0 1 0 0
## 4 4 1 0 0 1
## 5 5 0 1 0 1
## 6 6 1 0 0 0
## 7 7 1 1 1 0
## 8 8 1 0 1 0
## 9 9 0 0 0 0
## 10 10 0 0 1 1
Now, we want to create a factor version of each of the side effect columns. We’ve already learned how to do so one column at a time:
drug_trial %>%
mutate(
se_headache_f = factor(se_headache, 0:1, c("No", "Yes")),
se_diarrhea_f = factor(se_diarrhea, 0:1, c("No", "Yes")),
se_dry_mouth_f = factor(se_dry_mouth, 0:1, c("No", "Yes"))
)
## # A tibble: 10 × 8
## id se_headache se_diarrhea se_dry_mouth se_nausea se_headache_f se_diarrhea_f se_dry_mouth_f
## <int> <int> <int> <int> <int> <fct> <fct> <fct>
## 1 1 0 1 0 0 No Yes No
## 2 2 0 1 1 1 No Yes Yes
## 3 3 0 1 0 0 No Yes No
## 4 4 1 0 0 1 Yes No No
## 5 5 0 1 0 1 No Yes No
## 6 6 1 0 0 0 Yes No No
## 7 7 1 1 1 0 Yes Yes Yes
## 8 8 1 0 1 0 Yes No Yes
## 9 9 0 0 0 0 No No No
## 10 10 0 0 1 1 No No Yes
🚩Once again, we have essentially the same code copied more than twice. That’s a red flag that we should be thinking about removing unnecessary repetition from our code. Here’s how we can use across()
to do so:
drug_trial %>%
mutate(
across(
.cols = starts_with("se"),
.fns = ~ factor(.x, 0:1, c("No", "Yes")),
.names = "{col}_f"
)
)
## # A tibble: 10 × 9
## id se_headache se_diarrhea se_dry_mouth se_nausea se_headache_f se_diarrhea_f se_dry_mouth_f se_nausea_f
## <int> <int> <int> <int> <int> <fct> <fct> <fct> <fct>
## 1 1 0 1 0 0 No Yes No No
## 2 2 0 1 1 1 No Yes Yes Yes
## 3 3 0 1 0 0 No Yes No No
## 4 4 1 0 0 1 Yes No No Yes
## 5 5 0 1 0 1 No Yes No Yes
## 6 6 1 0 0 0 Yes No No No
## 7 7 1 1 1 0 Yes Yes Yes No
## 8 8 1 0 1 0 Yes No Yes No
## 9 9 0 0 0 0 No No No No
## 10 10 0 0 1 1 No No Yes Yes
👆Here’s what we did above:
We used a purrr-style lambda to create a factor version of all the side effect columns in our data frame.
We used the
.names
argument to add an “_f” to the end of the new column names.
35.3 Across with summarise
Let’s return to the ehr
data frame we used in the chapter on working with character strings for our first example of using across()
inside of summarise
.
You may click here to download this file to your computer.
For this example, the only column we will concern ourselves with is the symptoms
column:
## # A tibble: 15 × 1
## symptoms
## <chr>
## 1 "\"Pain\", \"Headache\", \"Nausea\""
## 2 "Pain"
## 3 "Pain"
## 4 "\"Nausea\", \"Headache\""
## 5 "\"Pain\", \"Headache\""
## 6 "\"Nausea\", \"Headache\""
## 7 "Pain"
## 8 <NA>
## 9 "Pain"
## 10 <NA>
## 11 "\"Nausea\", \"Headache\""
## 12 "\"Headache\", \"Pain\", \"Nausea\""
## 13 "Headache"
## 14 "\"Headache\", \"Pain\", \"Nausea\""
## 15 <NA>
You may recall that we created dummy variables for each symptom like this:
symptoms <- symptoms %>%
mutate(
pain = str_detect(symptoms, "Pain"),
headache = str_detect(symptoms, "Headache"),
nausea = str_detect(symptoms, "Nausea")
) %>%
print()
## # A tibble: 15 × 4
## symptoms pain headache nausea
## <chr> <lgl> <lgl> <lgl>
## 1 "\"Pain\", \"Headache\", \"Nausea\"" TRUE TRUE TRUE
## 2 "Pain" TRUE FALSE FALSE
## 3 "Pain" TRUE FALSE FALSE
## 4 "\"Nausea\", \"Headache\"" FALSE TRUE TRUE
## 5 "\"Pain\", \"Headache\"" TRUE TRUE FALSE
## 6 "\"Nausea\", \"Headache\"" FALSE TRUE TRUE
## 7 "Pain" TRUE FALSE FALSE
## 8 <NA> NA NA NA
## 9 "Pain" TRUE FALSE FALSE
## 10 <NA> NA NA NA
## 11 "\"Nausea\", \"Headache\"" FALSE TRUE TRUE
## 12 "\"Headache\", \"Pain\", \"Nausea\"" TRUE TRUE TRUE
## 13 "Headache" FALSE TRUE FALSE
## 14 "\"Headache\", \"Pain\", \"Nausea\"" TRUE TRUE TRUE
## 15 <NA> NA NA NA
🗒Side Note: Some of you may have noticed that we repeated ourselves more than twice in the code chunk above and thought about using across()
to remove it. Unfortunately, across()
won’t solve our problem in this situation. We will need some of the tools that we learn about in later chapters if we want to remove this repetition.
And finally, we used the table()
function to get a count of how many people reported having a headache:
##
## FALSE TRUE
## 4 8
This is where the example stopped in the chapter on working with character strings. However, what if we wanted to know how many people reported the other symptoms as well? Well, we could repeatedly call the table()
function:
##
## FALSE TRUE
## 4 8
##
## FALSE TRUE
## 6 6
But, that would cause us to copy and paste repeatedly. Additionally, wouldn’t it be nice to view these counts in a way that makes them easier to compare? One solution would be to use summarise()
like this:
symptoms %>%
summarise(
had_headache = sum(headache, na.rm = TRUE),
had_pain = sum(pain, na.rm = TRUE),
had_nausea = sum(nausea, na.rm = TRUE)
)
## # A tibble: 1 × 3
## had_headache had_pain had_nausea
## <int> <int> <int>
## 1 8 8 6
This works, but we can do better with across()
:
symptoms %>%
summarise(
across(
.cols = c(headache, pain, nausea),
.fns = ~ sum(.x, na.rm = TRUE)
)
)
## # A tibble: 1 × 3
## headache pain nausea
## <int> <int> <int>
## 1 8 8 6
Great! But, wouldn’t it be nice to know the proportion of people with each symptom as well? You may recall that R treats TRUE
and FALSE
as 1
and 0
when used in a mathematical operation. Additionally, you may already be aware that the mean of a set of 1
’s and 0
’s is equal to the proportion of 1
’s in the set. For example, there are three ones and three zeros in the set (1, 1, 1, 0, 0, 0)
. The proportion of 1
’s in the set is 3 out of 6, which is 0.5. Equivalently, the mean value of the set is (1 + 1 + 1 + 0 + 0 + 0) / 6, which equals 3 / 6, which is 0.5. So, when we have dummy variables like headache
, pain
, and nausea
above, passing them to the mean()
function returns the proportion of TRUE
values. In this case, the proportion of people who had each symptom. We know we can do that calculation like this:
symptoms %>%
summarise(
had_headache = mean(headache, na.rm = TRUE),
had_pain = mean(pain, na.rm = TRUE),
had_nausea = mean(nausea, na.rm = TRUE)
)
## # A tibble: 1 × 3
## had_headache had_pain had_nausea
## <dbl> <dbl> <dbl>
## 1 0.667 0.667 0.5
As before, we can do better with the across()
function like this:
symptoms %>%
summarise(
across(
.cols = c(pain, headache, nausea),
.fns = ~ mean(.x, na.rm = TRUE)
)
)
## # A tibble: 1 × 3
## pain headache nausea
## <dbl> <dbl> <dbl>
## 1 0.667 0.667 0.5
Now, at this point, we might think, “wouldn’t it be nice to see the count and the proportion in the same result?” Well, we can do that by supplying our purrr-style lambdas as functions in a list of name-function pairs like this:
symptom_summary <- symptoms %>%
summarise(
across(
.cols = c(pain, headache, nausea),
.fns = list(
count = ~ sum(.x, na.rm = TRUE),
prop = ~ mean(.x, na.rm = TRUE)
)
)
) %>%
print()
## # A tibble: 1 × 6
## pain_count pain_prop headache_count headache_prop nausea_count nausea_prop
## <int> <dbl> <int> <dbl> <int> <dbl>
## 1 8 0.667 8 0.667 6 0.5
In this case, it’s probably fine to stop here. But, what if we had 20 or 30 symptoms that we were analyzing? It would be really difficult to read and compare them arranged horizontally like this, wouldn’t it?
Do you recall us discussing restructuring our results in the chapter on restructuring data frames? This is a circumstance where we might want to use pivot_longer()
to make our results easier to read and interpret:
symptom_summary %>%
tidyr::pivot_longer(
cols = everything(),
names_to = c("symptom", ".value"),
names_sep = "_"
)
## # A tibble: 3 × 3
## symptom count prop
## <chr> <int> <dbl>
## 1 pain 8 0.667
## 2 headache 8 0.667
## 3 nausea 6 0.5
There! Isn’t that result much easier to read?
For our final example of this section, let’s return the first example from the writing functions chapter. We started with some simulated study data:
study <- tibble(
age = c(32, 30, 32, 29, 24, 38, 25, 24, 48, 29, 22, 29, 24, 28, 24, 25,
25, 22, 25, 24, 25, 24, 23, 24, 31, 24, 29, 24, 22, 23, 26, 23,
24, 25, 24, 33, 27, 25, 26, 26, 26, 26, 26, 27, 24, 43, 25, 24,
27, 28, 29, 24, 26, 28, 25, 24, 26, 24, 26, 31, 24, 26, 31, 34,
26, 25, 27, NA),
age_group = c(2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2,
2, 1, 1, 1, NA),
gender = c(2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1,
1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1,
1, 1, 2, 1, NA),
ht_in = c(70, 63, 62, 67, 67, 58, 64, 69, 65, 68, 63, 68, 69, 66, 67, 65,
64, 75, 67, 63, 60, 67, 64, 73, 62, 69, 67, 62, 68, 66, 66, 62,
64, 68, NA, 68, 70, 68, 68, 66, 71, 61, 62, 64, 64, 63, 67, 66,
69, 76, NA, 63, 64, 65, 65, 71, 66, 65, 65, 71, 64, 71, 60, 62,
61, 69, 66, NA),
wt_lbs = c(216, 106, 145, 195, 143, 125, 138, 140, 158, 167, 145, 297, 146,
125, 111, 125, 130, 182, 170, 121, 98, 150, 132, 250, 137, 124,
186, 148, 134, 155, 122, 142, 110, 132, 188, 176, 188, 166, 136,
147, 178, 125, 102, 140, 139, 60, 147, 147, 141, 232, 186, 212,
110, 110, 115, 154, 140, 150, 130, NA, 171, 156, 92, 122, 102,
163, 141, NA),
bmi = c(30.99, 18.78, 26.52, 30.54, 22.39, 26.12, 23.69, 20.67, 26.29,
25.39, 25.68, 45.15, 21.56, 20.17, 17.38, 20.8, 22.31, 22.75,
26.62, 21.43, 19.14, 23.49, 22.66, 32.98, 25.05, 18.31, 29.13,
27.07, 20.37, 25.01, 19.69, 25.97, 18.88, 20.07, NA, 26.76,
26.97, 25.24, 20.68, 23.72, 24.82, 23.62, 18.65, 24.03, 23.86,
10.63, 23.02, 23.72, 20.82, 28.24, NA, 37.55, 18.88, 18.3,
19.13, 21.48, 22.59, 24.96, 21.63, NA, 29.35, 21.76, 17.97,
22.31, 19.27, 24.07, 22.76, NA),
bmi_3cat = c(3, 1, 2, 3, 1, 2, 1, 1, 2, 2, 2, 3, 1, 1, 1, 1, 1, 1, 2, 1, 1,
1, 1, 3, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, NA, 2, 2, 2, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 2, NA, 3, 1, 1, 1, 1, 1, 1, 1, NA, 2, 1,
1, 1, 1, 1, 1, NA)
) %>%
mutate(
age_group = factor(age_group, labels = c("Younger than 30", "30 and Older")),
gender = factor(gender, labels = c("Female", "Male")),
bmi_3cat = factor(bmi_3cat, labels = c("Normal", "Overweight", "Obese"))
) %>%
print()
## # A tibble: 68 × 7
## age age_group gender ht_in wt_lbs bmi bmi_3cat
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <fct>
## 1 32 30 and Older Male 70 216 31.0 Obese
## 2 30 30 and Older Female 63 106 18.8 Normal
## 3 32 30 and Older Female 62 145 26.5 Overweight
## 4 29 Younger than 30 Male 67 195 30.5 Obese
## 5 24 Younger than 30 Female 67 143 22.4 Normal
## 6 38 30 and Older Female 58 125 26.1 Overweight
## 7 25 Younger than 30 Female 64 138 23.7 Normal
## 8 24 Younger than 30 Male 69 140 20.7 Normal
## 9 48 30 and Older Male 65 158 26.3 Overweight
## 10 29 Younger than 30 Male 68 167 25.4 Overweight
## # ℹ 58 more rows
And wrote our own function to calculate the number of missing values, mean, median, min, and max for all of the continuous variables:
continuous_stats <- function(var) {
study %>%
summarise(
n_miss = sum(is.na({{ var }})),
mean = mean({{ var }}, na.rm = TRUE),
median = median({{ var }}, na.rm = TRUE),
min = min({{ var }}, na.rm = TRUE),
max = max({{ var }}, na.rm = TRUE)
)
}
We then used that function to calculate our statistics of interest for each continuous variable:
## # A tibble: 1 × 5
## n_miss mean median min max
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 26.9 26 22 48
## # A tibble: 1 × 5
## n_miss mean median min max
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 3 66.0 66 58 76
## # A tibble: 1 × 5
## n_miss mean median min max
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 2 148. 142. 60 297
## # A tibble: 1 × 5
## n_miss mean median min max
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 4 23.6 22.9 10.6 45.2
This is definitely an improvement over all the copying and pasting we were doing before we wrote our own function. However, there is still some unnecessary repetition above. One way we can remove this repetition is to use across()
like this:
summary_stats <- study %>%
summarise(
across(
.cols = c(age, ht_in, wt_lbs, bmi),
.fns = list(
n_miss = ~ sum(is.na(.x)),
mean = ~ mean(.x, na.rm = TRUE),
median = ~ median(.x, na.rm = TRUE),
min = ~ min(.x, na.rm = TRUE),
max = ~ max(.x, na.rm = TRUE)
)
)
) %>%
print()
## # A tibble: 1 × 20
## age_n_miss age_mean age_median age_min age_max ht_in_n_miss ht_in_mean ht_in_median ht_in_min ht_in_max wt_lbs_n_miss wt_lbs_mean wt_lbs_median
## <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1 26.9 26 22 48 3 66.0 66 58 76 2 148. 142.
## # ℹ 7 more variables: wt_lbs_min <dbl>, wt_lbs_max <dbl>, bmi_n_miss <int>, bmi_mean <dbl>, bmi_median <dbl>, bmi_min <dbl>, bmi_max <dbl>
This method works, but it has the same problem that our symptom summaries had above. Our results are hard to read and interpret because they are arranged horizontally. We can once again pivot this data longer, but it won’t be quite as easy as it was before. Our first attempt might look like this:
summary_stats %>%
tidyr::pivot_longer(
cols = everything(),
names_to = c("characteristic", ".value"),
names_sep = "_"
)
## Warning: Expected 2 pieces. Additional pieces discarded in 12 rows [1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16].
## # A tibble: 12 × 8
## characteristic n mean median min max `in` lbs
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age 1 26.9 26 22 48 NA NA
## 2 ht NA NA NA NA NA 3 NA
## 3 ht NA NA NA NA NA 66.0 NA
## 4 ht NA NA NA NA NA 66 NA
## 5 ht NA NA NA NA NA 58 NA
## 6 ht NA NA NA NA NA 76 NA
## 7 wt NA NA NA NA NA NA 2
## 8 wt NA NA NA NA NA NA 148.
## 9 wt NA NA NA NA NA NA 142.
## 10 wt NA NA NA NA NA NA 60
## 11 wt NA NA NA NA NA NA 297
## 12 bmi 4 23.6 22.9 10.6 45.2 NA NA
What do you think the problem is here?
Well, we passed an underscore to the names_sep
argument. This tells pivot_longer()
that that character string on the left side of the underscore should make up the values of the new characteristic
column and each unique character string on the right side of the underscore should be used to create a new column name. In the symptoms data, this worked fine because all of the column names followed this pattern (e.g., pain_count
and pain_prop
). But, do the column names in summary_stats
always follow this pattern? What about age_n_miss
and ht_in_n_miss
? All the extra underscores in the column names makes this pattern ineffective.
There are probably many ways we could address this problem. We think the most straightforward way is probably to go back to the code we used to create summary_stats
and use the .names
argument to separate the column name and statistic name with a character other than an underscore. Maybe a hyphen instead:
summary_stats <- study %>%
summarise(
across(
.cols = c(age, ht_in, wt_lbs, bmi),
.fns = list(
n_miss = ~ sum(is.na(.x)),
mean = ~ mean(.x, na.rm = TRUE),
median = ~ median(.x, na.rm = TRUE),
min = ~ min(.x, na.rm = TRUE),
max = ~ max(.x, na.rm = TRUE)
),
.names = "{col}-{fn}" # This is the new part of the code
)
) %>%
print()
## # A tibble: 1 × 20
## `age-n_miss` `age-mean` `age-median` `age-min` `age-max` `ht_in-n_miss` `ht_in-mean` `ht_in-median` `ht_in-min` `ht_in-max` `wt_lbs-n_miss` `wt_lbs-mean`
## <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 1 26.9 26 22 48 3 66.0 66 58 76 2 148.
## # ℹ 8 more variables: `wt_lbs-median` <dbl>, `wt_lbs-min` <dbl>, `wt_lbs-max` <dbl>, `bmi-n_miss` <int>, `bmi-mean` <dbl>, `bmi-median` <dbl>,
## # `bmi-min` <dbl>, `bmi-max` <dbl>
Now, we can simply pass a hyphen to the names_sep
argument to pivot_longer()
:
summary_stats %>%
tidyr::pivot_longer(
cols = everything(),
names_to = c("characteristic", ".value"),
names_sep = "-"
)
## # A tibble: 4 × 6
## characteristic n_miss mean median min max
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 age 1 26.9 26 22 48
## 2 ht_in 3 66.0 66 58 76
## 3 wt_lbs 2 148. 142. 60 297
## 4 bmi 4 23.6 22.9 10.6 45.2
Look at how much easier those results are to read!
35.4 Across with filter
We’ve already discussed complete case analyses multiple times in this book. That is, including only the rows from our data frame that don’t have any missing values in our analysis. Additionally, we’ve already seen how we can use the filter()
function to remove the rows of a single column where the data are missing. For example:
## # A tibble: 9 × 4
## row x y z
## <int> <dbl> <dbl> <dbl>
## 1 1 -0.560 1.22 -1.07
## 2 3 1.56 0.401 -1.03
## 3 4 0.0705 NA -0.729
## 4 5 0.129 -0.556 -0.625
## 5 6 1.72 1.79 NA
## 6 7 0.461 0.498 0.838
## 7 8 -1.27 -1.97 0.153
## 8 9 -0.687 0.701 -1.14
## 9 10 -0.446 -0.473 1.25
Notice that row 2 – the row that had a missing value for x
– is no longer in the data frame, and we can now easily calculate the mean value of x
.
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 0.108
However, we want to remove the rows that have a missing value in any column – not just x
. We could get this result using multiple sequential filter()
functions like this:
## # A tibble: 7 × 4
## row x y z
## <int> <dbl> <dbl> <dbl>
## 1 1 -0.560 1.22 -1.07
## 2 3 1.56 0.401 -1.03
## 3 5 0.129 -0.556 -0.625
## 4 7 0.461 0.498 0.838
## 5 8 -1.27 -1.97 0.153
## 6 9 -0.687 0.701 -1.14
## 7 10 -0.446 -0.473 1.25
As you can see, rows 2, 4, and 6 – the rows with a missing value for x
, y
, and z
– were dropped.
🚩Of course, in the code chunk above, we have essentially the same code copied more than twice. That’s a red flag that we should be thinking about removing unnecessary repetition from our code.
At this point in the book, our first thought might be to use the across()
function, inside the filter()
function, to remove all of the rows rows with missing values from our data frame. However, as of dplyr version 1.0.4
, using the across()
function inside of filter()
is deprecated. That means we shouldn’t use it anymore. Instead, we should use the if_any()
or if_all()
functions, which take the exact same arguments as across()
. In the code chunk below, we will show you how to solve this problem, then we will dissect the solution below.
## # A tibble: 7 × 4
## row x y z
## <int> <dbl> <dbl> <dbl>
## 1 1 -0.560 1.22 -1.07
## 2 3 1.56 0.401 -1.03
## 3 5 0.129 -0.556 -0.625
## 4 7 0.461 0.498 0.838
## 5 8 -1.27 -1.97 0.153
## 6 9 -0.687 0.701 -1.14
## 7 10 -0.446 -0.473 1.25
👆Here’s what we did above:
You can type
?dplyr::if_any
or?dplyr::if_all
into your R console to view the help documentation for this function and follow along with the explanation below.We used the
if_all()
function inside of thefilter()
function to keep only the rows in our data frame that had nonmissing values for all of the columnsx
,y
, andz
.We passed the value
c(x:z)
to the.cols
argument. This told R to apply the function passed to the.fns
argument to the columnsx
throughz
inclusive.We used a purrr-style lambda to test whether or not each value of each of the columns passed to
.cols
is NOT missing.Remember, the special
.x
symbol is just shorthand for each column passed to the.cols
argument.
So, how does this work? Well, first let’s remember that the is.na()
function returns TRUE
when the value of the vector passed to it is missing and FALSE
when it is not missing. For example:
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
We can then use the !
operator to “flip” those results. In other words, to return TRUE
when the value of the vector passed to it is not missing and FALSE
when it is missing. For example:
## [1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The filter() function then returns the rows from the data frame where the values returned by !is.na()
are TRUE
and drops the rows where they are FALSE
. For example, we can copy and paste the TRUE/FALSE values above to keep only the rows with nonmissing values for x
:
## # A tibble: 9 × 4
## row x y z
## <int> <dbl> <dbl> <dbl>
## 1 1 -0.560 1.22 -1.07
## 2 3 1.56 0.401 -1.03
## 3 4 0.0705 NA -0.729
## 4 5 0.129 -0.556 -0.625
## 5 6 1.72 1.79 NA
## 6 7 0.461 0.498 0.838
## 7 8 -1.27 -1.97 0.153
## 8 9 -0.687 0.701 -1.14
## 9 10 -0.446 -0.473 1.25
Now, let’s repeat this process for the columns y
and z
as well.
## [1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
Next, let’s stack these results next to each other to make them even easier to view.
not_missing <- tibble(
row = 1:10,
x = !is.na(df_xyz$x),
y = !is.na(df_xyz$y),
z = !is.na(df_xyz$z)
) %>%
print()
## # A tibble: 10 × 4
## row x y z
## <int> <lgl> <lgl> <lgl>
## 1 1 TRUE TRUE TRUE
## 2 2 FALSE TRUE TRUE
## 3 3 TRUE TRUE TRUE
## 4 4 TRUE FALSE TRUE
## 5 5 TRUE TRUE TRUE
## 6 6 TRUE TRUE FALSE
## 7 7 TRUE TRUE TRUE
## 8 8 TRUE TRUE TRUE
## 9 9 TRUE TRUE TRUE
## 10 10 TRUE TRUE TRUE
👆Here’s what we did above:
- We created a data frame that contains the value
TRUE
in each position wheredf_xyz
has a nonmissing value andFALSE
in each position wheredf_xyz
has a missing value. We wouldn’t typically create this for our data analysis. We just created it here for teaching purposes.
You can think of the data frame of TRUE
and FALSE
values above as an intermediate product that if_any()
and if_all()
uses “under the hood” to decide which rows to keep. We think using this data frame as a conceptual model makes it a little easier to understand how if_any()
and if_all()
differ.
if_any()
will keep the rows where any value of x
, y
, or z
are TRUE
. In this case, there is at least one TRUE
value in every row. Therefore, we would expect if_any()
to return all rows in our data frame. And, that’s exactly what happens.
## # A tibble: 10 × 4
## row x y z
## <int> <dbl> <dbl> <dbl>
## 1 1 -0.560 1.22 -1.07
## 2 2 NA 0.360 -0.218
## 3 3 1.56 0.401 -1.03
## 4 4 0.0705 NA -0.729
## 5 5 0.129 -0.556 -0.625
## 6 6 1.72 1.79 NA
## 7 7 0.461 0.498 0.838
## 8 8 -1.27 -1.97 0.153
## 9 9 -0.687 0.701 -1.14
## 10 10 -0.446 -0.473 1.25
On the other hand, if_all()
will the keep the rows where all value of x
, y
, and z
are TRUE.
In this case, there is at least one FALSE
value in rows 2, 4, and 6. Therefore, we would expect if_all()
to return all rows in our data frame except rows 2, 4, and 6. That’s exactly what happens, and it’s exaclty the result we want.
## # A tibble: 7 × 4
## row x y z
## <int> <dbl> <dbl> <dbl>
## 1 1 -0.560 1.22 -1.07
## 2 3 1.56 0.401 -1.03
## 3 5 0.129 -0.556 -0.625
## 4 7 0.461 0.498 0.838
## 5 8 -1.27 -1.97 0.153
## 6 9 -0.687 0.701 -1.14
## 7 10 -0.446 -0.473 1.25
Because this is a small, simple example, using if_all()
doesn’t actually reduce the number of lines of code we wrote. But again, try to imagine if we added 20 additional columns to our data frame. We would only need to update the value we pass to the .cols
argument. This makes our code more concise, easier to maintain, and less error-prone.
35.5 Summary
We are big fans of using across()
, if_any()
, and if_all()
in conjunction with the dplyr
verbs. They allows us to remove a lot of the unnecessary repetition from our code in a way that integrates pretty seamlessly with the tools we are already using. Perhaps you will see value in using these functions as well. In the next chapter, we will learn about using for loops to remove unnecessary repetition from our code.