Coalescing joins in dplyr
Filling in missing data by joining
When aggregating data, it is not uncommon to need to combine datasets
containing identical non-key variables in varying states of completeness. There
are various ways to accomplish this task. One possibility an coalescing join, a
join in which missing values in x
are filled with matching values from y
.
Such behavior does not exist in current dplyr joins, though it has been
discussed, and so may someday.
For now, let’s build an coalesce_join
function.
First, some sample data:
library(dplyr)
set.seed(47)
df1 <- data_frame(
key = c('a', 'b', 'c', 'd', 'e', 'f'),
var1 = c( 1, 2, 3, 4, NA, NA),
var2 = c( NA, NA, NA, NA, 5, 6),
var3 = c( 1, 2, 3, 4, 5, 6)
)
df2 <- data_frame(
key = c('c', 'd', 'e', 'f'),
var1 = c( NA, NA, 5, 6),
var2 = c( NA, 4, 5, NA),
var4 = c( 3, 4, 5, 6)
)
df1
#> # A tibble: 6 x 4
#> key var1 var2 var3
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 1 NA 1
#> 2 b 2 NA 2
#> 3 c 3 NA 3
#> 4 d 4 NA 4
#> 5 e NA 5 5
#> 6 f NA 6 6
df2
#> # A tibble: 4 x 4
#> key var1 var2 var4
#> <chr> <dbl> <dbl> <dbl>
#> 1 c NA NA 3
#> 2 d NA 4 4
#> 3 e 5 5 5
#> 4 f 6 NA 6
The goal, then, is to get to:
#> # A tibble: 6 x 5
#> key var1 var2 var3 var4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 NA 1 NA
#> 2 b 2 NA 2 NA
#> 3 c 3 NA 3 3
#> 4 d 4 4 4 4
#> 5 e 5 5 5 5
#> 6 f 6 6 6 6
This is still a relatively simple case, which has a single, complete key variable and in which the data is identical between variables, i.e. decisions about which data takes precedence are inconsequential. Those more complicated cases can be handled by an extension of the logic necessary here, though.
The core operation of coalesce_join
will be done by dplyr::coalesce
, which
replaces NA
values in a vector with corresponding non-missing values from
another of identical length (or length 1), e.g.
coalesce(
c( 1, 2, NA, NA),
c(NA, 2, 3, NA)
)
#> [1] 1 2 3 NA
The logic we want, then, is:
full_join(df1, df2, by = 'key') %>%
mutate(
var1 = coalesce(var1.x, var1.y),
var2 = coalesce(var2.x, var2.y)
) %>%
select(key, var1, var2, var3, var4)
#> # A tibble: 6 x 5
#> key var1 var2 var3 var4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 NA 1 NA
#> 2 b 2 NA 2 NA
#> 3 c 3 NA 3 3
#> 4 d 4 4 4 4
#> 5 e 5 5 5 5
#> 6 f 6 6 6 6
…but without all the explicit references to specific column names. Thus, written for robustness,
coalesce_join <- function(x, y,
by = NULL, suffix = c(".x", ".y"),
join = dplyr::full_join, ...) {
joined <- join(x, y, by = by, suffix = suffix, ...)
# names of desired output
cols <- union(names(x), names(y))
to_coalesce <- names(joined)[!names(joined) %in% cols]
suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
# remove suffixes and deduplicate
to_coalesce <- unique(substr(
to_coalesce,
1,
nchar(to_coalesce) - nchar(suffix_used)
))
coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
joined[[paste0(.x, suffix[1])]],
joined[[paste0(.x, suffix[2])]]
))
names(coalesced) <- to_coalesce
dplyr::bind_cols(joined, coalesced)[cols]
}
Let’s try it out:
coalesce_join(df1, df2, by = 'key')
#> # A tibble: 6 x 5
#> key var1 var2 var3 var4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 NA 1 NA
#> 2 b 2 NA 2 NA
#> 3 c 3 NA 3 3
#> 4 d 4 4 4 4
#> 5 e 5 5 5 5
#> 6 f 6 6 6 6
Beautiful!
While this code could be used for an updating join, because of its structure,
it will always prioritize the data from x
over y
. More flexible (updating
columns from specified sources) or strict (checking equality of non-missing
data in coalesced columns) versions could be written, but the former would
require a good API for specifying data precedence, and the latter would be
inconsistent with the behavior of coalesce
itself.
Adapt as you like.
Share this post