Anonymous Functions, Not Variables

Use the LHS of a formula to specify variable names in purrr-style lambda functions

March 25, 2018 Edward Visel

7 minute read

I am a very heavy purrr user. The killer feature is clearly map_df (fairly recently rebranded as map_dfr and map_dfc for row and column binding, respectively) to iterate over a list à la lapply and simplify the result to a data frame. Thanks to the power of dplyr::bind_rows, it fixes all the drawbacks of sapply’s simplify2array behavior:

It returns a data frame, not a matrix or array, so multiple types can be kept. Never see a matrix of lists again.
Variables are aligned by names, not locations, and thus will not be recycled to locations they do not belong, and NAs will be inserted as appropriate.
Turning the list names into a column is as simple as passing a name for it to the .id parameter.

Along with tidyr’s gather and spread, map_df is indispensable for me—to do without would mean a lot of do.call(rbind, ...) that can go very wrong without warning.

purrr has lots of other lovely functions, too, like transpose, partial, and cross that are well worth learning. At the core of its functional programming abilities, though, is as_mapper, a function unlikely to be called much directly, but which is called by every map variant to decide what to do with the .f and ... parameters, which may either be parameters for subsetting each element or a function, possibly with additional parameters. In the latter case, purrr:::as_mapper.default is [now] simply a wrapper for rlang::as_closure.

This function enables one of purrr’s far-reaching conveniences: the ability to define anonymous functions with formulas. In its most basic format, e.g. a unary function for map, function(x) as.character(x) can be rewritten as ~as.character(.x). The ~ both quotes the function, and can be read for the unary case as function(.x).

For unary functions, the input can be referred to as .x, just . (which I generally avoid to eliminate confusion with data piped in), or ..1, since it is the first parameter. For binary functions like those for map2 or reduce, the inputs are typically referred to as .x and .y (though the first behaves the same as that of a unary function, and the second, as such, can also be referred to as ..2). For a polyadic function like that for pmap, the parameters are input as ..., which can be collected, e.g. params = list(...), or accessed directly by number with ..1, ..2, etc. notation as usual.

While this notation has a small learning curve, the convenience when composing anonymous functions adds up. Keystrokes are saved. Less noticeably but perhaps more importantly, the mental effort of naming parameters is eliminated. (“Did I call something x already?”)

This is all very good. However, there are limitations. In particular, nested calls with this notation will typically not work, because the variable is called the same thing, e.g.

library(purrr)

map(c("a", "b", "c"), 
    ~map_chr(1:3, 
         ~paste0(.x, .x))) %>% 
    str()
#> List of 3
#>  $ : chr [1:3] "11" "22" "33"
#>  $ : chr [1:3] "11" "22" "33"
#>  $ : chr [1:3] "11" "22" "33"

Assuming such a case where double iteration is necessary and cannot be vectorized away¹, there is simply no convenient way to write this code with this notation. If the function is supposed to be passed to the first parameter of one of the functions, an alternative is to pass the function and other parameters separately, e.g.

map(c("a", "b", "c"), 
    ~map_chr(1:3, 
             paste0, 
             .x)) %>% 
    str()
#> List of 3
#>  $ : chr [1:3] "1a" "2a" "3a"
#>  $ : chr [1:3] "1b" "2b" "3b"
#>  $ : chr [1:3] "1c" "2c" "3c"

but this gets hard to read, and the order cannot be reversed without flipping the loops, which may be impractical or impossible. Thus, the alternative is to fall back to standard anonymous function notation so as to specify different names for each parameter:

map(c("a", "b", "c"), 
    function(x) {
        map_chr(1:3, function(y) paste0(x, y))
    }) %>% 
    str()
#> List of 3
#>  $ : chr [1:3] "a1" "a2" "a3"
#>  $ : chr [1:3] "b1" "b2" "b3"
#>  $ : chr [1:3] "c1" "c2" "c3"

This works, but loses a lot of convenience for the sake of avoiding a name clash. But I think there is a better way.

The functions in question are anonymous functions, i.e. they are not assigned to a name, but are rather raw expressions that define a function. In R, the most common usage of such functions (sometimes called lambda expressions due to their origins in Alonzo Church’s lambda calculus) is when passing a function as a parameter to a function like map. Since their introduction in Lisp, anonymous functions have become a part of most modern programming languages. Syntax differs, of course; for example in Racket, a modern Lisp:

((lambda (x) 
  (* x x)) 
 2)
#> 4

or Python:

(lambda x: print(x * x))(2)
#> 4

or Julia:

((x) -> x * x)(2)
#> 4

or R:

(function(x) x * x)(2)
#> [1] 4

Comparing these syntaxes to that of purrr’s anonymous functions, a disadvantage of the latter becomes apparent: there is no way to specify parameter names, which is why nested evaluation is impossible. That said, given both a notation like that of Julia above and how models in R use formula notation, it seems plausible that variable names could be specified on the otherwise unused left-hand side (LHS) of a formula function.² Multiple variables cannot be separated by commas (which would terminate the formula), but could be separated by +, like multiple terms in a regression.

To build such capability, let’s use rlang, which has handy utilities for parsing and creating formulas and functions.

The variable names can be parsed from the LHS (returned as an expression by f_lhs) with base::all.vars, which takes an expression and returns a character vector of variables it uses.
The function can be created with new_function, which requires
- the parameters, passed as a named list. Since there should be no defaults, the args parameter should be a list with named empty elements. For known inputs, such a list can be created with alist, e.g. alist(x = , y = ), but because of its quoting semantics, it is very difficult to program with. A suitable list can be generated by simply making a list of empty expressions, though.
- the body, which is simply the right-hand side of the formula, extracted with f_rhs, and
- the environment. Formulas, like functions, have environments, which can be accessed with f_env.

All together,

library(rlang)

lambda <- function(x){
    arg_names <- all.vars(f_lhs(x))
    new_function(args = set_names(rep(list(expr()), 
                                      length(arg_names)), 
                                  arg_names), 
                 body = f_rhs(x), 
                 env = f_env(x))
}

Now we can write formula functions in a fashion fully equivalent to ordinary ones, e.g.

lambda(x ~ x * x)
#> function (x) 
#> x * x

lambda(x ~ x * x)(2)
#> [1] 4

It works for multiple parameters, too:

lambda(x + na.rm ~ mean(x, na.rm = na.rm))(c(1, NA, 3), na.rm = TRUE)
#> [1] 2

Now, in the original nested map example,

map(c("a", "b", "c"), 
    lambda(x ~ map_chr(1:3, 
                       lambda(y ~ paste0(x, y))))) %>% 
    str()
#> List of 3
#>  $ : chr [1:3] "a1" "a2" "a3"
#>  $ : chr [1:3] "b1" "b2" "b3"
#>  $ : chr [1:3] "c1" "c2" "c3"

Ideally, it would be nice to drop the lambda call altogether, but as far as I have been able to find, the only way to do so would be to redefine purrr:::as_mapper.default to call lambda instead of rlang::as_closure when passed a formula with anything on the left-hand side.

For now, then, extra keystrokes abide.³

Here, outer(c("a", "b", "c"), 1:3, paste0), though that does return a matrix instead of a list.↩
The fascinating magrittr alternative pipeR implements something of the sort for naming data piped into a function.↩
Maybe not all of them; see Part 2: gsubfn.↩

blog

Home

About

Blog

Packages

Categories

Contact

Recent Posts

Querying across files with Apache Drill

match.arg

Mapping leaves

Fireworks

Coalescing joins in dplyr