E. Visel

Querying across files with Apache Drill

Globbing, implicit columns, and the power of SQL

April 6, 2020 Edward Visel

16 minute read

When I first used Apache Drill several years ago, it was one of those “holy crap this is amazing” moments. Moreover, every time since that I’ve thought “Oh, Drill could be really useful here” and spun it up, that thought has been quickly followed by “holy crap this is amazing” all over again. It’s just delightful. I keep thinking I should try out alternatives like Presto (which has two branches now) or Apache Impala, but I always start by spinning up Drill for comparison and never quite make it to anything else.

To be fair, it’s not all roses:

the installation seems very sensitive to your Java version and configuration
Java errors and tracebacks can get long enough to overflow your buffer and are generally unrelated to what’s actually wrong, anyway
configuring Drill beyond the basics can get confusing

but once you get it working, none of that matters much.

When it comes to working with Drill in R, much is owed to Bob Rudis, who wrote sergeant—which defines a DBI and dplyr interface to Drill—as well as lots of pieces about the subject, notably Using Apache Drill with R. As there are thus plenty of resources on getting started with Drill, I am here going to focus on one particular workflow where Drill really shines: working with a directory of files.

First, let’s make some sample files to play with using nycflights13, whose flights table is not huge (336,776 rows), but big enough to suggest how Drill can be useful, e.g. where the dataset is not just one year’s flights for one city, but, say, all global recorded flights.

Let’s save the data as Parquet files, which are considerably smaller than CSVs and allow individual columns to be read without scanning the whole file, making the a common choice for larger modern datasets. The arrow package makes reading and writing Parquet in R easy.¹ Drill also has excellent Parquet support.

library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
#> ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
#> ✓ tibble  3.0.0     ✓ dplyr   0.8.5
#> ✓ tidyr   1.0.2     ✓ stringr 1.4.0
#> ✓ readr   1.3.1     ✓ forcats 0.5.0
#> ── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()

flights_dir <- '/tmp/flights'
dir.create(flights_dir)

data('flights', package = 'nycflights13')

# make a directory for each carrier
unique(flights$carrier) %>% 
    walk(~dir.create(file.path(flights_dir, .x)))

# turn year and day into filenames and write the rest to parquet files
flights %>% 
    group_by(carrier, year, month) %>% 
    nest() %>% 
    mutate(
        filename = paste0(year, '-', sprintf('%02d', month), '.parquet'),
        filepath = file.path(flights_dir, carrier, filename)
    ) %>% 
    with(walk2(data, filepath, arrow::write_parquet))

list.dirs(flights_dir)
#>  [1] "/tmp/flights"    "/tmp/flights/9E" "/tmp/flights/AA" "/tmp/flights/AS"
#>  [5] "/tmp/flights/B6" "/tmp/flights/DL" "/tmp/flights/EV" "/tmp/flights/F9"
#>  [9] "/tmp/flights/FL" "/tmp/flights/HA" "/tmp/flights/MQ" "/tmp/flights/OO"
#> [13] "/tmp/flights/UA" "/tmp/flights/US" "/tmp/flights/VX" "/tmp/flights/WN"
#> [17] "/tmp/flights/YV"

list.files(file.path(flights_dir, "AA"))
#>  [1] "2013-01.parquet" "2013-02.parquet" "2013-03.parquet" "2013-04.parquet"
#>  [5] "2013-05.parquet" "2013-06.parquet" "2013-07.parquet" "2013-08.parquet"
#>  [9] "2013-09.parquet" "2013-10.parquet" "2013-11.parquet" "2013-12.parquet"

Cool, now we’ve got some sample data. Reading a single file is easy:

aa_2013_01 <- file.path(flights_dir, "AA", "2013-01.parquet")

arrow::read_parquet(aa_2013_01)
#> # A tibble: 2,794 x 16
#>      day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
#>    <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
#>  1     1      542            540         2      923            850        33
#>  2     1      558            600        -2      753            745         8
#>  3     1      559            600        -1      941            910        31
#>  4     1      606            610        -4      858            910       -12
#>  5     1      623            610        13      920            915         5
#>  6     1      628            630        -2     1137           1140        -3
#>  7     1      629            630        -1      824            810        14
#>  8     1      635            635         0     1028            940        48
#>  9     1      656            700        -4      854            850         4
#> 10     1      656            659        -3      949            959       -10
#> # … with 2,784 more rows, and 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

We can even only read in only a subset of columns:

arrow::read_parquet(aa_2013_01, contains('time'))
#> # A tibble: 2,794 x 6
#>    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
#>       <int>          <int>    <int>          <int>    <dbl> <dttm>             
#>  1      542            540      923            850      160 2013-01-01 10:00:00
#>  2      558            600      753            745      138 2013-01-01 11:00:00
#>  3      559            600      941            910      257 2013-01-01 11:00:00
#>  4      606            610      858            910      152 2013-01-01 11:00:00
#>  5      623            610      920            915      153 2013-01-01 11:00:00
#>  6      628            630     1137           1140      192 2013-01-01 11:00:00
#>  7      629            630      824            810      140 2013-01-01 11:00:00
#>  8      635            635     1028            940      248 2013-01-01 11:00:00
#>  9      656            700      854            850      143 2013-01-01 12:00:00
#> 10      656            659      949            959      142 2013-01-01 11:00:00
#> # … with 2,784 more rows

But what if we want to work across files? A common idiom is to iterate across a vector of files, read them in, and rbind them, e.g.

file.path(flights_dir, "AA") %>% 
    list.files(full.names = TRUE) %>% 
    setNames(., .) %>%    # so `.id` will work
    map_dfr(arrow::read_parquet, .id = "filename")
#> # A tibble: 32,729 x 17
#>    filename   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <chr>    <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1 /tmp/fl…     1      542            540         2      923            850
#>  2 /tmp/fl…     1      558            600        -2      753            745
#>  3 /tmp/fl…     1      559            600        -1      941            910
#>  4 /tmp/fl…     1      606            610        -4      858            910
#>  5 /tmp/fl…     1      623            610        13      920            915
#>  6 /tmp/fl…     1      628            630        -2     1137           1140
#>  7 /tmp/fl…     1      629            630        -1      824            810
#>  8 /tmp/fl…     1      635            635         0     1028            940
#>  9 /tmp/fl…     1      656            700        -4      854            850
#> 10 /tmp/fl…     1      656            659        -3      949            959
#> # … with 32,719 more rows, and 10 more variables: arr_delay <dbl>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

This works—to a point. The core limitation is memory: if the files in question require more memory to read in than is available, this approach is insufficient. Parquet allows us to read in only some columns, which stretches this a bit further, but ultimately the next stage is processing each file as a batch before combining, e.g.

file.path(flights_dir, "AA") %>% 
    list.files(full.names = TRUE) %>% 
    setNames(., .) %>% 
    map_dfr(
        ~arrow::read_parquet(.x) %>% count(day), 
        .id = "filename"
    )
#> # A tibble: 365 x 3
#>    filename                          day     n
#>    <chr>                           <int> <int>
#>  1 /tmp/flights/AA/2013-01.parquet     1    94
#>  2 /tmp/flights/AA/2013-01.parquet     2    94
#>  3 /tmp/flights/AA/2013-01.parquet     3    95
#>  4 /tmp/flights/AA/2013-01.parquet     4    95
#>  5 /tmp/flights/AA/2013-01.parquet     5    77
#>  6 /tmp/flights/AA/2013-01.parquet     6    89
#>  7 /tmp/flights/AA/2013-01.parquet     7    95
#>  8 /tmp/flights/AA/2013-01.parquet     8    92
#>  9 /tmp/flights/AA/2013-01.parquet     9    92
#> 10 /tmp/flights/AA/2013-01.parquet    10    93
#> # … with 355 more rows

…but this workflow gets frustrating as each operation requires boilerplate for iterating and combining. This can be abstracted away with tools like disk.frame, but Drill offers a more powerful and flexible approach that handles a lot of issues like memory management and calculating cross-file metrics for us. Let’s spin up Drill:

drill-embedded &

Drill comes with Parquet support and a tmp workspace in the dfs file system source already configured, so we can jump right in. To read one file like the read_parquet() call above, then,

drill <- sergeant::src_drill()
drill
#> src:  DrillConnection
#> tbls: cp.default, dfs.default, dfs.root, dfs.tmp, information_schema, sys

tbl(drill, 'dfs.tmp.`flights/AA/2013-01.parquet`')
#> # Source:   table<dfs.tmp.`flights/AA/2013-01.parquet`> [?? x 16]
#> # Database: DrillConnection
#>    flight arr_delay distance tailnum dep_time sched_dep_time origin
#>     <dbl>     <dbl>    <dbl> <chr>      <dbl>          <dbl> <chr> 
#>  1   1141        33     1089 N619AA       542            540 JFK   
#>  2    301         8      733 N3ALAA       558            600 LGA   
#>  3    707        31     1389 N3DUAA       559            600 LGA   
#>  4   1895       -12     1085 N633AA       606            610 EWR   
#>  5   1837         5     1096 N3EMAA       623            610 LGA   
#>  6    413        -3     1598 N3BAAA       628            630 JFK   
#>  7    303        14      733 N3CYAA       629            630 LGA   
#>  8    711        48     1389 N3GKAA       635            635 LGA   
#>  9    305         4      733 N4WNAA       656            700 LGA   
#> 10   1815       -10      944 N5FMAA       656            659 JFK   
#> # … with more rows, and 9 more variables: sched_arr_time <dbl>,
#> #   dep_delay <dbl>, dest <chr>, minute <dbl>, hour <dbl>, arr_time <dbl>,
#> #   air_time <dbl>, time_hour <dbl>, day <dbl>

Instead of iterating, Drill lets us use globbing:

tbl(drill, "dfs.tmp.`flights/*`")
#> # Source:   table<dfs.tmp.`flights/*`> [?? x 17]
#> # Database: DrillConnection
#>    flight arr_delay distance tailnum dep_time sched_dep_time origin
#>     <dbl>     <dbl>    <dbl> <chr>      <dbl>          <dbl> <chr> 
#>  1   3353       -32      509 N927XJ       745            745 JFK   
#>  2   3611       -23      340 N8458A       758            805 JFK   
#>  3   3538       -29     1029 N605LR       813            815 JFK   
#>  4   3492       -28      213 N924XJ       845            850 JFK   
#>  5   3521        10      740 N934XJ       854            830 JFK   
#>  6   3902        13      461 N8554A      1044           1050 LGA   
#>  7   3614       -12      296 N8794B      1051           1055 LGA   
#>  8   3606       -18      184 N8808H      1156           1159 JFK   
#>  9   3836        -1      258 N8960A      1209           1210 LGA   
#> 10   3303       -21      301 N923XJ      1223           1230 JFK   
#> # … with more rows, and 10 more variables: sched_arr_time <dbl>,
#> #   dep_delay <dbl>, dir0 <dbl>, dest <chr>, minute <dbl>, hour <dbl>,
#> #   arr_time <dbl>, air_time <dbl>, time_hour <dbl>, day <dbl>

If we collect that into memory (given it’s small enough that we can), we get flights back:

flights2 <- tbl(drill, "dfs.tmp.`flights/*`") %>% collect()
flights2
#> # A tibble: 336,776 x 17
#>    flight arr_delay distance tailnum dep_time sched_dep_time origin
#>     <dbl>     <dbl>    <dbl> <chr>      <dbl>          <dbl> <chr> 
#>  1   3496        -4      335 N904XJ       643            645 LGA   
#>  2   3769         7      569 N832AY       655            659 EWR   
#>  3   3611       -10      340 N8673D       749            759 JFK   
#>  4   3507        28     1182 N934XJ       752            800 JFK   
#>  5   3353       -11      509 N907XJ       755            800 JFK   
#>  6   3528       -12      833 N293PQ       800            805 LGA   
#>  7   3538       -18     1029 N928XJ       801            800 JFK   
#>  8   3489        -7     1080 N917XJ       805            805 LGA   
#>  9   3317       -10      301 N935XJ       807            815 JFK   
#> 10   3868         1      427 N8541D       825            830 JFK   
#> # … with 336,766 more rows, and 10 more variables: sched_arr_time <dbl>,
#> #   dep_delay <dbl>, dir0 <chr>, dest <chr>, minute <dbl>, hour <dbl>,
#> #   arr_time <dbl>, air_time <dbl>, time_hour <dbl>, day <dbl>

dim(flights2)
#> [1] 336776     17

rm(flights2)

…except flights2 is not exactly the same as flights because of how we wrote it to disk: the carrier, year, and month are now in the filepaths, but not in the saved files at all.

But we can get those from Drill! Drill 1.8 introduces four “implicit columns”: filename, filepath, suffix, and fqn (fully qualified name). Working with them in dplyr is a little weird, because they don’t exist, so you can’t select them until you mutate them:

tryCatch(
    tbl(drill, 'dfs.tmp.`flights/*`') %>% select(filename),
    error = print
)
#> <error/vctrs_error_subscript_oob>
#> Can't subset columns that don't exist.
#> x The column `filename` doesn't exist.
#> Backtrace:
#>   1. base::local(...)
#>  57. vctrs:::stop_subscript_oob(...)
#>  58. vctrs:::stop_subscript(...)

tbl(drill, 'dfs.tmp.`flights/*`') %>% 
    mutate(filename, filepath, suffix, fqn) %>% 
    select(filename, filepath, suffix, fqn) %>% 
    distinct()
#> # Source:   lazy query [?? x 4]
#> # Database: DrillConnection
#>    fqn                             filename        filepath        suffix 
#>    <chr>                           <chr>           <chr>           <chr>  
#>  1 /tmp/flights/AS/2013-01.parquet 2013-01.parquet /tmp/flights/AS parquet
#>  2 /tmp/flights/DL/2013-01.parquet 2013-01.parquet /tmp/flights/DL parquet
#>  3 /tmp/flights/DL/2013-02.parquet 2013-02.parquet /tmp/flights/DL parquet
#>  4 /tmp/flights/HA/2013-01.parquet 2013-01.parquet /tmp/flights/HA parquet
#>  5 /tmp/flights/MQ/2013-01.parquet 2013-01.parquet /tmp/flights/MQ parquet
#>  6 /tmp/flights/MQ/2013-05.parquet 2013-05.parquet /tmp/flights/MQ parquet
#>  7 /tmp/flights/US/2013-12.parquet 2013-12.parquet /tmp/flights/US parquet
#>  8 /tmp/flights/VX/2013-12.parquet 2013-12.parquet /tmp/flights/VX parquet
#>  9 /tmp/flights/WN/2013-04.parquet 2013-04.parquet /tmp/flights/WN parquet
#> 10 /tmp/flights/9E/2013-11.parquet 2013-11.parquet /tmp/flights/9E parquet
#> # … with more rows

Cool! Now let’s go calculate some stuff! Let’s start basic with a count of flights by carrier. But remember, carrier is part of the filepath. As it happens, it’s pretty easy to extract with right(), which will get passed through as a SQL function, but in this case you could also group by filepath directly and then clean it up with regex after collecting.

tbl(drill, 'dfs.tmp.`flights/*`') %>% 
    group_by(carrier = right(filepath, 2L)) %>% 
    count()
#> # Source:   lazy query [?? x 2]
#> # Database: DrillConnection
#> # Groups:   carrier
#>    carrier     n
#>    <chr>   <dbl>
#>  1 AS        714
#>  2 AA      32729
#>  3 EV      54173
#>  4 F9        685
#>  5 VX       5162
#>  6 B6      54635
#>  7 US      20536
#>  8 WN      12275
#>  9 9E      18460
#> 10 YV        601
#> # … with more rows

We can extract months from the filenames, too:

tbl(drill, 'dfs.tmp.`flights/*`') %>% 
    group_by(month = left(filename, 7L)) %>% 
    count() %>% 
    arrange(month)
#> # Source:     lazy query [?? x 2]
#> # Database:   DrillConnection
#> # Groups:     month
#> # Ordered by: month
#>    month       n
#>    <chr>   <dbl>
#>  1 2013-01 27004
#>  2 2013-02 24951
#>  3 2013-03 28834
#>  4 2013-04 28330
#>  5 2013-05 28796
#>  6 2013-06 28243
#>  7 2013-07 29425
#>  8 2013-08 29327
#>  9 2013-09 27574
#> 10 2013-10 28889
#> # … with more rows

Moreover, we can use these tricks to reconstruct a view (of sorts) of the entire dataset without collecting it into R:

flights_tbl <- tbl(drill, 'dfs.tmp.`flights/*`') %>% 
    mutate(
        carrier = right(filepath, 2L),
        year = left(filename, 4L), 
        month = substr(filename, 6L, 2L)
    )

flights_tbl
#> # Source:   lazy query [?? x 20]
#> # Database: DrillConnection
#>    flight arr_delay distance tailnum  year dep_time sched_dep_time origin
#>     <dbl>     <dbl>    <dbl> <chr>   <dbl>    <dbl>          <dbl> <chr> 
#>  1   3353       -32      509 N927XJ   2013      745            745 JFK   
#>  2   3611       -23      340 N8458A   2013      758            805 JFK   
#>  3   3538       -29     1029 N605LR   2013      813            815 JFK   
#>  4   3492       -28      213 N924XJ   2013      845            850 JFK   
#>  5   3521        10      740 N934XJ   2013      854            830 JFK   
#>  6   3902        13      461 N8554A   2013     1044           1050 LGA   
#>  7   3614       -12      296 N8794B   2013     1051           1055 LGA   
#>  8   3606       -18      184 N8808H   2013     1156           1159 JFK   
#>  9   3836        -1      258 N8960A   2013     1209           1210 LGA   
#> 10   3303       -21      301 N923XJ   2013     1223           1230 JFK   
#> # … with more rows, and 12 more variables: sched_arr_time <dbl>,
#> #   dep_delay <dbl>, dir0 <dbl>, dest <chr>, minute <dbl>, carrier <dbl>,
#> #   hour <dbl>, month <chr>, arr_time <dbl>, air_time <dbl>, time_hour <dbl>,
#> #   day <dbl>

This tibble is exactly equivalent to the original flights object, except it behaves like a database and is stored on-disk. The database interface means we would have to collect to local memory to do most complicated things, but we can do rather a lot—anything you can write in ANSI SQL—which is especially helpful for subsetting and aggregating data before collecting for modeling or other purposes.

A few things we can do:

Subsetting

flights_tbl %>% 
    group_by(month, day) %>% 
    filter(
        distance > mean(distance, na.rm = TRUE),
        carrier %in% c("AA", "UA", "DL")
    ) %>% 
    ungroup() %>%
    select(carrier, origin, dest, dep_delay, arr_delay)
#> # Source:   lazy query [?? x 5]
#> # Database: DrillConnection
#>    carrier arr_delay origin dep_delay dest 
#>    <chr>       <dbl> <chr>      <dbl> <chr>
#>  1 AA            102 LGA           81 MIA  
#>  2 AA             34 JFK           37 LAX  
#>  3 AA              0 JFK           17 SFO  
#>  4 AA             -8 LGA           -4 DFW  
#>  5 AA            -12 EWR           -9 DFW  
#>  6 AA             10 JFK           -3 FLL  
#>  7 AA             -1 LGA           -3 DFW  
#>  8 AA            -14 JFK           -4 SJU  
#>  9 AA              8 EWR           11 MIA  
#> 10 AA             -2 LGA           -7 MIA  
#> # … with more rows

Calculating summary statistics

flights_tbl %>% 
    group_by(origin) %>% 
    summarise(
        n_flights = n(),
        n_dest = n_distinct(dest),
        min_air_time = min(air_time, na.rm = TRUE),
        max_dep_delay = max(arr_delay, na.rm = TRUE),
        mean_arr_delay = mean(arr_delay, na.rm = TRUE),
        sd_distance = sd(distance)
    )
#> # Source:   lazy query [?? x 7]
#> # Database: DrillConnection
#>   sd_distance min_air_time n_flights mean_arr_delay origin n_dest max_dep_delay
#>         <dbl>        <dbl>     <dbl>          <dbl> <chr>   <dbl>         <dbl>
#> 1        896.           21    111279           5.55 JFK        70          1272
#> 2        730.           20    120835           9.11 EWR        86          1109
#> 3        372.           21    104662           5.78 LGA        68           915

Date, time, and datetime handling

flights_tbl %>% 
    filter(!is.na(dep_time), !carrier %like% '%9%') %>%
    mutate(
        # restructure `dep_time` as an actual time object instead of a weird int
        dep_time = cast(paste(
            as.character(dep_time / 100L), 
            right(dep_time, 2L), 
            '00', 
            sep = ':'
        ) %as% TIME),
        # reconstruct flight date and time
        flight_date = as.Date(paste(year, month, as.character(day), sep = '-')),
        flight_time = cast(paste(
            lpad_with(as.integer(hour), 2L, '0'), 
            lpad_with(as.integer(minute), 2L, '0'), 
            '00', 
            sep = ':'
        ) %as% TIME),
        # construct a timestamp from a date and time
        flight_timestamp = as.POSIXct(paste(as.character(flight_date), as.character(flight_time))),
        dep_timestamp = as.POSIXct(paste(as.character(flight_date), as.character(dep_time))),
        # recalculate dep_delay in raw SQL
        dep_delay = sql("EXTRACT(MINUTE FROM (dep_time - flight_time))")
    ) %>% 
    select(carrier, flight_date, flight_time, flight_timestamp, dep_time, dep_delay)
#> # Source:   lazy query [?? x 6]
#> # Database: DrillConnection
#>    carrier flight_timestamp    flight_date dep_time dep_delay flight_time
#>    <chr>   <dttm>              <date>      <time>       <dbl> <time>     
#>  1 AA      2013-10-01 05:45:00 2013-10-01  05:36           -9 05:45      
#>  2 AA      2013-10-01 06:00:00 2013-10-01  05:52           -8 06:00      
#>  3 AA      2013-10-01 06:00:00 2013-10-01  05:59           -1 06:00      
#>  4 AA      2013-10-01 06:10:00 2013-10-01  06:07           -3 06:10      
#>  5 AA      2013-10-01 06:29:00 2013-10-01  06:21           -8 06:29      
#>  6 AA      2013-10-01 06:30:00 2013-10-01  06:26           -4 06:30      
#>  7 AA      2013-10-01 06:55:00 2013-10-01  06:52           -3 06:55      
#>  8 AA      2013-10-01 07:05:00 2013-10-01  06:56           -9 07:05      
#>  9 AA      2013-10-01 07:10:00 2013-10-01  07:01           -9 07:10      
#> 10 AA      2013-10-01 07:10:00 2013-10-01  07:04           -6 07:10      
#> # … with more rows

Grouped aggregations

flights_tbl %>%
    mutate(
        # turn weird int times (745 for 7:45) into numeric representations (7.75)
        dep_time_float = (dep_time / 100L) + (mod(dep_time, 100L) / 60),
        arr_time_float = (arr_time / 100L) + (mod(arr_time, 100L) / 60)
    ) %>%
    group_by(carrier) %>% 
    summarise(
        arr_delay_mean = mean(arr_delay, na.rm = TRUE),
        arr_delay_sd = sd(arr_delay),
        time_corr = cor(dep_time_float, arr_time_float)
    )
#> # Source:   lazy query [?? x 4]
#> # Database: DrillConnection
#>    arr_delay_mean carrier time_corr arr_delay_sd
#>             <dbl> <chr>       <dbl>        <dbl>
#>  1          -9.93 AS          0.938         36.5
#>  2          10.8  MQ          0.859         43.2
#>  3           2.13 US          0.969         33.1
#>  4           9.65 WN          0.905         46.9
#>  5          21.9  F9          0.919         61.6
#>  6          20.1  FL          0.845         54.1
#>  7           3.56 UA          0.656         41.0
#>  8           1.76 VX          0.661         50.0
#>  9           7.38 9E          0.827         50.1
#> 10           9.46 B6          0.330         42.8
#> # … with more rows

Writing SQL via dplyr requires more effort than working on an in-memory data frame because

Drill does not do implicit type coercion like R, so requires explicit casting more often. It is also more sensitive to integers vs. doubles/floats.
Some R functions will be translated to SQL (?sergeant::drill_custom_functions is a useful reference here), but not all. Untranslated functions are passed through, which lets us use SQL functions. If the syntax is too different, strings of raw SQL can be escaped with sql().
Null handling is a little different, and can require caution.

Depending a bit on data size and location (Drill also works very nicely on other data sources like S3), I typically limit usage to what I can write quickly, i.e. basic summary stats and subsetting, which is usually sufficient to reduce the amount of data to something that fits more neatly in memory. But use it how it suits you! Happy drilling!

Arrow is also cool because it defines a common in-memory data structure for rectangular data, which suggests the possibility of sharing in-memory data between R, Python, C++, and more.↩︎

Edward Visel

Querying across files with Apache Drill

Subsetting

Calculating summary statistics

Date, time, and datetime handling

Grouped aggregations

Edward Visel

Recent Posts

Querying across files with Apache Drill

match.arg

Mapping leaves

Fireworks

Coalescing joins in dplyr

Categories

About

Home

About

Blog

Packages

Categories

Contact

Recent Posts

Querying across files with Apache Drill

match.arg

Mapping leaves

Fireworks

Coalescing joins in dplyr