Querying across files with Apache Drill
Globbing, implicit columns, and the power of SQL
When I first used Apache Drill several years ago, it was one of those “holy crap this is amazing” moments. Moreover, every time since that I’ve thought “Oh, Drill could be really useful here” and spun it up, that thought has been quickly followed by “holy crap this is amazing” all over again. It’s just delightful. I keep thinking I should try out alternatives like Presto (which has two branches now) or Apache Impala, but I always start by spinning up Drill for comparison and never quite make it to anything else.
To be fair, it’s not all roses:
- the installation seems very sensitive to your Java version and configuration
- Java errors and tracebacks can get long enough to overflow your buffer and are generally unrelated to what’s actually wrong, anyway
- configuring Drill beyond the basics can get confusing
but once you get it working, none of that matters much.
When it comes to working with Drill in R, much is owed to Bob Rudis, who wrote sergeant—which defines a DBI and dplyr interface to Drill—as well as lots of pieces about the subject, notably Using Apache Drill with R. As there are thus plenty of resources on getting started with Drill, I am here going to focus on one particular workflow where Drill really shines: working with a directory of files.
First, let’s make some sample files to play with using
nycflights13, whose flights
table is
not huge (336,776 rows), but big enough to suggest how Drill can be useful, e.g.
where the dataset is not just one year’s flights for one city, but, say, all
global recorded flights.
Let’s save the data as Parquet files, which are considerably smaller than CSVs and allow individual columns to be read without scanning the whole file, making the a common choice for larger modern datasets. The arrow package makes reading and writing Parquet in R easy.1 Drill also has excellent Parquet support.
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
#> ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
#> ✓ tibble 3.0.0 ✓ dplyr 0.8.5
#> ✓ tidyr 1.0.2 ✓ stringr 1.4.0
#> ✓ readr 1.3.1 ✓ forcats 0.5.0
#> ── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
flights_dir <- '/tmp/flights'
dir.create(flights_dir)
data('flights', package = 'nycflights13')
# make a directory for each carrier
unique(flights$carrier) %>%
walk(~dir.create(file.path(flights_dir, .x)))
# turn year and day into filenames and write the rest to parquet files
flights %>%
group_by(carrier, year, month) %>%
nest() %>%
mutate(
filename = paste0(year, '-', sprintf('%02d', month), '.parquet'),
filepath = file.path(flights_dir, carrier, filename)
) %>%
with(walk2(data, filepath, arrow::write_parquet))
list.dirs(flights_dir)
#> [1] "/tmp/flights" "/tmp/flights/9E" "/tmp/flights/AA" "/tmp/flights/AS"
#> [5] "/tmp/flights/B6" "/tmp/flights/DL" "/tmp/flights/EV" "/tmp/flights/F9"
#> [9] "/tmp/flights/FL" "/tmp/flights/HA" "/tmp/flights/MQ" "/tmp/flights/OO"
#> [13] "/tmp/flights/UA" "/tmp/flights/US" "/tmp/flights/VX" "/tmp/flights/WN"
#> [17] "/tmp/flights/YV"
list.files(file.path(flights_dir, "AA"))
#> [1] "2013-01.parquet" "2013-02.parquet" "2013-03.parquet" "2013-04.parquet"
#> [5] "2013-05.parquet" "2013-06.parquet" "2013-07.parquet" "2013-08.parquet"
#> [9] "2013-09.parquet" "2013-10.parquet" "2013-11.parquet" "2013-12.parquet"
Cool, now we’ve got some sample data. Reading a single file is easy:
aa_2013_01 <- file.path(flights_dir, "AA", "2013-01.parquet")
arrow::read_parquet(aa_2013_01)
#> # A tibble: 2,794 x 16
#> day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
#> <int> <int> <int> <dbl> <int> <int> <dbl>
#> 1 1 542 540 2 923 850 33
#> 2 1 558 600 -2 753 745 8
#> 3 1 559 600 -1 941 910 31
#> 4 1 606 610 -4 858 910 -12
#> 5 1 623 610 13 920 915 5
#> 6 1 628 630 -2 1137 1140 -3
#> 7 1 629 630 -1 824 810 14
#> 8 1 635 635 0 1028 940 48
#> 9 1 656 700 -4 854 850 4
#> 10 1 656 659 -3 949 959 -10
#> # … with 2,784 more rows, and 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
We can even only read in only a subset of columns:
arrow::read_parquet(aa_2013_01, contains('time'))
#> # A tibble: 2,794 x 6
#> dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
#> <int> <int> <int> <int> <dbl> <dttm>
#> 1 542 540 923 850 160 2013-01-01 10:00:00
#> 2 558 600 753 745 138 2013-01-01 11:00:00
#> 3 559 600 941 910 257 2013-01-01 11:00:00
#> 4 606 610 858 910 152 2013-01-01 11:00:00
#> 5 623 610 920 915 153 2013-01-01 11:00:00
#> 6 628 630 1137 1140 192 2013-01-01 11:00:00
#> 7 629 630 824 810 140 2013-01-01 11:00:00
#> 8 635 635 1028 940 248 2013-01-01 11:00:00
#> 9 656 700 854 850 143 2013-01-01 12:00:00
#> 10 656 659 949 959 142 2013-01-01 11:00:00
#> # … with 2,784 more rows
But what if we want to work across files? A common idiom is to iterate across a vector of files, read them in, and rbind them, e.g.
file.path(flights_dir, "AA") %>%
list.files(full.names = TRUE) %>%
setNames(., .) %>% # so `.id` will work
map_dfr(arrow::read_parquet, .id = "filename")
#> # A tibble: 32,729 x 17
#> filename day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <chr> <int> <int> <int> <dbl> <int> <int>
#> 1 /tmp/fl… 1 542 540 2 923 850
#> 2 /tmp/fl… 1 558 600 -2 753 745
#> 3 /tmp/fl… 1 559 600 -1 941 910
#> 4 /tmp/fl… 1 606 610 -4 858 910
#> 5 /tmp/fl… 1 623 610 13 920 915
#> 6 /tmp/fl… 1 628 630 -2 1137 1140
#> 7 /tmp/fl… 1 629 630 -1 824 810
#> 8 /tmp/fl… 1 635 635 0 1028 940
#> 9 /tmp/fl… 1 656 700 -4 854 850
#> 10 /tmp/fl… 1 656 659 -3 949 959
#> # … with 32,719 more rows, and 10 more variables: arr_delay <dbl>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
This works—to a point. The core limitation is memory: if the files in question require more memory to read in than is available, this approach is insufficient. Parquet allows us to read in only some columns, which stretches this a bit further, but ultimately the next stage is processing each file as a batch before combining, e.g.
file.path(flights_dir, "AA") %>%
list.files(full.names = TRUE) %>%
setNames(., .) %>%
map_dfr(
~arrow::read_parquet(.x) %>% count(day),
.id = "filename"
)
#> # A tibble: 365 x 3
#> filename day n
#> <chr> <int> <int>
#> 1 /tmp/flights/AA/2013-01.parquet 1 94
#> 2 /tmp/flights/AA/2013-01.parquet 2 94
#> 3 /tmp/flights/AA/2013-01.parquet 3 95
#> 4 /tmp/flights/AA/2013-01.parquet 4 95
#> 5 /tmp/flights/AA/2013-01.parquet 5 77
#> 6 /tmp/flights/AA/2013-01.parquet 6 89
#> 7 /tmp/flights/AA/2013-01.parquet 7 95
#> 8 /tmp/flights/AA/2013-01.parquet 8 92
#> 9 /tmp/flights/AA/2013-01.parquet 9 92
#> 10 /tmp/flights/AA/2013-01.parquet 10 93
#> # … with 355 more rows
…but this workflow gets frustrating as each operation requires boilerplate for iterating and combining. This can be abstracted away with tools like disk.frame, but Drill offers a more powerful and flexible approach that handles a lot of issues like memory management and calculating cross-file metrics for us. Let’s spin up Drill:
drill-embedded &
Drill comes with Parquet support and a tmp
workspace in the dfs
file system
source already configured, so we can jump right in. To read one file like the
read_parquet()
call above, then,
drill <- sergeant::src_drill()
drill
#> src: DrillConnection
#> tbls: cp.default, dfs.default, dfs.root, dfs.tmp, information_schema, sys
tbl(drill, 'dfs.tmp.`flights/AA/2013-01.parquet`')
#> # Source: table<dfs.tmp.`flights/AA/2013-01.parquet`> [?? x 16]
#> # Database: DrillConnection
#> flight arr_delay distance tailnum dep_time sched_dep_time origin
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
#> 1 1141 33 1089 N619AA 542 540 JFK
#> 2 301 8 733 N3ALAA 558 600 LGA
#> 3 707 31 1389 N3DUAA 559 600 LGA
#> 4 1895 -12 1085 N633AA 606 610 EWR
#> 5 1837 5 1096 N3EMAA 623 610 LGA
#> 6 413 -3 1598 N3BAAA 628 630 JFK
#> 7 303 14 733 N3CYAA 629 630 LGA
#> 8 711 48 1389 N3GKAA 635 635 LGA
#> 9 305 4 733 N4WNAA 656 700 LGA
#> 10 1815 -10 944 N5FMAA 656 659 JFK
#> # … with more rows, and 9 more variables: sched_arr_time <dbl>,
#> # dep_delay <dbl>, dest <chr>, minute <dbl>, hour <dbl>, arr_time <dbl>,
#> # air_time <dbl>, time_hour <dbl>, day <dbl>
Instead of iterating, Drill lets us use globbing:
tbl(drill, "dfs.tmp.`flights/*`")
#> # Source: table<dfs.tmp.`flights/*`> [?? x 17]
#> # Database: DrillConnection
#> flight arr_delay distance tailnum dep_time sched_dep_time origin
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
#> 1 3353 -32 509 N927XJ 745 745 JFK
#> 2 3611 -23 340 N8458A 758 805 JFK
#> 3 3538 -29 1029 N605LR 813 815 JFK
#> 4 3492 -28 213 N924XJ 845 850 JFK
#> 5 3521 10 740 N934XJ 854 830 JFK
#> 6 3902 13 461 N8554A 1044 1050 LGA
#> 7 3614 -12 296 N8794B 1051 1055 LGA
#> 8 3606 -18 184 N8808H 1156 1159 JFK
#> 9 3836 -1 258 N8960A 1209 1210 LGA
#> 10 3303 -21 301 N923XJ 1223 1230 JFK
#> # … with more rows, and 10 more variables: sched_arr_time <dbl>,
#> # dep_delay <dbl>, dir0 <dbl>, dest <chr>, minute <dbl>, hour <dbl>,
#> # arr_time <dbl>, air_time <dbl>, time_hour <dbl>, day <dbl>
If we collect that into memory (given it’s small enough that we can), we get
flights
back:
flights2 <- tbl(drill, "dfs.tmp.`flights/*`") %>% collect()
flights2
#> # A tibble: 336,776 x 17
#> flight arr_delay distance tailnum dep_time sched_dep_time origin
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
#> 1 3496 -4 335 N904XJ 643 645 LGA
#> 2 3769 7 569 N832AY 655 659 EWR
#> 3 3611 -10 340 N8673D 749 759 JFK
#> 4 3507 28 1182 N934XJ 752 800 JFK
#> 5 3353 -11 509 N907XJ 755 800 JFK
#> 6 3528 -12 833 N293PQ 800 805 LGA
#> 7 3538 -18 1029 N928XJ 801 800 JFK
#> 8 3489 -7 1080 N917XJ 805 805 LGA
#> 9 3317 -10 301 N935XJ 807 815 JFK
#> 10 3868 1 427 N8541D 825 830 JFK
#> # … with 336,766 more rows, and 10 more variables: sched_arr_time <dbl>,
#> # dep_delay <dbl>, dir0 <chr>, dest <chr>, minute <dbl>, hour <dbl>,
#> # arr_time <dbl>, air_time <dbl>, time_hour <dbl>, day <dbl>
dim(flights2)
#> [1] 336776 17
rm(flights2)
…except flights2
is not exactly the same as flights
because of how we
wrote it to disk: the carrier, year, and month are now in the filepaths, but not
in the saved files at all.
But we can get those from Drill! Drill 1.8 introduces four “implicit columns”:
filename
, filepath
, suffix
, and fqn
(fully qualified name). Working with
them in dplyr is a little weird, because they don’t exist, so you can’t select
them until you mutate them:
tryCatch(
tbl(drill, 'dfs.tmp.`flights/*`') %>% select(filename),
error = print
)
#> <error/vctrs_error_subscript_oob>
#> Can't subset columns that don't exist.
#> x The column `filename` doesn't exist.
#> Backtrace:
#> 1. base::local(...)
#> 57. vctrs:::stop_subscript_oob(...)
#> 58. vctrs:::stop_subscript(...)
tbl(drill, 'dfs.tmp.`flights/*`') %>%
mutate(filename, filepath, suffix, fqn) %>%
select(filename, filepath, suffix, fqn) %>%
distinct()
#> # Source: lazy query [?? x 4]
#> # Database: DrillConnection
#> fqn filename filepath suffix
#> <chr> <chr> <chr> <chr>
#> 1 /tmp/flights/AS/2013-01.parquet 2013-01.parquet /tmp/flights/AS parquet
#> 2 /tmp/flights/DL/2013-01.parquet 2013-01.parquet /tmp/flights/DL parquet
#> 3 /tmp/flights/DL/2013-02.parquet 2013-02.parquet /tmp/flights/DL parquet
#> 4 /tmp/flights/HA/2013-01.parquet 2013-01.parquet /tmp/flights/HA parquet
#> 5 /tmp/flights/MQ/2013-01.parquet 2013-01.parquet /tmp/flights/MQ parquet
#> 6 /tmp/flights/MQ/2013-05.parquet 2013-05.parquet /tmp/flights/MQ parquet
#> 7 /tmp/flights/US/2013-12.parquet 2013-12.parquet /tmp/flights/US parquet
#> 8 /tmp/flights/VX/2013-12.parquet 2013-12.parquet /tmp/flights/VX parquet
#> 9 /tmp/flights/WN/2013-04.parquet 2013-04.parquet /tmp/flights/WN parquet
#> 10 /tmp/flights/9E/2013-11.parquet 2013-11.parquet /tmp/flights/9E parquet
#> # … with more rows
Cool! Now let’s go calculate some stuff! Let’s start basic with a count of
flights by carrier. But remember, carrier is part of the filepath. As it
happens, it’s pretty easy to extract with right()
, which will get passed
through as a SQL function, but in this case you could also group by filepath
directly and then clean it up with regex after collecting.
tbl(drill, 'dfs.tmp.`flights/*`') %>%
group_by(carrier = right(filepath, 2L)) %>%
count()
#> # Source: lazy query [?? x 2]
#> # Database: DrillConnection
#> # Groups: carrier
#> carrier n
#> <chr> <dbl>
#> 1 AS 714
#> 2 AA 32729
#> 3 EV 54173
#> 4 F9 685
#> 5 VX 5162
#> 6 B6 54635
#> 7 US 20536
#> 8 WN 12275
#> 9 9E 18460
#> 10 YV 601
#> # … with more rows
We can extract months from the filenames, too:
tbl(drill, 'dfs.tmp.`flights/*`') %>%
group_by(month = left(filename, 7L)) %>%
count() %>%
arrange(month)
#> # Source: lazy query [?? x 2]
#> # Database: DrillConnection
#> # Groups: month
#> # Ordered by: month
#> month n
#> <chr> <dbl>
#> 1 2013-01 27004
#> 2 2013-02 24951
#> 3 2013-03 28834
#> 4 2013-04 28330
#> 5 2013-05 28796
#> 6 2013-06 28243
#> 7 2013-07 29425
#> 8 2013-08 29327
#> 9 2013-09 27574
#> 10 2013-10 28889
#> # … with more rows
Moreover, we can use these tricks to reconstruct a view (of sorts) of the entire dataset without collecting it into R:
flights_tbl <- tbl(drill, 'dfs.tmp.`flights/*`') %>%
mutate(
carrier = right(filepath, 2L),
year = left(filename, 4L),
month = substr(filename, 6L, 2L)
)
flights_tbl
#> # Source: lazy query [?? x 20]
#> # Database: DrillConnection
#> flight arr_delay distance tailnum year dep_time sched_dep_time origin
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 3353 -32 509 N927XJ 2013 745 745 JFK
#> 2 3611 -23 340 N8458A 2013 758 805 JFK
#> 3 3538 -29 1029 N605LR 2013 813 815 JFK
#> 4 3492 -28 213 N924XJ 2013 845 850 JFK
#> 5 3521 10 740 N934XJ 2013 854 830 JFK
#> 6 3902 13 461 N8554A 2013 1044 1050 LGA
#> 7 3614 -12 296 N8794B 2013 1051 1055 LGA
#> 8 3606 -18 184 N8808H 2013 1156 1159 JFK
#> 9 3836 -1 258 N8960A 2013 1209 1210 LGA
#> 10 3303 -21 301 N923XJ 2013 1223 1230 JFK
#> # … with more rows, and 12 more variables: sched_arr_time <dbl>,
#> # dep_delay <dbl>, dir0 <dbl>, dest <chr>, minute <dbl>, carrier <dbl>,
#> # hour <dbl>, month <chr>, arr_time <dbl>, air_time <dbl>, time_hour <dbl>,
#> # day <dbl>
This tibble is exactly equivalent to the original flights
object, except it
behaves like a database and is stored on-disk. The database interface means we
would have to collect to local memory to do most complicated things, but we can
do rather a lot—anything you can write in ANSI SQL—which is especially helpful
for subsetting and aggregating data before collecting for modeling or other
purposes.
A few things we can do:
Subsetting
flights_tbl %>%
group_by(month, day) %>%
filter(
distance > mean(distance, na.rm = TRUE),
carrier %in% c("AA", "UA", "DL")
) %>%
ungroup() %>%
select(carrier, origin, dest, dep_delay, arr_delay)
#> # Source: lazy query [?? x 5]
#> # Database: DrillConnection
#> carrier arr_delay origin dep_delay dest
#> <chr> <dbl> <chr> <dbl> <chr>
#> 1 AA 102 LGA 81 MIA
#> 2 AA 34 JFK 37 LAX
#> 3 AA 0 JFK 17 SFO
#> 4 AA -8 LGA -4 DFW
#> 5 AA -12 EWR -9 DFW
#> 6 AA 10 JFK -3 FLL
#> 7 AA -1 LGA -3 DFW
#> 8 AA -14 JFK -4 SJU
#> 9 AA 8 EWR 11 MIA
#> 10 AA -2 LGA -7 MIA
#> # … with more rows
Calculating summary statistics
flights_tbl %>%
group_by(origin) %>%
summarise(
n_flights = n(),
n_dest = n_distinct(dest),
min_air_time = min(air_time, na.rm = TRUE),
max_dep_delay = max(arr_delay, na.rm = TRUE),
mean_arr_delay = mean(arr_delay, na.rm = TRUE),
sd_distance = sd(distance)
)
#> # Source: lazy query [?? x 7]
#> # Database: DrillConnection
#> sd_distance min_air_time n_flights mean_arr_delay origin n_dest max_dep_delay
#> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 896. 21 111279 5.55 JFK 70 1272
#> 2 730. 20 120835 9.11 EWR 86 1109
#> 3 372. 21 104662 5.78 LGA 68 915
Date, time, and datetime handling
flights_tbl %>%
filter(!is.na(dep_time), !carrier %like% '%9%') %>%
mutate(
# restructure `dep_time` as an actual time object instead of a weird int
dep_time = cast(paste(
as.character(dep_time / 100L),
right(dep_time, 2L),
'00',
sep = ':'
) %as% TIME),
# reconstruct flight date and time
flight_date = as.Date(paste(year, month, as.character(day), sep = '-')),
flight_time = cast(paste(
lpad_with(as.integer(hour), 2L, '0'),
lpad_with(as.integer(minute), 2L, '0'),
'00',
sep = ':'
) %as% TIME),
# construct a timestamp from a date and time
flight_timestamp = as.POSIXct(paste(as.character(flight_date), as.character(flight_time))),
dep_timestamp = as.POSIXct(paste(as.character(flight_date), as.character(dep_time))),
# recalculate dep_delay in raw SQL
dep_delay = sql("EXTRACT(MINUTE FROM (dep_time - flight_time))")
) %>%
select(carrier, flight_date, flight_time, flight_timestamp, dep_time, dep_delay)
#> # Source: lazy query [?? x 6]
#> # Database: DrillConnection
#> carrier flight_timestamp flight_date dep_time dep_delay flight_time
#> <chr> <dttm> <date> <time> <dbl> <time>
#> 1 AA 2013-10-01 05:45:00 2013-10-01 05:36 -9 05:45
#> 2 AA 2013-10-01 06:00:00 2013-10-01 05:52 -8 06:00
#> 3 AA 2013-10-01 06:00:00 2013-10-01 05:59 -1 06:00
#> 4 AA 2013-10-01 06:10:00 2013-10-01 06:07 -3 06:10
#> 5 AA 2013-10-01 06:29:00 2013-10-01 06:21 -8 06:29
#> 6 AA 2013-10-01 06:30:00 2013-10-01 06:26 -4 06:30
#> 7 AA 2013-10-01 06:55:00 2013-10-01 06:52 -3 06:55
#> 8 AA 2013-10-01 07:05:00 2013-10-01 06:56 -9 07:05
#> 9 AA 2013-10-01 07:10:00 2013-10-01 07:01 -9 07:10
#> 10 AA 2013-10-01 07:10:00 2013-10-01 07:04 -6 07:10
#> # … with more rows
Grouped aggregations
flights_tbl %>%
mutate(
# turn weird int times (745 for 7:45) into numeric representations (7.75)
dep_time_float = (dep_time / 100L) + (mod(dep_time, 100L) / 60),
arr_time_float = (arr_time / 100L) + (mod(arr_time, 100L) / 60)
) %>%
group_by(carrier) %>%
summarise(
arr_delay_mean = mean(arr_delay, na.rm = TRUE),
arr_delay_sd = sd(arr_delay),
time_corr = cor(dep_time_float, arr_time_float)
)
#> # Source: lazy query [?? x 4]
#> # Database: DrillConnection
#> arr_delay_mean carrier time_corr arr_delay_sd
#> <dbl> <chr> <dbl> <dbl>
#> 1 -9.93 AS 0.938 36.5
#> 2 10.8 MQ 0.859 43.2
#> 3 2.13 US 0.969 33.1
#> 4 9.65 WN 0.905 46.9
#> 5 21.9 F9 0.919 61.6
#> 6 20.1 FL 0.845 54.1
#> 7 3.56 UA 0.656 41.0
#> 8 1.76 VX 0.661 50.0
#> 9 7.38 9E 0.827 50.1
#> 10 9.46 B6 0.330 42.8
#> # … with more rows
Writing SQL via dplyr requires more effort than working on an in-memory data frame because
- Drill does not do implicit type coercion like R, so requires explicit casting more often. It is also more sensitive to integers vs. doubles/floats.
- Some R functions will be translated to SQL
(
?sergeant::drill_custom_functions
is a useful reference here), but not all. Untranslated functions are passed through, which lets us use SQL functions. If the syntax is too different, strings of raw SQL can be escaped withsql()
. - Null handling is a little different, and can require caution.
Depending a bit on data size and location (Drill also works very nicely on other data sources like S3), I typically limit usage to what I can write quickly, i.e. basic summary stats and subsetting, which is usually sufficient to reduce the amount of data to something that fits more neatly in memory. But use it how it suits you! Happy drilling!
Arrow is also cool because it defines a common in-memory data structure for rectangular data, which suggests the possibility of sharing in-memory data between R, Python, C++, and more.↩︎
Share this post