I have a function that needs to merge a large user dataset with standard column names (upto ~10M rows) with an internal dataset (~5k rows). The internal dataset is always the same. We currently use data.table for this. Are there any faster options? #rstats #rlang #DataScience
3
5
5
Also, we have python users in our team and more than happy to outsource this particular problem to any other language if it's faster. Speed is the end goal, but merging is slow. I considered writing it in C++ to merge a hardcoded file. #python #julialang #rcpp #cpp
1
All responses are welcome! I'd never considered turning it into a database. My example above is oversimplified, but we actually have to merge a few predefined tables onto the user provided one. Any reason why SQL could be so much faster?
1
Per the @Rdatatable benchmarks at h2oai.github.io/db-benchmark… (which includes a 'join' problem likely close to your merge issue) you are probably unlikely to beat `data.table` just by going to #Python or #Rcpp. Maybe profile a little and then discuss with team @Rdatatable?

May 27, 2021 · 7:31 PM UTC

1
1
5
Looking at those benchmarks, wouldn't python's Polars be an option? I'm a huge fan of R/data.table personally, but it does seem like Polars is a good option now performance wise (especially for joins) pola-rs.github.io/polars-boo…
2