Processing Geospatial Big Data with Delta Lake, Sparklyr, and Apache Sedona in R

Author

Rodgers Iradukunda

Published

March 19, 2025

Hi, my name is Rodgers Iradukunda. I am a PhD student at the University of Liverpool and part of the Geographic Data Science Lab. I created this tutorial for R users (R-ddicts?) who, like me, sometimes work with geospatial data that is too large to be processed by conventional geospatial libraries like sf and raster. Enter Sparklyr, Apache Sedona, and Delta Lake—three powerful tools that leverage distributed computing and efficient storage to tackle this issue. While Sparklyr and Apache Sedona enable scalable geospatial processing, Delta Lake provides optimised storage and management for big data workflows. This tutorial serves as an introduction to using these tools for geospatial big data analysis.

In my experience, most geographic big data is related to mobility. The first time I encountered such data was during my master’s dissertation when I worked with smartphone GPS data. For this tutorial, I use the well-known New York City (NYC) yellow cab trip dataset. This dataset contains pickup and drop-off geographic coordinates for each trip, along with other variables. A common goal when analysing this data is to predict trip duration—in other words, given certain independent variables, how accurately can we predict the length of a taxi trip in NYC?

Naturally, a passenger’s pickup and drop-off locations influence trip duration. For instance, all else being equal, a taxi journey between two busy areas will take longer than one between quieter locations. However, it is difficult to determine the characteristics of an area based solely on coordinates. To gain a fuller picture of where a trip started and ended, we can augment the coordinates with additional data. The aim of this tutorial is to demonstrate how to obtain and integrate this extra information using Sparklyr, Apache Sedona, and Delta Lake.

I hope you find this tutorial insightful. If you have any feedback, please feel free to connect with me on LinkedIn. I am also still learning about the technologies covered here, so do let me know if you have suggestions for improving my workflow or if I have stated anything inaccurately.