# Define path to the updated locations Delta table
<- spark_read_delta(
locations_sdf_updated_one
sc,path = file.path(
getwd(),
"data",
"locations_sdf_updated_one"
)|>
) filter(trip_id >= 40000000 & trip_id <= 40000010) %>% # Filter for only ten rows
sdf_register("locations_sdf_updated_one_view") # Register as a temporary view
4 Part One - Processing Raster Data with Apache Sedona and Sparklyr in R
Using WorldPop Data to Determine Population Density Around Pickup and Dropoff Points
4.1 Introduction
In this chapter, we are going to demonstrate how to obtain information from raster files based on geographic coordinates. Let us assume that there is a relationship between the duration of a taxi (dependent variable) ride and the population density of an area. We will, therefore, need to extract population density values at each pickup and dropoff location. Such granular data is typically available in raster format, which is why we use WorldPop population density data with a resolution of 1 km by 1 km for this purpose. You can download the data here to follow along.
The Spark configuration used in this chapter — and the next one — is identical to the one used in Chapter 3, so it is not shown below for brevity’s sake.
Furthermore, because I need to re-run this code multiple times to render this website, I shall filter only a few rows from the 94 million we previously worked with, as I will be constantly updating this site. In the actual analysis I conducted, however, I used the full dataset.
You can find out more about using Apache Sedona for raster manipulation here.
4.2 Loading updated locations data
We start by loading our updated locations data, which now contains household median income by neighbourhood information.
To reiterate, I will filter for a few rows here so that I can render this webpage faster. Sometimes in your analysis, you will find that you have too much data to fit in memory, especially when running complex transformations. In such cases, you can filter for specific rows, perform your analysis on that subset, and then append the results to your delta tables. You can repeat this process for another set of rows until you are done.
For instance, knowing that I have trip ID values ranging from 0 to about 48,000,000, I would:
- First filter for rows between 0 and 16 million,
- Then 16 million to 32 million,
- And finally, anything above 32 million,
- Appending to the same folder each time.
If you have enough RAM and cores, though, feel free to run everything at once — go crazy with it!
print(locations_sdf_updated_one, width=Inf, n=10)
Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range
# Source: table<`locations_sdf_updated_one_view`> [?? x 8]
# Database: spark_connection
trip_id latitude longitude is_pickup BoroName NTA2020
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 40000000 40.8 -74.0 1 Manhattan MN0604
2 40000001 40.7 -74.0 1 Manhattan MN0201
3 40000002 40.7 -74.0 1 Manhattan MN0201
4 40000003 40.8 -74.0 1 Manhattan MN0502
5 40000004 40.7 -74.0 1 Manhattan MN0401
6 40000005 40.8 -74.0 1 Manhattan MN0502
7 40000006 40.8 -74.0 1 Manhattan MN0604
8 40000007 40.8 -74.0 1 Manhattan MN0603
9 40000008 40.8 -74.0 1 Manhattan MN0502
10 40000009 40.7 -74.0 1 Manhattan MN0603
NTAName MdHHIncE
<chr> <int>
1 East Midtown-Turtle Bay 161934
2 SoHo-Little Italy-Hudson Square 133847
3 SoHo-Little Italy-Hudson Square 133847
4 Midtown-Times Square 153871
5 Chelsea-Hudson Yards 118915
6 Midtown-Times Square 153871
7 East Midtown-Turtle Bay 161934
8 Murray Hill-Kips Bay 138337
9 Midtown-Times Square 153871
10 Murray Hill-Kips Bay 138337
# ℹ more rows
4.3 Loading WorldPop Population Density dataset
The difference when using raster data compared to vector data with Apache Sedona is that we do not import raster in its native format directly. Instead, we must first load it as a binary dataframe, and then convert it into its native raster format within Sedona.
Also, bear in mind that Sedona only accepts raster files in the following formats:
- Arc Info ASCII Grid,
- GeoTIFF, and
- NetCDF.
If your data is in any other raster format, you will first need to convert it to one of these supported formats.
I have found GDAL to be particularly useful for converting between different raster formats. For this tutorial, I used GDAL (command line) to compress the original US population density raster file, and then clip it using the NYC boundaries shapefile. I used the Deflate lossless compression algorithm. You want to work with as small a file as possible as the processing can be quite memory intensive.
# Compressing the raster file
gdal_translate -co COMPRESS=DEFLATE usa_pd_2016_1km_UNadj_clipped.tif usa_pd_2016_1km_UNadj_compressed.tif
# Clipping the compressed raster using a shapefile boundary
gdalwarp -cutline nynta2020.shp -crop_to_cutline usa_pd_2016_1km_UNadj_compressed.tif nyc.tif
# Load the raster data for world population (NYC)
<- file.path(
world_pop_raster_filepath getwd(),
"data",
"raster",
"worldpop",
"nyc.tif"
)
# Read the raster data as a binary file
<- spark_read_binary(
world_pop_binary
sc,dir = world_pop_raster_filepath,
name = "worldpop"
)
We obtain raster geometry from our GeoTiff data.
# Register the world population raster as a temporary view
|> sdf_register("worldpop_view") world_pop_binary
Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range
# Source: table<`worldpop_view`> [?? x 4]
# Database: spark_connection
path modificationTime length
<chr> <dttm> <int>
1 file:/Users/rodgersiradukunda/Library/CloudStorage… 2025-02-27 01:46:58 26274
# ℹ 1 more variable: content <arrw_bnr>
# Extract raster data from the GeoTiff file using Sedona
<- sdf_sql(
worldpop_sdf
sc,"
SELECT RS_FromGeoTiff(content) AS raster FROM worldpop_view
"
)
# Register the raster data as a temporary view
|> sdf_register("worldpop_view") |> compute() worldpop_sdf
Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range
# Source: table<`dbplyr_O777dTCBfY`> [?? x 1]
# Database: spark_connection
# ℹ 1 more variable: raster <arrw_bnr>
%>% glimpse() worldpop_sdf
Rows: ??
Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range
Columns: 1
Database: spark_connection
$ raster <arrw_bnr> <00, 10, 00, 00, 00, 67, 65, 6f, 74, 69, 66, 66, 5f, 63, 6…
We can retrieve metadata from our raster file, including:
- The upper left coordinates of the raster (in the raster’s coordinate system units),
- The width and height of the raster (in number of pixels),
- The spatial resolution of each pixel (in units of the raster’s CRS),
- Any skew or rotation of the raster (if present),
- The SRID (spatial reference system identifier) of the raster’s coordinate system,
- The number of bands, and
- Tile width and height.
In our case:
- Upper left X coordinate: -74.25125
- Upper left Y coordinate: 40.90792
(both in degrees as the CRS is WGS84)
- Raster size: 66 x 49
pixels (quite small)
- Pixel resolution: 0.00833 x -0.00833
degrees
- Skew: 0
in both x and y directions (i.e., no skew)
- SRID: 4326
(WGS 84)
- Number of bands: 2
- Tile width: 66
, Tile height: 15
All this information is important when interpreting and working with raster data, especially when performing coordinate-based queries.
# Retrieve and view metadata for the world population raster
<- sdf_sql(
worldpop_sdf_metadata
sc,"
SELECT RS_MetaData(raster) FROM worldpop_view
"
|> collect() )
Warning in arrow_enabled_object.spark_jobj(sdf): Arrow disabled due to columns:
rs_metadata(raster)
options(width = 100)
# Glimpse at the metadata information
|> glimpse() worldpop_sdf_metadata
Rows: 1
Columns: 1
$ `rs_metadata(raster)` <list> [-74.25125, 40.90792, 66, 49, 0.008333333, -0.008333333, 0, 0, 4326…
4.4 Joining point data with raster data
We now conduct the join using Spatial SQL, as it is much easier and more intuitive than using Apache Sedona’s R functions for raster operations in my opinion.
By leveraging Spatial SQL, we can directly query raster values at specific pickup and dropoff coordinates, simplifying what would otherwise be a more complex process if done via function-based syntax.
# Perform a spatial join between the locations and the world population data to calculate population density
<- sdf_sql(
locations_sdf_updated_two
sc,"
SELECT
/*+ BROADCAST(w) */ l.*, RS_Value(w.raster, ST_Point(l.longitude, l.latitude)) AS pop_density
FROM
locations_sdf_updated_one_view l
LEFT JOIN worldpop_view w
ON RS_Intersects(w.raster, ST_POINT(l.longitude, l.latitude))
"
)
We can now take a look at the result of our join below.
# Glimpse at the updated data with population density
%>% glimpse() locations_sdf_updated_two
Rows: ??
Warning in arrow_collect(object, ...): NAs introduced by coercion to integer range
Columns: 9
Database: spark_connection
$ trip_id <dbl> 40000000, 40000001, 40000002, 40000003, 40000004, 40000005, 40000006, 40000007…
$ latitude <dbl> 40.75250, 40.72693, 40.72448, 40.76216, 40.74357, 40.75565, 40.75845, 40.75144…
$ longitude <dbl> -73.97816, -74.00348, -74.00199, -73.97910, -73.99427, -73.97939, -73.95974, -…
$ is_pickup <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ BoroName <chr> "Manhattan", "Manhattan", "Manhattan", "Manhattan", "Manhattan", "Manhattan", …
$ NTA2020 <chr> "MN0604", "MN0201", "MN0201", "MN0502", "MN0401", "MN0502", "MN0604", "MN0603"…
$ NTAName <chr> "East Midtown-Turtle Bay", "SoHo-Little Italy-Hudson Square", "SoHo-Little Ita…
$ MdHHIncE <int> 161934, 133847, 133847, 153871, 118915, 153871, 161934, 138337, 153871, 138337…
$ pop_density <dbl> 4927.671, 22022.211, 14979.832, 14245.537, 37489.418, 4927.671, 41234.668, 206…
# Print a preview of the resulting dataframe with specific formatting options
print(locations_sdf_updated_two, n=10, width=Inf)
Warning in arrow_collect(object, ...): NAs introduced by coercion to integer range
# Source: table<`sparklyr_tmp_73426160_7142_40f6_acc3_cdc5dd708f61`> [?? x 9]
# Database: spark_connection
trip_id latitude longitude is_pickup BoroName NTA2020 NTAName MdHHIncE
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <int>
1 40000000 40.8 -74.0 1 Manhattan MN0604 East Midtown-Turtle Bay 161934
2 40000001 40.7 -74.0 1 Manhattan MN0201 SoHo-Little Italy-Hudson Square 133847
3 40000002 40.7 -74.0 1 Manhattan MN0201 SoHo-Little Italy-Hudson Square 133847
4 40000003 40.8 -74.0 1 Manhattan MN0502 Midtown-Times Square 153871
5 40000004 40.7 -74.0 1 Manhattan MN0401 Chelsea-Hudson Yards 118915
6 40000005 40.8 -74.0 1 Manhattan MN0502 Midtown-Times Square 153871
7 40000006 40.8 -74.0 1 Manhattan MN0604 East Midtown-Turtle Bay 161934
8 40000007 40.8 -74.0 1 Manhattan MN0603 Murray Hill-Kips Bay 138337
9 40000008 40.8 -74.0 1 Manhattan MN0502 Midtown-Times Square 153871
10 40000009 40.7 -74.0 1 Manhattan MN0603 Murray Hill-Kips Bay 138337
pop_density
<dbl>
1 4928.
2 22022.
3 14980.
4 14246.
5 37489.
6 4928.
7 41235.
8 20680.
9 14246.
10 44076.
# ℹ more rows
Perfect! We have successfully obtained approximate population density values for each pickup and dropoff location.
4.5 Saving the data
Writing the updated data to file for further processing.
# Define file path for saving the updated dataframe
<- file.path(
locations_sdf_updated_two_file_path getwd(),
"data",
"locations_sdf_updated_two"
)
# Save the final dataframe as a Delta table
spark_write_delta(
locations_sdf_updated_two,path = locations_sdf_updated_two_file_path,
mode = "append" # Overwrite any existing data at the location
)
# Disconnect from the Spark session
spark_disconnect(sc)
4.6 References
- WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). https://dx.doi.org/10.5258/SOTON/WP00675