# Read location data stored in Delta format
<- spark_read_delta(
locations_sdf_updated_two
sc,path = file.path(getwd(), "data", "locations_sdf_updated_two")
|>
) filter(trip_id >= 40000000 & trip_id <= 40000010) %>% # filter for only ten rows
sdf_register("locations_sdf_updated_two_view") # Register as temporary view for SQL
5 Part Two - Processing Raster Data with Apache Sedona and Sparklyr in R
Using Raster Tiles to Determine Local Climate Zones of Pickup and Dropoff Locations
5.1 Introduction
For this chapter, we would like to find out the land cover classification associated with specific pickup and dropoff points. Our assumption is that the type of area where one picks up a taxi or gets dropped off may influence how long the trip takes. Again, do not dwell too much on this assumption — the main objective is to demonstrate a different way of extracting data from raster files.
We shall make use of the Local Climate Zones (LCZ) Map from the World Urban Database and Access Portal Tools (WUDAPT). The US version of this dataset can be accessed here. This dataset contains 17 urban land cover classifications, ranging from compact high-rise buildings to water bodies.
We downloaded a version that was already in EPSG:4326 CRS and clipped it based on the NYC boundary. For this chapter, we demonstrate how to divide a raster image into tiles before performing analysis. This is particularly useful when working with large raster files that cannot be processed in their entirety. By dividing the raster into tiles, we make the analysis more manageable and efficient.
Note: Raster processing is computationally intensive, so it is preferable to work with smaller files, especially when operating in local mode. You will know a raster file is too large when you receive an error upon attempting to view its contents.
Important: The Spark configuration used in this chapter is the same as that used in Chapter 3, and so it is not repeated here for brevity.
5.2 Loading the data
We start by loading our most updated locations data.
We then load our raster data.
# Define path to raster file (LCZ map)
<- file.path(
wudapt_raster_filepath getwd(),
"data",
"raster",
"wudapt",
"CONUS_LCZ_map_NLCD_v1.0_cropped.tif"
)
# Read raster as binary using Spark
<- spark_read_binary(
wudapt_binary
sc,dir = wudapt_raster_filepath,
name = "wudapt_binary_view"
)
5.3 Creating raster tiles
Here, we are assuming that our raster data is too large to be processed in its entirety. To address this, we divide it into 256 by 256 tiles. By doing so, we make it more analysis-friendly, as we can first identify a specific tile to work on and then perform analysis on only that tile, instead of the entire raster.
Note: The clipped dataset used here was actually not large, but we proceeded with this method for demonstration purposes to show how one would handle genuinely large raster datasets.
# Explode raster into tiles (256x256) for spatial operations and cache in memory
<- sdf_sql(
wudapt_raster_tiles
sc,"
SELECT RS_TileExplode(RS_FromGeoTiff(content), 256, 256) AS (x, y, tile) FROM wudapt_binary_view
"
%>%
) sdf_register("wudapt_raster_tiles_view")
# Quick look at raster tiles structure
%>% glimpse() wudapt_raster_tiles
Rows: ??
Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range
Columns: 3
Database: spark_connection
$ x <int> 0, 1, 0, 1
$ y <int> 0, 0, 1, 1
$ tile <arrw_bnr> <00, 0f, 00, 00, 00, 67, 65, 6e, 65, 72, 69, 63, 43, 6f, 76, 65, …
As you can see below, the spatial SQL query is divided into two parts:
- First, we locate the tile that a given coordinate pair (pickup or dropoff point) belongs to.
- Second, we use that specific tile to find the actual land classification value associated with that location.
This two-step approach makes it efficient to work with large raster datasets by limiting the analysis to only relevant tiles, rather than processing the entire raster at once.
# Spatial join: match location points with corresponding LCZ tiles and assign LCZ labels
<- sdf_sql(
locations_sdf_updated_three
sc,"
WITH matched_tiles AS (
SELECT l.*, w.tile
FROM wudapt_raster_tiles_view w
JOIN locations_sdf_updated_two_view l
ON RS_Intersects(w.tile, ST_Point(l.longitude, l.latitude))
)
SELECT *,
TRY_CAST(RS_Value(tile, ST_Point(longitude, latitude)) AS INT) AS lcz_class,
CASE
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 1 THEN 'Compact highrise'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 2 THEN 'Compact midrise'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 3 THEN 'Compact lowrise'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 4 THEN 'Open highrise'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 5 THEN 'Open midrise'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 6 THEN 'Open lowrise'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 7 THEN 'Lightweight lowrise'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 8 THEN 'Large lowrise'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 9 THEN 'Sparsely built'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 10 THEN 'Heavy industry'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 11 THEN 'Dense trees'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 12 THEN 'Scattered trees'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 13 THEN 'Bush, scrub'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 14 THEN 'Low plants'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 15 THEN 'Bare rock or paved'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 16 THEN 'Bare soil or sand'
WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 17 THEN 'Water'
ELSE 'Unknown'
END AS lcz_label
FROM matched_tiles
"
)
We now remove the tile geometry from our dataset because delta cannot serialise geometry columns, and also because we do not need the geometry for our intended analysis. This helps keep the dataset clean and light for storage and further processing.
# Drop RasterUDT (tile) column before saving as Delta (unsupported type in Delta)
<- locations_sdf_updated_three %>%
locations_sdf_updated_three select(-tile)
The updated data now looks as shown below. As you can see, we went from only having coordinates to obtaining richer information about our locations by leveraging additional geospatial datasets.
Not too bad for using a simple laptop to process tens of millions of rows!
# Preview updated data
%>% glimpse() locations_sdf_updated_three
Rows: ??
Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range
Columns: 11
Database: spark_connection
$ trip_id <dbl> 40000002, 40000009, 40000008, 40000005, 40000010, 40000003…
$ latitude <dbl> 40.72448, 40.77977, 40.76196, 40.75565, 40.73432, 40.74904…
$ longitude <dbl> -74.00199, -73.98648, -73.97881, -73.97939, -73.99419, -73…
$ is_pickup <dbl> 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1…
$ BoroName <chr> "Manhattan", "Manhattan", "Manhattan", "Manhattan", "Manha…
$ NTA2020 <chr> "MN0201", "MN0701", "MN0502", "MN0502", "MN0202", "MN0501"…
$ NTAName <chr> "SoHo-Little Italy-Hudson Square", "Upper West Side-Lincol…
$ MdHHIncE <int> 133847, 158165, 153871, 153871, 175436, 167458, 138337, 11…
$ pop_density <dbl> 14979.832, 23416.408, 14245.537, 4927.671, 37463.500, 1587…
$ lcz_class <int> 2, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1…
$ lcz_label <chr> "Compact midrise", "Compact highrise", "Compact highrise",…
# Print formatted preview for review
::with_options(
withrlist(pillar.sigfig = 6),
print(locations_sdf_updated_three, n = 10, width = Inf)
)
Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range
# Source: SQL [?? x 11]
# Database: spark_connection
trip_id latitude longitude is_pickup BoroName NTA2020
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 40000002 40.7245 -74.0020 1 Manhattan MN0201
2 40000009 40.7798 -73.9865 0 Manhattan MN0701
3 40000008 40.7620 -73.9788 1 Manhattan MN0502
4 40000005 40.7557 -73.9794 1 Manhattan MN0502
5 40000010 40.7343 -73.9942 0 Manhattan MN0202
6 40000003 40.7490 -73.9903 0 Manhattan MN0501
7 40000005 40.7419 -73.9746 0 Manhattan MN0603
8 40000004 40.7436 -73.9943 1 Manhattan MN0401
9 40000007 40.7514 -73.9737 1 Manhattan MN0603
10 40000002 40.7270 -73.9939 0 Manhattan MN0202
NTAName MdHHIncE pop_density lcz_class
<chr> <int> <dbl> <int>
1 SoHo-Little Italy-Hudson Square 133847 14979.8 2
2 Upper West Side-Lincoln Square 158165 23416.4 1
3 Midtown-Times Square 153871 14245.5 1
4 Midtown-Times Square 153871 4927.67 1
5 Greenwich Village 175436 37463.5 2
6 Midtown South-Flatiron-Union Square 167458 15877.1 1
7 Murray Hill-Kips Bay 138337 29922.4 1
8 Chelsea-Hudson Yards 118915 37489.4 2
9 Murray Hill-Kips Bay 138337 20680.5 1
10 Greenwich Village 175436 31733.8 2
lcz_label
<chr>
1 Compact midrise
2 Compact highrise
3 Compact highrise
4 Compact highrise
5 Compact midrise
6 Compact highrise
7 Compact highrise
8 Compact midrise
9 Compact highrise
10 Compact midrise
# ℹ more rows
Finally, we save the data for further processing.
# Define output path for writing final dataset
<- file.path(
locations_sdf_updated_three_file_path getwd(),
"data",
"locations_sdf_updated_three"
)
# Write enriched data back to Delta format (append mode)
spark_write_delta(
locations_sdf_updated_three,path = locations_sdf_updated_three_file_path,
mode = "append"
)
And disconnect from our spark instance.
# Disconnect Spark session to free resources
spark_disconnect(sc)
5.4 References
Demuzere, M., Hankey, S., Mills, G., Zhang, W., Lu, T., & Bechtel, B. (2020). Combining expert and crowd-sourced training data to map urban form and functions for the continental US. Scientific Data, 7(1), 264. https://doi.org/10.1038/s41597-020-00605-z
Demuzere, M., Hankey, S., Mills, G., Zhang, W., Lu, T., Bechtel B. (2020): CONUS-wide LCZ map and Training Areas. Figshare. Dataset. https://doi.org/10.6084/m9.figshare.11416950