5  Part Two - Processing Raster Data with Apache Sedona and Sparklyr in R

Using Raster Tiles to Determine Local Climate Zones of Pickup and Dropoff Locations

5.1 Introduction

For this chapter, we would like to find out the land cover classification associated with specific pickup and dropoff points. Our assumption is that the type of area where one picks up a taxi or gets dropped off may influence how long the trip takes. Again, do not dwell too much on this assumption — the main objective is to demonstrate a different way of extracting data from raster files.

We shall make use of the Local Climate Zones (LCZ) Map from the World Urban Database and Access Portal Tools (WUDAPT). The US version of this dataset can be accessed here. This dataset contains 17 urban land cover classifications, ranging from compact high-rise buildings to water bodies.

We downloaded a version that was already in EPSG:4326 CRS and clipped it based on the NYC boundary. For this chapter, we demonstrate how to divide a raster image into tiles before performing analysis. This is particularly useful when working with large raster files that cannot be processed in their entirety. By dividing the raster into tiles, we make the analysis more manageable and efficient.

Note: Raster processing is computationally intensive, so it is preferable to work with smaller files, especially when operating in local mode. You will know a raster file is too large when you receive an error upon attempting to view its contents.

Important: The Spark configuration used in this chapter is the same as that used in Chapter 3, and so it is not repeated here for brevity.

5.2 Loading the data

We start by loading our most updated locations data.

# Read location data stored in Delta format
locations_sdf_updated_two <- spark_read_delta(
  sc,
  path = file.path(getwd(), "data", "locations_sdf_updated_two")
) |> 
  filter(trip_id >= 40000000 & trip_id <= 40000010) %>% # filter for only ten rows 
  sdf_register("locations_sdf_updated_two_view")  # Register as temporary view for SQL

We then load our raster data.

# Define path to raster file (LCZ map)
wudapt_raster_filepath <- file.path(
  getwd(),
  "data",
  "raster",
  "wudapt",
  "CONUS_LCZ_map_NLCD_v1.0_cropped.tif"
)
# Read raster as binary using Spark
wudapt_binary <- spark_read_binary(
  sc,
  dir = wudapt_raster_filepath,
  name = "wudapt_binary_view"
)

5.3 Creating raster tiles

Here, we are assuming that our raster data is too large to be processed in its entirety. To address this, we divide it into 256 by 256 tiles. By doing so, we make it more analysis-friendly, as we can first identify a specific tile to work on and then perform analysis on only that tile, instead of the entire raster.

Note: The clipped dataset used here was actually not large, but we proceeded with this method for demonstration purposes to show how one would handle genuinely large raster datasets.

# Explode raster into tiles (256x256) for spatial operations and cache in memory
wudapt_raster_tiles <- sdf_sql(
  sc,
  "
  SELECT RS_TileExplode(RS_FromGeoTiff(content), 256, 256) AS (x, y, tile) FROM wudapt_binary_view
  "
) %>%
  sdf_register("wudapt_raster_tiles_view")

# Quick look at raster tiles structure
wudapt_raster_tiles %>% glimpse()
Rows: ??
Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range
Columns: 3
Database: spark_connection
$ x    <int> 0, 1, 0, 1
$ y    <int> 0, 0, 1, 1
$ tile <arrw_bnr> <00, 0f, 00, 00, 00, 67, 65, 6e, 65, 72, 69, 63, 43, 6f, 76, 65, …

As you can see below, the spatial SQL query is divided into two parts:

  1. First, we locate the tile that a given coordinate pair (pickup or dropoff point) belongs to.
  2. Second, we use that specific tile to find the actual land classification value associated with that location.

This two-step approach makes it efficient to work with large raster datasets by limiting the analysis to only relevant tiles, rather than processing the entire raster at once.

# Spatial join: match location points with corresponding LCZ tiles and assign LCZ labels
locations_sdf_updated_three <- sdf_sql(
  sc,
  "
  WITH matched_tiles AS (
    SELECT l.*, w.tile
      FROM wudapt_raster_tiles_view w
    JOIN locations_sdf_updated_two_view l
      ON RS_Intersects(w.tile, ST_Point(l.longitude, l.latitude))
  )

  SELECT *,
    TRY_CAST(RS_Value(tile, ST_Point(longitude, latitude)) AS INT) AS lcz_class,
    CASE
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 1 THEN 'Compact highrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 2 THEN 'Compact midrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 3 THEN 'Compact lowrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 4 THEN 'Open highrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 5 THEN 'Open midrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 6 THEN 'Open lowrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 7 THEN 'Lightweight lowrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 8 THEN 'Large lowrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 9 THEN 'Sparsely built'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 10 THEN 'Heavy industry'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 11 THEN 'Dense trees'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 12 THEN 'Scattered trees'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 13 THEN 'Bush, scrub'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 14 THEN 'Low plants'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 15 THEN 'Bare rock or paved'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 16 THEN 'Bare soil or sand'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 17 THEN 'Water'
      ELSE 'Unknown'
    END AS lcz_label
  FROM matched_tiles
  "
)

We now remove the tile geometry from our dataset because delta cannot serialise geometry columns, and also because we do not need the geometry for our intended analysis. This helps keep the dataset clean and light for storage and further processing.

# Drop RasterUDT (tile) column before saving as Delta (unsupported type in Delta)
locations_sdf_updated_three <- locations_sdf_updated_three %>%
  select(-tile)

The updated data now looks as shown below. As you can see, we went from only having coordinates to obtaining richer information about our locations by leveraging additional geospatial datasets.

Not too bad for using a simple laptop to process tens of millions of rows!

# Preview updated data
locations_sdf_updated_three %>% glimpse()
Rows: ??
Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range
Columns: 11
Database: spark_connection
$ trip_id     <dbl> 40000002, 40000009, 40000008, 40000005, 40000010, 40000003…
$ latitude    <dbl> 40.72448, 40.77977, 40.76196, 40.75565, 40.73432, 40.74904…
$ longitude   <dbl> -74.00199, -73.98648, -73.97881, -73.97939, -73.99419, -73…
$ is_pickup   <dbl> 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1…
$ BoroName    <chr> "Manhattan", "Manhattan", "Manhattan", "Manhattan", "Manha…
$ NTA2020     <chr> "MN0201", "MN0701", "MN0502", "MN0502", "MN0202", "MN0501"…
$ NTAName     <chr> "SoHo-Little Italy-Hudson Square", "Upper West Side-Lincol…
$ MdHHIncE    <int> 133847, 158165, 153871, 153871, 175436, 167458, 138337, 11…
$ pop_density <dbl> 14979.832, 23416.408, 14245.537, 4927.671, 37463.500, 1587…
$ lcz_class   <int> 2, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1…
$ lcz_label   <chr> "Compact midrise", "Compact highrise", "Compact highrise",…
# Print formatted preview for review
withr::with_options(
  list(pillar.sigfig = 6),
  print(locations_sdf_updated_three, n = 10, width = Inf)
)
Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range
# Source:   SQL [?? x 11]
# Database: spark_connection
    trip_id latitude longitude is_pickup BoroName  NTA2020
      <dbl>    <dbl>     <dbl>     <dbl> <chr>     <chr>  
 1 40000002  40.7245  -74.0020         1 Manhattan MN0201 
 2 40000009  40.7798  -73.9865         0 Manhattan MN0701 
 3 40000008  40.7620  -73.9788         1 Manhattan MN0502 
 4 40000005  40.7557  -73.9794         1 Manhattan MN0502 
 5 40000010  40.7343  -73.9942         0 Manhattan MN0202 
 6 40000003  40.7490  -73.9903         0 Manhattan MN0501 
 7 40000005  40.7419  -73.9746         0 Manhattan MN0603 
 8 40000004  40.7436  -73.9943         1 Manhattan MN0401 
 9 40000007  40.7514  -73.9737         1 Manhattan MN0603 
10 40000002  40.7270  -73.9939         0 Manhattan MN0202 
   NTAName                             MdHHIncE pop_density lcz_class
   <chr>                                  <int>       <dbl>     <int>
 1 SoHo-Little Italy-Hudson Square       133847    14979.8          2
 2 Upper West Side-Lincoln Square        158165    23416.4          1
 3 Midtown-Times Square                  153871    14245.5          1
 4 Midtown-Times Square                  153871     4927.67         1
 5 Greenwich Village                     175436    37463.5          2
 6 Midtown South-Flatiron-Union Square   167458    15877.1          1
 7 Murray Hill-Kips Bay                  138337    29922.4          1
 8 Chelsea-Hudson Yards                  118915    37489.4          2
 9 Murray Hill-Kips Bay                  138337    20680.5          1
10 Greenwich Village                     175436    31733.8          2
   lcz_label       
   <chr>           
 1 Compact midrise 
 2 Compact highrise
 3 Compact highrise
 4 Compact highrise
 5 Compact midrise 
 6 Compact highrise
 7 Compact highrise
 8 Compact midrise 
 9 Compact highrise
10 Compact midrise 
# ℹ more rows

Finally, we save the data for further processing.

# Define output path for writing final dataset
locations_sdf_updated_three_file_path <- file.path(
  getwd(),
  "data",
  "locations_sdf_updated_three"
)

# Write enriched data back to Delta format (append mode)
spark_write_delta(
  locations_sdf_updated_three,
  path = locations_sdf_updated_three_file_path,
  mode = "append"
)

And disconnect from our spark instance.

# Disconnect Spark session to free resources
spark_disconnect(sc)

5.4 References

  • Demuzere, M., Hankey, S., Mills, G., Zhang, W., Lu, T., & Bechtel, B. (2020). Combining expert and crowd-sourced training data to map urban form and functions for the continental US. Scientific Data, 7(1), 264. https://doi.org/10.1038/s41597-020-00605-z

  • Demuzere, M., Hankey, S., Mills, G., Zhang, W., Lu, T., Bechtel B. (2020): CONUS-wide LCZ map and Training Areas. Figshare. Dataset. https://doi.org/10.6084/m9.figshare.11416950