5 Part Two - Processing Raster Data with Apache Sedona and Sparklyr in R

Using Raster Tiles to Determine Local Climate Zones of Pickup and Dropoff Locations

5.1 Introduction

For this chapter, we would like to find out the land cover classification associated with specific pickup and dropoff points. Our assumption is that the type of area where one picks up a taxi or gets dropped off may influence how long the trip takes. Again, do not dwell too much on this assumption — the main objective is to demonstrate a different way of extracting data from raster files.

We shall make use of the Local Climate Zones (LCZ) Map from the World Urban Database and Access Portal Tools (WUDAPT). The US version of this dataset can be accessed here. This dataset contains 17 urban land cover classifications, ranging from compact high-rise buildings to water bodies.

We downloaded a version that was already in EPSG:4326 CRS and clipped it based on the NYC boundary. For this chapter, we demonstrate how to divide a raster image into tiles before performing analysis. This is particularly useful when working with large raster files that cannot be processed in their entirety. By dividing the raster into tiles, we make the analysis more manageable and efficient.

Note: Raster processing is computationally intensive, so it is preferable to work with smaller files, especially when operating in local mode. You will know a raster file is too large when you receive an error upon attempting to view its contents.

Important: The Spark configuration used in this chapter is the same as that used in Chapter 3, and so it is not repeated here for brevity.

5.2 Loading the data

We start by loading our most updated locations data.

# Read location data stored in Delta format
locations_sdf_updated_two <- spark_read_delta(
  sc,
  path = file.path(getwd(), "data", "locations_sdf_updated_two")
) |> 
  filter(trip_id >= 40000000 & trip_id <= 40000010) %>% # filter for only ten rows 
  sdf_register("locations_sdf_updated_two_view")  # Register as temporary view for SQL

We then load our raster data.

# Define path to raster file (LCZ map)
wudapt_raster_filepath <- file.path(
  getwd(),
  "data",
  "raster",
  "wudapt",
  "CONUS_LCZ_map_NLCD_v1.0_cropped.tif"
)

# Read raster as binary using Spark
wudapt_binary <- spark_read_binary(
  sc,
  dir = wudapt_raster_filepath,
  name = "wudapt_binary_view"
)

5.3 Creating raster tiles

Here, we are assuming that our raster data is too large to be processed in its entirety. To address this, we divide it into 256 by 256 tiles. By doing so, we make it more analysis-friendly, as we can first identify a specific tile to work on and then perform analysis on only that tile, instead of the entire raster.

Note: The clipped dataset used here was actually not large, but we proceeded with this method for demonstration purposes to show how one would handle genuinely large raster datasets.

# Explode raster into tiles (256x256) for spatial operations and cache in memory
wudapt_raster_tiles <- sdf_sql(
  sc,
  "
  SELECT RS_TileExplode(RS_FromGeoTiff(content), 256, 256) AS (x, y, tile) FROM wudapt_binary_view
  "
) %>%
  sdf_register("wudapt_raster_tiles_view")

# Quick look at raster tiles structure
wudapt_raster_tiles %>% glimpse()

Rows: ??

Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range

Columns: 3
Database: spark_connection
$ x    <int> 0, 1, 0, 1
$ y    <int> 0, 0, 1, 1
$ tile <arrw_bnr> <00, 0f, 00, 00, 00, 67, 65, 6e, 65, 72, 69, 63, 43, 6f, 76, 65, …

As you can see below, the spatial SQL query is divided into two parts:

First, we locate the tile that a given coordinate pair (pickup or dropoff point) belongs to.
Second, we use that specific tile to find the actual land classification value associated with that location.

This two-step approach makes it efficient to work with large raster datasets by limiting the analysis to only relevant tiles, rather than processing the entire raster at once.

# Spatial join: match location points with corresponding LCZ tiles and assign LCZ labels
locations_sdf_updated_three <- sdf_sql(
  sc,
  "
  WITH matched_tiles AS (
    SELECT l.*, w.tile
      FROM wudapt_raster_tiles_view w
    JOIN locations_sdf_updated_two_view l
      ON RS_Intersects(w.tile, ST_Point(l.longitude, l.latitude))
  )

  SELECT *,
    TRY_CAST(RS_Value(tile, ST_Point(longitude, latitude)) AS INT) AS lcz_class,
    CASE
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 1 THEN 'Compact highrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 2 THEN 'Compact midrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 3 THEN 'Compact lowrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 4 THEN 'Open highrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 5 THEN 'Open midrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 6 THEN 'Open lowrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 7 THEN 'Lightweight lowrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 8 THEN 'Large lowrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 9 THEN 'Sparsely built'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 10 THEN 'Heavy industry'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 11 THEN 'Dense trees'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 12 THEN 'Scattered trees'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 13 THEN 'Bush, scrub'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 14 THEN 'Low plants'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 15 THEN 'Bare rock or paved'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 16 THEN 'Bare soil or sand'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 17 THEN 'Water'
      ELSE 'Unknown'
    END AS lcz_label
  FROM matched_tiles
  "
)

We now remove the tile geometry from our dataset because delta cannot serialise geometry columns, and also because we do not need the geometry for our intended analysis. This helps keep the dataset clean and light for storage and further processing.

# Drop RasterUDT (tile) column before saving as Delta (unsupported type in Delta)
locations_sdf_updated_three <- locations_sdf_updated_three %>%
  select(-tile)

The updated data now looks as shown below. As you can see, we went from only having coordinates to obtaining richer information about our locations by leveraging additional geospatial datasets.

Not too bad for using a simple laptop to process tens of millions of rows!

# Preview updated data
locations_sdf_updated_three %>% glimpse()

Rows: ??

Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range

Columns: 11
Database: spark_connection
$ trip_id     <dbl> 40000002, 40000009, 40000008, 40000005, 40000010, 40000003…
$ latitude    <dbl> 40.72448, 40.77977, 40.76196, 40.75565, 40.73432, 40.74904…
$ longitude   <dbl> -74.00199, -73.98648, -73.97881, -73.97939, -73.99419, -73…
$ is_pickup   <dbl> 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1…
$ BoroName    <chr> "Manhattan", "Manhattan", "Manhattan", "Manhattan", "Manha…
$ NTA2020     <chr> "MN0201", "MN0701", "MN0502", "MN0502", "MN0202", "MN0501"…
$ NTAName     <chr> "SoHo-Little Italy-Hudson Square", "Upper West Side-Lincol…
$ MdHHIncE    <int> 133847, 158165, 153871, 153871, 175436, 167458, 138337, 11…
$ pop_density <dbl> 14979.832, 23416.408, 14245.537, 4927.671, 37463.500, 1587…
$ lcz_class   <int> 2, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1…
$ lcz_label   <chr> "Compact midrise", "Compact highrise", "Compact highrise",…

# Print formatted preview for review
withr::with_options(
  list(pillar.sigfig = 6),
  print(locations_sdf_updated_three, n = 10, width = Inf)
)

Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range

# Source:   SQL [?? x 11]
# Database: spark_connection
    trip_id latitude longitude is_pickup BoroName  NTA2020
      <dbl>    <dbl>     <dbl>     <dbl> <chr>     <chr>  
 1 40000002  40.7245  -74.0020         1 Manhattan MN0201 
 2 40000009  40.7798  -73.9865         0 Manhattan MN0701 
 3 40000008  40.7620  -73.9788         1 Manhattan MN0502 
 4 40000005  40.7557  -73.9794         1 Manhattan MN0502 
 5 40000010  40.7343  -73.9942         0 Manhattan MN0202 
 6 40000003  40.7490  -73.9903         0 Manhattan MN0501 
 7 40000005  40.7419  -73.9746         0 Manhattan MN0603 
 8 40000004  40.7436  -73.9943         1 Manhattan MN0401 
 9 40000007  40.7514  -73.9737         1 Manhattan MN0603 
10 40000002  40.7270  -73.9939         0 Manhattan MN0202 
   NTAName                             MdHHIncE pop_density lcz_class
   <chr>                                  <int>       <dbl>     <int>
 1 SoHo-Little Italy-Hudson Square       133847    14979.8          2
 2 Upper West Side-Lincoln Square        158165    23416.4          1
 3 Midtown-Times Square                  153871    14245.5          1
 4 Midtown-Times Square                  153871     4927.67         1
 5 Greenwich Village                     175436    37463.5          2
 6 Midtown South-Flatiron-Union Square   167458    15877.1          1
 7 Murray Hill-Kips Bay                  138337    29922.4          1
 8 Chelsea-Hudson Yards                  118915    37489.4          2
 9 Murray Hill-Kips Bay                  138337    20680.5          1
10 Greenwich Village                     175436    31733.8          2
   lcz_label       
   <chr>           
 1 Compact midrise 
 2 Compact highrise
 3 Compact highrise
 4 Compact highrise
 5 Compact midrise 
 6 Compact highrise
 7 Compact highrise
 8 Compact midrise 
 9 Compact highrise
10 Compact midrise 
# ℹ more rows

Finally, we save the data for further processing.

# Define output path for writing final dataset
locations_sdf_updated_three_file_path <- file.path(
  getwd(),
  "data",
  "locations_sdf_updated_three"
)

# Write enriched data back to Delta format (append mode)
spark_write_delta(
  locations_sdf_updated_three,
  path = locations_sdf_updated_three_file_path,
  mode = "append"
)

And disconnect from our spark instance.

# Disconnect Spark session to free resources
spark_disconnect(sc)

5.4 References

Demuzere, M., Hankey, S., Mills, G., Zhang, W., Lu, T., & Bechtel, B. (2020). Combining expert and crowd-sourced training data to map urban form and functions for the continental US. Scientific Data, 7(1), 264. https://doi.org/10.1038/s41597-020-00605-z
Demuzere, M., Hankey, S., Mills, G., Zhang, W., Lu, T., Bechtel B. (2020): CONUS-wide LCZ map and Training Areas. Figshare. Dataset. https://doi.org/10.6084/m9.figshare.11416950

--- title: "Part Two - Processing Raster Data with Apache Sedona and Sparklyr in R" subtitle: "Using Raster Tiles to Determine Local Climate Zones of Pickup and Dropoff Locations" execute: eval: true output: true --- ## Introduction For this chapter, we would like to find out the land cover classification associated with specific pickup and dropoff points. Our assumption is that the type of area where one picks up a taxi or gets dropped off may influence how long the trip takes. Again, do not dwell too much on this assumption — the main objective is to demonstrate a different way of extracting data from raster files. We shall make use of the **Local Climate Zones (LCZ) Map** from the **World Urban Database and Access Portal Tools (WUDAPT)**. The US version of this dataset can be accessed [here](https://figshare.com/articles/dataset/CONUS-wide_LCZ_map_and_Training_Areas/11416950). This dataset contains **17 urban land cover classifications**, ranging from compact high-rise buildings to water bodies. We downloaded a version that was already in **EPSG:4326 CRS** and clipped it based on the NYC boundary. For this chapter, we demonstrate how to divide a raster image into tiles before performing analysis. This is particularly useful when working with large raster files that cannot be processed in their entirety. By dividing the raster into tiles, we make the analysis more manageable and efficient. **Note**: Raster processing is computationally intensive, so it is preferable to work with smaller files, especially when operating in local mode. You will know a raster file is too large when you receive an error upon attempting to view its contents. **Important**: The Spark configuration used in this chapter is the same as that used in Chapter 3, and so it is not repeated here for brevity. ```{r} #| echo: false #| output: false # Load required libraries library(arrow) library(sparklyr) library(dplyr) # Install and specify the Spark version spark_install("3.5.5") # Set environment variables for Java and Spark Sys.setenv("JAVA_HOME"="/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home") Sys.setenv("SPARK_HOME"=spark_home_dir(version = "3.5.5")) # Set working directory working_dir <- "/Users/rodgersiradukunda/Library/CloudStorage/OneDrive-TheUniversityofLiverpool/geospatial_docker" setwd(working_dir) # Directory for Spark temporary files spark_dir <- file.path(getwd(), "data", "spark") # Create an empty list for Spark configuration settings config <- list() # Set Spark configurations for memory and performance optimisation # Configure some delta specific options config$spark.sql.extensions <- "io.delta.sql.DeltaSparkSessionExtension" config$spark.sql.catalog.spark_catalog <- "org.apache.spark.sql.delta.catalog.DeltaCatalog" # Use KryoSerializer for better performance config$spark.serializer <- "org.apache.spark.serializer.KryoSerializer" # Set temporary directory for Spark config$`sparklyr.shell.driver-java-options` <- paste0("-Djava.io.tmpdir=", spark_dir) # Use compressed Oops for JVM performance config$`sparklyr.shell.driver-java-options` <- "-XX:+UseCompressedOops" # Allocate 8GB of memory for the Spark driver config$`sparklyr.shell.driver-memory` <- '10G' # Set fraction of heap memory used for Spark storage config$spark.memory.fraction <- 0.7 # Set shuffle partitions (local setting based on workload) config$spark.sql.shuffle.partitions.local <- 24 # Set extra memory for driver config$spark.driver.extraJavaOptions <- "-Xmx1G" # Enable off-heap memory usage config$spark.memory.offHeap.enabled <- "true" # Set 4GB for off-heap memory config$spark.memory.offHeap.size <- "2g" # Disable shuffle spill to disk config$spark.sql.shuffle.spill <- "false" # Periodic garbage collection interval config$spark.cleaner.periodicGC.interval <- "60s" # Set max partition size for shuffle files config$spark.sql.files.maxPartitionBytes <- "200m" # Enable adaptive query execution config$spark.sql.adaptive.enabled <- "true" # Enable console logging for Sparklyr options(sparklyr.log.console = TRUE) # Connect to Spark, including necessary packages for Delta, Sedona, and GeoTools sc <- spark_connect( master = "local[*]", config = config, packages = c( "io.delta:delta-spark_2.12:3.3.0", "org.apache.sedona:sedona-spark-shaded-3.5_2.12:1.7.0", "org.datasyslab:geotools-wrapper:1.7.0-28.5" ) ) # Load Apache Sedona for spatial functions library(apache.sedona) # Initialize Sedona context (required step for spatial processing) invoke_static( sc, "org.apache.sedona.spark.SedonaContext", "create", spark_session(sc), "r" ) ``` ## Loading the data We start by loading our most updated locations data. ```{r} # Read location data stored in Delta format locations_sdf_updated_two <- spark_read_delta( sc, path = file.path(getwd(), "data", "locations_sdf_updated_two") ) |> filter(trip_id >= 40000000 & trip_id <= 40000010) %>% # filter for only ten rows sdf_register("locations_sdf_updated_two_view") # Register as temporary view for SQL ``` We then load our raster data. ```{r} # Define path to raster file (LCZ map) wudapt_raster_filepath <- file.path( getwd(), "data", "raster", "wudapt", "CONUS_LCZ_map_NLCD_v1.0_cropped.tif" ) ``` ```{r} # Read raster as binary using Spark wudapt_binary <- spark_read_binary( sc, dir = wudapt_raster_filepath, name = "wudapt_binary_view" ) ``` ## Creating raster tiles Here, we are **assuming that our raster data is too large to be processed in its entirety**. To address this, we **divide it into 256 by 256 tiles**. By doing so, we make it more analysis-friendly, as we can first identify a specific tile to work on and then perform analysis on only that tile, instead of the entire raster. **Note**: The clipped dataset used here was actually not large, but we proceeded with this method for demonstration purposes to show how one would handle genuinely large raster datasets. ```{r} # Explode raster into tiles (256x256) for spatial operations and cache in memory wudapt_raster_tiles <- sdf_sql( sc, " SELECT RS_TileExplode(RS_FromGeoTiff(content), 256, 256) AS (x, y, tile) FROM wudapt_binary_view " ) %>% sdf_register("wudapt_raster_tiles_view") # Quick look at raster tiles structure wudapt_raster_tiles %>% glimpse() ``` As you can see below, the spatial SQL query is divided into two parts: 1. First, we locate the tile that a given coordinate pair (pickup or dropoff point) belongs to.\ 2. Second, we use that specific tile to find the actual land classification value associated with that location. This two-step approach makes it efficient to work with large raster datasets by limiting the analysis to only relevant tiles, rather than processing the entire raster at once. ```{r} # Spatial join: match location points with corresponding LCZ tiles and assign LCZ labels locations_sdf_updated_three <- sdf_sql( sc, " WITH matched_tiles AS ( SELECT l.*, w.tile FROM wudapt_raster_tiles_view w JOIN locations_sdf_updated_two_view l ON RS_Intersects(w.tile, ST_Point(l.longitude, l.latitude)) ) SELECT *, TRY_CAST(RS_Value(tile, ST_Point(longitude, latitude)) AS INT) AS lcz_class, CASE WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 1 THEN 'Compact highrise' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 2 THEN 'Compact midrise' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 3 THEN 'Compact lowrise' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 4 THEN 'Open highrise' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 5 THEN 'Open midrise' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 6 THEN 'Open lowrise' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 7 THEN 'Lightweight lowrise' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 8 THEN 'Large lowrise' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 9 THEN 'Sparsely built' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 10 THEN 'Heavy industry' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 11 THEN 'Dense trees' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 12 THEN 'Scattered trees' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 13 THEN 'Bush, scrub' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 14 THEN 'Low plants' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 15 THEN 'Bare rock or paved' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 16 THEN 'Bare soil or sand' WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 17 THEN 'Water' ELSE 'Unknown' END AS lcz_label FROM matched_tiles " ) ``` We now **remove the tile geometry** from our dataset because delta cannot serialise geometry columns, and also because we do not need the geometry for our intended analysis. This helps keep the dataset clean and light for storage and further processing. ```{r} # Drop RasterUDT (tile) column before saving as Delta (unsupported type in Delta) locations_sdf_updated_three <- locations_sdf_updated_three %>% select(-tile) ``` The updated data now looks as shown below. As you can see, we went from only having coordinates to obtaining richer information about our locations by leveraging additional geospatial datasets. Not too bad for using a simple laptop to process tens of millions of rows! ```{r} # Preview updated data locations_sdf_updated_three %>% glimpse() ``` ```{r} # Print formatted preview for review withr::with_options( list(pillar.sigfig = 6), print(locations_sdf_updated_three, n = 10, width = Inf) ) ``` Finally, we save the data for further processing. ```{r} #| eval: false # Define output path for writing final dataset locations_sdf_updated_three_file_path <- file.path( getwd(), "data", "locations_sdf_updated_three" ) # Write enriched data back to Delta format (append mode) spark_write_delta( locations_sdf_updated_three, path = locations_sdf_updated_three_file_path, mode = "append" ) ``` And disconnect from our spark instance. ```{r} # Disconnect Spark session to free resources spark_disconnect(sc) ``` ## References - Demuzere, M., Hankey, S., Mills, G., Zhang, W., Lu, T., & Bechtel, B. (2020). Combining expert and crowd-sourced training data to map urban form and functions for the continental US. Scientific Data, 7(1), 264. https://doi.org/10.1038/s41597-020-00605-z - Demuzere, M., Hankey, S., Mills, G., Zhang, W., Lu, T., Bechtel B. (2020): CONUS-wide LCZ map and Training Areas. Figshare. Dataset. https://doi.org/10.6084/m9.figshare.11416950