4 Part One - Processing Raster Data with Apache Sedona and Sparklyr in R

Using WorldPop Data to Determine Population Density Around Pickup and Dropoff Points

4.1 Introduction

In this chapter, we are going to demonstrate how to obtain information from raster files based on geographic coordinates. Let us assume that there is a relationship between the duration of a taxi (dependent variable) ride and the population density of an area. We will, therefore, need to extract population density values at each pickup and dropoff location. Such granular data is typically available in raster format, which is why we use WorldPop population density data with a resolution of 1 km by 1 km for this purpose. You can download the data here to follow along.

The Spark configuration used in this chapter — and the next one — is identical to the one used in Chapter 3, so it is not shown below for brevity’s sake.

Furthermore, because I need to re-run this code multiple times to render this website, I shall filter only a few rows from the 94 million we previously worked with, as I will be constantly updating this site. In the actual analysis I conducted, however, I used the full dataset.

You can find out more about using Apache Sedona for raster manipulation here.

4.2 Loading updated locations data

We start by loading our updated locations data, which now contains household median income by neighbourhood information.

To reiterate, I will filter for a few rows here so that I can render this webpage faster. Sometimes in your analysis, you will find that you have too much data to fit in memory, especially when running complex transformations. In such cases, you can filter for specific rows, perform your analysis on that subset, and then append the results to your delta tables. You can repeat this process for another set of rows until you are done.

For instance, knowing that I have trip ID values ranging from 0 to about 48,000,000, I would:

First filter for rows between 0 and 16 million,
Then 16 million to 32 million,
And finally, anything above 32 million,
Appending to the same folder each time.

If you have enough RAM and cores, though, feel free to run everything at once — go crazy with it!

# Define path to the updated locations Delta table
locations_sdf_updated_one <- spark_read_delta(
  sc,
  path = file.path(
    getwd(), 
    "data", 
    "locations_sdf_updated_one"
  )
) |> 
  filter(trip_id >= 40000000 & trip_id <= 40000010) %>% # Filter for only ten rows
  sdf_register("locations_sdf_updated_one_view")  # Register as a temporary view

print(locations_sdf_updated_one, width=Inf, n=10)

Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range

# Source:   table<`locations_sdf_updated_one_view`> [?? x 8]
# Database: spark_connection
    trip_id latitude longitude is_pickup BoroName  NTA2020
      <dbl>    <dbl>     <dbl>     <dbl> <chr>     <chr>  
 1 40000000     40.8     -74.0         1 Manhattan MN0604 
 2 40000001     40.7     -74.0         1 Manhattan MN0201 
 3 40000002     40.7     -74.0         1 Manhattan MN0201 
 4 40000003     40.8     -74.0         1 Manhattan MN0502 
 5 40000004     40.7     -74.0         1 Manhattan MN0401 
 6 40000005     40.8     -74.0         1 Manhattan MN0502 
 7 40000006     40.8     -74.0         1 Manhattan MN0604 
 8 40000007     40.8     -74.0         1 Manhattan MN0603 
 9 40000008     40.8     -74.0         1 Manhattan MN0502 
10 40000009     40.7     -74.0         1 Manhattan MN0603 
   NTAName                         MdHHIncE
   <chr>                              <int>
 1 East Midtown-Turtle Bay           161934
 2 SoHo-Little Italy-Hudson Square   133847
 3 SoHo-Little Italy-Hudson Square   133847
 4 Midtown-Times Square              153871
 5 Chelsea-Hudson Yards              118915
 6 Midtown-Times Square              153871
 7 East Midtown-Turtle Bay           161934
 8 Murray Hill-Kips Bay              138337
 9 Midtown-Times Square              153871
10 Murray Hill-Kips Bay              138337
# ℹ more rows

4.3 Loading WorldPop Population Density dataset

The difference when using raster data compared to vector data with Apache Sedona is that we do not import raster in its native format directly. Instead, we must first load it as a binary dataframe, and then convert it into its native raster format within Sedona.

Also, bear in mind that Sedona only accepts raster files in the following formats:
- Arc Info ASCII Grid,
- GeoTIFF, and
- NetCDF.

If your data is in any other raster format, you will first need to convert it to one of these supported formats.

I have found GDAL to be particularly useful for converting between different raster formats. For this tutorial, I used GDAL (command line) to compress the original US population density raster file, and then clip it using the NYC boundaries shapefile. I used the Deflate lossless compression algorithm. You want to work with as small a file as possible as the processing can be quite memory intensive.

# Compressing the raster file
gdal_translate -co COMPRESS=DEFLATE usa_pd_2016_1km_UNadj_clipped.tif usa_pd_2016_1km_UNadj_compressed.tif

# Clipping the compressed raster using a shapefile boundary
gdalwarp -cutline nynta2020.shp -crop_to_cutline usa_pd_2016_1km_UNadj_compressed.tif nyc.tif

# Load the raster data for world population (NYC)
world_pop_raster_filepath <- file.path(
  getwd(),
  "data",
  "raster",
  "worldpop",
  "nyc.tif"
)

# Read the raster data as a binary file
world_pop_binary <- spark_read_binary(
  sc,
  dir = world_pop_raster_filepath,
  name = "worldpop"
)

We obtain raster geometry from our GeoTiff data.

# Register the world population raster as a temporary view
world_pop_binary |> sdf_register("worldpop_view")

Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range

# Source:   table<`worldpop_view`> [?? x 4]
# Database: spark_connection
  path                                                modificationTime    length
  <chr>                                               <dttm>               <int>
1 file:/Users/rodgersiradukunda/Library/CloudStorage… 2025-02-27 01:46:58  26274
# ℹ 1 more variable: content <arrw_bnr>

# Extract raster data from the GeoTiff file using Sedona
worldpop_sdf <- sdf_sql(
  sc,
  "
  SELECT RS_FromGeoTiff(content) AS raster FROM worldpop_view
  "
)

# Register the raster data as a temporary view
worldpop_sdf |> sdf_register("worldpop_view") |> compute()

Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range

# Source:   table<`dbplyr_O777dTCBfY`> [?? x 1]
# Database: spark_connection
# ℹ 1 more variable: raster <arrw_bnr>

worldpop_sdf %>% glimpse()

Rows: ??

Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range

Columns: 1
Database: spark_connection
$ raster <arrw_bnr> <00, 10, 00, 00, 00, 67, 65, 6f, 74, 69, 66, 66, 5f, 63, 6…

We can retrieve metadata from our raster file, including:
- The upper left coordinates of the raster (in the raster’s coordinate system units),
- The width and height of the raster (in number of pixels),
- The spatial resolution of each pixel (in units of the raster’s CRS),
- Any skew or rotation of the raster (if present),
- The SRID (spatial reference system identifier) of the raster’s coordinate system,
- The number of bands, and
- Tile width and height.

In our case:
- Upper left X coordinate: -74.25125
- Upper left Y coordinate: 40.90792 (both in degrees as the CRS is WGS84)
- Raster size: 66 x 49 pixels (quite small)
- Pixel resolution: 0.00833 x -0.00833 degrees
- Skew: 0 in both x and y directions (i.e., no skew)
- SRID: 4326 (WGS 84)
- Number of bands: 2
- Tile width: 66, Tile height: 15

All this information is important when interpreting and working with raster data, especially when performing coordinate-based queries.

# Retrieve and view metadata for the world population raster
worldpop_sdf_metadata <- sdf_sql(
  sc,
  "
  SELECT RS_MetaData(raster) FROM worldpop_view
  "
) |> collect()

Warning in arrow_enabled_object.spark_jobj(sdf): Arrow disabled due to columns:
rs_metadata(raster)

options(width = 100)

# Glimpse at the metadata information
worldpop_sdf_metadata |> glimpse()

Rows: 1
Columns: 1
$ `rs_metadata(raster)` <list> [-74.25125, 40.90792, 66, 49, 0.008333333, -0.008333333, 0, 0, 4326…

4.4 Joining point data with raster data

We now conduct the join using Spatial SQL, as it is much easier and more intuitive than using Apache Sedona’s R functions for raster operations in my opinion.

By leveraging Spatial SQL, we can directly query raster values at specific pickup and dropoff coordinates, simplifying what would otherwise be a more complex process if done via function-based syntax.

# Perform a spatial join between the locations and the world population data to calculate population density
locations_sdf_updated_two <- sdf_sql(
  sc,
  "
  SELECT 
    /*+ BROADCAST(w) */ l.*, RS_Value(w.raster, ST_Point(l.longitude, l.latitude)) AS pop_density
  FROM
    locations_sdf_updated_one_view l
  LEFT JOIN worldpop_view w
    ON RS_Intersects(w.raster, ST_POINT(l.longitude, l.latitude))
  "
)

We can now take a look at the result of our join below.

# Glimpse at the updated data with population density
locations_sdf_updated_two %>% glimpse()

Rows: ??

Warning in arrow_collect(object, ...): NAs introduced by coercion to integer range

Columns: 9
Database: spark_connection
$ trip_id     <dbl> 40000000, 40000001, 40000002, 40000003, 40000004, 40000005, 40000006, 40000007…
$ latitude    <dbl> 40.75250, 40.72693, 40.72448, 40.76216, 40.74357, 40.75565, 40.75845, 40.75144…
$ longitude   <dbl> -73.97816, -74.00348, -74.00199, -73.97910, -73.99427, -73.97939, -73.95974, -…
$ is_pickup   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ BoroName    <chr> "Manhattan", "Manhattan", "Manhattan", "Manhattan", "Manhattan", "Manhattan", …
$ NTA2020     <chr> "MN0604", "MN0201", "MN0201", "MN0502", "MN0401", "MN0502", "MN0604", "MN0603"…
$ NTAName     <chr> "East Midtown-Turtle Bay", "SoHo-Little Italy-Hudson Square", "SoHo-Little Ita…
$ MdHHIncE    <int> 161934, 133847, 133847, 153871, 118915, 153871, 161934, 138337, 153871, 138337…
$ pop_density <dbl> 4927.671, 22022.211, 14979.832, 14245.537, 37489.418, 4927.671, 41234.668, 206…

# Print a preview of the resulting dataframe with specific formatting options
print(locations_sdf_updated_two, n=10, width=Inf)

Warning in arrow_collect(object, ...): NAs introduced by coercion to integer range

# Source:   table<`sparklyr_tmp_73426160_7142_40f6_acc3_cdc5dd708f61`> [?? x 9]
# Database: spark_connection
    trip_id latitude longitude is_pickup BoroName  NTA2020 NTAName                         MdHHIncE
      <dbl>    <dbl>     <dbl>     <dbl> <chr>     <chr>   <chr>                              <int>
 1 40000000     40.8     -74.0         1 Manhattan MN0604  East Midtown-Turtle Bay           161934
 2 40000001     40.7     -74.0         1 Manhattan MN0201  SoHo-Little Italy-Hudson Square   133847
 3 40000002     40.7     -74.0         1 Manhattan MN0201  SoHo-Little Italy-Hudson Square   133847
 4 40000003     40.8     -74.0         1 Manhattan MN0502  Midtown-Times Square              153871
 5 40000004     40.7     -74.0         1 Manhattan MN0401  Chelsea-Hudson Yards              118915
 6 40000005     40.8     -74.0         1 Manhattan MN0502  Midtown-Times Square              153871
 7 40000006     40.8     -74.0         1 Manhattan MN0604  East Midtown-Turtle Bay           161934
 8 40000007     40.8     -74.0         1 Manhattan MN0603  Murray Hill-Kips Bay              138337
 9 40000008     40.8     -74.0         1 Manhattan MN0502  Midtown-Times Square              153871
10 40000009     40.7     -74.0         1 Manhattan MN0603  Murray Hill-Kips Bay              138337
   pop_density
         <dbl>
 1       4928.
 2      22022.
 3      14980.
 4      14246.
 5      37489.
 6       4928.
 7      41235.
 8      20680.
 9      14246.
10      44076.
# ℹ more rows

Perfect! We have successfully obtained approximate population density values for each pickup and dropoff location.

4.5 Saving the data

Writing the updated data to file for further processing.

# Define file path for saving the updated dataframe
locations_sdf_updated_two_file_path <- file.path(
  getwd(), 
  "data", 
  "locations_sdf_updated_two"
)

# Save the final dataframe as a Delta table
spark_write_delta(
  locations_sdf_updated_two,
  path = locations_sdf_updated_two_file_path,
  mode = "append"  # Overwrite any existing data at the location
)

# Disconnect from the Spark session
spark_disconnect(sc)

4.6 References

WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). https://dx.doi.org/10.5258/SOTON/WP00675