Benson Duong

Data Science major at UC San Diego

Interests in Data Science, Python Programming, Machine Learning.

View My GitHub Profile

Geospatial Data Science Project: New York City Traffic

Table of Contents

Background

This GIS data science project is an exploration of traffic volume statistics in New York City. It is an independent project that I’ve worked since Fall 2021. It contains 2 parts:

ArcGIS-Data-Preprocessing

Part 1A: Getting the Street Segment Traffic Data into a GIS-friendly Format

The traffic file is 27 million rows, so saving it as a parquet and using polars rather than pandas is advisable. Opening the traffic CSV file (made from the code before) on ArcMap, you’ll come across the first problem. It’s an ordinary CSV file, with no usable geospatial data formatting. The closest column we have are the street names, which we don’t really want to resort to, since string data like this can be wildly inconsistent from one data source to another (East 73rd Street in one data source might be 73 St. E in another).

But there’s no reason to panic. Luckily, the metadata for that traffic CSV file shows that the column “Segment ID” is an identifier for each street segment. After some internet sleuthing, you will eventually find an online data source with a name like “nyc_lion” (the second base dataset), which has extensive data on NYC street segments, including a column with the exact same identifier. Most importantly, it includes shapefiles that will allow us to use geospatial data.

Now, we must do a join between the shapefile dataframe and the csv dataframe on the foreign key Segment ID. But there’s a side-issue that must be resolved first. In the traffic csv layer, the Segment ID’s type is a long integer. Meanwhile, in the lion shapefile layer, the Segment ID is a 7-character string that’s zero-padded in the front. This formatting difference will ruin the join.

This can be resolved by making a new column with the correctly formatted segment ID.

Now we can go ahead with the join. But after running, you’ll notice a problem: ArcGIS’s inbuilt join tool will only treat it as one-to-one, keeping only the first matching row it finds for each street segment, and ignoring the duplicate rows after. This means that each street segment row will have just the first hour of the day, rather than all of them. This is bad, because it needs to be a one-to-many join, because each street segment has many rows describing its traffic volume at different times of the day. However, we won’t delete this failed join layer. It will be useful later, so rename it as something like “OneToOneJoin” and tuck it away for now.

According to Esri, going to ArcToolbox > Layers and Table Views > Make Query Table will provide an alternative join tool that resolves one-to-many. However, this method did not work for me, so I’ll provide another way. According to this forum post: https://gis.stackexchange.com/questions/177506/one-to-many-joins-on-a-feature-class-to-a-table, to do a one-to-many join:

That final, resulting join layer (“gdb_join”) should appear like this (It does not need to be green). All the other layers were hidden and a BaseMap was added underneath to show the purpose of this join.

In summary, for the 1st base dataset listed at the beginning, we had to join it with the 2nd base dataset, so that its street segment traffic data could be geospatially usable and visually seen when loaded up onto ArcMap.

Part 1B: Retrieval of Geospatial Data Surrounding the Street Segments

We could move onto data-cleaning and wrap up ArcGIS, but we could also go a bit further. So far, we have data on time (hourly traffic data) and place (street segment), which are very useful for analyzing traffic. But could place be improved? Specifically, could more be done to learn about a street segment’s surroundings and local environment?

Environmental context can imply a lot about the traffic volume; for example, if the street segment’s local surrounding area is mostly residential, like say, Queens or Staten Island, you likely won’t see noisy urban traffic.

This is the goal of Part 1B:

Zooming in, the LandUse shapefile layer (pink) represents the parcels and lots on each city-block. The lion shapefile layer (dark red), and the new gdb_join shapefile layer (bright green) are also shown.

Remember that “failed” OneToOneJoin shapefile layer created back then? This is when it gets used again. Use the Geoprocessing > Buffer Tool with the following parameters:

Run the buffer tool, and rename the resulting layer as “StreetSegmentBlobs”. You should have a result like this: These “blobs” are the area 500ft around the the street segment. The reason that the “Failed join” layer (“OneToOneJoin” layer) was used is to avoid having redundant “blobs” stacked onto each other; only 1 is necessary for each street segment.

Next, use the Geoprocessing > Clip tool with the following parameters:

Run the Clip tool, which might take a while. Name the new clipped layer as StreetSegment_LandUse_Clipped.

Finally, use the Customize > ArcToolbox > Analysis Tools > Overlay > Spatial Join tool with the following parameters:

Name the resulting layer as StreetSegment_LandUse_Subsets, and export the dataframe of it to a txt file for later use as a CSV.

Data Analysis and Visualizations

Stacked Barcharts showing Landuse Compositions across NYC Boroughs

Lineplots showing Traffic Volume over Weekday-types, Boroughs, and Landuses

3D Scatterplot in terms of Traffic Volume (color), Subway Entrance Proximity, and Building Floor Count

Correlation Heatmap

Multinomial Logistic Regression Prediction Model’s Confusion Matrix

Feature Importances of the Multinomial Logistic Regression Prediction Model