Anthony Scaffeo
- May 16, 2023
- 2 min read

Using Data Science to Understand the City of Thunder Bay's Capital Improvement Programs for 2020

When managing city infrastructure, understanding how funds are allocated across different programs is key. In this blog post, we explore the capital improvement programs for the City of Thunder Bay in 2020, using data science techniques to analyze, categorize, and visualize the data. The visualization provides unique insights into our data which “Contains information licensed under the Open Data Licence – The Corporation of the City of Thunder Bay.”

The Data

The dataset contains information about various capital improvement programs, including the type and amount of funds allocated to each program. The funding types include Revenue, Debenture, Reserve, Subsidy, Fees, and Other.

Each bar represents a category of programs with different colours within each bar representing all types of funds: Revenue (steel blue), Debenture (firebrick), Reserve (dark orange), Subsidy (sea green), Fees (medium purple), and Other (grey).

Initial Visualization

We first visualized the data using a stacked bar chart, with each bar representing a program and the segments of the bar showing the amount of each type of fund. However, due to the large number of programs, the x-axis labels were not clearly visible. To address this issue, we decided to categorize the programs.

Categorizing the Programs: K-Means Clustering

To categorize the programs, we used a machine learning technique called K-Means clustering. K-Means is an unsupervised learning algorithm that divides data into groups (or clusters) based on their similarity. However, as our data were text descriptions of the programs, we first had to convert these descriptions into a numerical format that the algorithm could process.

This was achieved through a process called TF-IDF vectorization, which quantifies the importance of different words in each description. After vectorizing the descriptions, we applied K-Means clustering to group the programs based on the similarity of their descriptions.

The algorithm assigned each program to one of 10 clusters, labeled 0 to 9. Programs with the same number are similar to each other in terms of their descriptions. While these labels are numerical identifiers and don't have inherent meaning, we can get a sense of the types of programs within each category by looking at the most frequent words in their descriptions.

Understanding the Categories

By generating word clouds for each category, we identified the most common words in the descriptions for each category:

Category 0: Street and water-related programs.
Category 1: Infrastructure or facility rehabilitation programs.
Category 2: Road construction or reconstruction programs.
Category 3: Programs related to sewer systems.
Category 4: Building-related programs, possibly construction.
Category 5: Park development or playground programs.
Category 6: Programs related to water systems.
Category 7: Bridge construction or maintenance programs.
Category 8: Rehabilitation programs, possibly related to road surfaces.
Category 9: Building maintenance programs, particularly roof replacements.

Revised Visualization

With the programs now categorized, we created a new stacked bar chart seen at the beginning of the blog post. This time, each bar represented a category of programs. The chart provided a clearer picture of the distribution of funds across different types of programs.

Conclusion

Data science techniques, such as machine learning and data visualization, can provide valuable insights into public data. By categorizing and visualizing the capital improvement programs, we were able to understand the allocation of funds in a more meaningful way.