Group project repository for Advanced Computing
Group members: Emily Chu, Elsie Zhang
This project studies patterns of police misconduct in New York City using two NYC public safety datasets.
We focus on three questions:
- how extreme the highest-complaint officers are relative to the overall distribution,
- which officer groups show the highest complaint burden and substantiation intensity,
- whether precincts with higher crime volume also tend to have more misconduct allegations.
Main app:
https://fuzzy-potato-kmst2vvnvebesjvs2b9kyh.streamlit.app/
This project uses two NYC Open Data datasets.
This dataset provides officer-level information from the CCRB, including officer identifiers, current rank, current command, and complaint-related variables.
We use it for both officer-level and group-level analysis.
Key variables used in this project include:
- current command
- current rank
- total complaints
- total substantiated complaints
- officer demographic and status fields
https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i/about_data
This dataset contains historical NYPD complaint records related to crime incidents in New York City.
We use it as a precinct-level measure of crime volume and compare it with misconduct allegation patterns.
Key variables used in this project include:
addr_pct_cdfor precinct identification- offense description and category
- borough
- report date
How extreme are the highest-complaint officers relative to the overall distribution?
We compare the most complained-about officers with the broader complaint distribution across all officers.
This helps show whether the highest-complaint officers are simply above average or whether they are true outliers.
Which groups show the highest complaint burden and substantiation intensity?
We group officers by variables such as Current Command or Current Rank and compare:
- Complaint burden = average complaints per officer
- Substantiation intensity = substantiated complaints per 100 complaints
This allows us to identify which groups appear most exposed to complaints and which groups have relatively higher substantiation rates.
Is crime volume associated with misconduct allegations across precincts?
We aggregate Dataset 2 into crime counts by precinct and compare those counts with misconduct allegation counts derived from Dataset 1.
This lets us examine whether precincts with more crime also tend to show more misconduct complaints.
Introduces the project motivation, datasets, updated research questions, analytical approach, known unknowns, and anticipated challenges.
Focuses on officer-level complaint distribution.
Main outputs include:
- top officers by total complaints
- summary distribution metrics
- distribution plots showing how extreme high-complaint officers are relative to the full officer population
Focuses on group-level misconduct risk.
Main outputs include:
- a quadrant bubble chart comparing groups on burden and substantiation intensity
- ranked chart of groups by complaint burden
- ranked chart of groups by substantiation intensity
Focuses on the precinct-level relationship between crime volume and misconduct allegations.
Main outputs include:
- crime counts by precinct
- merged precinct-level comparisons between crime volume and misconduct patterns
- Load officer-level complaint data from BigQuery.
- Clean and standardize fields such as current command, current rank, and complaint variables.
- Use the cleaned table for both officer-level and group-level analysis.
- Load the historic NYPD complaint dataset from BigQuery.
- Aggregate records by precinct using
addr_pct_cd. - Construct precinct-level crime count measures.
- Build misconduct counts by precinct from Dataset 1.
- Build crime counts by precinct from Dataset 2.
- Align the precinct identifiers and data types.
- Merge the two aggregated tables for precinct-level comparison.
For local ingestion and setup, authentication should be done with a user account following the course guidance for pandas-gbq / Google Cloud local authentication.
For the deployed Streamlit app, BigQuery access uses a service account stored in Streamlit secrets.
Important:
.streamlit/secrets.tomlmust not be committed to GitHub- it should be included in
.gitignore
- Dataset 1 supports officer-level and group-level misconduct analysis.
- Dataset 2 supports precinct-level crime aggregation.
- Together, the datasets allow comparison across officer, group, and precinct levels.
- Whether the highest-complaint officers are extreme relative to the full officer distribution.
- Which commands or ranks consistently show the highest burden and highest substantiation intensity.
- Whether the precinct-level relationship between crime and misconduct is strong, weak, or inconsistent.
- How much external factors such as policing intensity, reporting behavior, or neighborhood context affect these results.
-
Data cleaning and standardization
Command names, group labels, and complaint-related variables may require cleaning and alignment. -
Interpretation
Complaint burden and substantiation intensity measure different things, so both need to be interpreted carefully. -
Precinct matching
Precinct identifiers differ across datasets and must be aligned before merging. -
Measurement bias
Higher police activity or higher reporting rates may mechanically increase complaint counts. -
Performance
Large datasets can slow down loading, filtering, and visualization if not handled efficiently.
This repository contains the Streamlit app, ingestion scripts, helper functions, and project materials for our Advanced Computing group project.
This project uses Google BigQuery in two contexts:
The scripts ingest_dataset1.py and ingest_dataset2.py upload data to BigQuery using a Google Cloud service account.
For local runs, authentication is provided through:
export GOOGLE_APPLICATION_CREDENTIALS="/Users/elsiezhang/Downloads/fuzzy-potato-491318-52c0aaab0477.json"