November 29, Wednesday

After categorizing offenses into different tiers based on their severity, my focus shifted to tier 3, which specifically encompasses crimes like larceny and robbery. To predict these tier 3 crimes, I conducted research to identify a suitable forecasting model. Opting for the ARIMA (AutoRegressive Integrated Moving Average) model, I found it to be a valuable tool in time series forecasting due to its simplicity, versatility, and effectiveness in capturing temporal patterns. The ARIMA model’s capability to handle a broad spectrum of time series data was a key factor in my decision. While I successfully implemented the ARIMA model, the results proved somewhat intricate, requiring further interpretation. Additionally, I plan to explore other models to determine which one yields the best outcomes.

November 27, Monday

Due to the extensive nature of the dataset encompassing various offenses, managing them poses a challenge. Following discussions within our group, we have opted to categorize the offenses into distinct tiers based on the level of violence associated with the crimes.

Tier List:

Tier 1 Violent Crimes: Murder, Manslaughter, Rape
Related Codes: 111, 123, 121, 244, 241, 243, 251, 261, 252, 253, 271, 254, 242

Tier 2 Serious Offenses: Arson (900-930), Aggravated Assault/Battery (401-433), Related Codes: 900-930; 401-433; 802, 423, 413, 801

Tier 3 Non-Violent Crimes: Larceny, Robbery (301-380), Burglary, Breaking and Entering (510-547)
Related Codes: 612, 613, 615, 617, 614, 618, 616; 301, 311, 351, 361, 371, 381; 541, 540, 562, 561, 542, 521, 520, 522, 560

Tier Drugs: 1874, 1842, 1841, 1849, 1848, 1858, 1855, 1864, 1863, 1866, 1868, 1843, 3023, 3021, 1875, 3022, 1847, 1840, 1873, 1848, 1850

Tier 4: Property-Related Offenses: Vandalism (1402, 1415)
Vehicular Accidents (3801, 3802, 3803, 3810, 3807, 3805)

With this I plan to further try and forecast number of crimes in each district.

November 24, Friday

The initial analysis brought to light that a significant number of shooting incidents occurred predominantly on Saturday nights. This prompted me to investigate whether a correlation exists between the hour of the day and the nature of offenses, implying that certain crimes might be more prevalent at specific times. The resulting correlation coefficient between ‘Hour’ and ‘OFFENSE_CODE’ was approximately -0.008032, signifying an extremely weak correlation. In practical terms, this suggests a limited to negligible linear relationship between the time of day and the specific offense code. It is crucial to emphasize that correlation does not imply causation, and various factors may contribute to the relationship between these variables.

Subsequently, I delved into identifying streets with the highest reported crimes, and a screenshot of the findings is attached.

Considering this information, I also pondered the feasibility of utilizing this data to construct predictive models for crime occurrences based on historical data. Additionally, I also plan to explore the possibility of predicting specific types of crimes or the likelihood of incidents in particular locations.

 

November 22, Wednesday

Today, I conducted an analysis of crime distribution across districts. The dataset employs district codes rather than district names, necessitating some research to associate each code with its corresponding district. Upon completion of the distribution analysis, I noted that districts represented by codes B2, C11, and D4—corresponding to Roxbury, Dorchester, and South End, respectively—exhibit a higher incidence of reported crime. Conversely, district code A15, representing Charlestown, has the lowest number of reported crimes.

In examining the frequency of shooting incidents and their geographical distribution across districts, it was found that B2, C11, and B3 (Mattapan) have the highest reported number of such incidents. Further analysis unveiled that the majority of shooting incidents occurred on Saturday nights.

 

November 20, Monday

Following the initial descriptive analysis, today I conducted a correlation analysis on numerical variables. The results revealed a positive correlation of 0.025, indicating a subtle rise in offense codes over the years. To delve deeper into this connection, I examined crime rates across all years. The findings showed that 2017 recorded the highest number of reported crimes, followed by 2016 and 2018. Subsequently, there was a gradual decrease in reported crimes, with a slight uptick observed in 2022.

Upon conducting a more in-depth analysis of the offense code groups, it was revealed that the most commonly reported group is “Motor Vehicle Accident Response,” registering a frequency of 41,064, followed by “Larceny” with a frequency of 29,000. Conversely, the least common offense code group is “HUMAN TRAFFICKING – INVOLUNTARY SERVITUDE,” documented only twice.

November 17, Friday

This dataset comprises multiple files of crime data spanning from 2015 to 2023. I consolidated these files into a single combined dataset. Upon conducting descriptive analysis, it became evident that certain columns (e.g., SHOOTING, UCR_PART, OFFENSE_CODE_GROUP) exhibit a notably high count of missing values, suggesting the need for imputation or alternative strategies for addressing these gaps.

The temporal details (YEAR, MONTH, DAY_OF_WEEK, HOUR) present in the dataset offer an opportunity to analyze trends and patterns over time. Concurrently, the spatial information (DISTRICT, Lat, Long, STREET, Location) provides the basis for exploring the geographical distribution of incidents.

Further exploration and cleaning of the dataset may be imperative, depending on the specific goals of the analysis. These insights form a foundational understanding for subsequent data exploration and analysis, allowing for the extraction of meaningful information from the dataset.

November 15, Wednesday

For Project 3, we have the flexibility to choose datasets from the Analyze Boston site, and we’ve opted for the CRIME INCIDENT REPORTS (AUGUST 2015 – TO DATE) (SOURCE: NEW SYSTEM).

These reports, supplied by the Boston Police Department (BPD), serve to document the initial details of incidents to which BPD officers respond. The dataset originates from the new crime incident report system, featuring a streamlined set of fields concentrating on capturing incident types, along with details about when and where they occurred. The records in this system commence from June of 2015.

Key attributes in the dataset include:

  • [incident_num]: Internal BPD report number
  • [offense_code]: Numerical code corresponding to offense description
  • [Offense_Code_Group_Description]: Internal categorization of [offense_description]
  • [Offense_Description]: Primary descriptor of the incident
  • [district]: District where the crime was reported
  • [reporting_area] varchar NULL: RA number associated with where the crime was reported
  • [shooting] char NULL: Indicates whether a shooting took place
  • [occurred_on] datetime2 NULL: Earliest date and time the incident could have taken place
  • [UCR_Part] varchar NULL: Universal Crime Reporting Part number (1,2, 3)
  • [street] varchar NULL: Street name where the incident took place

November 13, Monday – Project 3 Day 1

For this project I plan to understand the “Economic Indicators” dataset from Analyze Boston, the City of Boston’s open data hub. My goal is to conduct a comprehensive analysis to extract insights that can significantly contribute to informed policy-making decisions, leveraging the economic data tracked by the Boston Planning and Development Authority (BPDA) between January 2013 and December 2019.

November 8, Wednesday

Today, I commenced work on the project report by consolidating all the analyses conducted this far on the dataset. The following aspects were addressed during the analysis:

  1. Visualized the distribution of victims’ age, race, and gender.
  2. Identified the top regions with the highest number of incidents.
  3. Determined the top agencies associated with the highest number of incidents.
  4. Explored the distribution of arms in fatal police shootings.
  5. Examined the average age by race.
  6. Provided frequency counts and percentages for various variables, including ‘threat_type,’ ‘flee_status,’ ‘armed_with,’ ‘race,’ and ‘was_mental_illness_related.’
  7. Conducted a temporal analysis to understand if the number of incidents increases or decreases over time.
  8. Utilized clustering to map incidents geographically.
  9. Investigated the correlation between age and armed status.
  10. Explored the distribution of weapons across different age groups.
  11. Performed logistic regression for predicting flee status.

I also had meeting with my group members to discuss the results of our findings

November 6, Monday

Today, I conducted a logistic regression analysis to predict flee status, employing features such as age, gender (male/female), vehicle presence, armed or unarmed status, and race (black/white). The target variable was ‘flee_status_not,’ where 1 indicates did not flee, and 0 indicates fleeing. Here are the key observations from the results:

  • Accuracy: The model demonstrated an accuracy of approximately 70%, signifying its ability to correctly predict ‘flee_status_not’ for 70% of the samples in the test set.
  • Precision: Of the instances predicted as ‘flee_status_not,’ 72% were genuinely ‘flee_status_not,’ implying that 72% of individuals did not attempt to flee.
  • Recall: Out of all the actual ‘flee_status_not’ instances, the model correctly identified 94%.
  • F1-score: The F1-score, representing the harmonic mean of precision and recall, is a useful metric for balancing these two measures. With an F1-score of 0.81, the model demonstrates a reasonably good balance between precision and recall for predicting the ‘flee_status_not’ class.

 

November 3,Friday

I conducted a correlation analysis to explore the relationship between age and armed status, and although the correlation outcomes revealed a weak connection, it prompted me to delve further into whether specific age groups exhibit a higher prevalence of particular weapons, such as guns. To investigate this, I employed seaborn’s barplot function to generate a stacked bar plot. In this plot, each age group’s bar is subdivided into segments, representing the counts of different weapon categories. I focused on armed statuses involving guns, knives, and being unarmed, as they play a more dominant role in the dataset. Attached below a screenshot of the result. Notably, the analysis reveals that the age group 19-30 exhibits a higher incidence of individuals carrying guns or knives. Additionally, the visual representation highlights instances of underage possession of guns.

November 1, 2023(Wednesday)

Previously, I transformed categorical data into numeric format for analytical purposes. During this process, I pondered whether there might be a connection between age and armed status. Consequently, I conducted a correlation analysis specifically focusing on the armed statuses: armed with a knife, armed with a gun, and unarmed, as they exhibit high occurrence in the dataset. The obtained correlation coefficients are as follows:

  • Armed with a Knife: -0.011807: This value is very close to zero, indicating an extremely weak or negligible correlation between age and the likelihood of being armed with a knife. The negative sign suggests a slight negative correlation, but its magnitude is minimal.
  • Armed with a Gun: 0.088604: This positive value points to a weak positive correlation between age and the likelihood of being armed with a gun. However, the correlation is relatively modest, implying that age and being armed with a gun are not strongly linearly associated.
  • Unarmed: -0.128284: This negative value suggests a weak negative correlation between age and the likelihood of being unarmed. Although the correlation is still relatively small.

These values only capture linear relationships. Other factors and nonlinear relationships may also play a role in understanding the data.