October 30, Monday

I began by converting categorical data into numerical format in order to facilitate clustering techniques. The initial step involved applying DBSCAN clustering based on the ‘latitude’, ‘longitude’, and ‘total_shootings’ columns from the dataset. This enabled the visualization of clusters on a map of the USA using the folium library, making it possible to observe shooting incidents that occurred in close geographic proximity.

Subsequently, I delved into exploring appropriate clustering methods to extract meaningful insights from the dataset. I opted to employ K-Means clustering, as it operates without requiring a target variable, aligning with the principles of unsupervised learning. Following this decision, I partitioned the data into training and testing sets, and I proceeded to train the K-Means model using the training data. Further, I applied the model to predict cluster labels for the testing dataset. The results obtained from these need further analysis and evaluation.

October 27, Friday

Upon analyzing the data related to fatal police shootings over the span of several years, I noted that in 2015, there were nearly 1,000 incidents, with a dip to 960 in 2016. Subsequently, the numbers exhibited a gradual increase, reaching 980 and 990 in 2017 and 2018, respectively. This upward trend persisted, nearly hitting 1,000 again in 2019. The year 2020 saw the number of shootings exceed 1,000, and it peaked in 2021, surpassing 1,040 fatal police shootings. Nevertheless, in 2022, the count decreased once more to 1,000.

October 25, Wednesday

Earlier, I computed frequency counts for categorical variables such as “threat_type,” “armed_with,” “was_mental_illness_related,” and “flee status.” Today, I extended my exploration to delve into the gender distribution within the dataset. Although males are more commonly involved, incidents with female victims are also noteworthy in number. Additionally, I discovered supplementary data on the Washington Post site, and I intend to incorporate it for further temporal analysis.

Understanding features (October, 23 -Monday)

Today, to gain deeper insights into the dataset, I calculated frequency counts for categorical variables, including “threat_type,” “armed_with,” “was_mental_illness_related,” and “flee status.”

Here are my key observations:

  1. Regarding the “threat_type” category, it is evident that “shoot” and “threat” represent the most prevalent threat types, whereas “point,” “attack,” and “move” are comparatively less common.
  2. When examining whether victims were armed or not, it is apparent that “gun” and “knife” emerge as the predominant weapons used in these incidents. Additionally, a significant number of cases involve unarmed victims.
  3. With respect to flee status, Most cases do not involve fleeing (“not”).
  4. An analysis of the “was_mental_illness_related” category reveals that the majority of cases, approximately 79%, do not report any indication of mental illness (“False”). However, a substantial portion, around 20%, does report mental illness (“True”).

These observations might be useful in decision-making, resource allocation, and the development of policies in the relevant fields.

October 20, Friday

Today, I delved into additional dataset features. I initiated my analysis by identifying the state with the highest number of incidents. My findings revealed that California had the most incidents, totaling 344, followed by Texas with 192 incidents, and Florida with 126 incidents.

Additionally, I explored which agencies were involved in the most incidents, pinpointing agency ID 38 as having the highest number of incidents. To gain deeper insights into the dataset, I visualized the ‘armed’ feature, focusing on understanding how many victims were armed and the types of weapons they possessed. The visualization unveiled that over 1,200 victims were armed with guns, around 400 with knives, and roughly 200 were unarmed.

Furthermore, I calculated the average ages of all victims across different racial groups. You can find a screenshot of this analysis below:

October 18, Wednesday

I attempted to perform a temporal analysis on the dataset, which initially pointed to 2017 as the year with the highest number of shootings. However, I noticed a discrepancy between the data and my Python code that calls for further investigation. In this case, the dataset records ‘0’ shootings for the year 2023, despite the presence of data for that year in the dataset.

To gain a better understanding of the dataset’s demographics, I also visualized the gender feature. The visualization revealed a notable disparity in fatalities, with nearly 2,000 male victims as opposed to fewer than 250 female victims. As I continue my analysis, I plan to explore additional features to gain a more comprehensive understanding of the dataset.

October 16, Monday

Upon reviewing the dataset, a few questions came to mind. Firstly, I wondered whether there has been an increase or decrease in the number of fatal police shootings over the years. Secondly, I pondered which geographical areas experience the highest incidence of fatal police shootings, and lastly, I was curious about which racial group exhibits the lowest average age. I intend to share these queries with my team and conduct a temporal analysis to delve deeper into the first question I raised.

Understanding Dataset (October 13, Friday)

Regarding the dataset, it contains information on fatal police shootings. The data is divided into two parts: one with details about the victims and incidents, located in the “/v2/fatal-police-shootings-data.csv” file, and another containing data about police agencies involved in at least one fatal police shooting since 2015, found in the “/v2/fatal-police-shootings-agencies.csv” file. I combined these two CSV files using the “agency_ids” value as a reference and removed any rows with missing data (NaN values).

After discussion with one of my group member, to gain a better understanding of the dataset, I created a visualization using a histogram. The histogram revealed that the number of incidents involving white individuals was the highest, followed by Black, Hispanic, Native American, and Asian individuals. This information provides insights into the distribution of fatal police shootings across different racial groups.

Project 2 Day 1 (October 11, Wednesday)

The second project involves analyzing data from the Washington Post data repository on fatal police shootings in the United States. In order to gain a preliminary understanding of the data and its attributes, I executed basic commands such as “describe()” and “info()” There are 8770 data points till now from 2015-01-02 to 2023-10-07. I am currently in the process of understanding the dataset and exploring its potential analyses and applications.

Attaching a screenshot of the results:

Project update……October 4, Monday

Issues I have analyzed till now are:

  1. Disparity in Food access, between rural and urban areas of the USA and how it affects % of diabetics.
  2. Disparity in Physical inactivity, between rural and urban areas of the USA and how it affects % of diabetes.
  3. Disparity in Obesity, between rural and urban areas of the USA and how it affects % of diabetes.

As mentioned in the previous blog, I have conducted linear regression, multiple regression, t-test, utilized the Monte Carlo method, and implemented cross-validation techniques on the dataset.

I’m currently documenting the results and methodologies in my report. 

October 2, Monday

While working on my report today, I came to the realization that it’s necessary to conduct a t-test on the independent variables within the urban-rural dataset. Specifically, I examined the obesity data in the urban-rural context and found that the T-statistic is -13.52903571504057, and the p-value is 1.4137452385540387e-40. This statistical result suggests that there is a considerable difference between these groups, with first group(rural) having a significantly lower mean than the other. The low p-value suggests that the observed difference is not due to random variability but is instead likely a real and meaningful difference between the groups. 

Similarly, when analyzing the inactivity data, I observed a T-statistic of -7.488472092334628, and the associated p-value is 9.002823630475274e-14. 

As of now, I have conducted linear regression, multiple regression, t-test, utilized the Monte Carlo method, and implemented cross-validation techniques on the dataset. The outcomes and any noteworthy challenges encountered during these analyses have been documented in the report.