We have compiled our findings and are currently in the final stages of preparing our Project 3 report. Our goal is to finalize and complete it by the end of today.
Today, I commenced work on the project report by consolidating all the analyses conducted this far on the dataset. The following aspects were addressed during the analysis:
- Examining trends in crime rates across different years.
- Analyzing the geographical distribution of crime in various districts.
- Identifying the specific offense with the highest number of occurrences.
- Locating streets characterized by the highest reported incidents of crime.
- Forecasting Tier 3 crimes for future analysis.
- Investigating the correlation between household incomes and crime rates in various locations in and around Boston.
- Exploring the correlation between poverty rates and crime rates in different locations in and around Boston.
I also had meeting with my group members to discuss the results of our findings.
In my investigation today, I explored the relationship between poverty rates and crime rates in different areas. The calculated correlation coefficient, standing at 0.782002, points to a significant positive correlation between crime rates and poverty rates.
This positive correlation suggests that as poverty rates rise, so do crime rates, and conversely, a decrease in poverty rates tends to coincide with a decrease in crime rates. The closer the coefficient is to 1, the more pronounced this positive correlation becomes. Visualizing the data reinforces this observation, clearly illustrating that areas with higher poverty rates also tend to exhibit higher crime rates.
Opting to center my analysis on household incomes and poverty rates as external factors, my objective was to uncover their correlation with crime rates across various areas. Beginning with an exploration of household incomes, the dataset columns include district, median income, total households, and percentage distributions across income brackets, spanning from $14,999 and under to $150,000 and above. In the context of our analysis, the focus was particularly on the median income of that area.
To investigate the correlation between median household income and crime rates in various areas, I initially had to extract crime rates from the first dataset and manually mapped of districts and district codes in both datasets for effective merging. Subsequently, I calculated the correlation coefficient for median income and crime rates in each specific area, yielding the following result and interpretation:
- The negative correlation coefficient suggests an inverse relationship between median income and crime rate.
- As median income increases, the crime rate tends to decrease, and conversely, as median income decreases, the crime rate tends to increase.
- The proximity of the coefficient to -1 indicates a stronger negative correlation.
Upon visualizing the trend, a clear pattern emerged, illustrating that areas with higher incomes consistently exhibit lower crime rates.
Exploring the dynamics of crime rates in a specific area requires a consideration of diverse influential factors, encompassing socioeconomic conditions, demographics, and more. In our endeavor to attain a thorough comprehension, we scrutinized additional datasets, with a notable focus on the “BOSTON NEIGHBORHOOD DEMOGRAPHICS, 2015-2019.” This dataset, meticulously compiled by the BPDA Research Division, leverages U.S. Census Decennial data to delineate demographic transformations in Boston’s neighborhoods spanning the period from 1950 to 2010, utilizing consistent tract-based geographies.
The dataset’s most recent demographic insights are extrapolated from the 5-year American Community Survey (ACS), furnishing a holistic panorama of Boston’s neighborhoods based on Census-tract approximations. Within this dataset, diverse dimensions such as age, race, nativity, education, household income, and poverty rates are comprehensively covered. For our analysis, I chose to concentrate on household incomes and poverty rates as an external factor, aiming to discern its correlation with crime rates across different areas.
After categorizing offenses into different tiers based on their severity, my focus shifted to tier 3, which specifically encompasses crimes like larceny and robbery. To predict these tier 3 crimes, I conducted research to identify a suitable forecasting model. Opting for the ARIMA (AutoRegressive Integrated Moving Average) model, I found it to be a valuable tool in time series forecasting due to its simplicity, versatility, and effectiveness in capturing temporal patterns. The ARIMA model’s capability to handle a broad spectrum of time series data was a key factor in my decision. While I successfully implemented the ARIMA model, the results proved somewhat intricate, requiring further interpretation. Additionally, I plan to explore other models to determine which one yields the best outcomes.
Due to the extensive nature of the dataset encompassing various offenses, managing them poses a challenge. Following discussions within our group, we have opted to categorize the offenses into distinct tiers based on the level of violence associated with the crimes.
Tier 1 Violent Crimes: Murder, Manslaughter, Rape
Related Codes: 111, 123, 121, 244, 241, 243, 251, 261, 252, 253, 271, 254, 242
Tier 2 Serious Offenses: Arson (900-930), Aggravated Assault/Battery (401-433), Related Codes: 900-930; 401-433; 802, 423, 413, 801
Tier 3 Non-Violent Crimes: Larceny, Robbery (301-380), Burglary, Breaking and Entering (510-547)
Related Codes: 612, 613, 615, 617, 614, 618, 616; 301, 311, 351, 361, 371, 381; 541, 540, 562, 561, 542, 521, 520, 522, 560
Tier Drugs: 1874, 1842, 1841, 1849, 1848, 1858, 1855, 1864, 1863, 1866, 1868, 1843, 3023, 3021, 1875, 3022, 1847, 1840, 1873, 1848, 1850
Tier 4: Property-Related Offenses: Vandalism (1402, 1415)
Vehicular Accidents (3801, 3802, 3803, 3810, 3807, 3805)
With this I plan to further try and forecast number of crimes in each district.
The initial analysis brought to light that a significant number of shooting incidents occurred predominantly on Saturday nights. This prompted me to investigate whether a correlation exists between the hour of the day and the nature of offenses, implying that certain crimes might be more prevalent at specific times. The resulting correlation coefficient between ‘Hour’ and ‘OFFENSE_CODE’ was approximately -0.008032, signifying an extremely weak correlation. In practical terms, this suggests a limited to negligible linear relationship between the time of day and the specific offense code. It is crucial to emphasize that correlation does not imply causation, and various factors may contribute to the relationship between these variables.
Subsequently, I delved into identifying streets with the highest reported crimes, and a screenshot of the findings is attached.
Considering this information, I also pondered the feasibility of utilizing this data to construct predictive models for crime occurrences based on historical data. Additionally, I also plan to explore the possibility of predicting specific types of crimes or the likelihood of incidents in particular locations.
Today, I conducted an analysis of crime distribution across districts. The dataset employs district codes rather than district names, necessitating some research to associate each code with its corresponding district. Upon completion of the distribution analysis, I noted that districts represented by codes B2, C11, and D4—corresponding to Roxbury, Dorchester, and South End, respectively—exhibit a higher incidence of reported crime. Conversely, district code A15, representing Charlestown, has the lowest number of reported crimes.
In examining the frequency of shooting incidents and their geographical distribution across districts, it was found that B2, C11, and B3 (Mattapan) have the highest reported number of such incidents. Further analysis unveiled that the majority of shooting incidents occurred on Saturday nights.
Following the initial descriptive analysis, today I conducted a correlation analysis on numerical variables. The results revealed a positive correlation of 0.025, indicating a subtle rise in offense codes over the years. To delve deeper into this connection, I examined crime rates across all years. The findings showed that 2017 recorded the highest number of reported crimes, followed by 2016 and 2018. Subsequently, there was a gradual decrease in reported crimes, with a slight uptick observed in 2022.
Upon conducting a more in-depth analysis of the offense code groups, it was revealed that the most commonly reported group is “Motor Vehicle Accident Response,” registering a frequency of 41,064, followed by “Larceny” with a frequency of 29,000. Conversely, the least common offense code group is “HUMAN TRAFFICKING – INVOLUNTARY SERVITUDE,” documented only twice.
This dataset comprises multiple files of crime data spanning from 2015 to 2023. I consolidated these files into a single combined dataset. Upon conducting descriptive analysis, it became evident that certain columns (e.g., SHOOTING, UCR_PART, OFFENSE_CODE_GROUP) exhibit a notably high count of missing values, suggesting the need for imputation or alternative strategies for addressing these gaps.
The temporal details (YEAR, MONTH, DAY_OF_WEEK, HOUR) present in the dataset offer an opportunity to analyze trends and patterns over time. Concurrently, the spatial information (DISTRICT, Lat, Long, STREET, Location) provides the basis for exploring the geographical distribution of incidents.
Further exploration and cleaning of the dataset may be imperative, depending on the specific goals of the analysis. These insights form a foundational understanding for subsequent data exploration and analysis, allowing for the extraction of meaningful information from the dataset.
For Project 3, we have the flexibility to choose datasets from the Analyze Boston site, and we’ve opted for the CRIME INCIDENT REPORTS (AUGUST 2015 – TO DATE) (SOURCE: NEW SYSTEM).
These reports, supplied by the Boston Police Department (BPD), serve to document the initial details of incidents to which BPD officers respond. The dataset originates from the new crime incident report system, featuring a streamlined set of fields concentrating on capturing incident types, along with details about when and where they occurred. The records in this system commence from June of 2015.
Key attributes in the dataset include:
- [incident_num]: Internal BPD report number
- [offense_code]: Numerical code corresponding to offense description
- [Offense_Code_Group_Description]: Internal categorization of [offense_description]
- [Offense_Description]: Primary descriptor of the incident
- [district]: District where the crime was reported
- [reporting_area] varchar NULL: RA number associated with where the crime was reported
- [shooting] char NULL: Indicates whether a shooting took place
- [occurred_on] datetime2 NULL: Earliest date and time the incident could have taken place
- [UCR_Part] varchar NULL: Universal Crime Reporting Part number (1,2, 3)
- [street] varchar NULL: Street name where the incident took place
For this project I plan to understand the “Economic Indicators” dataset from Analyze Boston, the City of Boston’s open data hub. My goal is to conduct a comprehensive analysis to extract insights that can significantly contribute to informed policy-making decisions, leveraging the economic data tracked by the Boston Planning and Development Authority (BPDA) between January 2013 and December 2019.
We have compiled our findings and are currently in the final stages of preparing our Project 2 report. Our goal is to finalize and complete it by the end of today.
Today, I commenced work on the project report by consolidating all the analyses conducted this far on the dataset. The following aspects were addressed during the analysis:
- Visualized the distribution of victims’ age, race, and gender.
- Identified the top regions with the highest number of incidents.
- Determined the top agencies associated with the highest number of incidents.
- Explored the distribution of arms in fatal police shootings.
- Examined the average age by race.
- Provided frequency counts and percentages for various variables, including ‘threat_type,’ ‘flee_status,’ ‘armed_with,’ ‘race,’ and ‘was_mental_illness_related.’
- Conducted a temporal analysis to understand if the number of incidents increases or decreases over time.
- Utilized clustering to map incidents geographically.
- Investigated the correlation between age and armed status.
- Explored the distribution of weapons across different age groups.
- Performed logistic regression for predicting flee status.
I also had meeting with my group members to discuss the results of our findings
Today, I conducted a logistic regression analysis to predict flee status, employing features such as age, gender (male/female), vehicle presence, armed or unarmed status, and race (black/white). The target variable was ‘flee_status_not,’ where 1 indicates did not flee, and 0 indicates fleeing. Here are the key observations from the results:
- Accuracy: The model demonstrated an accuracy of approximately 70%, signifying its ability to correctly predict ‘flee_status_not’ for 70% of the samples in the test set.
- Precision: Of the instances predicted as ‘flee_status_not,’ 72% were genuinely ‘flee_status_not,’ implying that 72% of individuals did not attempt to flee.
- Recall: Out of all the actual ‘flee_status_not’ instances, the model correctly identified 94%.
- F1-score: The F1-score, representing the harmonic mean of precision and recall, is a useful metric for balancing these two measures. With an F1-score of 0.81, the model demonstrates a reasonably good balance between precision and recall for predicting the ‘flee_status_not’ class.
I conducted a correlation analysis to explore the relationship between age and armed status, and although the correlation outcomes revealed a weak connection, it prompted me to delve further into whether specific age groups exhibit a higher prevalence of particular weapons, such as guns. To investigate this, I employed seaborn’s barplot function to generate a stacked bar plot. In this plot, each age group’s bar is subdivided into segments, representing the counts of different weapon categories. I focused on armed statuses involving guns, knives, and being unarmed, as they play a more dominant role in the dataset. Attached below a screenshot of the result. Notably, the analysis reveals that the age group 19-30 exhibits a higher incidence of individuals carrying guns or knives. Additionally, the visual representation highlights instances of underage possession of guns.
Previously, I transformed categorical data into numeric format for analytical purposes. During this process, I pondered whether there might be a connection between age and armed status. Consequently, I conducted a correlation analysis specifically focusing on the armed statuses: armed with a knife, armed with a gun, and unarmed, as they exhibit high occurrence in the dataset. The obtained correlation coefficients are as follows:
- Armed with a Knife: -0.011807: This value is very close to zero, indicating an extremely weak or negligible correlation between age and the likelihood of being armed with a knife. The negative sign suggests a slight negative correlation, but its magnitude is minimal.
- Armed with a Gun: 0.088604: This positive value points to a weak positive correlation between age and the likelihood of being armed with a gun. However, the correlation is relatively modest, implying that age and being armed with a gun are not strongly linearly associated.
- Unarmed: -0.128284: This negative value suggests a weak negative correlation between age and the likelihood of being unarmed. Although the correlation is still relatively small.
These values only capture linear relationships. Other factors and nonlinear relationships may also play a role in understanding the data.
I began by converting categorical data into numerical format in order to facilitate clustering techniques. The initial step involved applying DBSCAN clustering based on the ‘latitude’, ‘longitude’, and ‘total_shootings’ columns from the dataset. This enabled the visualization of clusters on a map of the USA using the folium library, making it possible to observe shooting incidents that occurred in close geographic proximity.
Subsequently, I delved into exploring appropriate clustering methods to extract meaningful insights from the dataset. I opted to employ K-Means clustering, as it operates without requiring a target variable, aligning with the principles of unsupervised learning. Following this decision, I partitioned the data into training and testing sets, and I proceeded to train the K-Means model using the training data. Further, I applied the model to predict cluster labels for the testing dataset. The results obtained from these need further analysis and evaluation.
Upon analyzing the data related to fatal police shootings over the span of several years, I noted that in 2015, there were nearly 1,000 incidents, with a dip to 960 in 2016. Subsequently, the numbers exhibited a gradual increase, reaching 980 and 990 in 2017 and 2018, respectively. This upward trend persisted, nearly hitting 1,000 again in 2019. The year 2020 saw the number of shootings exceed 1,000, and it peaked in 2021, surpassing 1,040 fatal police shootings. Nevertheless, in 2022, the count decreased once more to 1,000.
Earlier, I computed frequency counts for categorical variables such as “threat_type,” “armed_with,” “was_mental_illness_related,” and “flee status.” Today, I extended my exploration to delve into the gender distribution within the dataset. Although males are more commonly involved, incidents with female victims are also noteworthy in number. Additionally, I discovered supplementary data on the Washington Post site, and I intend to incorporate it for further temporal analysis.
Today, to gain deeper insights into the dataset, I calculated frequency counts for categorical variables, including “threat_type,” “armed_with,” “was_mental_illness_related,” and “flee status.”
Here are my key observations:
- Regarding the “threat_type” category, it is evident that “shoot” and “threat” represent the most prevalent threat types, whereas “point,” “attack,” and “move” are comparatively less common.
- When examining whether victims were armed or not, it is apparent that “gun” and “knife” emerge as the predominant weapons used in these incidents. Additionally, a significant number of cases involve unarmed victims.
- With respect to flee status, Most cases do not involve fleeing (“not”).
- An analysis of the “was_mental_illness_related” category reveals that the majority of cases, approximately 79%, do not report any indication of mental illness (“False”). However, a substantial portion, around 20%, does report mental illness (“True”).
These observations might be useful in decision-making, resource allocation, and the development of policies in the relevant fields.
Today, I delved into additional dataset features. I initiated my analysis by identifying the state with the highest number of incidents. My findings revealed that California had the most incidents, totaling 344, followed by Texas with 192 incidents, and Florida with 126 incidents.
Additionally, I explored which agencies were involved in the most incidents, pinpointing agency ID 38 as having the highest number of incidents. To gain deeper insights into the dataset, I visualized the ‘armed’ feature, focusing on understanding how many victims were armed and the types of weapons they possessed. The visualization unveiled that over 1,200 victims were armed with guns, around 400 with knives, and roughly 200 were unarmed.
Furthermore, I calculated the average ages of all victims across different racial groups. You can find a screenshot of this analysis below:
I attempted to perform a temporal analysis on the dataset, which initially pointed to 2017 as the year with the highest number of shootings. However, I noticed a discrepancy between the data and my Python code that calls for further investigation. In this case, the dataset records ‘0’ shootings for the year 2023, despite the presence of data for that year in the dataset.
To gain a better understanding of the dataset’s demographics, I also visualized the gender feature. The visualization revealed a notable disparity in fatalities, with nearly 2,000 male victims as opposed to fewer than 250 female victims. As I continue my analysis, I plan to explore additional features to gain a more comprehensive understanding of the dataset.
Upon reviewing the dataset, a few questions came to mind. Firstly, I wondered whether there has been an increase or decrease in the number of fatal police shootings over the years. Secondly, I pondered which geographical areas experience the highest incidence of fatal police shootings, and lastly, I was curious about which racial group exhibits the lowest average age. I intend to share these queries with my team and conduct a temporal analysis to delve deeper into the first question I raised.
Regarding the dataset, it contains information on fatal police shootings. The data is divided into two parts: one with details about the victims and incidents, located in the “/v2/fatal-police-shootings-data.csv” file, and another containing data about police agencies involved in at least one fatal police shooting since 2015, found in the “/v2/fatal-police-shootings-agencies.csv” file. I combined these two CSV files using the “agency_ids” value as a reference and removed any rows with missing data (NaN values).
After discussion with one of my group member, to gain a better understanding of the dataset, I created a visualization using a histogram. The histogram revealed that the number of incidents involving white individuals was the highest, followed by Black, Hispanic, Native American, and Asian individuals. This information provides insights into the distribution of fatal police shootings across different racial groups.
The second project involves analyzing data from the Washington Post data repository on fatal police shootings in the United States. In order to gain a preliminary understanding of the data and its attributes, I executed basic commands such as “describe()” and “info()” There are 8770 data points till now from 2015-01-02 to 2023-10-07. I am currently in the process of understanding the dataset and exploring its potential analyses and applications.
Attaching a screenshot of the results:
In the last phase of documenting project 1 report, I anticipate completing it today.
Issues I have analyzed till now are:
- Disparity in Food access, between rural and urban areas of the USA and how it affects % of diabetics.
- Disparity in Physical inactivity, between rural and urban areas of the USA and how it affects % of diabetes.
- Disparity in Obesity, between rural and urban areas of the USA and how it affects % of diabetes.
As mentioned in the previous blog, I have conducted linear regression, multiple regression, t-test, utilized the Monte Carlo method, and implemented cross-validation techniques on the dataset.
I’m currently documenting the results and methodologies in my report.
While working on my report today, I came to the realization that it’s necessary to conduct a t-test on the independent variables within the urban-rural dataset. Specifically, I examined the obesity data in the urban-rural context and found that the T-statistic is -13.52903571504057, and the p-value is 1.4137452385540387e-40. This statistical result suggests that there is a considerable difference between these groups, with first group(rural) having a significantly lower mean than the other. The low p-value suggests that the observed difference is not due to random variability but is instead likely a real and meaningful difference between the groups.
Similarly, when analyzing the inactivity data, I observed a T-statistic of -7.488472092334628, and the associated p-value is 9.002823630475274e-14.
As of now, I have conducted linear regression, multiple regression, t-test, utilized the Monte Carlo method, and implemented cross-validation techniques on the dataset. The outcomes and any noteworthy challenges encountered during these analyses have been documented in the report.
In my previous blog post, I forgot to mention that I had also performed quadratic regression on the urban-rural dataset with diagnosed diabetes as dependent variable and obesity, inactivity and food insecurity as independent variables . For the rural dataset, the model predicted an average Diagnosed Diabetes percentage(y_pred) of 8.379, and the R-squared value was 0.58. The R-squared value tells us that about 58% of the variation in Diagnosed Diabetes percentage can be explained by the model, which is a moderate fit.
For the urban dataset, the average predicted Diagnosed Diabetes percentage (y_pred) was 8.971, and the R-squared value was 0.61. This R-squared value means that approximately 61% of the variation in Diagnosed Diabetes percentage in urban areas is explained by the model, indicating a slightly better fit compared to the rural dataset.
After that, I moved on to cross-validation, as I explained in my previous blog post. Today, I explored different cross-validation methods to get a comprehensive understanding of my model’s performance. I plan to know more about “bootstrapping” in particular.
Following this exploration, I analyzed all the tests and analyses I’ve done so far to understand the data better. This analysis will be useful when discussing our findings with my project group as we work on the project report.
Previously, I encountered errors while attempting cross-validation. Today, I successfully addressed those issues. I conducted cross-validation on two distinct datasets: one representing rural areas and the other urban areas with kfold=5. In both datasets, the dependent variable was “Diagnosed diabetes,” and the independent variables included “obesity,” “inactivity,” and “food insecurity.”
For the rural dataset, the results were as follows:
- Mean Squared Error: 1.54
- Standard Deviation of MSE: 0.17
The model’s predictions, on average, deviate from the actual values by an MSE of 1.54. The standard deviation of 0.17 suggests some variability in prediction accuracy across different cross-validation folds.
For the urban dataset, the results were as follows:
- Mean Squared Error: 1.12
- Standard Deviation of MSE: 0.10
The model’s predictions have a lower MSE of 1.12 on average compared to the rural dataset. Additionally, the standard deviation of 0.10 indicates relatively consistent prediction performance across different folds.
Today I invested a considerable amount of time in grasping the concept of cross validation. To summarize my understanding, cross-validation is a test for your model to see how well it can make predictions on data it hasn’t seen before. It helps us avoid a common problem called overfitting which occurs when a model learns the training data too well and becomes too specialized, performing poorly on new data. Cross-validation checks if your model is likely to overfit.
Further I decided to perform this on my Urban- Rural dataset for diagnosed diabetes percentage. I decided to use K fold cross validation to split the data into 10 sets.
I faced issues during my cross-validation process, and to address these problems, I intend to handle the missing data. My next step is to resolve these errors and proceed with the cross-validation.
Today I tried understanding the Monte Carlo method, since it was discussed in a previous class. What I understood is that the Monte Carlo method is a technique used to estimate uncertain outcomes through randomness and simulation. I further performed this test on the diagnosed diabetes dataset for urban and rural counties and the p-value came out to be 0.99999.
This finding confused me a bit because, in an earlier analysis using a t-test on the same datasets, I had obtained a drastically different p-value of 3.832914332163736e-20.
This contrast in p-values prompted me to delve deeper into the discrepancy between the two methods. It became apparent that the Monte Carlo method can yield variable results based on the assumptions made during the simulations. On the other hand, the t-test assumes a normal distribution of data, and if this assumption is not met, the reliability of the results may be compromised.
To address this disparity in p-values and gain a better understanding of the data, further exploration and analysis are necessary. This involves scrutinizing the dataset, investigating potential outliers, and considering alternative statistical tests to get clarity on the reasons behind the differences in results and draw more accurate conclusions from the data
With respect to the project, my primary goal today was to assess the relationship between diagnosed diabetes percentage and three key independent variables: obesity, inactivity, and food insecurity. For the rural dataset, the multiple regression analysis gave an R-squared value of 0.549 indicating that about 54.9% of the changes in diagnosed diabetes percentage in rural areas can be explained by factors like obesity, inactivity, and food insecurity. Following this, I replicated the same multiple regression analysis on the urban dataset, which resulted in an R-squared value of 0.602 implying that about 60.2% of the changes in diagnosed diabetes percentage in urban areas can be explained by factors – obesity, inactivity, and food insecurity.
For Rural Dataset:
For Urban Dataset:
Further, I delved into the topic of t-tests, which had been discussed in today’s class. I conducted a t-test on the diabetes data for both urban and rural counties, producing a t-statistic of -9.25558917924443, implying that there is a significant disparity in diabetes rates between these two types of counties, with urban counties having a notably higher diabetes rate than rural counties.
Additionally, the very low p-value: 3.832914332163736e-20 confirms strong statistical significance and that urban counties indeed have a significantly higher diabetes rate than rural counties, this result is highly reliable and unlikely to be due to random fluctuations in the data.
Today, post the class, we engaged in a productive discussion with the TA, after which I decided to introduce a new variable into the model that is physical inactivity. In addition to obesity and food insecurity, this variable adds an important dimension to our analysis. Subsequently, I meticulously organized the data, ensuring that it accommodated this new variable for both urban and rural contexts. Following this data preparation, I conducted multiple regression analysis on our updated datasets.
Following are the results:
For the rural dataset: R-squared value: 0.572, Skewness: 0.314, Kurtosis: 3.417, F-statistic: 888.5, Prob(F-statistic): 6.78e-246
For the urban dataset: R-squared value: 0.507, Skewness: 0.061, Kurtosis: 3.104, F-statistic: 927.2, Prob(F-statistic): 1.11e-277
The R squared value for both datasets indicated a strong model fit. I further plan to delve deeper into understanding these findings and what they mean for thoroughly analyzing them, to understand how they can practically help our goals. After that I plan to perform cross validation to check the accuracy of the model and to determine overfitting.
After conducting linear regression analysis initially on the relationship between %Food Insecurity and %Obesity and furthering our focus to investigate health disparities between urban and rural populations, I analyzed the data again, by performing linear regression specifically within the Urban-Rural indicator subset. The resulting R-squared value was very low at approximately 0.0025, indicating that the model offers very limited explanatory power for %Obesity based on %Food Insecurity within the Urban-Rural subset, which in my understanding implies that %Food Insecurity alone may not be a strong indicator of %Obesity in these areas, and other influential factors also play a significant role.
For a better understanding of the relationship between these variables, I also conducted a Pearson correlation analysis, revealing a correlation coefficient (r) of approximately 0.3538 and a high p-value.
After obtaining these results, the next step was to interpret their implications. After some research, I learned that the in results I obtained that is a positive value of r (0.3538) indicates a positive linear relationship, meaning that as one variable increases, the other tends to increase as well. While at the same time the high p-value indicates that the strength of this relationship is relatively weak (as null hypotheses gets rejected). Therefore, correlation does not mean causation.
With this I believe that additional factors beyond %Food Insecurity are at contributing to health disparities and more set of variables could influence %Obesity. I further plan to explore more variables and perform multiple regression on that dataset.
In our class discussion today, we explored the concept of p-value. To summarize my understanding, a p-value signifies the probability of an event occurring under the assumption that the null hypothesis is valid. In practical terms, when the p-value is small, it provides strong evidence to reject the null hypothesis, while a larger p-value suggests that it might be reasonable to retain the null hypothesis for further consideration.
Regarding project, after digging deeper into the data, I had a lot of questions. Within our group, we’ve currently focused our attention on identifying health disparities between rural and urban populations. This led me to contemplate which factors, aside from ‘overall SVI’ (Social Vulnerability Index), are crucial for our analysis.
After a thoughtful discussion with one of my fellow group members, we decided to work with datasets encompassing various social determinants of health with respect to urban and rural status of the counties. Specifically, I embarked on a task to compare and correlate the percentage of food access with obesity data within urban and rural counties.
I started with formatting datasets in a way it will be relevant to my goal this included merging the two datasets containing food access and obesity data using Python. Further after performing linear- regression I found the r-squared value to 0.0295 which implies that the independent variables in our model exhibit little to no explanatory power, indicating that the model’s fit is far from ideal.
As part of my ongoing efforts, I plan to enhance the model’s performance. This will involve the removal of outliers from the dataset and the incorporation of additional factors to consider. The aim is to refine our analysis and achieve more meaningful insights into the health disparities between rural and urban populations.
- Explored data and tried analyzing patterns– The dataset includes three main variable inactivity, obesity and diabetes. I observed data in detail with different indicators and social determinants like economics, food access, healthcare, social vulnerability index which is tool used by CDC to spatially identify “at-risk” populations. I also attempted to correlate and identify patterns among multiple data points, including factors like Social Vulnerability Index (SVI).
- In class, we discussed relating diabetes with inactivity through this example I learnt more statistical terms like “kurtosis”. Kurtosis is the measure of tailedness of distribution, tails being the tapering ends of the distribution. The second important term discussed today was “heteroskedasticity”. In layman’s terms it can be explained as the “fanning out” of data. Higher heteroskedasticity implies lesser reliable is the model. The test used to determine heteroscedasticity in a linear regression model is “Breusch-Pagan Test”.
- Group meeting– Today after a discussion with the group about the possibilities with the dataset, we collectively decided to concentrate on examining the prevalence and underlying factors of inactivity, obesity, and diabetes among rural and urban populations.
- I brushed up my python skills and displayed the data with python for urban and rural population depending on the factors Food access and SVI and plotted graph for the same.
- I further plan to discuss my findings with my group and instructor.
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!