September, 29 Friday…

 
In my previous blog post, I forgot to mention that I had also performed quadratic regression on the urban-rural dataset with diagnosed diabetes as dependent variable and obesity, inactivity and food insecurity as independent variables . For the rural dataset, the model predicted an average Diagnosed Diabetes percentage(y_pred) of 8.379, and the R-squared value was 0.58. The R-squared value tells us that about 58% of the variation in Diagnosed Diabetes percentage can be explained by the model, which is a moderate fit. 

For the urban dataset, the average predicted Diagnosed Diabetes percentage (y_pred) was 8.971, and the R-squared value was 0.61. This R-squared value means that approximately 61% of the variation in Diagnosed Diabetes percentage in urban areas is explained by the model, indicating a slightly better fit compared to the rural dataset. 

After that, I moved on to cross-validation, as I explained in my previous blog post. Today, I explored different cross-validation methods to get a comprehensive understanding of my model’s performance. I plan to know more about “bootstrapping” in particular. 

Following this exploration, I analyzed all the tests and analyses I’ve done so far to understand the data better. This analysis will be useful when discussing our findings with my project group as we work on the project report. 

September,27 (Wednesday)

Previously, I encountered errors while attempting cross-validation. Today, I successfully addressed those issues. I conducted cross-validation on two distinct datasets: one representing rural areas and the other urban areas with kfold=5. In both datasets, the dependent variable was “Diagnosed diabetes,” and the independent variables included “obesity,” “inactivity,” and “food insecurity.” 

For the rural dataset, the results were as follows: 

    • Mean Squared Error: 1.54 
    • Standard Deviation of MSE: 0.17 

The model’s predictions, on average, deviate from the actual values by an MSE of 1.54. The standard deviation of 0.17 suggests some variability in prediction accuracy across different cross-validation folds. 

For the urban dataset, the results were as follows: 

    • Mean Squared Error: 1.12 
    • Standard Deviation of MSE: 0.10 

The model’s predictions have a lower MSE of 1.12 on average compared to the rural dataset. Additionally, the standard deviation of 0.10 indicates relatively consistent prediction performance across different folds. 

 

 

 

 

September, 25 (Monday)

Today I invested a considerable amount of time in grasping the concept of cross validation. To summarize my understanding, cross-validation is a test for your model to see how well it can make predictions on data it hasn’t seen before. It helps us avoid a common problem called overfitting which occurs when a model learns the training data too well and becomes too specialized, performing poorly on new data. Cross-validation checks if your model is likely to overfit.  

Further I decided to perform this on my Urban- Rural dataset for diagnosed diabetes percentage. I decided to use K fold cross validation to split the data into 10 sets. 

I faced issues during my cross-validation process, and to address these problems, I intend to handle the missing data. My next step is to resolve these errors and proceed with the cross-validation. 

Monte Carlo Testing …………..(September,22 Friday)

Today I tried understanding the Monte Carlo method, since it was discussed in a previous class. What I understood is that the Monte Carlo method is a technique used to estimate uncertain outcomes through randomness and simulation. I further performed this test on the diagnosed diabetes dataset for urban and rural counties and the p-value came out to be 0.99999. 

This finding confused me a bit because, in an earlier analysis using a t-test on the same datasets, I had obtained a drastically different p-value of 3.832914332163736e-20. 

This contrast in p-values prompted me to delve deeper into the discrepancy between the two methods. It became apparent that the Monte Carlo method can yield variable results based on the assumptions made during the simulations. On the other hand, the t-test assumes a normal distribution of data, and if this assumption is not met, the reliability of the results may be compromised. 

To address this disparity in p-values and gain a better understanding of the data, further exploration and analysis are necessary. This involves scrutinizing the dataset, investigating potential outliers, and considering alternative statistical tests to get clarity on the reasons behind the differences in results and draw more accurate conclusions from the data

September,20 Wednesday

With respect to the project, my primary goal today was to assess the relationship between diagnosed diabetes percentage and three key independent variables: obesity, inactivity, and food insecurity. For the rural dataset, the multiple regression analysis gave an R-squared value of 0.549 indicating that about 54.9% of the changes in diagnosed diabetes percentage in rural areas can be explained by factors like obesity, inactivity, and food insecurity. Following this, I replicated the same multiple regression analysis on the urban dataset, which resulted in an R-squared value of 0.602 implying that about 60.2% of the changes in diagnosed diabetes percentage in urban areas can be explained by factors – obesity, inactivity, and food insecurity. 

For Rural Dataset: 

For Urban Dataset: 

Further, I delved into the topic of t-tests, which had been discussed in today’s class. I conducted a t-test on the diabetes data for both urban and rural counties, producing a t-statistic of -9.25558917924443, implying that there is a significant disparity in diabetes rates between these two types of counties, with urban counties having a notably higher diabetes rate than rural counties. 

Additionally, the very low p-value: 3.832914332163736e-20 confirms strong statistical significance and that urban counties indeed have a significantly higher diabetes rate than rural counties, this result is highly reliable and unlikely to be due to random fluctuations in the data.  

Project update (September 18,Monday)

Today, post the class, we engaged in a productive discussion with the TA, after which I decided to introduce a new variable into the model that is physical inactivity. In addition to obesity and food insecurity, this variable adds an important dimension to our analysis. Subsequently, I meticulously organized the data, ensuring that it accommodated this new variable for both urban and rural contexts. Following this data preparation, I conducted multiple regression analysis on our updated datasets. 

Following are the results: 

For the rural dataset: R-squared value: 0.572, Skewness: 0.314, Kurtosis: 3.417F-statistic: 888.5, Prob(F-statistic): 6.78e-246 

For the urban dataset: R-squared value: 0.507, Skewness: 0.061, Kurtosis: 3.104, F-statistic: 927.2, Prob(F-statistic): 1.11e-277

The R squared value for both datasets indicated a strong model fit. I further plan to delve deeper into understanding these findings and what they mean for thoroughly analyzing them, to understand how they can practically help our goals. After that I plan to perform cross validation to check the accuracy of the model and to determine overfitting.

Project Update….(Day 3)(September 15, Friday)

After conducting linear regression analysis initially on the relationship between %Food Insecurity and %Obesity and furthering our focus to investigate health disparities between urban and rural populations, I analyzed the data again, by performing linear regression specifically within the Urban-Rural indicator subset. The resulting R-squared value was very low at approximately 0.0025, indicating that the model offers very limited explanatory power for %Obesity based on %Food Insecurity within the Urban-Rural subset, which in my understanding implies that %Food Insecurity alone may not be a strong indicator of %Obesity in these areas, and other influential factors also play a significant role.

For a better understanding of the relationship between these variables, I also conducted a Pearson correlation analysis, revealing a correlation coefficient (r) of approximately 0.3538 and a high p-value.

After obtaining these results, the next step was to interpret their implications. After some research, I learned that the in results I obtained that is a positive value of r (0.3538) indicates a positive linear relationship, meaning that as one variable increases, the other tends to increase as well. While at the same time the high p-value indicates that the strength of this relationship is relatively weak (as null hypotheses gets rejected). Therefore, correlation does not mean causation.

With this I believe that additional factors beyond %Food Insecurity are at contributing to health disparities and more set of variables could influence %Obesity. I further plan to explore more variables and perform multiple regression on that dataset.

Here’s what I learned…(September 13, Wednesday)

In our class discussion today, we explored the concept of p-value. To summarize my understanding, a p-value signifies the probability of an event occurring under the assumption that the null hypothesis is valid. In practical terms, when the p-value is small, it provides strong evidence to reject the null hypothesis, while a larger p-value suggests that it might be reasonable to retain the null hypothesis for further consideration.

Regarding project, after digging deeper into the data, I had a lot of questions. Within our group, we’ve currently focused our attention on identifying health disparities between rural and urban populations. This led me to contemplate which factors, aside from ‘overall SVI’ (Social Vulnerability Index), are crucial for our analysis.

After a thoughtful discussion with one of my fellow group members, we decided to work with datasets encompassing various social determinants of health with respect to urban and rural status of the counties. Specifically, I embarked on a task to compare and correlate the percentage of food access with obesity data within urban and rural counties.

I started with formatting datasets in a way it will be relevant to my goal this included merging the two datasets containing food access and obesity data using Python. Further after performing linear- regression I found the r-squared value to 0.0295 which implies that the independent variables in our model exhibit little to no explanatory power, indicating that the model’s fit is far from ideal.

As part of my ongoing efforts, I plan to enhance the model’s performance. This will involve the removal of outliers from the dataset and the incorporation of additional factors to consider. The aim is to refine our analysis and achieve more meaningful insights into the health disparities between rural and urban populations.

Monday(September 11)- Manasi Sarvankar

  1. Explored data and tried analyzing patterns– The dataset includes three main variable inactivity, obesity and diabetes. I observed data in detail with different indicators and social determinants like economics, food access, healthcare, social vulnerability index which is tool used by CDC to spatially identify “at-risk” populations. I also attempted to correlate and identify patterns among multiple data points, including factors like Social Vulnerability Index (SVI).
  2. In class, we discussed relating diabetes with inactivity through this example I learnt more statistical terms like “kurtosis”. Kurtosis is the measure of tailedness of distribution, tails being the tapering ends of the distribution. The second important term discussed today was “heteroskedasticity”. In layman’s terms it can be explained as the “fanning out” of data. Higher heteroskedasticity implies lesser reliable is the model. The test used to determine heteroscedasticity in a linear regression model is “Breusch-Pagan Test”.
  3. Group meeting– Today after a discussion with the group about the possibilities with the dataset, we collectively decided to concentrate on examining the prevalence and underlying factors of inactivity, obesity, and diabetes among rural and urban populations.
  4. I brushed up my python skills and displayed the data with python for urban and rural population depending on the factors Food access and SVI and plotted graph for the same.
  5. I further plan to discuss my findings with my group and instructor.