Exploring the Chronic Disease Risk Using NHANES Data in U.S.

Challenge background

The various health issues that the United States must deal with include high prevalence of chronic diseases, and these include diabetes, hypertension, cardiovascular disease and obesity among others which are compounded by one or the other of the social, demographic or even environmental factors. Although prevention begins with interventions such as the recently executed NHANES project, prevalence discrepancies among minority, low-income, and rural patient populations call for more specific, risk-predictive prevention strategies.

The local chapter in assessing chronic diseases in the U. S would be beneficial in the following ways; Such comprehensive datasets as NHANES make this process easier since one is provided with a consistent data source that is appropriate for the inexperienced and experienced collaborators. The non-complex nature of the dataset certainly contributes to project accomplishment on time and allows persons with negligible technical expertise to participate and assist in advancing proactive, effective, health-oriented approaches in public health. Also, it will be a dataset that beginner collaborator can work with easily, and it will introduce them to data lifecycle. This projects will solve address a real issue while giving collaborators the skills they need to be a successful data scientist

The problem

In the United States, noncommunicable diseases like diabetes, hypertension, and cardiovascular disorders are serious public health issues. It is imperative that we identify the etiology of these disorders to properly intervene. We can obtain vital information about the dietary, demographic, and developmental aspects of these disorders from the National and Nutrition Examination Survey (NHANES). Based on this data, the research will employ model informatics to analyze the data, identify critical components of disease risk, and examine the potential impact of demographic variables, including age, gender, race/ethnicity, and income level, on the risk of contracting this illness.

This initiative aims to lower health disparities across at-risk populations, promote the creation of personalized health care plans, and suggest policy advancements for the public health agenda. Our research can assist the FDA and HHS in assessing potential trends in chronic illnesses across different populations, enabling them to more effectively avoid these illnesses and enhance the health of the country's residents.

The central issues explored in this chapter include:

How can NHANES data be utilized to forecast the risk of chronic diseases (including diabetes, hypertension, and cardiovascular disease) by analyzing demographic, dietary, and lifestyle information?
Which dietary and lifestyle factors are the most significant contributors to chronic disease risk?
In what ways do demographic variables such as age, gender, race/ethnicity, and income levels influence the likelihood of developing these conditions?
Is it possible to create a predictive model that identifies high-risk individuals early, thereby facilitating preventive health interventions?

Goal of the project

The objective of this project is to utilize NHANES data to inform and shape public health policies in the United States, addressing critical challenges associated with the prevention and management of chronic diseases.

Objectives:

Utilize NHANES data to derive insights that assist organizations such as the U.S. Food and Drug Administration (FDA) and the Department of Health and Human Services (HHS) in developing dietary guidelines and tracking trends in obesity, diabetes, heart disease, and various other health conditions over time.
Track chronic disease: Examine NHANES data to monitor the prevalence and progression of chronic diseases, such as, diabetes, hypertension, cardiovascular disease, and obesity. By highlighting important factors, this project aims to enhance our understanding of the circulation of these conditions and identify the vulnerable populations.
Address Health Disparities: Investigate and assess health disparities among various racial ethnic, and socio-economic groups within the U.S. This project will also offer a valuable insight into enhanced health outcomes for marginalized and high-risk populations.
Facilitate Customized HealthCare: Create predictive models utilizing NHANES data to pinpoint individuals who are even at an elevated risk of chronic disease. Also, this project aims to enhance personalized healthcare approaches by customizing prevention strategies and public health guidelines according to distinct demographic, dietary and lifestyle characteristics.

Project timeline

1
Week 1
Week 1: Data Access, Exploration, and Project Scoping

Goal: Familiarize participants with NHANES data, explore relevant variables, and define the research problem.

1. Day 1–2: Introduction and Dataset Overview
- Go over NHANES dataset and its structure, covering key categories, demographics, physical activity, dietary intake, lab results, and medical history.
- Confirm the participants are aware of the procedures for accessing and downloading NHANES data from the CDC NHANES website.
2. Day 3–4: Data Exploration and Understanding
- Guide participants through exploring NHANES datasets (e.g., demographics, dietary intake, lab results, physical exams).
- Conduct fundamental data exploration utilizing Pandas to examine absent values, distributions, and correlations among variables, such as the relationship between age, dietary habits, and physical activity with laboratory measurements like cholesterol or blood glucose levels.
3. Day 5: Problem Definition & Hypothesis
- Facilitate a discussion to define the specific focus area. For example:
  
  Prediction of diabetes risk based on demographic, lifestyle, and dietary factors.
  
  Exploring risk factors for cardiovascular disease.
  
  Identifying determinants of obesity using physical activity, dietary intake, and demographics.
  
  Formulate research inquiries and hypotheses, such as, "Is there a relationship between a higher consumption of processed foods and an elevated risk of cardiovascular disease?"
2
Week 2
Week 2: Data Cleaning, Feature Engineering, and EDA

Goal: Clean and prepare the data for analysis, perform exploratory data analysis (EDA), and create features.

1. Day 1–2: Data Cleaning
- Address the issue of missing values and outliers within the NHANES dataset by either imputing the missing values or excluding them based on the specific context.
- Combine various NHANES datasets, such as those related to demographics, dietary intake, and laboratory results, to develop a unified dataset suitable for analysis.
- Standardize categorical variables, including gender and race/ethnicity, while normalizing continuous variables such as cholesterol levels and body mass index (BMI).
2. Day 3: Feature Engineering
- Develop features that may be predictors of chronic diseases.
- Body Mass Index (BMI), cholesterol measurements, and blood pressure readings derived from laboratory analyses.
- Nutritional consumption: overall caloric intake, distribution of macronutrients (carbohydrates, fats, proteins), and consumption of processed foods.
- Exercise levels: duration of moderate to vigorous physical activity on a weekly basis.
- Demographic information: age, sex, racial or ethnic background, income bracket, and educational attainment.
3. Day 4–5: Exploratory Data Analysis (EDA)
- Visualize relationships between predictors and outcomes using Matplotlib or Seaborn
- Analyze relationship between interactions between variables.
- Perform statistical tests to identify significant predictors (e.g., using t-tests or ANOVA).
3
Week 3
Week 3: Model Development and Training

Goal: Build and train predictive models for chronic disease risk using NHANES data.

1. Day 1: Train-Test Split & Model Selection
- Split the data into training and test sets (e.g., 80% training, 20% test).
- Select models based on the problem definition. For example:
  
  Logistic regression (for binary classification tasks like predicting diabetes or cardiovascular disease).
  
  Random forest, XGBoost, or Support Vector Machines (for more advanced models).
  
  Neural networks (optional) for participants with deep learning experience.
3. Day 2–3: Baseline Model Development
- Develop a baseline logistic regression or decision tree model.
- Train the model using the training data and evaluate it using accuracy, precision, recall, and F1-score on the test set.
- Generate a confusion matrix to analyze model performance.
4. Day 4–5: Advanced Model Training & Tuning
- Develop more complex models such as Random Forest or XGBoost and fine-tune hyperparameters using cross-validation.
- Train the models and compare performance metrics such as AUC-ROC (Receiver Operating Characteristic Curve) or precision-recall.
4
Week 4
Week 4: Evaluation of Models, Insights Generation, and Presentation

Objective: Assess models, extract insights, and communicate results.

1. Days 1–2: Model Assessment
- Assess the final model utilizing various metrics (e.g., accuracy, AUC-ROC, F1-score).
- Analyze model coefficients or feature significance (e.g., through SHAP values or feature importance graphs) to identify the most influential predictors of disease risk.
- Detect potential biases within the model (e.g., by evaluating performance across diverse demographic groups).
2. Day 3: Insights and Ethical Considerations
- Examine significant insights derived from the model. For instance:
  
  Which dietary and lifestyle elements are the most significant predictors of diabetes or cardiovascular disease risk?
  
  Are there unexpected correlations between demographic factors and health outcomes?
- Investigate the ethical ramifications of employing predictive models in healthcare, focusing on fairness, bias, and transparency.
3. Days 4–5: Final Presentation and Conclusion
- Each team or individual showcases their final models, principal findings, and visual representations.
- Discuss how these findings can guide public health policies or individual health choices.
- Contemplate potential extensions or enhancements for the project.

What you'll learn

1. Predictive Models for Chronic Disease Risk:

Logistic regression, Random Forest, XGBoost, or other models developed to predict risks of chronic diseases like diabetes, cardiovascular disease, or hypertension.
Models will be evaluated based on metrics such as accuracy, precision, recall, F1-score, and AUC-ROC.

2. Data-Driven Insights:

Identification of significant predictors of chronic disease risk, such as diet (processed foods, caloric intake), physical activity, and demographic factors (age, sex, ethnicity).
Understanding of interactions between lifestyle, dietary habits, and health outcomes.

3. EDA:

Involve statistical tests and visualizations to identify correlations between factors like physical activity, food, and indicators of chronic diseases like BMI and cholesterol. Correlations between various demographic groupings will provide insights and point out important risk factors.

4. Model Interpretation and the Significance of Features:

Identify variables that are predictors of chronic disease
Assess an bias in biases exist in the model predictions based on demographic characteristics.

5. Ethical Consideration:

A reflection on the ethical implication of using predictive models in AI

6. Real World Experience:

Collaborators will gain practical experience through this project, addressing common skill gaps such as familiarity with the data lifecycle and other essential data science techniques. This initiative focuses on closing those gaps by providing a dataset that is manageable for beginners, ensuring that everyone can actively engage and develop critical skills without being overwhelmed by complexity.

Challenge background

The problem

Goal of the project

Project timeline

Week 1

Week 2

Week 3

Week 4

What you'll learn

What to expect from a Local Chapter project

First project

Benefits

Requirements

This challenge is hosted by

San Jose, USA Chapter

Leveraging AI to Combat Climate Change in Bhutan

Chatbot Using LLM to Evaluate Import and Export in Peru

Urban Tree Observatory: Data-Driven Monitoring & Conservation in Ibagué, Colombia

Exploring the Chronic Disease Risk Using NHANES Data in U.S.

Challenge background

The problem

Goal of the project

Project timeline

Week 1

Week 2

Week 3

Week 4

What you'll learn

What to expect from a Local Chapter project

First project

Benefits

Requirements

This challenge is hosted by

San Jose, USA Chapter

Other Local Chapter projects

Leveraging AI to Combat Climate Change in Bhutan

Chatbot Using LLM to Evaluate Import and Export in Peru

Urban Tree Observatory: Data-Driven Monitoring & Conservation in Ibagué, Colombia