Report – Multiple Analysis Variances and Covariances

For this project you will be using the NHANES data set (attached).

Goal of the project:  to find variables that may be risk factors for Diabetes in Adults (18 years or older).


  1. Create a variable for Diabetes using fasting blood glucose criteria.  You must cite the source for the criteria you use.
  2. Find the variables for age, gender, race and smoking.  You will use these four variables.
  3. Choose 4 other variables (using BMI; cholesterol – LDL; physical activity; education) from the data set that you believe may be related to Diabetes.
  4. For each of the variables from part 2 and those selected in part 3, generate an appropriate hypothesis for its relation to Diabetes.
  5. Decide on the appropriate hypothesis tests to use to test the hypotheses from part 3.
  6. Examine the variables in the data set (if one of the variables does not have a sufficient number of observations a different variable will need to be selected)
  7. Perform the appropriate hypothesis tests
  8. Do power calculations for each of the tests
  9. Submit a written summary

The written summary should be laid out as follows:

  1. Intro – background on project and clearly stated purpose of the study.   List all hypotheses that will be tested.
  2. Methods – how was the data collected? What data was collected? What type of Quality Control (QC – data cleaning) was performed? How was the data analyzed (be sure to include whether values were excluded and why)? 

demo – demographic data 

bmx – body measurement data

bpx – blood pressure data

cbq – consumer behavior

cdq – cardiovascular health

alq – alcohol use

glu – plasma fasting glucose

hdl – hdl cholesterol

ihgem – blood mercury

paq – physical activity

smq – cigarette use

trigly – cholesterol – LDL and triglycerides

whq – weight history



Nowadays, the healthcare industry is rich in term of available and accessible gathered data. These enormous amounts of data, using statistical analysis technics, can be used to extract some useful and interesting conclusions and relationships among attributes. In fact, deep data analysis allows us to discover and analyze hidden patterns from data and help us predicting it.

The healthcare industry generates a large data about patients and their disease diagnostics. In our time, correct disease diagnosing and effective treatment prescription are the biggest challenge facing healthcare industry. Wrong diagnostic can and will lead to disastrous consequences, which is not tolerable.

According to the Centers for Disease Control and Prevention (CDC), more than 100 million U.S. adults are now living with diabetes or pre-diabetes. In details, according to a new statistics released.30.3 million Americans (9.4 percent of the U.S. population have diabetes). Another 84.1 million have pre-diabetes. Furthermore, the rate of new diabetes diagnoses remains steady. However, the disease continues to represent a growing health problem:

Diabetes is a serious disease. Therefore, the aim of this study is to identify variables that may be risk factors for Diabetes among Adults using statistical analysis technics.

The data used for this study is obtained from diabetes section of the National Health And Nutrition examination Survey (NHANS). This section provides personal interview data on diabetes, pre-diabetes, use of insulin or oral hypoglycemic medications, and diabetic retinopathy. It also provides self-reported information on awareness of risk factors for diabetes, general knowledge of diabetic complications, and medical or personal cares associated with diabetes.

In order to determine the risk factors of pre-diabetes and diabetes we will use a multinomial logistic regression with the following dependent variables: age, gender, race, smoking, BMI, cholesterol – LDL, physical activity, and education level.

Multiple analysis of variances and analysis of covariances has been done to determine if there is any statistically significant difference of the persistence of pre-diabetes using qualititative variables groups (BMI, race, smoking, BMI and education level).


The Sample was drawn using a multistage probability stratification of the civilian no institutionalized resident population of the United States

All survey participants aged 1 year and older were eligible. The questions asked varied by age and history of diabetes. Please refer to check items in the diabetes questionnaire and corresponding codebook for question-specific details about the eligible target group.

The survey questions were asked, athome, by trained interviewers using the Computer-Assisted Personal Interviewing (CAPI) system. Hand cards showing response categories were also used for some questions. When necessary, household interviewers read the hand cards to respondents. Participants 16 years of age and older and emancipated minors were interviewed directly. A proxy provided information for survey participants who were under 16 years of age and for participants who could not answer the questions themselves.

The CAPI system is programmed with built-in consistency checks to reduce data entry errors. CAPI also uses online help screens to assist interviewers in defining key terms used in the questionnaire.

After gathering data, the next step is to clean the data set. We have identified if there are any missing values. Then replace those missing values using mean, median or some sophisticated methods like k nearest neighbor. A descriptive analysis was done in order to identify the characteristics of the sample for the entire available variables. In addition, we have checked the existence of outliers that might affect the results of our study and remove those extreme values.


[1] James D. Beck, “The prevalence of caries and tooth loss among participants in the Hispanic Community Health Study/Study of Latinos”

[2] Ming-Fong Chen, Chih-Jen Chang, “The Association between Nonalcoholic Fatty Pancreas Disease and Diabetes”

[3] Tabaei BP, Herman WH., “A multivariate logistic regression equation to screen for diabetes: development and validation”

[4] National Diabetes Statistics Report, 2017 Estimates of Diabetes and Its Burden in the United States