Research Question & Data analysis
Aim:This assessment is designed to give students the opportunity to solve a discipline specific problem based on real data.
Students will need to construct a research question based on the Australian Heath Survey data. Reformulate the research question into a statistics problem, analyze the data, and communicate the results.
The data description can be found in the file:
Aus_Health_Survey
The data itself is located in the file:
npa2011
The data codebook is located in the file:
npa2011DataItems
A sample report is attached:
Report example
How does SEIFA (SocioEconomic Indexes for Areas) and equivalised income of households would affect their hypertensive disease and ischaemic diseases? – what their diet is (how often salt, fruit, vege – blood pressure and cholesterol) –>ischaemic or hypertensive disease
Report layout:
 Executive summary (1 page max).
 Short description of the problem(s).
 What are the main findings?
 Key figure if appropriate.
 Are their shortcomings to the analysis?
 What is the clinical relevance?
 The problem:
 Longer description of the scientific problem(s).
 Translation of the scientific problem into a statistical problem.
 Relevant (to answering the question) data summaries. Data transformations. (Only if necessary).
 Analysis:
 What tools are used and why.
 What are the results?
 Use figures to illustrate the results.
 Interpret the results (to a statistician).
 Interpret the results (to a nonstatistician).
Marking rubric:
Criteria  Fail  Pass  Credit  Distinction  High Distinction 
Translate the scientific question into a statistical formulation
(Question) 
Wrong  Slight misunderstanding of the biological question.  Simplistic, or formulated to a limited capacity  Appropriate formulation.  1) Innovative formulation.
2) Shows good understanding of the biology.

Analysis  Incorrect analysis  Direct application of techniques  No real justification  Accurate and appropriate  1) Robust analysis (crossvalidated, outliers handled).
2) Accurate, 3) Appropriate, 4) Assumptions checked. 
Presentation  Figures do not match analysis. Errors in the Figures. Figures do not match data.  Not informative figures  1) Informative figures.
2) Axis labels, headings, and legend.

1) Informative figures.
2) Axis labels, headings, and legend. 3) Visually pleasing.

1) Informative figures.
2) Axis labels, headings, and legend. 3) Visually pleasing. 4) Innovative visualization. 
Reproducibility  Code does not match figures/analysis.  Too many customized elements for lecturer to easily modify the code to get it to run.  Readable, but not fully reproducible.  Reproducible with minimal changes.  1) Fully reproducible.
2) Stable code. 3) Runs first time without editing. 4) Selfcontained. 
Background
Insulin is a hormone which plays a key role in the regulation of blood glucose levels. Insulin resistance is a
pathological condition in which cells fail to respond normally to the hormone insulin. In people with insulin
resistance, the muscles and the liver resist the action of insulin, so the body should produce higher amounts
to keep the blood glucose levels within a normal range. It is more common in people with a family history of
diabetes or people who are overweight, particularly around the stomach area. A person with insulin resistance
has a greater risk of developing Type II diabetes and heart disease. Nowadays, type II diabetes and obesity
are increasingly affecting human populations around the world.
Problem
We investigate a number of scientific questions.
1. The question of which clinical measurements (e.g. weight, BMI, abdomen fat) are most closely associated
with risk of being insulin resistant. It is believed that people who are ‘appleshaped’ and have a lot
of visceral fat have an elevated risk of developing insulin resistance, which is a precursor to type II
diabetes. We were tasked with exmining the data available and determining whether the evidence
at hand supports this idea. In statistical terms, the question is whether the typical values of these
measurements differed between the OIR and OIS groups, and whether these differences can be said to
be statistically significant.
2. The question of whether certain individual genes are expressed differently in the different metabolic
groups. We had reason to believe, based on literature, that certain genes were associated with insulin
sensitivity. The statistical problem here was similar to the last case: the question was whether the
expression levels differed between the groups, and whether the differences were statistically significant.
In this case, there was the added complication of a largescale multiple testing problem.
3. The question of whether the exression levels of certain individual genes are related to waist circumference.
In statistical terms, this is a regression problem.
4. The question of whether two particular sets of genes, the qarm and parm genes on the sixth chromosome
are expressed differently in the OIR and OIS groups, as suggested by our NUTM collaborators based
on their literature review. In statistical terms, this involved performing a geneset test on each of the
two sets of genes.
5. The question of whether we can predict the status of a nondiabetic obese patient as insulin resistant or
insulin sensitive given their genetic data and physical measurements. In statistical terms, we want to
see whether it is feasible to classify obese patients into insulin sensitive and insulin resistant groups
given the data we have available.
Solution
Introduction
Human health is affected by many factors. In our data set, we will focus on how socioeconomic conditions, incomes, and diet affect diabetes and the level of cholesterol.
This task will show how important the methods of machine learning play in the definition of human health. Based on the lifestyle of a person, you can with a fairly accurate probability to determine the potential risk of diabetes mellitus or elevated levels of chalestirin.
In this work, several techniques of machine learning were applied:
 correlation analysis
 linear regression
 decision tree
 random forests
 associative rules
I used several approaches, in order to look at health problems from different sides. I tried to build an analysis, balancing between the accuracy of models and their simple explanation.
As a result of the analysis, I managed to get a fairly accurate model of classification and to find interrelations.
Correlation
Our data set contains categorical variables. However, we will build a correlation between the variables to determine the relationship between the variables.
Сorrelation between variables
We see a positive correlation between BDYMSQ04 and AGEC, DIETRDI and AGEC, DIETRDI and BDYMSQ04, HCHOLBC and DIABBC.
We see a negative correlation between DIETQ14 and AGEC, DIETQ14 and BDYMSQ04 , DIABBC and AGEC, HCHOLBC and AGEC.
Analysis of the network correlation structure:
We see that correlation exists between the variables. This means that there is a relationship between lifestyle and illness of patients. We will find out how the patient’s lifestyle affects his health.
Linear Regression
Analyze DIABBC
Linear regression shows that there is a close relationship between diabetes mellitus and Index of Relative SocioEconomic Disadvantage.
Analyze HCHOLBC
The data show that there is a correlation between blood pressure, diabetes, age and diet. . Looking at the data, it should be inferred. With the age of a person, the risk of diabetes increases. In this case, if people do not adhere to a diet, the risk of growing diabetes is further increased. In addition, there is a relationship between diabetes mellitus and blood pressure.
Decision tree
Analyze DIABBC
We see that the first node of the tree occurred on a variable Equivalised income of household: deciles. If Equivalised income of household: deciles > 3, then such people with a 73% chance never treated the problem of diabetes mellitus. But if people have low income and adhere to a diet, then such people are likely to suffer from diabetes mellitus
Analyze HCHOLBC
If a person does not apply for dietary adherence or not, then most likely he is not currently on never told has high cholesterol.
Random forest
AnalyseDIABBC
multiclass.aunu acc mmce
0.78495927 0.94897959 0.05102041
The random forest algorithm, with 94% accuracy, can determine the classification of dietary questions.
AnalyseHCHOLBC
multiclass.aunu acc mmce
0.7292729 0.8841343 0.1158657
The random forest algorithm, with 94% accuracy, can determine the classification of whether has high cholesterol questions.
Analyze important variables
Analyze DIABBC
Analyze HCHOLBC
To determine the classification of diabetes or cholesterol, the most important data are: questions about the age of a person, and questions about a person’s diet.
let’s find the rules by which you can diagnose problems with human health
If a person has problems with diabetes mellitus:
If a person has problems with high cholesterol:
Conclusion:
The health of a person is affected by his lifestyle, as well as the environment. On the basis of questionnaires, it is possible to diagnose with a high degree of accuracy the problem of a person with diabetes or cholesterol.
For example, it turned out that if a person turned to a specialist about recommendations for eating vegetables and so on, but not being on a diet and not adhering to recommendations, then such a person has problems with cholesterol.
Also, for example, it turned out that if a person does not follow the recommendations for a healthy diet and at the same time has a high level of cholesterol, then most likely this person has diabetes mellitus.
As we can see, these are simple logical rules that can be interpreted and applied to determine the risk of diabetes mellitus or high cholesterol.
setwd(“P:/R/AS/6”)
library(data.table)
library(dplyr)
library(ggplot2)
library(ClustOfVar)
library(qgraph)
library(corrplot)
df = fread(“npa2011.csv”, stringsAsFactors = TRUE)
df = df[,.(
AGEC,
SF2SA1QN,
INCDEC,
BDYMSQ04,
DIASTOL,
DIETQ12,
DIETQ14,
DIETQ5,
DIETQ8,
DIETRDI,
DIABBC,
HCHOLBC
)]
sum(is.na(df))
str(df)
summary(df)
# Check correlation berween variables
df.correlation = cor(df)
corrplot(df.correlation)
# Build cluster on all data
model1 = hclustvar(df.correlation)
summary(model1)
plot(model1)
qgraph(df.correlation, layout = “spring”)
# Linear model
DIABBC.linearmodel = lm(DIABBC ~., data = df[,c(“HCHOLBC”)])
summary(DIABBC.linearmodel)
HCHOLBC.linearmodel = lm(HCHOLBC ~., data = df[,c(“DIABBC”)])
summary(HCHOLBC.linearmodel)
# Create factor variables
df = df[,.(
AGEC =as.numeric(AGEC),
SF2SA1QN = as.factor(SF2SA1QN),
INCDEC = as.numeric(INCDEC),
BDYMSQ04 = as.factor(BDYMSQ04),
DIASTOL = as.numeric(DIASTOL),
DIETQ12 = as.factor(DIETQ12),
DIETQ14 = as.factor(DIETQ14),
DIETQ5 = as.factor(DIETQ5),
DIETQ8 = as.factor(DIETQ8),
DIETRDI = as.factor(DIETRDI),
DIABBC = as.factor(DIABBC),
HCHOLBC = as.factor(HCHOLBC)
)]
str(df)
model2 = hclustvar(scale(select_if(df, is.numeric)), select_if(df, is.factor))
plot(model2)
#Visualisationscunter
ggplot(df, aes(SF2SA1QN)) + geom_bar()
ggplot(df, aes(INCDEC)) + geom_density()
ggplot(df, aes(DIASTOL)) + geom_density()
ggplot(df, aes(BDYMSQ04)) + geom_bar()
ggplot(df, aes(DIETQ12)) + geom_bar()
ggplot(df, aes(DIETQ14)) + geom_bar()
ggplot(df, aes(DIETQ12)) + geom_bar()
ggplot(df, aes(DIETQ5)) + geom_bar()
ggplot(df, aes(DIETQ8)) + geom_bar()
ggplot(df, aes(DIETQ12)) + geom_bar()
ggplot(df, aes(DIETRDI)) + geom_bar()
ggplot(df, aes(DIABBC)) + geom_bar()
ggplot(df, aes(AGEC)) + geom_density()
#
ggplot(df, aes(x = SF2SA1QN, y = DIASTOL)) + geom_boxplot()
ggplot(df, aes(x = scale(INCDEC), y = scale(DIASTOL))) + geom_point()
#Make rpart model
library(rpart)
library(rpart.plot)
DIABBC.rpart = rpart(DIABBC ~., data = df[,c(“HCHOLBC”, “AGEC”)])
predict.DIABBC.rpart = predict(DIABBC.rpart, data = df[,c(“HCHOLBC”, “DIABBC”)])
rpart.plot(DIABBC.rpart)
HCHOLBC.rpart = rpart(HCHOLBC ~., data = df[,c(“DIABBC”, “AGEC”)])
predict.HCHOLBC.rpart = predict(HCHOLBC.rpart, data = df[,c(“HCHOLBC”, “DIABBC”)])
rpart.plot(HCHOLBC.rpart)
# Multi task
library(mlr)
task.DIABBC = makeClassifTask(data = df[,c(“HCHOLBC”)], target = “DIABBC”)
fv2 = generateFilterValuesData(task.DIABBC, method = c(“information.gain”, “chi.squared”))
fv2$data
plotFilterValues(fv2)
#Analyse DIABBC
lrn = makeLearner(“classif.randomForest”, predict.type = “prob”)
n = getTaskSize(task.DIABBC)
mod = train(lrn, task.DIABBC, subset = seq(1, n, by = 2))
pred = predict(mod, task.DIABBC, subset = seq(2, n, by = 2))
performance(pred, measures = list(multiclass.aunu, acc, mmce))
pd = generatePartialDependenceData(mod, task.DIABBC, “DIASTOL”)
plt = plotPartialDependence(pd)
head(plt$data)
plt
plot(Probability ~ Value, data = plt$data, type = “b”, xlab = plt$data$Feature[1])
#Analyse HCHOLBC
task.HCHOLBC = makeClassifTask(data = df[,c(“DIABBC”)], target = “HCHOLBC”)
fv2 = generateFilterValuesData(task.DIABBC, method = c(“information.gain”, “chi.squared”))
fv2$data
plotFilterValues(fv2)
n = getTaskSize(task.HCHOLBC)
mod = train(lrn, task.HCHOLBC, subset = seq(1, n, by = 2))
pred = predict(mod, task.HCHOLBC, subset = seq(2, n, by = 2))
performance(pred, measures = list(multiclass.aunu, acc, mmce))
#Define rules
library(arules)
library(arulesViz)
df.rules =df[,.(
AGEC =as.factor(AGEC),
SF2SA1QN = as.factor(SF2SA1QN),
INCDEC = as.factor(INCDEC),
BDYMSQ04 = as.factor(BDYMSQ04),
DIASTOL = as.factor(DIASTOL),
DIETQ12 = as.factor(DIETQ12),
DIETQ14 = as.factor(DIETQ14),
DIETQ5 = as.factor(DIETQ5),
DIETQ8 = as.factor(DIETQ8),
DIETRDI = as.factor(DIETRDI),
DIABBC = as.factor(DIABBC),
HCHOLBC = as.factor(HCHOLBC)
)]
# rules DIABBC
rules< apriori(df.rules,
parameter = list(minlen=2, supp=0.01, conf=0.05),
appearance = list(rhs=c(“DIABBC=1″), default=”lhs”),
control = list(verbose=F))
plot(head(rules, n =10), method=”graph”, control=list(type=”items”))
inspect(rules)
# rules HCHOLBC
rules< apriori(df.rules,
parameter = list(minlen=2, supp=0.01, conf=0.1),
appearance = list(rhs=c(“HCHOLBC=1″), default=”lhs”),
control = list(verbose=F))
plot(head(rules, n =10), method=”graph”, control=list(type=”items”))
inspect(rules)