# Research Question & Data analysis

Aim:This assessment is designed to give students the opportunity to solve a discipline specific problem based on real data.

Students will need to construct a research question based on the Australian Heath Survey data. Reformulate the research question into a statistics problem, analyze the data, and communicate the results.

The data description can be found in the file:

Aus_Health_Survey

The data itself is located in the file:

npa2011

The data codebook is located in the file:

npa2011DataItems

A sample report is attached:

Report example

How does SEIFA (Socio-Economic Indexes for Areas) and equivalised income of households would affect their hypertensive disease and ischaemic diseases? – what their diet is (how often salt, fruit, vege – blood pressure and cholesterol) –>ischaemic or hypertensive disease

Report layout:

• Executive summary (1 page max).
• Short description of the problem(s).
• What are the main findings?
• Key figure if appropriate.
• Are their shortcomings to the analysis?
• What is the clinical relevance?
• The problem:
• Longer description of the scientific problem(s).
• Translation of the scientific problem into a statistical problem.
• Relevant (to answering the question) data summaries. Data transformations. (Only if necessary).
• Analysis:
• What tools are used and why.
• What are the results?
• Use figures to illustrate the results.
• Interpret the results (to a statistician).
• Interpret the results (to a non-statistician).

Marking rubric:

 Criteria Fail Pass Credit Distinction High Distinction Translate the scientific question into a statistical formulation (Question) Wrong Slight misunderstanding of the biological question. Simplistic, or formulated to a limited capacity Appropriate formulation. 1) Innovative formulation. 2) Shows good understanding of the biology. Analysis Incorrect analysis Direct application of techniques No real justification Accurate and appropriate 1) Robust analysis (cross-validated, outliers handled). 2) Accurate, 3) Appropriate, 4) Assumptions checked. Presentation Figures do not match analysis. Errors in the Figures. Figures do not match data. Not informative figures 1) Informative figures. 2) Axis labels, headings, and legend. 1) Informative figures. 2) Axis labels, headings, and legend. 3) Visually pleasing. 1) Informative figures. 2) Axis labels, headings, and legend. 3) Visually pleasing. 4) Innovative visualization. Reproducibility Code does not match figures/analysis. Too many customized elements for lecturer to easily modify the code to get it to run. Readable, but not fully reproducible. Reproducible with minimal changes. 1) Fully reproducible. 2) Stable code. 3) Runs first time without editing. 4) Self-contained.

Background
Insulin is a hormone which plays a key role in the regulation of blood glucose levels. Insulin resistance is a
pathological condition in which cells fail to respond normally to the hormone insulin. In people with insulin
resistance, the muscles and the liver resist the action of insulin, so the body should produce higher amounts
to keep the blood glucose levels within a normal range. It is more common in people with a family history of
diabetes or people who are overweight, particularly around the stomach area. A person with insulin resistance
has a greater risk of developing Type II diabetes and heart disease. Nowadays, type II diabetes and obesity
are increasingly affecting human populations around the world.

Problem
We investigate a number of scientific questions.
1. The question of which clinical measurements (e.g. weight, BMI, abdomen fat) are most closely associated
with risk of being insulin resistant. It is believed that people who are ‘apple-shaped’ and have a lot
of visceral fat have an elevated risk of developing insulin resistance, which is a precursor to type II
diabetes. We were tasked with exmining the data available and determining whether the evidence
at hand supports this idea. In statistical terms, the question is whether the typical values of these
measurements differed between the OIR and OIS groups, and whether these differences can be said to
be statistically significant.
2. The question of whether certain individual genes are expressed differently in the different metabolic
groups. We had reason to believe, based on literature, that certain genes were associated with insulin
sensitivity. The statistical problem here was similar to the last case: the question was whether the
expression levels differed between the groups, and whether the differences were statistically significant.
In this case, there was the added complication of a large-scale multiple testing problem.
3. The question of whether the exression levels of certain individual genes are related to waist circumference.
In statistical terms, this is a regression problem.
4. The question of whether two particular sets of genes, the q-arm and p-arm genes on the sixth chromosome
are expressed differently in the OIR and OIS groups, as suggested by our NUTM collaborators based
on their literature review. In statistical terms, this involved performing a gene-set test on each of the
two sets of genes.
5. The question of whether we can predict the status of a nondiabetic obese patient as insulin resistant or
insulin sensitive given their genetic data and physical measurements. In statistical terms, we want to
see whether it is feasible to classify obese patients into insulin sensitive and insulin resistant groups
given the data we have available.

Solution

Introduction

Human health is affected by many factors. In our data set, we will focus on how socio-economic conditions, incomes, and diet affect diabetes and the level of cholesterol.

This task will show how important the methods of machine learning play in the definition of human health. Based on the lifestyle of a person, you can with a fairly accurate probability to determine the potential risk of diabetes mellitus or elevated levels of chalestirin.

In this work, several techniques of machine learning were applied:

• correlation analysis
• linear regression
• decision tree
• random forests
• associative rules

I used several approaches, in order to look at health problems from different sides. I tried to build an analysis, balancing between the accuracy of models and their simple explanation.

As a result of the analysis, I managed to get a fairly accurate model of classification and to find interrelations.

## Correlation

Our data set contains categorical variables. However, we will build a correlation between the variables to determine the relationship between the variables.

Сorrelation between variables

We see a positive correlation between BDYMSQ04 and AGEC, DIETRDI and AGEC, DIETRDI and BDYMSQ04, HCHOLBC and DIABBC.

We see a negative correlation between DIETQ14 and AGEC, DIETQ14  and BDYMSQ04 , DIABBC and AGEC, HCHOLBC and AGEC.

Analysis of the network correlation structure:

We see that correlation exists between the variables. This means that there is a relationship between lifestyle and illness of patients. We will find out how the patient’s lifestyle affects his health.

Linear Regression

Analyze DIABBC

Linear regression shows that there is a close relationship between diabetes mellitus and Index of Relative Socio-Economic Disadvantage.

Analyze HCHOLBC

The data show that there is a correlation between blood pressure, diabetes, age and diet. . Looking at the data, it should be inferred. With the age of a person, the risk of diabetes increases. In this case, if people do not adhere to a diet, the risk of growing diabetes is further increased. In addition, there is a relationship between diabetes mellitus and blood pressure.

## Decision tree

Analyze DIABBC

We see that the first node of the tree occurred on a variable Equivalised income of household: deciles. If  Equivalised income of household: deciles > 3, then such people with a 73% chance never treated the problem of diabetes mellitus. But if people have low income and adhere to a diet, then such people are likely to suffer from diabetes mellitus

Analyze HCHOLBC

If a person does not apply for dietary adherence or not, then most likely he is not currently on never told has high cholesterol.

## Random forest

AnalyseDIABBC

multiclass.aunu             acc            mmce

0.78495927      0.94897959      0.05102041

The random forest algorithm, with 94% accuracy, can determine the classification of dietary questions.

AnalyseHCHOLBC

multiclass.aunu             acc            mmce

0.7292729       0.8841343       0.1158657

The random forest algorithm, with 94% accuracy, can determine the classification of whether has high cholesterol questions.

## Analyze important variables

Analyze DIABBC

Analyze HCHOLBC

To determine the classification of diabetes or cholesterol, the most important data are: questions about the age of a person, and questions about a person’s diet.

## let’s find the rules by which you can diagnose problems with human health

If a person has problems with diabetes mellitus:

If a person has problems with high cholesterol:

Conclusion:

The health of a person is affected by his lifestyle, as well as the environment. On the basis of questionnaires, it is possible to diagnose with a high degree of accuracy the problem of a person with diabetes or cholesterol.

For example, it turned out that if a person turned to a specialist about recommendations for eating vegetables and so on, but not being on a diet and not adhering to recommendations, then such a person has problems with cholesterol.

Also, for example, it turned out that if a person does not follow the recommendations for a healthy diet and at the same time has a high level of cholesterol, then most likely this person has diabetes mellitus.

As we can see, these are simple logical rules that can be interpreted and applied to determine the risk of diabetes mellitus or high cholesterol.

setwd(“P:/R/AS/6”)

library(data.table)

library(dplyr)

library(ggplot2)

library(ClustOfVar)

library(qgraph)

library(corrplot)

df = fread(“npa2011.csv”, stringsAsFactors = TRUE)

df = df[,.(

AGEC,

SF2SA1QN,

INCDEC,

BDYMSQ04,

DIASTOL,

DIETQ12,

DIETQ14,

DIETQ5,

DIETQ8,

DIETRDI,

DIABBC,

HCHOLBC

)]

sum(is.na(df))

str(df)

summary(df)

# Check correlation berween variables

df.correlation = cor(df)

corrplot(df.correlation)

# Build cluster on all data

model1 = hclustvar(df.correlation)

summary(model1)

plot(model1)

qgraph(df.correlation, layout = “spring”)

# Linear model

DIABBC.linearmodel = lm(DIABBC ~., data = df[,-c(“HCHOLBC”)])

summary(DIABBC.linearmodel)

HCHOLBC.linearmodel = lm(HCHOLBC ~., data = df[,-c(“DIABBC”)])

summary(HCHOLBC.linearmodel)

# Create factor variables

df = df[,.(

AGEC  =as.numeric(AGEC),

SF2SA1QN = as.factor(SF2SA1QN),

INCDEC = as.numeric(INCDEC),

BDYMSQ04 = as.factor(BDYMSQ04),

DIASTOL = as.numeric(DIASTOL),

DIETQ12 = as.factor(DIETQ12),

DIETQ14 = as.factor(DIETQ14),

DIETQ5 = as.factor(DIETQ5),

DIETQ8 = as.factor(DIETQ8),

DIETRDI = as.factor(DIETRDI),

DIABBC = as.factor(DIABBC),

HCHOLBC = as.factor(HCHOLBC)

)]

str(df)

model2 = hclustvar(scale(select_if(df, is.numeric)), select_if(df, is.factor))

plot(model2)

#Visualisationscunter

ggplot(df, aes(SF2SA1QN)) + geom_bar()

ggplot(df, aes(INCDEC)) + geom_density()

ggplot(df, aes(DIASTOL)) + geom_density()

ggplot(df, aes(BDYMSQ04)) + geom_bar()

ggplot(df, aes(DIETQ12)) + geom_bar()

ggplot(df, aes(DIETQ14)) + geom_bar()

ggplot(df, aes(DIETQ12)) + geom_bar()

ggplot(df, aes(DIETQ5)) + geom_bar()

ggplot(df, aes(DIETQ8)) + geom_bar()

ggplot(df, aes(DIETQ12)) + geom_bar()

ggplot(df, aes(DIETRDI)) + geom_bar()

ggplot(df, aes(DIABBC)) + geom_bar()

ggplot(df, aes(AGEC)) + geom_density()

#

ggplot(df, aes(x = SF2SA1QN, y = DIASTOL)) + geom_boxplot()

ggplot(df, aes(x = scale(INCDEC), y = scale(DIASTOL))) + geom_point()

#Make rpart model

library(rpart)

library(rpart.plot)

DIABBC.rpart = rpart(DIABBC ~., data = df[,-c(“HCHOLBC”, “AGEC”)])

predict.DIABBC.rpart = predict(DIABBC.rpart, data = df[,-c(“HCHOLBC”, “DIABBC”)])

rpart.plot(DIABBC.rpart)

HCHOLBC.rpart = rpart(HCHOLBC ~., data = df[,-c(“DIABBC”, “AGEC”)])

predict.HCHOLBC.rpart = predict(HCHOLBC.rpart, data = df[,-c(“HCHOLBC”, “DIABBC”)])

rpart.plot(HCHOLBC.rpart)

library(mlr)

fv2 = generateFilterValuesData(task.DIABBC, method = c(“information.gain”, “chi.squared”))

fv2\$data

plotFilterValues(fv2)

#Analyse DIABBC

lrn = makeLearner(“classif.randomForest”, predict.type = “prob”)

mod = train(lrn, task.DIABBC, subset = seq(1, n, by = 2))

pred = predict(mod, task.DIABBC, subset = seq(2, n, by = 2))

performance(pred, measures = list(multiclass.aunu, acc, mmce))

plt = plotPartialDependence(pd)

plt

plot(Probability ~ Value, data = plt\$data, type = “b”, xlab = plt\$data\$Feature[1])

#Analyse HCHOLBC

fv2 = generateFilterValuesData(task.DIABBC, method = c(“information.gain”, “chi.squared”))

fv2\$data

plotFilterValues(fv2)

mod = train(lrn, task.HCHOLBC, subset = seq(1, n, by = 2))

pred = predict(mod, task.HCHOLBC, subset = seq(2, n, by = 2))

performance(pred, measures = list(multiclass.aunu, acc, mmce))

#Define rules

library(arules)

library(arulesViz)

df.rules =df[,.(

AGEC  =as.factor(AGEC),

SF2SA1QN = as.factor(SF2SA1QN),

INCDEC = as.factor(INCDEC),

BDYMSQ04 = as.factor(BDYMSQ04),

DIASTOL = as.factor(DIASTOL),

DIETQ12 = as.factor(DIETQ12),

DIETQ14 = as.factor(DIETQ14),

DIETQ5 = as.factor(DIETQ5),

DIETQ8 = as.factor(DIETQ8),

DIETRDI = as.factor(DIETRDI),

DIABBC = as.factor(DIABBC),

HCHOLBC = as.factor(HCHOLBC)

)]

# rules DIABBC

rules<- apriori(df.rules,

parameter = list(minlen=2, supp=0.01, conf=0.05),

appearance = list(rhs=c(“DIABBC=1″), default=”lhs”),

control = list(verbose=F))

inspect(rules)

# rules HCHOLBC

rules<- apriori(df.rules,

parameter = list(minlen=2, supp=0.01, conf=0.1),

appearance = list(rhs=c(“HCHOLBC=1″), default=”lhs”),

control = list(verbose=F))