Multivariate Analysis

Multivariate Analysis

Q.1

This problem is concerned with the identification of forged bank notes. The file “bnotes.csv”, has datafrom six measurements which roughly quantify the size and the position of theprinted image on 1000-franc Swiss bank notes. The file contains data for a sampleof 100 genuine 1000-franc bank notes. There is one lineof data for each bank note in the sample with the data arranged in the followingorder:
X1 Length of the bill (mm)
X2 Width of the bill on the left side (mm)
X3 Width of the bill on the right side (mm)
X4 Width of the margin at the bottom (mm)
X5 Width of the margin at the top (mm)
X6 Diagonal length of the printed image (mm)
It is relatively easy to obtain a sample from the population of genuine notes, but
it would be difficult to sample from the population of forged notes for obvious
reasons. In this case information is available only for the genuine notes, and no
information is available for forged notes. The identification problem consists of
using the data available from the sample of genuine notes to develop a procedure
for deciding whether notes of uncertain origin are genuine or forged. We will first
try to determine if the variation in the six measurements on genuine notes can be
approximately described by a normal distribution. If so, we will use the information
in the sample of 100 genuine notes to determine whether or not a suspect note was
likely to come from the population of genuine notes.
(a) Report the values of the Shapiro-Wilk statistic and associated p-values for
each of the six variables. Also examine corresponding univariate normal probability plots (qqplots). State your conclusions. Keep in mind that the data
are discrete because of the limited accuracy of the measuring instruments.
Consequently, the probability distribution of the six measurements on genuine
notes cannot be exactly multivariate normal. We only want to know if the
multivariate normal distribution is a good approximation.
(b) Examine the chi-square probability plot. What does it indicate about the
suitability of a 6-dimensional normal model?

(c) If you conclude that the distribution of any measurement is not reasonably well
modeled by a normal distribution, examine the possibility of finding transformations to improve the fit of the normal model. List the transformations youthink should be used for each measurement. Report ’none’ if no transformationis required

Q.2

A number of patients with bronchus cancer were treated with ascorbate and compared with matched control patients who received no ascorbate (Cameron & Pauling, 1978). The
explanation of variables is as follows:
y1 = patient (with ascorbate): survival time (days) from date of first hospital admission.
x1= matched control (without ascorbate): survival time (days)
from date of first hospital admission.
y2 = patient: survival time (days) from date of untreatability
x2 = matched control: survival time (days) from date of untreatability.
(a)

Compare y1 and y2 with x1 and x2, respectively, using a paired T 2 test with
α = 0:05. Clearly sate your null hypothesis.

(b)

Perform the following analyses as a follow up to your hypothesis test in part(a):
i. Sketch 95% confidence region. Confirm it using built-in R function, “Ellipse.R”.
ii. Find 95% univariate t-intervals for each mean difference.
iii. Find 95% simultaneous T 2 intervals for each mean difference.
iv. Find 95% Bonferroni component intervals.
v. Find the linear combination that gives the largest value of T 2.

(c)

Using all the information in part (b), would you want to receive the treatment
if you have the same type of cancer? Explain.

Q.3

Disturbing the equilibrium of an organism “in vivo” by administering a treatment,
usually a drug, and monitoring the reaction is a commonly used procedure in biochemistry. In one such study, 7 human subjects were given an alcoholic drink, and
blood samples were taken at 1, 2, 3, and 4 hours after the alcoholic drink was consumed. The blood glucose concentration (mg./10 liters of blood) was determined
for each blood sample. The sample mean vector and sample covariance matrix, for
these data are

(a)

Plot the sample means at the four time points and connect them with straight
line segments. To help determine whether differences in means are large relative to variation in responses for different subjects, insert vertical bars representing one-at-a-time 95 % confidence intervals for the individual means. Notehow variation decreases as the mean glucose concentration decreases.
(b)

Let µ1; µ2; µ3; µ4 denote the population means for blood glucose concentration at 1, 2, 3, and 4 hours, respectively. Write the null hypothesis that themean blood glucose concentrations are the same for the four time points, i.e.,µ1 = µ2 = µ3 = µ4, in the following from:
H0 : Cµ

Fill in appropriate values for C. What are the degrees of freedom for the Fdistribution associated with the T 2 statistic?

(c)

Use the Bonferroni method to compute simultaneous 95% confidence intervalsfor µ1 – µ2, µ1 – µ3, µ1 – µ4, µ2 – µ3, µ2 – µ4, and µ3 – µ4. State yourconclusions.

Q.4

Four psychological tests were given to 32 men and 32 women.
Download the data(“PsyTests.txt”) from the Blackboard. The variables are:
y1 = pictorial inconsistencies y3 = tool recognition
y2 = paper form board y4 = vocabulary
(a)

Assuming that the population matrices Σ1 = Σ2. Obtain the pooled estimate
of the common variance-covariance matrix.


(c)

Using the pooled covariance matrix, compute Hotelling’sT2. Test the hypothesis that the mean vectors of men and women are same using T 2. State yourconclusion. 

Solution

 

x=read.csv(“bnotes.csv”)

attach(x)

question 1(a)

shapiro.test(X1)

data:  X1

W = 0.98374, p-value = 0.2567

shapiro.test(X2)

data:  X2

W = 0.96431, p-value = 0.008254

shapiro.test(X3)

data:  X3

W = 0.96637, p-value = 0.01174

shapiro.test(X4)

data:  X4

W = 0.97338, p-value = 0.04035

shapiro.test(X5)

data:  X5

W = 0.97456, p-value = 0.04984

shapiro.test(X6)

data:  X6

W = 0.95878, p-value = 0.003295

qqnorm(X1)

qqnorm(X2)

qqnorm(X3)

qqnorm(X4)

qqnorm(X5)

qqnorm(X6)

Question(2)

a=read.table(“BronchusCancer.txt”,head=TRUE)

b=a[1:2]

c=a[3:4]

d=(hotelling.test(b,c, shrinkage = FALSE, perm = FALSE,B = 10000, progBar = (perm && TRUE)))

>d$stat

Hotelling test statistic

[1] 16.53195

$m

[1] 0.4833333

$df

[1]  2 29

$nx

[1] 16

$ny

[1] 16

$p

[1] 2

>d$pval

[1] 0.001721335

>apc<- pairwiseCI(admiss ~ treat, data, method=”Param.diff”)

>apc

95 %-confidence intervals

Method:  Difference of means assuming Normal distribution, allowing unequal variances

Answer 2(b)

confidence interval for y1-x1

estimate  lower upper

1-0     49.5 -28.72 127.7

>apc<- pairwiseCI(untreat ~ treat, data, method=”Param.diff”)

>apc

95 %-confidence intervals

Method:  Difference of means assuming Normal distribution, allowing unequal variances

confidence interval for difference of mean y2-x2

estimate lower upper

1-0    106.9 35.86 177.9