# Multivariate Analysis

**Q.1**

This problem is concerned with the identification of forged bank notes. The file “bnotes.csv”, has datafrom six measurements which roughly quantify the size and the position of theprinted image on 1000-franc Swiss bank notes. The file contains data for a sampleof 100 genuine 1000-franc bank notes. There is one lineof data for each bank note in the sample with the data arranged in the followingorder:

X1 Length of the bill (mm)

X2 Width of the bill on the left side (mm)

X3 Width of the bill on the right side (mm)

X4 Width of the margin at the bottom (mm)

X5 Width of the margin at the top (mm)

X6 Diagonal length of the printed image (mm)

It is relatively easy to obtain a sample from the population of genuine notes, but

it would be difficult to sample from the population of forged notes for obvious

reasons. In this case information is available only for the genuine notes, and no

information is available for forged notes. The identification problem consists of

using the data available from the sample of genuine notes to develop a procedure

for deciding whether notes of uncertain origin are genuine or forged. We will first

try to determine if the variation in the six measurements on genuine notes can be

approximately described by a normal distribution. If so, we will use the information

in the sample of 100 genuine notes to determine whether or not a suspect note was

likely to come from the population of genuine notes.

(a) Report the values of the Shapiro-Wilk statistic and associated p-values for

each of the six variables. Also examine corresponding univariate normal probability plots (qqplots). State your conclusions. Keep in mind that the data

are discrete because of the limited accuracy of the measuring instruments.

Consequently, the probability distribution of the six measurements on genuine

notes cannot be exactly multivariate normal. We only want to know if the

multivariate normal distribution is a good approximation.

(b) Examine the chi-square probability plot. What does it indicate about the

suitability of a 6-dimensional normal model?

(c) If you conclude that the distribution of any measurement is not reasonably well

modeled by a normal distribution, examine the possibility of finding transformations to improve the fit of the normal model. List the transformations youthink should be used for each measurement. Report ’none’ if no transformationis required

**Q.2**

A number of patients with bronchus cancer were treated with ascorbate and compared with matched control patients who received no ascorbate (Cameron & Pauling, 1978). The

explanation of variables is as follows:

y1 = patient (with ascorbate): survival time (days) from date of first hospital admission.

x1= matched control (without ascorbate): survival time (days)

from date of first hospital admission.

y2 = patient: survival time (days) from date of untreatability

x2 = matched control: survival time (days) from date of untreatability.

(a)

Compare y1 and y2 with x1 and x2, respectively, using a paired T 2 test with

α = 0:05. Clearly sate your null hypothesis.

(b)

Perform the following analyses as a follow up to your hypothesis test in part(a):

i. Sketch 95% confidence region. Confirm it using built-in R function, “Ellipse.R”.

ii. Find 95% univariate t-intervals for each mean difference.

iii. Find 95% simultaneous T 2 intervals for each mean difference.

iv. Find 95% Bonferroni component intervals.

v. Find the linear combination that gives the largest value of T 2.

(c)

Using all the information in part (b), would you want to receive the treatment

if you have the same type of cancer? Explain.

**Q.3 **

Disturbing the equilibrium of an organism “in vivo” by administering a treatment,

usually a drug, and monitoring the reaction is a commonly used procedure in biochemistry. In one such study, 7 human subjects were given an alcoholic drink, and

blood samples were taken at 1, 2, 3, and 4 hours after the alcoholic drink was consumed. The blood glucose concentration (mg./10 liters of blood) was determined

for each blood sample. The sample mean vector and sample covariance matrix, for

these data are

(a)

Plot the sample means at the four time points and connect them with straight

line segments. To help determine whether differences in means are large relative to variation in responses for different subjects, insert vertical bars representing one-at-a-time 95 % confidence intervals for the individual means. Notehow variation decreases as the mean glucose concentration decreases.

(b)

Let µ1; µ2; µ3; µ4 denote the population means for blood glucose concentration at 1, 2, 3, and 4 hours, respectively. Write the null hypothesis that themean blood glucose concentrations are the same for the four time points, i.e.,µ1 = µ2 = µ3 = µ4, in the following from:

H0 : Cµ

Fill in appropriate values for C. What are the degrees of freedom for the Fdistribution associated with the T 2 statistic?

(c)

Use the Bonferroni method to compute simultaneous 95% confidence intervalsfor µ1 – µ2, µ1 – µ3, µ1 – µ4, µ2 – µ3, µ2 – µ4, and µ3 – µ4. State yourconclusions.

**Q.4**

Four psychological tests were given to 32 men and 32 women.

Download the data(“PsyTests.txt”) from the Blackboard. The variables are:

y1 = pictorial inconsistencies y3 = tool recognition

y2 = paper form board y4 = vocabulary

(a)

Assuming that the population matrices Σ1 = Σ2. Obtain the pooled estimate

of the common variance-covariance matrix.

(c)

Using the pooled covariance matrix, compute Hotelling’sT^{2}. Test the hypothesis that the mean vectors of men and women are same using T 2. State yourconclusion.** **

**Solution**

** **

x=read.csv(“bnotes.csv”)

attach(x)

question 1(a)

shapiro.test(X1)

data: X1

W = 0.98374, p-value = 0.2567

shapiro.test(X2)

data: X2

W = 0.96431, p-value = 0.008254

shapiro.test(X3)

data: X3

W = 0.96637, p-value = 0.01174

shapiro.test(X4)

data: X4

W = 0.97338, p-value = 0.04035

shapiro.test(X5)

data: X5

W = 0.97456, p-value = 0.04984

shapiro.test(X6)

data: X6

W = 0.95878, p-value = 0.003295

qqnorm(X1)

qqnorm(X2)

qqnorm(X3)

qqnorm(X4)

qqnorm(X5)

qqnorm(X6)

Question(2)

a=read.table(“BronchusCancer.txt”,head=TRUE)

b=a[1:2]

c=a[3:4]

d=(hotelling.test(b,c, shrinkage = FALSE, perm = FALSE,B = 10000, progBar = (perm && TRUE)))

>d$stat

Hotelling test statistic

[1] 16.53195

$m

[1] 0.4833333

$df

[1] 2 29

$nx

[1] 16

$ny

[1] 16

$p

[1] 2

>d$pval

[1] 0.001721335

>apc<- pairwiseCI(admiss ~ treat, data, method=”Param.diff”)

>apc

95 %-confidence intervals

Method: Difference of means assuming Normal distribution, allowing unequal variances

Answer 2(b)

confidence interval for y1-x1

estimate lower upper

1-0 49.5 -28.72 127.7

>apc<- pairwiseCI(untreat ~ treat, data, method=”Param.diff”)

>apc

95 %-confidence intervals

Method: Difference of means assuming Normal distribution, allowing unequal variances

confidence interval for difference of mean y2-x2

estimate lower upper

1-0 106.9 35.86 177.9