**Simple Linear Regression**

__Instructions:__

You should use SAS (or some other statistical software package) to aid in the solution of these problems.

Your solutions to these problems should be presented in a clear and concise fashion, presented in a MS Word document. Include selected computer output in your write-up, as needed.

__PROBLEM 1: Simple Linear Regression__

A researcher reported the median grain size of sand (in mm) in 59 alluvial aquifers in the Arkansas River Valley. The yield of each aquifer (in gal/day/ft2) was also reported. The data are contained in the file titled **Arkansas River Valley data**. The researcher is interested in predicting yield from grain size. Thus, denote yield by y and grain size by x.

- Generate a scatterplot for these data, with grain size on the horizontal axis, and yield on the vertical axis.

- Find the least squares estimates of β
_{0 }and β_{1 }in the model

μ_{y|x}= β_{0 }+ β_{1}x

- Test the hypothesis

H_{0}: β_{1 }= 0

H_{1}: β_{1 }≠ 0

at α = .05,

- Compute the residuals for these data. Do any residuals exceed ± 3s
_{ε}?

__ __

Solution

__ __

__ __

__ __

Problem 1 :

1.

- Least square estimates are

These were found in R using the lm(.) function ( i.e. least square fitting technique in linear models) .

- Coefficients:

Estimate Std. Error t value Pr(>|t|)

: -9.294 42.255 -0.220 0.827

: 744.979 109.964 6.775 7.54e-09 ***

We seek to test ,

H_{0}: β_{1 }= 0

H_{1}: β_{1 }≠ 0

Under , the test statistics is 6.775 , and the corresponding p values is 7.54e-09 , which is very small ( much smaller than 0.05) . So , we reject .

__The 59 residuals are as follows__:

-24.8948554 , -27.3647300 , -43.6538518 , -35.6538518, -61.1036427 ,

3.8963573, 9.9967755 , -21.4530154 ,-34.9028063 ,10.0971937 ,

-17.3525972, -53.8023882, -101.2521791 , 27.7478209, 127.7478209,

-89.7019700, -110.1517609, -87.1517609, -99.6015518 ,-131.0513427,

-82.5011336 , 0.4988664 ,3.0490755 ,-54.4007154 ,-39.4007154 ,

-154.8505063. 8.1494937 ,-123.3002972 ,-39.3002972, -8.7500881 ,

-125.1998790 ,213.2499119 ,115.8001210 ,-109.0994609, 40.9005391 ,

-116.3484154, 113.6515846, -68.7982063 , -41.2479972, -28.6977881 ,

81.3022119 ,101.4026301 ,276.4026301 , 248.9528391 , 251.5030482 ,

4.0532573 ,69.1536755 ,329.1536755 ,-148.2961154 ,224.2540937 ,

336.8043028, 496.8043028 , 279.5553483 ,-425.6936063, -187.6936063.

-200.0429790 , -141.9906790, -140.9864972, -198.4362881

**Only The 52nd residual , i.e. ****496.804 exceeds**** **

The R code used is attached here :

library(readxl)

SAS_Arkansas_River_Valley_Data<- read_excel(“~/Desktop/Pawel/newquoteneeded/SAS-Arkansas River Valley Data.xlsx”)

x <-SAS_Arkansas_River_Valley_Data

plot(x$GrainSize,x$Yield, xlab=”garin size in mm” ,ylab=”Yield”)#scatterplot

fit<- lm(x$Yield~x$GrainSize) #fitting the linear model

fit$coefficients #gives beta_0 & beta_1

summary(fit) #look at p-values

res<-fit$residuals #residuals

RES <- res[abs(res)>3*sd(res)] #filter the values exceeding +/-3*s_{epsilon}

**code1.R**** **

library(readxl)

SAS_Arkansas_River_Valley_Data<- read_excel(“~/Desktop/Pawel/newquoteneeded/SAS-Arkansas River Valley Data.xlsx”)

x <-SAS_Arkansas_River_Valley_Data

plot(x$GrainSize,x$Yield, xlab=”garin size in mm” ,ylab=”Yield”)#scatterplot

fit<- lm(x$Yield~x$GrainSize) #fitting the linear model

fit$coefficients #gives beta_0 & beta_1

res<-fit$residuals #residuals

RES <- res[abs(res)>3*sd(res)] #filter the values exceeding 3*s_{epsilon}

__PROBLEM 2: ____One Way ANOVA__

Using the data from ex8-32:

- Perform an analysis of variance on these data, and test the hypothesis

H0: μA= μB=μC=μD

H1: ~H0

State your conclusions.

- Use Levene’s Test to test the assumption that the variances in the 4

groups are equal. State your conclusions.

- Check to see if the residuals are modeled well by a normal distribution.

State your conclusions.

- If you determine that the means are not equal in part (1), use Tukey’s

HSD procedure to determine which means are different. State your

conclusions.

- To determine whether the distributions are the same in the four groups,

perform a Kruskal-Wallis test on the ranks. State your conclusions.** **

**Solution**

__ANSWERS TO QUESTION 2__

- We seek to test,

H0: μA = μB=μC=μD

H1: ~H0

Under , The test statistic 11.05 and the corresponding p-value is 5.85e-05 . Such a small value leads us to reject .

We seek to test ,

: Variances in all 4 groups are equal

H1: ~H0

Levene’s Test in R yields the F statistic to be 0.5647 , which has a p-value of 0.6428 >> 0.05

Hence , we accept , i.e. the the variances in the 4 groups are equal.

We use Shapiro test for testing normality of residuals .

From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.

The adjusted p values for D-A & C-B comparison in TukeyHSD test in R , are found to be 0.8854051& 0.6132192 respectively .

So , the mean difference of pairs of D-A & C_B are NOT statistically significant .

But the adjusted p values for B-A , C-A , D-B, D-C pairs are 0.0294781 , 0.0013447 , 0.0049860 & 0.0001916 respectively . So , the pairwise differences B-A , C-A , D-B, D-C are statistically significant.

- The p-value obtained by Kruskal-Wallis test is 0.0008698 ,which is <0.05 . Hence , we reject , i.e. the distributions are NOT the same in 4 groups .

**code2.R**** **

library(readxl)

ex8_32 <- read_excel(“~/Desktop/Pawel/newquoteneeded/ex8-32.xlsx”)

x <-ex8_32

X <-c(x$A,x$B,x$C,x$D) #write all 32 observations in 1 vector

Y= c(rep(“A”,8),rep(“B”,8),rep(“C”,8),rep(“D”,8))

Z <- data.frame(X,Y)

A <-aov(X~Y,data=Z)

summary(aov(X~Y,data=Z)) #look at the p-values in ANOVA. This is needed in part 1

library(Rcmdr)

leveneTest(X~Y,data=Z) #look at the p-value to conclude

shapiro.test(A$residuals) #look at the p-values.

library(TukeyC)

TukeyHSD(A) #look at p-adj values

kruskal.test(X~Y,data = Z) #look at p-value