Simple Linear Regression

Simple Linear Regression

Instructions:

You should use SAS (or some other statistical software package) to aid in the solution of these problems.

Your solutions to these problems should be presented in a clear and concise fashion, presented in a MS Word document. Include selected computer output in your write-up, as needed.

PROBLEM 1: Simple Linear Regression

A researcher reported the median grain size of sand (in mm) in 59 alluvial aquifers in the Arkansas River Valley. The yield of each aquifer (in gal/day/ft2) was also reported. The data are contained in the file titled Arkansas River Valley data. The researcher is interested in predicting yield from grain size. Thus, denote yield by y and grain size by x.

  1. Generate a scatterplot for these data, with grain size on the horizontal axis, and yield on the vertical axis.
  1. Find the least squares estimates of β0 and β1 in the model

μy|x= β0 + β1x

  1. Test the hypothesis

H0: β1 = 0

H1: β1 ≠ 0

at α = .05,

  1. Compute the residuals for these data. Do any residuals exceed ± 3sε?

 

Solution

 

 

 

Problem 1 :

1.

  1. Least square estimates are

These were found in R using the lm(.) function ( i.e. least square fitting technique in linear models) .

  1. Coefficients:

Estimate Std. Error    t value        Pr(>|t|)

:                    -9.294         42.255      -0.220       0.827

:                  744.979      109.964       6.775       7.54e-09 ***

We seek to test ,

H0: β1 = 0

H1: β1 ≠ 0

Under  , the test statistics is 6.775 , and the corresponding p values is 7.54e-09 , which is very small ( much smaller than 0.05) . So , we reject .

  1. The 59 residuals are as follows :

-24.8948554 , -27.3647300 , -43.6538518 , -35.6538518,  -61.1036427 ,

3.8963573,    9.9967755 , -21.4530154  ,-34.9028063   ,10.0971937 ,

-17.3525972,  -53.8023882,   -101.2521791 ,  27.7478209,  127.7478209,

-89.7019700, -110.1517609,  -87.1517609,  -99.6015518 ,-131.0513427,

-82.5011336  ,  0.4988664    ,3.0490755  ,-54.4007154  ,-39.4007154 ,

-154.8505063.    8.1494937 ,-123.3002972  ,-39.3002972,   -8.7500881 ,

-125.1998790  ,213.2499119  ,115.8001210 ,-109.0994609,   40.9005391 ,

-116.3484154,  113.6515846,  -68.7982063 , -41.2479972,  -28.6977881 ,

81.3022119  ,101.4026301  ,276.4026301 , 248.9528391 , 251.5030482 ,

4.0532573   ,69.1536755  ,329.1536755    ,-148.2961154  ,224.2540937 ,

336.8043028,  496.8043028 , 279.5553483 ,-425.6936063, -187.6936063.

-200.0429790 , -141.9906790, -140.9864972, -198.4362881

Only The 52nd residual , i.e. 496.804  exceeds 

The R code used is attached here :

library(readxl)

SAS_Arkansas_River_Valley_Data<- read_excel(“~/Desktop/Pawel/newquoteneeded/SAS-Arkansas River Valley Data.xlsx”)

x <-SAS_Arkansas_River_Valley_Data

plot(x$GrainSize,x$Yield, xlab=”garin size in mm” ,ylab=”Yield”)#scatterplot

fit<- lm(x$Yield~x$GrainSize) #fitting the linear model

fit$coefficients #gives beta_0 & beta_1

summary(fit) #look at p-values

res<-fit$residuals #residuals

RES <- res[abs(res)>3*sd(res)] #filter the values exceeding +/-3*s_{epsilon}

code1.R 

library(readxl)

SAS_Arkansas_River_Valley_Data<- read_excel(“~/Desktop/Pawel/newquoteneeded/SAS-Arkansas River Valley Data.xlsx”)

x <-SAS_Arkansas_River_Valley_Data

plot(x$GrainSize,x$Yield, xlab=”garin size in mm” ,ylab=”Yield”)#scatterplot

fit<- lm(x$Yield~x$GrainSize) #fitting the linear model

fit$coefficients #gives beta_0 & beta_1

res<-fit$residuals #residuals

RES <- res[abs(res)>3*sd(res)] #filter the values exceeding 3*s_{epsilon}

PROBLEM 2: One Way ANOVA

Using the data from ex8-32:

  1. Perform an analysis of variance on these data, and test the hypothesis

H0: μA= μB=μC=μD

H1: ~H0

State your conclusions.

  1. Use Levene’s Test to test the assumption that the variances in the 4

groups are equal. State your conclusions.

  1. Check to see if the residuals are modeled well by a normal distribution.

State your conclusions.

  1. If you determine that the means are not equal in part (1), use Tukey’s

HSD procedure to determine which means are different. State your

conclusions.

  1. To determine whether the distributions are the same in the four groups,

perform a Kruskal-Wallis test on the ranks. State your conclusions. 

Solution

ANSWERS TO QUESTION 2

  1. We seek to test,

H0: μA = μB=μC=μD

H1: ~H0

Under  , The test statistic 11.05 and the corresponding p-value is 5.85e-05 . Such a small value leads us to reject .

We seek to test ,

: Variances in all 4 groups are equal

H1: ~H0

Levene’s Test in R yields the F statistic to be 0.5647 , which has a p-value of 0.6428 >> 0.05

Hence , we accept  , i.e. the the variances in the 4 groups are equal.

We use Shapiro test for testing normality of residuals .

From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.

The adjusted p values for D-A & C-B comparison in TukeyHSD test in R , are found to be 0.8854051& 0.6132192 respectively .

So , the mean difference of pairs of D-A & C_B are NOT statistically significant .

But the adjusted p values for B-A , C-A , D-B, D-C pairs are  0.0294781 , 0.0013447 , 0.0049860 &  0.0001916 respectively . So , the pairwise differences B-A , C-A , D-B, D-C are statistically significant.

  1. The p-value obtained by Kruskal-Wallis test is 0.0008698 ,which is <0.05 . Hence , we reject , i.e. the distributions are NOT the same in 4 groups .

code2.R 

library(readxl)

ex8_32 <- read_excel(“~/Desktop/Pawel/newquoteneeded/ex8-32.xlsx”)

x <-ex8_32

X <-c(x$A,x$B,x$C,x$D) #write all 32 observations in 1 vector

Y= c(rep(“A”,8),rep(“B”,8),rep(“C”,8),rep(“D”,8))

Z <- data.frame(X,Y)

A <-aov(X~Y,data=Z)

summary(aov(X~Y,data=Z)) #look at the p-values in ANOVA. This is needed in part 1

library(Rcmdr)

leveneTest(X~Y,data=Z) #look at the p-value to conclude

shapiro.test(A$residuals) #look at the p-values.

library(TukeyC)

TukeyHSD(A) #look at p-adj values

kruskal.test(X~Y,data = Z) #look at p-value