The R Programming homework focuses on analyzing a dataset to build and evaluate regression models for predicting house prices in different scenarios. It involves data transformations, regression coefficients, residual diagnostics, model comparison, and statistical inference. The goal is to understand the relationships between various variables and make predictions about house prices.
For this model to be better, a transformation of log-log works better, the resulting scatter plot is as shown in the figure below.
After transformation of Price and Size variables, the computed coefficients for the variables used while fitting a linear regression equation to predict price as the model 1 is as follows;
b0 (intercept) = 4.8840
b1 (size) = 0.9021
b2 (Beds) = -0.0712
Residual diagnostic plots:
The Shapiro-Wilk test for normality for the residuals in this model are as shown in the figure below; The
P value is 0.0091
From the QQ-plot, residual plot, residual vs fitted plot, the top 3 observations that would impact the conclusions include;
From residual plots; 2
From residual and fitted plots; 7
From normal QQ-plot; 7
When we run a second model without the outliers/extreme values, the computed regression coefficient values are as shown in the figure below;
From the computed model above, removing the outliers makes the model better since we now have an r squared value of 0.6276 compared to our previous model 0.5337 showing that 62.8% of the variation in the price of houses in the model is explained by the variables. Removing outliers improves the model.
From Model 1 = 0.2 * 0.9021 = 0.1804, which is 18.04%.
The calculated effect above means that for an increase in 20% in home size, this would lead to an increase by 18.04% of the house price.
The difference between CA and NJ = 0.1304.
The effect computed above shows than on average, the house price in California is 13.04% higher than the house price in New Jersey.
The difference in multiple R squared = 0.5543 (in model 3) – 0.5337 (in model 2) = 0.0206.
From the scatterplot and the coefficient estimate, there is a positive non-linear relationship between the covariate and response
From the residual’s vs fitted values plot, there appears to be a problem.
Yes, there is a linear relationship hence a linear model is appropriate.
b0 = 0.9651
b1 = 7.010
Multiple R squared = 0.9569
P-value of the b1 coefficient = 2.95e – 10
The Linear model does not seem appropriate since it has too many outliers. The normality assumption appears to be violated from the QQ-plot.
For 5% change in log Dose = 0.05 * 7.010 = 0.3505 = 35.05% change in the response.
The 95% confidence interval for the slope parameter in the model above is calculated as follows;
7.010+1.960* 0.417= (6.1927, 7.8273)
Yes, we reject the null hypothesis since at 5% significance level, the computed P-value is lower than 0.05.
X bar = 20
Y bar = 58
Sx = 6.758
Sy = 8.775
Syy = 462
Standard error of the intercept;
(t=b/SEb) = (6.075=35.1533/SEb) = 5.7866
t-value of x variable = 1.1423/0.2761 = 4.1373
the regression equation is as follows;
Y= 35.1533 + 1.1423X
The residuals sum of squares = (-0.99635)2+ (-1.71898)2+(5.9964)2+ (-0.5730)2+ (-6.8577)2+(0.00365)2+ (4.146)2= 104.450
SSR = SST- SSE = 462 – 104.450 = 357.55
SST = Syy = 462
Multiple R squares = 1-(SSE/SST) = 1 – (104.450/462) = 0.7749
Adjusted R squared = 0.7299
P-value of the F-test = 009018.