# A Solution to Choose the Best Data Analysis and Visualization Technique to Create a Linear Model in R Studio

## Assignment instructions

Produce plots to explore the relationship, or lack thereof, between your response variable, and each of these potential predictors, choosing appropriate plots for the types of variables analyzed. Describe any relationships that arise.

Model selection

Build a linear model to predict evaporation (in mm) on a given day in Melbourne, using all of the predictors listed in the Bivariate summaries paragraph above. Also consider an intersection term between a month and 9 am relative humidity, to determine whether humidity has a different effect in different months. Your model should include all predictors with a significant effect on evaporation, and no predictors that are not significant. In order to do this, produce your model according to the following process:

1. Fit a model containing all the possible predictors.

2. Determine the p-value for inclusion of each predictor: i. P-values for quantitative variables can be determined using the linear model summary.

ii. P-values for categorical variables, or intersections containing categorical variables, can be determined using an ANOVA.

3. Remove the predictor with the highest p-value for inclusion, unless all remaining predictors are significant at the 5% level.

4. Update your model to include only the remaining predictors.

5. Repeat Steps 2-4 until only significant predictors remain.

Interpret the coefficients of your model in context. This includes the intercept, and the coefficients relating to each predictor. Your interpretation should be done in a manner that can be interpreted by your client.

For categorical predictors, you do not need to explain all coefficients but provide an overview of how the model operates in relation to these terms, with an example of one of the coefficients.

Model diagnostics

Test all of the assumptions of your linear model.

Prediction

MWC is interested both in the general application of your model, and in some particular extreme scenarios that it envisages. They thus seek your predictions for the amount of evaporation, in mm, for days of the following character:

• February 29, 2020, if this day has a minimum temperature of 13.8 degrees and reaches a maximum of 23.2 degrees, and has 74% humidity at 9am.

• December 25, 2020, if this day has a minimum temperature of 16.4 degrees and reaches a maximum of 31.9 degrees, and has 57% humidity at 9am.

• January 13, 2020, if this day has a minimum temperature of 26.5 degrees and reaches a maximum of 44.3 degrees, and has 35% humidity at 9am.

• July 6, 2020, if this day has a minimum temperature of 6.8 degrees and reaches a maximum of 10.6 degrees, and has 76% humidity at 9am.

Provide a table containing appropriate intervals for making forecasts on these particular days, and explain the intervals in context. Compare and contrast potential amounts of evaporation on different days.

If there is more than 10mm of evaporation at MWC’s Cardinia Reservoir, the corporation takes temporary measures to ensure a continuous supply of water, including transferring water from its Silvan Reservoir upstream:

• For which of the predicted days can we say with 95% confidence that this will occur?

• For which of the predicted days can we say with 95% confidence that this will not occur?

Assignment Solution

Assignment 2

Executive Summary

The analysis of Melbourne’s weather observations data shows the relationship between the evaporation on a given day with attributes like a day of the month, month, the maximum and minimum temperature for 24 hours to 9AM, and relative humidity at 9 AM. The analysis of the data shows that Minimum temperature, Month, relative humidity, and interaction between relative humidity and month were found to be significantly affecting the evaporation for the day. The statistical model was tested to be effective for predicting evaporation using these attributes. However, the day of the month did not show any relationship with the evaporation which was expected all along. The analysis shows that evaporation is high in the summer season and low in the winter season with peaks being around January and lowest around May-June.

Methods

The analysis of Melbourne’s weather observations data shows the relationship between the evaporation on a given day with attributes like a day of the month, month, the maximum and minimum temperature for 24 hours to 9AM, and relative humidity at 9AM. R was used as the choice of statistical software for all purposes of computation in this work due to its vast capabilities. The incomplete data points which had NA’s in evaporation was removed in the start to have only the complete cases in the analysis.

Bivariate Summaries

The first step in analyzing the data is exploratory analysis. In next section of bivariate summaries, relationship between the response variable, evaporate, and all the predictor variables have been explored individually to get a sense of the direction of the analysis.

Month

The month variable represents the calendar month of the date where the observation was made. The month variable is expected to have a significant impact on evaporation as different months can have different weather leading to different levels of evaporation. A boxplot was used to look at the difference in evaporation by month. Figure 1: Boxplot of evaporation by month confirms that there are big differences in evaporation across different months

The plot resonates with the expectations that there would be a quite significant difference in levels of evaporation for different months. From the plot, it can be noted that the May-Aug period shows less evaporation and Dec-Mar shows higher evaporation levels.

Day
The Day variable represents the day of the month when the observation was made. It is not the expectation that day will make any big difference on the evaporation level because it is only for 24 hours which is a really small interval to exhibit any strong pattern. A scatterplot was used to study the significance of the relationship with evaporation on the x-axis. Figure 2: As expected, there was no significant pattern was observed between the day variable and evaporation levels

The scatterplot does not show any visible pattern which resonates with the original expectations.
Minimum Temperature (9:00 AM)

The minimum temperature variable represents the minimum temperature in the 24 hours to 9 am observed on the day. This variable is expected to have strong impact on the evaporation levels as this should directly impact evaporation. It is expected that higher minimum temperature should lead to high evaporation levels. A scatterplot was used to study the significance of the relationship with evaporation on the x-axis. Figure 4:Scatterplot between evaporation and maximum temperature. The plot shows an increasing relationship between evaporation and maximum temperature.

As expected, the scatterplot shows strong relationship between evaporation and maximum temperature.

Relative humidity (9:00 AM)

Relative humidity variable records the Relative humidity at 9 am. It is also expected to have strong effect on the evaporation as higher humidity should lead to low evaporation as water content on air will have a saturation point. Scatterplot was used to study the significance of the relationship with evaporation on x-axis. Figure 5: Scatterplot between evaporation and relative humidity. It shows a negative relationship between the two

As expected, we do see a strong visible decreasing relationship between evaporation and relative humidity.

Model Selection

• The first model was built with evaporation as the dependent variable and day, month, minimum, and maximum temperature, relative humidity, and interaction between month and relative humidity as predictors. The model had multiple R2 of 0.6383. In terms of significance of the variables, day (p=0.74) and maximum temperature (p=0.70) was found to be statistically insignificant. On this step, day variable was removed.

• Model was fitted again with all the previous predictors excluding day variable. This time, only maximum temperature (p=0.71) was found to be statistically insignificant. Hence, maximum temperature was dropped in the next model fit.

• Third model was fitted with evaporation as dependent variable and month, minimum temperature, relative humidity, and interaction between month and relative humidity as predictors. In this model, all variables were found to be statistically significant.

The final model which had all the variables statistically significant was model 3 with month (p<0.001), minimum temperature (p<0.001), relative humidity (p<0.001), and interaction between month and relative humidity (p<0.001).

Expectation vs reality?

These terms differ from what we expected in bivariate summaries where we expected maximum temperature to be impactful but as it turned out, it was not statistically significant. This might have happened because of the multicollinearity. Since we expect high correlation between minimum and maximum temperature, essentially hot days have both minimum and maximum temperature high and vice versa, the information available in maximum temperature is captured in the variable minimum temperature. The correlation between minimum and maximum temperature was 0.70 which is quite high. Hence, even though maximum temperature would have been significant standalone, it was not significant in presence of minimum temperature.

However, all the other terms which was expected to be significant in the bivariate summaries are indeed significant.

Model Diagnostics

The linear model implemented in this work comes with few assumptions which have been outlined below and tested in the appendix of the report.

1. Homoscedasticity: The variance of residual is the same for any value of X. This was tested through scatterplot of fitted vs residuals.

2. Independence: Observations are independent of each other. This assumption was tested using ACF and PACF of the residuals.

3. Normality: Residuals are normally distributed. This was tested through normal Q-Q plot.

4. No multicollinearity: The independent variables should not be correlated. This was tested using correlation matrix.

For assessment, readers are referred to the Appendix section of this report.

Result

Model Interpretations

The final model which was selected for the prediction in this work had evaporation as dependent variable and month, minimum temperature, relative humidity, and interaction between month and relative humidity as predictors. In R, the first category from alphabetical order was selected as reference for Month, hence April is the reference month variable and interaction term. The slopes and intercept of the models can be interpreted as below:

• Intercept (10.56) – The intercept represents the average evaporation value when the numeric variables are set to 0 and the categorical variable are set to reference categories. Here, if minimum temperature is 0, relative humidity is 0, and month is of April, then the average evaporation is estimated to be 10.56mm.

• Relative Humidity (-0.147) - Slope of relative humidity indicates that if all the other variables are kept same, the change in evaporation for each unit increase in relative humidity will be -0.147mm.

• Minimum Temperature (0.369) - Slope of minimum temperature indicates that if all the other variables are kept same, the change in evaporation for each unit increase in minimum temperature will be 0.369mm

• Month – The slopes in different categories of the month represents the average estimated change in evaporation compared to the month of April, if all the other variables are kept same. Hence, slope of June (-10.348) represents that if all the other variables are kept same, then the evaporation will decrease by 10.348mm, on an average, in June as compared to month of April.

• Interaction between Month and Relative Humidity – The slopes in these categories represent the change in evaporation per unit increase in relative humidity as compared to month of April at a given level of relative humidity, if all the other variables are kept same. Hence, slope of Aug in interaction terms (0.136) shows that evaporation will be 0.136*Relative Humidity higher than that in month of April at same relative humidity.

The ANOVA Table of the model shows the contribution of each variable towards explaining the variance of the data. The model had R2 of 0.638 which means 63.8% of the variance in the data was explained by the predictors in the final model.

There were few outliers in the residuals, except which, it seemed to have normal distribution properties. Discussion

Prediction

The prediction was done on the four given dates with provided attributed and is given in the table below:

Table 1: Predictions for the four mentioned dates

 95% Prediction Interval Date                                     Fitted Lower                 Upper February 29, 2020             5.506 1.089                  9.923 December 25, 2020          8.606 4.209                 13.003 January 13, 2020              14.872 10.105               19.640 July 6, 2020                        2.265 -2.111                  6.642
The 95% prediction interval gives the range of values as forecasted interval which has 95% confidence of capturing the future value in that forecasted interval. So, one can have 95% confidence that the evaporation on Feb 29, 2020 will be in range of [1.089, 9.923]. Similarly on Dec 25, 2020 one can have 95% confidence that the evaporation on the day will be in range of [4.209, 13.003]. It can be observed that January and December forecasts are very high which are in summer season, whereas July is very low which is winter.

On January 13, the prediction interval predicts >10mm of evaporation whereas for Feb 29, and July 6, it predicts <10mm evaporation with 95% confidence. On the 25 Dec, the model is not certain with 95% confidence if evaporation will be >10mm or less.

Conclusion

The analysis of the Melbourne’s weather data reveals some insights to the evaporation levels throughout the year and its possible dependencies on weather attributes. It was found in the analysis that evaporation is vey low around mid-winter and it is very high during mid-summer months, but it does not show any pattern with respect to the day of the month. The minimum temperature for the day seems to affect the evaporation strongly, so does the relative humidity but maximum temperature was not required if minimum temperature was available, essentially due to repetitive information given by the two variables. The model was able to explain 63.8% of the variability in the data which is quite encouraging. On prediction front, 1 day (Jan 13, 2020) was identified as the day with >10mm evaporation with 95% confidence where the corporation may need to take temporary measures to ensure a continuous supply of water, including transferring water from its Silvan Reservoir upstream among the four dates prediction was done.

Appendix

Code

rm(list = ls())
options(warn = -1)
melbourne<- melbourne[,c("Date" ,"Minimum.temperature..Deg.C.", "Maximum.Temperature..Deg.C.", "X9am.relative.humidity....", "Evaporation..mm.")]
names(melbourne) <- c("Date", "Minimum_Temperature_9AM", "Maximum_Temperature_9AM", "Relative_humidity_9AM" , "Evaporation")
melbourne$Date<- as.POSIXct(melbourne$Date, "%Y-%m-%d")
month <- months.POSIXt(melbourne$Date) day <- as.numeric(format(melbourne$Date, "%d"))
melbourne<- cbind(melbourne, "Month" = month, "Day" = day)
attach(melbourne)
# Exploration -------------------------------------------------------------
library(tidyverse)
ggplot(melbourne, aes(Evaporation, Month )) + geom_boxplot() + labs(title = "Boxplot", subtitle = "Evaporation by month")
ggplot(melbourne, aes( Evaporation, Day )) + geom_point() + labs(title = "Scatterplot", subtitle = "Evaporation by day of the month")
ggplot(melbourne, aes(Evaporation, Minimum_Temperature_9AM )) + geom_point() + labs(title = "Scatterplot", subtitle = "Evaporation vs Minimum Temperature") + ylab("Minimum Temperature (9:00 AM)")
ggplot(melbourne, aes(Evaporation, Maximum_Temperature_9AM )) + geom_point() + labs(title = "Scatterplot", subtitle = "Evaporation vs Maximum Temperature") + ylab("Maximum Temperature (9:00 AM)")
ggplot(melbourne, aes(Evaporation, Relative_humidity_9AM )) + geom_point() + labs(title = "Scatterplot", subtitle = "Evaporation vs Relative humidity") + ylab("Relative Humidity (9:00 AM)")

# Model Selection ---------------------------------------------------------
library(MASS)
melbourne<- melbourne[complete.cases(melbourne),]
model1 <- lm(Evaporation ~ Day + Month + Minimum_Temperature_9AM + Maximum_Temperature_9AM + Relative_humidity_9AM + Month * Relative_humidity_9AM,
melbourne)
summary(model1)
anova(model1)
model2 <- lm(Evaporation ~ Month + Minimum_Temperature_9AM + Maximum_Temperature_9AM + Relative_humidity_9AM + Month * Relative_humidity_9AM,
melbourne)
summary(model2)
anova(model2)
model3 <- lm(Evaporation ~ Month + Relative_humidity_9AM + Minimum_Temperature_9AM + Month * Relative_humidity_9AM,
melbourne)
summary(model3)
anova(model3)
plot(model3)
qqnorm(residuals(model3))
acf(residuals(model3))
pacf(residuals(model3))
corrplot(cor(melbourne[,2:5]), method = "number")
# Prediction --------------------------------------------------------------
data_pred<- data.frame("Month" = c("February", "December", "January","July"),
"Day" = c(29,25,13,6),
"Relative_humidity_9AM" = c(74, 57, 35, 76),
"Maximum_Temperature_9AM" = c(23.2,31.9,44.3,10.6),
"Minimum_Temperature_9AM" = c(13.8,16.4,26.5,6.8)
)
predict(model3, data_pred, interval = "prediction" , conf.level = 0.95)

## Model Diagnostics Assessment

1. Homoscedasticity: The model seems to be homoscedastic as we do not see any change in the variance across different ranges of fitted values. 2. Independence: The observations in the model are independent as we do not see any huge spikes in the ACF or PACF plots  3. Normality: The model estimated residuals seem to have roughly straight line in Normal Q-Q plot which indicates that normality assumption is not violated. 4. Multicollinearity: There does not seem to be high correlation in minimum temp and relative humidity. Hence, no heteroscedasticity observed. 