Simple & Multiple Linear Regression

Simple & Multiple Linear Regression

Linear Regression

Instructions

Please submit your solutions to Canvas, as an R markdown (.Rmd) file. Please also knit your R markdown file, and submit the resulting .HTML file as well.

Part 1: Pricing Monet’s Paintings

Claude Monet (1840-1926) was one of the founders of French Impressionist painting, and his artwork remains highly prized today. Monet’s paintings have sold for record amounts; for example, ‘Grainstack’ (picture above) was sold at auction for $81.4 Million in 2016.

The goal of this problem is for you to develop a regression model that predicts the price (dependent variable) of a Monet painting based on its attributes (independent variables).

Load the data in R, and examine its features. WIDTH and HEIGHT represent the width and height of the painting. The SIGNED feature indicates whether or not Monet’s signature appears on the painting; 1 means a painting is signed, and 0, unsigned. PICTURE is an ID for the painting, and HOUSE indicates the auction house at which the painting was sold.

Using the pairs and cor functions, explore the relationship between the various features in the data set. Which features exhibit the highest correlation with one another? And which features are most highly correlated with PRICE?

Simple Linear Regression

Recall that one of the assumptions underlying multiple regression is that all explanatory variables be independent of one another. Sometimes, when two explanatory variables are correlated, it makes sense to replace them with a single variable that represents their interaction. Replace HEIGHT and WIDTH with a single variable that accounts for both these features. Mutate the data to incorporate this new feature.

Find the correlation between PRICE and your new feature. Transform the variables to see if there might be a stronger correlation in log space. Try all combinations of log transformations.

Pick one of the transformations to run with, and plot PRICE vs. your size feature in log space. Then use lmto create a simple linear regression model to price Monet’s paintings. Summarize the model, and add an abline to your plot. Then plot the residuals. Comment on both the model (i.e., the residual standard error, the 2value, etc.) and the plot.

Note: Standard error is the standard deviation of a sampling distribution, where the sampling distribution is the distribution of estimates. In this case, the estimates are the fitted (i.e., predicted) values, and the standard error (also called the residual standard error) is the standard deviation of these fitted values.

What can you do to improve the model? Go ahead and make the necessary change, remaining within the realm of simple linear regression (i.e., do not add additional variables, yet).

Multiple Linear Regression

Intuitively, there is another variable beyond the size of a painting that seems as if it should influence its price. Which variable is it? If you guessed SIGNED, you are correct! What is the average price of signed vs. unsigned paintings? (Answer this question for small paintings only, and for all paintings in the data set.)

Plot PRICE vs. your size feature for small paintings only, and color the points in your plot based on the SIGNED feature. Then, continuing with your preferred transformation, create the corresponding plot in log space. Next, use lm to create a multiple linear regression model for the transformed data.

As above, summarize the model, and add an abline to your plot (in log space) for signed paintings, and a second one for unsigned paintings.

Here is how to extract the relevant coefficients from your model, which you can then use to define these values:

a <- model$coefficients[1]b <- model$coefficients[2]dummy<- model$coefficients[3]

Next, plot the residuals, and comment on both the model and the plot.

The final step in this analysis is to plot PRICE vs. your size variable in non-log space, so that you can visualize the transformation of your linear model as it pertains to the observed (non-transformed) data, to confirm that it is accurate.

But before doing so, use R’s built-in spline function to fit a curve to these data, and use the lines function to display this curve on your plot. In addition to the data to fit, the spline function takes as input a parameter df, which stands for degrees of freedom, and is the maximum degree of the polynomial to fit to the data. (Play around with various settings of this parameter.)

lines(smooth.spline(PRICE ~ SIZE, df = 5), col = “green”)

Finally, to add to this plot the two curves that represent your linear model (one for the signed paintings, and a second for the unsigned ones), you should use the curve function (twice). The curve function takes as input a function (i.e., x ** 2) describing a curve to add to a plot. You should give as input to the curve function the inverse of the log transformation you applied: i.e., where you took a log, you should exponentiate.

For example, if you took only the log of PRICE, so that your linear model is

, where \(Y)\ is PRICE, then you can plot your curves as follows:

curve(exp(a + b * x + dummy), col = “blue”, add = TRUE)curve(exp(a + b * x), col = “red”, add = TRUE)

Alternatively, if you took the log of both PRICE and your size variable, so that your linear model is , then you can plot your curves as follows:

curve(exp(a + dummy) * (x ** b), col = “blue”, add = TRUE)curve(exp(a) * (x ** b), col = “red”, add = TRUE)

If you are unhappy with the fit you obtained (i.e., if your curves do not seem to fit to the data), repeat some of the earlier steps (e.g., try a different log transformation) until you are satisfied. But don’t forget to change the inputs to the curve function based on the log transformation you apply, so that you can see how the curves vary with the transformations.

Solution

title: “Report”

output:

pdf_document: default

html_document: default

“`{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE)

“`

“`{r}

data = read.csv(‘./monet.csv’)

cor(data)

“`

We can see that highest correlation is between the variables HEIGHT and WIDTH. This is intuitive as most paintings would conform to a particular aspect ratio. The next highest correlation is between the variables HEIGHT and WIDTH with PRICE, indicating the price of a painting increases with the size of the painting.

“`{r}

pairs(data)

“`

## Simple Linear Regression

Creating a new variable AREA as the product of HEIGHT and WIDTH variables

“`{r}

data$AREA = data$HEIGHT*data$WIDTH

data$HEIGHT = NULL

data$WIDTH = NULL

cor(data)

“`

Adding log transformation to the non categorical variables (PRICE, AREA)

“`{r}

data$Log_PRICE = log(data$PRICE)

data$Log_AREA = log(data$AREA)

cor(data)

“`

From the correlation output, we can see that correlation is maximum between Log_AREA and Log_PRICE. This is an improvement over the correlation of 0.347 between PRICE and AREA. Therefore, log transformation of the PRICE and AREA does improve the correlation between the two variables.

“`{r}

plot(PRICE~Log_AREA, data = data)

plot(Log_PRICE~Log_AREA, data = data)

“`

A much clearer relationship can be seen between Log_PRICE and Log_Area than between PRICE and Log_AREA. This further confirms the correlation we had observed earlier.

Training a linear regression model to predict PRICE using Log_AREA

“`{r}

model = lm(PRICE~Log_AREA,data = data)

plot(PRICE~Log_AREA, data = data)

abline(model)

“`

Plotting the residuals of the linear model fitted

“`{r}

plot(residuals(model))

“`

“`{r}

summary(model)

“`

From the summary of the model, we can see that the residual standard error is 3.99 and the adjusted R-squared value is 0.142

#### REMOVING OUTLIERS

“`{r}

boxplot(data$PRICE)

“`

We can see the outliers that have been detected by the boxplot. We remove these residuals in order to remove their adverse effects

“`{r}

outliers = boxplot(data$PRICE)$out

minValue = min(outliers)

data = data[data$PRICE<=minValue,]

model = lm(PRICE~Log_AREA,data = data)

plot(PRICE~Log_AREA,data = data)

abline(model)

“`

Checking the summary of the new model

“`{r}

summary(model)

“`

We see that the removal of outliers has reduced the standard residual error and has also increased the adjusted R-squared value.

#### CLUSTERING BASED ON AREA

“`{r}

plot(PRICE~AREA,data = data)

“`

We can see that there exist two clusters based on AREA that can be divided by a threshold area value of 2500. We use this threshold to separate the two clusters. Furthermore, we create a linear regression model for the small paintings category

“`{r}

smallPainting = data[data$AREA<=2500,]

largePainting = data[data$AREA>2500,]

model = lm(PRICE~Log_AREA,data = smallPainting)

plot(PRICE~Log_AREA,data = smallPainting)

abline(model)

“`

“`{r}

summary(model)

“`

## Multiple Linear Regression

Looking at the effect of SIGNED variable on PRICE

“`{r}

mean(data[data$SIGNED==1,]$PRICE)

mean(data[data$SIGNED==0,]$PRICE)

mean(smallPainting[smallPainting$SIGNED==1,]$PRICE)

mean(smallPainting[smallPainting$SIGNED==0,]$PRICE)

“`

Going by the average value, SIGNED has a strong effect on the PRICE variable.

We now plot AREA against PRICE and color code samples which are signed and not signed. Furthermore, we make another plot taking the log of both PRICE and AREA as we had seen they had the best correlation value.

“`{r}

library(ggplot2)

ggplot(data = smallPainting) + geom_point(aes(x = AREA,y=PRICE,color = SIGNED))

ggplot(data = smallPainting) + geom_point(aes(x = log(AREA),y=log(PRICE),color = SIGNED))

“`

We now create a multiple regression model to predict for PRICE using Log of AREA and SIGNED variable.

“`{r}

model  = lm(PRICE ~ Log_AREA + SIGNED,data = smallPainting)

a <- model$coefficients[1]

b <- model$coefficients[2]

dummy<- model$coefficients[3]

“`

“`{r}

#Plot for signed paintings

plot(PRICE~Log_AREA, data =smallPainting[smallPainting$SIGNED==1,],col = smallPainting$SIGNED)

abline(a+dummy,b)

“`

“`{r}

#Plot for unsigned paintings

plot(PRICE~Log_AREA, data =smallPainting[smallPainting$SIGNED==0,],col = smallPainting$SIGNED)

abline(a,b)

“`

Plotting the residuals from this model

“`{r}

plot(residuals(model))

“`

“`{r}

summary(model)

“`

From the summary of the model, we see that the addition of the SIGNED variable has helped the model and has in fact brought down the standard residual error. R-squared remains more or less constant.

From the plot, it can also be seen that the model has been brought up by the presence of few outlier observations based on their high prices.

#### Plotting the models in non-log space

“`{r}

plot(PRICE~AREA,data = smallPainting)

curve(a + dummy + b*log(x),add= TRUE,col=”red”)

curve(a  + b*log(x),add= TRUE,col=”blue”)

“`