## Written Response Questions

Please provide your answer for each of below questions in MS Word. The questions are worth 7.5 points each. There is a total of 8 questions.

1. What are training set and test set used for respectively? If splitting a dataset by assigning 75% to one set while 25% to another set, is it 75% or 25% that should go to training set?

2. Removing predictor(s) is generally known as a data reduction technique. Explain under what conditions we should consider removing predictors.

3. What’s the difference(s) between simple random sampling and stratified random sampling?

4. Why is model tuning necessary for predictive modeling?

5. Use your words to describe the process of building predictive models considering data splitting and data resampling (referring to the graph below).

6. List three linear regression models we learned in class. What metrics can be used to compare the linear model predictive performance?

7. What are the two tuning parameters associated with Multivariate Adaptive Regression Splines (MARS) model? How to determine the optimal values for the tuning parameters?

8. Define K-Nearest Neighbors (KNN) regression method and indicate whether pre-processing predictors is needed prior to performing KNN.

## R Programming Scenario Questions

Please complete your work on R script (both code and comments are required).

1. Create mat as a matrix (3 rows, 4 columns) in RStudio (see below). mat <- matrix (1:12, nrow=3, ncol=4)

(a) Assign ‘sample1’, ‘sample2’, ‘sample3’ as the names of the matrix rows, and ‘variable1’, ‘variable2’, ‘variable3’, ‘variable4’ as the names of the matrix columns. (3 points)

(b) To practice subset matrix, continue running below two lines of code in RStudio. What are their return values, respectively? List the return values on the comments starting with “#”. (3 points)

mat[,4]

mat[3,]

(c) After above practice, you can proceed to answer below two questions. (4 points)

(1) Write R code to subset the matrix mat that returns the elements sitting in the 2nd column.

(2) Write R code to subset the matrix mat that returns the elements sitting in the 1st row.

2. Create df as a data frame in RStudio (see below).

df <- data.frame(names=c('Queen','Cleo','Rose','Bill','Flora'), score=c(23,32,27,40,45))

(a) Continue running below R code to practice subset data frame. What are the returns? List the return values on the comments starting with “#”. (2 points)

df$names

(b) After practice in (a), write R code to subset the data frame df that returns the column of score. (4 points)

(c) subset() function is one of the functions we introduced in the week 1 R practice. Use the subset() function to return only those names with score over 25. (4 points)

3. The excel dataset (carEconomy.xlsx) carries 623 vehicles that were used to develop regression models for predicting miles per gallon (mpg) of the car based on certain aspects of automobile design and performance. The dataset consists of 10 predictors related to the car fuel economy, such as the number of cylinders (cyl), displacement (disp), gross horsepower (hp), and so on. The outcome is miles per gallon (mpg).

To start,

a. Make sure you have below packages installed in RStudio, and then library the packages

• AppliedPredictiveModeling

• caret

• lattice

• corrplot

• earth

b. Import carEconomy.xlsx to RStudio.

c. Convert carEconomy.xlsx to data frame.

d. Use str() function to check the data structure. There is a total of 11 columns, where the 11th column is the outcome (mpg), while the rest are the predictors.

Then, you can proceed to explore answers for below questions.

(1) Write R code to check missing data and perform data pre-processing.

• Check whether there are any missing values in the carEconomy dataset. (2 points)

• Check if there are any near-zero variance predictors. (2 points)

Note: If you find none, then all predictors are useful and do not remove any predictors.

• Check if there are any high correlated predictors (using threshold value 0.9). (2 points)

(2) Split data into the training set and test set (say assign 75% to the training set). (3 points)

(3) Train three regression models: simple linear model (LM), MARS model, and KNN model, respectively. (8 points)

*Note 1: use “degree = 1:2, nprune = 2:38” when tuning the MARS model.

*Note 2: use “k = 1:50” when tuning the KNN model.

(4) Compare the predictive performance of the above three models, which one would you choose? Also, use your chosen model to predict on the test data. (3 points)

## Written Response Questions Solution

1. What are training set and test set used for respectively? If splitting a dataset by assigning 75% to one set while 25% to another set, is it 75% or 25% that should go to training set?

Ans: Training set is used to train the model at a known sample so that model can learn its parameters. Test set is used for the model performance testing using out of sample examples which was not used to train the model in order to assess the real-world performance of the model. 75% of the data should go to training the model so that it can reliably estimate the parameters.

2. Removing predictor(s) is generally known as a data reduction technique. Explain under what

conditions we should consider removing predictors.

Ans: Predictors can be removed under certain conditions such as:

a) Predictor is not adding any value to the problem in logical sense, like name, serial number etc.

b) Predictor is replicating same information which is covered in any other predictor.

c) Lots of missing values in the predictor which may lead to bad fit.

3. What is the difference(s) between simple random sampling and stratified random sampling?

Ans: Simple random sampling is just taking a k out of n objects randomly. In these sampling scheme, every possible sample must have equal probability of getting selected.

In Stratified sampling, there are well defined groups or strata, and simple random sampling is done inside each stratum and included into the sample. These are, in most cases, a better alternative to represent actual scenario especially in case of class imbalance.

4. Why is model tuning necessary for predictive modelling?

Ans: Hyperparameters are crucial as they control the overall behaviour of a machine learning model. The ultimate goal is to find an optimal combination of hyperparameters that minimizes a predefined loss function to give better results. This is why model tuning is important as to get the optimum model based on problem statement. There can be n number of models for every task but to get the best out of it, hyperparameters must be tuned.

5. Use your words to describe the process of building predictive models considering data splitting and data resampling (referring to the graph below).

Ans: The steps of model building is outlined below:

Step 1: Select/Get Data

Step 2: Data cleaning/Data pre-processing

Step 3: Data splitting: Into training and test sets

Step 4: Split training set into Training and Validation set

Step 5: Model Selection and Develop Models (Training)

Step 6: Parameter tuning (Validation set), Optimize

Step 7: Testing and model performance evaluation

6. List three linear regression models we learned in class. What metrics can be used to compare the linear model predictive performance?

Ans: The regression models are Ordinary least square regression, Kernel regression, k-NN regression, MARS Model.

7. What are the two tuning parameters associated with Multivariate Adaptive Regression Splines (MARS) model? How to determine the optimal values for the tuning parameters?

Ans: Two parameters are degree and nprune. Both of these are determined by testing the model performance on validation set.

8. Define K-Nearest Neighbours (KNN) regression method and indicate whether pre-processing predictors is needed prior to performing KNN.

Ans: KNN regression is a non-parametric method that, in an intuitive manner, approximates the association between independent variables and the continuous outcome by averaging the observations in the same neighbourhood. The size of the neighbourhood needs to be set by the analyst or can be chosen using cross-validation to select the size that minimises the mean-squared error. Generally, pre-processing here includes making the features similar and numeric so that distance can be calculated. So we centre and scale the data.

## R Programming Scenario Solution

# R scenario questions (40 points)

# Q1 ----------------------------------------------------------------------

#### Part A

mat <- matrix (1:12, nrow=3, ncol=4)

colnames(mat) <- c("Variable1","Variable2","Variable3","Variable4")

rownames(mat) <- c("sample1","sample2","sample3" )

#### Part B

mat[,4]

## Ans: sample1 sample2 sample3

## 10 11 12

mat[3,]

## Ans

# Variable1 Variable2 Variable3 Variable4

# 3 6 9 12

#### Part C

mat[,2]

mat[1,]

# Q2 ----------------------------------------------------------------------

df <- data.frame(names=c('Queen','Cleo','Rose','Bill','Flora'), score=c(23,32,27,40,45))

#### Part A

df$names

# Ans: "Queen" "Cleo" "Rose" "Bill" "Flora"

#### Part B

df$score

#### Part C

subset(df, score>25)

# Q3 ----------------------------------------------------------------------

library(readxl)

library(AppliedPredictiveModeling)

library(caret)

library(lattice)

library(corrplot)

library(earth)

carEconomy <- read_excel("carEconomy.xlsx")

carEconomy <- as.data.frame(carEconomy)

str(carEconomy)

# Part 1 ------------------------------------------------------------------

sum(is.na(carEconomy))

## No missing value

apply(carEconomy, 2, var)

## There are no near zero variance

corrplot(cor(carEconomy[,1:10]), method = "number")

## There are no very high correlations

# Part 2 ------------------------------------------------------------------

set.seed(100)

tr_samp <- sample(nrow(carEconomy), floor(0.75 * nrow(carEconomy)), replace = F)

training <- carEconomy[tr_samp,]

testing <- carEconomy[-tr_samp,]

# Part 3 ------------------------------------------------------------------

model1 <- lm(mpg ~., data = carEconomy)

summary(model1)

plot(model1)

hyper_grid <- expand.grid(

degree = 1:2,

nprune = 2:38

)

model2 <- train(x = training[,1:10], y = training$mpg, method = "earth",

metric = "RMSE", tuneGrid = hyper_grid, trControl = trainControl(method = "cv", number = 10))

model2$bestTune

ggplot(model2)

k_parms <- expand.grid(k = 1:50)

model3 <- train(x = training[,1:10], y = training$mpg, method = "knn",

metric = "RMSE", tuneGrid = k_parms, trControl = trainControl(method = "cv", number = 10))

model3$bestTune

ggplot(model3)

# Part 4 ------------------------------------------------------------------

pred_lm <- predict(model1, testing)

pred_mars <- predict(model2, testing)

pred_knn <- predict(model3, testing)

RMSE(pred_lm, testing$mpg)

RMSE(pred_mars, testing$mpg)

RMSE(pred_knn , testing$mpg)

## The lowest RMSE error on test data is for Simple linear regression. We would choose simple linear regression due to it's lower complexity and better RMSE

## on the test data.