# Linear Discriminant Analysis

** **

** **

**Solution**** **

Q1.

On fitting the Linear Discriminant Analysis (LDA) using the training data,

The confusion matrix for the training data is:

n = 100 | Predicted: NO | Predicted: YES | |

Actual: NO | 33 | 13 | 46 |

Actual: YES | 6 | 48 | 54 |

39 | 61 |

The confusion matrix for the testing data is:

n = 300 | Predicted: NO | Predicted: YES | |

Actual: NO | 112 | 39 | 151 |

Actual: YES | 41 | 108 | 149 |

153 | 147 |

The misclassification error rate is 0.19 for the training data.

The misclassification error rate is 0.26667 for the testing data.

Q2.

On fitting the Quadratic Discriminant Analysis (QDA) using the training data,

The confusion matrix for the training data is:

n = 100 | Predicted: NO | Predicted: YES | |

Actual: NO | 33 | 13 | 46 |

Actual: YES | 6 | 48 | 54 |

39 | 61 |

The confusion matrix for the testing data is:

n = 300 | Predicted: NO | Predicted: YES | |

Actual: NO | 113 | 38 | 151 |

Actual: YES | 43 | 106 | 149 |

156 | 144 |

The misclassification error rate is 0.19 for the training data.

The misclassification error rate is 0.27 for the testing data.

Q3.

On using Logistic Regression,

The confusion matrix for the training data is:

n = 100 | Predicted: NO | Predicted: YES | |

Actual: NO | 33 | 13 | 46 |

Actual: YES | 9 | 45 | 54 |

42 | 58 |

The confusion matrix for the testing data is:

n = 300 | Predicted: NO | Predicted: YES | |

Actual: NO | 104 | 47 | 151 |

Actual: YES | 37 | 112 | 149 |

141 | 159 |

The misclassification error rate is 0.22 for the training data.

The misclassification error rate is 0.28 for the testing data.

Q4.

On using K- nearest neighbor classification (KNN) with k = 3,

The confusion matrix for the training data is:

n = 100 | Predicted: NO | Predicted: YES | |

Actual: NO | 40 | 6 | 46 |

Actual: YES | 3 | 51 | 54 |

43 | 57 |

The confusion matrix for the testing data is:

n = 300 | Predicted: NO | Predicted: YES | |

Actual: NO | 105 | 46 | 151 |

Actual: YES | 33 | 116 | 149 |

138 | 162 |

The misclassification error rate is 0.09 for the training data.

The misclassification error rate is 0.2633 for the testing data.

Q5.

On using K- nearest neighbor classification (KNN) with k = 9,

The confusion matrix for the training data is:

n = 100 | Predicted: NO | Predicted: YES | |

Actual: NO | 40 | 6 | 46 |

Actual: YES | 11 | 43 | 54 |

51 | 49 |

The confusion matrix for the testing data is:

n = 300 | Predicted: NO | Predicted: YES | |

Actual: NO | 115 | 36 | 151 |

Actual: YES | 45 | 104 | 149 |

160 | 140 |

The misclassification error rate is 0.17 for the training data.

The misclassification error rate is 0.27 for the testing data.

Q6.

If the main objective of this study is to develop a data-driven model for better prediction of the occurrence of product failure, we should choose that model which has the least misclassification error rate for the testing data.

The misclassification error rates in testing data are 26.67%, 27%, 28%, 26.33% and 27% for the five models – LDA, QDA, Logistic regression, KNN with k=3 and KNN with k=9 respectively.

So, the KNN model with k=3 would be the most appropriate model as it has the least misclassification error rate in testing data among these 5 models and the Logistic regression model would be the least appropriate as it has the most misclassification error rate in testing data among these 5 models.

Q7.

If the main objective of this study is to develop a data-driven model for better understanding and interpretation of different factor’s influence on the occurrence of product failure, we should choose that model which has the least misclassification error rate for the training data.

The misclassification error rates in training data are 19%, 19%, 22%, 9% and 17% for the five models – LDA, QDA, Logistic regression, KNN with k=3 and KNN with k=9 respectively.

So, the KNN model with k=3 would be the most appropriate model as it has the least misclassification error rate in training data among these 5 models and the Logistic regression model would be the least appropriate as it has the most misclassification error rate in training data among these 5 models.

Based on this model, the probability of occurrence of product failure is more if

- The temperature condition is high.
- The voltage condition is high.
- The material type is A.

APPENDIX – R CODE

#Note that before using this code,

#The training and testing data sets are saved as .csv files

#In which the failure attribute is re-coded as:

#1 if the value of atrribute failure is N

#2 if the value of attribute failure is Y

#And these files should be in the working directory of your R to run this code.

require(“MASS”)

require(“class”)

set.seed(10)

train_data<- read.csv(“Mid-train.csv”)

test_data<- read.csv(“Mid-test.csv”)

train_data<- train_data[,-1]

test_data<- test_data[,-1]

lda<- lda(failure ~ ., data = train_data)

lda_train<- predict(lda, train_data)$class

lda_test<- predict(lda, test_data)$class

qda<- qda(failure ~ ., data = train_data)

qda_train<- predict(qda, train_data)$class

qda_test<- predict(qda, test_data)$class

train_dat<- train_data

train_dat$failure<- train_dat$failure – 1

logit<- glm(failure ~ temp, stress + material, data = train_dat, family = “binomial”)

prob_train<- predict.glm(logit, newdata = train_data, type = “response”)

prob_test<- predict.glm(logit, newdata = test_data, type =”response”)

logistic_train<- c()

for(i in 1:length(prob_train)) {

if(prob_train[i] > 0.5)

logistic_train[i] <- 2

if(prob_train[i] < 0.5)

logistic_train[i] <- 1 }

logistic_test<- c()

for(i in 1:length(prob_test)) {

if(prob_test[i] > 0.5)

logistic_test[i] <- 2

if(prob_test[i] < 0.5)

logistic_test[i] <- 1 }

c1<- factor(train_data$failure)

knn3_train <- knn(train_data, train_data, c1, k=3)

knn3_test <- knn(train_data, test_data, c1, k=3)

knn9_train <- knn(train_data, train_data, c1, k=9)

knn9_test <- knn(train_data, test_data, c1, k=9)

result<- function(data, prediction)

{

a=b=c=d=0

for(i in 1:nrow(data))

{

if(data$failure[i]==1 && prediction[i]==1)

a=a+1

if(data$failure[i]==2 && prediction[i]==1)

b=b+1

if(data$failure[i]==1 && prediction[i]==2)

c=c+1

if(data$failure[i]==2 && prediction[i]==2)

d=d+1

}

confusion_matrix<- matrix(c(a,b,c,d), nrow = 2, ncol = 2)

misclassification_error_rate<- (b+c)/(a+b+c+d)

cat(“The confusion marix is “, “\n”)

print(confusion_matrix)

cat(“The misclassification error rate is “, misclassification_error_rate,”\n”)

}

cat(“\n”,”For the LDA in training data:”,”\n”)

result(train_data, lda_train)

cat(“\n”,”For the LDA in testing data:”,”\n”)

result(test_data, lda_test)

cat(“\n”,”For the QDA in training data:”,”\n”)

result(train_data, qda_train)

cat(“\n”,”For the QDA in testing data:”,”\n”)

result(test_data, qda_test)

cat(“\n”,”For the logistic regression in training data:”,”\n”)

result(train_data, logistic_train)

cat(“\n”,”For the logistic regression in testing data:”,”\n”)

result(test_data, logistic_test)

cat(“\n”,”For the knn with k=3 in training data:”,”\n”)

result(train_data, knn3_train)

cat(“\n”,”For the knn with k=3 in testing data:”,”\n”)

result(test_data, knn3_test)

cat(“\n”,”For the knn with k=9 in training data:”,”\n”)

result(train_data, knn9_train)

cat(“\n”,”For the knn with k=9 in testing data:”,”\n”)

result(test_data, knn9_test)

cat(“\nBased on KNN with k=3 model:-\n”)

data<- cbind(train_data, knn3_train)

data1 <- subset(data, data$knn3_train == 1)

data2 <- subset(data, data$knn3_train == 2)

cat(“\nIf the prediction is product failure:\n”)

print(summary(data1))

cat(“\nIf the prediction is not product failure:\n”)

print(summary(data2))

**Classification.R**

#Note that before using this code,

#The training and testing data sets are saved as .csv files

#In which the failure attribute is re-coded as:

#1 if the value of atrribute failure is N

#0 if the value of attribute failure is Y

#And these files should be in the working directory of your R to run this code.

require(“MASS”)

require(“class”)

set.seed(10)

train_data<- read.csv(“Mid-train.csv”)

test_data<- read.csv(“Mid-test.csv”)

train_data<- train_data[,-1]

test_data<- test_data[,-1]

lda<- lda(failure ~ ., data = train_data)

lda_train<- predict(lda, train_data)$class

lda_test<- predict(lda, test_data)$class

qda<- qda(failure ~ ., data = train_data)

qda_train<- predict(qda, train_data)$class

qda_test<- predict(qda, test_data)$class

train_dat<- train_data

train_dat$failure<- train_dat$failure – 1

logit<- glm(failure ~ temp, stress + material, data = train_dat, family = “binomial”)

prob_train<- predict.glm(logit, newdata = train_data, type = “response”)

prob_test<- predict.glm(logit, newdata = test_data, type =”response”)

logistic_train<- c()

for(i in 1:length(prob_train)) {

if(prob_train[i] > 0.5)

logistic_train[i] <- 2

if(prob_train[i] < 0.5)

logistic_train[i] <- 1 }

logistic_test<- c()

for(i in 1:length(prob_test)) {

if(prob_test[i] > 0.5)

logistic_test[i] <- 2

if(prob_test[i] < 0.5)

logistic_test[i] <- 1 }

c1<- factor(train_data$failure)

knn3_train <- knn(train_data, train_data, c1, k=3)

knn3_test <- knn(train_data, test_data, c1, k=3)

knn9_train <- knn(train_data, train_data, c1, k=9)

knn9_test <- knn(train_data, test_data, c1, k=9)

result<- function(data, prediction)

{

a=b=c=d=0

for(i in 1:nrow(data))

{

if(data$failure[i]==1 && prediction[i]==1)

a=a+1

if(data$failure[i]==2 && prediction[i]==1)

b=b+1

if(data$failure[i]==1 && prediction[i]==2)

c=c+1

if(data$failure[i]==2 && prediction[i]==2)

d=d+1

}

confusion_matrix<- matrix(c(a,b,c,d), nrow = 2, ncol = 2)

misclassification_error_rate<- (b+c)/(a+b+c+d)

cat(“The confusion marix is “, “\n”)

print(confusion_matrix)

cat(“The misclassification error rate is “, misclassification_error_rate,”\n”)

}

cat(“\n”,”For the LDA in training data:”,”\n”)

result(train_data, lda_train)

cat(“\n”,”For the LDA in testing data:”,”\n”)

result(test_data, lda_test)

cat(“\n”,”For the QDA in training data:”,”\n”)

result(train_data, qda_train)

cat(“\n”,”For the QDA in testing data:”,”\n”)

result(test_data, qda_test)

cat(“\n”,”For the logistic regression in training data:”,”\n”)

result(train_data, logistic_train)

cat(“\n”,”For the logistic regression in testing data:”,”\n”)

result(test_data, logistic_test)

cat(“\n”,”For the knn with k=3 in training data:”,”\n”)

result(train_data, knn3_train)

cat(“\n”,”For the knn with k=3 in testing data:”,”\n”)

result(test_data, knn3_test)

cat(“\n”,”For the knn with k=9 in training data:”,”\n”)

result(train_data, knn9_train)

cat(“\n”,”For the knn with k=9 in testing data:”,”\n”)

result(test_data, knn9_test)

cat(“\nBased on KNN with k=3 model:-\n”)

data<- cbind(train_data, knn3_train)

data1 <- subset(data, data$knn3_train == 1)

data2 <- subset(data, data$knn3_train == 2)

cat(“\nIf the prediction is product failure:\n”)

print(summary(data1))

cat(“\nIf the prediction is not product failure:\n”)

print(summary(data2))