# Linear Discriminant Analysis  Solution

Q1.

On fitting the Linear Discriminant Analysis (LDA) using the training data,

The confusion matrix for the training data is:

 n = 100 Predicted: NO Predicted: YES Actual: NO 33 13 46 Actual: YES 6 48 54 39 61

The confusion matrix for the testing data is:

 n = 300 Predicted: NO Predicted: YES Actual: NO 112 39 151 Actual: YES 41 108 149 153 147

The misclassification error rate is 0.19 for the training data.

The misclassification error rate is 0.26667 for the testing data.

Q2.

On fitting the Quadratic Discriminant Analysis (QDA) using the training data,

The confusion matrix for the training data is:

 n = 100 Predicted: NO Predicted: YES Actual: NO 33 13 46 Actual: YES 6 48 54 39 61

The confusion matrix for the testing data is:

 n = 300 Predicted: NO Predicted: YES Actual: NO 113 38 151 Actual: YES 43 106 149 156 144

The misclassification error rate is 0.19 for the training data.

The misclassification error rate is 0.27 for the testing data.

Q3.

On using Logistic Regression,

The confusion matrix for the training data is:

 n = 100 Predicted: NO Predicted: YES Actual: NO 33 13 46 Actual: YES 9 45 54 42 58

The confusion matrix for the testing data is:

 n = 300 Predicted: NO Predicted: YES Actual: NO 104 47 151 Actual: YES 37 112 149 141 159

The misclassification error rate is 0.22 for the training data.

The misclassification error rate is 0.28 for the testing data.

Q4.

On using K- nearest neighbor classification (KNN) with k = 3,

The confusion matrix for the training data is:

 n = 100 Predicted: NO Predicted: YES Actual: NO 40 6 46 Actual: YES 3 51 54 43 57

The confusion matrix for the testing data is:

 n = 300 Predicted: NO Predicted: YES Actual: NO 105 46 151 Actual: YES 33 116 149 138 162

The misclassification error rate is 0.09 for the training data.

The misclassification error rate is 0.2633 for the testing data.

Q5.

On using K- nearest neighbor classification (KNN) with k = 9,

The confusion matrix for the training data is:

 n = 100 Predicted: NO Predicted: YES Actual: NO 40 6 46 Actual: YES 11 43 54 51 49

The confusion matrix for the testing data is:

 n = 300 Predicted: NO Predicted: YES Actual: NO 115 36 151 Actual: YES 45 104 149 160 140

The misclassification error rate is 0.17 for the training data.

The misclassification error rate is 0.27 for the testing data.

Q6.

If the main objective of this study is to develop a data-driven model for better prediction of the occurrence of product failure, we should choose that model which has the least misclassification error rate for the testing data.

The misclassification error rates in testing data are 26.67%, 27%, 28%, 26.33% and 27% for the five models – LDA, QDA, Logistic regression, KNN with k=3 and KNN with k=9 respectively.

So, the KNN model with k=3 would be the most appropriate model as it has the least misclassification error rate in testing data among these 5 models and the Logistic regression model would be the least appropriate as it has the most misclassification error rate in testing data among these 5 models.

Q7.

If the main objective of this study is to develop a data-driven model for better understanding and interpretation of different factor’s influence on the occurrence of product failure, we should choose that model which has the least misclassification error rate for the training data.

The misclassification error rates in training data are 19%, 19%, 22%, 9% and 17% for the five models – LDA, QDA, Logistic regression, KNN with k=3 and KNN with k=9 respectively.

So, the KNN model with k=3 would be the most appropriate model as it has the least misclassification error rate in training data among these 5 models and the Logistic regression model would be the least appropriate as it has the most misclassification error rate in training data among these 5 models.

Based on this model, the probability of occurrence of product failure is more if

• The temperature condition is high.
• The voltage condition is high.
• The material type is A.

APPENDIX – R CODE

#Note that before using this code,

#The training and testing data sets are saved as .csv files

#In which the failure attribute is re-coded as:

#1 if the value of atrribute failure is N

#2 if the value of attribute failure is Y

#And these files should be in the working directory of your R to run this code.

require(“MASS”)

require(“class”)

set.seed(10)

train_data<- train_data[,-1]

test_data<- test_data[,-1]

lda<- lda(failure ~ ., data = train_data)

lda_train<- predict(lda, train_data)\$class

lda_test<- predict(lda, test_data)\$class

qda<- qda(failure ~ ., data = train_data)

qda_train<- predict(qda, train_data)\$class

qda_test<- predict(qda, test_data)\$class

train_dat<- train_data

train_dat\$failure<- train_dat\$failure – 1

logit<- glm(failure ~ temp, stress + material, data = train_dat, family = “binomial”)

prob_train<- predict.glm(logit, newdata = train_data, type = “response”)

prob_test<- predict.glm(logit, newdata = test_data, type =”response”)

logistic_train<- c()

for(i in 1:length(prob_train)) {

if(prob_train[i] > 0.5)

logistic_train[i] <- 2

if(prob_train[i] < 0.5)

logistic_train[i] <- 1 }

logistic_test<- c()

for(i in 1:length(prob_test)) {

if(prob_test[i] > 0.5)

logistic_test[i] <- 2

if(prob_test[i] < 0.5)

logistic_test[i] <- 1 }

c1<- factor(train_data\$failure)

knn3_train <- knn(train_data, train_data, c1, k=3)

knn3_test <- knn(train_data, test_data, c1, k=3)

knn9_train <- knn(train_data, train_data, c1, k=9)

knn9_test <- knn(train_data, test_data, c1, k=9)

result<- function(data, prediction)

{

a=b=c=d=0

for(i in 1:nrow(data))

{

if(data\$failure[i]==1 && prediction[i]==1)

a=a+1

if(data\$failure[i]==2 && prediction[i]==1)

b=b+1

if(data\$failure[i]==1 && prediction[i]==2)

c=c+1

if(data\$failure[i]==2 && prediction[i]==2)

d=d+1

}

confusion_matrix<- matrix(c(a,b,c,d), nrow = 2, ncol = 2)

misclassification_error_rate<- (b+c)/(a+b+c+d)

cat(“The confusion marix is “, “\n”)

print(confusion_matrix)

cat(“The misclassification error rate is “, misclassification_error_rate,”\n”)

}

cat(“\n”,”For the LDA in training data:”,”\n”)

result(train_data, lda_train)

cat(“\n”,”For the LDA in testing data:”,”\n”)

result(test_data, lda_test)

cat(“\n”,”For the QDA in training data:”,”\n”)

result(train_data, qda_train)

cat(“\n”,”For the QDA in testing data:”,”\n”)

result(test_data, qda_test)

cat(“\n”,”For the logistic regression in training data:”,”\n”)

result(train_data, logistic_train)

cat(“\n”,”For the logistic regression in testing data:”,”\n”)

result(test_data, logistic_test)

cat(“\n”,”For the knn with k=3 in training data:”,”\n”)

result(train_data, knn3_train)

cat(“\n”,”For the knn with k=3 in testing data:”,”\n”)

result(test_data, knn3_test)

cat(“\n”,”For the knn with k=9 in training data:”,”\n”)

result(train_data, knn9_train)

cat(“\n”,”For the knn with k=9 in testing data:”,”\n”)

result(test_data, knn9_test)

cat(“\nBased on KNN with k=3 model:-\n”)

data<- cbind(train_data, knn3_train)

data1 <- subset(data, data\$knn3_train == 1)

data2 <- subset(data, data\$knn3_train == 2)

cat(“\nIf the prediction is product failure:\n”)

print(summary(data1))

cat(“\nIf the prediction is not product failure:\n”)

print(summary(data2))

Classification.R

#Note that before using this code,

#The training and testing data sets are saved as .csv files

#In which the failure attribute is re-coded as:

#1 if the value of atrribute failure is N

#0 if the value of attribute failure is Y

#And these files should be in the working directory of your R to run this code.

require(“MASS”)

require(“class”)

set.seed(10)

train_data<- train_data[,-1]

test_data<- test_data[,-1]

lda<- lda(failure ~ ., data = train_data)

lda_train<- predict(lda, train_data)\$class

lda_test<- predict(lda, test_data)\$class

qda<- qda(failure ~ ., data = train_data)

qda_train<- predict(qda, train_data)\$class

qda_test<- predict(qda, test_data)\$class

train_dat<- train_data

train_dat\$failure<- train_dat\$failure – 1

logit<- glm(failure ~ temp, stress + material, data = train_dat, family = “binomial”)

prob_train<- predict.glm(logit, newdata = train_data, type = “response”)

prob_test<- predict.glm(logit, newdata = test_data, type =”response”)

logistic_train<- c()

for(i in 1:length(prob_train)) {

if(prob_train[i] > 0.5)

logistic_train[i] <- 2

if(prob_train[i] < 0.5)

logistic_train[i] <- 1 }

logistic_test<- c()

for(i in 1:length(prob_test)) {

if(prob_test[i] > 0.5)

logistic_test[i] <- 2

if(prob_test[i] < 0.5)

logistic_test[i] <- 1 }

c1<- factor(train_data\$failure)

knn3_train <- knn(train_data, train_data, c1, k=3)

knn3_test <- knn(train_data, test_data, c1, k=3)

knn9_train <- knn(train_data, train_data, c1, k=9)

knn9_test <- knn(train_data, test_data, c1, k=9)

result<- function(data, prediction)

{

a=b=c=d=0

for(i in 1:nrow(data))

{

if(data\$failure[i]==1 && prediction[i]==1)

a=a+1

if(data\$failure[i]==2 && prediction[i]==1)

b=b+1

if(data\$failure[i]==1 && prediction[i]==2)

c=c+1

if(data\$failure[i]==2 && prediction[i]==2)

d=d+1

}

confusion_matrix<- matrix(c(a,b,c,d), nrow = 2, ncol = 2)

misclassification_error_rate<- (b+c)/(a+b+c+d)

cat(“The confusion marix is “, “\n”)

print(confusion_matrix)

cat(“The misclassification error rate is “, misclassification_error_rate,”\n”)

}

cat(“\n”,”For the LDA in training data:”,”\n”)

result(train_data, lda_train)

cat(“\n”,”For the LDA in testing data:”,”\n”)

result(test_data, lda_test)

cat(“\n”,”For the QDA in training data:”,”\n”)

result(train_data, qda_train)

cat(“\n”,”For the QDA in testing data:”,”\n”)

result(test_data, qda_test)

cat(“\n”,”For the logistic regression in training data:”,”\n”)

result(train_data, logistic_train)

cat(“\n”,”For the logistic regression in testing data:”,”\n”)

result(test_data, logistic_test)

cat(“\n”,”For the knn with k=3 in training data:”,”\n”)

result(train_data, knn3_train)

cat(“\n”,”For the knn with k=3 in testing data:”,”\n”)

result(test_data, knn3_test)

cat(“\n”,”For the knn with k=9 in training data:”,”\n”)

result(train_data, knn9_train)

cat(“\n”,”For the knn with k=9 in testing data:”,”\n”)

result(test_data, knn9_test)

cat(“\nBased on KNN with k=3 model:-\n”)

data<- cbind(train_data, knn3_train)

data1 <- subset(data, data\$knn3_train == 1)

data2 <- subset(data, data\$knn3_train == 2)

cat(“\nIf the prediction is product failure:\n”)

print(summary(data1))

cat(“\nIf the prediction is not product failure:\n”)

print(summary(data2))