Linear Discriminant Analysis

Linear Discriminant Analysis

 

 

Solution 

Q1.

On fitting the Linear Discriminant Analysis (LDA) using the training data,

The confusion matrix for the training data is:

n = 100 Predicted: NO Predicted: YES
Actual: NO 33 13 46
Actual: YES 6 48 54
39 61

The confusion matrix for the testing data is:

n = 300 Predicted: NO Predicted: YES
Actual: NO 112 39 151
Actual: YES 41 108 149
153 147

The misclassification error rate is 0.19 for the training data.

The misclassification error rate is 0.26667 for the testing data.

Q2.

On fitting the Quadratic Discriminant Analysis (QDA) using the training data,

The confusion matrix for the training data is:

n = 100 Predicted: NO Predicted: YES
Actual: NO 33 13 46
Actual: YES 6 48 54
39 61

The confusion matrix for the testing data is:

n = 300 Predicted: NO Predicted: YES
Actual: NO 113 38 151
Actual: YES 43 106 149
156 144

The misclassification error rate is 0.19 for the training data.

The misclassification error rate is 0.27 for the testing data.

Q3.

On using Logistic Regression,

The confusion matrix for the training data is:

n = 100 Predicted: NO Predicted: YES
Actual: NO 33 13 46
Actual: YES 9 45 54
42 58

The confusion matrix for the testing data is:

n = 300 Predicted: NO Predicted: YES
Actual: NO 104 47 151
Actual: YES 37 112 149
141 159

The misclassification error rate is 0.22 for the training data.

The misclassification error rate is 0.28 for the testing data.

Q4.

On using K- nearest neighbor classification (KNN) with k = 3,

The confusion matrix for the training data is:

n = 100 Predicted: NO Predicted: YES
Actual: NO 40 6 46
Actual: YES 3 51 54
43 57

The confusion matrix for the testing data is:

n = 300 Predicted: NO Predicted: YES
Actual: NO 105 46 151
Actual: YES 33 116 149
138 162

The misclassification error rate is 0.09 for the training data.

The misclassification error rate is 0.2633 for the testing data.

Q5.

On using K- nearest neighbor classification (KNN) with k = 9,

The confusion matrix for the training data is:

n = 100 Predicted: NO Predicted: YES
Actual: NO 40 6 46
Actual: YES 11 43 54
51 49

The confusion matrix for the testing data is:

n = 300 Predicted: NO Predicted: YES
Actual: NO 115 36 151
Actual: YES 45 104 149
160 140

The misclassification error rate is 0.17 for the training data.

The misclassification error rate is 0.27 for the testing data.

Q6.

If the main objective of this study is to develop a data-driven model for better prediction of the occurrence of product failure, we should choose that model which has the least misclassification error rate for the testing data.

The misclassification error rates in testing data are 26.67%, 27%, 28%, 26.33% and 27% for the five models – LDA, QDA, Logistic regression, KNN with k=3 and KNN with k=9 respectively.

So, the KNN model with k=3 would be the most appropriate model as it has the least misclassification error rate in testing data among these 5 models and the Logistic regression model would be the least appropriate as it has the most misclassification error rate in testing data among these 5 models.

Q7.

If the main objective of this study is to develop a data-driven model for better understanding and interpretation of different factor’s influence on the occurrence of product failure, we should choose that model which has the least misclassification error rate for the training data.

The misclassification error rates in training data are 19%, 19%, 22%, 9% and 17% for the five models – LDA, QDA, Logistic regression, KNN with k=3 and KNN with k=9 respectively.

So, the KNN model with k=3 would be the most appropriate model as it has the least misclassification error rate in training data among these 5 models and the Logistic regression model would be the least appropriate as it has the most misclassification error rate in training data among these 5 models.

Based on this model, the probability of occurrence of product failure is more if

  • The temperature condition is high.
  • The voltage condition is high.
  • The material type is A.

APPENDIX – R CODE

#Note that before using this code,

#The training and testing data sets are saved as .csv files

#In which the failure attribute is re-coded as:

#1 if the value of atrribute failure is N

#2 if the value of attribute failure is Y

#And these files should be in the working directory of your R to run this code.

require(“MASS”)

require(“class”)

set.seed(10)

train_data<- read.csv(“Mid-train.csv”)

test_data<- read.csv(“Mid-test.csv”)

train_data<- train_data[,-1]

test_data<- test_data[,-1]

lda<- lda(failure ~ ., data = train_data)

lda_train<- predict(lda, train_data)$class

lda_test<- predict(lda, test_data)$class

qda<- qda(failure ~ ., data = train_data)

qda_train<- predict(qda, train_data)$class

qda_test<- predict(qda, test_data)$class

train_dat<- train_data

train_dat$failure<- train_dat$failure – 1

logit<- glm(failure ~ temp, stress + material, data = train_dat, family = “binomial”)

prob_train<- predict.glm(logit, newdata = train_data, type = “response”)

prob_test<- predict.glm(logit, newdata = test_data, type =”response”)

logistic_train<- c()

for(i in 1:length(prob_train)) {

if(prob_train[i] > 0.5)

logistic_train[i] <- 2

if(prob_train[i] < 0.5)

logistic_train[i] <- 1 }

logistic_test<- c()

for(i in 1:length(prob_test)) {

if(prob_test[i] > 0.5)

logistic_test[i] <- 2

if(prob_test[i] < 0.5)

logistic_test[i] <- 1 }

c1<- factor(train_data$failure)

knn3_train <- knn(train_data, train_data, c1, k=3)

knn3_test <- knn(train_data, test_data, c1, k=3)

knn9_train <- knn(train_data, train_data, c1, k=9)

knn9_test <- knn(train_data, test_data, c1, k=9)

result<- function(data, prediction)

{

a=b=c=d=0

for(i in 1:nrow(data))

{

if(data$failure[i]==1 && prediction[i]==1)

a=a+1

if(data$failure[i]==2 && prediction[i]==1)

b=b+1

if(data$failure[i]==1 && prediction[i]==2)

c=c+1

if(data$failure[i]==2 && prediction[i]==2)

d=d+1

}

confusion_matrix<- matrix(c(a,b,c,d), nrow = 2, ncol = 2)

misclassification_error_rate<- (b+c)/(a+b+c+d)

cat(“The confusion marix is “, “\n”)

print(confusion_matrix)

cat(“The misclassification error rate is “, misclassification_error_rate,”\n”)

}

cat(“\n”,”For the LDA in training data:”,”\n”)

result(train_data, lda_train)

cat(“\n”,”For the LDA in testing data:”,”\n”)

result(test_data, lda_test)

cat(“\n”,”For the QDA in training data:”,”\n”)

result(train_data, qda_train)

cat(“\n”,”For the QDA in testing data:”,”\n”)

result(test_data, qda_test)

cat(“\n”,”For the logistic regression in training data:”,”\n”)

result(train_data, logistic_train)

cat(“\n”,”For the logistic regression in testing data:”,”\n”)

result(test_data, logistic_test)

cat(“\n”,”For the knn with k=3 in training data:”,”\n”)

result(train_data, knn3_train)

cat(“\n”,”For the knn with k=3 in testing data:”,”\n”)

result(test_data, knn3_test)

cat(“\n”,”For the knn with k=9 in training data:”,”\n”)

result(train_data, knn9_train)

cat(“\n”,”For the knn with k=9 in testing data:”,”\n”)

result(test_data, knn9_test)

cat(“\nBased on KNN with k=3 model:-\n”)

data<- cbind(train_data, knn3_train)

data1 <- subset(data, data$knn3_train == 1)

data2 <- subset(data, data$knn3_train == 2)

cat(“\nIf the prediction is product failure:\n”)

print(summary(data1))

cat(“\nIf the prediction is not product failure:\n”)

print(summary(data2))

Classification.R

#Note that before using this code,

#The training and testing data sets are saved as .csv files

#In which the failure attribute is re-coded as:

#1 if the value of atrribute failure is N

#0 if the value of attribute failure is Y

#And these files should be in the working directory of your R to run this code.

require(“MASS”)

require(“class”)

set.seed(10)

train_data<- read.csv(“Mid-train.csv”)

test_data<- read.csv(“Mid-test.csv”)

train_data<- train_data[,-1]

test_data<- test_data[,-1]

lda<- lda(failure ~ ., data = train_data)

lda_train<- predict(lda, train_data)$class

lda_test<- predict(lda, test_data)$class

qda<- qda(failure ~ ., data = train_data)

qda_train<- predict(qda, train_data)$class

qda_test<- predict(qda, test_data)$class

train_dat<- train_data

train_dat$failure<- train_dat$failure – 1

logit<- glm(failure ~ temp, stress + material, data = train_dat, family = “binomial”)

prob_train<- predict.glm(logit, newdata = train_data, type = “response”)

prob_test<- predict.glm(logit, newdata = test_data, type =”response”)

logistic_train<- c()

for(i in 1:length(prob_train)) {

if(prob_train[i] > 0.5)

logistic_train[i] <- 2

if(prob_train[i] < 0.5)

logistic_train[i] <- 1 }

logistic_test<- c()

for(i in 1:length(prob_test)) {

if(prob_test[i] > 0.5)

logistic_test[i] <- 2

if(prob_test[i] < 0.5)

logistic_test[i] <- 1 }

c1<- factor(train_data$failure)

knn3_train <- knn(train_data, train_data, c1, k=3)

knn3_test <- knn(train_data, test_data, c1, k=3)

knn9_train <- knn(train_data, train_data, c1, k=9)

knn9_test <- knn(train_data, test_data, c1, k=9)

result<- function(data, prediction)

{

a=b=c=d=0

for(i in 1:nrow(data))

{

if(data$failure[i]==1 && prediction[i]==1)

a=a+1

if(data$failure[i]==2 && prediction[i]==1)

b=b+1

if(data$failure[i]==1 && prediction[i]==2)

c=c+1

if(data$failure[i]==2 && prediction[i]==2)

d=d+1

}

confusion_matrix<- matrix(c(a,b,c,d), nrow = 2, ncol = 2)

misclassification_error_rate<- (b+c)/(a+b+c+d)

cat(“The confusion marix is “, “\n”)

print(confusion_matrix)

cat(“The misclassification error rate is “, misclassification_error_rate,”\n”)

}

cat(“\n”,”For the LDA in training data:”,”\n”)

result(train_data, lda_train)

cat(“\n”,”For the LDA in testing data:”,”\n”)

result(test_data, lda_test)

cat(“\n”,”For the QDA in training data:”,”\n”)

result(train_data, qda_train)

cat(“\n”,”For the QDA in testing data:”,”\n”)

result(test_data, qda_test)

cat(“\n”,”For the logistic regression in training data:”,”\n”)

result(train_data, logistic_train)

cat(“\n”,”For the logistic regression in testing data:”,”\n”)

result(test_data, logistic_test)

cat(“\n”,”For the knn with k=3 in training data:”,”\n”)

result(train_data, knn3_train)

cat(“\n”,”For the knn with k=3 in testing data:”,”\n”)

result(test_data, knn3_test)

cat(“\n”,”For the knn with k=9 in training data:”,”\n”)

result(train_data, knn9_train)

cat(“\n”,”For the knn with k=9 in testing data:”,”\n”)

result(test_data, knn9_test)

cat(“\nBased on KNN with k=3 model:-\n”)

data<- cbind(train_data, knn3_train)

data1 <- subset(data, data$knn3_train == 1)

data2 <- subset(data, data$knn3_train == 2)

cat(“\nIf the prediction is product failure:\n”)

print(summary(data1))

cat(“\nIf the prediction is not product failure:\n”)

print(summary(data2))