Machine Learning Project - II: Learning Curves Plot for Classification Algorithms: (KNN & Naive Bayes)
by Yifei Zhou
1.Introduction
In machine learning, when we have built some models for a dataset and sometimes we want to evaluate and compare their performance. The performance mainly related to the accuracy of one learning system. In order to improve the performance, there are several ways like changing the number of training examples and optimizing the system model parameters (Sammut et al. 2011). Learning curves is one of the approaches which help us to compare the performance of different algorithms and improve convergence (Meek et al. 2002). (Note: The shade blocks are codes and the # Sections are comments for code and ( ) are references)
library(lattice) #The package for data visualization (Lattice: trellis graphics for R n.d.)
library(ggplot2) #The package for geometric graph (Hadley, W 2018)
library(caTools) #The package for statistic functions (Jarek, T 2018)
library(caret)
#The package for evaluation and model (Max, K 2018)
library(gmodels) #Tools of model fitting (Gregory, R 2018)
library(e1071)
#Algorithm of Naive Bayes (David, M 2018)
library(class)
#Algorithm of KNN (Brian, R 2015)
wbdc <- data.frame(readxl::read_excel("E:\\MasterStudy\\CT475\\dataset.xlsx"))
wbdc$Autoimmune_Disease <- as.factor(wbdc$Autoimmune_Disease) #read dataset and make the prediction
column as factors(nominal). The dataset was loaded into a data frame (wbdc).
2.Accuracy Statistics
In this section, I mainly create two functions to evaluate the accuracy performed by the KNN and naïve Bayes to plot the learning curves for these two algorithms. This process can be divided into 4 steps. Firstly, choose a X from 0% to 100% as a point to split the dataset into training and testing set sampled randomly, and use X% for training and rest as validation. After that, repeat several times and record the average accuracy at X%. In addition, adding a threshold (t) to X% step by step like [0, X%+t, X%+2t…1] and repeat previous step. Finally, collect the results of each X% and plot them together (Michael 2018). In this assignment, X=10%, and the threshold (t) is 10% each time. For each X%, the algorithm repeated 10 times. In fact, the repeat times can be based on the users need.
2.1 Naïve Bayes
get_Average_accuracy_Naive <- function(data_set,per_vec,n) #Naïve Bayes accuracy evaluation declaration,
dataset: original data set, per_vec: threshold vector, n: repeat times
{
final_result <- sapply(per_vec,function(m){
#sapply function to get the average accuracy for each
X% in threshold set and put the result into final_result vector
result_vec <- sapply(c(1:n),
#sapply function to repeat n times for each X% in threshold vector
function(x){
row_num <- nrow(data_set)
row_set <- sample(1:row_num,row_num*m,replace = F,prob = NULL)
#sample randomly from dataset size
training_data <- data_set[row_set,]
#split data set into X% for training
testing_data <- data_set[-row_set,]
#The rest (1-X%) as testing
model <- naiveBayes(training_data[,-5],training_data[,5]) #Using training data construct Bayes model
pred_set <- predict(model,testing_data)
#Using test data to predict the outcomes
e <- confusionMatrix(pred_set,testing_data[,5])
#construct the confusion Matrix
(ConfusionMatrix function| R Documentation n.d.)
e$overall['Accuracy']
#Fetch the accuracy of each time
})
mean(result_vec)
#calculate the mean value of each X% as the result of X%
})
final_result
#put the result of each X% into a final result vector
}
2.2 KNN
get_Average_accuracy_KNNs <- function(data_set,per_vec,n,k_value) #KNN accuracy evaluation declaration,
data set: original data set, per_vec: threshold vector, n: repeat times and k_value: k neighbor number
{
fin_result<- sapply(per_vec,function(m){ #sapply function to get the average accuracy for each X% in
threshold set and put the result into final_result vector
res_vec <- sapply(c(1:n), function(x){ #sapply function to repeat n times for each X%
row_num <- nrow(data_set)
row_set <- sample(1:row_num,row_num*m,replace = F,prob = NULL) #sample X% randomly of dataset size
raining_data <- data_set[row_set,] #X% of dataset for training
testing_data <- data_set[-row_set,] #The rest (1-X%) as testing
pred_set <- knn(training_data[,-5],testing_data[,-5],training_data[,"Autoimmune_Disease"],k=k_value)
#using k nearest neighbors to predict, and put the prediction outcomes into pred_set
e <- confusionMatrix(pred_set,testing_data[,"Autoimmune_Disease"]) #construct confusion matrix
e$overall['Accuracy'] #fetch accuracy of each time
})
mean(res_vec) #calculate the mean value of each X% as the result of X% point
})
}
3.Function Invoking & Results
per_vec = c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9) #Create threshold vector for splitting dataset
bayes_result<-get_Average_accuracy_Naive(wbdc,per_vec,10) #vector to record Bayes accuracy with repeat
10 times
bayes_result_df <- data.frame(training_size=ceiling(nrow(wbdc)*per_vec),Accuracy=bayes_result,Classifier
="Naive Bayes") #put Bayes result into data frame for plotting
data_norm <- function(x) {((x-min(x))/(max(x)-min(x)))} #data normalization function for KNN
wbdc_norm <- wbdc
wbdc_temp <- as.data.frame(lapply(wbdc[,-5],data_norm))
wbdc_norm[,-5] <- wbdc_temp
knn_result <- get_Average_accuracy_KNNs(wbdc_norm,per_vec,10,10) #record KNN accuracy result with k=10
knn_result1 <-get_Average_accuracy_KNNs(wbdc_norm,per_vec,10,6) #record KNN accuracy result with k=6
knn_result2 <-get_Average_accuracy_KNNs(wbdc_norm,per_vec,10,1) #record KNN accuracy result with k=1
knn_result_df <- data.frame(training_size=ceiling(nrow(wbdc)*per_vec),Accuracy=knn_result,Classifier="KN
N (K=10)")
knn_result1_df <- data.frame(training_size=ceiling(nrow(wbdc)*per_vec),Accuracy=knn_result1,Classifier="
KNN (K=6)")
knn_result2_df <- data.frame(training_size=ceiling(nrow(wbdc)*per_vec),Accuracy=knn_result2,Classifier="
KNN (K=1)")#create 3 data frame to put the all KNN results for plotting
final_result_df <-rbind(bayes_result_df,knn_result_df,knn_result1_df,knn_result2_df)
break_points <- c(40,80,120,160,200,240,280,320,360) #set the display break point on graph
re_df <-data.frame(b=bayes_result,k1=knn_result,k2=knn_result1,k3=knn_result2)
re_df <- t(re_df)
rownames(re_df) <-c("Naive Bayes","KNN (k=10)","KNN (k=6)","KNN (k=1)")
colnames(re_df) <-break_points
re_df #THE RESULT TABLE of KNN and Naïve Bayes from dataset size [10%-90%] (n=376)
10% 20% 30% 40% 50% 60% 70% 80% 90%
Naive Bayes 0.7492625 0.7564784 0.7488636 0.7597345 0.7601064 0.7529801 0.7513274 0.7355263 0.7657895
KNN (k=10) 0.7176991 0.7272425 0.7465909 0.7500000 0.7627660 0.7483444 0.8053097 0.7789474 0.7631579
KNN (k=6) 0.7088496 0.7438538 0.7473485 0.7402655 0.7553191 0.7304636 0.7495575 0.7618421 0.7157895
KNN (k=1) 0.6929204 0.7073090 0.7045455 0.7057522 0.7196809 0.7105960 0.7017699 0.7052632 0.7078947
4. Plot Learning Curves
ggplot(data=final_result_df,mapping=aes(x=training_size,y=Accuracy,colour=Classifier))+geom_line()+expan
d_limits(y=c(0.6,1))+geom_point(size=2)+ggtitle("The Learning curves of Naive and Bayes")+scale_x_contin
uous(breaks = c(0,40,80,120,160,200,240,280,320,360))
ggplot provides me a tool to plot the learning curves. As the graph below shown, the x axis is break point of the function of data set size, and the y axis is the accuracy performed by each threshold. For these two algorithms, I mainly applied the Naïve Bayes and KNN (k=1, 6 and 10) to predict if the attribute Autoimmune_Disease is positive or negative. And all of result was recorded in the above table in section 3.
5. Conclusion - observations
1) As shown by the green, blue and purple lines in graph below, the KNN performed better when the K (the nearest neighbor number) is increased. 2) The performance of Naïve Bayes seems stable, and accuracy fluctuates around 75%. Compared with Naïve Bayes, although the accuracy of KNN (k=10) achieved highest point in this graph, 80.5%, this algorithm not seems stable. 3) For the data set, when the training set size is lower, from 10% (40) to 60% (220), these two algorithms, Naïve Bayes and KNN (k=6 and 10) performed similarly, while with the increasing size of the training data set, more than 60%, the performance of KNN looks better than Naïve Bayes. 4) When the training data set size is too small, (<40), all of these algorithms does not performed good. 5) There might be a split point around (70%) that the training data set size is 260. In this case, all of algorithms have a good performance, but it not means that if the training data size is larger the accuracy would be higher. (E. g. KNN is declined after 80%). 6) Overall, although Naïve Bayes have a good stability than KNN, the averaged accuracy of KNN is better than Naïve Bayes, therefore, KNN (K=10) should be the best solution.
References
- Sammut, C. & Webb, G. eds. (2011) Encyclopedia of Machine Learning, Sydney: University of New South Wales.
- Meek, C., Thiesson, B. and Heckerman, D. (2002) ‘The Learning Curve Sampling Method Applied to Model-Based Clustering’, Journal of Machine Learning Research, 2(3), 397.
- Lattice: trellis graphics for R (n.d.) available: http://lattice.r-forge.r-project.org/[accessed 30 Oct 2018].
- Hadley, W. (2018) ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics, CRAN – Contributed Packages, available: https://cran.r-project.org/web/packages/ggplot2/index.html [accessed 25 Oct 2018].
- Jarek, T. (2018) caTools: Tools: moving window statistics, GIF, Base64, ROC AUC, etc, CRAN – Contributed Packages, available: https://cran.r-project.org/web/packages/caTools/ [accessed 20 Jul 2018].
- Gregory, R. (2018) gmodels: Various R Programming Tools for Model Fitting, CRAN – Contributed Packages, available: https://cran.r-project.org/web/packages/gmodels/index.html [accessed 25 Jun 2018].
- Michael, M. (2018) ‘Evaluating Classifier Performance; Practical Advice; Some Data Mining Tools’, CT475: Machine Learning & Data Mining, National University Ireland Galway, unpublished.
- David, M. (2018) CRAN – Package e1071, CRAN – Contributed Packages, available: https://cran.r project.org/web/packages/e1071/index.html [accessed 28 Jul 2018].
- Max, K. (2018) The cart Package, available: http://topepo.github.io/caret/index.html [accessed 26 May 2018]. ConfusionMatrix function| R Documentation (n.d.) available: https://www.rdocumentation.org/packages/caret/versions/6.0-80/topics/confusionMatrix [accessed 30 Oct 2018].
- Brian, R. (2015) class: Functions for classification, CRAN- Package class, available: https://cran.r project.org/web/packages/class/index.html [accessed 30 Aug 2015].