Machine Learning Project - I: Classification Algorithms
by Yifei Zhou
1. Introduction
As for this assignment, I made some researches about the open-source packages available for the implementation of these two algorithms. R provides me with a good platform to do this work, but more importantly, it is integrated with some packages which are used in machine learning flexible, such as KNN, SVM, ID3 and Naive Bayes for classification. On other hand, some packages also provides functions to visualize the model and training result. All operations based on RStudio. The below code segment is loading ML packages to R. (Notes: The shade blocks are implemented code, and tables ## are running result)
library(lattice) # load the add-on visualization package [Rf.2]
library(ggplot2) # support packages of visualization
library(MASS,quietly = TRUE) # A third party package for data mining [Rf.4]
library(caTools) # Contains some basic functions of statistics [Rf.5]
library(caret) # a set of functions that attempt to streamline the process for creating predictive models [Rf.6]
library(gmodels) # To analysis the predict and test outcome set [Rf.7]
2.Preparation of Data Set
f <- file.choose() #select the target data set [Autoimmune.csv]
wbdc <- read.csv(f,TRUE,",")
#import data set
bd_set <- wbdc
The followed code is to load the data set and save it into an object with type of data frame. There are three parameters of loading
function
read.csv()</br>. The first parameter (f) is the csv source file, and the second parameter is whether set the table header of data set,
and the final parameter is to show data is separated by comma. (wbdc) is the final object of data set.
str(wbdc) #view the structure of the imported data set
## 'data.frame':
376 obs. of 10 variables:
## $ Age
: int 30 22 21 23 25 25 35 22 23 23 ...
## $ Blood_Pressure
: int 64 74 70 64 76 62 84 78 68 86 ...
## $ BMI
: num 35.1 30 30.8 34.9 53.2 25.1 35 34.6 29.7 45.5 ...
## $ Plasma_level
: num 61 40 50 59.5 81 45 68 43.5 37 51 ...
## $ Autoimmune_Disease: Factor w/ 2 levels "negative","positive": 2 1 1 1 2 1 2 1 1 2 ...
## $ Adverse_events : int 1 1 0 0 0 1 5 1 3 2 ...
## $ Drug_in_serum : int 156 60 50 92 100 59 88 32 45 120 ...
## $ Liver_function : num 0.692 0.527 0.597 0.725 0.759 ...
## $ Activity_test : int 32 11 26 18 56 18 41 27 28 36 ...
## $ Secondary_test : num 12.7 0 22.6 1.8 3.6 16.8 34.1 31.4 1.8 30.6 ...
Now, we can check the structure of data set using
str()</br> function provided by R. For this assignment, my work is to predict whether a
person will get disease Autoimmune_Disease by other attributes.
3. Model Explanation
3.1 KNN (Rectangular weighted)
The reason why I selected this model is that most attributes of data set are numeric. KNN is one of the IBL algorithms with supervised learning, and the prediction is based on the k nearest neighbors [Rf.8]. By computing distance between each case of training set and actual query, find out the k nearest neighbors.
Sometimes, we need to make sure that all attributes are in same scale before this model is applied. Thus, we need to normalize them. There are two general normalization approaches, Z-normalization and 0-1 normalization [Rf.9]. In my assignment, I applied 0-1 normalization and the implementation is in section 4.2.
In fact, the distance between actual query and the case of training set represents the similarity between them. If the distance is smaller, it means that the prediction outcome will probably belongs to that class which the case has. There are mainly four measure metrics for KNN. The actual query and case of training can be represented as two vectors, like (a1, a2.. as attributes)
A = (a1, a2, a3……)
Q = (q1, q2, q3……)
Euclidian distance:
\(√(𝑞1 − 𝑎1)2 +(𝑞2− 𝑎2)2 + ⋯\)
Manhattan distance:
\[|q1 − a1|+ |q2 − a2|+⋯\]Hamming distance: Mainly for discrete attributes, 0: the query has same value as case 1: the value of query is different as case. [Rf.10]
3.2 Naive Bayes
Another one supervised learning mechanism is Naïve Bayes. It mainly based on the Bayes rules. The prediction outcome can be transmitted to calculate the conditional probability [Rf.11]. In most cases, this model is widely used for the process of spam email. There is a big assumption that all of the attributes are independent each other. Basic Bayes Rules: P(B/A)=P(A/B)P(B)/P(A)
For discrete value:
\[P(Ci|X) = \frac{𝑃(𝑋|𝐶𝑖)𝑃(𝐶𝑖)}{P(X)}\]Ci: attributes of dataset/ X: predict class
P(Ci|X): The probability of event Ci happens in case of event X
For continuous value:
\[g(x,u,σ) = \frac{1}{\sqrt{2\pi}σ}e^{\frac{-(x-u)^{2}}{2σ^{2}}}\]μ: mean σ:standard derivation. Assuming that g is a normal distribution with u and σ.
In this assignment, I mainly focus on continuous value.
4. Model Implementation & Evaluation
4.1 data split
I split the data into training set and testing set these two parts using k-fold cross validation [Rf.12]. It means that the whole data set is divided into k equivalent parts randomly. Each part of these was selected as the testing set, and another k-1 parts were training set. Repeat this step for the model, and choose the average accuracy of these times as the final estimated result. All data used K-1 times for training and 1 time for testing.
folds <- createFolds(y=wbdc$Autoimmune_Disease,k=10) #Apply 10 cross-validation to split training set
and testing set, folds contains the cases index.
4.2 KNN
The below code segment illustrates the KNN implementation using KNN function provided by R. The first thing is to normalize the data to ensure that these attributes are in same dimension.
data_norm <- function(x) {((x-min(x))/(max(x)-min(x)))} #declare a function of 1-normalization
wbdc_norm <- as.data.frame(lapply(wbdc[,-5],data_norm)) #apply the normalization exclude the target attr
ibute (column 5)
After that, I need to train data using the model I built [Rf.15].
folds <- createFolds(y=wbdc$Autoimmune_Disease,k=10) #folds only contains index of data [Rf.17]
re <- {}# a container to put each accuracy
for(i in 1:10)
{
train_data <- wbdc_norm[-folds[[i]],]
#generate training set
test_data <- wbdc_norm[folds[[i]],]
#generate the testing set
wbdc_pred <- knn(train_data,test_data,wbdc[-folds[[i]],5],k=8) #apply knn model [Rf.18]
e <- confusionMatrix(wbdc_pred,wbdc[folds[[i]],5]) #statistics the prediction result
print(e)
re <- c(re,e$overall["Accuracy"])
#a vector to store the accuracy of 10 tests.
} # In this case, the target column Autoimmune_Disease is column 5
Now, the data set has been divided into 10 parts, and apply KNN model and set the k nearest neighbor number is 8 to find out the result. For each time, show the prediction result and mistakes, and the result of 1st iteration as shown in table. Repeat 10 times, and record accuracy of each iteration. The accuracy is (24+5)/(25+5+7+2). (TP+TN)/(TP+TN+FP+FN) [Rf.14].
Finally, find the average value of each accuracy as the final accuracy of this model.
print(mean(re))
## [1] 0.7841631
4.3 Naive Bayes
Another one model I used is naive Bayes. In fact, this model mainly based on the Bayes rules and probability. Generally speaking, this model is more suitable for nominal attributes, but for numeric attributes, it will apply the discrete function with the mean and standard derivation of that predictor. Same as the KNN, the below code are the implementation of naive Bayes. Library e1071 is a package for Naïve Bayes [Rf.13].
library(e1071) #An integrated R package for naïve Bayes model
Same as KNN, the data was split and applied model.
folds1 <- createFolds(y=bd_set$Autoimmune_Disease,k=10) #folds only contains index of data
re <- {}
for(i in 1:10)
{
train_data1 <- bd_set[-folds1[[i]],] #split data set to training set
test_data1 <- bd_set[folds1[[i]],] #split data set to testing set
model <- naiveBayes(train_data1[,-5],train_data1[,5])#apply the naïve Bayes model [Rf.15]
if(i==1)
print(model)
result <- predict(model,test_data1) #predict using testing set
e <- confusionMatrix(result,test_data1[,5]) #generate the confusion matrix by comparing result [Rf.16]
print(e)
re <- c(re,e$overall["Accuracy"]) #save accuracy for all iterations
}
Running Result:
Apply the continuous density function in sec 3.2 of mean and standard derivation to calculate the probability of each attribute. Apply the Bayes formula for these probability as the final result. Repeat 10 times and get the average accuracy.
5. Conclusion
Compared with the final accuracy with these two algorithms, I found that the KNN (78%) has more advantages than Naive Bayes (75.8%). But based on the accuracy of these two models, KNN might be a good solution to this case. For the models, KNN is more useful for numeric attributes, and has simple operation. But the calculation capacity is large. For Naïve Bayes, it has the stable classification efficiency, and it can work with both numeric and nominal attributes. But there is a big assumption that all of the attributes are independent each other. In fact, it is not like this. Therefore, each model has their own advantages and drawbacks, the model should be based on the actual dataset.
References
[1] cran.r-project.org (2018), Available at: https://cran.r-project.org/web/views/MachineLearning.html [Accessed 4 Oct, 2018]
[2] cran.r-project.org (2017). Available at: https://cran.r-project.org/web/packages/lattice/index.html [Accessed 4 Oct. 2018]
[3] statmethods.net (2018). Available at: https://www.statmethods.net/advgraphs/ggplot2.html [Accessed 4 Oct. 2018]
[4] cran.r-project.org (2018). Available at: https://cran.r-project.org/web/packages/MASS/index.html [Accessed 4 Oct. 2018]
[5] cran.r-project.org (2018). Available at: https://cran.r-project.org/web/packages/caTools [Accessed 4 Oct. 2018]
[6] cran.r-project.org (2018). Available at: http://caret.r-forge.r-project.org/ [Accessed 4 Oct. 2018]
[7] cran.r-project.org (2018). Available at: https://cran.r-project.org/web/packages/gmodels/index.html [Accessed 4 Oct. 2018]
[8] Michael Madden, (2018)/ https://nuigalway.blackboard.com/bbcswebdav/pid-1559028-dt-content-rid-11877344_1/xid 11877344_1/p.7
[9]Michael Madden, 11877344_1/p.15
[10]Michael
Madden
11877344_1/p.11
(2018)/
https://nuigalway.blackboard.com/bbcswebdav/pid-1559028-dt-content-rid-11877344_1/xid
(2018)/https://nuigalway.blackboard.com/bbcswebdav/pid-1559028-dt-content-rid-11877344_1/xid
[11]Wekipedia.com (2018): Available at: https://en.wikipedia.org/wiki/Naive_Bayes_classifier [Accessed 7 Oct, 2018]
[12]Michael 11960992_1/pp.12-13 Madden, (2018)/https://nuigalway.blackboard.com/bbcswebdav/pid-1584557-dt-content-rid-11960992_1/xid
[13] cran.r-project.org (2018) Available at: https://cran.r-project.org/web/packages/e1071/index.html [Accessed 6 Oct, 2018]
[14] Michael Madden, (2018) https://nuigalway.blackboard.com/bbcswebdav/pid-1584557-dt-content-rid-11960992_1/xid-11960992_1/p.4
[15] rdocumentation.org, (2018) Available at: https://www.rdocumentation.org/packages/e1071/versions/1.7-0/topics/naiveBayes [Accessed 7 Oct, 2018]
[16] rdocumentation.org, (2018) Available 80/topics/confusionMatrix [Accessed 7 Oct, 2018]
[17] rdocumentation.org, (2018) Available at: at: https://www.rdocumentation.org/packages/caret/versions/6.0 https://www.rdocumentation.org/packages/DrugClust/versions/0.2/topics/CreateFolds [Accessed 7 Oct, 2018]
[18]rdocumentation.org,(2018) Available at:
[Accessed 7 Oct, 2018]