1. Introduction

As we all known, machine learning (ML) is a big family. One of the well-defined definition of ML is to improve the performance of a task based on one experience (Mitchell 1977). And our aim is to extract the interesting knowledge form some large, raw or unstructured datasets. Furtherly, for the different aims, we can classified them into supervised, unsupervised, semi-supervised and reward-based learnings. In this assignment, I mainly focus on the classification algorithms of supervised learning.

2. Algorithm Choice – Logistic Regression

2.1 Model Instruction

There are several kinds of algorithms for classification, and some of them are based on probability, like Naïve Bayes, logistic Regression, and some are instanced-based learning, like KNN, even some of them which is used for numeric prediction, like Linear regression. In fact, some packages have integrated with them in R, python or C++. Therefore, the user just need to import them on their machine. But some of them has some potential drawbacks and limitations. In this assignment, I implemented the scratched Logistic regression by myself using R and overcome some drawbacks of them. Besides, I also create a friendly user interface.

Binary Logistic Regression: The basic idea of Logistic regression is to use a linear classifier to get a result to predicted result, In that formula, X is the set of features, and θ is a set of model parameters.

The formula above is a normal form of linear classifier, for the linear regression task, it uses hard threshold that the output of \(T(f_{θ} (x))\) just has two outcomes, 0 and 1. However, logistic regression is a kind of classification rather than regression which accept the soft threshold which is differentiable (Map the result to range [0, 1]). And we applied a sigmoidal function. Same as linear regression, the assumption is that the classes are linearly separable, and both find an optimal hyperplane. Therefore, we get the normal form of logistic regression.

2.2 Standard Gradient Descent

After that, our final aim is to find the optimal θ set to construct model. One possible solution is to minimize the Cost function which is the averaged sum of squared error of predicted \(g_{θ} (x^{(i)})\) and actual outcome \(y^{(i)}\) of all observations (N). Besides, there are two most common ways such as closed form solution based on partial derivatives and Gradient Descent. In this assignment, I mainly based on the second one.

The standard Gradient Descent work flow:

Initialize θ: Any set of valid values (In this assignment all of them are 0)
Repeat Until convergence (The partial result is lower than tolerance level or θ didn't change)
Simultaneously foreach \(θ_{j}\) \(θ_{j}= θ_{j}-α* \frac{∂}{∂θ_{j}}J(θ)\) (α is a deterministic learning rate)
Table 1: Gradient Descent workflow
Linear classifier (Hypothesis form) \(f_{θ}(x)=θ_{0}x_{0}(x_{0}=1)+θ_{1}x_{1}+θ_{2}x_{2}+...+θ_{n}x_{n}= \boldsymbol{θ*X}\)
Sigmoidal function \(g(z)=\frac{1}{1+e^{(-z)}}\)
Logistic Regression \(g(f_{θ}(x))=g(\boldsymbol{θ*X})\)
The Cost function \(J(θ)=\frac{1}{2N}(g(f_{θ}(x^{(i)}))-y^{(i)})^{2}\)
\(\frac{∂}{∂θ_{j}}J(θ)\) (Partial Derivation Result of \(θ_{j}\)) \(J(θ)=\frac{1}{N}\sum_{i=1}^{N}(f_{θ}(x^{(i)}-y^{(i)})x_{j}^{(i)})\)

3.Algorithm Inference – Multi-Class Logistic Regression

As followed by the steps of Gradient Descent, we can get an optimal theta set for the logistic regression of binary classification. For the binary classification, we can predict based on the final output probability. Besides, we can inference to multiple classes. There are two most common ways, One-to-One and One-to-Many. In this assignment, I mainly use One-to-Many this approach to achieve this. The basic idea is to select one target label as positive class, and all rest of them as negative class, and we can build a classifier for this case. Similarly, for each target label of the classes, we can build a classifier for them, and finally, we can combine them together as the final classifier. Therefore, Multi-class logistic regression can be viewed as a set of multi binary logistic regressions. In fact, this approach is not only suitable for logistic regression, but also useful for all kinds of multi-class problems.

4.Scratched Algorithm Implementation

4.1 Dataset Preprocess

First of all, I have to preprocess the input data set. For any raw data set, add a new column called para which represents X0, and all para is initialized with 1. After that, put the predicted attribute (response) as the end column of data set, and rename with type. (Note: Have to change the response column name as type before input the dataset)

m_dataset<-m_dataset %>% mutate(para=1)                                     #Add a new column para as x0                                                       m_dataset<-m_dataset %>% select(para,everything()) %>% arrange(type)               #Arrange the data set str(m_dataset)                                                                          #show the result

4.2 Scratched Algorithm

All code in this section does not dependent any ML libraries. The code segment below shows how to construct binary Logistic regression model and the process of Gradient Descent.

Find_Theta_Custom_Logistic<- function(data_set,ite_num,target_label)  
{      #The function to find the optimal Theta for the target attributes(Predictors) data_set: the input training set. Ite_num: the maximum iteration number. Target_label: The predicted class label(E.g Long).
  data_set<-data_set %>% mutate(type=case_when(
  type == target_label ~ 1,            #Set the target label class to 1.(E.g. Long~1)
  TRUE                   ~ 0           #Set all the non-target label class to 0(E.g. Snow ~0 and Barn~0)
  ))                                   #And problem transforms to Binary classification.
 theta_set<-rep(0,ncol(data_set)-1)    #Set and initialize the theta set with the size of predictors(θ1,θ2,θ3… etc).
 learning_rate<-0.08                   #Set the deterministic learning rate(α) with 0.08
 tol<-1e-8                             #Set the tolerance when the iteration stops
 sapply(c(1:ite_num), function(x){     #Set the iteration times until the error rate is lower enough.                                      Gradient Descent Starts to find the optimal theta set within sapply().
   result1<-rep(0,nrow(data_set))      #(Result1) is a container to hold the Partial derivation result.
   for(i in c(1:length(theta_set)))    
     result1<-result1+theta_set[i]*data_set[,i]          #f(θx)=θ0*1(x0)+θ1x1+θ2x2+…. Θnxn   (Table 2.1)
   result1<-sigod(result1)-data_set$type                 #f(θx)-yi
   result<-result1*data_set[,-ncol(data_set)]            #(f(θx)-yi)* θj
   result <- apply(result,2,sum)*1.0/nrow(data_set)      #The sum of the partial derivation result of                                                            each Theta(parameter)(Table 2.5)
   result <- result*learning_rate            #learning_rate * Partial derivation result of all thetas
   theta_set<<-theta_set-result              #Update theta_set
  })
  theta_set                    #Return the found optimal theta_set for the prediction of target label.
 }

The code segment below shows how to construct a multi-class logistic regression model (One-to-Many).

Find_Theta_List_Logistic<-function(training_set,ite_num)  #Find the theta list for Muti-class prediction
{                  #training_set: input traning set,      #ite_num: the maximum iteration number to stop
  m<-sapply(categories_list,function(x){      #For all possible target labels, (e.g. Long, Snow & Barn).
   fin_theta<-Find_Theta_Custom_Logistic(training_set,ite_num,x)  #Using Gradient Descent Approach to
   fin_theta                                   find out the optimal theta set for each target label
  })
  theta_list<-matrix(m,nrow = ncol(training_set)-1,ncol = length(categories_list))                        #Create a theta set list as a form of matrix, nrow represents the attributes list, and ncol is the      target label list.
  colnames(theta_list)<- categories_list      #Assign the names to theta list columns with target labels
  theta_list<-t(theta_list)     #Get the Transit Matrix, exchange the theta list column and rows to      ensure that the columns are the attributes which facilitate the calculation of Multiplication of Matrix
  theta_list                                  #Return the optimal theta_list
}                                                                                                       #The sample running result after running once.(1-5 represents the selected features)
[,1]       [,2]        [,3]       [,4]       [,5]                                       
BarnOwl       0.06324485 -0.6732138  0.04550235  0.2861187 -0.1342547                                   
LongEaredOwl  0.20102864  1.0540039  0.31026545 -1.6643000 -0.7620899                                   
SnowyOwl     -0.36642242 -0.8851152 -0.84151912  1.3608909  0.9122416

The Sigmod/Logistic Regression function:

sigod<-function(x)
{
  1*1.0/(1.0+exp(-1*x))              #The sigmod function f(x)=1/(1+e^(-x))(Table 2.2)
}

5. Testing & User Interface

5.1 Data Splitting

In this assignment, I split the training and testing data set using \(\textbf{stratified sampling}\) which ensures that the proportion distribution of each class keep same both training and testing data set. For example, the proportion of three classes Long, Snowy & Barn is 1:1:1 in original data set. And now I want to use 2/3 of data as training and rest 1/3 as testing. Therefore, in the training data set, three categories should be 2/9: 2/9: 2/9 and 1/9: 1/9: 1/9 in testing set. The benefit of stratified sampling is to make sure the data distribution uniformly.

after_sample <-strata(m_dataset,stratanames = c("type"),size =sam_pro$count,method="srswor")
training_set <-m_dataset[after_sample$ID_unit,]       #Split into training set                          testing_set <-m_dataset[-after_sample$ID_unit,]       #Split into testing set

5.2 Testing & User Interface

For the user interface, I mainly use R-shiny which helps me to create a simple web application. The figure 5.1 shows the running accuracy after running 10 times using all predictors (attributes) with 66% data for training and 34% for testing.

The figure 5.3 shows the running accuracy after 10 times using two predictors of them with same training set and testing set, and figure 5.2 shows how to predict an unknown observation using this model, for example, the input are:(\(\textbf{body.length=2.6, wing.length=5.4, body.width=4.1,wing,width=0.8}\)). The user click ’Predict Now’ button and it gives the prediction summary. In this case, it seems belong to BarnOwl (52.58%). After that, The user can click the ’Go Visualization’ this button to show result. Obviously, this observation was located in the cluster of BarnOwl as shown in figure 5.4. The figure 5.5 shows the running result for 80% training and 20% testing data. The last figure 5.6 illustrates the output of testing data with actual and predict labels when the user click the ‘Export to File’.

Figure 5.1 The Accuracy after running 10 times Figure 5.2 The prediction of an unknown observation Interface
Figure 5.3 The accuracy after running 10 times with 2 attributes Figure 5.4 The visualization of an unknown result
Figure 5.5 The auuracy after running 10 times (80% training) Figure 5.6 Output result to file for each running result

6. Conclusion:

For the given test data set, this algorithm performed well with LongEaredOwl, while for SnowyOwl and BarnOwl sometimes it will causes the misclassification. Another one interesting observation is that four attributes together performed better (averaged accuracy 90%) than the only some of them (averaged accuracy 77%). Besides, if the iteration number of Gradient Descent is higher, the performance of model will be higher respectively, followed by the line graph of accuracy, the performance seems more stable. The learning rate also might influence the performance, I set the initial learning rate as 0.001. The accuracy performed better when the training data set is 66% and 34% for testing. However, some potential drawbacks are not avoidable. One of the significant factors is the calculation complexity, the multiplication of matrix might be considered. Another one factor is for nominal attributes. Overall, the extension algorithm brings me a lot of conveniences, but for the drawbacks, I hope to find some solutions to overcome these.