Assignment 1 – Group: 21
Tutors: Yang Lin
Group members: Yuhong Gon
g (ygon9301), Bowen Guan (bgua0071), Gunbilegt Byambadorj (gbya2617)
Abstract
Image classification is one of the highly exercised tasks in machine learning discipline
which places an image under the corresponding category. The development and
performance of machine learning have already surpassed human-level performance
in imageNet competitions in 2015. In this assignment we compared the performance
of different machine learning methods, namely KNN, Support Vector Machine with
linear kernel, Logistic Regression and Naive Bayes to classify MNIST fashion dataset.
2
Table of Contents
1. Introduction 3
1.1 The aim of the study 3
1.2 The Importance of This Study 3
2. Pre-processing 4
2.1 Min-Max Normalization 4
2.2 Principal Component Analysis 4
2.3 Singular value decomposition 4
2.3.1 Introduction 4
2.3.2 Implementation 4
2.3.3 Improvement 6
3. Classifier 6
3.1 K-Nearest Neighbours Classifier 6
3.1.1 Introduction 6
3.1.2 Implementation 6
3.1.3 Influence of data distribution 7
3.1.4 Improvement 8
3.2 SVM 8
3.2.1 Introduction 8
3.2.2 Theory 8
3.2.3 Implementation 9
3.2.4 Improvement 9
3.3 Multi-class Logistic Regression 9
3.3.1 Introduction 9
3.3.2 Implementation 11
3.3.3 Pros and Cons 11
3.3.4 Improvement 11
3.4 Naive Bayes Classifier 12
3.4.1 Introduction 12
3.4.2 Implementation 12
3.4.3 Pros and Cons 13
3.4.4 Improvement 13
4. Results 13
4.1 KNN 13
4.2 Naive Bayes 14
4.3 Multi-class Logistic Regression 15
4.4 Support Vector Machine 16
5. Conclusions 17
6. Hardware and Software 17
6.1 Hardware 17
6.2 Software 17
References 18
3
1. Introduction
In this study, we are evaluating performance, accuracy and implementation of supervised
machine learning algorithms, K-Nearest Neighbours, SVM, Multi-Class Logistic Regression,
Naive Bayes methods to build a model to classify images of MNIST Fashion database.
MNIST Fashion database to find the most suitable algorithm which is computationally
inexpensive and achieves the best result.
Input data set is 30,000 grayscale labelled images with 28×28 resolution or 784 dimensions.
Images are classified into 10 groups labelled from 0 to 9. Classifiers are trained on 24,000
randomly selected images and validated on 6,000 images out of 30,000 input datasets. An
additional dataset with 10,000 records and 2,000 labels, is provided as well for testing. The
output of this study is labels of a latter dataset using our nominated algorithm which is multiclass logistic regression.
We have tested our algorithm on several reduced dimensions in certain algorithms using
principal component analysis technique. In the final code revision, we have picked the most
appropriate dimension for each algorithm.
In our findings, K-Nearest Neighbour has the highest accuracy followed by Multi-Class
logistic regression. But the computational cost of K-Nearest Neighbour was far higher than
Multi-Class logistic regression that we deemed Logistic Regression shall be nominated as
the best model for this task.
Other deep learning methods have been eliminated for the same reason. We have built a
Convolutional neural network and evaluated forward propagation, but it was so
computationally expensive that we are afraid it may not run under required duration which is
under 10 minutes. Source code is included in the appendix.
1.1 The aim of the study
We aim to evaluate the advantages and disadvantages of different algorithms to find the
most accurate algorithm with the least computational expense. The latter requirement is very
important for us that we didn’t pick the KNN, even though it has depicted the highest
accuracy.
1.2 The Importance of This Study
The study tests our understanding of the basic principle of machine learning problems. We
can apply our knowledge to build a real classifier and examine and select the best model for
a specific question. Many situations need to be considered including data pre-processing
and data distribution. And understanding the advantages and disadvantages of all
algorithms is also important for making decisions.
4
2. Pre-processing
2.1 Min-Max Normalization
The given data contains 30,000 examples training set and 10,000 examples test set.
Without distorting the differences in the range of original values by using max-min
normalization to make all the data value into range 0 and 1 which refer to equation 1. By using
the max-min normalization we can process data easier and faster. It can better solve for
coefficients by reducing the sensitivity of training to scale of features. In addition, with small
variance which can improve the performance of each model with better convergence.