An Investigation on Classification by Support Vector Machines
Optimization for Big Data
–
Final Project
Page 1
An Investigation on Classification by Support Vector Machines
Danial Esmaeili Aliabadi
Abstract
In this report, advantages of different approaches in classification over large size problems are studied. Selected case study is about response of different functions of yeast gene over different conditions. 79 different conditions over 2467 genes studied and measured by 6 different functions. Over aforementioned data, linear and nonlinear classification models based on Support Vector Machine (SVM) are implemented. Moreover, different approaches for solving models are considered. Finally, we compared classification error as well as CPU time over different approaches.
Binary Linear Classifier
In this section, a simple linear classifier of SVM is used to create a base for our next extensions. As we can expect, linear classifier is not so flexible to cover whole data set especially when data is highly nonlinear (e.g. XOR problem). Although, linear SVM is feeble in classification but it is easy to implement and has faster convergence rate. There is always a tradeoff between simplicity and effectiveness of algorithms; therefore, we are imposed to implement simple version firstly. In order to solve SVM, we will use three methods, 1)
Single Pass over Data 2)
Stochastic Gradient Descent (SGD) 3)
Minibatch approach In the Single Pass over Data, I covered whole set but just once in every iteration. The second method picks a random sample in each iteration. Finally, Minibatch method is something between single pass and SGD. Minibatch selects a small subset of samples and will add average of gradients for whole subset to the weight vector. In minibatch method, I choose batches of 20sample in each iteration. A common
issue in all of them is how to pick α
t
iteration by iteration. Generally, two conditions for a
suitable α
t
need to be satisfied
. First of all, α
t
should go to zero and secondly, α
t
series should diverge. But again there is an open question which series gives the most effective step. Another issue is how to classify our samples. Shall we use binary classification for each criterion or not? As answer we can start from binary classification from one. General SVM paradigm is depicted in Algorithm.1.
Optimization for Big Data
–
Final Project
Page 2
Algorithm 1. Classifying Linear SVM with SGD approach For k = 1, 2,
…
Choose a sample randomly from N Calculate g
h
in
∂
l(
θ
) by pricewise Eq. 1
θ
k+1
=
θ
k
–
α
k
(g
h
+ λθ
k
)
c = (θ
k+1
x
i
–
y
i
)
α
k
=
* k

½
end Algorithm 2. Classifying Linear SVM with Minibatch approach For k = 1, 2,
…
Choose set S
k
with prespecified size randomly from N Calculate g
h
by getting average
∂
l(
θ
) by Eq. 1 over S
k
θ
k+1
=
θ
k
–
α
k
(g
h
+ λθ
k
) c = (1/S
k
)
Σ(θ
k+1
x
i
–
y
i
)
α
k
=
* k

½
end
=
<10
>1
=1
(1) Results of three methods are shown in different rows of Table.1. First column of table shows classification error over whole dataset. Second column corresponds to computation time per second and the last column reports offset of the hyperplane from the srcin along the normal vector (b/w).
Table1. Linear SVM Result over first function of yeast genes Method Error (%) CPU Time (s) Offset Single Pass over Data
0.689096 50.743661 69.875047
Stochastic Gradient Descend
0.689096 0.417968 18.546455
Minibatch method
0.689096 1.254667 44.676771
Lowerbound of Error
Before starting multiclass classification, we can discuss about an important topic. How much error should we expect from multiclass classifier? Based on result of binary SVM classification, we can find a lower bound over aggregated classifiers. So, we can expect that error of multiclass classifier become greater than maximum error of binary classifiers over each feature. For example in the current case study, we have six different features and maximum error of SGD method over each feature is around 1%, therefore, we can expect that linear multiclass classification be inaccurate at least 1%.
Optimization for Big Data
–
Final Project
Page 3
Multiclass linear SVM classifier
In the previous section we investigated power of binary classifier over samples. But our case study contains 6 different functions and 2
6
= 64 different combinations as binary classifier. The way that we can classify our samples is going to be discussed in current section. Due to nonlinearity of combination of labels, we should not expect same level of precision as binary classification. Therefore, we can fill lack of precision by increasing number of iterations or using more sophisticated methods which will come later. Table.2 shows summary of classes and number of samples in each class. In order to do multiclass classification, we employ
one.vs.all
approach which means chosen sample firstly analyzed to for each class and for the proper class we will give +1 as label and for the rest of classes 1. Then, we will update W vector for each class based on new information in addition to updating bias. Running stochasticgradient descent algorithm for one million iterations in 1558 seconds yields
1.297%
classification error.
Table 2. Summary of SGD classifier over multiclass sample data Class Number
16 32 48 56 60 62 63 64 Total
Number of Samples
3 14 27 121 35 11 16 2240 2467
Number of errors
0 3 3 0 4 1 10 11 32
For the second method we can use dual SVM method with linear classifier, therefore, we separate data to two different sections, learning set and test set. To enlarge hyper
planes’ margins, C value is
decided to be small (1e8). The algorithm responding acceptable in learning set but showing weak performance in test set. For evaluating test set, we collect all votes from different classes and pick the maximum as chosen cluster. Figure 1 depicts voting system of multiclass SVM. Figure 1. Voting system between classes. In order to calculate slope of hyperplane from dual results, we will use Karush
–
Kuhn
–
Tucker condition
and for intercepts we can check α values. Those α’s
with positive value are showing corresponding constraints satisfying by equality so we can calculate intercept by taking average of calculated intercepts. For different sizes of learning dictionary, linear classifier yields optimization +
Optimization for Big Data
–
Final Project
Page 4
modeling errors in the learning set, beside statistical error for the test set. Table 3 shows error of both sets by considering different sizes of learning set simultaneously.
Table 3. C = 10
8
–
Performance of separate binary classifiers Learning set size D
Dictionary Error (%) Data Set Error (%)
50
0.0 24.807458
100
0.0 22.334820
200
0.0 21.078230
400
2.0 14.876368
500
0.6 11.957844
800
6.0 10.620182
1000
11.3 12.606405
2000
8.70 8.9987840
What is the problem and how could we solve?
In order to find a solution for this weak classification, we need first to find its reason. The reason is laid in the finding binary classifiers separately. We need to aggregate results over dictionary (learning set). We know that in any situation f
yi
should be bigger than f
y
but sometimes we can find some samples that this does not hold. To prevent this we can adjust bias value by corresponding formula.
=max
{
}
[
]
+
>0→
>
[
]
Basically, bias of correct label could be slightly bigger than incorrect label effect (10
3
). But this method as I checked was fast but not as accurate as I expected because it is just playing with biases not with weights (Can we achieve 4% error as aforementioned?). However, it disclosed some positive effect.
Table 4. C = ½  Performance of classifier with flexible bias value Learning set size D
Dictionary Error (%) Data Set Error (%)
100
0.0 16.173490
200
0.0 11.390353
300
5.666667 9.566275
500
0 6.566680
800
2.125000 4.783137 In order to use whole flexibility of changing weights, Crammer and Singer (CS) method [1] could be used. The algorithm is mentioned in Figure 2. For the sake of simplicity, we considered
value as 1.
Optimization for Big Data
–
Final Project
Page 5
Figure 2. Skeleton of the CS algorithm for learning multiclass support vector machine [1]. It is worth mentioning, the key point in the CS method is visiting each sample in dictionary more than once. If visit them just once it would be exactly as efficient as separated binary classifiers but when we visit them more than once the effect of
B
p
in previous iteration conveys to the next run and increases performance of classifier. To enhance speed and performance of sub
problem’s solver, we
could employ Newton method to achieve second order convergence just consider that
A
p
could be zero therefore a switching approach between gradient descent and newton is preferred. Termination condition could be a fixed number of run or algorithm meet predefined criterion.
Table 5. Performance of linear Crammer and Singer method when each sample of dictionary visited different times Learning set size D
Data Set Error 1 time visit (%) Data Set Error 10 times (%) Data Set Error 50 times (%)
50
26 21.56 21.24
200
17 13.70 12.02
400
10.98 12.60 10.94
1000
9.36 10.33 7.09
2000
8.43 7.05 4.82
Multiclass Nonlinear SVM classifier
For nonlinear classification, we need to change kernel function from simple inner products of input vectors to a Gaussian function (it is also called RBF) which is shown as following formula.
,=
−(‖−‖
2
)