An Investigation on Classification by Support Vector Machines

 Screenplays & Play

of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
An Investigation on Classification by Support Vector Machines
  Optimization for Big Data  –   Final Project Page 1   An Investigation on Classification by Support Vector Machines Danial Esmaeili Aliabadi  Abstract In this report, advantages of different approaches in classification over large size problems are studied. Selected case study is about response of different functions of yeast gene over different conditions. 79 different conditions over 2467 genes studied and measured by 6 different functions. Over aforementioned data, linear and non-linear classification models based on Support Vector Machine (SVM) are implemented. Moreover, different approaches for solving models are considered. Finally, we compared classification error as well as CPU time over different approaches. Binary Linear Classifier In this section, a simple linear classifier of SVM is used to create a base for our next extensions. As we can expect, linear classifier is not so flexible to cover whole data set especially when data is highly nonlinear (e.g. XOR problem). Although, linear SVM is feeble in classification but it is easy to implement and has faster convergence rate. There is always a tradeoff between simplicity and effectiveness of algorithms; therefore, we are imposed to implement simple version firstly. In order to solve SVM, we will use three methods, 1)   Single Pass over Data 2)   Stochastic Gradient Descent (SGD) 3)   Mini-batch approach In the Single Pass over Data, I covered whole set but just once in every iteration. The second method picks a random sample in each iteration. Finally, Mini-batch method is something between single  pass and SGD. Mini-batch selects a small subset of samples and will add average of gradients for whole subset to the weight vector. In mini-batch method, I choose batches of 20-sample in each iteration. A common issue in all of them is how to pick α t  iteration by iteration. Generally, two conditions for a suitable α t  need to be satisfied . First of all, α t should go to zero and secondly, α t  series should diverge. But again there is an open question which series gives the most effective step. Another issue is how to classify our samples. Shall we use binary classification for each criterion or not? As answer we can start from binary classification from one. General SVM paradigm is depicted in Algorithm.1.  Optimization for Big Data  –   Final Project Page 2   Algorithm 1. Classifying Linear SVM with SGD approach For k = 1, 2, …  Choose a sample randomly from N Calculate g h  in ∂ l( θ ) by pricewise Eq. 1 θ k+1  = θ k   –   α k (g h + λθ k ) c = (θ k+1 x i   –  y i ) α k =   * k - ½   end Algorithm 2. Classifying Linear SVM with Mini-batch approach For k = 1, 2, …  Choose set S k  with pre-specified size randomly from N Calculate g h  by getting average ∂ l( θ ) by Eq. 1 over S k   θ k+1  = θ k   –   α k (g h + λθ k ) c = (1/|S k |) Σ(θ k+1 x i   –  y i ) α k =   * k - ½   end =           <10       >1           =1  (1) Results of three methods are shown in different rows of Table.1. First column of table shows classification error over whole dataset. Second column corresponds to computation time per second and the last column reports offset of the hyperplane from the srcin along the normal vector (b/||w||). Table1. Linear SVM Result over first function of yeast genes Method Error (%) CPU Time (s) Offset Single Pass over Data 0.689096 50.743661 69.875047 Stochastic Gradient Descend 0.689096 0.417968 18.546455 Mini-batch method 0.689096 1.254667 44.676771 Lower-bound of Error Before starting multiclass classification, we can discuss about an important topic. How much error should we expect from multiclass classifier? Based on result of binary SVM classification, we can find a lower bound over aggregated classifiers. So, we can expect that error of multi-class classifier become greater than maximum error of  binary classifiers over each feature. For example in the current case study, we have six different features and maximum error of SGD method over each feature is around 1%, therefore, we can expect that linear multi-class classification be inaccurate at least 1%.  Optimization for Big Data  –   Final Project Page 3   Multi-class linear SVM classifier In the previous section we investigated power of binary classifier over samples. But our case study contains 6 different functions and 2 6  = 64 different combinations as binary classifier. The way that we can classify our samples is going to be discussed in current section. Due to nonlinearity of combination of labels, we should not expect same level of precision as  binary classification. Therefore, we can fill lack of precision by increasing number of iterations or using more sophisticated methods which will come later. Table.2 shows summary of classes and number of samples in each class. In order to do multi-class classification, we employ one.vs.all   approach which means chosen sample firstly analyzed to for each class and for the proper class we will give +1 as label and for the rest of classes -1. Then, we will update W vector for each class based on new information in addition to updating bias. Running stochastic-gradient descent algorithm for one million iterations in 1558 seconds yields 1.297% classification error. Table 2. Summary of SGD classifier over multi-class sample data Class Number 16 32 48 56 60 62 63 64 Total Number of Samples 3 14 27 121 35 11 16 2240 2467 Number of errors 0 3 3 0 4 1 10 11 32 For the second method we can use dual SVM method with linear classifier, therefore, we separate data to two different sections, learning set and test set. To enlarge hyper-  planes’ margins, C value is decided to be small (1e-8). The algorithm responding acceptable in learning set but showing weak  performance in test set. For evaluating test set, we collect all votes from different classes and pick the maximum as chosen cluster. Figure 1 depicts voting system of multi-class SVM. Figure 1. Voting system between classes. In order to calculate slope of hyperplane from dual results, we will use Karush  –  Kuhn  –  Tucker condition and for intercepts we can check α values. Those α’s with positive value are showing corresponding constraints satisfying by equality so we can calculate intercept by taking average of calculated intercepts. For different sizes of learning dictionary, linear classifier yields optimization +  Optimization for Big Data  –   Final Project Page 4   modeling errors in the learning set, beside statistical error for the test set. Table 3 shows error of both sets  by considering different sizes of learning set simultaneously. Table 3. C = 10 -8    –   Performance of separate binary classifiers Learning set size |D| Dictionary Error (%) Data Set Error (%) 50 0.0 24.807458 100 0.0 22.334820 200 0.0 21.078230 400 2.0 14.876368 500 0.6 11.957844 800 6.0 10.620182 1000 11.3 12.606405 2000 8.70 8.9987840 What is the problem and how could we solve?  In order to find a solution for this weak classification, we need first to find its reason. The reason is laid in the finding binary classifiers separately. We need to aggregate results over dictionary (learning set). We know that in any situation f  yi  should be bigger than f  y  but sometimes we can find some samples that this does not hold. To prevent this we can adjust bias value by corresponding formula.    =max  {      }   [     ]  +     >0→   >  [     ]   Basically, bias of correct label could be slightly bigger than incorrect label effect (10 -3 ). But this method as I checked was fast but not as accurate as I expected because it is just playing with biases not with weights (Can we achieve 4% error as aforementioned?). However, it disclosed some positive effect. Table 4. C = ½ - Performance of classifier with flexible bias value Learning set size |D| Dictionary Error (%) Data Set Error (%) 100 0.0 16.173490 200 0.0 11.390353 300 5.666667 9.566275 500 0 6.566680 800 2.125000 4.783137 In order to use whole flexibility of changing weights, Crammer and Singer (CS) method [1] could  be used. The algorithm is mentioned in Figure 2. For the sake of simplicity, we considered   value as 1.  Optimization for Big Data  –   Final Project Page 5   Figure 2. Skeleton of the CS algorithm for learning multiclass support vector machine [1]. It is worth mentioning, the key point in the CS method is visiting each sample in dictionary more than once. If visit them just once it would be exactly as efficient as separated binary classifiers but when we visit them more than once the effect of  B  p  in previous iteration conveys to the next run and increases  performance of classifier. To enhance speed and performance of sub-  problem’s solver, we could employ  Newton method to achieve second order convergence just consider that  A  p could be zero therefore a switching approach between gradient descent and newton is preferred. Termination condition could be a fixed number of run or algorithm meet predefined criterion. Table 5. Performance of linear Crammer and Singer method when each sample of dictionary visited different times Learning set size |D| Data Set Error 1 time visit (%) Data Set Error 10 times (%) Data Set Error 50 times (%) 50 26 21.56 21.24 200 17 13.70 12.02 400 10.98 12.60 10.94 1000 9.36 10.33 7.09 2000 8.43 7.05 4.82 Multi-class Non-linear SVM classifier For nonlinear classification, we need to change kernel function from simple inner products of input vectors to a Gaussian function (it is also called RBF) which is shown as following formula. ,= −(‖−‖  2  )  
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks