Weighted structural support vector machine

pdf 14 trang Gia Huy 17/05/2022 2930
Bạn đang xem tài liệu "Weighted structural support vector machine", để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên

Tài liệu đính kèm:

  • pdfweighted_structural_support_vector_machine.pdf

Nội dung text: Weighted structural support vector machine

  1. Journal of Computer Science and Cybernetics, V.37, N.1 (2021), 43–56 DOI 10.15625/1813-9663/37/1/15396 WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE NGUYEN THE CUONG1, HUYNH THE PHUNG2,∗ 1Faculty of Basic, Telecommunications University, Khanh Hoa, Vietnam 2Department of Mathematics, College of Sciences, Hue University, Vietnam Abstract. In binary classification problems, two classes of data seem to be different from each other. It is expected to be more complicated due to the clusters in each class also tend to be different. Traditional algorithms as Support Vector Machine (SVM) or Twin Support Vector Machine (TWSVM) cannot sufficiently exploit structural information with cluster granularity of the data, cause limitation on the capability of simulation of data trends. Structural Twin Support Vector Machine (S-TWSVM) sufficiently exploits structural information with cluster granularity for learning a represented hyperplane. Therefore, the capability of S-TWSVM’s data simulation is better than that of TWSVM. However, for the datasets where each class consists of clusters of different trends, the S-TWSVM’s data simulation capability seems restricted. Besides, the training time of S-TWSVM has not been improved compared to TWSVM. This paper proposes a new Weighted Structural - Support Vector Machine (called WS-SVM) for binary classification problems with a class-vs-clusters strategy. Experimental results show that WS-SVM could describe the tendency of the distribution of cluster information. Furthermore, both the theory and experiment show that the training time of the WS-SVM for classification problem has significantly improved compared to S-TWSVM. Keywords. Support vector machine; Twin support vector machine; Structural twin support vector machine; Weighted structural - support vector machine. 1. INTRODUCTION In the early years of the 20th century, the Support Vector Machine (SVM) [5, 17] was a popular binary classification algorithm applied to many different fields in practice [1, 3, 11, 15, 16]. The SVM seeks a hyperplane separating two classes so that the margin between them is largest. Actual data is often distributed with different structures and tendencies, but SVM does not fully exploit structural information of data, so its ability to simulate data is limited. Nowadays, with rapid development, datasets are increasing in number and diversifying in structure. This fact requires classification algorithms to guarantee accuracy and improve the speed and ability to simulate data distribution. Many variants of SVM have been recently proposed to improve the speed and other task of standard SVM [3, 10, 12, 16, 18]. Two typical innovations of SVM are Twin Support Vector Machine (TWSVM) [7] and Structural Twin Support Vector Machine (S-TWSVM) [13]. The main idea of TWSVM is to seek two hyperplanes such that each hyperplane is closer to one class and far away from the other by solving two Quadratic Programming Problems (QPPs) whose size are smaller than the QPP in SVM. Despite having to solve two QPPs, the speed of TWSVM is approximately four *Corresponding author. E-mail addresses: nckcbnckcb@gmail.com (N.T.Cuong); huynhthephung@gmail.com (H.T.Phung). © 2021 Vietnam Academy of Science & Technology
  2. 44 N.T. CUONG, H.T. PHUNG times faster than standard SVM. S-TWSVM has the same strategy as TWSVM. Besides, S-TWSVM fully exploits structural information with cluster granularity into learning the model to build a more reasonable classifier. There is a reason for handling SVM with structure. In some practical binary classification problems, each class will consist of more than one cluster. For example, we consider the problem of classifying fruits with data including five categories: Mango, Jackfruit, Pineapple, Apples, Grapes, but the fruits will be only be classified according to the criteria “smooth skin” or “rough skin”. Obviously, data in the “smooth skin” class will form 3 clusters, corresponding to Pineapple, Apples, and Grapes, while data in the “rough skin” class will distribute into 2 clusters, corresponding to Jackfruit and Pineapple. S-TWSVM proved to be quite effective in the simple case, where each class consists of clusters with a similar distribution trend. However, for more complex data types, where each class contains clusters of different trends, S-TWSVM was not efficient at simulating data trends. Based on the strategy of TWSVM and S-TWSVM, we propose a new binary classification model: Weighted Structural - Support Vector Machine (called WS-SVM) with a class-vs- clusters strategy. Instead of solving two QPPs as in S-TWSVM, WS-SVM will solve (l + k) QPPs, where k and l are the number of clusters in class {+} and class {−}, respectively. This method allows WS-SVM to effectively simulate the distribution trends for complex data types while also improving computation speed. The paper is organized as follows. Section 2 briefly introduces the background of SVM, TWSVM and S-TWSVM; Section 3 is devoted to a detailed description of WS-SVM along with the algorithms and discussions; All experimental results are presented in Section 4, to- gether with the comparative evaluation; The conclusion is given in Section 5. All algorithms are settled by version 3.8.3 of Python Programming Language. 2. BACKGROUND 2.1. Structural granularity Consider a binary classification problem with the dataset, denoted by matrix C, consist- T n ing of m points (each point is a row of C) xj ∈ R , j = 1, . . . , m. We also write xj ∈ C to indicate that xj is a row of C. Suppose that yj ∈ {−1, 1} is the j−th data point label. m ×n Class {+} consists of mA points denoted by a matrix A ⊂ R A , class {−} consists of m ×n mB points denoted by a matrix B ⊂ R B . There are k clusters in class {+}, whose i−th m ×n cluster consists of mAi points and is denoted by matrix Ai ⊂ R Ai . Also, there are l clusters in class {−}, whose i−th cluster consists of mBi points and is denoted by matrix m ×n Bi ⊂ R Bi . A, B, Ai, Bi are called structural granularity [19]. We are interested in the following quantities of structural granularity. ˆ 1 X T Class granularity: ΣA = (xj − µA)(xj − µA) , mA xj ∈A 1 X T ΣB = (xj − µB)(xj − µB) . mB xj ∈B ˆ 1 X T Cluster granularity: Σi+ = (xj − µAi)(xj − µAi) , mAi xj ∈Ai
  3. WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE 45 1 X T Σi− = (xj − µBi)(xj − µBi) , mBi xj ∈Bi here, µX denotes the average vector of the dataset X. 2.2. Support vector machine Standard SVM [17] seeks a hyperplane wT x + b = 0 separating class {+} and class {−} 2 such that the margin kwk between two classes is largest. However, Standard SVM is only available when the data is linearly separable. In the case when the data is not linearly separable, soft SVM [17] is recommended with the more loser constraints  T 1 2 min ce ξ + kwk2, w,b,ξ 2 (1) s.t. D(Cw + eb) + ξ ≥ e, ξ ≥ 0, m×m m where D ∈ R is the diagonal matrix with Djj = yj; ∀j, ξ ∈ R is the vector of slack variables, c ∈ R is the penalty coefficient, appropriately selected to adjust the role between m terms in the objective function, e ∈ R is the vector of ones. A new data point x will be classified in class {+} if sgn(f(x) = wT x + b) > 0 and in class {−} if sgn(f(x)) < 0. SVM does not effectively exploit structural information of the data, resulting in simulating the distribution structure of the two classes being the same (see Figure 1.a). 2.3. Twin support vector machine Based on the strategy of multi-surface proximal support vector classification via gener- alized eigenvalues (GEPSVM) [9]. The main idea of TWSVM [7] for binary classification problem is to seek two hyperplanes: T • f+(x) = w+x + b+ = 0 is closer to class {+} and far away from class {−}, T • f−(x) = w−x + b− = 0 is closer to class {−} and far away from class {+} by solving two QPPs as follows  1 2 T  min kAw+ + e+b+k2 + c1e−ξ, (TWSVM1) w+,b+,ξ 2 (2)  s.t. −(Bw+ + e−b+) + ξ ≥ e−, ξ ≥ 0,  1 2 T  min ||Bw− + e−b−k2 + c2e+η, (TWSVM2) w−,b−,η 2 (3)  s.t. (Aw+ + e+b+) + η ≥ e+, η ≥ 0, where c1, c2 are penalty coefficients to adjust the role between terms in the objective func- m m m m tions, e+ ∈ R A , e− ∈ R B are vectors of ones, ξ ∈ R B , η ∈ R A are vectors of slack variables. A new data x is classified into class {+} or class {−} depending on whether it is closer to the hyperplane f+(x) = 0 or the hyperplane f−(x) = 0. TWSVM simulates the structural distribution of the two classes independently, but it does not actually simulate the distribution trend of the data within each class (see Figure 1.b). It has been shown in [7] that the training time of TWSVM is approximately four times faster than that of standard SVM (see Table 1).
  4. 46 N.T. CUONG, H.T. PHUNG 2.4. Structural twin support vector machine S-TWSVM [13] has two steps: The first step is to extract the structural information within classes by Ward’s linkage clustering method [8, 19]; The second step is the model learning. Suppose that there are k clusters A1, ,Ak in class A, each cluster Ai consists of mAi data points, and there are l clusters B1, ,Bl in class B, each cluster Bi consists of mBi data points. S-TWSVM determines two hyperplanes T T f+(x) = w+x + b+ = 0; f−(x) = w−x + b− = 0, (4) by solving two QPPs as follows  1 2 T 1 2 2 1 T  min kAw+ + e+b+k2 + c1e−ξ + c2(kw+k2 + b+) + c3w+Σ+w+, w+, b+, ξ 2 2 2 (5)  s.t. −(Bw+ + e−b+) + ξ ≥ e−, ξ ≥ 0,  1 2 T 1 2 2 1 T  min kBw− + e−b−k2 + c4e+η + c5(kw−k2 + b−) + c6w−Σ−w−, w−, b−, η 2 2 2 (6)  s.t. (Aw− + e+b−) + η ≥ e+, η ≥ 0, where c1, . . . , c6 ≥ 0 are penalty coefficients to adjust the role between terms in the objective m m functions, ξ ∈ R B , η ∈ R A are vectors of slack variables. Σ+ = Σ1+ + Σ2+ + ··· + Σk+, Σ− = Σ1− + Σ2− + ··· + Σl−, Σi+ and Σi− are respectively the covariance matrices m m corresponding to the clusters Ai and Bi, e+ ∈ R A , e− ∈ R B are vectors of ones. A new data point is assigned into class {+} or {−} in the same manner as in TWSVM. In the 1 problem (5), kAw + e b k2 is the sum of the squares of the distances from data points in 2 + + + 2 1 class {+} to the hyperplane {f (x) = 0}. c eT ξ is the sum of errors, c (kw k2 +b2 ) is the + 1 − 2 2 + 2 + 1 regularization, c wT Σ w is the sum of covariance matrices with the cluster granularity 2 3 + + + of class {+} projected onto vector w+. The constraints of (5) is defined by the points of class {−}. The problem (6) is similarly established for class {−} with the constraints defined by class {+}. Thus, S-TWSVM exploits structural information with cluster granularity of one class in each problem. Therefore, the data simulation capability of S-TWSVM is more accurate than that of TWSVM (see Figure 1.c). However, as the data become more complex, this ability of the S-TWSVM remains limited (see Figure 2.c). By using sufficient information about the cluster granularity of the classes, S-TWSVM finds the hyperplanes that represent the classes better than TWSVM. However, due to the calculation of all the covariance matrices Σi+ and Σi− of both classes, the training speed of S-TWSVM is not improved compared to that of TWSVM (see Table 1). 3. WEIGHTED STRUCTURAL - SUPPORT VECTOR MACHINE In this section, we describe a new classification algorithm: Weighted Structural - Support Vector Machine (called WS-SVM). Both theoretically and experimentally, we show that WS- SVM overcomes the S-TWSVM in data simulation and training speed.
  5. WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE 47 a b c Figure 1. For datasets in which clusters in each class are constituted by the same trend, the SVM (a) simulates the distribution trend of the two classes being the same, TWSVM (b) simulates the distribution trend of the two classes as different, but does not actually simulate the distribution trend of data in each class, S-TWSVM (c) simulates the distribution trend in each class quite accurately. a b c Figure 2. For datasets in which clusters in each class are constructed according to different dis- tribution trends, SVM (a), TWSVM (b), and S-TWSVM (c) all have difficulty in simulating the distribution trend of the data. Table 1. Training time (seconds), with the dataset shown in Figure 2 Algorithms 1200 data points 1600 data points 2000 data points 2400 data points SVM 2.143 5.308 6.754 11.512 TWSVM 0.351 0.700 1.193 1.805 S-TWSVM 0.388 0.853 1.434 2.158 Figure 3. (WS-SVM) f1+ (thin solid line) shows that cluster B1 (small square) tends to deviate vertically, f2+ (bold solid line) shows that clus- ter B2 (large square) tends to deviate horizon- tally. f1− (thin dashed line) shows that cluster A1 (small circle) tends to deviate horizontally, f2− (bold dashed line) shows that cluster A2 (large circle) tends to deviate vertically.
  6. 48 N.T. CUONG, H.T. PHUNG Similar to S-TWSVM, WS-SVM also has two steps. The first step is to extract structural information within classes by Ward’s linkage clustering method [8, 13, 19]; The second step is the model learning. Suppose that there are k clusters A1, ,Ak in class {+} and l clusters B1, ,Bl in class {−}. WS-SVM uses a class-vs-clusters strategy to determine (l + k) hyperplanes such that each of which is closer to one class and far away from one cluster in the other class. Specifically, the method need to find l hyperplanes such that T the i−th hyperplane, fi+(x) = wi+x + bi+ = 0, is closer to class {+} and far away from cluster Bi of class {−}; Also, It need to find k hyperplanes such that the i−th hyperplane, T fi−(x) = wi−x + bi− = 0, is closer to class {−} and far away from cluster Ai of class {+} n (see Figure 3). Here wi+, wi− ∈ R , bi+, bi− ∈ R. The classifier is now selected as f(x) = argmin(f+(x), f−(x)), (7) +, − l k with X mBi X mAi f (x) = |f (x)|, f (x) = |f (x)|. (8) + m i+ − m i− i=1 B i=1 A From the definition, we see that f+(x) is the average, taking into account the weights, distances from x to the hyperplanes {fi+(x) = 0}. The i−th hyperplane’s weight is propor- tional to mBi - the number of data points in the cluster Bi. Similarly, f−(x) is the weighted average of distances from x to the hyperplanes {fi−(x) = 0}. By virtue of (7), a new data point x is classified into class {+} or class {−} depending on whether f+(x) is less than or greater than f−(x). 3.1. The linear case WS-SVM determines (l + k) hyperplanes by solving (l + k) QPPs as follows  min 1 kAw +e b k2 +c eT ξ + µ+ (kw k2 +b2 )+ λ+ wT Σ w ,  2 i+ mA i+ 2 + mBi i 2 i+ 2 i+ 2 i+ + i+ + wi+,bi+,ξ (WS-SVMi ) i  s.t. − (Biwi+ + emBi bi+) + ξi ≥ emBi ; ξi ≥ 0, i= 1, . . . , l, and  min 1 kBw +e b k2 +c eT η + µ− (kw k2 +b2 )+ λ− wT Σ w ,  2 i− mB i− 2 − mAi i 2 i− 2 i− 2 i− − i− − wi−,bi−,η (WS-SVMi ) i  s.t. (Aiwi− + emAi bi−) + ηi ≥ emAi ; ηi ≥ 0, mA×1 mB ×1 mAi×1 mBi×1 i = 1, . . . , k. Here, emA ∈ R , emB ∈ R , emAi ∈ R , emBi ∈ R are vectors mAi×1 mBi×1 of ones; ηi ∈ R , ξi ∈ R are vectors of slack variables; Σ+ = Σ1+ + ··· + Σk+, Σ− = Σ1− + ··· + Σl−, Σi+ and Σi− are the covariance matrices corresponding to the clusters Ai and Bi; c+, c−, λ+, λ−, µ+, µ− are penalty coefficients to adjust the role T between terms in the objective functions; wi+Σ+wi+ is the sum of covariance matrices T with cluster granularity of class {+} projected onto to wi+ and wi−Σ−wi− is the sum of covariance matrices with cluster granularity of class {−} projected onto wi−. The constraints + in problem (WS-SVMi ) is defined by the points of cluster Bi and the constraints in problem − (WS-SVMi ) is defined by cluster Ai.
  7. WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE 49 + The Lagrangian of (WS-SVMi ) is given by 1 2 T 1 2 2 Li(wi+, bi+, ξ , αi, β ) = kAwi+ + em bi+k + c+e ξ + µ+(kwi+k + b ) i i 2 A 2 mBi i 2 2 i+ 1 + λ wT Σ w − αT (−(B w + e b ) + ξ − e ) − βT ξ . 2 + i+ + i+ i i i+ mBi i+ i mBi i i + Therefore, the KKT conditions of problem (WS-SVMi ) are T T A (Awi+ + emA bi+) + µ+wi+ + λ+Σ+wi+ + Bi αi = 0, (9) eT (Aw + e b ) + µ b + eT α = 0, (10) mA i+ mA i+ + i+ mBi i eT c − α − β = 0, (11) mBi + i i T αi (−(Biwi+ + emBi bi+) + ξi − emBi ) = 0, (12) T βi ξi = 0. (13)     wi+ Σ+ 0 By defining H = [A emA ], ui+ = , F+ = , Gi = [Bi emBi ], and I being bi+ 0 0 the identity matrix of order (n + 1), it follows from (9) and (10) that T T [H H + µ+I + λ+F+]ui+ + Gi αi = 0, which implies T −1 T ui+ = −[H H + µ+I + λ+F+] Gi αi, i = 1, . . . , l. (14) Substituting (14) into the Lagrangian, and combined with the conditions (11), (12), and (13) + we have the dual problem of (WS-SVMi ) as follows  T 1 T T −1 T max e αi − α Gi[H H + µ+I + λ+F+] G αi, + α mBi i i (DWS-SVMi ) i 2 (15) s.t. 0 ≤ αi ≤ c+emBi . − In the same way, we also obtain the dual problem of (WS-SVMi )  max eT γ − 1 γT H (GT G + µ I + λ F )−1HT γ ,  mAi i 2 i i − − − i i − γi (DWS-SVMi ) (16) s.t. 0 ≤ γi ≤ c−emAi ,     Σ− 0 wi− where G = [B emB ], F− = , Hi = [Ai emAi ]. The augmented vectors ui− = 0 0 bi− are also given by T −1 T ui− = [G G + µ−I + λ−F−] Hi γi, i = 1, . . . , k. (17) Algorithm 1. [Linear WS-SVM] n Give m data points in R represented by a m × n matrix C. Class {+} includes mA points represented by a mA × n matrix A, class {−} includes mB points represented by a mB × n matrix B. We generate the linear classifier f(x) as follows:
  8. 50 N.T. CUONG, H.T. PHUNG (i) Clustering dataset by using Ward’s linkage [13]. Suppose that, there are k clusters in class m ×n {+}, each cluster consists of mAi points and is represented by matrix Ai ⊂ R Ai , there are l clusters in class {−}, each cluster consists of mBi points and is represented m ×n by matrix Bi ⊂ R Bi . (ii) Solving the problems (15), (16) to obtain α1, , αl; γ1, , γk. (iii) Determining (w1+, b1+), , (wl+, bl+) and (w1−, b1−), , (wk−, bk−) via (14), (17). (iv) Classifying a new data x by using (7) and (8). Remark 1. In Linear WS-SVM, we have to solve (l + k) problems of the form (15) or (16). These are QPPs with the number of decisive variables being m or m . If the number m Bi Ai of data samples in each cluster is approximately equal to , then the complexity of the k + l m m3 algorithm is O(k + l) 3 = O . While the complexity of SVM and TWSVM k + l (k + l)2 m3 are O(m3) and O( ), respectively. From the above formula, it is easy to see that the more 4 clusters in the data set, the less the runtime of WS-SVM. This is also clearly demonstrated in experimentation. Remark 2. There are many different clustering methods that we can use to implement for Step (i) of Algorithm 1. Here we choose Ward’s clustering algorithm (with k and l being chosen independently via elbow method) for convenience in comparison with S-TWSVM because the authors have also used this technique in the experiment in [13]. When processing actual data we can choose another clustering algorithm that is more efficient. 3.2. The nonlinear case n Let Φ : R → H be a nonlinear mapping, where H is a Hilbert space whose dimension T is not less than n (maybe infinite-dimensional). Since S = span(Φ(C )) is a subspace of H n whose dimension does not exceed m, we can consider S as an Euclidean space and Φ : R → S. Suppose that after the clustering step on space S we obtain k clusters Φ(A1), , Φ(Ak) in the class Φ(A), each cluster Φ(Ai) consists of mAi data points; and l clusters Φ(B1), , Φ(Bl) in the class Φ(B), each cluster Φ(Bi) consists of mBi data points. In space S, a hyperplane T T T Φ(x )h+b = 0 (with h ∈ S being the normal vector) can be rewritten as Φ(x )Φ(C )u+b = m T T T T 0 for some vector u ∈ R . Therefore, by defining Φ(x )Φ(C ) = K(x ,C ), the hyperplane has the form K(xT ,CT )u + b = 0, where K is a predefined kernel [14]. T T WS-SVM determines l hyperplanes such that the i−th one K(x ,C )ui+ + bi+ = 0 is closer to class Φ(A) and far away from cluster Φ(Bi). It also determines k hyperplanes such T T that the i−th one K(x ,C )ui− + bi− = 0 is closer to class Φ(B) and far away from cluster Φ(Ai). Specifically, we have (l + k) QPPs as follows  1 µ λ min kK(A, CT )u +e b k2 +c eT ξ + + (ku k2 +b2 )+ + uT Φ(C)ΣΦΦ(C)T u ,  i+ mA i+ 2 + mBi i i+ 2 i+ i+ + i+ ui+,bi+,ξi 2 2 2  T s.t. − (K(Bi,C )ui+ + emBi bi+) + ξi ≥ emBi , ξi ≥ 0, (18)
  9. WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE 51 i = 1, . . . , l, and  1 µ λ min kK(B, CT )u +e b k2 +c eT η + − (ku k2 +b2 )+ − uT Φ(C)ΣΦΦ(C)T u ,  i− mB i− 2 − mAi i i− 2 i− i− − i− ui−,bi−,ηi 2 2 2  T s.t. (K(Ai,C )ui− + emAi bi−) + ηi ≥ emAi , ηi ≥ 0, (19) Φ Φ Φ Φ Φ Φ Φ Φ i = 1, . . . , k. Here, Σ+ = Σ1+ + ··· + Σk+, Σ− = Σ1− + ··· + Σl−, Σi+, Σi− are respectively the covariance matrices corresponding to clusters Φ(Ai) and Φ(Bi). The classification function in the nonlinear case is selected as f(x) = argmin(f+(x), f−(x)), (20) +, − l k X mBi X mAi with f (x) = |K(xT ,CT )u + b |, f (x) = |K(xT ,CT )u + b |. (21) + m i+ i+ − m i− i− i=1 B i=1 A The dual problem of (18)  max eT α − 1 αT G [HT H + µ I + λ F ]−1GT α ,  mBi i 2 i i + + + i i αi (22) s.t. 0 ≤ αi ≤ c+emBi , Φ(C)ΣΦΦ(C)T 0 where H = [K(A, CT ) e ], F = + , G = [K(B ,CT ) e ], I is the mA + 0 0 i i mBi identity matrix of order (m + 1), and the augmented vectors   ui+ T −1 T ui+ = = −[H H + µ+I + λ+F+] Gi αi. (23) bi+ The dual problem of (19)  max eT γ − 1 γT H (GT G + µ I + λ F )−1HT γ ,  mAi i 2 i i − − − i i γi (24) s.t. 0 ≤ γi ≤ c−emAi , Φ(C)ΣΦΦ(C)T 0 where G = [K(B, CT ) e ], F = − , H = [K(A ,CT ) e ], and mB − 0 0 i i mAi   ui− T −1 T ui− = = (G G + µ−I + λ−F−) Hi γi. (25) bi− m ×n Remark 3. We can calculate the matrix F+ as follows. For each Ai ∈ R Ai we denote m ×n by MAi ∈ R Ai the average matrix of Ai (i.e., all rows of MAi are the same and equal to Φ 1 T the average vector of Ai). Therefore, Σi+ = (Φ(Ai) − Φ(MAi)) (Φ(Ai) − Φ(MAi)), mAi T Φ T h 1 T i h 1 T i Φ(C)Σi+Φ(C) = √ (Φ(Ai) − Φ(MAi))Φ(C) √ (Φ(Ai) − Φ(MAi))Φ(C) mAi mAi T h 1 T T i h 1 T T i = √ (K(Ai,C )−K(MAi,C )) √ (K(Ai,C )−K(MAi,C )) . mAi mAi
  10. 52 N.T. CUONG, H.T. PHUNG Similarly, we can use the following formula to calculate the matrix F− T Φ T h 1 T i h 1 T i Φ(C)Σi−Φ(C) = √ (Φ(Bi) − Φ(MBi))Φ(C) √ (Φ(Bi) − Φ(MBi))Φ(C) mBi mBi T h 1 T T i h 1 T T i = √ (K(Bi,C )−K(MBi,C )) √ (K(Bi,C )−K(MBi,C )) , mBi mBi m ×n where MBi ∈ R Bi is the average matrix of Bi. Remark 4. In (22), (24) we need to compute the inverse of square matrices in order (m+1). This work will become difficult when m is large. So it is necessary to reduce the size of those matrices. This problem can be solved by using the Sherman-Morrison-Woodbury (SMW) formula [6] as in [7]. Algorithm 2. [Nonlinear WS-SVM] n Give m data points in R represented by the m × n matrix C. Class {+} includes mA points represented by the mA × n matrix A, class {−} includes mB points represented by the mB × n matrix B. We generate the nonlinear classifier f(x) as follows: (i) Choosing a kernel function K(xT ,CT ), typically the Gaussian kernel [14]. (ii) Clustering the dataset by using Ward’s linkage [13]. Suppose that, there are k clusters in class {+}, each cluster consists of mAi points and represented by matrix Φ(Ai); and there are l clusters in class {−}, each cluster consists of mBi points and represented by matrix Φ(Bi). (iii) Solving the problems (22) and (24) to obtain α1, , αl; γ1, , γk. (iv) Determining (u1+, b1+), , (ul+, bl+) and (u1−, b1−), , (uk−, bk−) via (23), (25). (v) Classifying a new data x by using (20) and (21). Remark 5. As usual, the linear algorithm will be applied when the training data is almost linearly separable. If the data overlap occurs too seriously, the linear algorithm will not work effectively, and the accuracy is not high. In that situation, the nonlinear algorithm should be used for better accuracy. 4. EXPERIMENTS In this section, we compare the WS-SVM against S-TWSVM [13] and TWSVM [7] on various datasets. All algorithms (our model, S-TWSVM as in [13], TWSVM as in [7]) are settled by version 3.8.3 of Python programming language, and run on a Laptop with an AMD Ryzen 5 with 8GB RAM. We use the following libraries: “scipy.cluster.hierarchy” to cluster the data, “cvxopt” to solve the QPP, “matplotlib” to show figure, “panda” and “numpy” to process data, ‘sklearn’ to evaluate and adjust hyperparameters of all models. For simplicity, let c+ = c−, µ+ = µ−, λ+ = λ− in WS-SVM, c1 = c4, c2 = c5, c3 = c6 as in S-TWSVM [13], all hyperparameters belong to the set {0.0001, 0.001, 0.1, 1, 10, 100, 1000} and are obtained by using Grid-search technique. All settings are uploaded to [4].
  11. WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE 53 4.1. Toy data We first experiment with 2-D toy data. Note that the training time of WS-SVM and S-TWSVM includes clustering time. The 2-D toy data consists of 800 points belonging to class {+} and 800 points belonging to class {−}, randomly generated according to two Gaussian distributions in each class. The dataset is scaled 90/10, which means 90% of the data was used for training and the rest of the data for testing. We use standard 10-fold cross validation (CV) to evaluate the testing accuracy of all models. Implementing SVM, TWSVM, S-TWSVM, we obtained the results as shown in Figure 4, and WS-SVM, we obtained the results as shown in Figure 5. a b c Figure 4. SVM (a): The run-time: 10.677 (s); The CV accuracy: 98.516 +/- 0.649. TWSVM (b): The run-time: 0.510 (s); The CV accuracy: 98.438 +/- 0.988. S-TWSVM (c): The run-time: 0.562 (s); The CV accuracy: 98.594 +/- 0.911. Figure 5. WS-SVM: The run-time: 0.142 (s); The CV accuracy: 98.359 +/- 0.547 4.2. UCI datasets Next, we implement these algorithms on the UCI datasets [2] which have been experi- mented in [13] and [7]. We randomly select 90% of each extracted dataset for training and 10% for testing. We also use 10-fold cross validation to evaluate the accuracy of all algo- rithms. All hyperparameters belong to the set {0.0001, 0.001, 0.1, 1, 10, 100, 1000} and are obtained by using Grid-search technique. The results are shown in Table 3, Table 4 (by applying Algorithm 1), and Table 5 (by applying Algorithm 2). Table 3 shows that the training time of WS-SVM is better than that of S-TWSVM and TWSVM, while Table 4 shows that the accuracy in data classification is not much different between methods. WS-SVM training time is even better when the data size is large and the data is constructed in many clusters. That is shown in Table 2.
  12. 54 N.T. CUONG, H.T. PHUNG Table 2. Training time (s) with the increase in the size of the data set. The computational complexity of WS-SVM is clearly less than that of S-TWSVM and TWSVM. Algorithms 4000 data points 6000 data points 8000 data points 10000 data points TWSVM 7.267 21.872 58.101 104.355 S-TWSVM 5.614 16.244 44.397 75.549 WS-SVM 1.33 6.251 8.675 16.863 Table 3. Test training time (s) with a Linear Kernel (Algorithm 1) Dataset Number of clusters WS-SVM S-TWSVM TWSVM Hepatitis(155 x 19) (5 × 3) 0.001 0.005 0.004 Australian(690 x 14) (4 × 5) 0.033 0.089 0.043 BUPA-liver(345 x 6) (4 × 2) 0.015 0.020 0.014 CMC (844 x 9) (3 × 5) 0.047 0.099 0.083 Credit (690 x 19) (2 × 3) 0.037 0.055 0.050 Diabetes (768 x 8) (5 × 2) 0.099 0.166 0.117 Flare-solar(1066 x 10) (2 × 2) 0.117 0.330 0.278 German (1000 x 20) (3 × 3) 0.065 0.180 0.166 Heart-statlog (270 x 13) (3 × 4) 0.011 0.011 0.011 Image (2310 x 18) (3 × 2) 0.356 1.279 1.444 Ionosphere (350 x 34) (3 × 2) 0.012 0.017 0.014 Spect (265 x 22) (3 × 3) 0.012 0.016 0.012 Sonar (208 x 60) (6 × 3) 0.008 0.009 0.008 Heart-c (303 x 13) (8 × 2) 0.013 0.019 0.010 Table 4. Test set accuracy (%) with a Linear Kernel (Algorithm 1) Dataset WS-SVM S-TWSVM TWSVM Hepatitis(155 x 19) 83.516 +/- 10.078 87.088 +/- 8.888 84.176 +/- 8.910 Australian(690 x 14) 86.452 +/- 4.903 86.613 +/- 4.507 85.968 +/- 4.565 BUPA-liver(345 x 6) 68.065 +/- 7.133 67.097 +/- 11.613 65.806 +/- 5.983 CMC (844 x 9) 65.223 +/- 3.637 65.221 +/- 5.010 64.821 +/- 5.204 Credit (690 x 19) 86.157 +/- 3.127 86.964 +/- 4.444 85.192 +/- 4.223 Diabetes (768 x 8) 75.118 +/- 4.993 77.000 +/- 5.759 77.205 +/- 4.605 Flare-solar(1066 x 10) 81.760 +/- 4.127 81.969 +/- 4.200 81.761 +/- 4.306 German (1000 x 24) 69.375 +/- 5.356 71.667 +/- 5.449 71.042 +/- 7.184 Heart-statlog (270 x 13) 84.850 +/- 7.223 86.483 +/- 5.367 85.650 +/- 5.776 Image (2310 x 18) 86.196 +/- 1.619 84.415 +/- 1.244 85.089 +/- 1.745 Ionosphere (350 x 34) 92.056 +/- 6.574 91.744 +/- 4.982 90.504 +/- 6.104 Spect (265 x 22) 81.449 +/- 6.503 84.801 +/- 4.992 83.134 +/- 6.120 Sonar (208 x 60) 80.263 +/- 11.966 78.099 +/- 8.244 78.129 +/- 10.286 Heart-c (303 x 13) 84.577 +/- 5.116 83.836 +/- 4.668 83.823 +/- 4.435
  13. WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE 55 Table 5. Test set accuracy (%) with an RBF Kernel (Algorithm 2) Dataset(mxn) WS-SVM S-TWSVM TWSVM Hepatitis(155 x 19) 84.835 +/- 9.939 83.407 +/- 12.052 83.462 +/- 11.958 BUPA-liver(345 x 6) 70.000 +/- 6.297 69.032 +/- 3.290 72.258 +/- 3.592 Votes (303 x 13) 95.154 +/- 2.626 95.667 +/- 2.519 95.917 +/- 2.321 WPBC (198x35) 81.669 +/- 8.879 79.412 +/- 9.727 80.588 +/- 10.156 5. CONCLUSIONS This paper has proposed a new Weighted Structural - Support Vector Machine (known as WS-SVM) for classification problems with a class-vs-clusters strategy. This algorithm is performed in two steps: The first step is to extract structural information of the data using Ward’s linkage clustering method; The second step is to apply structural information with cluster granularity to the learning model. The classifier is based on the weighted average distances from the data point to the class representative hyperplanes. Both theory and experiment show that training time of WS-SVM is better than S-TWSVM and TWSVM in most cases. Besides, when the data is large, and there are many clusters with too different distribution trends, the WS-SVM algorithm effectively simulates the distribution trend of clusters and thus improves the accuracy in data classification. When k = l = 1, WS-SVM is exactly S-TWSVM, and if λ+ = λ− = 0 it becomes TWSVM again. The WS-SVM algorithm generally only achieves high accuracy when the clustering within each class is clear. In the case of clustering being ambiguous the algorithm needs to be improved, for example by combining it with a cluster-vs-class strategy flexibly. The WS-SVM algorithm may not be really suitable for a multi-class problem. However, it seems to be useful in solving a classification problem with unbalanced data. And this is an interesting research direction in the future. ACKNOWLEDGMENT The authors would like to thank the referees for their helpful comments and valuable suggestions. REFERENCES [1] M. Adancon and M. Cheriet, “Model selection for the ls-svm. application to handwriting recog- nition,” Pattern Recognition, vol. 42, pp. 3264–3270, 2009. [2] UCI Machine Learning Repository, Center for Machine Learning and Intelligent Systems at the University of California, Irvine. [Online]. Available: machine-learning-databases/ [3] J. Cervantes, F. Garcia-Lamont, L. Rodr´ıguez-Mazahua, and A. Lopez, “A comprehensive survey on support vector machine classification: Applications, challenges and trends,” Neurocomputing, vol. 408, pp. 189–215, 2020. [4] N. Cuong, Python code. [Online]. Available:
  14. 56 N.T. CUONG, H.T. PHUNG [5] G. Fung and O. L. Mangasarian, “Proximal support vector machine,” in KDD ’01: Proceedings of the Seventh ACM SIGKDD International Conference On Knowledge Discovery And Data Mining. San Francisco California: Association for Computing Machinery, New York, NY, United States, 2001, pp. 77–86. [Online]. Available: [6] G. Golub and C. Van Loan, Matrix Computations. The John Hopkins University Press, 1996. [7] Jayadeva, R. Khemchandani, and S. Chandra, “Twin support vector machines for pattern classi- fication,” IEEE Transactions on Pattern Analysis and Machine intelligence, vol. 29, pp. 905–910, 2007. [8] D. Jeung, D. Wang, W. Wing, E. Tsang, and X. Wang, “Structured large margin machines: sensitive to data distributions,” Machine Learning, vol. 68, pp. 171–200, 2007. [9] O. Mangasarian and E. Wild, “Multisurface proximal support vector classification via generalized eigenvalues,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, pp. 69–74, 2006. [10] B. Mei and Y. Xu, “Multi-task least squares twin support vector machine for classification,” Neurocomputing, vol. 338, pp. 26–33, 2019. [11] W. Noble, Support Vector Machine Applications in Computational Biology. MIT Press, 2004. [12] X. Pan, Y. Luo, and Y. Xu, “K-nearest neighbor based structural twin support vector machine,” Knowledge-Based Systems, vol. 88, pp. 34–44, 2015. [13] Z. Qi, Y. Tian, and Y. Shi, “Structural twin support vector machine for classification,” Knowledge-Based Systems, vol. 43, pp. 74–81, 2013. [14] B. Schoelkopf and A. Smola, Learning with Kernel. MIT Press, 2002. [15] Y. Tian, Y. Shi, and X. Liu, “Recent advances on support vector machines research,” Techno- logical and Economic Development of Economy, vol. 18, pp. 5–33, 2012. [16] D. Tomar and S. Agarwal, “Twin support vector machine: A review from 2007 to 2014,” Egyptian Informatics Journal, vol. 16, pp. 55–69, 2015. [17] V. Vapnik, The Natural Of Statistical Learning Theory. Springer-Verlag New York, 1995. [18] X. Xie and S. Sun, “Multitask centroid twin support vector machines,” Neurocomputing, vol. 149, pp. 1085–1091, 2015. [19] H. Xue, S. Chen, and Q. Yang, “Structural regularized support vector machine: A framework for structural large margin classifier,” IEEE Transactions on Neural Networks, vol. 22, pp. 573–587, 2011. Received 20 August 2020 Accepted 19 February 2021