Weighted structural support vector machine

14 trang Gia Huy 17/05/2022 2930

Download

Bạn đang xem tài liệu "Weighted structural support vector machine", để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên

Tài liệu đính kèm:

weighted_structural_support_vector_machine.pdf

Nội dung text: Weighted structural support vector machine

Journal of Computer Science and Cybernetics, V.37, N.1 (2021), 43–56 DOI 10.15625/1813-9663/37/1/15396 WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE NGUYEN THE CUONG1, HUYNH THE PHUNG2,∗ 1Faculty of Basic, Telecommunications University, Khanh Hoa, Vietnam 2Department of Mathematics, College of Sciences, Hue University, Vietnam Abstract. In binary classiﬁcation problems, two classes of data seem to be diﬀerent from each other. It is expected to be more complicated due to the clusters in each class also tend to be diﬀerent. Traditional algorithms as Support Vector Machine (SVM) or Twin Support Vector Machine (TWSVM) cannot suﬃciently exploit structural information with cluster granularity of the data, cause limitation on the capability of simulation of data trends. Structural Twin Support Vector Machine (S-TWSVM) suﬃciently exploits structural information with cluster granularity for learning a represented hyperplane. Therefore, the capability of S-TWSVM’s data simulation is better than that of TWSVM. However, for the datasets where each class consists of clusters of diﬀerent trends, the S-TWSVM’s data simulation capability seems restricted. Besides, the training time of S-TWSVM has not been improved compared to TWSVM. This paper proposes a new Weighted Structural - Support Vector Machine (called WS-SVM) for binary classiﬁcation problems with a class-vs-clusters strategy. Experimental results show that WS-SVM could describe the tendency of the distribution of cluster information. Furthermore, both the theory and experiment show that the training time of the WS-SVM for classiﬁcation problem has signiﬁcantly improved compared to S-TWSVM. Keywords. Support vector machine; Twin support vector machine; Structural twin support vector machine; Weighted structural - support vector machine. 1. INTRODUCTION In the early years of the 20th century, the Support Vector Machine (SVM) [5, 17] was a popular binary classiﬁcation algorithm applied to many diﬀerent ﬁelds in practice [1, 3, 11, 15, 16]. The SVM seeks a hyperplane separating two classes so that the margin between them is largest. Actual data is often distributed with diﬀerent structures and tendencies, but SVM does not fully exploit structural information of data, so its ability to simulate data is limited. Nowadays, with rapid development, datasets are increasing in number and diversifying in structure. This fact requires classiﬁcation algorithms to guarantee accuracy and improve the speed and ability to simulate data distribution. Many variants of SVM have been recently proposed to improve the speed and other task of standard SVM [3, 10, 12, 16, 18]. Two typical innovations of SVM are Twin Support Vector Machine (TWSVM) [7] and Structural Twin Support Vector Machine (S-TWSVM) [13]. The main idea of TWSVM is to seek two hyperplanes such that each hyperplane is closer to one class and far away from the other by solving two Quadratic Programming Problems (QPPs) whose size are smaller than the QPP in SVM. Despite having to solve two QPPs, the speed of TWSVM is approximately four *Corresponding author. E-mail addresses: nckcbnckcb@gmail.com (N.T.Cuong); huynhthephung@gmail.com (H.T.Phung). © 2021 Vietnam Academy of Science & Technology
44 N.T. CUONG, H.T. PHUNG times faster than standard SVM. S-TWSVM has the same strategy as TWSVM. Besides, S-TWSVM fully exploits structural information with cluster granularity into learning the model to build a more reasonable classiﬁer. There is a reason for handling SVM with structure. In some practical binary classiﬁcation problems, each class will consist of more than one cluster. For example, we consider the problem of classifying fruits with data including ﬁve categories: Mango, Jackfruit, Pineapple, Apples, Grapes, but the fruits will be only be classiﬁed according to the criteria “smooth skin” or “rough skin”. Obviously, data in the “smooth skin” class will form 3 clusters, corresponding to Pineapple, Apples, and Grapes, while data in the “rough skin” class will distribute into 2 clusters, corresponding to Jackfruit and Pineapple. S-TWSVM proved to be quite eﬀective in the simple case, where each class consists of clusters with a similar distribution trend. However, for more complex data types, where each class contains clusters of diﬀerent trends, S-TWSVM was not eﬃcient at simulating data trends. Based on the strategy of TWSVM and S-TWSVM, we propose a new binary classiﬁcation model: Weighted Structural - Support Vector Machine (called WS-SVM) with a class-vs- clusters strategy. Instead of solving two QPPs as in S-TWSVM, WS-SVM will solve (l + k) QPPs, where k and l are the number of clusters in class {+} and class {−}, respectively. This method allows WS-SVM to eﬀectively simulate the distribution trends for complex data types while also improving computation speed. The paper is organized as follows. Section 2 brieﬂy introduces the background of SVM, TWSVM and S-TWSVM; Section 3 is devoted to a detailed description of WS-SVM along with the algorithms and discussions; All experimental results are presented in Section 4, to- gether with the comparative evaluation; The conclusion is given in Section 5. All algorithms are settled by version 3.8.3 of Python Programming Language. 2. BACKGROUND 2.1. Structural granularity Consider a binary classiﬁcation problem with the dataset, denoted by matrix C, consist- T n ing of m points (each point is a row of C) xj ∈ R , j = 1, . . . , m. We also write xj ∈ C to indicate that xj is a row of C. Suppose that yj ∈ {−1, 1} is the j−th data point label. m ×n Class {+} consists of mA points denoted by a matrix A ⊂ R A , class {−} consists of m ×n mB points denoted by a matrix B ⊂ R B . There are k clusters in class {+}, whose i−th m ×n cluster consists of mAi points and is denoted by matrix Ai ⊂ R Ai . Also, there are l clusters in class {−}, whose i−th cluster consists of mBi points and is denoted by matrix m ×n Bi ⊂ R Bi . A, B, Ai, Bi are called structural granularity [19]. We are interested in the following quantities of structural granularity. 1 X T Class granularity: ΣA = (xj − µA)(xj − µA) , mA xj ∈A 1 X T ΣB = (xj − µB)(xj − µB) . mB xj ∈B 1 X T Cluster granularity: Σi+ = (xj − µAi)(xj − µAi) , mAi xj ∈Ai
WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE 45 1 X T Σi− = (xj − µBi)(xj − µBi) , mBi xj ∈Bi here, µX denotes the average vector of the dataset X. 2.2. Support vector machine Standard SVM [17] seeks a hyperplane wT x + b = 0 separating class {+} and class {−} 2 such that the margin kwk between two classes is largest. However, Standard SVM is only available when the data is linearly separable. In the case when the data is not linearly separable, soft SVM [17] is recommended with the more loser constraints  T 1 2 min ce ξ + kwk2, w,b,ξ 2 (1) s.t. D(Cw + eb) + ξ ≥ e, ξ ≥ 0, m×m m where D ∈ R is the diagonal matrix with Djj = yj; ∀j, ξ ∈ R is the vector of slack variables, c ∈ R is the penalty coeﬃcient, appropriately selected to adjust the role between m terms in the objective function, e ∈ R is the vector of ones. A new data point x will be classiﬁed in class {+} if sgn(f(x) = wT x + b) > 0 and in class {−} if sgn(f(x)) < 0. SVM does not eﬀectively exploit structural information of the data, resulting in simulating the distribution structure of the two classes being the same (see Figure 1.a). 2.3. Twin support vector machine Based on the strategy of multi-surface proximal support vector classiﬁcation via gener- alized eigenvalues (GEPSVM) [9]. The main idea of TWSVM [7] for binary classiﬁcation problem is to seek two hyperplanes: T • f+(x) = w+x + b+ = 0 is closer to class {+} and far away from class {−}, T • f−(x) = w−x + b− = 0 is closer to class {−} and far away from class {+} by solving two QPPs as follows  1 2 T  min kAw+ + e+b+k2 + c1e−ξ, (TWSVM1) w+,b+,ξ 2 (2)  s.t. −(Bw+ + e−b+) + ξ ≥ e−, ξ ≥ 0,  1 2 T  min ||Bw− + e−b−k2 + c2e+η, (TWSVM2) w−,b−,η 2 (3)  s.t. (Aw+ + e+b+) + η ≥ e+, η ≥ 0, where c1, c2 are penalty coeﬃcients to adjust the role between terms in the objective func- m m m m tions, e+ ∈ R A , e− ∈ R B are vectors of ones, ξ ∈ R B , η ∈ R A are vectors of slack variables. A new data x is classiﬁed into class {+} or class {−} depending on whether it is closer to the hyperplane f+(x) = 0 or the hyperplane f−(x) = 0. TWSVM simulates the structural distribution of the two classes independently, but it does not actually simulate the distribution trend of the data within each class (see Figure 1.b). It has been shown in [7] that the training time of TWSVM is approximately four times faster than that of standard SVM (see Table 1).
46 N.T. CUONG, H.T. PHUNG 2.4. Structural twin support vector machine S-TWSVM [13] has two steps: The ﬁrst step is to extract the structural information within classes by Ward’s linkage clustering method [8, 19]; The second step is the model learning. Suppose that there are k clusters A1, ,Ak in class A, each cluster Ai consists of mAi data points, and there are l clusters B1, ,Bl in class B, each cluster Bi consists of mBi data points. S-TWSVM determines two hyperplanes T T f+(x) = w+x + b+ = 0; f−(x) = w−x + b− = 0, (4) by solving two QPPs as follows  1 2 T 1 2 2 1 T  min kAw+ + e+b+k2 + c1e−ξ + c2(kw+k2 + b+) + c3w+Σ+w+, w+, b+, ξ 2 2 2 (5)  s.t. −(Bw+ + e−b+) + ξ ≥ e−, ξ ≥ 0,  1 2 T 1 2 2 1 T  min kBw− + e−b−k2 + c4e+η + c5(kw−k2 + b−) + c6w−Σ−w−, w−, b−, η 2 2 2 (6)  s.t. (Aw− + e+b−) + η ≥ e+, η ≥ 0, where c1, . . . , c6 ≥ 0 are penalty coeﬃcients to adjust the role between terms in the objective m m functions, ξ ∈ R B , η ∈ R A are vectors of slack variables. Σ+ = Σ1+ + Σ2+ + ··· + Σk+, Σ− = Σ1− + Σ2− + ··· + Σl−, Σi+ and Σi− are respectively the covariance matrices m m corresponding to the clusters Ai and Bi, e+ ∈ R A , e− ∈ R B are vectors of ones. A new data point is assigned into class {+} or {−} in the same manner as in TWSVM. In the 1 problem (5), kAw + e b k2 is the sum of the squares of the distances from data points in 2 + + + 2 1 class {+} to the hyperplane {f (x) = 0}. c eT ξ is the sum of errors, c (kw k2 +b2 ) is the + 1 − 2 2 + 2 + 1 regularization, c wT Σ w is the sum of covariance matrices with the cluster granularity 2 3 + + + of class {+} projected onto vector w+. The constraints of (5) is deﬁned by the points of class {−}. The problem (6) is similarly established for class {−} with the constraints deﬁned by class {+}. Thus, S-TWSVM exploits structural information with cluster granularity of one class in each problem. Therefore, the data simulation capability of S-TWSVM is more accurate than that of TWSVM (see Figure 1.c). However, as the data become more complex, this ability of the S-TWSVM remains limited (see Figure 2.c). By using suﬃcient information about the cluster granularity of the classes, S-TWSVM ﬁnds the hyperplanes that represent the classes better than TWSVM. However, due to the calculation of all the covariance matrices Σi+ and Σi− of both classes, the training speed of S-TWSVM is not improved compared to that of TWSVM (see Table 1). 3. WEIGHTED STRUCTURAL - SUPPORT VECTOR MACHINE In this section, we describe a new classiﬁcation algorithm: Weighted Structural - Support Vector Machine (called WS-SVM). Both theoretically and experimentally, we show that WS- SVM overcomes the S-TWSVM in data simulation and training speed.
WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE 47 a b c Figure 1. For datasets in which clusters in each class are constituted by the same trend, the SVM (a) simulates the distribution trend of the two classes being the same, TWSVM (b) simulates the distribution trend of the two classes as diﬀerent, but does not actually simulate the distribution trend of data in each class, S-TWSVM (c) simulates the distribution trend in each class quite accurately. a b c Figure 2. For datasets in which clusters in each class are constructed according to diﬀerent dis- tribution trends, SVM (a), TWSVM (b), and S-TWSVM (c) all have diﬃculty in simulating the distribution trend of the data. Table 1. Training time (seconds), with the dataset shown in Figure 2 Algorithms 1200 data points 1600 data points 2000 data points 2400 data points SVM 2.143 5.308 6.754 11.512 TWSVM 0.351 0.700 1.193 1.805 S-TWSVM 0.388 0.853 1.434 2.158 Figure 3. (WS-SVM) f1+ (thin solid line) shows that cluster B1 (small square) tends to deviate vertically, f2+ (bold solid line) shows that clus- ter B2 (large square) tends to deviate horizon- tally. f1− (thin dashed line) shows that cluster A1 (small circle) tends to deviate horizontally, f2− (bold dashed line) shows that cluster A2 (large circle) tends to deviate vertically.
48 N.T. CUONG, H.T. PHUNG Similar to S-TWSVM, WS-SVM also has two steps. The ﬁrst step is to extract structural information within classes by Ward’s linkage clustering method [8, 13, 19]; The second step is the model learning. Suppose that there are k clusters A1, ,Ak in class {+} and l clusters B1, ,Bl in class {−}. WS-SVM uses a class-vs-clusters strategy to determine (l + k) hyperplanes such that each of which is closer to one class and far away from one cluster in the other class. Speciﬁcally, the method need to ﬁnd l hyperplanes such that T the i−th hyperplane, fi+(x) = wi+x + bi+ = 0, is closer to class {+} and far away from cluster Bi of class {−}; Also, It need to ﬁnd k hyperplanes such that the i−th hyperplane, T fi−(x) = wi−x + bi− = 0, is closer to class {−} and far away from cluster Ai of class {+} n (see Figure 3). Here wi+, wi− ∈ R , bi+, bi− ∈ R. The classiﬁer is now selected as f(x) = argmin(f+(x), f−(x)), (7) +, − l k with X mBi X mAi f (x) = |f (x)|, f (x) = |f (x)|. (8) + m i+ − m i− i=1 B i=1 A From the deﬁnition, we see that f+(x) is the average, taking into account the weights, distances from x to the hyperplanes {fi+(x) = 0}. The i−th hyperplane’s weight is propor- tional to mBi - the number of data points in the cluster Bi. Similarly, f−(x) is the weighted average of distances from x to the hyperplanes {fi−(x) = 0}. By virtue of (7), a new data point x is classiﬁed into class {+} or class {−} depending on whether f+(x) is less than or greater than f−(x). 3.1. The linear case WS-SVM determines (l + k) hyperplanes by solving (l + k) QPPs as follows  min 1 kAw +e b k2 +c eT ξ + µ+ (kw k2 +b2 )+ λ+ wT Σ w ,  2 i+ mA i+ 2 + mBi i 2 i+ 2 i+ 2 i+ + i+ + wi+,bi+,ξ (WS-SVMi ) i  s.t. − (Biwi+ + emBi bi+) + ξi ≥ emBi ; ξi ≥ 0, i= 1, . . . , l, and  min 1 kBw +e b k2 +c eT η + µ− (kw k2 +b2 )+ λ− wT Σ w ,  2 i− mB i− 2 − mAi i 2 i− 2 i− 2 i− − i− − wi−,bi−,η (WS-SVMi ) i  s.t. (Aiwi− + emAi bi−) + ηi ≥ emAi ; ηi ≥ 0, mA×1 mB ×1 mAi×1 mBi×1 i = 1, . . . , k. Here, emA ∈ R , emB ∈ R , emAi ∈ R , emBi ∈ R are vectors mAi×1 mBi×1 of ones; ηi ∈ R , ξi ∈ R are vectors of slack variables; Σ+ = Σ1+ + ··· + Σk+, Σ− = Σ1− + ··· + Σl−, Σi+ and Σi− are the covariance matrices corresponding to the clusters Ai and Bi; c+, c−, λ+, λ−, µ+, µ− are penalty coeﬃcients to adjust the role T between terms in the objective functions; wi+Σ+wi+ is the sum of covariance matrices T with cluster granularity of class {+} projected onto to wi+ and wi−Σ−wi− is the sum of covariance matrices with cluster granularity of class {−} projected onto wi−. The constraints + in problem (WS-SVMi ) is deﬁned by the points of cluster Bi and the constraints in problem − (WS-SVMi ) is deﬁned by cluster Ai.
WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE 49 + The Lagrangian of (WS-SVMi ) is given by 1 2 T 1 2 2 Li(wi+, bi+, ξ , αi, β ) = kAwi+ + em bi+k + c+e ξ + µ+(kwi+k + b ) i i 2 A 2 mBi i 2 2 i+ 1 + λ wT Σ w − αT (−(B w + e b ) + ξ − e ) − βT ξ . 2 + i+ + i+ i i i+ mBi i+ i mBi i i + Therefore, the KKT conditions of problem (WS-SVMi ) are T T A (Awi+ + emA bi+) + µ+wi+ + λ+Σ+wi+ + Bi αi = 0, (9) eT (Aw + e b ) + µ b + eT α = 0, (10) mA i+ mA i+ + i+ mBi i eT c − α − β = 0, (11) mBi + i i T αi (−(Biwi+ + emBi bi+) + ξi − emBi ) = 0, (12) T βi ξi = 0. (13) wi+ Σ+ 0 By deﬁning H = [A emA ], ui+ = , F+ = , Gi = [Bi emBi ], and I being bi+ 0 0 the identity matrix of order (n + 1), it follows from (9) and (10) that T T [H H + µ+I + λ+F+]ui+ + Gi αi = 0, which implies T −1 T ui+ = −[H H + µ+I + λ+F+] Gi αi, i = 1, . . . , l. (14) Substituting (14) into the Lagrangian, and combined with the conditions (11), (12), and (13) + we have the dual problem of (WS-SVMi ) as follows  T 1 T T −1 T max e αi − α Gi[H H + µ+I + λ+F+] G αi, + α mBi i i (DWS-SVMi ) i 2 (15) s.t. 0 ≤ αi ≤ c+emBi . − In the same way, we also obtain the dual problem of (WS-SVMi )  max eT γ − 1 γT H (GT G + µ I + λ F )−1HT γ ,  mAi i 2 i i − − − i i − γi (DWS-SVMi ) (16) s.t. 0 ≤ γi ≤ c−emAi , Σ− 0 wi− where G = [B emB ], F− = , Hi = [Ai emAi ]. The augmented vectors ui− = 0 0 bi− are also given by T −1 T ui− = [G G + µ−I + λ−F−] Hi γi, i = 1, . . . , k. (17) Algorithm 1. [Linear WS-SVM] n Give m data points in R represented by a m × n matrix C. Class {+} includes mA points represented by a mA × n matrix A, class {−} includes mB points represented by a mB × n matrix B. We generate the linear classiﬁer f(x) as follows:
50 N.T. CUONG, H.T. PHUNG (i) Clustering dataset by using Ward’s linkage [13]. Suppose that, there are k clusters in class m ×n {+}, each cluster consists of mAi points and is represented by matrix Ai ⊂ R Ai , there are l clusters in class {−}, each cluster consists of mBi points and is represented m ×n by matrix Bi ⊂ R Bi . (ii) Solving the problems (15), (16) to obtain α1, , αl; γ1, , γk. (iii) Determining (w1+, b1+), , (wl+, bl+) and (w1−, b1−), , (wk−, bk−) via (14), (17). (iv) Classifying a new data x by using (7) and (8). Remark 1. In Linear WS-SVM, we have to solve (l + k) problems of the form (15) or (16). These are QPPs with the number of decisive variables being m or m . If the number m Bi Ai of data samples in each cluster is approximately equal to , then the complexity of the k + l m m3 algorithm is O(k + l) 3 = O . While the complexity of SVM and TWSVM k + l (k + l)2 m3 are O(m3) and O( ), respectively. From the above formula, it is easy to see that the more 4 clusters in the data set, the less the runtime of WS-SVM. This is also clearly demonstrated in experimentation. Remark 2. There are many diﬀerent clustering methods that we can use to implement for Step (i) of Algorithm 1. Here we choose Ward’s clustering algorithm (with k and l being chosen independently via elbow method) for convenience in comparison with S-TWSVM because the authors have also used this technique in the experiment in [13]. When processing actual data we can choose another clustering algorithm that is more eﬃcient. 3.2. The nonlinear case n Let Φ : R → H be a nonlinear mapping, where H is a Hilbert space whose dimension T is not less than n (maybe inﬁnite-dimensional). Since S = span(Φ(C )) is a subspace of H n whose dimension does not exceed m, we can consider S as an Euclidean space and Φ : R → S. Suppose that after the clustering step on space S we obtain k clusters Φ(A1), , Φ(Ak) in the class Φ(A), each cluster Φ(Ai) consists of mAi data points; and l clusters Φ(B1), , Φ(Bl) in the class Φ(B), each cluster Φ(Bi) consists of mBi data points. In space S, a hyperplane T T T Φ(x )h+b = 0 (with h ∈ S being the normal vector) can be rewritten as Φ(x )Φ(C )u+b = m T T T T 0 for some vector u ∈ R . Therefore, by deﬁning Φ(x )Φ(C ) = K(x ,C ), the hyperplane has the form K(xT ,CT )u + b = 0, where K is a predeﬁned kernel [14]. T T WS-SVM determines l hyperplanes such that the i−th one K(x ,C )ui+ + bi+ = 0 is closer to class Φ(A) and far away from cluster Φ(Bi). It also determines k hyperplanes such T T that the i−th one K(x ,C )ui− + bi− = 0 is closer to class Φ(B) and far away from cluster Φ(Ai). Speciﬁcally, we have (l + k) QPPs as follows  1 µ λ min kK(A, CT )u +e b k2 +c eT ξ + + (ku k2 +b2 )+ + uT Φ(C)ΣΦΦ(C)T u ,  i+ mA i+ 2 + mBi i i+ 2 i+ i+ + i+ ui+,bi+,ξi 2 2 2  T s.t. − (K(Bi,C )ui+ + emBi bi+) + ξi ≥ emBi , ξi ≥ 0, (18)
WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE 51 i = 1, . . . , l, and  1 µ λ min kK(B, CT )u +e b k2 +c eT η + − (ku k2 +b2 )+ − uT Φ(C)ΣΦΦ(C)T u ,  i− mB i− 2 − mAi i i− 2 i− i− − i− ui−,bi−,ηi 2 2 2  T s.t. (K(Ai,C )ui− + emAi bi−) + ηi ≥ emAi , ηi ≥ 0, (19) Φ Φ Φ Φ Φ Φ Φ Φ i = 1, . . . , k. Here, Σ+ = Σ1+ + ··· + Σk+, Σ− = Σ1− + ··· + Σl−, Σi+, Σi− are respectively the covariance matrices corresponding to clusters Φ(Ai) and Φ(Bi). The classiﬁcation function in the nonlinear case is selected as f(x) = argmin(f+(x), f−(x)), (20) +, − l k X mBi X mAi with f (x) = |K(xT ,CT )u + b |, f (x) = |K(xT ,CT )u + b |. (21) + m i+ i+ − m i− i− i=1 B i=1 A The dual problem of (18)  max eT α − 1 αT G [HT H + µ I + λ F ]−1GT α ,  mBi i 2 i i + + + i i αi (22) s.t. 0 ≤ αi ≤ c+emBi , Φ(C)ΣΦΦ(C)T 0 where H = [K(A, CT ) e ], F = + , G = [K(B ,CT ) e ], I is the mA + 0 0 i i mBi identity matrix of order (m + 1), and the augmented vectors ui+ T −1 T ui+ = = −[H H + µ+I + λ+F+] Gi αi. (23) bi+ The dual problem of (19)  max eT γ − 1 γT H (GT G + µ I + λ F )−1HT γ ,  mAi i 2 i i − − − i i γi (24) s.t. 0 ≤ γi ≤ c−emAi , Φ(C)ΣΦΦ(C)T 0 where G = [K(B, CT ) e ], F = − , H = [K(A ,CT ) e ], and mB − 0 0 i i mAi ui− T −1 T ui− = = (G G + µ−I + λ−F−) Hi γi. (25) bi− m ×n Remark 3. We can calculate the matrix F+ as follows. For each Ai ∈ R Ai we denote m ×n by MAi ∈ R Ai the average matrix of Ai (i.e., all rows of MAi are the same and equal to Φ 1 T the average vector of Ai). Therefore, Σi+ = (Φ(Ai) − Φ(MAi)) (Φ(Ai) − Φ(MAi)), mAi T Φ T h 1 T i h 1 T i Φ(C)Σi+Φ(C) = √ (Φ(Ai) − Φ(MAi))Φ(C) √ (Φ(Ai) − Φ(MAi))Φ(C) mAi mAi T h 1 T T i h 1 T T i = √ (K(Ai,C )−K(MAi,C )) √ (K(Ai,C )−K(MAi,C )) . mAi mAi
52 N.T. CUONG, H.T. PHUNG Similarly, we can use the following formula to calculate the matrix F− T Φ T h 1 T i h 1 T i Φ(C)Σi−Φ(C) = √ (Φ(Bi) − Φ(MBi))Φ(C) √ (Φ(Bi) − Φ(MBi))Φ(C) mBi mBi T h 1 T T i h 1 T T i = √ (K(Bi,C )−K(MBi,C )) √ (K(Bi,C )−K(MBi,C )) , mBi mBi m ×n where MBi ∈ R Bi is the average matrix of Bi. Remark 4. In (22), (24) we need to compute the inverse of square matrices in order (m+1). This work will become diﬃcult when m is large. So it is necessary to reduce the size of those matrices. This problem can be solved by using the Sherman-Morrison-Woodbury (SMW) formula [6] as in [7]. Algorithm 2. [Nonlinear WS-SVM] n Give m data points in R represented by the m × n matrix C. Class {+} includes mA points represented by the mA × n matrix A, class {−} includes mB points represented by the mB × n matrix B. We generate the nonlinear classiﬁer f(x) as follows: (i) Choosing a kernel function K(xT ,CT ), typically the Gaussian kernel [14]. (ii) Clustering the dataset by using Ward’s linkage [13]. Suppose that, there are k clusters in class {+}, each cluster consists of mAi points and represented by matrix Φ(Ai); and there are l clusters in class {−}, each cluster consists of mBi points and represented by matrix Φ(Bi). (iii) Solving the problems (22) and (24) to obtain α1, , αl; γ1, , γk. (iv) Determining (u1+, b1+), , (ul+, bl+) and (u1−, b1−), , (uk−, bk−) via (23), (25). (v) Classifying a new data x by using (20) and (21). Remark 5. As usual, the linear algorithm will be applied when the training data is almost linearly separable. If the data overlap occurs too seriously, the linear algorithm will not work eﬀectively, and the accuracy is not high. In that situation, the nonlinear algorithm should be used for better accuracy. 4. EXPERIMENTS In this section, we compare the WS-SVM against S-TWSVM [13] and TWSVM [7] on various datasets. All algorithms (our model, S-TWSVM as in [13], TWSVM as in [7]) are settled by version 3.8.3 of Python programming language, and run on a Laptop with an AMD Ryzen 5 with 8GB RAM. We use the following libraries: “scipy.cluster.hierarchy” to cluster the data, “cvxopt” to solve the QPP, “matplotlib” to show ﬁgure, “panda” and “numpy” to process data, ‘sklearn’ to evaluate and adjust hyperparameters of all models. For simplicity, let c+ = c−, µ+ = µ−, λ+ = λ− in WS-SVM, c1 = c4, c2 = c5, c3 = c6 as in S-TWSVM [13], all hyperparameters belong to the set {0.0001, 0.001, 0.1, 1, 10, 100, 1000} and are obtained by using Grid-search technique. All settings are uploaded to [4].
WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE 53 4.1. Toy data We ﬁrst experiment with 2-D toy data. Note that the training time of WS-SVM and S-TWSVM includes clustering time. The 2-D toy data consists of 800 points belonging to class {+} and 800 points belonging to class {−}, randomly generated according to two Gaussian distributions in each class. The dataset is scaled 90/10, which means 90% of the data was used for training and the rest of the data for testing. We use standard 10-fold cross validation (CV) to evaluate the testing accuracy of all models. Implementing SVM, TWSVM, S-TWSVM, we obtained the results as shown in Figure 4, and WS-SVM, we obtained the results as shown in Figure 5. a b c Figure 4. SVM (a): The run-time: 10.677 (s); The CV accuracy: 98.516 +/- 0.649. TWSVM (b): The run-time: 0.510 (s); The CV accuracy: 98.438 +/- 0.988. S-TWSVM (c): The run-time: 0.562 (s); The CV accuracy: 98.594 +/- 0.911. Figure 5. WS-SVM: The run-time: 0.142 (s); The CV accuracy: 98.359 +/- 0.547 4.2. UCI datasets Next, we implement these algorithms on the UCI datasets [2] which have been experi- mented in [13] and [7]. We randomly select 90% of each extracted dataset for training and 10% for testing. We also use 10-fold cross validation to evaluate the accuracy of all algo- rithms. All hyperparameters belong to the set {0.0001, 0.001, 0.1, 1, 10, 100, 1000} and are obtained by using Grid-search technique. The results are shown in Table 3, Table 4 (by applying Algorithm 1), and Table 5 (by applying Algorithm 2). Table 3 shows that the training time of WS-SVM is better than that of S-TWSVM and TWSVM, while Table 4 shows that the accuracy in data classiﬁcation is not much diﬀerent between methods. WS-SVM training time is even better when the data size is large and the data is constructed in many clusters. That is shown in Table 2.
54 N.T. CUONG, H.T. PHUNG Table 2. Training time (s) with the increase in the size of the data set. The computational complexity of WS-SVM is clearly less than that of S-TWSVM and TWSVM. Algorithms 4000 data points 6000 data points 8000 data points 10000 data points TWSVM 7.267 21.872 58.101 104.355 S-TWSVM 5.614 16.244 44.397 75.549 WS-SVM 1.33 6.251 8.675 16.863 Table 3. Test training time (s) with a Linear Kernel (Algorithm 1) Dataset Number of clusters WS-SVM S-TWSVM TWSVM Hepatitis(155 x 19) (5 × 3) 0.001 0.005 0.004 Australian(690 x 14) (4 × 5) 0.033 0.089 0.043 BUPA-liver(345 x 6) (4 × 2) 0.015 0.020 0.014 CMC (844 x 9) (3 × 5) 0.047 0.099 0.083 Credit (690 x 19) (2 × 3) 0.037 0.055 0.050 Diabetes (768 x 8) (5 × 2) 0.099 0.166 0.117 Flare-solar(1066 x 10) (2 × 2) 0.117 0.330 0.278 German (1000 x 20) (3 × 3) 0.065 0.180 0.166 Heart-statlog (270 x 13) (3 × 4) 0.011 0.011 0.011 Image (2310 x 18) (3 × 2) 0.356 1.279 1.444 Ionosphere (350 x 34) (3 × 2) 0.012 0.017 0.014 Spect (265 x 22) (3 × 3) 0.012 0.016 0.012 Sonar (208 x 60) (6 × 3) 0.008 0.009 0.008 Heart-c (303 x 13) (8 × 2) 0.013 0.019 0.010 Table 4. Test set accuracy (%) with a Linear Kernel (Algorithm 1) Dataset WS-SVM S-TWSVM TWSVM Hepatitis(155 x 19) 83.516 +/- 10.078 87.088 +/- 8.888 84.176 +/- 8.910 Australian(690 x 14) 86.452 +/- 4.903 86.613 +/- 4.507 85.968 +/- 4.565 BUPA-liver(345 x 6) 68.065 +/- 7.133 67.097 +/- 11.613 65.806 +/- 5.983 CMC (844 x 9) 65.223 +/- 3.637 65.221 +/- 5.010 64.821 +/- 5.204 Credit (690 x 19) 86.157 +/- 3.127 86.964 +/- 4.444 85.192 +/- 4.223 Diabetes (768 x 8) 75.118 +/- 4.993 77.000 +/- 5.759 77.205 +/- 4.605 Flare-solar(1066 x 10) 81.760 +/- 4.127 81.969 +/- 4.200 81.761 +/- 4.306 German (1000 x 24) 69.375 +/- 5.356 71.667 +/- 5.449 71.042 +/- 7.184 Heart-statlog (270 x 13) 84.850 +/- 7.223 86.483 +/- 5.367 85.650 +/- 5.776 Image (2310 x 18) 86.196 +/- 1.619 84.415 +/- 1.244 85.089 +/- 1.745 Ionosphere (350 x 34) 92.056 +/- 6.574 91.744 +/- 4.982 90.504 +/- 6.104 Spect (265 x 22) 81.449 +/- 6.503 84.801 +/- 4.992 83.134 +/- 6.120 Sonar (208 x 60) 80.263 +/- 11.966 78.099 +/- 8.244 78.129 +/- 10.286 Heart-c (303 x 13) 84.577 +/- 5.116 83.836 +/- 4.668 83.823 +/- 4.435
WEIGHTED STRUCTURAL SUPPORT VECTOR MACHINE 55 Table 5. Test set accuracy (%) with an RBF Kernel (Algorithm 2) Dataset(mxn) WS-SVM S-TWSVM TWSVM Hepatitis(155 x 19) 84.835 +/- 9.939 83.407 +/- 12.052 83.462 +/- 11.958 BUPA-liver(345 x 6) 70.000 +/- 6.297 69.032 +/- 3.290 72.258 +/- 3.592 Votes (303 x 13) 95.154 +/- 2.626 95.667 +/- 2.519 95.917 +/- 2.321 WPBC (198x35) 81.669 +/- 8.879 79.412 +/- 9.727 80.588 +/- 10.156 5. CONCLUSIONS This paper has proposed a new Weighted Structural - Support Vector Machine (known as WS-SVM) for classiﬁcation problems with a class-vs-clusters strategy. This algorithm is performed in two steps: The ﬁrst step is to extract structural information of the data using Ward’s linkage clustering method; The second step is to apply structural information with cluster granularity to the learning model. The classiﬁer is based on the weighted average distances from the data point to the class representative hyperplanes. Both theory and experiment show that training time of WS-SVM is better than S-TWSVM and TWSVM in most cases. Besides, when the data is large, and there are many clusters with too diﬀerent distribution trends, the WS-SVM algorithm eﬀectively simulates the distribution trend of clusters and thus improves the accuracy in data classiﬁcation. When k = l = 1, WS-SVM is exactly S-TWSVM, and if λ+ = λ− = 0 it becomes TWSVM again. The WS-SVM algorithm generally only achieves high accuracy when the clustering within each class is clear. In the case of clustering being ambiguous the algorithm needs to be improved, for example by combining it with a cluster-vs-class strategy ﬂexibly. The WS-SVM algorithm may not be really suitable for a multi-class problem. However, it seems to be useful in solving a classiﬁcation problem with unbalanced data. And this is an interesting research direction in the future. ACKNOWLEDGMENT The authors would like to thank the referees for their helpful comments and valuable suggestions. REFERENCES [1] M. Adancon and M. Cheriet, “Model selection for the ls-svm. application to handwriting recog- nition,” Pattern Recognition, vol. 42, pp. 3264–3270, 2009. [2] UCI Machine Learning Repository, Center for Machine Learning and Intelligent Systems at the University of California, Irvine. [Online]. Available: machine-learning-databases/ [3] J. Cervantes, F. Garcia-Lamont, L. Rodr´ıguez-Mazahua, and A. Lopez, “A comprehensive survey on support vector machine classiﬁcation: Applications, challenges and trends,” Neurocomputing, vol. 408, pp. 189–215, 2020. [4] N. Cuong, Python code. [Online]. Available:
56 N.T. CUONG, H.T. PHUNG [5] G. Fung and O. L. Mangasarian, “Proximal support vector machine,” in KDD ’01: Proceedings of the Seventh ACM SIGKDD International Conference On Knowledge Discovery And Data Mining. San Francisco California: Association for Computing Machinery, New York, NY, United States, 2001, pp. 77–86. [Online]. Available: [6] G. Golub and C. Van Loan, Matrix Computations. The John Hopkins University Press, 1996. [7] Jayadeva, R. Khemchandani, and S. Chandra, “Twin support vector machines for pattern classi- ﬁcation,” IEEE Transactions on Pattern Analysis and Machine intelligence, vol. 29, pp. 905–910, 2007. [8] D. Jeung, D. Wang, W. Wing, E. Tsang, and X. Wang, “Structured large margin machines: sensitive to data distributions,” Machine Learning, vol. 68, pp. 171–200, 2007. [9] O. Mangasarian and E. Wild, “Multisurface proximal support vector classiﬁcation via generalized eigenvalues,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, pp. 69–74, 2006. [10] B. Mei and Y. Xu, “Multi-task least squares twin support vector machine for classiﬁcation,” Neurocomputing, vol. 338, pp. 26–33, 2019. [11] W. Noble, Support Vector Machine Applications in Computational Biology. MIT Press, 2004. [12] X. Pan, Y. Luo, and Y. Xu, “K-nearest neighbor based structural twin support vector machine,” Knowledge-Based Systems, vol. 88, pp. 34–44, 2015. [13] Z. Qi, Y. Tian, and Y. Shi, “Structural twin support vector machine for classiﬁcation,” Knowledge-Based Systems, vol. 43, pp. 74–81, 2013. [14] B. Schoelkopf and A. Smola, Learning with Kernel. MIT Press, 2002. [15] Y. Tian, Y. Shi, and X. Liu, “Recent advances on support vector machines research,” Techno- logical and Economic Development of Economy, vol. 18, pp. 5–33, 2012. [16] D. Tomar and S. Agarwal, “Twin support vector machine: A review from 2007 to 2014,” Egyptian Informatics Journal, vol. 16, pp. 55–69, 2015. [17] V. Vapnik, The Natural Of Statistical Learning Theory. Springer-Verlag New York, 1995. [18] X. Xie and S. Sun, “Multitask centroid twin support vector machines,” Neurocomputing, vol. 149, pp. 1085–1091, 2015. [19] H. Xue, S. Chen, and Q. Yang, “Structural regularized support vector machine: A framework for structural large margin classiﬁer,” IEEE Transactions on Neural Networks, vol. 22, pp. 573–587, 2011. Received 20 August 2020 Accepted 19 February 2021