Some new results on automatic identification of vietnamese folk songs cheo and quanho

21 trang Gia Huy 17/05/2022 1840

Download

Bạn đang xem 20 trang mẫu của tài liệu "Some new results on automatic identification of vietnamese folk songs cheo and quanho", để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên

Tài liệu đính kèm:

some_new_results_on_automatic_identification_of_vietnamese_f.pdf

Nội dung text: Some new results on automatic identification of vietnamese folk songs cheo and quanho

Journal of Computer Science and Cybernetics, V.36, N.4 (2020), 325–345 DOI 10.15625/1813-9663/36/4/15609 SOME NEW RESULTS ON AUTOMATIC IDENTIFICATION OF VIETNAMESE FOLK SONGS CHEO AND QUANHO CHU BA THANH1,2,∗, TRINH VAN LOAN1,2, NGUYEN HONG QUANG2 1Faculty of Information Technology, Hung Yen University of Technology and Education 2School of Information Communication and Technology, Hanoi University of Science and Technology Abstract. Vietnamese folk songs are very rich in genre and content. Identifying Vietnamese folk tunes will contribute to the storage and search for information about these tunes automatically. The paper will present an overview of the classiﬁcation of music genres that have been performed in Vietnam and abroad. For two types of very popular folk songs of Vietnam such as Cheo and Quanho, the paper describes the dataset and Gaussian Mixture Model (GMM) to perform the experiments on identifying some of these folk songs. The GMM used for experiment with 4 sets of parameters containing Mel Frequency Cepstral Coeﬃcients (MFCC), energy, the ﬁrst and the second derivatives of MFCC and energy, tempo, intensity, and fundamental frequency. The results showed that the parameters added to the MFCCs contributed signiﬁcantly to the improvement of the identiﬁcation accuracy with the appropriate values of Gaussian component number M. Our experiments also showed that, on average, the length of the excerpts was only 29.63% of the whole song for Cheo and 38.1% of the whole song for Quanho, the identiﬁcation rate was only 3.1% and 2.33% less than the whole song for Cheo and Quanho, respectively. The identiﬁcation of Cheo and Quanho was also tested with i-vectors. Keywords. Identiﬁcation; Folk songs; Vietnamese, Cheo; Quanho; GMM; MFCC; Excerpt; Tempo; F0; i-vectors. 1. INTRODUCTION The researches related to music data mining are very diverse and have been going on for many years in various ways: genre classiﬁcation, artist/singer identiﬁcation, emotion/mood detection, instrument recognition, music similarity searching However, the research on music genre classiﬁcation is the most complex and very diﬃcult problem to solve. There were so many researches which have been done in music genre classiﬁcation with the various approach to problem-solving, such as Na¨ıve Bayes Classiﬁer (NBC) [1,2], Decision Tree Classiﬁer (DTC) [3], K Nearest Neighbor (KNN) [4,5], Hidden Markov Model (HMM) [6–8], GMM [9–11], Support Vector Machine (SVM) [12, 13] and Artiﬁcial Neural Network (ANN) [14–16]. The general architecture diagram of a music genre classiﬁcation is as shown in Figure 1 [17]. In general, researches in this ﬁeld can be divided into two steps. The ﬁrst step is to extract the features from the music signal, and the second step is to use machine learning *Corresponding author. E-mail addresses: thanhcb.ﬁt@utehy.edu.vn (C.B.Thanh); loantv@soict.hust.edu.vn (T.V.Loan); quangnh@soict.hust.edu.vn (N.H.Quang). c 2020 Vietnam Academy of Science & Technology
326 CHU BA THANH, et al. Audio Feature Training files extraction Vectors Training Testing Testing Genre Vectors Model Results Figure 1. Diagram of the music genre classiﬁcation system [17] algorithms for training and testing. The accuracy rate depends on the variants, parameters of the algorithm, and the number of features used. Vietnam is a multi-ethnic country with a long history of culture, so Vietnamese folk songs are multifarious. The Vietnamese folk songs are available in many regions with diﬀerent genres: In the North, we have Quanho Bac Ninh, Cheo, Xoan singing, Vi singing, Trong Quan singing, Do singing ; there are singing such as Vi Dam, Ho Hue, Ly Hue, Sac bua in the Middle ; In the South, there are Ly, Ho, poetry speaking ; In the northern mountainous region there are folk songs of Thai, H’Mong, and Muong ethnic groups; In the Highlands, there are folk songs of Gia Rai, Ede, Ba Na, Xo Dang each has their own identity. There are many types of folk songs in Vietnam, but Cheo and Quanho are two very popular types, and their number of songs is richer than others. According to the statistics in [18], there are 213 songs of Quanho, and from composer Mai Thien’s statistics [19–23] and his notes, there are 190 songs of Cheo. In our published paper [24], we performed the identiﬁcation experiment with 10 Quanho folk songs on the dataset including 100 ﬁles using the WEKA toolkit with SMO (Implements John Platt’s sequential minimal optimization algorithm for training a support vector classi- ﬁer), multilayer perceptron and multiclass classiﬁer (A metaclassiﬁer for handling multi-class datasets with 2-class classiﬁers) algorithms. The results showed that the average accuracy rate for these algorithms is 89%, 86%, and 71% respectively. Also, in this dataset, we have tested on GMM model [25] and the highest average accuracy rate was 79%. In the most recently published paper [26], we used the GMM model for classiﬁcation and identiﬁcation on the dataset including 1000 ﬁles of two types of folk songs Cheo and Quanho, each type of 500 ﬁles of 25 folk songs. The classiﬁcation accuracy of two types of Cheo and Quanho folk songs is 91.0%, and the highest identiﬁcation rates are 81.6% for Cheo folk songs and 85.60% for Quanho folk songs. In this paper, we present the results of the Cheo and Quanho identiﬁcation using diﬀerent parameter sets for the full length of audio ﬁles and the short excerpts with variable lengths. This paper is organized as follows: Section 2 is an overview of music genre classiﬁcation, Section 3 describes the dataset and GMM for our experiments, Section 4 gives the experiment results. Finally, the conclusion is done in Section 5.
SOME NEW RESULTS ON AUTOMATIC IDENTIFICATION 327 2. AN OVERVIEW OF MUSIC GENRE CLASSIFICATION Although researches on music data mining have been going on for a long time, however, research scale is still limited. Based on ongoing researches from around in the world, the International Symposium on Music Information Retrieval (ISMIR) was oﬃcially launched on October 23-25, 2000, in Massachusetts, USA. Since then the symposium has been held annually and is the world’s leading research forum on the processing, searching, organization, and extraction of information directly related to music. One of the most cited papers in music genre classiﬁcation is that of Tzanetankis et al. [27]. In this paper, the authors conducted a classiﬁcation experiment on a dataset of six genres (Classical, Country, Disco, Hip Hop, Jazz, Rock) with nine features (Mean-Rolloﬀ, Mean- Flux, Mean-Zero Crossings, Std-Centroid, Std-Rolloﬀ, Std-Flux, Std-Zero Crossings, Low Energy). The results of the experiment have the highest accuracy rate of 58.67%. The same experiment was conducted in 2003 by Tao Li et al., but they have proposed a new extraction feature method called DWCH (Daubechies Wavelet Coeﬃcient Histograms). This method has improved the accuracy rate up to 78.5%. In 2002, George Tzanetankis had completed the dataset named GTZAN [28]. It consists of 10 genres (Blues, Classical, Country, Disco, Hip Hop, Jazz, Metal, Pop, Reggae, and Rock), each genre has 100 excerpts, and length of each excerpt is 30 seconds. Tzanetakis & Cook [29] had experimented with classiﬁcation on this dataset with sets of features including 30- dimensional features vector (19 FFT, 10 MFCC, the rhythmic content features (6 dimensions) and the pitch content features (5 dimensions)). The result has improved the average accuracy rating to 61%. This dataset has been using by many authors for music genre classiﬁcation experiments with diﬀerent features. West & Cox [30] conducted an assessment of the factors aﬀecting the automatic classiﬁ- cation of musical signals. Firstly, they describe and evaluate the classiﬁcation performance of two diﬀerent evaluation methods based on spectral shape characteristics, Mel frequency ﬁlters, and spectral contrast characteristics. Secondly, their study time models of selected features from music and ﬁnally minimize the number of dimensions in the characteristic vec- tor. It was then used to train and evaluate the performance of diﬀerent classiﬁers and the results showed an increase of accuracy rate. Mohd et al. [31] used Marsyas software to extract the same features as Tzanetankis et al. in 2001. Using the J48 classiﬁer in WEKA software (a tool for preprocessing, classifying, clustering, selection of features and modeling) [32] and the dataset is Malaysian music. The results show that factors aﬀecting classiﬁcation results include features, classiﬁers, size of the dataset, length of the excerpts, the location of the excerpt in the original and parameters in classiﬁers. Bergstra et al. [33] suggested that it would be better to classify the features synthesized from a set of audio signals by using a generic vector for each song or classiﬁcation into individual features vectors. They focus on synthesizing frame-level features into appropriate segments and classifying and experimenting with the segment itself. Their software won the ﬁrst prize in the MIREX 2005 competition for the music information retrieval. They used two datasets, with about 1500 original songs. Features are extracted in frames of 1024 samples and m frames are synthesized. With m = 300 for each segment of 13.9s, the accuracy rate was 82.3%. They conducted the experiment on the GTZAN dataset with the same parameters set, the accuracy rate reached 82.5%.
Sets of Parameters Content Sets of Parameters Content 328 S1 60 parametersCHU BA THANH, etS3 al. S1 + F0 + intensity S2 S1 + tempo S4 S3 + tempo Table 1. A summary of the experimental results on the GTZAN dataset Authors Year Accuracy Rate Tzanetakis, Cook 2002 61.00% Li et al. 2003 78.50% Bergstra et al. 2006 82.50% Lidy et al. 2007 76.80% Panagakis, Benetos, and Kotropoulos 2008 78.20% Panagakis et al. 2009 91.00% Panagakis, Kotropoulos 2010 93.70% Jin S. Seo 2011 84.09% Shin-Chelon Lim et al. 2012 87.40% Baniya et al. 2016 87.90% Christine Senac, Thomas Pellegrini et al. 2017 91,00% Chun Pui Tang, Ka Long Chui et al. 2018 52.975% The results of classiﬁcation by Panagakis et al in 2009 and 2010 on the GTZAN da- taset were 91% and 93.7%. In 2009, they used SRC (Sparse Representation-based Classi- ﬁer) technique to reduced dimensions of represent information. By 2010, the authors used TNPMF (Topology Preserving Non-Negative Matrix Factorization) to reduce dimensions instead of the SRC. Nowadays, the GTZAN dataset is still widely used by researchers in mu- sic classiﬁcation by genre. Table 1 is a summary of the experimental results on the GTZAN dataset from 2002 to 2018 (sorted by ascending of the year). The ﬁrst research with folk songs was conducted by Wei Chai and Barry Vercoe [34] in the Multimedia Laboratory of the Massachusetts Institute of Technology in 2001. The dataset includes 187 Irish folk songs, 200 German folk songs, and 104 Austrian folk songs. This dataset was collected from (1) Helmut Schaﬀrath’s folk collection Essen (Germany) and (2) an Irish music collection by Donncha O´ Maid´ın. The authors used the HMM tool with the scale of data for training and testing assigned to 70% and 30%. The highest accuracy rate when using binary classiﬁcation between the union of three music genres including Irish - Germany, Irish - Austrian, and Germany - Austrian was 75%, 77%, and 66%. The result of the classiﬁcation of the three music genres has the highest accuracy rate of 63.0%. Until 2015, Nikoletta Bassiou and his colleagues [35] experimented with the classiﬁcation of Greek folk songs into two genres by using CCA (Canonical Correlation Analysis) technique between the lyrics and the sound. The dataset for experiment includes 98 songs from Pontus and 94 songs from Asia Minor, in which 75% data for training and 25% data for testing. Using the cross-evaluation method, the accuracy result is an average of 5 times testing and achieved 97.02%. In 2016, Rajesh, Betsy, and DG Bhalke [36] conducted the classiﬁcation Tamil folk songs (Southern India) on a dataset of 216 songs (103 traditional songs + 113 folk songs) with 30 seconds duration for each song. The data for training in each type is 70 songs and the data for testing is 33 songs and 43 songs for each type. The classiﬁer is KNN, the accuracy rate is 66.23%, and with the SVM classiﬁer, the accuracy rate is 84.21%. For Chinese folk songs, in 2017 Juan Li, Jianhang Ding, and Xinyu Yang proposed the GMM-CRF (Conditional Random Field) [37–39] model and used it for music classiﬁcation by region on the dataset including 344 Chinese folk songs (109 Shaanxi, 101 Jiangsu and 134 Hunan). On average, the highest accuracy rate reached 83.72% [40].
SOME NEW RESULTS ON AUTOMATIC IDENTIFICATION 329 Most recently in 2018, Juan Li et al. [41] overcame the limitations of the GMM-CRF model (improving calculation accuracy when the number of Gauss components is restricted) by proposing the GMM-RBM (Restricted Boltzmann Machine) model [42, 43] for the clas- siﬁcation experiment by region on the Chinese folk songs dataset including 297 folk songs from Northern Shaanxi, 278 from Jiangsu, and 262 from Hunan. The experiment results showed that the GMM-RBM model gives the better results (84.71%) than the GMM-CRF model (83.72%) [40]. In Vietnam, Phan Anh Cang and Phan Thuong Cang [44] conducted a music genre classiﬁcation experiment on the GTZAN dataset. They used the discrete wavelet transform to extract 19 timbral features, 6 beat features, and 5 pitch features. The experiment used k-NN classiﬁer (with k = 4), the highest accuracy result is 83.5%. The Zalo AI Challenge [45] was ﬁrst held in Vietnam in 2018. In this competition, the group of Dung Nguyen Ba used CNN model for music genre classiﬁcation on the Vietnam’s music dataset including ten classes with 867 ﬁles. As a result, they won the ﬁrst prize in this competition. The following sections will describe dataset, GMM for our experiment on Cheo and Quanho classiﬁcation and the classiﬁcation results. 3. DATASET AND GAUSSIAN MIXTURE MODEL 3.1. Vietnamese folk songs dataset Our dataset has a total of 1000 audio ﬁles equally distributed between two types of folk songs Cheo and Quanho, each ﬁle duration is of 45-60 seconds with a sample rate at 16kHz and 16 bits per sample. Cheo’s dataset is extracted from 25 songs, each song has 20 audio ﬁles. Quanho’s dataset is also extracted from 25 songs with 20 audio ﬁles for each song. The average full length of Cheo songs is 54 seconds and this value is 43 seconds for Quanho songs. 3.2. Gaussian mixture model In image and speech processing and some other areas, neural networks with deep learning techniques have enabled signiﬁcant results. However, traditional models and classiﬁers are still used in the pattern recognition ﬁeld. The GMM model was used in studies related to music data processing and music genre classiﬁcation [46,51]. Over the last few years and so far, GMM has continued to be used for music genre recognition, indexing, and retrieval of music [52–60]. This is because the GMM model is characterized by the parameters related averages and variance of data also allow modeling of data distribution with optional preci- sion. In the same way, GMM proved appropriate for the problem of recognizing information contours such as speaker recognition, dialect recognition, language identiﬁcation, emotion recognition, and music genre identiﬁcation [61–67]. On the other hand, in terms of model implementation, GMM allows for training in a much shorter time than ANN, leading to un- necessary use of complex and expensive hardware conﬁgurations including GPUs. Therefore, for our research, in addition to our research on the ANN model [68], GMM has been selected as one of the models or classiﬁers to identify music genres.
330 CHU BA THANH, et al. The GMM model with mixed Gaussians distribution can be considered as a linear super- position of Gaussian distributions in the form [10] M X p(x) = πkN (x|µk, Σk). (1) k=1 When using GMM for classiﬁcation Vietnamese folk songs, x in (1) is the data vector that contains the set of features vector of each folk song in which each element of the set has D dimensions. πk(k = 1 M ) are the weights of the mixture that satisfy the condition PM k=1 πk = 1. Each Gaussian density function is a component of the mixture with mean µk and covariance Σk 1 1 1 N (x|µ , Σ ) = exp − (x − µ )T Σ−1(x − µ ) . (2) k k D/2 1/2 k k k (2π) |Σk| 2 The complete GMM model is described by a set of three parameters λ = {πk, µk, Σk}, k = 1 M . To identify a folk song that has been modeled by λ, it is necessary to determine the likelihood p(X, λ) N Y p(X, λ) = p(xn|λ), (3) n=1 where N is the number of feature vectors and also the number of segments of the audio ﬁle for each folk song. In fact, λ is a statistical model, so we have to use the Expectation- Maximization (EM) algorithm [10] to determine log p(X|λ) such as it get the maximum. 4. EXPERIMENT RESULTS 4.1. The test results with GMM Our experiments used Spro [69], Praat [70], and Matlab [71, 72] tools to extract a set of 63 parameters including 60 parameters related to MFCC and energy (19 MFCCs + energy = 20, the ﬁrst and the second derivatives of these 20 parameters), tempo, intensity and fundamental frequency (F0). In musical terminology, the tempo is the speed or pace of a given piece. The tempo is often deﬁned in units of beats per minute (BPM). The beat is often deﬁned as the rhythm listeners would tap their toes to when listening to a piece of music [73]. The 63 parameters are divided into 4 sets of parameters in our experiments as in Table 2. Table 2. Four sets of parameters Sets of Parameters Content Sets of Parameters Content S1 60 parameters S3 S1 + F0 + intensity S2 S1 + tempo S4 S3 + tempo Authors Year Accuracy Rate The ALIZETzanetakis, toolkits Cook [74, 75] were used to implement2002 the GMM61.00% model for classiﬁcation with GaussianLi component et al. number M varied as a power2003of 2, from78. 1650% to 4096. The experimentsBergstra were et al. conducted in 2 cases: In the2006 ﬁrst case (Figure82.50% 2), a cross-evaluation was performedLidy for et two al. datasets Cheo and Quanho 2007with the full76. length80% of the songs. In Panagakis, Benetos, and Kotropoulos 2008 78.20% Panagakis et al. 2009 91.00% Panagakis, Kotropoulos 2010 93.70% Jin S. Seo 2011 84.09% Shin-Chelon Lim et al. 2012 87.40% Baniya et al. 2016 87.90% Christine Senac, Thomas Pellegrini et al. 2017 91,00% Chun Pui Tang, Ka Long Chui et al. 2018 52.975%
SOME NEW RESULTS ON AUTOMATIC IDENTIFICATION 331 this case, 80% dataset is used for training, and 20% dataset for testing. The purpose of this experiment is to consider the eﬀect of tempo, intensity, and F0 parameters on the identiﬁcation results. Features extraction 4 sets of parameters 60 parameters S1 S1 Training Segmentation S2 (20 ms) Tempo S2 S3 Training S4 Intensity S3 Vectors (80%) Fundamental Full Audio S4 Files frequency (F0) Gaussian Mixture Model Testing S1 Genre Model Testing S2 Vectors S3 (20%) S4 Classification Results Figure 2. Diagram of the classiﬁcation of Cheo and Quanho for the full length of audio ﬁles. Figure 3 is the result of testing on the Quanho dataset in the ﬁrst case with 4 sets of parameters. In general, the addition of parameters increases the identiﬁcation rate. The average accuracy rate of identiﬁcation for S1 is 96.62%, meanwhile for S2, S3, and S4 these rates are 96.67%, 96.76%, and 96.69% respectively, and higher than the average identiﬁcation rate of S1 (Figure 4). The results of the experiment on the Cheo dataset shown in Figure 5 also give a similar conclusion as above. The average accuracy rate of identiﬁcation for the set of parameters S1 is 93.91%. For S2, S3, and S4 these rates are 94.00%, 94.20%, and 94.18% respectively (Figure 6). The experimental results on the Quanho dataset for the full length of the songs are higher (about 2 to 3%) in comparison with the Cheo dataset and this is also true for the corresponding excerpt as we can see below. In the second case, the data for training use the full length of audio ﬁles, but the data for testing use only the short excerpts extracted from the dataset (Figure 7). These excerpts vary in length from 4, 6, 8 . . . to 16 seconds (the excerpts were extracted randomly from the dataset). The purpose of this experiment is to determine how the accuracy rate will change when changing the length of the excerpts. For the experiment results in the second case, within the scope of this paper, we present only the results corresponding to the three values of M = 512, 1024, and 2048. These values of M show more clearly the eﬀect of parameters such as tempo, intensity, and F0 on the identiﬁcation results for both Cheo and Quanho dataset. Figure 8 is the results of the Cheo’s excerpts with three values of M mentioned above. It can be seen that when the length of the excerpts is short, the parameters such as tempo, intensity, and F0 have no signiﬁcant inﬂuence on the identiﬁcation rate. With M = 512 (Figure 8a), the eﬀect of these additional parameters is more noticeable when the length of
332 CHU BA THANH, et al. 99 98.60 98.40 98.20 98.20 98.20 98.20 98.00 98.20 97.80 98.00 98.00 97.80 97.40 97.80 97.60 97.60 97.60 97 97.00 96.40 96.20 96.00 95.20 95 94.00 S1 S2 S3 S4 93 Average of identification (%) rate identification of Average 92.80 92.60 92.00 91.60 91 16 32 64 128 256 512 1024 2048 4096 M - Number of Gauss components Figure 3. The identiﬁcation rates correspond to 4 sets of parameters with the diﬀerent values of M on the Quanho dataset 96.80 96.76 96.75 96.70 96.69 96.67 96.65 96.62 96.60 Average of identification rate (%) rate identification of Average 96.55 S1 S2 S3 S4 Sets of Parameters Figure 4. The average identiﬁcation rates with 4 sets of parameters on the Quanho dataset
SOME NEW RESULTS ON AUTOMATIC IDENTIFICATION 333 96.60 96.20 96.20 96.00 96.00 96 95.80 96.00 96.00 95.60 95.80 95.00 95.40 95.60 94 93.60 93.40 93.40 93.20 92.00 92 92.20 S1 S2 90.80 90.80 S3 S4 90 89.60 Average (%) identification of rate Average 88.60 88 16 32 64 128 256 512 1024 2048 4096 M - Number of Gauss components Figure 5. The identiﬁcation rates correspond to 4 sets of parameters with the diﬀerent values of M on the Cheo dataset Features extraction Figure 6. The average identiﬁcation4 rates sets of withparameters 4 sets of parameters on Cheo dataset Segmentation 60 parameters S1 (20 ms) S1 Tempo S2 Testing S2 Vectors S3 Genre Model Intensity S3 (20%) S4 Short Testing Fundamental S4 excerpts frequency (F0) Classification Results
CHU BA THANH, et al. 334 Features extraction 4 sets of parameters Segmentation 60 parameters S1 (20 ms) S1 Tempo S2 Testing S2 Vectors S3 Genre Model Intensity S3 (20%) S4 Short Testing Fundamental S4 excerpts frequency (F0) Classification Results Figure 7. Diagram of the classiﬁcation of Cheo and Quanho for the short excerpts of audio ﬁles excerpts is 14 seconds or longer. With M = 1024 (Figure 8b), this value of length is 10 seconds. With M = 2048 (Figure 8c), the impact of additional parameters tends to decrease signiﬁcantly. It can be seen that, on average, the S2, S3, and S4 parameter sets had a higher identiﬁcation rate than S1 parameter set, which means that additional parameters had a good inﬂuence on the identiﬁcation results. In particular, as the length of the excerpt increases, the inﬂuence of additional parameters becomes even more evident. BẢNG TỔNG HỢP KẾT QUẢ ĐÁNH GIÁ CHÉO S1 S2 S3 S4 4s 84.40 80.80 84.00 82.20 96.00 6s 9687.20 88.60 87.20 85.00 95.60 8s 90.40 89.80 89.00 89.00 95.40 10s 9490.00 89.40 88.40 88.4093.40 93.40 93.80 12s 92.40 90.80 91.20 93.4092.40 93.60 14s 90.80 93.40 92.60 91.00 16s 9293.60 93.80 93.80 93.8091.20 92.60 Full 96.00 95.60 90.4095.40 90.00 95.40 91.00 Min 9084.40 80.80 84.00 82.2090.80 Max 96.00 95.6088.60 89.8095.40 89.40 95.40 Aver 89.00 88 88.40 87.20 86 S1 S2 85.00 84 84.40 82 82.20 S3 S4 Average of identification rate (%) identification of Average 80.80 80 4s 6s 8s 10s 12s 14s 16s Full Length of Excerpts a) M = 512 The experimental results on Quanho’s excerpts with three values of M = 512, 1024, and 2048 are shown in Figures 9a), 9b) and 9c) respectively. In this case, the additional parameters have also a positive eﬀect on the identiﬁcation results as in the experiment on the Cheo’s excerpts. When M = 2048, this eﬀect becomes more obvious (Figure 9c). The results show that with 16-seconds excerpt length, on average, the identiﬁcation rate reaches 91.09% compared to 94.18% when using the entire length of Cheo songs. With 16-
SOME NEW RESULTS ON AUTOMATIC IDENTIFICATION 335 BẢNG TỔNG HỢP KẾT QUẢ ĐÁNH GIÁ CHÉO S1 S2 S3 S4 96.60 4s 9684.60 80.40 82.60 81.00 6s 87.40 86.40 86.40 86.20 95.80 8s 90.00 88.60 90.20 89.00 94.60 10s 9489.60 89.40 89.40 89.80 12s 92.20 91.80 92.00 92.00 92.80 14s 92.20 92.80 92.80 91.60 92.20 93.60 16s 9293.60 93.60 94.60 94.6092.00 92.20 Full 95.80 96.60 95.8090.20 95.80 91.80 91.60 Min 9084.60 80.40 82.60 89.8081.00 Max 95.80 96.60 95.8089.00 89.6095.80 Aver 89.40 88 87.40 88.60 86.40 86 86.20 S1 S2 84 84.60 82 82.60 S3 S4 Average of identification rate (%) identification of Average 81.00 80 80.40 4s 6s 8s 10s 12s 14s 16s Full Length of Excerpts b) M = 1024 BẢNG TỔNG HỢP KẾT QUẢ ĐÁNH GIÁ CHÉO 97 S1 S2 S3 S4 4s 84.20 84.20 81.60 81.60 96.00 6s 87.00 87.60 86.00 87.60 8s 9590.80 90.80 90.40 87.80 10s 90.00 89.60 88.60 89.80 95.80 12s 92.60 92.80 92.80 91.60 92.80 93.40 14s 9392.80 92.60 92.20 91.8092.80 16s 93.20 93.40 93.40 93.40 93.20 Full 96.00 96.00 95.80 95.6092.60 92.20 Min 9184.20 84.20 90.8081.60 81.6091.60 91.80 Max 96.00 96.00 95.80 90.00 95.60 90.40 89.80 89 89.60 87.60 88.60 87.80 87 87.00 S1 S2 85 86.00 84.20 Average of identification rate (%) rate of identification Average S3 S4 83 81.60 81 4s 6s 8s 10s 12s 14s 16s Full Length of Excerpts c) M = 2048 Figure 8. The identiﬁcation rate based on lengths of Cheo’s excerpts with M = 512, 1024, and 2048. seconds excerpt length for Quanho songs, this identiﬁcation rate reaches 94.44% compared to 96.89% for the full length of audio ﬁles.
336 CHU BA THANH, et al. BẢNG TỔNG HỢP KẾT QUẢ ĐÁNH GIÁ CHÉO S1 S2 S3 S4 98.40 4s 9882.40 85.60 83.20 84.80 98.20 6s 90.80 88.40 89.20 89.20 97.60 8s 92.80 92.80 90.40 95.8091.00 96.40 96 10s 94.00 94.80 94.00 94.80 93.80 95.00 12s 95.80 95.80 94.00 95.0095.00 14s 9495.00 94.00 94.20 94.20 95.20 16s 95.20 95.80 92.8095.80 94.00 94.0096.40 94.00 Full 9298.20 98.40 98.20 97.60 93.03 93.2090.80 92.38 92.75 90 89.20 90.40 88 88.40 S1 S2 86 85.60 84.80 84 83.20 S3 S4 82 82.40 Average of identification (%) rate identification of Average 80 4s 6s 8s 10s 12s 14s 16s Full Length of Excerpts a) M = 512 BẢNG TỔNG HỢP KẾT QUẢ ĐÁNH GIÁ CHÉO S1 S2 S3 S4 98.20 4s 98 86.20 87.80 85.20 85.60 97.80 6s 90.00 91.60 90.20 88.60 96.40 97.60 8s 93.00 93.40 91.4095.80 95.6093.4095.80 10s 96 95.80 93.80 93.60 93.80 95.80 12s 95.60 94.20 94.20 94.4094.40 93.40 95.20 95.20 14s 94 95.20 95.00 95.80 94.2095.00 16s 95.80 96.00 96.4093.80 95.20 Full 92 97.60 91.6097.60 93.0098.20 97.80 93.65 93.68 93.13 92.98 90 91.40 90.20 88 87.80 88.60 86.20 S1 S2 86 85.60 85.20 84 S3 S4 Average of identification rate (%) rate identification of Average 82 80 4s 6s 8s 10s 12s 14s 16s Full Length of Excerpts b) M = 1024
SOME NEW RESULTS ON AUTOMATIC IDENTIFICATION 337 BẢNG TỔNG HỢP KẾT QUẢ ĐÁNH GIÁ CHÉO S1 S2 S3 S4 98.60 4s 9885.60 87.20 85.40 83.40 98.40 6s 90.60 90.40 89.60 89.6096.00 96.20 98.20 8s 9693.40 94.00 91.20 95.40 91.60 95.40 10s 93.60 95.40 94.20 95.4095.40 95.80 94.00 94.20 95.00 12s 9495.40 96.00 93.20 94.6094.60 95.00 14s 95.00 95.40 94.60 94.80 94.60 16s 96.20 95.80 93.4095.00 93.60 96.20 92 93.20 Full 98.40 98.2090.40 98.00 98.60 93.53 94.05 91.2092.65 93.03 90 89.60 88 87.20 S1 S2 86 85.60 85.40 84 83.40 S3 S4 82 Average of identification rate (%) rate of identification Average 80 4s 6s 8s 10s 12s 14s 16s Full Length of Excerpts c) M = 2048 Figure 9. The identiﬁcation rate based on lengths of Quanho’s excerpts with M = 512, 1024, and 2048. 4.2. The test results with i-vectors The i-vectors and x-vectors are both capable of representing feature parameters of a speech signal in a compact form (as a vector of ﬁxed size, regardless of the length of the utterance). These vectors have been suggested to be suitable for speaker recognition [76,77]. The x-vector concept is newer and these vectors are used in the neural network to recognize a speaker. The i-vectors have been used for the GMM model for speaker recognition and the follo- wing is a brief description of the i-vector and the experimental result using i-vector together with the GMM model to classify the Vietnamese folk-music Cheo and Quanho. An important problem posed to the speaker recognition system is how to model the inter-speaker variability and to compensate for channel/session variability in the context of GMM. In Joint Factor Analysis (JFA) [78–80], the speaker’s utterance is represented by the M supervector that includes additional components in the speaker and the channel/session subspace. In particular, the speaker-dependent supervector M is deﬁned as follows [76] M = m + Vy + Ux + Dz . (4) Here, m is the session and speaker independent supervector (generally from the Universal Background Model (UBM)), V and D deﬁne the subspace of speaker (which are the eigen- voice matrix and diagonal residual, respectively), U deﬁnes the session subspace and is the eigenchannel matrix. The vectors x, y, and z are session and speaker dependent factors in their respective subspaces, and each vector is assumed to be a random variable with normal distribution N (0, I ).
338 CHU BA THANH, et al. In [81] the author proposed a new space and this new space is only a single space, instead of two separate spaces. This new space is called the “total variability space”, and contains the channel and speaker variabilities simultaneously. With this new space, equation (4) is rewritten as follows M = m + Tw , (5) where, m is the session and speaker independent supervector, which can be taken from UBM, T is a rectangular matrix of low rank, w is a random vector with standard distribution N (0, I ). The components of the vector w are the total factors and these vectors are identity vectors or i-vectors. Alize has provided the tool to deﬁne i-vectors [74,75] from the set of parameter S1, and these vectors were used to classify the Vietnamese folk-music Cheo and Quanho in our case. 98.00 96.00 96.00 96.00 95.80 94.00 92.00 91.00 91.00 90.80 91.00 90.80 90.00 89.20 Average of identification rate (%) rate of identification Average 88.00 86.00 84.00 M = 512 M = 1024 M = 2048 PLDA SphNormPLDA GMM Figure 10. The average of identiﬁcation rates for PLDA, SphNormPLDA using i-vectors, and for GMM using the set of parameters S1 on Cheo dataset Figures 10, 11 are the identiﬁcation rates for PLDA (Probabilistic Linear Discriminant Analysis) [76,82–86] and SphNormPLDA (Spherical Normalization PLDA) [82,87–89] using i-vectors and for GMM using the set of parameters S1 on the Cheo and Quanho dataset. The following is a comment on the above result. In general, the accuracy obtained with i-vectors is lower. These results can be interpreted as follows. As mentioned above, the i-vectors are in a compact form with a ﬁxed size, regardless of the length of the utterance, and feature very well for the speaker. However, for music genre classiﬁcation, the rhythmic factors that change over time are very important. Because the compact nature of the i-vector does not take into account time-varying factors such as frame-by-frame processing, the result is of lower accuracy.
SOME NEW RESULTS ON AUTOMATIC IDENTIFICATION 339 100.00 98.60 98.20 98.00 97.80 96.00 94.00 93.40 93.60 93.00 92.80 92.60 92.00 92.00 90.00 Average of identification rate (%) rate of identification Average 88.00 86.00 84.00 M = 512 M = 1024 M = 2048 PLDA SphNormPLDA GMM Figure 11. The average of identiﬁcation rates for PLDA, SphNormPLDA using i-vectors and for GMM using the set of parameters S1 on Quanho dataset 5. CONCLUSIONS The paper presents the experimental results of identiﬁcation for some of the Vietnamese folk songs Cheo and Quanho using GMM for which the length of excerpts used for identi- ﬁcation is multiples of 2s, from 4s to 16s compared to the full length of audio ﬁles and the number of Gaussian components M changes as powers of 2 from 16 to 4096. With the appro- priate M values, the identiﬁcation results showed the important eﬀects of tempo, intensity, and fundamental frequency on the increase of identiﬁcation accuracy rate. On average, when the excerpt length is 16s (29.63% compared to full length for Cheo, 37.2% compared to full length for Quanho), the identiﬁcation rate was only 3.1% and 2.33% less than the whole song for Cheo and Quanho respectively. In the case of music genre identiﬁcation, rhythm feature is an important parameter, so the use of i-vectors proved to be not real eﬃciency versus speaker recognition. Our forthcoming research direction is to identify Vietnamese folk songs using other mo- dels or classiﬁers including various artiﬁcial neural network models. ACKNOWLEDGMENT We would like to thank School of Information and Communication Technology - Hanoi University of Science and Technology, and Faculty of Information Technology - Hung Yen University of Technical Education for helping us complete this paper. REFERENCES [1] X. Hu, J.S. Downie, K. West, and A. Ehmann, “Mining music reviews: Promising preliminary results,” In Proceedings of the International Conference on Music Information Retrieval,
340 CHU BA THANH, et al. pages 536–539, 2005. [2] C. DeCoro, Z. Barutcuoglu, and R. Fiebrink, “Bayesian aggregation for hierarchical genre classi- ﬁcation,” In Proceedings of the International Conference on Music Information Retrieval, pages 77–80, 2007. [3] A. Anglade, Q. Mary, R. Ramirez, and S. Dixon, “Genre classiﬁcation using harmony rules induced from automatic chord transcriptions,” In Proceedings of the International Conference on Music Information Retrieval, pages 669–674, 2009. [4] Cunningham, Padraig, and Sarah Jane Delany, “k-Nearest neighbor classiﬁers,”arXiv preprint arXiv:2004.04523 (2020). [5] Sazaki, Yoppy, “Rock genre classiﬁcation using k-nearest neighbor”, ICON-CSE, vol.1, no. 1, pp. 81–84, 2014. [6] Ghahramani, Zoubin, “An introduction to hidden Markov models and Bayesian networks”, In- ternational Journal of Pattern Recognition and Artiﬁcial Intelligence, vol. 15, no. 1, pp. 9–42, 2001. [7] Xi Shao, Changsheng Xu and M. S. Kankanhalli, “Unsupervised classiﬁcation of music genre using hidden Markov model,” 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763), Taipei, 2004, pp. 2023–2026 Vol.3. Doi: 10.1109/ICME.2004.1394661 [8] J. Reed and C.H. Lee, “A study on music genre classiﬁcation based on universal acoustic models,” In Proceedings of the International Conference on Music Information Retrieval, pages 89– 94, 2006. [9] Ba˘gcı, Ula¸s, Engin Erzin. “Boosting classiﬁers for music genre classiﬁcation,” International Symposium on Computer and Information Sciences. Springer Berlin Heidelberg, 2005. [10] Christopher M. Bishop, Pattern Recognition and Machine Learning. Springer, 2013. [11] Markov, Konstantin, Tomoko Matsui, “Music genre and emotion recognition using Gaussian processes,” IEEE Access 2, pp.688–697, 2014. [12] A. Meng, J. Shawe-Taylor, “An investigation of feature models for music genre classiﬁcation using the support vector classiﬁer,” In Proceedings of the International Conference on Music Information Retrieval, pages 604–609, 2005. [13] M. Li and R. Sleep, “Genre classiﬁcation via an LZ78-based string kernel,” In Proceedings of the International Conference on Music Information Retrieval, pages 252–259, 2005. [14] A.S. Lampropoulos, P.S. Lampropoulou, and G.A. Tsihrintzis, “Musical genre classiﬁcation en- hanced by improved source separation techniques,” In Proceedings of the International Con- ference on Music Information Retrieval, pages 576–581, 2005. [15] C. McKay and I. Fujinaga, “Automatic genre classiﬁcation using large high-level musical feature sets,” In Proceedings of the International Conference on Music Information Retrieval, pages 525–530, 2004. [16] A. Meng, P. Ahrendt, and J. Larsen, “Improving music genre classiﬁcation by short-time feature integration,” In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 497–500, 2005. [17] Jinsong Zheng, M. Oussalah, Automatic System for Music Genre Classiﬁcation, 2006, ISBN 1-9025-6013-9, PGNet.
SOME NEW RESULTS ON AUTOMATIC IDENTIFICATION 341 [18] Le Danh Khiem, Hoac Cong Huynh, Le Thi Chung, Quanho’s cultural space. Publisher of Bac Ninh Provincial Culture and Sports Center, 2006 (Vietnamese). [19] Hoang Kieu, Learn the ancient Cheo folk songs. Publisher of the stage - Vietnam Cheo theatre, 2001. [20] Bui Duc Hanh, 50 ancient Cheo folk songs. Publishing House of National Culture, 2006 (Viet- namese). [21] Hoang Kieu, Ha Hoa, Selected ancient Cheo folk songs. Publishing House of Information Culture, 2007 (Vietnamese). [22] Nguyen Thi Tuyet, Cheo singing syllabus. Publishing House of Hanoi Academy of Theatre and Cinema, 2000 (Vietnamese). [23] Nguyen Thi Tuyet, Ancient Cheo melodies. Publishing House of Hanoi Academy of Theatre and Cinema, 2007 (Vietnamese). [24] Chu Ba Thanh, Trinh Van Loan, Nguyen Hong Quang, “Automatic identiﬁcation of some Vietna- mese folk songs,” In Proceedings of the 19th National Symposium of Selected ICT Problems, Ha Noi, 2016. pages 92–97, ISBN: 978-604-67-0781-3. [25] Chu Ba Thanh, Trinh Van Loan, Nguyen Hong Quang, “GMM for automatic identiﬁcation of some Quanho Bac Ninh folk songs,” In Proceedings of Fundamental and Applied IT Research (FAIR), Da Nang, 2017. pages 416–421, ISBN: 978-604-913-165-3. [26] Chu Ba Thanh, Trinh Van Loan, Nguyen Hong Quang, “Classiﬁcation and identiﬁcation of Cheo and Quanho Bac Ninh folk songs,” In Proceedings of Fundamental and Applied IT Research (FAIR), Ha Noi, 2018. pages 395–403, ISBN: 978-604-913-165-3. [27] George, Tzanetakis, Essl Georg, and Cook Perry, “Automatic musical genre classiﬁcation of audio signals,” Proceedings of the 2nd International Symposium on Music Information Retrieval, Indiana, 2001. [28] Available: [29] G. Tzanetakis and P. Cook, “Musical genre classiﬁcation of audio signals,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, 2002. [30] K. West, S. Cox, “Features and classiﬁers for the automatic classiﬁcation of musical audio sig- nals,” In Proceedings of the Fifth International Conference on Music Information Retrieval (ISMIR), 2004. CUNG CAP LINK?????????? Check lai [30] West, Kristopher, and Stephen Cox, “Features and classiﬁers for the automa- tic classiﬁcation of musical audio signals,” ISMIR 2004, 5th International Con- ference on Music Information Retrieval, Barcelona, Spain, October 10-14, 2004. [31] N. Mohd, S. Doraisamy, R. Wirza, “Factors aﬀecting automatic genre classiﬁcation: An inves- tigation incorporating non-western musical forms,” In Proceedings of the Fifth International Conference on Music Information Retrieval, 2005. [32] Witten, Ian H., and Eibe Frank. “Data Mining: Practical machine learning tools and techniques”. Morgan Kaufmann, 2005. [33] J. Bergstra, N. Casagrande, D. Erhan, D. Eck, B. K´egl,“Aggregate features and adaboost for music classiﬁcation,” Machine Learning, vol. 65, no. (2-3), pp. 473–484, 2006. [34] Chai, Wei, and Barry Vercoe, “Folk music classiﬁcation using hidden Markov models,” Pro- ceedings of International Conference on Artiﬁcial Intelligence, vol. 6, no. 6.4, 2001. Music Classiﬁcation Using Hidden Markov Mode.pdf.
342 CHU BA THANH, et al. [35] N. Bassiou, C. Kotropoulos and A. Papazoglou-Chalikias, “Greek folk music classiﬁcation into two genres using lyrics and audio via canonical correlation analysis,” 2015 9th International Symposium on Image and Signal Processing and Analysis (ISPA), Zagreb, 2015, pp. 238– 243. Doi: 10.1109/ISPA.2015.7306065. [36] Rajesh, Betsy, and D. G. Bhalke, “Automatic genre classiﬁcation of Indian Tamil and western music using fractional MFCC,” International Journal of Speech Technology, vol. 19, no.3, pp. 551-563, 2016. [37] J. Laﬀerty, A. Mccallum, FC. Pereira, “Conditional random ﬁelds: Probabilistic mo- dels for segmenting and labeling sequence data,” ICML ’01: Proceedings of the Eig- hteenth International Conference on Machine Learning, June 2001. Pages 282–289. [38] A. Mohamed, D. Yu and L. Deng, “Investigation of fullsequence training of deep belief networks for speech recognition,” INTERSPEECH 2010 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010.Pages 2846–2849. [39] I. Heintz, E. Fosler-Lussier and C. Brew, “Discriminative input stream combination for conditio- nal random ﬁeld phone recognition,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 8, pp. 1533–1546, Nov. 2009. Doi: 10.1109/TASL.2009.2022204 [40] Li, Juan, Jianhang Ding, and Xinyu Yang, “The regional style classiﬁcation of Chinese folk songs based on GMM-CRF model,” ICCAE ’17: Proceedings of the 9th Internatio- nal Conference on Computer and Automation Engineering, February 2017. Pages 66–72. [41] Li, Juan, et al., “Regional classiﬁcation of Chinese folk songs based on CRF model,” Multimedia Tools and Applications, vol. 78, pp. 11563–11584, 2019. 6637-6 [42] G.E. Hinton, “A practical guide to training restricted Boltzmann machines,” In: Montavon G., Orr G.B., M¨ullerKR. (eds), Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin, Heidelberg. 35289-8 32 [43] J. Martel, T. Nakashika, C. Garcia, K. Idrissi, “A combination of hand-crafted and hierarchical highlevel learnt feature extraction for music genre classiﬁcation,” In: Mladenov V., Koprinkova- Hristova P., Palm G., Villa A.E.P., Appollini B., Kasabov N. (eds), Artiﬁcial Neural Networks and Machine Learning – ICANN 2013. ICANN 2013. Lecture Notes in Computer Science, vol 8131. Springer, Berlin, Heidelberg. 50 [44] Phan Anh Cang, Phan Thuong Cang, “Music classiﬁcation by genre using discrete wavelet transform,” In Proceedings of Fundamental and Applied IT Research (FAIR), ISBN: 978- 604-913-165-3, Can Tho, 2016. Pages 395–403. [45] [Online]. Available: [46] K. Markov and T. Matsui, “Music genre and emotion recognition using Gaussian processes,” IEEE Access, vol. 2, pp. 688–697, 2014. Doi: 10.1109/ACCESS.2014.2333095 [47] J. Eggink and G. J. Brown, “A missing feature approach to instrument identiﬁcation in po- lyphonic music,” 2003 IEEE International Conference on Acoustics, Speech, and Sig- nal Processing, 2003. Proceedings. (ICASSP ’03), Hong Kong, 2003, pp. V-553. Doi: 10.1109/ICASSP.2003.1200029
SOME NEW RESULTS ON AUTOMATIC IDENTIFICATION 343 [48] T. Heittola, A. Klapuri, “Locating segments with drums in music signals,” In Proceeding of the 3rd International Conference on Music Information Retrieval, 2002, pp. 271–272. [49] M. Marolt, “Gaussian mixture models for extraction of melodic lines from audio recordings”, In Proceedings of the International Conference on Music Information Retrieval, 2004. [50] G. Fuchs, “A robust speech/music discriminator for switched audio coding,” 2015 23rd Euro- pean Signal Processing Conference (EUSIPCO), Nice, 2015. pp. 569–573. Doi: 10.1109/EU- SIPCO.2015.7362447 [51] G. Sell and P. Clark, “Music tonality features for speech/music discrimination, In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, 2014, pp. 2489–2493. Doi: 10.1109/ICASSP.2014.6854048 [52] R. Thiruvengatanadhan, and P. Dhanalakshmi, “Indexing and retrieval of music using Gaussian mixture model techniques,” International Journal of Computer Applications, vol. 148, no.3, 2016. 911095.pdf. [53] Jakubec, Maros, and Michal Chmulik, “Automatic music genre recognition for in-car infotain- ment,” Transportation Research Procedia, vol. 40, pp. 1364–1371, 2019. [54] Evstifeev, Stepan, and Ivan Shanin, “Music genre classiﬁcation based on signal processing,” In DAMDID/RCDL, pp. 157–161, 2018. [55] Bhattacharjee, Mrinmoy, S. R. M. Prasanna, and Prithwijit Guha, “Time-frequency audio fea- tures for speech-music classiﬁcation,” arXiv.org > eess > arXiv:1811.01222. [56] Baelde, Maxime, Christophe Biernacki, and Rapha¨elGreﬀ, “Real-time monophonic and polyp- honic audio classiﬁcation from power spectra,” Pattern Recognition, vol. 92,pp. 82–92, 2019. [57] R. Thiruvengatanadhan, “Music genre classiﬁcation using GMM,” International Research Journal of Engineering and Technology (IRJET), vol. 5, no. 10, Oct 2018. [58] C. Kaur and R. Kumar, “Study and analysis of feature based automatic music genre classiﬁcation using Gaussian mixture model,” 2017 International Conference on Inventive Computing and Informatics (ICICI), Coimbatore, 2017, pp. 465–468. Doi: 10.1109/ICICI.2017.8365395. [59] B.K. Khonglah, S.M. Prasanna, “Speech/music classiﬁcation using speech-speciﬁc features,” Digit Signal Process, vol. 48, pp. 71–83, 2016. [60] H. Zhang, X.K. Yang, W.Q. Zhang, W.L. Zhang, J. Liu, “Application of i-vector in speech and music classiﬁcation,” 2016 IEEE International Symposium on Signal Processing and Infor- mation Technology (ISSPIT), Limassol, 2016, pp. 1–5. Doi: 10.1109/ISSPIT.2016.7885999. [61] Stuttle, Matthew Nicholas, “A Gaussian mixture model spectral representation for speech re- cognition,” Diss. University of Cambridge, 2003. [62] S. G. Bagul and R. K. Shastri, “Text independent speaker recognition system using GMM,” 2013 International Conference on Human Computer Interactions (ICHCI), Chennai, 2013, pp. 1–5. Doi: 10.1109/ICHCI-IEEE.2013.6887781. [63] G. Suvarna Kumar et. al., “Speaker recognition using GMM,” International Journal of Engi- neering Science and Technology, vol. 2, no. 6, pp. 2428–2436, 2010. [64] D. Reynolds, R. Rose, “Robust text-independent speaker identiﬁcation using Gaussian mixture speaker models,” IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, pp. 72–83, 1995.
344 CHU BA THANH, et al. [65] A. Dustor and P. Szwarc, “Application of GMM models to spoken language recognition,” 2009 MIXDES-16th International Conference Mixed Design of Integrated Circuits & Systems, Lodz, 2009, pp. 603–606. [66] Sarmah, Kshirod, and Utpal Bhattacharjee, “GMM based language identiﬁcation using MFCC and SDC features,” International Journal of Computer Applications, vol. 85, no. 5, 2014. [67] Pham Ngoc Hung, “Automatic recognition of continuous speech for Vietnamese main dialects through pronunciation modality,” Doctoral Thesis - Hanoi University of Science and Technology, 2017. [68] Quang H. Nguyen, Trang T. T. Do, Thanh B. Chu, Loan V. Trinh, Dung H. Nguyen, Cuong V. Phan, Tuan A. Phan, Dung V. Doan, Hung N. Pham, Binh P. Nguyen and Matthew C. H. Chua, “Music genre classiﬁcation using residual attention network,” 2019 International Conference on System Science and Engineering (ICSSE), Dong Hoi, Viet Nam, 2019, pp. 115–119. Doi: 10.1109/ICSSE.2019.8823100. [69] [Online]. Available: [70] [Online]. Available: win.html [71] [Online]. Available: [72] [Online]. Available: [73] [Online]. Available: [74] J. Bonastre, F. Wils and S. Meignier, “ALIZE, a free toolkit for speaker recognition,” Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., Philadelphia, PA, 2005, pp. I/737–I/740, vol. 1. Doi: 10.1109/ICASSP.2005.1415219. [75] Tommie Gannert, “A Speaker veriﬁcation system under the scope: Alize,” Stockholm, Sweden School of Computer Science and Engineering, 2007. [76] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker veriﬁcation,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, May 2011. Doi: 10.1109/TASL.2010.2064307. [77] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition, 2018 IEEE International Conference on Acou- stics, Speech and Signal Processing (ICASSP), Calgary, AB, 2018, pp. 5329–5333. Doi: 10.1109/ICASSP.2018.8461375. [78] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchan- nels in speaker recognition,” IEEE Transaction on Audio Speech and Language Processing, vol. 15, no. 4, pp.1435–1447, May 2007. [79] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speaker and session variability in GMM- based speaker veriﬁcation,” IEEE Transaction on Audio Speech and Language Processing, vol. 15, no. 4, pp. 1448–1460, May 2007. [80] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, “A study of interspeaker variability in speaker veriﬁcation,” IEEE Transaction on Audio, Speech and Language, vol. 16, no. 5, pp. 980–988, July 2008. [81] N. Dehak, “Discriminative and Generative Approches for Long- and Short-Term Speaker Cha- racteristics Modeling: Application to Speaker Veriﬁcation,” Ph.D. th`esis, Ecole´ de Technologie Sup´erieure,Montr´eal,2009.
SOME NEW RESULTS ON AUTOMATIC IDENTIFICATION 345 [82] Bousquet, Pierre-Michel, et al., “Variance-spectra based normalization for i-vector standard and probabilistic linear discriminant analysis,“ Odyssey 2012-The Speaker and Language Recogni- tion Workshop, 2012. [83] T. Stafylakis, et al., “I-Vector/PLDA variants for text-dependent speaker recognition,” Preprint submitted to Computer, Speech and Language, pp (2013). [84] Kanagasundaram, Ahilan, “Speaker veriﬁcation using I-vector features,” Diss. Queensland Uni- versity of Technology, 2014. [85] E. Khoury, L. El Shafey, M. Ferras, and S. Marcel, “Hierarchical speaker clustering methods for the NIST i-vector challenge,” in Odyssey: The Speaker and Language Recognition Workshop, no. EPFL-CONF-198439, 2014. [86] S. Novoselov, T. Pekhovsky, and K. Simonchik, “Stc speaker recognition system for the NIST i-vector challenge,” in Odyssey: The Speaker and Language Recognition Workshop, 2014, pp. 231–240. [87] Larcher, J.-F. Bonastre, B. Fauve, K. A. Lee, C. Levy, H. Li, J. S. Mason, and J.-Y. Parfait, “ALIZE 3.0 - open source tool-kit for state-of-the-art speaker recognition,” in Interspeech, Lyon, France, 2013. [88] I. Salmun, I. Opher and I. Lapidot, “Improvements to PLDA i-vector scoring for short segments clustering,” 2016 IEEE International Conference on the Science of Electrical Engineering (ICSEE), Eilat, 2016, pp. 1–4. Doi: 10.1109/ICSEE.2016.7806108. [89] Fredouille, Corinne, and Delphine Charlet, “Analysis of i-vector framework for speaker identiﬁ- cation in TV-shows,” INTERSPEECH 2014 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014. Received on September 20, 2019 Revised on October 26, 2020