Nâng cao hiệu quả nhận dạng hình trạng bàn tay kết hợp nhiều luồng dữ liệu

10 trang Gia Huy 20/05/2022 1510

Download

Bạn đang xem tài liệu "Nâng cao hiệu quả nhận dạng hình trạng bàn tay kết hợp nhiều luồng dữ liệu", để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên

Tài liệu đính kèm:

nang_cao_hieu_qua_nhan_dang_hinh_trang_ban_tay_ket_hop_nhieu.pdf

Nội dung text: Nâng cao hiệu quả nhận dạng hình trạng bàn tay kết hợp nhiều luồng dữ liệu

TẠP CHÍ KHOA HỌC VÀ CÔNG NGHỆ NĂNG LƯỢNG - TRƯỜNG ĐẠI HỌC ĐIỆN LỰC (ISSN: 1859 - 4557) IMPROVING HAND POSTURE RECOGNITION PERFORMANCE USING MULTI-MODALITIES NÂNG CAO HIỆU QUẢ NHẬN DẠNG HÌNH TRẠNG BÀN TAY KẾT HỢP NHIỀU LUỒNG DỮ LIỆU Doan Huong Giang Electric Power University Ngày nhận bài: 26/02/2021, Ngày chấp nhận đăng: 16/03/2021, Phản biện: PGS.TS. Ngô Quốc Tạo Abstract: Hand gesture recognition has been researched for a long time. However, performance of such methods in practical application still has to face with many challenges due to the variation of hand pose, hand shape, viewpoints, complex background, light illumination or subject style. In this work, we deeply investigate hand representations on various extractors from independent data (such as RGB image and Depth image). To this end, we adopt an concatenate features from different modalities to obtain very competitive accuracy. To evaluate the robustness of the method, two datasets are used: The first one, a self-captured dataset that composes of six hand gestures in indoor environment with complex background. The second one, a published dataset which has 10 hand gestures. Experiments with RGB and/or Depth images on two datasets show that combination of information flows has strong impact on recognition results. Additionally, the CNN method's performances are mostly increased by multi-features combination of which results are compared with hand-craft-based feature extractors, respectively. The proposed method suggests a feasible and robust solution addressing technical issues in developing HCI application using the hand posture recognition. Keywords: Electronic Home Appliances, Deep Learning, Machine Learning, Hand Poseture/Gesture Recognition, Human Machine Interaction, Multi-modalities, Late Fusion, Early Fusion. Tóm tắt: Nhận dạng cử chỉ tay đã được nghiên cứu trong thời gian vừa qua. Tuy nhiên, đây vẫn là một mảng nghiên cứu còn phải đối mặt với nhiều thách thức nếu muốn triển khai trong thực tế do: tồn tại nhiều hình trạng bàn tay khác nhau, hình dáng của cùng một hình trạng, góc nhìn khác nhau, điều kiện nền phức tạp, điều kiện chiếu sáng, mỗi người có cách thức thực hiện khác nhau. Bài báo này sẽ nghiên cứu cách biểu diễn bàn tay sử dụng các bộ phân lớp khác nhau trên các luồng thông tin (ảnh màu RGB và ảnh độ sâu Depth). Sau đó, các đặc trưng được kết hợp với nhau để nâng cao hiệu quả của quá trình nhận dạng. Các thử nghiệm được thực hiện trên các bộ sơ sở dữ liệu (CSDL) khác nhau gồm bộ CSDL tự thu thập và bộ CSDL được công bố trên mạng cho cộng đồng nghiên cứu. Ngoài ra, tác giả cũng sử dụng mạng nơron nhân tạo để thử nghiệm và so sánh với các giải pháp sử dụng các bộ trích chọn đặc trưng tự thiết kế. Kết quả cho thấy giải pháp sử dụng mạng 40 Số 26
TẠP CHÍ KHOA HỌC VÀ CÔNG NGHỆ NĂNG LƯỢNG - TRƯỜNG ĐẠI HỌC ĐIỆN LỰC (ISSN: 1859 - 4557) nơron đạt kết quả tốt hơn so, trong đó giải pháp đề xuất kết hợp các luồng thông tin trên tất cả các bộ phân lớp đạt hiệu quả tốt hơn so với sử dụng từng luồng thông tin riêng biệt. Các kết quả này cho thấy, giải pháp đề xuất khả thi khi triển khai ứng dụng trong tương tác giữa người và thiết bị sử dụng cử chỉ của bàn tay. Từ khóa: Thiết bị điện tử gia dụng, học sâu, học máy, nhận dạng cử chỉ bàn tay, tương tác người - máy, đa thể thức, kết hợp muộn, kết hợp sớm. 1. INTRODUCTION obtained impressive results [13][15], there exists still many challenges that should be Hand gesture recognition has been carefully carried out before applying it in become an attractive field in computer reality. vision [5][11][17][19][20] because of huge range of it applications such as The remaining of this paper is organized Human-Machine-Interaction (HCI) [15], as follows: Section 2 describes our entertainment, virtual reality [18][21], proposed approach. The experiments and autonomous vehicles [15], and so on. results are analyzed in Section 3. Section Moreover, its system performance 4 concludes this paper and recommends (accuracy, time cost, ) has been face to some future works. many challenges due to various appearances of hand poses, non-rigid 2. PROPOSED METHOD objects, different scales, too many degrees The main flow-work for hand pose of freedoms, illumination, complex of recognition from RGB and Depth background. Thanks to the development modalities consists of a series of the of new and low-cost depth sensors, new cascaded steps as shown in Fig. 1. Given opportunities for posture recognition have RGB-Depth images, hand region regions emerged. Microsoft Xbox-Kinect is a are detected, extracted, and recognized. successful commercial product that The steps are presented in detail as the provides both RGB and Depth next sections following: information for recognizing hand gestures to control game consoles [12]. 2.1. Pre-processing data Combination these data could be considered to improve recognition results. RGB images (IRGB) and Depth images (ID) Particularly, Convolutional Neuronal are captured by the Kinect camera version Networks (CNNs) [14][16] have been 1 (640×480 pixels). Because coordinate of emerged as a promising technique to pixels are reflected. It must be calibrated resolve many issues of the posture/gesture as presented detail in our previous recognition. Although utilizing CNNs has research [7]. Số 26 41
TẠP CHÍ KHOA HỌC VÀ CÔNG NGHỆ NĂNG LƯỢNG - TRƯỜNG ĐẠI HỌC ĐIỆN LỰC (ISSN: 1859 - 4557) Fig. 1. Propose framework for hand posture recognition passed through three Haar like cascade RGB models to detect RGB hand region IHand in a RGB image (Fig. 2(c,d)) as presented in following (1) equation: IFFIFIFIRGB union RGB ,, RGB RGB Hand cropt Palm RGB Wrist RGB Hand RGB (1) Next, coordinate of RGB hand region is marked on Depth image IDepth . Depth hand Depth region IHand (Fig. 2(e)) as illustrated in Fig. 2. Hand region detection following (2) equation: 2.2. Hand detection IFIIDepth mar k RGB , Hand cropt Hand Depth (2) This step aims to have coordinate of hand region in image. Haar like cascade is 2.3. Hand posture representation used. That is an object detection In this section, series of digit hand images algorithm in machine learning. This RGB Depth IHand and IHand are then represented by algorithm is proposed by [1]. It is both hand craft-based methods and deep composed by four stages: Haar features learning-based method as Sec. 2.3.1. selection, Integral image creating, Then, digit hands are recognized as Adaboost training and Cascade classifiers. presented detail in the following Sec. In this paper, we used pre-train Haar 2.3.2. cascade model that is trained through all those steps and authors used a large hand 2.3.1. Handcraft-based method dataset. The xml file of the pretrained In this part, some state-of-the-art model is published at [2]. We use three descriptors are used to extract features for type models of hand parts such as: hand pose such as: SIFT[7], SURF[8], Palm.xml, Wrist.xml and Hand.xml. HOG[9] and KDES[10]. By using those Given an input RGB image IRGB that is corresponding descriptors, the number of 42 Số 26
TẠP CHÍ KHOA HỌC VÀ CÔNG NGHỆ NĂNG LƯỢNG - TRƯỜNG ĐẠI HỌC ĐIỆN LỰC (ISSN: 1859 - 4557) (4) T most important key-points of hand í FF [KK(4) (4) ] (6) RGB/ Depth KDES1 N4 detected. Hand gesture is then presented (2) 2.3.2. Deep learning method by feature vectors such as: F FSift ; (3) (4) (5) Recently, deep learning has been widely FF Surf ; FF HOG and FF KDES . used in computer vision in various Which is presented detail in equations (3), tasks as feature extraction, recognition, (4), (5) and (6) (while N1 = 256; N2 = N3 identification. In this research, Resnet50 = N4 = 1024) following: model is utilized to extract feature of (1) T human hand. This convolutional neural FF [KK(1) (1) ] (3) RGB/ Depth SIRF1 N1 network composes of 5 Conv layers and (2) (2) (2) T FC layer. The architecture of pertained FF SURF [KK1 N ] (4) RGB/ Depth 2 Resnet50 network is illustrated in the (3) T FF [KK(3) (3) ] (5) following Fig. 7: RGB/ Depth HOG1 N3 Fig. 3. The Resnet50 architecture RGB Depth 2.4. Hand gesture classification The cropped images IHand and IHand are different sizes that are then resized to The features FF(1); ; (6) (extracted from (224 224pixels). These same dimension RGB and Depth information) are utilized hand images are utilized as inputs of this as the inputs of the classification Resnet50 convolutional neuron network. strategies late fusion and early fusion as Then, this network is used as feature presented in the following sections: extractor. The dimension of output feature is taken at last FC layer of network and 2.4.1. Early fusion strategy the feature size is presented by In this part, modalities of hand posture FF(6) as following Resnet 50 1xN 5 N 5 1000 representation that are extracted from (7) equation: RGB and Depth hand images (as (5) T presented in previous section) by five FF [KK(5) (5) ] (7) RGB/ Depth Resnet 50 1 N5 Số 26 43
TẠP CHÍ KHOA HỌC VÀ CÔNG NGHỆ NĂNG LƯỢNG - TRƯỜNG ĐẠI HỌC ĐIỆN LỰC (ISSN: 1859 - 4557) descriptors. RGB and Depth features of classifier [6] that is utilized as multi-class the same extractors are combined SVM classifier. The output of multi-class together. Both of features are normalized SVM will be one number value among in following equation: {0, 1, 2, , N} with N is number ID of ii (8) Fmulti || F RGB ,F Depth ||;i (1,2,3,4,5) gestures in dataset. Next, Fmulti features are inputs of the SVM Fig. 4. Early fusion strategy 2.4.1. Late fusion strategy final results. This method requires one more classifier as well as long time cost Differ from early fusion method, in this than early fusion. Furthermore, it is more case, exploiting features derived from flexible and easy to expand model. multimodal data are independently used Additionally, it allows to use the classifier as input of the separate SVM classifiers that is best suitable for the each modality. [6] as illustrated in the following Fig. 5. Then, at decision layer, output scores of The results of fusion strategies are presented detail in the following Sec. 3. classifiers are decision vectors ( DRGB and DDepth ) that are combined to obtain the Fig. 5. Late fusion strategy 3. EXPRIMENTAL RESULTS and KinectLeap [19] dataset. In entire The proposed framework is warped by evaluations, we follow Leave-p-out-cross- Python program on a PC Core i6 4.2 GHz validation method as presented detail in CPU, 8GB RAM, NVIDIA 8G GPU. We [11], with p equals 1. It means that evaluate performance of the hand gesture gestures of one subject are utilized for recognition on EPUHandPose2 dataset testing and the remaining subjects are 44 Số 26
TẠP CHÍ KHOA HỌC VÀ CÔNG NGHỆ NĂNG LƯỢNG - TRƯỜNG ĐẠI HỌC ĐIỆN LỰC (ISSN: 1859 - 4557) utilized for training. Three evaluations are Look at the Fig. 8, it is apparent that the considered such as: (1) How is the better Resnet-based descriptor obtains the best between Hand-craft and Deep learning percentage at both flows, 95.3% for RGB feature method; (2) Comparison of and 90.5 for Depth. While SIRF accuracy recognition rate between kernel descriptor is lowest accuracy in overall SVM classifiers; (3) Compare accuracy of that is stood at 34.5% and 28.3%, fusion strategies. The detail evaluations respectively. In Hand-craft extractors, are presented as following sub-sections: KDES feature obtains the best evaluations (at 90.3% and 78.2%) that is dramatically 3.1. Evaluation of hand recognition rate higher than remain extractors. Moreover, on different feature representations accuracy of the hand craft-based methods In this evaluation, we test the accuracy (SIFT, SIRF, HOG, and KDES extractors) rate of various Hand-craft feature are far smaller than the deep learning- representation with SVM classifier. based approach. Therefore, in this paper, EPUHandPose2 dataset is used in this the Resnet model and KDES model will work. The accuracy is evaluated at be utilized in the next experiments. independent modalities RGB and Depth. Fig. 6. Accuracy with the different feature representations 3.2. Comparison between Kernels of it is clear that, the RBF kernel obtains the SVM classifiers best accuracy on all modalities in both In this Section, two evaluations are two datasets, at 90.3% and 78.2% for performed on four Kernels of SVM RGB and Depth of EPU dataset while classifiers: Linear, Sigmoid, RBF and KinectLeap dataset are 78.4% and 60.2% Poly. Two datasets are used in this respectively. evaluation as: EPUHandPose2 dataset and In the second cases, we try to evaluate on KinectLeap[19] dataset. deep learning-based feature. Given the In the first case, the best KDES feature Table. 2 that shows a comparison of extractor is used. A glance at the Table. 1, recognition accuracy of four various Số 26 45
TẠP CHÍ KHOA HỌC VÀ CÔNG NGHỆ NĂNG LƯỢNG - TRƯỜNG ĐẠI HỌC ĐIỆN LỰC (ISSN: 1859 - 4557) Kernels at two datasets. It is evident from for KinectLeap dataset. This evaluation the information provided that RBF Kernel illustrates that, the deep learning method has by far the highest percentage in is more efficient than hand craft-based almost cases (at 97.2%, 92.6% for EPU method over the period show. In addition, dataset and 93.2% and 90.6% for the highest hand-craft-based methods KinectLeap dataset). (KDES-SVM) are far lower than CNN method. Also note worth is fact that Table 1. Hand craft-based feature on various Kernels of SVM Resnet50 model is more likely to deploy a real application. Dataset EPU dataset KinectLeap Kernel 3.3. Hand gesture recognition using SVM various fusion strategies RGB Depth RGB Depth Linear 85.1 72.2 73.4 58.6 A glance at the figure provided reveals Sigmoid 15.6 11.2 13.6 10.8 hand pose recognition accuracy of fusion Rbf 90.3 78.2 78.4 60.2 methods (late fusion and early fusion) of Poly 86.9 78.3 71.3 60.1 two cues (RGB and Depth flows) during the period shown. It could be seen from Table 2. Deep learning-based feature on various Kernels of SVM the Fig. 7 that, combination of both RGB and Depth information obtains higher Dataset Kernel EPU dataset KinectLeap accuracy than independent cue SVM representation. Additionally, the early RGB Depth RGB Depth fusion method obtains the best hand Linear 95.3 90.5 91.6 89.7 gesture recognition accuracy, with Sigmoid 25.7 23.4 27.8 15.4 the highest value at 99.1% on Rbf 97.2 92.6 93.2 90.6 EPUHandPose2 dataset and 94.3% for Poly 86.9 78.3 88.9 79.3 KinectLeap dataset. While late fusion Moreover, results of Linear Kernal are approach accounted by far on both two slightly lower than RBF method at 95.3%, EPUHandPose2 and KinectLeap datasets 96,5% for EPU dataset and 91.6%, 89.7% at 96.3% and 91.6%, respectively. Accuracy with the fusion strategies 46 Số 26
TẠP CHÍ KHOA HỌC VÀ CÔNG NGHỆ NĂNG LƯỢNG - TRƯỜNG ĐẠI HỌC ĐIỆN LỰC (ISSN: 1859 - 4557) 4. DISCUSSION AND CONCLUSION obtained highest performance on both In this research, an approach for hand datasets. It is simple approach and obtains pose recognition system that combines high accuracy system. So one of multi-modalities (RGB and Depth recommendation is to combine with images). This paper has deeply others features such as optical flow and/or investigated the results of some state-of- texture of hand as well as create larger the-art feature extraction methods as training dataset to obtain the higher SIRF, SURF, KDES, HOG and Deep accuracy of hand posture recognition; ii) learning method. Experimental try in The proposed method will be evaluated order to test on various Kernels of SVM on other published datasets. to choose best suitable model for different features. Experiments are conducted on 5. ACKNOWLEDMENT our captured dataset and the published This research is funded by Electric Power dataset. Furthermore, the evaluations lead University under the Project “Control home to some following conclusions: i) appliances using Computer Vision and Concerning both hand craft-based and Artificial Intelligent”. CNN issues, the proposed method has REFERENCES [1] P.A. Viola and M.J. Jones, “Rapid object detection using a boosted cascade of simple features”, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, I–I, 2001. [2] [3] Caltech dataset: [4] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In Procceding of ECCV, pp. 512-528,2019. [5] Huong-Giang Doan and Van-Toi Nguyen, Improving Dynamic Hand Gesture Recognition on Multi- views with Multi-modalities, International Journal of Machine Learning and Computing, Vol. 9, No. 6, pp. 795-800, 2019. [6] C.1.C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Vol. 43, pp. 1-43, 1997. [7] David G Lowe, Method and apparatus for identifying scale invariant features in an image and use of same for locating an object in an image, David Lowe's patent for the SIFT algorithm, 2004. [8] Bay, Herbert & Tuytelaars, Tinne & Van Gool, Luc. SURF: Speeded up robust features. The Proceedings of the 9th European conference on Computer Vision-ECCV, pp. 404-417. 2006. [9] Navneet Dalal, Bill Triggs. Histograms of Oriented Gradients for Human Detection. International Conference on Computer Vision & Pattern Recognition (CVPR '05), United States. pp.886 893, 2005. Số 26 47
TẠP CHÍ KHOA HỌC VÀ CÔNG NGHỆ NĂNG LƯỢNG - TRƯỜNG ĐẠI HỌC ĐIỆN LỰC (ISSN: 1859 - 4557) [10] Bo, Liefeng, Xiaofeng Ren, and Dieter Fox. Kernel descriptors for visual recognition. In Advances in neural information processing systems, pp. 244-252. 2010. [11] Dang-Manh Truong, Huong-Giang Doan, Thanh-Hai Tran, Hai Vu, and Thi-Lan Le, Robustness Analysis of 3D Convolutional Neural Network for Human Hand Gesture Recognition, International Journal of Machine Learning and Computing (IJMLC), Vol. 9, No. 2, April 2019, pp.135-142. [12] [13] F. Zhan, Hand Gesture Recognition with Convolution Neural Networks, 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI), Los Angeles, CA, USA, pp. 295-298, 2019. [14] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, Curran Associates Inc., USA, 2012, pp. 1097–1105. [15] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, J. Kautz, Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4207–4215. [16] Kaiming He, Xiangyu Zhang and, Shaoqing Ren and Jian Sun, "Deep Residual Learning for Image Recognition", abs/1512.03385, CoRR 2015. [17] Huong-Giang Doan, V.T. Nguyen, H. Vu, and T.H. Tran, A combination of user-guide scheme and kernel descriptor on rgb-d data for robust and realtime hand posture recognition," Eng. Appl. Artif. Intell., vol. 49, no. C, pp. 103-113, Mar. 2016. [18] Yang Li, Jin Huang, Feng Tian, Hong-An Wang, Guo-Zhong Dai, Gesture interaction in virtual reality, Virtual Reality & Intelligent Hardware, Volume 1, Issue 1, 2019, Pages 84-112. [19] G. Marin, F. Dominio, P. Zanuttigh, "Hand gesture recognition with Leap Motion and Kinect devices", IEEE International Conference on Image Processing (ICIP), Paris, France, 2014. [20] Ashish Sharma, Anmol Mittal, Savitoj Singh, Vasudev Awatramani, Hand Gesture Recognition using Image Processing and Feature Extraction Techniques, Procedia Computer Science,Volume 173, pp. 181-190, 2020. [21] Marta Sylvia Del Rio Guerra, Jorge Martin-Gutierrez, Renata Acevedo and Sofía Salinas, Hand Gestures in Virtual and Augmented 3D, Environments for Down Syndrome Users, Applied Sciences, Vol.9, pp. 1-16, 2019. Giới thiệu tác giả: Doan Huong Giang, received B.E. degree in Instrumentation and Industrial Informatics in 2003, M.E. in Instrumentation and Automatic Control System in 2006 and Ph.D. in Control engineering and Automation in 2017, all from Hanoi University of Science and Technology, Vietnam. She is a lecturer at Control and Automation faculty, Electric Power University, Ha Noi, Viet Nam. Her current research centers on human-machine interaction using image information, action recognition, manifold space representation for human action, computer vision. 48 Số 26
TẠP CHÍ KHOA HỌC VÀ CÔNG NGHỆ NĂNG LƯỢNG - TRƯỜNG ĐẠI HỌC ĐIỆN LỰC (ISSN: 1859 - 4557) Số 26 49