Multimodal Approach to Hand Gesture Recognition and  Synthesis

Nayan, Navneet

Please use this identifier to cite or link to this item: http://localhost:8081/jspui/handle/123456789/20464

Title:	Multimodal Approach to Hand Gesture Recognition and Synthesis
Authors:	Nayan, Navneet
Issue Date:	Jul-2024
Publisher:	IIT Roorkee
Abstract:	Hand gesture recognition and synthesis have an important place in the field of computer vi sion. There is significant involvement of humans and computers in hand gesture recognition and synthesis. Due to this, they have a prominent position and play a crucial role in Human Computer Interaction (HCI). Addition of multi-modality offers researchers to explore new ways and open doors for making noteworthy improvements in their performance. However, hand gesture recognition and synthesis suffers from many issues. In this thesis, we focus on finding efficient solution to some of these issues in hand gesture recognition. Also, we explore multi-modality and observe its effect on hand gesture recognition performance. We also develop an animation-based Indian Sign Language (ISL) e-learning app for the educa tion of hearing and impaired community in India. In this thesis, we take into account both isolated and continuous hand gesture recogni tion having static as well as dynamic signs. There are several challenges in hand gesture recognition like Movement Epenthesis (ME), co-articulation, occlusion, misclassification in case of similar type of gestures and signer dependent gesture recognition. To some extent every issue detriments the recognition performance. But, the major issue among these is the presence of ME segments in continuous sign language videos. ME segments pose serious challenge to sign language recognition accuracy and also increase the computational load. So, in this work, we first focus on the following two research problems. 1. Developing an algorithm for detection and removal of ME segments in continuous gesture. 2. Developing a multimodal framework for isolated and continuous hand gesture recog nition utilizing ME removal. In ME detection and removal, we have proposed a generic approach that detects ME in continuous sign gesture videos. For this, we extract features via calculating the singular value decomposition of the difference frame obtained after computing the absolute differ ence between every current video frame and a designated reference video frame in the input gesture video. The set of all these feature values so obtained for all the frames in the gesture video are clustered into two groups. This helps in separating the frames of gesture video into insignificant movement epenthesis frames and meaningful valid sign frames. To cluster the gesture video frames, we use three clustering algorithms, namely K-means, Gaussian Mix ture Models using Expectation Maximization and Spectral clustering. Among these three, K-means clustering proved to be the best suited clustering algorithm giving superior perfor mance than the other two. The proposed approach was tested on continuous gesture videos of ISL fingerspelling, ISL words and ChaLearn LAP ConGD datasets. The gesture recogni tion accuracy and computational load in terms of number of frames processed during gesture recognition was compared before and after removal of movement epenthesis frames. It is found that there is a significant improvement is gesture recognition accuracy and a notewor thy reduction in the number of frames processed during recognition stage due to removal of MEframes. Theefficiency of a handgesture recognition framework primarily depends on two factors. One is how cleanly the gestures are being fed to the framework or in other words, how well the constituent gestures are segmented from one another. The second factor is related to the extraction, selection and type of features. In case of gesture videos, spatial and temporal features play crucial roles. Several 2D and 3D deep neural networks are being used to extract these features. Adding to these, multi-modality has also shown its important participation in several computer vision applications, hand gesture recognition being one of them. In our proposed research on isolated and continuous hand gesture recognition, we have employed multi-modality, utilized ME detection and removal and proposed a novel modal ity based on the temporal difference that extracts hand regions, removes gesture irrelevant factors and provides temporal information contained in the hand gesture videos. We have also utilized different modalities available in various publicly available large scale isolated and continuous hand gesture datasets. Several combinations of these modalities, in addition to our proposed modality, have also been used while employing multimodal framework. For feature extraction from these modalities, Googlenet CAFFE model have been used. A set of discriminative features is derived by fusing the acquired features from these modalities to form a feature vector representing the query sign or hand gesture. A Bidirectional Long Short-Term Memory Network (Bi-LSTM) is designed for the classification purpose. The proposed multimodal framework is tested on various publicly available continuous and iso lated hand gesture datasets like ChaLearn LAP IsoGD, ChaLearn LAP ConGD, IPN Hand, and NVGesture. We find in our experiments that the proposed framework performs ex ceptionally well on individual modalities as well as on combination of modalities of these datasets. Also, the combined effect of the proposed modality and ME frames removal leads to a significant amount of improvement in gesture recognition accuracy and substantial re duction in computational burden. The obtained results support that our proposed multimodal framework performs at par with the state-of-art methods. Hand gesture synthesis and animation is another facet of communication between a nor mal hearing person and a person with hearing and speech impairment. It is equally important for providing education to the hearing and speech community. It can also be used to train normal hearing person with sign language so as to make them capable of communicating with the community suffering from hearing and speech impairment. It is noticed that ani mated videos in many context are better than live-action videos. Animated videos are more creative and consistent in terms of visuals and design. Due to flexibility in editing and design, animated videos are easy and cost-effective in making changes compared to the live-action videos. When it comes to present or convey a message to the differently-abled community, compared to live-action videos, animated videos provide more inclusiveness and leaves long lasting impression due to its creativity and design. Animated videos are also cost effective than a live-action video. To the best of our knowledge, an animation based ISL fingerspelling e-learning app is still missing in the literature reported till now. So, in our work on hand gesture animation, an animation-based ISL e-learning app for the education of hearing and speech impaired com munity has been developed. This app can play animation of ISL fingerspelling (that contains alphabets, digits, or combination of these) and some ISL words as per the given input. For this, tasks like hand region extraction, key frame identification, hand parameters extraction and hand gesture synthesis and animation using image metamorphosis are performed se quentially. With the help of MATLAB App Designer, a standalone MATLAB application for ISL fingerspelling (including digits, alphabets and combination of these two) and some ISL words of daily usage is developed. This app can be used by the community suffering with hearing and speech impairment for learning Indian Sign Language even while seated at home, or for training signers and teachers at special schools. Synthesis and generation of sentence level sign language is a bit challenging task than synthesizing alphabets, digits and words. This challenge is posed due to the requirement of synthesis of smooth transition segments between two consecutive signs in the gesture se quence. In this work, we propose to use multimodal approach for the synthesis of smooth transition segments thereby producing an effective sentence level sign language synthesis system. Our proposed system learns from the edge features and trajectory features to synthe size smooth transition segments thereby synthesizing sentence level sign language system.
URI:	http://localhost:8081/jspui/handle/123456789/20464
Research Supervisor/ Guide:	Ghosh, Debashis and Pradhan, Pyari Mohan
metadata.dc.type:	Thesis
Appears in Collections:	DOCTORAL THESES (E & C)

Files in This Item:

File	Description	Size	Format
17915014_NAVNEET NAYAN.pdf		10.59 MB	Adobe PDF	View/Open

Show full item record