HUMAN ACTIVITY RECOGNITION USING DEEP LEARNING TECHNIQUES

Imran, Javed

Please use this identifier to cite or link to this item: http://localhost:8081/jspui/handle/123456789/19575

Title:	HUMAN ACTIVITY RECOGNITION USING DEEP LEARNING TECHNIQUES
Authors:	Imran, Javed
Issue Date:	Oct-2020
Publisher:	IIT Roorkee
Abstract:	Human activity recognition is an important topic in the area of computer vision research as it has a wide variety of applications in surveillance, video indexing and retrieval, human{computer interactions, health monitoring, intelligent systems, and similar other domains. Though a lot of progress has been made in the eld of action recognition, yet it is regarded as a challenging task due to variation in appearance, motion pattern, lightning conditions, camera view-point, change in scale, background clutter, and partial occlusions. This thesis presents deep learning frameworks for automatic recognition of human actions at four di erent levels of complexity: gestures, actions, interactions and group activities. During the course of our research work, the aim is not only to develop robust and accurate models but also reduce the space and time complexity to make the entire system suitable for real-world applications. The work presented in the thesis is divided into seven chapters. To begin with, Chapter 1 presents an introduction to the topic of human action recognition while an in-depth literature survey is conducted in Chapter 2 to evaluate and compare the current state-of-the-art techniques. These techniques cover both handcrafted and deeply learned features-based methods for all four di erent levels of human action recognition, i.e., gestures, actions, interactions and group activities. In Chapter 3, a sign language recognition (SLR) and an egocentric activity recognition (EAR) frameworks are proposed using convolutional neural networks (ConvNets). For SLR, three types of motion templates are used to ne-tune three ConvNets, whose output are combined using a fusion technique based on Kernel-based extreme learning machine (KELM). The proposed approach is validated on several publicly available sign language dataset, and state-of-the-art results are achieved. Similarly for EAR, RGB videos and sensor data are utilized to build a robust recognition framework, which could provide real-time performance on mobile and embedded devices. Chapters 4 presents a multi-modal sensor fusion framework to recognize human actions using RGB, skeleton and inertial sensor data. We also presented a series of guidelines related to data augmentation, network architecture, and fusion strategies, so as to exploit the complementary information from di erent modalities in the best possible manner. In Chapter 5, we move towards the relatively less explored infrared video-based human action recognition. Our primary contribution in this chapter is a new infrared dataset with 27 action classes comprised of single as well as multiple persons or objects. Further, the dataset is recorded at a variety of locations, with varying view-points so as to match the real-world scenarios as much as possible. We also proposed a local and global spatio-temporal feature aggregation network using state-of-the-art deep neural networks. In the later half of this chapter, we deal with the highest level of human action recognition, i.e., group activities, by solving the problem of violence activity recognition in videos. Our proposed method is not only accurate but also preserves the privacy of all the subjects using an encryption function based on randomization of pixel values. We conducted experiments on three di erent violent activity dataset, and achieved the best results without compromising the e ciency of the recognition system in terms of space and time complexity. In Chapter 6, a three-stream spatio-temporal network is designed with attention mechanism to solve the problem of rst-person action and interaction recognition. We utilize 3D ConvNets to extract low-level features from RGB frame di erence, optical ow and magnitude-orientation streams. These features are fused together and fed to a recurrent network to model high-level temporal feature sequences. An attention mechanism is also proposed to focus on important snippets, thereby improving the results by signi cant margin as compared to previous methods. Finally, chapter 7 concludes the thesis with important discussions and future directions of research work.
URI:	http://localhost:8081/jspui/handle/123456789/19575
Research Supervisor/ Guide:	Raman, Balasubramanian
metadata.dc.type:	Thesis
Appears in Collections:	DOCTORAL THESES (CSE)

Files in This Item:

File	Description	Size	Format
JAVED IMRAN 16911003.pdf		17.2 MB	Adobe PDF	View/Open

Show full item record