Abstract:
The dominating nature of human being for preserving and propagating important information
makes the human body under the study of high concerns. Human nature remains very
complex topic to depict the behavior in any random situation. This thesis aims to develop
an intelligent human action analysis model by applying deep learning methods on videos.
The actions performed by human being are the outcomes of very complex phenomena of
perceptual vision and neurology. We considered the issues only with vision based system
for human action analysis. The main phases to analyze the human actions can be considered
as tracking the human body and its spatiotemporal localization along with the actions
being performed in any given video sequences. Each of research sub-problems, localization,
tracking and recognition of the action instances in a given video sequence under the
uncontrolled conditions can pertains itself a wide range of applications. Although, the research
problem has been defined before 18th century in the form biological understanding
of human nature which was not very successful due to less information available from human
brain. From vision aspects, plenty of information can be achieved to get thorough
the understanding of human nature. In terms of computer vision research, the stated problem
is part of video understanding. Developing a system that can automatically retrieve
desired information from any video is expected a challenging computer vision problem.
Dealing with large scale data analytic during the training of a network is another overhead
in processing the video samples from unconstraint media. The evolutions in deep neural
networks results satisfactory solution of several research domains which deals with large
scale data analytic.
The thesis entitled Human Action Localization, Tracking and Recognition using
Deep learning is organized into six chapters. Chapter 1, Introduction’ presents the general
introduction to video understanding for human action analysis along with the motivations
and challenging issues with stated research problem. The required experimental
setup is also highlighted with the research objective and authors contributions. In chapter
2, Preliminaries the basics of deep network architectures are described which is necessary
to understand before developing a deep network for human action analysis. Chapter
3, entitled by Weakly Supervised Deep Network Model for Spatiotemporal Human
Action Localization with CNN and LSTM presents human action localization model in
spatiotemporal domain. The architecture of the model exploits CNN to capture the spatial
information whereas, temporal information is retrieved by LSTM. CNN is a famous
deep network model based convolution operations at large scale. LSTM is a specific recurrent
neural network which can recover the information from video sequence in both long
and short term duration. In chapter 4, A Cascaded CNN Model for Multiple Human
Tracking and Re-localization in Complex Video Sequences, we present a human body
tracking and re-localization system. We used three CNN in cascaded fashion such that it
can improve the overall performance of human tracking system. Chapter 5 entitled by
Spatiotemporal Attention based Deep Network Model for Human Action Recognition
with CNN and RNN presents a deep network model for human action recognition. The
proposed human action recognition model utilized the information based on spatiotemporal
attention and localized the action in the local neighborhood of initial frames. The model
uses deep neural network framework which receives the information from 3D-CNN, LSTM
vii
and spatiotemporal attention based blocks in a cascaded fashion. Besides this, Long Short
Term Memory (LSTM) model is used to capture more extensive temporal information.
From the comparative studies of results with existing state-of-the-arts, proposed model improves
the performance of human action system.
The summary of the thesis, experimental results and future scope is presented in chapter
6, Conclusions and Future Scope. The thesis can be helpful to give a direction for
developing an intelligent system that can simultaneously localize, track and recognize the
objects or actions in real-time domain.