Human Action Recognition and Localization in Videos

Ullah, Javid

Please use this identifier to cite or link to this item: http://localhost:80/xmlui/handle/123456789/5217

Title:	Human Action Recognition and Localization in Videos
Authors:	Ullah, Javid
Keywords:	Computer Science
Issue Date:	2019
Publisher:	National University of Computer and Emerging Sciences, Islamabad
Abstract:	Human action localization and recognition in videos is one of the most studied and active research area in computer vision. In this thesis we elaborate two main questions; First when and where is the action performed in the video and what type of action is performed. When and where localize the action spatially in a time series visual data, while what type of action determine the action category/class. The output of action localization is a sub-volume consists of the action of interest. Action localization is more challenging as compared to the action classi cation, because it is the process of extracting a speci c part of the spatio-temporal volume in a visual data. We address the problem of automatic extraction of foreground objects in videos and then determine the category of action performed in the localized region. Action localization and recognition deal with understanding when, where and what happens in a video sequence. In the last decade, some of the proposed methods addressed the problem of simultaneous recognition and localization of actions. Action recognition addresses the question What type of action is performed in the video? , while action localization aims to answer the question Where in the video? . These methods are termed action detection or action localization and recognition . The human action recognition and localization is greatly motivated by the wide range of applications in various elds of computer vision like human perceptual segmentation, tracking human in a video sequence, recovering the body structure, medical diagnosis, monitoring the human activities in security-sensitive areas like airports, buildings (universities, hospitals, schools), border crossings and elderly daily activity recognition (related to elderly health issues). It is one of the hardest problems due to enormous variations in visual data, appearance of actors, motion patterns, changes in camera viewpoints, illumination variations, moving and cluttered backgrounds, occlusions of actors, intra-and inter-class variations, noise, moving cameras and the availability of extensive amount of visual data. Local features based action recognition methods have been extensively studied in the last two decades. These systems have numerous limitations and far enough from the real time scenario. Every phase of the system has its own importance for the next phase, such as the success and accuracy of local feature based methods depend on the accurate encoding of visual data i.e. feature extraction method, dimensionality reduction of the extracted features and compact representation, localizing the action and training a learning model (classi er) for the classi cation of action sequences (Main parts of the system should be: (1) Feature extraction, (2) Feature representation, (3) localization of the region of interest, and (4) classi cation of the action video). First of all we study, investigate, evaluate and compare the well known state-of-the-art and prominent approaches proposed for action recognition and localization. The methods proposed for action localization are too complex and computationally expensive. We have proposed a novel saliency map computation based on local and global features to ll the gap between the two types of features and hence provide promising results very e ciently for salient object detection. Then the motion features are fused intelligently with the detected salient object to extract the moving object in each frame of the sequence. The object proposal algorithms normally use computationally expensive segmentation methodologies to extract di erent non-overlapping objects/regions in a frame. Our proposed methods exploit very limited spatio-temporal neighborhood to extract a compact action region based on the compensated motion information. Finally, classi er is trained on the local features to recognize/label the action sequence. We have evaluated two types of learning models, extreme learning machine (ELM) and deep neural networks (DNNs). ELM is fast, while the computationally intensive classi ers such as Deep Neural Networks (DNNs) produce comparatively better action recognition accuracy. The experimental evaluation reveals that our local features based human action recognition and localization system improves the existing systems in many aspects such as computational complexity and performance. Finally it is concluded that the proposed algorithms obtain better or very similar action recognition and localization performance/accuracy as compared to the state-of-the-art approaches on realistic, unconstrained and challenging human action recognition and localization datasets such as KTH, MSR-II, JHMDB21 and UCF Sports. Besides, to evaluate the e ectiveness of localization of proposed algorithms a number of segmentation data sets have been used such as MOViCs, I2R, SegTrack v 1 & 2, ObMiC and Wall owers. Though the approaches proposed in the thesis obtain promising and impressive results as compared to the prominent state-of-the-art methods, further research and investigations are required to get enhanced or comparable results on more challenging realistic videos encountered in practical life. The future directions are discussed in conclusions and future work section of the thesis.
Gov't Doc #:	17923
URI:	http://142.54.178.187:9060/xmlui/handle/123456789/5217
Appears in Collections:	Thesis

Files in This Item:

File	Description	Size	Format
10741.htm		121 B	HTML	View/Open

Show full item record