Activity Recognition in Videos Using Deep Learning
Shanbhag, Mahesh Ramaray
MetadataShow full item record
Automatically recognizing activities in a video is a long standing goal of computer vision and artificial intelligence. Recently, breakthroughs in deep learning have revolutionized the field of computer vision and today deep models can solve low-level tasks such as image classification and object detection more accurately than humans and even highly trained (human) experts. However, inferring high-level activities from low-level information such as objects in a video is a difficult task because the objects interacting with humans can be too small or similar activities might be captured at different spatial locations or angles. In this thesis, we propose an effective and efficient supervised learning model for solving this difficult task by leveraging advanced deep learning architectures. Our key idea is to formulate activity recognition as a multi-label classification problem in which the input is a set of frames (a video) and the output is an assignment of most probable labels to the four elements that make up an activity: action, tool, object and source/target at each frame. We begin with a network pre-trained on objects appearing in a large image classification dataset and then modify it with an additional layer that helps us solve the much harder multilabel classification problem. Then, we tune and train this new network to our video data by presenting each labeled frame in the video as input to the network. We train, evaluate and benchmark the model using a popular Cooking activities dataset and also interpret the learned model by visualizing the network at various levels of hierarchy.