Action Recognition in Continuous Data Streams Using Fusion of Depth and Inertial Sensing
Human action or gesture recognition has been extensively studied in the literature spanning a wide variety of human-computer interaction applications including gaming, surveillance, healthcare monitoring, and assistive living. Sensors used for action or gesture recognition are primarily either vision-based sensors or inertial sensors. Compared to the great majority of previous works where a single modality sensor is used for action or gesture recognition, the simultaneous utilization of a depth camera and a wearable inertial sensor is considered in this dissertation. Furthermore, compared to the great majority of previous works in which actions are assumed to be segmented actions, this dissertation addresses a more realistic and practical scenario in which actions of interest occur continuously and randomly amongst arbitrary actions of non-interest. In this dissertation, computationally efficient solutions are presented to recognize actions of interest from continuous data streams captured simultaneously by a depth camera and a wearable inertial sensor. These solutions comprise three main steps of segmentation, detection, and classification. In the segmentation step, all motion segments are extracted from continuous action streams. In the detection step, the segmented actions are separated into actions of interest and actions of non- interest. In the classification step, the detected actions of interest are classified. The features considered include skeleton joint positions, depth motion maps, and statistical attributes of acceleration and angular velocity inertial signals. The classifiers considered include maximum entropy Markov model, support vector data description, collaborative representation classifier, convolutional neural network, and long short-term memory network. These solutions are applied to the two applications of smart TV hand gestures and transition movements for home healthcare monitoring. The results obtained indicate the effectiveness of the developed solutions in detecting and recognizing actions of interest in continuous data streams. It is shown that higher recognition rates are achieved when fusing the decisions from the two sensing modalities as compared to when each sensing modality is used individually. The results also indicate that the deep learning-based solution provides the best outcome among the solutions developed.