Together with Miriam Buonamente, Zahra Gharaee, Haris Dindo and Peter Gärdenfors I have developed a hierarchical neural network system that uses a hierarchy of Self-Organizing Maps (SOMs) to recognize actions. We have done extensive research on how to get this to work well in numerous offline systems categorizing both movies of agents performing actions and sequences of skeleton models extracted from Kinect sensors. In addition we have also done research, with promising results, on how to get the system to internally simulate the likely continuation of partly seen actions, by employing Associative SOMs.
At present we are working on using the insights obtained from our extensive research on internal simulation of actions, carried out to a large extent by Miriam Buonamente, to make our online action recognition system able to internally simulate the likely continuation of partly seen actions. Ultimately, we aim at a system that can not only recognize the actions but also anticipate the likely intentions of an observed agent through internal simulation (mental imagery).
Each layer in the architecture represents increasingly complex human activity features. The first layer receives suitably preprocessed input. The kind of preprocessing varies depending on the input source, but in the case when we use sets of joint positions extracted from the stream of depth images obtained by a 3D camera, we mainly apply coordinate transformation into egocentric coordinates and scaling to provide invariance to capturing angle and size/distance of/to the acting agent. The layer consists of a SOM which performs dimensionality reduction and represents individual postures (or other features like the first or second order dynamics of postures or their combination). Thus a trajectory of activity representing key postures or dynamics is unfolded during an action. Since sufficiently simliar (what counts as sufficiently similar depends on the number of neurons) postures are represented by the same neuron, a particular movement carried out at various speeds will elicit the same activity trajectory in the SOM (i.e. the same sequence of activated neurons). Thus time invariance is achieved. Since similar postures are represented close to each other in a similarity ordered way in the SOM, similar movements carried out by the acting agent are represented as similar trajectories (i.e. similar sequences of activated neurons) in the SOM. The present activity together with a suitably long (the suitable length depends on and is optimized to the set of actions the system is trained to recognize) sequence of previous unique activity is transformed into an ordered vector representation before entering a second-layer SOM, which develops an ordered spatial representation of sequences that uniquely correponds to different actions, thus also solving the action segmentation problem. The third-layer of the hierarchy consists of a neural network that learns to label the activations in the second-layer SOM with their corresponding action labels.
A video demonstrating the recognition of actions by an earlier online version of the system that needed manual segmentation is available here. A video demonstrating an extended version of the system that can also determine what object the agent acts upon can be found here. The most recent implementation of the system automatically solves the action segmentation problem by going around it without explicitly defining the start and end of actions. It continuously guesses the action based on the ongoing movement. A video is available here. Another video demonstrating how our automatically segmenting action recognition system converges towards correct guesses of actions from the publicly available MSR repository is available here. Notice how the system also makes reasonable guesses based on the movements of the agent even before the actions are completed, even though the data is rather noisy. In conclusion we have a real time online action recognition system based on hierarchical SOMs, that solves the action segmentation problem. The action recognition system can learn and handle a much larger set of actions than are shown in the movies, which are only for demonstration. These online versions of the action recognition system have been implemented by Zahra Gharaee.