Together with Miriam Buonamente and Haris Dindo from RoboticsLab at the University of Palermo in Italy, and Zahra Gharaee and Peter Gärdenfors from Lund University I have developed a hierarchical neural network architecture that uses a hierarchy of Self-Organizing Maps (SOMs) to recognize actions. We have done extensive research on how to get this to work well in numerous variants and implementations categorizing both 2D movies of agents performing actions and sequences of sets of joint positions obtained from 3D cameras. Together with Miriam Buonamente and Haris Dindo I have also done extensive research, with promising results, on versions of the architecture that internally simulate the likely continuation of partly seen actions, by employing Associative SOMs (A-SOMs). This is a biologically inspired way of achieving sequence completion or the completion of missing parts of patterns extended in time. The use of A-SOMs also enables the elicitation of expectations across different modalities, although this has not yet been tested in practice with this action recognition architecture. I have completed my work on action recognition to pursue other interests, but a short description of the architecture together with some videos captured by Zahra Gharaee demonstrating action recognition by her implementations of the architecture are kept below.
The basic architecture is composed of hierarchical layers of neural networks that self-organizes into topologically ordered representations of increasingly complex human activity features, together with a suitable preprocessing mechanism (which depends on the variant of the architecture) and some variant of a mechanism that transforms activity patterns unfolding over time into a spatial representation (which also depends on the variant of the architecture). The implementations have employed two such layers, one that develops a representation of the key postures (i.e. postures distinguishible with the resolution obtained by the number of neurons involved) and one that develops a topologically ordered representation of movements (with a length less or equal to an action, depending on the method employed to transform activity patterns unfolding over time into a spatial representation). By determining a time length for the spatial transformations of the movements suitable for the particular set of actions the architecture is supposed to recognize together with 'smoothing' of the output it is possible to obtain a continuous guessing of the ongoing action without segmentation.
The kind of preprocessing varies depending on the input source and the particular version of the architecture, but in the case when sets of joint positions extracted from the stream of depth images obtained by a 3D camera are used, a straightforward coordinate transformation into egocentric coordinates together with scaling is efficient to provide capturing angle and size/distance invariances.
The actual activity feature layers are implemented by the employment of SOMs (or recurrently connected A-SOMs if the ability for internal simulation is needed). By using SOMs (or A-SOMs) it is possible to obtain feature representation by unsupervised learning, which means decreased demands on the data available for training.
The first layer SOM performs dimensionality reduction and forms a representation of individual postures (or other features at the lower end, like the first or second order dynamics of postures or their combination) in a topologically ordered way (which means similar postures are represented close to each other). When the system receives input a trajectory of activity representing key postures or dynamics is unfolded during an action. Since sufficiently simliar (what counts as sufficiently similar depends on the number of neurons) postures are represented by the same neuron, a particular movement carried out at various speeds will elicit the same activity trajectory in the SOM (i.e. the same sequence of activated neurons). Thus time invariance is achieved. Since similar postures are represented close to each other in a similarity ordered way in the SOM, similar movements carried out by the acting agent are represented as similar trajectories (i.e. similar sequences of activated neurons) in the SOM.
The present activity together with a suitably long (the suitable length depends on and is optimized to the set of actions the system is trained to recognize) sequence of previous unique activity is transformed into an ordered vector representation before entering a second-layer SOM, which develops an ordered spatial representation of sequences that uniquely correponds to different actions.
The third-layer in the hierarchy consists of a neural network that learns to label the activations in the second-layer SOM with their corresponding action labels.
Demo videos of Zahra Gharaee's implementations of the action recognition architecture:
Movie1. Action recognition with manual segmentation.
Movie2. Action recognition with determination of the object acted upon. This is a hybrid system composed of an implementation of the hierarchical SOM architecture and a non-neural system for the determination of the object acted upon not described here.
Movie3. Continuous guessing of the action based on the ongoing movement.
Movie4. Continuous guessing of the action based on the ongoing movement applied to the publicly available MSR repository. Notice how the system also makes reasonable guesses based on the movements of the agent even before the actions are completed.