Description of the MSRC-12 Gesture Dataset Contributors to this Document Pushmeet Kohli, Sebastian Nowozin and Helena Mentis Microsoft Research Cambridge Simon Fothergill University of Cambridge Introduction The Microsoft Research Cambridge-12 (MSRC-12) gesture dataset consists of sequences of human skeletal body part movements (represented as body part locations) and the associated meaning that needs to be recognized by the system. The dataset was collected at MSR Cambridge as part of a research project which is described in: Simon Fothergill, Helena Mentis, Pushmeet Kohli, Sebastian Nowozin "Instructing People for Training Gestural Interactive Systems" ACM SIGCHI Conference on Human Factors in Computing Systems (CHI 2012) If you use this dataset, you should cite the aforementioned paper in any resulting publication. The dataset is accompanied with a set of MATLAB files (written by Sebastian Nowozin) which can be used to verify and visualize the sequences contained in the dataset. Description of the Dataset The dataset comprises of 594 sequences, 719359 frames (approx. 6h40m) collected from 30 people performing 12 gestures. In total, there are 6244 gesture instances. The motion files contain tracks of 20 joints estimated using the Kinect Pose Estimation pipeline. The body pose is captured at a sample rate of 30Hz with ~2cm accuracy in joint positions. Participants The participants were recruited at MSR Cambridge and the Computer Laboratory at University of Cambridge. Although some of the participants were familiar with the domain of machine learning and computer vision, none of the participants were privy to the workings of the machine learning algorithm of the study we were conducting. The demographics of the participants were 60% male, 93% right-handed, 5’0”-6’6” tall with an average of 5’8”, and 22-65 with an average of 31 years of age. Gesture Types The gestures can be categorized into two abstract categories: Iconic gestures - those that imbue a correspondence between the gesture and the reference, and Metaphoric gestures - those that represent an abstract concept. A list of all gestures along with the number of instances is given below: A. Iconic Gestures ID 2 4 6 8 10 12 Description Crouch or hide Put on night vision goggles Shoot a pistol Throw an object Change weapon Kick Tag name G2 Duck G4 Goggles G6 Shoot G8 Throw G10 Change weapon G12 Kick Instances 500 508 511 515 498 502 B. Metaphoric Gestures ID 1 3 5 7 9 11 Description Start System/Music/Raise Volume Navigate to next menu / move arm right Wind up the music Take a Bow to end music session Protest the music Move up the tempo of the song / beat both arms Tag name G1 lift outstretched arms G3 Push right G5 Wind it up G7 Bow G9 Had enough G11 Beat both Instances 508 522 649 507 508 516 Instruction method To study how the instruction modality affected the movements of the subjects, we collected data by giving different types of instructions to our participants. We chose to provide participants with three familiar, easy to prepare instruction modalities and their combinations that did not require the participant to have any sophisticated knowledge. The three were (1) descriptive text breaking down the performance kinematics, (2) an ordered series of static images of a person performing the gesture with arrows annotating as appropriate, and (3) video (dynamic images) of a person performing the gesture. We wanted mediums to be transparent so they fulfil their primary function of conveying the kinematics. More concretely, for each gesture we roughly have data using: Text (10 people), Images (10 people), Video (10 people), Video with text (10 people), Images with text (10 people). File Naming Convention There are 594 pose sequences (.csv files) along with 594 associated “tagstream” files which contain information about when a particular gesture from the list above should be detected. The format of the file names is: P[#A]_[#B]_g[#C][A]_s[#D]. [csv|tagstream] [#A] and [#B] encode the instruction that was given to the subject (see table below and the paper). The presence of the Alphabet “A” after #C indicates that movements generated when two instructions were given (eg. text + images, or text + video). [#C] denotes the identifier of the gesture being performed in the file. This is redundant information since this information is also present in the tag stream file. [#D] denotes the identifier for the human subject whose movements are captured in the file. MATLAB files The dataset comes with a set of Matlab scripts (written by Sebastian Nowozin) which allow you to verify the content and to visualize the data. The basic description of the files is as follows: 1. 2. 3. 4. LOAD_FILE -- Load gesture recognition sequence SKEL_VIS -- Visualize a skeleton in 3D coordinates. Demo 1: Visualize a sequence. Verify integrity of the dataset F.A.Qs. Q: Is it collected by a Kinect sensor? A: Yes Q: Please can I have RGB and depth map data? A: We are not able to share this data for legal reasons, as they make it too easy to identify the person. Q: Where is the file: P2_1_2A_p12.tagstream? A: The initial release on the dataset included a typographic error - this file was named: p2_1_2A_p12.tagstream with a lowercase 'p'. Q: Why does P1_1_12A_p19.csv contain all 0s for every frame, even the corresponding tagstream file says there should be 10 'kick' gestures? A: On a couple of occasions we did not realise the Kinect capture system hadn't recognised (and therefore recorded) the skeleton but we still annotated the RGB footage. Unfortunately I think this is the case for file P1_1_12A_p19.csv - the performer was quite tall... Q: Why does the letter “A” occurs after #C, not #B as in the initial release of the documentation? A: There was a mistake in the initial documentation. The letter "A" should be after #C as it refers to the instruction for the performance of the gesture, not the group of performers. It means that the instruction used a compound semiotic modality such as Video + Text or Image + Text. Q: Why are the descriptions of the gestures not consistent with the tag names in the tagstream files? A: This is for historical reasons. A mapping is provided in later versions of the documentation. The tag names focus more on the kinematics of the performances, which could be assigned any domain specific meaning. The documentation describes the gestures in the context of specific example applications such as music playing systems. Q: Why are there no g and s in the names of the files? A: The initial release of the dataset was updated with this clearer naming convention after 11/07/2012. Q: The table in the initial release of the documentation is a bit confusing for me: does the "+" sign means AND or OR? A: The "+" sign mean UNION of the two sets of performers. Community The dataset was advertised in various RSS feeds, at CVPR 2012 and at cvpapers.com. Interest has already been shown from people collecting a large and diverse quantity of training data and carrying out gesture recognition research at institutions including the University of Central Florida. Acknowledgements We would like to thank the staff, students and interns of Microsoft Research, Cambridge and the University of Cambridge for participating in our user study. We would also like to thank Olivia Nicell for help with data gathering and tabulation.