NEST Contract No. 5010 CONTACT Learning and Development of Contextual Action Instrument: Thematic Priority: Specific Targeted Research Project (STREP) New and Emerging Science and Technology (NEST), Adventure Activities Artifact Speech and Manipulation Production-Perception Setup Due date: 01/01/2006 Submission Date: 15/10/2006 Start date of project: 01/09/2005 Duration: 36 months Organisation name of lead contractor for this deliverable: University of Genova (UGDIST) Revision: 1 Project co-funded by the European Commission within the Sixth Framework Programme (2002-2006) Dissemination Level PU Public PU PP Restricted to other programme participants (including the Commission Services) RE Restricted to a group specified by the consortium (including the Commission Services) CO Confidential, only for members of the consortium (including the Commission Services) Contents 1 Introduction 2 2 Embodied artifact 2.1 Artifact for manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Artifact for speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Integration plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 3 3 3 Articulatory synthesis 4 4 The “linguometer” 4.1 Articulograph (Carstens AG500) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Ultrasound System (Toshiba Aplio Ultrasound machine) . . . . . . . . . . . . . . . . . 4.3 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 6 7 1 1 Introduction For CONTACT, we are taking a modular approach to the “embodied artifact” that will be used for this project. We expect to integrate all our algorithmic work on a single platform capable of manipulation and speech. This platform will be a humanoid robot. The modules we have identified are as follows: . Embodied artifact: UGDIST is developing a robot with a mechanically sophisticated arm/hand and setting it up for CONTACT. IST (in collaboration with US) is developing a robot head with a sophisticated ear. . Articulatory synthesis: The US, IST, and UGDIST groups are working on adapting speech generation systems that explicitly model the process of articulation. This is a key technology for our artifact to be able to treat speech production as a motor act, without being constrained to a specific language and a pre-selected set of phonemes. . The “linguometer”: We are interested in measuring speech-related phonoarticulatory activity, but tools for this are relatively poorly developed, and the tongue is relatively inaccessible, compared with human arm/hand movement. UNIFE and UGDIST have begun an activity to develop a “linguometer”, a set of instrumentation for measuring articulator activity. This is key to training algorithms for machine perception/production of speech. 2 Embodied artifact In the technical annex, the “BabyBot” robot is described, and is still available for experimentation if needed, but by joint work with other projects we now have access to other robots (see Section 2). In particular, we are collaborating with the RobotCub project (IST-004370). The goal of the RobotCub project is (among other things) to generate a fully open source robot platform. Cooperation on software and hardware is in the interests of both projects: CONTACT can give the RobotCub robot iCub a voice and models for exploring sensorimotor space, and iCub will be an excellent embodiment for the CONTACT project. 2.1 Artifact for manipulation UGDIST has been working on developing a robot platform called “James” with better dexterity, both in terms of mechanical degrees-of-freedom and sensors (see Figure 1). Figure 1: Left: James, a 23-degree-of-freedom robot with a dextrous hand. 2 Figure 2: Left: the robot head. Center: the current design for an artificial pinna for the robot head, for better sound localization. Right: the pinna helps with the locating a sound source. Determining the angle along the horizontal is relatively straightforward ; the pinna helps create a cue that in turn helps identify the elevation of the sound source. 2.2 Artifact for speech IST is investigating sound localization and tracking on a robot head, collaborating with US on ear design (see Figure 2). Without ears, a pair of microphones can only localize sounds in the horizontal plane; with correctly designed ears this limit no longer holds. IST has developed an “artificial pinna” for sound source localization. The form of the pinna is important for sound localization since it gives spectral cues on the elevation of the sound source. The human pinna can be simulated by a spiral. The pinna gives a notch when the distance to the microphone is equal to a quarter of the wavelength of the sound source (and any multiple of half the wavelength). Cues used for localization are ITD (Interaural Time Difference), ILD (Interaural Level Difference) and ISD (Interaural Spectral Difference). IST, with US, have carried out a first experiment where the system is trained with white noise, and tested with speech in an echo-free chamber. A second experiment is learning the audio-motor maps using vision in an office environment. 2.3 Integration plan UGDIST has been examining how to bring speech and manipulation systems into alignment. These systems have important practical differences: . Can the motor space be explored safely through random actions? For speech, yes – the worst that can happen is minor irritation to nearby humans. But for manipulation, the answer is no; certain motor states will break a robot, through self-collision, ripping cables, or a host of similar woes. . How stable is the mapping from motor to perceptual space? Both hearing speech and viewing manipulation are subject to all sorts of environmental distortions. The visual appearance of manipulation is perhaps subject to more radical transformations than the auditory “appearance” of speech, since vision is subject to harsher geometric effects than sound (hence the difficulty of sound localization). . How direct is the effect of motor action on the world? Speech (apart from its own sound) has primarily social-mediated effects, while manipulation (apart from its own appearance) has direct physical effects. These are all significant differences to abstract across. The latter two points are questions of degree, but the first point, safety, is critical. We could conceivably explore the motor space of an artificial articulator automatically from tabula rasa, but this is not true of the motor space of today’s robot hand/arms. Given a motor space that is potentially dangerous to explore, we need to perform some degree of calibration and the analogue of protective “reflexes” and built-in limits. We take motor spaces augmented with 3 such measures as our starting point, to give an “explorable” motor space that is now safe. We consider taking as our basic abstractions the following: . An explorable motor space (a motor space augmented with whatever measures are required to make it safe for exploration). . A “proximal” sensor space, comprising a set of sensors that relate closely to motor action, for example motor encoders, strain gauges, articulator tube resonance model setpoints, etc. . A “distal” sensor space, comprising a set of sensors that relate to motor action indirectly, for example via microphones or cameras. The mapping from “proximal” sensor space to motor space, in the case of manipulation, needs to be at least partially determined manually for our current manipulator platform (the James robot), since it is needed to achieve safety. For speech, this mapping is a candidate for automated learning. So, for now, this mapping is not a point of contact between speech and manipulation, at least for the embodied artifact. The mapping from “distal” sensor space to motor space, on the other hand, is a clear point of contact between speech and manipulation. This is so both for the embodied artifact and for human studies, so this level of abstraction seems appropriate for integration. 3 Articulatory synthesis Current approaches to speech production and speech perception by machine are highly divergent both from each other and also from any notion of being a “motor system”. We are investigating the most reasonable approach to take to speech production for our embodied artifact in order to meet our goal of integrating perception and production for speech and manipulation. The dominant model of automatic speech generation by computer takes text, converts it into a symbolic phonetic representation aligned with prosody, then converts that into actual sounds. This process is very much divorced from the perception of speech or indeed the mechanical process of articulation. At the opposite extreme, there are a small number of robots that produce speech-sounds by physical manipulation of a tube; but the technical challenges are huge. A simulated approach seems more reasonable, and sufficent for our goals. And in fact there is a family of articulatory synthesizers, an approach to speech generation that involves specific modeling of the mechanics of articulation to a greater or lesser degree of abstraction. With such systems, we can throw away all language-specific parameters and work with a continuous control-space of physically meaningful parameters of a model of the human articulatory system, without any specific language bias built in. And we can treat speech as just another motor system, like an arm or hand, because the control input is continuous in time. So far, we have taken an open-source system (GNUSpeech) for articulatory synthesis, and converted it into a continuously running real-time system with motor inputs analogous to a hand/arm (see Figure 3). One of our partners has received access to the source code for a more advanced system created by Shinji Maeda,which deals better (among other things) with extra frication sound sources created with tube constriction. We expect that this is the system we will in the end use, if it can be configured in real-time. IST has been using VTDemo, an implementation of Shinji Maeda’s articulatory synthesizer by Mark Huckvale, University College London (see Figure 4). Via US, we now also have access to Maeda’s own synthesizer. IST is considering the following questions: . Data acquisition: which is more appropriate, articulatory measurements, or speech synthesis? . Motor information: which is better, articulatory or orosensory parameters? . Other modalities: how can visual information help the segmentation process? As shown in Figure 4, IST is working on the adapting the DIVA model as a starting point for exploring the space of articulation. 4 Figure 3: Tube resonant model of GNUSpeech (left). The articulator is approximated as a tube, whose width can be controlled dynamically along its length. We have successfully converted GNUSpeech into a realtime server, receiving numerical inputs that immediately change its configuration, just as any other robotic actuator. A simple interface for testing is shown on the right. Phonemes for English are shown, but correspond only to particular sets of parameter settings, rather than discrete symbols. Figure 4: Left: the VTDemo articulatory synthesizer, adapted at IST. Right: the proposed architecture for exploring speech sounds. 4 The “linguometer” Figure 5 shows some of the instrumentation. We are interested in measuring speech-related phonoarticulatory activity, but tools for this are relatively poorly developed, and the tongue is relatively inaccessible, compared with human arm/hand movement. UNIFE and UGDIST have begun an activity to develop a “linguometer”, a set of instrumentation for measuring articulator activity. We are interested in determining whether knowledge of motor activity during speech can aid learning to perceive speech. In a previous successful project involving a subset of the partners (the MIRROR project, IST-2000-28159) an analogous result was demonstrated for grasp5 Figure 5: Left: the tongue, viewed in real-time via an ultrasound machine. Right: an articulograph, which recovers the 3D pose of sensors placed on the tongue and face. Figure 6: Initial studies at CRIL show that some tongues are easier to sense than others (compare the left image, where tongue profile is visible as a clear white line, to the right image, where profile is much less distinct). For good subjects, the real-time imaging of the tongue is of good quality, although unfortunately only for a 2D slice rather than a 3D volume. ing. To carry out this experiment for speech, we need a way to measure speech-related phonoarticulatory activity. We informally call this the “linguometer”, but in practice this will be a constellation of instruments and software. We have begun a collaboration with CRIL in Lecce to work towards integrating a ”linguometer”. The future integration of the linguometer will rely on two main instruments: 4.1 Articulograph (Carstens AG500) The AG500 articulograph locates the 3D position and orientation of 12 coils within a ”cube” generating a high frequency electromagnetic field. Its accuracy is high, and the output is very easy to interpret since it is geometric in nature. It is limited by the physical dimensions of the cube, having a long and cumbersome setup procedure (including gluing coils to the tongue), and somewhat unstable software. Also, interaction with other devices containing metallic components requires careful handling. 4.2 Ultrasound System (Toshiba Aplio Ultrasound machine) The ultrasound machine gives high frame rate (around 30 frames per second) sensing for 2D slices. Direct real-time access to data from the sensor appears difficult, and we may need to make do with output that has passed through some rescaling and other filters that ideally we would bypass. RAW data is available, but will not be used since the software running on the Toshiba Aplio system has some 6 Figure 7: The current proposed architecture for the “linguometer”, with synchronization via shared audio. constraints that limit the extraction of such data. 2D images provided by the machine give a good view of the tongue profile, although quality varies from subject to subject. For our purposes, at least in the early stages, it is sufficient to find “good” subjects whose tongue can be clearly imaged - we do not need to be able to image arbitrary subjects. An open question is how well, or whether, infants can be imaged in practice. 4.3 Integration The main idea is to record data from the Articulograph and the Ultra Sound System at the same time. This task is quite complicated since the Articulograph detects the position of each sensor using an electromagnetic field and the Ultrasound System probe interferes with it. We are actually testing the result of the synchronous recordings and building physical devices that would allow the speaker to speak naturally and us to record data with the proper accuracy. While testing the possibility of having the two different instruments to cooperate, we are developing the software that will allow us to process, store and share the recorded data. As soon as the constellation of hardware and software will be ready, we will run few test experiments for validating the setup before following the definitive protocol. Either device individually will provide excellent data for CONTACT. Integrating both will be even better, data-wise, but has a high cost in terms of time and complexity. We are evaluating this further. Our general approach for synchronizing these devices, and other devices not mentioned here, is to relate all data to the sound signal. Most devices support sound recording (especially devices designed for speech research!). 7