Towards a solution for the sharing of phonological data Yvan Rose Memorial University of Newfoundland Brian MacWhinney Carnegie Mellon University Map of presentation Context: no specialized tool to facilitate research in phonological development A preliminary attempt: ChildPhon A more promising solution: Phon Potential Current state of the Phon project Developments in foreseeable future Publicly-available cross-linguistic database Proposal Context (until recently) CHILDES tools (focus on CLAN) Number of tools for multimedia data storage and analysis Mostly deals with morphological and syntactic aspects of development Not easily extensible What about phonology? No CHILDES tool adapted for phonology Data sharing and broad-based investigations are challenging A first attempt ChildPhon (Rose 2003) Analytical (relational) database for child language data Designed within FileMaker Pro Main features Interface for double-blind transcriptions Automatic functions based on phonetic transcriptions: Syllabification of transcribed forms Detection of common processes observed in child language (e.g. onset cluster reduction) Problems with ChildPhon No support for Unicode fonts no X-platform compatibility (Macintosh-only) Not compatible with CHILDES / TalkBank no data exchange functions Automatic parses limited, not customizable Multimedia capabilities are minimal (at best) Requires use of proprietary software and font Algorithms are ‘destructive’ Statistical functions are minimal No web implementation In sum: Good idea -- Bad implementation Phon: a more promising solution Interdisciplinary project (First of its kind between Linguistics and Computer Science at Memorial University of Newfoundland) Software designers and programmers: Rodrigue Byrne, Gregory Hedlund, Philip O'Brien, Yvan Rose, Harold Wareham Financial Support: Faculty of Arts, Memorial University Social Sciences and Humanities Research Council of Canada (SSHRC) Canada Fund for Innovation (CFI) National Science Foundation (NSF) Phon: Overview Software underpinnings: Programmed in Java, Unicode font encoding XML data storage structure Cross-platform compatible (Mac, Windows, …) Compatible with TalkBank schema User management system Extended multimedia capabilities More flexible automatic algorithms Specialized query language Offers a complete solution for data sharing Phon: usability Intuitive graphical user interface Helpful wizards (e.g. project creation, queries) Record navigator Custom selection of data fields General / record-by-record Intuitive query language Standard terminology Built-in queries (modifiable by user) Query memorization and saving Phon: main functions User management Media segmentation Phonetic transcription Transcription merging (Selection of ‘final’ transcriptions for analysis) Phrase segmentation and alignment (Further segmentation according to research needs) Syllable alignment (Alignment of syllables of target and actual forms) Database query User management Secure login User tasks / privileges management Media segmentation Generally similar to CLAN Default segment length user-defined Hit the space bar to define a speech segment Useful for working on small speech segments Segment editing: Change numerical value ‘Stretch’ the time segment by sliding pointer Yvan Rose: Replace yellow line in segment “timebar” by waveform. Play Export sound clip Transcription: general interface Media window Session info (drawer) Media controls Transcription window Transcription Built-in IPA character map Symbol ‘categories’ Access to sound segment Interface for double-blind transcriptions Tied with user management functions Yvan Rose: Link adulttranscription to an electronic IPA dictionary. Need to develop a transcription system for sounds that can’t be transcribed easily. • Ability to assign a feature set to a dummy character Transcription merging Comparison of ‘competing’ transcriptions Direct access to media segment Selection of most accurate transcription Further refinement of selected transcription Yvan Rose: People an algorithm that would enable a comparison of transcriptions based on specific parameters (e.g. voicing). This algorithm could build on the feature sets associated with each segment transcribed. Phrase alignment Further segmentation of the utterances Useful for research on phonological domains A simple mouse click sets and resets the domain boundaries Yvan Rose: Several people requested different levels of segmentation. This includes morpho-syntactic levels of segmentation, as well as various levels of the prosodic hierarchy. Syllabification algorithm Syllabification algorithm O k R R O N ø n s t r N e I n t s ‘constraints’ Refined labeling of each syllabic position Each label is a valid object for query Syllabification algorithm Parameters of syllabification are user-definable Timing tier Syllable constituents Yvan Rose: The parameters will be revised thoroughly. To add (among others): word-final codas, list of exceptional clusters. Also add, to complement stress attraction, an option of ambisyllabic syllabification of intervocalic Syllable alignment Automatic alignment of syllables Manual modifications Query language Quick and accurate queries on large amounts of data Language features Uses terms familiar to phonologists to compose queries Syllable constituents: onset, nucleus, … Stressed vs. unstressed syllables Custom predicates History of recent queries Ability to save queries Query language components Selectors (e.g. Onset(Syllable x)) Predicates (e.g. Branching(Onset(Syllable x)) Boolean connectives Example: let corpusName = "TestCorpus", let corpus = Corpus(corpusName), let records = Records(corpus) foreach r in records foreach p in Phrases(r) foreach s in Syllables(p) Branching(Onset(TargetSyllable(s))) AND NOT Branching(Onset(ActualSyllable(s))) Query tree structure Branching onset reduction in 2nd syllable Record TargetPhrase Syllable Syllable Syllable Syllable Rhyme Rhyme Rhyme Rhyme Nucleus Nucleus Nucleus Nucleus Onset T ActualPhrase Onset U N D R TRUE branching( onset( pos( TargetPhrase , 2) ) ) AND NOT branching( onset( pos( ActualPhrase , 2) ) ) Coda A S Onset D Onset U N AND NOT MATCH D FALSE Coda A S Query results View in application Use to generate textual reports Recording session (e.g. to exemplify a given process) Time slice (e.g. to exemplify a stage of acquisition) Entire database (to exemplify a learning curve) Export As Unicode file As ASCII file (modulo font conversion limitations) Enhancements (short term) Improvement of syllable alignment algorithm (building on Kondrak’s 2003 algorithm) Import function ChildPhon files (including font translator --almost done!) CHAT files Incorporation user-defined fields Incorporation of statistical functions Chart report generator Ability to select various chart formats Bar graphs (for proportions within and across sessions) Line graphs (for learning curves) Enhancements (longer term) Interoperability with Praat Web-based interface Export to Praat (similar to CLAN function) Interface to accommodate acoustic measurement data Data sharing at a distance Easy query of corpora on CHILDES database Further automation Automatic detection of pre-identified processes Yvan Rose: Include function to extract phonetic inventories per session/stage/… Get examples of ‘canned’ analyses in literature on clinical phonology. Development timeline End of fall of 2004 Completion of current development phase Release of testing (Beta) version Winter of 2005 Bug fixes Improvement of functionality and user interface (including short-term enhancements) Website creation (http://www.phon.ca/) Completion of technical documentation Notes to programmers User guide Summer of 2005 Release of Phon 1.0 as open-source freeware Potential Standard for data sharing Large-scale investigations Cross-linguistic investigations Enhancement to CHILDES Elaboration of a database fulfilling the needs of acquisitionists focussing on phonology and related issues Investigation of interface issues (e.g. between morpho-syntax and phonology) How to realize this potential Team of researchers specializing in: Early acquisition (including babbling) Segmental development Prosodic development Phonological disorders Second language acquisition … Feedback on software development project Data contribution Existing corpora in digital format Conversion of printed corpora Identification of corpora (printed, with or without audio files) Setting of conventions for data conversion Our proposal Constitution of a research team to develop a phonological component of CHILDES Database Supporting software Elaboration, with the research team, of a grant application to support: Database elaboration Software development Periodical meetings Workshops … Concretely Feedback on software project Software needs for various types of research Implementation Let us know how you want it to work Contribution to grant application Let us know what you need Kinds of research would the new database enable Let us know what you would like to do Impacts of this research (e.g. theoretical, clinical, …) Supporting letters Contribution to the public database Sharing of existing / future corpora Establishment of conventions to format older corpora Special thanks The ‘Phon’ team at Memorial: Rodrigue Byrne Harold Wareham Gregory Hedlund Philip O’Brien For his great help with the TalkBank XML schema: Franklin Chen (Carnegie Mellon University) For their useful feedback on an early version of this software: Heather Goad (McGill), Paula Fikkert (Nijmegen), Clara Levelt (Leiden), Katherine Demuth (Brown), Mark Johnson (Brown), Carrie Dyck (Memorial), Phil Branigan (Memorial), Brian MacWhinney (Carnegie Mellon), Bryan Gick (UBC), Sophie Wauquier-Gravelines (Nantes), Sharon Inkelas (UC Berkeley), Conxita Lleó, Sonia Frota (Lisbon), Maria João Freitas (Lisbon), Ronald Sprouse (UC Berkeley), Joe Pater (UMass, Amherst), John Archibald (Calgary), Éliane Lebel (Memorial); hoping that no one was forgotten…