Austin Butts Hearing Loss as a Health Issue 53 million with severe (61+ dB) or greater (Mathers et al. 2000) 14 million profound (81+ dB) loss (Mathers et al. 2000) About $300,000 for people with severe loss in the US (Mohr et al. 2000) What are Cochlear Implants (CIs) "the most successful neural prosthesis" (Zeng et al. 2008) Restore hearing by convert sounds into electrical pulses Stimulate the nerves of the inner ear Milestones and Trends in CIs (Zeng et al. 2008) Preliminary work: 1950s - 70s First single channel electrode approved by FDA in 1984 Conversational speech beginning in 1990s Indexical Properties Qualities of the speaker, not linguistic in nature Ex: Identifying speakers, discriminating voice gender Also leads to problems in group contexts Others Telephone conversations: High variability of usage, more common to suffer with unfamiliar talkers and topics Speech intonation/Prosody: Statement vs. question Expense and Risk: $40k - $100k, invasive (ASHA 2015, AAO-HNS 2015) Variability in Outcomes What is Sensory Substitution? Convert information about one sense to another Accomplished by a machine interface Non-invasive, application of system is external Prior Work TVSS: camera to height map of pin array (Bach-y-Rita 2004) Tongue electrotactile: camera to electrically stimulate tongue (Bach-y-Rita 2004) vOICe: image, treated as audio spectrogram, into sound output (Auvray et al. 2007) Other examples: tactile-tactile, balance aids (Kaczmarek et al. 1991) General Considerations (Lenay et al. 2003) Issues in fundamental usefulness and widespread use Replace sensation perception Require time to form associations Actually substitution addition Physical areas still retain old functions and contextual perceptions Typical formulation leaves out crucial role of motor integration (Bach-y-Rita and Kercel 2003, Bach-y-Rita 2004) More obvious in visual systems Applications for Cochlear Implants Same* for perception framework as traditional applications Psychophysics theory implies an increased sensitivity for multimodal systems Also applicable for 'addition' Sensorimotor regimes more contentious Motor components not crucial to linguistic and indexical speech perception Space not represented topographically in human system (Kandel et al. 2013) Aspects of Multisensory Integration Clearly is in this application Requires integrating cues from audition and another modality to arrive at a single abstraction Speech is natively multisensory Information is not specific to a single mode, often present in multiple channels at least in part (Rosenblum 2005, McGurk and MacDonald 1976, Hopyan-Misakyan et al. 2009) Neural Mechanisms Deaf vs. NH: Plasticity to tactile cues (Levänen et al. 1998, Auer et al. 2007) Small amounts of activation to vibrotactile cues in NH (Schürmann et al. 2006) Lesion studies: voice quality and ID/familiarity (Van Lancker 1982, Kreiman 1997; Belin et al. 2000) Synesthesia following stroke: audio eliciting tactile sensations (Ro et al. 2007, Beauchamp and Ro 2008, Naumer and van den Bosch 2009) Direct Spectral Mapping Principle: Energy of spectral bands Mimic tonotopic nature of cochleae/CIs (De Filippo and Scott 1978, Sparks et al. 1979, Wada et al. 1996, Galvin et al. 1999) Fundamental and Formant Isolation Principle: Source/filter properties Source is fundamental frequency/glottal pulse rate Formants are peak frequencies of filter (Rothenberg and Molitor 1979, Boothroyd 1986, Franklin et al. 1991) Contemporary Work VEST Conference demos, but no journal publications to date (Eagleman 2014, Novich and Eagleman 2014, Eagleman 2015) Discussion Confer some information (Osberger et al. 1991) Hardly any tests just at chance level Only comparable to single electrode CIs Multichannel technology clearly has performance advantages after initial development (Pickett and McFarland 1985, Osberger et al. 1991) Multiple aids (i.e. with CIs) in linguistic applications appear dim More difficult to demonstrate effectiveness when baseline is already above chance The literature largely ignores indexical properties Can it be done? Auditory-Tactile cue mapping Continuum of dimensions approach Contrasting to pattern-based (Tan et al. 2003) Do not have that knowledge for particular results the user wants Infeasible for the end-goal for sheer number (speaker ID) Information extraction might be a prerequisite to clarify salient cues Cues in Speech and Hearing Science Rhythm, speaking rate, breathiness, nasality, pitch and intonation, formants, and dynamic articulatory cues (Sambur 1975 , Cleary et al. 2005, Vongpaisal et al. 2010) Methods (Kreiman et al. 2005) Manipulate samples to select or drop certain cues Additional (abstract) frameworks: Factor Analysis (FA) and Multidimenional Scaling (MDS) Cues in Computer Science Mathematical approach based on the signal, not concerned with the speech apparatus Features: Cepstral coefficients (Furui 1981, Gowdy and Tufekci 2000, Zheng et al. 2001) Frameworks: Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Artificial Neural Network (ANN) I-vectors Reconciling the Two Approaches Both provide accurate descriptions, utility depends on the application Abstraction of features against complexity of algorithms Sensory substitution systems with humanmachine interfaces need to be mindful of both What is salient and what can be categorized Neural and Psychophysical Descriptions Skin described as having four types of tactile receptors (Johansson and Vallbo 1979, Johansson and Vallbo 1980) Each defined by size of receptor field and speed of neural adaptation (Johansson and Vallbo 1983) Inquiry here has more to do with perceptual dimensions Contributions of specific receptors are complex (Gescheider et al. 2002, Sherrick et al. 1990) Can't be elicited specifically in any device implementation Potential Dimensions Static materials: Roughness, hardness, possibly compressional elasticity (Hollins et al. 1993) Vibrotactile: Mechanical pitch and loudness, temporal envelope (Melara and Day 1992, Park and Choi 2011) Specific systems: Spatial location, frequency, intensity, waveform, duration, rhythm, and roughness or temporal envelope (Jones et al. 2009, Brown et al. 2006, Cholewiak et al. 2001) Difficult to infer discriminability of two different arbitrary stimuli with current framework Assign each auditory dimension to a dimension of the device Non-trivial for more than one dimension Potential issues (Kreiman et al. 2005) Specific to experimental Individual variations in salience Experiments 1&2: Trial and Full Study Test fundamental frequency mapped to the height spatial dimension Task: Identify the gender of the speaker Experiment 3: Computational Simulate identification of speaker using linear discrimination procedure Initial Work: Experiment 1 Equipment Chair Vibrotactile (Haptic) Array Arduino Controller Computer (interface) Chair with Vibrotactile Array [258.0,305) 18 41 12 [132.1,184.6] [94.57,132.1] (80,94.57] Schematic of Device Design Dimensions (units in mm) Frequency Range (Hz) [184.6,258.0] Stimuli Processing Audio Speech sentences from TIMIT database (Fisher et al. 1986) CI simulations: process TIMIT files through 8 channel noise vocoder (AngelSim) (Emily Shannon Fu Foundation 2014) Vibrotactile Fundamental frequency (Praat) converted to patterns Large (500 ms) and small (50 ms) windows displayed; left and right respectively [258.0,305) [184.6,258.0] [132.1,184.6] [94.57,132.1] (80,94.57] Range of Representative Male Speaker Range of Representative Female Speaker Session Files 3 blocks of 52 trials each Balanced speaker gender within blocks Normal audio, CI simulation, CI sim + haptics Order of blocks originally randomized and counterbalanced among subjects Not completely balanced due to ending experiment before completion No feedback on any of the correct responses Participants 12 participants recruited (10 M and 2 F) Compensated for a 1-hour max. session Session Procedure Load spreadsheet (CSV) with file directories in specified order Web interface Play segment, prompt the user “Please select the gender of the speaker”; Male or Female Pause and prompt to continue Likert scale survey at the end Factors Order of Stimulus: A Subject: B(A) Type of Stimulus: C Response Variables Raw data: correct trials and time to respond Final metrics Accuracy Response time Bias (Donaldson 1992) Transformation Techniques Principle: stabilize variance Accuracy: arcsinesquareroot (Vollset 1993) Derived from variance as a function of the mean Response Time: inverse Derived from Box-Cox test (Montgomery 2012) Bias: none Difficult to determine Transformation for Accuracy * y arcsin y Confidence Interval c * sin y 2 n 2 ANOVA F-Tests Model: restricted nested mixed effects F tests for each factor reflect the model Post-hoc Tests Single fixed factors: Tukey’s test Contrasts (interactions): Scheffé's Method for comparing all contrasts Correlations of subject-wise factors: underlying prediction Sign test when dealing with potentially nonnormal data Accuracy B(A) [subject] C [stim type] Normal at ceiling, above CI and CI+Haptics No difference between CI and CI+Haptics AC [stim order*type] Involve learning CI sim between blocks and absolute order learning No significant and meaningful post-hoc results * * Response Time AC [stim order*type] Significant F stat, but again no meaningful post-hoc B(A) [subject] B(A)C [subject*type] No sig. correlations between B(A)C and B(A) Bias Overall towards male No factors are significant Which of the following interpretations best accounts for two modes seen in the data? By-Subject Model By-Training Model Performance statistics Lack of fundamental difference between modalities, appears that the chair does not contribute meaningful information Subjects vary in overall performance, and within modalities for response time Why utilizing chair is difficult Speculate lack of instruction for most participants, combined with two data streams and no direct feedback Bias in answer choice Might have made flawed associations Full Design: Experiment 2 Equipment Comparable to first experiment Different laptop Stimuli Processing In addition to CI simulation segments, also made matching set AMR file compression before simulation Mimic phone network Separate two streams of vibrotactile patterns (time window size) into two different sets [258.0,305) [184.6,258.0] [132.1,184.6] [94.57,132.1] (80,94.57] Range of Representative Male Speaker Range of Representative Female Speaker Session Files 3 blocks of 80 trials each (16 specific training segments, 64 normal) Balanced speaker gender within blocks CI simulation, haptics alone, CI sim + haptics Order of blocks fully randomized and counterbalanced among subjects Feedback on training segments Participants 18 different participants recruited (10 M and 8 F), compensated for a max 1 hour session All informed of mapping Session Procedure Similar session Choice layout randomized Now training segments have correct answers displayed afterwards on continue screen Code Factor A Order of Stimuli B(A) Subject C Type of Stimulus D Type of Auditory Stimulus E Type of Haptic Stimulus H Block Halves Factors and Attributes Add two within-block factors Type of audio stimulus: D Type of haptic stimulus: E Consider which half of the block a trial: H Consider duration and distance from center F0 of files in separate analysis Response Variables Accuracy Response time Bias: choice and layout Also consider the Likert scores Transform Techniques Same techniques ANOVA Stages Not all the factors can be crossed with others, nonsense combinations Separate ANOVAs are completed that have all factors crossed ANOVA F-values and Tests Model: Restricted nested mixed effects, different variety now with additional factors and invalid terms Post-hoc Tests Same kinds of tests Linear models for fitting an outcome based on predictors Accuracy Fixed: C [stim type] Combined stimuli have greater effect than either modality alone Haptic trends higher than CI (not sig.) Random: B(A), B(A)C, B(A)D [subject * nothing, type and audio] No correlations within or between random factors found Note: D [audio] and A [order] are marginal D tends towards compression having negative effect * * Response Time Fixed: C [stim type] Haptic alone slower than both CI and combined No significant difference between CI and combined Fixed: AD and ADE [order*within] Nothing significant found relating to block order Random: B(A), B(A)C [subject and subject*type] No significant correlations within and between random factors * * Biases No significant bias for speaker gender (choice) or L/R (layout) File Parameters on Accuracy Against (i) the duration of the segment and (ii) crossmodal distance from the center Source Coefficients from both (Intercept) variables significant Distance Distance having larger Duration effect Estimate St. Error t-statistic p-value 0.94088 0.041263 22.802 1.0843 E-61 0.26559 0.02958 8.9789 8.6729 E-17 0.021475 0.0087065 2.4665 0.014352 Linear Model for Accuracy vs. File Parameters ANOVA on Likert Scores Only effect from C is significant Haptics and combined conditions perceptually easier than CI alone But not significant between themselves (trend) Predict Likert from Performance Neither significant Both trended as expected and effect for accuracy was marginal Source Estimate St. Error t-statistic p-value (Intercept) 8.3753 2.2305 3.7549 0.00044586 Accuracy -3.0754 1.5776 -1.9494 0.056754 RespTime -1.2722 1.8331 -0.69402 0.49082 Linear Model for Likert vs. Objective Performance Splitting Trials in Half No effects on accuracy Effect of H, B(A)H, and B(A)CH on response time Multimodal Enhancement Alternative model: subjects just utilize the one which works for them (no multisensory regime) Is it typical for subjects to utilize both modalities? Method 1: Accuracy for a modality above chance, and combined above that single modality (two ways to step) Both ways of stepping are significant Method 2: Accuracy for combined above both CI and haptic alone scores Marginal, not quite significant Performance Metrics Type of stimulus (C) has significant effects, with interesting interplay Combined modalities result in higher accuracy above CI alone without having to sacrifice reaction time (as with haptics alone) Variability of random factors a constant theme Not all subjects react to stimuli the same D alone just barely not significant in accuracy, but do see variability in how subjects react to compressed audio (B(A)D) Fixed factors related to learning, when they show up significant in ANOVA, are not significant for meaningful contrasts Fail to show any correlations in random factors, need further demonstration to confirm strong independent relations Biases Experimental setup appears to fix the issue with bias File Parameters on Accuracy Longer files helps with accuracy, but not nearly as much as having distinct stimuli for cross-modal distance Likert Scores See similar trends across C (different significant posthoc results) More difficult to show significance in how much influence accuracy and response time have on scores Splitting Trials in Half Overall increase in speed, and also significant variation between and within subjects Multimodal Enhancement Contentious how much it is typical for multimodal usage can be confirmed Second method may be more susceptible to error Indicative, but requires further testing with this being the primary hypothesis Can still show existence of some multimodal subjects and average effects Experiment 3 Want to see how Melfrequency Cepstral Coefficient (MFCC) features correspond to the fundamental frequency Male and female speakers separated well by a hyperplane in MFCC feature space (129 out of 130 in both groups) Linear Discriminant Representation for Classifying Voice Gender Broad categorization and correlation Suspect for within groups Much lower variance explained Slope Estimate Slope Std. Error t-statistic p-value Adjusted R2 Both Genders 1.6882 0.04354 38.773 1.3038E-109 0.853 Male Only 0.54248 0.12418 4.3686 2.5581E-05 0.123 Female Only 1.5207 0.12636 12.035 8.8334E-23 0.527 Model Linear Models for Hyperplane Distance to Mean log F0 Goal: See if a linear classifier can succeed in identifying speaker based on the mean MFCC vectors of speech segments Reduce dimensions of MFCCs to maximize the variance between means (presumed device operation) Parameters Number of speakers: 2, 3, 5, 7, 10, 15, 20; n of 260 Dimensionality of space: Integers 1-12 Duration: Up to 5 seconds in 0.25 second increments 1000 trials per parameter combination (max error +/-3%) Train with the 3 SI segments for each speaker Random selections of SX sentences until required duration reached for testing Duration Plots Varying number of dimensions (speakers = 10) Increase in dimensions makes accuracy rise Varying number of speakers (dimensions = 3) Increasing speakers makes accuracy fall Plateaus quickly for duration Accuracy (Raw) Scaled Performance Level Number of Dimensions Number of Speakers Estimate 95% CI 1 1 12 12 0.7980 0.0800 0.9460 0.5840 0.7720 - 0.8217 0.0647 - 0.0985 0.9302 - 0.9584 0.5532 - 0.6142 2 20 2 20 Best performance with full dimensional representations Reduction leads to substantial problems, especially for moderate to large numbers of speakers Some information conveyed, but not passable for a usable implementation Different mathematical approach needed Sensory substitution devices can support perception of indexical qualities of speech Even in subjects that are already aided by simulations of CIs Mapping and procedure make all the difference Theme of variation among subjects Existence and possible prevalence of utilizing information in a multimodal fashion Sophisticated models needed to convey speaker ID in reduced dimensions CI simulation, really no true substitute for real patients Scores observed not too different Stepping through to familiarize with vocoder (Fu et al. 2004, Fu et al. 2005, Gonzalez and Oliver 2005) Needed for more rigorous procedure to acclimate subjects Device Components Robustness of conclusions to different actuators and implementations Microphone/sensorimotor integration Mapping Algorithms Test against categorical approach Different mathematical framework and possibly features User Study Tasks Logistics of building speaker ID experiment together (database and procedure) Validate task itself in normal hearing people Simultaneous task (intelligibility) End of Presentation