Thesis Presentation Final 2

Austin Butts  Hearing Loss as a Health Issue  53 million with severe (61+ dB) or greater (Mathers et al. 2000)  14 million profound (81+ dB) loss (Mathers et al. 2000)  About $300,000 for people with severe loss in the US (Mohr et al. 2000)  What are Cochlear Implants (CIs)  "the most successful neural prosthesis" (Zeng et al. 2008)  Restore hearing by convert sounds into electrical pulses  Stimulate the nerves of the inner ear  Milestones and Trends in CIs (Zeng et al. 2008)  Preliminary work: 1950s - 70s  First single channel electrode approved by FDA in 1984  Conversational speech beginning in 1990s  Indexical Properties  Qualities of the speaker, not linguistic in nature  Ex: Identifying speakers, discriminating voice gender  Also leads to problems in group contexts  Others  Telephone conversations: High variability of usage, more common to suffer with unfamiliar talkers and topics  Speech intonation/Prosody: Statement vs. question  Expense and Risk: $40k - $100k, invasive (ASHA 2015, AAO-HNS 2015)  Variability in Outcomes  What is Sensory Substitution?  Convert information about one sense to another  Accomplished by a machine interface  Non-invasive, application of system is external  Prior Work  TVSS: camera to height map of pin array (Bach-y-Rita 2004)  Tongue electrotactile: camera to electrically stimulate tongue (Bach-y-Rita 2004)  vOICe: image, treated as audio spectrogram, into sound output (Auvray et al. 2007)  Other examples: tactile-tactile, balance aids (Kaczmarek et al. 1991) General Considerations (Lenay et al. 2003)  Issues in fundamental usefulness and widespread use  Replace sensation perception  Require time to form associations  Actually substitution addition  Physical areas still retain old functions and contextual perceptions  Typical formulation leaves out crucial role of motor integration (Bach-y-Rita and Kercel 2003, Bach-y-Rita 2004)  More obvious in visual systems Applications for Cochlear Implants  Same* for perception framework as traditional applications  Psychophysics theory implies an increased sensitivity for multimodal systems   Also applicable for 'addition' Sensorimotor regimes more contentious  Motor components not crucial to linguistic and indexical speech perception  Space not represented topographically in human system (Kandel et al. 2013) Aspects of Multisensory Integration  Clearly is in this application  Requires integrating cues from audition and another modality to arrive at a single abstraction  Speech is natively multisensory  Information is not specific to a single mode, often present in multiple channels at least in part (Rosenblum 2005, McGurk and MacDonald 1976, Hopyan-Misakyan et al. 2009) Neural Mechanisms  Deaf vs. NH: Plasticity to tactile cues (Levänen et al. 1998, Auer et al. 2007)  Small amounts of activation to vibrotactile cues in NH (Schürmann et al. 2006)  Lesion studies: voice quality and ID/familiarity (Van Lancker 1982, Kreiman 1997; Belin et al. 2000)  Synesthesia following stroke: audio eliciting tactile sensations (Ro et al. 2007, Beauchamp and Ro 2008, Naumer and van den Bosch 2009) Direct Spectral Mapping  Principle: Energy of spectral bands  Mimic tonotopic nature of cochleae/CIs (De Filippo and Scott 1978, Sparks et al. 1979, Wada et al. 1996, Galvin et al. 1999) Fundamental and Formant Isolation  Principle: Source/filter properties  Source is fundamental frequency/glottal pulse rate  Formants are peak frequencies of filter (Rothenberg and Molitor 1979, Boothroyd 1986, Franklin et al. 1991) Contemporary Work  VEST  Conference demos, but no journal publications to date (Eagleman 2014, Novich and Eagleman 2014, Eagleman 2015) Discussion  Confer some information (Osberger et al. 1991)  Hardly any tests just at chance level  Only comparable to single electrode CIs  Multichannel technology clearly has performance advantages after initial development (Pickett and McFarland 1985, Osberger et al. 1991)  Multiple aids (i.e. with CIs) in linguistic applications appear dim  More difficult to demonstrate effectiveness when baseline is already above chance  The literature largely ignores indexical properties    Can it be done? Auditory-Tactile cue mapping Continuum of dimensions approach  Contrasting to pattern-based (Tan et al. 2003)  Do not have that knowledge for particular results the user wants  Infeasible for the end-goal for sheer number (speaker ID)  Information extraction might be a prerequisite to clarify salient cues Cues in Speech and Hearing Science  Rhythm, speaking rate, breathiness, nasality, pitch and intonation, formants, and dynamic articulatory cues (Sambur 1975 , Cleary et al. 2005, Vongpaisal et al. 2010)  Methods (Kreiman et al. 2005)  Manipulate samples to select or drop certain cues  Additional (abstract) frameworks: Factor Analysis (FA) and Multidimenional Scaling (MDS) Cues in Computer Science  Mathematical approach based on the signal, not concerned with the speech apparatus  Features: Cepstral coefficients (Furui 1981, Gowdy and Tufekci 2000, Zheng et al. 2001)  Frameworks:      Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Artificial Neural Network (ANN) I-vectors Reconciling the Two Approaches  Both provide accurate descriptions, utility depends on the application  Abstraction of features against complexity of algorithms  Sensory substitution systems with humanmachine interfaces need to be mindful of both  What is salient and what can be categorized Neural and Psychophysical Descriptions  Skin described as having four types of tactile receptors (Johansson and Vallbo 1979, Johansson and Vallbo 1980)  Each defined by size of receptor field and speed of neural adaptation (Johansson and Vallbo 1983) Inquiry here has more to do with perceptual dimensions  Contributions of specific receptors are complex  (Gescheider et al. 2002, Sherrick et al. 1990)  Can't be elicited specifically in any device implementation Potential Dimensions  Static materials: Roughness, hardness, possibly compressional elasticity (Hollins et al. 1993)  Vibrotactile: Mechanical pitch and loudness, temporal envelope (Melara and Day 1992, Park and Choi 2011)  Specific systems: Spatial location, frequency, intensity, waveform, duration, rhythm, and roughness or temporal envelope (Jones et al. 2009, Brown et al. 2006, Cholewiak et al. 2001)  Difficult to infer discriminability of two different arbitrary stimuli with current framework  Assign each auditory dimension to a dimension of the device  Non-trivial for more than one dimension  Potential issues (Kreiman et al. 2005)  Specific to experimental  Individual variations in salience  Experiments 1&2: Trial and Full Study  Test fundamental frequency mapped to the height spatial dimension  Task: Identify the gender of the speaker  Experiment 3: Computational  Simulate identification of speaker using linear discrimination procedure Initial Work: Experiment 1 Equipment  Chair  Vibrotactile (Haptic) Array  Arduino Controller  Computer (interface) Chair with Vibrotactile Array [258.0,305) 18 41 12 [132.1,184.6] [94.57,132.1] (80,94.57] Schematic of Device Design Dimensions (units in mm) Frequency Range (Hz) [184.6,258.0] Stimuli Processing  Audio  Speech sentences from TIMIT database (Fisher et al. 1986)  CI simulations: process TIMIT files through 8 channel noise vocoder (AngelSim) (Emily Shannon Fu Foundation 2014)  Vibrotactile  Fundamental frequency (Praat) converted to patterns  Large (500 ms) and small (50 ms) windows displayed; left and right respectively [258.0,305) [184.6,258.0] [132.1,184.6] [94.57,132.1] (80,94.57] Range of Representative Male Speaker Range of Representative Female Speaker Session Files  3 blocks of 52 trials each  Balanced speaker gender within blocks  Normal audio, CI simulation, CI sim + haptics  Order of blocks originally randomized and counterbalanced among subjects  Not completely balanced due to ending experiment before completion  No feedback on any of the correct responses Participants  12 participants recruited (10 M and 2 F)  Compensated for a 1-hour max. session Session Procedure  Load spreadsheet (CSV) with file directories in specified order  Web interface  Play segment, prompt the user  “Please select the gender of the speaker”; Male or Female  Pause and prompt to continue  Likert scale survey at the end Factors  Order of Stimulus: A  Subject: B(A)  Type of Stimulus: C Response Variables  Raw data: correct trials and time to respond  Final metrics  Accuracy  Response time  Bias (Donaldson 1992) Transformation Techniques  Principle: stabilize variance  Accuracy: arcsinesquareroot (Vollset 1993)   Derived from variance as a  function of the mean  Response Time: inverse  Derived from Box-Cox test (Montgomery 2012)  Bias: none  Difficult to determine Transformation for Accuracy * y  arcsin y Confidence Interval c   * sin  y   2 n  2 ANOVA F-Tests  Model: restricted nested mixed effects  F tests for each factor reflect the model Post-hoc Tests  Single fixed factors: Tukey’s test  Contrasts (interactions): Scheffé's Method for comparing all contrasts  Correlations of subject-wise factors: underlying prediction  Sign test when dealing with potentially nonnormal data Accuracy  B(A) [subject]  C [stim type]  Normal at ceiling, above CI and CI+Haptics  No difference between CI and CI+Haptics  AC [stim order*type]  Involve learning CI sim between blocks and absolute order learning  No significant and meaningful post-hoc results * * Response Time  AC [stim order*type]  Significant F stat, but again no meaningful post-hoc   B(A) [subject] B(A)C [subject*type]  No sig. correlations between B(A)C and B(A) Bias  Overall towards male  No factors are significant Which of the following interpretations best accounts for two modes seen in the data? By-Subject Model By-Training Model  Performance statistics  Lack of fundamental difference between modalities, appears that the chair does not contribute meaningful information  Subjects vary in overall performance, and within modalities for response time  Why utilizing chair is difficult  Speculate lack of instruction for most participants, combined with two data streams and no direct feedback  Bias in answer choice  Might have made flawed associations Full Design: Experiment 2 Equipment  Comparable to first experiment  Different laptop Stimuli Processing  In addition to CI simulation segments, also made matching set AMR file compression before simulation  Mimic phone network  Separate two streams of vibrotactile patterns (time window size) into two different sets [258.0,305) [184.6,258.0] [132.1,184.6] [94.57,132.1] (80,94.57] Range of Representative Male Speaker Range of Representative Female Speaker Session Files  3 blocks of 80 trials each (16 specific training segments, 64 normal)  Balanced speaker gender within blocks  CI simulation, haptics alone, CI sim + haptics  Order of blocks fully randomized and counterbalanced among subjects  Feedback on training segments Participants  18 different participants recruited (10 M and 8 F), compensated for a max 1 hour session  All informed of mapping Session Procedure  Similar session  Choice layout randomized  Now training segments have correct answers displayed afterwards on continue screen Code Factor A Order of Stimuli B(A) Subject C Type of Stimulus D Type of Auditory Stimulus E Type of Haptic Stimulus H Block Halves Factors and Attributes  Add two within-block factors  Type of audio stimulus: D  Type of haptic stimulus: E   Consider which half of the block a trial: H Consider duration and distance from center F0 of files in separate analysis Response Variables  Accuracy  Response time  Bias: choice and layout  Also consider the Likert scores  Transform Techniques  Same techniques  ANOVA Stages  Not all the factors can be crossed with others, nonsense combinations  Separate ANOVAs are completed that have all factors crossed  ANOVA F-values and Tests  Model: Restricted nested mixed effects, different variety now with additional factors and invalid terms  Post-hoc Tests  Same kinds of tests  Linear models for fitting an outcome based on predictors Accuracy  Fixed: C [stim type]  Combined stimuli have greater effect than either modality alone  Haptic trends higher than CI (not sig.)  Random: B(A), B(A)C, B(A)D [subject * nothing, type and audio]  No correlations within or between random factors found  Note: D [audio] and A [order] are marginal  D tends towards compression having negative effect * * Response Time  Fixed: C [stim type]  Haptic alone slower than both CI and combined  No significant difference between CI and combined  Fixed: AD and ADE [order*within]  Nothing significant found relating to block order  Random: B(A), B(A)C [subject and subject*type]  No significant correlations within and between random factors * * Biases  No significant bias for speaker gender (choice) or L/R (layout) File Parameters on Accuracy  Against (i) the duration of the segment and (ii) crossmodal distance from the center Source  Coefficients from both (Intercept) variables significant Distance  Distance having larger Duration effect Estimate St. Error t-statistic p-value 0.94088 0.041263 22.802 1.0843 E-61 0.26559 0.02958 8.9789 8.6729 E-17 0.021475 0.0087065 2.4665 0.014352 Linear Model for Accuracy vs. File Parameters ANOVA on Likert Scores  Only effect from C is significant  Haptics and combined conditions perceptually easier than CI alone  But not significant between themselves (trend) Predict Likert from Performance  Neither significant  Both trended as expected and effect for accuracy was marginal Source Estimate St. Error t-statistic p-value (Intercept) 8.3753 2.2305 3.7549 0.00044586 Accuracy -3.0754 1.5776 -1.9494 0.056754 RespTime -1.2722 1.8331 -0.69402 0.49082 Linear Model for Likert vs. Objective Performance Splitting Trials in Half  No effects on accuracy  Effect of H, B(A)H, and B(A)CH on response time Multimodal Enhancement  Alternative model: subjects just utilize the one which works for them (no multisensory regime)  Is it typical for subjects to utilize both modalities?  Method 1: Accuracy for a modality above chance, and combined above that single modality (two ways to step)  Both ways of stepping are significant  Method 2: Accuracy for combined above both CI and haptic alone scores  Marginal, not quite significant Performance Metrics  Type of stimulus (C) has significant effects, with interesting interplay  Combined modalities result in higher accuracy above CI alone without having to sacrifice reaction time (as with haptics alone)  Variability of random factors a constant theme  Not all subjects react to stimuli the same D alone just barely not significant in accuracy, but do see variability in how subjects react to compressed audio (B(A)D)  Fixed factors related to learning, when they show up significant in ANOVA, are not significant for meaningful contrasts  Fail to show any correlations in random factors, need further demonstration to confirm strong independent relations  Biases  Experimental setup appears to fix the issue with bias File Parameters on Accuracy  Longer files helps with accuracy, but not nearly as much as having distinct stimuli for cross-modal distance Likert Scores  See similar trends across C (different significant posthoc results)  More difficult to show significance in how much influence accuracy and response time have on scores Splitting Trials in Half  Overall increase in speed, and also significant variation between and within subjects Multimodal Enhancement  Contentious how much it is typical for multimodal usage can be confirmed  Second method may be more susceptible to error  Indicative, but requires further testing with this being the primary hypothesis  Can still show existence of some multimodal subjects and average effects Experiment 3   Want to see how Melfrequency Cepstral Coefficient (MFCC) features correspond to the fundamental frequency Male and female speakers separated well by a hyperplane in MFCC feature space (129 out of 130 in both groups) Linear Discriminant Representation for Classifying Voice Gender Broad categorization and correlation  Suspect for within groups   Much lower variance explained Slope Estimate Slope Std. Error t-statistic p-value Adjusted R2 Both Genders 1.6882 0.04354 38.773 1.3038E-109 0.853 Male Only 0.54248 0.12418 4.3686 2.5581E-05 0.123 Female Only 1.5207 0.12636 12.035 8.8334E-23 0.527 Model Linear Models for Hyperplane Distance to Mean log F0  Goal: See if a linear classifier can succeed in identifying speaker based on the mean MFCC vectors of speech segments  Reduce dimensions of MFCCs to maximize the variance between means (presumed device operation)  Parameters       Number of speakers: 2, 3, 5, 7, 10, 15, 20; n of 260 Dimensionality of space: Integers 1-12 Duration: Up to 5 seconds in 0.25 second increments 1000 trials per parameter combination (max error +/-3%) Train with the 3 SI segments for each speaker Random selections of SX sentences until required duration reached for testing Duration Plots  Varying number of dimensions (speakers = 10)  Increase in dimensions makes accuracy rise  Varying number of speakers (dimensions = 3)  Increasing speakers makes accuracy fall  Plateaus quickly for duration Accuracy (Raw) Scaled Performance Level Number of Dimensions Number of Speakers Estimate 95% CI 1 1 12 12 0.7980 0.0800 0.9460 0.5840 0.7720 - 0.8217 0.0647 - 0.0985 0.9302 - 0.9584 0.5532 - 0.6142 2 20 2 20     Best performance with full dimensional representations Reduction leads to substantial problems, especially for moderate to large numbers of speakers Some information conveyed, but not passable for a usable implementation Different mathematical approach needed  Sensory substitution devices can support perception of indexical qualities of speech  Even in subjects that are already aided by simulations of CIs     Mapping and procedure make all the difference Theme of variation among subjects Existence and possible prevalence of utilizing information in a multimodal fashion Sophisticated models needed to convey speaker ID in reduced dimensions  CI simulation, really no true substitute for real patients  Scores observed not too different  Stepping through to familiarize with vocoder (Fu et al. 2004, Fu et al. 2005, Gonzalez and Oliver 2005)  Needed for more rigorous procedure to acclimate subjects  Device Components  Robustness of conclusions to different actuators and implementations  Microphone/sensorimotor integration  Mapping Algorithms  Test against categorical approach  Different mathematical framework and possibly features  User Study Tasks  Logistics of building speaker ID experiment together (database and procedure)  Validate task itself in normal hearing people  Simultaneous task (intelligibility) End of Presentation

Thesis Presentation Final 2

Related documents

Products

Support

Thesis Presentation Final 2

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib