Thesis Presentation Final 2

advertisement
Austin Butts

Hearing Loss as a Health Issue
 53 million with severe (61+ dB) or greater (Mathers et al. 2000)
 14 million profound (81+ dB) loss (Mathers et al. 2000)
 About $300,000 for people with severe loss in the US (Mohr et al. 2000)

What are Cochlear Implants (CIs)
 "the most successful neural prosthesis" (Zeng et al. 2008)
 Restore hearing by convert sounds into electrical pulses
 Stimulate the nerves of the inner ear

Milestones and Trends in CIs (Zeng et al. 2008)
 Preliminary work: 1950s - 70s
 First single channel electrode approved by FDA in 1984
 Conversational speech beginning in 1990s

Indexical Properties
 Qualities of the speaker, not linguistic in nature
 Ex: Identifying speakers, discriminating voice gender
 Also leads to problems in group contexts

Others
 Telephone conversations: High variability of usage, more
common to suffer with unfamiliar talkers and topics
 Speech intonation/Prosody: Statement vs. question
 Expense and Risk: $40k - $100k, invasive
(ASHA 2015, AAO-HNS 2015)

Variability in Outcomes

What is Sensory Substitution?
 Convert information about one sense to another
 Accomplished by a machine interface
 Non-invasive, application of system is external

Prior Work
 TVSS: camera to height map of pin array (Bach-y-Rita 2004)
 Tongue electrotactile: camera to electrically stimulate
tongue (Bach-y-Rita 2004)
 vOICe: image, treated as audio spectrogram, into sound
output (Auvray et al. 2007)
 Other examples: tactile-tactile, balance aids (Kaczmarek et al. 1991)
General Considerations (Lenay et al. 2003)
 Issues in fundamental usefulness and widespread use
 Replace sensation perception
 Require time to form associations

Actually substitution addition
 Physical areas still retain old functions and contextual perceptions

Typical formulation leaves out crucial role of motor integration
(Bach-y-Rita and Kercel 2003, Bach-y-Rita 2004)
 More obvious in visual systems
Applications for Cochlear Implants
 Same* for perception framework as traditional applications
 Psychophysics theory implies an increased sensitivity for multimodal systems


Also applicable for 'addition'
Sensorimotor regimes more contentious
 Motor components not crucial to linguistic and indexical speech perception
 Space not represented topographically in human system (Kandel et al. 2013)
Aspects of Multisensory Integration
 Clearly is in this application
 Requires integrating cues from audition and another modality to
arrive at a single abstraction

Speech is natively multisensory
 Information is not specific to a single mode, often present in multiple
channels at least in part (Rosenblum 2005, McGurk and MacDonald 1976, Hopyan-Misakyan et al. 2009)
Neural Mechanisms
 Deaf vs. NH: Plasticity to tactile cues
(Levänen et al. 1998, Auer et al. 2007)

Small amounts of activation to vibrotactile cues in NH
(Schürmann et al. 2006)

Lesion studies: voice quality and ID/familiarity
(Van Lancker 1982, Kreiman 1997; Belin et al. 2000)

Synesthesia following stroke: audio eliciting tactile sensations
(Ro et al. 2007, Beauchamp and Ro 2008, Naumer and van den Bosch 2009)
Direct Spectral Mapping
 Principle: Energy of spectral bands
 Mimic tonotopic nature of cochleae/CIs
(De Filippo and Scott 1978, Sparks et al. 1979, Wada et al. 1996, Galvin et al. 1999)
Fundamental and Formant Isolation
 Principle: Source/filter properties
 Source is fundamental frequency/glottal pulse rate
 Formants are peak frequencies of filter
(Rothenberg and Molitor 1979, Boothroyd 1986, Franklin et al. 1991)
Contemporary Work
 VEST
 Conference demos, but no journal publications to date
(Eagleman 2014, Novich and Eagleman 2014, Eagleman 2015)
Discussion
 Confer some information (Osberger et al. 1991)
 Hardly any tests just at chance level

Only comparable to single electrode CIs
 Multichannel technology clearly has performance
advantages after initial development
(Pickett and McFarland 1985, Osberger et al. 1991)

Multiple aids (i.e. with CIs) in linguistic
applications appear dim
 More difficult to demonstrate effectiveness when
baseline is already above chance
 The literature largely ignores indexical properties



Can it be done?
Auditory-Tactile cue mapping
Continuum of dimensions approach
 Contrasting to pattern-based (Tan et al. 2003)
 Do not have that knowledge for particular results
the user wants
 Infeasible for the end-goal for sheer number
(speaker ID)

Information extraction might be a
prerequisite to clarify salient cues
Cues in Speech and Hearing Science
 Rhythm, speaking rate, breathiness, nasality, pitch and intonation,
formants, and dynamic articulatory cues (Sambur 1975 , Cleary et al. 2005, Vongpaisal et al. 2010)
 Methods (Kreiman et al. 2005)
 Manipulate samples to select or drop certain cues
 Additional (abstract) frameworks: Factor Analysis (FA) and Multidimenional
Scaling (MDS)
Cues in Computer Science
 Mathematical approach based on the signal, not concerned with the
speech apparatus
 Features: Cepstral coefficients (Furui 1981, Gowdy and Tufekci 2000, Zheng et al. 2001)
 Frameworks:





Gaussian Mixture Model (GMM)
Hidden Markov Model (HMM)
Support Vector Machine (SVM)
Artificial Neural Network (ANN)
I-vectors
Reconciling the Two Approaches
 Both provide accurate descriptions, utility
depends on the application
 Abstraction of features against complexity of
algorithms
 Sensory substitution systems with humanmachine interfaces need to be mindful of
both
 What is salient and what can be categorized
Neural and Psychophysical Descriptions
 Skin described as having four types of tactile
receptors
(Johansson and Vallbo 1979, Johansson and Vallbo 1980)
 Each defined by size of receptor field and speed of
neural adaptation
(Johansson and Vallbo 1983)
Inquiry here has more to do with perceptual
dimensions
 Contributions of specific receptors are complex

(Gescheider et al. 2002, Sherrick et al. 1990)
 Can't be elicited specifically in any device
implementation
Potential Dimensions
 Static materials: Roughness, hardness, possibly
compressional elasticity
(Hollins et al. 1993)

Vibrotactile: Mechanical pitch and loudness, temporal
envelope
(Melara and Day 1992, Park and Choi 2011)

Specific systems: Spatial location, frequency,
intensity, waveform, duration, rhythm, and roughness
or temporal envelope
(Jones et al. 2009, Brown et al. 2006, Cholewiak et al. 2001)

Difficult to infer discriminability of two different
arbitrary stimuli with current framework

Assign each auditory dimension to a
dimension of the device
 Non-trivial for more than one dimension

Potential issues (Kreiman et al. 2005)
 Specific to experimental
 Individual variations in salience

Experiments 1&2: Trial and Full Study
 Test fundamental frequency mapped to the
height spatial dimension
 Task: Identify the gender of the speaker

Experiment 3: Computational
 Simulate identification of speaker using linear
discrimination procedure
Initial Work: Experiment 1
Equipment
 Chair
 Vibrotactile (Haptic)
Array
 Arduino Controller
 Computer (interface)
Chair with Vibrotactile Array
[258.0,305)
18
41
12
[132.1,184.6]
[94.57,132.1]
(80,94.57]
Schematic of Device Design Dimensions (units in mm)
Frequency Range (Hz)
[184.6,258.0]
Stimuli Processing
 Audio
 Speech sentences from TIMIT database (Fisher et al. 1986)
 CI simulations: process TIMIT files through 8 channel
noise vocoder (AngelSim)
(Emily Shannon Fu Foundation 2014)

Vibrotactile
 Fundamental frequency (Praat) converted to patterns
 Large (500 ms) and small (50 ms) windows displayed;
left and right respectively
[258.0,305)
[184.6,258.0]
[132.1,184.6]
[94.57,132.1]
(80,94.57]
Range of Representative Male Speaker
Range of Representative Female Speaker
Session Files
 3 blocks of 52 trials each
 Balanced speaker gender within blocks
 Normal audio, CI simulation, CI sim + haptics
 Order of blocks originally randomized and
counterbalanced among subjects
 Not completely balanced due to ending
experiment before completion

No feedback on any of the correct responses
Participants
 12 participants recruited (10 M and 2 F)
 Compensated for a 1-hour max. session
Session Procedure
 Load spreadsheet (CSV) with file directories in
specified order
 Web interface
 Play segment, prompt the user
 “Please select the gender of the speaker”; Male or
Female
 Pause and prompt to continue
 Likert scale survey at the end
Factors
 Order of Stimulus: A
 Subject: B(A)
 Type of Stimulus: C
Response Variables
 Raw data: correct trials and time to respond
 Final metrics
 Accuracy
 Response time
 Bias (Donaldson 1992)
Transformation Techniques
 Principle: stabilize
variance
 Accuracy: arcsinesquareroot (Vollset 1993)

 Derived from variance as a

function of the mean

Response Time: inverse
 Derived from Box-Cox test
(Montgomery 2012)

Bias: none
 Difficult to determine
Transformation for
Accuracy
*
y  arcsin
y
Confidence Interval
c 
 *
sin  y 

2 n

2
ANOVA F-Tests
 Model: restricted nested mixed effects
 F tests for each factor reflect the model
Post-hoc Tests
 Single fixed factors: Tukey’s test
 Contrasts (interactions): Scheffé's Method for
comparing all contrasts
 Correlations of subject-wise factors: underlying
prediction
 Sign test when dealing with potentially nonnormal data
Accuracy
 B(A) [subject]
 C [stim type]
 Normal at ceiling, above CI and
CI+Haptics
 No difference between CI and
CI+Haptics

AC [stim order*type]
 Involve learning CI sim between
blocks and absolute order learning
 No significant and meaningful
post-hoc results
*
*
Response Time
 AC [stim order*type]
 Significant F stat, but again no
meaningful post-hoc


B(A) [subject]
B(A)C [subject*type]
 No sig. correlations between
B(A)C and B(A)
Bias
 Overall towards male
 No factors are significant
Which of the following interpretations best accounts for
two modes seen in the data?
By-Subject Model
By-Training Model

Performance statistics
 Lack of fundamental difference between modalities,
appears that the chair does not contribute meaningful
information
 Subjects vary in overall performance, and within
modalities for response time

Why utilizing chair is difficult
 Speculate lack of instruction for most participants,
combined with two data streams and no direct feedback

Bias in answer choice
 Might have made flawed associations
Full Design: Experiment 2
Equipment
 Comparable to first experiment
 Different laptop
Stimuli Processing
 In addition to CI simulation segments, also made
matching set AMR file compression before
simulation
 Mimic phone network

Separate two streams of vibrotactile patterns
(time window size) into two different sets
[258.0,305)
[184.6,258.0]
[132.1,184.6]
[94.57,132.1]
(80,94.57]
Range of Representative Male Speaker
Range of Representative Female Speaker
Session Files
 3 blocks of 80 trials each (16 specific training segments, 64 normal)
 Balanced speaker gender within blocks
 CI simulation, haptics alone, CI sim + haptics
 Order of blocks fully randomized and counterbalanced among
subjects
 Feedback on training segments
Participants
 18 different participants recruited (10 M and 8 F), compensated for
a max 1 hour session
 All informed of mapping
Session Procedure
 Similar session
 Choice layout randomized
 Now training segments have correct answers displayed afterwards
on continue screen
Code
Factor
A
Order of Stimuli
B(A)
Subject
C
Type of Stimulus
D
Type of Auditory Stimulus
E
Type of Haptic Stimulus
H
Block Halves
Factors and Attributes
 Add two within-block factors
 Type of audio stimulus: D
 Type of haptic stimulus: E


Consider which half of the block a trial:
H
Consider duration and distance from
center F0 of files in separate analysis
Response Variables
 Accuracy
 Response time
 Bias: choice and layout
 Also consider the Likert scores

Transform Techniques
 Same techniques

ANOVA Stages
 Not all the factors can be crossed with others, nonsense
combinations
 Separate ANOVAs are completed that have all factors
crossed

ANOVA F-values and Tests
 Model: Restricted nested mixed effects, different variety
now with additional factors and invalid terms

Post-hoc Tests
 Same kinds of tests
 Linear models for fitting an outcome based on predictors
Accuracy
 Fixed: C [stim type]
 Combined stimuli have greater effect
than either modality alone
 Haptic trends higher than CI (not sig.)

Random: B(A), B(A)C, B(A)D
[subject * nothing, type and audio]
 No correlations within or between
random factors found

Note: D [audio] and A [order] are
marginal
 D tends towards compression having
negative effect
*
*
Response Time
 Fixed: C [stim type]
 Haptic alone slower than both CI and
combined
 No significant difference between CI
and combined

Fixed: AD and ADE [order*within]
 Nothing significant found relating to
block order

Random: B(A), B(A)C [subject and
subject*type]
 No significant correlations within and
between random factors
*
*
Biases
 No significant bias for
speaker gender (choice) or
L/R (layout)
File Parameters on Accuracy
 Against (i) the duration of
the segment and (ii) crossmodal distance from the
center
Source
 Coefficients from both
(Intercept)
variables significant
Distance
 Distance having larger
Duration
effect
Estimate
St. Error
t-statistic
p-value
0.94088
0.041263
22.802
1.0843 E-61
0.26559
0.02958
8.9789
8.6729 E-17
0.021475
0.0087065
2.4665
0.014352
Linear Model for Accuracy vs. File Parameters
ANOVA on Likert Scores
 Only effect from C is
significant
 Haptics and combined
conditions perceptually
easier than CI alone
 But not significant between
themselves (trend)
Predict Likert from
Performance
 Neither significant
 Both trended as expected
and effect for accuracy was
marginal
Source
Estimate
St. Error
t-statistic
p-value
(Intercept)
8.3753
2.2305
3.7549
0.00044586
Accuracy
-3.0754
1.5776
-1.9494
0.056754
RespTime
-1.2722
1.8331
-0.69402
0.49082
Linear Model for Likert vs. Objective Performance
Splitting Trials in Half
 No effects on accuracy
 Effect of H, B(A)H, and B(A)CH on response time
Multimodal Enhancement
 Alternative model: subjects just utilize the one which
works for them (no multisensory regime)
 Is it typical for subjects to utilize both modalities?
 Method 1: Accuracy for a modality above chance, and
combined above that single modality (two ways to step)
 Both ways of stepping are significant

Method 2: Accuracy for combined above both CI and
haptic alone scores
 Marginal, not quite significant
Performance Metrics
 Type of stimulus (C) has significant effects, with
interesting interplay
 Combined modalities result in higher accuracy above CI alone
without having to sacrifice reaction time (as with haptics alone)

Variability of random factors a constant theme
 Not all subjects react to stimuli the same
D alone just barely not significant in accuracy, but do see
variability in how subjects react to compressed audio
(B(A)D)
 Fixed factors related to learning, when they show up
significant in ANOVA, are not significant for meaningful
contrasts
 Fail to show any correlations in random factors, need
further demonstration to confirm strong independent
relations

Biases
 Experimental setup appears to fix the issue with bias
File Parameters on Accuracy
 Longer files helps with accuracy, but not nearly as
much as having distinct stimuli for cross-modal
distance
Likert Scores
 See similar trends across C (different significant posthoc results)
 More difficult to show significance in how much
influence accuracy and response time have on scores
Splitting Trials in Half
 Overall increase in speed, and also significant
variation between and within subjects
Multimodal Enhancement
 Contentious how much it is typical for multimodal
usage can be confirmed
 Second method may be more susceptible to error
 Indicative, but requires further testing with this being
the primary hypothesis
 Can still show existence of some multimodal subjects
and average effects
Experiment 3


Want to see how Melfrequency Cepstral
Coefficient (MFCC)
features correspond to
the fundamental
frequency
Male and female
speakers separated well
by a hyperplane in MFCC
feature space (129 out of
130 in both groups)
Linear Discriminant Representation for
Classifying Voice Gender
Broad categorization
and correlation
 Suspect for within
groups

 Much lower variance
explained
Slope
Estimate
Slope
Std. Error t-statistic p-value
Adjusted
R2
Both Genders 1.6882
0.04354
38.773
1.3038E-109
0.853
Male Only
0.54248
0.12418
4.3686
2.5581E-05
0.123
Female Only
1.5207
0.12636
12.035
8.8334E-23
0.527
Model
Linear Models for Hyperplane Distance to Mean log F0

Goal: See if a linear classifier can succeed in identifying
speaker based on the mean MFCC vectors of speech
segments
 Reduce dimensions of MFCCs to maximize the variance
between means (presumed device operation)

Parameters






Number of speakers: 2, 3, 5, 7, 10, 15, 20; n of 260
Dimensionality of space: Integers 1-12
Duration: Up to 5 seconds in 0.25 second increments
1000 trials per parameter combination (max error +/-3%)
Train with the 3 SI segments for each speaker
Random selections of SX sentences until required duration
reached for testing
Duration Plots
 Varying number of
dimensions (speakers = 10)
 Increase in dimensions makes
accuracy rise

Varying number of speakers
(dimensions = 3)
 Increasing speakers makes
accuracy fall

Plateaus quickly for duration
Accuracy (Raw)
Scaled Performance Level
Number of Dimensions Number of Speakers
Estimate
95% CI
1
1
12
12
0.7980
0.0800
0.9460
0.5840
0.7720 - 0.8217
0.0647 - 0.0985
0.9302 - 0.9584
0.5532 - 0.6142
2
20
2
20




Best performance with full dimensional
representations
Reduction leads to substantial problems,
especially for moderate to large numbers of
speakers
Some information conveyed, but not
passable for a usable implementation
Different mathematical approach needed

Sensory substitution devices can support
perception of indexical qualities of speech
 Even in subjects that are already aided by simulations
of CIs




Mapping and procedure make all the difference
Theme of variation among subjects
Existence and possible prevalence of utilizing
information in a multimodal fashion
Sophisticated models needed to convey speaker
ID in reduced dimensions

CI simulation, really no true substitute for real
patients
 Scores observed not too different

Stepping through to familiarize with vocoder
(Fu et al. 2004, Fu et al. 2005, Gonzalez and Oliver 2005)
 Needed for more rigorous procedure to acclimate
subjects

Device Components
 Robustness of conclusions to different actuators and
implementations
 Microphone/sensorimotor integration

Mapping Algorithms
 Test against categorical approach
 Different mathematical framework and possibly features

User Study Tasks
 Logistics of building speaker ID experiment together
(database and procedure)
 Validate task itself in normal hearing people
 Simultaneous task (intelligibility)
End of Presentation
Download