The ERP Boot Camp Plotting, Measurement, & Statistics All slides © S. J. Luck, except as indicated in the notes sections of individual slides Slides may be used for nonprofit educational purposes if this copyright notice is included, except as noted Permission must be obtained from the copyright holder(s) for any other use Plotting- The Right Way To-be-compared waveforms overlaid Legend in figure Time Zero Time ticks on baseline for every waveform Electrode Site Voltage calibration aligned with waveform Calibration size and polarity Baseline shows 0 µV Plotting- Basic Principles • • • • You must show the waveforms (SPR rule) - You need to show enough sites so that experts can figure out underlying component structure - I often show just one site for a cognitive audience when component can be isolated (N2pc or LRP) - In most cases, don’t shown more than 6-8 sites (topo map instead) A prestimulus baseline must be shown - Usually 200 ms (minimum of 100 ms for most experiments) - If you don’t see a baseline, the study is probably C.R.A.P (Carelessly Reviewed Awful Publication) Overlay the key waveforms In most cases, show both original waveforms and difference waves Measuring ERP Amplitudes Basic options - Peak amplitude • • Or average around peak Or local peak amplitude - Mean/area amplitude Why Mean is Better than Peak • • • Rule #1: “Peaks and components are not the same thing. There is nothing special about the point at which the voltage reaches a local maximum.” - Mean amplitude better characterizes a component as being extended over time - Peak amplitude encourages misleading view of components Peak may find rising edge of adjacent component - Can be solved by local peak measure Peak is sensitive to high-frequency noise - Can be mitigated by low-pass filter or “mean around peak” • Time of peak depends on overlapping components - The peak may be nowhere near the center of the experimental effect Why Mean is Better than Peak • Peak amplitude is biased by the noise level - More noise means greater peak amplitude - Mean amplitude is unbiased by noise level • Example - Do 1000 simulation runs at two noise levels - Take mean amplitude and peak amplitude on each run - Average of 1000 mean amplitudes will be approximately the same for high-noise and low-noise data - Average of 1000 peak amplitudes will be greater for high-noise data than for low-noise data Peak Amplitude and Noise Clean Waveform Waveform + 60-Hz Noise Why Mean is Better than Peak • • • Peak at different time points for different electrodes - A real effect cannot do this A narrower measurement window can be used for mean amplitude Mean amplitude is linear; peak amplitude is not - Mean of peak amplitudes ≠ peak amplitude of grand average - Mean of mean amplitudes = mean amplitude of grand average - Same applies to single-trial data vs. averaged waveform Shortcomings of Mean Amplitude • • You will still pick up overlapping components - A narrower window reduces this, but increases noise level Different measurement windows might be appropriate for different subjects - This could be a source of measurement noise - Patients and controls might have different latencies, leading to a systematic distortion of the results • • This is a case where peak might be better How do you pick the measurement window? - Using the time course of an effect biases you to find a significant effect - Reality: People often look at the data first - Alternative 1: Select window based on prior results - Alternative 2: “Functional localizer” condition to find “ROI” - Alternative 3: Resampling/randomization approaches The Baseline (reminder) • • • Baseline correction is equivalent to subtracting baseline voltage from your amplitude measures - Any noise in baseline contributes to amplitude measure Short baselines are noisy Usual recommendation: 200 ms Need to look at 200+ ms to evaluate overlap and preparatory activity Baseline can be significant confound - Baselines may differ across conditions due to overlap or preparatory activity, and this activity may fade over time - A poststimulus amplitude measure may therefore vary across conditions due to differential baselines Fading prestimulus differences can also distort scalp distributions - Distribution of prestimulus period contributes to distribution Measuring Midpoint Latency Basic options - Peak latency • Or local peak latency - 50% area latency Better Example of 50% Area Rare Minus Frequent Shortcomings of Peak Latency • • Peak may find rising edge of adjacent component - Can be solved by local peak measure Peak is sensitive to high-frequency noise - Can be mitigated by low-pass filter • • Time of peak depends on overlapping components Terrible for broad components with no real peak • Biased by the noise level - More noise => nearer to center of measurement window • Not linear • Difficult to relate to reaction time 50% Area Latency • Uses entire waveform in determining latency • Robust to noise • Not biased by the noise level • Works fine for broad waveforms with no real peak • Linear • Easier to relate to RT • - Almost the same as median Shortcomings - Measurement window must include entire component Strongly influenced by overlapping components Requires monophasic waveforms Works best on big components and/or difference waves Relating Midpoint Latency to RT Probability Distribution of RT Probability of Reaction Time 0.6 17% of RTs at 350 ms 0.4 0.2 0 -200 25% of RTs at 400 ms 7% of RTs at 300 ms 0 200 400 Time 600 800 1000 Relating Midpoint Latency to RT Peak latency is related to mode of RT distribution, not mean or median ERP Amplitude 0.6 0.4 0.2 0 -200 0 200 400 Time 600 800 1000 Relating Midpoint Latency to RT Typical RT probability distributions across different conditions P3 peak latency usually differs less across conditions than mean RT 50% Area Latency Example Luck & Hillyard (1990) 50% Area Latency Example Luck & Hillyard (1990) Measuring Onset Latency • Basic options for onset of component - 20% area latency - 50% peak latency - Statistical threshold • First of N consecutive p<.05 points Peak amplitude 50% of peak amplitude Latency @ 50% of peak amplitude Jackknife Approach • Miller, Patterson, & Ulrich (1998) - Hard to measure onset latency (and other nonlinear parameters) from noisy single-subject waveforms - Much easier to measure from grand average • • Measure from grand average of N-1 subjects N times (once excluding each subject) Variance will be artificially low but can be corrected - Fcorrected = Funcorrected ÷ (N-1)2 [N per condition] - Between, within, main effects, interactions - Jackknife can also be used with Pearson r • So precise that you may need to use interpolation to measure latencies between sample points Jackknife Approach Subject 1 50% fractional peak latency Grand w/o Subject 1 Subject 2 Grand w/o Subject 2 Subject 3 Grand w/o Subject 3 Jackknife Approach • Conventional ANOVA on LRP onset latency - F(1, 20) = 1.315, p = 0.258 • Jackknife ANOVA on LRP onset latency • - F(1, 20) = 5221.625, Fc = 13.05, p = .0017 Limitations - Doesn’t help with linear measures Easier to have equal Ns for between-subjects ANOVAs Is sometimes worse than conventional approach Testing a slightly different null hypothesis Jackknife Approach • • • Conventional null hypothesis - If you measure from every individual in the population, the average of these measures does not differ across conditions Jackknife null hypothesis - If you make grand averages across every individual in the population, and measure from these grand averages, these measures do not differ across conditions Making a grand average leads to the same problems as averaging across trials - Greater latency variability across subjects in one group will lead to lower peak amplitude in this group’s grand average - The onset time in the grand average will reflect the onset times of the subjects with the earliest onset times • Think about it, and make sure you get the same general pattern with conventional statistics Jackknife Approach Condition A Condition B Sub1 Sub2 Sub3 Sub4 Sub2 Sub3 Sub1 Stim Mean of singlesubject values Stim Mean of singlesubject values Sub4 Jackknife Approach Condition A Condition B Sub1 Sub2 Sub3 Sub4 Sub2 Sub3 Sub1 Value from grand average Stim Sub4 Value from grand average Mean of singlesubject values Stim Mean of singlesubject values A difference in timing variability is misconstrued as a difference in mean onset time Statistical Analysis • Replication is the best statistic - The .05 threshold is arbitrary • • What would happen if we decided the threshold should be .06? - We regularly violate the assumptions of statistical tests, so the computed p-values are not correct estimates of probability of a Type I error - The real question is whether the effects are real or noise - If they are real (and large enough), they will be replicable General advice - Collect clean data with big effects - Run follow-up experiments that contain replications - Use a vanilla statistical approach (with jackknife approach for nonlinear measures, when appropriate) or - Find a really good statistician who can do the most appropriate statistical tests Standard Approach • First, collapse across irrelevant factors - If target and standard are counterbalanced, collapse to avoid physical stimulus differences - This reduces number of ANOVA factors • • • • Fewer p-values Fewer spurious interactions Smaller experimentwise error Do a separate ANOVA for each component - Don’t use component as a repeated-measures factor - Separate ANOVAs for amplitude and latency - You could do a gigantic MANOVA, but it would have a zillion pvalues Standard Approach • • • • Use electrodes at which component is present - Otherwise your effect may get swamped by noise at other electrodes - Interaction with electrode site has low power Electrode site is usually two factors - Anterior-posterior - Left-middle-right - Or clusters (averages across nearby electrodes) Usually bad to do a separate ANOVA for each site - More p-values means greater chance of Type I error - Less power means greater chance of Type II error Overall advice: Use stats in a way that most directly tests your main hypotheses Choosing Electrode Sites • Imagine you are comparing Condition A and Condition B at 128 electrode sites, and the conditions do not actually differ (zero difference with infinite power) - If the noise is independent at each site, you would expect p < .05 for 6-7 sites (.05 x 128 = 6.4) - If noise is correlated among nearby sites, you would expect p < .05 for at least one cluster of sites • Therefore, if you choose which sites to measure by seeing which sites • • • • • • (or clusters) show a difference, you will have many false positives (actual p >> .05) Solution 1: All sites in an omnibus ANOVA (low power) Solution 2: Bonferonni correction (even lower power) Solution 3: Use false discovery rate correction (not quite as bad) Solution 4: Use a priori region of interest Solution 5: Use “functional localizer” condition Solution 6: Use resampling/randomization approaches Example: Fishing for N2ac 2 simultaneous stimuli on each trial, selected from: A) Pure sine wave B) FM sweep C) White noise burst D) Click train Duration=750, SOA = 1500±150 One stimulus defined as target for each trial block (e.g., FM sweep) Task: Press one button for target-present, another for target-absent Each stimulus equally likely to be combined with each other stimulus Locations are randomized from trial to trial Target is present on 25% of trials Look at contra vs ipsi with respect to target Example: Fishing for N2ac Example: Fishing for N2ac Example: Fishing for N2ac Separate ANOVAs for anterior and posterior electrode clusters Factors: Contra/Ipsi, Hemisphere, Within-Hemisphere Site, Time Example: Fishing for N2ac Key Effects Contra/Ipsi: Significant Contra/Ipsi x Time: Significant Contra/Ipsi x Electrode: ns Contra/Ipsi x Hemisphere: ns Key Effects Contra/Ipsi: ns Contra/Ipsi x Time: Significant Contra/Ipsi x Electrode: ns Contra/Ipsi x Hemisphere: Significant Example: Fishing for N2ac Contra/Ipsi @ Each Time Interval 200-300: Significant 300-400: Significant 400-500: Significant 500-600: ns Contra/Ipsi @ Each Time Interval 200-300: ns 300-400: ns 400-500: Significant 500-600: Significant Example: Fishing for N2ac Follow-Up Experiment: Same basic paradigm to demonstrate replicability Slightly different stimuli to demonstrate generality Additional anterior electrode sites to better map scalp distribution Also included unilateral stimuli to determine whether the N2ac requires competition between simultaneous stimuli Replicated basic anterior and posterior patterns These effects were not present for unilateral stimuli Electrode Interactions • Amplitudes are multiplicative across electrodes - Fz amplitude might go from 1.0 µV to 1.5 µV, and Pz amplitude might go from 2 µV to 3 µV 3 2.5 2 Fz 1.5 Cz Pz 1 0.5 0 Condition A • Condition B Condition C Multiplicative Additive This produces a condition x electrode site interaction - Even without a change in neural generators Electrode Interactions • McCarthy & Wood (1985): Normalize the Data - Divide by vector length 1 3 0.9 2.5 0.8 0.7 2 0.6 0.5 1.5 0.4 1 0.3 0.2 0.5 0.1 0 Fz Cz Pz Condition A • Condition B Condition C Now the conditions have the same overall amplitude - Main effects are eliminated; they are assessed prior to normalization Electrode Interactions • Technical Problem: Urbach & Kutas (2002) demonstrated that this does not actually work under many realistic conditions - Many of these problems disappear if you measure from difference waves • Conceptual problem: The conclusions that can be drawn from an electrode site interaction are extremely weak • - Could be same generators, but change in relative amplitudes - Could be same generators, but a change in relative latencies General advice: Don’t worry about electrode interactions - You can’t draw very strong conclusions from them, so just report them Heterogeneity of Covariance • • Within-subjects ANOVA assumes homogeneity of variance and covariance (sphericity) - Modest heterogeneity of variance not a big problem - Heterogeneity of covariance inflates Type I error rate What is homogeneity of covariance? - 3 or more levels of a within-subjects factor - Each level must be equally correlated with the other levels Heterogeneity of Covariance Within-Subjects ANOVA assumes: Covariance(A, B) = Covariance(B, C) = Covariance(A, C) Subject 1 Subject 2 Subject 3 Cond A Cond B Cond C Heterogeneity of Covariance • • • • Why is this a special problem for ERPs? - Covariance is lower for more distant electrode pairs than for nearby electrode pairs - Whenever 3 or more electrodes are used, heterogeneity of covariance is likely SPR mandates that papers deal with this problem Greenhouse-Geisser epsilon adjustment - Degree of nonsphericity is computed An adjustment factor, epsilon, is computed New df computed by multiplying epsilon by original df New df used for computing p-values Greehouse-Geisser epsilon is overly conservative - Can use Huynh-Feldt epsilon instead • Everyone should use epsilon adjustment for all studies, not just ERP studies