Task 6 Statistical Approaches Scope of Work Bob Youngs NGA Workshop #5 March 25, 2003 Working Group 6 • • • • Norm Abrahamson David Brillinger Brian Chiou Bob Youngs Primary Objectives • Identify regression techniques that address uncertain/missing predictor variables, multiple levels of overlapping correlation in the residuals, and censoring/truncation of response • Assess the significance of these issues in developing ground motion models • Provide statistical tools to the NGA developers to assist them in addressing these issues Progress to Date • Treatment of Data Censoring/Truncation – Have identified an approach and begun implementation • Treatment of correlations due to crossclassification of data (earthquake terms and site terms) – Have identified one method for analysis, but may not be an important issue in NGA Progress to Date (cont’d) • Treatment of other correlations (spatial within a given earthquake, and between frequencies) – Have not determined extent of need for treatment in NGA • Treatment of missing/uncertain predictor variables – Identifying potential approaches to be explored Treatment of Censored/Truncated Response Data Standard Statistical Model ln( yi ) ( xi , β) i Likelihood of observed data L f N i recorded ( yi xi , β) Solved by maximizing the log(Likeli hood) ln( L) ln( i recorded 2 ) / 2 ln( yi ) ( xi , β) / 2 2 2 or by minimizing the sum of squared difference s SS ln( y ) ( x , β) 2 i recorded i i Censored Data • Known number of recordings where value of yi < Zcensor and value of xi is known 10 PGA 1 0.1 Zcensor 0.01 0.001 1 10 100 Distance 1000 (McLaughlin, 1991) Censored Data Statistical Model Likelihood of observed data L f N ( yi xi , β) i recorded F N j censored ( Z censor x j , β) Solved by maximizing the log(Likeli hood) ln( L) ln( i recorded 2 ) / 2 ln( yi ) ( xi , β) / 2 2 2 ln F j censored N ( Z censor x j , β) Truncated Data • Unknown number of recordings where value of yi < Ztrunc , value of xi is unknown 10 PGA 1 0.1 Ztrunc 0.01 0.001 1 10 100 Distance 1000 (Toro, 1981) Truncated Data Statistical Model Likelihood of observed data L f N i recorded ( yi xi , β) /1 FN ( Z trunc xi , β) Solved by maximizing the log(Likeli hood) ln( L) ln( i recorded 2 ) / 2 ln( yi ) ( xi , β) / 2 2 2 ln 1 F i recorded N ( Z trunc xi , β) Example Large Synthetic Data Set (1000) ln(y)=1 + 2ln(r + 3) + 4r 10 Acceleration 1 > 0.03g < 0.03g 0.1 Generating function Fit to all data 0.01 0.001 0.1 1 10 Distance 100 1000 Fit to Censored/Truncated Data Ignoring Effect 10 Acceleration 1 > 0.03g Generating function 0.1 Fit to all data Fit to data > 0.03 0.01 0.001 0.1 1 10 Distance 100 1000 Fit Using Censored Data Model 10 Acceleration 1 > 0.03g < 0.03g Generating function 0.1 Fit to all data Censored fit Censored x's 0.01 0.001 0.1 1 10 Distance 100 1000 Fit Using Truncated Data Model 10 Acceleration 1 > 0.03g Generating function 0.1 Fit to all data Truncated fit 0.01 0.001 0.1 1 10 Distance 100 1000 Example Small Synthetic Data Set (20) ln(y)=1 + 2ln(r + 3) + 4r 10 Acceleration 1 > 0.03g < 0.03g 0.1 Generating function Fit to all data 0.01 0.001 1 10 100 Distance 1000 Fit to Censored/Truncated Data Ignoring Effect 10 Acceleration 1 > 0.03g Generating function 0.1 Fit to all data Fit to data > 0.03g 0.01 0.001 1 10 100 Distance 1000 Fit Using Censored Data Model 10 Acceleration 1 > 0.03g < 0.03g Generating function 0.1 Fit to all data Censored fit censored x's 0.01 0.001 1 10 100 Distance 1000 Fit Using Truncated Data Model 10 Acceleration 1 > 0.03g Generating function 0.1 Fit to all data Truncated fit 0.01 0.001 1 10 100 Distance 1000 Example Model Parameters Case Number of Records Model 1 2 3 4 4.5 -1.6 20 -5.00E-03 0.5 Fit all data 1000 4.328 -1.549 20.1 -5.74E-03 0.502 Fit to data > 0.03 858 4.057 -1.547 16.8 0 0.500 Censored fit 858 + 142c 2.311 -1.012 13.5 -1.25E-02 0.507 Truncated fit 858 4.000 -1.470 18.9 -6.40E-03 0.511 Fit all data 20 0.889 -0.598 7.1 -1.59E-02 0.395 Fit to data > 0.03 16 2.391 -1.120 10.5 0 0.327 Censored fit 16+4c 0.268 -0.427 2.8 -1.68E-02 0.374 Truncated fit 16 0.486 -0.553 2.9 -9.07E-03 0.349 Minimum PGA versus Date of Earthquake in NGA Data Set 10 10 1938-1970 1981-1990 1 Minimum PGA Minimum PGA 1 0.1 0.01 0.001 0.1 0.01 0.001 0.0001 0.0001 4 4.5 5 5.5 6 6.5 7 7.5 8 4 Magnitude 10 5 5.5 6 6.5 7 7.5 8 6.5 7 7.5 8 Magnitude 1991-2002 1971-1980 1 Minimum PGA 1 Minimum PGA 4.5 10 0.1 0.01 0.1 0.01 0.001 0.001 0.0001 0.0001 4 4.5 5 5.5 6 Magnitude 6.5 7 7.5 8 4 4.5 5 5.5 6 Magnitude Minimum PGA versus Number of Records/Earthquake in NGA Data Set 10 10 1 to 5 11 to 50 1 Minimum PGA Minimum PGA 1 0.1 0.01 0.001 0.1 0.01 0.001 0.0001 0.0001 4 4.5 5 5.5 6 6.5 7 7.5 8 4 Magnitude 10 5 10 5.5 6 6.5 7 7.5 8 6.5 7 7.5 8 Magnitude >50 6 to 10 1 Minimum PGA 1 Minimum PGA 4.5 0.1 0.01 0.1 0.01 0.001 0.001 0.0001 0.0001 4 4.5 5 5.5 6 Magnitude 6.5 7 7.5 8 4 4.5 5 5.5 6 Magnitude Addition Work to be Done • Incorporate into random effects model • Investigate stability of estimation algorithms – maximum likelihood appears to be primary approach • Evaluate sensitivity to selection of truncation level – treat as uncertain? Treatment of Correlations in Response Data (Peak Motions) Source and Site Data Correlations yij ( xij , β) earthquake effect i site effect j • Earthquake effect – correlation in peak motions from the ith earthquake – presently incorporated by random effects and two-stage regression approaches • Site effect – correlations in peak motions recorded at the jth site. – This effect is cross-classified with the earthquake effect – eliminates block-diagonal variance matrix, requiring “tricks” Potential Data Correlations from Earthquake and Site Classifications Number of Number of Recordings per Earthquakes Earthquake 56 1 21 2 16 3 5 4 9 5 27 6-10 11 11-21 20 22-83 6 118-420 Number of Stations 648 235 149 119 95 145 17 Number of Recordings per Station 1 2 3 4 5 6 7-10 Tentative Conclusions • Earthquake effect already addressed by developers • Cross-classification by site effect term not a significant issue because of limited number of sites with many recordings – Need to do some testing with simulated data sets to confirm this conclusion Additional Correlations • Spatial Correlation of adjacent sites – Readily handled as nested classifications provided one has the correlation model – Need to investigate the potential extent in NGA data • Correlation between adjacent spectral frequencies in a “global” regression – Is this of interest to then developers? Treatment of Missing or Uncertain Predictor Variables Missing Predictor Variables • Site classification variables – VS30, NEHRP Categories, Other Site Categories, – Depth to VS of 1.0 and 2.5 km/sec • Rupture geometry variables – Directivity variables – Hanging wall/footwall determinations – Confined to smaller events/distant recordings where effect is believed to be minimal? Possible Approaches • Estimation of variable by an external model – Example: correlation of VS30 with surficial geology • Correlations with other variables in the NGA data set – Technique used in multivariate normal models Treatment of Uncertainty in Predictor Variables • Magnitude uncertainty – partition of earthquake random effect into an magnitude error term and an event term (Rhodes, 1997) • Propagation of variable uncertainty into resulting model parameter uncertainty – Formal errors in variable methods – Simulation methods