Linear regression analysis 線性迴歸分析 Wen, shuhui shwen@mail.tcu.edu.tw 2010.12 2010.12 1 Example 停經婦女之骨質密度BMD偏低,可能導致易 骨折 older, heavier 高脂飲食者會有較高之LDL cholesterol,可能 增加心血管疾病風險 2010.12 They might be smokers and overweight. 2 Multi-predictor analysis Y=α+β1X1+β2X2+...+βkXk+error Potentially complex relationship in observational study A continuous outcome (Y, e.g. BMD, LDL) is related to a risk factor (X1 e.g.停經, 高脂飲食) But the risk factor of interest might be related to other factors (X2, e.g. age, BMI,smoke ) which also predict the outcome. Similarly, for experiments (e.g. clinical trials) 2010.12 If randomization is implemented, confounding might not an issue. For Multi-center trials, need to adjusted for clinical center. When baseline differences are apparent between case and control group. 3 以文獻(Åkesson et al. 2006)為例 探討鎘暴露對骨頭的影響 骨骼傷害(Y):因為鈣及磷酸的流失,以及因為腎損 壞而抑制維他命D羊巠化反應,造成骨質疏鬆及軟 化。 評估鎘的暴露量(X)和身體含量時,血中的鎘含量可 表示最近的暴露量,尿中的鎘可表示身體的含量 採用multiple linear regression 2010.12 可能還有其他影響因素(X2, X3,…,Xk) 4 Statistical analyses Data from two independent groups of subjects were compared by the Mann-Whitney U-test. We used Spearman rank correlation (rs) or Kendall’s tau to assess univariate associations (p ≤ 0.1). In multiple linear regression models, each bonerelated variable was evaluated in relation to cadmium, potential confounders (factors associated with both cadmium and bone) and effect modifiers (factors associated with bone). We explored possible interactions in the model. 2010.12 5 Statistical analyses Because the season of sampling correlated with blood and urinary cadmium, BMD, PTH, U-DPD, and urinary calcium, it was included in the models. Residual and goodness-of-fit analyses indicated no deviation from a linear pattern in the regression models. The final regression model included, apart from cadmium, only statistically significant variables (p ≤ 0.05). All tests were two sided, and statistical evaluation was performed using SPSS (version 12.01; SPSS Inc., Chicago, IL, USA). 2010.12 6 2010.12 7 2010.12 8 2010.12 9 Outline Correlation Multiple linear regression Predictor selection Interaction Other extended cases 2010.12 10 Example: FEV data 一秒最大呼氣量(FEV) FEV 與抽菸的關係? 2010.12 Other related factors, e.g. age, gender 11 FEV data 2010.12 12 Analysis steps Step1: Present the descriptive the clinical features for FEV and other influencing factors. Step2: Explore the correlation between FEV and X1…Xk. Step3: Build up the multiple linear regression model and check for model adequacy. Step4: Model revision or selection. Step5: Interpretation the result (model). 2010.12 13 Step2: Explore the correlations SPSS: Analyze Correlation Pairwise 相關係數處有三個選項 1.相關係數: For continuous Xs. 2.Kendall’s tau: For ordinal Xs 3.Spearman: For nominal Xs. 2010.12 14 Correlation matrix (recall Tab2 in paper 2006) 可將output的圖(選圖後滑鼠於圖上點兩下)直 接編輯成下表 或是將p-value放在 左下角矩陣的位置 2010.12 15 Add the scatter plot Graph Scatter plotMatrix plot 2010.12 16 Matrix scatter plot 2010.12 散佈圖與相 關係數矩陣 搭配著看 相關係數看 出正相關且 達顯著 散佈圖可看 出是否為線 性相關 17 For nominal variables, Spearman rs is more suitable. Look at the correlation of FEV and gender(or smoke) 2010.12 18 Spearman rs FEV 與性別(0=female,1=male)有關,男性其 FEV較大 FEV 與抽菸(0=No,1=Yes)有關,抽菸者其 FEV較大 2010.12 19 FEV vs. smoke 2010.12 抽菸者FEV 值大? 可能的原因 是抽菸者多 為男性或者 年齡較大(體 型較大) Confounder? 20 Summary for bivariate correlation For continuous outcome (Y) If factors (Xs) are continuous, we show the Pearson correlation coefficient. If factors (Xs) are categorical, we list the Spearman correlation coefficient. Also, provide the plots as possible. 2010.12 21 Summary of correlation analysis FEV 與抽菸的關係? Others related factors, e.g. age, gender FEV 與 身高、年齡都呈正相關,且有統計上 顯著相關(p<0.05) FEV與性別有關(p<0.05),男性其FEV值越大 FEV與抽菸有關(p<0.05),抽菸者其FEV值越 大,但此現象可能是有confounder造成,例如 性別、年齡、身高尚未考慮 2010.12 22 Step3: Build up the multiple linear regression model Now, we want to build the model as FEV=α+β1age+β2sex+β3Hgt+β4smoke 2010.12 23 Multiple linear regression 2010.12 24 Check for model adequacy. 點進”圖形”後選擇常態機率圖(為檢驗資料 是否符合常態性假設) 畫殘差圖(Y axis:殘差值, X axis:FEV值) 為判斷同質性假設 若此兩假設不符 則後續檢定迴歸 係數之結果可能 會不對(not valid) 2010.12 25 Results-1:Pearson Correlation matrix 除了看出FEV與因子(Xs)間相關以外,Xs彼此 也有些達統計相關e.g. age vs. Hgt 2010.12 26 Results-2: Adjusted R-square FEV的變異可被模式中所有因子共同解釋的變 異比例為 0.774。換句話說,還有 22.6% 為誤 差,可能還有其他影響FEV因素未被考慮。 2010.12 27 Result-3:Collinearity diagnosis(共線性) Collinearity: 意指Xs彼此高相關而影響β值估 計,如此則須再 revise the model. 檢查指標為VIF. 若VIF>10則表示該變項與其 他變數高相關,可考慮拿掉 2010.12 28 Result-4-1: Normality 2010.12 圖中直線若接近45 度直線則表示常態 性假設成立 通常sample size若 夠大可不用太擔心 常態性不成立 如果常態性不成立, 一般會將Y轉換成 log(Y) 重新做 regression 29 Result-4-2: Homogeneous 2010.12 正常圖形應該看來 是雜亂無pattern 右圖看來有點扇形 (Fan shape)可能是 違反同質性 另外Y-axis標準化 殘差值落在(-3,3)之 外的就是異常值 30 Outliers 下表即為outliers. 一般也可以拿掉後重做 regression. (Do you know how to do it?) 2010.12 31 Influential point High-leverage point could be x-outlier. Influential point, i.e. one or more β-hat would change by a large amount. Criterion Leverage, h 2010.12 Bound >2/n Studentized residual, r >3 DFFIT >2 Cook's distance >1 32 Influential point (2) 2010.12 Reference: Page 122 from Vittinghoff et al. 2005 33 Outliers or influential points? 有outliers. 無影響點(max cook’s distance<1) 2010.12 34 Step4: Model revision or selection. 根據初步分析結果 FEV可被 age, gender, smoke. Height解釋變異之 比例達77.4% 常態性符合,同質性雖不甚符合,但 n 夠大 無共線性問題,無影響點,有 5 個異常值 Model revision 2010.12 試著將 outliers 去掉後再做一次 35 先儲存標準化殘差,再利用selection功 能將outlier去掉 執行完 regression 後請到 資料 選擇觀察值 2010.12 36 Delete outliers and do regression again 條件為 abs(ZRE_1) <=3 2010.12 37 Interpretation of regression analysis 重新做regression後的結果即可仿照 page 2333步驟 檢視統計結果 N=649 (原本有 654筆) 2010.12 38 Adjusted R-square (new) R-square is 78.7%. A little larger than previous one. 2010.12 39 Normality, Collinearity, Homogeneous Normality 符合 Collinearity VIF 皆小於 10, 無共線性 Homogeneous 常態機率圖 接近45度直線 殘差圖與之前一樣 Outliers 2010.12 雖有但很輕微(很接近3)故不再排除 40 Interpretation of regression analysis Regression model FEV=-4.521+0.057Age+0.131Sex0.067Smoke+0.042Hgt 2010.12 41 1. 拿掉outlier 後regression model影響不大 2. 與FEV顯著相關之變項仍是 Age, Sex, Height 有異常點 2010.12 42 整理成 paper 之表格 (供參考) Table: Multiple linear regression analysis between FEV and factors. 95% CI Factors coefficient lower bound Age(yr) 0.057 0.039 0.075 <0.001* Sex 0.131 0.069 0.194 <0.001* Smoke -0.067 -0.177 0.044 0.236 Height(cm) 0.042 0.038 0.046 <0.001* upper bound p-value Sex:0=female, 1=male. Smoke: 0=no, 1=yes. *: statistical significance 2010.12 43 Solutions if Normality failed 對 Y 做轉換(特別在小樣本時) e.g, log(Y) Model is log(Y)=α+βX Interpretation of β X每增加一單位,則Y會增加 _____ %. 缺點:資料經轉換後,較不易解釋 How to do it? 2010.12 先利用 compute 得到轉換後的Y 再利用剛剛學到的steps 2-4進行分析 44 Solutions if Homogeneous failed 1. 亦可做轉換(尤其小樣本時) e.g. log(Y), 1/Y 2. 利用加權最小平方法(請洽 statisticians) 2010.12 45 Solutions if Collinearity exists Model selection 利用模式選取的方式,放入較顯著的變項,以避免 Xs之間之高相關 Forward, Backward, Stepwise regression 2010.12 Stepwise 較常使用 46 Stepwise regression 2010.12 47 Results 2010.12 48 Selected model Model is FEV=-4.449+0.041Hgt+0.061Age+0.161Sex (here is for all data, plz use data without outliers) 2010.12 49 Interaction 若Z與X對Y的交互作用存在,則Z的值不同時, X與Y的關係會改變 統計角度,可畫出 Y 的 mean plot for each X*Z group 模式中要加入interaction effect, 作法是 2010.12 加入X與Z的交乘項X*Z,檢定X*Z的迴歸係數是否 為0,若顯著則X與Z之 interaction 存在 50 Sex vs. Smoke? 2010.12 51 Check for mean FEV 此處尚未考慮Age, Height的影響喔, 若加入confounder後關係會再改變! (Multiple regression) 由敘述性統計值看來 男生的FEV值與女生的FEV值之差異會因抽菸狀態不同而不同 可能有交互作用存在(from statistical viewpoint) 4 4 Nonsmoker smoker Female 3.5 Mean FEV Mean FEV 3.5 3 Male 3 2.5 2.5 2 Female 2010.12 Male 2 nonsmoker smoker 52 Add interaction effects 檢驗抽菸與性別之交互作用 1. 先新增加乘項(name it as “interaction”) 2010.12 53 Build up the model 將 interaction 選入自變數清單 2010.12 54 Results (here is for all data, plz use data without outliers) Regression model 抽菸與性別之交互作用存在,此時 的smoke 主效應亦存在 2010.12 55 Which one is the final model? Add the interaction. (here is for all data) Mean FEV=-4.422+0.066age+0.135Sex-0.183Smoke+0.041Hgt+0.234Interaction 2010.12 56 Interpretation Mean FEV= -4.422+0.066age+0.135Sex-0.183Smoke +0.041Hgt+0.234Interaction Sex Smoke Interaction Estimated FEV (adjusted for age, height) female(0) No(0) 0 baseline female Yes(1) 0 -0.183 male(1) No 0 0.135 male Yes 1 0.186 2010.12 女性者抽菸其FEV值會較未抽菸者低0.183(l), 男性者抽菸其FEV值會較未抽菸者高0.051(l)。 可能原因是? 57 會是身高影響? 2010.12 58 Further issues What if Y is not continuous? If Y is binary, say disease vs. healthy. Suggest use the logistic regression (next class by Prof. Hsieh). What if Y are repeated measure, say pre/post Y? 2010.12 Might use post-Y as response variable, and adjusted for pre-Y and Xs. (For 2 time points) For several time points, suggest use “repeatedmeasure” ANOVA. (請洽statisticians) 59 References 1. 2. 3. 4. 2010.12 M. Pagano, K. Gauvereau. Principles of Biostatistics(2nd Ed). Australia ; Pacific Grove, CA : Duxbury, 2000. (歐亞書局代理) Rosner B. (2006) Fundamentals of Biostatistics (6th ed). Belmont, CA : Thomson-Brooks/Cole (歐 亞代理) Vittinghoff E., Glidden D.V., Shiboski S.C., McCulloch C.E. Regression Methods in Biostatistics. Spreinger 2005. 史麗珠 (2005),進階應用生物統計學。學富文化, 台北。 60