Linear regression analysis 線性迴歸分析

Linear regression analysis 線性迴歸分析 Wen, shuhui shwen@mail.tcu.edu.tw 2010.12 2010.12 1 Example  停經婦女之骨質密度BMD偏低，可能導致易骨折   older, heavier 高脂飲食者會有較高之LDL cholesterol，可能增加心血管疾病風險  2010.12 They might be smokers and overweight. 2 Multi-predictor analysis Y=α+β1X1+β2X2+...+βkXk+error  Potentially complex relationship in observational study    A continuous outcome (Y, e.g. BMD, LDL) is related to a risk factor (X1 e.g.停經, 高脂飲食) But the risk factor of interest might be related to other factors (X2, e.g. age, BMI,smoke ) which also predict the outcome. Similarly, for experiments (e.g. clinical trials)    2010.12 If randomization is implemented, confounding might not an issue. For Multi-center trials, need to adjusted for clinical center. When baseline differences are apparent between case and control group. 3 以文獻(Åkesson et al. 2006)為例  探討鎘暴露對骨頭的影響    骨骼傷害(Y)：因為鈣及磷酸的流失，以及因為腎損壞而抑制維他命D羊巠化反應，造成骨質疏鬆及軟化。評估鎘的暴露量(X)和身體含量時，血中的鎘含量可表示最近的暴露量，尿中的鎘可表示身體的含量採用multiple linear regression  2010.12 可能還有其他影響因素(X2, X3,…,Xk) 4 Statistical analyses  Data from two independent groups of subjects were compared by the Mann-Whitney U-test. We used Spearman rank correlation (rs) or Kendall’s tau to assess univariate associations (p ≤ 0.1). In multiple linear regression models, each bonerelated variable was evaluated in relation to cadmium, potential confounders (factors associated with both cadmium and bone) and effect modifiers (factors associated with bone). We explored possible interactions in the model. 2010.12 5 Statistical analyses Because the season of sampling correlated with blood and urinary cadmium, BMD, PTH, U-DPD, and urinary calcium, it was included in the models. Residual and goodness-of-fit analyses indicated no deviation from a linear pattern in the regression models. The final regression model included, apart from cadmium, only statistically significant variables (p ≤ 0.05). All tests were two sided, and statistical evaluation was performed using SPSS (version 12.01; SPSS Inc., Chicago, IL, USA). 2010.12 6 2010.12 7 2010.12 8 2010.12 9 Outline      Correlation Multiple linear regression Predictor selection Interaction Other extended cases 2010.12 10 Example: FEV data  一秒最大呼氣量(FEV)  FEV 與抽菸的關係?  2010.12 Other related factors, e.g. age, gender 11 FEV data 2010.12 12 Analysis steps      Step1: Present the descriptive the clinical features for FEV and other influencing factors. Step2: Explore the correlation between FEV and X1…Xk. Step3: Build up the multiple linear regression model and check for model adequacy. Step4: Model revision or selection. Step5: Interpretation the result (model). 2010.12 13 Step2: Explore the correlations  SPSS: Analyze  Correlation  Pairwise 相關係數處有三個選項 1.相關係數: For continuous Xs. 2.Kendall’s tau: For ordinal Xs 3.Spearman: For nominal Xs. 2010.12 14 Correlation matrix (recall Tab2 in paper 2006)  可將output的圖(選圖後滑鼠於圖上點兩下)直接編輯成下表或是將p-value放在左下角矩陣的位置 2010.12 15 Add the scatter plot  Graph Scatter plotMatrix plot 2010.12 16 Matrix scatter plot    2010.12 散佈圖與相關係數矩陣搭配著看相關係數看出正相關且達顯著散佈圖可看出是否為線性相關 17 For nominal variables, Spearman rs is more suitable.  Look at the correlation of FEV and gender(or smoke) 2010.12 18 Spearman rs   FEV 與性別(0=female,1=male)有關，男性其 FEV較大 FEV 與抽菸(0=No,1=Yes)有關，抽菸者其 FEV較大 2010.12 19 FEV vs. smoke    2010.12 抽菸者FEV 值大? 可能的原因是抽菸者多為男性或者年齡較大(體型較大) Confounder? 20 Summary for bivariate correlation     For continuous outcome (Y) If factors (Xs) are continuous, we show the Pearson correlation coefficient. If factors (Xs) are categorical, we list the Spearman correlation coefficient. Also, provide the plots as possible. 2010.12 21 Summary of correlation analysis  FEV 與抽菸的關係?     Others related factors, e.g. age, gender FEV 與身高、年齡都呈正相關，且有統計上顯著相關(p<0.05) FEV與性別有關(p<0.05)，男性其FEV值越大 FEV與抽菸有關(p<0.05)，抽菸者其FEV值越大，但此現象可能是有confounder造成，例如性別、年齡、身高尚未考慮 2010.12 22 Step3: Build up the multiple linear regression model Now, we want to build the model as FEV=α+β1age+β2sex+β3Hgt+β4smoke  2010.12 23 Multiple linear regression 2010.12 24 Check for model adequacy.   點進”圖形”後選擇常態機率圖(為檢驗資料是否符合常態性假設) 畫殘差圖(Y axis:殘差值, X axis:FEV值)  為判斷同質性假設若此兩假設不符則後續檢定迴歸係數之結果可能會不對(not valid)  2010.12 25 Results-1:Pearson Correlation matrix  除了看出FEV與因子(Xs)間相關以外，Xs彼此也有些達統計相關e.g. age vs. Hgt 2010.12 26 Results-2: Adjusted R-square  FEV的變異可被模式中所有因子共同解釋的變異比例為 0.774。換句話說，還有 22.6% 為誤差，可能還有其他影響FEV因素未被考慮。 2010.12 27 Result-3:Collinearity diagnosis(共線性)   Collinearity: 意指Xs彼此高相關而影響β值估計，如此則須再 revise the model. 檢查指標為VIF. 若VIF>10則表示該變項與其他變數高相關，可考慮拿掉 2010.12 28 Result-4-1: Normality    2010.12 圖中直線若接近45 度直線則表示常態性假設成立通常sample size若夠大可不用太擔心常態性不成立如果常態性不成立，一般會將Y轉換成 log(Y) 重新做 regression 29 Result-4-2: Homogeneous    2010.12 正常圖形應該看來是雜亂無pattern 右圖看來有點扇形 (Fan shape)可能是違反同質性另外Y-axis標準化殘差值落在(-3,3)之外的就是異常值 30 Outliers  下表即為outliers. 一般也可以拿掉後重做 regression. (Do you know how to do it?) 2010.12 31 Influential point High-leverage point could be x-outlier. Influential point, i.e. one or more β-hat would change by a large amount. Criterion Leverage, h 2010.12 Bound >2/n Studentized residual, r >3 DFFIT >2 Cook's distance >1 32 Influential point (2) 2010.12 Reference: Page 122 from Vittinghoff et al. 2005 33 Outliers or influential points?  有outliers. 無影響點(max cook’s distance<1) 2010.12 34 Step4: Model revision or selection.  根據初步分析結果     FEV可被 age, gender, smoke. Height解釋變異之比例達77.4% 常態性符合，同質性雖不甚符合，但 n 夠大無共線性問題，無影響點，有 5 個異常值 Model revision  2010.12 試著將 outliers 去掉後再做一次 35 先儲存標準化殘差，再利用selection功能將outlier去掉執行完 regression 後請到資料 選擇觀察值 2010.12 36 Delete outliers and do regression again  條件為 abs(ZRE_1) <=3 2010.12 37 Interpretation of regression analysis   重新做regression後的結果即可仿照 page 2333步驟檢視統計結果 N=649 (原本有 654筆) 2010.12 38 Adjusted R-square (new)  R-square is 78.7%. A little larger than previous one. 2010.12 39 Normality, Collinearity, Homogeneous  Normality 符合   Collinearity   VIF 皆小於 10, 無共線性 Homogeneous   常態機率圖接近45度直線殘差圖與之前一樣 Outliers  2010.12 雖有但很輕微(很接近3)故不再排除 40 Interpretation of regression analysis   Regression model FEV=-4.521+0.057Age+0.131Sex0.067Smoke+0.042Hgt 2010.12 41 1. 拿掉outlier 後regression model影響不大 2. 與FEV顯著相關之變項仍是 Age, Sex, Height 有異常點 2010.12 42 整理成 paper 之表格 (供參考) Table: Multiple linear regression analysis between FEV and factors. 95% CI Factors coefficient lower bound Age(yr) 0.057 0.039 0.075 <0.001* Sex 0.131 0.069 0.194 <0.001* Smoke -0.067 -0.177 0.044 0.236 Height(cm) 0.042 0.038 0.046 <0.001* upper bound p-value Sex:0=female, 1=male. Smoke: 0=no, 1=yes. *: statistical significance 2010.12 43 Solutions if Normality failed 對 Y 做轉換(特別在小樣本時) e.g, log(Y)  Model is log(Y)=α+βX  Interpretation of β  X每增加一單位，則Y會增加 _____ %. 缺點：資料經轉換後，較不易解釋 How to do it?     2010.12 先利用 compute 得到轉換後的Y 再利用剛剛學到的steps 2-4進行分析 44 Solutions if Homogeneous failed 1. 亦可做轉換(尤其小樣本時) e.g. log(Y), 1/Y 2. 利用加權最小平方法(請洽 statisticians) 2010.12 45 Solutions if Collinearity exists  Model selection   利用模式選取的方式，放入較顯著的變項，以避免 Xs之間之高相關 Forward, Backward, Stepwise regression  2010.12 Stepwise 較常使用 46 Stepwise regression 2010.12 47 Results 2010.12 48 Selected model Model is FEV=-4.449+0.041Hgt+0.061Age+0.161Sex (here is for all data, plz use data without outliers) 2010.12 49 Interaction  若Z與X對Y的交互作用存在，則Z的值不同時， X與Y的關係會改變   統計角度，可畫出 Y 的 mean plot for each X*Z group 模式中要加入interaction effect, 作法是  2010.12 加入X與Z的交乘項X*Z，檢定X*Z的迴歸係數是否為0，若顯著則X與Z之 interaction 存在 50 Sex vs. Smoke? 2010.12 51 Check for mean FEV 此處尚未考慮Age, Height的影響喔，若加入confounder後關係會再改變! (Multiple regression) 由敘述性統計值看來男生的FEV值與女生的FEV值之差異會因抽菸狀態不同而不同可能有交互作用存在(from statistical viewpoint) 4 4 Nonsmoker smoker Female 3.5 Mean FEV Mean FEV 3.5 3 Male 3 2.5 2.5 2 Female 2010.12 Male 2 nonsmoker smoker 52 Add interaction effects   檢驗抽菸與性別之交互作用 1. 先新增加乘項(name it as “interaction”) 2010.12 53 Build up the model  將 interaction 選入自變數清單 2010.12 54 Results (here is for all data, plz use data without outliers)  Regression model 抽菸與性別之交互作用存在，此時的smoke 主效應亦存在 2010.12 55 Which one is the final model?  Add the interaction. (here is for all data) Mean FEV=-4.422+0.066age+0.135Sex-0.183Smoke+0.041Hgt+0.234Interaction 2010.12 56 Interpretation Mean FEV= -4.422+0.066age+0.135Sex-0.183Smoke +0.041Hgt+0.234Interaction Sex Smoke Interaction Estimated FEV (adjusted for age, height) female(0) No(0) 0 baseline female Yes(1) 0 -0.183 male(1) No 0 0.135 male Yes 1 0.186 2010.12 女性者抽菸其FEV值會較未抽菸者低0.183(l)，男性者抽菸其FEV值會較未抽菸者高0.051(l)。可能原因是? 57 會是身高影響? 2010.12 58 Further issues  What if Y is not continuous?   If Y is binary, say disease vs. healthy. Suggest use the logistic regression (next class by Prof. Hsieh). What if Y are repeated measure, say pre/post Y?   2010.12 Might use post-Y as response variable, and adjusted for pre-Y and Xs. (For 2 time points) For several time points, suggest use “repeatedmeasure” ANOVA. (請洽statisticians) 59 References 1. 2. 3. 4. 2010.12 M. Pagano, K. Gauvereau. Principles of Biostatistics(2nd Ed). Australia ; Pacific Grove, CA : Duxbury, 2000. (歐亞書局代理) Rosner B. (2006) Fundamentals of Biostatistics (6th ed). Belmont, CA : Thomson-Brooks/Cole (歐亞代理) Vittinghoff E., Glidden D.V., Shiboski S.C., McCulloch C.E. Regression Methods in Biostatistics. Spreinger 2005. 史麗珠 (2005)，進階應用生物統計學。學富文化，台北。 60

Linear regression analysis 線性迴歸分析

Related documents

Products

Support

Linear regression analysis 線性迴歸分析

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib