STATISTICS Regression & Correlation 1 STATISTICS Outline X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE SLR: ANOVA table & R-square SLR、ANOVA、2-s t test的比較 Multiple Linear Regression Pearson’s correlation coefficient (r) R2, r, b之間的關係 Z, t, F, 2 之間的關係 2 STATISTICS X and Y X: Y: Predictor variables; Predictors; Covariates; Explanatory variables; Independent variables. Outcome; Response; Dependent variables 3 STATISTICS Univariate analysis: 1X1Y X Y Comparisons Methods Num._normal Num._normal Num._non-normal Num._non-normal Num._normal Num._normal Num._non-normal Categorical Categorical Categorical_Binary Categorical_Binary Categorical_Binary Categorical_Binary Categorical_Binary 2 indep. means >= 2 indep. means 2 indep. medians >= 2 indep. medians Two-sample t test* One-way ANOVA* Wilcoxon rank sum Kruskal-Wallis Regression* Paired t Wilcoxon signed rank Pearson's Chi-sq McNemar Chi-sq Pearson's Chi-sq 2-Z 說明:有*的分析方法需要有以下假設: 名詞縮寫 Binary Categorical Binary Categorical num._normal normality Independence.. 2 related means 2 related medians X related to Y 2 related prop. 2 indep. Prop. 2 indep. Prop. Cat.: categorical; Num.: numerical 4 STATISTICS Multivariate analysis: Xs1Y Xs Y Methods Categorical Cat. Log-linear Cat.+Num. Cat.(binary) Logistic regression Cat.+Num. Cat.(>=3) Logistic regression Dicriminant analysis* 說明:有*的分析方法需要有以下假設: Multivariate normality Independence.. 名詞縮寫 Cluster analysis Cat.: categorical; Num.: numerical Propensity scores CART: classification and CART Cat. Num. ANOVA* MANOVA* Num. Num. Multiple regression* Cat.+Num. Num.(censored) Cox Propotional hazard model Confounding factors Num. ANCOVA* regression tree ANOVA: analysis of variance ANCOVA: analysis of covariance MANOVA: multivariate analysis of variance GEE: generalized estimating equations MANOVA* GEE* Confounding factors Num. Cat. Mantel-Haenszel Factor analysis 5 STATISTICS Regression Models Mathematical models to describe the relationship between Y and X The use of regression model Adjustment Prediction Finding important factors for Y 6 STATISTICS Regression Models Definition: Mathematical models to describe the relationship between Y and X Purpose: The use of regression model: Find important factors for Y and/or Prediction 7 STATISTICS Simple linear regression (SLR) Model: Y 0 1 X ~ N (0, 2 ) E (Y ) 0 1 X Y 0 1 X 8 STATISTICS SLR Example 年齡跟膽固醇間是否有直線關係 ID AGE CHOL 1 34 141.4 2 39 180.5 3 44 178.4 4 46 212 5 48 203.2 6 51 224.1 7 53 186 8 60 350 9 61 286.3 10 65 287.6 11 66 330.3 12 67 311.3 9 STATISTICS SLR: parameter estimation The least square method N min (Yi 0 1 X i ) 2 i 1 Point estimate: ˆ0 : est imat edint ercept ˆ1 : est imat edslope 10 STATISTICS The logic of SLR: SST=SSR+SSE Yˆ ˆ0 ˆ1 X amount at Xi unexplained by regression Y1 Yˆ1 Total amount unexplained at Xi Y1 Yˆ1 Y1 Y Yˆ1 Y Y amount at Xi explained by regression Yˆ2 2 2 2 ˆ ˆ ˆ ˆ (Y Y ) (Y Y Y Y ) (Y Y ) (Y Y ) 2 Y2 SST = X1 SSE + SSR 11 STATISTICS SLR: parameter estimation The least square method min SSE: S (Y Yˆ ) 2 i2 (Yi 0 1 X i ) 2 Point estimate 分別對截距與斜率做偏微分,可求出截距與斜率 截距 S 2 (Yi 0 1 X i ) 0 0 b0 Y b1 X 斜率 S 2 X i (Yi 0 1 X i ) 0 1 b1 ( X X )(Y Y ) (X X ) i i 2 i 12 STATISTICS SLR example: Regression line CHOL vs Age 350.0 CHOL 287.5 225.0 162.5 100.0 30.0 Estimated Model: CHOL= (-57.5964988786446) + ( 5.65024919013205) * (Age) 40.0 50.0 Age 60.0 70.0 13 STATISTICS SLR: ANOVA table & R-square Source DF SS MSS Intercept 1 696538.3 696538.3 Slope 1 42705.43 42705.43 Error 10 9395.352 939.5352 Adj. Total 11 52100.78 4736.435 Total 12 748639.1 F 45.4538 p 0.0001 Power(5%) 1.0000 R2=0.82, p=0.0001 14 STATISTICS SLR: qualitative covariate Example: X=treatment, 1 or 0 Y=SBP Hypothesis H0: β1 = 0 H1: β1≠0 與平均值檢定的比較: H0: μ1 = μ0 H1: μ1≠μ0 Note: β1 = μ1 - μ0 15 STATISTICS SLR、ANOVA、2-s t test的比較 2-s t →ANOVA 2-s t →SLR H0: μ1 = μ0 → H0: β1 = 0 Dummy variable: K組需要K-1個 ID Y X ID Y X 1 140 A 1 140 0 2 135 B 2 135 1 - - ANOVA →SLR H0: μ1 = μ2 = μ3 → H0: β1 = β2 = 0 ID Y X ID Y X1 X2 1 140 A 1 140 0 0 2 135 B 2 135 0 1 3 130 C 3 130 1 0 - 16 STATISTICS Multiple Linear Regression Model Y 0 0 X 1 ... p X p E (Y ) Y 0 0 X 1 ... p X p Yˆ ˆ0 ˆ0 X 1 ...ˆ p X p Example: Is Age a predictor for SBP adjusting for Sex? Yˆ ˆ0 ˆ1 AGE ˆ2 SEX 17 STATISTICS MLR: example male Yˆ ˆ0* ˆ1 AGE SBP ˆ ˆ AGE ˆ Y female 0 1 ˆ0* ˆ0 Age 18 STATISTICS Pearson’s correlation coefficient (r) Relationship btw X and Y r ( X X )(Y Y ) ( X X ) (Y Y ) i i 2 i 2 i Properties of Pearson’s r Range: Unitless 1 r 1 Good for normally distributed X and Y 相關係數 r:可視為是多維空間中,兩個向量的cos 值 Spearman’s correlation coefficient Pearson’s r for ranked X and Y Good for non- normally distributed X and Y 19 STATISTICS Spearman’s Rho: rank correlation Relationship btw X and Y rs (R X R X )(RY R Y ) (RX R X ) 2 ( RY RY ) t 2 rS n 2 1 rS2 Spearman’s correlation coefficient Pearson’s r for ranked X and Y Good for non- normally distributed X and Y 20 STATISTICS Assumptions in Regression Linear Independent Normal distribution Equal Variance 說明:For all the values of x, εare independent, normally distributed, have the same SD σ = σ (ε) mean μ = 0 y= α + βx Weight Height Yi = α0 + β1Xi + εi α and β are the unknown parameters ε = random error fluctuations 21 STATISTICS R2, r, b之間的關係 r and b r SSR/ (Y Y ) 1 SSE r 2 r 2 ( X X )(Y Y ) ( X X ) (Y Y ) i i 2 i i 2 b1 2 ( x x) ( y y) 2 SDX b r b SDY 2 2 ( X X )(Y Y ) (X X ) i i 2 i r2: Coefficient of Determination: The proportion of the variability among the observed values of Y that is explained by the linear regression of Y on X. Y的變異量可以被X迴歸後所解釋的百分比 22 STATISTICS r, b之間的關係: 正負同號 r大b小 r小b大 23 STATISTICS 迴歸線的幾個標準差1: 名 稱 (1).估計標準誤 SE of estimate (2).迴歸線標準誤 (3).預測標準誤 SE of RL(Ŷ的抽樣分佈標準差) SE of prediction 楊志良 迴歸線的標準差 迴歸線標準誤 估計標準誤 **該名詞易混淆 意義 任一觀察值Y與回歸直 線間的垂直距離的分布變 異 以迴歸線代替平均值算 出來的標準差 以重複抽樣的多個相同的X值 來計算Y 的標準誤,亦即Ŷ值 的第二個層次的常態分布的標 準差, 估計單一E(y)的CI用 以一個X預測Y的標 準誤,亦即某個X值上, Y值的第一個層次的常 態分布的標準差 24 STATISTICS 迴歸線的幾個標準差2: The Standard Error of the Estimate S V (Y ) (Y Yˆ ) /(n 2) (Y Y ) b ( X X ) /(n 2) 2 Y.X 2 2 2 2 1 2 (Y Y ) (1 r 2 ) /(n 2) 2 SE of RL S Y2ˆ V (Yˆ ) V (b0 b1 x) V [Y b1 ( X X )] V (Y ) V [b1 ( X X )] 2( X X )COV (Y , b1 ) 2 n 2 (X X )2 (X X ) 2 .... from : Note(a) SE of prediction SˆY2 V (Y Yˆ ) V (Y ) V (Yˆ ) 2COV (Y , Yˆ ) 1 ( X X )2 [1 ]....from : above2 n ( X X )2 2 25 STATISTICS 迴歸線的幾個標準差3: Note (a): b1的變異數 (X X ) ( X X )(Y Y ) ] V [ ( X X )Y ] V (b ) V [ ( X X ) ( X X ) (X X ) (X X ) 2 1 2 V (b1 ) 2 2 V (Y ) 2 2 (X X ) 2 Note (b): b0的變異數 2 V (b0 ) V (Y b1 Y ) V (Y ) V X (b1 ) 2 X COV (Y ,b1 ) 2 n 2 2 X ( X X )2 .... from : Note(a) 2 1 X 2( ) n ( X X )2 26 STATISTICS 例題: 10位30-39歲男子於最初所做的血膽固醇量(X),與相隔10年後所做的量 (Y)兩次的比較如下(資料來源:彭游生物統計學,89年,P374) ,請問: 迴歸係數是多少?截距是多少? 相關係數r是多少 相關係數是否有統計上的意義?已知F0.05 (1,8) =5.32 有多少10年後膽固醇值的變異是由10年前膽固醇值的變異所引起的? 樣本的迴歸係數是否具統計意義? 某個男性目前的膽固醇為350,請預測10年後的膽固醇和其95%CI 某群男性的平均膽固醇為350,則其10年後的膽固醇和其95%CI為多少? 部分解答: 27 STATISTICS 例題:部分解答(續) 28 STATISTICS Logistic Regression 主題:Y為類別變項的預測 Predicting Nominal or categorical outcome 有無生病;有無死亡 Odds Ratio ( 勝算比; 危險對比值 ) 研究設計: 橫斷法:Cross sectional study 世代追蹤法:Cohort study (Follow-up study) 個案對照法:Case-control study 臨床實驗法:Clinical trial 29 STATISTICS Odds ratio X Y 暴露組(+) 非暴露組(-) 總和 有病(+) 沒病(-) A B C D A+C B+D 總和 A+B C+D A+B+C+D Odds是機率的另一種表示方法 odds p( x) 1 p ( x) Odds就是賠率 危險對比值(Odds ratio) 暴露組發病率: p1 = A / (A+B) 對照組發病率: p0 = C / (C+D) OR p p1 A /( A B) C /(C D) AD 0 1 p1 1 p0 B /( A B) D /(C D) BC 世界杯足球賽巴西隊的賭盤為1賠1,中國隊則為1賠100 巴西與中國的勝算比為何? 30 STATISTICS 流行病學的研究設計: 橫斷法:Cross sectional study 世代追蹤法:Cohort study (Follow-up study) 個案對照法:Case-control study 臨床實驗法:Clinical trial 31 STATISTICS 流行病學的偏差(bias) 選擇性偏差: selection bias 資訊性偏差: information bias 錯誤歸類: misclassification 干擾因子: confounding 32 STATISTICS 橫斷法 研究目的: 盛行率調查 衛生行政需求 研究關鍵: 研究對象要有代表性:隨機抽樣 研究限制: 沒有時序性,無法確定因果關係 33 STATISTICS 個案對照法 E E 研究目的: 因果分析 個案組與對照組的暴露率比較 D D 研究關鍵: 對照組的挑選 對照組要能代表個案組所來自的母群 體的暴露經驗 研究限制: 時序性 回憶偏差(recall bias) 34 STATISTICS 世代研究法(追蹤研究法) E E 研究目的: 因果分析 暴露組與非暴露組的 疾病發生率比較 研究關鍵: D 追蹤 研究限制: 失去追蹤 35 STATISTICS 干擾因子Confounding factors 干擾因子的定義: 本身單獨與疾病有相關;本身是危險因子 Obesity 干擾因子與危險因子有相關 干擾不能是中介變項: X1X2Y Cholesterol MI 36 STATISTICS 臨床實驗法 研究目的:評估介入(intervention)效果 介入:藥物治療,衛生教育 研究關鍵: 隨機分派(randomization):控制干擾因子 安慰劑效應(placebo effect) 研究限制: 倫理道德問題 37 STATISTICS 各種Study Designs之間的關係 Case-control study Matched case-control study Cohort study E E Matched cohort study Randomization clinical trial Complete matched cohort study Causality and correlation Y=a+b1X1+b2X2+b3X3+b4X4+b5X5… covariate, confounder 38 STATISTICS Logistic regression: Simple linear regression: E(Y | x) 0 1 x ~ (, ) Logistic regression: Y為二分類別變項 如何使Y從(0,1)到(- ∞, ∞)? Logistic transformation p( x) E (Y | x) exp( 0 1 x) p ( x) 0 1 x ~ (, ) ~ (0,1) g ( x) ln 1 exp( 0 1 x) 1 p ( x) 39 STATISTICS Logistic regression係數與OR OR:exp(beta) 若該X變項是三組以上的類別變項,表示與參考組比較的OR 若該X變項是連續變項,表示每增加一單位的X,會增加多少OR 若model有多個X變項,解讀相同,但要加上「其他X變項保持不 變下」的條件 舉例: X代表性別,男性x=1,女性x=0;Y代表自殺的有無 g (1) 0 1 ; g (0) 0 g (1) g (0) 1 ln(OR) 40 STATISTICS 課本例子:LR men with unintentional injury Soderstrom, 1997 Table 10-5,p247 結論: 週末的晚上到急診室的白人,有較高的機率血中酒精濃度 過高(BAC>50mg/Dl); 年紀則沒有統計差異。 41 STATISTICS Z, t, F, 2 之間的關係 Z2 , chi-square 母群體平均值已知: 定義: n 2 2 Z i ( n) 2 ( x ) i 2 i 1 2 結論: Z 2 2 ( x ) 1 (1) 2 /n n 或 2 2 Z i ( n) i 1 2 ( x ) i 2 / n 42 STATISTICS Z, t, F, 2 之間的關係 F ,chi-square 母群體平均值未知: 2 2 s ( x x ) ( n 1 ) ( n 1 ) s 2 定義: Zi2 2( n1) i 2 2 n 1 i 1 2 s 2 ( n1) s12 s12 結論: Fdf 1,df 2 2 df 2 , Fdf 1, 2 Fdf 1, 2 s2 n 1 n 2 2 43 STATISTICS Z, t, F, 2 之間的關係 F ,(df 1,df 2) F ,(1,df 2) t F ,(df 1, ) 2 1 / 2,(df 2 ) 2 df 1 F ,(1,) z12 / 2 44