Document

STATISTICS Regression & Correlation 1 STATISTICS Outline  X, Y & Regression Models  Simple linear regression (SLR)  The logic of SLR: SST=SSR+SSE  SLR: ANOVA table & R-square  SLR、ANOVA、2-s t test的比較  Multiple Linear Regression  Pearson’s correlation coefficient (r)  R2, r, b之間的關係  Z, t, F, 2 之間的關係 2 STATISTICS X and Y X: Y: Predictor variables; Predictors; Covariates; Explanatory variables; Independent variables. Outcome; Response; Dependent variables 3 STATISTICS Univariate analysis: 1X1Y X Y Comparisons Methods Num._normal Num._normal Num._non-normal Num._non-normal Num._normal Num._normal Num._non-normal Categorical Categorical Categorical_Binary Categorical_Binary Categorical_Binary Categorical_Binary Categorical_Binary 2 indep. means >= 2 indep. means 2 indep. medians >= 2 indep. medians Two-sample t test* One-way ANOVA* Wilcoxon rank sum Kruskal-Wallis Regression* Paired t Wilcoxon signed rank Pearson's Chi-sq McNemar Chi-sq Pearson's Chi-sq 2-Z  說明：有*的分析方法需要有以下假設:  名詞縮寫 Binary Categorical Binary Categorical num._normal   normality Independence.. 2 related means 2 related medians X related to Y 2 related prop. 2 indep. Prop. 2 indep. Prop.  Cat.: categorical; Num.: numerical 4 STATISTICS Multivariate analysis: Xs1Y Xs Y Methods Categorical Cat. Log-linear Cat.+Num. Cat.(binary) Logistic regression Cat.+Num. Cat.(>=3) Logistic regression Dicriminant analysis*  說明：有*的分析方法需要有以下假設:  Multivariate normality  Independence..  名詞縮寫 Cluster analysis  Cat.: categorical; Num.: numerical Propensity scores  CART: classification and CART Cat. Num. ANOVA* MANOVA* Num. Num. Multiple regression* Cat.+Num. Num.(censored) Cox Propotional hazard model Confounding factors Num. ANCOVA*     regression tree ANOVA: analysis of variance ANCOVA: analysis of covariance MANOVA: multivariate analysis of variance GEE: generalized estimating equations MANOVA* GEE* Confounding factors Num. Cat. Mantel-Haenszel Factor analysis 5 STATISTICS Regression Models Mathematical models to describe the relationship between Y and X The use of regression model  Adjustment  Prediction  Finding important factors for Y 6 STATISTICS Regression Models Definition:  Mathematical models to describe the relationship between Y and X Purpose: The use of regression model:  Find important factors for Y and/or  Prediction 7 STATISTICS Simple linear regression (SLR) Model: Y   0  1 X    ~ N (0,  2 ) E (Y )   0  1 X   Y   0  1 X 8 STATISTICS SLR Example 年齡跟膽固醇間是否有直線關係 ID AGE CHOL 1 34 141.4 2 39 180.5 3 44 178.4 4 46 212 5 48 203.2 6 51 224.1 7 53 186 8 60 350 9 61 286.3 10 65 287.6 11 66 330.3 12 67 311.3 9 STATISTICS SLR: parameter estimation The least square method N min  (Yi   0  1 X i ) 2 i 1 Point estimate: ˆ0 : est imat edint ercept ˆ1 : est imat edslope 10 STATISTICS The logic of SLR: SST=SSR+SSE Yˆ  ˆ0  ˆ1 X amount at Xi unexplained by regression Y1 Yˆ1 Total amount unexplained at Xi Y1  Yˆ1 Y1  Y Yˆ1  Y Y amount at Xi explained by regression Yˆ2 2 2 2 ˆ ˆ ˆ ˆ  (Y  Y )   (Y  Y  Y  Y )   (Y  Y )   (Y  Y ) 2 Y2 SST = X1 SSE + SSR 11 STATISTICS SLR: parameter estimation The least square method  min SSE: S   (Y  Yˆ ) 2    i2   (Yi   0  1 X i ) 2 Point estimate  分別對截距與斜率做偏微分，可求出截距與斜率 截距 S  2 (Yi   0  1 X i )  0  0 b0  Y  b1 X 斜率 S  2 X i (Yi   0  1 X i )  0 1 b1 ( X  X )(Y  Y )   (X  X ) i i 2 i 12 STATISTICS SLR example: Regression line CHOL vs Age 350.0 CHOL 287.5 225.0 162.5 100.0 30.0 Estimated Model: CHOL= (-57.5964988786446) + ( 5.65024919013205) * (Age) 40.0 50.0 Age 60.0 70.0 13 STATISTICS SLR: ANOVA table & R-square Source DF SS MSS Intercept 1 696538.3 696538.3 Slope 1 42705.43 42705.43 Error 10 9395.352 939.5352 Adj. Total 11 52100.78 4736.435 Total 12 748639.1 F 45.4538 p 0.0001 Power(5%) 1.0000 R2=0.82, p=0.0001 14 STATISTICS SLR: qualitative covariate  Example:  X=treatment, 1 or 0  Y=SBP  Hypothesis  H0: β1 = 0  H1: β1≠0  與平均值檢定的比較:  H0: μ1 = μ0  H1: μ1≠μ0  Note: β1 = μ1 - μ0 15 STATISTICS SLR、ANOVA、2-s t test的比較  2-s t →ANOVA  2-s t →SLR H0: μ1 = μ0 → H0: β1 = 0  Dummy variable: K組需要K-1個 ID Y X ID Y X 1 140 A 1 140 0 2 135 B 2 135 1 - -  ANOVA →SLR H0: μ1 = μ2 = μ3 → H0: β1 = β2 = 0 ID Y X ID Y X1 X2 1 140 A 1 140 0 0 2 135 B 2 135 0 1 3 130 C 3 130 1 0 - 16 STATISTICS Multiple Linear Regression  Model Y   0   0 X 1  ... p X p   E (Y )  Y   0   0 X 1  ... p X p Yˆ  ˆ0  ˆ0 X 1  ...ˆ p X p  Example: Is Age a predictor for SBP adjusting for Sex? Yˆ  ˆ0  ˆ1 AGE  ˆ2 SEX 17 STATISTICS MLR: example male Yˆ  ˆ0*  ˆ1 AGE SBP ˆ  ˆ AGE ˆ Y   female 0 1 ˆ0*  ˆ0 Age 18 STATISTICS Pearson’s correlation coefficient (r)  Relationship btw X and Y r  ( X  X )(Y  Y )  ( X  X )  (Y  Y ) i i 2 i 2 i  Properties of Pearson’s r  Range:  Unitless  1  r  1  Good for normally distributed X and Y  相關係數 r：可視為是多維空間中，兩個向量的cos 值  Spearman’s correlation coefficient  Pearson’s r for ranked X and Y  Good for non- normally distributed X and Y 19 STATISTICS Spearman’s Rho: rank correlation  Relationship btw X and Y rs   (R X  R X )(RY  R Y )  (RX  R X ) 2  ( RY  RY ) t 2 rS n  2 1  rS2  Spearman’s correlation coefficient  Pearson’s r for ranked X and Y  Good for non- normally distributed X and Y 20 STATISTICS Assumptions in Regression Linear Independent Normal distribution Equal Variance 說明：For all the values of x,     εare independent, normally distributed, have the same SD σ = σ (ε) mean μ = 0 y= α + βx Weight      Height Yi = α0 + β1Xi + εi α and β are the unknown parameters ε = random error fluctuations 21 STATISTICS R2, r, b之間的關係  r and b r  SSR/  (Y  Y )  1  SSE  r 2 r 2  ( X  X )(Y  Y )  ( X  X )  (Y  Y ) i i 2 i i 2 b1 2 ( x  x)    ( y  y) 2 SDX b  r  b SDY 2 2 ( X  X )(Y  Y )   (X  X ) i i 2 i  r2: Coefficient of Determination:  The proportion of the variability among the observed values of Y that is explained by the linear regression of Y on X.  Y的變異量可以被X迴歸後所解釋的百分比 22 STATISTICS r, b之間的關係: 正負同號  r大b小  r小b大 23 STATISTICS 迴歸線的幾個標準差1：名稱 (1).估計標準誤 SE of estimate (2).迴歸線標準誤 (3).預測標準誤 SE of RL(Ŷ的抽樣分佈標準差) SE of prediction 楊志良迴歸線的標準差迴歸線標準誤估計標準誤 **該名詞易混淆意義 任一觀察值Y與回歸直線間的垂直距離的分布變異 以迴歸線代替平均值算出來的標準差 以重複抽樣的多個相同的X值來計算Y 的標準誤，亦即Ŷ值的第二個層次的常態分布的標準差， 估計單一E(y)的CI用 以一個X預測Y的標準誤，亦即某個X值上， Y值的第一個層次的常態分布的標準差 24 STATISTICS 迴歸線的幾個標準差2： The Standard Error of the Estimate S  V (Y )     (Y  Yˆ ) /(n  2)   (Y  Y )  b  ( X  X ) /(n  2) 2 Y.X 2 2 2 2 1 2   (Y  Y )  (1  r 2 ) /(n  2) 2 SE of RL S Y2ˆ  V (Yˆ )  V (b0  b1 x)  V [Y  b1 ( X  X )]  V (Y )  V [b1 ( X  X )]  2( X  X )COV (Y , b1 )  2 n   2 (X  X )2 (X  X ) 2 .... from : Note(a) SE of prediction SˆY2  V (Y  Yˆ )  V (Y )  V (Yˆ )  2COV (Y , Yˆ ) 1 ( X  X )2   [1   ]....from : above2 n  ( X  X )2 2 25 STATISTICS 迴歸線的幾個標準差3： Note (a): b1的變異數 (X  X )  ( X  X )(Y  Y ) ]  V [  ( X  X )Y ]   V (b )  V [    ( X  X ) ( X  X ) (X  X ) (X  X ) 2 1 2 V (b1 )  2 2    V (Y ) 2  2 (X  X ) 2 Note (b): b0的變異數 2 V (b0 )  V (Y  b1 Y )  V (Y )  V X (b1 )  2 X COV (Y ,b1 )  2 n 2 2 X   ( X  X )2 .... from : Note(a) 2 1 X  2(  ) n  ( X  X )2 26 STATISTICS 例題：  10位30-39歲男子於最初所做的血膽固醇量(X)，與相隔10年後所做的量 (Y)兩次的比較如下(資料來源:彭游生物統計學，89年，P374) ，請問：        迴歸係數是多少？截距是多少？相關係數r是多少相關係數是否有統計上的意義？已知F0.05 (1,8) =5.32 有多少10年後膽固醇值的變異是由10年前膽固醇值的變異所引起的？樣本的迴歸係數是否具統計意義？某個男性目前的膽固醇為350，請預測10年後的膽固醇和其95%CI 某群男性的平均膽固醇為350，則其10年後的膽固醇和其95%CI為多少？  部分解答: 27 STATISTICS 例題：部分解答(續) 28 STATISTICS Logistic Regression 主題：Y為類別變項的預測  Predicting Nominal or categorical outcome  有無生病；有無死亡  Odds Ratio ( 勝算比; 危險對比值 ) 研究設計：  橫斷法：Cross sectional study  世代追蹤法：Cohort study (Follow-up study)  個案對照法：Case-control study  臨床實驗法：Clinical trial 29 STATISTICS Odds ratio X Y 暴露組(＋) 非暴露組(－) 總和有病(＋) 沒病(－) A B C D A+C B+D 總和 A+B C+D A+B+C+D  Odds是機率的另一種表示方法 odds  p( x)   1  p ( x)     Odds就是賠率  危險對比值(Odds ratio)  暴露組發病率: p1 = A / (A+B)  對照組發病率: p0 = C / (C+D) OR  p p1 A /( A  B) C /(C  D) AD  0    1  p1 1  p0 B /( A  B) D /(C  D) BC  世界杯足球賽巴西隊的賭盤為1賠1，中國隊則為1賠100  巴西與中國的勝算比為何? 30 STATISTICS 流行病學的研究設計： 橫斷法：Cross sectional study 世代追蹤法：Cohort study (Follow-up study) 個案對照法：Case-control study 臨床實驗法：Clinical trial 31 STATISTICS 流行病學的偏差(bias) 選擇性偏差: selection bias 資訊性偏差: information bias  錯誤歸類: misclassification 干擾因子: confounding 32 STATISTICS 橫斷法 研究目的：  盛行率調查  衛生行政需求 研究關鍵：  研究對象要有代表性：隨機抽樣 研究限制：  沒有時序性，無法確定因果關係 33 STATISTICS 個案對照法 E E  研究目的：  因果分析  個案組與對照組的暴露率比較 D D  研究關鍵：  對照組的挑選  對照組要能代表個案組所來自的母群體的暴露經驗  研究限制：  時序性  回憶偏差(recall bias) 34 STATISTICS 世代研究法(追蹤研究法) E E 研究目的：  因果分析  暴露組與非暴露組的疾病發生率比較 研究關鍵： D  追蹤 研究限制：  失去追蹤 35 STATISTICS 干擾因子Confounding factors 干擾因子的定義：  本身單獨與疾病有相關；本身是危險因子 Obesity  干擾因子與危險因子有相關  干擾不能是中介變項： X1X2Y Cholesterol MI 36 STATISTICS 臨床實驗法 研究目的：評估介入(intervention)效果  介入：藥物治療，衛生教育 研究關鍵：  隨機分派(randomization)：控制干擾因子  安慰劑效應(placebo effect) 研究限制：  倫理道德問題 37 STATISTICS 各種Study Designs之間的關係 Case-control study  Matched case-control study Cohort study E E  Matched cohort study Randomization clinical trial  Complete matched cohort study Causality and correlation  Y=a+b1X1+b2X2+b3X3+b4X4+b5X5… covariate, confounder 38 STATISTICS Logistic regression: Simple linear regression: E(Y | x)   0  1 x ~ (, ) Logistic regression: Y為二分類別變項  如何使Y從(0,1)到(- ∞, ∞)?  Logistic transformation p( x)  E (Y | x)  exp( 0  1 x)  p ( x)     0  1 x ~ (, ) ~ (0,1)  g ( x)  ln 1  exp( 0  1 x)  1  p ( x)  39 STATISTICS Logistic regression係數與OR  OR：exp(beta)  若該X變項是三組以上的類別變項，表示與參考組比較的OR  若該X變項是連續變項，表示每增加一單位的X，會增加多少OR  若model有多個X變項，解讀相同，但要加上「其他X變項保持不變下」的條件  舉例:  X代表性別，男性x=1，女性x=0；Y代表自殺的有無 g (1)   0  1 ; g (0)   0  g (1)  g (0)  1  ln(OR) 40 STATISTICS 課本例子：LR men with unintentional injury  Soderstrom, 1997 Table 10-5,p247 結論：  週末的晚上到急診室的白人，有較高的機率血中酒精濃度過高(BAC>50mg/Dl)；  年紀則沒有統計差異。 41 STATISTICS Z, t, F, 2 之間的關係  Z2 , chi-square  母群體平均值已知： 定義： n 2 2 Z    i ( n)  2 ( x   )  i 2  i 1 2 結論： Z 2   2  ( x   ) 1 (1) 2 /n n 或 2 2 Z    i ( n)  i 1 2 ( x   )  i 2 / n 42 STATISTICS Z, t, F, 2 之間的關係 F ,chi-square  母群體平均值未知： 2 2  s ( x  x ) ( n  1 ) ( n  1 ) s  2 定義：  Zi2   2( n1)   i 2  2 n 1    i 1 2 s 2  ( n1) s12 s12 結論： Fdf 1,df 2  2  df 2  , Fdf 1,  2 Fdf 1,  2  s2   n 1 n 2 2 43 STATISTICS Z, t, F, 2 之間的關係 F ,(df 1,df 2) F ,(1,df 2)  t F ,(df 1, )  2 1 / 2,(df 2 ) 2 df 1 F ,(1,)  z12 / 2 44

Document

Related documents

Products

Support

Document

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib