An Introduction to Things to Come! Administrivia • The details on the class can be found here: www.stanford.edu/class/hrp259 • Class participation plus lots of little assignments and 2 exams determines your final grade. • If you send me a computer virus or other malicious code (even unintentionally) you will fail the course. – Visit ess.stanford.edu to get a virus scanner and keep it up to date by clicking LiveUpdate…. • Dozens of new viruses are unleashed on the world every day. Please update your virus definitions and scan everything before you send it to me. How to Predict The Future • Either the world is driven completely by random chance events (and your best bet for predicting the future is using Tarot cards or a Magic 8 Ball™) or there are detectable patterns in the world. • If you talk to a preschool teacher or a PhD in math, they will tell you that math is all about pattern detection. Deterministic Modeling • To predict the future using a deterministic model, you would say that if event A happens then B will happen. For example, in a deterministic world, if you know that a baby gestates for exactly 280 days (40 weeks), its weight will be exactly 7 ½ lbs (3.2 kg). We know that in reality it is typically in the range of 5.5–10 pounds (2.7– 4.6 kg). • For better or for worse, we do not live in a deterministic (Orwellian 1984ish) world and we can not usually make EXACT predictions in medicine. What We Want…. • You want to be able to fill in a table like this: Weeks of Gestation (in weeks) Weight at birth (in lbs) 29 weeks … 38 weeks 39 weeks 40 weeks …and express it with a simple formula like this: lbs = weeks * something The Process • The process of going from a single predictors or a set of predictors to a predicted outcome is called statistical modeling. • People get far too excited about figuring out which statistic (with accompanying pvalues anxiety) to use for the factors that are used in models. • Today I want to talk about the process of building and using the most common models that you will see in medicine. What is a Model and Why Care? • The predictors and the outcomes can be on a continuous scale (time in days) or categorical factors (mom smoked, yes or no). • Generally we try to use all the information available when we make a prediction about the future. – The amount of blood ejected each time the heart beats (continuous scale) as opposed to whether or not the heart is beating – The number of cancer cells seen on a slide (or the presence or absence of malignant cells) • The models we build are remarkably similar regardless of whether we have categorical or continuous outcomes. The Structure of a Model • All the models I learned in school were formulated at their core like this: Outcome = baseline + predictor + predictor Baby’s Weight some number Impact of time Impact of being a smoker Weeks * a number a number • The math can get ugly very quickly depending on the properties of the outcome (continuous, count, categories) but the core idea is that these models are all using additive contributions from some predictors! How many predictors? • You will be faced with the HARD question of how many predictors you have. – Stop and think about the LEVELS of your predictors. If you have 100 births in a dataset and 10 variables, many of which are categorical, you can quickly find yourself making predictions based on one or two births. • Male babies (50%) from smoking mothers (20%) where this is the mom’s 2nd or later birth (35%) = SMALL numbers. • From100 in the sample, you end up with 5.25 births in this level. • Do you want to make generalizations to the WORLD based on 5 children? • Are the factors correlated? • Are some cheap to measure? • Can you use some as proxies for others? Models • One of my favorite statistics books is Michael J. Crawley’s Statistical Computing. He says: 1. All models are wrong. 2. Some models are better than others. 3. The correct model can never be known with certainty. 4. The simpler the model, the better it is. What Makes a Bad Model • Predicts some outcomes poorly • Is strongly influenced by a small number of data points • Shows systematic patterns in how it fails to predict Types of Statistical Modeling • Think about the extremes of what you could use when making a model: • Null model – use the mean of everyone • A theoretically minimally adequate model • Current model you have specified • A theoretical maximal model with every predictor • Saturated model – Every predictor and every interaction – Have everyone predict themselves What is an interaction? • Sometimes factors in a model behave weird together. A mad scientist invented a love potion and she looked at happiness in the two genders and people on and off her drug. There was a huge increase in happiness if you were male and on drug but minimal impact of gender or drug by themselves. Interaction Gender by Drug drugged 35 30 normal Mean 99.88 Mean 103.7 Standard Dev. 13.78 Standard Dev. 16.70 Mean 120.1 Mean 101.8 Standard Dev. 12.73 Standard Dev. 11.18 Percent F 25 20 15 10 5 0 35 30 Percent M 25 20 15 10 5 0 50 60 70 80 90 100 110 120 130 140 50 happiness 60 70 80 90 100 110 120 130 140 Interactions Again • Imagine that are two factors that impact a man’s risk of prostate cancer, being Japanese and living in Japan when you are a pre-teen. If you are Japanese, your risk of prostate cancer is lower than for men of any other race. If you were raised in Japan, your risk is lower than for men raised in the USA. • The interaction term would measure if there is extra protection for being both Japanese and having been raised in Japan. Categorical Interaction • IsJapanese = 0 or 1 (1= yes and 0 = no) • IsRaisedJapan = 0 or 1 • IsJapanese * isRaisedJapan = ?? Probablity annually Probablity of Prostate Cancer 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 Not Japanese Japanese Born USA Born Japan You can ALWAYS do better. • You can keep adding more and more predictors to a model but the price is the loss of generalizability. Will you get another child who is exactly that weight, who’s mother smoked that much, of that ethnicity, that gender, whose father weighed that much, mother weighed that much, etc….? • Modeling methods compare the models and frequently use criteria that penalize for extra factors (AIC criteria). Goals • I see modeling as having two goals. • Estimate parameters. – How much weight gain occurs each week as a baby is developing? • Estimate how well it describes your data. – How far off will my guess be when I predict the next child? – Are there regions where my guesses are far off, like premature or late deliveries? – Is there a lot of variability at one point and not at others? – Can I see any problems when I fit the model to THIS data? Looking for Errors • Statisticians use the word “error” differently than everyone else. – You know that you will not have perfect prediction. Instead, you will be off. That is error. It does not mean somebody made a mistake! It just means you can’t make a perfect prediction. – Specifying how far you will be off is the fun and interesting part of statistics. The rest is just math. Looking at Errors Outcome = baseline + predictor + predictor + error a number drawn from a bell shaped distribution some number Baby Weight Impact of time Impact being a smoker Weeks * a number a number Looking for Errors • Hopefully you will see that, given any specific predictor value, your guessed values for the outcome will be close to the values you actually observe in the outcome. Also, any observed outcome values that stray too far from your guess are unlikely. • That pattern of how far off your guesses are from your observed data can frequently be described by a bell-shaped (“normal”) histogram. So, if you measure errors between your prediction and the observed outcomes, the distribution should be “normal.” Guesses and Errors Histogram of actual weights at 40 week births 5.5 lbs My model guesses 7.5 lbs 9.5 lbs Histogram of errors at 40 weeks I guessed way too high rarely 0 error if child was 7.5 lbs Most errors are off by just a bit I guessed way too low rarely Looking at Errors • There are some kinds of errors that you will be unwilling to accept. • If I want to predict the number of times an evil lackey proposes marriage to a mad scientist, I will not accept a negative number! • If I am predicting the chance of someone developing cancer, I will not accept a number less than 0% or greater than 100%. • Specifying the type of errors is a critical part of building a model. More on Errors • In addition to specifying the range of legal values, another critical component is specifying the variability in the errors. • If your data is constrained to lie between 0 and 1, what is the variability like? – If the average is about .5, then you can have scores that are both above and below that and the variability drawn as a histogram may be well described by a bell shaped curve. What is the variability like if the average is about .95? Now the whole right side of your curve is hanging off the side of your page! • If you have count data (e.g., number of cancer cells), your variability increases with the average count. 0 .73 1 20000 0 1 .5 If you pretend your variability is normally distributed but your outcome has a limited range, you clearly have problems. In theory, the variance of count data increases with the mean. 0 5000 10000 15000 .9 1 variance 0 0 5000 10000 mean 15000 20000 Ordinary Least Squares • Perhaps the easiest models to draw and understand are ones where you have a continuous outcome like weight and a continuous predictor like time. • The model is just a line…. • Y = mX + b Weight = estimated weight gain each week after conception * number of weeks + weight at 0 weeks 3000 2000 1000 FETAL_WGT_ 4000 5000 Maximum Likelihood Visual 30 35 GWKS_DEL 40 Bad Models • All models are wrong. • Your data is sacred (after you remove the pregnant men) and you fit models to the data. You do not fit data to a model. That difference is not a semantic minor detail. Poor Predictions • Sometimes you have data points that are not well fit by the model. Go to extreme measures to document those points. If the data is not a true error, then run the analysis with it and without it. Include the point(s) in all your plots with a special symbol and if one person changes your inferences, consider excluding them. – You may have different subgroups that you have not identified yet. 5000 A True Outlier 3000 2000 1000 FETAL_WGT_ 4000 Induced because of HUGE size 30 35 GWKS_DEL 40 Looking at Residuals • A critical step in examining the quality of a model is graphically looking at the residuals. • Residuals are the differences between the estimated values and the observed values for each person/critter/observation. • Look for curves, changing variability across the range of values or changes over time. Patterns in Residuals From Crawley: Statistical Computing Partial Residuals • When you have multiple predictors in a model, you can ask for residuals where you have “controlled for” or “removed the effect of” the other factors. • Evaluate software packages on their ability to produce these graphics. They are completely missing from Excel, they are not built into SAS (but can be done with minor work), but they are trivially easily with R or S-plus. Curve Fitting • Linear models can model curves – The math is not too bad…. • You can use explicit mathematical formulas. If you see curves in your residuals, you can use things like: – Polynomials or inverse polynomials – Exponentials – Power functions Keywords in Parametric Modeling and What They Look Like S-shaped curve From Crawley Statistical Computing Nonlinear Regression • Often the formulas to describe your data are extraordinarily complicated and you want to use non-linear or non-parametric modeling instead. • Key words you will see include: – Non-parametric smoothing • Lowess regression • Spine regression – GAM – Tree models A Bad Fit Y= Size 60 80 100 100 80 60 Is this better than a flat line at the mean? 40 40 20 20 0 0 Y= Size residual 120 120 140 140 • What happens when you fit a straight linear model to curvilinear data? 0 10 20 30 X = Age 40 50 0 10 20 30 X = Age 40 50 Is it good? • A tiny p-value does not mean a good model! • Where on the output does it tell that this is a good or a poor model? 80 60 20 40 Flatten the line, then look up and down to see if you are systematically off. 0 Y= Size 100 120 140 Residuals? 0 10 20 30 X = Age 40 50 Curve Fitting! • You can build a model that has a curve using a polynomial… the degree of the polynomial determines how many “bends” appear in a curve. So a 2nd degree polynomial would use x and x2 while a 3rd degree polynomial would use x and x2 and x3. These squared or cubed values don’t do anything especially complicated. They are just like adding new variables. Polynomials 100 60 80 Y= Size 80 60 size = intercept + X * something + X2* something else + X3 * another thing 20 40 40 0 20 0 Y= Size 100 120 120 140 140 size = intercept + X * something + X2*something else 0 10 20 30 40 50 0 10 20 30 40 X = Age X = Age poly2 = lm(y~poly(x,2)) poly3 = lm(y~poly(x,3)) 50 What is a good fit? • Choosing where to stop adding terms to a model is as much an art as a science. You can do comparisons between the models and ask to see if it is a statistically significant difference. • There are systems for penalizing your model as you add more and more factors to a model like AIC. Generalized Linear Models • You will eventually move out of the realm of predicting continuous outcomes with normal error. When you do, you will move into the realm of Generalized Linear Models (GLM). • You want to have a linear model predicting an outcome where you restrict the possible outcome values (e.g., only allow values between 0 and 1) and deal with errors not being consistently normal across the entire range. • You can change (transform) your outcome and model this with just another linear model similar to what I have shown. GLM in English • If you are predicting the number of bacteria you see in a Petri dish, you can not possibly see a negative number of bacteria. A GLM model can be written so that your predicted values can not be negative. • Contrast this with the baby weight example where with a bit of bad data for your predictor value, you could have the formula spit out a negative weight or a baby weighing a ton. GLM • Instead of modeling like this: Outcome = baseline + predictor + predictor + error normal/bell-shaped • You can model with GLM like this: Tweaked outcome = baseline + predictor + predictor + not normal error Ordinary Regression • So, the ordinary least squares regression models that I have shown are really just a case of GLM. In these cases I specify that the tweak to the outcome is to just make the outcome identical to what it was originally and the error is normal. • The tweak to the outcome is called the link and this case the link is called identity. Link Functions • The tweaks to the outcome are called links: • Identity link = predicting a continuous outcome (baby weight) • Log link = if you can’t have negative values • Logit link = if you have to restrict the range to between 0 and 1 • There are other links. Error Structure • Why bother to specify an error structure other than normal? – Strong skew, kurtosis errors, bounded errors, negative counts • The shape of the error distribution is not a bell-shaped curve. Rather than worrying about the math to describe those curves, you simply need to know that different types of data have different error structures. – – – – Normal errors – continuous outcomes Poisson errors - counts Binomial errors - proportions Gamma errors - variation Binary Response • If you are not dealing with a continuous outcome, or count data, you will likely have a binary (yes/no scored as 1 or 0) outcome. • Clearly you need to do some major tweaking to the outcome because linear models, as we have seen, can predict very large and small numbers. • Also, the variability of a binary outcome is very different from a continuous variable. Logistic Regression • The solution is to specify a link that limits values to be between 0 and 1 (think of the changed outcome as being the probability of being scored 1) and use an error term that behaves well with binary outcomes. • This is a GLM with a logit link and binomal errors. • This kind of analysis is so popular most people don’t know it is a GLM. Rather, they know it only as logistic regression. Before Next Time • The first assignment is on the class website and is due before the start of class Wednesday. • Print the slides as close as possible to class time.