Introduction: Why statistics? Petter Mostad 2005.08.29 Statistics is… • …a way to summarize and describe information: not very interesting in itself • …an important tool for research in my field, and something I look forward to learning more about • …an important tool for research in my field, but I only learn what I must learn about this • … boring What best describes your attitude towards statistics? How much do you already know? • Definition of mean value, median, standard deviation? • Bayes formula? • t-tests? • p-values? • Computing the probability of getting dealt a flush in a game of poker? Why a course in statistics? What is research? • A distinguishing feature of scientific research is that its conclusions are reproducible by other scientists • Thus, research must – contain information about exactly what has been done – somehow convince the reader that if she repeates what has been done, she will reach the same conclusions A goal of science: To study causality • Ultimately, much of science is concerned with establishing statements like ”If A happens, then B will follow” • In other words, one wants to show that B is reproduced every time A happens. Example: Studying causality through intervension • Retrospective studies can show covariation between variables, but not causality. • Intervension can be used to argue that changing a certain variable causes another variable to change. • To study effect of intervension, a control group is needed Example: Reproducibility through randomization Assume an experiment is done, with two groups, receiving different ”treatment”: • Differences in the result could be caused by differences in the treatments, or by differences between the groups from the start. • Randomising the division into groups makes it unlikely that the groups are systematically different from the start Example: blind, or double-blind studies • Differences between the two groups could be caused by people’s knowledge they are in one group or the other. • Differences could also be caused by the experimentalists (doctors) knowledge who is in which group. • Removing the first knowledge gives a blind study, removing the second gives a double-blind study. Quantitative and qualitative research • Quantitative: Focus on things that can be measured or counted • Qualitative: Focus on descriptions and examples. • Two different scientific tratidions. Health economics and administration has elements from both. • Both have advantages and disadvantages (which)? Quantitative research • For quantitative research, we have many good tools to ensure reproducibility of conclusions • Statistics is a very important such tool • Statistics used in this way can be called inferential statistics Example: Reproducibility through statistics • If you repeat a quantitative investigation (a questionnaire, an observation of a social phenomenon, a measurement) you are unlikely to get exactly the same numbers. • Statistics can help you to estimate how different results are likely to be. • This can tell you which conclusions are likely to be reproducible in a potential repetition of the investigation. Descriptive vs. inferential statistics • Descriptive statistics: To sum up, present, and visualize data. • Inferential statistics: A tool to handle, and to draw (”infer”) reproducible conclusions on the basis of, uncertain information. Descriptive statistics • Goal: To reduce amount of data, while extracting the ”most important information” • Can be done with single numbers (”summary statistics”), tables, or graphical figures. • My next lecture will look at descriptive statistics Can descriptive statistics be ”objective”? • A person makes choices about: – What to measure – How to measure (for example what questions to ask or what scale to use) – How to present the result • Thus: A presentation or publication should always contain information about exactly how results have been obtained Inferential statistics: Hypothesis test example • You throw a dice ten times, and get 1 seven out of these ten times. You conclude that this is not a fair dice. Is the conclusion reproducible? • You need to compute what observations are to be expected if the dice is a fair one. Example: probability calculations • The disease X has a 1% prevalence in the population. There is a test for X, and – If you are sick, the test is positive in 90% of cases. – If you are not sick, the test is positive in 10% of cases. • You have a positive test: What is the probability that you are sick? Example: desicions based on uncertain information • An oil company wants to produce the maximum amount from an oil field. • Available information: – Measurements (seismics) describing approximately the geometry of the rock layers – Information from a couple of test drills – Information from geologists • Where should they place the wells, and how should they produce? The concept of a MODEL • What separates inferential statistics from descriptive statistics is the use of a model. • A model is a (mathematical) description of the connections between the variables you are interested in. • It is a simplification of reality, and so never ”correct” or ”wrong”, but it can be more or less useful. Statistical (or stochastic) models • In statistical models, the variables are predicted with some variation or uncertainty: – The model for force moving a mass: F=ma, is exact. – The model for what the eyes of a fair dice will show contains probabilities • We can use the observed data to choose between possible models. • The word ”stochastic” is often used when we are focusing more on the model than on the data. Example • Assume a certain portion of the population carry a specific gene, you want to know how many • The model is simply the unknown proportion p • You select and measure a number of individuals, and use the information to select the right model, i.e., the right p Example • You want to know the height distribution among 30 year old Norwegian women. • You assume, using experience, that a good model is a normal distribution with some expectation and some variance • You use data from a number of women to select a model (i.e. an expectation and variance), or a range of likely such expectations and variances Sampling • Often, the model can be a simplifying description of the population we want to study. • We investigate the model by sampling from the population. • When each individual is selected independently and randomly from the population, we call it (simple) random sampling • Simple random sampling makes it easier to compute what we can conclude about the model from the data Using the results • Selecting some models over others means that you increase your understanding of each variable, and the relationships between variables • Once a model has been selected, it can be used to forecast or predict the future • Being able to predict the likely results of different desicions can be used to improve the desicion making The goals of this course • To enable you to understand, use, and criticise research results produced by others, and in particular to understand and view critically the statistical arguments • To enable you to produce your own valid research results, using statistical tools. Overview of statistics topics we will look at • • • • • • • • • Descriptive statistics Probability theory Sampling and estimation Regression Non-parametrics Analysis of variance Desicion theory Some more advanced topics Much information is and will be available at course web page