MD 5108 Biostatistics for Basic Research Lecturer: Dr K. Mukherjee Office: S16-06-100 Tel: 874 2764 Email: stamk@nus.edu.sg Objectives To train practitioners of the biomedical sciences in the use and interpretation of statistical data analysis. • explore and present data using tables, charts and graphs • • • • • • • • ability to do simple statistical calculations with a calculator carry out data analysis using a statistical package such as SPSS pick the right procedure for analysing a set of data interpret results correctly and report findings avoid misuse and abuse of statistics understand statistical contents of papers in medical journals judge claims and statements critically discuss and communicate ideas in a quantitative manner Teaching approach • nonmathematical introduction • explanation of concepts rather than proofs • emphasis on methodology and procedures • emphasise use of statistical package rather than manual calculation • emphasis on choosing the right procedure • emphasis on correct interpretation of results • examples from clinical research literature Topic 1: What is statistics? “A branch of mathematics dealing with the analysis and interpretation of masses of numerical data” Merrian-Webster Dictionary “The field of study that involves the collection and analysis of numerical facts or data of any kind” Oxford Dictionary “The study of how information should be employed to reflect on, and give guidance for action, in a practical situation involving uncertainty” Vic Barnett Biostatistics: Application of statistical methods to biological, medicine and health sciences Why the need for Statistics in Biomedicine ? Two main reasons: • Variation – attributes differ not only among individuals but also within the same individual over time • Sampling – biomedical research projects mostly carried out on small numbers of study subjects – challenging problem to project results from small samples studies to individuals at large Biological Variation Necessitates the use of statistical methods in biomedicine to put numerical data into a context by which we can better judge their meaning From sample to population Statistical methods used to produce statistical inferences about a population based on information from a sample derived from that population Population inductive statistical methods sample Altman (1991) Practical Statistics for Medical Research, Chapman and Hall. Bailar & Mosteller (1986) Medical Uses of Statistics, NEJM Books. Many studies have been done on misuse of statistics in medicine From Altman (1991) Schor and Karten (1966, J. Am. Med. Assoc.): • 149 papers classed as “analytical studies” in 3 issues of 11 most frequently read medical journals • assessment criteria: Validity with respect to: • Design of experiment? • Type of analysis performed? • Applicability of statistical test used? Findings of Schor and Karten: • 28% of papers acceptable • 68% deficient but acceptable if reviewed • 4% unsalvageable Lesson: CARE must be exercised when reading scientific papers in biomedical journals! Knowledge of basic biostatistics is required “ There are three kinds of lies: lies, damned lies and statistics” Benjamin Disraeli “ It is easy to lie with statistics, but it is easier to lie without them” Frederick Mosteller “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” H.G. Wells Types of statistical methods 1. Descriptive statistical methods data collection and organization summarizing data and describing its characteristics presentation and publication 2. Exploratory data analysis play around and get a feel of the data preliminary analysis, often graphical looking for patterns and possible relationships are assumptions satisfied? which model and procedure to use? 3. Inductive (inferential) statistical methods Statistical inferences about a population based on information from a sample derived from that population Population • estimation, confidence intervals • hypothesis testing • prediction, forecasting • classification inductive statistical methods sample Topic 2: Types of data Sources of data, the raw materials of statistics Routinely kept records, e.g., hospital medical records Surveys Experiments Clinical trials Data base Published reports Any characteristic that can be measured or classified into categories is called a variable Types of variables (1) Qualitative variables cannot be measured numerically categorical in nature, e.g., gender categories must not overlap and must cover all possibilities w Nominal variables (No inherent ordering of categories) § M/F, Yes/No § Blood group (A, B, AB, O) § Ethnic group (Chinese, Malay, Indian, Others) w Ordinal variables (Categories are ordered in some sense) § response to treatment: unimproved, improved, much improved § pain severity: no pain, slight pain, moderate pain, severe pain (2) Quantitative variables can be measured numerically, e.g., weight, height, concentration can be continuous or discrete w a continuous variable can take on any value (subject to precision of measuring instrument) within some range or interval, e.g., weight, height, blood pressure, cholesterol level w a discrete variable is usually a count of something and hence takes on integer values only, e.g., number of admissions to NUH Variable types and measurement types have implications on how data should be displayed or summarized determines the kind of statistical procedures that should be used SUMMARY Variable Types of variables Qualitative or categorical Nominal (not ordered) e.g. ethnic group Ordinal (ordered) e.g. response to treatment Quantitative measurement Discrete (count data) e.g. number of admissions Measurement scales Continuous (real-valued) e.g. height Topic 3: Presenting data graphically Advantages of graphical data display Let data speak for itself Get a good feel of the data before formal analysis Graphs and plots easier to understand and interpret Reveal patterns in data which may shed light on the appropriate model/analysis to use e.g., Skewed or symmetric distribution Multiple peaks / mode Are there any outliers ? Relatioship between variables. Graphs for categorical data Bar chart for world pharmaceutical spendings, 1997 % of world spendings 35 30 25 20 15 10 5 0 Africa Australasia Canada Europe Japan Latin America Middle East SE Asia & China USA Region Pie chart for world pharmaceutical spendings, 1997 Canada ( 2, 2.0%) USA (34, 34.0%) SE Asia & Ch ( 7, 7.0%) Middle East ( 2, 2.0%) Latin Americ ( 8, 8.0%) Australasia ( 1, 1.0%) Japan (16, 16.0%) Europe Af rica ( 1, 1.0%) (29, 29.0%) Segmented bar chart for world pharmaceutical spending, 1997 100 % of world spending 90 80 70 60 50 40 30 20 10 0 Africa Australasia Canada Europe Japan Latin America Middle East SE Asia & Chin USA Bar chart for world pharmaceutical spendings, 1997 35 % of world spending 30 25 20 15 10 5 0 Africa Australasia Canada Europe Japan Latin America Middle East SE Asia & China USA Region World pharmaceutical spending, 1997 ( 2, 2.0%) USA (34, 34.0%) SE Asia & Ch ( 7, 7.0%) Middle East ( 2, 2.0%) Latin Americ ( 8, 8.0%) Australasia ( 1, 1.0%) Japan (16, 16.0%) Europe Af rica ( 1, 1.0%) (29, 29.0%) 100 Sum of % of world spending Canada 90 80 70 60 50 40 30 20 10 0 Africa Australasia Canada Europe Japan Latin America Middle East SE Asia & Chin USA Comparison of methods Bar charts can be read more accurately and offer better distinction between close together values Pie charts especially useful for showing percentage distribution Pie charts can display large and small % simultaneously without scale break A single bar chart is preferable to a single segmented bar chart A series of segmented bar charts is easier to read than a series of pie charts or ordinary bar charts Bar chart for number of health professionals Number of workers 6000 5000 4000 3000 2000 1000 0 Dentists Doctors Nurses Profession Pharmacists Variation of the basic bar chart Stacked bar chart for number of health professionals 6000 Private Public Number of workers 5000 4000 3000 2000 1000 0 Dentists Doctors Nurses Profession Pharmacists Clustered bar chart for number of health professionals Number of workers 4000 Private Public 3000 2000 1000 0 Dentists Doctors Nurses Profession Pharmacists Segmented bar charts by profession Private Public 100 90 Percent by sector 80 70 60 50 40 30 20 10 0 Dentists Doctors Nurses Profession Pharmacists Clustered bar chart for number of health professionals Number of workers 4000 Private Public 3000 2000 1000 0 Dentists Doctors Nurses Pharmacists Profession Stacked bar chart for number of health professionals 6000 Segmented bar charts by profession Private Public 90 5000 80 Percent by sector Number of workers Private Public 100 4000 3000 2000 1000 70 60 50 40 30 20 10 0 0 Dentists Doctors Nurses Profession Pharmacists Dentists Doctors Nurses Profession Pharmacists Plotting by sector rather than by profession Look at the data from a different angle Highlight different aspects of the data Clustered bar charts of number of health professionals Number of workers 4000 Dentists Doctors Nurses Pharmacists 3000 2000 1000 0 Private Public Sector Stacked bar charts by sector 6000 Dentists Doctors Nurses Pharmacists Number of workers 5000 4000 3000 2000 1000 0 Private Public Sector Percentage bar charts by sector Dentists Doctors Nurses Pharmacists 100 Percent within sector 90 80 70 60 50 40 30 20 10 0 Private Public Sector Segmented bar charts by sector Dentists Doctors Nurses Pharmacists 100 Percent within sector 90 80 70 60 50 40 30 20 10 0 Private Public Sector Percentage bar charts by sector Clustered bar chart of number of health professionals Dentists Doctors Nurses Pharmacists 3000 2000 1000 90 0 80 70 60 50 40 30 20 10 0 Private Public Private Sector Public Sector Stacked bar charts by sector Segmented bar charts by sector Dentists Doctors Nurses Pharmacists 5000 4000 3000 2000 1000 0 Dentists Doctors Nurses Pharmacists 100 90 Percent within sector 6000 Number of workers Dentists Doctors Nurses Pharmacists 100 Percent within sector Number of workers 4000 80 70 60 50 40 30 20 10 0 Private Public Sector Private Public Sector A back to back bar chart Source: JAMA, 1978, vol 239, no 21 Comparison of methods Stacked bar chart is also a bar chart for the combined data Some of the bars in a stacked bar chart are not aligned Bars in clustered bar charts are aligned but it is harder to visualize how the component bars would stack up Back to back bar charts are applicable when there are 2 groups only, the aggregated bars are not aligned Series of stacked or segmented bar charts useful in showing time trend Time Trend Exaggerate visually the increase in # prescriptions written per person by starting at 8 rather than 0 Stacked bar chart of yearly mortality rate per 1000 births Pagano & Gauvreau (1999) Principles of Biostatistics, Duxbury. Response under two treatments Response to Treatment Treatment A B None Partial Complete 3 15 9 2 22 30 Total 27 54 A misleading bar chart A B Frequency 30 20 10 0 None Partial Complete Response to treatment By design, there are twice as many patients receiving treatment B Within treatment percentage Can compare the response type percentages for the two treatments Response to treatment None Partial Complete 100 90 80 70 60 50 40 30 20 10 0 A B Treatment Within treatment percentage Stacked bar charts for percentage figures Response to treatment None Partial Complete 100 90 80 70 60 50 40 30 20 10 0 A B Treatment Graphs for quantitative data Histogram Frequency polygon Box plot Histogram Divide the range of the data into a suitably chosen number of intervals/bins, all of the same width The number of observations that fall within each interval is plotted Relative frequency histogram Plot the proportions of observations that fall within the class intervals Wild & Seber (2000) Chance Encounters, Wiley. Histogram of End-Systolic Volume for 45 Male Heart Attack Patients Frequency 20 10 0 40 60 80 100 120 140 160 180 200 220 Relative frequency polygon for SysVol 40 Percent 30 20 10 0 40 60 80 100 120 140 160 180 200 220 SysVol Comparison of methods Histogram good at revealing distributional shape such as symmetry, skewness, number of peaks etc difficult to superimpose or draw side by side Frequency polygons can be superimposed for easy comparison Wild & Seber (2000, p.59) Can be superimposed Pagano & Gauvreau (1999) Wild & Seber (2000) Median and quartiles Sort the data in increasing order The median is the middle value (if n is odd) or the average of the two middle values (if n is even), it is a measure of the “center” of the data Quartiles: dividing the set of ordered values into 4 equal parts Q2 = second quartile = median first 25% second 25% Q1 third 25% Q2 IQR = Interquartile range = Q3 Q1 fourth 25% Q3 Box plot Draw a box from the lower quartile to the upper quartile and a line to mark the position of the median Extend from both edges of the box by 1.5 IQR, pull back the lines until they hit observation Observations more than 1.5 IQR away from the lower or upper quartile are marked out as outside values for further investigation and checking How a boxplot is constructed (Wild & Seber, 2000, p.73) 5-Number Summary: min, lower quartile, median, upper quartile, max Dotplot for SysVol = End-systolic volume, a measure of the size of the heart 50 100 150 200 SysVol Boxplot for SysVol 20 120 Sys Vol 220 Advantages of box plot quick visual summary of a data set capture prominent features like location, spread, skewness and outliers can easily draw a series of box plots side by side; not so for histograms Brand name Type Taste Happy Hill Supers Beef Bland Georgies Skinless Beef Beef Bland Special Market's Beef Premium BlandB Spike's Beef Beef Medium Hungry Hugh's BeefJumbo Medium Beef Great Dinner Beef Beef Medium RJB KosherBeef Beef Medium Wonder Kosher Beef Skinless Medium Bee Happy FatsBeef Jumbo Beef Medium Midwest Beef Beef Medium General Kosher Beef Beef Medium Wall's Kosher Beef Beef Lower Medium F Hickory Natural Beef SmokeMedium Smith BeefBeef Medium Premium Beef Beef Medium Family StoreSkinless Beef Beef Medium Sam's Kosher BeefBeef Medium Hammer Beef Beef Medium Athens Beef Beef Medium Regents Kosher Beef Beef Scrumpt. Really Big Meat Bland Biggest Jumbo Meat Bland Home MadeMeat Bland Martha's Jumbo Meat DinnerBland Hammer Premium Meat Bland Willie's Wieners Meat Bland Premium Hot Meat Dogs Medium Airport Wieners Meat Medium Judy's Favorite MeatJumbosMedium Stick Lean Meat Supreme Jumbo Medium Stick Jumbo Meat Medium Fat Jack Jumbo Meat Medium Thin Jack Veal Meat Medium Top Grade Hot MeatDogs Medium Blended w/Chicken&Beef Meat Scrumpt. Heaven Made Meat Scrumpt. Baked and Meat Smoked Scrumpt. Smart Person Poultry ChickenBland Woods Park Poultry Chicken Medium Tony Turkey Poultry Medium Rose Garden Poultry Turkey Medium Low Fat Turkey Poultry Medium Special Market's PoultryTurkey Medium Caloryless Poultry Turkey Medium Heaven Made Poultry Lower Fat Medium McDowell'sPoultry Jumbo Chicken Medium $/oz 0.11 0.17 0.11 0.15 0.1 0.11 0.21 0.2 0.14 0.14 0.23 0.25 0.07 0.09 0.1 0.1 0.19 0.11 0.19 0.17 0.12 0.12 0.12 0.1 0.11 0.13 0.1 0.09 0.11 0.15 0.13 0.1 0.18 0.09 0.07 0.08 0.06 0.08 0.05 0.07 0.08 0.08 0.07 0.09 0.06 0.07 $/lbProt Cal 14.23 21.7 14.49 20.49 14.47 15.45 25.25 24.02 18.86 18.86 30.65 25.62 8.12 12.74 14.21 13.39 22.31 19.95 22.9 19.78 14.86 17.32 15.2 14.01 13.92 18.24 14.12 11.83 15.41 17.4 17.32 15.61 20.4 12.65 11.17 11.75 9.49 10.21 6.37 8.42 9.37 9 8.07 9.39 6.59 8.43 Sod 186 181 176 149 184 190 158 139 175 148 152 111 141 153 190 157 131 149 135 132 173 191 182 190 172 147 146 139 175 136 179 153 107 195 135 140 138 129 132 102 106 94 102 90 99 107 Prot/Fat 495 477 425 322 482 587 370 322 479 375 330 300 386 401 645 440 317 319 298 253 458 506 473 545 496 360 387 386 507 393 405 372 144 511 405 428 339 430 375 396 383 387 542 359 357 528 1 2 1 1 1 1 2 2 1 1 1 3 2 1 1 1 2 1 2 2 2 1 1 1 2 1 1 2 1 3 1 1 3 1 1 1 1 2 2 3 3 4 5 5 4 2 Dataset Hotdogs Graphical Analysis of the “Hotdogs” data. Parallel Box plots Can Be Quite Revealing Rice (1995) Mathematical Statistics & Data Analysis, Duxbury Press. 1969 Reduction in concentration through time Higher during winter months Skewed toward higher value Spread increases with level 1972 (Parallel histograms much harder to visualise)