Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 2: Types of Variable, measures of central tendency, and scatter Marshall University Genomics Core Facility Types of Variable • Understanding the type of variable with which you are working is important. – Type of variable determines which arithmetic operations make sense – Helps determine which tests are appropriate for hypothesis testing Marshall University School of Medicine Determining the type of variable • To determine the type of variable we are using, we ask the following questions: – Is there an ordering for values of the variable? – If there is an ordering, is there a scale? • i.e. Does an increase in one unit always mean the same thing? – If there is an ordering and a scale, does the value zero have a specific meaning? • Additionally, we ask if the variable is continuous or discrete. – Continuous means there’s always a value lying strictly between any two distinct values • So it must be able to take on fractional values – Discrete means it takes on only specific, disjoint, values. Marshall University School of Medicine Nominal variables • Nominal variables are those whose values have no ordering. – Just qualitative categories. – Cannot be continuous. – Examples: • Gender – Values are "Male", "Female” • Race – Values are "Black", "White", "Asian", "Native American", etc… Marshall University School of Medicine Ordinal values • Ordinal variables are variables with qualitative categories which have an ordering, but no scale. – Example: Economic status • Values are typically stated as "Low", "Medium", or "High", which are computed using a number of factors (income, education level, occupation, wealth). • These are ordered because there is a natural ordering low → medium → high. • They have no scale because the difference between low and medium is not necessarily the same as the difference between medium and high. Marshall University School of Medicine Interval Variables • Interval variables are variables with ordering and scale, but with no meaningful zero • Examples: Temperature in celsius or fahrenheit – There is a scale, because a difference in one degree means the same thing, no matter what the starting temperature is. – However, the choice of a zero value is essentially arbitrary. Marshall University School of Medicine Operations on interval variables • Computing differences of values of interval variables makes sense. – For example, computing a change in temperature (difference between two temperatures) makes sense, since a change of one unit (one degree) makes sense. – Computing ratios of values of interval varaibles does not make sense, because there is no meaningful zero value. • Ratios of values are dimensionless – Have no units – Should be the same no matter what units we start in. • 100°C is not double 50°C • These values are equal to 212°F and 132°F respectively. Marshall University School of Medicine Operations on Ratio Variables • It makes sense to compute differences and ratios of ratio variables. – A blood pressure of 120 is double the blood pressure of 60. • Note that the difference of values of an interval variable is always a ratio variable – For example, elapsed time (essentially the difference between two dates) is a ratio variable Marshall University School of Medicine Examples • For each of the following, determine the type of the variable (Nominal, Ordinal, Interval, Ratio). Also determine whether it is continuous or discrete. Variable Type (N/O/I/R) Tumor grade Heart rate # Heart attacks in a patient’s lifetime Color Weight (mass) Disease status Pain scale Age Genotype CT values from RT-qPCR Marshall University School of Medicine Continuous/Discrete Ambiguity in variable types • Determining the type of variable can depend on context, and/or on the measurement techniques used. – In a psychological experiment, patients are exposed to flash cards of various colors and activity in specific parts of the brain is measured. • Color here is (most likely) a nominal variable. – In a cosmological experiment, the colors of stars are observed and used (along with other data) to determine their relative speeds. • • Color here is measured by wavelength of light, and is a ratio variable. Is Age a continuous or discrete variable? – Age is really a continuous (ratio) variable: it's the amount of time elapsed since birth. However, it is often collected as a discrete variable, by rounding down to a whole number of years. The imprecision in this rounding is usually insignificant, since effects of age tend to be more noisy than this loss of precision anyway. However, it is usually better to collect data on a subject's date of birth and subsequent dates of important events in the study: this way ages can be calculated to the number of days if required. – In statistical analysis, it is usually fine to treat age as a continuous variable, even when the measurement is rounded to a whole number of years. All continuous data is measured to a degree of precision, and the loss of precision becomes part of the noise. This is no different with age. Marshall University School of Medicine Summarizing Data • The next sections of the course will focus on continuous data. – Or data that may be treated as continuous • Often, experiments will collect more data than can reasonably be presented in a poster, presentation, or manuscript. – If this is not the case, then present all the data! • Typically, we collect datapoints in the range of dozens upwards (to trillions, in the case of sequencing experiments) – Data must be summarized for presentation and interpretation. Marshall University School of Medicine Aims of Summarizing Data • Summarized data may be presented textually (in a table) or graphically • A good summary should: – Demonstrate what a "typical" value looks like. – Demonstrate the extent to which values deviate from the "typical" value. – Provide as much detail as is realistically possible. – Clearly state how the summary was made. Marshall University School of Medicine Measures of Central Tendency • "Typical" values in a data set are identified by a measure of central tendency – Choosing the right measure is important • Mean • Median • Mode – All these are kinds of "Average" Marshall University School of Medicine Mean • The mean is the measure of central tendency most commonly understood by the word "average". – Sum of all the values divided by the number of values. – Since values are summed, mean only makes sense for interval and ratio data. – The mean can be dramatically affected by extreme outliers. Marshall University School of Medicine Median • The median is the "middle" value. • Computed by ordering all values and taking the middle one. – Mean of the middle two if there are an even number of values. • Not affected by a small number of outliers, no matter how extreme. • A good measure for ordinal data. Marshall University School of Medicine Mode • The mode is the most common value. – The French word mode means fashion. – Value that occurs most often. • Makes no sense for continuous data – If measured with enough precision, no value could occur more than once. • The best measure of central tendency for nominal data • Does not always measure the "center" of the data Marshall University School of Medicine Averages do not tell the story • Merely stating an average can be extremely misleading. – The average human being has one breast and one testicle. • Example (simulated). Two patients have blood pressure measured every two hours from 6 a.m. to 10 p.m. Patient A B Mean systolic blood pressure 115.3 119.6 • Both patients appear healthy… Marshall University School of Medicine Example • However, examine all the data: Time Patient A Systolic b.p. Patient B Systolic b.p. 6 a.m. 144 115 8 a.m. 108 130 10 a.m. 92 121 Noon 122 118 2 p.m. 67 122 4 p.m. 142 120 6 p.m. 131 113 8 p.m. 99 122 10 p.m. 133 115 Marshall University School of Medicine • Patient A no longer appears healthy… Measures of Variability • Range – Just the minimum and maximum values in the data • Interquartile range – The range of the "middle half" of the data • Variance and/or standard deviation – A measure of the average deviation from the mean • Coefficient of variation – The standard deviation relative to the mean. Marshall University School of Medicine Range • Range is the simplest measure of variability. – Just the minimum and maximum values. – For our simulated blood pressure data, already gives a good clue as to what is happening. Systolic Blood pressure Patient A Patient B Mean 115.3 119.6 Range 67-144 113-130 • Very susceptible to outliers • One bad reading can completely change the range Marshall University School of Medicine Interquartile Range • Simliar philosophy to the median – Order the values in the data set – Find the 25th percentile and the 75th percentile • The values ¼ and ¾ the way along the ordering – The difference is the interquartile range • The interquartile ranges for the patients in our blood pressure example are 34 and 7 – Verify this! Marshall University School of Medicine Standard Deviation • Standard Deviation is the most commonly used measure of variability • Intuitively, it measures the average difference between each data point and the mean. – Gives a sense of the average spread of the data Marshall University School of Medicine Computing the Standard Deviation • The formula for the standard deviation is given by SD = • • • • å(Y -Y ) 2 i n -1 Yi represents each data point Y is the mean n is the number of data points. Motulsky (p 73) has a good discussion of why n-1 is used instead of n. Marshall University School of Medicine Variance • Variance is just the square of the standard deviation – Useful quantity for performing some statistical tests we’ll see later – Interpretation less intuitive than standard deviation • Units of standard deviation are the same as the units of the measurement • Units of variance are the square of the units of the measurement Marshall University School of Medicine Coefficient of Variation • The coefficient of variation (CV) is simply the standard deviation divided by the mean – Only makes sense for ratio variables (why?) • CV has no units – Often presented as a percentage • Occasionally useful for comparing scatter in variables in unrelated units Marshall University School of Medicine Graphing Data • We’ll look at four ways of graphing our blood pressure data: – Column Scatter Plot – Box and Whisker Plot – Column or Bar Chart – Line Chart • In all these, it’s important to show both a measure of central tendency (average) and a measure of variability Marshall University School of Medicine Column Scatter Plot • A column scatter plot plots all the data as individual points in a column – Rarely used – But very useful, for up to 100 data points – Not much software support • GraphPad Prism, for which Marshall SOM has a license, can do this Marshall University School of Medicine Column Scatter Plot Example Marshall University School of Medicine Box and Whisker Plot • A box and whisker plot shows the range, interquartile range, and median of the data set – A good choice when the median and interquartile range are good measures of central tendency and variation for your data – The median is marked with a horizontal line – The interquartile range is marked with a box – "Whiskers" extend to the full range of the data • A variation is for the whiskers to extend to most of the range, and outliers to be marked individually as points Marshall University School of Medicine Box and Whisker Plot Example Marshall University School of Medicine Bar Chart • Bar charts use horizontal or vertical bars to demonstrate the mean of the data set • "Error bars" are used to show a measure of variability • Some important considerations for bar charts: – It is natural to look at the relative size of the bars in order to compare the relative values of the means. – Therefore, bar charts should only be used with ratio data and should have the base of the bar at zero – There are various ways the error bars can be drawn (we will see later), so always clearly state what the error bars represent Marshall University School of Medicine Bar chart example Marshall University School of Medicine Line Chart • A line chart is useful if the data points are ordered, and the ordering is important – For example, if we want to track the data over time • Like a column scatter plot, a line chart plots all the data Marshall University School of Medicine Line Chart Example Marshall University School of Medicine