Descrip Stat Probability and Random Variables The math, the computation, and examples. Prof. Dr. Asad Ali Department of Applied Mathematics and Statistics Institute of Space Technology Islamabad, Pakistan 1 / 38 Descrip Stat Descriptive Statistics Chapter 2: Descriptive Statistics: Presentation of Data 14 / 38 Descrip Stat Descriptive Statistics Researchers can measure many physical processes, such as pressure, strength, survival time, and amount. Often, hundreds or thousands of measurements are made, and procedures were developed to organize, summarize, and make sense of these measurements. These procedures, referred to as descriptive statistics, are specifically used to condense and summarize numerical observations to get the initial (meaningful) information and make the data ready for further manipulations. In univariate case, descriptive statistics mainly covers the following tasks of data analysis. Presentation of data using Tabulation methods (frequency distributions) Graphical methods (diagrams and graphs) Measures of central tendency (averages and quantiles) Measures of dispersion (ranges, deviations, variations) In the multivariate case, descriptive statistics covers, along with the above, the analysis of the relationships (covariance, correlation and regression etc) between different variables as well. 15 / 38 Descrip Stat Presentation Tabulation methods Frequency distribution: The frequency (f ) of a particular observation is the number of times that observation occurs in the data. A frequency distribution is a table that lists the observations along with their respective frequencies. Frequency distribution with no grouping: For discrete data with small range (or small number of actually distinct values) the frequency table is constructed by arranging the collected data values in ascending order of magnitude with their corresponding frequencies. Frequency distribution with grouping: In case of very broad range of values or if the data is continuous, the entire data is divided into different non-overlapping groups or classes with the number of observations falling in each group or class. A frequency distribution condenses bulky data to a small table, which tells us about the pattern and shape of the distribution of values of the underlying variable or population. 16 / 38 Descrip Stat Presentation A very simple example (without grouping) Example 1. The marks awarded for an assignment set for a BE (MS&E) class of 20 students were as follows: 6 7 5 7 7 8 7 6 9 7 4 10 6 8 8 9 5 6 4 8. Present this information in a frequency table. Solution: To construct a frequency table, we proceed as following: Draw a three columns table with column’s heading “Marks”, “Tally”, and “Frequency”. Put all the possible distant values without repetition in the first column in ascending (or descending) order as shown below. Marks 4 5 6 7 8 9 10 Tally Frequency 17 / 38 Descrip Stat Presentation data: 6 7 5 7 7 8 7 6 9 7 4 10 6 8 8 9 5 6 4 8. The first data value is 6, put a tally bar against it, second is 7 put a tally bar for it too. Go ahead and put tallies for all the values. Count the bars for each data value and that’s the frequency. When the number of tally bars equals 5, bundle them in a group of 4 with a slash across it. Marks 4 5 6 7 8 9 10 Tally Frequency =⇒ Marks 4 5 6 7 8 9 10 Tally Frequency 2 2 4 5 4 2 1 So we now have the data in a meaningful form. We can now answer the following questions? Where is the data concentration (peak) point? How is it declining? Is this a normal marks’ distribution? Or there is some thing wrong with class performance? Do we need further investigations? 18 / 38 Descrip Stat Presentation The “how to” of a frequency distribution with grouping. When there are too many values in the data and are more spread out, it is difficult to set up a frequency table for every data value as there will be too many rows in the table. Before proceeding ahead, we need to learn about a few terms and rules that we will need for the construction of a frequency distribution with grouping or classes. Class-limits: The numbers that describe a class or group. The two limits are called lower class limit and the upper class limit. The class-limits (CL) should be inclusive and should not cause any overlapping between any adjacent classes, e.g. age in years can be classified as 10-14, 15-19, 20-24 or 10.0-14.9, 15.0-19.9, 20.0-24.9 etc. Class-boundaries: The class-boundaries (CB) are precise numbers that separate one class from its first neighbours. CBs are just the midpoint of the upper limit of one class and the lower limit of the next class, e.g. consider the first two classes 10-14, 15-19, the class boundaries are calculated by 14+15 = 14.5. Thus, for 10-14, 15-19, 20-24, the CBs are 9.5-14.5, 14.5-19.5, 19.5-24.5, thus CBs are 2 by one decimal place more precise than class-limits. The upper class-boundary of one class coincides with the lower class-boundary of the next class, thus leaving no gap. Class marks: Class marks are simply the midpoints of classes. For example, the class mark of class 10-14 is 10+14 = 12. 2 Class interval or class width: Class interval, traditionally denoted by “h” is the difference between the two class-boundaries of the same class or the difference between the lower (or upper) limits of the two consecutive classes. In the above case the class interval is 5. Ideally, all the classes should have equal intervals, unequal intervals can also happens, but should be avoided, until required, because of difficulty in interpretations. Class frequency: The frequency of a particular class is the number of times the data value occurs within the limits of that class. 19 / 38 Descrip Stat Presentation A typical frequency distribution with grouping looks like the following table. Classes 10-14 15-19 20-24 ··· ··· Class-boundaries 9.5-14.5 14.5-19.5 19.5-24.5 ··· ··· Tally bars ··· ··· ··· ··· ··· Class-Marks 10+14 =12 2 17 22 ··· ··· Frequency ··· ··· ··· ··· ··· The columns of class-boundaries and class-marks help in the calculations of different statistical quantities such as mean, median and quantiles as we will see in next chapter. 20 / 38 Descrip Stat Presentation A few rules How many classes? There is no hard rule to decide as to how many classes should we make. Both very few or too many classes will defeat the purpose of constructing the frequency distribution. Too few classes will result in the loss of lot of information and too many classes will kill the purpose of condensation. As a rule of thumb, a number between 5 and 15 would give reasonable results. (I think, 15 is still too large; I would not take a number larger than 10, unless I am using a computer.) Find the range, that is the difference between the maximum and the minimum values in the data. Calculate the class width/interval “h” by dividing the range of data by the number of classes. If the division results in a decimal number, take the next higher whole number. Avoid using fractional numbers as intervals, it brings you headache. Taking a multiple of 5 or 10 would ease up the problem and also would increase the readability of the table. The resulting classes should cover the whole of data. Note: you can also choose a proper interval first and then calaculate the number of classes, provided the whole data is covered in a reasonable number of classes. Where to start the first class from? Usually the lower class-limit is put at or below the smallest data. Remember, the lower class-limit of the first class should never be larger than the smallest value of the data otherwise that values at the lower end of data will be lost. Starting from a multiple of 5 or 10 would not hurt. Find the upper class-limit by counting from the lower class-limit to the end of the interval. Note that adding the interval directly to lower class-limit is erroneous, as we know the classes are inclusive. Adding an interval to the lower class-limit of a class gives you the lower class-limit of the next class, rather than the upper limit of the same class. (most students forget it...be careful) 21 / 38 Descrip Stat Presentation Find the rest of the classes by just adding the interval to the lower and the upper class-limits to get the lower and upper class-limits of the next class. Now the hard part... scanning the data (mouse hunt)... and putting the values in appropriate classes. Placing tally marks and frequencies. Determine the sum of frequencies to check whether all the values were included. An example of frequency distribution with grouping Example 2. Thirty energy saver light bulbs were tested to determine how long they usually last. The results, to the nearest day, were recorded as follows: 423 392 399 369 408 415 387 431 428 411 401 422 393 363 396 394 391 372 371 405 410 377 382 419 389 400 386 409 381 390 Construct a frequency distribution for these values. Solution: First we need to find the range Range = Largest - Smallest = 431 − 363 = 68 Lets there be 8 classes, therefore class interval is 68 Range = = 8.5 ≈ 10.0 Number of classes 8 We take h = 10.0 because it eases up the data scanning process. h= 22 / 38 Descrip Stat Presentation Now lets make the table and set the classes. The smallest value is 363, we start from 360 and set the first class as 360-369, second as 370-379 and so on. Now start scanning the data, allocate the values to their corresponding classes and put tallies for them accordingly. When a data value is allocated to some class, cancel that value in the actual data set, indicating that it has been counted, to avoid recounting. 423 369 392 399 408 415 387 431 428 411 401 422 393 363 396 Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Total TBs 394 391 372 371 405 410 377 382 419 389 400 386 409 381 390 Frequency (f ) 23 / 38 Descrip Stat Presentation Go on scanning, canceling and counting and put the tallies accordingly. Fill up the rest of the columns. 423 369 387 411 393 394 371 377 389 409 392 408 431 401 363 391 405 382 400 381 399 415 428 422 396 372 410 419 386 390 Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Total TBs Frequency (f ) 2 3 5 7 5 4 3 1 P f = n = 30 Sum up the frequencies to check whether all the data values are picked up. By looking at this frequency distribution, we can quickly find that generally most of the bulbs have life between 390 and 399 days as this group has the largest frequency (7). Thus, this group can be regarded as a representative group of this data. We can also see how the frequencies decrease toward the tails of the distribution and the distribution looks fairly symmetric. 24 / 38 Descrip Stat Presentation Relative frequency and percentage frequency While studying these data we may want to know not only how long the bulbs last, but also what proportion of the bulbs falls into each class of bulb’s life. This is called the relative frequency (RF) of a particular observation or class and is found by dividing its corresponding frequency (f ) by the total number of observations n: that is: RF = f n A more clear measure is the percentage frequency, which is found by multiplying each relative frequency value by 100. Thus: PRF = RF × 100 The PRF tells us about what percent of observations fall in a particular class. This gives us a bit clearer picture than RF. 25 / 38 Descrip Stat Presentation Example 3. Lets calculate the RF and PRF for Example 2. Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Total f 2 3 5 7 5 4 3 1 P f = n = 30 f RF = n = 0.07 = 0.10 0.17 0.23 0.17 0.13 0.10 0.03 1.0 2 30 3 30 PRF 2 × 100 = 7 30 3 × 100 = 10 30 17 23 17 13 10 3 100 Looking at this table we can now say that: The chance of any randomly selected bulb having a life in this range is approximately 0.23. 23% of bulbs have a life of from 390 days up to but less than 400 days. 26 / 38 Descrip Stat Presentation Cumulative frequency distribution A cumulative frequency distribution table is the same as a frequency distribution table with additional columns that give the cumulative frequency (CF) and the cumulative percentage (CP) of the data. The cumulative frequency distribution gives us an idea of how many observations of the data falls below or above a given value. It also tells us about the number of observations that lie between a given interval of two values. The CFs are obtained by adding the frequencies of different classes in successive manner to the cumulative total of previous frequencies, that is accumulating (the running total) the elements of frequency column. The accumulation can be conducted either from the top class (or value), in which case the CF is called the “less than” type CF, or from the bottom class (or value), which is known as the “more than” type CF. In grouped data, for the “less than” type CF the upper class boundaries are used and for “more than” type the lower class boundaries are used. 27 / 38 Descrip Stat Presentation Example 4. We calculate a “less than” type CF and CP for the data in Example 2. Upper Class Boundaries <369.5 <379.5 <389.5 <399.5 <409.5 <419.5 <429.5 <439.5 Total f 2 3 5 7 5 4 3 1 n = 30 CF 2 2+3=5 5+5=10 10+7=17 17+5=22 22+4=26 26+3=29 29+1=30 CF × 100 n 2 × 100 =7 30 5 × 100 = 17 30 CP = 33 57 73 87 97 100 Suppose we have been asked to find as to how many or what percent of observations lie below 399.5. From the table we quickly learn that - there are 17 observations below the given value, which makes them 57% of the entire data. Note: We use the upper class boundaries for a “less than” (<) type CF distribution. 28 / 38 Descrip Stat Presentation Example 5. Now lets calculate a “more than” type CF and CP for the data in Example 2. Upper Class Boundaries >359.5 >369.5 >379.5 >389.5 >399.5 >409.5 >419.5 >429.5 Total f 2 3 5 7 5 4 3 1 n = 30 CF 28+2=30 25+3=28 20+5=25 13+7=20 8+5=13 4+4=8 1+3=4 1 CP = CF × 100 n 30 × 100 = 100 30 28 × 100 = 93 30 83 67 43 27 13 1 Suppose now we are asked to tell as to how many or what percent of observations lie above 399.5. From the table we quickly learn that - there are 13 observations above the given value, which makes them 43% of the entire data. Note: We use the lower class boundaries for a “more than” (>) type CF distribution. 29 / 38 Descrip Stat Presentation Graphical Methods We now introduce the widely used graphic displays for data presentation in all sciences. Most of the time we want visual presentation of data for clearly seeing patterns in data. Patterns in data are commonly described in terms of: center, spread, shape, and unusual features. Some common distributions have special descriptive labels, such as: symmetric, bell-shaped, skewed, etc. We often need answer to questions like Where are the data (center) located? How spread out are the data? Are the data symmetric or skewed? Are there outliers in the data? Histogram Histogram is a visual version of frequency table. The main purpose of a histogram is to enhance the presentation of data. You can present the same information in a table; however, the graphic presentation format usually makes it easier to see the nature of distribution. It consists of vertical bars, usually called ‘bins’ or ’frequency bins’, that represent different classes of a frequency table. Usually, there is no space between adjacent bars. The height of bars indicates the frequency of classes. A histogram can typically help you answer the following questions: What is the most frequent observation? What distribution (center, variation and shape) does the data have? Does the distribution of data look symmetric or is it skewed towards the left or right? 30 / 38 Descrip Stat Presentation Example 6. Lets construct a histogram and relative frequency histogram for the energy saver bulbs data given in Example 2. We already have constructed the frequency table in Example 3. Lets now depict it. Relative Frequency Histogram of Data 0.10 Frequency 0 0.00 1 0.05 2 3 Frequency 4 0.15 5 6 0.20 7 Histogram of Data 360 380 400 Data values 420 440 360 380 400 420 440 Data values One can also construct a percentage relative frequency histogram by multiplying the relative frequencies by 100. 31 / 38 Descrip Stat Presentation Some of the key features that we usually look for in a histogram. Center: Graphically, the center of a distribution is located at the median of the distribution. Median is the point in a graphic display where about half of the observations are on either side. In the chart to the right, the height of each column indicates the frequency of observations. Here, the observations are centered over 4. Spread: The spread of a distribution refers to the variability of the data. If the observations cover a wide range, the spread is larger. If the observations are more clustered around a single value, the spread is smaller. 32 / 38 Descrip Stat Presentation Shape: The shape of a distribution is described by the following characteristics. Number of peaks. Distributions can have few or many peaks. Distributions with one clear peak are called unimodal, and distributions with two clear peaks are called bimodal. Symmetry. When it is graphed, a unimodal symmetric distribution can be divided at the center so that each half is a mirror image of the other. A single peaked symmetric distribution is referred to as bell-shaped distribution. Skewness. When displayed graphically, some unimodal distributions have many more observations on one side of the graph than the other side. Distributions with most of their observations on the left (toward lower values) are said to be skewed right; and distributions with most of their observations on the right (toward higher values) are said to be skewed left. Uniform. When the observations in a set of data are equally spread across the range of the distribution, the distribution is called a uniform distribution. A uniform distribution has no clear peak(s). Gaps. Gaps refer to areas of a distribution where there are no observations. The second last figure on the next slide has a gap; there are no observations in that part of the distribution. Outliers. Sometimes, distributions are characterized by extreme values that differ greatly from the other observations. These extreme values are called outliers. 33 / 38 Descrip Stat Presentation xi f (x i ) f (x i ) f (x i ) f (x i ) f (x i ) f (x i ) Different shapes of histogram. xi xi xi xi xi xi f (x i ) f (x i ) f (x i ) f (x i ) A uniform distribution f (x i ) A skewed distribution f (x i ) A normal distribution xi xi A bi−modal distribution A distribution with outliers Clip−like distribution xi xi xi 34 / 38 Descrip Stat Presentation Cumulative Histogram Like histogram–frequency table pairing the cumulative histogram is a visual version of the cumulative frequency table. It tells what percentage of the total number of observations accumulates at each bin (or interval). It makes finding the percentage or proportion of observations falling within a given interval rather more easy. An ordinary and a cumulative histogram of the same data are given in the following figures. Histogram of Data Cumulative Histogram of Data 30 30 7 7 29 25 6 26 22 3 3 1 1 20 10 5 5 2 17 15 4 4 10 Cumulative Frequency 5 3 2 Frequency 5 5 0 0 2 360 380 400 Data values 420 440 360 380 400 420 440 Data values Cumulative histogram is the actual concept that most of the probability distributions uses to calculate probabilities associated with different events. So learning about it, and understanding it, is must. 35 / 38 Descrip Stat Presentation Dotplots A dotplot is an attractive summary of numerical data when the data set is reasonably small or there are relatively few distinct data values, especially discrete values. Each observation is represented by a dot above the corresponding location on a horizontal measurement scale. When a value occurs more than once, there is a dot for each occurrence, and these dots are stacked vertically. As with a stem-and-leaf display, a dotplot gives information about location, spread, extremes, and gaps. Example 7. The study included 33 students whose first-grade IQ scores are given here: The following figure shows a dotplot for the above data. A representative IQ value is around 110, and the data is fairly symmetric about the center. 36 / 38 Descrip Stat Presentation Stem and Leaf Displays A stem-and-leaf plot (aka stemplot) of a quantitative variable is a textual graph that classifies data items according to their most significant numeric digits. It is generally used for small data sets (50 or fewer observations). A stem and leaf display is similar to a histogram, since it shows how many values in a set fall under a certain interval. It has even more information, it shows the actual values within the interval. A stem is the leading digit of an observation whereas the remaining digits are leaves. For example the observation 327 can be split as stem=3, and leaf=27 or stem=32, and leaf=7. The stemplot is drawn with two columns separated by a vertical line with stems listed to the left of the vertical line. Each stem is listed only once and no numbers are skipped, even if it has no leaves. The leaves are listed in increasing order in a row to the right of each stem. When there is a repeated number in the data (such as two 72s) then the plot must reflect such (e.g. the plot of 72 72 75 76 would look like 7 | 2 2 5 6.) 37 / 38 Descrip Stat Presentation Example 8. The stem-and-leaf plot of energy saver bulb data is constructed as below. Stem 36 37 38 39 40 41 42 43 Leaves 39 127 12679 0123469 01589 0159 238 1 (Key: 40|8 = 408) In this example we could also use a stem of single digit but then there would have been only two stems; 3 and 4, resulting in a very less informative plot. In the case of values with decimal points (continuous data), the decimal part in each number is taken as leaf. Rounding may be used to suppress certain number of decimal points so that all data values have the same number of decimal points. Further reading and exercises: Have a look of the introduction and Section 1.2 of Devore’s book and the examples there in. Then solve questions 10, 11, 12, 13, 14, 15, 16.a, 16.b, 17, 20, 24, 25, 29 in exercise 1.2. 38 / 38