FOM 11 – Chapter 5: Statistics Statistics is a mathematical discipline that is concerned with the collection, organization, displaying, analyzing interpretation and presentation of data. Data is collected to study a population which can represent a certain group of people animals, trees, etc. data points are often represented as ๐ฟ. A population includes all measurements of interest. But when census data cannot be collected a portion of this population is wisely selected to form a sample. After data is collected statistician desires to study the behavior of this data using two statistical methods: 1. Measures of central tendency: which study the middle of the data: ๏ท Mean: is the sum of the collection of the data entries divided by the count of data in the collection. The mean of a population is represented by the Greek letter ๐ (mu). The mean of a sample is represented as ๐ฅฬ ๏ท Median: is the data entry that has half of the data below it and half the data above it. When the data entries are written in numerical order, the median is the Middle data value if the number of data elements is odd and the mean of the two middle data points if the number of data elements is even. ๏ท Mode: is the data entry that occurs the most. 2. Measures of dispersion which study how the data is spread out: ๏ท Max: the greatest number in the population. ๏ท Min: the smallest number in the population ๏ท Range: Is the difference between the minimum and maximum values in the data set. ๏ท Variance and Standard deviation. Notes: ๏ท It is always beneficial to arrange the data in an ascending order to find the max, min, median, mode and so on ๏ท Dispersion has a value of zero if all the data points in a set are identical. It increases in value as the data becomes more spread out. ๏ท An outlier is a value in the data set that is very different from the other data value. 5.1 Exploring Data: Example 1: Determine the mean, median, modem maximum, minimum and range for the data set below. 5.2 Frequency Tables, Histograms and Frequency Polygons Sometimes data values may be repeated frequently. A frequency table tells us how often a data value or an event occurs. For example, the table below organizes students based on the number of siblings they have. Number of Siblings Number of Students 0 1 2 3 4 5+ A frequency distribution is a set of intervals (a table or a graph) into which data is organized. For example, in the table above, each row is an interval of the frequency distribution. Intervals are also referred to as bins. A histogram is a bar graph that shows a frequency distribution. The horizontal axis represents the intervals and the vertical axis represents the frequency. A frequency polygon is the graph of a frequency distribution produced by joining the midpoints of the intervals using straight lines. ๏ท When dealing with a lot of data, it is often easier to choose a range of values for each interval, instead of creating an individual interval for each data point which will be impossible to deal with especially if the data set is very large. Example 2: Create a frequency table and histogram for the data given below. Range: __ __ to_______ Flow Rate Tally Freq. ๏ท We often use histograms to determine the distribution (shape) of data. When data is normally distributed, its frequency distribution has a bell shape. The intervals in the middle have a higher frequency that the intervals on each end. It has been discovered that many types of data naturally have this type of distribution. Height, weight and temperature of living organism are examples of data that are normally distributed. Although data is often normally distributed, sometimes data is skewed away from one direction or does not have a distinct peak. Many other shapes are possible, but only the below have names. Skewed Left (Less data left) Skewed Right (Less data right) Uniform (No Peak) Bimodal (2 Peaks) Example 3: The magnitude of an earthquake is measured using the Richter scale. The higher the magnitude, the more severe the earthquake is. Based on the histograms shown below, in what years did the most damage from earthquakes occur? 5.3 Measures of Dispersion: So far we know one measure of dispersion, the range. The range doesn’t describe how the date is distributed inside that interval, so we have the standard deviation can be used for this purpose. Deviation is the distance between a data value and the mean of the data set, denoted by (๐ฅ − ๐) where x is the data value, ๐ is the mean. Standard deviation is a measure of the dispersion, or scatter, of data values in relation to the mean. A low standard deviation shows that most data is close to the mean, so the data is more consistent. A high standard deviation shows that the data is scattered farther from the mean and the data is less consistent. Standard deviation for a population is represented using the Greek letter ๐ and. ๐ = ๐ ๐ก๐๐๐๐๐๐ ๐๐๐ฃ๐๐๐ก๐๐๐ ๐๐ ๐ ๐๐๐๐ข๐๐๐ก๐๐๐ ๐ = ๐๐๐๐ ๐๐ ๐๐๐๐ข๐๐๐ก๐๐๐ ๐ด = “๐กโ๐ ๐ ๐ข๐ ๐๐” ๐ฅ = ๐๐๐โ ๐๐๐ก๐ ๐๐๐ก๐๐ฆ ๐ = ๐๐ข๐๐๐๐ ๐๐ ๐๐๐ก๐๐๐๐ Steps for calculating the standard deviation: 1. Calculate the deviation from the mean for each value: 2. Square each of the numbers obtained above (the deviations) 3. Find the sum of all the values above. 4. Divide the sum by the number of data entry. 5. Take the square root of the result. ∑(๐ฅ − ๐)2 ๐=√ ๐ Example 4: 170, 182, 192, 193, 212 represents the heights of players on a basketball team. Calculate the standard deviation for the team’s heights. ๐= ๐ฅ (๐ฅ − ๐) (๐ฅ − ๐)2 170 182 192 196 212 ∑(๐ฅ − ๐)2 = ๐= Example 5: A different basketball team height is 152, 154, 174, 180, and 220. Calculate the standard deviation for this team’s height. ๐= ๐ฅ (๐ฅ − ๐) (๐ฅ − ๐)2 152 154 174 180 220 ∑(๐ฅ − ๐)2 = ๐= Q: How do the heights of the two teams compare, based on the standard deviation (consistency) Steps for Calculating Standard Deviation from a Frequency Table: 1. To calculate the mean, multiply the data values by their frequency to get the sum of all the data entries that have the same value. 2. Add all the sums, and then divide by the sum of the frequencies which is the number of elements in the data set. 3. To calculate the standard deviation, calculate the square of the deviation for each data value,(๐ฅ − ๐)2 , then multiply by its frequency 4. Find the sum of all the(๐ฅ − ๐)2 , then divide by the sum of the frequencies 5. Take the square root of the number from step 4. 6. These steps can be performed using a table Example 6: the standard deviation from the frequency table below. Number of hits (๐ฅ) 0 1 2 3 4 Frequency (๐) 5 10 4 3 1 ๐ฅ×๐ (๐ฅ − ๐) (๐ฅ − ๐)2 ๐ × (๐ฅ − ๐)2 ๏ท It is also possible to calculate the standard deviation from a frequency table whose intervals are given as a range, rather than specific data points. Because it is not possible to see each individual data point, the midpoint of each interval is treated as the data point and then the standard deviation is calculated as above for a frequency table. Example 7: Angelo conducts a survey to determine the number of hours per week that grade 11 females in her school play videogames. She obtained the following set of data. Calculate the mean and the standard deviation for the data table given below. Number of hits (๐ฅ) 3-5 5-7 7-9 9-11 11-13 Frequency (๐) ๐ฅ ๐ฅ×๐ (๐ฅ − ๐) (๐ฅ − ๐)2 ๐ × (๐ฅ − ๐)2 7 11 16 19 12 Janice conducted the same experiment for males, and found out that the mean was ๐ = 12.84 while the standard deviation was ๐ = 2.16 Which quiz has a greater mean? Which quiz has a greater standard deviation? Hw # 1, 2 page 211 #2, 3, 7, 11 page 221 #1 ac, 2, 4, 6, 9 a, 13 p 233