Course Notes: Probability and Statistics – Mrs. Leahy Unit 1: Organizing Data Probability and Statistics Unit 1: Organizing Data Day 1: Collecting & Organizing Data; Frequency Distributions; Histograms & Relative Frequency Histograms A. Collecting & Organizing Data ______________________ is the study of how to collect, organize, analyze, and interpret numerical information from data. This data can represent information that is “Qualitative”/ __________________________ or “Quantitative”/ ______________________________. Example A1: A survey is conducted at a local library collecting the following data from patrons: 1. Age 4. Marital Status 2. Gender 5. Number of Children in Household 3. Distance you live from library 6. Favorite Book Genre BOX in data that is “Qualitative” STAR beside data that is “Quantitative” In this unit, we will be dealing with a lot of “Quantitative” data, how to organize it, and how to display the data in a useful manner. B. Frequency Tables Have a large set of quantitative data? Organize into smaller intervals called ________________. How many should we use? 1 Course Notes: Probability and Statistics – Mrs. Leahy Unit 1: Organizing Data Example B1: Characteristics and Important Values in a Frequency Distribution An irate customer called the Dollar Day Mail Order Company 40 times during the last two weeks to see why his order had not arrived. Each time he called, he recorded the length of time he was put “on hold” before begin allowed to talk to a customer service representative. Time on Hold, in minutes 1 5 5 6 7 4 8 5 6 7 6 6 5 8 7 8 11 2 4 6 5 3 7 8 8 9 9 10 7 9 12 8 6 9 13 9 5 10 6 9 We are going to use five classes to organize our data. (The number of classes will be given to you for homework.) We need to determine how big each interval should be. This is called the “Class width.” Step 1: Determine Class Width 1. Compute: In our example: πΏπππππ π‘ πππ‘π π£πππ’π−π ππππππ π‘ πππ‘π π£πππ’π ππ’ππππ ππ πΆπππ π ππ 2. ROUND UP to next whole number. Step 2: Determine the Data Range for each class: The Class Limits Start with Lower Class Limits (LL) (The lowest value in the data class) Lowest data value = Lowest Class Limit. Add Class Width to get next lowest limit, etc. Fill in Upper Class Limits (UL) (the highest value that fits in the class) Class Limits Class Boundaries Tally (optional) Step 3: Determine the Class Boundaries Upper Class Boundary = Upper limit + 0.5 Lower Class Boundary = Lower limit – 0.5 Frequency Midpoints Step 4: Determine the Frequency of each class Class Frequency = # of data values in class (count) Step 5: Find the Class Midpoint Class Midpoint = Average of Lower and Upper Limits 2 Course Notes: Probability and Statistics – Mrs. Leahy Unit 1: Organizing Data A table (like the one we just made) that shows the classes and corresponding frequencies is called a ______________________________________ or _____________________________________ Example B2: Relative Frequency Table The relative frequency of a class is the proportion of all data values in that class. It helps us compare the amount of data in each class. Step 1: Fill in your class limits and frequencies (from our last example) Step 2: Compute the Relative Frequency 1. Find the total frequency (sum) 2. Rel. Frequency = ***NOTATION*** ∑ = π π = ππππ π ππππ. Class Limits Frequency 1–3 3 4–6 15 7–9 17 10 – 12 4 13 – 15 1 π‘ππ‘ππ ππππ. ∑ ππππ = Relative Frequency Total: ∑ ππ = Example B3: Given a data set of 10 numbers {1, 7, 8, 4, 4, 5, 6, 3, 8, 7} and using four classes a) Find the class width b) Make a frequency table showing class limits, class boundaries, midpoints, frequencies, and relative frequencies 3 Course Notes: Probability and Statistics – Mrs. Leahy Unit 1: Organizing Data C: Histograms and Relative Frequency Histograms Properties of Histograms & Frequency Distributions A histogram uses ______________ to show 1. Bars touch ________________________________. A histogram where the ________________ of the bar = class frequency is a frequency histogram. If we use relative frequency for the height, then we have a relative frequency histogram. 2. Sometimes class limits or midpoints may be used instead of boundaries. 3. Class widths must be equal. 4. Class limits cannot overlap. 5. Classes cannot be openended Example C1: Construct a histogram and a relative frequency histogram for the data in example B3. (You may want a ruler) 6. Use 4 to 15 classes. (Some sources recommend 5 to 15) Basic Construction: Freq/Rel Freq./Etc. TITLE Class Boundaries 0.5 – 2.5 Frequency Relative Frequency 1 0.1 2.5 – 4.5 3 0.3 4.5 – 6.5 2 0.2 6.5 – 8.5 4 0.4 Class Boundaries Don’t forget to use a SCALE on your axis. Ticks should be equally spaced! 4 Course Notes: Probability and Statistics – Mrs. Leahy Unit 1: Organizing Data Distribution Shapes Distribution shapes; Dot Plots; Frequency Polygons & Ogives Day 2: Symmetric Uniform Skewed Left Skewed Right Recall a __________________________ can be used to represent a Frequency Distribuition. Bimodal Distribution Shapes A: Distribution Shapes Symmetric Uniform Skewed Left Skewed Right Bimodal Copyright © Cengage Learning. All rights reserved. Mound/Bell Shaped: two sides are symmetrical with respect to a vertical line that goes through the middle of the graph Uniform: every class has the same frequency Copyright © Cengage Learning. All rights reserved. Biomodal: histogram shows _______ peaks Skewed Left: “tail” on the left is longer than right 2 | 14 Skewed Right: “tail” on right is longer than left Unimodal: histogram shows ______ peak Often a ______________________ distribution is caused by collecting data from a group of individuals that could have been classified better into two separate groups for that particular data. Example: height from a mixed group of men and woman Gaps between bars at the left or right can be caused by _______________________. These are values that are significantly higher or lower than the rest of your data. Example: salaries of employees at a major corporation where the CEO makes three times as much as rest of the workers. 5 Course Notes: Probability and Statistics – Mrs. Leahy Unit 1: Organizing Data B: Dot Plots (Similar to a histogram) Example B1: Consider the following data set 12, 15, 16, 16, 14, 12, 14, 18, 19, 14, 15, 18, 16, 13, 15, 16, 13, 10, 18, 16 How many numbers? ______ Lowest? _______ Start with: _______ Highest? _______ End with: _______ Mode? (most frequent number) ______ how many times? ______ C: Frequency Polygons and Ogives A Frequency Distribution can be used to produce a histogram. Class Boundaries 11.5-13.5 13.5-15.5 15.5-17.5 17.5-19.5 Frequency 7 9 10 4 6 Course Notes: Probability and Statistics – Mrs. Leahy Unit 1: Organizing Data Another type of graph is a Frequency Polygon and is used to show trends in data. Frequency Polygon 1. Horizontal axis = midpoints of each class 2. Vertical axis = frequency 3. Use a dot at the frequency, then use line segments to connect dots. Class Boundaries 11.5-13.5 13.5-15.5 15.5-17.5 17.5-19.5 Midpoints Frequency 12.5 14.5 16.5 18.5 7 9 10 4 Ogive A third type of graph is an Ogive and is used to show the cumulative frequencies for classes in a frequency distribution. Cumulative frequencies help us determine how many data values are above or below a certain upper boundary. Class Frequency Cumulative Cumulative Boundaries Frequency Relative Frequency 11.5-13.5 7 7 7/30 = 0.23 13.5-15.5 9 16 16/30 = 0.53 15.5-17.5 10 26 26/30 = 0.97 17.5-19.5 4 30 30/30 = 1.00 Example C1: What percent of data values are below 17.5 ? 7 Course Notes: Probability and Statistics – Mrs. Leahy Day 3: Unit 1: Organizing Data Bar Graphs, Pareto Charts, Pie Charts, Time Series A: Bar Graph Features of a Bar Graph ο· Bars can be vertical or horizontal. ο· Bars are of uniform width and uniformly spaced. ο· Lengths represent values of variables being displayed, the frequency of occurrence, or the percentage of occurrence. The same measurement scale is used for the length of each bar. ο· The graph is well annotated with title, labels for each bar, and vertical scale or actual value for the length of each bar. Clustered Bar Graph: two or more bars for each value on the horizontal axis, clusters are uniformly spaced Pareto Chart: bars arranged by frequency, highest to lowest B: Pie Charts/Circle Graphs Wedges visually display proportional parts of the total population as a percentage or as a portion of 360° Good for qualitative/categorical data The graph should have a title and wedges should be well labeled or have a key/legend. 8 Course Notes: Probability and Statistics – Mrs. Leahy Unit 1: Organizing Data C: Time Series Data are plotted in order of occurrence at regular intervals over time. Dots are connected using line segments. D: EXAMPLES The chart shows the data (in hundreds) of the enrollment of male and female students in a college for the years of 1995, 2000, and 2005. Year 1995 2000 2005 Male 30 34 32 Female 28 35 33 Example D1. Make a clustered Bar Graph for this data. Example D2. Make a Pareto chart for male student enrollment during the three years. Keep in mind: 1. Bars = same width 2. Clusters = equally spaced 3. Bar length = % based on category or actual frequency 4. Include Title, Key, Labels, and Scale Keep in mind: 1. Arrange data highest to lowest 2. Bars = same width and equally spaced 3. Bar length = frequency 4. Include Title, Bar labels 9 Course Notes: Probability and Statistics – Mrs. Leahy Example D3: Use the data to make a circle graph to display the distribution between male and female students in 1995. π %=π= πππ‘πππππ¦ πππππ’ππππ¦ π‘ππ‘ππ πππππ’ππππ¦ Unit 1: Organizing Data Year 1995 Male 30 Female 28 Fractional Part Percentage Degrees Total Degrees = % x 360° Example D4: Use the data to make a time series graph for the female student enrollment in those three years. Keep in mind: 1. Vertical Axis = frequency 2. Horizontal Axis = Time 3. Data points connected by line segments 4. Include Title, Labels, Scale Year 1995 Female 28 2000 2005 35 33 10 Course Notes: Probability and Statistics – Mrs. Leahy Unit 1: Organizing Data Example D5: Determine whether the statement is true or false. A. In a bar graph, the bars do not have to be of uniform width. B. The bars in a bar graph can be vertical or horizontal. C. The lengths of the bars in a bar graph stands for certain values of the variable being displayed. D. When two or more variables are displayed together, the bar graph is called a clustered bar graph (or a comparative bar graph). E. In a Pareto chart, the bars are arranged from left to right according to increasing height. F. A circle graph is also called a pie chart. G. Circle graphs are usually used to display percentages. H. A time series data contains the values of a variable taken at regular intervals over a certain time period. THINK ABOUT IT. Bar Graph Good for comparing Qualitative/Categorical Data or Quantitative/Numerical Data (frequency) (actual measurements) Pareto Chart Good for comparing Qualitative/Categorical Data in decreasing order Circle Graphs Good for comparing how a total is dispersed among several categories (compare percentages) Time-Series Graph Good for displaying how data changes over time Histogram Good for displaying the “shape” of the distribution of data ALL GRAPHS: Provide a title, label the axes, and identify units of measure. 11 Course Notes: Probability and Statistics – Mrs. Leahy Day 4: Unit 1: Organizing Data Stem-and-Leaf Displays A. Exploratory Data Analysis (EDA) Exploratory Data Analysis techniques are used to explore a data set, to detect patterns and extreme data values, to raise new questions, or to pursue leads in many directions. Useful when data has been gathered for ______________________________. For example: Ages of Applicants of Graduate Programs B. Key: 1 2 = 12 Stem-and-Leaf Display Used for _________________________ data. Best with small to medium size sets. A stem-and-leaf display is used to ___________ order and arrange data into groups. The _____________ are aligned vertically from smallest to largest. A vertical line is drawn to the right of the stems. The ____________ with the same stem are placed in the same row as the stem, arranged in ___________________ order. A label (Key) is used to indicate the magnitude of the numbers in the display. Example B1: A study on peanut butter reported the following optimal consumption temperatures for various brands: 56 44 62 36 39 53 50 65 45 40 Make a stem-and-leaf display for this data. Step 1: Identify appropriate stem values. List smallest to largest. No omissions! Step 2: List leaves with corresponding stems In numeric order smallest to largest! Step 3: Include Key and Title 12 Course Notes: Probability and Statistics – Mrs. Leahy Unit 1: Organizing Data Example B2: For the following data, use the first two digits as the step to make a stem-and-leaf display. 106 94 112 96 89 113 90 85 85 100 Example B3: Look at the distributions in example B1 and B2. Are the unimodal/bimodal? Are they symmetrical or skewed? C: Splitting the Stems Consider: 0 0 1 2 3 3 4 5 5 7 7 8 9 9 9 Using only one stem “0” would give us an overcrowded graph. Instead of using an interval of 0-9, maybe we could use TWO intervals. Example C1: Make a stem-and-leaf display using a) Two intervals: 0-4, 5-9 b) Five intervals: 0-1, 2-3, 4-5, 6-7, 8-9 13 Course Notes: Probability and Statistics – Mrs. Leahy Unit 1: Organizing Data Example C2. Britney is a swimmer training for a competition. The number of 50 meter laps she swam each day for 30 days are as follows: a) Prepare a stem-and-leaf plot. b) Redraw the stem-and-leaf plot using two unit intervals. c) Make a comment on what these plots show. D: Back-To-Back Stem-and-Leaf Plots If you are comparing two sets of data, you can use a back-to-back stem-and-leaf plot. 14 Course Notes: Probability and Statistics – Mrs. Leahy Unit 1: Organizing Data Example D1: The following class sizes were reported in Economics 101 and Math 151: Econ 101: Math 151 20, 34, 27, 15, 24, 35, 38, 28 14, 18, 21, 34, 29, 13, 32, 23 Make a back-to-back stem-and-leaf plot for the data. 15