Probability & Statistics Basics of Probability & Statistics 1 Chapter 1 Imran Ali Text book 6 Probability & Statistics for Engineers & Scientists 8th Edition by Ronald E. Walpole Raymond H. Myers Sharon L. Myers Keying Ye NOTE: The book has a wider scope that this course, so use the book with the course guideline in mind What is meant by Probability? 7 The term probability refers to the study of randomness and uncertainly 40% chances of showers does not mean it will or will not rain Used when circumstances permit multiple outcomes It may or may not rain Two possible outcomes Students may have a percentage >80%, >75%, 70% etc Multiple outcomes possible What is meant by Statistics? 8 Deals with collection and interpretation of data Involves collecting, classifying, summarising, organising, analysing and finally interpretation of data Statistics for runs scored by a batsman or wicket taken by a bowler How would the definition apply here? Statistics for number of students in MUET who passed all subjects Descriptive Statistics Inferential Statistics 9 Descriptive Statistics Collection of data leading to presenting its summary and describing its important features For example, collection of data from students of 10BM on whether they would prefer either practical or theory classes in the morning. The results are displayed in the form of a graph The techniques used to describe the statistics fall under the category of descriptive statistics 10 Inferential Statistics Now that the data is available, inferential statistics helps to translate the output of descriptive statistics to draw a conclusion (make inference) about the population For example, if the students of 10BM prefer theory classes in the morning and assuming that most students feel the same, we can conclude that all students of studying BME prefer theory classes in the morning Is this good enough assumption? 14 11 Consider the figure which shows the statistics for the data collected during a survey to find the most popular sport in Pakistan (fictitious) 12 10 8 6 4 2 0 1 Football 1.2 1.4 1.6 1.8 Cricket 2 2.2 2.4 2.6 Descriptive statistics informs us that 13 people favored cricket, 6 favored football, 3 favored hockey. 19 people were involved in the survey Inferential statistics informs us that cricket is by far the most popular sport! 2.8 Hockey3 Why Probability? 12 With out the use of probability, the true meaning from statistical inference cannot be extracted Probability theory translates the outputs of descriptive statistics into inference statistics Probability Population Sample Inferential Statistics Why Statistics? 13 Engineering systems are designed after thorough understanding of the system requirements Statistical concepts and methods provide ways of gaining new insights into the behavior of many phenomena that are encountered in every field of engineering & science 14 The field of statistics makes possible intelligent judgment and informed decision making Collecting data about requirements for medical appliances from hospitals can let you know which products they would like to have Once supported by statistics Bio-Medical Engineers can work on the design and development of the required products 15 Measures of Central location Measures of Central Location 16 Methods to identify quantitatively the central position, the central value in a data set We use these measures in our daily lives Examples: Average marks obtained by the class in a subject Average distance covered by a vehicle per litre of petrol Statements like, 8 out of 10 recommend this... Experiment 17 An activity or process whose outcome is subject to uncertainty Similar to an experiment which you may perform in a chemistry lab., experiments for gathering data has to be planned properly too Tossing a coin, conducting survey, teacher’s evaluation etc a population Population 18 An experiment will consist of a well-defined collection of objects constituting a population of interest The desired information is available from the population For example, Students who graduated with BEng Bio-Medical Feedback People Do about the course who suffer from malaria in Hyderabad they live near water bodies? 19 When all objects have the desired information, then the population turns out to be a census Due to time, money and other miscellaneous reasons a subset of the population i.e. sample Sample 20 A sample is a set of observations taken from a population Subset of data of interest Subset of Population For example, if we want to find out why some students underperform then instead of every student we can use students with 1st year academic percentage between 50-60 Sampling 21 Sampling refers to collection of data in a discrete manner i.e. part of the entire population While checking for faculty produce at a manufacturing plant, we check only samples drawn randomly For example, you take a bite at piece of cake and judgment how the rest of it might taste like You do not need to eat it all to know Can you think of more examples? Representive Sampling One that accurately reflects its population characteristics It is neutral For example, you run a survey among students at MUET, ask them which flavour of ice cream they like 22 Biased Sampling Not neutral For example, you run a survey among students at MUET, ask them which is the best department at the university ES has a greater chance of being selected because it has more students while CRP does not Sample mean 23 The sample mean is the numerical average of the values in a data set Suppose the set X has n elements, then the sample mean is n xi x =∑ i =1 n x1 + x2 + x3 + + xn x= n 24 For example, Average marks of students in a class Average petrol consumption per kilometer for a certain vehicle More examples where this kind of information is used? Median 25 The median is a value which lies at the centre in a data set An equal number of elements have value greater than the median value and an equal number of elements have value smaller than the median value 26 In order to find the median value in the data set, arrange the element with increasing value of the elements Ascending order in terms of element value The median is the middle value Consider A = {5 1 2 7 3} A1 = {1 2 3 5 7} The median is 3 The median remains unaffected by the extreme values of a data set How? Consider B = {5 1 2 7 3 6} B1 27 = {1 2 3 5 6 7} In such a case the median is equal to one half of the sum of the “two” centre values (3+5)/2 = 4 Verification: three elements have value less than 4 while three have value greater than 4 So we note that Median may not actually exist in the data set It may not be possible. How? Violations can also occur. Set of all ones Mode 28 The simple measure for central tendency Least used in practice This measure provides information about the most frequently occurring value in a data set For example, the mode of the following data set A is 3 A = {1 2 3 4 5 6 3} It is also possible to have more than one mode C 29 = {1 2 3 4 5 6 3 4 9 9 10} How many modes are there? What are they? Such a data set is called bimodal Trimmed Mean 30 A method of averaging that removes a small percentage of the largest and smallest values before calculating the mean The mean is quite sensitive to an extrema value while the median does not always in to account the extrema values The trimmed mean is a compromise between the two 31 Suppose we have the following data set, A={1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and we want to calculate the 10% trimmed mean 10% trimming = remove 10% of the total elements of the data set from either extrema i.e. remove 0.1*10 = 1 element B = {2, 3, 4, 5, 6, 7, 8, 9} Now calculate the mean Representing Data using a Dot-plot 32 This is a statistical chart/plot used as part of descriptive statistics Illustrates the location of elements in a data set on a simple scale For example, marks secured by students in the exam can be illustrated as x x 0 10 20 30 40 50 x x x x x x x x 60 70 80 90 100 Examples 33 Using the following data sets, find the mean, median, mode and the 5%, 10%, 20%, 40% trimmed mean X = {2, 1, 9, 3, 8, 4, 6, 7, 0, 1, 3, 6, 5, 9, 1} Y = {5, 0.6, 6, 0.1, 0.5} More Examples-1 34 Twenty adult males between the ages of 30 and 40 were involved in a study to evaluate the effect of specific health regiment involving diet and exercise on the blood cholesterol. Ten were randomly selected to be a control group and ten others were assigned to take a part in the regimen as the treatment group for a period of 6 months. 35 The following data cholesterol so far shows the reduction in Control Group 7 3 -4 14 2 5 22 -7 9 5 Treatment Group -6 5 9 4 4 12 37 5 3 3 Compute the mean, median, the mode and 10% trimmed mean Provide explanation for the information provided by the statistics More Examples-2 36 Blood pressure values are often reported to the nearest mmHg (100, 105, 110 etc). Suppose the actual blood pressure values for nine randomly selected individuals are 118.6, 127.4, 138.4, 130.0, 113.7, 122.0, 108.3, 131.5,133.2 37 What is the median of the reported blood pressure values? Suppose the blood pressure of the second individual is 127.6 rather than 127.4. How does this affect the median of the reported values? What if the median was calculated with roundedoff data? How would this affect the results? 38 Measures of Variability Introduction 39 The measures of central location we discussed in the previous lecture provide only partial information about a data set It is possible that multiple data sets have identical measures of center yet they differ from one another s For example, consider the marks achieved by 40 students in two examinations Exam 1 70 65 75 80 70 Mean = 72 Exam 2 40 35 95 100 90 Mean = 72 x = Exam 1 o = Exam 2 x o o 0 10 20 30 40 50 60 x x x x 70 mean 80 o o o 90 100 Statistics for measuring variability 41 Three key statistics Range Sample Variance and Population Variance Sample Standard Deviation and Population Standard Deviation Range 42 It is the difference between the largest and smallest value in a data set For example, The range of marks for Exam 1 is 15 The range of marks for Exam 2 is 65 Exam 1 70 65 75 80 70 Mean = 72 Exam 2 40 35 95 100 90 Mean = 72 s Range is however a poor measure of variation of 43 values in a data set because it is based on the two extreme values and disregards the position of the remaining samples Exam 3 20 74 75 75 75 75 76 100 x = Exam 3 x x x xxx x 0 10 20 30 40 50 60 70 80 Mean = 71.25 x 90 100 Deviation from the Mean value 44 In this statistic we find the amount by which the value of each sample in the data set deviates from the mean value of the data set For a data set with values, x1, x2, x3,…, xn the deviations from the mean can be calculated as x1 − x , x2 − x , , xn − x s The deviation will be positive if the sample has a value greater than the mean and negative when the sample has a value smaller than the mean 45 Calculate the deviations from the mean for the samples in the data set. Exam 3 20 74 75 75 75 75 76 100 s If all deviations are of small magnitude, then all the samples in the data set will be close to the mean 46 Small variability in the data set Converse is also true Small/no variability is important Soft drink manufacturers Medicine manufacturers Electronic equipment Variance 47 Comparing the variability of a data set with few samples is straightforward, however, with larger sized data sets it becomes more cumbersome Variance is a statistic which combines the deviations of individual samples with in a data set Two types: Population variance Sample variance Population Variance 48 The populating variance can be calculated as 2 N 2 = Where, ∑ (x − ) i =1 i N = population variance = population mean 2 Sample Variance 49 The population variance is not always known Difficult to know the variance in the data set involving 200 million Pakistan on their preference on a given matter So we estimate the population variance from the sample variance (inference) As the name suggestions, sample variance is the variance in the data set which is a subset of the population s Sample variance can be calculated as, 2 ∑ (xi − x ) 2 50 s = n −1 Where, Thus variance is the average squared deviation s 2 = sample variance x = sample mean If the unit of the sample is cm, then the unit of the variance is cm2 Population Standard Deviation 51 It is the positive square root of the variance 2 N = 2 = ∑ (x − ) i =1 i N The unit for the standard deviation is the same as the samples in the data set Sample Standard Deviation 52 It is the positive square root of the variance ∑ (x − x ) 2 s= s = 2 i n −1 The unit for the standard deviation is the same as the samples in the data set Examples 53 Using the following data sets, find the variance and the standard deviation. X = {2, 1, 9, 3, 8, 4} Y = {5, 0.6, 6, 0.1, 0.5} sLet us find the population mean and variance. 54 X = {2, 1, 9, 3, 8, 4} We can assume that the data set is a population because we have not been explicitly told that the given data set represents a much larger data set Size of dataset : N =6 Population mean : 2 +1+ 9 + 3 + 8 + 4 = = 4.5 6 s xi − (xi − ) 2 i xi 1 2 2 - 4.5 = -2.5 6.25 2 1 1 - 4.5 = -3.5 12.25 3 9 9 - 4.5 = 4.5 20.25 4 3 3 - 4.5 = -1.5 2.25 5 8 8 - 4.5 = 3.5 12.25 6 4 4 - 4.5 = -0.5 0.25 55 2 N ∑ (x − ) i =1 i = 6.25 + 12.25 + 20.25 + 2.25 + 12.25 + 0.25 2 N = ∑ (x − ) i =1 i N 53.5 = = 8.9167 = 2.9861 6 = 4.5 = 2.9861 0 x x x x 1 2 3 4 5 Mean = 4.5 56 6 7 x x 8 9 10 57 58 Chebyshev’s theorem & z-scores Chebyshev’s Theorem 59 So far we have studies some important statistics for the description of a data set including the mean, variance and the standard deviation We observed that when the standard deviation was large, there was a greater variability in the data set and vice versa s P. L. Chebyshev’s discovered that The fraction (percentage) of measurements falling between any two values symmetric about the mean is related to the standards deviation 60 At least the fraction 1−(1/k2) of the measurements of any set of data must lie within k standard deviations of the mean It means > 1−(1/k2) can also lie s Examples, 61 With k=2, 1−(1/22) = 3/4 = 75% or more of the values must lie within 2 standard deviations on either side of the mean Population : − 2 , + 2 Sample : x − 2 s, x + 2 s s Examples, With k=3, 1−(1/32), 88.9% or more values lie within three standard deviations from either side of the mean 62 Population : − 3 , + 3 Sample : x − 3s, x + 3s The theorem is not so helpful for k=1 For k=2, 1−(1/12) = 0 i.e. zero or more values must lie within 1 standard deviation on either side of the mean Why is this not helpful? = 4.5 = 3.2532 µ-2σ µ+2σ µ-σ -2 -1 0 µ+σ x x x x 1 2 3 4 5 6 7 x x 8 9 Mean = 4.5 µ±σ contains 3/6 = 50% values. µ±2σ contains 6/6 = 100% values. 63 10 11 sExample: Example: 64 Using Chebyshev’s theorem to find the percentage of values that fall between 20 and 30 for a data set with sample mean 21 and a standard deviation of 2. Solution: The lower limit is x − 20 20 = x − ks ⇒ ks = x − 20 ⇒ k = s 21 − 20 k= = 0.5 2 Chebyshev’s Theorem 65 If the IQs of a random sample of 1080 students at a large university have a mean score of 120 and a standard deviation of 8, use the Chebyshev’s theorem to determine the interval containing at least 810 of the IQs in the sample. sSolution: 66 According to the question, we need to determine the interval containing at least 810 of the IQs in the sample. So, we need to find the value of k for the intervals μkσ and μ+kσ We know that, 1 810 3 1− 2 = = k 1080 4 s 67 1 3 1 = 1− = 2 k 4 4 k2 = 4 k=2 Now, the interval can be calculated as, Lower limit: μ-kσ = 120-2*8 = 120-16 = 104 Upper limit: μ+kσ = 120+2*8 = 120+16 = 136 Concluding Remarks 68 Chebyshev’s theorem holds for any distribution The value given by the theorem is a lower bound only The actual value can be much greater than the lower bound Z-scores 69 Motivation Suppose I want to compare the marks you score in two subjects Applied Calculus (AC) English Suppose that a student scores 80 marks in AC and 90 marks in English Does it mean that the student is better at English than Applied Calculus? s We cannot absolutely say that the student is better at English than Applied Calculus 70 Rather it would make more sense if we compare the student’s performance in these two subjects relative to the performance of all other students in the class Why compare relative? It is quite possible that Applied Calculus examination is more difficult than English s It is quite possible that the mean grade in English was 86 with a standard deviation of 5 while the mean in AC was 70 with a stanrdard deviation of 8 71 So, the objective is to compare two observations from two different populations in order to determine their relative rank One of the ways is to convert the statistics of the observations in to standard units known as z-scores or z-values s Z-score: 72 An observation x from a population with mean µ and standard deviation σ has a z-score defined by z= x− Note that units in the numerator and the denominator cancel, so the z-score is a unitless quantity. Permitting comparison of even two distinct observations s Compare the z-score for the two exams 73 sExample: Example: 74 Different typing skills are required for secretaries depending on whether one is working in a law office, an accounting firm or for a research mathematical group at a university. The data is gathered from three distinct testing methods. Determine the candidate which has the fastest typing speed. Sample Applicant’s Score Law 141 seconds Accounting 7 minutes Scientific 33 minutes 75 Frequency distribution Frequency distribution 76 Frequency of a particular observation is the number of times the observation occurs in a data Frequency distributions can be portrayed as Frequency tables Histograms s Important characteristics of a large data set can be easily assessed by first grouping the data into difference classes and then determining the number of observations that fall in each of the classes 77 In a tabular form we call this as frequency distribution s The 78 data presented in the form of frequency distribution is called grouped data Data can be grouped according to intervals/class (as in classification) Table: Frequency distribution for Percentage secured by students of BME Percentage Number of students 0-50 1 50-59 2 60-69 15 70-79 20 80-89 4 90-100 1 s Grouping the data provides a better overall picture of the unknown population 79 What can you infer from the table? However, grouped data loses the identity of the individual observations How? Note that the lower limit of the interval is called the lower class limit and the upper limit is called the upper class limit 70-79 % s Moreover, 79.5% is the upper class boundary and 69.5% is the lower class boundary for that class 7079% 80 The number of observations falling in a particular class is called the class frequency Denoted by “f” The numerical difference between the upper and lower class boundaries of a class interval is defined to be the class width s The midpoint between the upper and the lower class 81 boundaries is called the class mark or class midpoint. Generating a Freq. Distribution 82 Consider the following data 2.2 4.1 3.5 4.5 2 3.4 1.6 3.1 3.3 3.8 2.5 4.3 3.4 3.6 2.9 3.3 3.1 3.7 4.4 3.2 4.9 3.8 3.2 2.6 3.9 s STEP 1: Decide on the number of class intervals 83 required The number must be smaller than the number of observations otherwise we gain nothing from grouping Too few classes will make the outcome too generalised Look at the data and decide Typically I’ll between 5 & 20 take 5 (can change later) s STEP 2: Determine the range 84 From the data we find that the range is 4.9-1.6 = 3.3 s STEP 3: Divide the range by the number of classes 85 in order to estimate the approximate width of the interval 3.3/5 = 0.68 3.3/7 = 0.47 approx. 0.5 s STEP 4: List the lower class limit of the bottom 86 interval and then the lower class boundary Add the class width to the lower class boundary to obtain the upper class boundary Write down the upper class limit and complete the table Class Interval Class Boundary 1.5 – 1.9 1.45 – 1.95 2.0 – 2.4 1.95 – 2.45 2.5 – 2.9 2.45 – 2.95 3.0 – 3.4 2.95 – 3.45 3.5 – 3.9 3.45 – 3.95 4.0 – 4.4 3.95 – 4.45 4.5 – 4.9 4.45 – 4.95 s STEP 5: Determine the class marks by averaging the 87 class limits What are they? s STEP 6: Write the frequencies 88 Class Interval Class Boundary Frequency 1.5 – 1.9 1.45 – 1.95 1 2.0 – 2.4 1.95 – 2.45 2 2.5 – 2.9 2.45 – 2.95 4 3.0 – 3.4 2.95 – 3.45 8 3.5 – 3.9 3.45 – 3.95 5 4.0 – 4.4 3.95 – 4.45 3 4.5 – 4.9 4.45 – 4.95 2 89 Graphical respresentations How do we Graphically summarise data? 90 We can summarise data in numerical and graphical forms Summary of data in numerical form referred to as statistics of data Mean, median, mode Range, variance, standard deviation s Before blindly going for statistical analysis, it is always good to look at the raw data 91 usually in graphical form Helps in summarising the data into an easy interpretable format The types of graphical display most frequently used by biomedical engineers include Dot plot Time series Histograms Stem-and-Leaf Boxplots Time Series 92 A time series is used to plot the changes in a variation as a function of time The variable Is usually a physiological measure, such as electrical activation in the brain or hormone concentration in the blood stream that changes with time Histogram 93 The histogram is a graphical representation for the frequency distribution On the x-axis, we have the sample value On the y-axis, we have the number of occurrence of samples frequency Frequency of Occurrence Class 1 mark Lower Class Limit 1 94 Class 2 mark Lower Class Limit 2 Class 3 mark Lower Class Limit 3 Class 4 mark Lower Class Limit 4 Lower Class Limit 5 95 Class Interval Class Boundary Class mark Frequency 1.5 – 1.9 1.45 – 1.95 1.7 1 2.0 – 2.4 1.95 – 2.45 2.2 2 2.5 – 2.9 2.45 – 2.95 2.7 4 3.0 – 3.4 2.95 – 3.45 3.2 8 3.5 – 3.9 3.45 – 3.95 3.7 5 4.0 – 4.4 3.95 – 4.45 4.2 3 4.5 – 4.9 4.45 – 4.95 4.7 2 Shapes for histograms 96 A histogram can be of a wide range of shapes If the histogram has a single peak then it is called unimodal histogram s A bimodal histogram has two distinct peaks 97 s A histogram is said to be symmetric if the right half 98 is a mirror image of the right half sA 99 unimodal histogram is positively skewed if the right or upper tail is stretched out compared with the left or the lower tail sA 100 unimodal histogram is negatively skewed if the left or lower tail is stretched out compared with the right or the upper tail Stem and Leaf plot 101 Stem-and-Leaf plot is a graphical method for showing the frequency with which certain classes of values occur One can use a frequency distribution table or a histogram for the values or one can use a stem-andleaf plot Frequency distribution and the histogram do not show the exact value of the elements of a sample or population s Example: Consider the following data, 102 {12, 13, 21, 27, 33, 34, 35, 37, 40, 40,41} Step 1: Draw a table where the first column is the stem and the second column is the leaf stem leaf sStep 2: Select 103 and list one or more leading digits for the stem values. The trailing digits become the leaves stem 1 2 3 4 leaf sStep 3: Record the leaf for each observation beside 104 the corresponding stem value stem leaf 1 23 2 17 3 357 4 001 sStep 4: Indicate units for the stem and the leaves 105 stem leaf 1 23 2 17 3 357 4 001 Key: stem = tens leaf = units Interpreting Stemplots 106 Example: Determine the range and the median for the data provided in the data set stem leaf 1 23 2 17 3 357 4 001 s Example 2: Consider the following data 107 2.2 4.1 3.5 4.5 2 3.4 1.6 3.1 3.3 3.8 2.5 4.3 3.4 3.6 2.9 3.3 3.1 3.7 4.4 3.2 4.9 3.8 3.2 2.6 3.9 Draw the dotplot graph Draw the frequency distribution table Draw the histogram Draw the stem-and-leaf plot 108 EXAMPLES Chebyshev’s Theorem 109 If the IQs of a random sample of 1080 students at a large university have a mean score of 120 and a standard deviation of 8, use the Chebyshev’s theorem to determine the interval containing at least 810 of the IQs in the sample. sSolution: 110 According to the question, we need to determine the interval containing at least 810 of the IQs in the sample. So, we need to find the value of k for the intervals μkσ and μ+kσ We know that, 1 810 3 1− 2 = = k 1080 4 s 111 1 3 1 = 1− = 2 k 4 4 k2 = 4 k=2 Now, the interval can be calculated as, Lower limit: μ-kσ = 120-2*8 = 120-16 = 104 Upper limit: μ+kσ = 120+2*8 = 120+16 = 136 112 Quantiles, Boxplots Introduction 113 We have so far studied measures of central location and variations Mean, median, mode Range, variance, standard deviation Apart from these statistics there are several other measures of location that describe or locate the position of certain non-central pieces of data, relative to the data set s These 114 measures are referred to as fractiles or quantiles These are values below which a specific fraction or percentage of the observations in a given set must fall Percentage of elements of a data set with value less than some pre-defined value Percentile 115 Percentiles are values that divide a set of observations into 100 equal parts These values denoted by P1, P2, …, P99 are such that 1% of the data falls below P1, 1% of the data falls below P1, 2% of the data falls below P2, 99% of the data falls below P99 etc s Recall that a median divides the lower 50% values and the higher 50% values in a data set 116 Percentiles divides the data set into 100 values There are 99 Percentiles 70 Percentile means that 70% values lie below the value at P70 while 30% of the values lie above the value at P70 Percentage and Percentile? Calculating sCalculating the kth Percentile 117 Step 1: Arrange the data in ascending order Step 2: Compute the locator, L, using k L= n 100 where, n = number of values in the data set, k = percentile of the data sStep Step 3: 118 If L is an integer, the kth percentile, Pk , can be found by Pk = (Lth value + Next value) /2 If L is not an integer, the we will need to round it up to the next largest integer. Then the value of Pk is the Lth value counting from the lowest sExample: Example: Consider that data set 119 {1, 2, 2, 4, 4, 8, 9, 9, 9,10} Let us calculate P85. Since we have 10 elements in the data set, we seek to find the value below which L = (85/100)*10 = 8.5 observations fall That is approximately 9 observations Thus P85 = 9 {1, 2, 2, 4, 4, 8, 9, 9, 9,10} Quartiles 120 Divide the data set into four equal parts, Q1, Q2, Q3 Quartiles can be related to percentiles as Q1 = P25, Q2 = P50, Q3 = P75 Interquartile 121 The interquartile range (IQR) is the difference between the 75th percentile and the 25th percentile scores in a distribution Deciles 122 Divide the data set into ten equal parts D1, D2, …, D9 Deciles can be related to percentiles as D1 = P10, D2 = P20, …, D9 = P90 123 2.2 4.1 3.5 4.5 2 3.4 1.6 3.1 3.3 3.8 2.5 4.3 3.4 3.6 2.9 3.3 3.1 3.7 4.4 3.2 4.9 3.8 3.2 2.6 3.9 Boxplot 124 Also known as Box and Whisker plot A graphic representation of the distribution of scores on a variable that includes the range, the median and the inter-quartile range Provides information about 5 statistics of data Minimum value, lower quartile (Q1), median (Q2), upper quartile (Q3) and maximum value Maximum value Value 75th percentile 50th percentile 25th percentile Minimum value 125 Experiment number/name Boxplot 126 The thin line in the middle of the box represents the median of the distribution of scores The top line of the box represents the 75th percentile of the distribution The bottom line represents the 25th percentile In other words, 50% of the scores on this variable in this distribution are contained within the upper and lower lines of this box Drawing the boxplot 127 Step 1: Find the minimum and the maximum value available in the data set Step 2: Calculate P25, P50 (i.e. the median), P75 Step 3: Draw a box from P25, P75 Step 4: Split the box with a line at the median Step 5: Draw a line (whisker) from P75 to the maximum value Step 6: Draw another line from P25 down to the minimum value 4.9 Maximum value Value 75th percentile 50th percentile 25th percentile Minimum value 1.6 128 Experiment number/name 129 Plot the box plot for the following data set Marks = {10, 20, 40, 50, 70, 75, 80, 80, 85, 90, 100}