G N I R S E E TIC & N S 4 I G TI 1 ) F N O 0 4 A E T 3 L 0 A S CC 30 1 T N (E NG LE E S E ODU AM IC M T D S N I T U F TA S LEARNING OUTCOMES ▪ Understand the concepts of statistics ▪ Differentiate between population and sample ▪ Know how to represent/visualize/display ▪ Using Microsoft Excel or Scilab/Matlab to analyze, summarize and display data QUESTIONS ▪ What are the differences between population and sample? ▪ Identify the graphical representations of data. ▪ Explain the differences between bar chart and histogram. ▪ What is the difference between Median and Average? POPULATION VS SAMPLE POPULATION SAMPLE Entire group of individuals that we want to gather information A part of the population that we actually examine in order to obtain information Size N n Mean ∑$ !"# 𝑥! 𝜇= 𝑁 $ & ∑ !"# 𝑥! − 𝜇 & 𝜎 = 𝑁 ∑%!"# 𝑥! 𝑥̅ = 𝑛 % & ∑ 𝑥 − 𝑥 ̅ ! !"# 𝑠& = 𝑛−1 Design observation Variance Or Standard deviation 𝜎= 𝜎& 𝑠 IN CLASS ACTIVITY ▪ Collect BMI data of all students in this class and update in google sheet ▪ Based on the collected data, categorize the data into different level of underweight (<18.5), normal (18.522.9), pre-obese (23.0-27.4), obese I (27.5-34.9), obese II (35.0-39.9) and obese III (>=40). ▪ Plot appropriate representations of the data. ▪ Summarize the data as follows: Mean Std. dev n Median Range Min Max Q1 Q2 ▪ Take a sample of size 30 and compute its statistics. CONCEPT OF STATISTICS ▪ Most of real world problems required statistics to draw conclusion. ▪ Steps involved in statistical analysis: Data Collection Data Organization/Representation What is the size of population (N)? How many sample? What is the size of sample (n) ? What type of data (discrete or continuous)? How do you keep your data? How do you visualize the data? Data Analysis Methods and Techniques via descriptive or inferential statistics What statistical methods/techniques you use? Data Interpretation What can you draw from the analysis result? WHAT IS DATA? ▪ Data is a collection of facts, such as numbers, words, measurements, observations or even just descriptions of things. ▪ Types: ❑ ❑ Qualitative data is descriptive information (it describes something) Quantitative data is numerical information (numbers) DATA COLLECTION ▪ Data can be collected in many ways. ▪ The simplest way is direct observation. ❑ Example: Counting Cars ▪ You want to find how many cars pass by a certain point on a road in a 10-minute interval. ▪ So: stand near that road, and count the cars that pass by in 10 minutes. ▪ You might want to count many 10-minute intervals at different times during the day, and on different days too! ▪ Experimental data collection ▪ You also can gather data through a survey ❑ Example: ▪ You can survey people (through questionnaires, opinion polls, etc) or things (like pollution levels in a river, or traffic flow) HOW DO YOU REPRESENT DATA? ▪ Supposed you are to present the following data on sales for the month of February, what method would you choose? I can use graphs or charts or plots … • • • Do you know there are around 30 different choices of graphs? Yet, not all graphs are appropriate for presentation purposes. Thus, you need to choose the most suitable graph for your data. CONT. ▪ Bar Graphs ▪ Pie Charts ▪ Dot Plots ▪ Line Graphs ▪ Scatter (x,y) Plots ▪ Pictographs ▪ Histograms ▪ Frequency Distribution Distribution and ▪ Stem and Leaf Plots ▪ Cumulative Tables and Graphs ▪ Graph Paper Maker ▪ a lot more can be added to this list Grouped Frequency SOMETHING TO PONDER ▪ So far, we have learned several techniques to describe data using graphs / charts. ▪ Is it effective? ❑ Graphs / Charts are effective at giving the overall view of a situation ▪ HOWEVER ❑ Graphs / Charts cannot give precise information for inferential purposes (note: infer == to make conclusions) ▪ THUS – you need to add numerical representations BASIC NUMERICAL REPRESENTATIONS BASIC NUMERICAL REPRESENTATION (CONT.) Draw the frequency histogram. Calculate the mean, median and mode for the number of quarts of milk purchased by the following 25 households: 0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 5 Mean? Median? Mode? MEASURES OF CENTER ▪ A measure along the horizontal axis of the data distribution that locates the center of the distribution. ▪ What do you use as a measure of center? (a) Mean? (b) Median? (c) Mode? • Not all three are suitable to describe a distribution for ALL cases EXTREME VALUES ▪ The mean is more easily affected by extremely large or small values than the median. ▪ The median is often used as a measure of center when the distribution is skewed. SKEWNESS ▪ A measure of asymmetry in a statistical distribution ▪ 0 indicates perfect symmetry ▪ Negative indicates more values lie above the mean (left tail) ▪ Positive indicates more values lie below the mean (right tail) KURTOSIS ▪ a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution ▪ positive kurtosis indicates that the distribution has heavier tails than the normal distribution (>3) ▪ negative kurtosis indicates that the distribution has lighter tails than the normal distribution (<3) SKEWED RIGHT (POSITIVELY SKEWED) ▪ Skewed Right – long tail to the right ▪ A few high numbers pull the mean above the median The set: The graph: Num. Frequency 1 3 2 5 3 3 4 1 Mean = [1(3) + 2(5) + 3(3) + 4(1)] / 12 = 2.17 Median = 2 Mean > Median SKEWED LEFT (NEGATIVELY SKEWED) ▪ Skewed Left – long tail to the left ▪ A few low numbers pull the mean below the median The set: The graph: Num. Frequency 1 1 2 3 3 5 4 3 Mean = [1(1) + 2(3) + 3(5) + 4(3)] / 12 = 2.83 Median = 3 Mean < Median QUARTILES ▪ Quartiles are the values that divide a list of numbers into quarters ▪ 25% of the measurements of the given dataset (that are represented by Q1) ▪ Q2 = Median ▪ Interquartile range = Q3 –Q1 ▪ Calculate all quartiles for the following numbers: 10, 2, 4, 7, 8, 5, 11, 3, 12 The formula doesn't give you the value for the quartile, it gives you the place BOX AND WHISKER PLOT ▪ a convenient way of visually displaying the data distribution through their quartiles https://support.microsoft.com/en-us/office/create-a-boxplot-10204530-8cdf-40fe-a711-2eb9785e510f MEASURES OF CENTRE VS. VARIABILITY I was told that the average height of plants here is only 1 feet. But this tree is 10 feet high!!! !#$&*^(&** Often, measure of centre does not give the true picture. Need to know the measure of variability from the centre too…. MEASURES OF VARIABILITY ▪ A measure along the horizontal axis of the data distribution that describes the spread of the distribution from the center. THE RANGE ▪ The range, R describes the difference between the largest and smallest measurements. ▪ Example: A botanist records the number of petals on 5 flowers: 5, 12, 6, 8, 14 ▪ The range is R = 14 – 5 = 9 THE VARIANCE ▪ The variance is measure of variability that uses all the measurements (as oppose to range R that uses only 2 measurements, maximum and minimum). ▪ It measures the average deviation of the measurements from their mean. ▪ Flower petals: 5, 12, 6, 8, 14 Step 1: Find the mean. Step 2: For each data point, find the square of its distance to the mean. Step 3: Sum the values from Step 2. Step 4: Divide by the number of data points 4 6 8 10 12 14 THE VARIANCE (CONT.) ▪ The variance of a population of N measurements is the average of the squared deviations of the measurements about their mean m. ▪ The variance of a sample of n measurements is the sum of the squared deviations of the measurements about their mean, divided by (n – 1). THE STANDARD DEVIATION ▪ In calculating the variance, we squared all of the deviations, and in doing so changed the scale of the measurements. ▪ To return this measure of variability to the original units of measure, we calculate the standard deviation, the positive square root of the variance. 2 WAYS TO CALCULATE THE SAMPLE VARIANCE Use the Definition Formula: Sum 𝑥! 𝑥! − 𝑥̅ 𝑥! − 𝑥̅ " 5 -4 16 12 3 9 6 -3 9 8 -1 1 14 5 25 45 0 60 CONT. Use the Calculation Formula: Sum 5 25 12 144 6 36 8 64 14 196 45 465 SOME NOTES • The value of s is ALWAYS positive. • The larger the value of s2 or s, the larger the variability of the data set. • Why divide by n –1? • The sample standard deviation s is often used to estimate the population standard deviation s. Dividing by n –1 gives us a better estimate of s. EXERCISE 1 1. Question: Find the mean, median and mode of: 5, 7, 3, 5, 6, 8, 5, 6, 4, 6, 25 Solution: Note: First, arrange the data 3, 4, 5, 5, 5, 6, 6, 6, 7, 8, 25 median = 6; mean = 80/11 = 7.27 ; modes = 5 and 6 2. Question: Eliminate the last observation x= 25 and then find the mean, median and mode. How do these values compare with those found using the full data set? Solution: median = 5.5; mean = 55/10 = 5.5; modes = 5 and 6. The mean is smaller. 3. Question: How do possible outliers (such as 25) affect these values? Solution: The mean is very much affected by the outlier, while the median and mode are not so. EXERCISE 2 Given the observations 7, 9, 10, 6, 8, 7, 8, 9, 8 calculate: 1. the range Solution : R = 10 – 6 = 4 2. the mean Solution : Mean = 72 / 9 = 8 3. the variance Solution : Variance = [588 – (722/9)] / 8 = 12 / 8 = 1.5 4. the standard deviation Solution : Standard Deviation = √1.5 = 1.225 Activity ▪ Kindly put your full name and student's ID, and upload your 30 seconds introductory video. The video should contain an introduction about yourself, where do you live, what is your hobby, etc., and what's your expectation of this course. ▪ https://www.csusm.edu/qc/facultydocuments/biof older/bio353.pdf