Math 116 Chapter 12 Topics: Graphical Display – histogram. Using numbers – measures of center and spread. Sampling and Law of Large Numbers Definitions An observation is a single number. It could be a measurement, monetary amount, etc. It can also be considered to be one particular outcome to some random trial. Raw data is a collection of observations Frequency is how many observations are in each bin. Graphical Display of Data There are two types of data: 1. Quantitative e.g. closing prices, ratios 2. Categorical e.g gender, political affiliation There are different ways to visually display data. Histograms are popularly used to display quantitative data. Pie charts are one way to display categorical data. Histogram of the percentages of weekly ratios of Disney Stocks Relative Frequencies 0.5 0.4 0.3 0.2 0.1 0 0.73 0.77 0.81 0.86 0.90 0.94 0.99 1.03 1.07 1.12 1.16 More definitions Relative frequency is the percentage of observations in each bar. A frequency distribution is a chart which shows the bins and frequencies. What are things to look for in a histogram: overall pattern or shape of the distribution major peaks rough symmetry or clear skewness (skewed to right or left) estimate the center and spread of the data look for any striking deviations from the pattern. Another way of describing our data is the use of numbers: Measures of center or central tendency. - mean, median, mode Measures of spread or variation. - range, variance, standard deviation How to find mean or average: “average” or mean is the sum of all observations divided by the number of observations 1 X Xi n X 1 X 2 ... Xn n Another measure of center or central tendency: Median is the middle value of the data set. How to find the median (M): arrange data in ascending or descending order if odd number of observations, then M = (n+1)/2th if even number of observations, then M is the average of the two middle values ; n/2 th & n/2+1 th Examples: e.g.: 40, 75, 80, 80, 96, 100 mean = 78.5 median = 80 e.g. 40, 75, 80, 96, 100 mean = 78.2 median = 80 Mean versus Median: mean is a common or more popular way to measure center but is more sensitive to extreme values than median. E.g. Which of the two measures better reflect the average price of a home? If the distribution is symmetric, mean and median are the same. If the distribution is skewed, the mean is farther out in the long tail than is the median. Example: In 1993, the mean and median salaries paid to major league baseball players were $490,000 and $1,160,000. Which one is the mean? Median? Explain. Example: Measures of center is not enough final exam in math class section 1: 80, 80, 80, 80, 80 final exam in math class section 2: 30, 80, 90, 100, 100 Note: mean = 80 but the datasets are different in the two sections. (Measuring center is not enough to describe the data; we need measures of spread) Measures of Spread: Range: is the difference between the largest and the smallest observation. E.g. Let us look at 3 datasets below: A: 195, 200, 205, 215, 219, 225, 226, 235 B: 195, 210, 213, 214, 216, 218, 219, 235 C: 208, 209, 210, 210, 211, 211, 213, 248 the range for each dataset is 40 but the datasets are different from each other. Range: strongly influenced by extreme values and takes only account two observations in the whole dataset. Standard Deviation s (the most common and popular): measures how far each observation is from the mean; the square root of the variance. Formula for Standard deviation: 2 1 2 s ( Xi X ) n 1 Let us try to find the variance and standard deviation by hand for one time only. Use Excel for other times. E.g. Math test score: 30, 80, 90, 100, 100 Back to the sample A:195, 200, 205, 215, 219, 225, 226, 235 B:195, 210, 213, 214, 216, 218, 219, 235 C:208, 209, 210, 210, 211, 211, 213, 248 Range for all three sets: 40 Mean for all three sets: 215 Sd for set A = 13.94 Sd for set B = 11.06 Sd for Set C = 13.42 Interpretations: Variance is the average of the squares of the deviations of each observation from the mean. Standard deviation is the square root of the variance. (to have the same units as the observation). Hence, it is a single value the measures the dispersion of the data about the mean. A larger standard deviation indicates a more spread set of data points. We use n-1 rather than n to get the average. (to be more conservative with our estimate). Open excel file data.xls. In the second column, generate a new data = old data + constant. In the third column, generate a new data = old data multiplied by a constant. Find the mean, variance and standard deviation for each column. What do you notice? Adding a number to each observation: If a number b is added (or subtracted): The mean increases (or decreases) by b. The variance does not changed. The standard deviation does not changed. Multiplying a number to each observation: If each observation is multiplied by a number a: The mean is multiplied by a. The variance is multiplied by a2 The standard deviation is multiplied by a. Sampling: Some definitions: Population: entire group of individuals or objects that we want information about. Sample: part of the population that we actually analyze in order to gather information. Parameter: a number that describes a population. E.g. population mean, population standard deviation, etc. Statistic: a number that describes a sample. E.g sample mean, sample standard deviation, etc. Reasons for sampling: Impossible to take measurements of the population. Samples, are quicker, easier, cheaper. If done properly, it is enough to give us needed information about the population. Random Sampling: Simple random sampling: every one in the population has an equal chance of being selected in the sample. Types of random sampling: draw names from a hat, balls from a basket, etc. computer software to generate random numbers, table of random digits. Stratified random sampling E.g. example: seattle population: strata- economic status, race, gender, marital,etc. systematic random sampling – every 10th observation is chosen. Law of Large Numbers: With random sampling and a large sample, we can use the statistic of a sample to estimate the parameter of a population. Volatility It is a measurement of how much the value of a stock fluctuates. A common way of measuring the volatility of a stock is to find the annualized standard deviation of the ratios of closing prices of a stock. (weekly, in our project). There are other types of volatility but the one above is what we are going to use in our project. To annualize the standard deviation: For monthly ratios, multiply the standard deviation by square root of 12. For weekly ratios (which we use), multiply by square root of 52 (52 weeks in a year). For daily ratios, multiply by square root of 252 (252 business days in a year). Focus on the Project: Suppose our mean weekly ratio is 1.001894. Let’s call it Rm, for the “mean of the ratios”. From chapter 11, our computed weekly risk-free ratio is approximately 1.0007695. Let’s call it Rrf, for risk free rate. Note that Rm is too large. Focus on the Project: This means that on the average, each of our weekly ratios is too large. Specifically, each ratio is in excess of (Rm-Rrf). In example above, 1.001894-1.0007695 = 0.0011245 To adjust our weekly ratios to equal the weekly risk-free rate: We can do this by reducing each ratio by (Rm-Rrf). We call this normalizing each ratio. Hence, Rnorm Ri Rm Rrf The normalized ratio The weekly ratio the Ratio excess By normalizing our ratios: Our new mean will match the weekly riskfree rate. In our example above, New Mean = Old mean – (Rm-Rrf) = 1.001894 – 0.0011245 = 1.0007695