Welcome to CSC323 Data Analysis and Statistical Software I Winter 2004 Instructor: Raffaella Settimi Office: Loop campus: CST 716 - Tel. (312) 362 – 5556 O'Hare campus: Email: rsettimi@cs.depaul.edu Contact hours: Monday 4:00 - 5:30pm O’Hare campus Tuesday 3:30 – 5:00 pm in the Loop campus Course web page: http://facweb.cs.depaul.edu/rsettimi/323 Check the web page regularly for news and announcements. Course documents and homework assignments will be posted on the course homepage. Lectures can be seen online on the DL website: http://dlweb.cs.depaul.edu The lectures will be available in the morning following the class. Course topics The course will discuss simple statistical methods and basic concepts of probability theory. The topics of the course are 1. descriptive statistics and representing data using graphs. 2. Linear regression models. 3. Sampling and experimental design. 4. An introduction to statistical inference 1. confidence intervals and 2. hypothesis testing. We will use the statistical package SAS The statistical software SAS runs on • UNIX (accounts on Hawk are available to students) and • on PC's (available in the computer labs) Check the course web page at http://facweb.cs.depaul.edu/rsettimi/323/sasinstructions.htm for more information on SAS availability. Required Texts: Introduction to the Practice of Statistics, Fourth Edition, by D.S. Moore and G.P. McCabe (2003). ISBN:0-7167-9657-0 Recommended SAS manual SAS Manual for Moore and McCabe's Introduction to the Practice of Statistics. Michael Evans, Freeman. Third edition, 1999. ISBN: 0-7167-3657-8 The course syllabus provides more detailed course information. The syllabus is posted on http://facweb.cs.depaul.edu/rsettimi/323/csc_323.htm Grading Homework and Programming assignments (35%). No homework this week!! Due on Monday in class or it can be submitted online at http://dlweb.cs.depaul.edu. Late assignments will be accepted not later than three days after the due date (typically by the following Thursday). Notice that a 10% point penalty will be applied for each day after the deadline. Quizzes (15%). There will be two short tests, scheduled tentatively on week 3 and week 8. Students are allowed to bring one single page of notes and a calculator. There will be no make up quizzes. Midterm (30%) on Feb 9th, 2004 at 6:15-7:45pm. It is a closed book exam, students are allowed to bring one single page of notes and a calculator. Final (35%) on March 15th at 6:15 - 8:30 pm. It is a closed book exam, students are allowed to bring two pages of notes and a calculator. Homework submission • Homework assignments will be posted on Tuesdays. • Homework solutions will be due in class on Monday. (Only legible, organized homework will be graded. Include your name, section number, date, and homework number on the first page of your assignment. Staple pages together.) • Alternatively, homework can be submitted online at the dlweb site http://dlweb.cs.depaul.edu. The online submission application will let you submit only one document, so use a word processor to collate your solutions in a file. • Duplication of homework solutions and computer output prepared in whole or in part by someone else is not permitted. Lecture 1 Outline • Exploratory data analysis (Sec. 1.1, 1.2) Discovering information from the data through graphs and numbers. • Introduction to the statistical package SAS Exploratory Data Analysis The goal of statistics is to gain information from the data. Data come from several sources: 1. 2. Available data: Census data, Federal agencies, Governmental Statistical Offices (www.fedstats.gov), General Social Survey at the University of Chicago’s NORC (http://www.icpsr.umich.edu/GSS/). Several databases are available on the Internet or at DePaul library!! New Data: • Sampling from population of interest: Observational studies • Conducting statistical experiments: medical trials, controlled experiments. When well designed, provide most reliable source of information!! What’s the next step after the data collection? Long listings of data are of little value. Statistical methods come to help us. Exploratory data analysis: set of methods to display and summarize the data. Data on just one variable: the distribution of the observations is analyzed by I. Displaying the data in a graph that shows overall patterns and unusual observations (histogram, box plot, density curve) II. Computing descriptive statistics that summarize specific aspects of the data (center and spread). Random variables Data contain information about group of individuals / subjects A variable is a characteristic of an observed individual which takes different values for different individuals: Quantitative variable (continuous) takes numerical values. Ex.: Height, Weight, Age, Income, Measurements Qualitative/Categorical variable classifies an individual into categories or groups. Ex. : Sex, Religion, Occupation, Age (in classes e.g. 10-20, 20-30, 3040) The distribution of a variable tells us what values it takes and how often it takes those values Different statistical methods are used to analyze quantitative or categorical variables. Graphs for categorical variables The values of a categorical variable are labels. The distribution of a categorical variable lists the count or percentage of individuals in each category. Wireless surfers by Age Bar Chart 60% 40% 53% Pie chart 55> 5% 42% 20% 5% 0% 18-34 Counts: 212 35-54 55> 168 20 35-54 42% A sample of 400 wireless internet users. 18-34 53% Wireless internet users Male 288 (72%) Female 112 (28%) Total 400 (100%) Wireless surfers by gender Bar chart 100% 72% 28% 50% 0% Male Female Example: On the morning of April 10, 1912 the Titanic sailed from the port of Southampton (UK) directed to NY. Altogether there were 2,201 passengers and crew members on board. This is the table of the survivors of the famous tragic accident. Survived Dead Male Female Male Female First class 62 141 118 4 Second class 25 93 154 13 Third class 88 90 422 106 Crew members 192 20 670 3 Define the categorical variables Bar chart representing the data in the table above (in percentages) 0.7 0.6 0.5 First Class 0.4 Second class 0.3 Third class 0.2 Crew class 0.1 0 Male Male Survived Survivors Female Female Survived Survivors Male Male Dead Victims Female Female Dead Victims Graphs for qualitative variables: Stemplots Stemplot ~ stem-and-leaf plot To make a stemplot: 1. Separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf, the final digit. 2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. 3. Write each leaf in the row to the right of its stem, in increasing order out from the stem. Example: Babe Ruth home run hits 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22 Stem and leaf plot 2 3 4 5 6 5 5 1 4 0 2 4 6 7 6 9 6 1 9 4 Key 3|5 means 35 hits Stems = 1’s Leaves = 1’s Stemplots (cont.) Back-to-back stemplot How stemplots deal with large data sets? Splitting stems: One stem with leaves between 0 and 4 One stem with leaves between 5 and 9 How stemplots deal with observations with having many digits? Rounding Stemplots (cont.) Advantages of stemplots: Describe the shape of a distribution for small numbers Disadvantages: Don’t work well with large data sets since they display the values of the variables Divide the observations into groups (stems) determined by the number system rather than by judgment Graphs for qualitative variables: the histogram Example: CEO salaries Forbes magazine published data on the best small firms in 1993. These were firms with annual sales of more than five and less than $350 million. Firms were ranked by fiveyear average return on investment. The data extracted are the age and annual salary of the chief executive officer for the first 60 ranked firms. (Data at http://lib.stat.cmu.edu/DASL/DataArchive.html ) Salary of chief executive officer (including bonuses), in $thousands 145 621 262 208 362 424 339 736 291 58 498 643 390 332 750 368 659 234 396 300 343 536 543 217 298 1103 406 254 862 204 206 250 21 298 350 800 726 370 536 291 808 543 149 350 242 198 213 296 317 482 155 802 200 282 573 388 250 396 572 Drawing a histogram 1. 2. 3. Construct a distribution table: i. Define class intervals or bins (Choose intervals of equal width!) ii. Count the percentage of observations in each interval iii. End-point convention: left endpoint of the interval is included, and the right endpoint is excluded, i.e. [a,b[ Draw the horizontal axis. Construct the blocks: Height of block = percentages! The total area under an histogram must be 100% Class intervals Frequency Use left end-point Percentage= (frequency/tot al)x100 Class intervals Frequency Use left end-point Percentage= (frequency/total) x100 0-100 2 2/59x100=3.39 600-700 3 5.08 100-200 4 4/59x100=6.78 700-800 3 5.08 200-300 18 30.50 800-900 4 6.78 300-400 14 23.73 900-1000 0 0 400-500 4 6.78 1000-1100 1 1.70 500-600 6 10.18 Total 59 100% 30..50% 23.73% 3.39% 1.70% The area of each block represents the percentages of cases in the corresponding class interval (or bin). Remarks • A histogram represents percent by area. The area of each block represents the percentages of cases in the corresponding class interval. • The total area under a histogram is 100% • There is no fixed choice for the number of classes in a histogram: If class intervals are too small, the histogram will have spikes; If class intervals are too large, some information will be missed. Use your judgment! • Typically statistical software will choose the class intervals for you, but you can modify them. SMOKING In a Public Health Service study, a histogram was plotted showing the number of cigarettes smoked per day by each subject (male current smokers), as shown below. The density is marked in parentheses. The class intervals include the left endpoint, but not the right. 1. 2. 3. 4. The percentage who smoked less than two packs a day but at least a pack, is around (There are 20 cigarettes in a pack.) 1.5% 15% 30% 50% The percent who smoked at least a pack a day is around 1.5% 15% 30% 50% The percent who smoked at least 3 packs a day is around 0.25 of 1% 0.5 of 1% 10% The percent who smoked 20 cigarettes a day is around 0.35 of 1% 0.5 of 1% 1.5% 3.5% 10% Answers: 1. The percentage who smoked less than two packs a day but at least a pack, is around (There are 20 cigarettes in a pack.) It is given by the area of the third block: 1.5x(40-20)=1.5x20=30% 2. The percent who smoked at least a pack a day is around It is given by the area of the third and fourth blocks: 30+0.5x40=50% 3. The percent who smoked at least 3 packs a day is around It is the area of the block for number of cigarettes greater or equal to 60. This is half of the fourth block: 10% 4. The percent who smoked 20 cigarettes a day is around We use the left endpoint convention, so 20 belongs to the third block. The answer is 1.5%. Using histograms for comparisons Fuel economy for model year 2001 compact and twoseater cars (Table 1.8 pg 38) City Consumption Highway consumption Describing distributions with numbers A distribution can be described through the measures of its center and of its spread. Measuring the center The most common measures are the mean or average and the median. 1. The Mean or Average x To calculate the average x of a set of observations, add their value and divide by the number of observations: x1 x2 x3 ... xn x n Data: Number of home runs hit by Babe Ruth as a Yankee 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22 The mean number of home runs hit in a year is: x 54 59 35 41 46 ... 41 34 22 659 43.9 15 15 2. The median The median M is the midpoint of a distribution, the number such that half the observations are smaller and the other half are larger. To find the median: 1. Sort all the observations in order of size from smallest to largest 2. If the number of observations n is odd, the median M is the center observation in the ordered list; I.e. M=(n+1)/2-th obs. 3. If the number of observations n is even, the median M is the mean of the two center observations in the ordered list. Example 1: Ordered list of home run hits by Babe Ruth: 22 25 34 35 41 41 46 46 46 47 49 54 54 59 60 N=15 Median = 46 8th Example 2: Ordered list of home run hits by Roger Maris in 1961: 8 13 14 16 23 26 28 33 39 61 N=10 Median = (23+26)/2=24.5 Mean versus Median Symmetric distribution 50% 1. The mean and median of a symmetric distribution are close together Mean 2. Median In skewed distributions, the mean is farther out in the long tail than is the median. The mean is more sensitive to extreme values. Right-skewed distribution Left-skewed distribution 50% Median Mean 50% Mean Median Mean or median? v The mean is a good measure for the center of a symmetric distribution v The median is a resistant measure and should be used for skewed distributions. Its value is only slightly affected by the presence of extreme observations, no matter how large these observations are. Example: Shopping in a supermarket A marketing consultant observed 50 consecutive shoppers at a supermarket. The histogram below shows how much each shopper spent in the store. Summary statistics: Mean = $ 34.70 Median = $ 27.855 The mean does not say much… The median says that about 50% of the shoppers spent less than 28 dollars What else would you like to know? Spread of a Distribution Two measures of spread: 1. The Quartiles: First quartile Q1 = the value such that 25% of the observations fall at or below it, (Q1 is often called 25th percentile). The third quartile Q3 = the value such that 75% of the observations fall at or below it, (Q3 is often called 75th percentile). Typically used if the distribution of the observations is skewed. The Inter-Quartile Range IQR is defined as the distance between the two quartiles: IQR= Q3 – Q1 IQR Q1 M Q3 Example: Shopping in a supermarket A marketing consultant observed 50 consecutive shoppers at a supermarket. The histogram below shows how much each shopper spent in the store. Summary statistics: Mean = $ 34.70 Median = $ 27.855 Q1 = $19.27 Q3 = $ 45.40 IQR= 45.40-19.27= 26.13 About 50% of the shoppers spent less than 28 dollars, 25% spent less than 20 dollars and 25% of the customers of the store spent more that 45 dollars. Moreover, 50% of the customers spent between 20 and 45 dollars! Extreme values for purchases > Q3 + 1.5xIQR=84.59