C.Pang’s Elementary Statistics Notes & Worksheets (v.090728b) C. Pang’s Elementary Statistics Notes & Worksheets ISBN-5-6987458-1-1 © Chi-yin Pang, 2009 No Refund/No Buy-back Value C.Pang’s Elementary Statistics Notes & Worksheets (v.090728b) Contents 1. Basics a. Basic Terms b. Critical Thinking 2. Organizing Data a. Graphs of Quantitative Data b. Statdisk Instruction c. Project 1 Assignment and Example d. Graphs of Categorical Data 3. Center and Variation a. Measures of Center b. Standard Deviation 4. Probability a. Contingency Table (“Titanic”) b. Stacked Bar Graph (Assessing Dependency) c. Binomial Distribution i. Binomial Motivation Example (Juror Selection Process) ii. Binomial Computation iii. Binomial Application iv. Binomial Mean and Standard Deviation d. Normal Distribution i. Heights Distributions of 2-years Old & 20-years Old ii. Normal CDF Computation iii. Inverse Normal CDF Computation iv. SAT vs. ACT Example (“Donald & Micky”) v. Central Limit Theorem (CLT) vi. Normal Distribution Application (Setup) vii. Central Limit Theorem (CLT) Application (Setup) 5. Influential Statistic a. Confident Interval (CI) i. “CI Facts” (Summary) ii. Interpretation of CI b. Hypothesis Test (HT) i. Types of Errors & Power of HT ii. Defining H 0 and H1 c. Project 2: CI or HT Assignment & Examples d. 2-Sample Mean HT (or CI) Data Collection (for Independent Samples) e. 2-Sample Proportion HT (or CI) Data Collection f. Correlation & Regression (Car Weight vs. MPG) g. Chi-Square Goodness-of-Fit Test (Distribution of Categories) h. Test of Dependency (Gender & Politics; “Titanic”) 6. CI & HT Flowcharts; HT Setup and Decision Methods ISBN-5-6987458-1-1 © Chi-yin Pang, 2009 No Refund/No Buy-back Value Elementary Statistics Basic Terms Worksheet Ref. [Tri 2008] NAME: _______________________ (Ver.090117) Subject (The individuals that provide the data.) Question to or about the subject A survey of 35 students in Laysie College, has an average of 4.3 units per students. NY city has 3250 Walk Buttons. 77% of them (2502 buttons) are broken. Sample Statistics or Parameter p.5 Stat Parameter Stat Parameter A sample of 877 executives were surveyed. 45% of them (395) would not hire someone with typo error on their job application. p.10 #5 Stat Parameter Stat Parameter p.10 #6 Stat Parameter p.10 #7 Stat Parameter p.10 #8 Stat Parameter p.11 #21 Stat Parameter p.11 #22 Stat Parameter p.11 #23 Stat Parameter p.11 #24 © Chi-yin Pang, 2009 Population (All the subjects of interest.) Stat Parameter Page 1 of 2 Elementary Statistics Basic Terms Worksheet NAME: _______________________ (Ver.090117) Quantitative Or Categorical p.6 Discrete or Continuous (no gaps) p.6 Nominal or Ordinal or Interval or Ratio Bleed of dog. (Bull, retriever, etc.) Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio Cell phone’s area code. Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio Size of coffee (small, medium, large) Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio Out door noon time temperature in degree F. Number of people in a country. Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio Weight of a candy. Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio p.10 #10 Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio p.10 #11 Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio p.10 #12 Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio p.10 #13 Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio p.10 #14 Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio p.10 #15 Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio p.10 #16 Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio p.10 #17 Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio p.10 #18 Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio p.11 #19 Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio p.11 #20 Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio Quant Categorical Discrete Continuous Nominal Ordinal Interval Ratio © Chi-yin Pang, 2009 pp.7-9 Page 2 of 2 Elementary Statistics Critical Thinking Examples (v.090207+) Sampling Problems: Some samples do not represent the population, therefore, the sample statistic is useless for making inference about the population. Situation More background Problem From a randomly picked sample, The statistics is from Don’t make conclusion with small sample. [Tri08, p.13] (We will I have concluded, at SJCC, a sample of 3 cars: 66.7% of people drive Honda and 2 Hondas and learn how to 33.3% of people drive Toyota. 1 Toyota. determine the minimal sample size later.) Another e.g.: [Huff54] www.RateMyProfessors.com The July 27, 2001 Orange County Register reported: “All but 2% of [home] buyers have at least one computer at home. …” [BVD07, p.282] Conduct U.S. Census by telephone. [Tri08, p.16] Conduct opinion poll over the phone. [Tri08, p.16] Only the students who felt strongly took the time to write. The survey were conducted via the Internet. Voluntary sampling misses the “middle spectrum” of the population. Don’t use voluntary sampling! Don’t trust the result. [Tri08, p.12] This Convenience sampling the home buyers who do not use a computer. It does not represent the population. Do not believe the result! [Tri08, p.28] Homeless and poor A segment of the population is missing. Beware may not have phone. of Missing Data. [Tri08, p.16] Some won’t answer to Those are afraid of telemarketers are not avoid “selling under represented. Beware of Non-response. [Tri08, p.16] the guise” of a poll. Survey Questions Problems: Some survey question can be “loaded” to encourage a certain answer. Even the order of question may have an effect to the answer. (See [Tri08, p.16] for examples.) Representation and Interpretations Problems: The following could be from ignorance (fooling self and others). They can also be intentional tricks to lead people to a wrong conclusions. DON’T BE FOOLED! Situation More background Problem(s) Wage statistics: (~1954) [Huff54] Actually, Don’t use Rotundia: $30/week Pictographs. The USA: $60/week 2-D, 3-D pictures falsely exaggerate the relative size. [Tri08, p.14] Rotundia vs. E.g., the Carter $ worth only 44% of the Ike $. But, the area of the Carter bill is only 19% of the Ike bill. [Tufte83] USA Page 1 of 2 Elementary Statistics Critical Thinking Examples (v.090207+) Gov’t PayRolls Stable! Gov’t Pay Rolls Shoot Up! 30M$ [Huff54] 20M$ Beware of graphs that does not have the horizontal axis at y=0. [Tri08, p.13] Include y = 0 for the correct perspective. Another misrepresentation e.g.: 10M$ 0M$ “17.6264% of the bird/plane collisions strike the engine.” 10,916 struck the engine out of 61,930 collisions, between 1990 – 2007. 10916/61930 = 0.176263523… “We need an emergency pay cut of 50%, after the difficult quarter we will restore the pay with a raise of 50%.” [Huff54] Pay before cut = $100 Pay after cut = $50 50% of new pay=$50x.5=$25 Pay after “restoration” (50% increase) = $50+$25 = $75! Reporting many digits of an estimate gives a false impression of a very accurate and precise investigation. It is less misleading to report “About 17.6% ...” or “About 18% …” See Precise Numbers in [Tri08, p.17]. Be ware of how Percentage are calculated. Also see [Tri08, p.14]. Another e.g.: Using % to manipulate conclusion. [Huff54] “Honey, statistic shows that BMW owners have a longer life expectancy. For the sake of our lives, let’s buy a BMW.” BMW owners tend to be wealthy. Wealthy people have better health care. In a semester, your instructor taught Statistics in the morning as well as evening. The evening students were older. The evening class also did better, can you conclude that the older students were better students? The instructor repeated the morning lectures in the evening (with improvements). Ethical Problems: “Correlation does not imply Causality.” [Tri08, p.17] E.g., When ice cream sales goes up, more people get drown. Let’s ban ice cream to save life! Was the higher achievement due to “age” or “better lectures”? The effect of “student age” and the effect of “lecture improvement” were Confounded. When planning an experiment, you must carefully think through all the factors that might affect the result. [Tri08, p.23] (See [Tri08, p.17].) Beware of: • Self-Interest Studies (e.g., oil co. sponsors study to prove their gas is better) • Partial Pictures (What are they hiding?) • Deliberate Distortions (claims that base on nothing) Page 2 of 2 Elementary Statistics Graphs for Quantitative Data G2: Stemplot Graphs for Quantitative Data (v.090721) (Stem-and-Leaf) [Tri08, p.59] Goal: These graphs summarize the “distribution” of all the data. “Distribution” means what values are common (concentrated) and what values are rare. These are tools to assess whether the data is from a “normally distributed” (bell-shaped Example: In a Statistic class (in Fall 2008), scores for Test 1 are as follows: G1: Dotplot [Tri08, p.58] ) population. 94 87 81 100 76 62 96 81 87 79 86 60 82 73 62 83 99 98 75 94 Step 1: Label the x-axis • Find the minimum and maximum values from the data set. • Decide the best x-scale to cover all the values. • Label the x-axis. Step 2: Plot the dots. For each value, put a dot over the appropriate x. When there is a repeated value, just stack them. 42 82 93 (See [Tri08, p.59] for more detail. Also see Wikipedia.) Step 1: Label the “Stem”. • Find the minimum and maximum values from the data set. • Decide the range the stem values. • Label stem heading, stem unit, stem values • Label the leaf heading, and leaf unit. • CAUTION: Do not skip any stem value. Step 2: Record each “leaf” value. Step 3: Sort the leave values for each “stem row”. • CAUTION: Some stems might not have leaves. You must not delete that “stem row”. Observations: Where are the values concentrated? Identify any outliers, values that are far from the rest. Observations: Where are the values concentrated? Are the values normally distributed (i.e., bell-shaped)? Identify any outliers, values that are far from the rest. Are the values normally distributed (i.e., bell-shaped)? © 2009 Chi-yin Pang Note: Stemplot and boxplot evolved from Arthur Bowley's work. Bowley (1869-1957) wrote the first English-language statistics text-book. Page 1 of 3 Elementary Statistics G3: Histogram [Tri08, pp.51-] Enter data into L1: [STAT]>Edit>1:Edit>[ENTER] Use [^] to go up to L1>[CLEAR]>[ENTER] Enter the data to L1. Quit list editor: [2nd][MODE](QUIT) Make Histogram: Graphs for Quantitative Data (v.090721) G4: Boxplot (Box-and-Whisker) [Tri08, pp.121] Make the plot: [2nd][Y=](STAT PLOT)>1:Plot1… Set up Plot1 as shown in screen shot. [ZOOM]>9:ZoomStat Copy the boxplot to scale. [2nd][Y=](STAT PLOT)>1:Plot1… Set up Plot1 as shown in screen shot. [ZOOM]>9:ZoomStat Copy the histogram to scale. Label the boxplot: Use [TRACE] and [<] [>]. Label the histogram: Use [TRACE] and [<] [>]. Observations: Where are the values concentrated? Identify any outliers, values that are far from the rest. Are the values normally distributed (i.e., bell-shaped)? Interpretation: • A quarter of data are between minx and Q1. • A quarter of data are between Q1 and Med. • A quarter of data are between Med and Q3. • A quarter of data are between Q3 and maxX. • If there are any Outliers, they would be indicated with mark(s) separated from the “whisker”. Observations: Where are the values concentrated? Identify any outliers, values that are far from the rest. Are the values normally distributed (i.e., bell-shaped)? © 2009 Chi-yin Pang Page 2 of 3 Elementary Statistics G5: Normal Quantile [Tri08, pp.53-54] Graphs for Quantitative Data (v.090721) Plot G3, G4, G5 with Statdisk’s “Explore Data” Function Install and Launch Statdisk: See the Statdisk Instruction handout. Enter data into column 1: See the Statdisk Instruction handout. Use the “Explore Data” function to do “Exploratory Data Analysis”: Make the plot: [2nd][Y=](STAT PLOT)>1:Plot1… Set up Plot1 as shown in screen shot. [ZOOM]>9:ZoomStat Copy the normal quantile plot to scale. Data>Explore Data>choose column 1>[Evaluate] Interpretation: • The normal quartile is mainly used to assess whether a data set is normally distributed. The points lines up straight, if and only if the data set is normally distributed. • Outliers, if any, would be located on the lower left or upper right far from the straight line formed by the other points. Observations: Are the values normally distributed (i.e., bell-shaped)? Identify any outliers. © 2009 Chi-yin Pang Page 3 of 3 Compare the Statdisk result with the TI results. Note: • The histograms are different, because the TI and Statdisk used different classes. • Statdisk’s Boxplot does not show outliers. Elementary Statistics Scatterplot [Tri08, p.60] Example: Are horsepower & fuel economy correlated? Here are advertised horsepower ratings and expected gas mileage for several 2001 vehicles. [BVD07, p.164] Audi A4 Chevy Prizm Ford Excursion Honda Civic Lexus 300 Olds Alero VW Beetle x: Horsepower 170 hp 125 310 127 215 140 115 y: Gas Mileage 22 mpg 31 10 29 21 23 29 Time-Series Graph Example: Trend of Marriage age. Here are the average age of American women (at first marriage) in different years. [BVD07, p.214] x: Year 1900 1910 1920 1930 1940 y: Age 21.9 21.6 21.2 21.3 21.5 x: Year 1950 1960 1970 1980 1990 Plot the data by hand. Plot the data by hand. Plot the data with the TI. Enter data into L1 & L2: [STAT]>Edit>1:Edit>[ENTER] etc. Make the Scatterplot: [2nd][Y=](STAT PLOT)>1:Plot1… Plot the data with the TI. Enter data into L1 & L2: [STAT]>Edit>1:Edit>[ENTER] etc. Make the Time series plot: [2nd][Y=](STAT PLOT)>1:Plot1… Set up Plot1 as shown in screen shot. Set up Plot1 as shown in screen shot. [ZOOM]>9:ZoomStat [ZOOM]>9:ZoomStat Observation: What trend do you see? Observation: Does the graph show a correlation? © 2009 Chi-yin Pang Graphs for X-Y Data (v.090120+) [Tri08, p.61] Page 1 of 1 y: Age 20.3 20.2 20.8 22.0 23.9 Elementary Statistics Triola’s Statdisk Instruction (v.090301+) Triola’s Statdisk Instruction The text book is Mario F. Triola, Essentials of Statistics (3rd Ed.), Pearson, 2008. Steps Examples Installing Triola’s Statdisk From CDROM 1.)Insert the STATDISK CD in the CD-ROM Drive. 2.)Double click the "My Computer" icon on your desktop (Windows) or open a Finder window (Macintosh). 3.)Double click on your CD-ROM Drive icon. This will open a window containing the contents of this CD. 4.)Double click on the "Software" Folder. This will display the contents of that folder. 5.)Double click the "STATDISK" Folder. This will display the contents of that folder. 6.)Double click on the file that corresponds to your operating system. Drag the Statdisk executable to your hard drive to install it. Installing Triola’s Statdisk From the Web http://wps.aw.com/aw_triola_stats_series Click Essentials of Statistics http://wps.aw.com/aw_triola_stats_series Click [STATDISK] http://media.pearsoncmg.com/aw/aw_triola_e lemstats_9/software/statdisk.htm (Assume you have a PC.) Click “STATDISK 10.4.0 for Windows”. Download and save the file to My Document/Download/TiolaStat/ sd_10_4_0_win2kXP.zip Go to the folder “My Document/Download/ TiolaStat /sd_10_4_0_win2kXP.zip” and double click Statdisk.exe, then click [Extract All]. Now you have the application Statdisk.exe with the Histogram icon. © Chi-yin Pang, 2009 Page 1 of 2 Elementary Statistics A pictorial guide to draw a comparative boxplot Triola’s Statdisk Instruction (v.090301+) You can follow this example to get a feel of how easy to make a comparative boxplot. In your project, you would enter your data into columns 1 and 2. Launch . The loaded the data from p.631’s Data Set 8: Forecast and Actual Temperatures. We will plot columns 1 and 2. NOTE: For the comparative boxplots to be effective, you MUST plot both sets of data at the same same time, and not plot them one at a time. The two plots must be together and on the same scale. Paste boxplot into Word You can use [Alt]+[PrtSc] (Alternate, Print Screen) keys to “take a picture” of the StatDisk window. Then in your Word document (your report), use the Edit>Paste to paste the StatDisk window into your report. Cropping the boxplot picture in Word To trim off the unnecessary borders of the pasted boxplot picture, you need to use the If you do not see that tool, you need to add the “Picture Tool Bar” by picture crop tool. Click the boxplot picture, then select the crop tool. The cursor would become the crop symbol. Use the crop symbol to drag the corners of the picture to trim the borders. Position of the picture You can also use Format>Picture>Layout>Square to experiment repositioning the placement of the boxplot on your page. Then [Close]. © Chi-yin Pang, 2009 Page 2 of 2 Elementary Statistics Project 1 Assignment (v.090728) Project 1: Comparative Boxplots or Dotplots Assignment Learning Objective The objectives are to: • Gain experience for a “field” data collection. • Use Comparative Boxplots or Dotplots to compare data. • Exercise your written presentation skill. You will use the numerical methods to analyze the same data in Project 2. Due Date Proposal due _______________ (send me a 1-line e-mail, chi-yin_pang@alumni.hmc.edu) Report due _________________ Topic You are to collect two sets of QUANTITATIVE data, of at least 31 samples each. (Use measurement data, and not categorical or Yes/No data.) Compare the samples comparative boxplots or dotplots to highlight the difference (or similarity) of their distributions, and make conclusion about your investigation. (If you want to use other graphic displays, talk with me.) See the end of this assignment for examples. Make you own “fun” topic; especially, the “fun” topic that you have ready access to the data. “Fun” makes work easy. If you are not interested in the topic, it is like pulling your own teeth. Your report must include: Project Report’s • An introduction of you topic. (It would be good to discuss your original Content prediction of the result.) Requirements • Describe how you collect the data, the source of the data, and present the raw data. • Describe your analysis and present the comparative graphs boxplots. • Draw conclusion(s). If appropriate, state what you would do differently for further investigation. Editing Requirement The report must be word processed: • You may draw the graphs software, for example, StatDisk. You may use graph paper and sketch the graph neatly and to scale. You may also use “Courier New” font (which is an “equal-width” font) and hand typed character graphics, to scale. • In case you use formulas, typed them with Equation Editor. x x + ... + xn E.g., μ = ∑ = 1 . n n For Word, use Insert>Object …>Object type: Microsoft Equation 3.0> [OK]. © 2009 Chi-yin Pang Page 1 of 2 Elementary Statistics Caution Project 1 Assignment (v.090728) There are many potential “time-eaters” along the way, such as: • Collecting data. • Installing/configuring software for graphing, equation editing etc. • Using equation editor. • Wanting to collect more data to investigating further. • Thinking and writing up the conclusion. Start the project early to find out what might surprise you. Project Ideas: (Real projects from Fall 2008) Price Comparison Merchandise prices: Paper back vs. Hardcover Wal-Mart vs. Target Starbucks vs. Peets NVIDIA 260 vs NVIDIA 9800 Rent price: SF vs. SJ; SJ vs. NY East SJ vs. West SJ Car prices: Charge vs. Magnum Prius vs. Civic BMW M5: SF vs. LA Saturn Ions: 2 vs. 4 door Office rental price: Downtown SJ vs. not Downtown SJ Women’s boot prices: Aldo vs. Bakers Shoe prices: Manolo Blanik vs. Jimmy Choo Home price: Zip 95123 vs. Zip 95051 Purse prices: Coach vs. Dooney&Burke Gucci vs. Burberry # pets in 1000 households: dogs vs. cats # text messages: In box vs. Out box Past due traffic tickets (in $): Male vs. Female Unemployment rates: 8/2007 vs. 9/2008 Household incomes: 3-person families vs. 4-person families Pairs of jeans owned: Male vs. Female College tuitions: Private vs. Public Nursing schools passing rates: 2006 vs. 2007 # teachers in elementary schools: Male vs. Female Word lengths: King James Version vs. New King James version # words in a verse: English vs. Spanish # letters in the title in Harry Potter books: Book 6 vs. Book 7 Age of Target workers: Female vs. Male Income from arcade game: Redemption vs. Non-Redemption Bagels wasting (per day): 8/2008 vs. 9/2008 Movie gross (Million $): Highest grossing of the year vs. Best picture of the year Rice export (tons): 1997 vs. 1998 Daily stock prices: High vs. Low Radiology scan time: CT scan vs. MRI scan Blood glucose level: Before treatment vs. After treatment # days of hospital stay for major joint replacement: With complications vs. Without complications Number of park visitors: Regular price vs. Discount price Temperature in cities: West Coast vs. East Coast Fuel economy of car: Weight vs. MPG. Foreclosure price: Santa Clara County vs. San Benito County Sociology Education # units taken: Spring vs. Fall Literature Word lengths (# letters): Broken Spears vs. To Kill a Mocking Bird Business Spending at checkouts: Credit card vs. Cash Time spent in a furniture store: Saturdays vs. Sundays Medical Cases of pneumonia: 2006 vs. 2007 Miscellaneous Off Highway Vehicle park visits: ATV vs. Motorcycle NBA draft ages: 2007 vs. 2008 © 2009 Chi-yin Pang Crime rate: Page 2 of 2 Project 1 Example (ver.090222) Project 1 Example B Did the 1st Century Doctor Use Longer Words? Chi-yin Pang February 22, 2009 Introduction In the study of Koine Greek, beginner’s text books like to use examples from fisherman John’s writing rather than Dr. Luke’s writing, because John’s writings are easier to read and Dr. Luke’s writing are more refined. I thought that medical school must have taught Dr. Luke to write longer unintelligible words. John was just an uneducated fisherman, before he became “St. John”; therefore, he must have use simpler words. I will just take a sample of their writings, and count the length of words and see what the data say. Method and Raw Data At first, I have decided to compare two passages that describe the same event, Jesus fed five thousand people. Here are the comparative passages in Koine Greek (from Byzantine/Majority Text 2000), and the number of letters of the first 15 words are shown as a sample of the data columns. 13 Dr. Luke’s Luke 9:13-17 ειπεν δε προς αυτους δοτε αυτοις υμεις φαγειν οι δε ειπον ουκ εισιν ημιν πλειον η πεντε αρτοι και ιχθυες δυο ει μητι πορευθεντες ημεις αγορασωμεν εις παντα τον λαον τουτον βρωματα 14ησαν γαρ ωσει ανδρες πεντακισχιλιοι ειπεν δε προς τους μαθητας αυτου κατακλινατε αυτους κλισιας ανα πεντηκοντα 15και εποιησαν ουτως και ανεκλιναν απαντας 16λαβων δε τους πεντε αρτους και τους δυο ιχθυας αναβλεψας εις τον ουρανον ευλογησεν αυτους και κατεκλασεν και εδιδου τοις μαθηταις παρατιθεναι τω οχλω 17και εφαγον και εχορτασθησαν παντες και ηρθη το περισσευσαν αυτοις κλασματων κοφινοι δωδεκα 8 Fisherman John’s John 6:8-13 λεγει αυτω εις εκ των μαθητων αυτου ανδρεας ο αδελφος σιμωνος πετρου 9εστιν παιδαριον εν ωδε ο εχει πεντε αρτους κριθινους και δυο οψαρια αλλα ταυτα τι εστιν εις τοσουτους 10ειπεν δε ο ιησους ποιησατε τους ανθρωπους αναπεσειν ην δε χορτος πολυς εν τω τοπω ανεπεσον ουν οι ανδρες τον αριθμον ωσει πεντακισχιλιοι 11ελαβεν δε τους αρτους ο ιησους και ευχαριστησας διεδωκεν τοις μαθηταις οι δε μαθηται τοις ανακειμενοις ομοιως και εκ των οψαριων οσον ηθελον 12ως δε ενεπλησθησαν λεγει τοις μαθηταις αυτου συναγαγετε τα περισσευσαντα κλασματα ινα μη τι αποληται 13συνηγαγον ουν και εγεμισαν δωδεκα κοφινους κλασματων εκ των πεντε αρτων των κριθινων α επερισσευσεν τοις βεβρωκοσιν Page 1 of 2 Word # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 etc. Luke 5 2 4 6 4 6 5 6 2 2 5 3 5 4 6 etc n=91 John 5 4 3 2 3 7 5 7 1 7 6 5 9 2 3 etc n=101 Project 1 Example (ver.090222) Graphs of the Distributions I entered the raw word length counts into Statdisk. Luke 9:13-17’s counts at column 1, and John 6:8-13’s counts at column2. Then I used “Boxplot” to draw Box-and-Whisker plots. Dr. Luke’s Luke 0:13-17 Fisherman John’s John 6:8-13 Conclusions To my surprise, the Boxplots show that the word length distribution of both Luke and John are remarkably similar. Furthermore, the Boxplots show that 25% of Luke’s words are 6 to 14 letters, and 25% of John’s words are 7 to 14 words. This fact even suggests that John use longer words more often. What have I learned and what would I do differently? Since Boxplots only provide the 5-number summary and sacrifice details of the raw data, perhaps Dotplots would show me more of the subtle differences. Nevertheless, the Boxplots was very effective in showing that Dr. Luke does not seem to use longer words that fisherman John. In a more fundamental level, the whole premise of “difficulties are associated with word length” might be shaky. For example, grammatical construction might have more influence to reading difficulty, but difficult grammatical construction might not lead to long words. Page 2 of 2 Elementary Statistics Graphs for Categorical Data (Ver.090118) G3: Segmented bar chart (not in [Tri08]) Step A: Calculate sample size, n: Go to list editor: [STAT]>Edit>1:Edit>[ENTER] Clear L1: Use [^] to go up to L1>[CLEAR]>[ENTER] Enter frequencies into L1. Quit list editor: [2nd][MODE](QUIT) Calculate n, by sum up L1: Graphs for Categorical Data Example: Cause of death [BVD04 p.28] In 1999, a sample of death cause data shows: Cause Frequency (#Cases) Cancer 230 Circulatory diseases 84 Heart disease 303 Respiratory diseases 79 L1 Frequency Category xi L2 Relative Freq. ri = xi / n L2 = L1 / n Cumm. L3 % for Central Angle Segmented θ = r ∗ i i 360° bar chart L3 = L2 *360 [2nd][STAT](LIST)>MATH>5:SUM(>[ENTER]> [2nd][1](L1)>[ ) ]>[ENTER] Step B: Calculate the relative frequency: Go to list editor. Go up to L2. Type L2=L1/(the value of n) then [ENTER] Step C: Calculate the cumulative percentage. (Add from bottom to top.) Step D: Draw the Segmented bar chart Sample size: n = ∑xi = sum(L1) = G1: Bar chart (not in [Tri08]) Decide on the scale. Draw the bars for each category. G4: Pie chart Step A: Calculate the central angles. Go to list editor. Go up to L3. Type L3=L2*360 then [ENTER] G2: Pareto* chart (p.59) Sort the bars from tallest to the shortest. Step B: “Cut” the pie according to central angles. Label each slice. * Vilfredo Pareto (Italian economist, 1848-1923) observed the “Pareto Principle” (“80-20 Rule”, 80% of the wealth is owned by 20% of the people). © 2009 Chi-yin Pang Page 1 of 1 Measures of Center (Ver.080205+) What? How? What for? When? Characteristics Sensitive to every sample in the dataset. Outliers would pull it way off. Mean (Average) Arthematic average. Add the n values and devide by n. The "center of gravity" of the (x1+x2+…+xn) / n data. Used for most measured data: Height, weight, duration, grade point average, age … Median (50th Percentile) The value where 50% of the 1. Line up all the values from the samples are above and 50% smallest to the largest. are below. 2. Count from both ends going towards the middle. 3. If there is one middle point, that's the median. If there are 2 middle points, add them up and divide by 2. Real estate: "Median house Insensitive to outliers. price." Mode The category (or the number) that occurs the most often. Election. The candidate who The only "center" that gets the most vote wins. works for categorical data. Midrange The mid point between the (Minimum + Maximum) / 2 minimum and the maximum. (c) Chi-yin Pang, 2008 Find the category (or the number) that occurs the most often. Academic achivement (50th percentile). Baby weight, baby length. (Rarely used.) Only depends on the minimum and the maximum values. Elementary Statistics Standard Deviation (v.090728) Standard Deviation Exercise You will get a problem like this in a quiz and probably in the next test. (During the test you will be given just an EMPTY table grid and you have to supply all the headings and all the formulae.) ∑x = x Sample Mean: x= Sample Variance: v = s2 = + ... + xn n 1 n ∑(x − x) ∑ (x − x) Sample Standard Deviation: s = s = x x−x L1 L2 = L1 − ___ = n −1 2 ( x1 − x ) 2 + ... + ( xn − x ) 2 n −1 2 2 n −1 ( x − x )2 L3 = L2 ^ 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ∑ x =sum( L ) ∑ (x − x ) = = 1 x x=∑ = n 2 s 2 =sum( L3 ) (x − x) =∑ n −1 Computed sample standard deviation: s = s = 2 = ∑ (x − x) Compute the sample standard deviation again using “1-Var Stat” ([STAT]>CALC>1:1-Var Stats) “Sx”= Explain the difference, if any, of the results: © 2009 Chi-yin Pang 2 p.1 of 2 n −1 2 = Elementary Statistics Standard Deviation (v.090728) Mean and Standard Deviation for Data from a Frequency Distribution Sometimes data is note presented in one list of numerical values, but a list of values and the number of times it was observed. E.g., for [Tri08, p.108, #27], a sample of fifty speeding tickets gives the following set of data. (The data record the speed of the drivers traveling through a 30 mph speed zone, in the town of Poughkeepsie, New York. Speed (mph) Frequency 42-45 25 46-49 14 50-53 7 54-57 3 58-61 1 [Tri08, pp.83 and 108] has formulae to compute the sample mean and standard deviation “by hand”: x= ∑ ( f ⋅ x) ∑f s= n ⎣⎡ ∑ ( f ⋅ x 2 ) ⎦⎤ − ⎡⎣∑ ( f ⋅ x ) ⎤⎦ 2 Isn’t it cool? n ( n − 1) Long way to compute the Mean and Standard Deviation with the TI: We can calculate the mean and standard deviation with TI’s 1-VarStats by entering data into L1 as follows: enter 43.5 (the mid point between 42 and 45) 25 times, enter 47.5 (the mid point between 46 and 49) 14 times, enter 51.5 (the mid point between 50 and 53) 7 times, enter 55.5 (the mid point between 54 and 57) 3 times, enter 59.5 (the mid point between 58 and 61) 1 times. You probably have something better to do with your time than playing this video game. Nifty way to compute the Mean and Standard Deviation with the TI: Enter the data into L1, L2 as follows: Speed (mph) Mid-Point of the Speed Frequency L1 L2 42-45 46-49 50-53 54-57 58-61 25 14 7 3 1 Now use TI’s 1-Var Stats [STAT]>CALC>1:1-Var Stats>1-Var Stats L1, L2 For L1, L2, type “[2nd][1][,][2nd][2]”. Mean = = Sample Standard Deviation = © 2009 Chi-yin Pang = p.2 of 2 Elementary Statistics, Probability Contingency Table (ver.090222) 3. Compute “The probability of a passenger dead given that the passenger is a 1st class passenger” and draw lines to link the numbers to the locations of the contingency table. # of 1stClassDead P( Dead | 1st Class ) = = total1stClass Probability Contingency Table Exercise A D Given the frequency contingency table of Titanic passenger survival data: Alive 1st Class 202 Dead 123 (total) 325 2nd Class 3rd Class 118 178 167 285 528 706 Crew 212 (total) 673 1491 885 4. P ( Dead | Crew) = 710 A D Alive Crew A D 710 1491 325 285 706 885 2201 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 (total) 202 ≈ .0918 2201 7. P ( Alive | 3rd Class ) = Dead (total) Crew 212 673 6. P ( Dead | 2nd Class ) = 1. Fill-in the rest of the following probability contingency table. 2nd Class 3rd Class 3rd 178 528 5. P ( Alive | Crew) = We can compute the probability contingency table, by computing the relative frequency for each cell x x x with = = n GrandTotal 2201 1st Class 2nd 118 167 # ofCrewDead = totalCrew A D 2201 1st 202 123 A D 325 ≈ .1477 2201 8. P (1st Class | Alive) = A D 2. Conditional Probability: Often we talk about probabilities under a certain condition. For example, “The probability of a passenger alive given (the condition) that the passenger is a 1st class passenger.” For short, we write P(Alive given 1st Class) or P(Alive | 1st Class). The probability is computed as follows: # AlivesAmong1stCls 202 P ( Alive | 1st Cls ) = = total1stClass 325 Draw lines to link the numbers to the locations of the contingency table below. A D #1st ClaAmongAlives 202 = totalAlives 710 9. P(3rd Cla | Alive) = 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 Page 1 of 3 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 #3rd ClaAmongAlives = totalAlives A D 1st 202 123 1st 202 123 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 Elementary Statistics, Probability Contingency Table 10. P(3rd Cla | Dead ) = #3rd ClaAmongDeads = totalDeads A D 11. P(3rd Class ) = 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 #3rd ClassPassengers = totalPassengers A D 12. P (2nd Class ) = 1st 202 123 totalPassengers A D 706 2201 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 2201 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 13. P(Crew) = A D 18. To compute the “probability of a random passenger who is 2nd Class AND who survived (alive)” we use the number of the cell that satisfies both criteria. The important word AND usually makes the number smaller. 118 P (2nd Class AND Alive) = 2201 A D 1st 202 123 = (ver.090222) 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 19. To compute the “probability of a random passenger who is 2nd Class OR who died (dead)” we use the number in the cell that satisfies both criteria. The important word OR usually makes the number bigger. P ( Dead OR 2 nd Class ) 123 + 167 + 528 + 673 + 118 = = 2201 2201 A D 1st 202 123 1st 202 123 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 20. P (2nd Class OR Alive) 14. P ( Alive) = = A D 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 = A D 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 21. 15. P (Dead ) = P (Crew AND Alive) A D 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 = 16. P(not Dead ) = = A D A D 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 22. P (1st Class OR 2nd Class ) = 17. P (not 1st Class ) = A D 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 Page 2 of 3 = A D 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 Elementary Statistics, Probability Contingency Table 23. P (1st Class AND 2nd Class ) = A D st nd 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 (ver.090222) Independency: Events A and B are independent if probability of A does not depend on whether B happens or not. That is, P( A given B) = P( A) = P( A given (not B)) P( A | B) = P( A) = P( A | (not B)) rd P (1 Class OR 2 Class OR 3 Class ) 24. = Dependency: Events A and B are dependent if they are not independent. = A D 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 28. Is “Alive” and being “1st Class” dependent or independent? 1st Class ) = = )= = P ( Alive | not1st Class ) = = P ( Alive | 25. Here is an easier way to compute P (1st Class OR 2nd Class OR 3rd Class ) using the Law of Complement : P (not A) = 1 − P( A) or P( B) = 1 − P(not B) P ( Alive P(1st Class OR 2nd Class OR 3rd Class ) A D = P(not Crew) = 1 − P(Crew) = 1 − = 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 Conclusion: ________________________ A D 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 29. Is “Dead” and being “3rd Class” dependent or independent? P( P(Crew OR 3rd Class OR 2nd Class ) = 26. P(not ) = 1 − P( ) =1− A D = P( 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 P( = A D | )= = )= = )= = A D 27. Two events are Mutually Exclusive, if they cannot both happen at the same time. Here is an example: What’s the “probability that a passenger is 1st Class AND also 3rd Class.” P (1st Class AND 3rd Class ) = | 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 Conclusion: ________________________ 30. Write a sociological conclusion about your conclusion of #27 and #28. 1st 202 123 2nd 118 167 3rd 178 528 Crew 212 673 710 1491 325 285 706 885 2201 Fact: If A and B are mutually exclusive, then P ( A AND B) = 0. (Meaning they can never happen at the same time.) Page 3 of 3 Elementary Statistics Contingency Table: Conditional Probability Stacked Bar Graph Worksheet (Ver.090226) STEP 1: Form a question of interest. STEP 1: Form a question of interest. Does ____________ depend on ____________? Does ____________ depend on ____________? STEP 2: Enter the variable names and the observed frequencies. STEP 2: Enter the variable names and the observed frequencies. The outcomes you are curious about. This is the RESPONSE VARIABLE. The conditions that are easier to assess. This is the EXPLANATORY VARIABLE. The outcomes you are curious about. This is the RESPONSE VARIABLE. R= S= T= U= Total Freq = Freq = Freq = Freq = P(R | E)= P(R | F)= P(R | G)= P(R | H)= Freq = Freq = Freq = Freq = P(S | E)= P(S | F)= P(S | G)= P(S | H)= Freq = Freq = Freq = Freq = P(T | E)= P(T | F)= P(T | G)= P(T | H)= Freq = Freq = Freq = Freq = P(U | E)= P(U | F)= P(U | G)= P(U | H)= Σ Freq = Σ Freq = Σ Freq = Σ Freq = Explanatory Variable (conditions, possible explanation for different outcomes) E= F= G= H= Response Variable (outcomes) Response Variable (outcomes) Explanatory Variable (conditions, possible explanation for different outcomes) E= F= G= H= The conditions that are easier to assess. This is the EXPLANATORY VARIABLE. R= S= T= U= Total Freq = Freq = Freq = Freq = P(R | E)= P(R | F)= P(R | G)= P(R | H)= Freq = Freq = Freq = Freq = P(S | E)= P(S | F)= P(S | G)= P(S | H)= Freq = Freq = Freq = Freq = P(T | E)= P(T | F)= P(T | G)= P(T | H)= Freq = Freq = Freq = Freq = P(U | E)= P(U | F)= P(U | G)= P(U | H)= Σ Freq = Σ Freq = Σ Freq = Σ Freq = STEP 3: Compute the totals, & the conditional probabilities, P(R | E) etc. STEP 3: Compute the totals, & the conditional probabilities, P(R | E) etc. STEP 4: Construct the stacked bar graphs with the conditional probabilities. STEP 4: Construct the stacked bar graphs with the conditional probabilities. STEP 5: Make conclusion (The response and explanatory variables are dependent, STEP 5: Make conclusion (The response and explanatory variables are dependent, if and only if the graphs are uneven.) if and only if the graphs are uneven.) ©2009 by Chi-yin Pang Page 1 of 1 Elementary Statistics Binomial Distribution Motivation Example (v.090702++) Tools for answering Question 4: The basic information are Binomial Distribution Motivation Example In 1972, Rodrigo Partida was convicted of burglary with the intent to commit rape. The trial was held in Texas’ Hidalgo County, which had 181,525 people eligible for jury duty, and 80% of them were MexicanAmerican. There were 12 jurors, and 7 of them were MexicanAmerican. (See more detail in [Tri08, p.193].) To compute the z-score of 7 MAs, we need the mean and standard deviation. The mean, or the expected value is just 80% of the 12 jurors. μ = np = Question: Based on 7 Mexican-American out of 12 jurors, can we conclude that the jury selection process discriminated against Mexican-American? A Simple Minded Assessment: Compute the ratio of MexicanAmericans (MAs) among the Jurors. Ratio of MA in the jury = #MAs = #People in jury = = n = Number of random picks (sample size) = Number of jurors = p = Probability of randomly picking a MA = q = Probability of randomly picking a non-MA = 1 − p = By “magic” that we will explain later, the standard deviation is: % Ratio of MA in the population = σ = npq = Solution to Question 4: Since we are interested in the z-score of 7 MAs x= Your conclusion: z= = Formula Rephrase the question (Question 2): Is 7 MAs out of 12 jurors unusually low, if the jury process was random? = Plug in numbers Answer Question 4: More rephrase (Question 3): Is 7 MAs more than 2 standard deviations below the expected value of MAs selected? More rephrase (Question 4) Is the z-score, z = than − 2? x−μ σ , of 7 MAs less Answer the original question: Are Questions 2, 3, 4 and the original question equivalent? © 2009 Chi-yin Pang Page 1 of 2 Final answer Elementary Statistics Binomial Distribution Motivation Example (v.090702++) How come σ = npq ? Next, we setup the lists for the distribution for “how many MA are selected”: We will not prove this mathematically, but we will demonstrate it by using the TI calculator: Step 1: We will “store” all 13 probability values into L2. Use the binompdf(12,0.8) function without the “x” parameter. The “arrow” is the [STO ►] key. • To compute the probability of getting exactly 0 MAs, the probability of getting exactly 1 MAs, …, the probability of getting exactly 12 MAs • To compute population mean and standard deviation, μ and σ , with “1-Var Stats L1, L2” • Then compare the previous answers computed by the formulae. Computing the probabilities of 0 MAs, 1 MA, 2 MAs etc: Step 2: Type 0, 1, 2, …, 12 into L1, to complete the description of the discrete probability table. Let X = the number of MAs selected, then we are interested in computing the following probabilities: P ( X = 0) P ( X = 1) ... P ( X = 12) Again, by “magic” that we will explain later, for any given x, Step 3: Use 1-Var Stats L1, L2 to compute population mean and standard deviation, μ and σ . The prob. of getting exactly x MAs n! p x q n− x x !(n − x)! = binompdf (n, p, x) = P ( X = x) = where binompdf(n,p,x). is the TI function “Binomial probability density function”. The key sequence is [2nd][VARS]>DISTR>0:binompdf( For example, for x=10 binompdf(12,0.8,10) = © 2009 Chi-yin Pang μ=x= σ =σx = Step 4: Compare to the results given by the formulae. μ = np = σ = npq = Page 2 of 2 San Jose City College Math 63, Elementary Statistics NAME:_______________________ Binomial Probability Computation (v.090721) Binomial Probability Computation Homework These exercises cover most of the different situations for binomial probabilities. These are standard tricks, and some of them will appear in the test. They use the formula for complementary event, P (not A) = 1 − P( A) , over and over again. These tricks will be used again with the normal distribution. The following are all the formulae that you need: P( successes = x) = P(exactly x successes) n! = p x q ( n− x ) = n C x p x q ( n− x ) x! (n − x)! = TI ' s binompdf (n, p, x) 4. Compute P(no success) “by hand”. From here on, just use TI’s binompdf() and binomcdf(). (1) Write the TI function you use and then (2) the numerical answer. 5. P (exactly 3 successes ) Hint: binompdf(15, 0.1, 3). 6. P(exactly 10 successes) 7. P(no success) P( successes ≤ x) = P(at most x successes) = P(0) + ... + P( x) = TI ' s binomcdf (n, p, x) (Get to binompdf() or binomcdf() by [2nd][VARS](DISTR)>DISTR> … 8. P(5) (This means P(x=5).) -------------------------------------------------Assume n=15, p=0.1. 9. P(at most 3 successes) Hint: binomcdf(15, 0.1, 3). 1. What is q? 2. Compute P(exactly 3 successes) “by hand”. Hint: use TI’s 15C3*.1^3*.9^12. (For nCx type n first then [MATH]>PRB>3:nCr.) 10. P(at most 7 successes) 11. P(at least 13 successes) Hint: 1 − P( successes ≤ 12) 3. Compute P (exactly 10 successes) “by hand”. © 2009 Chi-yin Pang Page 1 of 2 San Jose City College Math 63, Elementary Statistics NAME:_______________________ Binomial Probability Computation (v.090721) 21. (Cont.) You can easily get the whole list of P ( MAs ≤ 0), P ( MAs ≤ 1),..., P ( MAs ≤ 12) with one TI formula 12. P(at least 5 successes) binomcdf (12,0.8) ►L1 where “►” is the [STO►] (store) key. 13. P(at least 10 successes) About how many MA jurors is the cutoff point of being “unusually low”? 14. P(not all are successful) Hint: use either 1 − P( successes = 15) = 1 − binompdf (15, 0.1,15) or P ( successes ≤ 14) = binomcdf (15, 0.1,14) 22. (Cont.) What is the probability of having 10 or more MA jurors? 15. P(all failed) 23. (Cont.) What is the probability of having 11 or more MA jurors? 16. P(not all failed) 17. P(at least one failure) 24. (Cont.) What is the probability of having all MA jurors? 18. Texas’ Hidalgo County has 80% MexicanAmericans (MA’s). If the jury selection of 12 jurors is random, what’s the probability of having 7 or fewer MA jurors? 25. Gadgetco is selling a Gizzmo that has a defect rate of 1 in a thousand. Last year, they sold 3800 Gizzmo. What is the expected returned Gizzmos? = 80% = 0.8; = 12; =7 19. (Cont.) What is the probability of having 6 or fewer MA jurors? 26. (Cont.) What is the probability of 0 Gizzmo returned? Is this unusual? 27. (Cont.) What is the probability of at most 4 Gizzmo returned? Is this unusual? 20. (Cont.) What is the probability of having 8 or fewer MA jurors? © 2009 Chi-yin Pang Page 2 of 2 Elementary Statistics NAME:_______________________ Binomial Applications (v.090721) Binomial Applications Homework Description of Situation Crux = n (1) At the post office closest to Graceland, one in ten letters that arrive are addressed to Elvis. In a p= random sample of seven letters arriving at this post office, … [FrMa97p.83] q= Success = X= (2) Would most wives marry the same man again if given the chance? According to a poll of 608 married women conducted by Ladies Home Journal (June 1988), 80% would, in fact, marry their current husbands. If you randomly sample 20 people, … n= p= q= [FrMa97p.87] Question … what is the probability that exactly three are addressed to Elvis? x= What is the probability that exactly 60% or them would marry their current husband? X= , ) P( X ) = binom ____( = , , ) x= P( X ) = binom ____( = , , ) In a random sample of 20 female fence lizards, what is the probability that at least 15 will be resting? x = P( X ) = binom ____( = , , ) P( X ) = binom ____( = , , ) In a random sample of In a random sample of 200 female fence lizards, would you expect to observe fewer than 190 at rest? x = P( X ) = binom ____( = , , ) What is the probability that last Friday’s production will be shipped? NEED MORE INFO P( X ) = binom ____( = , , ) x= [FrMa97p.87] Success = X= (4) A factory manufactures ball bearings. Each production day the quality control department randomly selects 10 ball bearings and checks them for defects. Suppose that last Friday the machines were not calibrated correctly and consequently 60% of that day’s production of ball bearings were defective. [FrMa97p.83] n= p= q= In a random sample of 20 female fence lizards, what is the probability that fewer than 10 will be resting? x = n= p= q= ) , x= What is the probability that at least 60% of them would marry their current husband? (3) Zoologists have discovered that animals spend a great deal of time resting. For example, a female fence lizard will be resting at any given time is 0.97. , P( X ) = binom ____( = What is the probability that at most 60% of them would marry their current husband? Success = Solution P( X ) = binom ____( , = x= Success = X= Page 1 of 2 Elementary Statistics NAME:_______________________ Binomial Applications (v.090721) n= (5) An airline, believing that 5% of passengers fails to show up for flights, overbooks (sells more tickets p= than there are seats). Suppose a plane will hold 265 passengers, and the airline sells 275 seats. [BVD07.400.26] q = Success = What is the probability the airline will not have enough seats so someone gets bumped? P( X ) = binom ____( = , , ) What is the probability one of these classes would not have enough lefty arm tablet? P( X ) = binom ____( = , , ) Should he suspect he was misled about the true success rate? (Hint: Compute the probability of making up to 10 sales.) P( X ) = binom ____( = , , ) What is the probability that he will answer at least 80% of the questions correctly and get at least a B grade on the quiz? P( X ) = binom ____( = , , ) x= X= (6) A lecture hall has 200 seats with folding arm tablets, 30 of which are designed for left-handers. The average size of classes that meet there is 188, and we can assume that about 13% of students are left-handed. [BVD07.400.25] n= p= q= x= Success = X= (7) A newly hired telemarketer is told he will probably n = make a sale on about 12% of his phone calls. The p= first week he called 200 people, but only made 10 sales. [BVD07.400.27] q= Success = X= (8) A student who did not study for 20-question true/false quiz in his biology class must randomly guess the answer to each question. [FrMa97p.83] Success = X= x= n= p= q= x= Hints for selected problems: (1) n = 7, p = 0.1, q = 0.9, P( x = 3) = binomPDF (7, 0.1, 3) n = 20, p = 0.8, q = 0.2 (2a) P ( x = n × 60%) = P ( x = 12) = binomPDF ( 20, 0.8, 12) (2b) P ( x ≤ 12) = binomCDF ( 20, 0.8, 12) (2c) 1 − binomCDF ( 20, 0.8, 11) (2) (3a) (3b) (3c) P( x ≥ 15) = 1 − binomCDF (20, 0.97, 14) P( x < 10) = P( x ≤ 9) = binomCDF (20, 0.97, 9) P( x < 190) = P( x ≤ 189) = binomCDF (200, 0.97, 189) (4) For example, you may want to assume that the acceptance criterion is “no defect among any of the 10 samples.” In this case, you would compute P(x=0) =… (8) n = 20, p = 0.5, q = 0.5, P( x ≥ n × 80%) = P( x ≥ 16) = 1 − binomCDF (20, 0.5, 15) Page 2 of 2 Elementary Statistics NAME:_______________________ Mean & Std. Dev. of Binomial Dist. HW (v.090305) Mean & Standard Deviation of Binomial Distribution These exercises drill you on: (1) reading the word problem, and (2) computing the mean, standard deviation, & usual range of the binomial distribution. Ref. Triola 2008) p.222 #12a,b p.222 #11ab Basic Info For the distribution of number of successes Mean = μ = np Range Rule of Thumb for usual number of successes Maximum usual value: μ +2σ Minimum usual value: μ − 2σ Std.Dev. = σ = npq n = 100 Max usual=μ +2σ = 14 + 2 × 3.47 = 20.94 Success=”the candy is yellow” X=”# of yellow candies” p=0.14 q=1−p =1–0.14=0.86 μ = np = (100)(.14) = 14 Min usual=μ − 2σ = 14 − 2 × 3.47 = 7.06 σ = npq 8 is within the usual range, therefore not unusual. Therefore, the result of 8 does not cast doubt on the claim of 14% yellow. Success = n= X= μ= p= σ= = 100(.14)(.86) ≈ 3.47 Max usual= Min usual= q= p.222 #13ab Success = n= X= μ= p= σ= Max usual= Min usual= q= Page 1 of 2 Elementary Statistics p.223 #15ab Success = n= X= μ= p= σ= NAME:_______________________ Mean & Std. Dev. of Binomial Dist. HW (v.090305) Max usual= Min usual= q= p.223 #17ab Success = n= X= μ= p= σ= Max usual= Min usual= q= p.223 #18ab Success = n= X= μ= p= σ= Max usual= Min usual= q= p.224 #20ab Success = n= X= μ= p= σ= Max usual= Min usual= q= Page 2 of 2 Height Distribution .1 0.08 dnorm( x , 85.1 , 4.5) dnorm( x , 86.7 , 4.6) 0.06 dnorm( x , 163.3 , 8.3) dnorm( x , 176.9 , 9.2) 0.04 0.02 0 0 60 80 50 100 120 x Height in cm 2 yr Girls 2 yr Boys 20 yr Women 20 yr Men 140 160 180 200 210 Elementary Statistics NAME:_______________________ Standard Normal’s CDF Homework (v.090305) Standard Normal’s CDF Homework The Standard Normal Distribution has mean = 0 and standard deviation = 1 ( μ = 0, σ = 1 ). Notations: • It is sometimes denoted by N (0,1) . ( N ( μ , σ ) denotes the normal distribution with mean, μ, and standard deviation, σ.) • N (0,1) is also called the Z-distribution. • The values on the horizontal axis are called z-scores. The Z-distribution is our most beloved ♥ distribution that helps us to do most of our statistical inferences. (3) Label Z = –1, Z = –2, Z = –3 on the following Zdist. (4) Label Z = –2.5, Z = 0.3, Z =1.6 on the following Z-dist. All the following problems are for the Z-distribution. (1) Z means how many Std. Dev. in relation to the mean. z-score English description Z=1 One std. dev. ABOVE the mean. The area under the standard normal curve is exactly 1 which correspond to the sum of the probability is exactly 1. (For students who know calculus: The standard normal Probability z Density Function (PDF) is f ( z ) = 1 e − 2 . The Cumulative Distribution 2 Z=2 2π Function (CDF) is CDF ( z0 ) = ∫ Z = 1.6 z = z0 z = −∞ f ( z )dz . The area under the entire curve is CDF (∞) = 1 .) Z= Three std. dev. above the mean. Z = –3 Three std. dev. BELOW the mean. (5a) Shade the area under the curve between z1 = 0.6 and z2 = 2.0. Z = –0.4 Z= One std. dev. below the mean. Z=0 The mean. (Not above and not below.) (5b) Find the shaded area, P(0.3 ≤ Z ≤ 1.6) , by [2nd][VARS](DISTR) > DISTR > 2:normalcdf( (2) Label Z =0, Z =1, Z =2, Z =3 on the following Z dist. ( P (0.6 ≤ Z ≤ 2.0) = normalcdf (0.6 , 2 ) = NOTE: We will never use “normalPdf(z)”. (5c) What percentage (%) of the population are between 0.6 and 2 standard deviations above the mean? Page 1 of 3 Elementary Statistics NAME:_______________________ Standard Normal’s CDF Homework (v.090305) (6a) Shade the area under the curve between z1 = –1 and z2 = 1. For the “left tail” area, use z1 = –∞. For the “right tail” area, use z2 = ∞. TI’s biggest number is E 99 = 10 99 . Use E99 as ∞. The key sequence is [2nd][ , ](EE) > 99. (9a) Shade the area to the LEFT of z2 =0. (6b) Find the shaded area. P( Z ) = normal ( , )= (7a) Shade the area under the curve between z1 = –2 and z2 = 2. (9b) Find the above shaded area by P( Z ) = normal ( , )= (10a) Shade the area to the LEFT of z1 = –2. (7b) Find the above shaded area by P( Z ) = normal ( , )= (7c) The “Range Rule of Thumb” (p.98) says that the “usual values” are between 2 standard deviations below and above the mean. What percentage of population are considered “usual”? (10b) Find the above shaded area by P( Z ) = normal ( , )= (11a) Shade the area to the RIGHT of z1 = –2. (8a) Shade the area under the curve between z1 = –3 and z2 = 3. (11b) Find the above shaded area by P( (8b) Find the above shaded area by P( Z ) = normal ( , )= (8c) Do the results from (6b, 7b, 8b) agree with the “68-95-99.7%” Rule (see p.100)? (8d) What percentage of population are more than 3 standard deviation from the mean? P ( not (−3 ≤ Z ≤ 3) ) = 1 − P (−3 ≤ Z ≤ 3) = Z ) = normal ( , )= (12a) Compute the sum of the answers from #10b and #11b above. That is the portion of the population that are more than 2 standard deviations about the mean. (12b) What is the percentage? Page 2 of 3 Elementary Statistics NAME:_______________________ Standard Normal’s CDF Homework (v.090305) (13) Compute the following: P( Z ≤ −1.6) = normal P(1.6 ≤ Z P( ( )= , ) = normal ( , )= Z ≤ 1.6) = normal ( , )= P(−1.6 ≤ Z ) = normal ( NOTE: For the probability of both tails (both the left tail and right tail) you can add the area of the two tails two tails or simply multiply the area of one tail by 2. CAUTION: Do this only if the tails do not overlap, otherwise you will get a non-sense probability greater than 1! )= , Notice the similarities and differences of the above results. (14) Compute the following: (16a) Shade the both the area of the LEFT tail of z1 = –2, and the RIGHT tail of z2 = 2 P(−1 ≤ Z ≤ 2.5) = normal ( , )= (16b) Find the following probabilities: Left tail: P(−2.5 ≤ Z ≤ 1) = normal ( , )= P( P(1 ≤ Z ≤ 2.5) = normal ( P(−2.5 ≤ Z ≤ −1) = normal ( , P( )= Notice the similarities and differences of the above results. P (−0.3 ≤ Z ≤ 0) = normal P (−0.3 ≤ Z ≤ 0.3) = normal ( , )= Z ) = normal ( , )= (16c) The tails do not overlap. Compute sum of the tails. (CAUTION: We can only do this add, if the tails do not overlap.) P((Z ≤ −2) or (2 ≤ Z )) = P( Z ≤ −2) + P(2 ≤ Z ) = (15) Compute the following: P (0 ≤ Z ≤ 0.3) = normal ) = normal Right tail: )= , Z ( )= , ( )= , ( , )= Notice the similarities and differences of the above results. Page 3 of 3 Elementary Statistics NAME:_______________________ Inverse of Normal CDF Homework (v.090305) Inverse of Normal CDF Homework z score = invNorm( TI’s normalcdf(*,*) function goes from given z-scores to probability. I.e., it takes two z-scores as inputs, z1 < z 2 , and it outputs the probability that the random variable would occurs between those z-scores. normalcdf ( z1 , z2 ) outputs (2a) Compute the z-score that is the 75th percentile. )= (2b) Label that z-score and shade the left tail. p = P( z1 ≤ Z ≤ z2 ) (3a) Compute the z-score that is the 90th percentile. TI’s invNorm(*) does the “opposite.” It goes from a given percentage (a probability) to the percentile (a z-score). z score = invNorm( )= (3b) Label that z-score and shade the left tail. invNorm( p) outputs z such that P(−∞ ≤ Z ≤ z ) = p For example, given a percentage of 90% = 0.9, we can get to invNorm(*) by [2nd][VARS](DISTR) > DISTR > 3:invNorm( (4a) Compute the z-score that is the 95th percentile. z score = invNorm( meaning that P ( Z ≤ 1.281551567) = 0.9 . In fact, if we plug the output back into normalcdf(-E99, *) we get 0.9 back (with a high precision). )= (4b) Label that z-score and shade the left tail. (For students who know calculus: invNorm(a) = CDF −1 (a) , the inverse of CDF, the Cumulative Distributive Function.) (5a) Compute the z-score that is the 40th percentile. (1a) Compute the z-score that is the 50th percentile. z score = invNorm( z score = invNorm( (5b) Label that z-score and shade the left tail. )= )= (1b) Label that z-score and shade the left tail. (6a) Compute the z-score that is the 25th percentile. z score = invNorm( Page 1 of 2 )= Elementary Statistics NAME:_______________________ Inverse of Normal CDF Homework (v.090305) z1 = invNorm(0.1) = (6b) Label that z-score and shade the left tail. z 2 = invNorm(0.9) = (10b) Label that z-scores and shade the middle and label the percentage. (7a) Compute the z-score that is the 10th percentile. z score = invNorm( )= (7b) Label that z-score and shade the left tail. (11a) Find the z-scores that “trap” the middle 50% of the population. z1 = invNorm( )= z 2 = invNorm( (8a) Compute the z-score that is the 5th percentile. z score = invNorm( )= (11b) Label that z-scores and shade the middle and label the percentage. )= (8b) Label that z-score and shade the left tail. (9a) Compute the z-score that is the 99th percentile. z score = invNorm( )= (12a) Find the z-scores that “trap” the middle 95% of the population. z1 = invNorm( )= z 2 = invNorm( )= (9b) Label that z-score and shade the left tail. (12b) Label that z-scores and shade the middle and label the percentage. Often we want to know z-scores that contains the middle percentages, e.g., (10a) Find the z-scores that “trap” the middle 80% of the population. (The middle 80% includes all the zscores from the 10th percentile to the 90th percentile.) Page 2 of 2 [BB95, p.382] SAT (Scolastic Aptitute Test) Mean=500 Std.Dev.=100 ACT (American College Testing) Mean=18 Std.Dev.=6 X-Land: SAT Distribution X-Land: ACT Distribution 0.07 0.008 0.06 0.05 0.04 0.006 0.004 0.03 0.02 0.01 0.002 0 500 SAT Scores 1000 0 6 12 18 24 30 36 42 ACT Scores (1) Donald Pato took SAT and got 666 points. Micky Raton took ACT and got 29 points. Who has a higher achievement? (I.e., who achieved a higher percentile?) Z-Land: N(0,1) (1) Univ. of PRB (People's Republic of Beserkeley) accepts only students above the 85th percentile with their SAT or ACT. Find the minimum acceptance scores. minimum SAT score = minimum ACT score = Z-Land: N(0,1) Z-Land: N(0,1) San Jose City College, Spring 2007 Math 63 Statistics Ver. 071022 p.1 of 1 Central Limit Theorem (CLT) The theorem that proves: “The ____ justifies the mean.” Situation: 1. Given ANY distribution with unknown mean = μ and unknown standard deviation = σ, 2. You want to use a sample of size n to estimate the mean (μ). (I.e., take n samples, x1 , x2 ,..., xn , and use x= x1 + x2 + ... + xn to estimate μ.) n CONDITION: When n is large (e.g., n ≥ 30 ), RESULT: The sample mean, x , is approximately normally distributed with mean = μ x = μ std .dev. = σ x = σ n In other word, “ x ~ N ⎛⎜ μ , σ ⎞⎟ ”. ⎝ n⎠ Exceptions to the n ≥ 30 condition • If the original distribution IS NORMALly distributed, then the result is true for any n. • If the original distribution is already symmetric and uni-modal, n does not have to very large for the distribution of x to be approximately normal. Paraphrases of the CLT result & Notes • The sample mean, x , distribution “gets more and more normal” as the sample size (n) increases. • Even if the original distribution is highly skewed, and many peaks, the sample mean, x , distribution will still get more and more symmetric, uni-model, and bell-shaped (i.e., normal) as the sample size (n) increases. • The mean (center point) never changes for the sample mean distribution. • As the sample size (n) increases, the sample mean distribution has smaller and smaller spread, σ x = σ . n Elementary Statistics NAME:_______________________ Normal Dist. App.: Problem Setup HW (ver.081014) Normal Dist. Applications: Problem Setup Homework These exercises drill you on: (1) Getting the important information from the word problem, and (2) Forming the right strategy to solve the problem. Ref. Triola 2008) p.254 #14 “Fish out the Basic Info” X = the woman’s height μ = 63.6" σ = 2.5" p.255 #15 X= p.255 #16 X= Express the question in math terms Prob. of meeting the requirement: P (58" ≤ X ≤ 80") = ? Solution strategy Case P=?. Step 1: Translate the given x to z (z-score) Step 2: Compute the probability using normalcdf( * , * ) Case x=?. Step 1: Turn the probability to percentile (in decimal, say p). Step 2: Find the z-score: z = invNorm( p ) Step 3: Translate the z back to x. Let x1=58” and x2=80”. Find the z-scores for x1 and x2. Then use normalcdf( z1 , z2 ) normalcdf( -E99 , z1 ) normalcdf( z2 , E99) to find the probabilities. Prob. of too short: P ( X ≤ 58") = ? Prob. of too tall: P (80 " ≤ X ) = ? Page 1 of 2 Elementary Statistics p.255 #18 X= p.255 #21a X= p.256 #21b X= p.256 #22 X= NAME:_______________________ Normal Dist. App.: Problem Setup HW (ver.081014) Page 2 of 2 Elementary Statistics Central Limit Theorem Application Setup (v.090328) Central Limit Theorem (CLT) Application Setup These exercises drill you on setting up word problems. Mean & S.D. for the Ref. Assign Correct Does CLT Triola Symbols Apply? Sample Means σ 2008) “Yes” if n ≥ 30 or μX = μ σ X = n if X-dist. is normal p.276 CLT applies X=Random X =R.V. of sample #9ab because Variable of man mean weight weights are weight μX = normally n = 12 distributed. p.276 #10 μ = 172 lb σ = 29 lb σX = X=R.V. of X = $182 = $105 = 35 = $50 p.276 #11 =R.V. of μX = σX = X =R.V. of μX = σX = p.276 #12ab Page 1 of 2 Rewrite question with math symbols 12a. P(X>167)=? 12b. P( X >167)=? Elementary Statistics Central Limit Theorem Application Setup (v.090328) p.277 #14ab p.277 #15ab p.277 #16ab p.278 #19ab p.278 #20ab Page 2 of 2 Elementary Statistics Confidence Interval Facts (v.090721) Confidence Interval Facts This document summarizes terms and facts about confidence interval problems. The notation follows Triola’s Essentials of Statistics. Measurement Proportion Problems Problems The goal is to Population mean Population proportion μ estimate … Triola Term: p The given Case 1: the raw data, n = sample size sampled data x1 ,..., xn . x = # successes Case 2: the sample mean, x , (# failures = n − x) and sample std. dev., s. In either case, the population standard deviation, σ , might be also given. Check Assumptions Proceed only if n > 30 or 1. the distribution is normal 2. the sample is random Point estimate Sample mean x + ... + xn x= 1 n s ⎛ σ ⎞ ⎜≈ ⎟ n ⎝ n⎠ where s is the sample standard deviation. Standard Error (of the sample statistic) Confidence Level α /2= 1 − CL 2 Critical Value (for a given CL) E.g., CL = 90% = .90 CL = 95% = .95 CL = 99% = .99 E.g., CL = .90 → α / 2 = .05 CL = .95 → α / 2 = .025 CL = .99 → α / 2 = .005 zα /2 = −invNorm(α / 2) Some times the problem gives you the sample proportion, p̂ , and you need to compute x = npˆ . Proceed only if #successes ≥ 5 and 1. #failures ≥ 5 2. the sample is random, and independent to each other. Sample proportion x pˆ = n pˆ (1 − pˆ ) n ⎛ p (1 − p ) ⎞ ⎜≈ ⎟ ⎜ ⎟ n ⎝ ⎠ E.g., CL = 90% = .90 CL = 95% = .95 CL = 99% = .99 E.g., CL = .90 → α / 2 = .05 CL = .95 → α / 2 = .025 CL = .99 → α / 2 = .005 zα / 2 = −invNorm(α / 2) tα /2 = −invT (α / 2, n − 1) © 2009 Chi-yin Pang Page 1 of 2 Comments This step is for making sure that the sample size is large enough for the Central Limit Theorem. This step is also for documenting assumptions. This is our best 1-number guess. (It might be close, but it is almost never right on the dot.) This is an approximation of σ x (the standard deviation of the sample means), and σ p̂ (the standard deviation of the sample proportion). The “center area” (center probability). The “tail area” (tail probability) of one tail. The number of standard deviations from the center to the edge of the “center area.” Elementary Statistics Margin of Error Confidence Interval TI Function Confidence Interval Facts (v.090721) E = tα / 2 ( std .err.) E = zα / 2 ( std .err.) ⎛ s ⎞ = tα / 2 ⎜ ⎟ ⎝ n⎠ = zα /2 x±E pˆ ± E ( x − E, x + E ) ( pˆ − E , pˆ + E ) If σ is not available: (the usual case) [STAT]>TESTS>A:1-PropZInt… [STAT]>TESTS>8:TInterval… If σ is available: [STAT]>TESTS>7:ZInterval… Given margin of error, E, to determine n. ⎛z σ ⎞ ⎛z s⎞ n ≥ ⎜ α /2 ⎟ ≈ ⎜ α /2 ⎟ ⎝ E ⎠ ⎝ E ⎠ 2 2 The length from the center to the edge. pˆ (1 − pˆ ) n The input for number of successes, x, MUST be an integer (no decimal).* If estimate of p (π) is [z ] n ≥ α /2 We hope to capture the population parameter within this interval. * If the number of successes, x, is computed by x = npˆ , always round to the nearest integer. pˆ (1 − pˆ ) E2 2 available: If estimate of p (π) is not available: ⎛ z × 0.5 ⎞ [ zα /2 ] 0.25 n ≥ ⎜ α /2 ⎟ = E E2 ⎝ ⎠ 2 If the result is a decimal, ALWAYS round up. If we increase n… E would be smaller, therefore ( x − E , x + E ) would be narrower. 2 If the result is a decimal, ALWAYS round UP. E would be smaller, therefore ( x − E , x + E ) would be narrower. Because ⎛ s ⎞ E = tα / 2 ⎜ ⎟ ⎝ n⎠ E = zα /2 If we increase CL (confidence level) … E would be bigger, therefore ( x − E , x + E ) would be E would be bigger, therefore ( x − E , x + E ) would be wider. wider. If you want smaller E (higher precision) For larger σ or s… Either increase n or decrease CL. Either increase n or decrease CL. E would be bigger, therefore ( x − E , x + E ) would be (Not applicable.) wider. For p (π) closer to 0.5 … (Not applicable.) © 2009 Chi-yin Pang E would be bigger, therefore ( x − E , x + E ) would be wider. Page 2 of 2 pˆ (1 − pˆ ) n Multiple n by 4 would cut E in half. San Jose City College, Spring 2007 Math 63 Statistics Interpretation of Confident Interval What does it mean by the following sentence? “The 95% confident interval of weekly income is ($371, $509).” CORRECT: • Standard acceptable answer: “We are 95% confident that the population’s mean weekly income, μ, is between $371 and $509.” Note: Although a little ambiguous, this is the standard acceptable answer. The following is what “95% confident” really mean. • Precise interpretation: “When we compute the confident interval this way, on average, 95 out of 100 of these intervals would capture the weekly income’s true mean (the population mean, μ). This time we have calculated the interval to be ($371, $509).” INCORRECT: • Each worker makes between $371 to $509 per week. Note: This is just plain wrong. • 95% of workers make between $371 to $509 per week. Note: This is just plain wrong. • Any sample mean x would have a 95% chance to be within ($371, $509). Note: This is wrong. The confident interval tries to capture μ and not x ’s. • If we compute the mean earning, μ, from the population census, 95% of the time it would be within ($371, $509). Note: This is nonsense. If we compute μ from the population, there is only one number. That number does not change from time to time. • There is a 95% chance that the mean earning, μ, is within ($371, $509). Note: μ is a fixed number, it is either in the interval (therefore probability equals 1) or not in the interval (therefore probability equals 0). The 95% probability refers to this type of interval, and not this particular interval or any particular interval. Ver. 071022 p.1 of 1 Elementary Statistics (See [Triola 2008, pp.378-380].) Decision: Reject H0 E.g., 1. Patient tested positive 2. Fire alarm set off 3. Failed the EPA standard 4. The suspect was “found” guilty Hypothesis Testing: Errors & Power of a Test (Ver. 071211) Page 1 of 2 H0 is Actually True 1. Patient is healthy 2. No fire in the building 3. The factory’s discharge has no downstream environment effects 4. The suspect is innocent Type I Error (with probability α) α = P(rejecting H 0 | H 0 is true) DAMAGE: 1. Scared the patient (small cost) 2. Annoyance from the buzz 3. Waste money to comply with an unreasonable standard 4. Innocent got executed The The distribution of the test statistics if H0 is true. (Dotted line.) The distribution of the test statistics of the population that we sampled. (The solid line. In this case, the dotted line and the solid lies are the same.) H0 is Actually False 1. The patient has the disease 2. There is a fire in the building 3. The factory’s discharge has downstream environmental effects 4. The suspect committed the murder Correct decision 1 − β = " The Power of a test" = P(rejecting H 0 | H 0 is false) = The probability of supportinga true H 1 . The The distribution of the test statistics if H0 is true. (Dotted line.) The distribution of the test statistics of the population that we sampled. (The solid line.) P(Type-I Error) = P(reject H0 | H0) = α = level of significance α = level of significance P value < α. Reject H0, a WRONG decision. Critical Value = – invNorm(α) Decision: Fail to reject H0 E.g., 1. Patient tested negative 2. Fire alarm did not set off 3. Met EPA standard 4. The suspect was “found” not guilty Critical Value = – invNorm(α) The sample’s test statistics. P value < α. Reject H0, a correct decision. The sample’s test statistics. Type II error (with probability β) Correct decision β = P (not rejecting H 0 | H 0 is false) The The distribution of the test statistics if H0 is true. (Dotted line.) The distribution of the test statistics of the population that we sampled. (The solid line. In this case, the dotted line and the solid lies are the same.) DAMAGE: 1. Lost chance for treatment (may be life) 2. Miss escaping for fire 3. Continue to pollute the environment 4. The murderer got unpunished The The distribution of the test statistics if H0 is true. (Dotted line.) The distribution of the test statistics of the population that we sampled. (The solid line.) P value > α. Fail to Reject H0, a correct decision. α = level of significance P(Type-II Error) = P(failed to reject H0 | ~H0) = β (this is not assessable) P value > α. Fail to reject H0, a WRONG decision. α = level of significance The sample’s test statistics. Critical Value = – invNorm(α) The sample’s test statistics. Critical Value = – invNorm(α) Elementary Statistics NAME: _______________________ (v.090401) Hypothesis Testing: “Defining H0, H1” Homework These exercises drill you on setting up the correct Null and Alternative hypotheses for “Test of Significance.” For each of the referenced word problems, read the context of that problem and fill-in the blanks. (Just use the context of the problems and you DO NOT need to solve what the problems ask.) Ref. Mean or The English meaning for μ or p “Claim in Math Null & Shade Tail & [Tri Proportion Notation” Alternative State “2-Tail” 2008] Hypothesis “Left”, or “Right” p.394 Mean Prop p = the true proportion of p = .25 H0: p = .25 #5 the green flowered pea H1: p ≠ .25 2-Tail p.403 #10 Mean Prop μ = the population’s mean body temperature μ < 98.6˚F H0: μ = 98.6˚F H1: μ < 98.6˚F Left p.426 #1a Mean Prop p.426 #1b Mean Prop p.426 #1c Mean Prop p.426 #2 Mean Prop p.426 #3 Mean Prop © Chi-yin Pang, 2008 Page 1 of 2 Elementary Statistics p.426 #4 Mean Prop p.426 #5 Mean Prop p.427 #6 Mean Prop p.427 #8 Mean Prop p.427 #9 Mean Prop p.427 #11 Mean Prop p.427 #12 Mean Prop © Chi-yin Pang, 2008 NAME: _______________________ (v.090401) Page 2 of 2 Elementary Statistics Project 2 Assignment (v.09728) Elementary Statistics Project 2: Conf. Int. or Hypo. Test Assignment Learning Objective The objectives are to: • Gain experience for a real “field” data collection. • Apply Confident Interval and/or Hypothesis Testing to analyze data and make statistical inference. • Exercise your written presentation skill. Due Dates Proposal due _______________ (send me a 1-line e-mail, chi-yin_pang@alumni.hmc.edu) Report due _________________ Topic Collect two sets of quantitative data. (If appropriate, you may use the data you have collected for project 1.) Then use the 2-sample Confidence Interval or Hypothesis Testing to analyze and make inference from the raw data. Make you own “fun” topic; especially, a topic that you have ready access to the data. “Fun” makes work easy. If you are not interested in the topic, it is like pulling your own teeth. Your report must include the following 4 sections: Report’s 1. Introduction: Introduce your topic. (It would be good to discuss your Content original prediction of the result.) Requirements 2. Raw Data: Describe how you collect the data, the source of the data, and present the raw data. 3. Analysis: Describe your analysis technique. 4. Conclusion: Make your conclusion(s). If appropriate, state what you would do differently for further investigation. Editing Requirement The requirements are: • It has to be totally word processed. • Caution Formulae are typed with Equation Editor. E.g., μ = ∑x = x n There are may potential “time-eaters” along the way, such as: • Collecting data. • Wanting to collect more data to investigating further. • Thinking and writing up the conclusion. Start the project early to find out what might surprise you. © 2009 Chi-yin Pang Page 1 of 1 1 + ... + xn n Elementary Statistics Project 2 “Independent Samples” Example (v.090421) Project 2 “Independent Samples” Example (v.090421) Project 2 “Independent Samples” Example Analysis When a Car Ages, Does the MPG Change? I set up the problem as a 2-sample confidence interval problem. I am trying to estimate the difference of the means, (P1íP2). The data are independent. (The data did not come in pairs. In fact, the sample size for the “new car” MPG is not even the same as the sample size of the “old car” MPG. Chi-yin Pang April 16, 2008 Introduction When a car is new, everything should work perfectly and it’s performance should be at its peak. However, after the engine had some wear, it would have less internal friction and therefore may contribute to a higher mileage. This report investigates whether the MPG changes as a car ages. Method I use the data from the mileage record from my 1995 Toyota Previa. The records included the mileage between gas fill ups and the number of gallons for the fill-up. The first page of the record book is shown on the right. The MPG is computed by Number of miles since the last fill-up . I Number of gallons for this fill-up use a set of MPG data when the car was new (from 0 miles to 10,000 miles), and another set of data when the car was older (from 100,000 miles to 110,000 miles). Let P1 be the true MPG when the car was new and P2 be the true mean MPG when the car had 100,000 miles. I will construct the confidence interval of (P1íP2), and see whether it is totally positive, totally negative, or contains zero. MPG Elementary Statistics Step 1. Checks: The sample sizes are both greater than 30, therefore, there is no need to check for normality of the distributions. We assume that the sampling is random in the sense that the diving condition are different, the sample is an unbiased mix of various situations. Just for curiosity I entered the data into TI’s L1 and L2 and made a comparative box plot (shown on the right). The plots shows that the distributions are quite symmetrical and more importantly the median seems to be closed together. Therefore, I really do not expect a big difference betweenP1 and P2. Step 2. Computation: Since V1 and V2 are unknown, I used the T-distribution for the analysis. I used TI-83’s 2-SamTInt with a 95% confidence level. (See screen shots on the right.) The input and the resulted confidence interval are shown below. miles per gallon Raw Data The number of miles driven and the gallons for the fill-ups were entered in to a spreadsheet and the MPGs for each fill-up were computed. The table on the right lists the resulted MPG data. Data Column 1 has the MPG’s for the first 10,000 miles and Column 2 has the MPG’s for 100,000 to 110,000 miles. (The sample sizes are 35 and 42 respectively.) Page 1 of 2 Conclusion We are 95% confident that difference of the mean MPG’s of the new and old Toyota Previa is between í0.65 MPG and 1.01 MPG. Since the confidence interval is neither totally positive nor totally negative, (P1íP2) could be 0. In other words, the MPG did not change significantly when the Previa aged. We did not see the MPG got better or get worse. Before the analysis, I mentioned that a wore engine with less friction might result in higher MPG. Apparently, that is not the case, but perhaps, the modern engines have less wear and a longer life. It would be interesting to compare the MPG again when the Previa get to be 200,000 miles. Page 2 of 2 Elementary Statistics Project 2, Matched Pairs Example (ver.090421) Elementary Statistics Project 2, Matched Pairs Example (ver.090421) Raw Data Project 2 “Matched Pairs” Example Does Chinese or English Use Fewer Syllables to Say the Same Thing? Chi-yin Pang I decided to sample 20 commonly accessible passages. I took a random sample of 20 verses from the Book of Daniel, which has a mix of narrative and prophetic prose. Daniel has 12 chapters with a total of 357 verses. I used Microsoft Excel’s RANDBETWEEN(1,357) function to generate 20 random verses as sample passages. (Note: The sample is “with replacement.”) The Chinese translation used is the Chinese Union Version (CUV, 1919) and the English translation used is the American Standard Version (ASV, 1901). Random verse Chapter Verse CUV Chinese Union Ver. ASV American Std. Ver. x= Ch/Am 1 168 5 31 17 22 0.773 2 102 4 2 21 26 0.808 3 243 8 9 28 35 0.800 4 87 3 17 37 38 0.974 5 65 2 44 46 66 0.697 6 334 11 35 38 42 0.905 7 158 5 21 55 81 0.679 8 117 4 17 52 70 0.743 9 10 1 10 64 63 1.016 10 185 6 17 36 45 0.800 11 11 1 11 29 36 0.806 12 98 3 28 60 69 0.870 13 98 3 28 60 69 0.870 14 319 11 20 47 45 1.044 15 149 5 12 60 74 0.811 16 122 4 22 29 38 0.763 17 267 9 16 64 77 0.831 18 153 5 16 44 73 0.603 19 145 5 8 27 29 0.931 20 109 4 9 55 59 0.932 November 15, 2008 Introduction Whenever I read something, I often feel that Chinese can say it more concisely than English. Is that really true or just an unfounded feeling. I have used confident level technique to prove that on average Chinese needs fewer syllables than English to say the same thing. Method I set out to compare parallel Chinese/English passages. I count up the number of syllables in Number of Syllables in Chinese each language and compute the ratio x . If Chinese uses more Number of Syllables in English syllables, then x > 1. If Chinese uses fewer syllables, then x < 1. For exampled, with the following parallel passage, Daniel 5:31: ጆזԳՕܓڣքԼԲᄣΔ࠷Ա૫೬ࢍഏΖ Syllables = 17 (Note: Each Chinese character has one syllable.) x Number of Syllables in Chinese Number of Syllables in English And Darius the Mede received the kingdom, being about threescore and two years old. Syllables = 22 17 | 0.773 22 I planned to take a sample of n parallel passages and use the sample ratios ^x1 , x2 ,..., xn ` to compute the 95% confident interval (CI) of the mean ratio. Then, basing on the CI’s position with the 1 (the neutral ratio) I can draw conclusions: x If CI is totally less than 1, then P 1 . Therefore, Chinese use fewer syllables on average. x If CI contains 1, then P could be 1. Therefore, the evidence supports neither “Chinese use fewer syllables on average” nor “Chinese use more syllables on average.” x If CI is totally greater than 1, then P ! 1 . Therefore, Chinese use more syllables on average. Page 1 of 3 The last column, the “Chinese/English” ratio is used as the sample data. Page 2 of 3 Elementary Statistics Project 2, Matched Pairs Example (ver.090421) Analysis I choose to use Statdisk for performing the analysis, because of the ease of copy-and-paste of the computer results. Step 1. I copy the xi’s from Excel to Statdisk’s data column 3. Step 2. Because of my small sample, n = 20, I use “Data > Explore Data” to verify that the data is approximately normal. The histogram looks approximately normal and the data’s sample mean and sample standard deviation are also computed and ready for use as input for the confident interval computation. x 0.8326911 s 0.1126609 Step 3. I use the Mean One-Sample function. (“Analysis > Confidence Intervals > Mean One-Sample”). The 95% confident interval of P (the mean Chinese/ English ratio) is (0.78, 0.89). Step 4. Implication: Since the entire confident interval is less than 1, we are 95% confident that the mean Chinese/English syllable ratio is between 0.78 to 0.89. Conclusion Since I have only taken a convenience sample from the Book of Daniel, we must be careful to make generalization. Nevertheless, for the type of literature that is similar to the Book of Daniel we expect that on average Chinese to use fewer syllables that English. It fact, we expected it to take only about 78% to 89% of the syllables in English. The statistical analysis confirmed my suspicion that Chinese is more concise. However, the ratio is not as low as I expect. I was expecting something like 2/3. My analysis proved my “guesstimate” wrong. Page 3 of 3 Elementary Statistics 2-μ Independent Sample, Data Collection NAME: _______________________ (v.090721) Hypothesis Testing (& Confidence Interval): 2-μ Independent Sample, Data Collection Worksheet This template guides you on: (1) setting up the correct Null and Alternative hypotheses for 2sample measurement problem, and (2) collecting the necessary information for the computation. Problem: Triola p.459 #9 Problem: Triola p.459 #10 “Claim in Math”: μechin ≠ μ plac “Claim in Math”: μ1 = μechin = pop. mean of # days of fever" μ1 = μ = when treated with echinacea. μ2 = μ plac = pop. mean of # days of fever" μ2 = μ = when "treated" with placebo. H 0 : μ1 = μ2 Hypotheses: Hypotheses: H1 : μ1 ≠ μ2 Population 1 Desc. of pop. Echinacea Sample Mean x1 = 0.81 days Sample StdDev s1 = 1.50 days H0 : H1 : Population 2 Placebo x2 = 0.64 days Desc. of pop. Sample Mean s2 = 1.16 days Sample StdDev Sample Size n1 = 337 n2 = 370 Pop. StdDev σ 1 = not given σ 2 = not given Population 1 Sample Size Problem: Triola p.459 #11 Pop. StdDev (σ) Problem: Triola p.460 #13 “Claim in Math”: “Claim in Math”: μ1 = μ = μ1 = μ = μ2 = μ = μ2 = μ = Hypotheses: H0 : Hypotheses: H1 : Population 1 H0 : H1 : Population 2 Population 1 Desc. of pop. Sample Mean Desc. of pop. Sample Mean Sample StdDev Sample StdDev Sample Size Sample Size Pop. StdDev (σ) Pop. StdDev (σ) © 2009 Chi-yin Pang Population 2 Page 1 of 2 Population 2 Elementary Statistics 2-μ Independent Sample, Data Collection NAME: _______________________ (v.090721) Problem: Triola p.460 #16 Problem: Triola p.460 #17 “Claim in Math”: “Claim in Math”: μ1 = μ = μ1 = μ = μ2 = μ = μ2 = μ = Hypotheses: H0 : Hypotheses: H1 : Population 1 H0 : H1 : Population 2 Population 1 Desc. of pop. Sample Mean Desc. of pop. Sample Mean Sample StdDev Sample StdDev Sample Size Sample Size Pop. StdDev (σ) Problem: Triola p.461 #22 Pop. StdDev (σ) Problem: Triola p.477 #5b “Claim in Math”: “Claim in Math”: μ1 = μ = μ1 = μ = μ2 = μ = μ2 = μ = Hypotheses: H0 : Hypotheses: H1 : Population 1 H0 : H1 : Population 2 Population 1 Desc. of pop. Sample Mean Desc. of pop. Sample Mean Sample StdDev Sample StdDev Sample Size Sample Size Pop. StdDev (σ) Pop. StdDev (σ) © 2009 Chi-yin Pang Page 2 of 2 Population 2 Population 2 Elementary Statistics 2-Prop Data Collection NAME: _______________________ (v.090728) Hypothesis Testing (& Confidence Interval): “2-Prop Data Collection” Worksheet (for [Tri08] Sect.9-2) Problem: Triola08 p.444, #11 (Home Field Advantage) Problem: Triola08 p.444, #12 (Gloves) Success means: The home team won “Claim in Math”: pbb = p fb Success means: The glove leaks virus “Claim in Math”: pvin > plat p1 = pbb = pop. prop. of home team win in basketball p1 = pvin = pop. prop. of vinyl gloves that leaks virus p2 = p fb = pop. prop. of home team win in football p2 = p fb = Hypotheses: H 0 : p1 = p2 Hypotheses: H1 : p1 ≠ p2 Description of populations # “Successes” Population 1 Population 2 Basketball games Football games H 0 : p1 = p2 H1 : Population 1 x1 = 127 x2 = Description of populations # “Successes” Sample Size n1 = 198 n2 = Sample Size Sample Prop. pˆ1 not pˆ 2 Sample Prop. x given; or x = npˆ x given; or x = npˆ Population 2 Vinyl gloves x1 = n1 pˆ1 x2 = = 240 × .63 = n1 = 240 pˆ1 =63%= n2 = pˆ 2 explicitely given Problem: Triola08 p.445, #14 Problem: Triola08 p.445, #15 Success means: “Claim in Math”: Success means: “Claim in Math”: p1 = p = pop. proportion of p1 = p = pop. proportion of p2 = p = p2 = p = Hypotheses: H 0 : p1 = p2 Hypotheses: H1 : Population 1 Description of populations # “Successes” H 0 : p1 = p2 H1 : Population 2 Population 1 x1 = x2 = Description of populations # “Successes” Sample Size n1 = n2 = Sample Prop. pˆ1 = pˆ 2 x given; or x = npˆ © 2009 Chi-yin Pang Population 2 x1 = x2 = Sample Size n1 = n2 = Sample Prop. pˆ1 = pˆ 2 x given; or x = npˆ Page 1 of 2 Elementary Statistics 2-Prop Data Collection NAME: _______________________ (v.090728) Problem: Triola08 p.445, #17 Problem: Triola08 p.445, #19 Success means: “Claim in Math”: Success means: “Claim in Math”: p1 = p = pop. proportion of p1 = p = pop. proportion of p2 = p = p2 = p = Hypotheses: H 0 : p1 = p2 Hypotheses: H1 : Population 1 Description of populations # “Successes” H 0 : p1 = p2 H1 : Population 2 Population 1 x1 = x2 = Description of populations # “Successes” Sample Size n1 = n2 = Sample Prop. pˆ1 = pˆ 2 x given; or x = npˆ x1 = x2 = Sample Size n1 = n2 = Sample Prop. pˆ1 = pˆ 2 x given; or x = npˆ Problem: Triola08 p.446, #20 Problem: Triola08 p.446, #21 Success means: “Claim in Math”: Success means: “Claim in Math”: p1 = p = pop. proportion of p1 = p = pop. proportion of p2 = p = p2 = p = Hypotheses: H 0 : p1 = p2 Hypotheses: H1 : Population 1 Description of populations # “Successes” H 0 : p1 = p2 H1 : Population 2 Population 1 x1 = x2 = Description of populations # “Successes” Sample Size n1 = n2 = Sample Prop. pˆ1 = pˆ 2 x given; or x = npˆ © 2009 Chi-yin Pang Population 2 Population 2 x1 = x2 = Sample Size n1 = n2 = Sample Prop. pˆ1 = pˆ 2 x given; or x = npˆ Page 2 of 2 Elementary Statistics Correlation & Regression Example (v.090724) Correlation & Regression Example We use the example from [Tri08, p.504 #16]. The data has vehicle weight (lb.) and fuel efficiency (mpg, “miles per gallon gas mileage”) for 7 cars. Step 4a: Interpret the Result of the Test. t= the test statistic, t-score = df= the degree of freedom = p= the p-value = Conclusion of this test: (Reject H0 if p-value≤α) Step 1, 2: Assessment; Define H0, H1. Explanatory Variable (x): Response Variable (y): Assess correlation with the scatterplot. r= the sample Correlation Coefficient = r2= the Proportion of the Variation of y explained by the regression line = • If there is an inferential outlier, see whether it is appropriate to delete it. • If the scatterplot shows a curve, DON’T do linear regression. • Otherwise, proceed to define H1 as: ρ ≠ 0 or ρ < 0 or ρ > 0 Formulate the hypotheses: H 0 : ρ = 0 (i.e., there is no correlation) Step 4b: Graph the Regression Line. “a=” and “b=” defines the regression line: Y1= ŷ = a + bx = Use ZoomStat to get the original data and the regression line, Y1. H1 : Note: “r” represents the sample coefficient of correlation, and the Greek “ρ” (pronounced “roe”) represents the population coefficient of correlation. Step 3: Compute the Regression Line & p-value. Use [STAT]>TEST>*:LinRegTTest . Step 5: Predict ŷ for a specific x. • If we failed to reject H0, DON’T predict, because there is no correlation. • If x is outside the range of the xi’s, DON’T predict. It may give invalid prediction. • If x is within the range of the xi’s, then predict the ŷ value with ŷ = a + bx . Then use [Y=] . LinRegTTest does the following: • Fit a linear regression line, ŷ = a + bx to the data points, (x,y)’s; and put the equation into Y1. Note: This is called the “regression line”. This line has the “least square” property that minimizes the sum of all the square of the ydistances between the data points and the line. See [Tri08, pp. 488, 510]. • Compute r, and compute p-value for testing the hypotheses. Result: Use [Y=] to see the equation Y1. © 2009 Chi-yin Pang Step 4c: Interpret the Slope, b, of the Line. y's unit b = −0.00797 = x's unit Therefore, for every additional _____ of _________, the _________ is ______ (increased/decreased) by ______ ______ on average. Equivalently: “For every additional 100 lb. of vehicle weight, the fuel efficiency is decreased by 0.797 mpg on average.” E.g. 1: Geo Metro weights 2623 lb. Predict the mpg. The weight x=2623 lb. is within the range of xi’s [2290 lb, 3985 lb], therefore we can use the equation to predict: yˆ = a + bx = ______ − .00797 * ______ ≈ 33.8mpg E.g. 2: Smart Car weights 1808 lb. Predict the mpg. (Ans.: Don’t. Why?) Page 1 of 1 Chi-square (χ2) Goodness of Fit Test (Ver.081209) STEP 1: Define problem & Collection Information Elementary Statistic Chi-square (χ2) Goodness of Fit (GOF) Test Worksheet H0: H1: Given: 1. A discrete distribution of k categories, P ( Categoryi ) = pi and p1 + ... + pk = 1 . 11 9 8 Was the sampling random? 5 2. A sample of size n was drawn and the observed frequencies for the categories are: O1, O2, …, Ok To Test: How good the observation fit the distribution. H0: The population has the given distribution. H1: The population has a different distribution. Category i Item Name L1: Expected Prob. P (Cati ) = pi L2: Observed Freq. Oi L3: Expected Freq. L4 = (L2–L3)^2/L3 (Oi − Ei ) 2 Ei = npi STEP 2: Check assumptions 7 Ei 1 Check ∑p i =1? Check the expected frequencies Ei = npi . Is Ei ≥ 5 for each i? STEP 3: Compute the Chi-square Statistics, & P value χ =∑ 2 ( Oi − Ei ) Ei 2 = P value = right tail probability of the Chi-square distribution with (k – 1) degrees of freedom 2 = χ cdf( χ 2 , ∞, k – 1) 2 = χ cdf( ,E99, )= Use: [2nd][VARS](DISTR)>DISTR>*:χ2cdf(… TI-84 also has: [STATS]>TESTS>D:χ2GOF-Test… 2 STEP 4: Make conclusion 3 4 5 6 ∑p i = n = ∑ Oi = sum( L2 ) = χ2 = ∑ (Oi − Ei )2 = sum( L4 ) = Ei Elementary Statistics Test of Dependency Worksheet (v.090501) STEP 2: Check the assumptions. Test of Dependency Worksheet Given: A frequency contingency table Female Male E.g., from a 1992 poll conducted by Democrat 48 36 Univ. of Montana. [BVD07.631] Independent 16 24 Test the claim: Republican 33 45 The Response variable is Dependent on the Explanatory variable. E.g., “Political affiliation” depends on “gender” in 1992. Explanatory Variable 2b. Assuming that H0 is true, and P ( Explantory Cat. j ) = ∑ All i ' s Oij / n compute & check the expected frequencies, Eij’s. * Enter the observed frequencies into matrix [A]. [2nd][x-1](MATRIX)>EDIT>1:[A] * Use χ2-Test to compute the expected frequencies, Expected frequencies will be written to [B]. * Copy [B] to the worksheet. * Are all the expected frequencies ≥ 5? STEP 3: Compute the Test Statistics & P value. Use the χ2 worksheet OR USE TI’s χ2-Test results. Test Statistics = χ = ∑ 2 O11 O12 O13 O14 E11 E12 E13 E14 O21 O22 O23 O24 E21 E22 E23 E24 O31 O32 O33 O34 E31 E32 E33 E34 O41 O42 O43 O44 E41 E42 E43 E44 ©2009 by Chi-yin Pang and P ( Response Cat. i ) = ∑ All j ' s Oij / n [STAT]>TESTS>*: χ2-Test… Use Observed: [A] and Expected: [B]. Enter the category names and enter the observed frequencies into the table. V a r where n = ∑ Oij = sum of all observed frequencies 2a. Is the sample random? STEP 1: Define Problem & Collect Information H0: Response and Explanatory are Independent H1: Response and Explanatory are Dependent R e s p o n s e Eij = n ⋅ P ( Response Cat. i ) ⋅ P ( Explantory Cat. j ) (O ij − Eij ) Eij 2 = d . f . = degrees of freedom = ( Rows − 1)(Columns − 1) = P value = right tail prob. of the χ2 distribution with d.f. degrees of freedom = χ2cdf( χ 2 , ∞ , d.f. ) = χ2cdf( , E99, STEP 4: Make conclusion. Page 1 of 1 )= Elementary Statistics Test of Dependency Worksheet (v.090501) Test of Dependency Worksheet 2nd 118 167 3rd 178 528 Crew 212 673 Test the claim: The Response variable is Dependent on the Explanatory variable. E.g., “Survival” depends on “Passenger Type”. Eij = n ⋅ P ( Response Cat. i ) ⋅ P ( Explantory Cat. j ) where n = ∑ Oij = sum of all observed frequencies 2a. Is the sample random? Given: A frequency contingency table E.g., 1st Alive 202 Dead 123 STEP 2: Check the assumptions. and P ( Response Cat. i ) = ∑ All j ' s Oij / n 2b. Assuming that H0 is true, and P ( Explantory Cat. j ) = ∑ All i ' s Oij / n compute & check the expected frequencies, Eij’s. * Enter the observed frequencies into matrix [A]. [2nd][x-1](MATRIX)>EDIT>1:[A] * Use χ2-Test to compute the expected frequencies, [STAT]>TESTS>*: χ2-Test… Use Observed: [A] and Expected: [B]. STEP 1: Define Problem & Collect Information H0: Response and Explanatory are Independent H1: Response and Explanatory are Dependent Expected frequencies will be written to [B]. * Copy [B] to the worksheet. * Are all the expected frequencies ≥ 5? Enter the category names and enter the observed frequencies into the table. STEP 3: Compute the Test Statistics & P value. Use the χ2 worksheet OR USE TI’s χ2-Test results. Explanatory Variable Test Statistics = χ = ∑ 2 (O ij − Eij ) Eij 2 = d . f . = degrees of freedom = ( Rows − 1)(Columns − 1) = R e s p o n s e V a r O11 O12 O13 O14 E11 E12 E13 E14 O21 O22 O23 O24 E21 E22 E23 E24 O31 O32 O33 O34 E31 E32 E33 E34 O41 O42 O43 O44 E41 E42 E43 E44 ©2009 by Chi-yin Pang P value = right tail prob. of the χ2 distribution with d.f. degrees of freedom 2 = χ cdf( χ 2 , ∞ , d.f. ) = χ2cdf( , E99, STEP 4: Make conclusion. Page 1 of 1 )= Page 6 Grand Road Map References are from Triola (2008) STATISTICAL INFERENCE 2-SAMPLE DATA ANALYSIS Hypothesis Testing Confidence Interval §8-2 Page 6, 7 Page 2 Linear Regression Prediction Page 4 1-Sample 2-Sample Difference Page 3 1-Sample p σ §7-2 §7-5 2-Sample Difference §10-4 Page 8 μ Page 5 Page 8 μ p μ §9-2 p σ §8-3 §8-6 Matched Pairs Independent Samples Page 8 μ p §9-2 Matched Pairs Independent Samples §9-4 ChiSquare (χ2) Tests Page 9 ANOVA §11-4 σ σ ‘s σ’s σ σ σ’s σ’s Known §7-3 Unknown §7-4 Known §9-3 Unknown §9-3 Known §8-4 Unknown §8-5 Known §9-3 Unknown §9-3 Page 6, 7 Linear Correlation & Regression’s ρ, β (coeff. of correl. & slope) §10-2 #39, 10-3 §9-4 σ Ver.080430 © 2008 by Chi-yin Pang Other Tests Goodness -of-fit Independence §11-2 §11-3 σ §7-5 §8-6 1 CONFIDENCE INTERVAL Mean (μ) ⎛z σ ⎞ ⎛z s⎞ n ≥ ⎜ α /2 ⎟ ≈ ⎜ α /2 ⎟ ⎝ E ⎠ ⎝ E ⎠ No info. for the dist. and n≤30 Use nonparametric or bootstrapping, p.360. (Don’t touch!) 1. Random Sample? 2. n>30 or Normally dist. Always round up. (p.323) σ Known (§7-3) 1 − α (comput α = 1 − C.L.) Critical Value: zα / 2 = invNorm(α / 2) Point Estimate: ⎛ σ ⎞ E = zα / 2σ x = zα / 2 ⎜ ⎟ ⎝ n⎠ x Confidence Interval: ( x − E, x + E ) Margin of Error: [STAT] > TESTS > 7:ZInterval If given raw data, use “Input: Data” & L1. If given statistics, use “Input: Stats”. v.090710+ © 2009 by Chi-yin Pang If estimate of p is available, [ zα /2 ] 2 n≥ E Check: 1. Random Sample? ˆˆ pq 2 If estimate for p is not avail., . [zα / 2 ] ⋅ 0.25 E2 Always round up. (p.308) n≥ x , qˆ = 1 − pˆ n #successes ≥ 5? # successes = x = npˆ # failures ≥ 5? # failures = n − x = nqˆ 2. n big enough? 2 1 − α (comput α = 1 − C.L.) Degrees of Freedom: d . f . = n − 1 Critical Value: tα / 2 has P (−tα / 2 < t < tα / 2 ) = α / 2 with the "student's-t" distribution with (n − 1) d.f. For TI-84 tα / 2 = invT (α / 2, n − 1) Point Estimate: ⎛ s ⎞ E = tα / 2 ⎜ ⎟ ⎝ n⎠ x Confidence Interval: ( x − E, x + E ) Margin of Error: (See p.304, 305, 320.) Or Use TI’s ZInterval Success="1" This involves the ratio (x/n) of successes (x) out of the total (n). (See p.331, Tbl.A-3.) Or Use TI’s TInterval [STAT] > TESTS > 8:TInterval If given raw data, use “Input: Data” & L1. If given statistics, use “Input: Stats”. Conclusion: (for “mean” example ONLY) “We are 95% confident that the population’s mean weekly income is between $371 and $509.” pˆ = (Note: Diffferent books use 5 to 15.) σ Not Known (§7-4) Confident Level: Confident Level: 0.2 Failure="0" Det. Min. n (§7-2) Check: 2 0.8 Proportion (p) (§7-2) This involves average of measurements, e.g., weight. Det. Min. n (§7-3) 2 for “trapping” population mean, μ or population proportion, p Given a confidence level, 1-α (e.g., C.L.=0.95) Prob Triola (2008) References ZInterval: p.338 TInerval: p.338 1-PropZInt: p.312 Critical values: t-table: A-3 z-table: A-2 Binom. Dist. 1 0.8 0.6 0.4 0.2 0 Confident Level: 1 − α (comput α = 1 − C.L.) Critical Value: zα / 2 = invNorm(α / 2) Margin of Error: E = zα / 2σ = zα / 2 pˆ = x / n Confidence Interval: ( pˆ − E , p + E ) pq ≈ zα / 2 n ˆˆ pq n Point Estimate: (See p.304-306.) Or Use TI’s 1-PropZInt [STAT] > TESTS > A:1-PropZInt Note: The input x MUST be an integer. If x is not given, compute x = n ⋅ pˆ Precise interpretation: e.g., “When we compute the confident interval this way, on average, 95 out of 100 of these intervals would capture the weekly income’s true mean (the population mean, μ). This time we have calculated the interval to be ($371, $509).” 2 HYPOTHESIS TESTS (§8-1) Mean (μ) Check: 1. Random Sample? 2. n>30 or Normally dist. σ Known (§8-4) Test Statistic : z = x − μ0 H 0 : μ = μ0 H 1 : μ ≠ μ o ⇒ P value = 2 tails prob. = 2 × normalcdf ( z , E 99) H 1 : μ < μ o ⇒ P value = Left tail prob. = normalcdf (− E 99, z ) H 1 : μ > μ o ⇒ P value = Right tail prob. = normalcdf ( z , E 99) Or Use TI’s Z-Test (pp.402, 404) [STAT] > TESTS > 1:Z-Test If given raw data, use “Input: Data” & L1. If given statistics, use “Input: Stats”. Test statistic is reported as “z =…” P-value is reported as “p =…” Conclusion: v.090710 © 2009 by Chi-yin Pang Given level of significance: α (e.g.,α=0.05) 1 0.8 0.6 0.4 0.2 0 0.8 0.2 Failure="0" No info. for the dist. and n≤30 Use nonparametric or bootstrapping, p.360. (Don’t touch!) 2. n big enough? x , qˆ = 1 − pˆ n #successes ≥ 5? # successes = x = npˆ # failures ≥ 5? # failures = n − x = nqˆ H1 : μ ≠ μ o ⇒ P value = 2 tails prob. = 2 × tcdf( t , E 99 , d.f.) H1 : μ < μ o ⇒ P value = Left tail prob. = tcdf( − E 99 , t, d.f.) pˆ = (Note: Diffferent books use 5 to 15.) σ Not Known (§8-5) x − μ0 s/ n Use t - distribution, d . f . = n − 1 H 0 : μ = μ0 Success="1" Proportion (p) (§8-3) Check: 1. Random Sample? . Test Statistic : t = σ/ n Binom. Dist. Prob Triola (2008) References H0 & H1: pp.3693 types of H1’s: pp.374-375 Reject/Failed to Reject H0: pp.376-377 Type of errors: pp.378-380 Critical Region/value, α: pp.272-373 P-value: p.374 Critical values: t-table: A-3 z-table: A-2 Test Statistic : z = pˆ − p 0 p0 q0 / n H 0 : p = p0 H 1 : p ≠ p o ⇒ P value = 2 tails prob. = 2 × normalcdf ( z , E 99) H 1 : p < p o ⇒ P value = Left tail prob. = normalcdf ( − E 99, z ) H 1 : p > p o ⇒ P value = Right tail prob. = normalcdf ( z , E 99) H1 : μ > μ o ⇒ P value = Right tail prob. = tcdf(t, E 99 , d.f.) Or Use TI’s T-Test (p.411) [STAT] > TESTS > 2:T-Test If given raw data, use “Input: Data” & L1. If given statistics, use “Input: Stats”. Test statistic is reported as “t =…” P-value is reported as “p =…” α = level of significance (assume 0.05, if not given) (pp.376-377) If P value ≤ α then reject H0. “The observed sample statistics has a P-value of __≤__ [α]. We reject Ho with a significance level of α =___. The evidence is statistically significant to support _______ [English of H1].” If P-value > α, then fail to reject H0. “The observed sample statistics has a P-value of __>__ [α]. We failed to reject Ho with a significance level of α =___. The evidence is not statistically significant to support ____________ [English of H1].” Or Use TI’s 1-PropZTest (pp.393-394) [STAT] > TESTS > 5:1-PropZTest Note: The input x MUST be an integer. If x is not given, compute x = n ⋅ pˆ Test statistic is reported as “z =…” P-value is reported as “p =…” (§8-2, p.374) P-Value is the probability for observing a result that is at least as extreme as the given sampled data, if H0 is true. Therefore, a small P-values is a strong evidence for H0 being false. (§8-2, p.372) Significant Level, α, is the risk you are willing to take to reject a true null hypothesis, H0. 3 Matched Pairs (§9-4) Given a confidence level, 1-α (e.g., C.L.=0.95) Check: Ratio : 1. Random Sample? 2. (n1>30 & n2>30) or Both Normal ri = xi / y i Relative Diff. : rd i = xi − y i yi 0.2 0.6 0.6 0.4 0.4 0.2 0 Success="1" Failure="0" Success="1" Proportion (p1–p2) (§9-2) Check: 1. Random Sample? Difference : d i = xi − y i as a 1-sample Conf.Int problem. Caution: It might be more appropriate to analyze the Binom. Dist. 0.8 0.8 Failure="0" Independent Samples Given related data pairs, (xi,yi). Analyze the Binom. Dist. 1 0.8 0.6 0.4 0.2 0 Prob Mean (μ1−μ2) CONFIDENCE INTERVAL 2-SAMPLE (comparing 2 groups) Prob References are from Triola (2008) 2. n big enough? No info. for the dist. & (n1≤30 or n2≤30) Don’t Touch! . x1 , qˆ1 = 1 − pˆ 1 , n1 x2 , qˆ 2 = 1 − pˆ 2 n2 # successes1 ≥ 5 ? # successes1 = x1 = n1 pˆ 1 pˆ 1 = pˆ 2 = # failures1 ≥ 5 ? # failures1 = n1 − x1 = n1qˆ1 # successes2 ≥ 5 ? # successes2 = x2 = n2 pˆ 2 # failures2 ≥ 5 ? # failures2 = n2 − x2 = n2 qˆ 2 p.455 Both σ1 & σ2 Known (§9-3) Confident Level: 1 − α (compute α = 1 − C.L.) Critical Value: zα / 2 = invNorm(α / 2) Margin of Error: E = zα / 2 Point Estimate: x1 − x2 Confidence Interval: σ 12 n1 + σ 22 n2 ( ( x1 − x2 ) − E , ( x1 − x2 ) + E ) Or Use TI’s 2-SampZInt [STAT] > TESTS > 9:2-SampZInt If given raw data, use “Input: Data” & L1, L2. If given statistics, use “Input: Stats”. σ1 or σ2 Unknown (§9-3) p.438 If σ1=σ2, then see p.457 (pooled variance); otherwise, see p.450. Confident Level: 1 − α (compute α = 1 − C.L.) Critical Value: zα / 2 = invNorm(α / 2) Or Use TI’s 2-SampTInt Margin of Error: E = zα / 2 Point Estimate: pˆ1 − pˆ 2 [STAT] > TESTS > 0:2-SampTInt If given raw data, use “Input: Data” & L1, L2. If given statistics, use “Input: Stats”. If σ1 = σ2, use “Pooled: Yes”; otherwise, use “Pooled: No”. Confidence Interval: pˆ1qˆ1 pˆ 2 qˆ2 + n1 n2 ( ( pˆ1 − pˆ 2 ) − E , ( pˆ1 − pˆ 2 ) + E ) Or Use TI’s 2-PropZInt [STAT] > TESTS > B:2-PropZInt Note: The input x1, x2 MUST be integers. Conclusion: v.090710 © 2009 by Chi-yin Pang “We are __% confident that ___ [English of μ1-μ2 or p1-p2] is between _ and _ [with units] .” THEN make inference between the two groups: If the interval is totally positive, then “We are __% confident that __ [English of μ1 or p1] is greater than ___ [English of μ2 or p2] by at least ___ [the lower limit, with unit].” If the interval is totally negative, then “We are __% confident that __ [English of μ2 or p2] is greater than ___ [English of μ1 or p1] by at least ___ [the absolute value of the upper limit, with unit].” (If the interval contains 0, the two means could be the same & we can’t make the above conclusions.) 4 HYPOTHESIS TESTS Matched Pairs (§9-4) p.455 Check: 1. Random Sample? xi − y i yi 2. (n1>30 & n2>30) or Both Normal x1 − x2 σ 12 n1 + σ 22 0.4 0.6 0.4 0.2 0 Failure="0" Success="1" Failure="0" Success="1" x2 , qˆ 2 = 1 − pˆ 2 n2 # successes1 ≥ 5 ? # successes1 = x1 = n1 pˆ 1 pˆ 1 = No info. for the dist. & (n1≤30 or n2≤30) Don’t Touch! H 0 :μ1 = μ2 H1 : μ1 ≠ μ2 ⇒ P value = 2 tails prob. = 2 × normalcdf ( z , E 99) H1 : μ1 < μ2 ⇒ P value = Left tail prob. = normalcdf (− E 99, z ) H1 : μ1 > μ2 ⇒ P value = Right tail prob. = normalcdf ( z, E 99) Or Use TI’s 2-SampZTest [STAT] > TESTS > 3:2-SampZTest If given raw data, use “Input: Data” & L1, L2. If given statistics, use “Input: Stats”. Test statistic is reported as “z =…” P-value is reported as “p =…” Conclusion: x1 , qˆ1 = 1 − pˆ 1 , n1 pˆ 2 = # failures1 ≥ 5 ? # failures1 = n1 − x1 = n1qˆ1 # successes2 ≥ 5 ? # successes2 = x2 = n2 pˆ 2 . # failures2 ≥ 5 ? # failures2 = n2 − x2 = n2 qˆ 2 p.437 σ1 or σ2 Unknown (§9-3) If σ1=σ2, then see p.457 (pooled variance); otherwise, see p.450. n2 v.090710 © 2009 by Chi-yin Pang 0.2 0.6 2. n’s big enough? Both σ1 & σ2 Known (§9-3) Test Statistic: z = Binom. Dist. 0.8 0.8 Proportion (p1, p2) (§9-2) Check: 1. Indep. Random Sample? Independent Samples Given related data pairs, (xi,yi). Analyze the Difference : d i = xi − y i as a 1-sample Hypo.Test problem. Caution: It might be more appropriate to ri = xi / y i analyze the Ratio : Relative Diff. : rd i = 2-SAMPLE (comparing 2 groups) 1 0.8 0.6 0.4 0.2 0 Prob Mean (μ1, μ2) Binom. Dist. Prob References are from Triola (2008) Or Use TI’s 2-SampTTest [STAT] > TESTS > 4:2-SampTTest If given raw data, use “Input: Data” & L1, L2. If given statistics, use “Input: Stats”. If σ1 = σ2, use “Pooled: Yes”; otherwise, use “Pooled: No”. Test statistic is reported as “t =…” P-value is reported as “p =…” Pooled Sample Prop. : p = Test Statistics : z = x1 + x2 , q =1− p n1 + n2 pˆ1 − pˆ 2 x /n −x /n = 1 1 2 2 pq pq pq pq + + n1 n2 n1 n2 H 0 : p1 = p2 H1 : p1 ≠ p2 ⇒ P value = 2 tails prob. = 2 × normalcdf ( z , E 99) H1 : p1 < p2 ⇒ P value = Left tail prob. = normalcdf (− E 99, z ) H1 : p1 > p2 ⇒ P value = Right tail prob. = normalcdf ( z , E 99) Or Use TI’s 2-PropZTest [STAT] > TESTS > 6:2-PropZTest Note: The input x1, x2 MUST be integers. Test statistic is reported as “z =…” P-value is reported as “p =…” α = level of significance (e.g., assume α = 0.05 if not given) If P value ≤ α then reject H0. “The observed sample statistics has a P value of __≤__ [α]. We reject Ho with a significance level of α =___. The evidence is statistically significant to support ____ [English of H1].” If P value > α, then fail to reject H0. “The observed sample statistics has a P value of __>__ [α]. We failed to reject Ho with a significance level of α =___. The evidence is not statistically significant to support ____ [English of H1] .” 5 2-Sample Data, Correlation & Regression Road Map 2-SAMPLE DATA Mean (μ) Matched Pairs For Studying the Relationship between the Explanatory Variable (x) & the Response Variable (y) Analyze with Correlation & Regression. These techniques try to find the association between the variables by fitting a function the observed (xi,yi) points. They also assess the strength of the association. Independent Sample Proportion (p) 2-SampZInt 2-SampZTest 2-SampTInt 2-SampTTest 2-PropZInt 2-PropZTest For Comparison of Same Kind of Data For more information about the non-correlation type of analysis, see the 1-sample and 2-sample Confidence Interval and Hypo. Test flow charts. Take the difference and treat it as a 1-sample problem. ZInterval Z-Test TInterval T-Test Fit a LINEAR Function (y=ax+b) to the data. (“Linear Regression”) Fit a NON-LINEAR Function to the data (“Non-Linear Regression”) Use [STAT]>TESTS>LinRegTTest Use QuadReg for y = ax 2 + bx + c Other related functions are: [STAT]>TESTS>LinRegTInt Use CubicReg for y = ax3 + bx 2 + cx + d Or turn-on Diagnostics first [2nd]>[0](CATALOG)>DiagnosticsOn [ENTER}>[ENTER] Then use [STAT]>CALC>LinReg(ax+b) L1,L2,Y1 [STAT]>CALC>LinReg(a+bx) [STAT]>CALC>Med-Med [STAT]>CALC>Manual-Fit (See the “Linear Regression” flow chart for more detail.) Ver.090428 © 2009 by Chi-yin Pang Use QuartReg for y = ax 4 + bx 3 + cx 2 + dx + e Use LnReg for y = a + b ln x Use ExpReg for y = a ⋅ b x Use PwrReg for y = ax b ( ) Use Logistic for y = c / 1 + ae − bx Use SinReg for y = a sin(bx + c) + d 6 LINEAR REGRESSION References are from Triola (2008) Given n pairs of (xi,yi)’s. Regression was done (e.g., with LinReg(a+bx)) & the regression line is y=a+bx with correlation coefficient r. (The symbols for the corresponding population parameters are α, β, ρ.) Check: (p.488) Prediction Intervals (§10-4) Testing ρ, the population correlation coeff. (§10-2) Confident Level: c (a given probability value) Degrees of Freedom: d . f . = n − 2 Critical Value: tc has P( −tc < t < tc ) = c with the t distribution with (n − 2) d.f. 2 ⎞ ⎛ 1 (x − x) ⎟ E = tc ⎜ S e 1 + + ⎜ n SS x ⎟ ⎝ ⎠ (For Se and SS x , see the formulas for "Testing for correlation coefficient ρ".) Error Tolerance: Point Estimate: y p = a + bxs where xs is the specific given x value (y p r n−2 1− r2 H0 : β & ρ = 0 with a given level of significance α (e.g., α=0.05). (Unfortunately, α is used in 2 difference senses.) These are tests for whether x, y are correlated. Test β, the population slope. (∑ x) SS x = ∑ x 2 − slope b = n SS xy SS x 2 , SS y = ∑ y 2 − (∑ y) n 2 , SS xy = ∑ xy − , the Standard Error of the Estimate is Se = ( ∑ x )( ∑ y ) , n SS y − b SS xy n−2 b b = t= , d. f . = n − 2 . Standard Error for b Se / SS x , d. f . = n − 2 . H 0 :β = 0 H0 : ρ = 0 H1 : ρ ≠ 0 ⇒ P value = 2 × tcdf( t , E99, d.f.) H1 : β ≠ 0 ⇒ P value = 2 × tcdf( t , E99, d.f.) H1 : ρ < 0 ⇒ P value = tcdf( − E99, H1 : β < 0 ⇒ P value = tcdf( − E99, t , d.f.) H1 : ρ > 0 ⇒ P value = tcdf( and y p is the predicted value according to the regression line, y = a + bx. Confidence Interval: Compute t = Test the hypotheses H1 : β & ρ ≠ 0 (or < 0; or > 0 ) 1. Random Sample from all the possible (x,y)’s 2. Scatterplot has approximate a straight line? 3. Remove any erroneous outliers. 4. At each x value, y is normally distributed and with the same σ. Given a specific xs, we want to get a conf. int. to trap the population mean of y corresponding to xs, with confidence level c (e.g., c=0.95). Testing ρ & β t , d.f.) t , E99, d.f.) H1 : β > 0 ⇒ P value = tcdf( t , E99, d.f.) (For conf. int. for ρ, see p,508 #39.) − E , y p + E ) (See p.529, p.534 #26.) (CAUTION: TI-84’s LinRegTInt is for getting the confident interval for the slope β, and not the confident interval for the predicted y’s mean. There are no “TI tests” for this case.) Or Use TI’s LinRegTTest Conclusion: “We are 95% confident that the true mean of copper sulfate that will dissolve in 100 g of water at 45°C is between 26.5g and 39.5g.” Enter (x, y)’s into L1, L2. Enter the observation counts to matrix [A]. [STAT] > TESTS > E:LinRegTTest Test statistics is reported as “t =…” P-value is reported as “p = …” Or find the conf. int. of the slope β for conf. level c=(1-α) by using TI-84’s LinRegTInt: Enter (x, y)’s into L1, L2. [STAT] > TESTS > G:LinRegTInt… The output indicates the conf. interval for the slope β. Use that to make the appropriate conclusion. (For CI for α and CI for β, see p.533 #25.) Conclusion: v.090710 © 2009 by Chi-yin Pang α = level of significance If P value ≤ α then reject H0. “The observed sample statistics has a P value of __≤__(α). We reject Ho with a significance level of α =___. The evidence is statistically significant to support ____ [English of H1].” If P value > α, then fail to reject H0. “The observed sample statistics has a P value of __>__(α). We failed to reject Ho with a significance level of α =___. The evidence is not statistically significant to support ____ [English of H1] .” 7 References are from Triola (2008) 1st 2nd 3rd Crew Alive 202 118 178 212 Dead 123 167 528 673 Test of Independency (§11-3) Given a observed frequencies in a contingency table with R rows and C columns. We want to test whether the variables in the rows and columns are independent. Check: 1. Random Sample? 2. Each expected, Eij ≥ 5? (NOTE: Perform TI’s “χ2-Test” (see below) then check the output expected count matrix [B].) CHI-SQUARE (χ2) TESTS Tests that use the χ2 distributions. Goodness of Fit (§11-2) Given an assumed distribution of k categories and expected probabilities p1, p2, … , pk. An observation was made and the observed frequencies for the categories are: O1, O2, …, Ok Test how good the observation fit the distribution. Check: 1. Random Sample? Conf.Int. for σ (§7-5) for a conf. level=1−α (e.g., 0.95) Check: 1. Random Sample? 2. Population is NORMAL? (even for large n) d. f . = n −1 χ L2 = χ 2CDF −1 (α / 2, d . f .) χ R2 = χ 2CDF −1 (1 − α / 2, d . f .) CI : (n − 1) s 2 χ R2 Test of Std.Dev., σ (§8-6) Given a “status quo” σ0 and a sample standard deviation, s (from a sample of size n). Test the true σ comparing to the “status quo” σ0. 2. Each expected count, Ei = npi ≥ 5? H 0 : The two variables are Independnt H 0 : The population fits the assumed distribution. Check: 1.Random Sample? H1 : The population has a different distribution. 2. The population NORMAL? (even for large n) Compute the expected frequency (assuming H 0 ) : (Row total )(Column total) Eij = Sample size Compute the statistics χ = ∑ 2 (O ij (Oi − Ei ) Ei 2 Use chi-square (χ 2 ) distribution with d . f . = k − 1 − Eij ) 2 Eij Use chi - square ( χ 2 ) dist. with d . f . = ( R − 1)(C − 1) P value = the right tail prob. = χ 2cdf( χ 2 , E99, d.f.) Or Use TI’s χ2-Test [2nd][MATRX] > EDIT Enter the observation freq., Oij, to matrix [A]. [STAT] > TESTS > C:χ2-Test Test statistics is reported as “χ2 =…” P-value is reported as “p= …”, the computed expected counts, Eij’s, are stored in matrix [B]. v.090710 © 2009 by Chi-yin Pang Compute the statistics χ 2 = ∑ P value = the right tail prob. = χ 2 cdf(χ 2 ,E99,d.f.) [2nd][VARS](DISTR)>DISTR>*:c2cdf(LtLim,RtLim,df) Or Use TI-84’s χ2GOF-Test Enter the observed values in L1 & the expected values in L2. [STAT] > TESTS > D:χ2GOF-Test… Test statistic is reported as “χ2 =…” P-value is reported as “p= …” Conclusion: ( n − 1) s 2 χ L2 TI has no “inverseχ2CDF”. Use CATALOG>solve(χ2cdf(0, x ,d.f.)-.025,x,0) for χ2L CATALOG>solve(χ2cdf(x,E99,d.f.)-.025,x,0) for χ2R H1 : The two variables are NOT Independnt Let Oij = the observed frequency <σ < χ 2 = (n − 1) s2 σ 02 with d . f . = n − 1 H 0 : σ = σ 0 (CAUTION: don't forget to square σ 0 ) H1 : σ < σ 0 ⇒ P value = χ 2 cdf(0,χ 2 ,d.f.) H1 : σ > σ 0 ⇒ P value = χ 2 cdf(χ 2 ,E99,d.f.) H1 : σ ≠ σ 0 ⇒ P value = 2 × min( χ 2 cdf(0,χ 2 ,d.f.) , χ 2 cdf(χ 2 ,E99,d.f.)) No TI “test” available. Use [2nd][VARS](DISTR)>DISTR>*:χ2cdf(LtLim,RtLim,df) α = level of significance If P value ≤ α then reject H0. “The observed sample statistics has a P value of __≤__(α). We reject Ho with a significance level of α =___. The evidence is statistically significant to support ____ [English of H1].” If P value > α, then fail to reject H0. “The observed sample statistics has a P value of __>__(α). We failed to reject Ho with a significance level of α =___. The evidence is not statistically significant to support ____ [English of H1] .” 8 References are from Triola (2008) ANOVA (ANalysis Of VAriance) (§11-4) Given measured samples from k populations, test the hypotheses: H0: μ1=μ2=…=μk H1: not (μ1=μ2=…=μk) Check: (p.585) 1. Random Samples? 2. No two sets of data are matched pairs? 3. Distributions are all approximately normal? (Actually, “not very far from normal” would still give good results.) 4. σ1=σ2=…=σk Let:k = number of populations being compared ni = number of values in the i th sample xi = sample mean of the i th sample si = sample standard deviation of the i th sample N = ∑ ni = number of values in all samples combined x = means of all sample values combined ∑ ( n ( x − x ) ) ( k − 1) ∑ ( ( n − 1) s ) ∑ ( n − 1) 2 Test Statistic: F = variance between samples = variance within smaples i i 2 i i i The test statistics has an "F distribution" with : numerator degrees of freedom = "ndf" = k − 1 denominator degrees of freedom = "ddf" = N − 1 P value = Right tail probability = TI's Fcdf ( F , E 99, ndf , ddf ) Keys: [2nd][VARS](DISTR)>DISTR>*:Fcdf( Or Use TI’s ANOVA(L1,…,Lk) [STAT] > TESTS > *:ANOVA( Test statistics is reported as “F =…” P-value is reported as “p = …” Conclusion: v.090710 © 2009 by Chi-yin Pang α = level of significance If P value ≤ α then reject H0. “The observed sample statistics has a P value of __≤__(α). We reject Ho with a significance level of α =___. The evidence is statistically significant to support ____ [English of H1].” If P value > α, then fail to reject H0. “The observed sample statistics has a P value of __>__(α). We failed to reject Ho with a significance level of α =___. The evidence is not statistically significant to support ____ [English of H1] .” 9 San Jose City College Math 63 Statistics Ver. 071118 Hypothesis Testing: Setup and Decision Methods If claim has If claim “straight has “=” inequality” imbedded μ ≠ μ0 p ≠ p0 μ < μ0 p < p0 μ > μ0 p > p0 μ = μ0 Then Setup the Null & Alternative Hypotheses As: Two-tailed test H 0 : μ = μ0 H1 : μ ≠ μ 0 p = p0 H 0 : p = p0 H1 : p ≠ p0 μ ≥ μ0 Left-tailed test H 0 : μ = μ0 H1 : μ < μ0 p ≥ p0 H 0 : p = p0 H 1 : p < p0 μ ≤ μ0 Right-tailed test H 0 : μ = μ0 H1 : μ > μ 0 p ≤ p0 © Chi-yin Pang, 2007 H 0 : p = p0 H1 : p > p0 P-value Method Traditional Method (Compare the P-value (Probability) to α (the Significance Level)) (Compare the Test Statistic to the Critical Value (the t-score or z-score of α)) Test Statistics: x−μ x−μ pˆ − p z= or t = or z = σ/ n s/ n pq / n ( ) ( ) P-value = Left tail Probability = 2 × tdcf ( t , E 99, n − 1) or 2 × normalcdf ( z , E 99) Test Statistics: z= x−μ x−μ pˆ − p or t = or z = σ/ n s/ n pq / n ( ) ( ) Critical Value = invT (α / 2, n − 1) or invNorm(α / 2) Test Statistics: Reject H 0 if |Test Stat| ≥ Critical Value. Test Statistics: z= z= Reject H 0 if P-value ≤ α. x−μ x−μ pˆ − p or t = or z = σ/ n s/ n pq / n ( ) ( ) P-value = Left tail Probability = tdcf (− E 99, t , n − 1) or x−μ x−μ pˆ − p or t = or z = σ/ n s/ n pq / n ( ) ( ) Critical Value = invT (α , n − 1) or invNorm(α ) normalcdf (− E 99, z ) Test Statistics: Reject H 0 if Test Statistics ≤ – Critical Value. Test Statistics: z= z= Reject H 0 if P-value ≤ α. x−μ x−μ pˆ − p or t = or z = σ/ n s/ n pq / n ( ) ( ) P-value = Left tail Probability = tdcf (t , E 99, n − 1) or x−μ x−μ pˆ − p or t = or z = σ/ n s/ n pq / n ( ) ( ) Critical Value = invT (α , n − 1) or invNorm(α ) normalcdf ( z, E 99) Reject H 0 if P-value ≤ α. Reject H 0 if Test Stat ≥ Critical Value.