STAT 344: Probability and Statistics for Engineers and Scientists Instructor: Prof. M.K. Habib. Sci. & Tech. II, Rm. 143 URL: http://mason.gmu.edu/~mhabib E-mail: mhabib@gmu.edu Office hours: M & W: 4:30 - 6:00 pm, Tues. 2:00-5:00 Text: Probability and Statistics for Engineering and the Sciences, by Jay Devore. Thomson. TA: Jin Zang. Central Module Rm 18 or 33. Email. jzang5@gmu.edu. Office Hours: T & R 2:00-4:00 This is an introductory undergraduate course in probability and statistics with applications to computer science, engineering, operations research, and information technology. The focus will on basic concepts of probability, discrete and continuous random variables, expectations, and bivariate distributions, sums of independent random variables, correlations, limit theorems, sampling distributions, parameter estimation, and hypothesis testing. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. COURSE CONTENT 1. Introduction to statistics. 2. Probability, conditional Probability and Independence. 3. Discrete Random variables and Probability distributions. 4. Continuous Random variables and distributions. 5. Joint Probability Distributions and Random Samples. MIDTERM EXAM 6. Estimation Theory – Point Estimation. 7. Confidence (Statistical) Intervals. 8. Tests of hypotheses. GRADING: Homework 30%; Midterm 30%; Final exam 40%. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. STAT 344: Probability • 1. Introduction. • Probability is a branch of statistics (mathematics) that is concerned with developing and analyzing mathematical models of random (or statistical) experiments. • Definition: A statistical (or random) experiment is an experiment whose outcomes are not certain. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Examples (1) Flipping one or more coins, (2) Tossing one or more dice, (3) Examining a manufactured item to determine whether it is defective or not, (4) Measuring a patient's blood pressure: (120 / 80) or (120, 80) Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. (5) A laboratory blood test is 95% effective in detecting a certain disease when it is, in fact, present (sensitivity). However, the test also yields a "false positive" result for 1% of the healthy persons tested. (That is, if a healthy person is tested, then, with probability .01, the test result will be positive.) If of the population 0.5% actually has the disease (prevalence), what is the probability a person has the disease given that the test result is positive?. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Cross-Correlation Surface of Multiple Spike Trains Trans-membrane Potential Dendrites Soma -30 mV -70 mV Stochastic Counting Processes N(t) = i1 I[ Ti t) , t 0 Where I(A) is the indicator of the set A. (t) = : (const.): homogeneous Poisson process (t) = t:(det.): non-homogeneous Poisson process (t) = (t|H0): doubly-stochastic Poisson process Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. A Computational Machine for Optimal Parsimonious Hybrid Models of E-Text Classification HP DNA Ling E1/2 E½ E½ 1 – NB- χ2 0.1897 0.0420 0.0605 2 – SVM-Entropy 0.5169 0.0559 0.0384 3 – K-nn-Entropy 0.0501 0.0350 0.0051 4– K-nn-χ2 0.1966 0.0265 0.0190 5 – K-nn-MI 0.1441 0.0615 0.0436 6 – NB-Entropy 0.1282 0.0500 0.0414 7 – NN-MI 0.1282 0.0420 0.0542 classifier Method Recall Yahoo 60% Mozilla 85% New Machine 98% Data LING, HP, and DNA 0.6 Source: CAIDA Internet Map NN NB K-nn SVM 0.5 Parsimonious 0.4 Optimal Hybrid Model Error Rate -Data Pre-processing -Data Transformation -Data Normalization -Features Selection Calculate: FPR, Recall, Precision measure of error, Eλ Harmonic Error Classified Output E 1 1 1 1 (1 ) R P 0.3 0.2 , 0 1 0.1 2 12 23 15 57 125 1256 1246 2567 7 257 47 267 567 247 157 2467 146 367 1467 1237 2356 36 357 3467 234 235 2E+05 35 1E+05 34 3457 1457 23457 12347 24567 1E+05 137 1347 134 12345 346 3 0 Hybrid Models Khaled Alduhaiman, Doctoral Dissertation Spring 2004 Muhammad Habib, Dissertation Director Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Populations and Samples A population is a well-defined collection of objects. When information is available for the entire population we have a census. A subset of the population is a sample. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Data and Observations Univariate data consists of observations on a single variable (multivariate – more than two variables). Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Branches of Statistics Descriptive Statistics – summary and description of collected data. Inferential Statistics – generalizing from a sample to a population. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Relationship Between Probability and Inferential Statistics Probability Population Sample Inferential Statistics Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 1.2 Pictorial and Tabular Methods in Descriptive Statistics Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Stem-and- Leaf Displays 1. Select one or more leading digits for the stem values. The trailing digits become the leaves. 2. List stem values in a vertical column. 3. Record the leaf for every observation. 4. Indicate the units for the stem and leaf on the disply. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Stem-and-Leaf Example Observed values: 9, 10, 15, 22, 9, 15, 16, 24,11 0 99 1 10556 2 24 Stem: tens digit Leaf: units digit Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Stem-and- Leaf Displays • Identify typical value • Extent of spread about a value • Presence of gaps • Extent of symmetry • Number and location of peaks • Presence of outlying values Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Dotplots Represent data with dots. Observed values: 9, 10, 15, 22, 9, 15, 16, 24,11 5 10 15 20 25 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Types of Variables A variable is discrete if its set of possible values constitute a finite set or an infinite sequence. A variable is continuous if its set of possible values consists of an entire interval on a number line. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Histograms Discrete Data Determine the frequency and relative frequency for each value of x. Then mark possible x values on a horizontal scale. Above each value, draw a rectangle whose height is the relative frequency of that value. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Ex. Students from a small college were asked how many charge cards that they carry. x is the variable representing the number of cards and the results are below. x #people Rel. Freq 0 12 0.08 1 2 3 4 42 57 24 9 0.28 0.38 0.16 0.06 5 6 4 2 0.03 0.01 Frequency Distribution Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Histograms Credit card results: Relative Frequency x Rel. Freq. 0 1 0.08 0.28 0.4 2 3 4 5 0.38 0.16 0.06 0.03 0.2 6 0.01 0.3 xi 0.1 0 0 1 2 3 4 5 6 Number of Cards Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Histograms Continuous Data: Equal Class Widths Determine the frequency and relative frequency for each class. Then mark the class boundaries on a horizontal measurement axis. Above each class interval, draw a rectangle whose height is the relative frequency. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Histograms Continuous Data: Unequal Widths After determining frequencies and relative frequencies, calculate the height of each rectangle using: relative frequency of the class rectangle height = class width The resulting heights are called densities and the vertical scale is the density scale. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Histogram Shapes symmetric unimodal positively skewed bimodal negatively skewed Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 1.3:Measures of Location The Mean The average (mean) of the n numbers x1, x2 ,..., xn is x where n x1 x2 ... xn x n xi i 1 n Population mean: [=E(X)] Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Sample Variance Variance is a measure of the spread of the data. The sample variance of the sample x1, x2, …xn of n values of X is given by x x i 2 s n 1 2 S xx n 1 We refer to s2 as being based on n – 1 degrees of freedom. The population variance: 2=E(X-)2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Median The sample median, x, is the middle value in a set of data that is arranged in ascending order. For an even number of data points the median is the average of the middle two. Population median: ~ m Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Three Different Shapes for a Population Distribution Skewness=E(X-)3/3. symmetric negative skew positive skew Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.