Document

advertisement
STAT 344: Probability and Statistics for Engineers and Scientists
Instructor: Prof. M.K. Habib. Sci. & Tech. II, Rm. 143
URL: http://mason.gmu.edu/~mhabib
E-mail: mhabib@gmu.edu
Office hours: M & W: 4:30 - 6:00 pm, Tues. 2:00-5:00
Text: Probability and Statistics for Engineering and the Sciences, by
Jay Devore. Thomson.
TA: Jin Zang. Central Module Rm 18 or 33.
Email. jzang5@gmu.edu. Office Hours: T & R 2:00-4:00
This is an introductory undergraduate course in probability and
statistics with applications to computer science, engineering,
operations research, and information technology. The focus will on
basic concepts of probability, discrete and continuous random
variables, expectations, and bivariate distributions, sums of
independent random variables, correlations, limit theorems,
sampling distributions, parameter estimation, and hypothesis
testing.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
COURSE CONTENT
1. Introduction to statistics.
2. Probability, conditional Probability and Independence.
3. Discrete Random variables and Probability distributions.
4. Continuous Random variables and distributions.
5. Joint Probability Distributions and Random Samples.
MIDTERM EXAM
6. Estimation Theory – Point Estimation.
7. Confidence (Statistical) Intervals.
8. Tests of hypotheses.
GRADING: Homework 30%; Midterm 30%; Final exam 40%.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
STAT 344: Probability
• 1. Introduction.
•
Probability is a branch of statistics
(mathematics) that is concerned with developing
and analyzing mathematical models of random
(or statistical) experiments.
• Definition: A statistical (or random) experiment
is an experiment whose outcomes are not
certain.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Examples
(1) Flipping one or more coins,
(2) Tossing one or more dice,
(3) Examining a manufactured item to determine
whether it is defective or not,
(4) Measuring a patient's blood pressure:
(120 / 80) or (120, 80)
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
(5) A laboratory blood test is 95% effective in
detecting a certain disease when it is, in fact,
present (sensitivity). However, the test also
yields a "false positive" result for 1% of the
healthy persons tested. (That is, if a healthy
person is tested, then, with probability .01, the
test result will be positive.) If of the population
0.5% actually has the disease (prevalence), what
is the probability a person has the disease given
that the test result is positive?.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Cross-Correlation Surface of Multiple Spike
Trains
Trans-membrane Potential
Dendrites
Soma
-30 mV
-70 mV
Stochastic Counting Processes
N(t) = i1 I[ Ti  t) , t  0
Where I(A) is the indicator of the set A.
(t) = : (const.): homogeneous Poisson process
(t) = t:(det.): non-homogeneous Poisson process
(t) = (t|H0): doubly-stochastic Poisson process
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
A Computational Machine for Optimal Parsimonious Hybrid
Models of E-Text Classification
HP
DNA
Ling
E1/2
E½
E½
1 – NB- χ2
0.1897
0.0420
0.0605
2 – SVM-Entropy
0.5169
0.0559
0.0384
3 – K-nn-Entropy
0.0501
0.0350
0.0051
4–
K-nn-χ2
0.1966
0.0265
0.0190
5 – K-nn-MI
0.1441
0.0615
0.0436
6 – NB-Entropy
0.1282
0.0500
0.0414
7 – NN-MI
0.1282
0.0420
0.0542
classifier
Method
Recall
Yahoo
60%
Mozilla
85%
New Machine
98%
Data
LING, HP,
and DNA
0.6
Source: CAIDA Internet Map
NN
NB
K-nn
SVM
0.5
Parsimonious
0.4
Optimal Hybrid Model
Error Rate
-Data Pre-processing
-Data Transformation
-Data Normalization
-Features Selection
Calculate: FPR,
Recall, Precision
measure of error, Eλ
Harmonic Error
Classified
Output
E  1 
1
1
1
  (1   )
R
P
0.3
0.2
, 0 1
0.1
2
12
23
15
57
125
1256
1246
2567
7
257
47
267
567
247
157
2467
146
367
1467
1237
2356
36
357
3467
234
235
2E+05
35
1E+05
34
3457
1457
23457
12347
24567
1E+05
137
1347
134
12345
346
3
0
Hybrid Models
Khaled Alduhaiman, Doctoral Dissertation
Spring 2004
Muhammad Habib, Dissertation Director
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Populations and Samples
A population is a well-defined collection
of objects.
When information is available for the
entire population we have a census. A
subset of the population is a sample.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Data and Observations
Univariate data consists of observations
on a single variable (multivariate – more
than two variables).
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Branches of Statistics
Descriptive Statistics – summary and
description of collected data.
Inferential Statistics – generalizing
from a sample to a population.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Relationship Between Probability
and Inferential Statistics
Probability
Population
Sample
Inferential
Statistics
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
1.2
Pictorial
and
Tabular Methods
in Descriptive Statistics
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Stem-and- Leaf Displays
1. Select one or more leading digits for
the stem values. The trailing digits
become the leaves.
2. List stem values in a vertical column.
3. Record the leaf for every observation.
4. Indicate the units for the stem and
leaf on the disply.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Stem-and-Leaf Example
Observed values:
9, 10, 15, 22, 9, 15, 16, 24,11
0 99
1 10556
2 24
Stem: tens digit
Leaf: units digit
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Stem-and- Leaf Displays
• Identify typical value
• Extent of spread about a value
• Presence of gaps
• Extent of symmetry
• Number and location of peaks
• Presence of outlying values
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Dotplots
Represent data with dots.
Observed values:
9, 10, 15, 22, 9, 15, 16, 24,11
5
10
15
20
25
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Types of Variables
A variable is discrete if its set of possible
values constitute a finite set or an infinite
sequence. A variable is continuous if its
set of possible values consists of an entire
interval on a number line.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Histograms
Discrete Data
Determine the frequency and relative
frequency for each value of x. Then
mark possible x values on a horizontal
scale. Above each value, draw a
rectangle whose height is the relative
frequency of that value.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Ex. Students from a small college were asked how
many charge cards that they carry. x is the variable
representing the number of cards and the results are
below.
x
#people Rel. Freq
0
12
0.08
1
2
3
4
42
57
24
9
0.28
0.38
0.16
0.06
5
6
4
2
0.03
0.01
Frequency
Distribution
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Histograms
Credit card results:
Relative Frequency
x
Rel. Freq.
0
1
0.08
0.28
0.4
2
3
4
5
0.38
0.16
0.06
0.03
0.2
6
0.01
0.3
xi
0.1
0
0
1
2
3
4
5
6
Number of Cards
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Histograms
Continuous Data:
Equal Class Widths
Determine the frequency and relative
frequency for each class. Then mark
the class boundaries on a horizontal
measurement axis. Above each class
interval, draw a rectangle whose height
is the relative frequency.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Histograms
Continuous Data: Unequal Widths
After determining frequencies and
relative frequencies, calculate the height
of each rectangle using:
relative frequency of the class
rectangle height =
class width
The resulting heights are called densities
and the vertical scale is the density scale.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Histogram Shapes
symmetric unimodal
positively skewed
bimodal
negatively skewed
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
1.3:Measures of Location
The Mean
The average (mean) of the n numbers
x1, x2 ,..., xn is x where
n
x1  x2  ...  xn
x

n
 xi
i 1
n
Population mean:  [=E(X)]
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Sample Variance
Variance is a measure of the spread of the
data.
The sample variance of the sample x1, x2,
…xn of n values of X is given by
x

x



i
2
s 
n 1
2
S xx

n 1
We refer to s2 as being based on n – 1 degrees of
freedom. The population variance: 2=E(X-)2
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Median
The sample median, x, is the middle
value in a set of data that is arranged in
ascending order. For an even number
of data points the median is the average
of the middle two.
Population median:
~
m
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Three Different Shapes for a Population Distribution
Skewness=E(X-)3/3.

symmetric

negative skew

positive skew
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Download