Uploaded by mikiyasbayle

Chapter 1 SE 2015

advertisement
CHAPTER ONE
INTRODUCTION
1-2
DEFINITION OF STATISTICS
•
In plural sense : statistics refers to aggregate of numerical facts
(figures).

•
For instance, "The statistics show that In 2022, IT spending on enterprise software has increased
significantly compared the previous year. In this case, "statistics" refers to the specific numbers and
figures that have been collected and analyzed.
In singular sense statistics refers to a subject concerned with methods of
collecting, organizing, summarizing, presenting, analyzing data to derive
valid conclusions and make reasonable decisions.

For example, a researcher might say, "I used statistics to analyze the results of my experiment.“
In this case, "statistics" refers to the methodology or techniques used to analyze the data.
1-3
CLASSIFICATION OF STATISTICS
i.
Descriptive Statistics: area of statistics that involves summarizing,
and describing data in an informative way.


ii.
It consists of summary calculations, graphs, charts and tables.
It does not involve making any inferences or predictions about
the data beyond what is already observed.
Inferential Statistics: area of statistics that involves making
inferences or predictions about a population based on a sample of
data. This Includes:
performing estimations and conducting hypothesis tests,
generalizing from sample to population,
determining relationships among variables.
1-4
EXERCISE 1
1.
Classify the following statements as belonging to the area of
descriptive statistics or inferential statistics.
As a result of recent cutbacks by oil-producing nations, we can expect
the price of gasoline to double in the next year.
b. Of all patients who received this particular type of drug at a clinic Y,
75% later developed significant side effect.
a.
1-5
BASIC STATISTICAL TERMS

Study Population refers to the group of all individuals, systems or objects under
study that posses certain common characteristics.

Census: complete enumeration of every member of a study population.

Sample is a portion (part) of the population of interest.
• Sample survey: method of collecting data from a subset of the population(sample)
Advantages of Sample survey over census:
- It saves cost and time
- It has great accuracy
1-6
a)
Study population: Users of a particular software product
Sample: A randomly selected group of users who have used the software in the past 6
months.
b)
Study population: Open source software projects on GitHub
Sample: A randomly selected subset of open source projects that are active and have at
least 100 commits.
c)
Study population: Bug reports in a software system
Sample: A randomly selected subset of bug reports that have been submitted in the past
year and have at least 10 comments.
1-7
BASIC STATISTICAL TERMS CONT’D….

Parameter: numerical value that describes//summarizes a characteristic
of a population.
Examples: Populations mean(𝜇) and population standard deviation(𝜎),
population proportion(𝑝)
• Statistic: numerical value that describes/summarizes a characteristic of a
sample.
•
Examples: sample mean (𝑋), sample standard deviation (S), sample
proportion(𝑝)
1-8
BASIC STATISTICAL TERMS CONTD.
•
•
•
Data refers to the collection of information or observations.
Variable: is a characteristic or attribute that can take on different values or
levels.
Data are the values (measurements or observations) that the variables
assume.
Two types of variables: Qualitative and Quantitative variable.
i.
Qualitative variables : the characteristic being studied is nonnumeric.
Examples-Gender, Marital status, Blood type, Nationality etc…
ii.
Quantitative variables - measurable or countable characteristic. Takes
numeric values. .
Examples-Temperatures, Salaries, Height, age, crop yield, GPA,
•
1-9
BASIC STATISTICAL TERMS CONTD.
•
Quantitative variables can be classified as either Discrete or Continuous.
-Continuous
Variables: can take any value within a specified
interval.
Eg. age, time, height, income, price, temperature etc.
-Discrete Variables: can only assume whole number/discrete values.
Eg. Number of hits on a certain website, number of female students in a
classroom, number of accidents on traffic line etc
1-10
SOURCES OF DATA
•
Primary Data :Primary data are measurements observed and
recorded as part of an original study.
•
Secondary Data : data which has already been collected by other
researchers or organizations .
It can be obtained from journals, reports etc.

1-11
MEASUREMENT SCALES
i.
•
Measurement Scale (MS)- refer to ways in which variables are defined and
categorized based on the properties of order, distance and absolute zero point.
•
The four scales of measurement are nominal, ordinal, interval, and ratio.
Nominal Scale:
 It has no order or ranking
 arithmetic and relational operations are not applicable
Examples:
 Political party preference (Republican, Democrat, or Other,)
 Sex (Male or Female)
 Marital status (married, single, widowed, divorce)
 Country code
1-12
MEASUREMENT SCALES
ii.
Ordinal Scales
It has order but no meaningful difference between the ranks.
 No arithmetic operation is applicable but relational operations are
applicable.
Examples:
 Letter grades (A, B, C, D, F).
 Rating scales (Excellent, Very good, Good, Fair, poor).

1-13
MEASUREMENT SCALES
iii.
Interval Scales:




It has order and difference between ranks are meaningful
It has no absolute zero point
Relational operations are also possible.
All arithmetic operations except division are applicable.
Examples: IQ, Temperature
IV. Ratio Scales:





It has order and difference between ranks are meaningful
It has absolute zero point
All arithmetic operations are applicable.
Relational operations are also possible.
All arithmetic and relational operations are applicable.
Examples: Weight, Height, Number of students, Age
1-14
APPLICATION OF PROBABILITY AND STATISTICS IN SOFTWARE ENGINEERING

Probability theory is used in various ways in software engineering to analyze and
manage uncertainty, evaluate system reliability, and optimize system performance. For
example :
Software reliability engineering: Probability theory is used to evaluate system reliability
and predict the probability of system failure, by computing metrics such as Mean Time
Between Failures (MTBF), and Mean Time To Failure (MTTF).
 Risk analysis: Probability theory is used to assess risks associated with software
development, such as the probability of a security breach, data loss, or system failure. It is
used to calculate risk metrics such as Expected Monetary Value (EMV) to determine the
potential impact of a risk event.
 Optimization: Probability theory is used to optimize system performance. It is used to model
the behavior of complex systems and identify optimal system configurations based on
probabilistic methods.
 Machine learning: Probability theory is used extensively in machine learning algorithms such
as Bayesian networks and Markov models. It is used to model the relationships between data
points and make predictions based on probability distributions.

1-15
APPLICATION OF PROBABILITY AND STATISTICS IN SOFTWARE ENGINEERING

Statistics can be applied in various ways to software engineering to help in
decision-making, improving software quality, and identifying trends and patterns.
Here are some examples:
Performance testing: Statistical techniques such as regression analysis can be
used to analyze the relationship between software performance and various
factors such as system load, network latency, and CPU usage.
 Defect analysis: Statistical techniques can be used to analyze defects in
software, identify common patterns, and prioritize fixes.
 Predictive modeling: Statistical modeling techniques such as regression
analysis and time series analysis can be used to predict software performance,
estimate resource requirements, and forecast future demand.
 User experience analysis: Statistical methods such as usability testing and user
surveys can be used to evaluate user experience and identify areas for
improvement.

METHODS OF DATA PRESENTATION
1-17
2. METHODS OF DATA PRESENTATION
Data can be presented using either frequency distributions or
Diagrams and Graphs.
1. Frequency distributions
i.
Categorical Frequency distributions
ii.
Ungrouped Frequency distributions
iii. Grouped Frequency distributions
2. Diagrams and Graphs
•
i.
ii.
Diagrams: (Pie chart, Bar chart)
Graphs: ( Histograms, Frequency polygons and Ogives)
1-18
1. FREQUENCY DISTRIBUTIONS
i.
Categorical Frequency distribution- is used with nominal or ordinal level data.
Example 1.1: The blood type of 25 individuals is given below. Present the
data using FD table.
Blood type
Frequency
A
B B AB
O
O
O B AB
B
A
5
B
O A O
B
B
7
O
O O AB
A
O
9
A
O
B A
AB
AB
4
1-19
FREQUENCY DISTRIBUTIONS CONT..
ii.
i.
Ungrouped Frequency Distribution-appropriate when the range of data is
small.
Example1.2 : The following data represent the number of days of sick leave
taken by each of 50 workers of a company over the last 6 weeks. Construct
ungrouped frequency distribution
2
1
7
0
7
0
7
0
2
5
0
1
1
5
1
5
5
7
1
1
8
4
2
3
0
3
0
5
0
2
4
4
5
2
1
0
4
4
0
1
3
5
0
8
3
0
7
9
0
5
1-20
FREQUENCY DISTRIBUTIONS CONT..
Ungrouped Frequency Distribution cont..
number of days of No. of employees
on sick leave
0
12
1
8
2
5
3
4
4
5
5
8
7
5
8
2
9
1
1-21
FREQUENCY DISTRIBUTIONS CONT..
iii.
Grouped Frequency distribution: used
when the range of the data is large
and the data is continuous.
Important terms
•
Class limits: numbers that separate two consecutive classes.
Lower Class Limit(LCL)
 Upper Class Limit(UCL)


Class boundaries – numbers that separate two consecutive classes without gap.
The Lower class boundary(LCB): LCBi= LCLi –U/2
 Upper class boundary(UCB) :
UCBi= UCLi +U/2
Where, U is unit of measurement : it is a distance between two possible
consecutive measures. It is usually taken as 1, 0.1, 0.01, 0.001, ----- etc.

1-22
GROUPED FREQUENCY DISTRIBUTION CONT..
Class width(W): is the difference between the upper class boundary
and the lower class boundary of a class.
• Class mark (M): it is the average of the lower and upper class limits or
the average of upper and lower class boundary.
 More than type Cumulative frequency (MCF): it is the total frequency
of all values greater than or equal to the LCB of a given class.
 Less than type Cumulative frequency (LCF): it is the total frequency
of all values less than or equal to the UCB of a given class.
• Relative frequency (RF): it is the frequency of a given class divided
by the total frequency.
𝑐𝑙𝑎𝑠𝑠 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 =
𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
•
1-23
GROUPED FREQUENCY DISTRIBUTION CONT..


•
More than type Cumulative frequency (MCF): it is the total frequency of all values
greater than or equal to the LCB of a given class.
Less than type Cumulative frequency (LCF): it is the total frequency of all values less
than or equal to the UCB of a given class.
Relative frequency (rf): it is the frequency of a given class divided by the total
frequency.
𝑐𝑙𝑎𝑠𝑠 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 =
𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
1-24
EXAMPLE: GROUPED FREQUENCY DISTRIBUTION
iii.
Grouped Frequency distribution: used
when the range of the data is
large and the data is continuous.
Example 1.3: Construct grouped frequency distribution for the following dataset.
11
29
6
33
14
21
18
17
22
38
31
22
27
19
22
23
26
39
34
27
Solution:
Step 1: Find the range(R): R=Max-Min=39-6=33
Step 2: Determine the number of classes (k) desired using Sturges formula;
k = 1 + 3.32 log n =1+3.32log (20)
=5.32=6 (rounding up)
Step 3: Find the class width: W=R/k=33/6=5.5=6 (rounding up)

1-25
CONT…..
Step 4 Find class limits:

Lower class limits(LCLs).
LCL of first class is the smallest observation in the given dataset, which is 6.
 To obtain LCL’s of the remaining classes keep on adding the class width, W=6
as follows: 6 12 18 24 30 36


Upper class limits(LCLs).
To find UCL of first class use: UCL1=LCL2-U=12-1=11 To obtain
 UCL’s of the remaining classes keep on adding the class width, W=6 as follows:

11
17
23
29
35
Then, the upper and lower class limits will be:
41
Class limit
6-11
12-17
18-23
24-29
30-35
36-41
1-26
CONT…..
Step 5: Find the class boundaries.
Lower Class boundaries(LCB’s)

LCB of first class: LCB1= LCL1 –U/2=6-1/2=5.5

To obtain LCB’s of the remaining classes , keep adding the class width, w=6 as
follows:
5.5 11.5 17.5 23.5 29.5 35.5
Upper Class boundaries(UCB’s)

UCB of first class: UCB1= UCL1 + U/2=11+1/2=11.5

To obtain UCB’s of the remaining classes , keep on adding the class width, w=6
as follows:
11.5 17.5 23.5 29.5 35.5 41.5
1-27
Step 6: Find the frequencies, class marks, cumulative frequencies, the relativeCONT..
frequencies a
Class
limit
Class
boundary
Less than type
Freq. Class
Mark(M) Cumulative
frequency (LCF)
More than type
Cumulative
frequency (MCF)
Relative
frequency (RF)
6 – 11
5.5 – 11.5
2
8.5
2
20=2+2+7+4+3+2
0.10
12 – 17
11.5 – 17.5
2
14.5
2+2=4
18=2+7+4+3+2
0.10
18 – 23
17.5 – 23.5
7
20.5
2+2+7=11
16=7+4+3+2
0.35
24 – 29
23.5 – 29.5
4
26.5
2+2+7+4=15
9=4+3+2
0.20
30 – 35
29.5 – 35.5
3
32.5
2+2+7+4+3=18
5=3+2
0.15
36 – 41
35.5 – 41.5
2
38.5
2+2+7+4+3+2=20
2
0.10
1-28
2. DIAGRAMS AND GRAPHS

Diagrams and Graphs are the most effective and interesting way of presenting
in statistical data.
i.
ii.
Graphs: ( Histograms, Frequency polygons and Ogives)
Diagrams: (Pie chart, Bar chart)
1-29
I.GRAPHS
Histogram
 Frequency Polygons
 Ogives -An Ogive (pronounced as “oh-jive”)

1-30
HISTOGRAM
Used to present continuous data .The height of the bars correspond to the frequency of
each class and the drawn adjacent to each other (without gaps).
 X-axis- class boundaries VsY-axis- class frequencies.

Example : Take the data in the table below and draw less than and more than Ogive
and draw the histogram .
7.0
Class
Class
6.0
Freq.
5.0
limit
Mark
Class boundary
6 – 11 5.5 – 11.5
12 – 17 11.5 – 17.5
18 – 23 17.5 – 23.5
8.5
14.5
20.5
2
2
7
3.0
24 – 29 23.5 – 29.5
26.5
4
0.0
30 – 35 29.5 – 35.5
32.5
3
36 – 41 35.5 – 41.5
38.5
2
Frequency
4. 0
2.0
1.0
5.5
11.5
17.5 23.5 29.5
Class boundaries
35.5
41.5
1-31
FREQUENCY POLYGON

Frequency Polygons : uses line segment connected to points located directly
above class marks.

X-axis- class mark Vs Y-axis- class frequencies
Example : Take the data in the table below and draw the Frequency Polygon.
Class
limit
Class
boundary
6 – 11 5.5 – 11.5
12 – 17 11.5 – 17.5
18 – 23 17.5 – 23.5
Class
Mark
8.5
14.5
20.5
Freq.
7.0
2
2
7
LCF MCF
2 20
4 18
11 16
6.0
5.0
Frequency
4.0
3.0
24 – 29 23.5 – 29.5
26.5
4
15 9
30 – 35 29.5 – 35.5
32.5
3
18 5
36 – 41 35.5 – 41.5
38.5
2
20 2
2.0
2.5 8.5
14.5
20.5
26.5
Class marks
32.5
38.5
44.5
1-32
OGIVE

Ogives -An Ogive (pronounced as “oh-jive”)
less than Ogive-X-axis- UCB and Y-axis- LCF
 more than Ogive- X-axis- LCB and Y-axis- MCF


Example : Take the data in the table below and draw less than and more than Ogive
Less than Ogive
20
Class
Class Freq.
15
Class
boundary
6 – 11 5.5 – 11.5
12 – 17 11.5 – 17.5
Mark
8.5
14.5
18 – 23 17.5 – 23.5
20.5
limit
2
2
LCF
2
4
MCF
20
18
7
11
16
10
5
More than Ogive
0
24 – 29 23.5 – 29.5
26.5
4
15
9
30 – 35 29.5 – 35.5
32.5
3
18
5
36 – 41 35.5 – 41.5
38.5
2
20
2
5.5
11.5
17.5
Class Boundaries
23.5
29.5
35.5
41.5
1-33
II. DIAGRAMS (PIE CHART, AND BAR CHART)
Pie chart: Pie chart can used to compare the relation between the whole and its
components.
Example 1.4 : The following table gives the details of monthly budget of a family.
Represent these figures by Pie Chart.

1-34
BAR CHART
Bar chart: uses vertical or horizontal bars to represent the frequencies/quantity of a
distribution.
 Example 1.5 : Draw simple bar diagram to represent the profits of a bank for 5
years.

Download