CHAPTER ONE INTRODUCTION 1-2 DEFINITION OF STATISTICS • In plural sense : statistics refers to aggregate of numerical facts (figures). • For instance, "The statistics show that In 2022, IT spending on enterprise software has increased significantly compared the previous year. In this case, "statistics" refers to the specific numbers and figures that have been collected and analyzed. In singular sense statistics refers to a subject concerned with methods of collecting, organizing, summarizing, presenting, analyzing data to derive valid conclusions and make reasonable decisions. For example, a researcher might say, "I used statistics to analyze the results of my experiment.“ In this case, "statistics" refers to the methodology or techniques used to analyze the data. 1-3 CLASSIFICATION OF STATISTICS i. Descriptive Statistics: area of statistics that involves summarizing, and describing data in an informative way. ii. It consists of summary calculations, graphs, charts and tables. It does not involve making any inferences or predictions about the data beyond what is already observed. Inferential Statistics: area of statistics that involves making inferences or predictions about a population based on a sample of data. This Includes: performing estimations and conducting hypothesis tests, generalizing from sample to population, determining relationships among variables. 1-4 EXERCISE 1 1. Classify the following statements as belonging to the area of descriptive statistics or inferential statistics. As a result of recent cutbacks by oil-producing nations, we can expect the price of gasoline to double in the next year. b. Of all patients who received this particular type of drug at a clinic Y, 75% later developed significant side effect. a. 1-5 BASIC STATISTICAL TERMS Study Population refers to the group of all individuals, systems or objects under study that posses certain common characteristics. Census: complete enumeration of every member of a study population. Sample is a portion (part) of the population of interest. • Sample survey: method of collecting data from a subset of the population(sample) Advantages of Sample survey over census: - It saves cost and time - It has great accuracy 1-6 a) Study population: Users of a particular software product Sample: A randomly selected group of users who have used the software in the past 6 months. b) Study population: Open source software projects on GitHub Sample: A randomly selected subset of open source projects that are active and have at least 100 commits. c) Study population: Bug reports in a software system Sample: A randomly selected subset of bug reports that have been submitted in the past year and have at least 10 comments. 1-7 BASIC STATISTICAL TERMS CONT’D…. Parameter: numerical value that describes//summarizes a characteristic of a population. Examples: Populations mean(𝜇) and population standard deviation(𝜎), population proportion(𝑝) • Statistic: numerical value that describes/summarizes a characteristic of a sample. • Examples: sample mean (𝑋), sample standard deviation (S), sample proportion(𝑝) 1-8 BASIC STATISTICAL TERMS CONTD. • • • Data refers to the collection of information or observations. Variable: is a characteristic or attribute that can take on different values or levels. Data are the values (measurements or observations) that the variables assume. Two types of variables: Qualitative and Quantitative variable. i. Qualitative variables : the characteristic being studied is nonnumeric. Examples-Gender, Marital status, Blood type, Nationality etc… ii. Quantitative variables - measurable or countable characteristic. Takes numeric values. . Examples-Temperatures, Salaries, Height, age, crop yield, GPA, • 1-9 BASIC STATISTICAL TERMS CONTD. • Quantitative variables can be classified as either Discrete or Continuous. -Continuous Variables: can take any value within a specified interval. Eg. age, time, height, income, price, temperature etc. -Discrete Variables: can only assume whole number/discrete values. Eg. Number of hits on a certain website, number of female students in a classroom, number of accidents on traffic line etc 1-10 SOURCES OF DATA • Primary Data :Primary data are measurements observed and recorded as part of an original study. • Secondary Data : data which has already been collected by other researchers or organizations . It can be obtained from journals, reports etc. 1-11 MEASUREMENT SCALES i. • Measurement Scale (MS)- refer to ways in which variables are defined and categorized based on the properties of order, distance and absolute zero point. • The four scales of measurement are nominal, ordinal, interval, and ratio. Nominal Scale: It has no order or ranking arithmetic and relational operations are not applicable Examples: Political party preference (Republican, Democrat, or Other,) Sex (Male or Female) Marital status (married, single, widowed, divorce) Country code 1-12 MEASUREMENT SCALES ii. Ordinal Scales It has order but no meaningful difference between the ranks. No arithmetic operation is applicable but relational operations are applicable. Examples: Letter grades (A, B, C, D, F). Rating scales (Excellent, Very good, Good, Fair, poor). 1-13 MEASUREMENT SCALES iii. Interval Scales: It has order and difference between ranks are meaningful It has no absolute zero point Relational operations are also possible. All arithmetic operations except division are applicable. Examples: IQ, Temperature IV. Ratio Scales: It has order and difference between ranks are meaningful It has absolute zero point All arithmetic operations are applicable. Relational operations are also possible. All arithmetic and relational operations are applicable. Examples: Weight, Height, Number of students, Age 1-14 APPLICATION OF PROBABILITY AND STATISTICS IN SOFTWARE ENGINEERING Probability theory is used in various ways in software engineering to analyze and manage uncertainty, evaluate system reliability, and optimize system performance. For example : Software reliability engineering: Probability theory is used to evaluate system reliability and predict the probability of system failure, by computing metrics such as Mean Time Between Failures (MTBF), and Mean Time To Failure (MTTF). Risk analysis: Probability theory is used to assess risks associated with software development, such as the probability of a security breach, data loss, or system failure. It is used to calculate risk metrics such as Expected Monetary Value (EMV) to determine the potential impact of a risk event. Optimization: Probability theory is used to optimize system performance. It is used to model the behavior of complex systems and identify optimal system configurations based on probabilistic methods. Machine learning: Probability theory is used extensively in machine learning algorithms such as Bayesian networks and Markov models. It is used to model the relationships between data points and make predictions based on probability distributions. 1-15 APPLICATION OF PROBABILITY AND STATISTICS IN SOFTWARE ENGINEERING Statistics can be applied in various ways to software engineering to help in decision-making, improving software quality, and identifying trends and patterns. Here are some examples: Performance testing: Statistical techniques such as regression analysis can be used to analyze the relationship between software performance and various factors such as system load, network latency, and CPU usage. Defect analysis: Statistical techniques can be used to analyze defects in software, identify common patterns, and prioritize fixes. Predictive modeling: Statistical modeling techniques such as regression analysis and time series analysis can be used to predict software performance, estimate resource requirements, and forecast future demand. User experience analysis: Statistical methods such as usability testing and user surveys can be used to evaluate user experience and identify areas for improvement. METHODS OF DATA PRESENTATION 1-17 2. METHODS OF DATA PRESENTATION Data can be presented using either frequency distributions or Diagrams and Graphs. 1. Frequency distributions i. Categorical Frequency distributions ii. Ungrouped Frequency distributions iii. Grouped Frequency distributions 2. Diagrams and Graphs • i. ii. Diagrams: (Pie chart, Bar chart) Graphs: ( Histograms, Frequency polygons and Ogives) 1-18 1. FREQUENCY DISTRIBUTIONS i. Categorical Frequency distribution- is used with nominal or ordinal level data. Example 1.1: The blood type of 25 individuals is given below. Present the data using FD table. Blood type Frequency A B B AB O O O B AB B A 5 B O A O B B 7 O O O AB A O 9 A O B A AB AB 4 1-19 FREQUENCY DISTRIBUTIONS CONT.. ii. i. Ungrouped Frequency Distribution-appropriate when the range of data is small. Example1.2 : The following data represent the number of days of sick leave taken by each of 50 workers of a company over the last 6 weeks. Construct ungrouped frequency distribution 2 1 7 0 7 0 7 0 2 5 0 1 1 5 1 5 5 7 1 1 8 4 2 3 0 3 0 5 0 2 4 4 5 2 1 0 4 4 0 1 3 5 0 8 3 0 7 9 0 5 1-20 FREQUENCY DISTRIBUTIONS CONT.. Ungrouped Frequency Distribution cont.. number of days of No. of employees on sick leave 0 12 1 8 2 5 3 4 4 5 5 8 7 5 8 2 9 1 1-21 FREQUENCY DISTRIBUTIONS CONT.. iii. Grouped Frequency distribution: used when the range of the data is large and the data is continuous. Important terms • Class limits: numbers that separate two consecutive classes. Lower Class Limit(LCL) Upper Class Limit(UCL) Class boundaries – numbers that separate two consecutive classes without gap. The Lower class boundary(LCB): LCBi= LCLi –U/2 Upper class boundary(UCB) : UCBi= UCLi +U/2 Where, U is unit of measurement : it is a distance between two possible consecutive measures. It is usually taken as 1, 0.1, 0.01, 0.001, ----- etc. 1-22 GROUPED FREQUENCY DISTRIBUTION CONT.. Class width(W): is the difference between the upper class boundary and the lower class boundary of a class. • Class mark (M): it is the average of the lower and upper class limits or the average of upper and lower class boundary. More than type Cumulative frequency (MCF): it is the total frequency of all values greater than or equal to the LCB of a given class. Less than type Cumulative frequency (LCF): it is the total frequency of all values less than or equal to the UCB of a given class. • Relative frequency (RF): it is the frequency of a given class divided by the total frequency. 𝑐𝑙𝑎𝑠𝑠 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 • 1-23 GROUPED FREQUENCY DISTRIBUTION CONT.. • More than type Cumulative frequency (MCF): it is the total frequency of all values greater than or equal to the LCB of a given class. Less than type Cumulative frequency (LCF): it is the total frequency of all values less than or equal to the UCB of a given class. Relative frequency (rf): it is the frequency of a given class divided by the total frequency. 𝑐𝑙𝑎𝑠𝑠 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 1-24 EXAMPLE: GROUPED FREQUENCY DISTRIBUTION iii. Grouped Frequency distribution: used when the range of the data is large and the data is continuous. Example 1.3: Construct grouped frequency distribution for the following dataset. 11 29 6 33 14 21 18 17 22 38 31 22 27 19 22 23 26 39 34 27 Solution: Step 1: Find the range(R): R=Max-Min=39-6=33 Step 2: Determine the number of classes (k) desired using Sturges formula; k = 1 + 3.32 log n =1+3.32log (20) =5.32=6 (rounding up) Step 3: Find the class width: W=R/k=33/6=5.5=6 (rounding up) 1-25 CONT….. Step 4 Find class limits: Lower class limits(LCLs). LCL of first class is the smallest observation in the given dataset, which is 6. To obtain LCL’s of the remaining classes keep on adding the class width, W=6 as follows: 6 12 18 24 30 36 Upper class limits(LCLs). To find UCL of first class use: UCL1=LCL2-U=12-1=11 To obtain UCL’s of the remaining classes keep on adding the class width, W=6 as follows: 11 17 23 29 35 Then, the upper and lower class limits will be: 41 Class limit 6-11 12-17 18-23 24-29 30-35 36-41 1-26 CONT….. Step 5: Find the class boundaries. Lower Class boundaries(LCB’s) LCB of first class: LCB1= LCL1 –U/2=6-1/2=5.5 To obtain LCB’s of the remaining classes , keep adding the class width, w=6 as follows: 5.5 11.5 17.5 23.5 29.5 35.5 Upper Class boundaries(UCB’s) UCB of first class: UCB1= UCL1 + U/2=11+1/2=11.5 To obtain UCB’s of the remaining classes , keep on adding the class width, w=6 as follows: 11.5 17.5 23.5 29.5 35.5 41.5 1-27 Step 6: Find the frequencies, class marks, cumulative frequencies, the relativeCONT.. frequencies a Class limit Class boundary Less than type Freq. Class Mark(M) Cumulative frequency (LCF) More than type Cumulative frequency (MCF) Relative frequency (RF) 6 – 11 5.5 – 11.5 2 8.5 2 20=2+2+7+4+3+2 0.10 12 – 17 11.5 – 17.5 2 14.5 2+2=4 18=2+7+4+3+2 0.10 18 – 23 17.5 – 23.5 7 20.5 2+2+7=11 16=7+4+3+2 0.35 24 – 29 23.5 – 29.5 4 26.5 2+2+7+4=15 9=4+3+2 0.20 30 – 35 29.5 – 35.5 3 32.5 2+2+7+4+3=18 5=3+2 0.15 36 – 41 35.5 – 41.5 2 38.5 2+2+7+4+3+2=20 2 0.10 1-28 2. DIAGRAMS AND GRAPHS Diagrams and Graphs are the most effective and interesting way of presenting in statistical data. i. ii. Graphs: ( Histograms, Frequency polygons and Ogives) Diagrams: (Pie chart, Bar chart) 1-29 I.GRAPHS Histogram Frequency Polygons Ogives -An Ogive (pronounced as “oh-jive”) 1-30 HISTOGRAM Used to present continuous data .The height of the bars correspond to the frequency of each class and the drawn adjacent to each other (without gaps). X-axis- class boundaries VsY-axis- class frequencies. Example : Take the data in the table below and draw less than and more than Ogive and draw the histogram . 7.0 Class Class 6.0 Freq. 5.0 limit Mark Class boundary 6 – 11 5.5 – 11.5 12 – 17 11.5 – 17.5 18 – 23 17.5 – 23.5 8.5 14.5 20.5 2 2 7 3.0 24 – 29 23.5 – 29.5 26.5 4 0.0 30 – 35 29.5 – 35.5 32.5 3 36 – 41 35.5 – 41.5 38.5 2 Frequency 4. 0 2.0 1.0 5.5 11.5 17.5 23.5 29.5 Class boundaries 35.5 41.5 1-31 FREQUENCY POLYGON Frequency Polygons : uses line segment connected to points located directly above class marks. X-axis- class mark Vs Y-axis- class frequencies Example : Take the data in the table below and draw the Frequency Polygon. Class limit Class boundary 6 – 11 5.5 – 11.5 12 – 17 11.5 – 17.5 18 – 23 17.5 – 23.5 Class Mark 8.5 14.5 20.5 Freq. 7.0 2 2 7 LCF MCF 2 20 4 18 11 16 6.0 5.0 Frequency 4.0 3.0 24 – 29 23.5 – 29.5 26.5 4 15 9 30 – 35 29.5 – 35.5 32.5 3 18 5 36 – 41 35.5 – 41.5 38.5 2 20 2 2.0 2.5 8.5 14.5 20.5 26.5 Class marks 32.5 38.5 44.5 1-32 OGIVE Ogives -An Ogive (pronounced as “oh-jive”) less than Ogive-X-axis- UCB and Y-axis- LCF more than Ogive- X-axis- LCB and Y-axis- MCF Example : Take the data in the table below and draw less than and more than Ogive Less than Ogive 20 Class Class Freq. 15 Class boundary 6 – 11 5.5 – 11.5 12 – 17 11.5 – 17.5 Mark 8.5 14.5 18 – 23 17.5 – 23.5 20.5 limit 2 2 LCF 2 4 MCF 20 18 7 11 16 10 5 More than Ogive 0 24 – 29 23.5 – 29.5 26.5 4 15 9 30 – 35 29.5 – 35.5 32.5 3 18 5 36 – 41 35.5 – 41.5 38.5 2 20 2 5.5 11.5 17.5 Class Boundaries 23.5 29.5 35.5 41.5 1-33 II. DIAGRAMS (PIE CHART, AND BAR CHART) Pie chart: Pie chart can used to compare the relation between the whole and its components. Example 1.4 : The following table gives the details of monthly budget of a family. Represent these figures by Pie Chart. 1-34 BAR CHART Bar chart: uses vertical or horizontal bars to represent the frequencies/quantity of a distribution. Example 1.5 : Draw simple bar diagram to represent the profits of a bank for 5 years.