Higher Institute of Engineering & Technology, El-Boheira المعهد العالي للهندسة والتكنولوجيا بالبحيرة Computer Engineering Department Research paper submitted in fulfillment of Mathematics 4 BA221 Statistical analysis of (ungrouped and grouped data) Expectation on the future data by trend line Name: Mohamed Yosry Mohamed El-Zarka Code: 19100 Year: first level محمد يسري محمد الزرقا:اسم الطالب هندسة الحاسبات:القسم المستوى األول:الفرقة عبد الفتاح أبو هاشم:دكتور Statistical analysis of (ungrouped and grouped data) Expectation on the future data by trend line Introduction Data analysis is a method in which data is collected and organized so that one can derive helpful information from it. In other words, the main purpose of data analysis is to look at what the data is trying to tell us. For example, what does the data show or do? What does the data not show or do? Data Analysis is the act of trying to learn something from a dataset. Data Analysis is not an end to itself; it is used in service of optimizing or improving other activities. Data processing: Data initially obtained must be processed or organized for analysis. For instance, these may involve placing data into rows and columns in a table format (i.e., structured data) for further analysis, such as within a spreadsheet or statistical software. [2] 1 Grouped data VS. Ungrouped data - In statistics, the term data is used to refer to information that has been collected and recorded for the purpose of specific projects and it could be either qualitative or quantitative. - Both grouped and ungrouped data are types of data however, grouped data has been classified into categories based on similar characteristics whereas ungrouped data is raw data. - Both types of data can be represented by frequency tables. However, for ungrouped data, there are no class limits thus the use of tally marks. Grouped data in a frequency table has limits and that is the upper class limit and lower class limit. - Both types of data can be used to calculate the mean, mode and median of samples of population therefore they are useful. [1] Differences Grouped Ungrouped Classification Organized into classes No form of organization Preference Preferred when analyzing data Preferred when collecting data Accuracy Has higher accuracy levels when calculating mean and median Less accurate in determining mean and median Presentation Frequency tables are mostly used Lists are used in this data type Summary Summarized in frequency distribution No form of summarization 2 Ungrouped data Is a collection of statistical data that is classified, but is otherwise uncategorized, unfiltered and unsorted. In other words, the data is described generally, but has not been subdivided into groups or categories, and which consists of all the data collected with none of it omitted. It is also presented in the original order in which it was collected. [9] Some of the advantages of ungrouped data are as follows: 1. Most people can easily interpret it. 2. When the sample size is small, it is easy to calculate the mean, mode and median. 3. It does not require technical expertise to analyze it. [7] Arithmetic mean value The mean is the average of the numbers. It is easy to calculate: add up all the numbers, then divide by how many numbers there are. In other words, it is the sum divided by the count. [3] 𝑛 1 𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒 = 𝑥̅ = ∑ 𝑥𝑖 𝑛 𝑖=1 Geometric mean value The geometric mean, sometimes referred to as geometric average of a set of numerical values, as the arithmetic mean is a type of average, a measure of central tendency. The Geometric Mean is a special type of average where we multiply the numbers together and then take a square root (for two numbers), cube root (for three numbers), and nth root (for n numbers). Due to the formula used to calculate it, all values in the dataset must have the same sign, that is, they must be all positive or all negative. In addition, if the data set contains a zero, the geometric mean will always be zero. [4] 𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒 = 𝑛√𝑥1 ∗ 𝑥2 ∗ 𝑥3 … 𝑥𝑛 3 Deviation Deviation is a measure of difference between the observed value of a variable and some other value, often that variable's mean. The sign of the deviation reports the direction of that difference (the deviation is positive when the observed value exceeds the reference value). The magnitude of the value indicates the size of the difference. [9] 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑥𝑖 − 𝑥̅ Variance The Variance is defined as the average of the squared differences from the Mean. Variance is non-negative because the squares are positive or zero. The variance of a constant is zero. The variance of the distribution is the square of the standard deviation. It is not a useful measure in its own right, but it is a step in calculating a standard deviation. It is useful when creating statistical models since low variance can be a sign that you are over-fitting your data. [3] 𝑛 1 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝜎 2 = ∑(𝑥𝑖 − 𝑥̅ )2 𝑛 𝑖=1 Standard deviation Standard deviation defined as the square root of the average of the squared deviations of the values from their average or simply the square root of variance. The Standard Deviation is a measure of how spread out numbers are. Standard deviation is one measure of spread. A smaller standard deviation means that your data is more concentrated around the mean. A larger standard deviation means that your data tends to be more spread out from the mean. Why do we use standard deviation, when we have variance? Because, in order to maintain the calculations in same units i.e. suppose mean is in m/s, then variance is in m2/s2, whereas standard deviation is in m/s, so we use standard deviation most. Unlike variance, standard deviation is much more intuitive and closer to the values of the original data set. Therefore, it is used more often for demographic analysis or in sample surveys to get an idea of what is normal in the population. [4] 𝑛 1 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝜎 = √ ∑(𝑥𝑖 − 𝑥̅ )2 𝑛 𝑖=1 4 Mode The mode is simply the most common value. Although there will only be one mean and median in a set of data, it is possible to have more than one mode. A set of data with two modes is considered “bimodal,” one with three, “trimodal” etc. A big advantage of statistical mode is that it is not restricted to numbers alone. For example, among all the letters of the English alphabet, the mode is the letter ‘E’, which is the most frequently encountered letter. However, we cannot define the median or mean letter, since these can only be defined for numbers. This makes the scope of the mode quite broad in nature. [6] Median Median is when you take all the scores and arrange them in order from low to high then select the middle number. The median is one measure of central tendency. If one orders the elements from lowest to highest. The median is simply the point where 50% of the data is above and 50% is below. It is a good, intuitive metric of centrality that is good at representing a "typical" or "middle" value. If there are an even number of elements, it is the mean of the two middle numbers. [6] Odd Even 𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑋𝑛+1 2 𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑋𝑛 + 𝑋𝑛+1 2 2 2 Range Range is defined simply as the difference between the maximum and minimum observations. It is intuitively obvious why we define range in statistics this way - range should suggest how diversely spread out the values are, and by computing the difference between the maximum and minimum values, we can get an estimate of the spread of the data. The range can sometimes be misleading when there are extremely high or low values. This limitation of range is to be expected primarily because range is computed taking only two data points into consideration. Thus, it cannot give a very good estimate of how the overall data behaves. [5] 𝑅𝑎𝑛𝑔𝑒 = 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 5 Example 1 (ungrouped data): For the following data: {100, 120, 120, 60, 100, 90, 140, 120, 80, 150} evaluate the arithmetic mean, the geometric mean, Standard deviation and median, mode and range. xi xi - x̅ (xi − x̅)2 100 -8 64 120 12 144 120 12 144 60 -48 2304 100 -8 64 90 -18 324 140 32 1024 120 12 144 80 -28 784 150 42 1764 ∑(xi − x̅)2 = 6760 ∑ 𝑥𝑖 = 1080 𝑛 1 1080 𝑥̅ = ∑ 𝑥𝑖 = = 108 𝑛 10 𝑖=1 𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒 = 𝑛√𝑥1 ∗ 𝑥2 ∗ 𝑥3 … 𝑥𝑛 10 = √100 ∗ 120 ∗ 120 ∗ 60 ∗ 100 ∗ 90 ∗ 140 ∗ 120 ∗ 80 ∗ 150 = 104.6 𝑛 1 676 𝜎 = ∑(𝑥𝑖 − 𝑥̅ )2 = = 676 𝑛 10 2 𝑖=1 𝜎 = √676 = 26 To get the median, you must sort the elements ascending or descending { 60,80,90,100,100,120,120,120,140,150 } 𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑋𝑛 + 𝑋𝑛+1 2 2 2 = 𝑋5 + 𝑋6 100 + 120 = = 110 2 2 𝑀𝑜𝑑𝑒 = 120 𝑅𝑎𝑛𝑔𝑒 = 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 = 150 − 60 = 90 6 7 Grouped data Grouped data is the type of data, which is classified into groups after collection. The raw data is categorized into various groups and a table is created. The primary purpose of the table is to show the data points occurring in each group. For instance, when a test is done, the results are the data in this scenario and there are many ways to group this data. For example, the number of students that scored above each 20 mark can be recorded. Alternatively, the grades can be used. For example, a 90-100 all the way to F 0-59 with each category showing how many students are in each category. Histograms and frequency table are best used to show and interpret grouped data. Grouping of data has the following advantages: - Helps in improving the efficiency of estimations. - Allows for greater balancing of statistical power of tests of the differences between strata by analyzing equal number from strata. - Irrelevant subpopulations are ignored while the significant ones are focused on. [1] Class limits The (integer) lower and upper limits or lowest and highest values that can belong to each class. Grouped data is data that has been organized into groups known as classes. Grouped data has been 'classified' and thus some level of data analysis has taken place, which means that the data is no longer raw. A data class is group of data, which is related by some user-defined property. For example, if you were collecting the ages of the people you met as you walked down the street, you could group them into classes as those in their teens, twenties, thirties, and forties and so on. Each of those groups is called a class. Each of those classes is of a certain width and this is referred to as the Class Interval or Class Size. This class interval is very important when it comes to drawing Histograms and Frequency diagrams. All the classes may have the same class size or they may have different classes’ sizes depending on how you group your data. The class interval is always a whole number. Note: The lower value of a class interval is called lower limit and upper value of that class interval is called the upper limit. Thus, each class interval has lower and upper limits. [7] 8 Frequency Frequency is how often something occurs. By counting frequencies, we can make a Frequency Distribution table. [3] The midpoints Of the intervals are computed by adding the two apparent limits together and dividing by two. The midpoint for the interval 33 to35 would thus be (33 + 35)/2 or 34. The midpoint for the second interval (36-38) would be 37. [7] Weight The weight of each interval is calculated by multiplying the midpoint of this interval by its frequency Weight = midpoint * frequency Arithmetic mean value Find the midpoint of the grouped data and then multiply with frequency to get the total of fi * xi. Divide it with n; you get the mean value for the grouped data. 𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒 = 𝑥̅ = 𝑛 1 ∑𝑛𝑖=1 𝑓𝑖 ∗ ∑ 𝑥𝑖 ∙ 𝑓𝑖 𝑖=1 Deviation Deviation in grouped data is a measure of difference between the midpoint of an interval and the mean. The sign of the deviation reports the direction of that difference (the deviation is positive when the observed value exceeds the reference value). The magnitude of the value indicates the size of the difference. Standard deviation Is similar to the ungrouped data unless every class of data must be multiplied by the frequency in order to consider the weight of every set of data. [3] 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝜎 2 = 1 ∑𝑛𝑖=1 𝑓𝑖 𝑛 ∗ ∑ 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2 𝑖=1 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝜎 9 Mode First, you should detect which interval is the most frequent (fmax). f1: the previous frequency of fmax. f2: the next frequency of fmax. f1 and f2 are treated as Torques at the edge of the interval; the balance point is the mode. Note that in a data set, there could be more than one modes or no mode at all. ℎ = 𝑆2 − 𝑆1 𝑓1 𝑥 = 𝑓2 (ℎ − 𝑥) 𝑀𝑜𝑑𝑒 = 𝑥𝑚𝑖𝑛 + 𝑥 Cumulative Frequency Calculating cumulative frequency gives you the sum (or running total) of all the frequencies up to a certain point in a data set. In other words, the total of a frequency and all frequencies so far in a frequency distribution. [3] Quartile In statistics, a quartile, a type of quantile, is three points that divide sorted data set into four equal groups (by count of numbers), each representing a fourth of the distributed sampled population. There are three quartiles: the first quartile (Q1), the second quartile (Q2), and the third quartile (Q3). The first quartile (lower quartile, QL), is equal to the 25th percentile of the data. (Splits off the lowest 25% of data from the highest 75%)The second (middle) quartile or median of a data set is equal to the 50th percentile of the data (cuts data in half) The third quartile, called upper quartile (QU), is equal to the 75th percentile of the data. (Splits off the lowest 75% of data from highest 25%) [7] 10 How we calculating quartiles? We sort set of data with n items (numbers) and pick n/4-th item as Q1, n/2-th item as Q2 and 3n/4-th item as Q3 quartile. If indexes n/4, n/2 or 3n/4 are not integers then we use interpolation between nearest items. For example, for n=100 items, the first quartile Q1 is 25th item of ordered data, quartile Q2 is 50th item and quartile Q3 is 75th item. Zero quartile Q0 would be minimal item and the fourth quartile Q4 would be the maximum item of data, but these extreme quartiles are called minimum resp. maximum of set. [7] Median First, you should know in which set ∑𝑓 2 is located and L equals the lower limit of the previous set. ℎ = 𝑆2 − 𝑆1 𝑁 − 𝑐𝑓 𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐿 + [ 2 ].ℎ 𝑓 First quadrant First, you should know in which set ∑𝑓 4 is located and L equals the lower limit of the previous set. 𝑁 − 𝑐𝑓 𝑄1 = 𝐿 + [ 4 ].ℎ 𝑓 Third quadrant First, you should know in which set 3∑𝑓 4 is located and L equals the lower limit of the previous set. 3𝑁 − 𝑐𝑓 𝑄3 = 𝐿 + [ 4 ].ℎ 𝑓 11 Example 2 (grouped data): For the following grouped data, Evaluate the arithmetic mean, standard deviation, mode, median, first and third quadrants. Set 0 - 12 - 24 - 36 - 48 - 60 - 72 - 84 - 96 - 108 Frequency 8 12 15 18 24 16 12 k 8 6 Set 0 12 24 36 48 60 72 84 96 108 - fi xi fixi 8 12 15 18 24 16 12 6 8 6 6 18 30 42 54 66 78 90 102 114 48 216 450 756 1296 1056 936 540 816 684 ∑ = 125 𝑥𝑖 − 𝑥̅ (𝑥𝑖 − 𝑥̅ )2 -48.4 2342.56 -36.4 1324.96 -24.4 595.36 -12.4 153.76 -0.4 0.16 11.6 134.56 23.6 556.96 35.6 1267.36 47.6 2265.76 59.6 3552.16 ∑ = 6798 30 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2 cf 18740.5 15899.5 8930.4 2767.68 3.84 2152.96 6683.52 7604.16 18126.1 21313 8 20 35 53 77 93 105 111 119 125 ∑ = 102221.66 polygon & curve 25 20 15 10 5 0 0 20 40 60 80 100 120 140 12 𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒 = 𝑥̅ = 2 𝜎 = ∑𝑛𝑖=1 𝑓𝑖 ∗ ∑ 𝑥𝑖 ∙ 𝑓𝑖 = 𝑖=1 𝑛 1 ∑𝑛𝑖=1 𝑛 1 𝑓𝑖 ∗ ∑ 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2 = 𝑖=1 6798 = 54.4 125 102221.66 = 817.8 125 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝜎 = 28.6 Mode is at set [48 – 60] ℎ = 𝑆2 − 𝑆1 = 60 − 48 = 12 𝑓1 𝑥 = 𝑓2 (ℎ − 𝑥) 18𝑥 = 16(12 − 𝑥) 𝑥 = 5.65 𝑀𝑜𝑑𝑒 = 𝑥𝑚𝑖𝑛 + 𝑥 = 48 + 5.65 = 53.65 ∑𝑓 2 = 125 2 = 62.5 , so the median is at set [36 – 48] and L = 36 𝑁 − 𝑐𝑓 62.5 − 53 𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐿 + [ 2 ] . ℎ = 36 + [ ] ∗ 12 = 42.333 𝑓 18 ∑𝑓 4 = 125 4 = 31.25 , so the first quadrant is at set [12 – 24] and L = 12 𝑁 − 𝑐𝑓 31.25 − 20 𝑄1 = 𝐿 + [ 4 ] . ℎ = 12 + [ ] ∗ 12 = 23.25 𝑓 12 3∑𝑓 4 = 3∗125 4 = 93.75 , so the third quadrant is at set [60 – 72] and L = 60 3𝑁 − 𝑐𝑓 93.75 − 93 𝑄3 = 𝐿 + [ 4 ] . ℎ = 60 + [ ] ∗ 12 = 60.5625 𝑓 16 13 Expectation on the future data by trend line A trend line is a mathematical equation that describes the relationship between two variables. It is produced from raw data obtained by measurement or testing. The simplest and most common trend line equations are linear, or straight, lines. Once you know the trend line equation for the relationship between two variables, you can easily predict what the value of one variable will be for any given value of the other variable. You should already have a trend line based on a data set you have taken or gathered with the line representing a general trend of that data. Then, you can move onto predictions. [8] Predicting a Value Examine your trend line equation to ensure it is in the proper form. The equation for a linear relationship should look like this: y = mx + b. "x" is the independent variable and is usually the one you have control over. "y" is the dependent variable that changes in response to x. Uses for a Trend line: Trend Lines and Predictions A trend line is most often used to display data that increases or decreases at a specific and steady rate (at least within a specific timeline). That means that a trend line is a great tool for predicting what value something will have in the future; trend lines and predictions go hand in hand. Some examples could be for predicting population size, predicting the amount of a certain molecule in a solution over time, or creating an equation that can then be used in the future to predict similar information with other data sets. [8] 14 Example 3 (trend line): If the income of a family (in pounds) in 8 successive months shown in the following table then estimate the forecasted income in September and October. Month(x) Jan Feb Mar Apr May Jun Jul Aug income 400 450 420 500 550 600 580 700 x y xy x2 400 1 400 1 900 2 450 4 3 420 1260 9 4 500 2000 16 5 550 2750 25 6 600 3600 36 7 580 4060 49 8 700 5600 64 36 4200 20570 204 Chart Title 900 800 700 700 600 550 500 600 580 500 450 400 400 420 2 4 300 200 100 0 0 6 8 ∑ 𝑦 = 𝑚 ∑ 𝑥 + 𝑁𝑐 ∑ 𝑥𝑦 = 𝑚 ∑ 𝑥 2 + 𝑐 ∑ 𝑥 4200 = 𝑚(36) + 8𝑐 9𝑚 + 2𝑐 = 1050 20570 = 𝑚(204) + 𝑐(36) 102𝑚 + 18𝑐 = 10285 m = 39.76 c = 346.07 10 12 14 y=39.76 x+346.07 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑖𝑛 𝑆𝑒𝑝𝑡𝑒𝑚𝑏𝑒𝑟 = 39.76 (9) + 346.07 = 703.91 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑖𝑛 𝑂𝑐𝑡𝑜𝑏𝑒𝑟 = 39.76 (10) + 346.07 = 743.67 15 References 1. http://www.differencebetween.net/language/words-language/differencebetween-grouped-data-and-ungrouped-data/ 2. https://en.wikipedia.org/wiki/Data_analysis 3. https://www.mathsisfun.com/ 4. https://www.khanacademy.org/ 5. https://explorable.com/range-in-statistics 6. https://explorable.com/statistical-mode 7. https://www.wyzant.com/resources/lessons/math/statistics_and_probability/int roduction/data 8. https://sciencing.com/use-line-equation-predicted-value-7985744.html 9. Jeffery T. Walker, Statistics in criminology and criminal justice: analysis and interpretation, 1999 16