The Social Statistics Week 1 Meaning and Origin of statistics The words "statistics" and "statista," both of which mean "statesman" or "politician," are ultimately derived from the Neo-Latin statisticum collegium ("council of state") and "statesman." The term "science of state" (then known as political arithmetic in English) was first used in German by Gottfried Achenwall in 1749 to describe the analysis of data about the state. In the early 19th century, it came to mean the gathering and organization of data generally. With the release of the first of 21 volumes titled Statistical Account of Scotland in 1791, Sir John Sinclair translated it into English. [1] The word statistics comes from the Latin word “Status” or Italian word “Statistia” or German word “Statistik” or the French word “Statistique”; meaning a political state, and originally meant information useful to the state, such as information about sizes of the population (human, animal, products, etc.) and armed forces. According to pioneer statistician Yule, the word statistics occurred at the earliest in the book “the element of universal erudition” by Baron (1770). In 1787 a wider definition used by E.A.W. Zimmermann in “A Political survey of the present state of Europe”. It appeared in the encyclopedia of Britannica in 1797 and was used by Sir John Sinclair in Britain in a series of volumes published between 1791 and 1799 giving a statistical account of Scotland. In the 19th century, the word statistics acquired a wider meaning covering numerical data of almost any subject whatever and also interpretation of data through appropriate analysis. That’s all about the short history of Statistics. Now let us see how statistics is being used in different meanings nowadays. Now statistics is being used in different meanings. ■ Statistics refers to “numerical facts that are arranged systematically in the form of tables or charts etc. In this sense, it is always used a plural i.e. a set of numerical information. For instance statistics of prices, road accidents, crimes, births, educational institutions, etc. ■ The word statistics is defined as a discipline that includes procedures and techniques used to collect, process, and analyze the numerical data to make inferences and to reach appropriate decisions in situations of uncertainty (uncertainty refers to incompleteness, it does not imply ignorance). In this sense word statistic is used in the singular sense. It denotes the science of basing decisions on numerical data. ■ The word statistics are numerical quantities calculated from sample observations; a single quantity calculated from sample observations is called statistics such as the mean. Here word statistics is plural. “We compute statistics from statistics.” The first place of statistics is the plural of statistics; the second place is plural sense data; and the third place is singular sense methods. Social statistics is the use of statistical measurement systems to study human behavior in a social environment. 1. Primary Data Collection: Primary data collection involves the collection of original data directly from the source or through direct interaction with the respondents. This method allows researchers to obtain firsthand information specifically tailored to their research objectives. There are various techniques for primary data collection, including: ● Quantitative Data Collection Methods ● Qualitative Data Collection Methods Quantitative Data Collection Methods It is based on mathematical calculations using various formats like close-ended questions, correlation and regression methods, mean, median or mode measures. This method is cheaper than qualitative data collection methods and it can be applied in a short duration of time. Qualitative Data Collection Methods It does not involve any mathematical calculations. This method is closely associated with elements that are not quantifiable. This qualitative data collection method includes interviews, questionnaires, observations, case studies, etc. There are several methods to collect this type of data. They are a. Surveys and Questionnaires: Researchers design structured questionnaires or surveys to collect data from individuals or groups. These can be conducted through face-to-face interviews, telephone calls, mail, or online platforms. In this method, the set of questions are mailed to the respondent. They should read, reply and subsequently return the questionnaire. The questions are printed in the definite order on the form. A good survey should have the following features: ● Short and simple ● Should follow a logical sequence ● Provide adequate space for answers ● Avoid technical terms ● Should have good physical appearance such as colour, quality of the paper to attract the attention of the respondent b. Interviews: Interviews involve direct interaction between the researcher and the respondent. They can be conducted in person, over the phone, or through video conferencing. Interviews can be structured (with predefined questions), semi-structured (allowing flexibility), or unstructured (more conversational). The method of collecting data in terms of verbal responses. It is achieved in two ways, such as ● Personal Interview – In this method, a person known as an interviewer is required to ask questions face to face to the other person. The personal interview can be structured or unstructured, direct investigation, focused conversation, etc. ● Telephonic Interview – In this method, an interviewer obtains information by contacting people on the telephone to ask the questions or views, verbally. c. Observations: Researchers observe and record behaviors, actions, or events in their natural setting. This method is useful for gathering data on human behavior, interactions, or phenomena without direct intervention. Observation method is used when the study relates to behavioural science. This method is planned systematically. It is subject to many controls and checks. The different types of observations are: ● Structured and unstructured observation ● Controlled and uncontrolled observation ● Participant, non-participant and disguised observation d. Experiments: Experimental studies involve the manipulation of variables to observe their impact on the outcome. Researchers control the conditions and collect data to draw conclusions about cause-and-effect relationships. e. Focus Groups: Focus groups bring together a small group of individuals who discuss specific topics in a moderated setting. This method helps in understanding opinions, perceptions, and experiences shared by the participants. 2. Secondary Data Collection: Secondary data collection involves using existing data collected by someone else for a purpose different from the original intent. Researchers analyze and interpret this data to extract relevant information. Secondary data can be obtained from various sources, including: a. Published Sources: Researchers refer to books, academic journals, magazines, newspapers, government reports, and other published materials that contain relevant data. b. Online Databases: Numerous online databases provide access to a wide range of secondary data, such as research articles, statistical information, economic data, and social surveys. c. Government and Institutional Records: Government agencies, research institutions, and organizations often maintain databases or records that can be used for research purposes. d. Publicly Available Data: Data shared by individuals, organizations, or communities on public platforms, websites, or social media can be accessed and utilized for research. e. Past Research Studies: Previous research studies and their findings can serve as valuable secondary data sources. Researchers can review and analyze the data to gain insights or build upon existing knowledge. Uses of Statistics - conduct research - evaluate outcomes - develop critical thinking - and make informed decisions. Week 2 Variables What is a variable? A variable is any kind of attribute or characteristic that you are trying to measure, manipulate and control in statistics and research. All studies analyze a variable, which can describe a person, place, thing or idea. A variable's value can change between groups or over time. For example, if the variable in an experiment is a person's eye color, its value can change from brown to blue to green from person to person. ● Continuous variable: a variable with infinite number of values, like “time” or “weight”. ● Dependent variable: the outcome of an experiment. As you change the independent variable, you watch what happens to the dependent variable. ● Discrete variable: a variable that can only take on a certain number of values. For example, “number of cars in a parking lot” is discrete because a car park can only hold so many cars. ● Independent variable: a variable that is not affected by anything that you, the researcher, does. Usually plotted on the x-axis. ● Qualitative variable: a broad category for any variable that can’t be counted (i.e. has no numerical value). Nominal and ordinal variables fall under this umbrella term. ● Quantitative variable: A broad category that includes any variable that can be counted, or has a numerical value associated with it. Examples of variables that fall into this category include discrete variables and ratio variables. Independent vs. dependent variables Independent variables Definition A variable that stands alone and Dependent variables A variable that relies on and can isn't changed by the other variables be changed by other factors Example or factored that are measured that are measured Age: Other variables such as where A grade someone gets on an someone lives, what they eat or exam depends on factors such how much they exercise are not as how much sleep they got and going to change their age. how long they studied. In studies, researchers often try to find out whether an independent variable causes other variables to change and in what way. When analyzing relationships between study objects, researchers often try to determine what makes the dependent variable change and how. Independent variables can influence dependent variables, but dependent variables cannot influence independent variables. Quantitative vs. qualitative variables Quantitative variables Definition Examples Qualitative variables Any data sets that involve Non-numerical values numbers or amounts or groupings Height, distance or number of Eye color or dog breed items Types Discrete and continuous Binary, nominal and ordinal ● An extraneous variable is anything that could influence the dependent variable. These unwanted variables can Extraneous variables Confounding variables Definitio Factors that affect the dependent variable but n that the researcher did not originally consider when designing the experiment Extra variables that the researcher did not account for that can disguise another variable's effects and show false correlations Example Parental support, prior knowledge of a foreign language or socioeconomic status are extraneous variables that could influence a study assessing whether private tutoring or online courses are more effective at improving students' Spanish test scores. In a study of whether a particular genre of movie affects how much candy kids eat, with experiments are held at 9 a.m., noon and 3 p.m. Time could be a confounding variable, as the group in the noon study might be hungrier and therefore eat more candy because lunchtime is typically at noon. ● unintentionally change a study's results or how a researcher interprets those results. ● A confounding variable influences the dependent variable, and also correlates with or causally affects the independent variable. Confounding variables can invalidate your experiment results by making them biased or suggesting a relationship between variables exists when it does not. A “constant” simply means a fixed value or a value that does not change. A constant has a known value. What Is a Constant? If you measure the height of a wall or bookshelf at home, it will be a constant number. It won’t change. However, if you measure the height of a plant in a pot, it will keep changing as it grows. It’s not constant. Take a look at the following sentences to understand this. ■ There are 7 days in a week. Here, ⇒7 is a constant What is a Sample? A sample is defined as a smaller and more manageable representation of a larger group. A subset of a larger population that contains characteristics of that population. A sample is used in statistical testing when the population size is too large for all members or observations to be included in the test. The sample is an unbiased subset of the population that best represents the whole data. To overcome the restraints of a population, you can sometimes collect data from a subset of your population and then consider it as the general norm. You collect the subset information from the groups who have taken part in the study, making the data reliable. The results obtained for different groups who took part in the study can be extrapolated to generalize for the population. Figure: Sample The process of collecting data from a small subsection of the population and then using it to generalize over the entire set is called Sampling. Samples are used when : ● The population is too large to collect data. ● The data collected is not reliable. ● The population is hypothetical and is unlimited in size. Take the example of a study that documents the results of a new medical procedure. It is unknown how the procedure will affect people across the globe, so a test group is used to find out how people react to it. A sample should generally : ● Satisfy all different variations present in the population as well as a well-defined selection criterion. ● Be utterly unbiased on the properties of the objects being selected. ● Be random to choose the objects of study fairly. Say you are looking for a job in the IT sector, so you search online for IT jobs. The first search result would be for jobs all around the world. But you want to work in India, so you search for IT jobs in India. This would be your population. It would be impossible to go through and apply for all positions in the listing. So you consider the top 30 jobs you are qualified for and satisfied with and apply for those. This is your sample. However, it’s not that simple. When you do stats, your sample size has to be ideal—not too large or too small. Then once you’ve decided on a sample size, you must use a sound technique to collect the sample from the population: ● Probability Sampling uses randomization to select sample members. You know the probability of each potential member’s inclusion in the sample. For example, 1/100. However, it isn’t necessary for the odds to be equal. Some members might have a 1/100 chance of being chosen, others might have 1/50. ● Non-probability sampling uses non-random techniques (i.e. the judgment of the researcher). You can’t calculate the odds of any particular item, person or thing being included in your sample. Common Types The most common techniques you’ll likely meet in elementary statistics or AP statistics include taking a sample with and without replacement. Specific techniques include: ● Bernoulli samples have independent Bernoulli trials on population elements. The trials decide whether the element becomes part of the sample. All population elements have an equal chance of being included in each choice of a single sample. The sample sizes in Bernoulli samples follow a binomial distribution. Poisson samples (less common): An independent Bernoulli trial decides if each population element makes it to the sample. ● Cluster samples divide the population into groups (clusters). Then a random sample is chosen from the clusters. It’s used when researchers don’t know the individuals in a population but do know the population subsets or groups. ● In systematic sampling, you select sample elements from an ordered frame. A sampling frame is just a list of participants that you want to get a sample from. For example, in the equal-probability method, choose an element from a list and then choose every kth element using the equation k = N\n. Small “n” denotes the sample size and capital “N” equals the size of the population. ● SRS : Select items completely randomly, so that each element has the same probability of being chosen as any other element. Each subset of elements has the same probability of being chosen as any other subset of k elements. ● In stratified sampling, sample each subpopulation independently. First, divide the population into homogeneous (very similar) subgroups before getting the sample. Each population member only belongs to one group. Then apply simple random or a systematic method within each group to choose the sample. Stratified Randomization: a sub-type of stratified used in clinical trials. First, divide patients into strata, then randomize with permuted block randomization. What is Population? In statistics, population is the entire set of items from which you draw data for a statistical study. It can be a group of individuals, a set of items, etc. It makes up the data pool for a study. Generally, population refers to the people who live in a particular area at a specific time. But in statistics, population refers to data on your study of interest. It can be a group of individuals, objects, events, organizations, etc. You use populations to draw conclusions. Figure: Population An example of a population would be the entire student body at a school. It would contain all the students who study in that school at the time of data collection. Depending on the problem statement, data from each of these students is collected. An example is the students who speak Hindi among the students of a school. For the above situation, it is easy to collect data. The population is small and willing to provide data and can be contacted. The data collected will be complete and reliable. If you had to collect the same data from a larger population, say the entire country of India, it would be impossible to draw reliable conclusions because of geographical and accessibility constraints, not to mention time and resource constraints. A lot of data would be missing or might be unreliable. Furthermore, due to accessibility issues, marginalized tribes or villages might not provide data at all, making the data biased towards certain regions or groups. In Statistics, the determination of the variation between the group of data due to true variation is done by hypothesis testing. The sample data are taken from the population parameter based on the assumptions. The hypothesis can be classified into various types. In this article, let us discuss the hypothesis definition, various types of hypothesis and the significance of hypothesis testing, which are explained in detail. Hypothesis Definition in Statistics In Statistics, a hypothesis is defined as a formal statement, which gives the explanation about the relationship between the two or more variables of the specified population. It helps the researcher to translate the given problem to a clear explanation for the outcome of the study. It clearly explains and predicts the expected outcome. It indicates the types of experimental design and directs the study of the research process. Types of Hypothesis The hypothesis can be broadly classified into different types. They are: Simple Hypothesis A simple hypothesis is a hypothesis that there exists a relationship between two variables. One is called a dependent variable, and the other is called an independent variable. Complex Hypothesis A complex hypothesis is used when there is a relationship between the existing variables. In this hypothesis, the dependent and independent variables are more than two. Null Hypothesis In the null hypothesis, there is no significant difference between the populations specified in the experiments, due to any experimental or sampling error. The null hypothesis is denoted by H0. Alternative Hypothesis In an alternative hypothesis, the simple observations are easily influenced by some random cause. It is denoted by the Ha or H1. Empirical Hypothesis An empirical hypothesis is formed by the experiments and based on the evidence. Statistical Hypothesis In a statistical hypothesis, the statement should be logical or illogical, and the hypothesis is verified statistically. Apart from these types of hypothesis, some other hypotheses are directional and non-directional hypothesis, associated hypothesis, casual hypothesis. Characteristics of Hypothesis The important characteristics of the hypothesis are: ● The hypothesis should be short and precise ● It should be specific ● A hypothesis must be related to the existing body of knowledge ● It should be capable of verification In Statistics, the variables or numbers are defined and categorised using different scales of measurements. Each level of measurement scale has specific properties that determine the various use of statistical analysis. In this article, we will learn four types of scales such as nominal, ordinal, interval and ratio scale. What is the Scale? A scale is a device or an object used to measure or quantify any event or another object. Levels of Measurements There are four different scales of measurement. The data can be defined as being one of the four scales. The four types of scales are: ● Nominal Scale ● Ordinal Scale ● Interval Scale ● Ratio Scale Nominal Scale A nominal scale is the 1st level of measurement scale in which the numbers serve as “tags” or “labels” to classify or identify the objects. A nominal scale usually deals with the non-numeric variables or the numbers that do not have any value.The nominal scale simply categorizes variables according to qualitative labels (or names). These labels and groupings don’t have any order or hierarchy to them, nor do they convey any numerical value. For example, the variable “hair color” could be measured on a nominal scale according to the following categories: blonde hair, brown hair, gray hair, and so on. Characteristics of Nominal Scale ● A nominal scale variable is classified into two or more categories. In this measurement mechanism, the answer should fall into either of the classes. ● It is qualitative. The numbers are used here to identify the objects. ● The numbers don’t define the object characteristics. The only permissible aspect of numbers in the nominal scale is “counting.” Example: An example of a nominal scale measurement is given below: What is your gender? M- Male F- Female Here, the variables are used as tags, and the answer to this question should be either M or F. Ordinal Scale The ordinal scale is the 2nd level of measurement that reports the ordering and ranking of data without establishing the degree of variation between them. Ordinal represents the “order.” Ordinal data is known as qualitative data or categorical data. It can be grouped, named and also ranked. The ordinal scale also categorizes variables into labeled groups, and these categories have an order or hierarchy to them. For example, you could measure the variable “income” on an ordinal scale as follows: low income, medium income, high income. Another example could be level of education, classified as follows: high school, master’s degree, doctorate. These are still qualitative labels (as with the nominal scale), but you can see that they follow a hierarchical order. Characteristics of the Ordinal Scale ● The ordinal scale shows the relative ranking of the variables ● It identifies and describes the magnitude of a variable ● Along with the information provided by the nominal scale, ordinal scales give the rankings of those variables ● The interval properties are not known ● The surveyors can quickly analyse the degree of agreement concerning the identified order of variables Example: ● Ranking of school students – 1st, 2nd, 3rd, etc. ● Ratings in restaurants ● Evaluating the frequency of occurrences ● Very often ● Often ● ● Not often ● Not at all Assessing the degree of agreement ● Totally agree ● Agree ● Neutral ● Disagree ● Totally disagree Interval Scale The interval scale is the 3rd level of measurement scale. It is defined as a quantitative measurement scale in which the difference between the two variables is meaningful. In other words, the variables are measured in an exact manner, not as in a relative way in which the presence of zero is arbitrary.The interval scale is a numerical scale which labels and orders variables, with a known, evenly spaced interval between each of the values. An oft-cited example of interval data is temperature in Fahrenheit, where the difference between 10 and 20 degrees Fahrenheit is exactly the same as the difference between, say, 50 and 60 degrees Fahrenheit. Characteristics of Interval Scale: ● The interval scale is quantitative as it can quantify the difference between the values ● It allows calculating the mean and median of the variables ● To understand the difference between the variables, you can subtract the values between the variables ● The interval scale is the preferred scale in Statistics as it helps to assign any numerical values to arbitrary assessment such as feelings, calendar types, etc. Example: ● Likert Scale ● Net Promoter Score (NPS) ● Bipolar Matrix Table Ratio Scale The ratio scale is the 4th level of measurement scale, which is quantitative. It is a type of variable measurement scale. It allows researchers to compare the differences or intervals. The ratio scale has a unique feature. It possesses the character of the origin or zero points.The ratio scale is exactly the same as the interval scale, with one key difference: The ratio scale has what’s known as a “true zero.” A good example of ratio data is weight in kilograms. If something weighs zero kilograms, it truly weighs nothing—compared to temperature (interval data), where a value of zero degrees doesn’t mean there is “no temperature,” it simply means it’s extremely cold! Characteristics of Ratio Scale: ● Ratio scale has a feature of absolute zero ● It doesn’t have negative numbers, because of its zero-point feature ● It affords unique opportunities for statistical analysis. The variables can be orderly added, subtracted, multiplied, divided. Mean, median, and mode can be calculated using the ratio scale. ● Ratio scale has unique and useful properties. One such feature is that it allows unit conversions like kilogram – calories, gram – calories, etc. Example: An example of a ratio scale is: What is your weight in Kgs? ● Less than 55 kgs ● 55 – 75 kgs ● 76 – 85 kgs ● 86 – 95 kgs ● More than 95 kgs WEEK 3 Frequency Distribution What is a frequency distribution? The frequency of a value is the number of times it occurs in a dataset. A frequency distribution is the pattern of frequencies of a variable. It’s the number of times each possible value of a variable occurs in a dataset.The frequency (f) of a particular value is the number of times the value occurs in the data. The distribution of a variable is the pattern of frequencies, meaning the set of all possible values and the frequencies associated with these values. Frequency distributions are portrayed as frequency tables or charts. Types of frequency distributions There are four types of frequency distributions: ● Ungrouped frequency distributions: The number of observations of each value of a variable. ○ You can use this type of frequency distribution for categorical variables. ● Grouped frequency distributions: The number of observations of each class interval of a variable. Class intervals are ordered groupings of a variable’s values. ○ You can use this type of frequency distribution for quantitative variables. ● Relative frequency distributions: The proportion of observations of each value or class interval of a variable. ○ You can use this type of frequency distribution for any type of variable when you’re more interested in comparing frequencies than the actual number of observations. ● Cumulative frequency distributions: The sum of the frequencies less than or equal to each value or class interval of a variable. ○ You can use this type of frequency distribution for ordinal or quantitative variables when you want to understand how often observations fall below certain values. WEEK 4 Frequency Table How to make a frequency table Frequency distributions are often displayed using frequency tables. Frequency distribution tables can be used for both categorical and numeric variables. Continuous variables should only be used with class intervals, which will be explained shortly. A frequency table is an effective way to summarize or organize a dataset. It’s usually composed of two columns: ● The values or class intervals ● Their frequencies The method for making a frequency table differs between the four types of frequency distributions. You can follow the guides below or use software such as Excel, SPSS, or R to make a frequency table. How to make an ungrouped frequency table 1. Create a table with two columns and as many rows as there are values of the variable. Label the first column using the variable name and label the second column “Frequency.” Enter the values in the first column. ○ For ordinal variables, the values should be ordered from smallest to largest in the table rows. ○ For nominal variables, the values can be in any order in the table. You may wish to order them alphabetically or in some other logical order. 2. Count the frequencies. The frequencies are the number of times each value occurs. Enter the frequencies in the second column of the table beside their corresponding values. ○ Especially if your dataset is large, it may help to count the frequencies by tallying. Add a third column called “Tally.” As you read the observations, make a tick mark in the appropriate row of the tally column for each observation. Count the tally marks to determine the frequency. Example: Making an ungrouped frequency table. A gardener set up a bird feeder in their backyard. To help them decide how much and what type of birdseed to buy, they decide to record the bird species that visit their feeder. Over the course of one morning, the following birds visit their feeder: How to make a grouped frequency table 1. Divide the variable into class intervals. Below is one method to divide a variable into class intervals. Different methods will give different answers, but there’s no agreement on the best method to calculate class intervals. ○ Calculate the range. Subtract the lowest value in the dataset from the highest. ○ Decide the class interval width. There are no firm rules on how to choose the width, but the following formula is a rule of thumb: ○ You can round this value to a whole number or a number that’s convenient to add (such as a multiple of 10). ○ Calculate the class intervals. Each interval is defined by a lower limit and upper limit. Observations in a class interval are greater than or equal to the lower limit and less than the upper limit: ○ ○ The lower limit of the first interval is the lowest value in the dataset. Add the class interval width to find the upper limit of the first interval and the lower limit of the second variable. Keep adding the interval width to calculate more class intervals until you exceed the highest value. 2. Create a table with two columns and as many rows as there are class intervals. Label the first column using the variable name and label the second column “Frequency.” Enter the class intervals in the first column. 3. Count the frequencies. The frequencies are the number of observations in each class interval. You can count by tallying if you find it helpful. Enter the frequencies in the second column of the table beside their corresponding class intervals. Example: Grouped frequency distribution A sociologist conducted a survey of 20 adults. She wants to report the frequency distribution of the ages of the survey respondents. The respondents were the following ages in years: 52, 34, 32, 29, 63, 40, 46, 54, 36, 36, 24, 19, 45, 20, 28, 29, 38, 33, 49, 37 Round the class interval width to 10. The class intervals are 19 ≤ a < 29, 29 ≤ a < 39, 39 ≤ a < 49, 49 ≤ a < 59, and 59 ≤ a < 69. How to make a relative frequency table 1. Create an ungrouped or grouped frequency table. 2. Add a third column to the table for the relative frequencies. To calculate the relative frequencies, divide each frequency by the sample size. The sample size is the sum of the frequencies. Example: Relative frequency distribution From this table, the gardener can make observations, such as that 19% of the bird feeder visits were from chickadees and 25% were from finches. How to make a cumulative frequency table 1. Create an ungrouped or grouped frequency table for an ordinal or quantitative variable. Cumulative frequencies don’t make sense for nominal variables because the values have no order—one value isn’t more than or less than another value. 2. Add a third column to the table for the cumulative frequencies. The cumulative frequency is the number of observations less than or equal to a certain value or class interval. To calculate the relative frequencies, add each frequency to the frequencies in the previous rows. 3. Optional: If you want to calculate the cumulative relative frequency, add another column and divide each cumulative frequency by the sample size. Example: Cumulative frequency distribution From this table, the sociologist can make observations such as 13 respondents (65%) were under 39 years old, and 16 respondents (80%) were under 49 years old. How to graph a frequency distribution Pie charts, bar charts, and histograms are all ways of graphing frequency distributions. The best choice depends on the type of variable and what you’re trying to communicate. Pie chart A pie chart is a graph that shows the relative frequency distribution of a nominal variable. A pie chart is a circle that’s divided into one slice for each value. The size of the slices shows their relative frequency. This type of graph can be a good choice when you want to emphasize that one variable is especially frequent or infrequent, or you want to present the overall composition of a variable. A disadvantage of pie charts is that it’s difficult to see small differences between frequencies. As a result, it’s also not a good option if you want to compare the frequencies of different values. Bar chart A bar chart is a graph that shows the frequency or relative frequency distribution of a categorical variable (nominal or ordinal). The y-axis of the bars shows the frequencies or relative frequencies, and the x-axis shows the values. Each value is represented by a bar, and the length or height of the bar shows the frequency of the value. A bar chart is a good choice when you want to compare the frequencies of different values. It’s much easier to compare the heights of bars than the angles of pie chart slices. Histogram A histogram is a graph that shows the frequency or relative frequency distribution of a quantitative variable. It looks similar to a bar chart. The continuous variable is grouped into interval classes, just like a grouped frequency table. The y-axis of the bars shows the frequencies or relative frequencies, and the x-axis shows the interval classes. Each interval class is represented by a bar, and the height of the bar shows the frequency or relative frequency of the interval class. Although bar charts and histograms are similar, there are important differences: Bar chart Histogram Type of variable Categorical Quantitative Value grouping Ungrouped (values) Grouped (interval classes) Bar spacing Can be a space between bars Never a space between bars Bar order Can be in any order Can only be ordered from lowest to highest A histogram is an effective visual summary of several important characteristics of a variable. At a glance, you can see a variable’s central tendency and variability, as well as what probability distribution it appears to follow, such as a normal, Poisson, or uniform distribution. WEEK 5 Definition of Mean in Statistics Mean is the average of the given numbers and is calculated by dividing the sum of given numbers by the total number of numbers. Mean = (Sum of all the observations/Total number of observations) Example: What is the mean of 2, 4, 6, 8 and 10? Solution: First, add all the numbers. 2 + 4 + 6 + 8 + 10 = 30 Now divide by 5 (total number of observations). Mean = 30/5 = 6 In the case of a discrete probability distribution of a random variable X, the mean is equal to the sum over every possible value weighted by the probability of that value; that is, it is computed by taking the product of each possible value x of X and its probability P(x) and then adding all these products together. Mean Symbol (X Bar) The symbol of mean is usually given by the symbol ‘x̄’. The bar above the letter x, represents the mean of x number of values. X̄ = (Sum of values ÷ Number of values) X̄ = (x1 + x2 + x3 +….+xn)/n Mean Formula The basic formula to calculate the mean is calculated based on the given data set. Each term in the data set is considered while evaluating the mean. The general formula for mean is given by the ratio of the sum of all the terms and the total number of terms. Hence, we can say; Mean = Sum of the Given Data/Total number of Data To calculate the arithmetic mean of a set of data we must first add up (sum) all of the data values (x) and then divide the result by the number of values (n). Since ∑ is the symbol used to indicate that values are to be summed (see Sigma Notation) we obtain the following formula for the mean (x ̄ ): ̄ =∑ x/n x How to Find Mean? As we know, data can be grouped data or ungrouped data so to find the mean of given data we need to check whether the given data is ungrouped. The formulas to find the mean for ungrouped data and grouped data are different. In this section, you will learn the method of finding the mean for both of these instances. Mean for Ungrouped Data The example given below will help you in understanding how to find the mean of ungrouped data. Example: In a class there are 20 students and they have secured a percentage of 88, 82, 88, 85, 84, 80, 81, 82, 83, 85, 84, 74, 75, 76, 89, 90, 89, 80, 82, and 83. Find the mean percentage obtained by the class. Solution: Mean = Total of percentage obtained by 20 students in class/Total number of students = [88 + 82 + 88 + 85 + 84 + 80 + 81 + 82 + 83 + 85 + 84 + 74 + 75 + 76 + 89 + 90 + 89 + 80 + 82 + 83]/20 = 1660/20 = 83 Hence, the mean percentage of each student in the class is 83%. Mean for Grouped Data For grouped data, we can find the mean using either of the following formulas. Direct method: Assumed mean method: Step-deviation method: Go through the example given below to understand how to calculate the mean for grouped data. Example: Find the mean for the following distribution. xi 11 14 17 20 fi 3 6 8 7 Solution: For the given data, we can find the mean using the direct method. xi fi fixi 11 3 33 14 6 84 17 8 136 20 7 140 ∑fi = 24 ∑fi xi = 393 Mean = ∑fixi/∑fi = 393/24 = 16.4 Types of Mean There are majorly three different types of mean value that you will be studying in statistics. 1. Arithmetic Mean 2. Geometric Mean 3. Harmonic Mean 4. Quadratic Mean Arithmetic Mean When you add up all the values and divide by the number of values it is called Arithmetic Mean. To calculate, just add up all the given numbers then divide by how many numbers are given. Example: What is the mean of 3, 5, 9, 5, 7, 2? Now add up all the given numbers: 3 + 5 + 9 + 5 + 7 + 2 = 31 Now divide by how many numbers are provided in the sequence: 316= 5.16 5.16 is the answer. Geometric Mean The geometric mean of two numbers x and y is xy. If you have three numbers x, y, and z, their geometric mean is 3xyz. Example: Find the geometric mean of 4 and 3 ? How to Find the Geometric Mean (Examples) Example 1: What is the geometric mean of 2, 3, and 6? First, multiply the numbers together and then take the cubed root (because there are three numbers) = (2*3*6)1/3 = 3.30 Note: The power of (1/3) is the same as the cubed root 3√. To convert a nth root to this notation, just change the denominator in the fraction to whatever “n” you have. So: ● 5th root = to the (1/5) power ● 12th root = to the (1/12) power ● 99th root = to the (1/99) power. Example 2: What is the geometric mean of 4,8.3,9 and 17? First, multiply the numbers together and then take the 5th root (because there are 5 numbers) = (4 * 8 * 3 * 9 * 17)(1/5) = 6.81 Example 3: What is the geometric mean of 1/2, 1/4, 1/5, 9/72 and 7/4? First, multiply the numbers together and then take the 5th root: (1/2*1/4*1/5*9/72*7/4)(1/5) = 0.35. Example 4: The average person’s monthly salary in a certain town jumped from $2,500 to $5,000 over the course of ten years. Using the geometric mean, what is the average yearly increase? Solution: Step 1: Find the geometric mean. (2500*5000)^(1/2) = 3535.53390593. Step 2: Divide by 10 (to get the average increase over ten years). 3535.53390593 / 10 = 353.53. The average increase (according to the GM) is 353.53. Harmonic Mean The harmonic mean is a very specific type of average. It’s generally used when dealing with averages of units, like speed or other rates and ratios. The formula is: If the formula above looks daunting, all you need to do to solve it is: ● Add the reciprocals of the numbers in the set. ● Divide the number of items in the set by your answer to Step 1. The harmonic mean is a numerical average calculated by dividing the number of observations, or entries in the series, by the reciprocal of each number in the series. Thus, the harmonic mean is the reciprocal of the arithmetic mean of the reciprocals. For example, to calculate the harmonic mean of 1, 4, and 4, you would divide the number of observations by the reciprocal of each number, as follows: The harmonic mean has uses in finance and technical analysis of markets, among others. Example of the Harmonic Mean As an example, take two firms. One has a market capitalization of $100 billion and earnings of $4 billion (P/E of 25), and the other has a market capitalization of $1 billion and earnings of $4 million (P/E of 250). In an index made of the two stocks, with 10% invested in the first and 90% invested in the second, the P/E ratio of the index is: The Bottom Line The harmonic mean is calculated by dividing the number of entries in a series by the reciprocal of each number in the series. The harmonic mean stands out from the other types of Pythagorean mean—the arithmetic mean and geometrical mean—by using reciprocals and giving greater weight to smaller values. The harmonic mean is best used for fractions such as rates, and in finance, it is useful for averaging data like price multiples and identifying patterns such as Fibonacci sequences. What is the Quadratic Mean / Root Mean Square? The quadratic mean (also called the root mean square*) is a type of average. It measures the absolute magnitude of a set of numbers, and is calculated by: ● Squaring each number, ● Finding the mean of these squares, ● Taking the square root of that average. If you label each element of your set as xi, where i is an index number numbering from 1 to n, the RMS can be described as: RMS gives a greater weight to larger items in a set and is always equal to or greater than the “regular” arithmetic mean (average). Sometimes the quadratic mean is referred to as being “the same as” the standard deviation. This isn’t strictly true: standard deviation is actually equal to the quadratic deviations from the mean of the data set. For example, quadratic mean is used in the physical sciences as a synonym for standard deviation when referencing the “square root of the mean squared deviation of a signal from a given baseline or fit”(Wolfram). The quadratic mean is also called the root mean square because it is the square root of the mean of the squares of the numbers in the set. *Note: This is different from the root mean square error (RMSE), which is a value used in regression analysis to describe how spread out data is around a regression line. Formula The quadratic mean is equal to the square root of the mean of the squared values. The formula is: An equivalent formula has a summation sign (summation means “to add up”, so it’s telling you here to add all of the squared x-values up): Examples of the Root Mean Square (RMS) To find the root mean square of the set {1, 3, 4}: 1. Square each of the numbers 2. Find the mean of Step 1 3. Find the square root of step 2 Worked Example Find the Root Mean Square of 2, 4, 9, 10, and 12. Step 1: Count the number of items. N = 5. Set this number aside for a moment. Step 2: Square all of the numbers. 22,42,92,102, 122 = 4, 16, 81, 100, 144. Step 3: Add the numbers from Step 2 up: 4 + 16 + 81 + 100 + 144 = 345. Step 4: Divide Step 3 (the sum) by Step 1 (number of items in the set): 345/5 = 69. Step 5: Find square root of Step 4. √(69) = 8.31. That’s it! The RMS of any series of positive identical numbers will be that same number, just as the average of a series of identical numbers is the number itself. The RMS of a series of negative identical numbers will be the absolute value of that number. For positive values, the RMS is either the same or a bit larger than the average. WEEK 6 What are quantiles? A quartile is a type of quantile. Quantiles are values that split sorted data or a probability distribution into equal parts. In general terms, a q-quantile divides sorted data into q parts. The most commonly used quantiles have special names: ● Quartiles (4-quantiles): Three quartiles split the data into four parts. ● Deciles (10-quantiles): Nine deciles split the data into 10 parts. ● Percentiles (100-quantiles): 99 percentiles split the data into 100 parts. There is always one fewer quantile than there are parts created by the quantiles. How to find quantiles To find a q-quantile, you can follow a similar method to that used for quartiles, except in steps 3–5, multiply n by multiples of 1/q instead of 1/4. For example, to find the third 5-quantile: 1. Calculate n * (3 / 5). 2. If n * (3 / 5) is an integer, then the third 5-quantile is the mean of the numbers at positions n * (3 / 5) and n * (3 / 5) + 1. 3. If n * (3 / 5) is not an integer, then round it up. The number at this position is the third 5-quantile. Quartiles Quartiles are values that divide your data into quarters. However, quartiles aren’t shaped like pizza slices; Instead they divide your data into four segments according to where the numbers fall on the number line. The four quarters that divide a data set into quartiles are: 1. The lowest 25% of numbers. 2. The next lowest 25% of numbers (up to the median). 3. The second highest 25% of numbers (above the median). 4. The highest 25% of numbers. Quartiles are three values that split sorted data into four parts, each with an equal number of observations. Quartiles are a type of quantile. ● First quartile: Also known as Q1, or the lower quartile. This is the number halfway between the lowest number and the middle number. ● Second quartile: Also known as Q2, or the median. This is the middle number halfway between the lowest number and the highest number. ● Third quartile: Also known as Q3, or the upper quartile. This is the number halfway between the middle number and the highest number. Quartiles can also split probability distributions into four parts, each with an equal probability. Find Quartiles: Examples Need help with a homework question? Check out our tutoring page! Example: Divide the following data set into quartiles: 2, 5, 6, 7, 10, 22, 13, 14, 16, 65, 45, 12. Step 1: Put the numbers in order: 2, 5, 6, 7, 10, 12 13, 14, 16, 22, 45, 65. Step 2: Count how many numbers there are in your set and then divide by 4 to cut the list of numbers into quarters. There are 12 numbers in this set, so you would have 3 numbers in each quartile. 2, 5, 6, | 7, 10, 12 | 13, 14, 16, | 22, 45, 65 If you have an uneven set of numbers, it’s OK to slice a number down the middle. This can get a little tricky (imagine trying to divide 10, 13, 17, 19, 21 into quarters!), so you may want to use an online interquartile range calculator to figure those quartiles out for you. The calculator gives you the 25th Percentile, which is the end of the first quartile, the 50th Percentile which is the end of the second quartile (or the median) and the 75th Percentile, which is the end of the third quartile. For 10, 13, 17, 19 and 21 the results are: 25th Percentile: 11.5 50th Percentile: 17 75th Percentile: 20 Interquartile Range: 8.5. Why do we need quartiles in statistics? The main reason is to perform further calculations, like the interquartile range, which is a measure of how the data is spread out around the mean. Quartiles are a type of percentile. A percentile is a value with a certain percentage of the data falling below it. In general terms, k% of the data falls below the kth percentile. ● The first quartile (Q1, or the lowest quartile) is the 25th percentile, meaning that 25% of the data falls below the first quartile. ● The second quartile (Q2, or the median) is the 50th percentile, meaning that 50% of the data falls below the second quartile. ● The third quartile (Q3, or the upper quartile) is the 75th percentile, meaning that 75% of the data falls below the third quartile. By splitting the data at the 25th, 50th, and 75th percentiles, the quartiles divide the data into four equal parts. ● In a sample or dataset, the quartiles divide the data into four groups with equal numbers of observations. ● In a probability distribution, the quartiles divide the distribution’s range into four intervals with equal probability. How to find quartiles To find the quartiles of a dataset or sample, follow the step-by-step guide below. 1. Count the number of observations in the dataset (n). 2. Sort the observations from smallest to largest. 3. Find the first quartile: ○ Calculate n * (1 / 4). ○ If n * (1 / 4) is an integer, then the first quartile is the mean of the numbers at positions n * (1 / 4) and n * (1 / 4) + 1. ○ If n * (1 / 4) is not an integer, then round it up. The number at this position is the first quartile. Tip: An integer is a whole number—it can be written without any numbers after the decimal place. 4. Find the second quartile: ○ Calculate n * (2 / 4). ○ If n * (2 / 4) is an integer, the second quartile is the mean of the numbers at positions n * (2 / 4) and n * (2 / 4) + 1. ○ If n * (2 / 4) is not an integer, then round it up. The number at this position is the second quartile. 5. Find the third quartile: ○ Calculate n * (3 / 4). ○ If n * (3 / 4) is an integer, then the third quartile is the mean of the numbers at positions n * (3 / 4) and n * (3 / 4) + 1. ○ If n * (3 / 4) is not an integer, then round it up. The number at this position is the third quartile. There are multiple methods to calculate the first and third quartiles, and they don’t always give the same answers. There’s no universal agreement on the best way to calculate quartiles. Step-by-step example Imagine you conducted a small study on language development in children 1–6 years old. You’re writing a paper about the study and you want to report the quartiles of the children’s ages. Age (years) 1 2 3 4 5 6 Frequency 2 3 4 1 2 2 Step 1: Count the number of observations in the datasetn = 2 + 3 + 4 + 1 + 2 + 2 = 14Step 2: Sort the observations in increasing order 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 5, 5, 6, 6 Step 3: Find the first quartilen * (1 / 4) = 14 * (1 / 4) = 3.5 3.5 is not an integer, so Q1 is the number at position 4. 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 5, 5, 6, 6 Q1 = 2 years Step 4: Find the second quartilen * (2 / 4) = 14 * (2 / 4) = 7 7 is an integer, so Q2 is the mean of the numbers at positions 7 and 8. 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 5, 5, 6, 6 Q2 = (3 + 3) / 2 Q2 = 3 years Step 5: Find the third quartilen * (3 / 4) = 14 * (3 / 4) = 10.5 10.5 is not an integer, so Q3 is the number at position 11. 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 5, 5, 6, 6 Q3 = 5 years Interpreting quartiles Quartiles can give you useful information about an observation or a dataset. Comparing observations Quartiles are helpful for understanding an observation in the context of the rest of a sample or population. By comparing the observation to the quartiles, you can determine whether the observation is in the bottom 25%, middle 50%, or top 25%. Median The second quartile, better known as the median, is a measure of central tendency. This middle number is a good measure of the average or most central value of the data, especially for skewed distributions or distributions with outliers. Interquartile range The distance between the first and third quartiles—the interquartile range (IQR)—is a measure of variability. It indicates the spread of the middle 50% of the data. IQR = Q3 − Q1 The IQR is an especially good measure of variability for skewed distributions or distributions with outliers. IQR only includes the middle 50% of the data, so, unlike the range, the IQR isn’t affected by extreme values. Skewness The distance between quartiles can give you a hint about whether a distribution is skewed or symmetrical. It’s easiest to use a boxplot to look at the distances between quartiles: What is an Upper Quartile? The upper quartile (sometimes called Q3) is the number dividing the third and fourth quartile. The upper quartile can also be thought of as the median of the upper half of the numbers. The upper quartile is also called the 75th percentile; it splits the lowest 75% of data from the highest 25%. A set of numbers (-3,-2,-1,0,1,2,3) divided into four quartiles. Calculating the Upper Quartile You can find the upper quartile by placing a set of numbers in order and working out Q3 by hand, or you can use the upper quartile formula. If you have a small set of numbers (under about 20), by hand is usually the easiest option. However, the formula works for all sets of numbers, from very small to very large. You may also want to use the formula if you are uncomfortable with finding the median for sets of data with odd or even numbers. Example question: Find the upper quartile for the following set of numbers: 27, 19, 5, 7, 6, 9, 15, 12, 18, 2, 1. By Hand Step 1: Put your numbers in order: 1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27 Step 2: Find the median: 1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27. Step 3: Place parentheses around the numbers above the median. 1, 2, 5, 6, 7, 9, (12, 15, 18, 19, 27). Step 4: Find the median of the upper set of numbers. This is the upper quartile: 1, 2, 5, 6, 7, 9, (12, 15, 18 ,19 ,27). Using the Formula The upper quartile formula is: Q3 = ¾(n + 1)th Term. The formula doesn’t give you the value for the upper quartile, it gives you the place. For example, the 5th place, or the 76th place. Step 1: Put your numbers in order: 1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27. Step 2: Work the formula. There are 11 numbers in the set, so: Q3 = ¾(n + 1)th Term. Q3 = ¾(11 + 1)th Term. Q3 = ¾(12)th Term. Q3 = 9th Term. In this set of numbers (1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27), the upper quartile (18) is the 9th term, or the 9th place from the left. Difference between a quarter and a quartile There’s a slight difference between a quarter and quartile. A quarter is the whole slice of pizza, but a quartile is the mark the pizza cutter makes at the end of the slice. A quarter of the pizza is the whole slice; a quartile marks the end of the first quarter and the beginning of the second. What Is a Decile? Deciles break up a set of data into tenths. They are similar to quartiles. But while quartiles sort data into four quarters, data is instead sorted into ten equal parts: The 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, 90th and 100th percentiles. A decile rank assigns a number to each tenth: Decile Rank Percentile 1 10th 2 20th 3 30th 4 40th 5 50th 6 60th 7 70th 8 80th 9 90th The higher your place in the above rankings, the higher your overall ranking. For example, if you were in the 99th percentile for a particular test, that would put you in a ranking of 10. However, if you scored very low (say, the 5th percentile), then you would have a rank of 1. A chart showing decile rankings for discharged stroke patients. Image: SUNY Buffalo Why are Decile ranks used instead of percentiles of quartiles? Basically, ranks are just another way to categorize data and which system you use is usually a judgment call. For example, if you wanted to display class rankings on a pie chart, using deciles would make more sense that percentiles. That’s because a pie chart with 10-categories would be much easier to read than a pie chart with 99 categories. What is a Decile used for in Real Life? They are used significantly more often in real life than in the classroom. For example, Australia [1] uses decile ranks to report drought data. Ranks of 1-2 represent the lowest 20% (“much below normal”). That means droughts that are “much below normal” don’t occur more than 20% of the time. They are also commonly used for college admissions and high school rankings. For example, this chart from Roanoke College shows the high school rankings for the student body. A decile is a quantitative method of splitting up a set of ranked data into 10 equally large subsections. This type of data ranking is performed as part of many academic and statistical studies in the finance and economics fields. The data may be ranked from largest to smallest values, or vice versa. A decile, which has 10 categorical buckets may be contrasted with percentiles that have 100, quartiles that have four, or quintiles that have five. Understanding a Decile In descriptive statistics, a decile is used to categorize large data sets from the highest to lowest values, or vice versa. Like the quartile and the percentile, a decile is a form of a quantile that divides a set of observations into samples that are easier to analyze and measure. While quartiles are three data points that divide an observation into four equal groups or quarters, a decile consists of nine data points that divide a data set into 10 equal parts. When an analyst or statistician ranks data and then splits them into deciles, they do so in an attempt to discover the largest and smallest values by a given metric. For example, by splitting the entire S&P 500 Index into deciles (50 firms in each decile) using the P/E multiple, the analyst will discover the companies with the highest and lowest P/E valuations in the index. A decile is usually used to assign decile ranks to a data set. A decile rank arranges the data in order from lowest to highest and is done on a scale of one to 10 where each successive number corresponds to an increase of 10 percentage points. In other words, there are nine decile points. The 1st decile, or D1, is the point that has 10% of the observations below it, D2 has 20% of the observations below it, D3 has 30% of the observations falling below it, and so on. How to Calculate a Decile There is no one way of calculating a decile; however, it is important that you are consistent with whatever formula you decide to use to calculate a decile. One simple calculation of a decile is: From this formula, it is given that the 5th decile is the median since 5 (n+1) / 10 is the data point that represents the halfway point of the distribution. Example of a Decile The table below shows the ungrouped scores (out of 100) for 30 exam takers: 48 52 55 57 58 60 61 64 65 66 69 72 73 75 76 78 81 82 84 87 88 90 91 92 93 94 95 96 97 99 Using the information presented in the table, the 1st decile can be calculated as: ● = Value of [(30 + 1) / 10]th data ● = Value of 3.1st data, which is 0.1 of the way between scores 55 and 57 ● = 55 + 2 (0.1) = 55.2 = D1 ● D1 means that 10% of the data set falls below 55.2. Let’s calculate the 3rd decile: ● D3 = Value of 3 (30 + 1) / 10 ● D3 = Value of 9.3rd position, which is 0.3 between the scores of 65 and 66 ● Thus, D3 = 65 + 1 (0.3) = 65.3 ● 30% of the 30 scores in the observation fall below 65.3. What would we get if we were to calculate the 5th decile? ● D5 = Value of 5 (30 + 1) / 10 ● D5 = Value of 15.5th position, halfway between scores 76 and 78 ● 50% of the scores fall below 77. Also, notice how the 5th decile is also the median of the observation. Looking at the data set in the table, the median, which is the middle data point of any given set of numbers, can be calculated as (76 + 78) / 2 = 77 = median = D5. At this point, half of the scores lie above and below the distribution. Percentaile “Percentile” is in everyday use, but there is no universal definition for it. The most common definition of a percentile is a number where a certain percentage of scores fall below that number. You might know that you scored 67 out of 90 on a test. But that figure has no real meaning unless you know what percentile you fall into. If you know that your score is in the 90th percentile, that means you scored better than 90% of people who took the test. Percentiles are commonly used to report scores in tests, like the SAT, GRE and LSAT. for example, the 70th percentile on the 2013 GRE was 156. That means if you scored 156 on the exam, your score was better than 70 percent of test takers. The 25th percentile is also called the first quartile. The 50th percentile is generally the median (if you’re using the third definition—see below). The 75th percentile is also called the third quartile. The difference between the third and first quartiles is the interquartile range. Percentile Rank The word “percentile” is used informally in the above definition. In common use, the percentile usually indicates that a certain percentage falls below that percentile. For example, if you score in the 25th percentile, then 25% of test takers are below your score. The “25” is called the percentile rank. In statistics, it can get a little more complicated as there are actually three definitions of “percentile.” Here are the first two (see below for definition 3), based on an arbitrary “25th percentile”: Definition 1: The nth percentile is the lowest score that is greater than a certain percentage (“n”) of the scores. In this example, our n is 25, so we’re looking for the lowest score that is greater than 25%. Definition 2: The nth percentile is the smallest score that is greater than or equal to a certain percentage of the scores. To rephrase this, it’s the percentage of data that falls at or below a certain observation. This is the definition used in AP statistics. In this example, the 25th percentile is the score that’s greater or equal to 25% of the scores. They may seem very similar, but they can lead to big differences in results, although they are both the 25th percentile rank. Take the following list of test scores, ordered by rank: Score Rank 30 1 33 2 43 3 53 4 56 5 67 6 68 7 72 8 How to Find a Percentile Example question: Find out where the 25th percentile is in the above list. Step 1: Calculate what rank is at the 25th percentile. Use the following formula: Rank = Percentile / 100 * (number of items + 1) Rank = 25 / 100 * (8 + 1) = 0.25 * 9 = 2.25. A rank of 2.25 is at the 25th percentile. However, there isn’t a rank of 2.25 (ever heard of a high school rank of 2.25? I haven’t!), so you must either round up, or round down. As 2.25 is closer to 2 than 3, I’m going to round down to a rank of 2. Step 2: Choose either definition 1 or 2: Definition 1: The lowest score that is greater than 25% of the scores. That equals a score of 43 on this list (a rank of 3). Definition 2: The smallest score that is greater than or equal to 25% of the scores. That equals a score of 33 on this list (a rank of 2). Depending on which definition you use, the 25th percentile could be reported at 33 or 43! A third definition attempts to correct this possible misinterpretation: Definition 3: A weighted mean of the percentiles from the first two definitions. In the above example, here’s how the percentile would be worked out using the weighted mean: 1. Multiply the difference between the scores by 0.25 (the fraction of the rank we calculated above). The scores were 43 and 33, giving us a difference of 10: (0.25)(43 – 33) = 2.5 2. Add the result to the lower score. 2.5 + 33 = 35.5 In this case, the 25th percentile score is 35.5, which makes more sense as it’s in the middle of 43 and 33. In most cases, the percentile is usually definition #1. However, it would be wise to double check that any statistics about percentiles are created using that first definition. How to Calculate Percentile? You can calculate percentiles in statistics using the following formula: For example: Imagine you have the marks of 20 students. Now, try to calculate the 90th percentile. Step 1: Arrange the score in ascending order. Step 2: Plug the values in the formula to find n. P90 = 94 means that 90% of students got less than 94 and 10% of students got more than 94 Percentile Range A percentile range is the difference between two specified percentiles. these could theoretically be any two percentiles, but the 10-90 percentile range is the most common. To find the 10-90 percentile range: 1. Calculate the 10th percentile using the above steps. 2. Calculate the 90th percentile using the above steps. 3. Subtract Step 1 (the 10th percentile) from Step 2 (the 90th percentile). What is the Midrange? The midrange is a type of average, or mean. For example, “midrange” electronic gadgets are in the middle-price bracket: not cheap, but not expensive, either. The formula to find the midrange = (high + low) / 2. Example problem: Current cell phone prices in a mobile phone store range from $40 (the cheapest) to $550 (the most expensive). Find the midrange. ● Step 1: Add the lowest value to the highest: $550 + $40 = $590. ● Step 2: Divide Step 1 by two: $590 / 2 = $295. The mid priced phones would be priced at around $295. Difference Between a Midrange and a Range. The range is a measure of spread. In the cell phone example, the range would be: $550 – $40 = $510. The range can also mean the entire spread of numbers—for example, it could be written as $40 to $550. The mid-range takes it a step further and divides the range by two to find a type of average. Difference Between a Midrange and the Interquartile Range. Don’t confuse the midrange with the interquartile range (IQR), sometimes called the “middle fifty“. They actually mean very different things. The mid-range is a type of mean, while the interquartile range is talking about a chunk of data in the middle of a data set. For example, when the weather service reports that a “mean daily temperature” is 77 degrees, they are talking about the mid-range. They got that number by taking the sum of the high daily temperature and the low daily temperature and dividing by 2. Let’s say the recorded daily temperatures were: 55, 65, 67, 69, 70, 80, 81, 87, 90 High = 90 Low = 55 Mid = (90 + 55) / 2 = 154 / 2 = 77. The IQR for this data set is the 25th percentile subtracted from the 75th percentile: 25th Percentile: 66 75th Percentile: 84 Interquartile Range: 84 – 66 = 18 What is the Mode? The mode, or modal value, is the most common number in a data set. It’s useful in statistics because it can tell you what the most popular item in your set is. For example, you might have results from a customer survey where your company is rated from 1 to 5. If the most popular answer is 2, then you know you need to make some improvements in customer service! The mode is the value that appears most frequently in a data set. A set of data may have one mode, more than one mode, or no mode at all. Other popular measures of central tendency include the mean, or the average of a set, and the median, the middle value in a set. A data set can have no mode, one, or many: ● None: 1, 2, 3, 4, 6, 8, 9. ● One mode: unimodal: 1, 2, 3, 3, 4, 5. ● Two: bimodal: 1, 1, 2, 3, 4, 4, 5. ● Three: trimodal: 1, 1, 2, 3, 3, 4, 5, 5. ● More than one (two, three or more) = multimodal. How to find the mode by hand The mode in statistics is the most common number in a data set. For example, in this set it’s 2, because it is the number that occurs most often: 1, 2, 2, 5, 6. Data sets in statistics tend to be much larger, so the solution is easier to spot if you put the numbers in order. Steps Sample question: Find the mode for the following data set: 56, 57, 56, 58, 59, 90, 98, 98, 65, 45, 34, 34, 23, 23, 24, 33, 56, 67, 78, 87, 87, 56. Step 1: Put the numbers in order: 23 23 24 33 34 34 45 56 56 56 56 57 58 59 65 67 78 87 87 90 98 98 Step 2: Count how many times each number appears. This may be easier if you put the numbers in a column/row format like this: 23 23 24 33 34 34 45 56 56 56 56 57 58 59 65 67 78 87 87 90 98 98 The most common number is 56 in this data set (it appears 4 times). Examples of the Mode For example, in the following list of numbers, 16 is the mode since it appears more times in the set than any other number: ● 3, 3, 6, 9, 16, 16, 16, 27, 27, 37, 48 A set of numbers can have more than one mode (this is known as bimodal if there are two modes) if there are multiple numbers that occur with equal frequency, and more times than the others in the set. ● 3, 3, 3, 9, 16, 16, 16, 27, 37, 48 In the above example, both the number 3 and the number 16 are modes as they each occur three times and no other number occurs more often. If no number in a set of numbers occurs more than once, that set has no mode: ● 3, 6, 9, 16, 27, 37, 48 A set of numbers with two modes is bimodal, a set of numbers with three modes is trimodal, and any set of numbers with more than one mode is multimodal. What is the Median? Median, in statistics, is the middle value of the given list of data when arranged in an order. The arrangement of data or observations can be made either in ascending order or descending order. Example: The median of 2,3,4 is 3. In Maths, the median is also a type of average, which is used to find the centre value. Therefore, it is also called measure of central tendency. Apart from the median, the other two central tendencies are mean and mode. Mean is the ratio of the sum of all observations and total number of observations. Mode is the value in the given data-set, repeated most of the time. In geometry, a median is also defined as the centre point of a polygon. For example, the median of a triangle is the line segment joining the vertex of a triangle to the centre of the opposite sides. Therefore, a median bisects the sides of a triangle. Median in Statistics The median of a set of data is the middlemost number or centre value in the set. The median is also the number that is halfway into the set. To find the median, the data should be arranged first in order of least to greatest or greatest to the least value. A median is a number that is separated by the higher half of a data sample, a population or a probability distribution from the lower half. The median is different for different types of distribution. For example, the median of 3, 3, 5, 9, 11 is 5. If there is an even number of observations, then there is no single middle value; the median is then usually defined to be the mean of the two middle values: so the median of 3, 5, 7, 9 is (5+7)/2 = 6. The median tells you where the middle of a data set is. It’s used for many real-life situations, like Bankruptcy law, where you can only claim bankruptcy if you are below the median income in your state. The median formula is {(n + 1) ÷ 2}th, where “n” is the number of items in the set and “th” just means the (n)th number. To find the median, first order the numbers from smallest to largest. Then find the middle number. For example, the middle for this set of numbers is 5, because 5 is right in the middle: 1, 2, 3, 5, 6, 7, 9. You get the same result with the formula. There are 7 numbers in the set, so n = 7: ● {(7 + 1) ÷ 2}th ● = {(8) ÷ 2}th ● = {4}th The 4th number in 1, 2, 3, 5, 6, 7, 9 is 5. A caution with using the median formula: The steps differ slightly depending on whether you have an even or odd amount of numbers in your data set. Find the median for an odd set of numbers Example question: Find the median for the following data set: 102, 56, 34, 99, 89, 101, 10. Step 1: Sort your data from the smallest number to the highest number. For this example data set, the order is: 10, 34, 56, 89, 99, 101, 102. Step 2: Find the number in the middle (where there are an equal number of data points above and below the number): 10, 34, 56, 89, 99, 101, 102. The median is 89. Tip: If you have a large data set, divide the number in the set by 2. That tells you how many numbers should be above and how many numbers should be below. For example, 101/2 = 55.5. Ignore the decimal; 55 numbers should be above and 55 below. Find the median for an even set of numbers Example question: Find the median for the following data set: 102, 56, 34, 99, 89, 101, 10, 54. Step 1: Place the data in ascending order (smallest to highest). 10, 34, 54, 56, 89, 99, 101, 102. Step 2: Find the TWO numbers in the middle (where there are an equal number of data points above and below the two middle numbers). 10, 34, 54, 56, 89, 99, 101, 102 Step 3: Add the two middle numbers and then divide by two, to get the average: ● 56 + 89 = 145 ● 145 / 2 = 72.5. The median is 72.5. Tip: For large data sets, divide the number of items by 2, then subtract 1 to find the number that should be above and the number that should be below. For example, 100/2 = 50. 50 – 1 = 49. The middle two numbers will have 49 items above and 49 below. That’s it! Average vs. Median The median is very useful for describing things like salaries, where large figures can throw off the mean. The median salary in the U.S. as of 2012 was $51,017. If an average was used, those American billionaires could skew that figure upwards. Let’s say you wanted to work for a small law firm that paid an average salary of over $73,000 to its 11 employees. You might think there’s a good chance you’ll land a great paying job. But take a closer look at how the average is calculated for those eleven employees: Employee Salary Samuel $28,000 Candice $17,400 Thomas $22,000 Ted $300,000 Carly $300,000 Shawanna $20,500 Chan $18,500 Janine $27,000 Barbara $21,000 Anna $29,000 Jim $20,000 Average (Mean) = ($28,000 + $17,400 + $22,000 + $300,000 + $300,000 + $20,500 + $18,500 + $27,000 + $21,000 + $29,000 + $20,000) / 11 = $73,000 The two partners in the firm—Ted and Carly, have increased the average way beyond most of the salaries paid in the firm. See how the “average” can be misleading? A better way to describe income is to figure out the median — or the middle wage. If you took that same list of incomes and found the median, you would get a more realistic representation of income. The median is the middle number, so if you placed all of the incomes in a list (from smallest to largest) you would get: $17,400, $18,500, $20,000, $20,500 $21,000, $22,000, $27,000, $28,000, $29,000, $300,000, $300,000 It’s a more accurate representation of what people are actually being paid. Calculation for a Grouped Frequency Distribution An easy way to ballpark the median(MD) for a grouped frequency distribution is to use the midpoint of the interval. If you need something more precise, use the formula: MD = lower value + (B ÷ D) x C. Step 1: Use (n + 1) / 2 to find out which interval has the MD. For example, if you have 11 intervals, then the MD is in the sixth interval: (11 + 1) / 2 = 12 / 2 = 6. This interval is called the MD group. Step 2: Calculate “A”: the cumulative percentage for the interval immediately before the median group. Step 3: Calculate “B”: subtract your step 2 value from 50%. For example, if the cumulative percentage is 45%, then B is 50% – 45% = 65%. Step 4: Find “C”: the range (how many numbers are in the interval). Step 5: Find “D”: the percentage for the median interval. Step 7: Find the median: Median = lower value + (B ÷ D) x C. That’s it! Median Formula The formula to calculate the median of the finite number of data set is given here. The median formula is different for even and odd numbers of observations. Therefore, it is necessary to recognise first if we have odd number of values or even number of values in a given data set. The formula to calculate the median of the data set is given as follows. Odd Number of Observations If the total number of observations given is odd, then the formula to calculate the median is: where n is the number of observations Even Number of Observations If the total number of observation is even, then the median formula is: where n is the number of observations How to Calculate the Median? To find the median, place all the numbers in ascending order and find the middle. Example 1: Find the Median of 14, 63 and 55 solution: Put them in ascending order: 14, 55, 63 The middle number is 55, so the median is 55. Example 2: Find the median of the following: 4, 17, 77, 25, 22, 23, 92, 82, 40, 24, 14, 12, 67, 23, 29 Solution: When we put those numbers in the order, we have: 4, 12, 14, 17, 22, 23, 23, 24, 25, 29, 40, 67, 77, 82, 92, There are fifteen numbers. Our middle is the eighth number: The median value of this set of numbers is 24. Example 3: Rahul’s family drove through 7 states on summer vacation. The prices of Gasoline differ from state to state. Calculate the median of gasoline cost. 1.79, 1.61, 2.09, 1.84, 1.96, 2.11, 1.75 Solution: By organizing the data from smallest to greatest, we get: 1.61, 1.75, 1.79, 1.84 , 1.96, 2.09, 2.11 Hence, the median of gasoline cost is 1.84. There are three states with greater gasoline costs and 3 with smaller prices. WEEK 8 What is Dispersion? Dispersion in statistics is a way of describing how spread out a set of data is. When a data set has a large value, the values in the set are widely scattered; when it is small the items in the set are tightly clustered. Very basically, this set of data has a small value: 1, 2, 2, 3, 3, 4 …and this set has a wider one: 0, 1, 20, 30, 40, 100 The spread of a data set can be described by a range of descriptive statistics including variance, standard deviation, and interquartile range. Spread can also be shown in graphs: dot plots, boxplots, and stem and leaf plots have a greater distance with samples that have a larger dispersion and vice versa. The larger the box, the more dispersion in a set of data. Image: Seton Hall University Measures of Dispersion. ● Coefficient of dispersion: A “catch-all” term for a variety of formulas, including distance between quartiles. ● Standard deviation: probably the most common measure. It tells you how spread out numbers are from the mean, ● Index of Dispersion: a measure of dispersion commonly used with nominal variables. ● Interquartile range (IQR): describes where the bulk of the data lies (the “middle fifty” percent). ● Interdecile range: the difference between the first decile (10%) and the last decile (90%). ● range : the difference between the smallest and largest number in a set of data. ● Mean difference or difference in means: measures the absolute difference between the mean value in two different groups in clinical trials. ● Median absolute deviation (MAD): the median of the absolute deviations from a data set’s median. ● Quartiles: Numbers that split the data into four quarters (first, second, third, and fourth quartiles). In some processes, like manufacturing or measurement, low dispersion is associated with high precision. High dispersion is associated with low precision. Measures of Dispersion: Example Let’s say you were asked to compare measures of dispersion for two data sets. Data set A has the items 97,98,99,100,101,102,103 and data set B has items 70,80,90,100,110,120,130. By looking at the data sets you can probably tell that the means and medians are the same (100) which technically are called “measures of central tendency” in statistics. However, the range (which gives you an idea of how spread out the entire set of data is) is much larger for data set B (60) when compared to data set A (6). In fact, nearly all measures of dispersion would be ten times greater for data set B, which makes sense as the range is ten times larger. For example, take a look at the standard deviations for the two data sets: Standard deviation for A: 2.160246899469287. Standard deviation for B: 21.602468994692867. The figure for data set B is exactly ten times that of A. Warning: When using a calculator (or a formula), check to make sure you are using the correct setting (or formula) for your data. Many measures of dispersion (like the variance) have two different formulas, one for a population and one for a sample. If you aren’t sure if you have a sample or a population Measures of Dispersion In statistics, the measures of dispersion help to interpret the variability of data i.e. to know how much homogenous or heterogeneous the data is. In simple terms, it shows how squeezed or scattered the variable is. Measures of spread (also called measures of dispersion) tell you something about how wide the set of data is. There are several basic measures of spread used in statistics. The most common are: 1. The range (including the interquartile range and the interdecile range), 2. The standard deviation, 3. The variance, 4. Quartiles. 1. The Range The Range tells The range is a basic statistic that tells you the range of values. For example, if your minimum value is $10 and the maximum value is $100 then the range is $90 ($100 – $10). A similar statistic is the interquartile range, which tells you the range in the middle fifty percent of a set of data; in other words, it’s where the bulk of data tends to lie. See: The Range and Interquartile Range for examples and calculation steps. Another, less common measure is the Semi Interquartile Range, which is one half of the interquartile range. 2. Standard Deviation Simply put, the standard deviation is a measure of how spread out data is around center of the distribution (the mean). It also gives you an idea of where, percentage wise, a certain value falls. For example, let’s say you took a test and it was normally distributed (shaped like a bell). You score one standard deviation above the mean. That tells you your score puts you in the top 84% of test takers. 3. The Variance The variance is a very simple statistic that gives you an extremely rough idea of how spread out a data set is. As a measure of spread, it’s actually pretty weak. A large variance of 22,000, for example, doesn’t tell you much about the spread of data — other than it’s big! The most important reason the variance exists is to give you a way to find the standard deviation: the standard deviation is the square root of variance. See: Variance for examples and calculation steps. 4. Quartiles Quartiles divide your data set into quarters according to where those numbers falls on the number line. Like the variance, the quartile isn’t very useful on its own. Instead, it’s used to find more useful values like the interquartile range. Types of Measures of Dispersion There are two main types of dispersion methods in statistics which are: ● Absolute Measure of Dispersion ● Relative Measure of Dispersion Absolute Measure of Dispersion An absolute measure of dispersion contains the same unit as the original data set. The absolute dispersion method expresses the variations in terms of the average of deviations of observations like standard or means deviations. It includes range, standard deviation, quartile deviation, etc. The types of absolute measures of dispersion are: 1. Range: It is simply the difference between the maximum value and the minimum value given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6 2. Variance: Deduct the mean from each data in the set, square each of them and add each square and finally divide them by the total no of values in the data set to get the variance. Variance (σ2) = ∑(X−μ)2/N 3. Standard Deviation: The square root of the variance is known as the standard deviation i.e. S.D. = √σ. 4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers into quarters. The quartile deviation is half of the distance between the third and the first quartile. 5. Mean and Mean Deviation: The average of numbers is known as the mean and the arithmetic mean of the absolute deviations of the observations from a measure of central tendency is known as the mean deviation (also called mean absolute deviation). Relative Measure of Dispersion The relative measures of dispersion are used to compare the distribution of two or more data sets. This measure compares values without units. Common relative dispersion methods include: 1. Co-efficient of Range 2. Co-efficient of Variation 3. Co-efficient of Standard Deviation 4. Co-efficient of Quartile Deviation 5. Co-efficient of Mean Deviation Co-efficient of Dispersion The coefficients of dispersion are calculated (along with the measure of dispersion) when two series are compared, that differ widely in their averages. The dispersion coefficient is also used when two series with different measurement units are compared. It is denoted as C.D. The common coefficients of dispersion are: C.D. in terms of Coefficient of dispersion Range C.D. = (Xmax – Xmin) ⁄ (Xmax + Xmin) Quartile Deviation C.D. = (Q3 – Q1) ⁄ (Q3 + Q1) Standard Deviation (S.D.) C.D. = S.D. ⁄ Mean Mean Deviation C.D. = Mean deviation/Average Solved Examples Example 1: Find the Variance and Standard Deviation of the Following Numbers: 1, 3, 5, 5, 6, 7, 9, 10. Solution: The mean = (1+ 3+ 5+ 5+ 6+ 7+ 9+ 10)/8 = 46/ 8 = 5.75 Step 1: Subtract the mean value from individual value (1 – 5.75), (3 – 5.75), (5 – 5.75), (5 – 5.75), (6 – 5.75), (7 – 5.75), (9 – 5.75), (10 – 5.75) = -4.75, -2.75, -0.75, -0.75, 0.25, 1.25, 3.25, 4.25 Step 2: Squaring the above values we get, 22.563, 7.563, 0.563, 0.563, 0.063, 1.563, 10.563, 18.063 Step 3: 22.563 + 7.563 + 0.563 + 0.563 + 0.063 + 1.563 + 10.563 + 18.063 = 61.504 Step 4: n = 8, therefore variance (σ2) = 61.504/ 8 = 7.69 Now, Standard deviation (σ) = 2.77 Example 2: Calculate the range and coefficient of range for the following data values. 45, 55, 63, 76, 67, 84, 75, 48, 62, 65 Solution: Let Xi values be: 45, 55, 63, 76, 67, 84, 75, 48, 62, 65 Here, Maxium value (Xmax) = 84 Minimum or Least value (Xmin) = 45 Range = Maximum value = Minimum value = 84 – 45 = 39 Coefficient of range = (Xmax – Xmin)/(Xmax + Xmin) = (84 – 45)/(84 + 45) = 39/129 = 0.302 (approx) Practice Problems 1. Find the coefficient of standard deviation for the data set: 32, 35, 37, 30, 33, 36, 35 and 37 2. The mean and variance of seven observations are 8 and 16, respectively. If five of these are 2, 4, 10, 12 and 14, find the remaining two observations. 3. In a town, 25% of the persons earned more than Rs 45,000 whereas 75% earned more than 18,000. Compute the absolute and relative values of dispersion. Standard deviation formula is used to find the values of a particular data that is dispersed. In simple words, the standard deviation is defined as the deviation of the values or data from an average mean. Lower standard deviation concludes that the values are very close to their average. Whereas higher values mean the values are far from the mean value. It should be noted that the standard deviation value can never be negative. Standard Deviation is of two types: 1. Population Standard Deviation 2. Sample Standard Deviation Formula to Calculate Standard Deviation Formulas for Standard Deviation Population Standard Deviation Formula Sample Standard Deviation Formula Notations for Standard Deviation ● σ = Standard Deviation ● xi = Terms Given in the Data ● x̄ = Mean ● n = Total number of Terms Standard Deviation Formula Based on Discrete Frequency Distribution For discrete frequency distribution of the type: x: x1, x2, x3, … xn and f: f1, f2, f3, … fn The formula for standard deviation becomes: Here, N is given as: N = n∑i=1 fi Standard Deviation Formula for Grouped Data There is another standard deviation formula which is derived from the variance. This formula is given as: Example Question based on Standard Deviation Formula Question: During a survey, 6 students were asked how many hours per day they study on an average? Their answers were as follows: 2, 6, 5, 3, 2, 3. Evaluate the standard deviation. Solution: Find the mean of the data: (2+6+5+3+2+3)6 = 3.5 Step 2: Construct the table: x1 x1 − x̄ (x1 − x̄)2 2 -1.5 2.25 6 2.5 6.25 5 1.5 2.25 3 -0.5 0.25 2 -1.5 2.25 3 -0.5 0.25 = 13.5 Step 3: Now, use the Standard Deviation formula Sample Standard Deviation = =√(13.5/[6-1]) =√[2.7] =1.643 To check more maths formulas for different classes and for various concepts, stay tuned with BYJU’S. Also, register now to get access to various video lessons and get a more effective and engaging learning experience. In probability theory and statistics, the variance formula measures how far a set of numbers are spread out. It is a numerical value and is used to indicate how widely individuals in a group vary. If individual observations vary considerably from the group mean, the variance is big and vice versa. A variance of zero indicates that all the values are identical. It should be noted that variance is always non-negative- a small variance indicates that the data points tend to be very close to the mean and hence to each other while a high variance indicates that the data points are very spread out around the mean and from each other. Variance Formulas Variance can be of either grouped or ungrouped data. To recall, a variance can of two types which are: ● Variance of a population ● Variance of a sample The variance of a population is denoted by σ2 and the variance of a sample by s2. Variance Formulas for Ungrouped Data Population variance Sample variance Here, Here, σ2 = Variance s2 = Sample variance xi = ith observation of given data xi = ith observation of given data μ = Population mean x̄ = Sample mean N = Total number of observations n = Sample size (or Number of data (Population size) values in sample) Variance Formulas for Grouped Data Formula for Population Variance The variance of a population for grouped data is: ● σ2 = ∑ f (m − x̅)2 / n Formula for Sample Variance The variance of a sample for grouped data is: ● s2 = ∑ f (m − x̅)2 / n − 1 Where, f = frequency of the class m = midpoint of the class These two formulas can also be written as: Population variance Sample variance Here, σ2 = Variance Here, xi = Midvalue of ith class s2 = Sample variance fi = Frequency of ith class xi = Midvalue of ith class N = Total number of observations fi = Frequency of ith class (Population size) n = Sample size (or Number of data values in sample) Summary: Variance Type For Ungrouped Data For Grouped Data Population Variance Formula σ2 = ∑ (x − x̅)2 / n σ2 = ∑ f (m − x̅)2 / n Sample Variance Formula s2 = ∑ (x − x̅)2 / n − 1 s2 = ∑ f (m − x̅)2 / n − 1 Variance Formula Example Question Question: Find the variance for the following set of data representing trees heights in feet: 3, 21, 98, 203, 17, 9 Solution: Step 1: Add up the numbers in your given data set. 3 + 21 + 98 + 203 + 17 + 9 = 351 Step 2: Square your answer: 351 × 351 = 123201 …and divide by the number of items. We have 6 items in our example so: 123201/6 = 20533.5 Step 3: Take your set of original numbers from Step 1, and square them individually this time: 3 × 3 + 21 × 21 + 98 × 98 + 203 × 203 + 17 × 17 + 9 × 9 Add the squares together: 9 + 441 + 9604 + 41209 + 289 + 81 = 51,633 Step 4: Subtract the amount in Step 2 from the amount in Step 3. 51633 – 20533.5 = 31,099.5 Set this number aside for a moment. Step 5: Subtract 1 from the number of items in your data set. For our example: 6–1=5 Step 6: Divide the number in Step 4 by the number in Step 5. This gives you the variance: 31099.5/5 = 6219.9 Step 7: Take the square root of your answer from Step 6. This gives you the standard deviation: √6219.9 = 78.86634 The answer is 78.86. Question 2: Calculate the variance for the following data: Class intervals Frequency 200 – 201 13 201 – 202 27 202 – 203 18 203 – 204 10 204 – 205 1 205 – 206 1 Solution: CI fi xi fixi fixi2 200 – 201 13 200.5 2606.5 522603.25 201 – 202 27 201.5 5440.5 1096260.75 202 – 203 18 202.5 3645 738112.5 203 – 204 10 203.5 2035 414122.5 204 – 205 1 204.5 204.5 41820.25 205 – 206 1 205.5 205.5 42230.25 ∑fixi = 14137 ∑fixi2 = 2855149.5 ∑fi = 70 = [1/(70 – 1)] [2855149.5 – (1/70)(14137)2] = 1.179 What is the Semi Interquartile Range? The semi interquartile range (SIR) (also called the quartile deviation) is a measure of spread. It tells you something about how data is dispersed around a central point (usually the mean). The SIR is half of the interquartile range. How to Calculate the Semi Interquartile Range / Quartile Deviation As the SIR is half of the Interquartile Range, all you need to do is find the IQR and then divide your answer by 2. Another way is to use the quartile deviation formula: Note: You might see the formula QD = 1/2(Q3 – Q1). Algebraically they are the same. Breaking down the above formula: Step 1: Find the first quartile, Q1. If you’re given Q1 in the question, great. If not, you’ve got several options, including: 1. Use a calculator, like this one. Plug in your numbers and click the blue button. Q1 is equal to the 25th percentile listed in the results. 2. Follow these instructions to find the interquartile range by hand (part of the process is to find quartiles). Step 2: Find the third quartile, Q3. If you’re given Q3 in the question, great. If not, use one of the options listed in Step 1. If you choose to use the calculator, Q3 is equal to the 75th percentile. Step 3: Subtract Step 1 from Step 2. Step 4: Divide by 2. Example Question: Find the Quartile Deviation for the following set of data: {490, 540, 590, 600, 620, 650, 680, 770, 830, 840, 890, 900} Step 1: Find the first quartile, Q1. This is the median of the lower half of the set {490, 540, 590, 600, 620, 650}. Q1 = (590 + 600) / 2 = 595. Step 2: Find the third quartile, Q3. This is the median of the upper half of the set {680, 770, 830, 840, 890, 900}. Q3 = (830 + 840) / 2 = 835. Step 3: Subtract Step 1 from Step 2. 835 – 595 = 240. Step 4: Divide by 2. 240 / 2 = 120 The quartile deviation for this set of data is 12. Coefficient of Quartile Deviation The coefficient of quartile deviation (sometimes called the quartile coefficient of dispersion) allows you to compare dispersion for two or more sets of data. The formula is: If one set of data has a larger coefficient of quartile deviation than another set, then that data set’s interquartile dispersion is greater. Mean Deviation Definition The mean deviation is defined as a statistical measure that is used to calculate the average deviation from the mean value of the given data set. The mean deviation of the data values can be easily calculated using the below procedure. Step 1: Find the mean value for the given data values Step 2: Now, subtract the mean value from each of the data values given (Note: Ignore the minus symbol) Step 3: Now, find the mean of those values obtained in step 2. Mean Deviation Formula The formula to calculate the mean deviation for the given data set is given below. Mean Deviation = [Σ |X – µ|]/N Here, Σ represents the addition of values X represents each value in the data set µ represents the mean of the data set N represents the number of data values | | represents the absolute value, which ignores the “-” symbol Mean Deviation for Frequency Distribution To present the data in the more compressed form we group it and mention the frequency distribution of each such group. These groups are known as class intervals. Grouping of data is possible in two ways: 1. Discrete Frequency Distribution 2. Continuous Frequency Distribution In the upcoming discussion, we will be discussing mean absolute deviation in a discrete frequency distribution. Let us first know what is actually meant by the discrete distribution of frequency. Mean Deviation for Discrete Distribution Frequency As the name itself suggests, by discrete we mean distinct or non-continuous. In such a distribution the frequency (number of observations) given in the set of data is discrete in nature. If the data set consists of values x1,x2, x3………xn each occurring with a frequency of f1, f2… fn respectively then such a representation of data is known as the discrete distribution of frequency. To calculate the mean deviation for grouped data and particularly for discrete distribution data the following steps are followed: Step I: The measure of central tendency about which mean deviation is to be found out is calculated. Let this measure be a. If this measure is mean then it is calculated as, where If the measure is median then the given set of data is arranged in ascending order and then the cumulative frequency is calculated then the observations whose cumulative frequency is equal to or just greater than N/2 is taken as the median for the given discrete distribution of frequency and it is seen that this value lies in the middle of the frequency distribution. Step II: Calculate the absolute deviation of each observation from the measure of central tendency calculated in step (I) StepIII: The mean absolute deviation around the measure of central tendency is then calculated by using the formula If the central tendency is mean then, In case of median Let us look into the following examples for a better understanding. Mean Deviation Examples Example 1: Determine the mean deviation for the data values 5, 3,7, 8, 4, 9. Solution: Given data values are 5, 3, 7, 8, 4, 9. We know that the procedure to calculate the mean deviation. First, find the mean for the given data: Mean, µ = ( 5+3+7+8+4+9)/6 µ = 36/6 µ=6 Therefore, the mean value is 6. Now, subtract each mean from the data value, and ignore the minus symbol if any (Ignore”-”) 5–6=1 3–6=3 7–6=1 8–6=2 4–6=2 9–6=3 Now, the obtained data set is 1, 3, 1, 2, 2, 3. Finally, find the mean value for the obtained data set Therefore, the mean deviation is = (1+3 + 1+ 2+ 2+3) /6 = 12/6 =2 Hence, the mean deviation for 5, 3,7, 8, 4, 9 is 2. Example 2: In a foreign language class, there are 4 languages, and the frequencies of students learning the language and the frequency of lectures per week are given as: Language Sanskrit Spanish French English No. of 6 5 9 12 5 7 4 9 students(xi) Frequency of lectures(fi) Calculate the mean deviation about the mean for the given data. Solution: The following table gives us a tabular representation of data and the calculations The 10-90 percentile range is the difference between the 90th and 10th percentiles. See the trimmed mean for another instance of where the data between the 10th and 90th percentiles are used. Procedure for finding 1. Find the 10th percentile using the instructions above 2. Find the 90th percentile using the instructions above 3. Subtract the 10th percentile from the 90th percentile Formula What Is Skewness? Skewness is a measurement of the distortion of symmetrical distribution or asymmetry in a data set. Skewness is demonstrated on a bell curve when data points are not distributed symmetrically to the left and right sides of the median on a bell curve. If the bell curve is shifted to the left or the right, it is said to be skewed. Skewness can be quantified as a representation of the extent to which a given distribution varies from a normal distribution. A normal distribution has a zero skew, while a lognormal distribution, for example, would exhibit some right skew. If one tail is longer than another, the distribution is skewed. These distributions are sometimes called asymmetric or asymmetrical distributions as they don’t show any kind of symmetry. Symmetry means that one half of the distribution is a mirror image of the other half. For example, the normal distribution is a symmetric distribution with no skew. The tails are exactly the same. A normal curve. A left-skewed distribution has a long left tail. Left-skewed distributions are also called negatively-skewed distributions. That’s because there is a long tail in the negative direction on the number line. The mean is also to the left of the peak. A right-skewed distribution has a long right tail. Right-skewed distributions are also called positive-skew distributions. That’s because there is a long tail in the positive direction on the number line. The mean is also to the right of the peak. The normal distribution is the most common distribution you’ll come across. Next, you’ll see a fair amount of negatively skewed distributions. For example, household income in the U.S. is negatively skewed with a very long left tail. Income in the U.S. Image: NY Times. Interestingly, you can take the same data and make it a right-skewed distribution. This positively-skewed graph plots number of household’s income brackets: Mean and Median in Skewed Distributions In a normal distribution, the mean and the median are the same number while the mean and median in a skewed distribution become different numbers: A left-skewed, negative distribution will have the mean to the left of the median. A right-skewed distribution will have the mean to the right of the median. Types of Skewness As noted above, skewness measures asymmetry in a data set and is usually shown on a bell curve. Normal distributions have zero skewness. This means that the distribution ends up being symmetrical around the mean. Having said that, there are instances where skewness isn't symmetrical. In these cases, it can be either positive or negative. Below, we highlight what each type of skewness means. Positive Skewness A distribution is positively skewed when its tail is more pronounced on the right side than it is on the left. Since the distribution is positive, the assumption is that its value is positive. As such, most of the values end up being left of the mean. This means that the most extreme values are on the right side. As an investor, you may find that you have some small losses with a positive skew. But you may also end up realizing large gains—albeit fewer. Negative Skewness Negative skewness, on the other hand, occurs when the tail is more pronounced on the left rather than the right side. Contrary to the positive skew, most of the values are found on the right side of the mean when it comes to negative skewness. As such, the most extreme values are found further to the left. Having a negative skew may indicate that you can expect some small gains here and there. But you can generally expect to see a few large losses here and there as an investor. Skewed Left (Negative Skew) A left skewed distribution is sometimes called a negatively skewed distribution because it’s long tail is on the negative direction on a number line. A common misconception is that the peak of distribution is what defines “peakness.” In other words, a peak that tends to the left is left skewed distribution. This is incorrect. There are two main things that make a distribution skewed left: 1. The mean is to the left of the peak. This is the main definition behind “skewness”, which is technically a measure of the distribution of values around the mean. 2. The tail is longer on the left. 3. In most cases, the mean is to the left of the median. This isn’t a reliable test for skewness though, as some distributions (i.e. many multimodal distributions) violate this rule. You should think of this as a “general idea” kind of rule, and not a set-in-stone one. In a left skewed distribution, the mean is to the left of the peak. Left Skewed and Numerical Values Skewness can be shown with a list of numbers as well as on a graph. For example, take the numbers 1,2, and 3. They are evenly spaced, with 2 as the mean (1 + 2 + 3 / 3 = 6 / 3 = 2). If you add a number to the far left (think in terms of adding a value to the number line), the distribution becomes left skewed: -10, 1, 2, 3. Similarly, if you add a value to the far right, the set of numbers becomes right skewed: 1, 2, 3, 10.