Uploaded by adejorieniola

The Social Statistics

advertisement
The Social Statistics
Week 1
Meaning and Origin of statistics
The words "statistics" and "statista," both of which mean "statesman" or
"politician," are ultimately derived from the Neo-Latin statisticum collegium
("council of state") and "statesman." The term "science of state" (then known as
political arithmetic in English) was first used in German by Gottfried Achenwall in
1749 to describe the analysis of data about the state. In the early 19th century, it
came to mean the gathering and organization of data generally. With the release
of the first of 21 volumes titled Statistical Account of Scotland in 1791, Sir John
Sinclair translated it into English. [1]
The word statistics comes from the Latin word “Status” or Italian word “Statistia”
or German word “Statistik” or the French word “Statistique”; meaning a political
state, and originally meant information useful to the state, such as information
about sizes of the population (human, animal, products, etc.) and armed forces.
According to pioneer statistician Yule, the word statistics occurred at the earliest
in the book “the element of universal erudition” by Baron (1770). In 1787 a wider
definition used by E.A.W. Zimmermann in “A Political survey of the present state of
Europe”. It appeared in the encyclopedia of Britannica in 1797 and was used by Sir
John Sinclair in Britain in a series of volumes published between 1791 and 1799
giving a statistical account of Scotland. In the 19th century, the word statistics
acquired a wider meaning covering numerical data of almost any subject
whatever and also interpretation of data through appropriate analysis. That’s all
about the short history of Statistics. Now let us see how statistics is being used in
different meanings nowadays.
Now statistics is being used in different meanings.
■ Statistics refers to “numerical facts that are arranged systematically in the
form of tables or charts etc. In this sense, it is always used a plural i.e. a set
of numerical information. For instance statistics of prices, road accidents,
crimes, births, educational institutions, etc.
■ The word statistics is defined as a discipline that includes procedures and
techniques used to collect, process, and analyze the numerical data to make
inferences and to reach appropriate decisions in situations of uncertainty
(uncertainty refers to incompleteness, it does not imply ignorance). In this
sense word statistic is used in the singular sense. It denotes the science of
basing decisions on numerical data.
■ The word statistics are numerical quantities calculated from sample
observations; a single quantity calculated from sample observations is
called statistics such as the mean. Here word statistics is plural.
“We compute statistics from statistics.”
The first place of statistics is the plural of statistics; the second place is plural
sense data; and the third place is singular sense methods.
Social statistics is the use of statistical measurement systems to study
human behavior in a social environment.
1. Primary Data Collection:
Primary data collection involves the collection of original data directly from the
source or through direct interaction with the respondents. This method allows
researchers to obtain firsthand information specifically tailored to their research
objectives. There are various techniques for primary data collection, including:
●
Quantitative Data Collection Methods
●
Qualitative Data Collection Methods
Quantitative Data Collection Methods
It is based on mathematical calculations using various formats like close-ended
questions, correlation and regression methods, mean, median or mode measures.
This method is cheaper than qualitative data collection methods and it can be
applied in a short duration of time.
Qualitative Data Collection Methods
It does not involve any mathematical calculations. This method is closely
associated with elements that are not quantifiable. This qualitative data collection
method includes interviews, questionnaires, observations, case studies, etc. There
are several methods to collect this type of data. They are
a. Surveys and Questionnaires: Researchers design structured questionnaires or
surveys to collect data from individuals or groups. These can be conducted
through face-to-face interviews, telephone calls, mail, or online platforms. In this
method, the set of questions are mailed to the respondent. They should read, reply
and subsequently return the questionnaire. The questions are printed in the
definite order on the form. A good survey should have the following features:
●
Short and simple
●
Should follow a logical sequence
●
Provide adequate space for answers
●
Avoid technical terms
●
Should have good physical appearance such as colour, quality of the paper
to attract the attention of the respondent
b. Interviews: Interviews involve direct interaction between the researcher and
the respondent. They can be conducted in person, over the phone, or through
video conferencing. Interviews can be structured (with predefined questions),
semi-structured (allowing flexibility), or unstructured (more conversational). The
method of collecting data in terms of verbal responses. It is achieved in two ways,
such as
●
Personal Interview – In this method, a person known as an interviewer is
required to ask questions face to face to the other person. The personal
interview can be structured or unstructured, direct investigation, focused
conversation, etc.
●
Telephonic Interview – In this method, an interviewer obtains information by
contacting people on the telephone to ask the questions or views, verbally.
c. Observations: Researchers observe and record behaviors, actions, or events in
their natural setting. This method is useful for gathering data on human behavior,
interactions, or phenomena without direct intervention. Observation method is
used when the study relates to behavioural science. This method is planned
systematically. It is subject to many controls and checks. The different types of
observations are:
●
Structured and unstructured observation
●
Controlled and uncontrolled observation
●
Participant, non-participant and disguised observation
d. Experiments: Experimental studies involve the manipulation of variables to
observe their impact on the outcome. Researchers control the conditions and
collect data to draw conclusions about cause-and-effect relationships.
e. Focus Groups: Focus groups bring together a small group of individuals who
discuss specific topics in a moderated setting. This method helps in understanding
opinions, perceptions, and experiences shared by the participants.
2. Secondary Data Collection:
Secondary data collection involves using existing data collected by someone else
for a purpose different from the original intent. Researchers analyze and interpret
this data to extract relevant information. Secondary data can be obtained from
various sources, including:
a. Published Sources: Researchers refer to books, academic journals, magazines,
newspapers, government reports, and other published materials that contain
relevant data.
b. Online Databases: Numerous online databases provide access to a wide range
of secondary data, such as research articles, statistical information, economic
data, and social surveys.
c. Government and Institutional Records: Government agencies, research
institutions, and organizations often maintain databases or records that can be
used for research purposes.
d. Publicly Available Data: Data shared by individuals, organizations, or
communities on public platforms, websites, or social media can be accessed and
utilized for research.
e. Past Research Studies: Previous research studies and their findings can serve as
valuable secondary data sources. Researchers can review and analyze the data to
gain insights or build upon existing knowledge.
Uses of Statistics
-
conduct research
-
evaluate outcomes
-
develop critical thinking
-
and make informed decisions.
Week 2
Variables
What is a variable?
A variable is any kind of attribute or characteristic that you are trying to measure,
manipulate and control in statistics and research. All studies analyze a variable,
which can describe a person, place, thing or idea. A variable's value can change
between groups or over time.
For example, if the variable in an experiment is a person's eye color, its value can
change from brown to blue to green from person to person.
●
Continuous variable: a variable with infinite number of values, like “time”
or “weight”.
●
Dependent variable: the outcome of an experiment. As you change the
independent variable, you watch what happens to the dependent variable.
●
Discrete variable: a variable that can only take on a certain number of
values. For example, “number of cars in a parking lot” is discrete because a
car park can only hold so many cars.
●
Independent variable: a variable that is not affected by anything that you,
the researcher, does. Usually plotted on the x-axis.
●
Qualitative variable: a broad category for any variable that can’t be
counted (i.e. has no numerical value). Nominal and ordinal variables fall
under this umbrella term.
●
Quantitative variable: A broad category that includes any variable that
can be counted, or has a numerical value associated with it. Examples of
variables that fall into this category include discrete variables and ratio
variables.
Independent vs. dependent variables
Independent variables
Definition
A variable that stands alone and
Dependent variables
A variable that relies on and can
isn't changed by the other variables be changed by other factors
Example
or factored that are measured
that are measured
Age: Other variables such as where
A grade someone gets on an
someone lives, what they eat or
exam depends on factors such
how much they exercise are not
as how much sleep they got and
going to change their age.
how long they studied.
In studies, researchers often try to find out whether an independent variable
causes other variables to change and in what way. When analyzing relationships
between study objects, researchers often try to determine what makes the
dependent variable change and how. Independent variables can influence
dependent variables, but dependent variables cannot influence independent
variables.
Quantitative vs. qualitative variables
Quantitative variables
Definition
Examples
Qualitative variables
Any data sets that involve
Non-numerical values
numbers or amounts
or groupings
Height, distance or number of
Eye color or dog breed
items
Types
Discrete and continuous
Binary, nominal and
ordinal
● An extraneous variable is anything that could influence the dependent
variable. These unwanted variables can
Extraneous variables
Confounding variables
Definitio Factors that affect the
dependent variable but
n
that the researcher did
not originally consider
when designing the
experiment
Extra variables that the
researcher did not account
for that can disguise
another variable's effects
and show false
correlations
Example Parental support, prior
knowledge of a foreign
language or
socioeconomic status
are extraneous variables
that could influence a
study assessing whether
private tutoring or online
courses are more
effective at improving
students' Spanish test
scores.
In a study of whether a
particular genre of movie
affects how much candy
kids eat, with experiments
are held at 9 a.m., noon
and 3 p.m. Time could be a
confounding variable, as
the group in the noon
study might be hungrier
and therefore eat more
candy because lunchtime
is typically at noon.
● unintentionally change a study's results or how a researcher interprets those
results.
● A confounding variable influences the dependent variable, and also
correlates with or causally affects the independent variable. Confounding
variables can invalidate your experiment results by making them biased or
suggesting a relationship between variables exists when it does not.
A “constant” simply means a fixed value or a value that does not
change. A constant has a known value.
What Is a Constant?
If you measure the height of a wall or bookshelf at home, it will be a constant
number. It won’t change. However, if you measure the height of a plant in a pot, it
will keep changing as it grows. It’s not constant.
Take a look at the following sentences to understand this.
■ There are 7 days in a week. Here, ⇒7 is a constant
What is a Sample?
A sample is defined as a smaller and more manageable representation of a larger
group. A subset of a larger population that contains characteristics of that
population. A sample is used in statistical testing when the population size is too
large for all members or observations to be included in the test.
The sample is an unbiased subset of the population that best represents the whole
data.
To overcome the restraints of a population, you can sometimes collect data from a
subset of your population and then consider it as the general norm. You collect the
subset information from the groups who have taken part in the study, making the
data reliable. The results obtained for different groups who took part in the study
can be extrapolated to generalize for the population.
Figure: Sample
The process of collecting data from a small subsection of the population and then using
it to generalize over the entire set is called Sampling.
Samples are used when :
● The population is too large to collect data.
● The data collected is not reliable.
● The population is hypothetical and is unlimited in size. Take the example of a
study that documents the results of a new medical procedure. It is unknown
how the procedure will affect people across the globe, so a test group is
used to find out how people react to it.
A sample should generally :
● Satisfy all different variations present in the population as well as a
well-defined selection criterion.
● Be utterly unbiased on the properties of the objects being selected.
● Be random to choose the objects of study fairly.
Say you are looking for a job in the IT sector, so you search online for IT jobs. The
first search result would be for jobs all around the world. But you want to work in
India, so you search for IT jobs in India. This would be your population. It would
be impossible to go through and apply for all positions in the listing. So you
consider the top 30 jobs you are qualified for and satisfied with and apply for
those. This is your sample.
However, it’s not that simple. When you do stats, your sample size has to be
ideal—not too large or too small. Then once you’ve decided on a sample size, you
must use a sound technique to collect the sample from the population:
●
Probability Sampling uses randomization to select sample members. You
know the probability of each potential member’s inclusion in the sample. For
example, 1/100. However, it isn’t necessary for the odds to be equal. Some
members might have a 1/100 chance of being chosen, others might have
1/50.
●
Non-probability sampling uses non-random techniques (i.e. the judgment of
the researcher). You can’t calculate the odds of any particular item, person
or thing being included in your sample.
Common Types
The most common techniques you’ll likely meet in elementary statistics or AP
statistics include taking a sample with and without replacement. Specific
techniques include:
●
Bernoulli samples have independent Bernoulli trials on population
elements. The trials decide whether the element becomes part of the
sample. All population elements have an equal chance of being included in
each choice of a single sample. The sample sizes in Bernoulli samples follow
a binomial distribution. Poisson samples (less common): An independent
Bernoulli trial decides if each population element makes it to the sample.
●
Cluster samples divide the population into groups (clusters). Then a
random sample is chosen from the clusters. It’s used when researchers don’t
know the individuals in a population but do know the population subsets or
groups.
●
In systematic sampling, you select sample elements from an ordered
frame. A sampling frame is just a list of participants that you want to get a
sample from. For example, in the equal-probability method, choose an
element from a list and then choose every kth element using the equation k
= N\n. Small “n” denotes the sample size and capital “N” equals the size of
the population.
●
SRS : Select items completely randomly, so that each element has the same
probability of being chosen as any other element. Each subset of elements
has the same probability of being chosen as any other subset of k elements.
●
In stratified sampling, sample each subpopulation independently. First,
divide the population into homogeneous (very similar) subgroups before
getting the sample. Each population member only belongs to one group.
Then apply simple random or a systematic method within each group to
choose the sample. Stratified Randomization: a sub-type of stratified used
in clinical trials. First, divide patients into strata, then randomize with
permuted block randomization.
What is Population?
In statistics, population is the entire set of items from which you draw data for a
statistical study. It can be a group of individuals, a set of items, etc. It makes up
the data pool for a study.
Generally, population refers to the people who live in a particular area at a specific
time. But in statistics, population refers to data on your study of interest. It can be
a group of individuals, objects, events, organizations, etc. You use populations to
draw conclusions.
Figure: Population
An example of a population would be the entire student body at a school. It would
contain all the students who study in that school at the time of data collection.
Depending on the problem statement, data from each of these students is
collected. An example is the students who speak Hindi among the students of a
school.
For the above situation, it is easy to collect data. The population is small and
willing to provide data and can be contacted. The data collected will be complete
and reliable.
If you had to collect the same data from a larger population, say the entire
country of India, it would be impossible to draw reliable conclusions because of
geographical and accessibility constraints, not to mention time and resource
constraints. A lot of data would be missing or might be unreliable. Furthermore,
due to accessibility issues, marginalized tribes or villages might not provide data
at all, making the data biased towards certain regions or groups.
In Statistics, the determination of the variation between the group of data due to
true variation is done by hypothesis testing. The sample data are taken from the
population parameter based on the assumptions. The hypothesis can be classified
into various types. In this article, let us discuss the hypothesis definition, various
types of hypothesis and the significance of hypothesis testing, which are
explained in detail.
Hypothesis Definition in Statistics
In Statistics, a hypothesis is defined as a formal statement, which gives the
explanation about the relationship between the two or more variables of the
specified population. It helps the researcher to translate the given problem to a
clear explanation for the outcome of the study. It clearly explains and predicts the
expected outcome. It indicates the types of experimental design and directs the
study of the research process.
Types of Hypothesis
The hypothesis can be broadly classified into different types. They are:
Simple Hypothesis
A simple hypothesis is a hypothesis that there exists a relationship between two
variables. One is called a dependent variable, and the other is called an
independent variable.
Complex Hypothesis
A complex hypothesis is used when there is a relationship between the existing
variables. In this hypothesis, the dependent and independent variables are more
than two.
Null Hypothesis
In the null hypothesis, there is no significant difference between the populations
specified in the experiments, due to any experimental or sampling error. The null
hypothesis is denoted by H0.
Alternative Hypothesis
In an alternative hypothesis, the simple observations are easily influenced by
some random cause. It is denoted by the Ha or H1.
Empirical Hypothesis
An empirical hypothesis is formed by the experiments and based on the evidence.
Statistical Hypothesis
In a statistical hypothesis, the statement should be logical or illogical, and the
hypothesis is verified statistically.
Apart from these types of hypothesis, some other hypotheses are directional and
non-directional hypothesis, associated hypothesis, casual hypothesis.
Characteristics of Hypothesis
The important characteristics of the hypothesis are:
●
The hypothesis should be short and precise
●
It should be specific
●
A hypothesis must be related to the existing body of knowledge
●
It should be capable of verification
In Statistics, the variables or numbers are defined and categorised using different
scales of measurements. Each level of measurement scale has specific properties
that determine the various use of statistical analysis. In this article, we will learn
four types of scales such as nominal, ordinal, interval and ratio scale.
What is the Scale?
A scale is a device or an object used to measure or quantify any event or another
object.
Levels of Measurements
There are four different scales of measurement. The data can be defined as being
one of the four scales. The four types of scales are:
●
Nominal Scale
●
Ordinal Scale
●
Interval Scale
●
Ratio Scale
Nominal Scale
A nominal scale is the 1st level of measurement scale in which the numbers serve
as “tags” or “labels” to classify or identify the objects. A nominal scale usually
deals with the non-numeric variables or the numbers that do not have any
value.The nominal scale simply categorizes variables according to qualitative
labels (or names). These labels and groupings don’t have any order or hierarchy to
them, nor do they convey any numerical value. For example, the variable “hair
color” could be measured on a nominal scale according to the following
categories: blonde hair, brown hair, gray hair, and so on.
Characteristics of Nominal Scale
●
A nominal scale variable is classified into two or more categories. In this
measurement mechanism, the answer should fall into either of the classes.
●
It is qualitative. The numbers are used here to identify the objects.
●
The numbers don’t define the object characteristics. The only permissible
aspect of numbers in the nominal scale is “counting.”
Example:
An example of a nominal scale measurement is given below:
What is your gender?
M- Male
F- Female
Here, the variables are used as tags, and the answer to this question should be
either M or F.
Ordinal Scale
The ordinal scale is the 2nd level of measurement that reports the ordering and
ranking of data without establishing the degree of variation between them.
Ordinal represents the “order.” Ordinal data is known as qualitative data or
categorical data. It can be grouped, named and also ranked. The ordinal scale
also categorizes variables into labeled groups, and these categories have an order
or hierarchy to them. For example, you could measure the variable “income” on an
ordinal scale as follows: low income, medium income, high income. Another
example could be level of education, classified as follows: high school, master’s
degree, doctorate. These are still qualitative labels (as with the nominal scale), but
you can see that they follow a hierarchical order.
Characteristics of the Ordinal Scale
●
The ordinal scale shows the relative ranking of the variables
●
It identifies and describes the magnitude of a variable
●
Along with the information provided by the nominal scale, ordinal scales
give the rankings of those variables
●
The interval properties are not known
●
The surveyors can quickly analyse the degree of agreement concerning the
identified order of variables
Example:
●
Ranking of school students – 1st, 2nd, 3rd, etc.
●
Ratings in restaurants
●
Evaluating the frequency of occurrences
●
Very often
●
Often
●
●
Not often
●
Not at all
Assessing the degree of agreement
●
Totally agree
●
Agree
●
Neutral
●
Disagree
●
Totally disagree
Interval Scale
The interval scale is the 3rd level of measurement scale. It is defined as a
quantitative measurement scale in which the difference between the two variables
is meaningful. In other words, the variables are measured in an exact manner, not
as in a relative way in which the presence of zero is arbitrary.The interval scale is a
numerical scale which labels and orders variables, with a known, evenly spaced
interval between each of the values. An oft-cited example of interval data is
temperature in Fahrenheit, where the difference between 10 and 20 degrees
Fahrenheit is exactly the same as the difference between, say, 50 and 60 degrees
Fahrenheit.
Characteristics of Interval Scale:
●
The interval scale is quantitative as it can quantify the difference between
the values
●
It allows calculating the mean and median of the variables
●
To understand the difference between the variables, you can subtract the
values between the variables
●
The interval scale is the preferred scale in Statistics as it helps to assign any
numerical values to arbitrary assessment such as feelings, calendar types,
etc.
Example:
●
Likert Scale
●
Net Promoter Score (NPS)
●
Bipolar Matrix Table
Ratio Scale
The ratio scale is the 4th level of measurement scale, which is quantitative. It is a
type of variable measurement scale. It allows researchers to compare the
differences or intervals. The ratio scale has a unique feature. It possesses the
character of the origin or zero points.The ratio scale is exactly the same as the
interval scale, with one key difference: The ratio scale has what’s known as a “true
zero.” A good example of ratio data is weight in kilograms. If something weighs
zero kilograms, it truly weighs nothing—compared to temperature (interval data),
where a value of zero degrees doesn’t mean there is “no temperature,” it simply
means it’s extremely cold!
Characteristics of Ratio Scale:
●
Ratio scale has a feature of absolute zero
●
It doesn’t have negative numbers, because of its zero-point feature
●
It affords unique opportunities for statistical analysis. The variables can be
orderly added, subtracted, multiplied, divided. Mean, median, and mode can
be calculated using the ratio scale.
●
Ratio scale has unique and useful properties. One such feature is that it
allows unit conversions like kilogram – calories, gram – calories, etc.
Example:
An example of a ratio scale is:
What is your weight in Kgs?
●
Less than 55 kgs
●
55 – 75 kgs
●
76 – 85 kgs
●
86 – 95 kgs
●
More than 95 kgs
WEEK 3
Frequency Distribution
What is a frequency distribution?
The frequency of a value is the number of times it occurs in a dataset. A frequency
distribution is the pattern of frequencies of a variable. It’s the number of times
each possible value of a variable occurs in a dataset.The frequency (f) of a
particular value is the number of times the value occurs in the data. The
distribution of a variable is the pattern of frequencies, meaning the set of all
possible values and the frequencies associated with these values. Frequency
distributions are portrayed as frequency tables or charts.
Types of frequency distributions
There are four types of frequency distributions:
● Ungrouped frequency distributions: The number of observations of each
value of a variable.
○ You can use this type of frequency distribution for categorical
variables.
● Grouped frequency distributions: The number of observations of each class
interval of a variable. Class intervals are ordered groupings of a variable’s
values.
○ You can use this type of frequency distribution for quantitative
variables.
● Relative frequency distributions: The proportion of observations of each
value or class interval of a variable.
○ You can use this type of frequency distribution for any type of
variable when you’re more interested in comparing frequencies than
the actual number of observations.
● Cumulative frequency distributions: The sum of the frequencies less than or
equal to each value or class interval of a variable.
○ You can use this type of frequency distribution for ordinal or
quantitative variables when you want to understand how often
observations fall below certain values.
WEEK 4
Frequency Table
How to make a frequency table
Frequency distributions are often displayed using frequency tables. Frequency
distribution tables can be used for both categorical and numeric variables.
Continuous variables should only be used with class intervals, which will be
explained shortly. A frequency table is an effective way to summarize or organize
a dataset. It’s usually composed of two columns:
● The values or class intervals
● Their frequencies
The method for making a frequency table differs between the four types of
frequency distributions. You can follow the guides below or use software such as
Excel, SPSS, or R to make a frequency table.
How to make an ungrouped frequency table
1. Create a table with two columns and as many rows as there are values of
the variable. Label the first column using the variable name and label the
second column “Frequency.” Enter the values in the first column.
○ For ordinal variables, the values should be ordered from smallest to
largest in the table rows.
○ For nominal variables, the values can be in any order in the table. You
may wish to order them alphabetically or in some other logical order.
2. Count the frequencies. The frequencies are the number of times each value
occurs. Enter the frequencies in the second column of the table beside their
corresponding values.
○ Especially if your dataset is large, it may help to count the
frequencies by tallying. Add a third column called “Tally.” As you read
the observations, make a tick mark in the appropriate row of the tally
column for each observation. Count the tally marks to determine the
frequency.
Example: Making an ungrouped frequency table. A gardener set up a bird feeder in
their backyard. To help them decide how much and what type of birdseed to buy,
they decide to record the bird species that visit their feeder. Over the course of one
morning, the following birds visit their feeder:
How to make a grouped frequency table
1. Divide the variable into class intervals. Below is one method to divide a
variable into class intervals. Different methods will give different answers,
but there’s no agreement on the best method to calculate class intervals.
○ Calculate the range. Subtract the lowest value in the dataset from the
highest.
○ Decide the class interval width. There are no firm rules on how to
choose the width, but the following formula is a rule of thumb:
○
You can round this value to a whole number or a number that’s
convenient to add (such as a multiple of 10).
○ Calculate the class intervals. Each interval is defined by a lower limit
and upper limit. Observations in a class interval are greater than or
equal to the lower limit and less than the upper limit:
○
○
The lower limit of the first interval is the lowest value in the dataset.
Add the class interval width to find the upper limit of the first interval
and the lower limit of the second variable. Keep adding the interval
width to calculate more class intervals until you exceed the highest
value.
2. Create a table with two columns and as many rows as there are class
intervals. Label the first column using the variable name and label the
second column “Frequency.” Enter the class intervals in the first column.
3. Count the frequencies. The frequencies are the number of observations in
each class interval. You can count by tallying if you find it helpful. Enter the
frequencies in the second column of the table beside their corresponding
class intervals.
Example: Grouped frequency distribution A sociologist conducted a survey of 20
adults. She wants to report the frequency distribution of the ages of the survey
respondents. The respondents were the following ages in years:
52, 34, 32, 29, 63, 40, 46, 54, 36, 36, 24, 19, 45, 20, 28, 29, 38, 33,
49, 37
Round the class interval width to 10.
The class intervals are 19 ≤ a < 29, 29 ≤ a < 39, 39 ≤ a < 49, 49 ≤ a < 59, and 59 ≤ a <
69.
How to make a relative frequency table
1. Create an ungrouped or grouped frequency table.
2. Add a third column to the table for the relative frequencies. To calculate the
relative frequencies, divide each frequency by the sample size. The sample
size is the sum of the frequencies.
Example: Relative frequency distribution
From this table, the gardener can make observations, such as that 19% of the bird
feeder visits were from chickadees and 25% were from finches.
How to make a cumulative frequency table
1. Create an ungrouped or grouped frequency table for an ordinal or
quantitative variable. Cumulative frequencies don’t make sense for nominal
variables because the values have no order—one value isn’t more than or
less than another value.
2. Add a third column to the table for the cumulative frequencies. The
cumulative frequency is the number of observations less than or equal to a
certain value or class interval. To calculate the relative frequencies, add
each frequency to the frequencies in the previous rows.
3. Optional: If you want to calculate the cumulative relative frequency, add
another column and divide each cumulative frequency by the sample size.
Example: Cumulative frequency distribution
From this table, the sociologist can make observations such as 13 respondents
(65%) were under 39 years old, and 16 respondents (80%) were under 49 years old.
How to graph a frequency distribution
Pie charts, bar charts, and histograms are all ways of graphing frequency
distributions. The best choice depends on the type of variable and what you’re
trying to communicate.
Pie chart
A pie chart is a graph that shows the relative frequency distribution of a nominal
variable.
A pie chart is a circle that’s divided into one slice for each value. The size of the
slices shows their relative frequency.
This type of graph can be a good choice when you want to emphasize that one
variable is especially frequent or infrequent, or you want to present the overall
composition of a variable.
A disadvantage of pie charts is that it’s difficult to see small differences between
frequencies. As a result, it’s also not a good option if you want to compare the
frequencies of different values.
Bar chart
A bar chart is a graph that shows the frequency or relative frequency distribution
of a categorical variable (nominal or ordinal).
The y-axis of the bars shows the frequencies or relative frequencies, and the x-axis
shows the values. Each value is represented by a bar, and the length or height of
the bar shows the frequency of the value.
A bar chart is a good choice when you want to compare the frequencies of
different values. It’s much easier to compare the heights of bars than the angles
of pie chart slices.
Histogram
A histogram is a graph that shows the frequency or relative frequency distribution
of a quantitative variable. It looks similar to a bar chart.
The continuous variable is grouped into interval classes, just like a grouped
frequency table. The y-axis of the bars shows the frequencies or relative
frequencies, and the x-axis shows the interval classes. Each interval class is
represented by a bar, and the height of the bar shows the frequency or relative
frequency of the interval class.
Although bar charts and histograms are similar, there are important differences:
Bar chart
Histogram
Type of variable
Categorical
Quantitative
Value grouping
Ungrouped (values)
Grouped (interval classes)
Bar spacing
Can be a space between bars
Never a space between bars
Bar order
Can be in any order
Can only be ordered from lowest to
highest
A histogram is an effective visual summary of several important characteristics of
a variable. At a glance, you can see a variable’s central tendency and variability,
as well as what probability distribution it appears to follow, such as a normal,
Poisson, or uniform distribution.
WEEK 5
Definition of Mean in Statistics
Mean is the average of the given numbers and is calculated by dividing the sum of
given numbers by the total number of numbers.
Mean = (Sum of all the observations/Total number of observations)
Example:
What is the mean of 2, 4, 6, 8 and 10?
Solution:
First, add all the numbers.
2 + 4 + 6 + 8 + 10 = 30
Now divide by 5 (total number of observations).
Mean = 30/5 = 6
In the case of a discrete probability distribution of a random variable X, the mean
is equal to the sum over every possible value weighted by the probability of that
value; that is, it is computed by taking the product of each possible value x of X
and its probability P(x) and then adding all these products together.
Mean Symbol (X Bar)
The symbol of mean is usually given by the symbol ‘x̄’. The bar above the letter x,
represents the mean of x number of values.
X̄ = (Sum of values ÷ Number of values)
X̄ = (x1 + x2 + x3 +….+xn)/n
Mean Formula
The basic formula to calculate the mean is calculated based on the given data set.
Each term in the data set is considered while evaluating the mean. The general
formula for mean is given by the ratio of the sum of all the terms and the total
number of terms. Hence, we can say;
Mean = Sum of the Given Data/Total number of Data
To calculate the arithmetic mean of a set of data we must first add up (sum) all of
the data values (x) and then divide the result by the number of values (n). Since ∑
is the symbol used to indicate that values are to be summed (see Sigma Notation)
we obtain the following formula for the mean (x
̄ ):
̄ =∑ x/n
x
How to Find Mean?
As we know, data can be grouped data or ungrouped data so to find the mean of
given data we need to check whether the given data is ungrouped. The formulas
to find the mean for ungrouped data and grouped data are different. In this
section, you will learn the method of finding the mean for both of these instances.
Mean for Ungrouped Data
The example given below will help you in understanding how to find the mean of
ungrouped data.
Example:
In a class there are 20 students and they have secured a percentage of 88, 82, 88,
85, 84, 80, 81, 82, 83, 85, 84, 74, 75, 76, 89, 90, 89, 80, 82, and 83.
Find the mean percentage obtained by the class.
Solution:
Mean = Total of percentage obtained by 20 students in class/Total number of
students
= [88 + 82 + 88 + 85 + 84 + 80 + 81 + 82 + 83 + 85 + 84 + 74 + 75 + 76 + 89 + 90 + 89
+ 80 + 82 + 83]/20
= 1660/20
= 83
Hence, the mean percentage of each student in the class is 83%.
Mean for Grouped Data
For grouped data, we can find the mean using either of the following formulas.
Direct method:
Assumed mean method:
Step-deviation method:
Go through the example given below to understand how to calculate the mean for
grouped data.
Example:
Find the mean for the following distribution.
xi
11
14
17
20
fi
3
6
8
7
Solution:
For the given data, we can find the mean using the direct method.
xi
fi
fixi
11
3
33
14
6
84
17
8
136
20
7
140
∑fi = 24
∑fi xi = 393
Mean = ∑fixi/∑fi = 393/24 = 16.4
Types of Mean
There are majorly three different types of mean value that you will be studying in
statistics.
1. Arithmetic Mean
2. Geometric Mean
3. Harmonic Mean
4. Quadratic Mean
Arithmetic Mean
When you add up all the values and divide by the number of values it is called
Arithmetic Mean. To calculate, just add up all the given numbers then divide by
how many numbers are given.
Example: What is the mean of 3, 5, 9, 5, 7, 2?
Now add up all the given numbers:
3 + 5 + 9 + 5 + 7 + 2 = 31
Now divide by how many numbers are provided in the sequence:
316= 5.16
5.16 is the answer.
Geometric Mean
The geometric mean of two numbers x and y is xy. If you have three numbers x, y,
and z, their geometric mean is 3xyz.
Example: Find the geometric mean of 4 and 3 ?
How to Find the Geometric Mean (Examples)
Example 1: What is the geometric mean of 2, 3, and 6?
First, multiply the numbers together and then take the cubed root (because there
are three numbers) = (2*3*6)1/3 = 3.30
Note: The power of (1/3) is the same as the cubed root 3√. To convert a nth root to
this notation, just change the denominator in the fraction to whatever “n” you
have. So:
●
5th root = to the (1/5) power
●
12th root = to the (1/12) power
●
99th root = to the (1/99) power.
Example 2: What is the geometric mean of 4,8.3,9 and 17?
First, multiply the numbers together and then take the 5th root (because there are
5 numbers) = (4 * 8 * 3 * 9 * 17)(1/5) = 6.81
Example 3: What is the geometric mean of 1/2, 1/4, 1/5, 9/72 and 7/4?
First, multiply the numbers together and then take the 5th root:
(1/2*1/4*1/5*9/72*7/4)(1/5) = 0.35.
Example 4: The average person’s monthly salary in a certain town jumped from
$2,500 to $5,000 over the course of ten years. Using the geometric mean, what is
the average yearly increase?
Solution:
Step 1: Find the geometric mean.
(2500*5000)^(1/2) = 3535.53390593.
Step 2: Divide by 10 (to get the average increase over ten years).
3535.53390593 / 10 = 353.53.
The average increase (according to the GM) is 353.53.
Harmonic Mean
The harmonic mean is a very specific type of average. It’s generally used when
dealing with averages of units, like speed or other rates and ratios.
The formula is:
If the formula above looks daunting, all you need to do to solve it is:
●
Add the reciprocals of the numbers in the set.
●
Divide the number of items in the set by your answer to Step 1.
The harmonic mean is a numerical average calculated by dividing the number of
observations, or entries in the series, by the reciprocal of each number in the
series. Thus, the harmonic mean is the reciprocal of the arithmetic mean of the
reciprocals.
For example, to calculate the harmonic mean of 1, 4, and 4, you would divide the
number of observations by the reciprocal of each number, as follows:
The harmonic mean has uses in finance and technical analysis of markets, among
others.
Example of the Harmonic Mean
As an example, take two firms. One has a market capitalization of $100 billion and
earnings of $4 billion (P/E of 25), and the other has a market capitalization of $1
billion and earnings of $4 million (P/E of 250). In an index made of the two stocks,
with 10% invested in the first and 90% invested in the second, the P/E ratio of the
index is:
The Bottom Line
The harmonic mean is calculated by dividing the number of entries in a series by
the reciprocal of each number in the series. The harmonic mean stands out from
the other types of Pythagorean mean—the arithmetic mean and geometrical
mean—by using reciprocals and giving greater weight to smaller values. The
harmonic mean is best used for fractions such as rates, and in finance, it is useful
for averaging data like price multiples and identifying patterns such as Fibonacci
sequences.
What is the Quadratic Mean / Root Mean Square?
The quadratic mean (also called the root mean square*) is a type of average. It
measures the absolute magnitude of a set of numbers, and is calculated by:
●
Squaring each number,
●
Finding the mean of these squares,
●
Taking the square root of that average.
If you label each element of your set as xi, where i is an index number numbering
from 1 to n, the RMS can be described as:
RMS gives a greater weight to larger items in a set and is always equal to or
greater than the “regular” arithmetic mean (average).
Sometimes the quadratic mean is referred to as being “the same as” the standard
deviation. This isn’t strictly true: standard deviation is actually equal to the
quadratic deviations from the mean of the data set. For example, quadratic mean
is used in the physical sciences as a synonym for standard deviation when
referencing the “square root of the mean squared deviation of a signal from a
given baseline or fit”(Wolfram).
The quadratic mean is also called the root mean square because it is the square
root of the mean of the squares of the numbers in the set.
*Note: This is different from the root mean square error (RMSE), which is a value
used in regression analysis to describe how spread out data is around a regression
line.
Formula
The quadratic mean is equal to the square root of the mean of the squared values.
The formula is:
An equivalent formula has a summation sign (summation means “to add up”, so
it’s telling you here to add all of the squared x-values up):
Examples of the Root Mean Square (RMS)
To find the root mean square of the set {1, 3, 4}:
1.
Square each of the numbers
2. Find the mean of Step 1
3. Find the square root of step 2
Worked Example
Find the Root Mean Square of 2, 4, 9, 10, and 12.
Step 1: Count the number of items.
N = 5.
Set this number aside for a moment.
Step 2: Square all of the numbers. 22,42,92,102, 122 = 4, 16, 81, 100, 144.
Step 3: Add the numbers from Step 2 up: 4 + 16 + 81 + 100 + 144 = 345.
Step 4: Divide Step 3 (the sum) by Step 1 (number of items in the set):
345/5 = 69.
Step 5: Find square root of Step 4. √(69) = 8.31.
That’s it!
The RMS of any series of positive identical numbers will be that same number, just as the
average of a series of identical numbers is the number itself. The RMS of a series of negative
identical numbers will be the absolute value of that number. For positive values, the RMS is
either the same or a bit larger than the average.
WEEK 6
What are quantiles?
A quartile is a type of quantile.
Quantiles are values that split sorted data or a probability distribution into equal
parts. In general terms, a q-quantile divides sorted data into q parts. The most
commonly used quantiles have special names:
● Quartiles (4-quantiles): Three quartiles split the data into four parts.
● Deciles (10-quantiles): Nine deciles split the data into 10 parts.
● Percentiles (100-quantiles): 99 percentiles split the data into 100 parts.
There is always one fewer quantile than there are parts created by the quantiles.
How to find quantiles
To find a q-quantile, you can follow a similar method to that used for quartiles,
except in steps 3–5, multiply n by multiples of 1/q instead of 1/4.
For example, to find the third 5-quantile:
1. Calculate n * (3 / 5).
2. If n * (3 / 5) is an integer, then the third 5-quantile is the mean of the
numbers at positions n * (3 / 5) and n * (3 / 5) + 1.
3. If n * (3 / 5) is not an integer, then round it up. The number at this position is
the third 5-quantile.
Quartiles
Quartiles are values that divide your data into quarters. However, quartiles aren’t
shaped like pizza slices; Instead they divide your data into four segments
according to where the numbers fall on the number line. The four quarters that
divide a data set into quartiles are:
1.
The lowest 25% of numbers.
2. The next lowest 25% of numbers (up to the median).
3. The second highest 25% of numbers (above the median).
4. The highest 25% of numbers.
Quartiles are three values that split sorted data into four parts, each with an equal
number of observations. Quartiles are a type of quantile.
● First quartile: Also known as Q1, or the lower quartile. This is the number
halfway between the lowest number and the middle number.
● Second quartile: Also known as Q2, or the median. This is the middle number
halfway between the lowest number and the highest number.
● Third quartile: Also known as Q3, or the upper quartile. This is the number
halfway between the middle number and the highest number.
Quartiles can also split probability distributions into four parts, each with an equal
probability.
Find Quartiles: Examples
Need help with a homework question? Check out our tutoring page!
Example: Divide the following data set into quartiles: 2, 5, 6, 7, 10, 22, 13, 14, 16, 65,
45, 12.
Step 1: Put the numbers in order: 2, 5, 6, 7, 10, 12 13, 14, 16, 22, 45, 65.
Step 2: Count how many numbers there are in your set and then divide by 4 to cut
the list of numbers into quarters. There are 12 numbers in this set, so you would
have 3 numbers in each quartile.
2, 5, 6, | 7, 10, 12 | 13, 14, 16, | 22, 45, 65
If you have an uneven set of numbers, it’s OK to slice a number down the middle.
This can get a little tricky (imagine trying to divide 10, 13, 17, 19, 21 into quarters!),
so you may want to use an online interquartile range calculator to figure those
quartiles out for you. The calculator gives you the 25th Percentile, which is the end
of the first quartile, the 50th Percentile which is the end of the second quartile (or
the median) and the 75th Percentile, which is the end of the third quartile. For 10,
13, 17, 19 and 21 the results are:
25th Percentile: 11.5
50th Percentile: 17
75th Percentile: 20
Interquartile Range: 8.5.
Why do we need quartiles in statistics? The main reason is to perform further
calculations, like the interquartile range, which is a measure of how the data is
spread out around the mean.
Quartiles are a type of percentile. A percentile is a value with a certain percentage
of the data falling below it. In general terms, k% of the data falls below the kth
percentile.
● The first quartile (Q1, or the lowest quartile) is the 25th percentile, meaning
that 25% of the data falls below the first quartile.
● The second quartile (Q2, or the median) is the 50th percentile, meaning that
50% of the data falls below the second quartile.
● The third quartile (Q3, or the upper quartile) is the 75th percentile, meaning
that 75% of the data falls below the third quartile.
By splitting the data at the 25th, 50th, and 75th percentiles, the quartiles divide
the data into four equal parts.
● In a sample or dataset, the quartiles divide the data into four groups with
equal numbers of observations.
● In a probability distribution, the quartiles divide the distribution’s range into
four intervals with equal probability.
How to find quartiles
To find the quartiles of a dataset or sample, follow the step-by-step guide below.
1. Count the number of observations in the dataset (n).
2. Sort the observations from smallest to largest.
3. Find the first quartile:
○ Calculate n * (1 / 4).
○ If n * (1 / 4) is an integer, then the first quartile is the mean of the
numbers at positions n * (1 / 4) and n * (1 / 4) + 1.
○ If n * (1 / 4) is not an integer, then round it up. The number at this
position is the first quartile.
Tip: An integer is a whole number—it can be written without any numbers after the
decimal place.
4. Find the second quartile:
○ Calculate n * (2 / 4).
○ If n * (2 / 4) is an integer, the second quartile is the mean of the
numbers at positions n * (2 / 4) and n * (2 / 4) + 1.
○
If n * (2 / 4) is not an integer, then round it up. The number at this
position is the second quartile.
5. Find the third quartile:
○ Calculate n * (3 / 4).
○ If n * (3 / 4) is an integer, then the third quartile is the mean of the
numbers at positions n * (3 / 4) and n * (3 / 4) + 1.
○ If n * (3 / 4) is not an integer, then round it up. The number at this
position is the third quartile.
There are multiple methods to calculate the first and third quartiles, and they don’t
always give the same answers. There’s no universal agreement on the best way to
calculate quartiles.
Step-by-step example
Imagine you conducted a small study on language development in children 1–6
years old. You’re writing a paper about the study and you want to report the
quartiles of the children’s ages.
Age (years)
1
2
3
4
5
6
Frequency
2
3
4
1
2
2
Step 1: Count the number of observations in the datasetn = 2 + 3 + 4 + 1 + 2 + 2 =
14Step 2: Sort the observations in increasing order
1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 5, 5, 6, 6
Step 3: Find the first quartilen * (1 / 4) = 14 * (1 / 4) = 3.5
3.5 is not an integer, so Q1 is the number at position 4.
1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 5, 5, 6, 6
Q1 = 2 years
Step 4: Find the second quartilen * (2 / 4) = 14 * (2 / 4) = 7
7 is an integer, so Q2 is the mean of the numbers at positions 7 and 8.
1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 5, 5, 6, 6
Q2 = (3 + 3) / 2
Q2 = 3 years
Step 5: Find the third quartilen * (3 / 4) = 14 * (3 / 4) = 10.5
10.5 is not an integer, so Q3 is the number at position 11.
1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 5, 5, 6, 6
Q3 = 5 years
Interpreting quartiles
Quartiles can give you useful information about an observation or a dataset.
Comparing observations
Quartiles are helpful for understanding an observation in the context of the rest of
a sample or population. By comparing the observation to the quartiles, you can
determine whether the observation is in the bottom 25%, middle 50%, or top 25%.
Median
The second quartile, better known as the median, is a measure of central
tendency. This middle number is a good measure of the average or most central
value of the data, especially for skewed distributions or distributions with outliers.
Interquartile range
The distance between the first and third quartiles—the interquartile range
(IQR)—is a measure of variability. It indicates the spread of the middle 50% of the
data.
IQR = Q3 − Q1
The IQR is an especially good measure of variability for skewed distributions or
distributions with outliers. IQR only includes the middle 50% of the data, so, unlike
the range, the IQR isn’t affected by extreme values.
Skewness
The distance between quartiles can give you a hint about whether a distribution is
skewed or symmetrical. It’s easiest to use a boxplot to look at the distances
between quartiles:
What is an Upper Quartile?
The upper quartile (sometimes called Q3) is the number dividing the third and
fourth quartile. The upper quartile can also be thought of as the median of the
upper half of the numbers. The upper quartile is also called the 75th percentile; it
splits the lowest 75% of data from the highest 25%.
A set of numbers (-3,-2,-1,0,1,2,3) divided into four quartiles.
Calculating the Upper Quartile
You can find the upper quartile by placing a set of numbers in order and working
out Q3 by hand, or you can use the upper quartile formula. If you have a small set
of numbers (under about 20), by hand is usually the easiest option. However, the
formula works for all sets of numbers, from very small to very large. You may also
want to use the formula if you are uncomfortable with finding the median for sets
of data with odd or even numbers.
Example question: Find the upper quartile for the following set of numbers:
27, 19, 5, 7, 6, 9, 15, 12, 18, 2, 1.
By Hand
Step 1: Put your numbers in order: 1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27
Step 2: Find the median: 1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
Step 3: Place parentheses around the numbers above the median.
1, 2, 5, 6, 7, 9, (12, 15, 18, 19, 27).
Step 4: Find the median of the upper set of numbers. This is the upper quartile:
1, 2, 5, 6, 7, 9, (12, 15, 18 ,19 ,27).
Using the Formula
The upper quartile formula is:
Q3 = ¾(n + 1)th Term.
The formula doesn’t give you the value for the upper quartile, it gives you the
place. For example, the 5th place, or the 76th place.
Step 1: Put your numbers in order: 1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
Step 2: Work the formula. There are 11 numbers in the set, so:
Q3 = ¾(n + 1)th Term.
Q3 = ¾(11 + 1)th Term.
Q3 = ¾(12)th Term.
Q3 = 9th Term.
In this set of numbers (1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27), the upper quartile (18) is the
9th term, or the 9th place from the left.
Difference between a quarter and a quartile
There’s a slight difference between a quarter and quartile. A quarter is the whole
slice of pizza, but a quartile is the mark the pizza cutter makes at the end of the
slice.
A quarter of the pizza is the
whole slice; a quartile marks the end of the first quarter and the beginning of the second.
What Is a Decile?
Deciles break up a set of data into tenths. They are similar to quartiles. But while
quartiles sort data into four quarters, data is instead sorted into ten equal parts:
The 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, 90th and 100th percentiles.
A decile rank assigns a number to each tenth:
Decile Rank
Percentile
1
10th
2
20th
3
30th
4
40th
5
50th
6
60th
7
70th
8
80th
9
90th
The higher your place in the above rankings, the higher your overall ranking. For
example, if you were in the 99th percentile for a particular test, that would put you
in a ranking of 10. However, if you scored very low (say, the 5th percentile), then
you would have a rank of 1.
A chart showing decile
rankings for discharged stroke patients. Image: SUNY Buffalo
Why are Decile ranks used instead of percentiles of
quartiles?
Basically, ranks are just another way to categorize data and which system you use
is usually a judgment call. For example, if you wanted to display class rankings on
a pie chart, using deciles would make more sense that percentiles. That’s because
a pie chart with 10-categories would be much easier to read than a pie chart with
99 categories.
What is a Decile used for in Real Life?
They are used significantly more often in real life than in the classroom. For
example, Australia [1] uses decile ranks to report drought data. Ranks of 1-2
represent the lowest 20% (“much below normal”). That means droughts that are
“much below normal” don’t occur more than 20% of the time.
They are also commonly used for college admissions and high school rankings. For
example, this chart from Roanoke College shows the high school rankings for the
student body.
A decile is a quantitative method of splitting up a set of ranked data into 10
equally large subsections. This type of data ranking is performed as part of many
academic and statistical studies in the finance and economics fields. The data
may be ranked from largest to smallest values, or vice versa.
A decile, which has 10 categorical buckets may be contrasted with percentiles that
have 100, quartiles that have four, or quintiles that have five.
Understanding a Decile
In descriptive statistics, a decile is used to categorize large data sets from the
highest to lowest values, or vice versa. Like the quartile and the percentile, a decile
is a form of a quantile that divides a set of observations into samples that are
easier to analyze and measure.
While quartiles are three data points that divide an observation into four equal
groups or quarters, a decile consists of nine data points that divide a data set into
10 equal parts. When an analyst or statistician ranks data and then splits them
into deciles, they do so in an attempt to discover the largest and smallest values
by a given metric.
For example, by splitting the entire S&P 500 Index into deciles (50 firms in each
decile) using the P/E multiple, the analyst will discover the companies with the
highest and lowest P/E valuations in the index.
A decile is usually used to assign decile ranks to a data set. A decile rank arranges
the data in order from lowest to highest and is done on a scale of one to 10 where
each successive number corresponds to an increase of 10 percentage points. In
other words, there are nine decile points. The 1st decile, or D1, is the point that has
10% of the observations below it, D2 has 20% of the observations below it, D3 has
30% of the observations falling below it, and so on.
How to Calculate a Decile
There is no one way of calculating a decile; however, it is important that you are
consistent with whatever formula you decide to use to calculate a decile. One
simple calculation of a decile is:
​
From this formula, it is given that the 5th decile is the median since 5 (n+1) / 10 is
the data point that represents the halfway point of the distribution.
Example of a Decile
The table below shows the ungrouped scores (out of 100) for 30 exam takers:
48
52
55
57
58
60
61
64
65
66
69
72
73
75
76
78
81
82
84
87
88
90
91
92
93
94
95
96
97
99
Using the information presented in the table, the 1st decile can be calculated as:
● = Value of [(30 + 1) / 10]th data
● = Value of 3.1st data, which is 0.1 of the way between scores 55 and 57
● = 55 + 2 (0.1) = 55.2 = D1
● D1 means that 10% of the data set falls below 55.2.
Let’s calculate the 3rd decile:
● D3 = Value of 3 (30 + 1) / 10
● D3 = Value of 9.3rd position, which is 0.3 between the scores of 65 and 66
● Thus, D3 = 65 + 1 (0.3) = 65.3
● 30% of the 30 scores in the observation fall below 65.3.
What would we get if we were to calculate the 5th decile?
● D5 = Value of 5 (30 + 1) / 10
● D5 = Value of 15.5th position, halfway between scores 76 and 78
● 50% of the scores fall below 77.
Also, notice how the 5th decile is also the median of the observation. Looking at
the data set in the table, the median, which is the middle data point of any given
set of numbers, can be calculated as (76 + 78) / 2 = 77 = median = D5. At this
point, half of the scores lie above and below the distribution.
Percentaile
“Percentile” is in everyday use, but there is no universal definition for it. The most
common definition of a percentile is a number where a certain percentage of
scores fall below that number. You might know that you scored 67 out of 90 on a
test. But that figure has no real meaning unless you know what percentile you fall
into. If you know that your score is in the 90th percentile, that means you scored
better than 90% of people who took the test.
Percentiles are commonly used to report scores in tests, like the SAT, GRE and
LSAT. for example, the 70th percentile on the 2013 GRE was 156. That means if you
scored 156 on the exam, your score was better than 70 percent of test takers.
The 25th percentile is also called the first quartile.
The 50th percentile is generally the median (if you’re using the third definition—see
below).
The 75th percentile is also called the third quartile.
The difference between the third and first quartiles is the interquartile range.
Percentile Rank
The word “percentile” is used informally in the above definition. In common use,
the percentile usually indicates that a certain percentage falls below that
percentile. For example, if you score in the 25th percentile, then 25% of test takers
are below your score. The “25” is called the percentile rank. In statistics, it can get
a little more complicated as there are actually three definitions of “percentile.”
Here are the first two (see below for definition 3), based on an arbitrary “25th
percentile”:
Definition 1: The nth percentile is the lowest score that is greater than a certain
percentage (“n”) of the scores. In this example, our n is 25, so we’re looking for the
lowest score that is greater than 25%.
Definition 2: The nth percentile is the smallest score that is greater than or equal
to a certain percentage of the scores. To rephrase this, it’s the percentage of data
that falls at or below a certain observation. This is the definition used in AP
statistics. In this example, the 25th percentile is the score that’s greater or equal to
25% of the scores.
They may seem very similar, but they can lead to big differences in results,
although they are both the 25th percentile rank. Take the following list of test
scores, ordered by rank:
Score
Rank
30
1
33
2
43
3
53
4
56
5
67
6
68
7
72
8
How to Find a Percentile
Example question: Find out where the 25th percentile is in the above list.
Step 1: Calculate what rank is at the 25th percentile. Use the following formula:
Rank = Percentile / 100 * (number of items + 1)
Rank = 25 / 100 * (8 + 1) = 0.25 * 9 = 2.25.
A rank of 2.25 is at the 25th percentile. However, there isn’t a rank of 2.25 (ever
heard of a high school rank of 2.25? I haven’t!), so you must either round up, or
round down. As 2.25 is closer to 2 than 3, I’m going to round down to a rank of 2.
Step 2: Choose either definition 1 or 2:
Definition 1: The lowest score that is greater than 25% of the scores. That equals a
score of 43 on this list (a rank of 3).
Definition 2: The smallest score that is greater than or equal to 25% of the scores.
That equals a score of 33 on this list (a rank of 2).
Depending on which definition you use, the 25th percentile could be reported at 33
or 43! A third definition attempts to correct this possible misinterpretation:
Definition 3: A weighted mean of the percentiles from the first two definitions.
In the above example, here’s how the percentile would be worked out using the
weighted mean:
1.
Multiply the difference between the scores by 0.25 (the fraction of the rank
we calculated above). The scores were 43 and 33, giving us a difference of
10:
(0.25)(43 – 33) = 2.5
2. Add the result to the lower score. 2.5 + 33 = 35.5
In this case, the 25th percentile score is 35.5, which makes more sense as it’s in the
middle of 43 and 33.
In most cases, the percentile is usually definition #1. However, it would be wise to
double check that any statistics about percentiles are created using that first
definition.
How to Calculate Percentile?
You can calculate percentiles in statistics using the following formula:
For example:
Imagine you have the marks of 20 students. Now, try to calculate the 90th
percentile.
Step 1: Arrange the score in ascending order.
Step 2: Plug the values in the formula to find n.
P90 = 94 means that 90% of students got less than 94 and 10% of students got more
than 94
Percentile Range
A percentile range is the difference between two specified percentiles. these could
theoretically be any two percentiles, but the 10-90 percentile range is the most
common. To find the 10-90 percentile range:
1.
Calculate the 10th percentile using the above steps.
2. Calculate the 90th percentile using the above steps.
3. Subtract Step 1 (the 10th percentile) from Step 2 (the 90th percentile).
What is the Midrange?
The midrange is a type of average, or mean. For example, “midrange” electronic
gadgets are in the middle-price bracket: not cheap, but not expensive, either.
The formula to find the midrange = (high + low) / 2.
Example problem: Current cell phone prices in a mobile phone store range from
$40 (the cheapest) to $550 (the most expensive). Find the midrange.
●
Step 1: Add the lowest value to the highest: $550 + $40 = $590.
●
Step 2: Divide Step 1 by two: $590 / 2 = $295.
The mid priced phones would be priced at around $295.
Difference Between a Midrange and a Range.
The range is a measure of spread. In the cell phone example, the range would be:
$550 – $40 = $510. The range can also mean the entire spread of numbers—for
example, it could be written as $40 to $550. The mid-range takes it a step further
and divides the range by two to find a type of average.
Difference Between a Midrange and the Interquartile
Range.
Don’t confuse the midrange with the interquartile range (IQR), sometimes called
the “middle fifty“. They actually mean very different things. The mid-range is a
type of mean, while the interquartile range is talking about a chunk of data in the
middle of a data set.
For example, when the weather service reports that a “mean daily temperature” is
77 degrees, they are talking about the mid-range. They got that number by taking
the sum of the high daily temperature and the low daily temperature and dividing
by 2. Let’s say the recorded daily temperatures were:
55, 65, 67, 69, 70, 80, 81, 87, 90
High = 90
Low = 55
Mid = (90 + 55) / 2 = 154 / 2 = 77.
The IQR for this data set is the 25th percentile subtracted from the 75th percentile:
25th Percentile: 66
75th Percentile: 84
Interquartile Range: 84 – 66 = 18
What is the Mode?
The mode, or modal value, is the most common number in a data set. It’s useful in
statistics because it can tell you what the most popular item in your set is. For
example, you might have results from a customer survey where your company is
rated from 1 to 5. If the most popular answer is 2, then you know you need to
make some improvements in customer service!
The mode is the value that appears most frequently in a data set. A set of data
may have one mode, more than one mode, or no mode at all. Other popular
measures of central tendency include the mean, or the average of a set, and the
median, the middle value in a set.
A data set can have no mode, one, or many:
●
None: 1, 2, 3, 4, 6, 8, 9.
●
One mode: unimodal: 1, 2, 3, 3, 4, 5.
●
Two: bimodal: 1, 1, 2, 3, 4, 4, 5.
●
Three: trimodal: 1, 1, 2, 3, 3, 4, 5, 5.
●
More than one (two, three or more) = multimodal.
How to find the mode by hand
The mode in statistics is the most common number in a data set. For example, in
this set it’s 2, because it is the number that occurs most often: 1, 2, 2, 5, 6. Data
sets in statistics tend to be much larger, so the solution is easier to spot if you put
the numbers in order.
Steps
Sample question: Find the mode for the following data set:
56, 57, 56, 58, 59, 90, 98, 98, 65, 45, 34, 34, 23, 23, 24, 33, 56, 67, 78, 87, 87, 56.
Step 1: Put the numbers in order:
23 23 24 33 34 34 45 56 56 56 56 57 58 59 65 67 78 87 87 90 98 98
Step 2: Count how many times each number appears. This may be easier if you
put the numbers in a column/row format like this:
23 23
24
33
34 34
45
56 56 56 56
57
58
59
65
67
78
87 87
90
98 98
The most common number is 56 in this data set (it appears 4 times).
Examples of the Mode
For example, in the following list of numbers, 16 is the mode since it appears more
times in the set than any other number:
● 3, 3, 6, 9, 16, 16, 16, 27, 27, 37, 48
A set of numbers can have more than one mode (this is known as bimodal if there
are two modes) if there are multiple numbers that occur with equal frequency, and
more times than the others in the set.
● 3, 3, 3, 9, 16, 16, 16, 27, 37, 48
In the above example, both the number 3 and the number 16 are modes as they
each occur three times and no other number occurs more often.
If no number in a set of numbers occurs more than once, that set has no mode:
● 3, 6, 9, 16, 27, 37, 48
A set of numbers with two modes is bimodal, a set of numbers with three
modes is trimodal, and any set of numbers with more than one mode is
multimodal.
What is the Median?
Median, in statistics, is the middle value of the given list of data when arranged in
an order. The arrangement of data or observations can be made either in
ascending order or descending order.
Example: The median of 2,3,4 is 3.
In Maths, the median is also a type of average, which is used to find the centre
value. Therefore, it is also called measure of central tendency.
Apart from the median, the other two central tendencies are mean and mode.
Mean is the ratio of the sum of all observations and total number of observations.
Mode is the value in the given data-set, repeated most of the time.
In geometry, a median is also defined as the centre point of a polygon. For
example, the median of a triangle is the line segment joining the vertex of a
triangle to the centre of the opposite sides. Therefore, a median bisects the sides
of a triangle.
Median in Statistics
The median of a set of data is the middlemost number or centre value in the set.
The median is also the number that is halfway into the set.
To find the median, the data should be arranged first in order of least to greatest
or greatest to the least value. A median is a number that is separated by the
higher half of a data sample, a population or a probability distribution from the
lower half. The median is different for different types of distribution.
For example, the median of 3, 3, 5, 9, 11 is 5. If there is an even number of
observations, then there is no single middle value; the median is then usually
defined to be the mean of the two middle values: so the median of 3, 5, 7, 9 is
(5+7)/2 = 6.
The median tells you where the middle of a data set is. It’s used for many real-life
situations, like Bankruptcy law, where you can only claim bankruptcy if you are
below the median income in your state.
The median formula is {(n + 1) ÷ 2}th, where “n” is the number of items in the set
and “th” just means the (n)th number.
To find the median, first order the numbers from smallest to largest. Then find the
middle number. For example, the middle for this set of numbers is 5, because 5 is
right in the middle:
1, 2, 3, 5, 6, 7, 9.
You get the same result with the formula. There are 7 numbers in the set, so n = 7:
●
{(7 + 1) ÷ 2}th
●
= {(8) ÷ 2}th
●
= {4}th
The 4th number in 1, 2, 3, 5, 6, 7, 9 is 5.
A caution with using the median formula: The steps differ slightly depending on
whether you have an even or odd amount of numbers in your data set.
Find the median for an odd set of numbers
Example question: Find the median for the following data set:
102, 56, 34, 99, 89, 101, 10.
Step 1: Sort your data from the smallest number to the highest number. For this
example data set, the order is:
10, 34, 56, 89, 99, 101, 102.
Step 2: Find the number in the middle (where there are an equal number of data
points above and below the number):
10, 34, 56, 89, 99, 101, 102.
The median is 89.
Tip: If you have a large data set, divide the number in the set by 2. That tells you
how many numbers should be above and how many numbers should be below. For
example, 101/2 = 55.5. Ignore the decimal; 55 numbers should be above and 55
below.
Find the median for an even set of numbers
Example question: Find the median for the following data set:
102, 56, 34, 99, 89, 101, 10, 54.
Step 1: Place the data in ascending order (smallest to highest).
10, 34, 54, 56, 89, 99, 101, 102.
Step 2: Find the TWO numbers in the middle (where there are an equal number of
data points above and below the two middle numbers).
10, 34, 54, 56, 89, 99, 101, 102
Step 3: Add the two middle numbers and then divide by two, to get the average:
●
56 + 89 = 145
●
145 / 2 = 72.5.
The median is 72.5.
Tip: For large data sets, divide the number of items by 2, then subtract 1 to find
the number that should be above and the number that should be below. For
example, 100/2 = 50. 50 – 1 = 49. The middle two numbers will have 49 items above
and 49 below.
That’s it!
Average vs. Median
The median is very useful for describing things like salaries, where large figures
can throw off the mean. The median salary in the U.S. as of 2012 was $51,017. If an
average was used, those American billionaires could skew that figure upwards.
Let’s say you wanted to work for a small law firm that paid an average salary of
over $73,000 to its 11 employees. You might think there’s a good chance you’ll land
a great paying job. But take a closer look at how the average is calculated for
those eleven employees:
Employee
Salary
Samuel
$28,000
Candice
$17,400
Thomas
$22,000
Ted
$300,000
Carly
$300,000
Shawanna
$20,500
Chan
$18,500
Janine
$27,000
Barbara
$21,000
Anna
$29,000
Jim
$20,000
Average (Mean) =
($28,000 + $17,400 + $22,000 + $300,000 + $300,000 + $20,500 + $18,500 +
$27,000 + $21,000 + $29,000 + $20,000) / 11 = $73,000
The two partners in the firm—Ted and Carly, have increased the average way
beyond most of the salaries paid in the firm.
See how the “average” can be misleading?
A better way to describe income is to figure out the median — or the middle wage.
If you took that same list of incomes and found the median, you would get a more
realistic representation of income. The median is the middle number, so if you
placed all of the incomes in a list (from smallest to largest) you would get:
$17,400, $18,500, $20,000, $20,500 $21,000, $22,000, $27,000, $28,000, $29,000, $300,000,
$300,000
It’s a more accurate representation of what people are actually being paid.
Calculation for a Grouped Frequency Distribution
An easy way to ballpark the median(MD) for a grouped frequency distribution is to
use the midpoint of the interval. If you need something more precise, use the
formula:
MD = lower value + (B ÷ D) x C.
Step 1: Use (n + 1) / 2 to find out which interval has the MD. For example, if you
have 11 intervals, then the MD is in the sixth interval: (11 + 1) / 2 = 12 / 2 = 6. This
interval is called the MD group.
Step 2: Calculate “A”: the cumulative percentage for the interval immediately
before the median group.
Step 3: Calculate “B”: subtract your step 2 value from 50%. For example, if the
cumulative percentage is 45%, then B is 50% – 45% = 65%.
Step 4: Find “C”: the range (how many numbers are in the interval).
Step 5: Find “D”: the percentage for the median interval.
Step 7: Find the median: Median = lower value + (B ÷ D) x C.
That’s it!
Median Formula
The formula to calculate the median of the finite number of data set is given here.
The median formula is different for even and odd numbers of observations.
Therefore, it is necessary to recognise first if we have odd number of values or
even number of values in a given data set.
The formula to calculate the median of the data set is given as follows.
Odd Number of Observations
If the total number of observations given is odd, then the formula to calculate the
median is:
where n is the number of observations
Even Number of Observations
If the total number of observation is even, then the median formula is:
where n is the number of observations
How to Calculate the Median?
To find the median, place all the numbers in ascending order and find the middle.
Example 1:
Find the Median of 14, 63 and 55
solution:
Put them in ascending order: 14, 55, 63
The middle number is 55, so the median is 55.
Example 2:
Find the median of the following:
4, 17, 77, 25, 22, 23, 92, 82, 40, 24, 14, 12, 67, 23, 29
Solution:
When we put those numbers in the order, we have:
4, 12, 14, 17, 22, 23, 23, 24, 25, 29, 40, 67, 77, 82, 92,
There are fifteen numbers. Our middle is the eighth number:
The median value of this set of numbers is 24.
Example 3:
Rahul’s family drove through 7 states on summer vacation. The prices of Gasoline
differ from state to state. Calculate the median of gasoline cost.
1.79, 1.61, 2.09, 1.84, 1.96, 2.11, 1.75
Solution:
By organizing the data from smallest to greatest, we get:
1.61, 1.75, 1.79, 1.84 , 1.96, 2.09, 2.11
Hence, the median of gasoline cost is 1.84. There are three states with greater
gasoline costs and 3 with smaller prices.
WEEK 8
What is Dispersion?
Dispersion in statistics is a way of describing how spread out a set of data is.
When a data set has a large value, the values in the set are widely scattered; when
it is small the items in the set are tightly clustered. Very basically, this set of data
has a small value:
1, 2, 2, 3, 3, 4
…and this set has a wider one:
0, 1, 20, 30, 40, 100
The spread of a data set can be described by a range of descriptive statistics
including variance, standard deviation, and interquartile range. Spread can also be
shown in graphs: dot plots, boxplots, and stem and leaf plots have a greater
distance with samples that have a larger dispersion and vice versa.
The larger the box, the more dispersion in a set of data. Image:
Seton Hall University
Measures of Dispersion.
●
Coefficient of dispersion: A “catch-all” term for a variety of formulas,
including distance between quartiles.
●
Standard deviation: probably the most common measure. It tells you how
spread out numbers are from the mean,
●
Index of Dispersion: a measure of dispersion commonly used with nominal
variables.
●
Interquartile range (IQR): describes where the bulk of the data lies (the
“middle fifty” percent).
●
Interdecile range: the difference between the first decile (10%) and the last
decile (90%).
●
range : the difference between the smallest and largest number in a set of
data.
●
Mean difference or difference in means: measures the absolute difference
between the mean value in two different groups in clinical trials.
●
Median absolute deviation (MAD): the median of the absolute deviations
from a data set’s median.
●
Quartiles: Numbers that split the data into four quarters (first, second, third,
and fourth quartiles).
In some processes, like manufacturing or measurement, low dispersion is
associated with high precision. High dispersion is associated with low precision.
Measures of Dispersion: Example
Let’s say you were asked to compare measures of dispersion for two data sets.
Data set A has the items 97,98,99,100,101,102,103 and data set B has items
70,80,90,100,110,120,130. By looking at the data sets you can probably tell that the
means and medians are the same (100) which technically are called “measures of
central tendency” in statistics.
However, the range (which gives you an idea of how spread out the entire set of
data is) is much larger for data set B (60) when compared to data set A (6). In
fact, nearly all measures of dispersion would be ten times greater for data set B,
which makes sense as the range is ten times larger. For example, take a look at the
standard deviations for the two data sets:
Standard deviation for A: 2.160246899469287.
Standard deviation for B: 21.602468994692867.
The figure for data set B is exactly ten times that of A.
Warning: When using a calculator (or a formula), check to make sure you are
using the correct setting (or formula) for your data. Many measures of dispersion
(like the variance) have two different formulas, one for a population and one for a
sample. If you aren’t sure if you have a sample or a population
Measures of Dispersion
In statistics, the measures of dispersion help to interpret the variability of data i.e.
to know how much homogenous or heterogeneous the data is. In simple terms, it
shows how squeezed or scattered the variable is.
Measures of spread (also called measures of dispersion) tell you something about
how wide the set of data is. There are several basic measures of spread used in
statistics. The most common are:
1.
The range (including the interquartile range and the interdecile range),
2. The standard deviation,
3. The variance,
4. Quartiles.
1. The Range
The Range tells
The range is a basic statistic that tells you the range of values. For example, if your
minimum value is $10 and the maximum value is $100 then the range is $90 ($100
– $10). A similar statistic is the interquartile range, which tells you the range in the
middle fifty percent of a set of data; in other words, it’s where the bulk of data
tends to lie.
See: The Range and Interquartile Range for examples and calculation steps.
Another, less common measure is the Semi Interquartile Range, which is one half
of the interquartile range.
2. Standard Deviation
Simply put, the standard deviation is a measure of how spread out data is
around center of the distribution (the mean). It also gives you an idea of
where, percentage wise, a certain value falls. For example, let’s say you
took a test and it was normally distributed (shaped like a bell). You score
one standard deviation above the mean. That tells you your score puts
you in the top 84% of test takers.
3. The Variance
The variance is a very simple statistic that gives you an extremely rough idea of how spread out a
data set is. As a measure of spread, it’s actually pretty weak. A large variance of 22,000, for
example, doesn’t tell you much about the spread of data — other than it’s big! The most important
reason the variance exists is to give you a way to find the standard deviation: the standard deviation
is the square root of variance.
See: Variance for examples and calculation steps.
4. Quartiles
Quartiles divide your data set into quarters according to where those
numbers falls on the number line. Like the variance, the quartile isn’t very
useful on its own. Instead, it’s used to find more useful values like the interquartile range.
Types of Measures of Dispersion
There are two main types of dispersion methods in statistics which are:
●
Absolute Measure of Dispersion
●
Relative Measure of Dispersion
Absolute Measure of Dispersion
An absolute measure of dispersion contains the same unit as the original data set. The
absolute dispersion method expresses the variations in terms of the average of
deviations of observations like standard or means deviations. It includes range,
standard deviation, quartile deviation, etc.
The types of absolute measures of dispersion are:
1. Range: It is simply the difference between the maximum value and the minimum value given
in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
2. Variance: Deduct the mean from each data in the set, square each of them and add each
square and finally divide them by the total no of values in the data set to get the variance.
Variance (σ2) = ∑(X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the standard deviation i.e.
S.D. = √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers into
quarters. The quartile deviation is half of the distance between the third and the first quartile.
5. Mean and Mean Deviation: The average of numbers is known as the mean and the
arithmetic mean of the absolute deviations of the observations from a measure of central
tendency is known as the mean deviation (also called mean absolute deviation).
Relative Measure of Dispersion
The relative measures of dispersion are used to compare the distribution of two or more
data sets. This measure compares values without units. Common relative dispersion
methods include:
1. Co-efficient of Range
2. Co-efficient of Variation
3. Co-efficient of Standard Deviation
4. Co-efficient of Quartile Deviation
5. Co-efficient of Mean Deviation
Co-efficient of Dispersion
The coefficients of dispersion are calculated (along with the measure of dispersion)
when two series are compared, that differ widely in their averages. The dispersion
coefficient is also used when two series with different measurement units are compared.
It is denoted as C.D.
The common coefficients of dispersion are:
C.D. in terms of
Coefficient of dispersion
Range
C.D. = (Xmax – Xmin) ⁄ (Xmax + Xmin)
Quartile Deviation
C.D. = (Q3 – Q1) ⁄ (Q3 + Q1)
Standard Deviation (S.D.)
C.D. = S.D. ⁄ Mean
Mean Deviation
C.D. = Mean deviation/Average
Solved Examples
Example 1: Find the Variance and Standard Deviation of the Following Numbers:
1, 3, 5, 5, 6, 7, 9, 10.
Solution:
The mean = (1+ 3+ 5+ 5+ 6+ 7+ 9+ 10)/8 = 46/ 8 = 5.75
Step 1: Subtract the mean value from individual value
(1 – 5.75), (3 – 5.75), (5 – 5.75), (5 – 5.75), (6 – 5.75), (7 – 5.75), (9 – 5.75), (10 – 5.75)
= -4.75, -2.75, -0.75, -0.75, 0.25, 1.25, 3.25, 4.25
Step 2: Squaring the above values we get, 22.563, 7.563, 0.563, 0.563, 0.063, 1.563,
10.563, 18.063
Step 3: 22.563 + 7.563 + 0.563 + 0.563 + 0.063 + 1.563 + 10.563 + 18.063
= 61.504
Step 4: n = 8, therefore variance (σ2) = 61.504/ 8 = 7.69
Now, Standard deviation (σ) = 2.77
Example 2: Calculate the range and coefficient of range for the following data
values.
45, 55, 63, 76, 67, 84, 75, 48, 62, 65
Solution:
Let Xi values be: 45, 55, 63, 76, 67, 84, 75, 48, 62, 65
Here,
Maxium value (Xmax) = 84
Minimum or Least value (Xmin) = 45
Range = Maximum value = Minimum value
= 84 – 45
= 39
Coefficient of range = (Xmax – Xmin)/(Xmax + Xmin)
= (84 – 45)/(84 + 45)
= 39/129
= 0.302 (approx)
Practice Problems
1. Find the coefficient of standard deviation for the data set: 32, 35, 37, 30, 33, 36, 35 and 37
2. The mean and variance of seven observations are 8 and 16, respectively. If five of these are
2, 4, 10, 12 and 14, find the remaining two observations.
3. In a town, 25% of the persons earned more than Rs 45,000 whereas 75% earned more than
18,000. Compute the absolute and relative values of dispersion.
Standard deviation formula is used to find the values of a particular data that is
dispersed. In simple words, the standard deviation is defined as the deviation of the
values or data from an average mean. Lower standard deviation concludes that the
values are very close to their average. Whereas higher values mean the values are far
from the mean value. It should be noted that the standard deviation value can never be
negative.
Standard Deviation is of two types:
1. Population Standard Deviation
2. Sample Standard Deviation
Formula to Calculate Standard Deviation
Formulas for Standard Deviation
Population Standard Deviation Formula
Sample Standard Deviation Formula
Notations for Standard Deviation
●
σ = Standard Deviation
●
xi = Terms Given in the Data
●
x̄ = Mean
●
n = Total number of Terms
Standard Deviation Formula Based on Discrete Frequency
Distribution
For discrete frequency distribution of the type:
x: x1, x2, x3, … xn and
f: f1, f2, f3, … fn
The formula for standard deviation becomes:
Here, N is given as:
N = n∑i=1 fi
Standard Deviation Formula for Grouped Data
There is another standard deviation formula which is derived from the variance. This
formula is given as:
Example Question based on Standard Deviation Formula
Question: During a survey, 6 students were asked how many hours per day they study
on an average? Their answers were as follows: 2, 6, 5, 3, 2, 3. Evaluate the standard
deviation.
Solution:
Find the mean of the data:
(2+6+5+3+2+3)6
= 3.5
Step 2: Construct the table:
x1
x1 − x̄
(x1 − x̄)2
2
-1.5
2.25
6
2.5
6.25
5
1.5
2.25
3
-0.5
0.25
2
-1.5
2.25
3
-0.5
0.25
= 13.5
Step 3: Now, use the Standard Deviation formula
Sample Standard Deviation =
=√(13.5/[6-1])
=√[2.7]
=1.643
To check more maths formulas for different classes and for various concepts, stay tuned
with BYJU’S. Also, register now to get access to various video lessons and get a more
effective and engaging learning experience.
In probability theory and statistics, the variance formula measures how far a set of
numbers are spread out. It is a numerical value and is used to indicate how widely
individuals in a group vary. If individual observations vary considerably from the group
mean, the variance is big and vice versa.
A variance of zero indicates that all the values are identical. It should be noted that
variance is always non-negative- a small variance indicates that the data points tend to
be very close to the mean and hence to each other while a high variance indicates that
the data points are very spread out around the mean and from each other.
Variance Formulas
Variance can be of either grouped or ungrouped data. To recall, a variance can of two
types which are:
●
Variance of a population
●
Variance of a sample
The variance of a population is denoted by σ2 and the variance of a sample by s2.
Variance Formulas for Ungrouped Data
Population variance
Sample variance
Here,
Here,
σ2 = Variance
s2 = Sample variance
xi = ith observation of given data
xi = ith observation of given data
μ = Population mean
x̄ = Sample mean
N = Total number of observations
n = Sample size (or Number of data
(Population size)
values in sample)
Variance Formulas for Grouped Data
Formula for Population Variance
The variance of a population for grouped data is:
●
σ2 = ∑ f (m − x̅)2 / n
Formula for Sample Variance
The variance of a sample for grouped data is:
●
s2 = ∑ f (m − x̅)2 / n − 1
Where,
f = frequency of the class
m = midpoint of the class
These two formulas can also be written as:
Population variance
Sample variance
Here,
σ2 = Variance
Here,
xi = Midvalue of ith class
s2 = Sample variance
fi = Frequency of ith class
xi = Midvalue of ith class
N = Total number of observations
fi = Frequency of ith class
(Population size)
n = Sample size (or Number of data
values in sample)
Summary:
Variance Type
For Ungrouped Data
For Grouped Data
Population Variance Formula
σ2 = ∑ (x − x̅)2 / n
σ2 = ∑ f (m − x̅)2 / n
Sample Variance Formula
s2 = ∑ (x − x̅)2 / n − 1
s2 = ∑ f (m − x̅)2 / n − 1
Variance Formula Example Question
Question: Find the variance for the following set of data representing trees heights in
feet: 3, 21, 98, 203, 17, 9
Solution:
Step 1: Add up the numbers in your given data set.
3 + 21 + 98 + 203 + 17 + 9 = 351
Step 2: Square your answer:
351 × 351 = 123201
…and divide by the number of items. We have 6 items in our example so:
123201/6 = 20533.5
Step 3: Take your set of original numbers from Step 1, and square them individually this
time:
3 × 3 + 21 × 21 + 98 × 98 + 203 × 203 + 17 × 17 + 9 × 9
Add the squares together:
9 + 441 + 9604 + 41209 + 289 + 81 = 51,633
Step 4: Subtract the amount in Step 2 from the amount in Step 3.
51633 – 20533.5 = 31,099.5
Set this number aside for a moment.
Step 5: Subtract 1 from the number of items in your data set. For our example:
6–1=5
Step 6: Divide the number in Step 4 by the number in Step 5. This gives you the
variance:
31099.5/5 = 6219.9
Step 7: Take the square root of your answer from Step 6. This gives you the standard
deviation:
√6219.9 = 78.86634
The answer is 78.86.
Question 2:
Calculate the variance for the following data:
Class intervals
Frequency
200 – 201
13
201 – 202
27
202 – 203
18
203 – 204
10
204 – 205
1
205 – 206
1
Solution:
CI
fi
xi
fixi
fixi2
200 – 201
13
200.5
2606.5
522603.25
201 – 202
27
201.5
5440.5
1096260.75
202 – 203
18
202.5
3645
738112.5
203 – 204
10
203.5
2035
414122.5
204 – 205
1
204.5
204.5
41820.25
205 – 206
1
205.5
205.5
42230.25
∑fixi = 14137
∑fixi2 = 2855149.5
∑fi = 70
= [1/(70 – 1)] [2855149.5 – (1/70)(14137)2]
= 1.179
What is the Semi Interquartile Range?
The semi interquartile range (SIR) (also called the quartile deviation) is a measure of spread. It
tells you something about how data is dispersed around a central point (usually the mean).
The SIR is half of the interquartile range.
How to Calculate the Semi Interquartile Range / Quartile Deviation
As the SIR is half of the Interquartile Range, all you need to do is find the IQR and then divide your
answer by 2.
Another way is to use the quartile deviation formula:
Note: You might see the formula QD = 1/2(Q3 – Q1). Algebraically they are the same.
Breaking down the above formula:
Step 1: Find the first quartile, Q1. If you’re given Q1 in the question, great. If not, you’ve got several
options, including:
1. Use a calculator, like this one. Plug in your numbers and click the blue button. Q1 is
equal to the 25th percentile listed in the results.
2. Follow these instructions to find the interquartile range by hand (part of the process is
to find quartiles).
Step 2: Find the third quartile, Q3. If you’re given Q3 in the question, great. If not, use one of the
options listed in Step 1. If you choose to use the calculator, Q3 is equal to the 75th percentile.
Step 3: Subtract Step 1 from Step 2.
Step 4: Divide by 2.
Example
Question: Find the Quartile Deviation for the following set of data:
{490, 540, 590, 600, 620, 650, 680, 770, 830, 840, 890, 900}
Step 1: Find the first quartile, Q1.
This is the median of the lower half of the set {490, 540, 590, 600, 620, 650}.
Q1 = (590 + 600) / 2 = 595.
Step 2: Find the third quartile, Q3.
This is the median of the upper half of the set {680, 770, 830, 840, 890, 900}.
Q3 = (830 + 840) / 2 = 835.
Step 3: Subtract Step 1 from Step 2.
835 – 595 = 240.
Step 4: Divide by 2. 240 / 2 = 120
The quartile deviation for this set of data is 12.
Coefficient of Quartile Deviation
The coefficient of quartile deviation (sometimes called the quartile coefficient of dispersion) allows
you to compare dispersion for two or more sets of data. The formula is:
If one set of data has a larger coefficient of quartile deviation than another set, then that data set’s
interquartile dispersion is greater.
Mean Deviation Definition
The mean deviation is defined as a statistical measure that is used to calculate the
average deviation from the mean value of the given data set. The mean deviation of the
data values can be easily calculated using the below procedure.
Step 1: Find the mean value for the given data values
Step 2: Now, subtract the mean value from each of the data values given (Note: Ignore
the minus symbol)
Step 3: Now, find the mean of those values obtained in step 2.
Mean Deviation Formula
The formula to calculate the mean deviation for the given data set is given below.
Mean Deviation = [Σ |X – µ|]/N
Here,
Σ represents the addition of values
X represents each value in the data set
µ represents the mean of the data set
N represents the number of data values
| | represents the absolute value, which ignores the “-” symbol
Mean Deviation for Frequency Distribution
To present the data in the more compressed form we group it and mention the
frequency distribution of each such group. These groups are known as class intervals.
Grouping of data is possible in two ways:
1. Discrete Frequency Distribution
2. Continuous Frequency Distribution
In the upcoming discussion, we will be discussing mean absolute deviation in a discrete
frequency distribution.
Let us first know what is actually meant by the discrete distribution of frequency.
Mean Deviation for Discrete Distribution Frequency
As the name itself suggests, by discrete we mean distinct or non-continuous. In such a
distribution the frequency (number of observations) given in the set of data is discrete in
nature.
If the data set consists of values x1,x2, x3………xn each occurring with a frequency of f1,
f2… fn respectively then such a representation of data is known as the discrete
distribution of frequency.
To calculate the mean deviation for grouped data and particularly for discrete
distribution data the following steps are followed:
Step I: The measure of central tendency about which mean deviation is to be found out
is calculated. Let this measure be a.
If this measure is mean then it is calculated as,
where
If the measure is median then the given set of data is arranged in ascending order and
then the cumulative frequency is calculated then the observations whose cumulative
frequency is equal to or just greater than N/2 is taken as the median for the given
discrete distribution of frequency and it is seen that this value lies in the middle of the
frequency distribution.
Step II: Calculate the absolute deviation of each observation from the measure of
central tendency calculated in step (I)
StepIII: The mean absolute deviation around the measure of central tendency is then
calculated by using the formula
If the central tendency is mean then,
In case of median
Let us look into the following examples for a better understanding.
Mean Deviation Examples
Example 1:
Determine the mean deviation for the data values 5, 3,7, 8, 4, 9.
Solution:
Given data values are 5, 3, 7, 8, 4, 9.
We know that the procedure to calculate the mean deviation.
First, find the mean for the given data:
Mean, µ = ( 5+3+7+8+4+9)/6
µ = 36/6
µ=6
Therefore, the mean value is 6.
Now, subtract each mean from the data value, and ignore the minus symbol if any
(Ignore”-”)
5–6=1
3–6=3
7–6=1
8–6=2
4–6=2
9–6=3
Now, the obtained data set is 1, 3, 1, 2, 2, 3.
Finally, find the mean value for the obtained data set
Therefore, the mean deviation is
= (1+3 + 1+ 2+ 2+3) /6
= 12/6
=2
Hence, the mean deviation for 5, 3,7, 8, 4, 9 is 2.
Example 2:
In a foreign language class, there are 4 languages, and the frequencies of students
learning the language and the frequency of lectures per week are given as:
Language
Sanskrit
Spanish
French
English
No. of
6
5
9
12
5
7
4
9
students(xi)
Frequency of
lectures(fi)
Calculate the mean deviation about the mean for the given data.
Solution: The following table gives us a tabular representation of data and the
calculations
The 10-90 percentile range is the difference between the 90th and 10th percentiles. See
the trimmed mean for another instance of where the data between the 10th and 90th
percentiles are used.
Procedure for finding
1. Find the 10th percentile using the instructions above
2. Find the 90th percentile using the instructions above
3. Subtract the 10th percentile from the 90th percentile
Formula
What Is Skewness?
Skewness is a measurement of the distortion of symmetrical distribution or
asymmetry in a data set. Skewness is demonstrated on a bell curve when data
points are not distributed symmetrically to the left and right sides of the median
on a bell curve. If the bell curve is shifted to the left or the right, it is said to be
skewed.
Skewness can be quantified as a representation of the extent to which a given
distribution varies from a normal distribution. A normal distribution has a zero
skew, while a lognormal distribution, for example, would exhibit some right skew.
If one tail is longer than another, the distribution is skewed. These distributions are
sometimes called asymmetric or asymmetrical distributions as they don’t show
any kind of symmetry. Symmetry means that one half of the distribution is a
mirror image of the other half. For example, the normal distribution is a symmetric
distribution with no skew. The tails are exactly the same.
A normal curve.
A left-skewed distribution has a long left tail. Left-skewed distributions are also
called negatively-skewed distributions. That’s because there is a long tail in the
negative direction on the number line. The mean is also to the left of the peak.
A right-skewed distribution has a long right tail. Right-skewed distributions are
also called positive-skew distributions. That’s because there is a long tail in the
positive direction on the number line. The mean is also to the right of the peak.
The normal distribution is the most common distribution you’ll come across. Next,
you’ll see a fair amount of negatively skewed distributions. For example,
household income in the U.S. is negatively skewed with a very long left tail.
Income in the
U.S. Image: NY Times.
Interestingly, you can take the same data and make it a right-skewed distribution. This
positively-skewed graph plots number of household’s income brackets:
Mean and Median in Skewed Distributions
In a normal distribution, the mean and the median are the same number while the
mean and median in a skewed distribution become different numbers:
A left-skewed, negative distribution will have the mean to the left of the median.
A right-skewed distribution will have the mean to the right of the median.
Types of Skewness
As noted above, skewness measures asymmetry in a data set and is usually shown
on a bell curve. Normal distributions have zero skewness. This means that the
distribution ends up being symmetrical around the mean. Having said that, there
are instances where skewness isn't symmetrical. In these cases, it can be either
positive or negative. Below, we highlight what each type of skewness means.
Positive Skewness
A distribution is positively skewed when its tail is more pronounced on the right
side than it is on the left. Since the distribution is positive, the assumption is that
its value is positive. As such, most of the values end up being left of the mean. This
means that the most extreme values are on the right side. As an investor, you may
find that you have some small losses with a positive skew. But you may also end
up realizing large gains—albeit fewer.
Negative Skewness
Negative skewness, on the other hand, occurs when the tail is more pronounced on
the left rather than the right side. Contrary to the positive skew, most of the values
are found on the right side of the mean when it comes to negative skewness. As
such, the most extreme values are found further to the left. Having a negative
skew may indicate that you can expect some small gains here and there. But you
can generally expect to see a few large losses here and there as an investor.
Skewed Left (Negative Skew)
A left skewed distribution is sometimes called a negatively skewed distribution
because it’s long tail is on the negative direction on a number line.
A common misconception is that the peak of distribution is what defines
“peakness.” In other words, a peak that tends to the left is left skewed distribution.
This is incorrect. There are two main things that make a distribution skewed left:
1.
The mean is to the left of the peak. This is the main definition behind
“skewness”, which is technically a measure of the distribution of values
around the mean.
2. The tail is longer on the left.
3. In most cases, the mean is to the left of the median. This isn’t a reliable test
for skewness though, as some distributions (i.e. many multimodal
distributions) violate this rule. You should think of this as a “general idea”
kind of rule, and not a set-in-stone one.
In a left skewed distribution, the mean is to the left of the peak.
Left Skewed and Numerical Values
Skewness can be shown with a list of numbers as well as on a graph. For example,
take the numbers 1,2, and 3. They are evenly spaced, with 2 as the mean (1 + 2 + 3
/ 3 = 6 / 3 = 2). If you add a number to the far left (think in terms of adding a
value to the number line), the distribution becomes left skewed:
-10, 1, 2, 3.
Similarly, if you add a value to the far right, the set of numbers becomes right
skewed:
1, 2, 3, 10.
Download