Click Here

advertisement
Data Analyses Skills
(ID6020 Module)
Rahul R. Marathe
Department of Management Studies
Introduction: Why?
Numbers everywhere!
-- Last year, ID6020 had 386 students registered. This year the
number is 405.
-- Average time required to complete a typical catalysis
experiment under laboratory conditions is 34.7.
• Successful professionals are those who can make sense of
these numbers.
• In today’s world, it is more the case of information overload –
too much data! It is our job to make this data tell us a story!
• Sort out what is important and what is not!
2
Introduction: Why?
• Whether you will be audited by income tax authorities
depends a lot on sampling techniques used by the IT
department, and also on you hitting certain numerical signals.
• The urban traffic planning is done using the data collected
from various locations in a city.
• Market research firms use statistical techniques on point-ofsale data to understand buyer behavior.
• Suitability of a drug is decided by analyzing the field data
collected from trials conducted.
• That’s why every professional should know these techniques.
3
Introduction: Why?
• Data analysis done traditionally through “Statistical
techniques”; in recent times, we call this “Data Analytics”.
• Today, data analytics encompasses areas like: Statistics (uniand multi- variate), Probability theory, Stochastic processes,
Computational methods, Optimization techniques, Data
mining, Artificial Intelligence, Econometrics, Numerical
techniques, Simulation…..
• Data analysis – Understanding the story told by the numbers!
4
Introduction: Why?
• Very likely, your research will involve data collection and
analysis.
• Data could be experimental (most engineering applications),
or secondary data (from surveys – humanities and
management).
• Data collection and analyses require deep understanding of
theory and techniques of data analytics.
• Your research area itself could be data analytics.
• You certainly require good understanding of theory and
techniques!
5
Introduction: Data
• Data: Any related observations.
• A collection of data is the data set and single observation is
data point.
• Data can be collected by:
1. Observations of incidences occurring (direct recording)
2. Surveys (and sampling)
3. Conducting experiments etc.
•
Data collection is the most important step. Because, if the
collected data is not correct, analyses and conclusions are
incorrect and misleading!
6
Data collection
Before relying on any data, test the data by asking:
• Where did the data come from? Is the source biased?
• Do the data support or contradict other evidence we have?
• Is the evidence missing that might cause us to come to a
different conclusion?
• How many observations do we have? Do they represent all
the groups we wish to study?
• Are the conclusions logical? Have we made conclusions that
are not supported by data?
7
Example of misleading data
• Trucking company advertises
“75% of everything you use travels by truck.”
• What do you conclude?
8
Before the data analyses….
Identify: Samples and population
• A population is a collection of all the elements one wants to
study and about which one is trying to draw conclusions.
• A sample is a collection of some, but not all, of the elements
of a population.
Consider a beauty soap which is targeted at middle–class
women customer aged between 18 and 45 years,
The population is the entire set of middle-class females of age
between 18 – 45. But you need to be careful about definition
of “middle-class”. Clearly, a school girl is not a member of the
population.
Sample is any subset of the above set.
9
Before the data analyses….
• Identify and classify variables
Types of
scales
Data type
Description
Example
Nominal
Qualitative
Data arranged in
Gender {Male, female}
unordered categories Software {Code A, Code B}
Ordinal
Qualitative
Ordered categories
Interval
Quantitative Rank and distance
from arbitrary zero
Ratio
Quantitative Interval + ratio with a Weight (object weighing 20
meaning
kgs is twice as heavy as
object with weight 10 kg).
Quality of chemical {poor,
average, good}
Temperature (difference
works, ratio doesn’t!)
10
Quick check
• Can variables with nominal scale be quantitative? Yes or No.
No – Nominal scale has categories. Categories are for qualitative
data.
• Can variables with ordinal scale be qualitative? Yes or No.
Could be qualitative; could be quantitative. So yes!
• Can nominal or ordinal scale be continuous? Yes or No.
No! Nominal or ordinal scale is for categorical data. Categorical
variables are discrete.
• Can interval scale be continuous and/or discrete? Yes or No.
It can be either continuous or discrete.
11
Before the data analyses….
• Check and question the assumptions made:
A. Linearity
B. Normality
C. Symmetry
D. Effect of uncommon observation
12
Example
Pressure Current
12.1
4
12.5
3.9
12.9
4.11 5
13.4
4.4 4.5
14.9
2.01 4
3.5
3
2.5
2
1.5
1
0.5
0
11
11.5
12
12.5
13
13.5
14
14.5
15
15.5
13
Example (cont.)
5
Pressure Current
12.1
4
12.5
3.9
12.9
4.11
13.4
4.4
14.9
2.01
14
3.7
14.8
2.75
11.8
3.45
14.65
2.68
14.2
2.9
4.5
4
3.5
3
2.5
2
1.5
11
11.5
12
12.5
13
13.5
14
14.5
15
14
15.5
Before the data analyses….
• Understand the purpose: Data analyses is done to identify and
understand patterns in data and use this information to make
better decisions.
DATA = STRUCTURE + NON-STRUCTURE
DATA = EXPLAINED BEHAVIOR + WHITE NOISE
15
Steps in data analysis
• Once data is collected, we need to clean the data, and then
summarize, interpret and make sense.
• Three categories:
1. Descriptive: How can the data be summarized?
2. Inferential: How can we draw inferences from the data?
3. Predictive: How can we build predictive models using the
data available?
16
Summary of data
• Describe the data in graphical or statistical way:
Some of commonly used graphical tools – Frequency distribution
tables; Line charts; Histogram; Higher dimensional plots;
Scatter plot
Use of summary statistics –
• Measures of central tendency (measures of location)
Examples?
• Measures of dispersion (extent of scatter) Examples?
• Measure of symmetry (skewness)
• Etc.
17
Interpretation and prediction
Should depend on:
• Data (variable) type;
• Amount of data;
• Expected type of conclusions.
• Data type:
Dependent variable Y
Independent
variable X
Quantitative
Qualitative
Quantitative
Correlation,
Regression
Convert X into
qualitative
Qualitative
ANOVA
Crosstabulation
(e.g. Pivot)
18
Example: Bridge failure
Material
Design Load
Corridor
Support
Status
Concrete
100 tons
Bangalore
Central
Failed
Tar
75 tons
Ahmedabad
Multiple
Failed
Tar
150 tons
Mumbai
Multiple
Still there!
Concrete
125 tons
Bareily
Beams
Failed
Synthetic
200 tons
Gangtok
Central
Still there!
19
Questions to ask
• Want to know: Reasons for failure
• Also: factors that may contribute to failure
• Is the data valid?
• Is the data sufficient?
• Can the conclusions be extrapolated?
• Possible methodology: Clustering algorithms.
• Interpretation depends on whether you look at this problem
as a civil engineer, management researcher, or a computer
scientist!
20
Example: Chemical reaction
• Time required to complete a chemical reaction in a set of
experiments:
24.2, 20.15, 17.11, 14.83, …
Do you see a trend?
Can we be more specific?
Solution methodology: Forecasting
What if the data has uncertainty?
21
Example: Regression
22
Example: Nonlinear relationships
23
What should you be asking?
“Average time required to complete a typical catalysis
experiment under laboratory conditions is 34.7.”
• What do you mean by “typical”?
• What do you mean by “laboratory conditions”?
• What were the other sample values? Was average value
affected by extreme values?
• What are the units?
24
Courses related to data analyses
Every department has some course(s) on analyses of data and
modeling using data.
• Computational aerodynamics (AS5330)
• Analytical methods in transportation engineering (CE5390)
• Mathematical methods in thermal engg (ME6170)
• Modeling and simulation in manufacturing (ME7240)
• Mathematical methods in materials engg (MM5590)
• Probability and Statistics courses offered by Mathematics, MS
and many other departments.
25
Courses related to data analyses
• Stochastic processes (multiple courses offered by EE,
Mathematics, MS)
• Multiple courses offered by CSE (on data mining, AI, Data
structures, Big Data)
• Optimization courses offered by CH, Mathematics, MS etc.
• Econometrics courses offered by HS, MS.
• These courses will probably not teach you how to draw a 3D
plot using the data you have, or how to interpret the same.
• But these courses will help you understand the numbers and
analysis in your research!
26
Tools for data analyses
Institute license, available on super-computing server:
• Abaqus
• Ansys
• LAMMPS
• Matlab
• Mathematica
• Many more!
• SPSS – Many department have licenses. R is available free
over internet
• Old friend: MS Excel
27
What should you be reading?
• Start from basic Data Analysis textbooks – understand the
basics first.
• Read the advanced texts and research articles – need based
learning (see what you require, understand the pre-requisites
and then master the technique).
• General reading should never stop!!!
e.g. “Freakonomics”: To understand what fun one can have
simply by playing with data!!
28
Data analyses
Do’s:
• Apply the correct analysis technique
• Understand the assumptions of the method
• Enter the data in the selected technique correctly
• Use the correct equations/software
• Be very careful about the conclusions you draw.
Dont’s:
• Try each and every technique to decide which “looks” good.
• Get fooled by jazzy graphs and colors.
• Extrapolate results and conclusions.
29
Final word
• Data analyses skills are extremely important and useful.
• Every researcher is going to require these skills at some point
or the other.
• Equip yourself with these techniques and you are better
prepared for the battle of logic.
• These weapons in your armory have to be used carefully, and
after knowing their capabilities (and limitations).
• Don’t make the mistake of beating everything with the same
stick – different demons require different tools!
30
Best wishes!!
Questions? Comments?
rrmarathe_at_iitm.ac.in
Download