Uploaded by Prakhar Sharma

Part 1 - BigData

advertisement
Predictive Analytics (CS 3142)
Academic year: 2022-23
Introduction
This course is offered by Dept. of Computer Science & Engineering as a program elective course,
targeting students who wish to pursue research & development in industries or higher studies in field of
Data Analytics.
The objective of the course is to demonstrate the various steps involved in analysis of data and design
predictive model for the same.
The students will learn various principles and techniques of classification, regression, and time series
analysis. The students will apply these for practical automation applications like forecasting, prediction,
research, and analytics.
Course Outcomes
At the end of the course, you will be able to:
• Understand the principles and techniques of exploratory data analysis.
• Apply predictive classification and regression model algorithms to
enhance their technical skills.
• Analyse the performance of predictive models and apply the suitable
model for given data.
• Use analysis of data for practical real-world prediction applications
hence enhance employability.
Program Outcomes:
•
•
•
•
PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and an
engineering specialization to the solution of complex engineering problems
PO2: Problem analysis: Identify, formulate, research literature, and analyse complex engineering problems
reaching substantiated conclusions using first principles of mathematics, natural sciences, and engineering
sciences
PO3: Design/development of solutions: Design solutions for complex engineering problems and design system
components or processes that meet the specified needs with appropriate consideration for the public health and
safety, and the cultural, societal, and environmental considerations
PO4: Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information to provide
valid conclusions
•
•
•
•
PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering
and IT tools including prediction and modelling to complex engineering activities with an understanding of the
limitations
PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal, and cultural issues and the consequent responsibilities relevant to the professional
engineering practice
PO7: Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable development
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practices
•
•
•
•
PO9: Individual and team work: Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings
PO10: Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear instructions
PO11:Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work, as a member and leader in a team, to manage projects
and in multidisciplinary environments
PO12: Life-long learning: Recognize the need for, and have the preparation and ability to engage in independent
and life-long learning in the broadest context of technological change
Program Specific Outcomes
•
•
•
PSO1: Will be able to design, develop and implement efficient software for a given real life problem.
PSO2: Will be able to apply knowledge of AI, Machine Learning and Data Mining in analysing big data for
extracting useful information from it and for performing predictive analysis.
PSO3: Will be able to design, manage and secure wired/ wireless computer networks for transfer and sharing of
information.
Syllabus
Introduction: Business analytics: types, applications, Models - predictive models, descriptive models,
decision models, applications, analytical techniques;
Understanding Data: Data types and associated techniques, complexities of data, data preparation, preprocessing, exploratory data analysis; Principles and Techniques
Predictive modelling: Propensity models, cluster models, collaborative filtering, applications and
limitations - Statistical analysis: Univariate Statistical analysis, Multivariate Statistical analysis;
Model Selection: Preparing to model the data: supervised versus unsupervised methods, statistical and data
mining methodology, cross-validation, overfitting, bias-variance trade-off, balancing the training dataset,
establishing baseline performance
Regression Models: Measuring Performance in Regression Models, Linear Regression and Its Cousins ,
Non-Linear Regression Models, Regression Trees and Rule-Based Models and relevant Case Studies
Classification Models: Measuring Performance in Classification Models, Discriminant Analysis and Other
Linear Classification Models, Non-Linear Classification Models, Classification Trees and Rule-Based
Models, Model Evaluation Techniques
Time Series Analysis: ARMA, ARIMA, ARFIMA - Temporal mining - Box Jenkinson method, temporal
reasoning, temporal constraint networks.
Reference Book(s):
1. Dinov, ID., Data Science and Predictive Analytics: Biomedical and Health Applications using R,
Springer, 2018.
2. A. Bari, M. Chaouchi, T. Jung, Predictive analytics for dummies, (2e), Wiley, 2016.
3. Jeffrey Strickland, Predictive analytics using R, Simulation educators, Colorado Springs, 2015
4. Daniel T. Larose, Chantal D. Larose, Data Mining and Predictive analytics, (2e), Wiley, 2015.
5. Max Kuhn and Kjell Johnson, Applied Predictive Modelling, (1e), Springer, 2013
Assessment Plan:
Criteria
Internal Assessment
(Summative)
End Term Exam
(Summative)
Attendance
(Formative)
Description
Maximum Marks
Sessional Exam I
(Closed Book Exam)
20
Sessional Exam II
(Closed Book Exam)
20
Quiz (10),
Class Assignments & Project (10)
20
End Term Exam
40
Total
100
A minimum of 75% Attendance is required to be maintained by a student to
be qualified for taking up the End Semester examination. The allowance of
25% includes all types of leaves including medical leaves.
What is Big Data?
• “Big data is high-volume, velocity, and variety
information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making” (Gartner).
• Big Data refers to complex and large data sets that
have to be processed and analyzed to uncover
valuable information that can benefit businesses and
organizations.
• However, there are certain basic tenets of Big Data that
will make it even simpler to answer what is Big Data:
– It refers to a massive amount of data that keeps on growing
exponentially with time.
– It is so voluminous that it cannot be processed or analyzed using
conventional data processing techniques.
– It includes data mining, data storage, data analysis, data sharing,
and data visualization.
– The term is an all-comprehensive one including data, data
frameworks, along with the tools and techniques used to process
and analyze the data.
Benefits of Big Data and Data Analytics
• Big data makes it possible for you to gain more complete answers because you have
more information.
• More complete answers mean more confidence in the data—which means a
completely different approach to tackling problems.
Types of Big Data
Structured:
•
•
•
Structured is one of the types of big data and By structured data, we
mean data that can be processed, stored, and retrieved in a fixed format.
It refers to highly organized information that can be readily and
seamlessly stored and accessed from a database by simple search engine
algorithms.
Example, the employee table in a company database will be structured
as the employee details, their job positions, their salaries, etc., will be
present in an organized manner.
Unstructured:
•
•
•
Unstructured data refers to the data that lacks any specific form or
structure whatsoever.
This makes it very difficult and time-consuming to process and analyze
unstructured data.
Email is an example of unstructured data.
Semi-structured:
•
•
•
Semi structured is the third type of big data.
Semi-structured data pertains to the data containing both the formats
mentioned above, that is, structured and unstructured data.
To be precise, it refers to the data that although has not been classified
under a particular repository (database), yet contains vital information or
tags that segregate individual elements within the data.
Why is Big Data Important?
• The importance of big data does not revolve around how much
data a company has but how a company utilizes the collected
data.
• Every company uses data in its own way; the more efficiently a
company uses its data, the more potential it has to grow.
• The company can take data from any source and analyze it to
find answers which will enable:
–
–
–
–
–
Cost Savings
Time Reductions
Understand the market conditions
Control online reputation
Using Big Data Analytics to Solve Advertisers Problem and Offer
Marketing Insights
– Big Data Analytics As a Driver of Innovations and Product
Development
1. Cost Savings:
Some tools of Big Data like Hadoop and Cloud-Based Analytics can bring cost advantages to business
when large amounts of data are to be stored and these tools also help in identifying more efficient ways
of doing business.
2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics can
easily identify new sources of data which helps businesses analysing data immediately
and make quick decisions based on the learning.
3. Understand the market conditions: By analyzing big data you can get a better
understanding of current market conditions.
For example, by analyzing customers’ purchasing behaviors,
a company can find out the products that are sold the most and
produce products according to this trend.
By this, it can get ahead of its competitors.
4. Control online reputation:
Big data tools can do sentiment analysis. Therefore, you can get feedback about who is saying what about
your company. If you want to monitor and improve the online presence of your business, then, big data
tools can help in all this.
5. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights:
Big data analytics can help change all business operations. This includes the ability to match customer
expectation, changing company’s product line and of course ensuring that the marketing campaigns are
powerful.
6. Big Data Analytics As a Driver of Innovations and Product Development:
Another huge advantage of big data is the ability to help
companies innovate and redevelop their products
Business Analytics
• Business Analytics is the discovery and communication of meaningful patterns of
data and that to some business-related problems.
• Business analytics is the scientific process of transferring data into insight for making
better decisions. (Informs)
• Business analytics is the extensive use of data statistical tools, quantitative tools, then
explanatory and predictive models and fact-based management to take some kind of
decisions.
Business analytics is the use of:
Data,
information technology,
statistical analysis,
quantitative methods, and
mathematical or computer-based models
to help managers to gain improved insight about their business operations and make
better, fact-based decisions.
Benefits of Business Analytics:
• Enable data-driven decision making that has the potential to
increase profits and improve efficiency
• With predictive analytics, allow businesses to plan for the
future in ways that were previously impossible
• Helps a company make informed business decisions
• By modeling the outcomes and understanding the past,
guesswork is minimized
• Present meaningful, clear data to support decision making
and convince stakeholder
Goals of business analytics
Business analytics encompasses the entire key informational and decisional attributes of any business, and
it is vitally important that business analytics features in the overall strategic vision of all businesses.
The major goals of business analytics include:
• Providing real-time, actionable information aimed at superior business decision making.
• Providing tools at all levels of an organization to help decision making around customer goals and
profits while comparing performance. ,
• Providing analysis that helps the business forecast the future with greater objectivity and accuracy.
• Providing the insight and understanding to support informed decisions and confident actions and
providing the feedback that is needed to create a learning organization.
Characteristics of business analytics
The following characteristics of business analytics identify its uniqueness:
1. Purposive: Business analytics needs to purposefully know why we make deliveries and perform the
analytics. The understanding derived from analysis must align with business functions (fi nance,
marketing, sales, etc.) and with the issues and objectives of management (performance, growth,
compliance, risk, profitability, etc.).
2. Intuitive: Business analytics are insightful, and they help uncover new facts or information and help
managers become aware of previously hidden patterns.
3. Expedient: An expedient output or action plan makes an application doable and possible, which means
a business manager should be able to act upon the recommendations of an insight.
Domains of business analytics
Domains refer to the variety of activities within a business. The business analytics is built around (but not
limited to) the following analytical domains:
• Human resource analytics
• Supply chain analytics
• Customer analytics
• Business processes analytics
• Financial analytics
Human resource analytics:
• Human resource analytics is defined as the analysis of human resources (employees), which embodies
the entire life cycle of an employee, such as recruitment, managing performance, incentives, and
employee engagement.
• Using the right metrics can improve policies and procedures, increase team members’ satisfaction and
retention, focus employee training and support, improve morale, reduce costs, and increase
productivity.
• The activities impacted by human capital involve recruitment, training, employee relationships,
employee satisfaction, and turnover.
Supply chain analytics:
• Supply chain analytics refers to the analysis of a firm’s delivery processes, which includes acquisition
of vendors, the sourcing of factors, inventory analytics, transportation and customer delivery network
efficiency, vendor management, and sourcing efficiency.
• The baseline for strategic sourcing initiatives is as an enabler for process improvement. Further,
supply chain analytics is a measurement device for cost-reduction programs, providing comprehensive
spend visibility of both direct and indirect expenses on commodities and services, significant costsaving opportunities through supplier and commodity consolidation and enhanced compliance through
effective spend and supplier monitoring.
Customer analytics:
• Customer analytics is an understanding of customers, the customer life cycle, their product needs, and
customer satisfaction.
• Customer analytics is the systematic interpretation of a business’s customer information to retain
profitable customers and proactively build relationships with them.
• Customer behavioural analysis seeks to identify and weigh the relative importance of the factors
customers use to choose one product over another.
• Customer profiling is a tool that helps business better understand customers so they can increase sales
and grow their business. Customer profiles can also help develop targeted marketing plans and ensure
that products meet the needs of their intended audience.
• By understanding the variables that influence individual decisions, businesses are more able to infl
uence their outcome. Customer decision making will rely heavily on considerations using individuals
as the unit of analysis.
Financial analytics:
• Financial analytics is defined as the analysis of the financial impact of business analytics.
• One aspect of financial analytics is the opportunity of working with net (final) figures, which are
derived after taxes, duties, and penalties, or capital charges are charged to the business.
• It embodies the versatility of the risks of doing business, and also translates such risks to net turnout.
• Financial analytics enables business to maintain cash flow, spread and liquidity; manage pricing value
acquisitions; control investments in new products and working capital; and plan funds and directed
investments. Improving financial performance and expense control, through organized monitoring of
expenses, drives profitability across business units, geographic locations, products, or channels.
Methods of Business Analysis
There are four primary methods of business
analysis:
Descriptive
Diagnostic
Predictive
Prescriptive
Descriptive: The interpretation of historical data to identify trends and patterns.
Diagnostic: a form of advanced analytics that examines data or content to answer the
question, “Why did it happen? , How did it happen?.
Predictive:
• Explores relationship in data, which may not directly be given by descriptive or
diagnostic analysis.
• Analyzes past performance.
• Predict future based on probability and statistical models
• It provides answer to questions as:
• What is happening (standard reporting)
• How many, how often, where (ad hoc reporting)
• What exactly the problem is (drill down)
Prescriptive:
The application of testing and other techniques to determine which outcome will yield the best
result in a given scenario.
Deciding which method to employ is dependent on the business situation at hand.
Uses optimization techniques [determining new ways to evaluate, target business objectives
with balancing possible constraints]
Framework of business analytics
Business Analytics Process
What is a dataset?
• A single row of data is called an instance.
• Datasets are a collection of instances that
all share a common attribute.
• Predictive models will generally contain a
few different datasets, each used to fulfil
various roles in the system.
What type of data does predictive models need?
• Data can come in many forms, but predictive models rely on
four primary data types. These include numerical data,
categorical data, time series data, and text data.
Types of Data
• There are different types of data in Statistics, that are collected, analysed,
interpreted and presented.
• The data are the individual pieces of factual information recorded, and it is used for
the purpose of the analysis process.
• The two processes of data analysis are interpretation and presentation.
• Statistics are the result of data analysis.
Numerical Data
• Numerical data is any data where data points are exact
numbers.
• Statisticians also might call numerical data, quantitative data.
• This data has meaning as a measurement such as house prices
or as a count, such as a number of residential properties in Los
Angeles or how many houses sold in the past year.
• Numerical
data
can
be
characterized by continuous or
discrete data. Continuous data
can assume any value within a
range whereas discrete data has
distinct values.
• For example, the number of students taking Python
class would be a discrete data set.
• You can only have discrete whole number values like
10, 25, or 33.
• A class cannot have 12.75 students enrolled. A student
either join a class or he doesn’t.
• On the other hand, continuous data are numbers that
can fall anywhere within a range. Like a student could
have an average score of 88.25 which falls between 0
and 100.
• The takeaway here is that numerical data is not ordered
in time. They are just numbers that we have collected.
Categorical Data
• As the name suggests, this encompasses data that
can be represented through words.
• It usually defines groups or categories & is
therefore known as categorical data.
• Some examples are the names of all items in a
supermarket, movie ratings(good, average, bad),
country of birth of individuals and so on.
• Categorical data can take numerical values.
• For example, maybe we would use 1 for the colour
red and 2 for blue. But these numbers don’t have a
mathematical meaning. That is, we can’t add them
together or take the average.
• In the context of classification,
categorical data would be the
class label. This would also be
something like if a person is a
man or woman, or property is
residential or commercial.
Nominal:
This type of data has categories that
don’t have any particular order or
ranking associated with them. The
total number of categories is usually
finite in this type of data as well.
Examples will be the country of birth
of individuals, all items in a
supermarket, educational degrees of
individuals, and so on.
Ordinal:
• This type of data has an inherent ordering present
within the categories.
• For instance, if you consider movie ratings with good,
average & bad as the different categories, good has a
higher ranking than average which is higher than bad.
• This needs to be taken into account while converting
this type of data into numbers so that the models can
learn this ranking as well.
• There is a fixed, finite number of categories/groups.
• Examples will be movie ratings, student grades,
Employee performance, and so on.
Unique:
• This type of data has a unique value for each sample
and the number of categories is usually large.
• Sometimes it is so large that it cannot be called
categorical data, but it still consists of alphabets and
numbers.
• Examples are product id of all items in a store, student
numbers of all individuals in a college, postal code of
individuals birthplace, and so on.
Time Series Data
• Time series data is a sequence of
numbers collected at regular
intervals over some period of
time. It is very important,
especially in particular fields like
finance.
• Time series data has a temporal
value attached to it, so this
would be something like a date
or a timestamp that you can look
for trends in time.
• For example, we might measure the
average number of home sales for
many years. The difference of time
series data and numerical data is
that rather than having a bunch of
numerical values that don’t have
any time ordering, time-series data
does have some implied ordering.
There is a first data point collected
and the last data point collected.
Text
• Text data is basically just words. A
lot of the time the first thing that
you do with text is you turn it
into numbers using some
interesting functions like the bag
of words formulation.
(Natural Language Processing
and Information Retrieval)
Download