Predictive Analytics (CS 3142) Academic year: 2022-23 Introduction This course is offered by Dept. of Computer Science & Engineering as a program elective course, targeting students who wish to pursue research & development in industries or higher studies in field of Data Analytics. The objective of the course is to demonstrate the various steps involved in analysis of data and design predictive model for the same. The students will learn various principles and techniques of classification, regression, and time series analysis. The students will apply these for practical automation applications like forecasting, prediction, research, and analytics. Course Outcomes At the end of the course, you will be able to: • Understand the principles and techniques of exploratory data analysis. • Apply predictive classification and regression model algorithms to enhance their technical skills. • Analyse the performance of predictive models and apply the suitable model for given data. • Use analysis of data for practical real-world prediction applications hence enhance employability. Program Outcomes: • • • • PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and an engineering specialization to the solution of complex engineering problems PO2: Problem analysis: Identify, formulate, research literature, and analyse complex engineering problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and engineering sciences PO3: Design/development of solutions: Design solutions for complex engineering problems and design system components or processes that meet the specified needs with appropriate consideration for the public health and safety, and the cultural, societal, and environmental considerations PO4: Conduct investigations of complex problems: Use research-based knowledge and research methods including design of experiments, analysis and interpretation of data, and synthesis of the information to provide valid conclusions • • • • PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering and IT tools including prediction and modelling to complex engineering activities with an understanding of the limitations PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal, health, safety, legal, and cultural issues and the consequent responsibilities relevant to the professional engineering practice PO7: Environment and sustainability: Understand the impact of the professional engineering solutions in societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable development PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the engineering practices • • • • PO9: Individual and team work: Function effectively as an individual, and as a member or leader in diverse teams, and in multidisciplinary settings PO10: Communication: Communicate effectively on complex engineering activities with the engineering community and with society at large, such as, being able to comprehend and write effective reports and design documentation, make effective presentations, and give and receive clear instructions PO11:Project management and finance: Demonstrate knowledge and understanding of the engineering and management principles and apply these to one’s own work, as a member and leader in a team, to manage projects and in multidisciplinary environments PO12: Life-long learning: Recognize the need for, and have the preparation and ability to engage in independent and life-long learning in the broadest context of technological change Program Specific Outcomes • • • PSO1: Will be able to design, develop and implement efficient software for a given real life problem. PSO2: Will be able to apply knowledge of AI, Machine Learning and Data Mining in analysing big data for extracting useful information from it and for performing predictive analysis. PSO3: Will be able to design, manage and secure wired/ wireless computer networks for transfer and sharing of information. Syllabus Introduction: Business analytics: types, applications, Models - predictive models, descriptive models, decision models, applications, analytical techniques; Understanding Data: Data types and associated techniques, complexities of data, data preparation, preprocessing, exploratory data analysis; Principles and Techniques Predictive modelling: Propensity models, cluster models, collaborative filtering, applications and limitations - Statistical analysis: Univariate Statistical analysis, Multivariate Statistical analysis; Model Selection: Preparing to model the data: supervised versus unsupervised methods, statistical and data mining methodology, cross-validation, overfitting, bias-variance trade-off, balancing the training dataset, establishing baseline performance Regression Models: Measuring Performance in Regression Models, Linear Regression and Its Cousins , Non-Linear Regression Models, Regression Trees and Rule-Based Models and relevant Case Studies Classification Models: Measuring Performance in Classification Models, Discriminant Analysis and Other Linear Classification Models, Non-Linear Classification Models, Classification Trees and Rule-Based Models, Model Evaluation Techniques Time Series Analysis: ARMA, ARIMA, ARFIMA - Temporal mining - Box Jenkinson method, temporal reasoning, temporal constraint networks. Reference Book(s): 1. Dinov, ID., Data Science and Predictive Analytics: Biomedical and Health Applications using R, Springer, 2018. 2. A. Bari, M. Chaouchi, T. Jung, Predictive analytics for dummies, (2e), Wiley, 2016. 3. Jeffrey Strickland, Predictive analytics using R, Simulation educators, Colorado Springs, 2015 4. Daniel T. Larose, Chantal D. Larose, Data Mining and Predictive analytics, (2e), Wiley, 2015. 5. Max Kuhn and Kjell Johnson, Applied Predictive Modelling, (1e), Springer, 2013 Assessment Plan: Criteria Internal Assessment (Summative) End Term Exam (Summative) Attendance (Formative) Description Maximum Marks Sessional Exam I (Closed Book Exam) 20 Sessional Exam II (Closed Book Exam) 20 Quiz (10), Class Assignments & Project (10) 20 End Term Exam 40 Total 100 A minimum of 75% Attendance is required to be maintained by a student to be qualified for taking up the End Semester examination. The allowance of 25% includes all types of leaves including medical leaves. What is Big Data? • “Big data is high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making” (Gartner). • Big Data refers to complex and large data sets that have to be processed and analyzed to uncover valuable information that can benefit businesses and organizations. • However, there are certain basic tenets of Big Data that will make it even simpler to answer what is Big Data: – It refers to a massive amount of data that keeps on growing exponentially with time. – It is so voluminous that it cannot be processed or analyzed using conventional data processing techniques. – It includes data mining, data storage, data analysis, data sharing, and data visualization. – The term is an all-comprehensive one including data, data frameworks, along with the tools and techniques used to process and analyze the data. Benefits of Big Data and Data Analytics • Big data makes it possible for you to gain more complete answers because you have more information. • More complete answers mean more confidence in the data—which means a completely different approach to tackling problems. Types of Big Data Structured: • • • Structured is one of the types of big data and By structured data, we mean data that can be processed, stored, and retrieved in a fixed format. It refers to highly organized information that can be readily and seamlessly stored and accessed from a database by simple search engine algorithms. Example, the employee table in a company database will be structured as the employee details, their job positions, their salaries, etc., will be present in an organized manner. Unstructured: • • • Unstructured data refers to the data that lacks any specific form or structure whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured data. Email is an example of unstructured data. Semi-structured: • • • Semi structured is the third type of big data. Semi-structured data pertains to the data containing both the formats mentioned above, that is, structured and unstructured data. To be precise, it refers to the data that although has not been classified under a particular repository (database), yet contains vital information or tags that segregate individual elements within the data. Why is Big Data Important? • The importance of big data does not revolve around how much data a company has but how a company utilizes the collected data. • Every company uses data in its own way; the more efficiently a company uses its data, the more potential it has to grow. • The company can take data from any source and analyze it to find answers which will enable: – – – – – Cost Savings Time Reductions Understand the market conditions Control online reputation Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights – Big Data Analytics As a Driver of Innovations and Product Development 1. Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based Analytics can bring cost advantages to business when large amounts of data are to be stored and these tools also help in identifying more efficient ways of doing business. 2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics can easily identify new sources of data which helps businesses analysing data immediately and make quick decisions based on the learning. 3. Understand the market conditions: By analyzing big data you can get a better understanding of current market conditions. For example, by analyzing customers’ purchasing behaviors, a company can find out the products that are sold the most and produce products according to this trend. By this, it can get ahead of its competitors. 4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get feedback about who is saying what about your company. If you want to monitor and improve the online presence of your business, then, big data tools can help in all this. 5. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights: Big data analytics can help change all business operations. This includes the ability to match customer expectation, changing company’s product line and of course ensuring that the marketing campaigns are powerful. 6. Big Data Analytics As a Driver of Innovations and Product Development: Another huge advantage of big data is the ability to help companies innovate and redevelop their products Business Analytics • Business Analytics is the discovery and communication of meaningful patterns of data and that to some business-related problems. • Business analytics is the scientific process of transferring data into insight for making better decisions. (Informs) • Business analytics is the extensive use of data statistical tools, quantitative tools, then explanatory and predictive models and fact-based management to take some kind of decisions. Business analytics is the use of: Data, information technology, statistical analysis, quantitative methods, and mathematical or computer-based models to help managers to gain improved insight about their business operations and make better, fact-based decisions. Benefits of Business Analytics: • Enable data-driven decision making that has the potential to increase profits and improve efficiency • With predictive analytics, allow businesses to plan for the future in ways that were previously impossible • Helps a company make informed business decisions • By modeling the outcomes and understanding the past, guesswork is minimized • Present meaningful, clear data to support decision making and convince stakeholder Goals of business analytics Business analytics encompasses the entire key informational and decisional attributes of any business, and it is vitally important that business analytics features in the overall strategic vision of all businesses. The major goals of business analytics include: • Providing real-time, actionable information aimed at superior business decision making. • Providing tools at all levels of an organization to help decision making around customer goals and profits while comparing performance. , • Providing analysis that helps the business forecast the future with greater objectivity and accuracy. • Providing the insight and understanding to support informed decisions and confident actions and providing the feedback that is needed to create a learning organization. Characteristics of business analytics The following characteristics of business analytics identify its uniqueness: 1. Purposive: Business analytics needs to purposefully know why we make deliveries and perform the analytics. The understanding derived from analysis must align with business functions (fi nance, marketing, sales, etc.) and with the issues and objectives of management (performance, growth, compliance, risk, profitability, etc.). 2. Intuitive: Business analytics are insightful, and they help uncover new facts or information and help managers become aware of previously hidden patterns. 3. Expedient: An expedient output or action plan makes an application doable and possible, which means a business manager should be able to act upon the recommendations of an insight. Domains of business analytics Domains refer to the variety of activities within a business. The business analytics is built around (but not limited to) the following analytical domains: • Human resource analytics • Supply chain analytics • Customer analytics • Business processes analytics • Financial analytics Human resource analytics: • Human resource analytics is defined as the analysis of human resources (employees), which embodies the entire life cycle of an employee, such as recruitment, managing performance, incentives, and employee engagement. • Using the right metrics can improve policies and procedures, increase team members’ satisfaction and retention, focus employee training and support, improve morale, reduce costs, and increase productivity. • The activities impacted by human capital involve recruitment, training, employee relationships, employee satisfaction, and turnover. Supply chain analytics: • Supply chain analytics refers to the analysis of a firm’s delivery processes, which includes acquisition of vendors, the sourcing of factors, inventory analytics, transportation and customer delivery network efficiency, vendor management, and sourcing efficiency. • The baseline for strategic sourcing initiatives is as an enabler for process improvement. Further, supply chain analytics is a measurement device for cost-reduction programs, providing comprehensive spend visibility of both direct and indirect expenses on commodities and services, significant costsaving opportunities through supplier and commodity consolidation and enhanced compliance through effective spend and supplier monitoring. Customer analytics: • Customer analytics is an understanding of customers, the customer life cycle, their product needs, and customer satisfaction. • Customer analytics is the systematic interpretation of a business’s customer information to retain profitable customers and proactively build relationships with them. • Customer behavioural analysis seeks to identify and weigh the relative importance of the factors customers use to choose one product over another. • Customer profiling is a tool that helps business better understand customers so they can increase sales and grow their business. Customer profiles can also help develop targeted marketing plans and ensure that products meet the needs of their intended audience. • By understanding the variables that influence individual decisions, businesses are more able to infl uence their outcome. Customer decision making will rely heavily on considerations using individuals as the unit of analysis. Financial analytics: • Financial analytics is defined as the analysis of the financial impact of business analytics. • One aspect of financial analytics is the opportunity of working with net (final) figures, which are derived after taxes, duties, and penalties, or capital charges are charged to the business. • It embodies the versatility of the risks of doing business, and also translates such risks to net turnout. • Financial analytics enables business to maintain cash flow, spread and liquidity; manage pricing value acquisitions; control investments in new products and working capital; and plan funds and directed investments. Improving financial performance and expense control, through organized monitoring of expenses, drives profitability across business units, geographic locations, products, or channels. Methods of Business Analysis There are four primary methods of business analysis: Descriptive Diagnostic Predictive Prescriptive Descriptive: The interpretation of historical data to identify trends and patterns. Diagnostic: a form of advanced analytics that examines data or content to answer the question, “Why did it happen? , How did it happen?. Predictive: • Explores relationship in data, which may not directly be given by descriptive or diagnostic analysis. • Analyzes past performance. • Predict future based on probability and statistical models • It provides answer to questions as: • What is happening (standard reporting) • How many, how often, where (ad hoc reporting) • What exactly the problem is (drill down) Prescriptive: The application of testing and other techniques to determine which outcome will yield the best result in a given scenario. Deciding which method to employ is dependent on the business situation at hand. Uses optimization techniques [determining new ways to evaluate, target business objectives with balancing possible constraints] Framework of business analytics Business Analytics Process What is a dataset? • A single row of data is called an instance. • Datasets are a collection of instances that all share a common attribute. • Predictive models will generally contain a few different datasets, each used to fulfil various roles in the system. What type of data does predictive models need? • Data can come in many forms, but predictive models rely on four primary data types. These include numerical data, categorical data, time series data, and text data. Types of Data • There are different types of data in Statistics, that are collected, analysed, interpreted and presented. • The data are the individual pieces of factual information recorded, and it is used for the purpose of the analysis process. • The two processes of data analysis are interpretation and presentation. • Statistics are the result of data analysis. Numerical Data • Numerical data is any data where data points are exact numbers. • Statisticians also might call numerical data, quantitative data. • This data has meaning as a measurement such as house prices or as a count, such as a number of residential properties in Los Angeles or how many houses sold in the past year. • Numerical data can be characterized by continuous or discrete data. Continuous data can assume any value within a range whereas discrete data has distinct values. • For example, the number of students taking Python class would be a discrete data set. • You can only have discrete whole number values like 10, 25, or 33. • A class cannot have 12.75 students enrolled. A student either join a class or he doesn’t. • On the other hand, continuous data are numbers that can fall anywhere within a range. Like a student could have an average score of 88.25 which falls between 0 and 100. • The takeaway here is that numerical data is not ordered in time. They are just numbers that we have collected. Categorical Data • As the name suggests, this encompasses data that can be represented through words. • It usually defines groups or categories & is therefore known as categorical data. • Some examples are the names of all items in a supermarket, movie ratings(good, average, bad), country of birth of individuals and so on. • Categorical data can take numerical values. • For example, maybe we would use 1 for the colour red and 2 for blue. But these numbers don’t have a mathematical meaning. That is, we can’t add them together or take the average. • In the context of classification, categorical data would be the class label. This would also be something like if a person is a man or woman, or property is residential or commercial. Nominal: This type of data has categories that don’t have any particular order or ranking associated with them. The total number of categories is usually finite in this type of data as well. Examples will be the country of birth of individuals, all items in a supermarket, educational degrees of individuals, and so on. Ordinal: • This type of data has an inherent ordering present within the categories. • For instance, if you consider movie ratings with good, average & bad as the different categories, good has a higher ranking than average which is higher than bad. • This needs to be taken into account while converting this type of data into numbers so that the models can learn this ranking as well. • There is a fixed, finite number of categories/groups. • Examples will be movie ratings, student grades, Employee performance, and so on. Unique: • This type of data has a unique value for each sample and the number of categories is usually large. • Sometimes it is so large that it cannot be called categorical data, but it still consists of alphabets and numbers. • Examples are product id of all items in a store, student numbers of all individuals in a college, postal code of individuals birthplace, and so on. Time Series Data • Time series data is a sequence of numbers collected at regular intervals over some period of time. It is very important, especially in particular fields like finance. • Time series data has a temporal value attached to it, so this would be something like a date or a timestamp that you can look for trends in time. • For example, we might measure the average number of home sales for many years. The difference of time series data and numerical data is that rather than having a bunch of numerical values that don’t have any time ordering, time-series data does have some implied ordering. There is a first data point collected and the last data point collected. Text • Text data is basically just words. A lot of the time the first thing that you do with text is you turn it into numbers using some interesting functions like the bag of words formulation. (Natural Language Processing and Information Retrieval)