Big Data Issues & Introduction to Business Intelligence Dr. Chang Liu My Story … Chair and Professor of MIS at the OM&IS Department, College of Business, Northern Illinois University. • Taught Database Management, Web Computing, Business Information Technologies, and Business Intelligence Applications courses • Received MIS Undergraduate Teaching Award in 2005. • Received MIS Graduate Teaching Awards in 2001, 2002, and 2006. • Recognized as top 20 researchers by Information & Management in 2007 • http://www.niutoday.info/2013/02/15/omis-chair-earns-2012-citation-of-excellence/ Born and raised in Beijing, China PH.D. from Mississippi State University in 1997: Doctor of Business Administration in Management Information Systems (MIS) Taught at Beijing Institute of Business from 1988 to 1992 Served as a Project Manager for Motorola Inc. Beijing office from 1992 to 1994 Interest: • Ping Pong, Badminton Department of Operations Management & Information Systems OM & IS http://cob.niu.edu/omis + Information Technology Business Processes OM&IS Programs • B.S. in Operations and Information Management • M.S. in Management Information Systems • Certificate Programs • • • • • • Certificate of Graduate Study in Business Analytics Using SAP Software Certificate of Undergraduate Study in Business Analytics Using SAP Software Certificate of Graduate Study in Management Information Systems Certificate of Undergraduate Study in Information Systems Certificate of Undergraduate Study in Service Management SAP Student Recognition Award OM&IS Department 11 tenured & tenure-track faculty 7 full-time/part-time instructors 1 program advisor & internship coordinator 1 department secretary 200 Undergraduate OM&IS majors 95 MIS master students Top Ranked Programs In 2013, Businessweek Ranked NIU College of Business’s Information Systems Program #34 and Operations Management Program #60 nationwide. Introductions Course Syllabus Name Background Hobbies Big Data Issue Big data is a term that describes the exponential growth and availability of data. It is estimated that by the year of 2020 the digital universe will have grown 44 times its size in 2009. Businesses must employ a data management, analysis, and security plan in order to stay compliant and competitive. Big Data Big Opportunities Business & Technology Priorities in 2013 Top 10 Business Priorities Rank Top 10 Technology Priorities Rank Increasing enterprise growth 1 Business intelligence and analytics 1 Delivering operational results 2 Mobile technologies 2 Reducing enterprise costs 3 Cloud computing 3 Attracting and retaining new customers 4 Collaboration technologies 4 Improving IT applications & infrastructure 5 Legacy modernization 5 Creating new products & services (innovation) 6 IT management 6 Improving efficiency 7 CRM 7 Attracting and retaining the workforce 8 Virtualization 8 Implementing analytics and big data 9 Security 9 Improving business processes 10 ERP Applications 10 Source: Gartner EXP (January 2013) Worldwide BI, Analytics and Performance Management Revenue Estimates for 2011 (Millions of U.S Dollars) 20102011 2010 2011 2011 Market 2010 Market Growth Company Revenue Share (%) Revenue Share (%) (%) SAP 2,883.5 23.6 2,413.1 23.0 19.5 Oracle 1,913.5 15.6 1,645.8 15.7 16.3 Institute 1,542.8 12.6 1,386.5 13.2 11.3 IBM 1,477.6 12.1 1,222.0 11.6 20.9 913.7 8.7 16.0 SAS Microsoft 1,059.9 8.7 Source: Gartner(March2012), http://www.gartner.com/it/page.jsp?id=1971516 Others 3,363.8 27.5 2,931.1 27.9 14.8 Total 12,241.0 100.0 10,512.2 100.0 16.4 How Come It Takes Me So Long to Get Answers to Simple Questions About My Business? Technologies for Business Intelligence Why BI / What’s the problem? Businesses (people, really) can’t get answers efficiently. 32%!!! An example – Let’s start small Two spreadsheets. One has student name, znumber (student id) and major, the other has student name, znumber and quiz score. ?? What’s “the answer”? Database approach Centralized, any-time, any-place data in a data warehouse. And the biggest? The “HITECH Act” The purpose of the HITECH Act is to promote the use of health information technology with a goal of utilization of an electronic health record for each person in the United States by 2014. What can YOU do? (technology-wise) Step 1: Get familiar with Microsoft Access (or any Relational Database Management System, RDBMS). Step 2: Make Access centrally accessible to your employees. Step 3: Building a company-wide data warehouse. What can YOU do? (process-wise) Step 1: Identify where your corporate information comes from. Step 2: For things spreadsheetbased, consider moving that data out of spreadsheets. Some Definitions -1 “Data” is characters, fields, and files that are stored “Information” is data with meaning and context. It is an organizational asset. A database is a collection of related data. A relational database is data stored in a table format. somewhere. The most common use of a database is an “ad hoc” query. For example, “How many cases of bottled water did we sell to college students in September vs. August?” Relationships of Users, Database Applications, DBMS, and Database Application #1 Application #2 Application #3 DBMS Database containing centralized shared data Hierarchy of Database Elements Fields Records Tables (files) Database + Metadata Class Exercise – Database Structure You have been hired as the Personnel Director for a medium-sized firm (500 employees) and are expected to implement a database system to track employee compensation. You want to be able to calculate the age of every employee as well as length of service. You want to know each employee’s most recent performance evaluation. You want to be able to calculate the amount of the most recent salary increase, both in dollars and as a percentage of the previous salary. You also want to know how long the employee had to wait for that increase --that is, how much time elapsed between the present and previous salary. Design a database table capable of providing this information. Q. How can you make it work? A. Centralized database, allowing for BI and mining. Server - responds to client requests DBMS - the program. Manages interaction with databases. request response Client - makes requests of the DBMS server database - the collection of data. Created and defined to meet the needs of the organization. Database Management System (DBMS) • a program for creating & managing databases; ex. Oracle, MS-Access, SQL Server, Sybase. • Basically synonymous with “database” at this point. What’s the File Processing The “old” way of doing things; still often used in practice. Separate information stored on separate files. File-Processing Systems Problems with File-Processing Systems Data are separated and isolated Data are often duplicated Application program dependent Incompatible data files Difficult to understand Create problems with data integrity because data is: Duplicated & Inconsistent File System: An Example Duplicate Data Benefits of Database Systems Data Data Data Data Data Data is integrated duplication is reduced is consistence is program independent is easy to understand is: Shared & Integrated Class Exercise – Database Relationship How to create two tables in a database, one for Customer, the other for Order information? How to build a relationship between the two tables in a database? Some Definitions - 2 A relational database has numerous tables (like spreadsheets) which are tied together by common fields. Primary key: A field that uniquely identifies each record in a table. Common Field (Foreign Key): A field that appears as a non primary key (field) in one table and as a primary key in another table. Database Management? Ugh, just give me a spreadsheet There are a number of ‘versions’ of the spreadsheet system, all with common features … • • • • Lack Lack Lack Lack of of of of data definition/documentation ownership support change control But Excel can be a good front end tool: Pivot Tables & PowerPivot Pivot Tables can be used for data analysis and presentation as a front end tool if all of your data in one spreadsheet. Demo Exercise Assignments Business Intelligence (BI) Definition BI is a set of technologies for tuning raw data into actionable information • Leads to better decisions that are in line with business goals and objectives • Helps organizations operate more efficiently • Can lead to the discovery of new opportunities • The BI market is growing rapidly. BI gives decision makers and operational staff access to ORGANIZATIONAL data • Allows them to interact with the data • Analyzes it • Uses the data to perform forecasts, etc. Data Warehouse A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes • Subject-oriented: e.g. customers, patients, students, products • Integrated: consistent naming conventions, formats, encoding structures; from multiple data sources • Time-variant: can study trends and changes • Non-updatable: read-only, periodically refreshed Need for Data Warehousing Integrated, company-wide view of high-quality information (from disparate databases) Separation of operational and informational systems and data (for improved performance) • Operational system – a system that is used to run a business in real time, based on current data; also called a system of record • Informational system – a system designed to support decision making based on historical point-intime and prediction data for complex queries or data-mining applications Data Warehouse Architectures Independent Data Mart Dependent Data Mart and Operational Data Store Logical Data Mart and Near RealTime Data Warehouse Three-Layer architecture All involve some form of extract, transform and load (ETL) Data Mart: A data warehouse that is limited in scope to support a single business function or process. Figure 1: Independent data mart data warehousing architecture Data marts: Mini-warehouses, limited in scope L T E Separate ETL for each independent data mart Data access complexity due to multiple data marts Figure 2: Dependent data mart with operational data store: a three-level architecture ODS provides option for obtaining current data L E T Simpler data access Single ETL for enterprise data warehouse (EDW) Dependent data marts loaded from EDW Figure 3:Three-layer data architecture for a data warehouse Derived Data Objectives • • • • • Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Ad-hoc query support Data mining capabilities Star Schema – A simple database design in which dimensional data are separated from fact data Most common data model = star schema (also called “dimensional model”) How the dimensional model can solve the problem of analyzing data? A retailer, John Doe, sells products to customers over a period of time Here are some questions John needs to ask to analyze his business: • • • • • • How many units of products did I sell altogether? Which products sold the greatest number of units? How did sales in Week 1 compare with sales in Week 2? How did sales perform by street? Who are my top two customers? How do sales split by customer gender and by street? To provide a solution for John, most people think SQL is the answer – write different SQL for each question • Slow, more joins, more loads on server Dimensional Model (Cube) to the Rescue John needs to have BI tool (such as SQL Server Analysis Services or SAP Business Information Warehouse) to create a cube for him • The cube stores the numeric answers for all combinations of product, store, customer, and time • The numbers can easily be analyzed by customer, by store, and by product. Customer, Time, Store, and Product are the cube’s dimensions • John does not need to run database queries as the answers are pre-calculated • All he needs is a nice client interface; e.g., Microsoft Excel Figure 4: Components of a star schema Fact tables contain factual or quantitative data 1:N relationship between dimension tables and fact tables Dimension tables are denormalized to maximize performance Dimension tables contain descriptions about the subjects of the business Figure 5: Star schema example Fact table provides statistics for sales broken down by product, period and store dimensions Figure 6: Star schema with sample data Cubes are intuitive, which makes them easy for non-technical users Different Names, Same Ideas • Data Mining - uncovers important patterns in existing data to support decision making. • Online Analytical Processing (OLAP) – User-driven discovery with multidimensional views of their data • Data Warehousing Example: Sales Data in DB Question 1: How a star schema can be designed to represent data in a cube? Question 2: Which salespeople do better with certain customers? Class Exercise – Draw The Star Schema? Case Study – Dimensional Modeling