Business Intelligence and Analytics: Overview and Examples Dr. Hsinchun Chen Director, Artificial Intelligence Lab, University of Arizona hchen@eller.arizona.edu http://ai.arizona.edu; BI & Analytics: The Field • • • • The Data Deluge (The Economists, March 2010); internet traffic 667 Exabytes by 2013, Cisco; Total amount of information in 2010, 1.2 Zettabyte (KB-MB-GB-TB-PB-EB-ZBYB) BIG DATA BIG COMPUTATION BIG ANALYTICS BIG (SOCIETAL) IMPACT $3B BI revenue in 2009 (Gartner, 2006); $9.4B BI software M&A spending in 2010 and $14.1B by 2014 (Forrester) IBM spent $14B in BI in five years; $9B BI revenue in 2010 (USA Today, November 2010); 24 acquisitions, 10,000 BI software developers, 8,000 BI consultants, 200 BI mathematicians IBM acquired I2/COPLINK in 2011 BI & Analytics: Definition and Components BI and Analytics refers to: (1) the technologies, systems, practices and applications that (2) analyze critical business data to (3) help an enterprise better understand its business and market.” Core technologies: data warehousing, Extraction, Transformation, and Load (ETL); Business Performance Management (BPM), visual dashboards; enterprise text and multimedia search; data and text mining, social network analysis BI 2.0 research: web analytics, web 2.0, social media analytics, opinion mining; in-memory and real-time BI; cloud computing, data/web services; Hadoop, MapReduce; stream and mobile data mining BI Industry and Capabilities (Garter Report, 2011) Magic Quadrant for BI Platforms (13 Capabilities) Integration (e.g., Microsoft, Oracle, SAP) BI (shared) infrastructure Metadata management Development tools, collaboration Information Delivery (e.g., SAP, Microsoft, IBM/Cognos) Reporting, dashboards Ad hoc query Microsoft Office integration Search-based BI (structured and unstructured) Analysis (e.g., IBM/SPSS, SAS) OLAP Interactive visualization Predictive modeling and data mining Scorecards 4 Magic Quadrant for Business Intelligence Platforms Hype Cycle for Business Intelligence, 2011 BI Hype Cycle (Garter Report, 2011) On the Rise Collaborative decision making Information semantic services Search-based data discovery tools Natural language question answering At the Peak Enterprise metadata repositories BI SaaS Visualization-based data discovery tools Mobile BI In-memory DMBS Sliding into the Trough Real-time decisoning Analytics, content analytics, in-memory analytics, text analytics Open-source BI tools Interactive visualization 7 BI Hype Cycle (Cont’d) Climbing the Slope BI consulting and system integration Business activity monitoring Column-based DBMS Dashboards, data quality tools Predictive analytics Excel as a BI front end Entering the Plateau BI platforms Data-mining workbenchs 8 Sample BI Applications (AI Lab) Security informatics Market intelligence Securing cyber space, cyber security, predicting Arab Spring Information and system security, enterprise risk management Data/text/web mining, web 2.0, social media analytics Big data (volume/variety/velocity/mobility), Hadoop, Cloud apps Healthcare informatics Healthcare IT integration and solutions, decision support EHR data/text mining, patient empowerment and social media 9 (1) BI for Security: COPLINK 10 COPLINK Identity Resolution and Criminal Network Analysis Cross-jurisdictional Information Sharing/Collaboration Arizona IDMatcher Law-enforcement Data AZ CA CAN Visualizer TX Border Crossing Data (AZ, CA, TX) Vehicles Identity Resolution DOB Match Criminal Network Analysis High-risk Vehicle Identification Identity Match Name Match People Address Match ID Match Law-enforcement Data Criminal Link Prediction Suspect Traffic Burst Detection Border Crossing Data Narcotics Network Mutual Information Vehicle A Vehicle B 2000 Time of Day ID Similarity 1500 1000 500 0 Jun 9 June 17 Mar 5 Mar 5 May 18 May 18 May 25 May 28 Dates May 30 Jan 6 Jan 15 Jan 19 Jan 26 Jan 31 < 2004 Feb 27 Nov 17 Dec 19 Dec 21 Address Similarity Dec 29 DOB Similarity Jan 6 Last Name Match Jan 6 Middle Name Match Nov 11 First Name Match 2005 > Frequent Crossers at Night First Name Similarity Middle Name Similarity Last Name Similarity Detect false and deceptive identities across jurisdictions using a probabilistic naïveBayes based resolution system. Vehicle A Vehicle B Identify high-risk vehicles using association techniques like mutual information using border crossing and law enforcement data. Predict interaction between individuals and vehicles using link prediction techniques to identify high-risk border crossers. * Only the grayed datasets are available to the AI Lab Detect real-time anomalies and threats in border traffic using Markov switching and other models. 11 (2) BI for Market Intelligence (AZ BizIntel) • • • • • • • Mass media, social media contents Text & social media analytics techniques Finance/accounting/marketing models (Tetlock/Columbia, Antweiler/UBC, Das/Santa Clara) NYU (Dhar), Arizona (Dhaliwal, Kelly, Jiang, Lusch, Yong), National Taiwan U (Li, Hong, Lu) Bag of words, named entities, proper nouns, topics (1, 2-, 3- grams) Sentiment/valence, lexicons, machine learning, stakeholder analysis, EFLS analysis Time series models, spike detection, decaying function, trading windows, targeted sentiment Econometrics/regression models (R-sqr, p-value), 10-fold validation (F, accuracy), simulated trading (cost, frequency, exit) SEC/Edgar NYSE.com NASDAQ.com Finance.Yahoo.com Company Information Database Ticker CIK CUSIP Company Name PERMNO Yahoo Finance Forums Company Websites Twitter Stock Exchange WSJ Dynamic Data Sources Search Engines 10K Report Blogs News Data Processing Transformation/Integration Finance/Econ models and metrics Topics & Sentiments Time Series / Burst SNA Risk Model Analysis Interactive Applications Data Collection Predefined Data Sources Company Keywords Static Figures/Dashboards Basic Information Data Sources for US Public Companies Analytic Approaches Single Media Analysis Cross Media Analysis Simulated Trading Predicting Markets 13 AZ BIZ INTEL System Design Visualization (3) BI for Healthcare: AZ Smart Health Research Deveolopment Commercialization Targeted Data Subscriptions Healthcare Business Intelligence NTU Hospital EHR National Health Insurance Database Health Cloud Infrastructure Cost, Performance, Benchmarking, Research & Practical Implications Market Development Health Informatics System Development Artificial Intelligence, Data Mining, Decision Support, Visualization Software, Data, Analytics as a Service On Demand Health Analytics Services Healthcare Business Consultations Patient Social Media Platform 14 AZ Smart Health Research Healthcare Decision Support Symptom-Disease-Treatment Extraction for Medical Knowledge Re-use Scenario-based Association Rule Mining and Result Validation for Effective Healthcare Outcome Assessment and Medication Compliance to Signify Quality of Care Temporal Episodes and Disease Progression Modeling for Better Patient Condition Assessment Patients-Like-You-and-Me EHR Search Interface to Accelerate Clinical Decision Making Patient-centered Smart Health Personalized Healthcare for Chronic and Family Diseases Management Long Term Medication Effects to Improve New Drug Development Public Health Modeling and Monitoring for Government Agencies Patient Social Media to Empower Patients and Improve Self Care at Home Healthcare Business Analytics Cost Modeling and Containment Improving Rate Calculation for the National Health Insurance Competency and Performance Benchmarking Quality-based Insurance Reimbursement Workflow Planning and Coordination for Inter- and Intra- Hospital Process ARM in Medicine: Symptoms, Diseases, and Treatments 0.05 < Confidence <=0.2 0.2 < Confidence <=0.5 0.5 < Confidence Symptoms Hemoptysis (786.3) Other dyspnea and respiratory abnormalities (786.09) 0.0640 0.0689 Unspecified pulmonary tuberculosis confirmation unspecified (011.90) Diseases 0.4525 Pneumonia (486) 0.2502 Malignant neoplasm of bronchus and lung, unspecified (162.9) 0.1456 0.2097 0.0640 Treatments 0.5562 Terbutaline sulphate 5mg/2ml/vial (ETERBUS) 0.7615 0.5496 Thoracentesis Chest PA view Pyridoxine Hcl (34.91) (320011) Tablets 50mg (OVTB6) 0.1158 0.4882 0.6777 0.4194 0.4646 Computerized axial tomography 0.2707 of thorax (87.41) Injection or infusion of Direct smear by cancer Gram Stain chemotherapeutic Aerobic Culture (130062) substance (13007) (99.25) Patient Statistics: Breast Cancer Patient Genders 1200 1000 800 600 400 200 0 1092 0 M Patient Age Groups 800 600 400 200 0 618 318 F Frequent Cooccurred Diagnosis 150 6 15 to 24 25 to 44 45 to 64 0 Secondary malignant neoplasm of bone and… Malignant neoplasm of female breast, upper-… Diabetes mellitus without mention of… Secondary malignant neoplasm of lung Secondary malignant neoplasm of liver Malignant neoplasm of other specified sites of… Essential hypertension, unspecified Malignant neoplasm of female breast, upper-… Secondary and unspecified malignant neoplasm… Benign neoplasm of breast 100 200 > 65 300 400 335 179 169 152 146 146 125 103 99 86 Consistency of Top Treatment Orders 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Top 20 treatments from aggregated population Exemestane (Aromasin) (諾曼癌素) Her-2/neu 螢光原位雜交法 (Her-2/neu FISH) Trastuzumab (Herceptin) (賀癌平) Anastrozole (Anazo) (安納柔) Zoledronic acid (Zometa) (卓古祂) Pegylated liposomal doxorubicin (Caelyx) (康利斯微脂利) Radical mastectomy-unilateral (乳癌根除術- 單側) Tamoxifen citrate (得適) Docetaxel (Taxotere) (剋癌易) Cyclophosphamide (Endoxan-Asta) (癌得星) Vinorelbine (Navelbine) (溫諾平) Docetaxel (Taxotere) (剋癌易) Epirubicin HCl (Pharmorubicin RD) (泛艾黴素) Epirubicin (Pharmorubicin) ( "速溶"泛艾黴素) CA-153 tumor marker (CA-153 腫瘤標記) Epirubicin (Pharmorubicin) ( "速溶"泛艾黴素) Methotrexate sodium inj (Amethopterin) (滅殺除癌) Dissection of axillary lymphatics (腋窩淋巴腺清除術) Breast tumor biopsy (乳房腫瘤組織檢查切片術) 20 Intravenous chemotherapy 4-8 hours (靜脈化學藥物注射4-8小時) • • • Physician Department M1130 M1529 M1540 M1585 03 BD V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V 4 V Age Group 5 6 V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V 7 V V V V V V V V Cooccurred Diagnosis 196.3 198.5 V V V V V V V V V V V V V V V V V V V V V V V Department 03: General Surgery; Department BD: Gastrointestinal surgery Age group 4: 15 to 24; Age group 5: 25 to 44; Age group 6: 45 to 64; Age group 7: > 65 Cooccurred Diagnosis 196.3: Secondary and unspecified malignant neoplasm of lymph nodes; Cooccurred Diagnosis 198.5: Secondary malignant neoplasm of bone and bone marrow Treatment Comparison Among Different Physicians DOCTOR_NO=M1130 1 Caelyx 20mg/10ml/vial (康利斯微脂利) 2 Aromasin S.C. Tablets 25mg (諾曼癌素) DOCTOR_NO=M1529 DOCTOR_NO=M1540 Zometa Powder For Solution For Infusion 4mg/vial (卓古 祂) Anazo F.C. Tablets (安納柔) Methotrexate Inj 50mg/2ml (滅殺除癌) Gemzar 200mg/vial (健擇) Zometa Powder For Solution For Infusion 4mg/vial (卓古 Taxotere 20mg/0.5ml/vial (剋癌易) 祂) 3 Navelbine 10mg/1ml/vial (溫諾平) Intravenous chemotherapy <1 hours (靜脈化學藥物 4 注射) FORMOXOL 30mg/5ml/vial (伏摩素) 5 Herceptin 440mg/20ml/vial (賀癌平) Aromasin S.C. Tablets 25mg (諾曼癌素) Zometa Powder For Solution For Infusion 4mg/vial (卓古 6 祂) Herceptin 440mg/20ml/vial (賀癌平) 7 FORMOXOL 30mg/5ml/vial (伏摩素) Navelbine 10mg/1ml/vial (溫諾平) 8 CA-153 tumor marker (CA-153 腫瘤標記) Endoxan-Asta Injection 200mg/vial(癌得星) 9 Abitrexate 50mg/2ml/vial (必除癌) Caelyx 20mg/10ml/vial (康利斯微脂利) 10 Taxotere 80mg/2ml/vial (剋癌易) Taxotere 20mg/0.5ml/vial (剋癌易) 11 Taxotere 20mg/0.5ml/vial (剋癌易) Taxotere 80mg/2ml/vial (剋癌易) 12 Endoxan-Asta Injection 200mg/vial(癌得星) CA-153 tumor marker (CA-153 腫瘤標記) 13 Pharmorubicin Rapid Dissolation 10mg ( "速溶"泛艾黴素) Radical mastectomy-unilateral (乳癌根除術- 單側) Intravenous chemotherapy 1-4 hours (靜脈化學藥 14 Pharmorubicin RD 50mg/vial (泛艾黴素) 物注射) Intravenous chemotherapy 4-8 hours (靜脈化學藥 15 Radical mastectomy-unilateral (乳癌根除術- 單側) 物注射) Taxotere 80mg/2ml/vial (剋癌易) Herceptin 440mg/20ml/vial (賀癌平) Radical mastectomy-unilateral (乳癌根除術- 單側) Granocyte 100ug/vial (顆球諾得) Sentinel lymphadenectomy (腋窩淋巴腺清除術) CA-153 tumor marker (CA-153 腫瘤標記) Endoxan-Asta Injection 200mg/vial(癌得星) Pharmorubicin Rapid Dissolation 10mg ( "速溶"泛艾黴素) Whole body bone scan (全身骨骼掃描) Pharmorubicin 10mg/vial ( "速溶"泛艾黴素) Simulation procedure (模擬定位攝影) Pharmorubicin RD 50mg/vial (泛艾黴素) Pharmorubicin RD 50mg/vial (泛艾黴素) Breast tumor biopsy examination (乳房腫瘤組織檢查切 片術) Intravenous chemotherapy 4-8 hours (靜脈化學藥物 注射) Intravenous chemotherapy 1-4 hours (靜脈化學藥物 注射) Pharmorubicin 10mg/vial ( "速溶"泛艾黴素) Sodium chloride injection (氯化鈉注射液) Vascular exploration (血管探查) Fixed mold-large (固定模具之設計及製作-大) 16 Pharmorubicin 10mg/vial ( "速溶"泛艾黴素) Rasitol Tablets 40mg (Furosemide) (來喜妥) 17 Gemzar 200mg/vial (健擇) Intravenous chemotherapy 4-8 hours (靜脈化學藥 18 物注射) Intravenous chemotherapy 1-4 hours (靜脈化學藥 19 物注射) 20 Neurotin Tablets 600mg (鎮頑癲) Emetrol Tablets 10mg (Domperidone) (愈吐寧) Treatment Comparison Among Different Patient Age Groups Age group=5 Age group=6 Age group=7 Caelyx 20mg/10ml/vial (康利斯微脂利) Anazo F.C. Tablets (安納柔) Her-2/neu 螢光原位雜交法 (Her-2/neu FISH) Herceptin 440mg/20ml/vial (賀癌平) Aromasin S.C. Tablets 25mg (諾曼癌素) Abitrexate 50mg/2ml/vial (必除癌) Zometa Powder For Solution For Infusion 4mg/vial (卓古祂) Herceptin 440mg/20ml/vial (賀癌平) Radical mastectomy-unilateral (乳癌根除術- 單側) Pharmorubicin 10mg/vial ( "速溶"泛艾黴素) Zometa Powder For Solution For Infusion 4mg/vial (卓古祂) Herceptin 440mg/20ml/vial (賀癌平) Taxotere 80mg/2ml/vial (剋癌易) Navelbine 10mg/1ml/vial (溫諾平) Sentinel lymphadenectomy (腋窩淋巴腺清除術) Taxotere 20mg/0.5ml/vial (剋癌易) Caelyx 20mg/10ml/vial (康利斯微脂利) Tadex 10mg/tab (得適) Navelbine 10mg/1ml/vial (溫諾平) Zometa Powder For Solution For Infusion 4mg/vial (卓古祂) Radical mastectomy-unilateral (乳癌根除術- 單側) Pharmorubicin RD 50mg/vial (泛艾黴素) Tadex 10mg/tab (得適) Caelyx 20mg/10ml/vial (康利斯微脂利) Endoxan-Asta Injection 200mg/vial (癌得星) Taxotere 80mg/2ml/vial (剋癌易) Endoxan-Asta Injection 200mg/vial(癌得星) Tadex 10mg/tab (得適) Endoxan-Asta Injection 200mg/vial(癌得星) CA-153 tumor marker (CA-153 腫瘤標記) Xeloda Tablets 500mg (結瘤達) Pharmorubicin Rapid Dissolation 10mg ( "速溶"泛艾黴素) Breast tumor biopsy (乳房腫瘤組織檢查切片術) Taxotere 20mg/0.5ml/vial (剋癌易) Radical mastectomy-unilateral (乳癌根除術- 單側) Partial mastectomy-unilateral (部份乳癌根除術- 單側) Pharmorubicin Rapid Dissolation 10mg ( "速溶"泛艾黴素) Granocyte 100ug/vial (顆球諾得) CA-153 tumor marker (CA-153 腫瘤標記) Pharmorubicin RD 50mg/vial (泛艾黴素) Taxotere 20mg/0.5ml/vial (剋癌易) CA-153 tumor marker (CA-153 腫瘤標記) Pharmorubicin RD 50mg/vial (泛艾黴素) Intravenous chemotherapy 4-8 hours (靜脈化學藥物注射)Pharmorubicin 10mg/vial ( "速溶"泛艾黴素) Granocyte 100ug/vial (顆球諾得) Abitrexate 50mg/2ml/vial (必除癌) Taxotere 80mg/2ml/vial (剋癌易) Methotrexate Inj 50mg/2ml (滅殺除癌) Pharmorubicin 10mg/vial ( "速溶"泛艾黴素) Sentinel lymphadenectomy (腋窩淋巴腺清除術) Intravenous chemotherapy 1-4 hours (靜脈化學藥物注射)Pharmorubicin Rapid Dissolation 10mg ( "速溶"泛艾黴素) Gemzar 200mg/vial (健擇) Intravenous chemotherapy 1-4 hours (靜脈化學藥物注射) Compensator design and production (補償器之設計及製 20 FORMOXOL 30mg/5ml/vial (伏摩素) 作) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 • • Anazo F.C. Tablets (安納柔) is a treatment for advanced breast cancer in postmenopausal women (advanced age). Abitrexate (必除癌) is a drug in the FDA pregnancy risk categories, which has proven to cause fetal risks and abnormalities. Therefore, it is less likely to be prescribed for patients in young age group=5 (i.e., age 25 to 44) Cancer Community Mapping: Text Mining & Visualization for Documents and Patient Forums Red Blood Cell and Lymph Nodes subtopics Meningeal Neoplasms and Brain Diaseases subtopics A Brain Neoplasms article about toddlers Breast cancer patient forum messages 21 BI & Analytics Research Opportunities and Challenges Opportunities: BIG DATA BIG COMPUTATION BIG ANALYTICS BIG (SOCIETAL) IMPCTS (NAE Grand Challenges: security, healthcare) Challenges: data deluge (TB/PB) data variety (numbers, text, multilingual, multimedia) data velocity (mobile, streaming) data organization & access (DBMS, Hadoop, IR, image, mobile) data analytics (statistical analysis, data/text/web mining) 22 Training the New “Data Scientists”: Core Knowledge B-School (Management Information Systems): economics/finance/accounting/marketing, statistical analysis/modeling, organizational/behavioral business knowledge; statistics C-School (Computer Science): programming language, data structure & algorithm, database management system, artificial intelligence, networking, data mining, web computing & mining computational techniques I-School (Information/Library Science): information organization, information retrieval, information visualization, NLP, text mining, HCI information processing 23