What is Watson, really? Session Number 1888 Keyur Dalal, IBM Vladimir Stemkovski, IBM Jeff Sumner, IBM Please note: IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. 1 Acknowledgements and disclaimers: Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. © Copyright IBM Corporation 2011. All rights reserved. – U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo, ibm.com, and jStart are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml. Other company, product, or service names may be trademarks or service marks of others. 2 What we’ll be discussing today... • Introducing the jStart® Team • What is IBM Watson? – How Watson Got Started – Where We Stand Today • Watson: A Deep Dive – – – – Watson’s Architecture What components make up Watson What Watson can do, today. The Future of Watson • How do we get started? – Existing IBM technologies – Examples of our work • Q&A 3 Introducing IBM jStart®: Who we are & what we do jStart is IBM’s emerging technologies client engagement team. Focused on creating solutions built on the latest technologies, the team has been working with companies since 1997 to address their business challenges while providing business value, today. jStart’s latest initiative is the commercialization of IBM’s Watson system. Your presenters for today: Keyur Dalal IT Architect, Managing Consultant 4 Vladimir Stemkovski IT Architect, Managing Consultant Jeff Sumner Product Marketing, IBM Content Analytics What we’ll be discussing today... • Introducing the jStart® Team • What is IBM Watson? – How Watson Got Started – Where We Stand Today • Watson: A Deep Dive – – – – Watson’s Architecture What components make up Watson What Watson can do, today. The Future of Watson • How do we get started? – Existing IBM technologies – Examples of our work • Q&A 5 IBM’s Grand Challenges • IBM has a long history of “Grand Challenges”—in fact it’s been doing grand challenges for the past century • Why do them? To push science (and the company) in ways that weren’t thought possible before. “We look at areas where there's an enormous gap in current capability and use that as a challenge. We call them Grand Challenges.” - Dr. John E. Kelly III Senior Vice President and Director of IBM • What prompted IBM to consider the grand challenge that lead to Watson? IBM was asking itself: “can a system be designed that applies advanced data management and analytics to natural language in order to uncover a single, reliable insight — in a fraction of a second?” 6 The Challenge: Automatic Open-Domain Question Answering A Long-Standing Challenge in Artificial Intelligence to emulate human expertise • Given – Rich Natural Language Questions – Over a Broad Domain of Knowledge • Deliver – Precise Answers: Determine what is being asked & give precise response – Accurate Confidences: Determine likelihood answer is correct – Consumable Justifications: Explain why the answer is right – Fast Response Time: Precision & Confidence in <3 seconds 7 You may have heard of IBM’s Watson… A. What is the computer system that played against human opponents on “Jeopardy”… and won. Why Jeopardy? The game of Jeopardy! makes great demands on its players – from the range of topical knowledge covered to the nuances in language employed in the clues. The question IBM had for itself was “is it possible to build a computer system that could process big data and come up with sensible answers in seconds—so well that it could compete with human opponents?” 8 What exactly makes up Watson? • A workload optimized system • • • • • • • • • • • • 1 9 90 x IBM Power 7501 servers 2880 POWER7 cores POWER7 3.55 GHz chip 500 GB per sec on-chip bandwidth 10 Gb Ethernet network 15 Terabytes of memory 20 Terabytes of disk, clustered Can operate at 80 Teraflops Runs IBM DeepQA software Scales out with and searches vast amounts of unstructured information with UIMA & Hadoop open source components Linux provides a scalable, open platform, optimized to exploit POWER7 performance 10 racks include servers, networking, shared disk system, cluster controllers Note that the Power 750 featuring POWER7 is a commercially available server that runs AIX, IBM i and Linux and has been in market since Feb 2010 What exactly makes up Watson? • This means Watson… • Operates at 80 teraflops. The human brain is estimated to have a processing power of 100 teraflops (100 trillion operations per second). • Has the equivalent in memory (RAM) that the Library of Congress adds in books and media over a 4 month period • Can process 200 million times more instructions per second than the Space Shuttle’s computers. • Parses within 3 seconds the equivalent of the number of books on a 700 yard long book shelf…and pick out the relevant information, and create an answer. 10 The Difference Between Search & DeepQA Decision Maker Has Question Search Engine Distills to 2-3 Keywords Finds Documents containing Keywords Reads Documents, Finds Answers Delivers Documents based on Popularity Finds & Analyzes Evidence Expert Decision Maker Understands Question Asks NL Question Produces Possible Answers & Evidence Considers Answer & Evidence Analyzes Evidence, Computes Confidence Delivers Response, Evidence & Confidence 11 What we’ll be discussing today... • Introducing the jStart® Team • What is IBM Watson? – How Watson Got Started – Where We Stand Today • Watson: A Deep Dive – – – – Watson’s Architecture What components make up Watson What Watson can do, today. The Future of Watson • How do we get started? – Existing IBM technologies – Examples of our work • Q&A 12 DeepQA: the technology & architecture behind Watson Learned Models help combine and weigh the Evidence Evidence Sources Answer Sources Initial Question Question & Topic Analysis 13 Primary Search Question Decomposition Candidate Answer Generation Hypothesis Generation Answer Scoring Evidence Retrieval Hypothesis & Evidence Scoring Deep Evidence Scoring Synthesis Hypothesis Generation Hypothesis and Evidence Scoring Hypothesis Generation Hypothesis and Evidence Scoring model model model model model model model model model Final Confidence Merging & Ranking Answer & Confidence DeepQA: the technology & architecture behind Watson 1 Initial Question Question & Topic Analysis Initial Question Formulated: “The name of this monetary unit comes from the word for "round"; earlier coins were often oval” 3 Question Decomposition Watson performs question analysis, determines what is being asked. 14 2 It decides whether the question needs to be subdivided. DeepQA: the technology & architecture behind Watson 5 Answer Sources Initial Question Question & Topic Analysis Primary Search Question Decomposition Candidate Answer Generation Hypothesis Generation Hypothesis Generation Hypothesis Generation 15 In creating the hypotheses it will use, Watson consults numerous sources for potential answers… 4 Watson then starts to generate hypotheses based on decomposition and initial analysis…as many hypothesis as may be relevant to the initial question… DeepQA: the technology & architecture behind Watson 7 Evidence Sources Answer Sources Initial Question Question & Topic Analysis Primary Search Question Decomposition 6 16 Candidate Answer Generation Hypothesis Generation Watson then uses algorithms to “score” each potential answer and assign a confidence to that answer… Answer Scoring Evidence Retrieval Hypothesis & Evidence Scoring Deep Evidence Scoring Synthesis Hypothesis and Evidence Scoring Hypothesis and Evidence Scoring Watson uses Evidence Sources to validate it’s hypothesis and help score the potential answers If the question was decomposed, Watson brings together hypotheses from sub-parts 8 DeepQA: the technology & architecture behind Watson 9 Answer Sources Initial Question Question & Topic Analysis Primary Search Question Decomposition Candidate Answer Generation Hypothesis Generation Hypothesis Generation Hypothesis Generation 17 Using models on the merged hypotheses, Watson can weigh evidence based on prior “experiences” Hypothesis & Evidence Scoring Synthesis Learned Models help combine and weigh the Evidence model model model model model model model model model Final Confidence Merging & Ranking 10 Once Watson has ranked its answers, it then provides its answers as well as the confidence it has in each answer. Answer & Confidence DeepQA: the technology & architecture behind Watson Learned Models help combine and weigh the Evidence Evidence Sources Answer Sources Initial Question Question & Topic Analysis 18 Primary Search Question Decomposition Candidate Answer Generation Hypothesis Generation Answer Scoring Evidence Retrieval Hypothesis & Evidence Scoring Deep Evidence Scoring Synthesis Hypothesis Generation Hypothesis and Evidence Scoring Hypothesis Generation Hypothesis and Evidence Scoring model model model model model model model model model Final Confidence Merging & Ranking Answer & Confidence Where did it acquire knowledge? Three types of knowledge • • • • • • • • • • 19 Domain Data (articles, books, documents) Wikipedia Time, Inc. New York Time Encarta Oxford University Internet Movie Database IBM Dictionary ... J! Archive/YAGO/dbPedia… Total Raw Content Preprocessed Content Training and test question sets w/answer keys NLP Resources (vocabularies, taxonomies, ontologies) • 17 GB • 2.0 GB • 7.4 GB • 0.3 GB • 0.11 GB • 0.1 GB • 0.01 GB XXX • 70 GB • 500 GB How we convert data into knowledge for Watson’s use Three types of knowledge Domain Data (articles, books, documents) Training and test question sets w/answer keys Converted to Indices for search/passage lookup Redirects extracted for disambiguation Frame cuts generated with frequencies to determine likely context Pseudo docs extracted for Candidate answer generation 20 Used to create logistic regression model that Watson uses for merging scores NLP Resources (vocabularies, taxonomies, ontologies) Named entity detection, relationship detection algorithms Custom slot grammar parsers, prolog rules for semantic analysis Machine learning • One of the core components of the system – Multiple models – 14000+ training questions • Every candidate answer gets hundreds of features/scores associated with it. There features/scores are passed through previously trained ML model for candidate answer scoring • It’s not just one model. In fact there is a chain of models, each subsequent one utilizes scores produced by previously run models • Machine learning also used in other parts of the system, such as LAT confidence analysis. 21 NLP • Used in many places (Question Analysis, Evidence Analysis, Content Pre-processing) • Combines both rule and statistic based approaches • Full NLP stack (used in QA) – – – – – – Tokenization Named Entity Recognition Deep Parsing and Predicate Argument Structure creation Lexical Answer Type (LAT) and Focus detection Anaphora resolution Semantic Relationships extraction • Various technologies and techniques are used (English Slot Grammar parser, R2 NED, machine learning for LAT confidence analysis, custom annotators written in Prolog and Java) 22 NLP Examples • LAT and Focus – It's the Peter Benchley novel about a killer giant squid that menaces the coast of Bermuda • Named Entity Recognition – It's the {Person::Peter Benchley} novel about a killer giant {Animal::squid} that menaces the {Location::coast of Bermuda} • Anaphora Resolution – Columbus embarked on his first voyage to this continent in 1492. In the next two decades he led three more expeditions there. 23 NLP in evidence analysis and content pre-processing • Why do NLP on evidence passages and ingested content? • NLP in Evidence Analysis allows: – LAT based scoring – Named entities alignment based scoring • NLP in Content Pre-processing – Extracting and accumulating “knowledge” frames from the content • For instance – SVO frame cuts will contain frequencies of Subject-Verb-Object occurrences in the content that Watson has ingested. – e.g squid menaces coast 809 – These “knowledge” frames are then used to generate candidate answers 25 DeepQA: the technology & architecture behind Watson Learned Models help combine and weigh the Evidence Evidence Sources Answer Sources Initial Question Question & Topic Analysis 26 Primary Search Question Decomposition Candidate Answer Generation Hypothesis Generation Answer Scoring Evidence Retrieval Hypothesis & Evidence Scoring Deep Evidence Scoring Synthesis Hypothesis Generation Hypothesis and Evidence Scoring Hypothesis Generation Hypothesis and Evidence Scoring model model model model model model model model model Final Confidence Merging & Ranking Answer & Confidence What is Watson good at? 27 What is Watson good at? 1 2 Questions Asked Answers Watson Training Data 4 28 Data Corpus 3 What is Watson good at? Very broad range Range 1 Wide range of questions Small set of FAQs Questions Asked Type What? Who? How? Yes/No? When? How many? Why? 29 Predictive vs Fact Finding Fact finding Language Used English Simple synthesis Predictive/knowled ge creation Other What is Watson good at? High Medium Low Low (mins/hours) High (seconds) One/few sufficient All must be returned No Yes 30 Need understanding of confidence Response time for concurrent use Multiple correct answers Interactive dialog 2 Answers What is Watson good at? Varied/largely unstructured Variety of data Mostly structured Large Small Batch Runtime Size and redundancy of corpus Rate of change of underlying data Data Corpus 3 31 What is Watson good at? Large set of questions with known answers Training Data 4 32 Little or no training data Easy to generate Hard to generate What is Watson good at? 1 2 Questions Asked Answers Watson Training Data 4 33 Data Corpus 3 Watson enabled patient centered healthcare solutions Care Consideration Analysis Treatment Protocol Analysis What’s New? Consumer Portal Coding Automation Patient Inquiry Patient Workup Treatment Authorization Second Opinion Longitudinal Patient Electronic Health Information Differential Diagnosis Treatment Options Specialty Diagnosis & Treatment Options Caregiver Education Patient 34 Population Analysis & Care Mgmt Lay Caregiver…PA… Nurse Practitioner On-going Treatment Specialty Research Genomicbased Analysis Physician What we’ll be discussing today... • Introducing the jStart® Team • What is IBM Watson? – How Watson Got Started – Where We Stand Today – The Future of Watson • Watson: A Deep Dive – Watson’s Architecture – What components make up Watson – What Watson can do, today. • How do we get started? – Existing IBM technologies – Examples of our work • Q&A 35 IBM Content Analytics is a platform to derive rapid insight • Transform raw information into business insight quickly without building models or deploying complex systems. • Derive insight in hours or days … not weeks or months. • Easy to use for all knowledge workers to search and explore content. • Flexible and extensible for deeper insights. Rapidly Derived Insight Search and Explore Analyze and Visualize Aggregate and Extract External and Internal Content (and Data) Sources including Social Media and More 36 IBM Content Analytics – highlights • Dynamically search, analyze and explore content for new business insight • Integrate analytics results into other systems and applications • Powerful content modeling with support for advanced classification delivering business specific deeper insight • Quickly generate Cognos BI reports IBM Content Analytics adds value to … 38 Healthcare Analytics Customer Care • Analyzing: E-Medical records, hospital reports • For: Clinical analysis; treatment protocol optimization • Benefits: Better management of chronic diseases; optimized drug formularies; improved patient outcomes • Analyzing: Call center logs, emails, online media • For: Buyer Behavior, Churn prediction • Benefits: Improve Customer satisfaction and retention, marketing campaigns, find new revenue opportunities Crime Analytics Insurance Fraud • Analyzing: Case files, police records, 911 calls… • For: Rapid crime solving & crime trend analysis • Benefits: Safer communities & optimized force deployment • Analyzing: Insurance claims • For: Detecting Fraudulent activity & patterns • Benefits: Reduced losses, faster detection, more efficient claims processes Automotive Quality Insight Social Media for Marketing • Analyzing: Tech notes, call logs, online media • For: Warranty Analysis, Quality Assurance • Benefits: Reduce warranty costs, improve customer satisfaction, marketing campaigns • Analyzing: Call center notes, SharePoint, multiple content repositories • For: churn prediction, product/brand quality • Benefits: Improve consumer satisfaction, marketing campaigns, find new revenue opportunities or product/brand quality issues IBM SPSS Portfolio and Key Focus Areas Real world ICA customer use cases • Seton Healthcare: reducing CHF re-admission with advanced content analytics and predictive modeling. • Veteris: Clinical Trials Researcher Profile Automation POC • NC State University: discovering emerging trends and patterns • USC – Annenberg School of Journalism: leveraging social media for predicting performance 40 Smarter is: reducing CHF re-admission with advanced content analytics and predictive modeling Seton Healthcare The Need: Seton Healthcare strives to reduce the occurrence of high cost CHF readmissions by proactively identifying patients likely to be readmitted on an emergent basis. The Solution: IBM will assist Seton to apply content and predictive analytics capabilities to better target and understand high-risk CHF patients for care management programs by: •Utilizing natural language processing to extract key elements from unstructured History and Physical, Discharge Summaries, Echocardiogram Reports, and Consult Notes •Leveraging predictive models that have demonstrated high positive predictive value against extracted elements of structured and unstructured data (relative to null and prior cost models) •Providing an interface through which knowledge workers can intuitively navigate, interpret and take action on this connected, actionable patient data which previously spanned disparate systems 41 IBM content and predictive analytics for healthcare IBM Content and Predictive Insights for Healthcare Raw Information Predictive Analytics Content Analytics Unstructured Data (Nurses Notes, Discharge Notes, etc.) Dynamic Multimode Interaction Natural Language Processing Predictive Scoring and Probability Analysis Medical Terminology Translation Search and Explore (Mine) Analyzed Information Structured Data (Billing Data, EMR, etc.) * Future optional capability ** Current optional capability 42 Trend, Pattern, Anomaly, Deviation Detection and Analysis Watson for Healthcare* Deep Question Answer from Knowledge Sources Health Integration Framework Data Warehouse** Master Data Mgt** Advanced Case Mgt** Business Intelligence** Other Smart is: Identifying Clinical Researcher Clinical Trials Researcher Profile Automation POC Utilize IBM’s text analytics technology to create and maintain research profiles from unstructured public data sources. Use Cases 1. Create profile for researchers based on name search in public databases from unstructured data. 2. Search for new or related names and profiles based on specific profile descriptors, i.e. oncology + next gen sequencing 43 Smart is: discovering emerging trends and patterns NC State University Research Patents Suitor Identification Utilize IBM’s text analytics technology to help NC State’s Office of Technology Transfer identify companies that would be good candidates to license NCSU innovations •Discovery of individuals in those companies •Discovery of contact information of those individuals Use Cases 1. Smart Inhaler technology Crawl a series of pharma and pharma oriented web sites to find potential licensees of a smart inhaler NCSU has developed where targeted drug deposition is achieved by injecting the drug aerosols from an optimal release position in the mouth inlet cross section by means of a controllable nozzle Look for evidence of failed inhaler oriented clinical trials, mine company names and contact info 2. New Husbandry Vaccine Crawl a series of pharma and pharma oriented web sites to find potential licensees for a new strain of Salmonella enterica serovar Typhimurium NCSU has developed Look for evidence of vaccine related R&D, involving animals (not humans), mine company names and contact info 44 Smart is: leveraging social media for predicting performance USC Annenberg School for Communication and Journalism Leveraging the “new water cooler” Social Media Sentiment Analysis with IBM BigSheets and IBM Content Analytics. Use Cases 1. Film forecaster: A Twitter Box Office Predictor Captures twitter and analyze volume of tweets mentioning upcoming movies and gauges their sentiment. • Analyzed more than 1 million movie-related tweets shows the premiere of "Harry Potter" as the star of Twitter movie chatter. It was the subject of nearly 53 percent of the tweets that expressed sentiment. • In May, the tool correctly predicted a clamor for "Hangover 2" that resulted in a $100 million opening over Memorial Day weekend. 2. 45 Work Underway • Summers Arab Revolutions Analysis • Republican Candidate Nomination What we’ll be discussing today... • Introducing the jStart® Team • What is IBM Watson? – How Watson Got Started – Where We Stand Today – The Future of Watson • Watson: A Deep Dive – Watson’s Architecture – What components make up Watson – What Watson can do, today. • How do we get started? – Existing IBM technologies – Examples of our work • Q&A 46 Communities • On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more – Find the community that interests you… • Information Management ibm.com/software/data/community • Business Analytics ibm.com/software/analytics/community • Enterprise Content Management ibm.com/software/data/contentmanagement/usernet.html • IBM Champions – Recognizing individuals who have made the most outstanding contributions to Information Management, Business Analytics, and Enterprise Content Management communities • ibm.com/champion Thank You! Your Feedback is Important to Us • Access your personal session survey list and complete via SmartSite – Your smart phone or web browser at: iodsmartsite.com – Any SmartSite kiosk onsite – Each completed session survey increases your chance to win an Apple iPod Touch with daily drawing sponsored by Alliance Tech 48 Backup Slides 49 Sample questions This fish was thought to be extinct millions of years ago until one was found off South Africa in 1938. Category: ENDS IN "TH" Answer: When hit by electrons, a phosphor gives off electromagnetic energy in this form. Category: General Science Answer: Secretary Chase just submitted this to me for the third time--guess what, pal. This time I'm accepting it. Category: Lincoln Blogs Answer: His resignation 50 Glossary • • • • • UIMA – Unstructured Information Management Architecture LAT – Lexical Answer Type i.e. fish, thing, person NLP – Natural Language Processing ESG – English Slot Grammar NED/NER – Named Entity Detection/Named Entity Recognition