NSF-Relevant Challenges in Computational Intelligence Jaime Carbonell (jgc@cs.cmu.edu) & Tom Mitchell, Guy Bleloch, Randy Bryant, et al School of Computer Science Carnegie Mellon University 26-April-2007 I) Major Computational Intelligence Research Areas II) Next-Generation Infrastructure (DISC) Carnegie Mellon School of Computer Science 1 Computational Intelligence • Machine Learning Inductive learning algorithms, active leraning Data mining & novel pattern detection • Language Technologies Multilingual & next-veneration search engines Machine translation (e.g. Arabic English) • Perception Computer vision, tactile sensing (e.g., in robotics) • Planning & optimizing Reasoning & planning under uncertainty Non-linear optimization (beyond O. R.) w/uncertainty • Key scientific applications Proteomics, genomics, computational biology Modeling human brain functions Carnegie Mellon School of Computer Science 2 Machine Learning Speech Recognition Object recognition Data Mining • Reinforcement learning • Predictive modeling Extracting facts from text Automated Control • Pattern discovery learning • Hidden Markov models • Convex optimization • Explanation-based learning Carnegie Mellon School of Computer Science 3 • .... Leveraging Existing Data Collecting Systems 1999 Influenza outbreak Influenza cultures Sentinel physicians WebMD queries about ‘cough’ etc. School absenteeism Sales of cough and cold meds Sales of cough syrup ER respiratory complaints ER ‘viral’ complaints Influenza-related deaths Carnegie Mellon School of Computer Science [Moore, 2002] Week (1999-2000)) 4 Cluster Evolution and Density Change Detection: d2F(r(t))/dt2 Constant Event New Obfuscated Event Carnegie Mellon School of Computer Science New Unobfuscated Event Growing Event 5 Classifier = Rocchio, Topic = Civil War (R76 in TREC10), Threshold = MLR MLR threshold function: locally linear, globally non-linear Carnegie Mellon School of Computer Science 6 Info-Age Bill of Rights • Get the right information Search Engines • To the right people Personalization • At the Anticipatory Analysis right time • On the right medium Speech Recognition • In the Machine Translation right language • With the right level of detail Carnegie Mellon School of Computer Science Summarization 7 MMR vs Current Search Engines documents query MMR IR λ controls spiral curl Carnegie Mellon School of Computer Science 8 Types of Machine Translation Interlingua Semantic Analysis Syntactic Parsing Source (Arabic) Sentence Planning Transfer Rules Direct: SMT, EBMT Requires Massive Massive Data Resources Carnegie Mellon School of Computer Science Text Generation Target (English) 9 2005 NIST Arabic-English MT Expert Human translator Usable translation BLEU Score 0.7 0.5 Topic Identification 0.4 Google ISI IBM + CMU UMD JHU-CU Edinburgh 0.3 Useless Region Systran Mitre 0.0 Carnegie Mellon School of Computer Science Pre-translated text (10200M words) Target language text (100M – 1 Trillon words) Best for general MT • Context-Based MT 0.2 0.1 Grammars, semantics Best for focused domains • Corpus-Based MT 0.6 Human Edittable translation • Interlingual MT FSC Improved variant of corpus-based MT Perfect client for DISC 10 Arabic Statistical-MT Output حث مسئولون صينيون وروس جميع االطراف المعنية علي/ شينخوا/ يناير17 بكين " التزام الهدوء وممارسة ضبط النفس " بشان القضية النووية الخاصة بجمهورية كوريا . الديمقراطية الشعبية وقد التقي نائب وزير الخارجية الصيني يانغ ون تشانغ ونائب وزير الخارجية الروسي الكسندر لوسيوكوف علي مادبة غداء حيث دعيا االطراف المعنية الي مواصلة السعي من اجل الحل السلمي . من خالل الحوار في ظل الوضع المعقد الحالي Beijing January 17 / Shinhua / the Chinese and Russian officials urged all parties concerned to " remain calm and exercise restraint " over the nuclear issue of the Democratic People's Republic of Korea. He met with vice Chinese foreign minister Yang Chang won the deputy of the Russian foreign minister Alexander Losyukov at a lunch with invited interested parties to continue the search for a peaceful solution through dialogue under the current complicated situation. Carnegie Mellon School of Computer Science BLEU = .64 11 What About Minor Languages or Dialects without Massive Data? Carnegie Mellon School of Computer Science 12 PROTEINS (Borrowed from: Judith Klein-Seetharaman) Sequence Structure Function Primary Sequence MNGTEGPNFY PLNYILLNLA KPMSNFRFGE HFIIPLIVIF SDFGPIFMTI VPFSNKTGVV VADLFMVFGG NHAIMGVAFT FCYGQLVFTV PAFFAKTSAV RSPFEAPQYY FTTTLYTSLH WVMALACAAP KEAAAQQQES YNPVIYIMMN LAEPWQFSML GYFVFGPTGC PLVGWSRYIP ATTQKAEKEV KQFRNCMVTT AAYMFLLIML NLEGFFATLG EGMQCSCGID TRMVIIMVIA LCCGKNPLGD GFPINFLTLY GEIALWSLVV YYTPHEETNN FLICWLPYAG DEASTTVSKT VTVQHKKLRT LAIERYVVVC ESFVIYMFVV VAFYIFTHQG ETSQVAPA Folding 3D Structure Complex function within network of proteins Normal Carnegie Mellon School of Computer Science 13 PROTEINS Sequence Structure Function Primary Sequence MNGTEGPNFY PLNYILLNLA KPMSNFRFGE HFIIPLIVIF SDFGPIFMTI VPFSNKTGVV VADLFMVFGG NHAIMGVAFT FCYGQLVFTV PAFFAKTSAV RSPFEAPQYY FTTTLYTSLH WVMALACAAP KEAAAQQQES YNPVIYIMMN LAEPWQFSML GYFVFGPTGC PLVGWSRYIP ATTQKAEKEV KQFRNCMVTT AAYMFLLIML NLEGFFATLG EGMQCSCGID TRMVIIMVIA LCCGKNPLGD GFPINFLTLY GEIALWSLVV YYTPHEETNN FLICWLPYAG DEASTTVSKT VTVQHKKLRT LAIERYVVVC ESFVIYMFVV VAFYIFTHQG ETSQVAPA Folding 3D Structure Complex function within network of proteins Disease Carnegie Mellon School of Computer Science 14 Predicting Protein Structures • Protein Structure is a key determinant of protein function • Crystalography to resolve protein structures experimentally in-vitro is very expensive, NMR can only resolve very-small proteins • The gap between the known protein sequences and structures: 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) Therefore we need to predict structures in-silico Carnegie Mellon School of Computer Science 15 Linked Segmentation CRF • Node: secondary structure elements and/or simple fold • Edges: Local interactions and long-range inter-chain and intrachain interactions • L-SCRF: conditional probability of y given x is defined as P( y1,..., y R | x1 ,..., x R ) 1 Z exp( y i , j VG k k f k ( x i , y i , j )) y i , j , y a ,b EG exp( l g k ( x i , x a , y i , j , ya ,b )) l Joint Labels Carnegie Mellon School of Computer Science 16 Fold Alignment Prediction: β-Helix • Predicted alignment for known β -helices on cross-family validation Carnegie Mellon School of Computer Science 17 fMRI to observe human brain activity Machine learning to discover patterns in complex data Data New discoveries about human brain function Our algorithms have learned to distinguish whether a human subject is reading a word e.g. ‘tools’ or ‘buildings’ with 90% accuracy Carnegie Mellon School of Computer Science 18 Requisite Infrastructure • Data Intensive SuperComputing (DISC) for tera-scale and peta-scale data repositories • Advanced algorithms research Massively-parallel decomposition Scalability in analytics & learning Extracting compact models for run-time Planning, reasoning, learning w/uncertainty) Active Learning (maximally reducing uncertainty) • Domain expertise (e.g. proteomics, neural sciences, astronomy, network security, …) Carnegie Mellon School of Computer Science 19 System Comparison: Data DISC Conventional Supercomputers System System collects and maintains data • Shared, active data set Computation colocated with storage • Faster access Carnegie Mellon School of Computer Science System Data stored in separate repository • No support for collection or management Brought into system for computation • Time consuming • Limits interactivity 20 Program Model Comparison DISC Conventional Supercomputers Application Programs Machine-Independent Programming Model Runtime System Hardware Application programs written in terms of highlevel operations on data Runtime system controls scheduling, load balancing, … Carnegie Mellon School of Computer Science Application Programs Software Packages Machine-Dependent Programming Model Hardware Programs described at very low level • Specify detailed control of processing & communications Rely on small # of software packages • Written by specialists • Limits classes of problems & solution methods 21 Final Thoughts • Opportunities in Computational Intelligence Machine learning for tough problems: relevant novelty detection, structural learning, active learning Scientific applications: Computational X (X=biology, linguistics, astrophysics, chemistry, …) • Next generation computational infrastructure DISC principle (beyond HPC, beyond grid, …) Algorithmic fundamentals • International programs (on common problems) Carnegie Mellon School of Computer Science 22