NSF-Relevant Challenges in Computational Intelligence Jaime Carbonell et al

advertisement
NSF-Relevant Challenges
in Computational Intelligence
Jaime Carbonell (jgc@cs.cmu.edu)
& Tom Mitchell, Guy Bleloch, Randy Bryant, et al
School of Computer Science
Carnegie Mellon University
26-April-2007
I) Major Computational Intelligence Research Areas
II) Next-Generation Infrastructure (DISC)
Carnegie Mellon
School of Computer Science
1
Computational Intelligence
• Machine Learning
 Inductive learning algorithms, active leraning
 Data mining & novel pattern detection
• Language Technologies
 Multilingual & next-veneration search engines
 Machine translation (e.g. Arabic  English)
• Perception
 Computer vision, tactile sensing (e.g., in robotics)
• Planning & optimizing
 Reasoning & planning under uncertainty
 Non-linear optimization (beyond O. R.) w/uncertainty
• Key scientific applications
 Proteomics, genomics, computational biology
 Modeling human brain functions
Carnegie Mellon
School of Computer Science
2
Machine Learning
Speech Recognition
Object recognition
Data Mining
• Reinforcement learning
• Predictive modeling
Extracting facts from text
Automated Control • Pattern discovery
learning
• Hidden Markov models
• Convex optimization
• Explanation-based learning
Carnegie Mellon
School of Computer Science
3
• ....
Leveraging Existing Data Collecting Systems
1999 Influenza outbreak
Influenza cultures
Sentinel physicians
WebMD queries about ‘cough’ etc.
School absenteeism
Sales of cough and cold meds
Sales of cough syrup
ER respiratory complaints
ER ‘viral’ complaints
Influenza-related deaths
Carnegie Mellon
School of Computer Science
[Moore, 2002]
Week (1999-2000))
4
Cluster Evolution and Density
Change Detection: d2F(r(t))/dt2
Constant Event
New Obfuscated Event
Carnegie Mellon
School of Computer Science
New Unobfuscated Event
Growing Event
5
Classifier = Rocchio, Topic = Civil War (R76 in TREC10), Threshold = MLR
MLR threshold function:
locally linear, globally non-linear
Carnegie Mellon
School of Computer Science
6
Info-Age Bill of Rights
• Get the right information
Search Engines
• To the right people
Personalization
• At the
Anticipatory Analysis
right time
• On the right medium
Speech Recognition
• In the
Machine Translation
right language
• With the right level of detail
Carnegie Mellon
School of Computer Science
Summarization
7
MMR vs Current Search Engines
documents
query
MMR
IR
λ controls spiral curl
Carnegie Mellon
School of Computer Science
8
Types of Machine Translation
Interlingua
Semantic
Analysis
Syntactic
Parsing
Source
(Arabic)
Sentence
Planning
Transfer Rules
Direct: SMT, EBMT
Requires Massive
Massive Data Resources
Carnegie Mellon
School of Computer Science
Text
Generation
Target
(English)
9
2005 NIST Arabic-English MT
Expert Human
translator
Usable
translation
BLEU Score
0.7
0.5
Topic
Identification
0.4
Google
ISI
IBM + CMU
UMD
JHU-CU
Edinburgh
0.3
Useless
Region
Systran
Mitre
0.0
Carnegie Mellon
School of Computer Science
 Pre-translated text (10200M words)
 Target language text
(100M – 1 Trillon words)
 Best for general MT
• Context-Based MT
0.2
0.1
 Grammars, semantics
 Best for focused domains
• Corpus-Based MT
0.6
Human Edittable
translation
• Interlingual MT
FSC
 Improved variant of
corpus-based MT
 Perfect client for DISC
10
Arabic Statistical-MT Output
‫ حث مسئولون صينيون وروس جميع االطراف المعنية علي‬/ ‫ شينخوا‬/ ‫ يناير‬17 ‫بكين‬
‫" التزام الهدوء وممارسة ضبط النفس " بشان القضية النووية الخاصة بجمهورية كوريا‬
. ‫الديمقراطية الشعبية‬
‫وقد التقي نائب وزير الخارجية الصيني يانغ ون تشانغ ونائب وزير الخارجية الروسي الكسندر‬
‫لوسيوكوف علي مادبة غداء حيث دعيا االطراف المعنية الي مواصلة السعي من اجل الحل السلمي‬
. ‫من خالل الحوار في ظل الوضع المعقد الحالي‬
Beijing January 17 / Shinhua / the Chinese and Russian officials urged all
parties concerned to " remain calm and exercise restraint " over the
nuclear issue of the Democratic People's Republic of Korea.
He met with vice Chinese foreign minister Yang Chang won the deputy of
the Russian foreign minister Alexander Losyukov at a lunch with invited
interested parties to continue the search for a peaceful solution through
dialogue under the current complicated situation.
Carnegie Mellon
School of Computer Science
BLEU = .64
11
What About Minor Languages or
Dialects without Massive Data?
Carnegie Mellon
School of Computer Science
12
PROTEINS
(Borrowed from: Judith
Klein-Seetharaman)
Sequence  Structure  Function
Primary Sequence
MNGTEGPNFY
PLNYILLNLA
KPMSNFRFGE
HFIIPLIVIF
SDFGPIFMTI
VPFSNKTGVV
VADLFMVFGG
NHAIMGVAFT
FCYGQLVFTV
PAFFAKTSAV
RSPFEAPQYY
FTTTLYTSLH
WVMALACAAP
KEAAAQQQES
YNPVIYIMMN
LAEPWQFSML
GYFVFGPTGC
PLVGWSRYIP
ATTQKAEKEV
KQFRNCMVTT
AAYMFLLIML
NLEGFFATLG
EGMQCSCGID
TRMVIIMVIA
LCCGKNPLGD
GFPINFLTLY
GEIALWSLVV
YYTPHEETNN
FLICWLPYAG
DEASTTVSKT
VTVQHKKLRT
LAIERYVVVC
ESFVIYMFVV
VAFYIFTHQG
ETSQVAPA
Folding
3D Structure
Complex function within
network of proteins
Normal
Carnegie Mellon
School of Computer Science
13
PROTEINS
Sequence  Structure  Function
Primary Sequence
MNGTEGPNFY
PLNYILLNLA
KPMSNFRFGE
HFIIPLIVIF
SDFGPIFMTI
VPFSNKTGVV
VADLFMVFGG
NHAIMGVAFT
FCYGQLVFTV
PAFFAKTSAV
RSPFEAPQYY
FTTTLYTSLH
WVMALACAAP
KEAAAQQQES
YNPVIYIMMN
LAEPWQFSML
GYFVFGPTGC
PLVGWSRYIP
ATTQKAEKEV
KQFRNCMVTT
AAYMFLLIML
NLEGFFATLG
EGMQCSCGID
TRMVIIMVIA
LCCGKNPLGD
GFPINFLTLY
GEIALWSLVV
YYTPHEETNN
FLICWLPYAG
DEASTTVSKT
VTVQHKKLRT
LAIERYVVVC
ESFVIYMFVV
VAFYIFTHQG
ETSQVAPA
Folding
3D Structure
Complex function within
network of proteins
Disease
Carnegie Mellon
School of Computer Science
14
Predicting Protein Structures
• Protein Structure is a key determinant of protein function
• Crystalography to resolve protein structures experimentally in-vitro is
very expensive, NMR can only resolve very-small proteins
• The gap between the known protein sequences and structures:
 3,023,461 sequences v.s. 36,247 resolved structures (1.2%)
 Therefore we need to predict structures in-silico
Carnegie Mellon
School of Computer Science
15
Linked Segmentation CRF
• Node: secondary structure elements and/or simple fold
• Edges: Local interactions and long-range inter-chain and intrachain interactions
• L-SCRF: conditional probability of y given x is defined as
P( y1,..., y R | x1 ,..., x R ) 
1
Z
 exp( 
y i , j VG
k
k
f k ( x i , y i , j ))

y i , j , y a ,b EG
exp(  l g k ( x i , x a , y i , j , ya ,b ))
l
Joint Labels
Carnegie Mellon
School of Computer Science
16
Fold Alignment Prediction: β-Helix
• Predicted alignment for known β -helices on cross-family validation
Carnegie Mellon
School of Computer Science
17
fMRI to observe human
brain activity
Machine learning to discover
patterns in complex data
Data
New discoveries about human brain function
Our algorithms have learned to distinguish
whether a human subject is reading a word
e.g. ‘tools’ or ‘buildings’ with 90% accuracy
Carnegie Mellon
School of Computer Science
18
Requisite Infrastructure
• Data Intensive SuperComputing (DISC) for
tera-scale and peta-scale data repositories
• Advanced algorithms research
 Massively-parallel decomposition
 Scalability in analytics & learning
 Extracting compact models for run-time
 Planning, reasoning, learning w/uncertainty)
 Active Learning (maximally reducing uncertainty)
• Domain expertise (e.g. proteomics, neural
sciences, astronomy, network security, …)
Carnegie Mellon
School of Computer Science
19
System Comparison: Data
DISC
Conventional Supercomputers
System
 System collects and
maintains data
• Shared, active data set
 Computation colocated
with storage
• Faster access
Carnegie Mellon
School of Computer Science
System
 Data stored in separate
repository
• No support for collection
or management
 Brought into system for
computation
• Time consuming
• Limits interactivity
20
Program Model Comparison
DISC
Conventional Supercomputers
Application
Programs
Machine-Independent
Programming Model
Runtime
System
Hardware
 Application programs
written in terms of highlevel operations on data
 Runtime system
controls scheduling,
load balancing, …
Carnegie Mellon
School of Computer Science
Application
Programs
Software
Packages
Machine-Dependent
Programming Model
Hardware
 Programs described at very low level
• Specify detailed control of
processing & communications
 Rely on small # of software packages
• Written by specialists
• Limits classes of problems &
solution methods
21
Final Thoughts
• Opportunities in Computational Intelligence
 Machine learning for tough problems: relevant novelty
detection, structural learning, active learning
 Scientific applications: Computational X (X=biology,
linguistics, astrophysics, chemistry, …)
• Next generation computational infrastructure
 DISC principle (beyond HPC, beyond grid, …)
 Algorithmic fundamentals
• International programs (on common problems)
Carnegie Mellon
School of Computer Science
22
Download