Introduction to Chemoinformatics & Computer

advertisement
Hot Topics in Chemoinformatics
in the Pharmaceutical Industry
David J. Wild, Ph.D.
Scientific Computing Consultant, and
Adjunct Professor of Pharmaceutical Engineering at the
University of Michigan
david@wild-ideas.org
www.WildIdeasConsulting.com
About me
B.Sc Computer Science
Ph.D. Chemoinformatics (Willett Lab)
Worked for 5 years in Scientific
Computing leadership at Pfizer,
responsible for the development of
computational tools for scientists
Now run a consulting firm based in
Ann Arbor, Mich., and am also an
Adjunct Professor at the
University of Michigan.
doing some research
Wild Ideas Consulting
www.WildIdeasConsulting.com
University of Michigan
www-personal.engin.umich.edu/~wildd
What we’ll cover today
• Overview of early-stage drug discovery and the big
industry concerns
• Using information and technology together to improve
the chances of finding a new drug
• Example – High Throughput Screening
• Some other examples of “hot” areas
–
–
–
–
Genomics & Proteomics Information Handling
Virtual Screening
Combinatorial Chemistry
Design of scientific software
Characteristics of the pharmaceutical
industry
• Very segmented market – largest company (Pfizer) only
has an 11% market share
• High risk, long term – takes 10-20 years to develop a
drug, and most drugs fail to get to market
• Highly regulated (by FDA)
• High profit margins for drugs which do make it
• Investors traditionally expect high return on investment
• Four main phases: discovery, development, clinical trials
and marketing
R&D spending up, new drugs down
Taken from http://www.newscientistjobs.com/biotech/ernstyoung/blues.jsp
Drug Discovery & Development
Identify disease
Find a drug effective
against disease protein
(2-5 years)
Isolate protein
involved in
disease (2-5 years)
Preclinical testing
(1-3 years)
Human clinical trials
(2-10 years)
Formulation &
Scale-up
FDA approval
(2-3 years)
Impact of new technology on drug discovery
• The last few years have seen a number of
“revolutionary” new technologies:
–
–
–
–
–
–
–
–
–
Gene chips, genomics and HGP
Bioinformatics & Molecular biology
More protein structures
High-throughput screening & assays
Virtual screening and library design
Docking
Combinatorial chemistry
In-vitro ADME testing
Other computational methods
• How do we make it all work for us?
GENOMICS, PROTEOMICS & BIOPHARM.
Potentially producing many more targets
and “personalized” targets
HIGH THROUGHPUT SCREENING
Identify disease
Screening up to 100,000 compounds a
day for activity against a target protein
VIRTUAL SCREENING
Using a computer to
predict activity
Isolate protein
COMBINATORIAL CHEMISTRY
Rapidly producing vast numbers
of compounds
Find drug
MOLECULAR MODELING
Computer graphics & models help improve activity
IN VITRO & IN SILICO ADME MODELS
Preclinical testing
Tissue and computer models begin to replace animal testing
There is little “hard data” on using the new
technologies
• In a sense, the drug design process is becoming a big
experiment
• Do we continue as before, and carefully introduce new
technologies as we deem appropriate, or do we radically
change the way things are done?
• Lots of pressure for the new technologies to yield results
quickly
• How do we measure the results?
Some questions being asked
• Is our increasing spending on R&D and new
technologies really going to pay off? Or was it a red
herring?
• Is the paucity of drugs in the pipeline because we’re not
doing things right, or are we just hitting limits on the
number of major diseases with potential treatments still
to be found? (“all the low-hanging fruit has gone”)
• Should we be looking in new areas (e.g. “life
enhancment” rather than “life saving” or “quality of
life”)
What’s being done
• Trying to get the right Attrition (=drugs dropping out of
the pipeline). Aim to increase early-stage attrition and
reduce late-stage attrition
• Risk analysis – look ideally for low-risk, high-payoff
drugs
• Using metrics to monitor successes and failures
Analyzing risk
High risk
Low payoff
High risk
High payoff
Low risk
Low payoff
Low risk
High payoff
Using metrics to monitor improvement
• Split the discovery process into
discrete units, with key
decisions at the end of each
unit.
• Come up with measurable
properties that can be used to
gauge success
• Look for good and bad
decisions and why they were
made
Stage
Decision Point
Target
exploration
Go with this
target?
HTS
Was the screen
successful?
HTS Analysis
Follow up these
5-10 series
Series Followup Produce 2-3
lead
compounds
ADME study
Are compounds
safe?
Summary
• The pharmaceutical industry is a high-risk industry with
very long development times and short product
lifespans
• There has been a lot of investment in new technologies
for early stage drug discovery, but so far these are not
resulting in more drug candidates (or profits)
• Companies are looking at ways to address this problem
including managing attrition, risk analysis and metrics.
How Chemoinformatics can help out
• Producing and manage information for metrics
• In-silico analysis to reduce risk, e.g.
–
–
–
–
Virtual screening
Library design,
Docking
Cost/benefit analyses
• Making information available at the right time and the
right place
• Needs to be integrated into processes
An example: High-Throughput
Screening
Screening perhaps millions of compounds in a corporate collection
to see if any show activity against a certain disease protein
High-Throughput Screening
• Drug companies now have millions of samples of chemical
compounds
• High-throughput screening can test 100,000 compounds a day for
activity against a protein target
• Maybe tens of thousands of these compounds will show some
activity for the protein
• The chemist needs to intelligently select the 2 - 3 classes of
compounds that show the most promise for being drugs to follow-up
Informatics Implications
• Need to be able to store chemical structure and biological data for
millions of datapoints
– Computational representation of 2D structure
• Need to be able to organize thousands of active compounds into
meaningful groups
– Use cluster analysis or machine learning methods to group similar structures
together and relate to activity
• Need to learn as much information as possible from the data (data
mining)
– Apply statistical methods to the structures and related information
HTS Tools – Tripos SAR Navigator
SAR Navigator is © Tripos, inc., www.tripos.com
BioReason ClassPharmer
•
•
•
•
•
Clusters actives into groups representing series
Attempts to find a scaffold using MCS algorithm
Recovers inactives back into series
Presents series as rows in a “spreadsheet” view
Gives other statistics on series, such as activity
distribution
• http://www.bioreason.com
BioReason Classpharmer
www.bioreason.com
BioReason Classpharmer
www.bioreason.com
Strategy for “HTS Triage”
• Run HTS
• Decided which compounds are “active” and which are
“inactive”
• Cluster the actives to put them into series
• Visualize clusters of actives (showing 2D structures) and
pick series of interest
• Identify “scaffold” for each series
• Use similarity or substructure search on inactives to find
inactives related to these series
• Use SAR techniques to discover differences between
actives and inactives in a series
Information generated at different points in
the Drug Design process
Gene chip experiments
Protein structures
Project selection decisions
Assay protocols
HTS results
Series selection decisions
SAR studies
Combinatorial Expts.
Pharmacophores
ADME studies
Lead cmpd decisions
Toxicology studies
Scaleup reactions
Clinical Trials data
Doctor/patient studies
Marketing, surveys, etc
Information generated at different
sites
Distributed goals model
Shared goals model
Information storage breakdowns
• Large amounts of information generated:
– Some is not kept at all
– Some is kept but loses its meaning
• Often data is kept, but not semantics or decisions
– e.g. keep “the HVX2 assay result for this compound was 3.2”,
but not what the assay protocol was, whether the compound was
considered ‘active’, nor whether it was followed up on.
• “Bigger picture” or derivative information is usually not
stored
– E.g. “all the compounds with a tri-methyl group seemed to have
much lower activity for this project”
Information access breakdowns
• Some information is only available in one physical
location
• Some information is only available within one part of the
discovery process
• Often information is not “contextualized” for use
outside a particular domain
• When someone is clear about a piece of information they
need; that piece of information exists, but they don’t
know how to access it.
– E.g. What system to use, what Oracle table it’s in, or even the
knowledge of whether that piece of information does exist!
“Missed opportunities”
• Not a specific breakdown, but if the right piece of
information had been available at the right time, better
decisions could have been made
• E.g.
– A group of compounds is being followed up as potential drugs,
but a rival company just applied for a patent on the compounds
– A large amount of money is being spent developing an HTS
assay for a target, but marketing research shows any drug is
unlikely to be a success
– A group of compounds is selected from an HTS as good
candidates for follow up, but 20 years ago they were followed up
for a similar project and had severe solubility problems
Information use breakdowns
• The meaning of data is incorrectly interpreted
• A single piece of information is used, whilst using a
wider range of information would lead to different
conclusions
• Lessons learned from one project are incorrectly applied
to another
• “Fuzzy” information is taken as concrete information
What do we do?
• No large company has really solved the problem
• But ongoing attempts include:
– Defining information produced and needed at each stage of the
discovery process
– Improving processes to be more consistent, especially across
different sites
– Improving information flow between departments and sites
– Harmonizing terminology across disciplines and sites
– Defining needed “management information” as well as raw data
– Looking for “quick win” opportunities
• This will presumably impact what is stored in databases
and what software is used
– Oracle Chemistry Cartridges help
Some Other Examples
Genomics & Proteomics Information Handling
Virtual Screening
Combinatorial Chemistry
Design of scientific software
Genomics & Proteomics Information Handling
Understanding the link between diseases,
genetic makeup and expression of proteins
Genomics
• Genomics is fast-forwarding our understanding of how DNA, genes, proteins
and protein function are related, in both normal and disease conditions
• Human genome project has mapped the genes in human DNA
• Hope is that this understanding will provide many more potential protein
targets
• Allows potential “personalization” of therapies
ATACGGAT
TATGCCTA
functions
Gene Chips
• “Gene chips” allow us to
look for changes in protein
expression for different
people with a variety of
conditions, and to see if the
presence of drugs changes
compounds administered
that expression
• Makes possible the design
of drugs to target different
phenotypes
expression profile
(screen for 35,000 genes)
people / conditions
e.g. obese, cancer,
caucasian
“Chemogenomics” from Vertex
Video: http://www.vrtx.com/Chemogenonone.html
Virtual Screening
• Build a computational model of activity for a particular target
• Use model to score compounds from “virtual” or real libraries
• Use scores to decide which to make, or pass through a real screen
Computational Models of Activity
• Machine Learning Methods
– E.g. Neural nets, Bayesian nets, SVMs, Kahonen nets
– Train with compounds of known activity
– Predict activity of “unknown” compounds
• Scoring methods
– Profile compounds based on properties related to target
• Fast Docking
– Rapidly “dock” 3D representations of molecules into 3D representations of
proteins, and score according to how well they bind
Present molecules to model
• We may want to virtual screen
– All of a company’s in-house compounds, to see which to screen first
– A compound collection that could be purchased
– A potential combinatorial chemistry library, to see if it is worth making,
and if so which to make
• Model will come out with with either prediction of how well each
molecule will bind, or a score for each molecule
Combinatorial Chemistry
• By combining molecular “building blocks”, we can create very large
numbers of different molecules very quickly.
• Usually involves a “scaffold” molecule, and sets of compounds
which can be reacted with the scaffold to place different structures
on “attachment points”.
Example Combinatorial Library
“R”-groups
Scaffold
R1 = OH
OCH3
NH2
Cl
COOH
R1
NH
R2
R3
For this small library, the number
of possible compounds is
5 x 6 x 5 = 150
Examples
OH
NH
CN
R2 = phenyl
OH
NH2
Br
F
CN
O
OH
OH
C
OH
NH
NH
OH
CF3
O
CH3
R3 = CF3
NO2
OCH3
OH
phenoxy
O
OH
C
NH
OH
O
Combinatorial Chemistry Issues
• Which R-groups to choose
• Which libraries to make
– “Fill out” existing compound collection?
– Targeted to a particular protein?
– As many compounds as possible?
• Computational profiling of libraries can help
– “Virtual libraries” can be assessed on computer
Design of Scientific Software
Problems with scientific software tend to occur because
of deficiencies in one of three areas:
Software Relevance
Software Usability
Software Management
Software Relevance
• To be able to make software relevant requires the
software designer to understand:
– the science, i.e. the domain
– the scientific computing techniques that are used in the domain
– the possibilites and limitations of software development.
• Even with this, it’s hard to match the things we can do
with the things that people want or need to do
• Techniques like personas and contextual inquiry simply
help us understand the people who use the software,
their goals, and tasks they want to do
Software relevance:
Bridge between computation & science
clustering
sim. searching
activity models
scaffold detection
docking
logp calculation
goals:
tasks:
tasks:
“doing a cluster
analysis”
“identifying
activity-related
fragments”
chemoinformatics
e.g. produce compounds
that have high biological
activity
tools
work out a chemical
synthesis
?
choose good reagents
try and document some
reactions
science
Software Usability
• Tend to focus on the method and the science, but not
how easy it is for people to get their job done using the
software
• Programmers tend to make software intuitive for them,
but not necessarily the people it is designed for
• A usability lab and other techniques can make a HUGE
difference to the satisfaction of users and programmers
alike!
Software Management
• Disparate set of tools & platforms
• Disparate programming styles, languages
• A variety of people tend to be writing software
– Trained software developers
– Enthusiastic scientists
– Scientific computing specialists
• Focus on the science tends to mean software management is
neglected
• Everyone hates traditional software management “rules”
• But there are ways of making everything work better and having
more fun doing it!
• Have a recommended basic setup that should help a lot
Foundation reading
• “The Inmates Are Running the Asylum” by Alan Cooper
• “Contextual Design” by Hugh Beyer and Karen
Holtzblatt
• “Usability Engineering” by Jakob Nielsen
• “The Visual Display of Quantitative Information” by
Edward Tufte
• “Don’t Make Me Think!” by Steve Krug
• See also, www.WildIdeasConsulting.com
Summary
• R&D in the pharmaceutical industry is undergoing a lot
of technological changes, and there is pressure to make
the investment pay off
• There is a big need to sensibly use the large amounts of
chemical and biological-related information produced in
the process
• Thoughtful use of chemoinformatics methods and
software is becoming crucial to the success of drug
discovery
Download