BIOINFORMATICS

REMINDERS 2nd Exam on Nov.17  Coverage:   Central Dogma of DNA • Replication • Transcription • Translation Cell structure and function  Recombinant DNA technology and molecular biology  Protein analysis  BIOINFORMATICS BIOINFORMATICS Study of the structure of biological information and biological systems  Integrates theories and tools of mathematics/statistics, computer science and information technology  Involves the use of hardware and software to study vast amounts of biological data  What is Bioinformatics?  the field of science in which biology, computer science, and information technology merge to form a single discipline  application of information technology to the storage, management and analysis of biological information  facilitated by the use of computers FUNCTIONS  Data Management Storage  Retrieval   Data Analysis *Literature/Bibliography, Sequence, Structure, Taxonomy, Expression, etc. BIOLOGICAL DATABASES Systematic data storage/retrieval  Maintained on a regular basis  Can contain various types of data (integration)  Sequence  Structure  Other pertinent information   Nucleotides and proteins are most common DATABASES  a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system  Biological databases consist usually of the nucleic acid sequences of the genetic material of various organisms as well as protein sequences and structures DATABASES   e.g. nucleotide sequence database typically contains information such as  contact name  the input sequence with a description of the type of molecule  the scientific name of the source organism from which it was isolated additional requirements  easy access to the information  a method for extracting only that information needed to answer a specific biological question DATABASES • Sequence – – – – – GenBank, European Nucleotide Archive (ENA) and DNA Data Bank of Japan (DDBJ); managed by the International Nucleotide Sequence Database Collaboration (INSDC) UniGene Saccharomyces Genome Database (SGD) UniProtKB (UniProtKB/Swiss-Prot or UniProt/TrEMBL) ExPASy DATABASES  Structure Nucleic Acid Database (NDB)  Protein Data Bank (PDB)  Worldwide Protein Data Bank (wwPDB)  ExPASy  DATA MINING Process by which testable hypotheses are created regarding function/structure of gene/protein of interest through identifying similar sequences in “more established” organisms  Tools:  Text-term search  Sequence similarity search  Machine Learning Studies methods and the design of computer programs based on past experience  Why?  New methods are being introduced  Old ones should be improved  “Units” of Information DNA (genome)  RNA (transcriptome)  Protein (proteome)  What is Being Analyzed? Sequence  Structure  Interactions  Pathways  Mutations/Evolutions  Why?  Increasing amount of biological information entails Organization  Archiving  Global unification/harmonization  More biological discoveries  Functional/Structural similarities  Phylogenetic/Evolutionary patterns  Applications Medicine  Pharmaceuticals  Biotechnology  Agriculture  STRUCTURE DATABASES Molecular Data • When you draw a molecule, – – – • You start with atoms Then proceed with the structure And the three-dimensional data What can be stored? – – – Coordinates Sequences Chemical graphs • Atoms and bonds Databases Protein Data Bank (PDB)  Molecular Modeling Database (MMDB)  Techniques in the Laboratory X-ray Crystallography  Nuclear Magnetic Resonance  Formats PDB  mmCIF  MMDB  Structure Viewers Cn3D  RasMol  WebMol  Mage  VRML  CAD  Swiss PDB Viewer  Promises of bioinformatics   Medicine  Knowledge of protein structure facilitates drug design  Understanding of genomic variation allows the tailoring of medical treatment to the individual’s genetic make-up  Genome analysis allows the targeting of genetic diseases  The effect of a disease or of a therapeutic on RNA and protein levels can be elucidated The same techniques can be applied to biotechnology, crop and livestock improvement, etc... Challenges in bioinformatics   Explosion of information  Need for faster, automated analysis to process large amounts of data  Need for integration between different types of information (sequences, literature, annotations, protein levels, RNA levels etc…)  Need for “smarter” software to identify interesting relationships in very large data sets Lack of “bioinformaticians”  Software needs to be easier to access, use and understand  Biologists need to learn about the software, its limitations, and how to interpret its results SEQUENCE ALIGNMENT Two or More Sequences Measure similarity  Determine correspondences between residues  Find patterns of conservation  Derive evolutionary relationships  Alignment  Correspondences of nucleotides/amino acids in two sequences or more are assigned An assignment of correspondences that preserves the order of the residues within the sequences is an alignment  Gaps are used to achieve this   Sequence alignment refers to the identification of residue-residue correspondences Uses  Homology Similarities  “Ancestry”   Genome annotation   Assigning structure and function to genes Database queries  For newly-discovered/unknown sequences Tools • Dot Plots – • Scoring Matrices – – – • Diagonal lines of dots showing similarities between two sequences Score reflects quality of each possible alignment; best possible score is identified Scoring scheme is crucial PAM (Point Accepted Mutations) and BLOSUM (BLOCKS Substitution Matrix) Dynamic Programming – Algorithmic technique that reuses previous computations Scoring  Penalties/Scores Match (e.g. A – A)  Mismatch (e.g. A C)  Gap (e.g. A _)  • Linear Gap Penalty: Uniform • Affine Gap Penalty: Gap Existence vs. Gap Extension Local vs. Global Alignments  Global Alignment   Similarities between majority of two sequences Local Alignment  Similarities between specific parts of two sequences Programs Pairwise Sequence Alignment  BLAST  VAST  FASTA Multiple Sequence Alignment  MAFFT Needleman-Wunsch Algorithm • • • Can be used for global and alignments Maximum-value function A simple scoring scheme is assumed Three steps – – – Initialization Matrix fill (scoring) Traceback (alignment)

BIOINFORMATICS

Related documents

Products

Support

BIOINFORMATICS

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib