FW4089: Bioinformatics (3 credits) FW5089: Tools of Bioinformatics (4 credits) Time: Every Tuesday and Thursday, 9.35 am to 10.55 am (3 hours) Place: Forestry, Room No. 139 Note: Presentation of class paper will be arranged sometime in early April 2006. Final exam will be held sometime in the week of April 24 to 28, 2006. Instructor: Shekhar Joshi (C. P. Joshi), Associate Professor of Plant Molecular Genetics, SFRES Room 168, Forestry, Phone: 487-3480 (cpjoshi@mtu.edu) Office hours: 9 am to 6 pm except when I teach this class! Teaching assistants: Shiv T. and Frank Xu (FMGB Graduate students) Course Description The main purpose of this course is to provide extensive hands-on-experience in using a variety of Bioinformatics tools and in future you could extrapolate that knowledge to other fields of biology such as genomics, molecular phylogenetics, and biotechnology. You will not write Bioinformatics programs but will use the available ones for extensive sequence analysis. Why was this course proposed? A number of sequence analysis packages and databases are currently available from the commercial sources as well as public web sites. In our day-to-day molecular biology research, we use some of these programs and databases to analyze the significance of the new genetic information that we obtain. But it is not always easy to choose the correct approach or appropriate tool. Databases are growing at a very fast pace and new questions are constantly popping up. Moreover, genomics is a new and exciting field of biotechnology that has recently witnessed many conceptual and technical advances. Ability to make sense of this information explosion will make our students more competitive in the current job markets in the fields of academics and industries. There is no doubt that this knowledge will be extremely valuable for living in this century. 1 FW4089/5089 Tools of Bioinformatics GENERAL TEXTBOOKS (Optional Reading material) 1) Genes VII Benjamin Lewin, 2000, Oxford University Press 2) Molecular Biology Robert F. Weaver, 1999, McGraw-Hill Press 3) Bioinformatics David W. Mount, 2001, CSH Press All these books will provide only supplemental material for the course and may be available at the MTU Book Store or in the library. Reading materials for the topics being covered in the class will be provided. Although there is no specific prerequisite for this class, it is advisable to have taken at least one of the following and have some background in genomics and bioinformatics: BL4030: Molecular Biology FW4087/5087: Plant Molecular Genetics 2 FW4089/5089 Tools of Bioinformatics THIS COURSE WILL NOT TEACH YOU HOW TO WRITE PROGRAMS. Bioinformatics Reference Books available in the MTU Library Guide to Human Genome Computing (Second Edition) by Bishop MJ Call No. QH445.3 .G85 1998 Bioinformatics: The machine learning approach by P. Baldi and S. Brunak Call No. QH506 .B35 1998 Sequence Analysis in Molecular Biology by G Von Heijne Call No. QP551 .H43 1987 Biological Sequence Analysis: Probabilistic Models Of Proteins And Nucleic Acids by R Durbin, S. Eddy, A. Krogh, G. Mitchison Call No. QP620 .B576 1998 Algorithms On Strings, Trees And Sequences: Computer Science And Computational Biology by Dan Gusfield Call No. QA76.9 .A43 G87 1997 Introduction To Computational Biology by Michael S. Waterman Call No. QH438.4 .M33W38 1995 Calculating The Secrets Of Life by Eric Lander And Michael Waterman Call No. QH438.4 .M3 C35 1995. Some internet addresses where Bioinformatics information is available: National Center of Biotechnology Information (GenBank) http://www.ncbi.nlm.nih.gov/ Genetics Computer Group: http://www.GCG.com Protein analysis: http://www.expasy.ch Celera Genomics: http://www.celera.com 3 FW4089/5089 Bioinformatics GRADING SYSTEM Grade Scale 100 - 95 94 - 90 89 - 85 84 - 80 79 - 75 74 - 70 69 - 60 60 - = = = = = = = = A Excellent AB Very Good B Good BC Above Average C Average CD Below Average D Inferior F Failure Course Points Home work, quiz etc= 30% Mid-term Exam 1 = 30% Final Exam = 30% Class Participation= 10% Exams: The midterm and cumulative finals will be worth 100 points. Class Paper = One Credit for FW5089 4 Jobs! Jobs! Jobs! Current Job trends: http://www.sloan.org/programs/scitech_page1.html Jobs in Genomics: http://www.genomejobs.com See also Science and Nature for Job ads. Bioinformatics is a young science but the information explosion has demanded more people in academics and industries. It is easy to get either a molecular biologist or a computer scientist but the job of bioinformatician needs both. Biologist who can compute and a computer scientist who can make sense out of biological data are hot commodities. Supply and demand! This is what I heard but do not quote me anywhere! MS in Bioinformatics: 60-100 K Ph.D. in Bioinformatics: 80-100K or higher All CS people do not find money that attractive! But those who are interested in the topic do very well in this field. New challenges and questions biologists are facing every day and CS is providing the answer. True collaboration! Having this course listed in your CVs will help in your job prospects. 5 http://www.bio.mtu.edu/campbell/bl4820/intro/plagiarism.htm Plagiarism - What It Is and How to Avoid It! Adapted from Notes prepared by Ron Gratz Scientists do not work in isolation from each other. Attendance at scientific meetings exposes us to the work of our colleagues and allows for the free exchange of ideas. Reading the published literature in our fields is vital for all scientists, who must keep themselves current with what is being done in other laboratories. Scientists continually refer to the work of their colleagues and most scientific research is based at least in part on ideas derived from others. Review articles and textbooks are often wholly based on already published work. It is thus necessary for you as developing scientists to learn how to properly use previously reported knowledge. While a free flow of ideas and information is vital to scientific progress, it also presents avenues for fraud, particularly plagiarism. Plagiarism can be defined as "Taking the ideas from another and passing them off as one's own" (Webster's New World Dictionary) and is unacceptable under any circumstances. Despite this universal disapproval, it is one of the more common faults with student papers. In some cases, it is a case of downright dishonesty brought upon by laziness but more often it Is lack of experience as how to properly use material taken from another source. To avoid plagiarism you must not only properly attribute the ideas of another but must also either paraphrase what the original author said or wrote or you must enclose that person's exact words in quotation marks. To use another's exact words with attribution but without quotation marks implies that the ideas belong to the original source but that the words are your own. Besides being dishonest, copying another’s work defeats the purpose of your education. Writing about the subject you are studying is a great way to learn. Ideas become more firmly implanted in your memory if you have to think about them and then write a coherent statement using them. Copying another’s work prevents you from learning, which is the whole purpose of your education. Whenever the words or ideas of another individual are used, proper attribution must be given. In other words, you must give credit for those ideas and words to their originator. Not to do so is a clear case of plagiarism. Plagiarism in classwork may result in a failing grade or even expulsion from the university. Plagiarism in professional work may result in dismissal from an academic position, being barred from publishing in a particular journal or from receiving funds from a particular granting agency, or even a lawsuit and criminal prosecution. In a review article, the author attempts to summarize all of the pertinent work done in a particular field of study. The goal is generally twofold: (1) to report what has been done and what has been learned; and (2) to use this knowledge to generate general conclusions based on these previous works. The author of a review article must be able to present the cited work accurately and be able to synthesize new ideas from this work. In order to 6 accurately represent the work of others and at the same time avoid plagiarism, the author of a review will often paraphrase the statements made in the cited work. The problem for many students, and some professional scientists, is that they do not know how to properly paraphrase another's words. Several general rules for paraphrasing that are relevant for students learning to master this skill are: 1. You should change both the sentence structure and the non-technical terms in order to avoid plagiarism. 2. You can also avoid plagiarism by altering the sequence of subject matter within and between sentences. 3. Don't paraphrase technical terms unless you are certain of their exact meaning and can provide an exact equivalent. 4. Accredit the original author within the group of sentences using his/her work. 7 FW4089 and FW5089: Bioinformatics questionnaire Your name: ID number: Department: Graduate student/Undergraduate: Name of Advisor if Graduate student: Motivation for taking this course: Previous experience with Unix, GCG or other sequence analysis packages What do you expect to get out of this course? Have you understood the problems of plagiarism? Yes Do you know what my office hours are? Yes No No Are you clear about grading policy? Yes No 8 First QUIZ of Plant Bioinformatics Date: January 10, 2006 Write one line answers to as many questions as possible in next 45 minutes. Feel free to refer to books/web etc. This will not be counted towards your grade. I just want to know where you stand with molecular biology background: 1. DNA stands for 2. RNA stands for 3. DNA is made up of 4. RNA is made up of 5. What is the difference between Deoxyribose sugar and ribose sugar? 6. What are the different types of nitrogen bases in DNA? 7. What are the different types of nitrogen bases in RNA? 8. What is the difference between purines and pyrimidines? 9. Name 2 purines and three pyrimidines 10. Which purine pairs with which pyrimidines? State the number of H bonds between each pair. 11. What are the differences between DNA and RNA? 12. What is transcription and translation? 13. What is central dogma in molecular biology? 14. What is reverse transcription? 15. What is a prokaryote? 16. What is a Eukaryote? 17. What are the differences between prokaryote and Eukaryotes 18. What is a genome? 9 19. What is genomics? 20. How many genomes are present in viruses, prokaryotes, plants and animals? Where? 21. What is bioinformatics? 22. What is the biological name for humans (binomial) 23. How big is the human genome? 24. How many chromosomes are there in a human diploid and haploid cell? 25. How are human genes arranged in the genome? 26. How many human genes are there? 27. What proportion of human genome is made up of genes? 28. What is a gene? 29. Why eukaryotic genes are said to be split? 30. How does DNA replicate? Conservatively or semi-conservatively? What is the difference? 31. How does DNA make RNA? 32. How many types of RNA are produced in a cell? 33. How many of these RNAs are said to be protein coding? 34. What is pre-mRNA? Is it present in bacteria? 35. What are the main three steps in pre-mRNA processing? 36. What is the 5’leader and 3’trailor sequence in pre-mRNA? 37. What is the difference between exons and introns? 38. How are introns spliced off? 39. Why are introns there? 40. How transcription process in regulated in prokaryotes? 10 41. How transcription process is regulated in eukaryotes? 42. What is a TATA box and AATAAA box? 43. What is a transcription factor? 44. Why TFIID is said to a commitment factor? 45. What is a transcription start site? 46. What is polyadenylation? Why is it an important biological process? Is it present in bacteria? 47. Describe the process of polyadenylation. 48. Define “protein”. What alternative forms are proteins present in a cell? 49. How many types of amino acids are typically present? Name five amino acids? What are their 3 letter and 1 letter codes? 50. How does a code present in DNA is used to make proteins? 51. Do you believe that genome is life’s instruction book? Why? 52. If you have a disease gene (what does that mean), do you always get the disease? 53. What is a mutation? Name a few types of mutations. 54. What are the translation start and stop sites? 55. What is tRNA? 56. What is rRNA? 57. What is ribosome? 58. What is the genetic code? Who discovered it?(Bonus) 59. Is genetic code Universal? What does it tell about our evolution? 60. Why a code is said to be made up of triplet? 61. What is codon bias? 11 62. What is wobbling hypothesis? 63. Who discovered the structure of DNA? 64. What is reverse transcription? Who discovered it? 65. Do you believe that viruses are most evolved organisms? If yes, Why? If not why not? 66. What is mitosis and meiosis? 67. What are the main steps in mitosis? How many cells are produced at the end of one cycle of mitosis? 68. What are the main steps in meiosis? How many cells are produced at the end of one cycle of meiosis? 69. What is the recombination? 70. Do bacteria recombine? 71. What is DNA sequencing? Who discovered it? 72. What is dideoxynucleotides? Why they are important in sequencing? 73. How can you sequence a gene? 74. Why DNA sequence is written in only one line when it is double stranded? 75. Which DNA strand is always denoted when writing a gene sequence? 76. How can you derive which protein a gene encodes by just looking at a gene sequence? (BONUS). 12 Bioinformatics and The Human Genome Human genome is the biggest gift of science to humanity. We have achieved something new in 2001 that we have only dreamed of for many years. Human genome is just the beginning of our exciting and sometimes fearful journey. Fear of unknown lurks around there but the promise of tomorrow is also bright and vivid. Sequenced organisms (From Science 291, Feb 2001 pp 1178) Organism genome size year completed No. of genes H. influenzae 1.8 MB 1995 1740 S. cerevisiae (yeast) 12.1 MB 1996 6034 C. elegans (worm) 97 MB 1998 19099 A. thaliana (water cress) 100MB 2000 25,000 D. melanogaster (fruit fly) 180 MB 2000 13,061 H. sapiens(human) 3000 MB 2001 35-45,000 Rice…Poplar…mouse… more than 200 genomes sequenced and list is ever-increasing. Human genome was a dream for which thousands of scientists worked for over 15 years. Celera and HGP provided two books for price of one. Celera achieved it in 3 years but heavily depended on public data. How did we do what we set out for? That is what is now written in Science and Nature articles. What it means is still unknown. They say that 200 telephone books of New York equivalent pages will be needed to print 3 billion bp of genome per cell. But Internet would allow this easily. Humans were supposed to have 100,000 genes but seems like only 32,000 are possible. Does that make humans less powerful or inadequate in any way? No, “The purpose of science is to find meaningful simplicity in the midst of complexity” Herbert Simon (Nature 409, 771, 2001). DNA structure and PCR are best examples. One gene works harder at many places and many times. So less is better in that crammed nuclear space. Alternative splicing. Human proteins have the same domains as worms but the way these domains come together is unique. We will know one day what makes up a human. We all are unique! All sexually reproduced organisms have the entire ensemble of the genes in one organism only once. One genotype occurs only once. 13 There are also some surprises in human genome! SNPs accumulate with a specific pattern Regulatory CpG islands occur more in gene rich regions than gene less TEs in gene poor regions Only 1.1-1.5 % of the genome is coding not even 3% as widely estimated earlier Parts of chromosome 12 in men and chromosome 16 in women are recombination prone. Repetitive DNA is only 40-45% Humans share 223 genes from bacteria that are absent in worm, fly and yeast genome. Did genome duplicate early on similar to plants? We will know how humans develop from zygote: ontogeny We will know our phylogeny looking at ontogeny: molecular archeology One day we will be trace our evolution using the genome information. Geneology of human race! CLASS PAPER (1 credit worth of extra work) Each of you will select a different gene family from human genome to write an essay on How to build a better human? You will also present your research finding to class. You may select either a human disease or a trait that you are interested in studying further. Collect all necessary background information and collect genes associated with your topic. Find the counterparts of your gene of interest in other organisms and develop a phylogenetic tree. You are expected to use as many bioinformatics programs as possible that you learnt in this class to create a comprehensive database of genes that you have selected. Important: Provide me with a list of all reference work (printed materials and web site addresses) that you used. Write in your own words. I plan to put your essays and databases on web so watch out that you are not accused of plagiarism. See the handout for more information on plagiarism. For FW5089: You have to do one more extra project to earn the fourth credit. I will discuss this separately with you all. 14 FW4089: How to use GCG in the GIS lab? Sit on any computer and shake the mouse to activate or wake the computer up. Press control alt delete and then Enter your username and password (first initial of your first name and first 7 numbers of your id) Your userids may be the MTU ones. The following procedure you will do every time you come for the class (unless things change in next few days due to new arrival of GCG at Mango server): Go to telnet and connect with oak by typing telnet oak.ffr.mtu.edu You will get window for login: type your login name and enter password; see oak% Type source /gcg/gcgstartup Then type gcg then hit return You should see GCG logo! Start using GCG programs! For GCG manuals go to: http://forestry.mtu.edu/manuals/gcg/index.htm 15 Tutorial on using Unix: Useful Unix Commands: GCG is unfriendly!! It is not Mac or PC based. Not for distribution. For personal use only. Login: connect or telnet with oak the server where GCG is loaded! Type the password correctly and enter You should see oak% Logout: Do not forget to logout at the end of the session. Nothing saved will be lost. Important note: Do not give your username or password to anyone. If someone wants to use it for GCG, ask him or her to contact his or her supervisor and then me. Any unauthorized use will cost you the loss of GCG privileges. UNIX Commands UNIX commands are entered at the prompt> and delivered to the system with the <RETURN> key. UNIX commands have a syntax, just like any language; there is a correct order for the words in a command, and MANY incorrect orders. Mix up the order, and UNIX is unlikely to be clever enough to understand what you want it to do! It is a dumb Computer! The most general form of UNIX command syntax is Prompt> command -flag(s) argument(s) Prompt. = oak% The command is WHAT you want to do, the -flags help refine the command, saying HOW you want it done, and the arguments tell the OBJECT of the command - the things to be acted upon. UNIX expects all of its commands to be lower-case, though flags and arguments may be a mixture of cases. Remember, UNIX is case-sensitive! As a trivial example, suppose you wanted to translate the following English request "Would you please quickly shovel the snow in the driveway today?" into UNIX. The translation might look something like 16 prompt> shovel -quickly -today snow In fact, given the absence of vowels and longer words from most UNIX commands and flags, the actual command is more likely to be prompt> sw -f -n snow where sw is short for shovel, -f is short for fast (=quickly), and -n is short for now (=today). For a genuine example of a UNIX command, consider mango% ls -la Dirname Here, ls is short for list, -l is short for long (=all details), and -a is short for all (=all files, even the hidden ones). Dirname is the name of the directory of files for which you want the listing. Finally, when using GCG commands in UNIX, there is one important "feature" for the arguments; the case you use for the names of database entries is unimportant, but all filenames must be in lower case and typed or copied and pasted correctly. Text files Data on computers (text, programmes, sequences etc.) is held in blocks of information called 'files'. Different files have different names and/or different locations - and there is a convention that filenames end with a three-letter extension that indicates the type of data held in the file, e.g., .txt for text, .seq for sequences, .pep for peptides, .dat for generic data, etc. Files can be created, deleted, altered, overwritten, moved around, copied, renamed, printed out to a screen or a printer, searched, compared, sorted, counted and transferred over the network to computers on other sites. Some UNIX commands for file management: touch filename - create a file [ holding no information! ] pico filename - edit the file using the pico editor [ use <CTRL> X to exit ] cp filename newfilename - copy a file to a new file [ retains the old file ] mv filename newfilename - move (rename) a file to a new file [ deletes the old file ] 17 cat filename - concatenate (print) a files contents to the screen more filename - print a files contents to the screen, one page at a time [ use <SPACE> to see the next page ] cat filename1 filename2 > filename3 - concatenate (print) the contents of the first two files into the third rm filename - remove (delete) the file dangerous to use with wildcard * Exercise DNA Analysis - UNIX 1: create and manage files Create a file named easyunix.txt prompt> touch easyunix.txt (NB: you may use any UNIX text editor you like - pico is probably the simplest but we will use vi today) prompt> vi easyunix.txt Edit the file and enter "UNIX is EASY!". Exit by typing :X and save the changes. To print easyunix.txt to the screen. prompt> more easyunix.txt Copy easyunix.txt to the file opinion.txt (How would you do this with cat? Hint!) prompt> cp easyunix.txt opinion.txt Rename easyunix.txt to unixcmds.txt prompt> mv easyunix.txt unixcmds.txt Edit the file unixcmds.txt with vi editor. Move down the screen with the arrow cursor keys and type what you now know about UNIX. Exit and save the new changes. prompt> vi unixcmds.txt Print unixcmds.txt to the screen to see how clever you have become. prompt> more unixcmds.txt 18 Delete opinion.txt. prompt> rm opinion.txt Directories A directory is a group of files or other directories. A directory within another is often called a sub-directory, to reflect this hierarchical organization. Directories can be created, copied, deleted, renamed, searched and transferred over the network to computers on other sites. Files can be moved between or copied among specified directories. You work in one directory at a time. This is known as the present working directory. The directory you begin with when you login is your home directory. PWD: print working directory You can easily return to your home directory from any other directory by giving the UNIX command "cd" with no argument. Some UNIX commands for directory management: cd dirname - change to the directory named dirname cd .. - change to the directory above the present one [ ".." = up ] cd - change to your home directory [ the default argument for cd is your home directory ] ls - list the files in the present working directory ls -l - a file list that is longer, more detailed mkdir subdirname - make (create) a new sub-directory in the present directory rmdir subdirname - remove (delete) a sub-directory in the present directory mv filename dirname - move a file into a sub-directory Exercise: create and manage directories 19 Create a sub-directory named Unixinfo prompt> mkdir Unixinfo Switch your present working directory to the new sub-directory prompt> cd Unixinfo Check to see you are there prompt> pwd Move a file from the directory above into your new present working directory (".." is a short form for the directory above, and "." is a short form for the present directory) prompt> cp ../unixcmds.txt . Has the file moved? It should occur in the second list (";" separates the two list commands) prompt> ls -l .. ; ls -l Get back to your home directory prompt> cd 20 21