FW4089 and FW5089: Bioinformatics questionnaire

advertisement
FW4089: Bioinformatics (3 credits)
FW5089: Tools of Bioinformatics (4 credits)
Time: Every Tuesday and Thursday, 9.35 am to 10.55 am (3 hours)
Place: Forestry, Room No. 139
Note: Presentation of class paper will be arranged sometime in early April 2006.
Final exam will be held sometime in the week of April 24 to 28, 2006.
Instructor:
Shekhar Joshi (C. P. Joshi),
Associate Professor of Plant Molecular Genetics, SFRES
Room 168, Forestry, Phone: 487-3480 (cpjoshi@mtu.edu)
Office hours: 9 am to 6 pm except when I teach this class!
Teaching assistants: Shiv T. and Frank Xu (FMGB Graduate students)
Course Description
The main purpose of this course is to provide extensive hands-on-experience in
using a variety of Bioinformatics tools and in future you could extrapolate that
knowledge to other fields of biology such as genomics, molecular phylogenetics, and
biotechnology. You will not write Bioinformatics programs but will use the available
ones for extensive sequence analysis.
Why was this course proposed?
A number of sequence analysis packages and databases are currently available from
the commercial sources as well as public web sites. In our day-to-day molecular
biology research, we use some of these programs and databases to analyze the
significance of the new genetic information that we obtain. But it is not always easy
to choose the correct approach or appropriate tool. Databases are growing at a very
fast pace and new questions are constantly popping up. Moreover, genomics is a
new and exciting field of biotechnology that has recently witnessed many conceptual
and technical advances. Ability to make sense of this information explosion will
make our students more competitive in the current job markets in the fields of
academics and industries. There is no doubt that this knowledge will be extremely
valuable for living in this century.
1
FW4089/5089 Tools of Bioinformatics
GENERAL TEXTBOOKS (Optional Reading material)
1) Genes VII
Benjamin Lewin, 2000, Oxford University Press
2)
Molecular Biology
Robert F. Weaver, 1999, McGraw-Hill Press
3) Bioinformatics
David W. Mount, 2001, CSH Press
All these books will provide only supplemental material for the
course and may be available at the MTU Book Store or in the
library.
Reading materials for the topics being covered in the class will
be provided.
Although there is no specific prerequisite for this class, it is
advisable to have taken at least one of the following and have
some background in genomics and bioinformatics:
BL4030: Molecular Biology
FW4087/5087: Plant Molecular Genetics
2
FW4089/5089
Tools of Bioinformatics
THIS COURSE WILL NOT TEACH YOU HOW TO WRITE PROGRAMS.
Bioinformatics Reference Books available in the MTU Library







Guide to Human Genome Computing (Second Edition) by Bishop MJ
Call No. QH445.3 .G85 1998
Bioinformatics: The machine learning approach by P. Baldi and S. Brunak
Call No. QH506 .B35 1998
Sequence Analysis in Molecular Biology by G Von Heijne
Call No. QP551 .H43 1987
Biological Sequence Analysis: Probabilistic Models Of Proteins And Nucleic Acids
by R Durbin, S. Eddy, A. Krogh, G. Mitchison Call No. QP620 .B576 1998
Algorithms On Strings, Trees And Sequences: Computer Science And
Computational Biology by Dan Gusfield Call No. QA76.9 .A43 G87 1997
Introduction To Computational Biology by Michael S. Waterman
Call No. QH438.4 .M33W38 1995
Calculating The Secrets Of Life by Eric Lander And Michael Waterman
Call No. QH438.4 .M3 C35 1995.
Some internet addresses where Bioinformatics information is available:
National Center of Biotechnology Information (GenBank)
http://www.ncbi.nlm.nih.gov/
Genetics Computer Group: http://www.GCG.com
Protein analysis: http://www.expasy.ch
Celera Genomics: http://www.celera.com
3
FW4089/5089 Bioinformatics
GRADING SYSTEM
Grade Scale
100 - 95
94 - 90
89 - 85
84 - 80
79 - 75
74 - 70
69 - 60
60 -
=
=
=
=
=
=
=
=
A
Excellent
AB Very Good
B
Good
BC Above Average
C
Average
CD Below Average
D
Inferior
F
Failure
Course Points
Home work, quiz etc= 30%
Mid-term Exam 1 = 30%
Final Exam = 30%
Class Participation= 10%
Exams: The midterm and cumulative finals will be worth 100 points.
Class Paper = One Credit for FW5089
4
Jobs! Jobs! Jobs!
Current Job trends: http://www.sloan.org/programs/scitech_page1.html
Jobs in Genomics: http://www.genomejobs.com
See also Science and Nature for Job ads.
Bioinformatics is a young science but the information explosion has
demanded more people in academics and industries. It is easy to get either a
molecular biologist or a computer scientist but the job of bioinformatician
needs both. Biologist who can compute and a computer scientist who can
make sense out of biological data are hot commodities.
Supply and demand!
This is what I heard but do not quote me anywhere!
MS in Bioinformatics: 60-100 K
Ph.D. in Bioinformatics: 80-100K or higher
All CS people do not find money that attractive! But those who are
interested in the topic do very well in this field. New challenges and
questions biologists are facing every day and CS is providing the answer.
True collaboration!
Having this course listed in your CVs will help in your job prospects.
5
http://www.bio.mtu.edu/campbell/bl4820/intro/plagiarism.htm
Plagiarism - What It Is and How to Avoid It!
Adapted from Notes prepared by Ron Gratz
Scientists do not work in isolation from each other. Attendance at scientific meetings
exposes us to the work of our colleagues and allows for the free exchange of ideas.
Reading the published literature in our fields is vital for all scientists, who must keep
themselves current with what is being done in other laboratories. Scientists continually
refer to the work of their colleagues and most scientific research is based at least in part
on ideas derived from others. Review articles and textbooks are often wholly based on
already published work. It is thus necessary for you as developing scientists to learn how
to properly use previously reported knowledge.
While a free flow of ideas and information is vital to scientific progress, it also presents
avenues for fraud, particularly plagiarism. Plagiarism can be defined as "Taking the ideas
from another and passing them off as one's own" (Webster's New World Dictionary) and
is unacceptable under any circumstances. Despite this universal disapproval, it is one of
the more common faults with student papers. In some cases, it is a case of downright
dishonesty brought upon by laziness but more often it Is lack of experience as how to
properly use material taken from another source.
To avoid plagiarism you must not only properly attribute the ideas of another but must
also either paraphrase what the original author said or wrote or you must enclose that
person's exact words in quotation marks. To use another's exact words with attribution
but without quotation marks implies that the ideas belong to the original source but that
the words are your own. Besides being dishonest, copying another’s work defeats the
purpose of your education. Writing about the subject you are studying is a great way to
learn. Ideas become more firmly implanted in your memory if you have to think about
them and then write a coherent statement using them. Copying another’s work prevents
you from learning, which is the whole purpose of your education.
Whenever the words or ideas of another individual are used, proper attribution must be
given. In other words, you must give credit for those ideas and words to their originator.
Not to do so is a clear case of plagiarism. Plagiarism in classwork may result in a failing
grade or even expulsion from the university. Plagiarism in professional work may result
in dismissal from an academic position, being barred from publishing in a particular
journal or from receiving funds from a particular granting agency, or even a lawsuit and
criminal prosecution.
In a review article, the author attempts to summarize all of the pertinent work done in a
particular field of study. The goal is generally twofold: (1) to report what has been done
and what has been learned; and (2) to use this knowledge to generate general conclusions
based on these previous works. The author of a review article must be able to present the
cited work accurately and be able to synthesize new ideas from this work. In order to
6
accurately represent the work of others and at the same time avoid plagiarism, the author
of a review will often paraphrase the statements made in the cited work.
The problem for many students, and some professional scientists, is that they do not know
how to properly paraphrase another's words. Several general rules for paraphrasing that
are relevant for students learning to master this skill are:
1. You should change both the sentence structure and the non-technical terms in order to
avoid plagiarism.
2. You can also avoid plagiarism by altering the sequence of subject matter within and
between sentences.
3. Don't paraphrase technical terms unless you are certain of their exact meaning and can
provide an exact equivalent.
4. Accredit the original author within the group of sentences using his/her work.
7
FW4089 and FW5089: Bioinformatics questionnaire
Your name:
ID number:
Department:
Graduate student/Undergraduate:
Name of Advisor if Graduate student:
Motivation for taking this course:
Previous experience with Unix, GCG or other sequence analysis packages
What do you expect to get out of this course?
Have you understood the problems of plagiarism? Yes
Do you know what my office hours are? Yes
No
No
Are you clear about grading policy? Yes No
8
First QUIZ of Plant Bioinformatics
Date: January 10, 2006
Write one line answers to as many questions as possible in next 45 minutes. Feel free to
refer to books/web etc. This will not be counted towards your grade. I just want to know
where you stand with molecular biology background:
1. DNA stands for
2. RNA stands for
3. DNA is made up of
4. RNA is made up of
5. What is the difference between Deoxyribose sugar and ribose sugar?
6. What are the different types of nitrogen bases in DNA?
7. What are the different types of nitrogen bases in RNA?
8. What is the difference between purines and pyrimidines?
9. Name 2 purines and three pyrimidines
10. Which purine pairs with which pyrimidines? State the number of H bonds
between each pair.
11. What are the differences between DNA and RNA?
12. What is transcription and translation?
13. What is central dogma in molecular biology?
14. What is reverse transcription?
15. What is a prokaryote?
16. What is a Eukaryote?
17. What are the differences between prokaryote and Eukaryotes
18. What is a genome?
9
19. What is genomics?
20. How many genomes are present in viruses, prokaryotes, plants and animals?
Where?
21. What is bioinformatics?
22. What is the biological name for humans (binomial)
23. How big is the human genome?
24. How many chromosomes are there in a human diploid and haploid cell?
25. How are human genes arranged in the genome?
26. How many human genes are there?
27. What proportion of human genome is made up of genes?
28. What is a gene?
29. Why eukaryotic genes are said to be split?
30. How does DNA replicate? Conservatively or semi-conservatively? What is the
difference?
31. How does DNA make RNA?
32. How many types of RNA are produced in a cell?
33. How many of these RNAs are said to be protein coding?
34. What is pre-mRNA? Is it present in bacteria?
35. What are the main three steps in pre-mRNA processing?
36. What is the 5’leader and 3’trailor sequence in pre-mRNA?
37. What is the difference between exons and introns?
38. How are introns spliced off?
39. Why are introns there?
40. How transcription process in regulated in prokaryotes?
10
41. How transcription process is regulated in eukaryotes?
42. What is a TATA box and AATAAA box?
43. What is a transcription factor?
44. Why TFIID is said to a commitment factor?
45. What is a transcription start site?
46. What is polyadenylation? Why is it an important biological process? Is it present
in bacteria?
47. Describe the process of polyadenylation.
48. Define “protein”. What alternative forms are proteins present in a cell?
49. How many types of amino acids are typically present? Name five amino acids?
What are their 3 letter and 1 letter codes?
50. How does a code present in DNA is used to make proteins?
51. Do you believe that genome is life’s instruction book? Why?
52. If you have a disease gene (what does that mean), do you always get the disease?
53. What is a mutation? Name a few types of mutations.
54. What are the translation start and stop sites?
55. What is tRNA?
56. What is rRNA?
57. What is ribosome?
58. What is the genetic code? Who discovered it?(Bonus)
59. Is genetic code Universal? What does it tell about our evolution?
60. Why a code is said to be made up of triplet?
61. What is codon bias?
11
62. What is wobbling hypothesis?
63. Who discovered the structure of DNA?
64. What is reverse transcription? Who discovered it?
65. Do you believe that viruses are most evolved organisms? If yes, Why? If not why
not?
66. What is mitosis and meiosis?
67. What are the main steps in mitosis? How many cells are produced at the end of
one cycle of mitosis?
68. What are the main steps in meiosis? How many cells are produced at the end of
one cycle of meiosis?
69. What is the recombination?
70. Do bacteria recombine?
71. What is DNA sequencing? Who discovered it?
72. What is dideoxynucleotides? Why they are important in sequencing?
73. How can you sequence a gene?
74. Why DNA sequence is written in only one line when it is double stranded?
75. Which DNA strand is always denoted when writing a gene sequence?
76. How can you derive which protein a gene encodes by just looking at a gene
sequence? (BONUS).
12
Bioinformatics and The Human Genome
Human genome is the biggest gift of science to humanity.
We have achieved something new in 2001 that we have only dreamed of for many years.
Human genome is just the beginning of our exciting and sometimes fearful journey. Fear
of unknown lurks around there but the promise of tomorrow is also bright and vivid.
Sequenced organisms (From Science 291, Feb 2001 pp 1178)
Organism
genome size
year completed
No. of genes
H. influenzae
1.8 MB
1995
1740
S. cerevisiae (yeast)
12.1 MB
1996
6034
C. elegans (worm)
97 MB
1998
19099
A. thaliana (water cress)
100MB
2000
25,000
D. melanogaster (fruit fly)
180 MB
2000
13,061
H. sapiens(human)
3000 MB
2001
35-45,000
Rice…Poplar…mouse… more than 200 genomes sequenced and list is ever-increasing.
Human genome was a dream for which thousands of scientists worked for over 15 years.
Celera and HGP provided two books for price of one. Celera achieved it in 3 years but
heavily depended on public data. How did we do what we set out for? That is what is now
written in Science and Nature articles.
What it means is still unknown.
They say that 200 telephone books of New York equivalent pages will be needed to print
3 billion bp of genome per cell. But Internet would allow this easily.
Humans were supposed to have 100,000 genes but seems like only 32,000 are possible.
Does that make humans less powerful or inadequate in any way?
No, “The purpose of science is to find meaningful simplicity in the midst of complexity”
Herbert Simon (Nature 409, 771, 2001). DNA structure and PCR are best examples.
 One gene works harder at many places and many times. So less is better in that
crammed nuclear space. Alternative splicing.
 Human proteins have the same domains as worms but the way these domains
come together is unique.
 We will know one day what makes up a human.
 We all are unique! All sexually reproduced organisms have the entire ensemble of
the genes in one organism only once. One genotype occurs only once.
13
There are also some surprises in human genome!












SNPs accumulate with a specific pattern
Regulatory CpG islands occur more in gene rich regions than gene less
TEs in gene poor regions
Only 1.1-1.5 % of the genome is coding not even 3% as widely estimated earlier
Parts of chromosome 12 in men and chromosome 16 in women are recombination
prone.
Repetitive DNA is only 40-45%
Humans share 223 genes from bacteria that are absent in worm, fly and yeast
genome.
Did genome duplicate early on similar to plants?
We will know how humans develop from zygote: ontogeny
We will know our phylogeny looking at ontogeny: molecular archeology
One day we will be trace our evolution using the genome information.
Geneology of human race!
CLASS PAPER (1 credit worth of extra work)
Each of you will select a different gene family from human genome to write an essay on
How to build a better human?
You will also present your research finding to class. You may select either a human
disease or a trait that you are interested in studying further. Collect all necessary
background information and collect genes associated with your topic. Find the
counterparts of your gene of interest in other organisms and develop a phylogenetic tree.
You are expected to use as many bioinformatics programs as possible that you learnt in
this class to create a comprehensive database of genes that you have selected.
Important: Provide me with a list of all reference work (printed materials and web site
addresses) that you used. Write in your own words. I plan to put your essays and
databases on web so watch out that you are not accused of plagiarism. See the handout for
more information on plagiarism.
For FW5089: You have to do one more extra project to earn the fourth credit. I will
discuss this separately with you all.
14
FW4089: How to use GCG in the GIS lab?
Sit on any computer and shake the mouse to activate or wake the computer up. Press
control alt delete and then
Enter your username and password (first initial of your first name and first 7 numbers of
your id)
Your userids may be the MTU ones.
The following procedure you will do every time you come for the class (unless things
change in next few days due to new arrival of GCG at Mango server):
Go to telnet and connect with oak by typing
telnet oak.ffr.mtu.edu
You will get window for login: type your login name and enter password; see oak%
Type source /gcg/gcgstartup
Then type gcg
then hit return
You should see GCG logo!
Start using GCG programs!
For GCG manuals go to:
http://forestry.mtu.edu/manuals/gcg/index.htm
15
Tutorial on using Unix:
Useful Unix Commands: GCG is unfriendly!! It is not Mac or PC based.
Not for distribution. For personal use only.
Login: connect or telnet with oak the server where GCG is loaded!
Type the password correctly and enter
You should see oak%
Logout: Do not forget to logout at the end of the session. Nothing saved will be lost.
Important note: Do not give your username or password to anyone. If someone wants to
use it for GCG, ask him or her to contact his or her supervisor and then me. Any
unauthorized use will cost you the loss of GCG privileges.
UNIX Commands
UNIX commands are entered at the prompt> and delivered to the system with the
<RETURN> key.
UNIX commands have a syntax, just like any language; there is a correct order for the
words in a command, and MANY incorrect orders. Mix up the order, and UNIX is
unlikely to be clever enough to understand what you want it to do! It is a dumb
Computer!
The most general form of UNIX command syntax is
Prompt> command -flag(s) argument(s)
Prompt. = oak%
The command is WHAT you want to do, the -flags help refine the command, saying
HOW you want it done, and the arguments tell the OBJECT of the command - the things
to be acted upon.
UNIX expects all of its commands to be lower-case, though flags and arguments may be a
mixture of cases. Remember, UNIX is case-sensitive!
As a trivial example, suppose you wanted to translate the following English request
"Would you please quickly shovel the snow in the driveway today?"
into UNIX. The translation might look something like
16
prompt> shovel -quickly -today snow
In fact, given the absence of vowels and longer words from most UNIX commands and
flags, the actual command is more likely to be
prompt> sw -f -n snow
where sw is short for shovel, -f is short for fast (=quickly), and -n is short for now
(=today).
For a genuine example of a UNIX command, consider
mango% ls -la Dirname
Here, ls is short for list, -l is short for long (=all details), and -a is short for all (=all files,
even the hidden ones). Dirname is the name of the directory of files for which you want
the listing.
Finally, when using GCG commands in UNIX, there is one important "feature" for
the arguments; the case you use for the names of database entries is unimportant,
but all filenames must be in lower case and typed or copied and pasted correctly.
Text files
Data on computers (text, programmes, sequences etc.) is held in blocks of information
called 'files'.
Different files have different names and/or different locations - and there is a convention
that filenames end with a three-letter extension that indicates the type of data
held in the file, e.g., .txt for text, .seq for sequences, .pep for peptides, .dat for generic
data, etc.
Files can be created, deleted, altered, overwritten, moved around, copied, renamed,
printed out to a screen or a printer, searched, compared, sorted, counted and transferred
over the network to computers on other sites.
Some UNIX commands for file management:
touch filename - create a file [ holding no information! ]
pico filename - edit the file using the pico editor [ use <CTRL> X to exit ]
cp filename newfilename - copy a file to a new file [ retains the old file ]
mv filename newfilename - move (rename) a file to a new file [ deletes the old file ]
17
cat filename - concatenate (print) a files contents to the screen
more filename - print a files contents to the screen, one page at a time [ use
<SPACE> to see the next page ]
cat filename1 filename2 > filename3 - concatenate (print) the contents of the first two
files into the third
rm filename - remove (delete) the file dangerous to use with wildcard *
Exercise DNA Analysis - UNIX 1: create and manage files
Create a file named easyunix.txt
prompt> touch easyunix.txt
(NB: you may use any UNIX text editor you like - pico is
probably the simplest but we will use vi today)
prompt> vi easyunix.txt
Edit the file and enter "UNIX is EASY!". Exit by typing :X and save the changes.
To print easyunix.txt to the screen.
prompt> more easyunix.txt
Copy easyunix.txt to the file opinion.txt (How would you do this with cat? Hint!)
prompt> cp easyunix.txt opinion.txt
Rename easyunix.txt to unixcmds.txt
prompt> mv easyunix.txt unixcmds.txt
Edit the file unixcmds.txt with vi editor. Move down the screen with the arrow cursor
keys and type what you now know about UNIX. Exit and save the new changes.
prompt> vi unixcmds.txt
Print unixcmds.txt to the screen to see how clever you have become.
prompt> more unixcmds.txt
18
Delete opinion.txt.
prompt> rm opinion.txt
Directories
A directory is a group of files or other directories. A directory within another is often
called a sub-directory, to reflect this hierarchical organization.
Directories can be created, copied, deleted, renamed, searched and transferred over the
network to computers on other sites. Files can be moved between or copied among
specified directories.
You work in one directory at a time. This is known as the present working directory. The
directory you begin with when you login is your home directory.
PWD: print working directory
You can easily return to your home directory from any other directory by giving the
UNIX command "cd" with no argument.
Some UNIX commands for directory management:
cd dirname - change to the directory named dirname
cd .. - change to the directory above the present one [ ".." = up ]
cd - change to your home directory [ the default argument for cd is your home
directory ]
ls - list the files in the present working directory
ls -l - a file list that is longer, more detailed
mkdir subdirname - make (create) a new sub-directory in the present directory
rmdir subdirname - remove (delete) a sub-directory in the present directory
mv filename dirname - move a file into a sub-directory
Exercise: create and manage directories
19
Create a sub-directory named Unixinfo
prompt> mkdir Unixinfo
Switch your present working directory to the new sub-directory
prompt> cd Unixinfo
Check to see you are there
prompt> pwd
Move a file from the directory above into your new present working directory (".." is a
short form for the directory above, and "." is a short form
for the present directory)
prompt> cp ../unixcmds.txt .
Has the file moved? It should occur in the second list (";" separates the two list
commands)
prompt> ls -l .. ; ls -l
Get back to your home directory
prompt> cd
20
21
Download