Amsterdam september 5 th
2007
Practicals course ‘Genomics’
Important :
This person by person exercise is compulsory and should be carried out independently from others
Put everything in a word document with assignment 2, the date, your name and student number on top of it. Save your document as: as2-student number. Send your results with the figures as an attached document via Email to: rob.van.spanning@falw.vu.nl before 14st of september 2007.
Dear Genomics students of the Free University in Amsterdam.
Un unknown disease has made many casualties. Our researchers were able to pick some of the DNA of the organism that causes the disease. Start with this sequence and continue with the exercise. At the end you will know what caused the disease, and more importantly you improved your skills in using the BLAST algorithm and learned all about the important aspects on terminology and parameters of the search algorithm.
You can paste your answers to the questions in this form.
Now here is the sequence:
CATCATCTATTTCCTTTTCGATCAAAACGGTAAAAATACCGGTAACCGAAC
CTTGTATGGAATTACCACAGCTTTCCACCAAACGAGAAAGAGCCAACGAT
ACTCCTATCGGAATCCCCCCTTTAATTGGAGTCCCATCCAACATGGCCTGA
GCACGAGCAAATCGTGTTAAAGAGTCAAAATAAAGTACTACTTTCTTACC
CTGTTCCATATAGTAACGAGCTATTGAAACAGCAACCAAACCAGATCGAA
CTTTTTCTAACGGGTTAGCTTCTGATGTTGAAACTATAGTGATTGATTTTCT
GATTACC
Go to Blast (http://www.ncbi.nlm.nih.gov:80/BLAST/) and BLASTX/format the translated sequences of your DNA sequence to get protein sequences from the protein data bases that resembles the one encoded by your DNA.
Why should you use the BlastX program?
Blast-x compares a nucleotide query sequence translated in all reading frames against a protein sequence database. Protein sequence is more conserved than nucleotide sequence.
Paste here the top 4 sequences of significant alignments:
NP_800848, ZP_01487160, ZP_01947878, YP_001154606
Which is the most likely sequence, what do the E and Score values tell you?
Most likely: putative ATPase YscN [Vibrio parahaemolyticus], NP_800848.
The Expect value (E) is a parameter that describes the number of hits one can
"expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences.
Go to the alignment of the best hit by clicking on the ‘score’ number
There you can move to the full protein record by clicking on the corresponding 4letter protein code.
Write down accession number, protein name (full and 4-letter code) protein sequence, organism and the locus tag (information at CDS bottom of the page)
Click on the accession number (NP_800848). Organism: Vibrio parahaemolyticus, locus tag: VPA1338.
Now look for similar sequences in the database by the BLAST algorithm now with the full protein sequence. Which BLAST program do you use now, and why?
Format your BLAST request.
Since we now have the full protein sequence, we use protein-blast (blastp). This program compares an amino acid query sequence against a protein sequence db.
Copy and paste the 4 best hits.
NP_800848, ZP_01487160, ZP_01947878, YP_001050343
Obviously the best hit came from your sequence.
You might see a blue marked G at the right of the best hit. Where does it link to?
It links you to the gene info. This information is much more detailed (like e.g. genomic region, genomic context and bibliography).
Does the information at the ‘G’-link still refer to your original information?
Here you can find links to more information about your gene. For instance, have a look at the general gene information, and study its role and function via this link. The protein makes part of a larger complex. What is the name and function of the complex?
The protein (a flagellum-specific ATPase) is part of the Type III secretion system
(protein complex), transport of components through the cellular membrane
Intracellular trafficking and secretion).
Have a look at the genomic context and write down the chromosome number and locus tags of the adjacent genes.
Location is Chromosome II, VPA1337 & VPA1339.
Also look at the PubMed link, which describes the genome sequence of your organism in the context of its virulence.
What is the title of this paper? Go to the abstract, try to understand the contents and paste it in your report.
Genome sequence of Vibrio parahaemolyticus: a pathogenic mechanism distinct from that of V cholerae.
Look in Pubmed for related articles and describe the role of your protein in the infection process.
The type III secretion system (TTSS) is an apparatus used by several gramnegative pathogenic bacteria to secrete and translocate virulence factor proteins into the cytosol of eukaryotic cells. The ATPase is thought to be the energizer of the secretion machine
Any idea now how infection took place?
Most people become infected by eating raw or undercooked shellfish, particularly oysters. Less commonly, this organism can cause an infection in the skin when an open wound is exposed to warm seawater.
How can you treat the disease?
Treatment of gastroenteritis with oral rehydration is usually sufficient because the illness is usually mild and self-limited.
In more severe cases: antimicrobial therapy or antibiotics.
(from: Centers for Disease Control and Prevention)
With your locus tag name you can search in the genome sequence of your organism.(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome) (then link to
Bacteria/chromosomes and then to the chromosome of the organism)
On the page with the genomic region of interest, click for the sequence viewer presentation.
In the protein coding genes options, there is a link to the ID of your gene. What information do you get when you follow this link?
What is the COG number?
The locus tag was: VPA1338. The corresponding geneID is: 1192034; by cliecking on this link we get a detailed report of the gene incl transcripts, products etc.
What does COG mean and what kind of information can you get at the COG page?
Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.
Now go back again to the BLAST homepage. Get your protein sequence again, and do a BLAST search, but now limit your search to the mammalian sequences
(advanced options).
What are the 4 best hits? What do you conclude from this.
The protein used is still NP_800848. When limiting to mammalian sequences, among the best 4 hits we find (F1) ATPase/ATP synthase beta subunits
(nucleotide-binding domain).
Are your query protein and database proteins homologues, orthologues, paralogues, analogues, similar or identical, and discuss your choice.
This result matches quite well with those from bacteria. It seems that these proteins are homologues (and more precisely orthologs, since they have the similar functional roles though in different species).
Now look at the protein record of the 1 st
best hit to your sequence. Write down species and protein information.
The first hit is the ‘ATP synthase beta subunit’ from human, having
Accession number AAA51808.
What is the function of the mammalian protein?
And what was the function of your query protein?
Function of the mammalian protein is ATP synthesis, the function of the query protein is ATPasic.
Go to pubmed and try to find information about the common ancestor and evolution of the molecular machines that share these two types of protein.
Discuss your findings in the context of recent public discussions about intelligent design versus Darwinian evolution.
Rob van Spanning and colleagues