Lecturer's notes

advertisement
This document illustrates a number of bioinformatic computer programs and what the
user interface might look like.
I keep a bookmark list of sites I have used at
http://biochem.uthscsa.edu/~hs_lab/frames/bkmk.html
NCBI Entrez.
The National Center for Bioinformatics is the main repository for DNA and protein
sequence analysis in the US. (http://ncbi.nih.nlm.gov). It's main interface for looking up
sequences (and many other things) is called Entrez.
A list of the databases searchable by Entrez at NCBI
A search of the Protein section for human beta globin:
Suppose one wants to find a sequence for the human beta globin protein. Naively typing
in "human beta gobin" retrieves 557 proteins, many of which are obviously not human
from the summary view, and many of which are obviously not complete proteins from
the length. Part of the problem is that NCBI's protein database is a conglomerate of many
different databases. One section, "called GenBank", contains all protein sequences
exactly as their authors submitted them. This will include partial sequences, and mutant
sequences. Another division is called RefSeq. RefSeq is a database curated by NCBI
staff where they have attempted to filter out variants, mutants, and bad sequences, leaving
just one copy to represent this protein from humans. The link in the upper right will filter
out just the RefSeq entries, but still will produce 94 of them. You can also notice which
entries are RefSeq entries because their names always have an underscore in them (see
entry number 1). The other problem is revealed by the expanded "Search details" box on
the lower left. The phrase "human beta globin" is ambiguous. Did you mean you only
want proteins that are named exactly "human beta globin", or only proteins from humans
that are named exactly "beta globin" or all proteins from humans containing either the
word "beta" or "globin", etc. Entrez decided you meant all proteins that are either from
humans or mentioned "human" somewhere in the documentation, AND which also had
the phrase "beta globin" somewhere in the documentation.
If you know the query language, or use the "Advanced" tab to help you build a query,
you could tighten this query up to human[organism] AND "beta globin"[Protein Name].
That retrieves 177 entries, none of which, unfortunately, are from RefSeq. You can solve
the remaining problem by noticing the protein name that was assigned to the beta globin
entries that were from RefSeq in the original list above. NCBI's curators decided not to
call this protein "beta globin"; they decided to call it "hemoglobin subunit beta".
If the search is conducted as human[orgn] AND "hemoglobin subunit beta"[Protein
Name], then only two entries are found. One is the RefSeq entry. The other is the entry
from another curated database known as Uniprot (previously Swiss Prot).
The RefSeq entry may have links to related 3D structures, and many other resources,
particularly for human genes. Beware that sometimes NCBI's links go to an NCBI
maintained clone of a resource, rather than the actual resource. In that case, the actual
resource will often have more or better information than NCBI.
For example, the structure links go to NCBI's structure system which has fairly limited
information. This is one particular structure link:
They have their own structure viewer, Cn3D, which you have to download and install.
By comparison, putting the accession 3SZK at the PDB site gives a much bigger
selection of information. Partially this is because NCBI considers the sequence entry to
be the top page, and has linked most of the information to the sequence entry, not the
structure entry. PDB considers the structure description page to be the top page, and has
most of it's information linked directly to that. However, it's often true that you'll find
more, or better quality, information if you look through a broader range of sites.
As a case in point, suppose we were interested in ontological information about betaglobin. From the RefSeq entry for beta globin, it's not obvious that there's any path to
ontology information. There is a third party link to an ontology resource that doesn't
seem to have any ontology information. You can get to ontology information through the
3rd party link to ensemble. However, NCBI Entrez does have links into some of the
underlying pathway resources that GO draws upon, and some of the links are not easy to
find through GO itself.
If you click the Uniprot entry, you get NCBI's clone of the Uniprot entry. In the left
panel, the Uniprot -> NCBI conversion did capture linked ontology information in the
form of xrefs. But the panel on the right is not updated to indicate that author-provided
information exists in the entry.
If you retrieve the Uniprot accession number (P68871) and put it in the actual Uniprot
site (http://www.uniprot.org/), you'll get a much better presentation of this information.
A more generic entrance to Gene ontology is through http://www.geneontology.org/.
Use the "set filters" button to narrow down the results.
Then look for the "associations" with the gene of interest.
One can then see the list of genes for a particular function, or see a diagram if the source
was one that provides diagrams, or else some other indication of how the genes were
grouped into this function.
The Reactome database has diagrams of gene products collaborating on functions
organized by the Uniprot curators. If the overall process referenced is too large for a
single diagram, a selection of diagrams will be shown. To search for the sought-after
protein in this collection of diagram, expand all the details in the section on the left, and
then use the find function to identify what diagram includes the sought-after term.
Note that the NCBI may have pathways resources linked other than those found at the
GO consortium, such as wikigenes, or the KEGG database.
Online Mendelian Inheritance in Man
One excellent resource searched through NCBI is Online Mendelian Inheritance in Man
(OMIM). This resource gives a summary of genetic studies on each human gene
particularly including phenotypes caused by defects in the gene and a summary of animal
knockout experiments.
Here's an example of it's use. I was preparing a lecture to a class about collagen, which
had the usual litany of facts about how procollagen is hydroxylated on lysine and proline,
then secreted, processed and crosslinked. The standard story goes that in vitamin C
deficiency, the hydroxylation does not occur and this results in a disease called scurvy.
Since failure to hydroxylate the proline leads to failure to fold or secrete the procollagen,
that would seem to explain scurvy. But what's the hydroxylation on the lysine for? The
lysines are involved in subsequent crosslinking. Various textbooks show the lysines
undergoing crosslinking as hydroxylated or not, and tend to be silent about how
important is it for the lysines to be hydroxylated. Textbooks all state that collagen lysine
hydroxylation is required for glycosylation, but tend to be silent on what that's for.
Searching OMIM reveals that humans have not one but three genes for this enzyme.
Inherited genetic disorders have been ascribed to defects for two of them. An animal
knockout of the third one is an embryonic lethal with basement membrane disintegration.
The variety of symptoms described all seem to track to the hypothesis that without the
hydroxylation the lysines can crosslink, but can not mature to an irreversible crosslink.
Reading through the case studies gives a much different impression of why we need to
hydroxylate collagen lysines than you'll get by looking at a reaction pathway.
Public Reaction pathway databases include KEGG, Wikipathways, and Reactome.
A relatively new BlastP interface at NCBI lists resources
attached to each sequence matched.
Some resources related to DNA sequencing.
Individual sequence reads retrieved from a sequencing facility for sequence confirmation
purposes should come with the chromatogram. There is a need to visualize the quality of
the chromatograms to discover if apparent discrepancies are real or sequence ambiguities.
The most common format is an ab1 file.
Here is a view into an ab1 file with a free downloadable utility called Chromas:
For assembling Sanger sequence reads of a new sequence, one should have an assembler
that's sensitive to quality values. Here's an example of the phredPhrap system.
Using phredPhrap at UTHSCSA is similar to a number of other high powered systems.
The program has been installed on the bioinformatics computer known as alamo, and is
maintained by basically one user, in this case Steve Hardies. So, to use it, you'd have to
get an 'account' on alamo. [An 'account' means a username and password. At this time,
there is no fee to UTHSCSA personnel associated with using the UTHSCSA computing
facilities.]
Although the phredPhrap programs are installed on alamo, you'd have to edit some files
in your home directory system to be able to use it. For that, you'd want to consult with a
person currently using the program.
A short list of things you might need to do:





Install a program on your PC called Putty allowing you to log into your alamo
account.
Edit the program path into a login file.
Create a set of directories and subdirectories according to program specifications
for your sequencing project.
Copy the chromatograms from the sequence supplier to the specified program
directory.
Obtain the cloning vector sequence and place in your system as a fasta file.
Create an environment variable informing the program of the location of the
vector file.
Edit a script copied from a program directory that says how to discover paired end
information from the filenames of your reads.
Obtain a file of repetitive sequences corresponding to the organism from which
your sequence was obtained.
Create an environment variable informing the program of the location of that file.
Install a graphics interface program on your PC called vnc viewer.
Start a vnc server instance in your account.
Log in with vncviewer.

Run the phredPhrap script [call bases, assigns quality values, overlaps reads]








Run consed [an assembly viewer] to evaluate if more reads are needed, or to
output the consensus of the finished sequence.
Accessing consed (the phredPhrap assembly viewer) through vnc to examine the quality
value for given bases in the consensus, and the underlying chromatograms.
For next generation sequencing, if contracted to a commercial provider, they may do
much of the requested bioinformatics for you. If done with equipment at UTHSCSA, you
will need to do your own processing. Again, this involves using a sequence of programs
installed on the UTHSCSA bioinformatics computer, and getting in contact with users
that know how to use that software.
There is a user group that one can join and query by e-mail to find an existing user that
can help:
Dear Colleagues:
Thank you all again very much for coming to yesterday's meeting, which
was very productive thanks to everyone's input. I have followed through
with the first suggestion, which was to set up the NGS User group, and
installed an initial mailing list for the NGS users at UTHSCSA. The address
to post to this list is ngs@biochem.uthscsa.edu. We can use this list to
pick up the discussion regarding NGS implementation on our computational
and storage resources.
I am attaching below the list of users subscribed to this
list. Please check if I accidentally left anyone out that you
think should also participate. If so, either send me their name
and e-mail address, or ask them to sign up for themselves at:
http://biochem.uthscsa.edu/mailman/listinfo/ngs
You can also manage your own subscription at this website. All messages
sent to this list are also archived on the web, which you can access with
a password for your subscription that you can retrieve from this page by
clicking on "Unsubscribe or edit Options" at the bottom of this page.
Please keep in mind that any message sent to ngs@biochem.uthscsa.edu
will be broadcast to everyone who is subscribed, even when replying.
Also: Please keep in mind that your posting may be rejected if you do not
post from the subscribed address (see below). If you have multiple e-mail
addresses from which you post, please send me the alternatives so I can
add them to the allowed lists to posts and so you will not get any bounces.
I will soon send a couple of messages to this list with valuable feedback
from Patricia Dahia and Yidong Chen to make sure everyone has a chance
to respond.
Thanks again for your participation!
Regards, -Borries
--Borries Demeler, Ph.D.
For RNA-SEQ type experiments, the first step is to match the reads against a reference
sequence. We are currently using bowtie for that purpose.
Bowtie output:
40927621
9
chr1
3035061 3
75M
*
0
0
CTGGGTATGCCTCGTAGTTAAAACATTCCTGGGAACATCTTGACCATAAGATAAAGGGGACTGTGAAGACATAGC
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
NM:i:0
40927621
25
chr1
3035074 1
75M
*
0
0
GTAGTTAAAACATTCCTGGGAACATCTTGACCATAAGATAAAGGGGACTGTGAAGACATAGCAGGGCTATATTAT
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
NM:i:0
24795718
9
chr1
3144396 255
75M
*
0
0
CTCGGTTTGTGTTTTTTCATGAGATGAAGATGGAGCGCGGTGGCTGCCAGAGAGATTAATTCGTCAGATGAACAC
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
NM:i:1
26498456
17
chr1
3523315 255
75M
=
137315932
0
GTCAGAGTTGTATTGAGACTGGATCTCACTGTGTCGCTCTGACTGGTCTGGAATTCTCTACATAGCCCAAACTGG
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
NM:i:3
13520385
9
chr1
4570806 3
75M
*
0
0
GTACACTGTAGCCGTCCTCAGATACTCCAGAAGAGGGCATCAGATTTCGTTACAGATGGTTGTGAGCCACCATGT
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
NM:i:1
30550076
1
chr1
4678002 255
75M
=
135503084
0
TGCACAGCTACAACTGAATCTCACGGTAGGCCCGCTTCTCCACCAGCTTCACTTTGTATTTGCGCTTGAACTTGG
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
NM:i:4
30550076
49
chr1
4678153 255
75M
=
135506758
0
CAGTGTCCTGAGTGAGCCAGACAACTCAGAGCTGAGCTGCACATCAATGTCAGGAGCCTGGTACTTGAGCCGTCC
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
NM:i:3
4767454 1
chr1
4678733 255
75M
=
136319553
0
GTTGTAGGAGGCTCCTGCAGGAATCACCTCCACTGCAGGCACCTGGGAAGGCTTGATGTGGAGGCGTTGTGGCCG
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
NM:i:0
This output gives the coordinates in the reference sequence where each read matches.
This output will then be reprocessed depending on the purpose of the experiment.
There are numbers of programs with advanced capabilities that use bowtie as their search
engine. For example, tophat processing bowtie output to detect reads that jump across
introns.
tophat output vs. a segment of human chromosome I.
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
137635773
137635825
137635900
137635944
137635946
137635949
137635952
137635956
137635964
137635968
137635969
137635970
137635971
137635972
137635973
137635974
137635975
137635976
137635977
137635978
137635979
137635980
137635981
137635982
137635983
137635984
137635985
137635825
137635900
137635944
137635946
137635949
137635952
137635956
137635964
137635968
137635969
137635970
137635971
137635972
137635973
137635974
137635975
137635976
137635977
137635978
137635979
137635980
137635981
137635982
137635983
137635984
137635985
137635986
0
1
0
1
2
1
2
1
2
1
2
8
31
62
112
135
141
1146
1148
1165
1296
1317
1326
1341
1342
1346
1371
Tophat differs from bowtie in that it can detect when reads jump across an intron:
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
135001650
135002590
135030133
135033309
135169235
135170476
135176956
135178406
135280461
135281068
135503085
135505538
135505469
135506365
135506291
135506812
135518312
135518761
135002590
255,0,0 2
135033309
255,0,0 2
135170476
255,0,0 2
135178406
255,0,0 2
135281068
255,0,0 2
135505538
255,0,0 2
135506365
255,0,0 2
135506812
255,0,0 2
135518761
255,0,0 2
JUNC00000058
42,41 0,899
JUNC00000059
59,70 0,3106
JUNC00000060
33,42 0,1199
JUNC00000061
28,47 0,1403
JUNC00000062
40,35 0,572
JUNC00000063
71,65 0,2388
JUNC00000064
67,71 0,825
JUNC00000065
30,72 0,449
JUNC00000066
66,68 0,381
2
+
135001650
68
+
135030133
1
+
135169235
1
+
135176956
1
-
135280461
188
-
135503085
75
-
135505469
72
-
135506291
51
-
135518312
You might have thought that deep sequencing just to quantitate the level of different
transcripts would have consisted of comparing reads to a preconceived set of spliced
cDNA sequences. But most packages instead prefer to search the entire genome, and
then apportion the reads into genes based on an annotation file specifying the coordinates
of named genes in the genome. This presumably aids in dealing with alternative splicing
by keeping the exon boundaries in site until the last steps. This presumably also allows
for discover of previously unknown transcripts, or previously unknown exons.
The file format for known gene annotation is called gtf. .gtf files for popular model
organisms can be downloaded from sites like ensemble or UCSC Genome browser.
Illumina has a site with numbers of gtf files. For a novel organism, you might have to
find a converter to construct your gtf file from some other format. Typically the process
would go GenBank->gff->gtf. The format is:
Structure is as GFF, so the fields are:
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes]
[comments]
Here is a simple example with 3 translated exons. Order of rows is not important.
381
381
381
381
381
Twinscan
Twinscan
Twinscan
Twinscan
Twinscan
CDS
CDS
CDS
start_codon
stop_codon
380
501
700
380
708
401
650
707
382
710
.
.
.
.
.
+
+
+
+
+
0
2
2
0
0
gene_id
gene_id
gene_id
gene_id
gene_id
"001";
"001";
"001";
"001";
"001";
transcript_id
transcript_id
transcript_id
transcript_id
transcript_id
"001.1";
"001.1";
"001.1";
"001.1";
"001.1";
From http://mblab.wustl.edu/GTF22.html.
As with most advanced bioinformatics processes, the data will be processed through a
succession of computer programs, possibly derived from different sources, and chosen to
progressively condense the data towards some particular goal.
Here is an example where the deep sequencing data was processed several steps and then
fed
to
an
assembly
viewer
called
Artemis.(http://www.sanger.ac.uk/resources/software/artemis/)
From the Artemis web site.
For RNA-SEQ data, one software package to proceed towards tabular differential gene
expression data is called cufflinks.
From https://pods.iplantcollaborative.org/wiki/display/eot/RNA-Seq_tutorial#RNASeq_tutorial-CuffDiffOutputs
Sample data to work through an RNA-SEQ data processing exercise can be retrieved
from the NCBI Sequence Read Archive. The output above was derived from accession
SRP003951 in that archive.
PBS execution system.
Alamo is a supercomputer. It has about 100 cpu's available at any given time for carrying
out computations. Eight of them are referred to as the "head node". The head node
carries out the initial interactions with users, and supports the interactive graphics vnc
interface and other housekeeping operations. The other cpu's are organized into
"compute nodes". The object is to distribute all the intensive computing operations to the
compute nodes. The job execution system is called PBS. Jobs to be executed under PBS
are embedded in a submission file of the following format. Then the job is submitted
with the command qsub.
#!/bin/sh
#PBS -r n
#PBS -V
#PBS -l nice=19
#PBS -j oe
#PBS -l nodes=1:ppn=8
cd /home/hardies/E25P/RNAP
hostname
blastpgp -i T7RNAP.fa -d /mnt/glusterfs/hardies/db9/nr -j3 -b0 -v 2000 -o T7RNAP.psiblast
-a8
The lines beginning with # are directives to PBS. They indicate that 8 cpu's are
requested, and give information about how the job is to be conducted. The commands
starting with cd /home... are the actual commands that would be typed to execute this job.
This is an 3 round psiblast search of the local copy of the NCBI nr database. If the
computer is busy, the job will be queued until the requested cpu's are available, and then
executed.
Revisiting PCR primer software.
I asked you on the midterm to think about how various changes in the PCR procedure
might influence the performance of the following oligo:
Proposed primer:
GACTCAGCGCTATTGCGCATGATC
4*GC + 2*AT
Tm = 74C
By inspection, hairpin:
GACTCAGCGCTATTGCGCATGATC
primer dimer
GACTCAGCGCTATTGCGCATGATC 3'
||||
3'CTAGTACGCGTTATCGCGACTCAG
Let's look at how different software packages agree or disagree on the properties of this
primer.
By Oligo Version 6 (There is a version 7 now).
By default the program started with usual PCR conditions, except with primer
concentrations of 1 nM. No one does PCR with 1 nM primers. The 1nM TM was 67.7C.
Changing that to 0.1 uM gave 74C. Maybe someone dialed in the 1 nM primer to cause
the program to automatically discount the Tm by 5C ???
Oligo explicitly accounts for the Mg concentration. If using programs that do not, one
would need to claim that the Na concentration was 148 to implicitly compensate.
The nonpriming duplexes are only relevant if the Mass Action function
[duplex]/[nonduplex]2 starts to approach 1. Since the primer concentrations are in uM,
and delta G0 is computed assuming 1 M standard concentrations, the primer will still be
1,000,000:1 nonduplex at G0 = 0. [ because 10-12/(10-6 * 10-6) = 1]. As a rule of
thumb, one has to have an interaction well under -9 kcal/mol for the primer to be more
than 50% duplexed. The simplest way to check out a particular self associated structure
would be to use a program that reports the Tm.
The Mfold server
(http://mfold.rit.albany.edu/?q=DINAMelt/Two-state-melting) will do that:
Mfold was more impressed with a different structure than was Oligo. If I raised the salt
as high as it would go, I got it to G0 = -9.2, but it was still not expected to be a problem,
even at 1 uM primer and 55C annealing. Changing the primer concentration to 0.1 uM
only made about a 5C difference.
On the other hand, the claim that the productive complex has a G0 of -2.1 is
problematical. G = 0 = G0 -RT ln ([duplex]/[nonduplex]2).
At 55C, T = 273+55 = 328K.
[duplex]/[nonduplex]2 = e^(2.1/328*1.98 x 10-3) = 25.4
Take x as the fraction duplexed. [nonduplex] = initial amount - x = initial amount
because x is small. [duplex] = x[initial amount].
If initial primer concentration was 1 uM:
25 = x*I/(I*I)
25*10-6 = x.
Only 2.5 x 10-5 of primer, or 1/40,000, will be duplexed.
BUT, if it primes efficiently to make a primer dimer product, then it will amplify two fold
per cycle. It will exhaust the primer after about 15 cycles. Mfold gave G0 of -0.4 to -1.4
depending on whether a tail was left on the GATC (e.g. AGATC vs. GATC). This will
only make a difference of a few cycles. However, if I make overlaps of AAA/TTT or
AA/TT, the Mfold server continues to indicate the primer dimer formation would be a
problem. In practice, very short complementary regions are unavoidable, and do not
create an observable primer dimer problem. Hence the predictive quality of the programs
is breaking down as very short duplexes are considered. This could reflect a problem
with properly considering end effects in the thermodynamic calculations, but could also
just reflect that these very short duplexes are not very much like the substrate the
polymerase is designed to bind. In any case, the primer dimer problem in PCR is mostly
handled by a rule of thumb to not leave duplexes beyond a certain length on the 3' end.
Oligo Calc (http://www.basic.northwestern.edu/biotools/oligocalc.html)
With default parameters, predicts Tm of 62C, and fails to notice hairpin or primer dimer.
With more realistic parameters, the Tm comes up over 70C.
Oligo calc claims to use the Mfold server to determine hairpins, but if you take the
sequence to the actual Mfold server and fill in the relevant salt conditions, Mfold says
that there is a very stable hairpin:
IDT OligoAnalyzer 3.1 under default conditions says that the Tm is 60C
(https://www.idtdna.com/analyzer/Applications/OligoAnalyzer/)
The method of calculation is was nearest neighbor. They claim to have better corrections
for Mg than other programs.
If the conditions are made more realistic, the Tm comes up to 64.2C.
It suggests a hairpin with Tm of 67C under its default conditions, but 71.5C under the
actual reaction conditions.
It suggested a substantial problem with primer dimer, although it was buried in a long list
of irrelevant self matches because the 3' end wasn't paired.
Promega's calculator said 66C with realistic reaction conditions.
Primer3Plus (http://primer3plus.com/cgi-bin/dev/primer3plus.cgi):
With the thermodynamic parameters common to Oligo, gave a Tm of 78.3C and declared
a hairpin problem:
This program gave options for which set of thermodynamic values to use in the nearest
neighbor calculation. There parameters are determined by melting a large number of
double stranded test oligonucleotides.
Primer3plus didn't explicitly notice the primer dimer problem.
With the newer values, it predicted a Tm of 64C.
Download