Two Scenarios Using - Department of Biological Science

advertisement
Bioinformatics Workshop #2
Computational Methods for Rational
Oligonucleotide PCR Primer Design and
Analysis:
Two Scenarios Using GCG¥’s SeqLab.
‘Not your ordinary primer design.’
The two scenarios:
1) A complicated case where the target DNA is unknown and the sequences are ‘difficult’ to
align — the “guessmer” — useful for discovering genes in organisms where they have
not yet been identified when the gene’s encoded protein sequence is known in several
other, related organisms. Here the example is the prion gene in primates; and
2) A case that you can do on your own where all the DNA sequences are known and ‘easily’
aligned — the Human Papilloma Virus major capsid protein L1 — type and strain
differentiation.
Fall 2006; a GCG¥ Wisconsin Package™ SeqLab® tutorial for Florida
State University sponsored by the School of Computational
Science6 (SCS).
Author and Instructor: Steven M. Thompson
Steve Thompson
BioInfo 4U
2538 Winnwood Circle
Valdosta, GA, USA 31601-7953
stevet@bio.fsu.edu
229-249-9751
¥GCG is the Genetics Computer Group, part of Accelrys Inc.,
producer of the Wisconsin Package for sequence analysis.
 2006 BioInfo 4U
Steven M. Thompson
2
Introduction
The Polymerase Chain Reaction, PCR, developed at Cetus Corporation by Kary Mullis in the mid ‘80’s (Saiki,
et al., 1988), for which he won the Nobel Prize, and patented by Hoffman La Roche and Perkins-Elmer
Corporation, has revolutionized modern molecular biology. From Jurassic Park scenarios in popular novels, to
everyday research in countless laboratories across the world, to cutting-edge forensic pathology techniques,
PCR is being used to analyze tinier concentrations of DNA than ever before imagined possible. PCR allows
the investigator to analyze any stretch of DNA in any organism where at least some sequence information is
known, either in that organism or in related organisms. It can isolate, and amplify up to around a million-fold,
just a few molecules of DNA from complex environmental mixtures, even where the DNA is significantly
degraded — the ramifications are incredibly far-reaching. It has been employed, among many examples, to
analyze DNA in Egyptian mummies, preserved prehistoric insects in amber, ancient fossilized leaves, and
both ice-age frozen and tar-pit preserved mastodons and other animals from the ‘great age of mammals.’
Claims were even made of dinosaur DNA recovery from specimens recovered in a Utah coalmine, though the
results were later proven to be contamination. The practical applications are extensive in medicine, especially
in the field of prenatal genetics and, in particular with HIV, immediately postnatal diagnosis. Other pathologies
such as Lyme disease are also extremely amenable to PCR diagnosis. Furthermore, molecular evolutionists
now have a tremendous tool for inferring phylogenies of any organism, whether they can be cultured or not.
Furthermore, forensics has been completely turned about.
Now investigators can isolate the DNA from
incredibly obscure bits of physical evidence, ala CSI, to positively exclude suspects based on distinct patterns,
fingerprints, within their DNA. Using it to ‘prove’ guilt is more difficult because of the population genetics
statistics involved, however, even these probabilities can be demonstrated within several magnitudes of order.
PCR has truly changed the face of molecular biology.
PCR is a modified primer extension reaction using a thermostable DNA polymerase that allows for the heat
dissociation of newly formed complimentary DNA and subsequent hybridization of oligonucleotide probes to
the target regions for subsequent rounds of amplification. The scope and methods of PCR are huge and
many varied and way beyond the aim of this workshop — I will not attempt to teach anything of the actual
procedure. Refer to any good, modern text in molecular biology for details (for some good, early reviews of
PCR methodology see Mullis [1990], White et al. [1989], and Cherfas [1990]). What I will attempt to teach is a
rational method for inferring appropriate oligonucleotide probes, often known as primers, for PCR or
hybridization screening analysis. These oligonucleotides are usually about 20 or more bases in length and
target the beginning and ending locations of the PCR amplification process.
Coupled with PCR techniques and/or ultra sensitive hybridization screenings, oligonucleotide primers have
allowed the ‘fishing out’ of thousands of genes from complex genomes that would have previously been
extremely difficult to ever even find, yet alone sequence. Present-day economic, automated synthesis and the
ready availability of nucleotides, have made primers commonplace. (This has also facilitated the development
of reliable methods for the introduction of site-specific mutations into known sequences.) Because of the high
specificity and adjustable stringency of oligonucleotide hybridization, the sequence knowledge of a relatively
short stretch of unique DNA is sufficient to rapidly isolate and/or amplify, clone if desired, and sequence the
corresponding gene. However, whatever technique one may use, primers are essential ingredients.
3
PCR and hybridization screening both require the design of appropriate primers. This can be a ‘hit-or-miss’
affair or you can use computational methods to greatly assist the efficiency of the process. Several strategies
can be imagined for the design of oligonucleotide primers. If an exact nucleotide sequence is known, then a
single oligonucleotide probe for hybridization or a pair of primers for PCR of a defined sequence can simply be
selected, tested, and synthesized. In the absence of a defined DNA sequence, sometimes a group of similar
DNA sequences can be aligned and a consensus sequence created from which primers can be designed.
However, this is often not possible because DNA can be very, very difficult to align. In some cases one may
even be forced to work off of either a small portion of a protein sequence from an Edman degradation reaction
or, as will be illustrated in this exercise, a consensus pattern from a group of related proteins — the luxury of
using DNA directly is often not available.
When nucleotide data is lacking or problematic, amino acid sequences can be back translated to provide the
necessary primers. In the absence of exact protein sequence data, a consensus pattern from a group of
related proteins can often be used. Using amino acid sequence information requires one to back translate the
sequence though. This is not a trivial chore though, because of the degeneracy of the genetic code. There
are 64 possible codons for 20 amino acids. Because of this, many different back translation probe techniques
have been employed. Two are, either utilizing large pools of short oligonucleotides whose sequences are
highly degenerate, or using small pools, or even just one pair, of longer oligonucleotides of lesser or no
degeneracy. All organisms have preferential biases in codon usage and this information can be used to
advantage in deciding which codons to synthesize out of all of the possible choices. This strategy of choosing
the longest defined stretches of unambiguous peptide and back translating them to their most probable
oligonucleotides, is known as designing “guessmers.”
Guessmers contain the combination of codons most likely to match the authentic gene. Guessmers work
because the decrease in hybridization stability caused by mismatched bases is offset by an increase in
stability from using longer sequences. In most cases, mismatches will occur in only the third position of
incorrect codon choices and, therefore, at least two of the three bases will still be matched. Naturally, the
biggest constraint on utilizing this type of strategy is that relatively long stretches of amino acid sequence are
required. Because of this, guessmers are particularly appropriate when strong and sufficiently long consensus
elements can be discovered in a protein family. They should be at least 30 nucleotides in length, in order to
insure sufficient hybridization despite potential mismatches, though PCR primers are seldom designed as long
as hybridization probes. It’s also not worth the extra effort and bother to synthesize them longer than about 70
bases. For very some early, very good descriptions of the factors involved in guessmer design and analysis
and references to primary literature see Sambrook et al. (1990) and Wood (1987).
The first portion of today’s tutorial will explore guessmer design. In order to discover possible consensus
patterns within a known protein family for the design of a guessmer, the individual members must be
maximally aligned and then a consensus must be created.
Alignment is usually achieved through an
automated progressive, pairwise alignment procedure, here the GCG program PileUp, which inserts gaps to
align the full length of its members.
Other automated alignment methods are also available such as
Thompson and Higgins’ ClustalW (1994), Smith and Smith’s PIMA (1995), and Gupta et al.’s MSA (1995), as
are several different manual alignment editors.
Consensus sequences can then be created from the
alignment. Many methods merely rely on the positional frequency of individual symbols; however, some utilize
much more information.
Profile analysis (Gribskov et al., 1989) is one of these.
4
Profile analysis takes
advantage of the BLOSUM (Henikoff and Henikoff, 1992) Dayhoff style scoring matrices (Schwartz and
Dayhoff, 1979) that utilize the relative conservation of various amino acid substitutions within the alignment.
Therefore, the resultant consensus residues are the most evolutionarily conserved rather than just statistically
the most frequent. This can mean much more to us than an ordinary consensus and is especially appropriate
in the design of the type of guessmer that we will be simulating — that is, a situation in which much sequence
information for the protein of interest is known in other organisms but not in the one we are studying.
I will illustrate the design of guessmers using the prion protein as an example.
The prion molecule is
responsible for a debilitating disease in animals and yet is encoded by the organism’s own DNA; the gene is
expressed in both normal and afflicted cells. Large amounts of proteinaceous plaques aggregate and are
deposited in the brains of afflicted animals. The prion protein has an unknown natural function but is found in
very high quantities in the brain of animals infected with the degenerative neurological diseases scrapie and
Bovine Spongiform Encephalopathy, in wild stock, and kuru, Creutzfeldt-Jacob Disease, or GerstmannStraussler Syndrome in humans. It is also involved in Fatal Familial Insomnia and gained notoriety as the
harbinger of “Mad-Cow Disease.” In humans the gene maps to position 20p12-pter and the disease can be
inherited in an autosomal dominant fashion. Seventeen pathologic allelic variants are listed in OMIM (1995).
One of the most peculiar aspects of the prion is no infective nucleotide entity has ever been found, yet the
protein particle itself is highly infectious. Somehow the infectious protein particle induces a posttranslational,
pathological change in the host’s normal protein to convert it to the aberrant isoform. The primary amino acid
sequence is not changed, only the structural conformation of the protein is different. Stanley B. Prusiner of the
University of California, San Francisco, won the 1997 Stockholm’s Karolinska Institute Nobel Prize in
physiology or medicine because of his work on this system. For further information, see Prusiner’s article in
Science, available on the World Wide Web at: http://www.sciencemag.org/feature/data/prusiner/245.shl.
The second scenario utilizes a human papillomavirus (HPV) dataset. HPV is known to be associated with
many varieties of human genital cancers. The DNA from certain types of HPV, in particular types 16 and 18,
has been found integrated into various sites on human chromosomes, especially 12q13, and is often
associated with the cis-activation of cellular oncogenes and/or the establishment of heritable fragile sites
(OMIM).
HPV exists in a dizzying number of genetic types — there are almost 2000 HPV nucleotide
sequences including around 50 complete HPV genomes in GenBank (Bilofsky, et al. 1986)! Some types
appear relatively benign while others have powerful etiologic roles.
The ability to easily discriminate between HPV types is obviously a valuable diagnostics tool. PCR provides a
proven methodology for achieving just this. The HPV major capsid protein, or L1 gene as it is known, has
proven to be a reliable locus for this technique. The HPV viral coat is largely built from this protein, and,
therefore, represents the first and major antigen presented to the host. Hence, the selective pressure is quite
intense on the molecule: It evolves quickly enough to provide sufficient variation between types for screening
purposes and yet has strongly conserved areas to provide for ‘universal’ primers. One paired set, the socalled MY09/11 consensus, has been extensively used for this purpose. See, for other historic examples, the
articles by Tenti, Nagano, Stewart and their collaborators (all 1996).
I have already prepared a multiply aligned DNA sequence dataset of the L1 region from about 50 different
HPV sequences most similar to type 16 for the second scenario. This dataset will not require the design of
guessmers, as these sequences have quite a high degree of similarity, enough to make this region quite easy
5
to align at the DNA level. From the multiple sequence alignment provided, you will be able to design your own
‘universal’ and type/strain specific primers. Furthermore, using the GCG primer design software, you can test
the efficiency of the commercial MY09/11 universal set, and compare them to your newly designed primers.
Finally, you can review the results of a database search that I completed using the MY09/11 primers to see
just how specific and/or universal they are for HPV L1 genes.
The Tutorial: A ‘Real-Life’ Project Oriented Approach
I write these tutorials from a ‘lowest-common-denominator’ biologist’s perspective. That is, I only assume that
you have fundamental molecular biology knowledge, but are relatively inexperienced regarding computers. As
a consequence of this they are written quite explicitly. Therefore, if you do exactly what is written, it will work.
However, this requires two things: 1) you must read very carefully and not skim over vital steps, and 2) you
mustn’t take offense if you already know what I’m discussing. I’m not insulting your intelligence. This also
makes the tutorials longer than otherwise necessary. Sorry.
I use bold type in the tutorial for those commands and keystrokes that you are to type in at your console or for
buttons that you are to click in SeqLab. I also use bold type for section headings. Screen traces are shown
in a “typewriter” style Courier font. and “////////////” indicates abridged data. The arrow symbol,
“>“ indicates the system prompt and should not be typed as a part of commands. Really important statements
may be underlined.
Specialized “X-server” graphics communications software is required to use GCG’s SeqLab interface. This
needs to be installed separately on personal style ‘Wintel’ or Macintosh machines but comes standard with
most UNIX operating systems. The details of X and of connecting to the GCG server on campus will not be
covered in this exercise. If you are unsure of these procedures ask for assistance in the computer laboratory.
I am also available for individualized personal help in your own laboratories if you are having difficulties
connecting to the GCG server from there, just contact me at stevet@bio.fsu.edu. A couple of tips at this point
should be mentioned though. Rather than holding mouse buttons down, to activate items, just click on them;
and do not close windows with the X-server software’s close icon in the upper right- or left-hand window
corner, rather, always use GCG’s “Close” or “Cancel” or “OK” button.
Standard operating procedure first step in much of molecular biology research
Probe genomic digests, shotgun clones, or cDNA libraries, or PCR methods toward the same end.
But, how do you design the oligonucleotide(s)? One way — defined DNA:
Based on known DNA sequences define and test probes/primers to any level of specificity using a
multiple sequence alignment of those sequences and primer design and analysis software, such as
GCG’s Prime. This is covered in the second portion of the tutorial.
Another way — the guessmer — ‘universal’ primers based on protein homology:
start from known protein sequences and find strong consensus elements within them;
BackTranslate the consensus elements to yield consensus DNA sequences;
use Prime to locate candidate primers within the conserved DNA regions;
6
test candidate primers’ suitability with FindPatterns and Prime.
Get started — SeqLab and primer design
Use the powerful X-based Graphical User Interface (GUI) sequence editor SeqLab to fully appreciate multiple
sequence alignments and, especially, to manipulate them.
SeqLab is a part of the Accelrys Genetics
Computer Group’s (GCG) Wisconsin Package. This comprehensive package of sequence analysis programs
is used worldwide and is one of my primary support responsibilities on campus. The package should initialize
automatically as soon as you log onto the GCG server. This process activates all of the programs within the
package and displays the current version of both the software and all of its accompanying databases.
Log on to the campus GCG server Mendel with an X tunneled ssh terminal connection (that’s a capital X!):
> ssh -X user@mendel.scs.fsu.edu
I placed a file in a publicly accessible GCG directory to make the last part of this section doable in ‘real time.’
Therefore, after logging in to the GCG server, issue the following command to copy this file into your account.
> fetch primer-tutorial.prion.finds
Llist your directory (ls) using the long form option (-l) on the new file to see how big it is:
> ls -l primer-tutorial.prion.finds
-rw-r--r--
1 stevet
gcg
49285 Jun 22 20:14 primer-tutorial.prion.finds
Next, issue the command “seqlab &” (without the quotes) in your terminal window to fire up the SeqLab
interface. The ampersand, “&,” is not necessary but it really helps out by launching SeqLab as a background
process so that you can retain control of your initial terminal window:
> seqlab &
The command should produce two new windows, the first an introduction with an “OK” box; check “OK.” You
should now be in SeqLab’s “List” mode.
Before beginning the analyses, go to the “Options” menu and select “Preferences . . .” We should check a
few options there to insure that SeqLab runs its most intuitive manner. If you were involved in last month’s
workshop, there is no need to repeat this section on SeqLab’s preferences. It ‘remembers’ your settings.
First notice that there are three different “Preferences” settings that can be changed: “General”, “Output,”
and “Fonts;” start with “General.” The “Working Dir . . .” setting will be the directory from which SeqLab was
initially launched. This is where all SeqLab’s working files will be stored; it can be changed in your accounts if
desired, however, leave it as is for now. Be sure that the “Start SeqLab in:” choice has “Main List” selected
(buttons are pushed in and shaded when they are turned on) and that “Close the window” is selected under
the “After I push the “Run” button:” choice. Next select the “Output” Preference. Be sure “Automatically
display new output” is selected. Finally, take a look at the “Fonts” menu. We’ll leave these choices as is,
but if you’re dealing with really big alignments, then picking a smaller Editor font point size may help to see
more of your alignment on the screen at once. Click “OK” to accept any changes.
7
1) The first case — the guessmer, from proteins to primers
The scenario
You are given a particular protein to investigate, here the prion protein. It is unknown in the particular
organism that your boss wants you to work with, let’s say for the purpose of the tutorial, the strange lemurlike critter the aye-aye, however, you are certain that the same protein has been worked with in other
related organisms. You want to use PCR methods to isolate the gene, so you’ll need to come up with
some primers. There are many ways to approach this design problem. I will present one useful when the
protein’s sequence is known in several representative cases, and [let’s assume, for the purpose of the
exercise that] the DNA is too divergent to align directly. The first step is to look for it in the protein
databases. We are going to use GCG’s database browser program LookUp to do this.
a) LookUp the UniProt protein database
We need to know proper database identity names or accession codes to find entries of interest in sequence
databases. Database text searching programs are often the easiest way to do this. There are several
methods; the NCBI Entrez program is one of the more powerful, EMBL/EBI’s SRS is another. Here we’ll use
GCG’s LookUp program because it creates an output file that can be used as an input list file to other GCG
programs. Insure that your “SeqLab Main Window” shows “Mode: Main List.”
Launch “LookUp” through the “Functions” “Database Reference Searching” menu.
In the “LookUp”
window be sure that “Search the chosen sequence libraries” is checked and that “UniProt” is the only
library selected. Under the main query section of the window, type the word “prion” following the category
“Definition” and the word “primate” in the “Organism” category. The “Organism” category supports any
proper taxonomic name, making it a great way to restrict your searches. Press the “Run” button. This should
find most of the prion proteins from primates in the UniProt database; since aye-ayes are primates, this is a
logical approach. The program will next display the results of the search; scroll through your output and then
“Close” the window. The very top portion of my LookUp output file follows below:
!!SEQUENCE_LIST 1.0
LOOKUP in: uniprot of: "([SQ-DEF: prion*] & [SQ-ORG: primate*])"
71 entries
October 9, 2006 19:45 ..
UNIPROT_SPROT:PRIO_AOTTR ! ID: 02f50101
! DE
Major prion protein precursor (PrP)
! DE
antigen) (Fragment).
! GN
Name=PRNP; Synonyms=PRP;
UNIPROT_SPROT:PRIO_ATEGE ! ID: 03f50101
! DE
Major prion protein precursor (PrP)
! DE
antigen) (Fragment).
! GN
Name=PRNP; Synonyms=PRP;
UNIPROT_SPROT:PRIO_ATEPA ! ID: 04f50101
! DE
Major prion protein precursor (PrP)
! DE
antigen).
! GN
Name=PRNP; Synonyms=PRP;
UNIPROT_SPROT:PRIO_CALJA ! ID: 0bf50101
! DE
Major prion protein precursor (PrP)
! DE
antigen).
! GN
Name=PRNP; Synonyms=PRP;
UNIPROT_SPROT:PRIO_CALMO ! ID: 0cf50101
! DE
Major prion protein precursor (PrP)
! DE
antigen) (Fragment).
! GN
Name=PRNP; Synonyms=PRP;
8
(PrP27-30) (PrP33-35C) (CD230
(PrP27-30) (PrP33-35C) (CD230
(PrP27-30) (PrP33-35C) (CD230
(PrP27-30) (PrP33-35C) (CD230
(PrP27-30) (PrP33-35C) (CD230
UNIPROT_SPROT:PRIO_CEBAP ! ID: 10f50101
! DE
Major prion protein precursor (PrP) (PrP27-30) (PrP33-35C) (CD230
! DE
antigen).
! GN
Name=PRNP; Synonyms=PRP;
UNIPROT_SPROT:PRIO_CERAE ! ID: 11f50101
! DE
Major prion protein precursor (PrP) (PrP27-30) (PrP33-35C) (CD230
! DE
antigen).
! GN
Name=PRNP; Synonyms=PRP;
UNIPROT_SPROT:PRIO_CERAT ! ID: 12f50101
! DE
Major prion protein precursor (PrP) (PrP27-30) (PrP33-35C) (CD230
////////////////////////////////////////////////////////////////////////
Be careful that all of the proteins included in the output from any text-searching program are appropriate. In
this case, upon a quick perusal, I see at least one of the entries is not a true prion, it’s a prion-like protein:
UNIPROT_SPROT:PRND_HUMAN ! ID: 56f60101
! DE
Prion-like protein doppel precursor (PrPLP) (Prion protein 2).
This entry should either be edited out of the list file, or it can be removed after loading the list into the SeqLab
editor display. An option, if you use an editor, is to comment out the undesired sequences by placing an
exclamation point, “!,” in front of the unwanted lines. GCG uses exclamation points as remark delineators.
Select the LookUp output file in the “SeqLab Output Manager” and press the “Add to Main List” button;
close the window afterwards. Next, be sure that the LookUp output file is selected in the “SeqLab Main
Window” and switch “Mode:” to “Editor.” This will load the file into the SeqLab editor and allow us to align the
entries and perform further analyses. ‘Grab and drag’ the lower-right corner of the display to expand it to a
more convenient size. The display should look similar to the graphic at the top of the following page below:
Select the prion-like entry, “PRND_HUMAN.” Press the “CUT” button to remove it. Explore the dataset; use
the horizontal scroll bar to move along the length of the sequences, and the vertical scroll bar to see the rest of
the entries. The “1:1” slider on top allows you to ‘zoom’ in and out on the dataset; move it to “2:1” so that you
can see most of the length at once. Double-click on various entries’ names to see their database annotations
(or single click the “INFO” icon with the sequence entry name selected).
Entries can be analyzed and
databases searched through the “Functions” menu, but not now — we’ve got too much to cover tonight.
9
Change the “Display:” box from “Residue Coloring” to “Graphic Features.” Now the display shows a
schematic of the database feature information from each entry. Double-click on various colored regions of the
alignment (or use the “Features” choice under the “Windows” menu); a “Sequence Features” window will
describe the features within the region of the sequence that you selected. Select the feature to show more
details. I selected one of the alpha helices in the human prion and my display looks like the graphic below:
“Close” the “Sequence Feature” window. Switch the “Display:” back to “Residue Coloring” after checking
out the “Graphic Features” representation. Also use the “File” menu “Save As . . .” button to save the
dataset as an RSF file.
Give it an filename that makes sense such as “prion,” but leave the “.rsf”
extension so that you’ll recognize the type of file that it is in your directory. RSF files contain sequence data,
names, and annotation — the acronym stands for “rich sequence format.”
b) PileUp the hits and evaluate the results
Now we need to align all of these proteins to determine the most conserved areas, those areas most suitable
in which to locate primers. Therefore, select all of the prion sequence entries in the editor window either by
dragging the mouse through them all (if they were to all fit in the window), by using <shift> click on the top and
bottom-most entries, or by selecting “Select All” from the “Edit” menu. Now go to the “Functions” “Multiple
Comparison” menu and choose “PileUp.” ClustalW+ is also available there for situations too complicated for
PileUp, but this dataset readily aligns with PileUp.
You may want to see all the options that are available, although we don’t need to use any in this example. To
do so, click on the “Options” button and scroll through the window; “Close” it when finished. Depending on the
level of divergence in a dataset, better multiple sequence alignments can often be generated by using
alternate scoring matrices (the –Matrix= option, with the BLOSUM30 matrix being the most suitable for the
most diverged datasets, Henikoff and Henikoff, 1992) and/or different gap penalties. Gap penalties can be
adjusted as desired but the defaults usually work quite well. Furthermore, GCG’s –InSitu option can be
incredibly effective at realigning regions within an alignment (see Workshop #1). However, these sequences
10
are all similar enough that we can just run PileUp using the GCG defaults, therefore, just press “Run” in the
“PileUp” window and the program will launch.
PileUp will first compare every sequence with every other one. This is the pairwise nature of the program, and
then it will progressively merge them into an alignment in the order of determined similarity, from most to least.
The window will go away and then, after a few moments, depending on the complexity of the alignment and
the load on the server, new output windows will automatically display. The top window will be the Multiple
Sequence Format (MSF) output from your PileUp run. Notice the BLOSUM62 matrix and gap introduction and
extension penalties used by default. Scroll through your alignment to check it out and then “Close” the
window afterwards. A greatly abridged version of my primate prion MSF file follows below:
!!AA_MULTIPLE_ALIGNMENT 1.0
PileUp of: @/home/thompson/.seqlab-mendel/pileup_1.list
Symbol comparison table: GenRunData:blosum62.cmp
CompCheck: 1102
GapWeight: 8
GapLengthWeight: 2
pileup_1.msf
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
MSF: 664
q7kyz4_human
q7kyy8_human
o75942_human
q6ses1_human
prio_cerae
prio_cerdi
prio_cerat
prio_macsy
prio_thege
prio_atege
q5ub85_atepa
q9tu20_varvv
q86xr1_human
prio_human
q5qpb4_human
q53yk7_human
prio_gorgo
q6fgn5_human
q27h91_human
prio_hylla
prio_hylsy
prio_pantr
q5u0k3_human
Type: P
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
October 11, 2006 15:38
664
664
664
664
664
664
664
664
664
664
664
664
664
664
664
664
664
664
664
664
664
664
664
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
282
963
7681
6122
2703
2703
4010
4010
4214
4488
5143
6846
3331
5841
4002
5841
6237
6291
5263
6422
6422
6422
6324
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Check: 4298 ..
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
/////////////////////////////////////////////////////////////
//
1
q7kyz4_human
q7kyy8_human
o75942_human
q6ses1_human
prio_cerae
prio_cerdi
prio_cerat
prio_macsy
prio_thege
prio_atege
q5ub85_atepa
q9tu20_varvv
q86xr1_human
prio_human
q5qpb4_human
q53yk7_human
~~~~~~~~~~
~~~~~~~~~~
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
~~~~~~~MLV
~~~~~~~MLV
~~~~~~~MLV
~~~~~~~MLV
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~MLV
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
~~~~~~~~~~
~~~~~~~~~~
LFVATWSDLG
LFVATWSDLG
VFVATWSDLG
VFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
~~~~~~~~~~
~~~~~~~~~~
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
~~~~~~~~~~
~~~~~~~~~~
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
~~~~~~~~GG
~~~~~~~~~~
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
11
50
~~~~~~~~~~
~~~~~~~~~~
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
~~~~~~~~~~
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
~~~~~~~~~~
~~~~~~~~~~
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
~~~~~~~~~~
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
prio_gorgo
q6fgn5_human
q27h91_human
prio_hylla
prio_hylsy
prio_pantr
q5u0k3_human
q540c4_human
q5ub98_chisa
q5ub97_cacca
prio_ponpy
q5ub99_pitir
prio_colgu
q6jl99_macmu
prio_prefr
prio_macar
prio_macfa
prio_macfu
prio_macmu
prio_macne
prio_papha
prio_cermo
prio_mansp
prio_cerne
prio_erypa
prio_certo
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
~~~~~~~MLV
~~~~~~~~~~
~~~~~~~~~~
MANLGCWMLV
~~~~~~~~~~
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
MANLGCWMLV
~~~~~~~MLV
~~~~~~~MLV
~~~~~~~MLV
~~~~~~~MLV
~~~~~~~MLV
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
~~~~~~~~~~
~~~~~~~~~~
LFVATWSNLG
~~~~~~~~~~
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
LFVATWSDLG
VFVATWSDLG
LFVATWSDLG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
~~~~~~~~GG
~~~~~~~~GG
LCKKRPKPGG
~~~~~~~~GG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
LCKKRPKPGG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
WNTGGSRYPG
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
QGSPGGNRYP
////////////////////////////////////////////////////////////////////
q5ub92_cebap
prio_atepa
q5ub94_lagla
prio_aottr
q5ub96_alobe
prio_calmo
q5uba0_calmo
q5ub87_calja
q5ub88_calgo
prio_calja
q5ub90_9prim
q5ub95_braar
q5ub93_saisc
q5ub91_aotle
q5ub89_leoro
q5ub86_cebpy
prio_saisc
q1l6p5_micmu
q27h88_human
q5tg42_human
q5t2t6_human
q5t2t5_human
q5tg43_human
q5tg34_human
q5t2t8_human
q5tg35_human
q15196_human
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
AYRGFIFKQT
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
AYRGFIFKQT
AYRGFIFKQT
AYRGFIFKQT
~~~~~~~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
~~~~
SKPF
~~~~
~~~~
~~~~
SKPF
SKPF
SKPF
~~~~
After scrolling through your alignment and then “Close”ing its window, the next window visible will be the
“SeqLab Output Manager.” This very important window will contain all of the output from your current
SeqLab session. Files may be displayed, printed, saved with other names and/or in other locations, and
deleted from this window. We need to use an extremely important function at this point; press the “Add to
Editor” button and specify “Overwrite old with new” in the next window when prompted, to take your MSF
output and merge it with the RSF file in the open editor. This will keep all feature information intact, yet
renumber all of its reference locations. “Close” the “Output Manager” after loading your new alignment. The
next window will contain PileUp’s cluster dendrogram; in the primate prion case, the graphic below:
12
This similarity dendrogram can be very helpful for determining whether the sequences used are all
appropriate. The length of the vertical lines is proportional to the difference between the sequences. In this
case I think that we should exclude all of the human outlier sequences seen at far left in the dendrogram, just
keeping the main central cluster. However, realize that this tree is not an evolutionary tree. No phylogenetic
inference algorithms, such as maximum likelihood or parsimony, nor any ‘mutliple-hit’ correction models, such
as Jukes-Cantor or Kimura, are used in its construction. PileUp’s dendrogram merely indicates the relative
similarity of the sequences and, therefore, the clustering order in which the alignment was built. After loading
our new alignment, my SeqLab Editor display looks like the following screen dump at a “4:1” zoom ratio:
Notice
the
nice
representing
columns
columns
of
of
color
aligned
residues. However, also notice that the
alignment
appears
in
two
different
sections, presumably true prions and
those outliers seen in the dendrogram —
further evidence that we are trying to
align ‘apples and oranges.’ Double-click
on the entry names at the very bottom of
the alignment: turns out that they are
prion “interacting” proteins, not prions
themselves.
Select all of these non-
prion outlier sequences and “CUT” them
from the alignment.
It also turns out the sequence just above this group, Q27H88_HUMAN, isn’t a real prion either; it’s another
prion-like protein. There were nine outliers in the dendrogram. I didn’t do a very good job of spotting all the
non-prions on my perusal of the LookUp output. Oh well; they are easy to get rid of at this stage. Select and
13
“CUT” it from the alignment as well. To insure that no columns of gaps are left in an alignment after cutting
sequences out of, it is always a good idea to use the “Edit” “Select All,” and then “Edit,” “Remove Gaps . . . ,”
“Columns of gaps” functions. Do so at this time, and then use the “File” menu to “Save As . . .” the
alignment. Use the same name as before and “Overwrite” the file. Return your display to a “1:1” zoom ratio.
c) Determine areas of maximal conservation
To identify regions of the alignment most appropriate for designing universal primers or probes, we need to
decide what regions are most highly conserved. To design a hybridization probe, one, most highly conserved
section is chosen; to design paired PCR primers, two flanking, highly conserved areas are chosen. A good
way of doing this is to calculate the running average similarity using a sliding window approach. The GCG
graphics program PlotSimilarity does this so that we can easily visualize the positional conservation of a
multiple sequence alignment. The program uses a sliding window along with a similarity matrix, such as
BLOSUM62, to indicate which portions are most conserved and which are most variable. The program can
also produce a color mask that corresponds to the plot by representing peaks with dark grays. This can be
overlaid on the alignment in the SeqLab editor to see exactly where the similarity rises and falls.
An
advantage of running PlotSimilarity on a protein alignment rather than a DNA alignment is that the peaks on
the plot not only represent the most conserved regions of the alignment, but also those areas most resistant to
evolutionary change due to the algorithm’s use of the BLOSUM matrix in its calculations.
Insure that all of the sequence entries are still selected. Next go to the SeqLab “Functions” menu; select
“Multiple Comparison” and then “PlotSimilarity.”
You may get a “Which selection” box if you have
previously selected a region of the alignment; if you do, specify “Selected sequences” not “Selected region.”
This will produce a PlotSimilarity dialog box. We need to change some of the program defaults there, so
choose “Options . . . .” Check “Save SeqLab colormask to” and “Scale the plot between:” the “minimum
and maximum values calculated from the alignment.” The first option’s output file will be used in the next
step and the second specification launches the program’s –Expand option. This blows up the plot, scaling it
between the maximum and minimum similarity values observed so that the entire graph is used rather than
just the portion of the Y-axis that the alignment happens to occupy. The Y-axis of the resulting plot will use the
similarity values from the default amino acid scoring matrix or you can specify an alternative. “Close” the
“PlotSimilarity Options” window; notice that the “Command Line:” box in the program window now reflects
your updated options. Click the “Run” box to launch the program. The output will quickly return. “Close” the
plotsimilarity.cmask display and the “Output Manager” and then take a look at the similarity plot. My
example follows below:
14
This example shows a great deal of sequence similarity. Strong peaks can be seen centered about positions
40 and 90, and throughout 125 to 270 or thereabout. The ordinate scale here is dependent on the scoring
matrix used by the program, by default the BLOSUM62 table in which amino acid identities vary from 4 to 11.
The dashed line across the middle shows the average similarity value for the entire alignment, here about 3.8.
“Close” the PlotSimilarity window after noting where appropriate sections of high conservation within the
alignment occur.
Next, go to the SeqLab “File” menu; select “Open Color Mask Files.” Select the file displayed in the dialog
box, “plotsimilarity.cmask;” click “Add” and then “Close.” Notice that the display is now represented in
various gray-tones — the intensity of color is proportional to the level of similarity in the alignment at that point,
averaged over the default window of 10 amino acids. Notice the correspondence between the original plot’s
peaks and valleys and the color mask’s dark and light areas. My screen dump is shown here:
15
The point of these similarity visualization techniques is to identify those regions of the alignment that will be
most appropriate for designing universal primers — areas of high conservation, obviously. Try to identify
stretches that correspond to around 100 bases, i.e. around 30 to 40 amino acids. Decide whether you want to
design a single hybridization probe, the central repeat region here looks great, or paired PCR primers based
on the observed similarity. Either case will do for the exercise. I will illustrate paired PCR guessmers by
choosing the furthest separated, most highly conserved regions I can find.
If designing a single hybridization probe, choose the single, longest, least ambiguous sequence you can find
based on all the information you have. If designing PCR primers, choose two highly conserved stretches that
bracket the longest portion of the alignment possible. This is obviously a subjective decision and depends on
how much of the sequence you will be trying to amplify. Regardless, choose the longest regions possible, as I
stated above, at least 30 to 40 amino acids long, in order to get target regions at least 100 base pairs apiece.
We will isolate the best primers within these stretches. Decide which exact sequence regions to use; write
down your selections. I selected residues 21 through 51 for my upstream primer, and 210 through 260 for my
downstream primer.
d) Use ProfileMake to create a consensus
We need to generate a consensus of the sequence alignment next. We could use the “Consensus” tool under
SeqLab’s “Edit” menu; however, the most powerful protein sequence consensus method I am aware of is the
Profile algorithm. This algorithm uses all of the data of an alignment, its conservation and its variability, as
well as the BLOSUM matrix to create a new alignment specific similarity matrix. Certainly, in this case,
because of the high similarity of all the sequences, the difference would be trivial, but sometimes it can make
16
a big difference. A profile, and its inherent consensus, is created with the program ProfileMake. Be sure that
all of your sequences are selected and then go to the “Functions” “Multiple Comparison” menu and launch
“ProfileMake.” Punch the “Options” button, select “Write the consensus into a sequence file,” and supply
an appropriate filename. This will launch the program’s –SeqOut option to generate a normal GCG sequence
file of the consensus in addition to the profile. Leave the other options as they are and “Close” the “Options”
window. Press “Run” in the “ProfileMake” program window and check out the results. Take a look at the
consensus sequence. The abridged primate prion profile consensus sequence follows:
!!AA_SEQUENCE 1.0
(Consensus) (Peptide) PROFILEMAKE v4.50 of: @/home/thompson/.seqlab-mendel/
profilemake_12.list Length: 287 Sequences: 61 MaxScore: 1062.85 October 13, 2006 16:05
Gap: 1.00
Len: 1.00
GapRatio: 0.33 LenRatio: 0.10
input_12.rsf{Q7KYZ4_HUMAN}
input_12.rsf{Q7KYY8_HUMAN}
input_12.rsf{O75942_HUMAN}
input_12.rsf{Q6SES1_HUMAN}
input_12.rsf{PRIO_CERAE}
input_12.rsf{PRIO_CERDI}
input_12.rsf{PRIO_CERAT}
input_12.rsf{PRIO_MACSY}
input_12.rsf{PRIO_THEGE}
input_12.rsf{PRIO_ATEGE}
input_12.rsf{Q5UB85_ATEPA}
input_12.rsf{Q9TU20_VARVV}
From:
From:
From:
From:
From:
From:
From:
From:
From:
From:
From:
From:
1
1
1
1
1
1
1
1
1
1
1
1
To:
To:
To:
To:
To:
To:
To:
To:
To:
To:
To:
To:
143
135
287
287
287
287
287
287
287
282
273
225
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0
////////////////////////////////////////////////////////////////////////////////
Symbol comparison table: GenRunData:blosum62.cmp
Relaxed treatment of non-observed characters
Exponential weighting of characters
Length: 287 October 13, 2006 16:05 Type: P
FileCheck: 982
Check: 7501
1
MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP
51
PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QPHGGGWGQP
101
HGGGWGQPHG GGWGQPHGGG WGQPHGGGTH NQWNKPSKPK TNMKHMAGAA
151
AAGAVVGGLG GYMLGSAMSR PLIHFGNDYE DRYYRENMYR YPNQVYYRPV
201
DQYSNQNNFV HDCVNITIKQ HTVTTTTKGE NFTETDVKMM ERVVEQMCIT
251
QYEKESQAYY QRGSSMVLFS SPPVILLISF LIFLIVG
..
You may want to look at your resultant “.prf” file. It’s a big table of numbers that doesn’t make a whole lot of
sense on first inspection; however, it is a tremendously powerful tool in subsequent analysis steps. Other
programs can read and interpret all of those numbers to perform very sensitive database searches and
alignments by utilizing the information within it which penalizes misalignments in phylogenetically conserved
areas more than in variable regions.
Load your new profile consensus sequence into the editor. Do this with the “Windows” menu “Output
Manager” window. Select your new consensus sequence file name there and press the “Add to Editor”
button. “Close” the “Output Manager” window after loading the consensus sequence.
e) Select and use BackTranslate on the consensus sequence
17
In an actual lab situation your peptide probe regoinn(s) may not be as long as my examples. I was fortunate
to find such strong consensus elements in the prion protein. Regardless of what length regions you come up
with though, they are still peptide sequences and oligonucleotide probes are necessary for both hybridization
and PCR methodology. Backtranslation is not trivial because of the degeneracy of the genetic code. GCG
has addressed this problem with their program BackTranslate. Alternate codons are indicated in the output
along with their order of preference, based on the codon usage table that you specify, for each amino acid of
the sequence. You can choose from them; the program generates either the most probable or the most
ambiguous sequence.
To use BackTranslate you must decide which codon usage table you want the program to utilize. By default
BackTranslate will use a frequency table designed from highly expressed E. coli genes. Therefore, if you’re
working with an E. coli gene, the program’s default is appropriate. However, if your protein comes from
anything else, you will want to use an alternate table. GCG provides a few alternate data files in a public data
library with the GCG logical name GenMoreData. The available tables, in addition to the default codon usage
table,
ecohigh.cod,
are:
celegans_high.cod,
celegans_low.cod,
drosophila_high.cod,
human_high.cod, maize_high.cod, and yeast_high.cod. Even more tables are available at various
molecular biology data servers such as IUBIO (http://iubio.bio.indiana.edu/soft/molbio/codon/).
The
TRANSTERM database at the European Bioinformatics Institute (ftp://ftp.ebi.ac.uk/pub/databases/transterm/)
also contains several, and an especially good selection derived from a recent GenBank version comes from
the CUTG database (http://www.kazusa.or.jp/codon/) available in GCG format through various SRS servers
(e.g. see http://srs.sanger.ac.uk/srs6bin/cgi-bin/wgetz?-page+LibInfo+-lib+CUTG). Furthermore, if you are not
satisfied with any of the available options, GCG has a program, CodonFrequency, that enables you to create
your own codon frequency table from known coding sequences.
Select your profile consensus sequence entry (only).
Now go to the “Functions” “Translation”
“BackTranslate . . .” menu; specify “Selected Sequence,” if asked. In the BackTranslate program window
change the type of sequence produced from the “Would you like to see:” most ambiguous default to “table
of back-translations and the most probable sequence.” You also need to change the “Codon Frequency
Table . . .” from the default “ecohigh.cod” to something more reasonable, so press the button and choose
“human_high.cod” from the “Chooser for Codon Frequency Table” window that pops up. Press the “OK”
button in the “Chooser” window after selecting the human table, and then press “Run” in the program window.
Display the output file and notice how each codon is listed. An abridged version of my prion backtranslation
sequence data file is shown below:
!!NA_SEQUENCE 1.0
BACKTRANSLATE of: : input_14.rsf{prion}
check: 7501
from: 1
to: 287
Description: (Consensus) (Peptide) PROFILEMAKE v4.50 of:
@/home/thompson/.seqlab-mendel/profi
Accession/ID:
====================General comments====================
(Consensus) (Peptide) PROFILEMAKE v4.50 of:
@/home/thompson/.seqlab-mendel/profilemake_12.list
MaxScore: 1062.85 October 13, 2006 16:05
Length: 287
Sequences: 61
Using codon frequencies from: /usr/local/gcg/share/codon/human_high.cod
CheckFile: 1528
18
CODONFREQUENCY January 24, 1991
From an existing codon frequency
From an existing codon frequency
From an existing codon frequency
From an existing codon frequency
From an existing codon frequency
Met
Ala
ATG 1.00
GCC
GCG
GCT
GCA
240
120
8 Met
154
Leu
CTG
CTC
TTG
CTT
CTA
TTA
172
215
ACC
ACG
ACA
ACT
CTG
CTC
TTG
CTT
CTA
TTA
197
0.58
0.26
0.06
0.05
0.03
0.02
Gly
GGC
GGG
GGA
GGT
0.50
0.24
0.14
0.12
340
Cys
Trp
TGC 0.68
TGT 0.32
TGG 1.00
394
371
0.58
0.26
0.06
0.05
0.03
0.02
Val
GTG
GTC
GTT
GTA
0.64
0.25
0.07
0.05
190
Leu
CTG
CTC
TTG
CTT
CTA
TTA
157
0.58
0.26
0.06
0.05
0.03
0.02
Phe
Val
TTC 0.80
TTT 0.20
GTG
GTC
GTT
GTA
155
193
0.64
0.25
0.07
0.05
Ala
GCC
GCG
GCT
GCA
0.53
0.17
0.17
0.13
103
21
Trp
0.57
0.15
0.14
0.14
TGG 1.00
145
148
22 Cys
AAC 0.78
AAT 0.22
Leu
14
ATG 1.00
15 Thr
0.53
0.17
0.17
0.13
Asn
16:55
file: Humprb4l_217_1054.Cod FileCheck: 8577
file: Humprb4m_217_928.Cod FileCheck: 7623
file: Humprb1s_51_763.Cod FileCheck: 8371
file: Humprb1_51_946.Cod FileCheck: 119
file: Humptaa_156_488.Cod FileCheck: 9052 . . .
Ser
AGC
TCC
TCT
AGT
TCG
TCA
74
0.34
0.28
0.13
0.10
0.09
0.05
Asp
GAC 0.75
GAT 0.25
126
Leu
CTG
CTC
TTG
CTT
CTA
TTA
114
Gly
0.58
0.26
0.06
0.05
0.03
0.02
GGC
GGG
GGA
GGT
0.50
0.24
0.14
0.12
162
Leu
CTG
CTC
TTG
CTT
CTA
TTA
265
0.58
0.26
0.06
0.05
0.03
0.02
28
Lys
Lys
TGC 0.68
TGT 0.32
AAG 0.82
AAA 0.18
AAG 0.82
AAA 0.18
169
119
119
Arg
CGC
CGG
AGG
AGA
CGT
CGA
70
0.37
0.21
0.18
0.10
0.07
0.06
Pro
CCC
CCT
CCG
CCA
94
Lys
0.48
0.19
0.17
0.16
Pro
AAG 0.82
AAA 0.18
CCC
CCT
CCG
CCA
98
120
0.48
0.19
0.17
0.16
///////////////////////////////////////////////////////////////////////////
prion_14.seq
Length: 861
October 13, 2006 16:35
Type: N
Check: 1187
1
ATGGCCAACC TGGGCTGCTG GATGCTGGTG CTGTTCGTGG CCACCTGGAG
51
CGACCTGGGC CTGTGCAAGA AGCGCCCCAA GCCCGGCGGC TGGAACACCG
101
GCGGCAGCCG CTACCCCGGC CAGGGCAGCC CCGGCGGCAA CCGCTACCCC
151
CCCCAGGGCG GCGGCGGCTG GGGCCAGCCC CACGGCGGCG GCTGGGGCCA
201
GCCCCACGGC GGCGGCTGGG GCCAGCCCCA CGGCGGCGGC TGGGGCCAGC
251
CCCACGGCGG CGGCTGGGGC CAGCCCCACG GCGGCGGCTG GGGCCAGCCC
301
CACGGCGGCG GCTGGGGCCA GCCCCACGGC GGCGGCTGGG GCCAGCCCCA
351
CGGCGGCGGC TGGGGCCAGC CCCACGGCGG CGGCACCCAC AACCAGTGGA
401
ACAAGCCCAG CAAGCCCAAG ACCAACATGA AGCACATGGC CGGCGCCGCC
19
..
451
GCCGCCGGCG CCGTGGTGGG CGGCCTGGGC GGCTACATGC TGGGCAGCGC
/////////////////////////////////////////////////////////////////
The final, resultant nucleotide sequence is the most likely coding sequence for the consensus polypeptide we
specified using the codon frequency chart we chose. A recommended enhancement, within those codons in
potential primers, is to prepare a mixture of oligo’s containing the various codons for those positions that are
particularly ambiguous, such as the serine at position 17 in the example above. AGC is used 34% of the time,
but TCC is also used 28% of the time. Several more analyses are necessary before synthesizing your new
probes, however. We need to discover which portions of the consensus elements that we have identified
make the best primers.
And of those portions, we need to determine if they have significant internal
complementation such that strong ‘hairpin’ structures would be formed, and we should also check for self- and
primer-dimer complementation. The GCG program Prime can be used for all these tests. We also need to
run a DNA database search to make sure that only the type of genes that we are interested in are ‘found.’ The
GCG program FindPatterns is probably best for this type of search because it does not allow gapping.
f)
Use Prime to locate ‘good’ primers within your candidate regions
The GCG program Prime can locate acceptable primers within a DNA template, Prime+ will work with
genomic length sequences. The programs are quite powerful and contain many, many options to maximize
flexibility. We will use Prime here to find the best forward and reverse primers within the defined 5’ and 3’
sequence regions identified above based on sequence similarity. We’ll use Prime to localize the best primers
within our defined stretches of DNA, eventually locating nucleotide hybridization guessmers of 30 to 50 bases,
corresponding to a peptide of 10 to 17 residues, or PCR primers around 20 to 30 bases in length,
corresponding to a peptide 7 to 10 amino acid residues long. (Although, in ‘real life,’ PCR primers may need
to be even longer to maximize annealing potential.)
But before we run Prime, we need to load the new backtranslated sequence into our Editor display so it is
available for analysis. Therefore, use “Add to Editor” from the SeqLab “Output Manager” to load the new
DNA sequence. The sequence will not load aligned to its respective protein coding region. In fact, this would
be impossible because it is three times longer than the protein sequence, which would need to be spaced out
to leave two gaps between every amino acid to reconcile the two. It will load starting at position one in the
Editor display. Just realize that it is no longer aligned to the alignment above.
Specify the upstream and downstream backtranslated sequence regions for Prime to search. We’ll use
Prime’s –Begin1, –End1, –Begin2, –End2 and –Include options to restrict our primers to the predefined target
regions. However, remember that there is a three to one numbering discrepancy between the DNA and
protein sequence. Be sure that your backtranslated sequence is loaded and that it is the only one selected in
the SeqLab editor, and then select the overall range within it for your target product with the “Edit” “Select
Range” function.
In other words, choose that range delineated by your 5’ and 3’ most target locations
identified in your PlotSimilarity notes, times three to compensate for the numbering discrepancy.
In my
example’s case that’s from base number “63” through “780.” Launch “Prime” through the “Functions” or
“Windows” menu.
Specify “Selected region” in the “Which selection” window when prompted by the
“Which Selection” box rather than “Selected sequence.” You may want to specify a slightly longer “Primer
Length” then the 18 through 22 default, though it isn’t necessary. I changed this parameter to a “Minimum”
20
of “20” and a “Maximum” of “50” to help take into account potential mismatches introduced in the
backtranslation step. Also, set “Maximum product length” to the maximum length of the selected region on
your backtranslated sequence selected. This will be the maximum value displayed, in my case “718.” Save
an RSF file to add annotation to your existing prion dataset by checking “Save results as features in file.”
Choose “Options” next; note that lots and lots of options are available. This makes the program very flexible
and very powerful. Check “Specify PCR target range.” This activates the –Begin2 and –End2 command line
options. The “PCR target starting” and “ending position” parameters identify those positions. Therefore,
specify the 3’ end of your upstream candidate region and the 5’ end of your downstream candidate region
respectively, again considering the three to one numbering discrepancy. The “starting position” is “153” and
the “ending position” is “630” in my example. Also be sure that “Minimum % of specified PCR target
range to be included in product” is “100.0.” This effectively brackets your desired product forcing all primer
searching to be performed within the desired primer binding regions. Just below that section in the “Prime
Options” window, check “Save primers found to a pattern file,” the –FoundPrimers option, and designate
an appropriate output data file name that makes sense to you (e.g. “prion.prime.dat”). This option saves
the primers in a special GCG pattern data format file, which can be read by Prime and other programs, as well
as in the standard text output. If you are looking for only the one best hybridization probe rather than paired
primers, the –ForwardPrimers option is obviously necessary. Accept the rest of the program defaults; “Close”
the “Prime Options” window and press “Run” in the Prime program window. Your first pass may not find
anything. Mine didn’t. You’ll be able to easily tell because your “.rsf” and “.dat” files will be empty. This is
because the default experimental conditions set by Prime are very restrictive. Often you will have to change
many of these parameters in subsequent program runs to find any primers at all.
As is often the case, Prime did not find any primers on my first pass.
Prime can sometimes be quite
frustrating to run because of these stringent parameters. However, it’s best to start with the very stringent
default conditions and slowly relax them, versus going the other way round, though Prime does have an
option, “Ignore . . . constraints,’ that does allow you to do that, if you become too frustrated. Use the “Output
Manager” to select and “Display” the file that ends with the “.prime” extension. This text file describes the
conditions used in the run and lists acceptable primers, if there were any, with their corresponding melting
temperatures. The “.prime” file also points out exactly which parameters prevented success. Therefore, if
no primers were discovered, either repeat the Prime run with different, more permissive, parameters, or
choose different and/or longer target sections on the consensus sequence. You will have to experiment with
changing these parameters to discover the combination that works. You may be forced to rerun the program
a number of times adjusting parameters and the regions searched until you are successful. This can be
frustrating — just persevere. Use the same data file output name in subsequent runs so that you end up with
only the one successful set of universal primers.
Based on our “.prime” report we can see that the parameter that most prevented success in this case is GC
content. Therefore, repeat the run using a less stringent GC content (or whatever parameters you are having
troubles with). Sometimes it will take many passes through the program adjusting different parameters each
time in order to finally get something acceptable.
Play with the options to find the best primers in the
backtranslated sequence. The “Windows” menu contains a ‘shortcut’ listing of all programs used in the
current session; you can launch any of them from there as well as from the “Functions” menu. Relaunch
21
“Prime” through the “Windows” menu and make the suggested parameter changes. From the “.prime” file
and the “Prime Options” window we can see that default GC content is required to be between 40 and 50
percent whereas our backtranslated sequence appears to be quite a bit higher than that. The GCG program
Composition can give you an exact count of nucleotide content if you need it. Therefore, I will increase my –
GCMaxPrimer parameter by increasing “Primer % G+C” “Maximum” to “75” and see what happens. It makes
sense to allow the same GC content in the product, so change the “Product % G+C” “Maximum,“ –
GCMaxProduct, to “75” as well.
“Run” the program again after you get all your settings specified.
As
mentioned above, an alternative option is to turn most of the constraints off by selecting the button next to
“Ignore most of the constraints set by default . . .” and working your way toward more restrictive conditions
rather than the other way around.
As frustrating as Prime can be, it certainly can point out the exact conditions that must be altered from
standard PCR reactions in order to have any success in the wet lab. In my case, rerunning the program with
“GCMaxPrimer” and “GCMaxProduct” set to “75” did the trick. Whether this is a totally impossible PCR
condition is not indicated by the program, so do not blindly accept the results! This may all seem like a
genuine pain just to get a couple of primers for PCR, however, realize that successful primers found in this
manner will most likely work with all similar organisms for this particular gene. You will not have to repeat the
experience until you are given a totally different system on which to work.
When the program successfully finishes and the output is displayed, check out the various files. The “.dat”
data file lists the primers in a special ‘pattern’ format that can be used in subsequent Prime runs and in other
GCG programs such as FindPatterns. The primer locations are noted as comments in the data file; look it
over and then “Close” its window. My example follows below:
!!PATTERNS 1.0
This file contains possible primers for the template sequence:
/home/thompson/.seqlab-mendel/input_17.rsf{prion.backtranslated}
..
forward1
1 GCAACCGCTACCCCCCCCAG
! 75 -> 94
forward2
1 CCAAGCCCGGCGGCTGGAAC
! 15 -> 34
forward3
1 AAGCCCGGCGGCTGGAACAC
! 17 -> 36
forward4
1 TGCAAGAAGCGCCCCAAGCC
! 2 -> 21
forward5
1 GCAAGAAGCGCCCCAAGCCC
! 3 -> 22
forward6
1 CAACCGCTACCCCCCCCAGG
! 76 -> 95
reverse1
1 GCTCCACCACGCGCTCCATC
! 674 -> 655
reverse2
1 CTCCACCACGCGCTCCATCATC
! 673 -> 652
reverse3
1 CCACCACGCGCTCCATCATCTTC
! 671 -> 649
reverse4
1 ACCACGCGCTCCATCATCTTCAC
! 669 -> 647
reverse5
1 ACGCGCTCCATCATCTTCACGTC
! 666 -> 644
reverse6
1 CCACCACGCGCTCCATCATCTTCAC ! 671 -> 647
reverse7
1 CACCACGCGCTCCATCATCTTCAC
! 670 -> 647
reverse8
1 CGCTCCATCATCTTCACGTCGGTCTC ! 663 -> 638
reverse9
1 TCTGCTCCACCACGCGCTCC
! 677 -> 658
reverse10 1 TCCACCACGCGCTCCATCATC
! 672 -> 652
reverse11 1 GCTCCATCATCTTCACGTCGGTCTC ! 662 -> 638
reverse12 1 TGCTCCACCACGCGCTCCATC
! 675 -> 655
reverse13 1 GCTCCACCACGCGCTCCATCATC
! 674 -> 652
reverse14 1 ACATCTGCTCCACCACGCGCTC
! 680 -> 659
reverse15 1 CATCTGCTCCACCACGCGCTC
! 679 -> 659
“Close” the RSF file display window; we’ll be using that file just below, but it’s not much to read. The abridged
“.prime” results from my successful run showing my set of primate prion primers (say that six times, real
fast) follow below. The primers are ranked in terms of an annealing score with smaller numbers being better
and the best primers at the top. The first three pairs shown here are all equally good. Read the Prime
22
program “Help” upon a subsequent program run, if you are interested in how this function is calculated. I’ve
indicated those parameters changed from their defaults with bold type in the following abridged screen trace:
PRIME of: input_17.rsf{prion.backtranslated} ck: 8405
October 13, 2006 22:44
INPUT SUMMARY
-------------
from: 63 to: 780
Input sequence: /home/thompson/.seqlab-mendel/input_17.rsf{prion.backtranslated}
Primer constraints:
primer size: 20 - 50
primer 3' clamp: S
although this is often turned off in real experiments!
primer sequence ambiguity: NOT ALLOWED
primer GC content: 40.0 - 75.0%
primer Tm: 50.0 - 65.0 degrees Celsius
primer self-annealing. . .
3' end: < 8
(weight: 2.0)
total: < 14
(weight: 1.0)
unique primer binding sites: required
primer-template and primer-repeat annealing. . .
3' end: ignored
total: ignored
repeated sequences screened: none specified
Product constraints:
product length: 478 - 718
product GC content: 40.0 - 75.0%
product Tm: 70.0 - 95.0 degrees Celsius
product must include the region from 153 - 630
duplicate primer endpoints: NOT ALLOWED
difference in primer Tm: < 2.0 degrees Celsius
primer-primer annealing. . .
3' end: < 8
(weight: 2.0)
total: < 14
(weight: 1.0)
PRIMER SUMMARY
-------------forward
reverse
Number of primers considered:
2643
1384
Number of primers rejected for . . .
primer 3' clamp:
primer sequence ambiguity:
primer GC content:
primer Tm:
non-unique binding sites:
primer self-annealing:
primer-template annealing:
primer-repeat annealing:
19
0
2478
120
0
7
0
0
78
0
0
767
0
65
0
0
19
474
Number of primers accepted:
PRODUCT SUMMARY
--------------Number of products considered:
9006
Number of products rejected for. . .
product length:
product GC content:
product Tm:
product position:
duplicate primer endpoints:
difference in primer Tm:
primer-primer annealing:
61
283
0
1915
2015
3419
1130
Number of products accepted:
Number of products saved:
183
25
23
Maximum overlap between products:
718 bp
THE FOLLOWING PRODUCTS ARE SORTED BY THEIR ANNEALING SCORE
-------------------------------------------------------------------------------Product: 1
[DNA] = 50.000 nM
[salt] = 50.000 mM
PRIMERS
-------
forward primer (20-mer):
reverse primer (20-mer):
5'
3'
137 GCAACCGCTACCCCCCCCAG 156
736 GCTCCACCACGCGCTCCATC 717
forward
reverse
75.0
61.5
70.0
60.5
primer %GC:
primer Tm (degrees Celsius):
PRODUCT
------product length: 600
product %GC: 73.0
product Tm: 88.7 degrees Celsius
difference in primer Tm: 1.1 degrees Celsius
annealing score:
53
optimal annealing temperature: 65.3 degrees Celsius
-------------------------------------------------------------------------------Product: 2
[DNA] = 50.000 nM
[salt] = 50.000 mM
PRIMERS
-------
forward primer (20-mer):
reverse primer (22-mer):
5'
3'
137 GCAACCGCTACCCCCCCCAG
156
735 CTCCACCACGCGCTCCATCATC 714
forward
reverse
75.0
61.5
63.6
59.9
primer %GC:
primer Tm (degrees Celsius):
PRODUCT
------product length: 599
product %GC: 73.0
product Tm: 88.7 degrees Celsius
difference in primer Tm: 1.6 degrees Celsius
annealing score:
53
optimal annealing temperature: 65.2 degrees Celsius
-------------------------------------------------------------------------------Product: 3
[DNA] = 50.000 nM
[salt] = 50.000 mM
PRIMERS
-------
forward primer (20-mer):
5'
137 GCAACCGCTACCCCCCCCAG
24
3'
156
reverse primer (23-mer):
733 CCACCACGCGCTCCATCATCTTC 711
forward
reverse
75.0
61.5
60.9
60.2
primer %GC:
primer Tm (degrees Celsius):
PRODUCT
------product length: 597
product %GC: 73.0
product Tm: 88.7 degrees Celsius
difference in primer Tm: 1.3 degrees Celsius
annealing score:
53
optimal annealing temperature: 65.3 degrees Celsius
/////////////////////////////////////////////////////////////////////////////
A graphics “.figure” window shows where the primers anneal to your sequence schematically. Blue tick
marks indicate forward primers, red reverse ones. The graphic from the above run is shown below on the
next page:
“Close” the graphics window. Be sure to “Add to Editor” the “prime.rsf” file displayed in the “Output
Manager.” Choose “Overwrite old with new” in the “Reloading Same Sequence” window that pops up.
This will merge the new feature annotation that locates the successful primers onto your existing RSF file.
“Close” the “Output Manager” window after loading your new feature annotation.
Take a look at this new feature information by changing your “Display:” to the “Graphic Features” cartoon
representation, and zoom out to “4:1” so that most of the entire sequence can be seen at once. The products
appear as orange arches, and the primers appear as upstream green, and downstream red, diamonds.
Double-click on one of the new features and then select that entry in the “Feature” window to see a
description. It should look something like the graphic displayed below, where an upstream prion primer is
described against a backdrop of the editor window:
25
An alternative primer design approach is to individually isolate forward and reverse primers separately. Use
the “Select: forward” and “reverse primers, only” options to do this. You may be able to ‘zero-in’ on regions a
bit more specifically with this approach. If you do design primers with this alternative forward only and reverse
only method, you’ll need to test the pairs together with each other in a subsequent program run to be sure that
they won’t anneal with each other so badly as to interfere with the reaction. The Prime program also allows
you to do this, and to test any other primers desired, by specifying an input data file of primers at run-time.
The output “.dat” file that we wrote out above is one of these data files. That way, rather than discovering
the best primers within a specified template, the program tests and ranks all the primers fed to it against the
specified template. Another GCG program, PrimePair can test a data file of input primers against themselves
in the absence of a template.
One restriction to both programs is they will not tolerate mismatches or
ambiguities in their primers or in those sites where they anneal, so all ambiguities must be taken out of the
primers and template annealing regions to be tested.
g) Will your primers only ‘find’ the correct genes?
Candidate primers need to pass one more test before being synthesized. You should check your primers’
specificity to insure that your primers will not hybridize to completely the wrong type of sequence by checking
them against the DNA database. This step can also point out, and allow you to correct if necessary, errors in
your primer sequence created in the backtranslation step, if enough DNA sequences are available in the
database to allow a comparison. The GCG program FindPatterns is probably best for this. It can be used to
screen your candidate primers against the entire DNA database or any other GCG sequences desired. There
are several advantages to using FindPatterns over standard similarity searching software such as BLAST or
FastA: 1) you can test more than one primer at a time against as many sequences as you want; 2) the
algorithm will not allow any gapping of your primers to the template, which would represent loop structures in
the hybrid and should not be allowed; 3) similarities don’t count — identities are required but mismatches are
allowed by option; again, just what you want in primer analysis; and 4) word size parameters are not relevant
since the algorithm doesn’t use them, therefore, they don’t have to be messed with (which you would need to
do if you were using heuristic style similarity searches such as BLAST since they are not designed to find short
regions of DNA similarity). For these reasons, I do not recommend using BLAST or FastA style searches for
testing primer specificity. The easiest way to run FindPatterns is to provide it with your primers as an input file
rather than typing them in interactively. To do this FindPatterns needs its input file of patterns to be in exactly
the same pattern data format as Prime can produce with its “Save primers found to a pattern file” –
FoundPrimers option. This makes it relatively easy to test them against the database.
26
However, running a full-blown FindPatterns GenBank search would require too much time for you to see the
results in the time constraints of our tutorial. Therefore, I have already run this search and am providing that
output file for you; it’s the one that you initially fetched when you began the exercise. If you would like to run
this type of analysis for your own research, it is very important to use appropriate parameters! You need to
specify the correct pattern data file, a realistic mismatch level, and the –Batch option. Give a mismatch level
of slightly less than 20% the length of your shortest primer sequence. The less than 20% mismatch cut-off
level is a ‘rule-of-thumb’ because that is the number of expected mismatches if all codon choices were made
on a completely random basis. In the example that I am providing I used a mismatch level of about 10%. If
running FindPatterns from the command line rather than from SeqLab, the program will ask you which
sequences you want to find your pattern in. These are not your primer sequences; these are the sequences
you want to search your primer patterns against.
Therefore, answer with either all of GenBank or the
appropriate subdivision of GenBank. Since I am trying to find prions in aye-ayes, the primate portion of
GenBank is most relevant and I used gb_pr:*, which means that I want to search all of the sequences in the
primate subdivision of GenBank. (See Worksho #1 and the GenHelp User’s Guide chapter Using Sequences,
topic Using Database Sequences, subtopic Nucleic Acid Database tables, if this still confuses you.) Do not
run FindPatterns today against GenBank — just scroll through and note the types of sequences that were
found by my example run. Temporarily switch to the terminal window behind the SeqLab window and use the
UNIX “more” utility to do this. Press the <space bar>, not the return key, to go from one page to the next.
The abridged output file, “primer-tutorial.prion.finds” follows below on the next few pages:
> more primer-tutorial.prion.finds
! FINDPATTERNS on gb_pr:* allowing 2 mismatches
! Using patterns from: primer-tutorial.prion.dat
APU15164 ck: 8218 len: 759
ciceps major prion protein precursor ge
October 14, 2006 16:57 ..
! U15164 Ateles paniscus x Ateles fus
PrPrR1 /Rev
CAGTACAGCAACCAGAACAACTTCG
499: TGGAT cagtacaacaaccagaacaactttg TGCAC mis=2
PrPrR2 /Rev
CAGTACAGCAACCAGAACAACTTCGT
499: TGGAT cagtacaacaaccagaacaactttgt GCACG mis=2
PrPrR3 /Rev
CAGTACAGCAACCAGAACAACTTCGTG
499: TGGAT cagtacaacaaccagaacaactttgtg CACGA mis=2
PrPrR4 /Rev
CAGTACAGCAACCAGAACAACTTCGTGC
499: TGGAT cagtacaacaaccagaacaactttgtgc ACGAC mis=2
PrPrR5 /Rev
CAGTACAGCAACCAGAACAACTTCGTGCA
499: TGGAT cagtacaacaaccagaacaactttgtgca CGACT mis=2
PrPrR10 /Rev
GAGAACATGTACCGCTACCCCAACCA
451: ATCGT gaaaacatgtaccgttaccccaacca AGTAT mis=2
GGU15166 ck: 5814 len: 762
protein precursor gene, complete cds. 6
! U15166 Gorilla gorilla major prion
PrPrF1
GCCTGTGCAAGAAGCGCCCCAAGCC
59: CCTGG gcctctgcaagaagcgcccgaagcc TGGAG mis=2
PrPrF4
GGCCTGTGCAAGAAGCGCCCCAAGC
58: ACCTG ggcctctgcaagaagcgcccgaagc CTGGA mis=2
PrPrF5
GGGCCTGTGCAAGAAGCGCCCCAAG
57: GACCT gggcctctgcaagaagcgcccgaag CCTGG mis=2
27
PrPrF8
GGGCCTGTGCAAGAAGCGCCCCAAGC
57: GACCT gggcctctgcaagaagcgcccgaagc CTGGA mis=2
PrPrR1 /Rev
CAGTACAGCAACCAGAACAACTTCG
502: TGGAT cagtacagcaaccagaacaactttg TGCAC mis=1
PrPrR2 /Rev
CAGTACAGCAACCAGAACAACTTCGT
502: TGGAT cagtacagcaaccagaacaactttgt GCACG mis=1
PrPrR3 /Rev
CAGTACAGCAACCAGAACAACTTCGTG
502: TGGAT cagtacagcaaccagaacaactttgtg CACGA mis=1
PrPrR4 /Rev
CAGTACAGCAACCAGAACAACTTCGTGC
502: TGGAT cagtacagcaaccagaacaactttgtgc ACGAC mis=1
PrPrR5 /Rev
CAGTACAGCAACCAGAACAACTTCGTGCA
502: TGGAT cagtacagcaaccagaacaactttgtgca CGACT mis=1
HSPRP2
ck: 7852
len: 2,301 ! X83416 H.sapiens PrP gene, exon 2.
1/96
PrPrF1
GCCTGTGCAAGAAGCGCCCCAAGCC
77: CCTGG gcctctgcaagaagcgcccgaagcc TGGAG mis=2
PrPrF4
GGCCTGTGCAAGAAGCGCCCCAAGC
76: ACCTG ggcctctgcaagaagcgcccgaagc CTGGA mis=2
PrPrF5
GGGCCTGTGCAAGAAGCGCCCCAAG
75: GACCT gggcctctgcaagaagcgcccgaag CCTGG mis=2
PrPrF8
GGGCCTGTGCAAGAAGCGCCCCAAGC
75: GACCT gggcctctgcaagaagcgcccgaagc CTGGA mis=2
PrPrR1 /Rev
CAGTACAGCAACCAGAACAACTTCG
496: TGGAT gagtacagcaaccagaacaactttg TGCAC mis=2
PrPrR2 /Rev
CAGTACAGCAACCAGAACAACTTCGT
496: TGGAT gagtacagcaaccagaacaactttgt GCACG mis=2
PrPrR3 /Rev
CAGTACAGCAACCAGAACAACTTCGTG
496: TGGAT gagtacagcaaccagaacaactttgtg CACGA mis=2
PrPrR4 /Rev
CAGTACAGCAACCAGAACAACTTCGTGC
496: TGGAT gagtacagcaaccagaacaactttgtgc ACGAC mis=2
PrPrR5 /Rev
CAGTACAGCAACCAGAACAACTTCGTGCA
496: TGGAT gagtacagcaaccagaacaactttgtgca CGACT mis=2
////////////////////////////////////////////////////////////////////////////////
SSU08308 ck: 4959 len: 762
on protein gene, complete cds. 2/95
! U08308 Symphalangus syndactylus pri
PrPrF1
GCCTGTGCAAGAAGCGCCCCAAGCC
59: CCTGG gcctctgcaagaagcgcccgaagcc TGGAG mis=2
PrPrF4
GGCCTGTGCAAGAAGCGCCCCAAGC
58: ACCTG ggcctctgcaagaagcgcccgaagc CTGGA mis=2
PrPrF5
GGGCCTGTGCAAGAAGCGCCCCAAG
57: GACCT gggcctctgcaagaagcgcccgaag CCTGG mis=2
PrPrF8
GGGCCTGTGCAAGAAGCGCCCCAAGC
57: GACCT gggcctctgcaagaagcgcccgaagc CTGGA mis=2
PrPrR1 /Rev
CAGTACAGCAACCAGAACAACTTCG
502: TGGAT cagtacagcagccagaacaactttg TGCAC mis=2
PrPrR2 /Rev
CAGTACAGCAACCAGAACAACTTCGT
502: TGGAT cagtacagcagccagaacaactttgt GCACG mis=2
PrPrR3 /Rev
CAGTACAGCAACCAGAACAACTTCGTG
28
502: TGGAT cagtacagcagccagaacaactttgtg CACGA mis=2
PrPrR4 /Rev
CAGTACAGCAACCAGAACAACTTCGTGC
502: TGGAT cagtacagcagccagaacaactttgtgc ACGAC mis=2
PrPrR5 /Rev
CAGTACAGCAACCAGAACAACTTCGTGCA
502: TGGAT cagtacagcagccagaacaactttgtgca CGACT mis=2
PrPrR10 /Rev
GAGAACATGTACCGCTACCCCAACCA
454: ATCGT gaaaacatgcaccgctaccccaacca AGTGT mis=2
SSU08310 ck: 136
in gene, complete cds. 2/95
len: 783
! U08310 Saimiri sciureus prion prote
PrPrR1 /Rev
CAGTACAGCAACCAGAACAACTTCG
523: TGGAT cagtacagcaaccagaacaactttg TGCAC mis=1
PrPrR2 /Rev
CAGTACAGCAACCAGAACAACTTCGT
523: TGGAT cagtacagcaaccagaacaactttgt GCACG mis=1
PrPrR3 /Rev
CAGTACAGCAACCAGAACAACTTCGTG
523: TGGAT cagtacagcaaccagaacaactttgtg CACGA mis=1
PrPrR4 /Rev
CAGTACAGCAACCAGAACAACTTCGTGC
523: TGGAT cagtacagcaaccagaacaactttgtgc ACGAC mis=1
PrPrR5 /Rev
CAGTACAGCAACCAGAACAACTTCGTGCA
523: TGGAT cagtacagcaaccagaacaactttgtgca CGACT mis=1
Databases searched:
GenBank, Release 153.0, Released on 14Dec2006, Formatted on 15Dec2006
Total finds:
395
Total length: 320,531,468
Total sequences:
91,570
CPU time: 3:16:52.39
Only prion sequences were found by our new primers — excellent. An example FindPatterns command line
run is shown below. Notice the GCG –check command line ‘super-option’ that lists all of the available options
within a program and gives you a chance to use any of them. Remember, do not run FindPatterns on
GenBank here today:
> findpatterns -check
FindPatterns identifies sequences that contain short patterns like
GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow
mismatches. You can provide the patterns in a file or simply type them
in from the terminal.
Minimal Syntax: % findpatterns [-INfile=]Genbank:Humig* -Default
Prompted Parameters:
-PATterns=GAATTC,RGGAY
[-OUTfile=]findpatterns.find
patterns to be found
the output file name
Local Data Files:
-DATa=pattern.dat
a file with a set of patterns
Optional Parameters:
-MISmatch=1
allows mismatches in the search for your subsequence
-NAMes
makes an output file in "file of filenames" format
-ONEstrand
searches only the top strand of nucleotide sequences
-SIXbase
searches only for patterns with six or more symbols
-CIRcular
searches all sequences as if they were circular
Press q to quit or <Return> for more: <rtn>
-ALL
does an "overlapping-set" search in nucleotide sequences
-PERFect
looks only for perfect matches
29
-APPend
-SHOw
-TERminal
-NOMONitor
-ONCe
-MINCuts=1
-MAXCuts=3
-EXCLude=n1,n2
-SINce=6.90
-BATch
appends the pattern data file to the output file
shows every file searched even if there are no finds
writes output to the terminal screen instead of a file
suppresses the screen trace showing each file
limits finds to patterns found a maximum of 1 time
limits finds to patterns found a minimum of 1 time
limits finds to patterns found a maximum of 3 times
excludes patterns found between positions n1 and n2
limits search to sequences dated on or after June 1990
Submits the program to run in the batch queue
Add what to the command line ?
-data=primer.dat -mismatch=2 -batch
FINDPATTERNS in what sequence(s) ?
gb_pr:*
What should I call the output file (* findpatterns.find *) ?
prion.finds
** findpatterns will run as a batch or at job.
** findpatterns was submitted using the command:
" batch "
warning: commands will be executed using /bin/sh
job 848528595.b at Wed Nov 20 14:23:15 1996
This is the conclusion of the main part of today’s computer laboratory. You can either log out or leave the
computers as they are with SeqLab and the terminal active. If you don’t log out, I’ll go around and log you out
and clean up any loose ends.
I hope that you all will have come to realize the tremendous help that
computational technology can be in this area by going through today’s tutorial. Obviously the same general
ideas taught here can be tailored to any particular system for the design of primers to any level of specificity.
For more help in this area and for personal sequence analysis consultation, contact me for more information:
Steve Thompson, stevet@bio.fsu.edu.
2) More practice: universal and strain specific primer design from a pre-built DNA alignment
I doubt whether you’ll have time to do this section of the tutorial during our allotted two hours, but I encourage
you to log back on to the GCG server at some point in the future and work through the following example
yourself. As always, I am available for any personalized help you may need.
As with the first portion of the tutorial, I have placed some files in a publicly accessible GCG directory to make
this portion of the tutorial less tedious. After logging in to the GCG server and launching a terminal window,
issue the following command to copy those files into your account.
> fetch primer-tutorial.L1.*
Next list your directory (ls) using the long form option (-l) on the new files to see what and how big they are:
> ls -l primer-tutorial.*
-rw-rw-rw-rw-r--r--rw-r--r--rw-r--r--
1
1
1
1
stevet
stevet
stevet
stevet
gcg
gcg
gcg
gcg
2113
170858
623104
49285
Jun
Jun
Jun
Jun
Next launch SeqLab with the standard command:
> seqlab &
30
11
17
13
22
18:48
10:45
17:47
20:14
primer-tutorial.L1.dat
primer-tutorial.L1.finds
primer-tutorial.L1.rsf
primer-tutorial.prion.finds
First I want you to see what the HPV L1 alignment looks like. Be sure the “Mode:” “Main List” choice is
selected in your main window and then go to the “File” menu. Pick “Add sequences from” and select
“Sequence Files.” (GCG format compatible sequences or list files are accessible through this route. Use
SeqLab’s Editor “Import” function to directly load GenBank or ABI/SCF trace format sequences without the
need to reformat.) This will produce an “Add Sequences” window from which you can select sequences to
add to your working.list. The “Filter” box is very important here! You can control which files are displayed by
choosing an appropriate text string. Since we want to use the alignment I already prepared and you fetched in
the previous step, put the extension “.rsf” in the “Filter” box (including the period); be sure to leave the “ *” wild
card. Press the “Filter” button to display all of the RSF files in your working directory. Select the file entitled
“primer-tutorial.L1.rsf” from the “Files” box, and then check the “Add” and then the “Close” buttons at the
bottom of the window, to put the file in your “working.list.” It will appear in the SeqLab “Main List”
window. Be sure it is selected and switch to “Editor” “Mode:” to load the RSF file into the SeqLab editor.
Notice that all of the sequences now appear in the editor window with the bases and residues color-coded.
Any portion of or all of the sequences loaded are now available for analysis by any of the GCG programs.
Expand the window full-screen. The display will look something like the following graphic:
Each DNA sequence is shown together with its proper amino acid translation directly below. The nucleotide
sequences are listed by their official GenBank entry name (LOCUS identifier). The protein sequences are not
annotated because they did not come form the database; they were translated from their corresponding DNA
sequences.
Select every DNA sequence in the alignment by <Ctrl> clicking each entry name that does not have the word
frame within it; also do not select the last consensus sequence, cons25pct. Normally this <Ctrl> click is done
31
with the left mouse button; however, a bug in the Linux version of SeqLab switched this function to the right
mouse button. Be sure to scroll up and down through the entire alignment. Release the <Ctrl> key to scroll
and then repress <Ctrl> selecting the new screen of DNA entries as you go. The SeqLab main window will
now look similar to the graphic on the following page, with every other sequence selected:
Now that all of the DNA entries are highlighted (and only the DNA entries, but not the consensus sequence at
the bottom), run PlotSimilarity on this DNA sequence alignment in the same manner as you did on prion
protein dataset.
That is, go to the SeqLab “Functions” menu; select “Multiple Comparison” and then
“PlotSimilarity.” Then choose “Options . . .” and check “Save SeqLab colormask to” and “Scale the plot
between:” the “minimum and maximum values calculated from the alignment.”
“Close” the
“PlotSimilarity Options” window. Click the “Run” box to launch the program. The output will quickly return.
“Close” the plotsimilarity.cmask display and the “Output Manager” and then take a look at the similarity
plot. My example follows below:
32
“Close” the PlotSimilarity graphics window after you’ve checked it out. As before, go up to the SeqLab “File”
menu; select “Open Color Mask Files.” Select the file displayed in the dialog box, “plotsimilarity.cmask;”
click “Add” and then “Close.” As with the prion dataset, identify those regions of the alignment that will be
most appropriate for designing universal primers — areas of high conservation — but now we’re dealing
directly with DNA.
Also note areas of low conservation; these are candidate regions for strain specific
primers. Take some notes of the general areas that appear promising to you. For the purpose of this
exercise try to identify two of the more highly conserved regions that flank the longest stretch of L1 possible.
Also note where the deepest and furthest separated valleys are. Try to get regions that are at least 100 or so
bases long. I’ll show a screen dump graphic of the first upstream conserved region from around column 130
to 230 that I happened to pick below:
33
a) Design ‘universal’ primers
Go to the very bottom of the alignment. Select the sequence labeled “cons25pct” (only). This is a consensus
sequence generated from the DNA alignment above it at the 25 percent agreement level that I produced
through the “Consensus” tool of the “Edit” menu.
The other DNA sequences should now no longer be
selected. Scroll through the consensus sequence until you find the upstream highly conserved area that you
noted above. Determine the exact base that you want your selection to begin and end at by selecting the
candidate bases and noting their “col:” positions in the lower left-hand corner of the display. Remember, we
want stretches about 100 bases long.
Write these numbers down. Repeat this procedure with your 3’
candidate region. Now go up to the “Edit” menu and click on “Select Range.” This will produce a dialog box
in which you can type the desired range; enter the overall length that you want to deal with, just like before,
that is, the 5’ most base through the 3’ most base noted above. After typing in the numbers punch “Select”
and then “Close” the box. Notice that the range specified is now highlighted on the display.
We’ll use Prime to again locate the best forward and reverse primers within the defined 5’ and 3’ sequence
candidate regions identified. As before, the primer target regions within this delineation are specified with the
–Begin2=, –End2=, and –Include= options to force primer discovery within the conservation peaks identified.
As before, launch Prime from the “Functions” menu; choose “Primer Selection” and then “Prime.” This
should produce a dialog box asking whether you want to analyze the selected sequences or the selected
region; choose “Selected region.” A Prime program box will display. If you haven’t logged out since doing
the prion exercise, press the “GCG Defaults” button in the “Prime Program Window” to reset all the
parameters from what you ended up with then. Regardless, adjust the “Maximum PCR Product Length” up
to the maximum allowed, the full length of your selected region. Next, check “Save results as features in file
prime.rsf” to add any primer locations found to the your current RSF file’s annotation. Choose “Options” next
and scroll down through the extensive options list in the “Prime Options” window. Check “Specify PCR
target range.” This activates the –Begin2 and –End2 options, as before with the prion dataset. The “PCR
34
target starting” and “ending position” parameters identify those positions and need to be changed to restrict
the target range accordingly. Therefore, as before, specify the 3’ end of your upstream candidate region and
the 5’ end of your downstream candidate region respectively; also be sure that “Minimum % of specified
PCR target range to be included in product” is set to “100.0” to force all primer searching within the desired
primer binding regions.
Also check the “Save primers found to a pattern file” option and give it an
appropriate name like “HPV.all.dat,” to create a pattern data file for this dataset. “Close” the options
window. Press “Run” in the program window after making your selections. Prime will now search for the best
primers within the restricted areas specified. The output will quickly display; as with the prion data, your
“.rsf” and “.dat” files may be empty because of the strict default parameters. In my first run against
cons25pct I couldn’t find any primers. Upon perusing my “.prime” output the worst offender again seemed to
be GC content, so I lowered “product GC minimum” from the default 40% to “25%” in a subsequent run.
You’ll probably have to do the same, since the consensus sequence is so AT rich.
Repeat the run with the new parameters. Remember the “Windows” menu has a ‘shortcut’ listing to all
programs used in the current session so it can be used to relaunch “Prime.”
Change the appropriate
parameters, and then press “Run” in the “Prime” main window.
Note
the
“Command
Line”
parameters in the setup for my
successful run, shown in the
screen dump graphic to the right
here:
When the program successfully finishes “Close” the RSF file display window, but look over the
“HPV.all.dat” data file and then “Close” its window. My example follows below:
!!PATTERNS 1.0
This file contains possible primers for the template sequence:
/users/thompson/.seqlab-mendel/input_61.rsf{cons25pct}
..
forward1
1 GACTACTTGCTGTTGGACATC ! 77 -> 97
forward2
1 GAATATGTGACACGCACAAAC ! 31 -> 51
forward3
1 CTAGACTACTTGCTGTTGGAC ! 74 -> 94
forward4
1 CCCTGTATCTAAGGTTGTAAGC ! 3 -> 24
forward5
1 GTATCTAAGGTTGTAAGCACGG ! 7 -> 28
forward6
1 AAGCACGGATGAATATGTGAC ! 21 -> 41
forward7
1 TGTATCTAAGGTTGTAAGCACG ! 6 -> 27
forward8
1 GCACGGATGAATATGTGACAC ! 23 -> 43
forward9
1 TGCAGGCAGTTCTAGACTAC
! 63 -> 82
forward10 1 GGATGAATATGTGACACGCAC ! 27 -> 47
forward11 1 CTACTTGCTGTTGGACATCC
! 79 -> 98
forward12 1 CGGATGAATATGTGACACGC
! 26 -> 45
forward13 1 ACTACTTGCTGTTGGACATCC ! 78 -> 98
reverse1
1 TGCGTCCCAAAGGAAACTG
! 1387 -> 1369
35
reverse2
reverse3
1 CTGGCCTTAAATCCTGCTTG
1 TTGCGTCCCAAAGGAAAC
! 1418 -> 1399
! 1388 -> 1371
Display and scroll through your successful “.prime” file and then “Close” its window. Remember, the best
are at the top of the file. The beginning of my successful universal primer search “.prime” file is shown
below over the next couple of pages; changed parameters are highlighted in bold:
PRIME of: input_62.rsf{cons25pct}
2003 16:09
ck: 2375
from: 1 to: 1646
October 17,
INPUT SUMMARY
------------Input sequence: /users/thompson/.seqlab-mendel/input_62.rsf{cons25pct}
Primer constraints:
primer size: 18 - 22
primer 3' clamp: S
although this is often turned off in real experiments!
primer sequence ambiguity: NOT ALLOWED
primer GC content: 40.0 - 55.0%
primer Tm: 50.0 - 65.0 degrees Celsius
primer self-annealing. . .
3' end: < 8
(weight: 2.0)
total: < 14
(weight: 1.0)
unique primer binding sites: required
primer-template and primer-repeat annealing. . .
3' end: ignored
total: ignored
repeated sequences screened: none specified
Product constraints:
product length: 1217 - 1419
product GC content: 25.0 - 55.0%
product Tm: 70.0 - 95.0 degrees Celsius
product must include the region from 234 - 1450
duplicate primer endpoints: NOT ALLOWED
difference in primer Tm: < 2.0 degrees Celsius
primer-primer annealing. . .
3' end: < 8
(weight: 2.0)
total: < 14
(weight: 1.0)
PRIMER SUMMARY
-------------forward
reverse
Number of primers considered:
500
482
Number of primers rejected for . . .
primer 3' clamp:
primer sequence ambiguity:
primer GC content:
primer Tm:
non-unique binding sites:
primer self-annealing:
primer-template annealing:
primer-repeat annealing:
108
18
214
104
0
11
0
0
124
18
202
90
0
16
0
0
45
32
Number of primers accepted:
PRODUCT SUMMARY
--------------Number of products considered:
1440
Number of products rejected for. . .
product length:
product GC content:
product Tm:
product position:
700
0
0
0
36
duplicate primer endpoints:
difference in primer Tm:
primer-primer annealing:
305
71
320
Number of products accepted:
Number of products saved:
Maximum overlap between products:
44
25
1419 bp
THE FOLLOWING PRODUCTS ARE SORTED BY THEIR ANNEALING SCORE
-------------------------------------------------------------------------------Product: 1
[DNA] = 50.000 nM
[salt] = 50.000 mM
PRIMERS
-------
forward primer (19-mer):
reverse primer (20-mer):
forward
5'
3'
193 CATGCAGGCAGTTCTAGAC 211
1605 AGAGGTAGATGAGGTGGTGG 1586
reverse
primer %GC:
primer Tm (degrees Celsius):
52.6
50.1
55.0
51.9
PRODUCT
------product length: 1413
product %GC: 34.7
product Tm: 73.7 degrees Celsius
difference in primer Tm: 1.8 degrees Celsius
annealing score:
53
optimal annealing temperature: 51.7 degrees Celsius
-------------------------------------------------------------------------------Product: 2
[DNA] = 50.000 nM
[salt] = 50.000 mM
PRIMERS
-------
forward primer (20-mer):
reverse primer (20-mer):
forward
5'
3'
192 TCATGCAGGCAGTTCTAGAC 211
1599 AGATGAGGTGGTGGGTGTAG 1580
reverse
primer %GC:
primer Tm (degrees Celsius):
50.0
51.5
55.0
52.5
PRODUCT
------product length: 1408
product %GC: 34.7
product Tm: 73.6 degrees Celsius
difference in primer Tm: 1.0 degrees Celsius
annealing score:
57
optimal annealing temperature: 52.1 degrees Celsius
-------------------------------------------------------------------------------Product: 3
37
[DNA] = 50.000 nM
[salt] = 50.000 mM
PRIMERS
-------
forward primer (21-mer):
reverse primer (19-mer):
5'
3'
209 GACTACTTGCTGTTGGACATC 229
1519 TGCGTCCCAAAGGAAACTG
1501
forward
reverse
47.6
50.9
52.6
52.5
primer %GC:
primer Tm (degrees Celsius):
PRODUCT
------product length: 1311
product %GC: 34.2
product Tm: 73.4 degrees Celsius
difference in primer Tm: 1.6 degrees Celsius
annealing score:
57
optimal annealing temperature: 51.8 degrees Celsius
//////////////////////////////////////////////////////////////////
My HPV L1 universal primer
example location schematic
is shown to the right here in
a screen dump graphic:
“Close” the graphics window. “Add to Editor” the “prime.rsf” file displayed in the “Output Manager” and
“Overwrite old with new” in the “Reloading Same Sequence” window to merge the new primer feature
annotation onto your existing RSF file. “Close” the “Output Manager” window afterwards.
Look at this new feature information by changing your “Display:” to “Graphic Features” and zoom out to
“16:1” so that the entire length can be seen at once. PCR products are again coded as orange arches,
upstream primers as green diamonds, and downstream primers as red diamonds. Double-click on the new
feature and then select that entry in the “Feature” window to see a description. It should look similar to the
graphic shown below:
38
“Close” the “Feature” window to return to SeqLab’s main window.
The sample pattern data file that you ‘fetched’ at the beginning of this section contains the My09/11 published
primer set mentioned in the Introduction. That data file, “primer-tutorial.L1.dat,” follows below:
The primers used in the L1 primer design computer laboratory
! The published My09/11 set:
!
MY11
MY11a
MY11b
MY11c
MY11d
MY11e
MY11f
MY11g
MY11h
!
1
1
1
1
1
1
1
1
1
MY09
MY09a
MY09b
MY09c
MY09d
MY09e
MY09f
MY09g
MY09h
MY09i
MY09j
MY09k
MY09l
MY09m
MY09n
MY09o
MY09p
GCaCAGGGaCATAAcAATGG
GCcCAGGGtCATAAcAATGG
GCaCAGGGtCATAAcAATGG
GCcCAGGGaCATAAcAATGG
GCaCAGGGaCATAAtAATGG
GCcCAGGGtCATAAtAATGG
GCaCAGGGtCATAAtAATGG
GCcCAGGGaCATAAtAATGG
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
GCMCAGGGWCATAAYAATGG
0 ! no ambiguity allowed
0
0
0
0
0
0
0
0
CGTCCMARRGGAWACTGATC
CGTCCaAaaGGAaACTGATC
CGTCCcAagGGAaACTGATC
CGTCCaAgaGGAaACTGATC
CGTCCcAggGGAaACTGATC
CGTCCaAagGGAaACTGATC
CGTCCcAaaGGAaACTGATC
CGTCCaAggGGAaACTGATC
CGTCCcAgaGGAaACTGATC
CGTCCaAaaGGAtACTGATC
CGTCCcAagGGAtACTGATC
CGTCCaAgaGGAtACTGATC
CGTCCcAggGGAtACTGATC
CGTCCaAagGGAtACTGATC
CGTCCcAaaGGAtACTGATC
CGTCCaAggGGAtACTGATC
CGTCCcAgaGGAtACTGATC
0 ! no ambiguity allowed
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
39
..
As mentioned in the prion exercise, these data files are in a special GCG format known as a pattern data file.
You’ll have at least three primer pattern data files in your working directory now, one from the prion exercise,
one sample data file, “primer-tutorial.L1.dat,” and your new HPV all data file.
I’d like to discuss this format in more detail here. It is structured after GCG’s restriction enzyme data files, and
begins with some helpful documentation or an appropriate explanation at the top of the file. Two periods “..”
are essential in all GCG data files and separate header information from the data below. Each entire pattern
needs to be on one line apiece without any gaps (ambiguity syntax is supported in most GCG programs that
read these data , but not in the primer analyses ones); it needs to be prefaced with a name, the offset number
1, and followed with an optional overhang number 0. The exact column in which the various fields appear is
not important, but the order of the fields is vital. Comments can be embedded anywhere by placing an
exclamation point before them.
See if you can figure out how to submit this data file to Prime in order to test how well the commercial primers
stack up against your custom designed ones with the universal consensus sequence. You’ll have to use the
“Select forward and reverse primers from one file” option. Use the “Primers” button there to produce a
dialog box, “Primer Chooser for Prime;” click on its “Primer Data File. . .” button, and use the “File
Chooser” to pick my sample “primer-tutorial.L1.dat” and check “OK.” “Close” the “Primer Chooser”
window. The “Prime Options” window should now show that you are using the appropriate data file; “Close”
the “Prime Options” window and “Run” the program.
b) Design strain specific primers
To find suitable strain specific primers, first check out the phylogenetic tree shown on the following page
inferred from the alignment that we have been working with. This particular tree was estimated using a
distance-based method with the Kimura correction model and the Fitch-Margoliash least-squares fit algorithm.
It is representative of the trees that this alignment will yield with most all inference algorithms. I have arbitrarily
placed the HPV type 16 assemblage at the top of the tree and imply no directionality by this. Horizontal
branch lengths are directly proportional to evolutionary divergence in units of substitutions per site; vertical
branch length has no meaning other than to separate clades.
The Human Papillomaviruses most closely related to Type 16 are illustrated below in a phylogenetic tree:
40
Pick your favorite clade from the tree (a clade is a group of organisms all related to a common ancestor not
shared by some other group). Go for any level of specificity — from just a couple of individual sequences
composing a clade up to a whole bunch of related sequences in a larger clade. Now go back to the SeqLab
editor. Deselect the consensus sequence by <Ctrl> clicking it (but remember that on Linux SeqLab you need
41
to use the right mouse button). Next, select that group of DNA sequences that you wish to design strain
specific primers for. Remember to press the <Ctrl> key to select (right-click in Linux) more than one nonadjacent entry. For my example I chose the clade that includes the type 6 and 11 strains. After selecting
those sequences that you want to design strain specific primers for, go back to the “Functions” menu and run
PlotSimilarity on just these chosen sequences. Your plot should look something like my example shown
below, but with your particular choices selected:
Areas of high conservation in this plot that correspond to areas of divergence in the previous plot are ideal for
designing strain specific primers. In fact, you can even produce printouts on clear plastic transparencies and
overlay the two to exactly localize the regions (or use a light box with paper printouts). Another rationale is to
selectively deselect particular sequences and note the shift in valleys rather than peaks. Both methods can be
very powerful and can tremendously help in the design of primers to any level of phylogenetic specificity
desired. This really does work — I’ve had many clients successfully use the technique in real experiments!
The actual primer design phase of this step precedes exactly the same as before. Close the windows that
overlay the SeqLab editor. With your group of desired sequences still chosen, go to the “Edit” menu and
select “Consensus.” Pick a desired agreement level (25% through 50% seem to work well depending on the
level of divergence in your subset of the data) and create a consensus sequence of your desired subset. A
new consensus sequence will appear below the alignment. Select your new consensus sequence (only) and
go through the same procedure with it as with the overall alignment consensus to discover the best forward
and reverse primers in it. Don’t forget to save the new RSF file and to use the “Save primers found to a
pattern file” option for saving the primers in GCG pattern data format. Name this data file with a name that
you’ll recognize (e.g. “HPV.subset.dat”). You’ll be selecting those ranges within the subset consensus that
correspond to the furthest separated, most highly conserved regions, that do not line up with highly conserved
42
regions of the overall alignment. Got it? This way you should be choosing sections of the L1 gene that will
discriminate between, in my case, the type 6 and 11 variants from all the other closely related type 16
sequences. Pretty cool, huh?
I discovered four acceptable type 6/11 strain specific primer pairs with my alignment subgroup consensus.
The abridged result of my analysis follows. The parameters I had to change are again highlighted in bold in
the output below:
PRIME of: L1.rsf{cons50pct}
ck: 3391
from: 1 to: 1638
March 4, 1997 15:33
INPUT SUMMARY
------------Input sequence: /usr/thompson/seqlab/L1.rsf{cons50pct}
Primer constraints:
primer size: 18 - 22
primer 3' clamp: S
primer sequence ambiguity: NOT ALLOWED
primer GC content: 40.0 - 55.0%
primer Tm: 50.0 - 65.0 degrees Celsius
primer self-annealing. . .
3' end: <
8
(weight: 2.0)
total: <
14
(weight: 1.0)
unique primer binding sites: required
primer-template and primer-repeat annealing. . .
3' end: ignored
total: ignored
repeated sequences screened: none specified
Product constraints:
product length: 740 - 942
product GC content: 40.0 - 55.0
product Tm: 70.0 - 95.0 degrees Celsius
product must include the region from 611 - 1350
duplicate primer endpoints: NOT ALLOWED
difference in primer Tm: < 2.0 degrees Celsius
primer-primer annealing. . .
3' end: <
8
(weight: 2.0)
total: <
14
(weight: 1.0)
PRIMER SUMMARY
-------------forward
reverse
Number of primers considered:
6
6
Number of primers rejected for . . .
primer 3' clamp:
primer sequence ambiguity:
primer GC content:
primer Tm:
non-unique binding sites:
primer self-annealing:
primer-template annealing:
primer-repeat annealing:
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
Number of primers accepted:
6
5
PRODUCT SUMMARY
--------------Number of products considered:
30
Number of products rejected for. . .
product length:
product GC content:
0
2
43
product Tm:
product position:
duplicate primer endpoints:
difference in primer Tm:
primer-primer annealing:
0
0
9
13
0
Number of products accepted:
Number of products saved:
6
6
----------------------------------------------------------------------------Product: 1
[DNA] = 50.000 nM
[salt] = 50.000 mM
PRIMERS
-------
forward primer: T611fb
reverse primer: T611rc
forward primer (21-mer):
reverse primer (18-mer):
5'
3'
525 GGATAACAGGGTTAATGTAGG 545
1408 GCTTTTGACAGGTAATGG
1391
forward
reverse
42.9
53.0
44.4
51.4
primer %GC:
primer Tm (degrees Celsius):
PRODUCT
------product length: 884
product %GC: 40.0
product Tm: 76.3 degrees Celsius
difference in primer Tm: 1.6 degrees Celsius
annealing score: 53
optimal annealing temperature: 53.9 degrees Celsius
----------------------------------------------------------------------------Product: 2
[DNA] = 50.000 nM
[salt] = 50.000 mM
PRIMERS
-------
forward primer: T611fe
reverse primer: T611rc
forward primer (20-mer):
reverse primer (18-mer):
5'
3'
522 ACAGGATAACAGGGTTAATG 541
1408 GCTTTTGACAGGTAATGG
1391
forward
reverse
40.0
51.7
44.4
51.4
primer %GC:
primer Tm (degrees Celsius):
PRODUCT
------product length: 887
product %GC: 40.0
product Tm: 76.3 degrees Celsius
difference in primer Tm: 0.3 degrees Celsius
annealing score: 55
optimal annealing temperature: 53.9 degrees Celsius
44
----------------------------------------------------------------------------Product: 3
[DNA] = 50.000 nM
[salt] = 50.000 mM
PRIMERS
-------
forward primer: T611fc
reverse primer: T611rc
forward primer (18-mer):
reverse primer (18-mer):
5'
3'
514 AACCCTGGACAGGATAAC 531
1408 GCTTTTGACAGGTAATGG 1391
forward
reverse
50.0
52.0
44.4
51.4
primer %GC:
primer Tm (degrees Celsius):
PRODUCT
------product length: 895
product %GC: 40.2
product Tm: 76.4 degrees Celsius
difference in primer Tm: 0.6 degrees Celsius
annealing score: 55
optimal annealing temperature: 54.0 degrees Celsius
///////////////////////////////////////////////////////////////
Be sure to again “Add to Editor” the “prime.rsf” file displayed in the “Output Manager.”
Choose
“Overwrite old with new” in the “Reloading Same Sequence” window that pops up. As before, this will
merge the new feature annotation that locates the successful strain specific primers onto your existing RSF
file. “Close” the “Output Manager” window after loading your new feature annotation.
c) Test for specificity
First we’ll see how specific our primers are for the sequences in our dataset. Therefore, we’ll run FindPatterns
in SeqLab against all the sequences in your alignment. Go to the “Functions” menu, choose “Database
Sequence Searching” and “FindPatterns.” Next, specify both the “Search Set” and the “Patterns” to be
used by the program. This gets a bit interesting, as you have to navigate through several file chooser boxes to
designate your desired input file and pattern data file. First click the “Search Set” button to get a dialog box
entitled “Build FindPattern’s Search Set.”
Click on “Add Main List Selection” to produce the “List
Chooser;” there select and then “Add” the alignment file we’ve been working on, “primertutorial.L1.rsf.” “Close” the chooser boxes to return to the FindPatterns program box. Now punch
“Patterns” to get the “Pattern Chooser.” Here click on “Pattern Data File. . .” to get the “File Chooser;” your
newly created data files from above should be displayed. Select your final universal primer data file and click
“OK” and then press the “Add All” button; then click the “Pattern Data File. . .” button again. This time add
your final subset specific primer data file by selecting it and pressing “OK” in the “File Chooser” window and
then using the “Add All” button in the “Pattern Chooser.” Finally repeat the procedure with the My09/11
commercial data set. You should end up with a combined pattern file of both your universal and subset
45
specific primer sets and then commercial set; they will now be displayed in the “Chosen Patterns:” window.
You may want to “Save Chosen. . .” to create a combined primer data pattern file in your account.
“Close” the “Pattern Chooser” window.
“Using selected Patterns.” should now be displayed after the
“Patterns. . .” button. Check “Save matches as features in “findpatterns.rsf” and then press the
“Run” button after setting up the specified search and pattern sets. The program box will go away and the
output will display relatively soon. Scroll thorough the file, noticing the specificity of each primer for particular
sequences. “Close” the windows when done. I found it interesting in my test runs that the My09/11 primers
matched so few sequences in our dataset.
Be sure to “Add to Editor” the “findpatterns.rsf” file displayed in the “Output Manager.” Choose
“Overwrite old with new” in the “Reloading Same Sequence” window that pops up. As always, this will
merge the new feature annotation that locates all the primer location data onto your existing RSF file. “Close”
the “Output Manager” window after loading your new feature annotation. Check out the new features in
“Graphic Features” mode to quickly see which sequences anneal to your new primers.
We still need to see if our primers are specific to HPV though. We did this sort of test in the prion exercise
with FindPatterns, for the reasons already discussed. Because the DNA databases are so big, and getting
bigger all the time, if you ever do this type of search from the command line, be sure to do it in batch mode.
GCG makes this easy by providing a –Batch option to many of their cpu intensive programs. Searching all of
GenBank in this fashion takes quite a while to run so I am providing you with that output file. I ran the
FindPatterns search with our candidate primers data file against all of GenBank allowing for one mismatch to
occur between the primer and the sequence. For those interested, I used the following command line for this
search; however, do not repeat this search at this point:
> findpatterns -data=combined.L1.dat -mismatch=1 -batch gb:* primer-tutorial.genbank.finds
Temporarily switch to the terminal window that’s been hanging around behind SeqLab.
Use the UNIX
command “more” to page through the file “primer-tutorial.L1.finds” in the following manner:
> more primer-tutorial.L1.finds
The file is huge, almost 5000 lines, with 1,357 finds in 326 different sequences. Each individual pattern found
is listed along with its location on each sequence. Notice the types of sequences being found by the program.
Largely they are HPV, the proper sequences to be found by the candidate primers, but some notable
exceptions appear. In particular, all of the U.S. patent sequences are interesting; they are likely commercial
kits for HPV diagnosis. Other exceptions include a few E. coli sequences, a potato cDNA, and a cow kinase
— not likely to cause much of a PCR contamination problem in genital tissue swabs — and some human
“genomic” sequences. These human sequences may cause some problems and should probably be checked
out. One of them turns out to be an HPV integrated site in a human carcinoma (Yabe, et al., 1991) and the
other a genomic clone that encodes the PAX6 protein (van Heyningen and Little, 1995). The HPV site is
expected. The PAX6 site is interesting but probably won’t be a problem since the primers that found it are
from two different series, MY11 and T 6/11, but its contaminant potential should be kept in mind. For those
interested, PAX6 turns out to be a ‘paired box’ type homeobox protein involved in vertebrate eye development.
46
Supplement
Back in the wet lab you would have synthesized oligo’s (and labeled them, if doing hybridization), performed the PCR
reaction or hybridization screen, and isolated the products with plaque/colony purification or direct PCR purification, as
appropriate.
After you found a candidate sequence; what next: often it’s restriction mapping.
The unknown stretch of DNA is restriction digested with various enzymes and agarose gel electrophoresed; the resultant
fragment sizes are extrapolated from migration distances. From this information a tentative restriction map can be
hypothesized.
This type of restriction mapping, i.e. reconstructing a physical map based on overlaps without having an actual
sequence, is computationally very difficult. Few automated solutions exist.
Alternative strategies include subcloning the pieces into a manageable vector and then sequencing those fragments or
direct PCR product sequencing.
After generating some sequence data, the other type of restriction mapping, that where you do know the sequence and
you merely want to know where all the various restriction enzymes may cut, can be very helpful. The GCG programs,
Map, MapPlot, MapSort and PlasmidMap can all assist in guiding and illustrating this process. Once all cut sites have
been mapped SeqLab, or the stand-alone sequence editor SeqEd, can be used to actually perform the subcloning
operation on the computer before doing it in the wet lab.
References Cited
Bilofsky, H.S., Burks, C., Fickett, J.W., Goad, W.B., Lewitter, F.I., Rindone, W.P., Swindell, C.D., and Tung, C.S. (1986)
The GenBank(TM) Genetic Sequence Data Bank. Nucleic Acids Research 14: 1-4.
Cherfas, J. (1990). Genes Unlimited. New Scientist 14: 29-33.
Genetics Computer Group (GCG), Inc. (Copyright 1982-2000) Program Manual for the Wisconsin Package, Version 10.2,
Madison, Wisconsin, USA 53711.
Gribskov, M., Luethy, R., and Eisenberg, D. (1989). Profile Analysis. Methods in Enzymology, 183: 146-159, Academic
Press, San Diego, California, U.S.A.
Gupta, S. K., Kececioglu, J., and Schaffer, A. A (1995) Making the Shortest-Paths Approach to Sum-of-Pairs Multiple
Sequence Alignment More Space Efficient in Practice, Proc. 6th Annual Combinatorial Pattern Matching
conference (CPM ‘95).
Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National
Academy of Sciences U.S.A.. 89: 10915-10919.
Mullis, K.B. (1990). The Unusual Origin of the Polymerase Chain Reaction. Scientific American April: 56-65.
Nagano,H., Yoshikawa, H., Kawana, T., Yokota, H., Taketani, Y., Igarashi, H., Yoshikura, H., and Iwamoto, A. (1996)
Association of multiple human papillomavirus types with vulvar neoplasias. J. Obstet. Gynaecol. 22: 1-8.
Online Mendelian Inheritance in Man, OMIM (TM). (1996) Center for Medical Genetics, Johns Hopkins University
(Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda,
MD). World Wide Web URL: http://www3.ncbi.nlm.nih.gov/omim/
47
Saiki, R.K., Gelfand, D.H., Stoffel, S., Scharf, S.J., Higuchi, R., Horn, G.T., Mullis, K.B., and Erlich, H.A. (1988). PrimerDirected Enzymatic Amplification of DNA with a Thermostable DNA Polymerase. Science 239: 487-491.
Sambrook, J., Fritsch, E.F., and Maniatis, T. (1989). Synthetic Oligonucleotide Probes. In Molecular Cloning A Laboratory
Manual, 2nd ed. (pp 11.2-11.53), Cold Spring Harbor Laboratory Press, New York, New York, USA.
Schwartz, R.M. and Dayhoff, M.O. (1979). Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences
and Structure, 5, Suppl. 3, (pp; 353-358), National Biomedical Research Foundation, Washington, D.C., U.S.A.
Smith, R.F. andSmith, T.F. (1992). Pattern-Induced Multi-sequence Alignment (PIMA) algorithm employing secondary
structure-dependent gap penalties for comparative protein modelling. Protein Engineering 5: 35-41.
Stewart A.C., Eriksson, A.M., Manos, M.M., Munoz, N., Bosch, F.X., Peto, J., and Wheeler, C.M. (1996) Intratype
variation in 12 human papillomavirus types: a worldwide perspective. J. Virol. 70: 3127-3136.
Tenti, P., Romagnoli, S., Silini, E., Zappatore, R., Spinillo, A. , Giunta, P., Cappellini, A., Vesentini, N., Zara, C., and
Carnevali, L. (1996) Human papillomavirus types 16 and 18 infection in infiltrating adenocarcinoma of the cervix:
PCR analysis of 138 cases and correlation with histologic type and grade. Am. J. Clin. Pathol. 106: 52-56.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple
sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice.
Nucleic Acids Research 22: 4673-4680.
van Heyningen, V. and Little, P.F. (1995) Report of the fourth international workshop on human chromosome 11 mapping
1994. Cytogenet. Cell Genet. 69: 127-158.
White, T.J., Arnheim, N., and Erlich, H.A. (1989). The Polymerase Chain Reaction. Trends in Genetics 5: 185-189.
Wood, W.I. (1987). Gene Cloning Based on Long Oligonucleotide Probes.
Methods in Enzymology 152: 443-447,
Academic Press, San Diego, California, USA.
Yabe, Y., Sakai, A. , Hitsumoto, T. , Kato, H. , and Ogura, H. (1991) A subtype of human papillomavirus 5 (HPV-5b) and
its subgenomic segment amplified in a carcinoma: nucleotide sequences and genomic organizations. Virology
183: 793-798.
48
Download