PowerPoint Presentation - Sequence analysis workshop

advertisement
Sequencher project
1. if one of the two strands can be clearly called, the other was ambiguous, you
should leave the ambiguous call as it is, do not use the clear strain to determine
the ambiguous call, the consensus will be determined by the two calls. example
2. when you export a sequence, export them from 5'-3', all in pearson/fasta format.
example
export as fasta format
format example:
>r100705ec2, 667 bases, 983E3ABA checksum.
AGAAGAAGACATAGTAATTAGATCTGAAAATTTTACAAACAATGT
TAAAACCATAATAGTACAGCTGAATGAATCTATACAAATT
>r100705ecj, 667 bases, 634529C5 checksum.
AGAAGAAGACATAGTAATTAGATCTGAAAATTTTACAAACAATGT
TAAAACCATAATAGTACAGCTGAATGAATCTATACAAATT
>r196807ecl, 667 bases, 79866EC6 checksum.
AGAAGAAGACATAGTAATTAGATCTGAAAATTTTACAAACAATGT
TAAAACCATAATAGTACAGCTGAATGAATCTATACAAATT
database search
to find out if there is any cross patient or lab strain contamination
•
•
•
•
on Huey, put all sequences in one directory, one sequence per file, nothing else (use r1pol as
demo example, r1env as demo results).
command line: fasterfasta 10 % UVELF &
A: GenBank Last Full Release + Updates
G: GenBank Last Full Release
U: GenBank Updates
V: All Viral sequences from GenBank Last Full Release
Y: All Synthetic (=vector) sequences from GenBank Last
Full Release
E: Mullins Lab Sequences
F: Frenkel Lab Sequences
L: LANL HIV Nucleotide Database (May 96)
O: Other (non-human) retroviral sequences (from LANL)
only
Output: fas-sum in the same directory
result sent as e-mail.
Log on to Huey from outside
• Log on Valis
• ssh username@blaze.csi.washington.edu
• ssh username@huey.csi.washington.edu
Guide for database search results
• if it is the first time you search for this patient, you want to make sure it is not
closely related to those from other patients who already in the database.
r1 env fasta result as example
• if it is the there are sequences from this patient who is already in the database, you
want to make sure they are closely related to those from that patient.
h2 env fasta results as example
m2 env, pol as contamination example
r100705ec1ps
vs LANL HIV Nucleotide Database (May 96) library
>>HIV1U36092
Human immunodeficiency virus type 1 sampl (650 nt)
initn: 2251 init1: 1552 opt: 2324 Z-score: 1840.3 expect() 7.4e-95
87.417% identity in 604 nt overlap (1-604:15-615)
>>HIV1U36094
Human immunodeficiency virus type 1 sampl (650 nt)
initn: 2251 init1: 1552 opt: 2324 Z-score: 1840.3 expect() 7.4e-95
87.417% identity in 604 nt overlap (1-604:15-615)
>>HIV1U36096
Human immunodeficiency virus type 1 sampl (650 nt)
initn: 2251 init1: 1552 opt: 2324 Z-score: 1840.3 expect() 7.4e-95
87.417% identity in 604 nt overlap (1-604:15-615)
……………..
r100705ec2
vs LANL HIV Nucleotide Database (May 96) library
>>HIV1U23138
Human immunodeficiency virus type 1 isola (1446 nt)
initn: 1573 init1: 1573 opt: 2308 Z-score: 1817.6 expect() 6.2e-94
86.452% identity in 620 nt overlap (1-614:723-1336)
>>HIV49957
1 Human immunodeficiency virus type 1 isola (1446 nt)
initn: 1573 init1: 1573 opt: 2308 Z-score: 1817.6 expect() 6.2e-94
86.452% identity in 620 nt overlap (1-614:723-1336)
>>HIVJFL
Human immunodeficiency virus type 1 provi (2553 nt)
initn: 1500 init1: 1353 opt: 2302 Z-score: 1809.9 expect() 9.4e-94
86.356% identity in 623 nt overlap (1-614:789-1405)
……………….
r100705ec3
vs LANL HIV Nucleotide Database (May 96) library
>>377_V09_26B
607 bp
DNA
5-JAN-19 (607 nt)
initn: 1549 init1: 1482 opt: 2326 Z-score: 1678.7 expect() 8e-86
87.622% identity in 614 nt overlap (1-604:1-607)
>>HIVU96503
HIV-1 patient 064 clone 064P02 from USA,
(664 nt)
initn: 1777 init1: 1046 opt: 2323 Z-score: 1676.1 expect() 1e-85
86.741% identity in 626 nt overlap (1-614:24-643)
>>377V09_26B
606 bp
DNA
5-JAN-199 (606 nt)
initn: 1544 init1: 1477 opt: 2321 Z-score: 1675.1 expect() 1.3e-85
87.602% identity in 613 nt overlap (2-604:1-606)
………………….
R1 env
Database search
Results (there are
No R1 env seqs in
Database yet)
Database search results 1
r100705ec4
vs LANL HIV Nucleotide Database (May 96) library
>>HIVSFAAA
Human immunodeficiency virus type 1 genom (3954 nt)
initn: 2246 init1: 1645 opt: 2395 Z-score: 1877.2 expect() 1.1e-97
88.651% identity in 608 nt overlap (1-605:1273-1877)
>>HIVSF162
Human immunodeficiency virus type 1 (HIV- (3954 nt)
initn: 2246 init1: 1645 opt: 2395 Z-score: 1877.2 expect() 1.1e-97
88.651% identity in 608 nt overlap (1-605:1273-1877)
>>HIV1U36091
Human immunodeficiency virus type 1 sampl (650 nt)
initn: 2358 init1: 1596 opt: 2401 Z-score: 1890.9 expect() 1.1e-97
88.760% identity in 605 nt overlap (1-605:14-615)
……………….
r100705eca
vs LANL HIV Nucleotide Database (May 96) library
>>HIVU63632
HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt)
initn: 1664 init1: 1664 opt: 2393 Z-score: 1968.8 expect() 3.8e-103
88.468% identity in 607 nt overlap (1-604:6330-6936)
>>HIVJRFL
Human immunodeficiency virus type 1, isol (8896 nt)
initn: 1664 init1: 1664 opt: 2393 Z-score: 1968.8 expect() 3.8e-103
88.468% identity in 607 nt overlap (1-604:6330-6936)
>>HIV1U36096
Human immunodeficiency virus type 1 sampl (650 nt)
initn: 2379 init1: 1590 opt: 2396 Z-score: 1984.3 expect() 7.1e-103
88.742% identity in 604 nt overlap (1-604:15-615)
………………….
r100705ecb
vs LANL HIV Nucleotide Database (May 96) library
>>HIVJRFL
Human immunodeficiency virus type 1, isol (8896 nt)
initn: 1678 init1: 1678 opt: 2389 Z-score: 1938.4 expect() 1.9e-101
88.322% identity in 608 nt overlap (1-605:6329-6936)
>>HIVU63632
HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt)
initn: 1678 init1: 1678 opt: 2389 Z-score: 1938.4 expect() 1.9e-101
88.322% identity in 608 nt overlap (1-605:6329-6936)
>>HIVSFAAA
Human immunodeficiency virus type 1 genom (3954 nt)
initn: 2247 init1: 1627 opt: 2386 Z-score: 1939.9 expect() 3.4e-101
88.487% identity in 608 nt overlap (1-605:1273-1877)
…………………….
Database search re
results
2sults
2
h200926ec1
vs LANL HIV Nucleotide Database (May 96) library
>>h297929ec1
611 bp
DNA
0
190 (611 nt)
initn: 3046 init1: 3046 opt: 3046 Z-score: 2512.5 expect() 2.9e-132
99.836% identity in 611 nt overlap (1-611:1-611)
>>AF105523
HIV-1 isolate C-DI-10 from Italy, envelop (789 nt)
initn: 2432 init1: 2272 opt: 2422 Z-score: 1996.3 expect() 1.3e-103
88.707% identity in 611 nt overlap (1-611:45-652)
>>MS97109Fc7
611 bp
DNA
0
190 (611 nt)
initn: 2406 init1: 2406 opt: 2416 Z-score: 1992.5 expect() 2.6e-103
88.599% identity in 614 nt overlap (1-611:1-611)
…………….
h200926ec2
vs LANL HIV Nucleotide Database (May 96) library
>>h297929ec1
611 bp
DNA
0
190 (611 nt)
initn: 3055 init1: 3055 opt: 3055 Z-score: 2519.9 expect() 1.1e-132
100.000% identity in 611 nt overlap (1-611:1-611)
>>AF105523
HIV-1 isolate C-DI-10 from Italy, envelop (789 nt)
initn: 2441 init1: 2281 opt: 2431 Z-score: 2003.7 expect() 4.9e-104
88.871% identity in 611 nt overlap (1-611:45-652)
>>MS97109Fc7
611 bp
DNA
0
190 (611 nt)
initn: 2415 init1: 2415 opt: 2425 Z-score: 2000.0 expect() 1e-103
88.762% identity in 614 nt overlap (1-611:1-611)
…………..
h200926ec3
vs LANL HIV Nucleotide Database (May 96) library
>>h297929ec1
611 bp
DNA
0
190 (611 nt)
initn: 3046 init1: 3046 opt: 3046 Z-score: 2512.5 expect() 2.9e-132
99.836% identity in 611 nt overlap (1-611:1-611)
>>AF105523
HIV-1 isolate C-DI-10 from Italy, envelop (789 nt)
initn: 2432 init1: 2272 opt: 2422 Z-score: 1996.3 expect() 1.3e-103
88.707% identity in 611 nt overlap (1-611:45-652)
>>MS97109Fc7
611 bp
DNA
0
190 (611 nt)
initn: 2406 init1: 2406 opt: 2416 Z-score: 1992.5 expect() 2.6e-103
88.599% identity in 614 nt overlap (1-611:1-611)
……………...
H2 env database
Search results
(there are h2 env
Seqs in the database
Already)
Database search results 3
>>HIVU63632
HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt)
initn: 1785 init1: 1660 opt: 2399 Z-score: 1965.5 expect() 5.7e-103
88.799% identity in 616 nt overlap (1-611:6329-6936)
>>HIVJRFL
Human immunodeficiency virus type 1, isol (8896 nt)
initn: 1785 init1: 1660 opt: 2399 Z-score: 1965.5 expect() 5.7e-103
88.799% identity in 616 nt overlap (1-611:6329-6936)
>>AF204455
HIV-1 clone p2p061-14 country USA envelop (600 nt)
initn: 2406 init1: 2406 opt: 2406 Z-score: 1984.4 expect() 7.6e-103
89.000% identity in 600 nt overlap (2-601:1-600)
………………...
h200926ec4
vs LANL HIV Nucleotide Database (May 96) library
>>h297929ec1
611 bp
DNA
0
190 (611 nt)
initn: 3046 init1: 3046 opt: 3046 Z-score: 2512.5 expect() 2.9e-132
99.836% identity in 611 nt overlap (1-611:1-611)
>>AF105523
HIV-1 isolate C-DI-10 from Italy, envelop (789 nt)
initn: 2432 init1: 2272 opt: 2422 Z-score: 1996.3 expect() 1.3e-103
88.707% identity in 611 nt overlap (1-611:45-652)
>>MS97109Fc7
611 bp
DNA
0
190 (611 nt)
initn: 2406 init1: 2406 opt: 2416 Z-score: 1992.5 expect() 2.6e-103
88.599% identity in 614 nt overlap (1-611:1-611)
………………….
h200926ec5
vs LANL HIV Nucleotide Database (May 96) library
>>h297929ec1
611 bp
DNA
0
190 (611 nt)
initn: 3046 init1: 3046 opt: 3046 Z-score: 2521.8 expect() 8.6e-133
99.836% identity in 611 nt overlap (1-611:1-611)
>>AF105523
HIV-1 isolate C-DI-10 from Italy, envelop (789 nt)
initn: 2432 init1: 2272 opt: 2422 Z-score: 2003.7 expect() 4.9e-104
88.707% identity in 611 nt overlap (1-611:45-652)
>>MS97109Fc7
611 bp
DNA
0
190 (611 nt)
initn: 2406 init1: 2406 opt: 2416 Z-score: 1999.9 expect() 1e-103
88.599% identity in 614 nt overlap (1-611:1-611)
…………………...
Database search results 4
m200712ec2
vs LANL HIV Nucleotide Database (May 96) library
>>h197514eca
602 bp
DNA
0
190 (602 nt)
initn: 2974 init1: 2974 opt: 2974 Z-score: 2235.4 expect() 7.9e-117
99.336% identity in 602 nt overlap (1-602:1-602)
>>HIVJRFL
Human immunodeficiency virus type 1, isol (8896 nt)
initn: 1818 init1: 1723 opt: 2491 Z-score: 1859.7 expect() 4.5e-97
90.625% identity in 608 nt overlap (1-602:6329-6936)
>>HIVU63632
HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt)
initn: 1818 init1: 1723 opt: 2491 Z-score: 1859.7 expect() 4.5e-97
90.625% identity in 608 nt overlap (1-602:6329-6936)
……………….
m200712ec4
vs LANL HIV Nucleotide Database (May 96) library
>>h197514eca
602 bp
DNA
0
190 (602 nt)
initn: 2962 init1: 2962 opt: 2962 Z-score: 2212.4 expect() 1.5e-115
99.003% identity in 602 nt overlap (1-602:1-602)
>>HIVU63632
HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt)
initn: 1806 init1: 1711 opt: 2479 Z-score: 1839.1 expect() 6.3e-96
90.296% identity in 608 nt overlap (1-602:6329-6936)
>>HIVJRFL
Human immunodeficiency virus type 1, isol (8896 nt)
initn: 1806 init1: 1711 opt: 2479 Z-score: 1839.1 expect() 6.3e-96
90.296% identity in 608 nt overlap (1-602:6329-6936)
………………..
m200712ec5
vs LANL HIV Nucleotide Database (May 96) library
>>h197514eca
602 bp
DNA
0
190 (602 nt)
initn: 2965 init1: 2965 opt: 2965 Z-score: 2252.2 expect() 9.1e-118
99.169% identity in 602 nt overlap (1-602:1-602)
>>HIVU63632
HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt)
initn: 1818 init1: 1723 opt: 2482 Z-score: 1872.7 expect() 8.5e-98
90.461% identity in 608 nt overlap (1-602:6329-6936)
>>HIVJRFL
Human immunodeficiency virus type 1, isol (8896 nt)
initn: 1818 init1: 1723 opt: 2482 Z-score: 1872.7 expect() 8.5e-98
90.461% identity in 608 nt overlap (1-602:6329-6936)
………………….
M2 env db search
Results, it shows
Evidence for crossPatient contamination
With H1)
Database search results 5
m200712ec6
vs LANL HIV Nucleotide Database (May 96) library
>>h197514eca
602 bp
DNA
0
190 (602 nt)
initn: 2974 init1: 2974 opt: 2974 Z-score: 2261.3 expect() 2.8e-118
99.336% identity in 602 nt overlap (1-602:1-602)
>>HIVU63632
HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt)
initn: 1818 init1: 1723 opt: 2491 Z-score: 1881.4 expect() 2.8e-98
90.625% identity in 608 nt overlap (1-602:6329-6936)
>>HIVJRFL
Human immunodeficiency virus type 1, isol (8896 nt)
initn: 1818 init1: 1723 opt: 2491 Z-score: 1881.4 expect() 2.8e-98
90.625% identity in 608 nt overlap (1-602:6329-6936)
……………….
m200712ec7
vs LANL HIV Nucleotide Database (May 96) library
>>h197514eca
602 bp
DNA
0
190 (602 nt)
initn: 2956 init1: 2956 opt: 2956 Z-score: 2275.0 expect() 4.9e-119
99.003% identity in 602 nt overlap (1-602:1-602)
>>HIVJRFL
Human immunodeficiency virus type 1, isol (8896 nt)
initn: 1836 init1: 1741 opt: 2509 Z-score: 1918.2 expect() 2.5e-100
90.954% identity in 608 nt overlap (1-602:6329-6936)
>>HIVU63632
HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt)
initn: 1836 init1: 1741 opt: 2509 Z-score: 1918.2 expect() 2.5e-100
90.954% identity in 608 nt overlap (1-602:6329-6936)
……………….
m200712ec8
vs LANL HIV Nucleotide Database (May 96) library
>>h197514eca
602 bp
DNA
0
190 (602 nt)
initn: 2956 init1: 2956 opt: 2956 Z-score: 2216.9 expect() 8.4e-116
99.003% identity in 602 nt overlap (1-602:1-602)
>>HIVU63632
HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt)
initn: 2453 init1: 1723 opt: 2491 Z-score: 1855.6 expect() 7.6e-97
90.625% identity in 608 nt overlap (1-602:6329-6936)
>>HIVJRFL
Human immunodeficiency virus type 1, isol (8896 nt)
initn: 2453 init1: 1723 opt: 2491 Z-score: 1855.6 expect() 7.6e-97
90.625% identity in 608 nt overlap (1-602:6329-6936)
………………….
Database search results 6
Outgroup
to find multiple sequences that can be used as the outgroup of the sequences that
you are interested in
1. pick couple of sequences
2. Blast NCBI GenBank, find 6 – 12 sequences as your outgroup
3. I put my outgroup as fasta format in one file (show example)
Outgroup examples
>HIVU63632
618 bp
DNA
14-JUN-1999, 618 bases, 5E9DEF6C checksum.
agaagaagaggtagtaattagatctgacaatttcacgaacaatgctaaaa
ccataatagtacagctgaaagaatctgtagaaattaattgtacaagaccc
aacaacaatacaagaaaaagtatacatata------ggaccagggagagc
……..
>HIVJRCSF
618 bp
DNA
14-JUN-1999, 618 bases, 89D0C733 checksum.
agaagaaaaggttgtaattagatctgacaattttacggacaatgctaaaa
ccataatagtacagctgaatgaatctgtaaaaattaattgtacaaggccc
agcaacaatacaagaaaaagtatacatata------ggaccagggagagc
……...
>HIVU95410
618 bp
DNA
14-JUN-1999, 618 bases, C4E18A19 checksum.
agaagaagaggtagtaattagatccgacaatttcacggacaatgctaaaa
tcataatagtacagctgaatgaatctgtagaaattaattgtacaagaccc
aacaacaatacaagaaaaagtatacatata------ggaccaggcagagc
……….
>HIVU95413
618 bp
DNA
14-JUN-1999, 618 bases, E4689196 checksum.
agaagaagaggtagtaattagatccgacaatttcacggacaatgctaaaa
tcataatagtacagctgaatgaatctgtagaaattaattgtacaagaccc
aacaacaatacaagaaaaagtatacatata------ggaccaggcagagc
……...
>HIVBAL1A
618 bp
DNA
14-JUN-1999, 618 bases, 4184AF81 checksum.
agaagaagaggtagtaattagatccgccaatttcgcggacaatgctaaag
tcataatagtacagctgaatgaatctgtagaaattaattgtacaagaccc
aacaacaatacaagaaaaagtatacatata------ggaccaggcagagc
…………..
clustalw alignment
1. on Unix, put all sequences in one directory, only individual sequences, including
outgroup sequences but nothing else (on valis r1pol as demo example, r1env as demo
results)
2. command line:
cat * > all
clustalw all &
(default outfile is called all.aln, all.dnd output files)
(clustalw program available on valis, huey, watson, crick....)
Steps that are not necessary for
preliminary quality checking
3. change the .aln format to .gde format in clustalw
4. open GDE on valis or sage to adjust alignment, generate consensus of the first timepoint,
put the consensus to the top of the alignment, translate to amino acid alignment. Then
export into .ig format.
5. pretty dot picture:
On valis command line: pcgdots filename.ig
pcgdots –l 120 –o outfilename infilename.ig
Play with sequence format
Two methods are introduced here:
1. using clustalw from all.aln to make GCG/MSF, Phylip, gde formats
2. Changing sequences format using readseq command to get fasta format
(for future hypermutant test in our current case)
ClustalW
command line: %clustalw
**************************************************************
******** CLUSTAL W (1.75) Multiple Sequence Alignments ********
**************************************************************
1.
2.
3.
4.
S.
H.
X.
Sequence Input From Disc
Multiple Alignments
Profile / Structure Alignments
Phylogenetic trees
Execute a system command
HELP
EXIT (leave program)
Your choice: 1
Sequences should all be in 1 file.
7 formats accepted:
NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF,
RSF.
Enter the name of the sequence file: allresult.aln
Clustal example 2
Sequence format is Clustal
Sequences assumed to be DNA
Sequence
Sequence
Sequence
Sequence
Sequence
Sequence
Sequence
Sequence
Sequence
Sequence
Sequence
.
.
1: r100705ec2_
2: r100705ecj_
3: r196807ecl_
4: r198610ecb_
5: r198610ecd_
6: r196807eca_
7: r196807ecj_
8: r198610ec2_
9: r196807ech_
10: r100705ece_
11: r199d22ec5_
667 bp
667 bp
667 bp
667 bp
667 bp
667 bp
667 bp
667 bp
667 bp
667 bp
667 bp
Clustal example 3
.**************************************************************
******** CLUSTAL W (1.75) Multiple Sequence Alignments ********
**************************************************************
1.
2.
3.
4.
Sequence Input From Disc
Multiple Alignments
Profile / Structure Alignments
Phylogenetic trees
S. Execute a system command
H. HELP
X. EXIT (leave program)
Your choice: 2
Clustal example 4
****** MULTIPLE ALIGNMENT MENU ******
1. Do complete multiple alignment now (Slow/Accurate)
2. Produce guide tree file only
3. Do alignment using old guide tree file
4.
Toggle Slow/Fast pairwise alignments = SLOW
5.
6.
Pairwise alignment parameters
Multiple alignment parameters
7.
8.
9.
Reset gaps before alignment? = OFF
Toggle screen display
= ON
Output format options
S. Execute a system command
H. HELP
or press [RETURN] to go back to main menu
Your choice: 9
Clustal example 5
********* Format of Alignment Output *********
1. Toggle CLUSTAL format output
= ON
2. Toggle NBRF/PIR format output
= OFF
3. Toggle GCG/MSF format output
= OFF
4. Toggle PHYLIP format output
= OFF
5. Toggle GDE format output
= OFF
6. Toggle GDE output case
=
7. Toggle CLUSTALW sequence numbers =
8. Toggle output order
=
LOWER
OFF
ALIGNED (attention here)
9. Create alignment output file(s) now?
0. Toggle parameter output
= OFF
H. HELP
Enter number (or [RETURN] to exit): 5
Clustal example 6
********* Format of Alignment Output *********
1. Toggle CLUSTAL format output
= ON
2. Toggle NBRF/PIR format output
= OFF
3. Toggle GCG/MSF format output
= OFF
4. Toggle PHYLIP format output
= OFF
5. Toggle GDE format output
= ON
6. Toggle GDE output case
=
7. Toggle CLUSTALW sequence numbers =
8. Toggle output order
=
LOWER
OFF
ALIGNED
9. Create alignment output file(s) now?
0. Toggle parameter output
H. HELP
Enter number (or [RETURN] to exit): 4
= OFF
Clustal example 7
********* Format of Alignment Output *********
1. Toggle CLUSTAL format output
= ON
2. Toggle NBRF/PIR format output
= OFF
3. Toggle GCG/MSF format output
= OFF
4. Toggle PHYLIP format output
= ON
5. Toggle GDE format output
= ON
6. Toggle GDE output case
=
7. Toggle CLUSTALW sequence numbers =
8. Toggle output order
=
LOWER
OFF
ALIGNED
9. Create alignment output file(s) now?
0. Toggle parameter output
H. HELP
Enter number (or [RETURN] to exit): 9
= OFF
Clustal example 8
WARNING: Output file name is the same as input file.
Enter new name to avoid overwriting
[allresult.aln]: all.aln
Enter a name for the PHYLIP output file
[allresult.phy]: all.phy
Enter a name for the GDE output file
[allresult.gde]: all.gde
Consensus length = 667
CLUSTAL-Alignment file created
[all.aln]
WARNING: Truncating sequence names to 10 characters for PHYLIP output.
PHYLIP-Alignment file created
GDE-Alignment file created
[all.phy]
[all.gde]
Clustal example 9
********* Format of Alignment Output *********
1. Toggle CLUSTAL format output
= ON
2. Toggle NBRF/PIR format output
= OFF
3. Toggle GCG/MSF format output
= OFF
4. Toggle PHYLIP format output
= ON
5. Toggle GDE format output
= ON
6. Toggle GDE output case
=
7. Toggle CLUSTALW sequence numbers =
8. Toggle output order
=
LOWER
OFF
ALIGNED
9. Create alignment output file(s) now?
0. Toggle parameter output
= OFF
H. HELP
Enter number (or [RETURN] to exit):
(enter return)
Clustal example 10
****** MULTIPLE ALIGNMENT MENU ******
1. Do complete multiple alignment now (Slow/Accurate)
2. Produce guide tree file only
3. Do alignment using old guide tree file
4.
Toggle Slow/Fast pairwise alignments = SLOW
5.
6.
Pairwise alignment parameters
Multiple alignment parameters
7.
8.
9.
Reset gaps before alignment? = OFF
Toggle screen display
= ON
Output format options
S. Execute a system command
H. HELP
or press [RETURN] to go back to main menu
Your choice: (return)
Clustal example 11
**************************************************************
******** CLUSTAL W (1.75) Multiple Sequence Alignments ********
**************************************************************
1.
2.
3.
4.
Sequence Input From Disc
Multiple Alignments
Profile / Structure Alignments
Phylogenetic trees
S. Execute a system command
H. HELP
X. EXIT (leave program)
Your choice: x
readseq:
to get fasta format
•command line: %readseq
readSeq (1Feb93), multi-format molbio sequence reader.
Name of output file (?=help, defaults to display):
r1e.fasta
1. IG/Stanford
10. Olsen (in-only)
2. GenBank/GB
11. Phylip3.2
3. NBRF
12. Phylip
4. EMBL
13. Plain/Raw
5. GCG
14. PIR/CODATA
6. DNAStrider
15. MSF
7. Fitch
16. ASN.1
8. Pearson/Fasta
17. PAUP/NEXUS
9. Zuker (in-only)
18. Pretty (out-only)
Choose an output format (name or #):
8
Name an input sequence or -option:
all.phy
Sequences in all.phy (format is 12. Phylip)
1) r100705ec2
2) r100705ecj
3) r196807ecl
4) r198610ecb
5) r198610ecd
6) r196807eca
7) r196807ecj
8) r198610ec2
9) r196807ech
10) r100705ece
.
.
.
.
.Choose a sequence (# or All):
all
Readseq 2
Name an input sequence or -option:
(Return)
Hypermutation
Definition of Simon Wain-Hobson
1. Monotony of G->A transitions with respect to the viral plus strand, maybe
occasionally accompanied by a few (<5%) other substitutions.
2. All parts of retroviral genome are vulnerable.
3. Overall the number of G->A transitions per sequence should be >5 while transition
frequency should be >5% of number of Gs. Up to 60% of Gs can be substituted.
4. The distribution of substitutions may be confined to a very small region, say 50 bp.
Equally, they may be distributed in an erratic manner throughout the genome.
5. G->A transitions are associated with dinucleotide context declining in the order
GpA>GpG>GpT>GpC. Occasionally a few examples have GpG>GpA.
6. May be accompanied by small deletions of 1-5 bases. Larger deletions and small
insertions (1-3) bases are rarer.
Hypermute program of Bette Korber
(LANL)
http://hiv-web.lanl.gov/HYPERMUT/hypermut.html
copy file “r1e.fasta” (which is fasta format alignment -- I found sometimes .phy format does not
work well even this site accept phy format
H2 env as example
couple of RT region as example (h1pro, h2rt)
R1env. hyperdemo file as example
distance matrix: to generate distance matrix for tree
making and diversity calculation
We will use phylip package on Valis (Unix) today. Phylip is available for Mac and PC. It takes
.phy format
If it is Mac, start with an alignment file for example all.phy to make a distance matrix:
put this file in the same directory as program "dnadist", double click the program, it will ask for
infile name. After program is done, it is called outfile. Rename it if you want to keep this file
(otherwise it will be overwritten).
On UNIX:
yang /export/home/mullab/yang/Sequences/PhylogeneticTraining/r1env
%dnadist
dnadist: can't read infile
Please enter a new filename>all.phy
Distance matrix 2
Nucleic acid sequence Distance Matrix program, version 3.572c
Settings for this run:
D Distance (Kimura, Jin/Nei, ML, J-C)?
T
Transition/transversion ratio?
C
One category of substitution rates?
L
Form of distance matrix?
M
Analyze multiple data sets?
I
Input sequences interleaved?
0
Terminal type (IBM PC, VT52, ANSI)?
1
Print out the data at start of run
2 Print indications of progress of run
Kimura 2-parameter
2.0
Yes
Square
No
Yes
(none)
No
Yes
Are these settings correct? (type Y or letter for one to change)
d
Distance matrix 3
Nucleic acid sequence Distance Matrix program, version 3.572c
Settings for this run:
D Distance (Kimura, Jin/Nei, ML, J-C)?
T
Transition/transversion ratio?
C
One category of substitution rates?
L
Form of distance matrix?
M
Analyze multiple data sets?
I
Input sequences interleaved?
0
Terminal type (IBM PC, VT52, ANSI)?
1
Print out the data at start of run
2 Print indications of progress of run
Jin and Nei
2.0
Yes
Square
No
Yes
(none)
No
Yes
Are these settings correct? (type Y or letter for one to change)
d
Distance matrix 4
Nucleic acid sequence Distance Matrix program, version 3.572c
Settings for this run:
D Distance (Kimura, Jin/Nei, ML, J-C)?
T
Transition/transversion ratio?
C
One category of substitution rates?
F
Use empirical base frequencies?
L
Form of distance matrix?
M
Analyze multiple data sets?
I
Input sequences interleaved?
0
Terminal type (IBM PC, VT52, ANSI)?
1
Print out the data at start of run
2 Print indications of progress of run
Maximum Likelihood
2.0
Yes
Yes
Square
No
Yes
(none)
No
Yes
Are these settings correct? (type Y or letter for one to change)
l
Distance matrix 5
Nucleic acid sequence Distance Matrix program, version 3.572c
Settings for this run:
D Distance (Kimura, Jin/Nei, ML, J-C)?
T
Transition/transversion ratio?
C
One category of substitution rates?
F
Use empirical base frequencies?
L
Form of distance matrix?
M
Analyze multiple data sets?
I
Input sequences interleaved?
0
Terminal type (IBM PC, VT52, ANSI)?
1
Print out the data at start of run
2 Print indications of progress of run
Maximum Likelihood
2.0
Yes
Yes
Lower-triangular
No
Yes
(none)
No
Yes
Are these settings correct? (type Y or letter for one to change)
y
Distances calculated for species
r100705ec2
r100705ecj
r196807ecl
r198610ecb
r198610ecd
r196807eca
r196807ecj
r198610ec2
r196807ech
r100705ece
r199d22ec5
r100705ech
r198610ece
r198610ec8
r199d22ec3
r100705eci
r199d22eca
r196807ecd
r196807eck
r199d22ech
r198610ecf
r196807ece
r196807eci
r199d22ecd
r199d22ecf
r196807ecb
r199d22ecc
r199d22ec4
r199d22ecb
r100705ec1
r100705ecb
r198610ec3
r198610ec7
r199d22ecg
r195301eca
r195301ece
r196807ecc
r100705ecc
r198610ec1
....................................................................
...................................................................
..................................................................
.................................................................
................................................................
...............................................................
..............................................................
.............................................................
............................................................
...........................................................
..........................................................
.........................................................
........................................................
.......................................................
......................................................
.....................................................
....................................................
...................................................
..................................................
.................................................
................................................
...............................................
..............................................
.............................................
............................................
...........................................
..........................................
.........................................
........................................
.......................................
......................................
.....................................
....................................
...................................
..................................
.................................
................................
...............................
..............................
Distance matrix 6
Distances written to file (default name outfile )
rename outfile command: mv outfile r1edist
making tree using phylip package on valis
yang /export/home/mullab/yang/Sequences/PhylogeneticTraining/r1env
%neighbor3.6
neighbor3.6: can't find input file "infile"
Please enter a new file name> r1edist
Neighbor-Joining/UPGMA method version 3.5
Settings for this run:
N
Neighbor-joining or UPGMA tree? Neighbor-joining
O
Outgroup root? No, use as outgroup species
L
Lower-triangular data matrix? No
R
Upper-triangular data matrix? No
S
Subreplicates? No
J
Randomize input order of species? No. Use input order
M
Analyze multiple data sets? No
0
Terminal type (IBM PC, ANSI, none)? ANSI
1
Print out the data at start of run No
2 Print indications of progress of run Yes
3
Print out tree Yes
4
Write out trees onto tree file? Yes
Y to accept these or type the letter for one to change
l
1
Neighbor-joining 2
Neighbor-Joining/UPGMA method version 3.5
Settings for this run:
N
Neighbor-joining or UPGMA tree?
O
Outgroup root?
L
Lower-triangular data matrix?
R
Upper-triangular data matrix?
S
Subreplicates?
J
Randomize input order of species?
M
Analyze multiple data sets?
0
Terminal type (IBM PC, ANSI, none)?
1
Print out the data at start of run
2 Print indications of progress of run
3
Print out tree
4
Write out trees onto tree file?
Neighbor-joining
No, use as outgroup species
Yes
No
No
No. Use input order
No
ANSI
No
Yes
Yes
Yes
Y to accept these or type the letter for one to change
y
1
Cycle 66: OTU 64 ( -0.00020) joins
Cycle 65: node 64 (
0.01700) joins
Cycle 64: node 64 (
0.01058) joins
Cycle 63: node 64 (
0.00750) joins
.
.
.
.
.
.
.
last cycle:
node 1 (
0.00001) joins node 30
OTU
OTU
OTU
OTU
65
66
67
68
(
(
(
(
0.00190)
0.02905)
0.01462)
0.04174)
Neighbor-joining 3
(
0.00007) joins OTU
34
(
Output written on output file (there is one output file: outfile)
Tree written on tree file (default output treefile name : outtree)
Done.
•To change outtree file to new name: mv outtree r1enjtree
•Use treeView program to open treefile and root with outgroup
0.00163)
TreeView Program to open treefile
•Tree->define outgroup
•Tree->root with outgroup
•Tree->order
•Select ladderise left (right), click OK
•Save as graphic file (pic), open and play with Canvas
calculate inter-timepoint diversity
using distance matrix
1. open r1edist in word first
2. replace all ”paragraph mark space space " to ”space space ", save
as text file
3. open in excel (select fixed width)
4. calculated average of distance within each timepoint using excel
tools
If you do not use GDE to sort the alignment after clustalw alignment, you have to
run clustalw in foreground instead of background, when run foreground you
can change the output order to "Input order" instead of "ALIGNED". In
this way the sequence output is sorted by name. You need a name sorted
sequence distance matrix to calculate
Intro timepoint population diversity.
What do we have here?
From a group of individual nucleotide sequences, We
• Get rid of contamination (if any)
• Get rid of hypermutated sequences (if any)
• Find a group of sequences as outgroup
• Make a nucleotide alignment using clustalw
• Changing sequences formats as we needed for future steps
• Make F84 neighbor joining tree and rooted with outgroup
• Calculate nucleotide intra-timepoint diversity
• Plot the diversity change with time
Download