Bioinformatics Using Python_001 - bio-bio-1

advertisement
Problem No 1
Determinaiton of GC Content
Among the four nucleotides, {A,T,C,G}, the ratio of C & G over a DNA sequence carries some very
important signals. This ratio is measured through “GC-content %” using the following formula.
GC-content % = ((n(G)+n(C))/(len(DNA)))*100%
Where
n(G) = number of G in the sequence
n(C) = number of C in the sequence
len(DNA)= length of the DNA sequence in base-pair(bp)
write a python program that can perform as bellow.
Input
1. DNA sequence as a string
Output
1. Length of the sequence
2. GC-content %
Example
Input
ATCG
Output
Length of the seqeunce = 4 bp
GC-content % = 50%
1
Problem No 2
Complement DNA strand
DNA forms the double helix structure with two strands of DNA. Though when we work with DNA
seqeunce, we usually talk about a single DNA-sequence (single strand). But in the chromosome DNA
remains in a double stranded form. These two strands are called complement of each other. One is
named as 5’-3’ (forward strand) and other is named as 3’-5’(reverse strand). When it is not explicitly
mentioned the strand type (or direction), it is assumed that the respective DNA sequence is of 5’-3’ or
forward strand.
5’---ACCGTA---3’
| | ||| |
3’---TGGCAT---5’
In a complement DNA strand each base of the original DNA sequence is replaced by the following interchanging ruleA is replaced by T and vice-versa
C is replaced by G and vice-versa
This is because, in the double helix structure A of one strand is connected with T of other strand with
hydrogen bond and same in the case of C & G.
write a python program that can perform as bellow.
Input
1. DNA sequence as a string
Output
1. Complement of input DNA sequence
Example
Input
ACCGTA
Output
Complement DNA Sequence = TGGCAT
2
Problem No 3
Reverse Complement of a DNA Sequence
This problem can be thought as an extension of Problem No 2 (read Problem No 2 first). In
bioinformatics analysis the concept of Reverse Complement DNA sequence is very often encountered. If
the complement of a DNA sequence is reversed, this is called reverse-complement of the original DNA
sequence.
5’---ACCGTA---3’
| | ||| |
3’---TGGCAT---5’
Here the complement of (5’---ACCGTA---3’) is (3’---TGGCAT---5’), and reverse of (3’---TGGCAT---5’) is
TACGGT, so the reverse complement of ACCGTA is TACGGT.
write a python program that can perform as bellow.
Input
1. DNA sequence as a string
Output
1. Reverse Complement of input DNA sequence
Example
Input
ACCGTA
Output
Reverse Complement DNA Sequence = TACGGT
3
Problem No 4
Codon List from a DNA sequence
Triplets of nucleotides (for example ATT, TCG, CCC, etc) are called Codons. Through the process of
Transcription and Translation these Codons of a DNA sequence become responsible to produce an
amino acid individually. And finally chain of amino acids builds a protein. 64 (4x4x4) different codons are
possible.
Lets think of a DNA sequence as ATTTCGAGGT. If we start parsing codons from left to right, the possible
codons will be ATT, TCG, AGG (ignore the right most remaining part with length <3 bp, in this case T).
Write a function/python program that returns the list of codons for a DNA sequence. This program
should return/print the list of codons as the “list” data structure of python.
Input
1. DNA sequence as a string
Output
1. List of Codons
Example
Input
ATTTCGAGGT
Output
Codon-List = [‘ATT’,’TCG’,’AGG’]
4
Problem No 5
Translate a DNA Sequence
Each codon represents an amino acid (skim through Problem No 4). The standard Codon-To-Amino Acid
mapping table is called the “Standard Genetic Code Table” or “Codon-Table”. This is built for codons
derived from RNA (detail will discussed in separate space beyond this problem), as a result you will find
U instead of T. But for the simplicity, in this specific problem definition you should use the customized
(for DNA) genetic code table.
Standard Genetic Code
U
C
UUU
A
UCU
UAU
Phe (F)
UUC
UGU
Tyr (Y)
UCC
UAC
UCA
Leu (L)
Phe (F)
Phe (F)
UAA
U
Cys (C)
UGC
Phe (F)
Phe
(F)
Ser (S)
U UUA
G
Phe (F)
C
Phe (F)
Phe
(F)
UGA
Stop
A
UGG
Trp (W)
G
Stop
UUG
UCG
UAG
CUU
CCU
CAU
CGU
U
His (H)
CUC
CCC
C
CUA
Phe (F)
CCA
Phe (F)
CCG
CGC
CGA
Gln (Q)
CAG
Phe (F)
C
Arg (R)
Phe (F)
CAA
Phe (F)
Phe (F)
Phe (F)
Phe (F)
Phe (F)
Phe (F)
CUG
CAC
Pro (P)
Leu (L)
Phe (F)
A
Phe (F)
Phe (F)
CGG
G
Phe (F)
Phe
(F)
AUU
ACU
AAU
AGU
Asn (N)
AUC
Ile (I)
ACC
Thr (T)
Phe (F)
A AUA
Phe (F)
Phe
(F)
AAC
ACA
Phe (F)
ACG
Phe (F)
Phe
(F)
AGA
AAG
Phe (F)
AGG
Phe (F)
GAU
A
Phe (F)
G
Phe (F)
Phe
(F)
Phe (F)
GCU
C
Arg (R)
Lys (K)
Phe (F)
GUU
Phe (F)
Phe (F)
AAA
Phe (F)
Met (M)
AGC
Phe (F)
Phe (F)
AUG
Phe (F)
U
Ser (S)
GGU
U
Asp (D)
GUC
GCC
G GUA
Phe (F)
GCA
GGC
Phe (F)
GAA
GCG
GGA
Glu (E)
Phe (F)
Phe (F)
C
Gly (G)
Phe (F)
Phe (F)
GUG
GAC
Ala (A)
Val (V)
GAG
Phe (F)
Phe (F)
A
Phe (F)
Phe (F)
GGG
G
Phe (F)
Phe
(F)
5
Customized (for DNA) Genetic Code
ttt: F
ttc: F
tta: L
ttg: L
tct: S
tcc: S
tca: S
tcg: S
tat: Y
tac: Y
taa: *
tag: *
ctt: L
ctc: L
cta: L
ctg: L
cct: P
ccc: P
cca: P
ccg: P
cat: H
cac: H
caa: Q
cag: Q
tgt: C
tgc: C
tca: *
tcg: W
cgt: R
cgc: R
cga: R
cgg: R
att: I
act: T
aat: N
agt: S
atc: I
acc: T
aac: N
agc: S
ata: I
aca: T
aaa: K
aga: R
atg: M
acg: T
aag: K
agg: R
gtt: V
gtc: V
gta: V
gtg: V
gct: A
gcc: A
gca: A
gcg: A
gat: D
gac: D
gaa: E
gag: E
ggt: G
ggc: G
gga: G
ggg: G
There are 20 different amino acids. Detial table is as bellow.
20 Amino Acids and Their Codes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1-Letter code
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
3-Letter Code
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Name
Alanine
Arginine
Asparagine
Aspartic acid
Cysteine
Glutamine
Glutamic acid
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
6
15
16
17
18
19
20
P
S
T
W
Y
V
Pro
Ser
Thr
Trp
Tyr
Val
Proline
Serine
Threonine
Thryptophan
Tyrosine
Valine
Write a function/program that takes a DNA sequence and returns/prints the translated protein
sequence (using the customized codon table, and representing amino acids using 1-letter codes). Ignore
right-most incomplete codon of length <3 bp, as explained in Problem No 4.
Input
1. DNA sequence as a string
Output
1. Amino Acids Sequence of Protein
Example
Input
TTTCCTAATC
Output
Protein Sequence =FPN
7
Download