PROGRAMMING LAB ASSIGNMENTS

advertisement
PROGRAMMING LAB ASSIGNMENTS - BIOINFORMATICS
Using basic genomics sequence analysis in Java
Background:
The data file (sequence.txt) will be used in this lab assignment. The text file will be in a FASTA
format that will have nucleotide sequences that will follow. A description of the FASTA format is
below taken from Wikipedia.
The FASTA format is a text-based format for representing either nucleotide sequences or
peptide sequences, in which base pairs or amino acids are represented using single-letter
codes. The format also allows for sequence names and comments to precede the
sequences. The format originates from the FASTA software package, but has now
become a standard in the field of bioinformatics.
A sequence in FASTA format begins with a single-line description, followed by lines of
sequence data. The description line is distinguished from the sequence data by a greaterthan (">") symbol in the first column. The word following the ">" symbol is the
identifier of the sequence, and the rest of the line is the description (both are optional).
There should be no space between the ">" and the first letter of the identifier. It is
recommended that all lines of text be shorter than 80 characters. The sequence ends if
another line starting with a ">" appears; this indicates the start of another sequence. A
simple example of one sequence in FASTA format is below:
>ENSG00000142166|ENST00000270139
ATGATGGTCGTCCTCCTGGGCGCGACGACCCTAGTGCTCGTCGCCGTGGCGCCATGGGTG
TTGTCCGCAGCCGCAGGTGGAAAAAATCTAAAATCTCCTCAAAAAGTAGAGGTCGACATC
ATAGATGACAACTTTATCCTGAGGTGGAACAGGAGCGATGAGTCTGTCGGGAATGTGACT
TTTTCATTCGATTATCAAAAAACTGGGATGGATAATTGGATAAAATTGTCTGGGTGTCAG
AATATTACTAGTACCAAATGCAACTTTTCTTCACTCAAGCTGAATGTTTATGAAGAAATT
AAATTGCGTATAAGAGCAGAAAAAGAAAACACTTCTTCATGGTATGAGGTTGACTCATTT
ACACCATTTCGCAAAGCTCAGATTGGTCCTCCAGAAGTACATTTAGAAGCTGAAGATAAG
GCAATAGTGATACACATCTCTCCTGGAACAAAAGATAGTGTTATGTGGGCTTTGGATGGT
TTAAGCTTTACATATAGCTTAGTTATCTGGAAAAACTCTTCAGGTGTAGAAGAAAGGATT
GAAAATATTTATTCCAGACATAAAATTTATAAACTCTCACCAGAGACTACTTATTGTCTA
AAAGTTAAAGCAGCACTACTTACGTCATGGAAAATTGGTGTCTATAGTCCAGTACATTGT
ATAGATGACAACTTTATCCTGAGGTGGAACAGGAGCGATGAGTCTGTCGGGAATGTGACT
TTTTCATTCGATTATCAAAAAACTGGGATGGATAATTGGATAAAATTGTCTGGGTGTCAG
AATATTACTAGTACCAAATGCAACTTTTCTTCACTCAAGCTGAATGTTTATGAAGAAATT
TTGTCCGCAGCCGCAGGTGGAAAAAATCTAAAATCTCCTCAAAAAGTAGAGGTCGACATC
ATAGATGACAACTTTATCCTGAGGTGGAACAGGAGCGATGAGTCTGTCGGGAATGTGACT
1
Assignment:
1. Write a simple program to input a DNA "sequence" as input to a Java program and
make the program print it out. The input will be done through the text file that is in
FASTA format. The output will be done in two ways: console and then write it to a
file.
2. Write a program to convert a given strand of a DNA sequence into its corresponding
reverse complemented strand sequence. Use regular expressions and pattern matching
to achieve this goal. Use the same output file as in number one. Write your output to a
file. Regular Expressions are are implemented using finite automata which are state
machines. Try thinking about it like this state machine:
Here, q0 is the initial/start state and the q3 is the "final/accepting" state.
So if at any point of time the machine reaches state q3 it means it has
found a pattern "cow" in the main text.
3. Write a program to convert a DNA sequence into its corresponding RNA
sequence. The main operation needed is to substitute every "T" with a "U". You
will be using string translation for this exercise.
4. Write a program to convert "translate" a coding sequence of a particular RNA into
its corresponding amino acid sequence. Use the RNA conversion text file you
created in exercise 3. A hash table will be used to store the genetic code table. ook
at the last page for a codon translation table. Remember to check to see if your
test data is divisible by 3 for it to be viable and that it can only include the letters
A,U, C, G. Please use a hash table for this exercise. Read over the hashing lesson
before we discuss it in class.
2
Extension: If you have completed ALL your programming assignments, and I have
checked them, please complete the following exercise for extra credit. Please read the
handout on SNP’s before attempting this exercise. You will have two text files: seq1.txt
and seq2.txt.
In this exercise, you will compare two DNA sequences (of the same length) and identify
sites of point mutations or Single Nucleotide Polymorphisms (SNPs).
Genetic Code Chart
http://www.he-astro.physics.utah.edu/~cassiday/p1080/lec06.html
3
Download