PROGRAMMING LAB ASSIGNMENTS

PROGRAMMING LAB ASSIGNMENTS - BIOINFORMATICS Using basic genomics sequence analysis in Java Background: The data file (sequence.txt) will be used in this lab assignment. The text file will be in a FASTA format that will have nucleotide sequences that will follow. A description of the FASTA format is below taken from Wikipedia. The FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a standard in the field of bioinformatics. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greaterthan (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. A simple example of one sequence in FASTA format is below: >ENSG00000142166|ENST00000270139 ATGATGGTCGTCCTCCTGGGCGCGACGACCCTAGTGCTCGTCGCCGTGGCGCCATGGGTG TTGTCCGCAGCCGCAGGTGGAAAAAATCTAAAATCTCCTCAAAAAGTAGAGGTCGACATC ATAGATGACAACTTTATCCTGAGGTGGAACAGGAGCGATGAGTCTGTCGGGAATGTGACT TTTTCATTCGATTATCAAAAAACTGGGATGGATAATTGGATAAAATTGTCTGGGTGTCAG AATATTACTAGTACCAAATGCAACTTTTCTTCACTCAAGCTGAATGTTTATGAAGAAATT AAATTGCGTATAAGAGCAGAAAAAGAAAACACTTCTTCATGGTATGAGGTTGACTCATTT ACACCATTTCGCAAAGCTCAGATTGGTCCTCCAGAAGTACATTTAGAAGCTGAAGATAAG GCAATAGTGATACACATCTCTCCTGGAACAAAAGATAGTGTTATGTGGGCTTTGGATGGT TTAAGCTTTACATATAGCTTAGTTATCTGGAAAAACTCTTCAGGTGTAGAAGAAAGGATT GAAAATATTTATTCCAGACATAAAATTTATAAACTCTCACCAGAGACTACTTATTGTCTA AAAGTTAAAGCAGCACTACTTACGTCATGGAAAATTGGTGTCTATAGTCCAGTACATTGT ATAGATGACAACTTTATCCTGAGGTGGAACAGGAGCGATGAGTCTGTCGGGAATGTGACT TTTTCATTCGATTATCAAAAAACTGGGATGGATAATTGGATAAAATTGTCTGGGTGTCAG AATATTACTAGTACCAAATGCAACTTTTCTTCACTCAAGCTGAATGTTTATGAAGAAATT TTGTCCGCAGCCGCAGGTGGAAAAAATCTAAAATCTCCTCAAAAAGTAGAGGTCGACATC ATAGATGACAACTTTATCCTGAGGTGGAACAGGAGCGATGAGTCTGTCGGGAATGTGACT 1 Assignment: 1. Write a simple program to input a DNA "sequence" as input to a Java program and make the program print it out. The input will be done through the text file that is in FASTA format. The output will be done in two ways: console and then write it to a file. 2. Write a program to convert a given strand of a DNA sequence into its corresponding reverse complemented strand sequence. Use regular expressions and pattern matching to achieve this goal. Use the same output file as in number one. Write your output to a file. Regular Expressions are are implemented using finite automata which are state machines. Try thinking about it like this state machine: Here, q0 is the initial/start state and the q3 is the "final/accepting" state. So if at any point of time the machine reaches state q3 it means it has found a pattern "cow" in the main text. 3. Write a program to convert a DNA sequence into its corresponding RNA sequence. The main operation needed is to substitute every "T" with a "U". You will be using string translation for this exercise. 4. Write a program to convert "translate" a coding sequence of a particular RNA into its corresponding amino acid sequence. Use the RNA conversion text file you created in exercise 3. A hash table will be used to store the genetic code table. ook at the last page for a codon translation table. Remember to check to see if your test data is divisible by 3 for it to be viable and that it can only include the letters A,U, C, G. Please use a hash table for this exercise. Read over the hashing lesson before we discuss it in class. 2 Extension: If you have completed ALL your programming assignments, and I have checked them, please complete the following exercise for extra credit. Please read the handout on SNP’s before attempting this exercise. You will have two text files: seq1.txt and seq2.txt. In this exercise, you will compare two DNA sequences (of the same length) and identify sites of point mutations or Single Nucleotide Polymorphisms (SNPs). Genetic Code Chart http://www.he-astro.physics.utah.edu/~cassiday/p1080/lec06.html 3

PROGRAMMING LAB ASSIGNMENTS

Related documents

Products

Support

PROGRAMMING LAB ASSIGNMENTS

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib