David Goldberg CS 1950 Final Paper The directed study task that I

advertisement
David Goldberg
CS 1950
Final Paper
The directed study task that I am doing is programming for the biology
department. Specifically, I am coding a program that parses RNA sequences and
searches for patterns. RNA sequences are the building blocks for life. Every cell of
every living thing on this planet has RNA. From the leaf of the tallest tree, to the
smallest bacterium, all need RNA to function. Similarly to lines of code, RNA is
instructions on how to make the most fundamental parts of living things. RNA is read by
ribosomes in cells, which then are used to create a functional product. This product is
the proteins for which all life is built by. Needless to say, RNA is an important topic in
biology. Unlocking what sequences correspond to what attributes is a very significant
quest.
There are several problems I encounter while working on this project. First off,
the current program is written in Perl, as will my updated version of it. The problem with
this is I, at the beginning of the project, did not have any knowledge of Perl or any other
scripted language at that point. Also this program relies on somewhat complex regular
expressions. I also did not have any knowledge of regular expressions when I started
this assignment. Due to my lack of knowledge in biology I had a little trouble figuring
out what was needed to be done. As I have not taken a biology class since I was a
freshman in high school many, if not most of the words used to describe RNA, what it
does, and what I need to do I did not know. On a very high level view without using any
of the biological terms RNA is a long string or letters. These letters can be a ‘A’, ‘C’, ‘G’,
or ‘T’. A ‘Y’ can either be a ‘C’ or a ‘T’. An ‘R’ can either be an ‘A’ or a ‘C’. And a ‘N’
corresponds to an ‘A’, ‘C’, ‘G’, or ‘T’. The string of letters is broken up into 3 parts,
called the up intron, exon, and down intron, respectively. The goal of the program I am
writing is to find 2 user specified patters within one of these segments.
For my research project, I improved a program which searches RNA containing 2
patterns that the scientist has deemed important. This is done to help understand more
about RNA. For example, understanding what causes some attributes to be expressed
in the person, while other times that attributes, despite appearing in the RNA, is
skipped, due to something in the code preceding it. The RNA sequences that I dealt
with include the possible characters: ‘A’, ‘C’, ‘G’, ‘T’, ‘Y’, ‘R’, and a ‘N’. In the RNA code
itself, only ‘A’, ‘C’, ‘G’, and ‘T’ exist. In the RNA sequence they represent both the amino
acids that make up what the RNA is trying to build as well as procedural information,
with commands such as ‘start transcribing here’ and ‘end transcribing here’ . Y, R, N
are used only used in the pattern that is inputted by the user. A ‘Y’, stands for either a
‘C’ or a ‘T’. An ‘R’ corresponds to either an ‘A’ or a ‘C’. And a ‘N’ corresponds to any of
them, so ‘A’, ‘C’, ‘G’, or ‘T’ will be an acceptable character.
The current program reads in for a flat file containing all of the RNA sequence.
The file is broken up into several parts. There are 2 RNA sequences that correspond to
one another. One is the sequence in human RNA; the other is the sequence in mouse
RNA. Both sequences have their own identification number, which is separated from
the sequence by a tab. The 2 sequences are separated from the next pair of
sequences by a new line character. The RNA sequences are broken up into 3 different
parts, each separated by an ‘X’. The 3 parts are called the up intron, the exon, and the
down intron, in that order.
For this research project, the code that I improved was very minimalistic in its
functionality. The existing code has no user interface. The 2 patters being looked for
are entered as command line arguments. It will not accept ‘Y’, ‘R’, or ‘N’ characters. In
order for the user to have the functionality that those characters provide they need to
manually construct their definition. For example, an ‘N’, which can be any of the 4
characters, ‘A’, ’C’, ’G’, and ’T’, the user would need to put this in the pattern [A|C|G|T].
The file containing all of the RNA sequences contains both human RNA patters and the
corresponding pattern in a mouse RNA sequence. However, the original program only
checks for the patterns in the human RNA sequence. Also, it only checked the down
intron for the patterns and completely ignores both the up intron and the exon. The 2
patterns are checked for in the last 75 characters of the down intron. This distance is
fixed. Also, the patterns are searched for separately. As a result of this the matches
found can overlap. This overlap is undesirable. Lastly, the patterns where both
matches are found are stored in a results file. The human RNA sequences identification
number is put into the file, followed by a tab, followed by the 75 character string that
was searched.
Over the course of the semester I improved the existing code in several ways.
My program will provide a better user interface, over just the command line arguments.
My program added the functionality of the ‘Y’, ‘R’, and ‘N’ characters. My program gave
the user the ability to control what part of the RNA sequence they are searching in. It
can search any of the three parts of the RNA sequence, the up intron, the down intron,
or the exon. My program also checks the mouse RNA for the matching patter. In
addition to that it prints to the results file if it matches just the human, just the mouse, or
both. My program also greatly increases the options available to the user in how they
define their search. Instead of having a fixed value for the section of the sequence that
it is checking, the user can specify this length. My program also allows the user to
specify whether this section is taken from the beginning or the end of the part of the
RNA that is being checked. My program also allows the user to specify the minimum
and maximum distance between the 2 patterns.(i.e. the user can specify a length of 50,
from the beginning of the up intron, with a minimum distance of 5, and a maximum
distance of 25.) Finally, the readability of the results file will be improved. My program
will make the matches apparent. The program prints out the sequence that was
searched as well as the first pattern it finds, the sequence in between the first pattern
and second pattern, and the second pattern.
My improved program works in the following way. It uses one regular expression
to see if both of the patterns exist in the sequence at the same time. The regular
expression it tries to match can be broken down into 3 parts. The first part is the first
pattern. The second part matches non-greedily. It does it non-greedily to find the two
matching patterns as close together as possible. The second part matches any
character sequence with a minimum length that is the same as the minimum distance
between the two patterns as specified by the user and whose maximum length is the
same as the maximum distance between the two patterns as specified by the user. The
third part of the pattern is the second pattern.
In addition to improving the existing code, I was asked to create a new program
that searched the end of the up intron and the beginning of the exon, with one pattern in
each, and have the 2 patterns that are matched as close together as possible. On a
high level view I needed to find 2 patterns. One at the end of one segment and the
other at the beginning of another segment, and the sum of the distances between the
first pattern and the end of the first segment and the beginning of the second segment
to the beginning of the first pattern has to be less than a user specified value. I called
this program the ‘exon intron program’.
Similarly to the improved program the exon intron program accepts all of the
characters, ‘A’, ’C’, ’G’, ’T’, ‘Y’, ‘R’, and ‘N’. Also as most exons start with a GT pattern,
if the GT is at the beginning of the pattern it is ignored but if it is not there the sequence
is still searched. The inputs from the user are the path of the database it will use, the 2
patterns and the maximum distance between the 2. The programs output is complex,
but gives lots of useful information. The output file shows the following information all
separated by a tab and each entry separated by a new line:

the id number of the human RNA sequence which the pattern match was found
in

part of the up intron before the first patterns

the first pattern

the rest of the up intro to its end

the GT at the beginning of the exon, if it exists

the beginning of the exon up to the second pattern

the second pattern

the part of the exon after the second pattern

the length between the end of the first pattern and the end of the up intron

the length between the beginning of the exon and the second pattern
The exon intron program works in the following way. It uses two regular expressions to
match both patterns. The first regular expression is broken up into 3 parts. These part
are the sequence before the first pattern which in the regular expression can be any
character, the first pattern, and the sequence after the first pattern to the end of the up
intron which in the regular expression can also be any character. In the regular
expression the first part is found greedily, so it at first consumed all of the sequence,
then it backs off one character at a time until it matches the first pattern. Then in the
second part the first pattern is matched. In the third part the rest of the string is
consumed, which represents the sequence in between the first pattern and the end of
the up intron. If this is found the program moves on to the exon and tries to match that.
For this it uses a regular expression that is broken up into 4 parts, the GT at the
beginning, the sequence from the beginning of the exon to the second pattern, the
second pattern, and the sequence after the second pattern. This regular expression
gets matched as follows. It sees if the GT is at the beginning, if it is it moves to the next
part and tries to match that at the third character in the string. If the GT is not there it
starts to match the next part at the first character. This next part is searched nongreedily so starts at length 0, and increases until it finds the second pattern, or it gets
longer than the maximum distance between the two patterns minus the length of the
third part of the first regular expression, which represents the sequence in between the
first pattern and the end of the up intron. It does this so it will not match anything where
the second pattern is found further than the user specified maximum distance from the
first pattern. Then the fourth part consumes the rest of the sequence it searches, which
represents the part of the exon after the second pattern.
Download