David Goldberg CS 1950 Final Paper The directed study task that I am doing is programming for the biology department. Specifically, I am coding a program that parses RNA sequences and searches for patterns. RNA sequences are the building blocks for life. Every cell of every living thing on this planet has RNA. From the leaf of the tallest tree, to the smallest bacterium, all need RNA to function. Similarly to lines of code, RNA is instructions on how to make the most fundamental parts of living things. RNA is read by ribosomes in cells, which then are used to create a functional product. This product is the proteins for which all life is built by. Needless to say, RNA is an important topic in biology. Unlocking what sequences correspond to what attributes is a very significant quest. There are several problems I encounter while working on this project. First off, the current program is written in Perl, as will my updated version of it. The problem with this is I, at the beginning of the project, did not have any knowledge of Perl or any other scripted language at that point. Also this program relies on somewhat complex regular expressions. I also did not have any knowledge of regular expressions when I started this assignment. Due to my lack of knowledge in biology I had a little trouble figuring out what was needed to be done. As I have not taken a biology class since I was a freshman in high school many, if not most of the words used to describe RNA, what it does, and what I need to do I did not know. On a very high level view without using any of the biological terms RNA is a long string or letters. These letters can be a ‘A’, ‘C’, ‘G’, or ‘T’. A ‘Y’ can either be a ‘C’ or a ‘T’. An ‘R’ can either be an ‘A’ or a ‘C’. And a ‘N’ corresponds to an ‘A’, ‘C’, ‘G’, or ‘T’. The string of letters is broken up into 3 parts, called the up intron, exon, and down intron, respectively. The goal of the program I am writing is to find 2 user specified patters within one of these segments. For my research project, I improved a program which searches RNA containing 2 patterns that the scientist has deemed important. This is done to help understand more about RNA. For example, understanding what causes some attributes to be expressed in the person, while other times that attributes, despite appearing in the RNA, is skipped, due to something in the code preceding it. The RNA sequences that I dealt with include the possible characters: ‘A’, ‘C’, ‘G’, ‘T’, ‘Y’, ‘R’, and a ‘N’. In the RNA code itself, only ‘A’, ‘C’, ‘G’, and ‘T’ exist. In the RNA sequence they represent both the amino acids that make up what the RNA is trying to build as well as procedural information, with commands such as ‘start transcribing here’ and ‘end transcribing here’ . Y, R, N are used only used in the pattern that is inputted by the user. A ‘Y’, stands for either a ‘C’ or a ‘T’. An ‘R’ corresponds to either an ‘A’ or a ‘C’. And a ‘N’ corresponds to any of them, so ‘A’, ‘C’, ‘G’, or ‘T’ will be an acceptable character. The current program reads in for a flat file containing all of the RNA sequence. The file is broken up into several parts. There are 2 RNA sequences that correspond to one another. One is the sequence in human RNA; the other is the sequence in mouse RNA. Both sequences have their own identification number, which is separated from the sequence by a tab. The 2 sequences are separated from the next pair of sequences by a new line character. The RNA sequences are broken up into 3 different parts, each separated by an ‘X’. The 3 parts are called the up intron, the exon, and the down intron, in that order. For this research project, the code that I improved was very minimalistic in its functionality. The existing code has no user interface. The 2 patters being looked for are entered as command line arguments. It will not accept ‘Y’, ‘R’, or ‘N’ characters. In order for the user to have the functionality that those characters provide they need to manually construct their definition. For example, an ‘N’, which can be any of the 4 characters, ‘A’, ’C’, ’G’, and ’T’, the user would need to put this in the pattern [A|C|G|T]. The file containing all of the RNA sequences contains both human RNA patters and the corresponding pattern in a mouse RNA sequence. However, the original program only checks for the patterns in the human RNA sequence. Also, it only checked the down intron for the patterns and completely ignores both the up intron and the exon. The 2 patterns are checked for in the last 75 characters of the down intron. This distance is fixed. Also, the patterns are searched for separately. As a result of this the matches found can overlap. This overlap is undesirable. Lastly, the patterns where both matches are found are stored in a results file. The human RNA sequences identification number is put into the file, followed by a tab, followed by the 75 character string that was searched. Over the course of the semester I improved the existing code in several ways. My program will provide a better user interface, over just the command line arguments. My program added the functionality of the ‘Y’, ‘R’, and ‘N’ characters. My program gave the user the ability to control what part of the RNA sequence they are searching in. It can search any of the three parts of the RNA sequence, the up intron, the down intron, or the exon. My program also checks the mouse RNA for the matching patter. In addition to that it prints to the results file if it matches just the human, just the mouse, or both. My program also greatly increases the options available to the user in how they define their search. Instead of having a fixed value for the section of the sequence that it is checking, the user can specify this length. My program also allows the user to specify whether this section is taken from the beginning or the end of the part of the RNA that is being checked. My program also allows the user to specify the minimum and maximum distance between the 2 patterns.(i.e. the user can specify a length of 50, from the beginning of the up intron, with a minimum distance of 5, and a maximum distance of 25.) Finally, the readability of the results file will be improved. My program will make the matches apparent. The program prints out the sequence that was searched as well as the first pattern it finds, the sequence in between the first pattern and second pattern, and the second pattern. My improved program works in the following way. It uses one regular expression to see if both of the patterns exist in the sequence at the same time. The regular expression it tries to match can be broken down into 3 parts. The first part is the first pattern. The second part matches non-greedily. It does it non-greedily to find the two matching patterns as close together as possible. The second part matches any character sequence with a minimum length that is the same as the minimum distance between the two patterns as specified by the user and whose maximum length is the same as the maximum distance between the two patterns as specified by the user. The third part of the pattern is the second pattern. In addition to improving the existing code, I was asked to create a new program that searched the end of the up intron and the beginning of the exon, with one pattern in each, and have the 2 patterns that are matched as close together as possible. On a high level view I needed to find 2 patterns. One at the end of one segment and the other at the beginning of another segment, and the sum of the distances between the first pattern and the end of the first segment and the beginning of the second segment to the beginning of the first pattern has to be less than a user specified value. I called this program the ‘exon intron program’. Similarly to the improved program the exon intron program accepts all of the characters, ‘A’, ’C’, ’G’, ’T’, ‘Y’, ‘R’, and ‘N’. Also as most exons start with a GT pattern, if the GT is at the beginning of the pattern it is ignored but if it is not there the sequence is still searched. The inputs from the user are the path of the database it will use, the 2 patterns and the maximum distance between the 2. The programs output is complex, but gives lots of useful information. The output file shows the following information all separated by a tab and each entry separated by a new line: the id number of the human RNA sequence which the pattern match was found in part of the up intron before the first patterns the first pattern the rest of the up intro to its end the GT at the beginning of the exon, if it exists the beginning of the exon up to the second pattern the second pattern the part of the exon after the second pattern the length between the end of the first pattern and the end of the up intron the length between the beginning of the exon and the second pattern The exon intron program works in the following way. It uses two regular expressions to match both patterns. The first regular expression is broken up into 3 parts. These part are the sequence before the first pattern which in the regular expression can be any character, the first pattern, and the sequence after the first pattern to the end of the up intron which in the regular expression can also be any character. In the regular expression the first part is found greedily, so it at first consumed all of the sequence, then it backs off one character at a time until it matches the first pattern. Then in the second part the first pattern is matched. In the third part the rest of the string is consumed, which represents the sequence in between the first pattern and the end of the up intron. If this is found the program moves on to the exon and tries to match that. For this it uses a regular expression that is broken up into 4 parts, the GT at the beginning, the sequence from the beginning of the exon to the second pattern, the second pattern, and the sequence after the second pattern. This regular expression gets matched as follows. It sees if the GT is at the beginning, if it is it moves to the next part and tries to match that at the third character in the string. If the GT is not there it starts to match the next part at the first character. This next part is searched nongreedily so starts at length 0, and increases until it finds the second pattern, or it gets longer than the maximum distance between the two patterns minus the length of the third part of the first regular expression, which represents the sequence in between the first pattern and the end of the up intron. It does this so it will not match anything where the second pattern is found further than the user specified maximum distance from the first pattern. Then the fourth part consumes the rest of the sequence it searches, which represents the part of the exon after the second pattern.