Exercise #2

advertisement
Weizmann 2012 - Introduction to Matlab & Data Analysis
Exercise #6
File handling and regular expression
Tutor in charge of this HW: Anat Tzimmer
e-mail for Questions: anat.tzimmer@weizmann.ac.il
HW instructions:

You must submit this assignment in pairs.

Please name your script "hw4_your ID number1_ your IDnumber2.m".

The script must contain a program description in the beginning of the
program.

You should submit your script (.m) file only, please do not attach any
other files.

Please do not use “clc” “clear” close” functions in your script.

A program that crushes (from any reason) will grant their owners in a
failing grade.

Pay attention for “Magic numbers” and meaningful name to variables.

Write your code in a readable way – please do not enter too long
descriptions and remarks, too many spaces (indent your code before
submitting it) do not write too long code lines.
In this exercise you will read a text file which contains a list of genes and their GO
terms. The Gene Ontology web site allows the user to upload a list of genes, and
receive their Gene Ontology terms (into a text file as the file you are about to read).
GO terms are genes annotation of functions, related pathways, cell localization and
more. Each gene can have a zero to number of GO terms according to the information
that is known about the gene. The purpose of this exercise is to find all genes that are
related to some cell function using their GO terms . There might be several GO terms
that imply the same function of a gene, for instance:
Weizmann 2012 - Introduction to Matlab & Data Analysis
The gene FBL has the following GO terms: GO:0006364~rRNA processing,
GO:0006396~RNA processing , GO:0008033~tRNA processing.
Each GO term has an accession number (starts with ‘GO:’) , and the GO term name
(description). We would like to find all the relevant GO terms in all the genes
according to a given list of key words.
1. You should write a function that receives 2 inputs:
a. prot_goTerms_fname - the file name for reading, the file contains the
list of genes and their GO terms
b. key_word - a cell array specify the key words for the GO terms search
I’ll use the following key words cell array to test your work:
key_words={'cytoskele', 'movement', 'migration', 'microtubul'};
2. Your function should open a file for writing the results. The name of the file
should be in the format: [the first key word]_ID1_ID2.txt for example:
“cytoskele_00000_11111.txt”.
3. Your function should open the file “prot_goTerms_fname” and read it.
Each line is in the format of the gene name followed by all of its go terms.
4. You should look for every key word from the list in the gene GO terms using
regular expression.
5. If one or more key words are found, you should print the results into the
output file that you already opened. The format should be:
a. One space line before the gene name
b. The gene name
c. Below the gene name in a new line (without a space line), the GO term
accession number, a tab delimiter and the GO term name.
d. If several GO terms were found you should print each GO term in a
new line in the format specified above.
6. You have a results file in the Supplementary Material on the web. Your output
file should be identical to the results file. Submitting a code which generates a
file with a different set of results or in a different format will grant their
owners in a failing grade. Use any program to find differences between your
output file and the given file, make sure there are no differences.
Weizmann 2012 - Introduction to Matlab & Data Analysis
7. Don’t forget to close the files (input and output) in your function.
Download