Abstract_Graf - Bioinformatics Vienna

advertisement
Building a Bioinformatics Pipeline for the Design of
Pichia Pastoris Whole Genome Microarrays
Alexandra Graf
FH-HauptbetreuerIn:
Prof. Dr. Diethard Mattanovich
1 Introduction
DNA microarray analysis is a widely used technology to generate genome-wide expression profiles, it is also the
only application that allows the efficient use of information from whole-genome sequencing projects in answering
biologically relevant questions. Not very long ago the genome of the yeast Pichia pastoris was sequenced making it
possible to use microarray techniques to address problems in the excretion of complex heterologous proteins produced in this organism. This work deals with the selection of candidate ORFs and the design of oligonucleotides for
a P. pastoris whole-genome microarray.
At the time of this work no DNA microarray of Pichia pastoris existed. The aim of this work was to establish a pipeline for the design of whole genome chips of P. pastoris. Starting from an only partly annotated genome of the
yeast P. pastoris a set of P. pastoris specific oligomers was designed which were then uploaded to Agilent to produce microarrays. In the process of designing these oligos appropriate software and parameters were selected as
well as connected via perl scripts. The type of microarray was a two-color, 60-oligomer, custom expression array
from Agilent in the format 4x44k.
2 Results
De-novo predictions of P. pastoris genes were created using the program GeneMark followed up by a homology
search with BLAST using S. cerevisiae as model organism. Data was parsed and filtered with Perl scripts developed
by the author of this work. The resulting ORFs were merged with data predicted by Integrated Genomics and the
program cd-hi was used to cluster the sequences according to their similarity. The ORFs as well as information
from BLAST and cd-hi was then used to select candidates for probe design. Oligo design was done with OligoArray
2.1. The number of different oligomers on each array was 17,161 so that 2 replicates of each oligomer could be
used per array. Of the oligos 10,134 were specific, for the other 7,027 the risk of cross-hybridization was given.
Hybridization of the first set of arrays was conducted according to the IAM SOP for RNA extraction and Agilents'
hybridization protocols. A variety of P. pastoris samples were pooled together to have a high proportion of active
genes. The experimental part of the work was conducted at the IAM RNA lab. Twelve arrays (3 slides) were hybridized in a same-same manner, scanned and quantified. The slides were scanned with a GenePix 4000B scanner and
Agilents' Feature Extraction software was used to convert the image into intensity values. The first batch of arrays
has the function to determine the amount of genes that hybridize to P. pastoris targets and to refine the next batch
accordingly. The results of the analysis of the extracted data together with prediction data will determine the oligo
content of the next generation of microarrays.
3 Discussion
The selection of a good gene finder for yeast was surprisingly hard in the light that Saccharomyces cerevisiae has
been completely sequenced since 1996. The reason for this is that up to date gene prediction software either focuses on prokaryotes having no introns or on higher eukaryotes having a large amount of rather long introns. In
lower eukaryotes like yeast only a small proportion of genes contain introns and of these most have very few and
short introns. Additionally, the yeast genome includes less repeats and is overall more compact than the genome
of complex eukaryotes. For example yeast genome contains about 70\% protein coding regions whereas in human
genome this number is only 2\% [Kellis et. al, 2003]. To see which type of gene finder is best suited for yeast genomes a test using three de-novo gene finder - GeneMark, Glimmer3 and GlimmerHMM - was carried out. S. cerevisiae was the logical choice for the test since it is the closest related organism to P. pastoris for which genome and
ORF data are available. The results confirmed that a gene finder written for eukaryotes (GlimmerHMM) was not
adequate for yeast organisms since it introduced far too many introns into the predicted genes. The prokaryotic
versions performed much better, with GeneMark predicting less false negatives but more false positives than
Glimmer3. As stated before, a higher rate of false positives was preferable over a higher rate of false negatives,
therefore GeneMark was used in the gene prediction of P. pastoris.
The lowest threshold for the de-novo predictions that GeneMark allowed was used to get the largest possible set of
ORFs. The resulting sequences as well as the ORFs that were predicted by IG were then blasted against coding
sequences of S. cerevisiae Sequences with a hsps that had a total length greater 100 and an identity of greater 50
were considered as similar and of these only the hsp with the highest e-value per sequence was recorded. This
information only indicates conserved regions between the two yeast organisms but not the actual length of the
coding regions. Moreover, considering that the similarity between S. cerevisiae and S. bayanus, both members of
the Saccharomyces group, being about the same as between human and mouse (62\% and 66\% respectively in
orthologous regions) [Kellis et. al, 2003], it follows that a notable amount of genes of P. pastoris will not have
orthologous in S. cerevisiae. Due to these facts BLAST scores were taken into account but were only an ancillary
criteria in the selection of oligo candidates.
To avoid cross-hybridization the sequences were run through cd-hi, a program that clusters sequences according to
their similarity and outputs a file containing the longest sequence of each cluster. A similarity threshold of 90\%
was used in our case. The more recent version (cd-hit) was deliberately not used since it allows for a certain
amount of redundancy. Both versions (cd-hi, cd-hit) were developed and used to build up non-redundant protein
databases. Since the program was written for protein sequences and involves alignments between these sequences
it might introduce a bias when used with nucleotide sequences. The program was used under the assumption that
the bias, if there is one, is negligible. Still, a nucleotide version of cd-hit (cd-hit-est) is now available and it would
be interesting to compare the results of the nucleotide version to the one used in this work.
After the QC and normalization steps it will be decided which probes are deemed true and will be on the next chip.
Probes that are not marked as outliers and that have an intensity value which is higher than the average of negative probes plus the standard deviation are considered as positive. Since the array contains specific as well as unspecific probes it will also be necessary to check for cross-hybridization. This is done by comparing the distribution
of the intensity values of the two groups as well as comparing the total amount of positive hybridizations for each
group. Additionally it is possible to use supervised learning techniques to see if there is a significant difference
between the two classes of probes. The positive probes plus all probes that have a high prediction value as well as
probes from complex clusters (that were not used on this microarray chip due to time limitations) will make up the
next round of arrays. For the complex cluster a Multiple Sequence Alignment has to be prepared before the decision can be made which sequences to include into the next design. The sequences that were present on the first
array can be easily selected by their IDs alone but as new sequences are introduced into the design a new run of
OligoArray 2.1 is required to check again for cross-hybridization in the set. After that the probes can be uploaded
to Agilent for the production of a new chip. Depending on the performance of the new array my estimation would
be that at least another iteration will be necessary to produce a chip that contains mainly true probes.
4 Literature
Sauer M., Branduardi P., Gasser B., Valli M., Maurer M., Porro D., Mattanovich D., Differential gene expression in
recombinant Pichia pastoris analysed by heterologous DNA microarray hybridisation, Microb Cell Fact, 2004,
Volume 3, Number 1
Stekel D., Microarray Bioinformatics, Cambridge University Press 2003
Kreil D.P., Russell R.R., Russell S., Microarray Oligonucleotide-probes, Methods Enzymology, 2006, Volume 410
Kellis M., Patterson N., Endrizzi M., Birren B., Lander E. S., Sequencing and comparison of yeast species to identify
genes and regulatory elements, Nature, 2003, Volume 423
2
Download