ProposalDraftFull-9-29 - Electrical and Computer Engineering

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________

1. SPECIFIC AIMS

The primary aim of this project, to increase minority participation in biomedical research through a broadly based program of assisting minority institutions in developing all aspects of bioinformatics programs on their campuses, remains unchanged. The most intensive effort is to assist these institutions in developing the training they offer by providing training and well developed, tested training materials for bioinformatics courses to minority institutions, their faculty members, and their graduate students. The program will directly assist bioinformatics programs at two minority institutions each year and provide significant bioinformatics training at a number of other minority institutions. The program concentrates on the sequence based aspects of bioinformatics that allow scientists to make effective use of the vast amount of information being produced by various genome sequencing, gene expression, and macromolecular molecular sequence determination projects around the world. This basic area will be supplemented with information on the newer techniques designed to facilitate large-scale genomics and proteomics investigations as well as structural modeling and analysis techniques.

The program has been significantly expanded beyond the initial program focused around an introductory bioinformatics course by adding course design and development components for one introductory and two advanced courses for bioinformatics in the three component areas of bioinformatics, the biological sciences, the mathematical sciences, and computer science. This expanded program provides both immediate and long-term increases in the research opportunities available to scientists at minority institutions. The program assist in developing bioinformatics as a strong part of the curriculum in multiple departments at selected institutions; and integrate bioinformatics procedures into the repertoire of research tools used in the research groups at the selected institutions. The initial three components, the continuing core of the program, are:

1.

An intense two week summer workshop in bioinformatics for multidisciplinary faculty teams from selected minority institutions;

2.

Strengthening the bioinformatics curriculum at two minority campuses every year by teaching a bioinformatics course in collaboration with members of the above faculty teams; and

3.

A five-week research internship at the Pittsburgh Supercomputing Center (PSC) for graduate students who have completed the bioinformatics course on their campus.

This yearlong program covers major aspects of developing a strong, multidepartmental bioinformatics program at minority institutions. It recruits and trains a multidisciplinary team to staff the new program. It embeds the new program institutionally by integrating it into the teaching curriculum and by providing two years of part-time support for an on-campus liaison.

Importantly, it incorporates bioinformatics into local research efforts. The new course development components, including introductory and advanced courses for biologists, mathematicians, and computer scientists, will assist minority institutions in training students from mathematical and computational backgrounds for bioinformatics careers. It assist minority institutions in providing training leading to specialization in bioinformatics to students from these diverse academic backgrounds and allow them to apply their expertise to important biological and medical research problems. Scientists recruited from minority institutions especially recruited will do the course development. This coordinated program will solidify institutional support for bioinformatics by yielding tangible results in both teaching and research within the first year and by creating a solid basis for continuing results. A second year of PSC consulting, support, and a student internship is also provided. We will couple this program to a strong evaluation protocol to both identify successes and to identify and remedy shortcomings.

PHS 398 (Rev. 5/95) Page

_ 1

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________

2. BACKGROUND AND SIGNIFICANCE

The biomedical sciences have progressively become more molecular and quantitative with a large fraction of current research aimed at describing both normal cellular processes and disease states at the molecular level. Macromolecular sequence data is such a powerful aid to this process that a programmatic effort is underway to acquire a systematic collection of such data.

The collection will include the complete genomic sequences for a large, diverse set of organisms, including humans and human pathogens. Sequence data are the successful outcomes of a set of random mutagenesis experiments conducted by nature and are a rich source of hypotheses about relationships between organism phenotypes and genotypes that can be tested experimentally.

This explosion of quantitative data continues to exapnd rapidly with the development of new techniques for rapidly sequencing whole genomes at reasonable cost and for carrying out largescale investigations of gene and protein expression with microarray and proteomics techniques.

This has led mathematically oriented biologists, as well as mathematicians, statisticians, and computer scientists to devise a growing array of analytical techniques, bioinformatics analyses, to mine this data to uncover hypotheses to efficiently drive further experimental work in productive directions as well as statistically evaluate and test hypotheses. The ability to gather and exploit this data has revolutionized molecular biology in such a short time that few biologists have received the mathematical and computer training needed to perform these analyses and exploit the resulting bioinformatics information in their experimental investigations. This has created a major demand for researchers trained in these techniques in both industry and academia.

These and related opportunities were what drove the discussions at the NIGMS meeting

“Visions of the Future” in September of 2002.

(http://www.nigms.nih.gov/news/meetings/visions.html) Two major themes that resulted from this meeting are in the two following paragraphs.

"Mathematization" of Biology. A fusion of biology with the physical and engineering sciences is needed.

Developing mechanisms that encourage and reward mutual, cooperative interactions among mathematicians, physicists, computer scientists, engineers, and biologists will be important for achieving highly significant future advances. … A further hindrance to progress is the observation that the quantitative components of undergraduate and graduate training in biology are currently inadequate.

Interdisciplinary Science. Ongoing, daily collaboration between scientists from different disciplines is vital to the success of future biomedical research.

Assembling "critical masses" of personnel representing sufficient breadth and depth of varied scientific expertise will be an important solution to tackling complicated problems in biology and facilitating the fusion of information from disparate sources. … While investigators should be encouraged to work together, they should concentrate on retaining their specialist foci.

Since the above-mentioned expansion in demand is both large and sudden many wellregarded universities are just beginning to train their students in this important area, as suggested by “Visions of the Future” because of the lack of trained faculty and established programs.

This lack of trained faculty and established programs afflicts minority institution efforts to establish bioinformatics curriculum not only

in biological sciences departments but in mathematics, statistics, and computer science departments as well. Scientists from all of these areas are needed in a complete bioinformatics program that will be broad enough to attract

PHS 398 (Rev. 5/95) Page

_ 2

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________ students from the even wider range of academic disciplines that have expertise that needs to be effectively used in biological research.

Even before the biomedical community has absorbed the first round of the bioinformatics revolution at the sequence level, we are undergoing a parallel expansion of macromolecular structure data as well as a second round of the bioinformatics revolution in terms of high throughput genomics and proteomics techniques. This is seen in the recent NIH initiatives to systematically acquire these types of data.

From the above discussion it can be seen that bringing a bioinformatics program into any university differs drastically from perhaps adding another enzymologist to teach metabolic pathways to an already established biochemistry program. Establishing a bioinformatics program is a major undertaking that crosses traditional academic field boundaries. Such a program needs people with skills in various aspects of biology such as molecular biology, cell biology, structural biology, and biochemistry are obvious requirements. Other natural sciences such as chemistry, physics, and engineering have much to offer, both in terms of scientific content or understanding and in terms of their more mathematical model building and computational approach to scientific investigation. Mathematics, statistics, and computer sciences are essential contributors to a bioinformatics program as the primary source of mathematical analysis and modeling and computational processing and organizing of data. Ideally, each person will have overlapping competencies that will form a coherent whole within the program. Thus establishing a bioinformatics program requires bringing in a wider, more diverse set of skills than adding a new specialty within a single academic field or department since it requires expertise found in multiple academic fields.

Given the highly competitive market for bioinformatics expertise and its limited availability it is unlikely that institutions with limited expansion budgets can import an entire bioinformatics program by hiring new faculty. Instead, the multidisciplinary program we propose here seeks to assist developing bioinformatics programs at minority institutions through a different mechanism. We will recruit multimember faculty teams whose members already posses a strong academic backgrounds in one of the foundation fields of bioinformatics and assist these teams in implementing bioinformatics programs on their campuses. from the perspective that bioinformatics is a set of core problems whose solution involves combining expertise from the biological sciences, the mathematical sciences, and computer science. The problem set we will focus on in our initial training, and on which the course development aspects of this program will build, is the problem of how to gain the maximum amount of information from the vast amount of data being generated by the large scale projects in genome sequencing, expression, and macromolecular structure determination.

The skills to solve these problems will be a valuable research ability and open the door to many research opportunities. This is true for a scientist from a mathematical or computer science background developing new methods to analyze these data or is an experimental biologist applying bioinformatics techniques to a specific research agenda. Indeed, we believe that there will be an increasing amount of biological research carried out by teams of scientists with biological scientists working with mathematical scientists and computer scientists using integrated approaches to complicated problems involving large and complicated datasets.

PHS 398 (Rev. 5/95) Page

_ 3

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________

3. PRELIMINARY STUDIES

This section will be divided into two parts: an overall description of the Biomedical Initiative at the Pittsburgh Supercomputing Center followed by a description of the first three years of the current project: “Developing Bioinformatics Effort at MARC Schools.”

3.A.1 The Biomedical Initiative at the Pittsburgh Supercomputing Center has been a leading program in high performance computing for over 15 years. The Initiative's mission is to develop and apply new computing and scientific solutions in important biomedical areas such as structural biology and bioinformatics, cellular microphysiology, neural modeling, the Visible

Human Project, and pathology. The group has an active research, training and service program.

The core funding for the Biomedical Initiative is as a Research Resource (P41 RR06009), with additional support from other institutes at the NIH, NSF, DARPA and others.

The Bioinformatics projects include investigating new algorithms for sensitive database searches, evaluating different representations of the information contained within a multiple sequence alignment, and the identification of residues that differentiate between different sequence subfamilies. We recently completed a detailed analysis of enzyme families (aldehyde dehydrogenase and glutathione S-transferase) that identified conserved motifs, key residues for specificity and catalytic activity and provided predictions used in laboratory research. PSC maintains a large service facility, which includes a large suite of programs for database searching, multiple sequence alignment, pattern identification and searching, and phylogenetic analysis; all major sequence databases and a large number of the completed genomes are maintained online.

At least one workshop each year is focused primarily on sequence-based bioinformatics. We also are working directly with a number of universities to help establish competitive bioinformatics efforts.

The Structural Biology effort uses a variety of computational chemistry approaches to obtain insights into structure, function and specificity. All projects employ bioinformatics to help define questions, direct projects, and interpret results. The group works directly with other experimental groups to test the predictions derived from computations. The current projects include quantum chemical models for divalent metal ion binding sites, QM and QM/MM computations to investigate enzyme mechanisms, and MD simulations to study the dynamical behavior around active and binding sites. The service effort supports all major quantum mechanics programs (e.g.,

GAMESS, Gaussian, NWChem), molecular mechanics (e.g., CHARMM, AMBER, NAMD) and some QM/MM programs (Dynamo). Usually there is a structural biology workshop each year, alternating between molecular dynamical simulation of biopolymers and structure determination

(e.g., X-Ray, NMR and cryo-EM).

3.A.2 The Training Program includes three types of workshops: A Research workshop, which brings in top scientist within a research domain to discuss the current state-of-the art and ways that this compute-bound field might be able to successfully use facilities such as the PSC.

An Application Workshop, which uses leading scientists within a research domain to teach the use of the current state-of-the art programs within a specific computational biology area. A

Technology Workshop teaches researchers the best ways to utilize technology in their research.

These workshops have been well received by the community. We have offered 9 Research

Workshops, 28 Applications Workshops (not including Roadshows), and 9 Technology

Workshops; and have had over 3000 total participants. Table I lists by type the various workshops that have been offered by the Biomedical Initiative at the PSC. These workshops have been funded as a part of PSC’s Research Resource unless otherwise noted.

PHS 398 (Rev. 5/95) Page

_ 4

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________

The workshop most directly relevant to this application is the “Nucleic Acid and Protein

Sequence Analysis workshop,” which has been offered continuously at the PSC since 1989. This week-long workshop was initially offered under the auspices of PSC’s research resource, but has since been funded by an NHGRI two year grant that is currently up for its sixth competitive renewal. We have trained 344 researchers in the fifteen offerings. This week-long workshop has been continually over-subscribed, which lead to the development of a two day version that we have taken on the road to 35 universities and presented to almost 1500 participants, often with these presentations included as requirements to a number of courses. Included among the

“Roadshow Stops” are: Howard University, Clark-Atlanta University, New Mexico State

University, the University of Puerto Rico Medical Sciences and Cayey campuses, and the

University of Texas at San Antonio and El Paso. Due to the demand and interest, we have also offered an advanced version of this workshop in Pittsburgh.

The University of California at Davis engaged the group to help organize and present a threecredit course (MCB/NPB 298) in bioinformatics during the fall quarter of 1999. Drs. Deerfield and Nicholas each made three trips to Davis during the quarter to present lectures, conduct hands-on computer laboratory sessions, and consult with faculty and students on the term projects that are the basis for a grade in the course. A number of other internationally recognized bioinformatics experts (who were selected in consultation with the PSC staff) made a single trip to University of California at Davis to present lectures and consult with students. This course is one of the first steps by the University of California at Davis to establish an extensive,

Table I. Workshops offered by PSC’s Biomedical Initiative since 1987.

 Research Workshop (2 day workshops)

 Epidemiological Modeling (12/91)

 Fast Processes in Protein Folding (Fund: Biology at NSF)

 High Performance Software for Computational Neuroscience (12/92)

 3 rd International on Human Chromosome 16 Mapping 1994 (Fund: DoE, 5/94)

 Application of High Performance Computing in Bioengineering (10/94)

 Gene Map Integration Research Workshop (12/95)

 Biomedical Image Analysis and Visualization (7/98)

 Statistical Analysis of Neuronal Data (5/02)

 From Structure to Function: Frontiers of Biological Ion Channels (5/03)

 Application Specific Workshop (3.5 to 5.5 day workshops)

 Nucleic Acid and Protein Sequence Analysis (Fund: NHGRI and NCRR)

 Academic course: Nucleic Acid and Protein Sequence Analysis (UC-Davis and Pitt)

 Advanced Nucleic Acid and Protein Sequence Analysis

 Molecular Mechanics and Dynamics (AMBER, CHARMM, NAMD)

 Structure Determination with either cryo-EM (1x), NMR (2x) or X-Ray (3x)

 Computational Neural Model (Mcell, Neosim)

 Image Restoration (Microscopy, Pathology)

 Biofluid Dynamics with Flexible Boundary Conditions

 Technology Workshop (3.5 day workshops)

 Supercomputing Techniques for Biomedical Researchers

 Building and Using PC Clusters in Biomedical Research

PHS 398 (Rev. 5/95) Page

_ 5

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________ multidisciplinary program in bioinformatics on that campus. A non-credit graduate level course was taught at the University of Pittsburgh and Carnegie Mellon University under the auspices of a Keck grant. A grant to the University of Pittsburgh from the Howard Hughes foundation funded the PSC to present this material in an undergraduate course at the University of Pittsburgh during the spring terms since 2000, which the PSC continues to present in subsequent spring terms.

This extensive background in training in a broad range of computational biology techniques including bioinformatics, both in Pittsburgh and on university campuses around the country, gives the PSC staff abundant experience in all of the components of this program.

3.A.3 Bioinformatics Software Developed by PSC’s Biomedical Initiative has two major software development paths: 1.) development or implementation of new algorithms as original code development, or 2.) parallelization, optimization or “hardening” of existing codes.

The original software development has been primarily in developing and implementing the dynamic programming algorithm on high performance computers. Ropelewski et al. (1997, 2000) have developed a general implementation of the Smith-Waterman variant with the Waterman-

Eggert (MaxSegs) extension for vector, massively parallel and single processor machines.

Versions of this code will also support the Needleman-Wunsch and Sellers variants of the dynamic programming algorithm. A special variant of this code, FShift, has been developed for the alignment of coding regions to coding regions in nucleic acids using the coding information.

This is actually a three-dimensional on nine 2-D planes dynamics programming algorithm (3 codons x 3 codons with frame shifts). A second variant, SeqStruct, allows for the sequencestructure alignment (similar to threading) of the primary sequence to tertiary structure information. Currently, we are developing a highly portable profile search code that will allow one to search a library of protein profiles (Position Specific Scoring Matrices) with a single sequence, either protein or nucleic acid. The SAF offers a large diversity of programs, each with full documentation. But, the creation of job files to run jobs proved to be a stumbling block to the use of this resource. We developed MakSeq, an interrogative user interface program, to assist individuals in creating effective and correct scripts to run any pf the programs available on the

SAF. This allows the computer naive workshop participant to concentrate on the scientific aspects of the analysis rather than on the mechanics of the computer systems.

The parallelization, optimization or “hardening” of existing codes has been performed on a number of relevant codes, and usually involves taking a code that has been developed for a single

(or small set of) problem and expanding the code to be used for a general set of problems. The

“hardening” of codes often requires the removal of “undocumented features”, providing effective user manuals, documenting more obscure uses, and generalizing the input/output options (e.g., reading all sequence formats). For example, we optimized early version of Mike Gribskov’s

ProfileSearch and the ProfileScan programs to run effectively on high performance computers, written in a portable manner, find multiple matches and introduced greater flexibility in choices.

Subsequently, the GCG has introduced a number of these changes in their distributed codes. The

MSA program provides a rigorous global alignment of an arbitrary number of either protein or nucleic acid sequences (Lipman et al., 1989), which we have added new options that allow users to do a complete analysis and have decreased the computational time by nearly a factor of ten.

We assisted in the parallelization and optimization of the MEME program, developed by Bailey

PHS 398 (Rev. 5/95) Page

_ 6

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________ and Elkan (1994) at the University of California at San Diego. This program is the most powerful and most automated of the stochastic sampling or search programs for finding sequence motifs.

3.B. “Developing Bioinformatics Efforts at MARC Institutions.”

The PSC, under auspices of a grant from the NIH (T36-GM08789), has carried out a three part program aimed at assisting the development of bioinformatics efforts at minority institutions. The three parts were: 1.) Summer Institute, a two week intensive course dedicated to introducing multidisciplinary faculty from minority institutions to the theory and practice behind bioinformatics, 2.) partner with at least one school a year to assist in teaching bioinformatics course, and 3.) work with summer interns (graduate and undergraduate students) from the various institutions.

3.B.1 The Summer Institute , a two week intensive workshop, is designed to bring a multidisciplinary group of scientists together to discuss the current state-of-the art, theory and implementation in bioinformatics to provide the foundation for current teaching needs and future research projects. Three of the greatest hindrances to multidisciplinary research are the lack of a common vocabulary, appreciation of the other’s research, and a common view of the research area. This course is designed to deal with all three issues. The first two issues are dealt with by introducing non-biologists to the basics of biological macromolecular sequences and molecular biology, while the biologists are introduced to the basics of algorithms and programming. The main part of the workshop is focused around a well-established core component of bioinformatics – macromolecular sequence analysis techniques, genomics and proteomics. See the appendix for a copy of the agenda, list of participants and a more detailed evaluation result.

Given the differences in background and research goals, we teach the course for two different set of goals. The goals for the biologists are: 1.) to prepare them to incorporate these core techniques into their teaching activities as the part of a bioinformatics course, 2.) become proficient in applying the core sequence analysis techniques so that these techniques become part of their research repertoire. For non-biological scientists, the goals are: 1.) provides them with an introduction to basic molecular biology, basic protein biochemistry and evolution as well as current problems in genome assembly and automatic gene and genome annotation; 2.) provide an introduction to some open statistical and modeling problems in bioinformatics such as maximum likelihood alignments, modeling Markov processes from small data sets such as individual protein families (i.e., a Dayhoff PAM table customized to a single protein family); and, 3.) provide an introduction to open modeling projects, such as alternative splicing mechanisms for both splice site prediction and in modeling the gene and its observed products. For both groups, the primary goal is to describe the actual implementation of bioinformatics programs at minority institutions.

The core of the course is on sequence-based bioinformatics and structural genomics. This includes such diverse topics as single sequence analysis, pairwise comparisons, multiple sequence alignment, pattern identification and representation, phylogenetic analysis, homology modeling and simulation of macromolecules. This is taught by the staff included on this proposal. In addition, we bring in outside speakers to describe new and emerging fields. For example, last year we featured Dr. Simon Lin, the program Chair of Duke University’s Critical

Assessment of Microarray Data Analysis (CAMDA) competition, to present information on

DNA microarray analysis. Arthur Wetzel of the PSC presented his work on analyzing high throughput mass spectroscopy proteomics data for cancer markers. Dr. Dannie Durand from

PHS 398 (Rev. 5/95) Page

_ 7

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________

Carnegie Mellon’s departments of Biological Sciences and Computer Sciences presented a talk on emerging techniques and problems in bioinformatics that focused on genome annotation and identification of regulatory sites, and on pathway and network analysis. Dr. Ricardo Gonzalez, the liaison for the Medical Sciences Campus of the University of Puerto Rico, described his experiences in preparing and team-teaching with PSC staff a course in sequence-based bioinformatics. He also presented a lecture on phylogenetic analysis that he had developed for the course along with a computer laboratory session that he developed to go with the lecture.

Ms. Roxana Cintron, a student in Dr. Gonzalez’s course and our summer intern this year, presented her term paper from Dr. Gonzalez’s course, the largest single item in determining the grade for the course. This provides a benchmark for the participants about how effective a course built around the PSC supplied materials can be for their students.

Table II – Summer Institute Evaluation where: 1=Strongly Disagreed and 5=Strongly Agreed

Course Content

1. I was aware of the prerequisites for this course.

2. I had the prerequisite knowledge and skills for this course.

3. I was well informed about the objectives of this course

4. The prerequisite readings were appropriate for this course.

5. This course (workshop) lived up to my expectations.

6. The content is relevant to my job.

Course Design

7. The course objectives are clear to me.

8. The course activities stimulated my learning.

9. Interactive multimedia was essential in the course.

10. The activities in this course gave me sufficient practice and feedback.

FY01

July 8-19

FY02

July 7-18

Average Average

3.8

3.2

4.0

4.1

4.2

4.3

4.1

4.5

4.2

3.9

4.1

3.4

4.1

4.0

4.2

4.1

4.2

4.4

4.3

4.1

4.0

4.2

3.9

4.3

4.1

4.1

4.1

3.8

11. The exercises in this course were appropriate and useful.

12. The difficulty level of this course is appropriate.

13. The pace of this course is appropriate.

14. Having a multidisciplinary group of participants was a positive aspect of the workshop.

Course Instructor (Facilitator)

15. The instructor(s) was(were) well prepared.

16. The instructor(s) was(were) helpful.

Course Environment

17 The training facility at this site was comfortable.

18 The training facility at this site provided everything I needed to learn.

Course Results

19. I accomplished the objectives of this course.

20. I will be able to use what I learned in this course.

Self-Paced Delivery

21. Computer Lab was a good way for me to learn this content.

4.5

4.8

4.6

4.6

4.2

4.4

4.5

4.3

4.5

4.4

4.5

4.1

4.2

4.4

PHS 398 (Rev. 5/95) Page

_ 8

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________

22. Computer Lab is an important aspect of this course. 4.7 4.5

3.B.1.a Leveraging PSC’s other grant support, we taught a two day roadshow at both the

University of Texas at San Antonio and El Paso in March of 2003. Furthermore, we were able to provide one slot in the NCRR-sponsored “Modeling from protein sequence to structure:

Computational tools for structure prediction” and seven slots in our NHGRI-sponsored “Nucleic

Acid and Protein Sequence Analysis Workshop” for faculty from five different minority institutions. Thus, this community is also benefiting from the additional workshops that are taught annually at the PSC.

3.B.2 Partner schools included Howard University, the Medical Sciences Campus at the

University of Puerto Rico, North Carolina Central University and Morgan State University. The grant provides partial support for two years for a liaison at the partner school, which is intended to provide release time so that the liaison can lead the bioinformatics effort. The liaison is chosen by the local institution, usually by an appropriate Dean or Provost, with consultation with the

PSC staff. We request that the school send 3-6 faculty and staff to the first Summer Institute as a partner school, including the liaison. Thus, a group of them will have a common experience and background for continuing the local efforts. Working through the liaison, the PSC staff identify appropriate time for research or general seminars, identify appropriate courses for PSC staff to team-teach in, and look for other ways that the PSC staff can work with the local institution.

3.B.2.a Howard University was the partner during the first fiscal year. In November of 2002

Drs. Deerfield and Nicholas presented 12 hours of lectures as part of the Molecular Genetics course in the Biology Department of Howard covering three major topics: 1.) pairwise alignments and database searches, 2.) multiple sequence alignment, and 3.) whole genome analysis and annotation. Tutorials and lecture notes were provided to students in advance.

Additionally, each of them presented a talk on bioinformatics as part of the Biology departmental research seminar. The first seminar an overview of different areas of active research in computational biology and how they complement laboratory experiments in the same field; while the second seminar focused on an extensive analysis of the Aldehyde Dehydrogenase enzyme family using a wide variety of computational biology techniques from bioinformatics as well as molecular and quantum mechanics. Emphasis was placed on how these computational techniques yielded detailed hypotheses that were then testable through laboratory experiments and how this integration of computational and laboratory approaches rapidly advances our understanding of an enzyme family and developed new understanding of its catalytic mechanism.

Drs. Deerfield and Nicholas continue to work with Dr. Wayne Patterson, Director, Program

Evaluation, Research, and International Programs in the Graduate School at Howard University to help them formulate a multidisciplinary certificate program in Bioinformatics.

3.B.2.b Medical Sciences Campus of the University of Puerto Rico was the partner during the second fiscal year, where Deerfield and Nicholas team-taught Biochemistry 8526 –

Introduction of Applied Bioinformatics in conjunction with Professor Ricardo Gonzalez Mendez, who was one of the participants in the July 2002 summer workshop and the UPR liaison.

Deerfield and Nicholas presented nine hours of lectures in the course on the San Juan Medical

School campus, a two half-day workshop (ten contact hours) at UPR-Cayey that was aimed at undergraduates (host: Michael Rubin, attended 2002 Summer Institute), and both have individually presented research talks on the Mayaguez campus in the Computer Science

PHS 398 (Rev. 5/95) Page

_ 9

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________ department where the University of Puerto Rico is in the process of establishing a multidisciplinary bioinformatics program (host: Jaime Segal, attended 2002 Summer Institute).

3.B.2.c North Carolina Central University is the current year partner. Both Deerfield and

Nicholas have given seminars on campus, both have taught in a bioinformatics course

(BIOG5410) being taught by the biology chairman who was participant in PSC’s workshop program (Dr. Amal Abu-Shakra), and both are scheduled to return to campus at least one more time during the fall.

3.B.2.d Morgan State University has been added in a supplemental to be a current year partner. We have an active liaison, have given lead a number of discussions on bioinformatics at the local campus, and are currently pursuing both a seminar and teaching schedule for the spring term.

3.B.3 Interns. We recruited four interns the first year. Three were graduate students from the

Biochemistry and Molecular Biology department at Howard University and the fourth was an undergraduate Art student with a strong biology background from North Carolina Central

University. In the second year we had one intern, a graduate student who had taken the oncampus bioinformatics course at the University of Puerto Rico Medical Sciences Campus. In the first year LaCoya Martin, an undergraduate art major from NCCU, worked with Dr. Deerfield on animations presenting the results of integrated bioinformatics and molecular mechanics investigations of Glutathione S-Transferase proteins, a protein important in metabolizing many xenobiotics and in developing custom organisms to use in bioremediation of environmental problems. Ms. Aleisha Dobbins a graduate student with Dr. M. George (see letter in appendix) at

Howard began working on characterizing a SP6 RNA polymerase termination site. We arranged for her to discuss this problem with Drs. Graham Hatfull and Roger Hendrix, co directors of the

Pittsburgh Bacteriophage Institute. This lead to a collaboration to determine the complete sequence of the SP6 genome and a complete computational analysis of termination sites in that sequence and in related bacteriophage genomes. This work has been submitted for publication in the Journal of Bacteriology. The second year intern, Ms. Roxana Cintron a graduate student with

Dr. A. Serrano in the department of Microbiology and Medical Zoology at the UPR Medical

Sciences Campus has built a custom library of Position Specific Scoring Matrices of high quality multiple sequence alignments of all known ABC Transporter proteins from species closely related to the malaria parasite. This custom library is being used to probe the complete genomic sequences of malaria parasites to first identify all of the ABC transporter genes in these genomes and then to use phylogenetic studies to classify the identified genes into ABC transporter subfamilies. ABC Transporter proteins are implicated in developing drug resistance in malaria parasites.

PHS 398 (Rev. 5/95) Page

_ 10

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________

4. EXPERIMENTAL DESIGN AND METHODS

We will assist the development of a strong, competitive, multidisciplinary bioinformatics program at minority institutions. Each year, two new minority schools will be chosen to participate in a two year program that provides both training and support for the multiple components necessary for a successful bioinformatics program. The program will focus on incorporating bioinformatics into both the academic and research programs. PSC's NIH

Research Resource capabilities and workshops will complement this primary effort, which includes additional workshops in computational biology as well as access to a world-class computational facility.

The bioinformatics program will involve four interconnected parts. The first part is an intensive two-week long Summer Institute in bioinformatics techniques to be taught by the PSC staff. The participants will be a multidisciplinary team of faculty from any minority institution.

The second part is partnership between the local multidisciplinary faculty team and the PSC to strengthen the local bioinformatics effort, with activities including teaching an introductory bioinformatics course, presenting research or overview seminars and other activities as appropriate. The third part is a bioinformatics research internship for selected students who have completed an on-campus bioinformatics course. The fourth part is course development to provide templates for schools to use in curriculum design and identification of essential topics.

This extensive training program will yield a cadre of trained faculty able to maintain and extend a program of bioinformatics education at the minority institution and to guide their students and other faculty in utilizing bioinformatics techniques in their research problems. The local campus liaison will be responsible for working directly with the PSC staff to closely monitor the progress of each aspect of the program. During the first year the PSC will play a highly active role in training faculty and students at the minority institutions through the Summer

Institute and team-teaching. During the second year the PSC’s role is mostly consulting with and support of the faculty at the minority institution as they assume the full role of teaching the oncampus bioinformatics course. During this second year the PSC’s active training role is reduced to training a second intern. The overlap of elements of the program over a two-year period is shown in the diagram below. In the diagram "SI" indicates Summer Institute.

SUMMER

INSTITUTE

(SI)

FALL

COURSE

SPRING

COURSE

SUMMER

INSTITUTE

SUMMER

INTERNS

LIAISON TAKES SI

(TRAVEL FROM

SI BUDGET)

PROGRAM YEAR

LIAISON VISITS SI

INTERN AT PSC

(TRAVEL FROM

LIAISON BUDGET)

FALL

COURSE

SPRING

COURSE

SUMMER

INTERNS

NEXT PROGRAM YEAR

LIAISON VISITS SI

INTERN AT PSC

(TRAVEL FROM

LIAISON BUDGET)

PHS 398 (Rev. 5/95) Page

_ 11

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________

This multiphase program is designed not only to train individual faculty and graduate students in various aspects of bioinformatics but also to create at each minority institution a bioinformatics community. Such community is a critical aspect of establishing a long-lived effort, providing the individual faculty members an intellectual environment of shared expertise and insights into their common problems and concerns. Further, the PSC program provides opportunities for building a multi-institutional bioinformatics community among the participating minority institutions.

In the following sections we will describe in detail the four part program followed by the management and evaluation plan.

4.A SUMMER INSTITUTE

We will present a two weeklong Summer Institute using PSC’s training facility to train at least fifteen faculty and staff from any department at a minority institution. The course covers the computational techniques of bioinformatics, particularly the analysis of nucleic acid and protein sequences and their structures with emphasis on extracting the maximum amount of information from known macromolecular sequences arising from the participants’ experimental activities and from sequence and structure databases. The scholarly goals of the Summer Institute are:

How bioinformatics can contribute to and strengthen research programs in molecular biology, biochemistry and allied biomedical sciences.

What information is obtained from specific analyses and what tools are available to perform the analysis and how to use these tools.

How the mathematical methods underlying the analyses and the parameters controlling the analyses are related to fundamental properties of the biological system being analyzed.

How each of the core competencies of the participating faculty can contribute to important biomedical research problems.

In addition, the participants will analyze, when possible, a specific problem from the research on their local campus. The programmatic goals of the workshop are that it provides:

A basis for developing an on-campus course that will have direct effects on students research projects.

Reading, lecture materials, and computer laboratory exercises that can be transferred to an on-campus course.

A concrete plan of action for developing and presenting on-campus courses.

The two week Summer Institute will be evaluated on how well these goals are met.

The multidisciplinary faculty from the minority institutions will include people with the diverse set of core competencies required for an effective bioinformatics program. These skillsets include those within the Biology-domain (e.g., biology, genetics, chemistry), Computationaldomain (e.g., computer science, information science) and Modeling-domain (e.g., mathematics, statistics, engineering). The creation of an effective bioinformatics program requires each of the members of the team, with their very diverse skills, to apply their own skills to a common set of

PHS 398 (Rev. 5/95) Page

_ 12

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________ core bioinformatics problems while learning to recognize the applicability and draw on the skills of the other team members to contribute to an overall success.

The training will focus on how to gain the maximum amount of information from the vast amount of data being generated by the large scale projects in genome sequencing and the national initiative to determine a representative structure for each protein fold. The lecture sessions will emphasize the conceptual bases of this issue and of the current state of the art solutions to individual problems. This will include explaning the fundamental biological problem, how the relevant biological principles are expressed mathematically, and the computational foundations of proposed solutions. Thus the lectures will vividly show the integration of multiple academic disciplines into an overall solution.

Table III. Proposed agenda for Summer Institute:

Monday

Morning

Overview of Bioinformatics

Afternoon

Dual Sessions: CS and Biology

Tuesday Introduction to sequence-analysis

Wednesday Pairwise alignments/Similarity

Matrix

Dual Sessions: CS and Biology

Dual Sessions: CS and Biology

Thursday

Friday

Monday

Tuesday

Multiple Sequence Alignment Pattern identification/representation

Overview of Computational Biology Dual Sessions: CS and Biology

Student Project Presentation

Interpretation of Results

Isozyme analysis

Computational Structural Biology

Wednesday Statistical Considerations

Thursday Phylogenetic Analysis

Proteomics and Genomics

Open Problems

Friday Local Bioinformatics Support Evaluation

The basic computational techniques begin with sequence database searching and progress to include a variety of methods for both global and local multiple sequence alignments. Techniques for identifying informative patterns from these alignments will be presented. The course will teach the more sophisticated techniques for discovering informative patterns in groups of sequences that have not been aligned and which may be unrelated and unalignable outside of the pattern. We also teach how to use such patterns to identify further macromolecular sequences related to those under investigation and how these patterns can be used to guide laboratory experiments by providing insight into the functionally and structurally important parts of the sequences. The course will teach how to integrate the information derived from the comparative analysis of macromolecular sequence with three-dimensional structure information by using the techniques of homology modeling and molecular mechanics and dynamics. Finally, examples will be provided of how to combine these analyses to derive testable hypotheses that can be investigated in laboratory experiments as well as how these techniques can be used to statistically evaluate hypotheses.

"Hands-on" computer laboratory sessions teach the participants to master a variety of programs to attack the different problems they will encounter in experimental projects. We emphasize the relevance of these techniques to the participants' research needs by encouraging them to bring to the course sequence data from the research being performed on their campus.

Faculty participants who are from non-biological sciences departments such as mathematics will benefit from collaborating with their biologists colleagues on these projects by gaining detailed

PHS 398 (Rev. 5/95) Page

_ 13

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________ insights into biological research and by having a practical opportunity to see how their nonbiological skills can contribute to solving these problems. At the same time these collaborative projects will provide the biologists faculty members with a concrete experience in the benefits of taking advantage of the different skills of other academic disciplines. This didactic technique of

"hands-on" computer laboratories strongly reinforces the lecture sessions where considerable emphasis is placed on the conceptual underpinnings of the computational techniques and how they are related to fundamental concepts of molecular biology. A further aim of the proposed workshop is to teach the participants to select computational techniques appropriate to particular research problems. The PSC developed MakSeq program will hide the details of these different computers from the participants so that they can concentrate on the Science.

4.B PARTNER SCHOOLS

We will establish partnerships with two new minority institutions each year, with these partnerships formally lasting two years. Our past experience is that our presence on campus will help provide visibility and potentially credibility to the local bioinformatics effort. Thus, the liaison will serve as a nucleation point for the local bioinformatics effort and as a dissemination point for the activities of PSC’s Biomedical Initiative. After the two years of support for the liaison, the bioinformatics initiative should be well integrated into the normal flow of campus teaching and research activities. The expectation is that the ties established during the program will continue after the liaison's term, with continued communication and potentially collaborative research projects with the PSC. The specific aspects of the liaison are:

1.) Support from this grant will be provided to the local university to obtain release time for a local liaison to direct the local bioinformatics effort and work with PSC’s staff.

2.) Depending on the size of the faculty, ideally 3-6 faculty and staff members from the local institution would attend the Summer Institute in the first year of the partnership, and as many additional faculty and staff members as possible in subsequent years.

3.) PSC will team-teach in a bioinformatics course at the local institution, with Nicholas and

Deerfield each visiting each campus twice each.

4.) Nicholas and/or Deerfield will present at least one seminar prior to the start of the program, and each will give at least one research seminar during the program.

5.) In collaboration with the local liaison, we will attempt to identify one or more student interns to work for a five week period at the PSC.

The PSC has identified partner schools for the first two years of this proposal (see Figure 1).

We are currently in negotiations with faculty at the University of the District of Columbia as a possible FY03 school. We will continue discussions with a number of both large and small minority institutions to continue finding excellent partners. The primary selection criteria is a motivated bioinformatics effort that has wide support among both faculty and the administration.

The four identified schools all have this in common.

They represent both a continuation with schools of the type with which we have already had success in this program and a branching out to schools that represent different environments and new challenges. The University of Texas at

San Antonio (UT-SA) is a large, state-supported university with an active life sciences research program, making UT-SA a logical extension to what we have done before with Howard

University and the Medical Sciences Campus at the University of Puerto Rico. In addition, UT-

SA has strong computer sciences and mathematical science departments that are interested in participating in a bioinformatics program. We feel like this makes them a strong partner.

PHS 398 (Rev. 5/95) Page

_ 14

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________

The University of Puerto Rico at Mayaguez (UPR-M) is a small state supported university that is primarily an engineering campus. Thus, while the biological sciences are represented on campus, the computer sciences and mathematical sciences departments are the strengths of this campus and the locus for our interaction. We feel that experience in this type of environment will strengthen our ability to assist primarily engineering schools.

The University of Texas at El Paso is a medium sized state supported university that, in recent years, has been vigorously expanding its graduate programs and degree offerings. This offers a unique opportunity to help incorporate bioinformatics into a young and growing research environment where it can then have an important role in defining future directions for the growth of the institution and its programs.

Johnson C. Smith University (JCSU) in Charlotte NC is a small liberal arts undergraduate college that has been in existence for over 135 years. Its strength is a strong commitment to innovation and to making computer technology a central feature of their educational program. In

2000 they became the first historically black university to joins IBM’s program of equipping every student with a laptop computer. Given the drive that JCSU has made towards moving forward, we feel that bioinformatics should be an excellent fit into this learning environment.

Figure 1.

The partner universities in the original grant (grey boxes) and this proposal. Where:

UPR-RCM is the Medical Sciences Campus at the University of Puerto Rico, NC Central U is

North Carolina Central University, UT-SA is the University of Texas at San Antonio, UPR-M is the University of Puerto Rico at Mayaguez, UT-EP is the University of Texas at El Paso and

JCSU is Johnson C. Smith University. The remaining partners have yet to be determined.

DATES Partner #1A Partner #1B Partner #2A Partner #2B

FY01 (AY01-02) Howard Univ.

FY02 (AY02-03)

FY03 (AY03-04) NC Central U Morgan State

FY01 (AY04-05)

UPR-RCM

UT-SA UPR-M

FY02 (AY05-06)

FY03 (AY06-07)

FY04 (AY07-08)

UT-EP

4-1

JCSU

4-2

3-1 3-2

FY05 (AY08-09) 5-1 5-2

4.B.1 Liaison will represent the clear contact person for PSC staff contacting the local institution. The liaison will be selected by the local administration, but should be a faculty member heavily involved in the local bioinformatics effort. Support from this grant will be provided to the local university to obtain release time for a local liaison to direct the local bioinformatics effort and work with PSC’s staff.

The liaison will attend the two-week Summer Institute prior to the school year in which the bioinformatics course will be offered. The liaison will then work with the PSC staff and evaluator during the actual academic course, including handling local arrangements and presumably giving a major part of the lectures delivered by local instructors. During the course

PHS 398 (Rev. 5/95) Page

_ 15

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________ the liaison will work directly with the PSC staff to insure the continuity of the course. The following summer the liaison will select a graduate student to participate in the student intern program. In addition, the liaison will attend one week of the second year Summer Institute to describe the issues associated with the program on their campus, extensive discussions with the evaluator, and offer suggestions on how to improve the program (Summer Institute, academic course and intern program) for subsequent years.

4.B.2 Ideally 3-6 faculty and staff members from the local institution would attend the

Summer Institute in the first year of the partnership, and as many additional faculty and staff members as possible in subsequent years.

4.B.3 PSC will team-teach in a bioinformatics course at the local institution, with Nicholas and Deerfield each visiting each campus twice each. During the academic year, PSC staff will travel to the partner institutions to participate in team-teaching an introductory bioinformatics course. This bioinformatics course is expected to become a permanent part of the curriculum at that institution. The teaching team will include PSC staff and members of the faculty team that attended the Pittsburgh workshop. In negotiation with the local liaison, the PSC staff will make a minimum of three, three day trips to the selected institutions to deliver lectures, assist in handson computer laboratories and to consult with faculty and students on term projects for the course.

During the second year of liaison support the course will be taught with minimal involvement by the PSC staff, although the PSC will provide lecture notes, class exercises, and consulting. In separate visits, the evaluator will visit the school to assess the impact of this program.

INSTRUCTION PORTION OF COURSE GRADING

SEMINAR

EVALUATOR

This on campus bioinformatics has the same scholarly goals as the two week Summer

Institute except that they apply to the students rather than to the faculty teams. We strongly suggest that this course be a project course, with an example including the complete an analysis of a sequence family associated with the research program at the local institution. This analysis will be modeled on analyses that have been proven useful in a variety of research projects

(Perozich et al., 1999; Nicholas et al., 1999; Resing et al., 2000). Programmatically, this analysis is the first step in introducing bioinformatics techniques into the research repertoires of the laboratories represented by the students, which is a major programmatic goal of this phase of the program along with providing the students with individual skills in the use of bioinformatics analysis that will expand the research opportunities available to them as their career progresses.

These goals provide the basis for evaluation of this stage of the program.

There are several reasons for believing that this collaborative team teaching approach will assist our program in establishing bioinformatics as a viable, on-going part of the research environment and curriculum at the selected minority institutions. Perhaps the most powerful of these reasons is the critical role that similar collaborative team teaching efforts have played in

PHS 398 (Rev. 5/95) Page

_ 16

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________ building the PSC's highly regarded workshop program in computational biology, especially the weeklong NIH-NHGRI sponsored workshop in sequence analysis. As this workshop has developed and incorporated new material over the years we have invited internationally recognized scientists, in many cases the original developers of the material we wished to add to the workshop, to join us and participate in teaching the workshop. This allowed us to observe the presentation of the material by someone very familiar with the technique and see what was essential and what was less important. It is also an effective way of increasing our depth of understanding of the material. The team-teaching will keep the local faculty and PSC staff functioning together in a joint problem-solving situation, which will contribute to the coherence of the newly established bioinformatics program. The course will provide recognition at the institutional level that together they constitute a new and valuable resource - an emergent bioinformatics program.

The PSC will provide teaching materials both for lectures and for hands-on computer laboratory sessions in this course. This should significantly lower the barrier to establishing a new course on campus by greatly reducing the amount of new material that must be prepared. We will also travel to the campus to present lectures, assist in hands-on computer laboratory sessions, and consult with faculty and students. Again this should lower the barrier to establishing a new course. It also provides an opportunity for face-to-face consulting on problems that the minority institution faculty team may encounter in presenting the material for the first time and thus further consolidate their mastery of the new material.

4.B.4 Seminars will be presented by Nicholas and/or Deerfield at least once prior to the start of the program, and each will give at least one research seminar during the program. Deerfield has developed a well-received presentation using state-of-the art scientific visualizations. These presentations will help maintain institutional support and awareness of the developing bioinformatics program. The visibility will be additionally enhanced by having the PSC present research talks outside of the course, perhaps as part of a regular seminar series for faculty and graduate students. More importantly, both Deerfield and Nicholas have given seminars on the same material to math, biology and CS departments. Thus, they can help bring out the point that a multidisciplinary approach is needed to work on bioinformatics issues.

4.B.5 Student interns will be identified by the local liaison in discussions with PSC staff.

This will keep the new bioinformatics effort active by supporting the development of preliminary results with these interns. Thus, after the internship, we would expect that the intern’s advisor will have sufficient results to start writing a competitive research grant.

4.C INTERN PROGRAM

We propose to bring two to four students to Pittsburgh each summer for a five-week research internship in bioinformatics. At least two students will come from one of the four current partner institutions. The research project will be negotiated between the PSC staff and either the interns thesis advisor or the institution’s liaison. One example would include the student carrying out an exhaustive and rigorous analysis of a family of sequences one or more of which are the focus of a research project in which they are involved at their home institution. This analysis will go beyond the scope that will have been done as term projects for the on-campus bioinformatics course and make extensive use of the PSC's high performance sequence analysis facilities.

Further goals that should be accomplished while focusing on the primary goal include a greater mastery of the bioinformatics concepts and tools introduced in the on-campus bioinformatics

PHS 398 (Rev. 5/95) Page

_ 17

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________ course as well as the mastery of additional bioinformatics techniques and computational tools.

This type of internship project will be focused toward producing analyses and papers similar to those that have been proven useful in a variety of research projects (Perozich et al., 1999;

Nicholas et al., 1999; Resing et al., 2000).

The primary evaluation of the internship will be on the completion of the extensive analysis and on the writing, submission, and publication of a refereed research paper in which the analysis is a central feature. At the end of the summer internship a reasonable first draft of the paper should be written. The paper should have been completed, submitted, and accepted within a year. Additional evaluation will be based on the degree to which the interns have increased their mastery of tools and analyses.

4.D COURSE DEVELOPMENT

During the first three years of this grant, it has become clear that institutions are developing bioinformatics efforts with only minimal release time. Thus, faculty have examined whether existing courses can be used in a bioinformatics curriculum. But, from our experience, an appropriate course in bioinformatics consists of a number of small sections taken from four or five courses. We have also watched a number of institutions go through the design and development of bioinformatics degrees and departments, and the politics that accompanies these efforts. At most universities a department, either current or to be created, must be the home of a degree. But, as we understand bioinformatics, this requires a multidisciplinary approach involving at least three specialties: Biology-specific (e.g. biology, molecular biology),

Computational-specific (e.g., CS, IS), and Modeling-specific (e.g., mathematics, statistics). Thus, if a degree is offered, then a number of departments that should be performing bioinformatics have been removed from offering a degree. We have adopted a model first proposed to us by

Howard University, where degrees would be conferred by originating departments but, upon completion of a core set of courses, a certificate is awarded acknowledging this competency.

We propose to create courses to serve at the core of a certificate program in bioinformatics, but we feel that these courses should be easily modified for schools that have either created a bioinformatics department or offer the degree from an existing department. The concept is that a student will major in a traditional department (e.g., math, CS, biology), but will wish to have demonstrated a core competency in bioinformatics. Thus, for each of the specialties (Domain vs.

Computational vs. Modeling), we propose each would need to develop two area-specific courses plus one survey course that would be taken by students in the other two specialties. Thus, each student would take four bioinformatics core courses: two survey courses in the other specialty areas and two specialty courses within their major. Our primary design goal for the courses is that students not only gain a conceptual and intellectual understanding of the material in each course, but also are able to apply the course material to a bioinformatics research project. To this end, the courses will be designed with the computer laboratory sessions and the completion of a student project as a major part of the course activity and integral to the learning experience.

We have chosen three different institutions to work with in the course development; one is a medical sciences campus (UPR-RCM), one is a strong engineering campus (UPR-M), while the third is a liberal arts campus that is strong in the sciences (NCCU). Thus, these institutions provide a representative sample of the type of institutions that are being targeted by this development effort. The first and last schools were also partner schools in the first three years of this project, while the second school (UPR-M) is a partner school in the first year of this grant. In

PHS 398 (Rev. 5/95) Page

_ 18

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________ fact, all course developers have attended a Summer Institute; and both Deerfield and Nicholas have individually (plus together) visited every campus and presented bioinformatics seminars.

Dr. Ricardo Gonzalez, at the University of Puerto Rico-Medical Sciences campus (UPR-RCM), has been involved with PSC’s Biomedical Initiative for about a decade. He has attended a number of our Research Resource-sponsored workshops, was our host and sponsor when we presented a two day roadshow at UPR-RCM, and was our liaison during the prior grant period and team-taught with PSC an UPR-RCM bioinformatics course. Dr. Gonzalez will develop the

Biology-specific courses. Dr. Jaime Seguel and Dr. Bienvenido Vélez, at the University of Puerto

Rico at Mayaguez campus (UPR-M) Department of Electrical and Computer Engineering, will develop the Computation-specific courses. Dr. Alade Tokuta, chairman of Mathematics and

Computer Sciences at North Carolina Central University (NCCU), will develop the Modelingspecific courses. Each course developer has been teamed with a PSC staffer. Dr. Wymore, a structural biologist, will team with Dr. Gonzalez. Mr. Ropelewski, lead bioinformatics programmer and instructor, will work with Drs. Seguel, Vélez, and Tokuta. All individuals know each other well and are comfortable working together. This level of interaction has produced a strong, well-integrated team highly motivated to not only carry out the development of high quality courses but to insure that the courses fit well together.

The courses will be modular with an entire semester course divided into from four to six modules. Each module will be composed of a coherent, self-contained set of related concepts that are logically presented together. This will make it easy for course adopters to modify each course to fit the specific needs or conditions of their institution. For instance, a module could be expanded to take advantage of the presence of faculty who are particularly strong in the areas of specific modules. Other modules could be shrunk or left out entirely to make a place for a new module developed by the adopting institution. This avoids the mistake of presenting a “one size fits all” course that is almost impossible to adapt to the differing conditions of various institutions.

A brief description for eight of these courses (one survey and two domain-specific courses for the three specialty areas) is given in Table IV, with six representative syllabus and course description included in the Appendix. Each specialty area offers its own potential problems in defining the specific topics to be included in each of the courses. For example, the mathematics survey and specialty courses may need to be further divided into two different series of courses: one for students interested in continuous (dynamical) systems verses one for students that are interested in discrete (combinatorics or Markov models) systems. We will develop this as a single series and then, in the third year, evaluate the performance of the classes to determine what topics work best together in the course developers home institution. We will re-evaluate our distribution of topics based upon the evaluation results and determine if more courses are needed.

After distribution, we will then evaluate how well the curriculum works at other universities.

Each module will contain a complete set of annotated visual instructional materials, generally a PowerPoint presentation with user notes describing the points to be made for each slide. There will also be appropriate textbook and primary literature references to provide explanatory material related to each concept, plus a fully worked-out and tested laboratory exercises to go with the module. This might include instructions on where and how to obtain appropriate freely available software along with example datasets and the annotated results to be expected from analyzing these data sets. Of course, the main component for each laboratory exercise is a welltested set of instructions to direct the student in performing the exercise. Finally, there will be

PHS 398 (Rev. 5/95) Page

_ 19

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________ guidelines for the student project as it relates to that module. This would include describing what features a project must have if it is to engage a student in applying the material from a specific module. The guidelines will include suggestion on how to insure that the project is sufficiently challenging to increase the student’s mastery of the module but not so challenging as to be impossible for the student to complete.

Table IV. Topics for the survey and two specialty courses.

Table III. Course

Topic Survey Specialty Course #1 Specialty Course

#2 Developer

PSC Staff contact

Biology-specific

Gonzalez, UPR-

RCM

Wymore, PSC

Central dogma

DNA, RNA, Proteins sequence and structure

Genome structure

Sequence evolution

Basic genomics

Sequence based

Multidisciplinary project teams

Information theory

Bioinformatics as:

Biological domain

Math model

Computational method

Structure based

Fundamental properties and structure formation

Classification libraries

Prediction

Phylogenetics from structure

Integrate with sequence analysis

Computationspecific

Seguel, Vélez ,

UPR-M

Ropelewski, PSC

Models of computing

Algorithms

Data Structures

Computer languages

Software development

Algorithm development

Deterministic versus probabilistic

Strings

Trees

Grammars

Data management

Relational model

Query models

Feature extraction

Data models

Document ranking

Modeling-specific

Tokuta, NCCU

Ropelewski, PSC

Introduction to:

Discrete math

Graph theory

Probability

Distributions

Bayes theorem

Matrix operations

Eigenvectors

Model sequence families

Distribution theory

Combinatorics

Graphs and trees

The budget has resources for both a course consultant and an advisory committee (see below). The course consultant will be paid to work with the course developers on a specific project. For example, in the first year, we will be hiring Dr. Michael Feig (Assistant Professor at

Michigan State University) to assist Gonzalez in developing the structural bioinformatics course.

PHS 398 (Rev. 5/95) Page

_ 20

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________

Dr. Feig is a developer of MMTSB, a Multiscale Modeling Tools in Structural Biology, which will serve as the primary software tool in the course. This is an excellent opportunity to involve a toolset developer in the development of an academic course that will use his toolset. In subsequent years, we will examine the needs and opportunities of the overall project to decide which additional consultants are required.

The material developed under the auspices of this grant will be freely available to academic faculty and staff. The copyright will remain with the developer’s institution, with all rights reserved. The PSC will maintain a distribution site (under license from the developing institutions) where individuals will register and then download all pertinent course material. The license accompanying the material will not allow the individual to redistribute or use the material commercially. But, the individual will have the right to use and reproduce the material for teaching or related activities. The individual will be asked to acknowledge the grant and will be requested to return any comments, feedback or additional material to be included in future releases (IP to remain with contributor).

The most mature of the proposed courses is the Biology-specific courses, where both Dr.

Gonzalez and the PSC have taught the first proposed biology specific course at the graduate and undergraduate levels. As a part of this effort, PSC (both Wymore and Deerfield) will work closely with Gonzalez to develop the structural bioinformatics course and PSC staff will present lectures in the subsequent UPR-RCM course using VTC technology (UPR-RCM has a number of classroom wired for VTC and the PSC maintains two VTC facilities). Torres will evaluate the quality of the remote presentations relative to those presented locally. Thus, we will begin developing the expertise required to extend the course material from what has to be presented locally to be effective to material that can be presented remotely. PSC will also be presenting a computational neuroscience course over VTC during the ’04-’05 academic year. This work, sponsored by the NSF, is evaluating the use of remote classes to EPSCOR states. Both efforts are important since we anticipate that not all schools will be able to staff an entire bioinformatics effort and, if effective, remote presentations could provide a solution to this problem. In addition, the PSC will continue to develop on-line tutorials on various topics in sequence analysis:

Pairwise alignment and database searching (Nicholas et al., 2000, 2003), Multiple sequence alignment (Nicholas et al., 2002), and Pattern and motif identification and techniques (projected second quarter 2004).

4.E MANAGEMENT PLAN

The PIs, Nicholas and Deerfield, will work to assure that this project remains successful.

Nicholas will act as the technical lead for this project, assuring that the course remains at the cutting edge of bioinformatics. Deerfield will act as the managerial lead; negotiating, establishing, maintaining, and administering the numerous subcontracts that are included with this project. This distribution of labor is consistent with each of these individual’s strengths.

The curriculum development group includes Nicholas and Deerfield as the PIs of the project;

Gonzalez (UPR-RCM) and Wymore (PSC) as the Biology course pair; Tokuta (NCCU), Seguel

(UPR-M), Velez (UPR-M), and Ropelewski (PSC) as the Math and CS course pairs. The curriculum development group, plus the evaluator, will review the progress of the course development and discuss future developments quarterly. The first meeting will be at the Summer

Institute, where the curriculum development group plus evaluator will have focused discussions with the liaisons about how the curriculum could be included in their local environment and,

PHS 398 (Rev. 5/95) Page

_ 21

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________ when appropriate, how the implementation has worked out. The second meeting will be an annual face-to-face meeting near one of the course development partner’s institution and will include the course development group, evaluator, course development consultants, and an external advisory committee. The third and fourth meetings will include the course development group, the evaluator, and (as appropriate) the current campus liaisons which will be held using

VTC technology.

Publicity. We will announce the availability of this program for assisting bioinformatics programs at minority institutions through seminars at appropriate conferences (e.g., the MORE

Program Directors meeting that was held in Reno, NV last summer), mailings to appropriate deans and departmental chairpersons, and personal contacts through email and phone calls. At present we have a mailing list of over 100 such institutions and will perform an extensive search to insure the list is complete. Furthermore, we will work with the liaisons to augment the dissemination of the workshops, both those described here and those offered by the PSC under different auspices.

Selection procedure.

For the Summer Institute, we have established a web site that includes the tentative agenda, description of the course and the ability to apply online. If the workshop is over-subscribed (greater than 15 applicants) then the faculty and staff at current partner institutions will be given first priority for the workshop positions. If there is a large number of applicants from a partner school, we will work with the liaison to prioritize the list. In order to assure us that the class is multi-disciplinary, we will use the originating department as one of the selection criteria. Ultimately, we will encourage faculty and staff from all minority institution to apply to the Summer Institute. For the partnership program, we will look for highly motivated faculty that have a supportive administration. The key to success for the local school is the liaison. We will talk with out current liaisons, course developers and NIH program staff when looking for new partners. For the Intern program, we will accept applications from any student at a minority institution. Preference will be given to students that have gone through partnership universities’ bioinformatics courses, but all students will be considered.

4.F EVALUATION OF OVERALL PROGRAM

We will have a professional, experienced evaluator carry out a rigorous evaluation of each of the three components of our proposed program every year. The evaluator, based at the University of Puerto Rico and experienced in evaluating minority training and research programs, will travel to each Summer Institute to initially meet with the entire faculty team for both primary institutions for the current year as well as the other workshop participants and summer interns.

This initial meeting will insure that evaluation is an integral part of the plan of the year and is included from the first. The evaluator will travel to each local campus bioinformatics course twice. The first trip will be early in the course, coinciding with the first PSC visit to the local campus and the second trip will coincide with the last PSC visit to the local campus for grading and evaluation of term projects from the course.

In response to a request by NIH-NHGRI several years ago we developed a questionnaire to evaluate the effectiveness of the bioinformatics -- sequence analysis workshop we teach under a

T-15 grant. We have found this questionnaire very helpful for improving the workshop and keeping it up to date. The NHGRI reviews of this workshop generally comment positively on the information contained in this questionnaire that we submit, both the summarized and the

PHS 398 (Rev. 5/95) Page

_ 22

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________ complete responses, with the annual progress report. The questionnaire asks about the content of the lectures, the lecture handouts, and readings that accompany the workshop; whether the material was useful, related to the needs of the participants, and well presented. The participants are asked what material was most useful to them, what was least useful, what factors led to this judgement as well as what additional topics they would like to see covered. The questionnaire is included in the appendix. With the assistance of the professional evaluator we will adapt this questionnaire to each of the three components of the proposed program: the Summer Institute, the local campus bioinformatics course, and the internships. The evaluation process will be conducted not only so as to identify successes, but to also identify and correct shortcomings in our initial approach.

The current questionnaire focuses on whether the workshop provided the participant with the background needed to incorporate the bioinformatics techniques presented into the active research program in their laboratories. For the proposed workshop the questionnaire will need the additional, primary focus on whether the workshop is effective in providing a basis for creating a bioinformatics course on the participants' local campuses. Adaptation of the questionnaire for use with the local campus bioinformatics course will again focus on how well the course prepared the students to apply the newly learned bioinformatics techniques to current and future research projects. Adaptation of the questionnaire to the summer internship program will focus on how well bioinformatics techniques have been integrated with the regular experimental protocols and procedures of the laboratory of the interns and their faculty advisors as well as whether the internship provided effective guidance in preparing a research paper.

Additional continuing evaluation will be done in connection with the local campus bioinformatics course. Important indicators of success will be that the course continues to be offered every year and the number of students who take the course compared to the number who were eligible to take the course. We will also request a syllabus for subsequent offerings of the course to see whether coverage expands and new material is incorporated. We will work with the evaluator to find suitable measures of how well the course has conveyed essential concepts.

We will also follow up in each of the laboratories from which summer interns and their mentors were selected to see that bioinformatics techniques continue to be regularly used in the laboratory and that new students entering the laboratory are introduced to these tools. We will also obtain the number of papers, theses, and dissertations from these laboratories that have made use of these techniques. We will also inquire whether the laboratory has applied for and obtained funding for research projects that incorporate bioinformatics techniques in the research methods used to solve the problems.

Ultimately, the measure of success for this program is the extent to which the participating institutions and faculty continue to provide effective training in bioinformatics to their students and the extent to which the PSC program fostered that training.

4.G LEVERAGE OF PSC’s RESOURCES

The use of the PSC computing facilities for the two week Summer Institute, the local campus bioinformatics course and associated term projects and research projects, the summer internship program, and the course development will have a number of advantages.

The experience of faculty and support personnel who attend the Summer Institute will be directly applicable to teaching the computer laboratory sessions during the local bioinformatics course the following academic year.

PHS 398 (Rev. 5/95) Page

_ 23

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________

It eliminates the very large task of immediately, and in a short time, creating a local facility with sufficient computer hardware, software, and data resources to teach the course.

Allows that the time otherwise necessary to learn the use of a second facility be directed to the challenge of creating and teaching the course.

Summer interns will already be familiar with PSC systems when they arrive in Pittsburgh, thus no time will be lost to learning a new system rather than new science.

It will makes the PSC staff more effective in assisting with computer laboratory sessions and with students’ homework and term projects.

PSC computing facilities will be used in teaching the course to insure that the course has adequate facilities for teaching. They will also provide an example for local computing centers to follow in developing their own support program.

4.H SUMMARY

We have designed a program to assist bioinformatics programs at minority institutions through a technology transfer operation that provides a two year series of training activities in bioinformatics. During the first year the PSC plays a highly visible and active role in training faculty and students at the minority institutions. During the second year the PSC’s’ role is mostly consulting with and support of the faculty as they assume the task of teaching the on campus bioinformatics course. The proposed technology transfer program involves three parts designed to provide both immediate and long-term increases in the research opportunities available to minority scientists. The aims for each of the three components are to: create a multidisciplinary core group of faculty with knowledge and interest in bioinformatics; establish bioinformatics as part of the curriculum at partner institutions and create a group of students knowledgeable in bioinformatics techniques; and integrate bioinformatics procedures into the repertoire of research tools used at the institution. An important part of the evaluation will be to measure how much each targeted group has increased their bioinformatics knowledge and skills during each component of the program.

Each of these four parts builds on proven strengths of the PSC to carry out bioinformatics training and research. The PSC staff has an extensive and successful background in carrying out almost identical programs in the past. The program will enable minority scientists to make effective use of the vast amount of information being produced by various genome sequencing, gene expression, and macromolecular structure determination initiatives around the world.

PHS 398 (Rev. 5/95) Page

_ 24

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________

E. HUMAN SUBJECTS: none

F. VERTEBRATE ANIMALS: none

G. REFERENCES

Bailey, T.L. & Elkan, C. 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent

Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California.

Lipman, D.J., Altschul, S.F., Kececioglu, J.D. 1989 A tool for multiple sequence alignment.

Proc. Natl. Acad. Sci. USA. 86 , 4412-4415.

McClain, W.H. and Nicholas, H.B,Jr. 1987. Discrimination between transfer RNA molecules.

Journal of Molecular Biology, 194:635-642.

Mirny, L.A. and Shaknovich, E.I. 1999. Universally Conserved positions in Protein Folds:

Reading Evolutionary Signals about Stability, Folding Kinetics and Function. J. Molecular

Biology, 291:177-196.

Nicholas, H.B.Jr., Chan, S.S., and Rosenquist, G.L. 1999. A Re-Evaluation of the Determinants of Tyrosine Sulfation. Endrocrine., 11:285-292.

Nicholas, H.B. Jr., Deerfield, D.W. II, Ropelewski, A.J. 2000. Strategies for Searching Sequence

Databases. Biotechniques, 28:2-14.

Nicholas, H.B. Jr., Deerfield, D.W. II, Ropelewski, A.J. 2003. Strategies for Searching Sequence

Databases in "Biocomputing: Computer Tools for Biologists", Eaton Publishing, ed.

S.M.Brown., pp 209-231.

Perozich, J., Nicholas, H.B.Jr., Wang, B-C., Lindahl, R., and Hempel, J. 1999. Relationships

Within the Aldehyde Dehydrogenase Extended Family. Protein Science, 8:137-146.

Ptitsyn, O.B. and Ting, K-L.H. 1999. Non-functional Conserved Residues in Globins and their

Possible Role as a Folding Nucleus. J. Molecular Biology, 291:671-682.

Resing, K.A., Pearton, D.J., Nicholas, H.B.Jr., Yeh, J., Hoofnagle, A.N., and Dale, A.D. 2000.

Function and evolution of the S-100 domain of rat profilaggrin. Biochemistry. (submitted)

Ropelewski, A.J., Nicholas Jr., H.B., Deerfield II, D.W. 1997. Implementation of Genetic

Sequence Alignment Programs on Supercomputers, J. Supercomputing, 11:237 - 253.

Ropelewski, A.J., Nicholas Jr., H.B., Deerfield II, D.W. 2000. Selective and Sensitive

Comparison of Genetic Sequence Data on High Performance Computers. In: Parallel

Applications Technology Program, Koniges, A. (Ed), Morgan Kaufmann. pp. 453 - 479.

Hempel J, Lindahl R, Perozich J, Wang BC, Kuo I, Nicholas H 2001. Beyond the catalytic core of ALDH: a web of important residues begins to emerge.

Chem-Biol Interact 130:39-46.

Hempel, J., Kuo, I., Perozich, J., Wang, B-C., Lindahl, R., and Nicholas, H. 2001. Aldehyde dehydrogenase: Maintaining critical active site geometry at motif 8 in the class 3 enzyme.

European Journal of Biochemistry Vol. 268, pp. 722-726.

Hempel, J., Perozich, J., Wymore, T. Nicholas, H.B.. An Algorithm for the Identification and

Ranking of Family-Specific Residues Applied to the ALDH3 Family Chemico-Biological

Interactions, 2003, 143-144:23-28.

PHS 398 (Rev. 5/95) Page

_ 25

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.

Principal Investigator/Program Director (Last, first, middle) :

Nicholas, Hugh B. Jr.

______________________________________________________________________________

Nicholas, H.B.Jr., Arst, H.N.Jr., and Caddick, M.X. 2001. Evaluating low level sequence identity: are Aspergillus Quta and Arom homologous?

European Journal of Biochemistry.

Vol. 268. pp. 414- 419.

Wymore T, Nicholas HB, Hempel, J. 2001. Molecular dynamics simulation of class 3 aldehyde dehydrogenase.

Chem-Biol Interact 130: 201-207.

Wymore, T., Deerfield II, D.W., Field, M.J., Nicholas, H.B. Hempel, J.. Initial Events in class 3

Aldehyde Dehydrogenase: MM and QM/MM simulations Chemico-Biological Interactions,

2003, 143-144: 75-84.

PHS 398 (Rev. 5/95) Page

_ 26

Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.