Software Infrastructure for Research Charisse Cotton, Sarah Keefe, Professor Judith Stafford Tufts University Computer Science Department Introduction Researchers in all fields use software tools for purposes such as collecting data, analyzing results, creating graphs and charts, and looking up information. However, the vast number of options available leads to inconsistencies in areas such as file types, storage, and output. There is currently no unified way to handle information from a large research project. We interviewed researchers who are currently working on biomedical-related projects about their use of software in their research. This information went into Professor Stafford’s proposal that was sent to the National Institute of Health. Professor Stafford hopes to get funding to create a software infrastructure aimed at helping biomedical researchers reach their goals more quickly and efficiently. Example Interview Questions for Researchers - What kind of software do you use in your research? - How do you obtain this software? - Is there programming involved? Who does the programming? - If you write your own software for your project, how is it tested? - Do you integrate the software or run programs separately? - What are your chief quality concerns with software you currently use? - How do you evaluate software use to be sure your goals will be met? - What problems do you frequently come across in your software use? Left: Screenshot of SAS/STAT used in Dr. Castaldi’s projects Top Right: SAM and HMMER Logos, and a Hidden Markov model used in protein function prediction Bottom: Screenshots from the Haploview program used in Dr. Castaldi’s project Researcher Interview: Dr. Peter Castaldi Dr. Castaldi is currently involved in three active research projects. Locating the Genes that Cause Emphysema Emphysema is a chronic obstructive pulmonary disease (COPD) that is often caused by exposure to toxic chemicals or tobacco smoke. Dr. Castaldi is working on locating the genes that cause this disease. Dr. Castaldi uses a software called Haploview, a menudriven software program to input data points and find where on the genome certain variations are located. More on Haploview: http://www.broad.mit.edu/mpg/haploview/index.php Tracking Medication Skipping Habits of the Elderly Dr. Castaldi's second project involves tracking the medication skipping habits of people over 75 who have chronic respiratory disease. This involves a database of responses from a survey. The SAS and STAT software programs are used for statistical analysis of the results. More on SAS and STAT: http://www.sas.com/technologies/analytics/statistics/ Building a Predictive Model for Alpha-1 Anti-Trypsin Deficiency Alpha-1 Anti-Trypsin Deficiency is a genetic disorder caused by defective production of alpha 1-antitrypsin, which causes symptoms such as shortness of breath and wheezing. Dr. Castaldi is working on building a model that will predict which people will develop the disease. This project also utilizes the SAS and STAT software. Researcher Interview: Anoop Kumar Prediction of Protein Functions Proteins are made up of chains of molecules called amino acids, of which there are about 20 standard types. The function of a protein in the human body depends on the sequence of amino acids. Anoop Kumar is working on the prediction of protein functions. It is easy to find the sequence of amino acids, but very difficult to accurately predict what the protein does based on that sequence. Predicting the functions of proteins involves using a probabilistic mathematical model. The idea is that a sequence of amino acids that make up a protein could be passed through an accurate model to tell whether that protein functions in a certain way. The models used are called Hidden Markov models, which are statistical models in which some parameters are hidden and must be determined from what is known. Anoop uses a variety of different programs for his research. -BLAST, the Basic Local Alignment and Search Tool, is used to search for sequences of amino acids and find similarities between sequences. -The Sequence Alignment and Modeling system (SAM) is a utility to build and evaluate protein models and download protein profiles. -HMMER is a software package used for analysis of Hidden Markov Models. More on BLAST: http://www.ncbi.nlm.nih.gov/blast/Blast.cgi More on SAM: http://www.soe.ucsc.edu/research/compbio/sam.html More on HMMER: http://hmmer.janelia.org/ Typical Problems in Research Software Time Since researchers are not software programmers themselves, they do not have the time to learn new programming skills to manipulate software. Researchers sometimes rely on outside sources, such as a colleague with programming experience or outside help. Compatibility Researchers receive data from many sources, and accepting all kinds of data types, from Excel spreadsheets to databases, graphs, and text documents, is critical to putting all the pieces of a research project together. Output A research software program should output data in ways that are easily adapted to presentation of results: graphs and charts should be neat and readable, and data from projects like surveys should be in a presentable format. Efficiency and Consistency A program must be able to calculate results relatively quickly and with utmost efficiency, and also obtain consistent results from the same data set. Simplicity Researchers use many different software programs in their projects, picking up new software utilities as they need them. An all-encompassing software utility for researchers might instead use a library of software plugins that are added as needed. Conclusion Biomedical researchers currently employ a variety of programs and software-use strategies to reach their goals. There is no unified infrastructure that can be used to integrate different programs' results or combine data into a single, useful output. Programming must be done by the researchers themselves or by hiring outside help, and various programs are generally run independent of each other. This can result in slow work where software is involved, due to such issues as data incompatibility. Biomedical research projects are multifaceted, and any software utility created to handle all aspects of research projects must contain certain elements that researchers require. Speed, ease of use, compatibility, and efficiency are all such essential factors that the end product must have in order for it to be a reliable research tool.