Community Shelter Board Research/Archival Database Project Final Report FINAL REPORT Archival/Research Database for CSB Overview The Community Shelter Board of Columbus, Ohio established a management information system for homeless service tracking in 1989. Since that time Columbus’ homeless management information system (HMIS) has evolved to include client-level data about recipients of prevention, emergency shelter, transitional housing, permanent supportive housing, and transitional financial assistance. The HMIS coverage includes nearly all homeless assistance providers and programs in the Columbus metro area. The HMIS has consistently captured a core set of data elements for each client served which include demographics, services provided/received, and outcome. Additional client information and project-specific data elements have been collected and added to the HMIS at various times throughout the past fifteen years to facilitate ongoing program evaluation, program monitoring and research initiatives. In 2001 CSB upgraded their HMIS system from a locally managed, DOS-based system with dial up access, to a web-based system in ServicePoint software managed off site by Bowman Internet Systems. Data from 1989 through 2001 in the DOS-based system was not migrated to the currently used ServicePoint product. Consequently CSB now has two different systems of HMIS data, although the data sets contained in each are based on the same data elements and client response categories. Because Columbus is one of the only communities throughout the United States with high quality, historical homeless system data, tremendous opportunities exist for sophisticated data analysis and research. However, the two existing datasets are not currently managed in a single, uniform and consistent system. This report evaluates commonly accepted organization standards for research databases and makes recommendations for database technical specifications and management protocols. Key Informants Interviewed Steve Poulin University of Pennsylvania, Center for Mental Health Policy & Services Research Marti Burt Urban Institute Brian Sokol University of Massachusetts – Boston, McCormack Center for Social Policy Research Nancy Smith, Derek Coursen Vera Institute of Justice Michael Banish, Marcus Mattson Community Research Partners Eric Grumdahl Hearth Connection Ellen Bassuk, Nick Huntington National Center on Family Homelessness Shared by Columbus OH on HMIS.Info/Database Research 1 Community Shelter Board Research/Archival Database Project Final Report Ed Malecki, Hazel Morrow-Jones The Ohio State University, Center for Urban and Regional Analysis Matt Kaliner, Copeland Young Harvard University, Radcliffe Institute for Advanced Study – Murray Research Center Database Structure and Format Data format relational database The preferred format for an archival database is one in which the integrity of the original data structure is maintained. Since the original data sources are both organized in relational tables, the archival database should also maintain the data relationships and structure in a relational database. Key informants also recommended that CSB maintain a detailed and up-to-date codebook that includes an “entity relationship diagram”, providing clear descriptions of the relationship among data elements within and among tables. Back end environment SQL The recommended software for storing data archived from a relational database is a SQL1 SQL is universally regarded as the recommended approach due to the compatibility, reliability and flexibility SQL provides. Compatible – multiple SQL products are universally supported and understood by the largest proportion of database administrators. SQL also allows the archived data to be stored in a relational table format that maintains the integrity of the original data source. Reliable – provides fast, secure, and robust tool for integration of new data during regular updates to archival database from the ServicePoint interactive live site. Flexible – export script can be easily written to transfer data in various formats (ASCII, tab delimited, flat file, relational tables) and subsets (subpopulations, program types, profiles, etc.). SQL also provides ability to integrate data into multiple formats and software products depending on research needs and staff preference. SQL is transferable to text files in ASCII format, Microsoft Access, Microsoft Excel, Crystal Reports, File Maker, Dbase, SPSS, SAS, and many others. Front end software SQL, Access, SPSS The recommended software for managing archival data is less definitive. Respondents were split in their answers to this question depending on their particular preference and expertise. Database administrators preferred SQL. Academic researchers preferred SPSS. Data analysts and general policy analysts preferred Access. SQL may seem to be the best option if the back end is built in SQL, but only if CSB can support the database with staff appropriately trained and experienced in SQL database 1 Structured Query Language - SQL is a standard interactive and programming language for sending queries to, getting information from, and updating a relational database. Although SQL is a standard used in database administration, many database products support SQL with proprietary extensions to the standard language. Examples of database products that support SQL include dBase, Oracle, and SQL Server. SQL is the industry standard for storage, management, and administration of relational databases. As the industry standard SQL provides the best performance, scalability, and reliability for management of relational databases. Shared by Columbus OH on HMIS.Info/Database Research 2 Community Shelter Board Research/Archival Database Project Final Report programming and administration. SQL certified staff are highly skilled and tend to demand the highest salaries of all database administrators. Other options for the front end interface of data management include basic database software packages such as Access. Access is highly flexible and an easily transferable product. Although data analysis is possible with Access, other products provide more sophisticated and robust reporting and analysis functionality. SPSS is the optimal choice for advanced data analysis. SPSS is constructed in a flat environment (client identifiers and demographics are repeated for each individual record or client contact). A flat environment can lead to cumbersome and unwieldy databases for very large data sets. SPSS is not as common as SQL or Access, and it can be more difficult to find staff with appropriate training and expertise in SPSS. If the back end architecture is built in SQL, the choice of front end software packages is less critical. SQL is the industry standard for storage, management, and administration of relational databases, affording enough flexibility, reliability, and scalability that the choice of the front end system becomes one of personal choice and CSB staff preference. Additionally, many of the standard queries and reports that CSB will use on a regular basis can be preprogrammed in SQL, requiring input of basic date ranges to extract those data that CSB require. Contents of Research Database The required data elements for the research database should comply with recently developed US Department of Housing and Urban Development (HUD) Data and Technical Standards for HMIS. These standards identify the specific data elements, data collection and reporting protocols, sharing of protected personal information (PPI), covered homeless organizations (CHO) in the HMIS, privacy and security standards, allowable uses of HMIS data, application and system security, and electronic transmission requirements. The archival dataset of CSB’s HMIS is a covered entity under the new Data and Technical Standards for HMIS. All applicable standards and requirements for HMIS are equally relevant to the archival database. The collection of specific data elements may vary according to program type. For instance programs that serve families in a transitional housing setting may collect sets of data that are different from programs that serve single adults in an emergency shelter setting. The contents of the Research Database will be defined according to program type, rather than a single data set for every client encounter. In general, however, a standard set of universal data elements will include basic demographics and service utilization. Data Elements client identifiers, demographics, related household data, disability status, program data (including entry and exit dates), residential history, client outcomes, and exit destination. Key informant responses generally recommended that individual case planning data beyond the HUD-required universal data elements and program-specific data elements should not be included in the archival database. Client-level case management information not included in the database would include case notes, status reports, treatment plan monitoring, and any clinical information relating to the planning, monitoring, and documentation of service provision. Shared by Columbus OH on HMIS.Info/Database Research 3 Community Shelter Board Research/Archival Database Project Final Report Database Management Frequency of data dumps from live site to archival site semi-annual CSB intends to use the live, interactive ServicePoint database to manage many of their regular reporting functions which include funder-required summary reports, program-specific summary activity reports in aggregate by various time periods, population-specific summary activity in aggregate by various time periods, monthly trend analysis by program and subpopulation, and additional ad hoc demographic, utilization, and accounting reports. CSB is heavily dependent on a constant stream of quality data to monitor program utilization and client outcomes. Reports from the archival/research database will only be extracted for purposes of research, large-scale evaluation, rigorous data analysis, and monitoring of long-term trends. These types of reports are run more efficiently from a research database specifically designed for these purposes. After the initial communication protocols and importing scripts are developed that allow data to be archived from the ServicePoint site to a CSB-managed research database, the frequency of subsequent data dumps is contingent on CSB’s ability to conduct comprehensive data quality testing and cleansing. The labor intensive nature of data monitoring, cleansing, and auditing processes is such that these data quality processes can only be conducted on a quarterly basis. Therefore, it is recommended that CSB conduct semi-annual data integration (‘dumps’) from the live ServicePoint site to the archival research database, allowing adequate time for data quality issues to be addressed. Data dumps more frequent than semi-annually do not allow for the rigorous and comprehensive quality assurance testing and data cleansing that CSB currently conducts. Frequency of data purging from the live site annually The live ServicePoint site will benefit from the deletion of records that are determine to be “inactive”. Homeless Assistance providers will find ServicePoint easier to navigate and faster to use with the purging of client records that are no longer accessed by homeless assistance providers in the provision of case planning and client management. The definition of “inactive” will need some input and discussion from all interested parties to strike a balance between usability and completeness. Ultimately, the “inactive” designation should be determined based on a rigorous review of current client utilization patterns. Any data purged from the live site should include a comprehensive audit trail that indicates what specifically was purged, who did it, and the criteria used to make the purge determination. Backups and Storage Key informants recommend that CSB follow standard practices for backup and storage of data off site as an extra but necessary precautionary measure. Additionally, any SQL code that describes database structure, relationship codes, programming, reports, etc. should also be saved on backup disks and stored off site. The processes employed for the backup and storage of the archival/research database can be incorporated into existing CSB server and database backup procedures. Projected staffing and management activities: Phase 1 (start-up, 4 to 6 months) 206 consulting hours Phase 2 (annual, ongoing) 0.5 FTE The specific tasks associated with data modeling, designing the research database, writing the specifications for data transfers, conducting the initial integration of all data into the new research database, and validity testing to assure accuracy all require expertise in database development. These activities are distinct, one-time functions and are characterized as Phase 1. Shared by Columbus OH on HMIS.Info/Database Research 4 Community Shelter Board Research/Archival Database Project Final Report The skill sets required to maintain the database during Phase 2 (following initial start-up) are quite different and are more aligned with database administration. These activities include the following: database management – security maintenance, backups, managing tape, cartridge, and assorted media data processing – report generation, copying, editing, and logging of pre-programmed reports quality management – maintain quality controls for ensuring accuracy and integrity of data files data analysis – develop preliminary analyses for initial data exploration involving basic statistical analysis (i.e. frequency distributions, correlations, chi square, ttests, etc.) Research Request Process for Data Access Throughout the history of CSB many researchers and universities have approached CSB staff with requests to access the MIS for analysis, research studies, evaluation work, data matches with mainstream administrative datasets, and general data mining. CSB has reviewed these various requests on an ad hoc basis and made determinations to pursue partnerships based on the merits of the research design and resources necessary to participate. Future requests for collaborations and participation in research will increase as CSB’s data sets are organized and designed for this purpose. Managing research requests must be approached intentionally to ensure cost-effective, timely, secure, and appropriate partnerships. CSB must articulate clear research objectives and make determinations about potential research partnerships based on the quality of the research design, ability of researcher to complete the research in a timely and cost effective manner, and consistency with local research objectives. The following recommendations provide a basic structure for managing research requests and facilitating productive partnerships. All requestors of data from CSB’s archival/research data base will be required to complete a pre-proposal concept paper. CSB staff will review pre-proposal concept papers to make one of the following three (3) determinations: 1. The requestor’s pre-proposal has no merit and no access to the archival/research database will be allowed. 2. The requestor’s pre-proposal has merit and only de-identified client data (aggregate summary data) is required for the research project. CSB may provide de-identified aggregate reports to the researcher. 3. The requestor’s pre-proposal has merit and identifiable client data is necessary for the research project. The researcher is invited to complete a full proposal. Access to data will be determined after successful submission of a full proposal and approval of the Ad Hoc Committee for Review of Research Requests (described below). The research request pre-proposal (requests for access to CSB’s administrative database of client-level records) must be initiated by a written proposal that describes the following: Name, affiliation, credentials of principal researcher Research design that describes scope, scale, objectives, and priorities for research Description of data (aggregate or client-level) required for research (population or sub-population type; program type; date ranges; specific flags, filters, cross-tabs, etc.) Shared by Columbus OH on HMIS.Info/Database Research 5 Community Shelter Board Research/Archival Database Project Final Report The research request full proposal must describe the following information in addition to pre-proposal information: Name, affiliation, and credentials of any research assistants with access to data Funding source underwriting research project and/or researcher’s time Estimated timeframe for completion of research or data analysis Estimated amount of CSB staff time associated with providing explanations, context, assistance, TA with data extraction and/or analysis. Description of researcher’s plan to assure privacy and confidentiality of client data, compliance with HIPAA guidelines, and plan for management, storage, and eventual destruction of client data. Because researchers are using CSB HMIS data as a secondary data resource, no IRB process is required. However, rigorous controls on allowing data access need to be established. Following the successful and complete submission of a written proposal, CSB will manage the approval process by calling a meeting of an Ad Hoc Committee for Review of Research Requests. This Ad Hoc Committee will be charged with meeting whenever necessary to review research request proposals and making determinations about providing access to data based on the merits of the proposal. Membership of the Ad Hoc Committee may be comprised of the following: CSB staff (Executive Director, Program Director, Database Manager) Consumer of homeless assistance program Homeless assistance provider Funder of homeless assistance and/or housing Member of the Rebuilding Lives Funder Collaborative Individual(s) with academic credentials representing research interests (OSU) Any researcher granted access to CSB’s HMIS data will be required to sign a Memorandum of Understanding (MOU) that describes the terms of the partnership, any liabilities, a plan for dissemination of research findings, and any other conditions placed on data access. Approved data format ASCII file with requested data in tab delimited format Each researcher has slightly different views on the optimal structure, data format, and software products that are best suited for research and analysis. Opinions differ based on the degree of training and familiarity with different statistical analysis products, the size and scope of data, and the level of analysis. Key informants generally recommended that CSB provide data to researchers in the most basic, simplest format possible and then require each independent researcher to convert the data to the format of his or her choosing. Following approval of a research request, CSB will create a copy of the requested data set on a CD in an ASCII file in tab delimited format. CSB will also consider providing data in different formats depending on the timing and resource requirements at CSB. CSB will also provide a detailed code book that defines each data element within a field, how cases are organized, and the logarithm for establishing the unique client identifier. Each case within a data set should represent a separate shelter visit. Security and Confidentiality The privacy protections of identifiable data within a data set are major concerns when making client-level data available for research purposes. Although current client consent and release protocols in place at homeless assistance programs in Columbus are compliant with industry standards and provide for adequate provisions for research and general management operations, researchers must comply with the following additional security and confidentiality measures: Researchers and their affiliated institutions must comply with all applicable privacy and security mandates such as HIPAA Researchers must sign a confidentiality agreement Shared by Columbus OH on HMIS.Info/Database Research 6 Community Shelter Board Research/Archival Database Project Final Report Anonymity of client-level data within any released data set must be protected and maintained throughout the research and analysis process, and in any final reports, studies, or published accounts of findings. All data sets (original and derivative) must be destroyed upon completion of the research Only researchers specifically approved by the Ad Hoc Committee will have access to data Preferred method of data sharing de-identified client level data Researchers must be able to match CSB’s client code (unique identifier) with the researcher’s client code. As a general practice all client identifiable data will be masked or stripped from released data sets. In extenuating circumstances when valid data matches (linking) are not possible without the presence of individual identifiers, researchers may have limited access to client-level data with identifiers. Data Quality As a general rule survey respondents recommended that all data, even somewhat problematic data, should be included in the research/archival database. Tolerance to various levels of data quality will differ based on the specific research question, scope, and analysis methodologies. The data quality standards of data within the live ServicePoint site should also be carried over to the archival database. Creation of Research Database Phase 1 of the project is anticipated to take a total of 4 to 6 months and require the following activities: Providence – define the origin of each data element within the database (DOS-based FirstLink system vs. ServicePoint system); document the location, relationship, and any issues or problems impacting future data interpretation or analysis. Much of the documentation currently exists for this activity. Data providence is expected to take 2 to 4 weeks. Data modeling – define data variables, relationships, and logic of database organization; conceptualize database structure; construct an entity relationship diagram for the new research/archival database. Data modeling is expected to take 4 to 6 weeks. Database development – write the SQL program schema for data migration to archival database from the live site and from the DOS-based system. Both data sources should be integrated into the new research/archival database. Testing of the migration process should be conducted and the process verified before final integration is accepted. Database development is expected to take 4 to 6 weeks. Analysis of problematic data – Client data from the period 10-1-01 through 6-30-02 is of questionable quality, completeness, and consistency. This period represents the initial transition time from use of the DOS-based system for client management to the ServicePoint site. Data from this time period will need special attention to determine their “fitness” for conversion into the research/archival database. This analysis is expected to take 4 to 6 weeks. Data migration – run the SQL procedures that join, filter, and convert data. This process is expected to take less than 1 week. Build import/export reports – write SQL protocols for periodic updating of research/archival database from live, ServicePoint site. Write export module to extract data in uniform data formats for regular research, Shared by Columbus OH on HMIS.Info/Database Research 7 Community Shelter Board Research/Archival Database Project Final Report evaluation, and reporting functionality. Report building is expected to take 4 to 6 weeks. Testing – ensure that data elements are not represented by more than one field (look for large number of missing values when frequencies run for duplicate fields); run basic queries from the new research/archival database and match with approved ServicePoint queries to assure accuracy and consistency. Testing is expected to take 2 to 4 weeks. Due to the technical nature of the initial database development and the intensive staff time required to conduct the design, integration and testing, survey respondents recommended that CSB consider contracting with a database development consultant skilled in these effort. Local examples of database development consulting teams include the following: Avenscia Inc. CompuWare Microman Resultdata Sarcom Costs Phase 1 – Database design, development, and testing: Costs associated with hiring consulting services for this time-limited activity are expected to be in the range of $15,000 to $25,000. This estimate is based on key informant interviews with database developers and/or consultants engaged in projects of similar scale and scope. Costs include database server hardware, software, peripherals such as tape back up, cables, and firewalls, and database set up and building of the interface. Phase 2 – On-going maintenance and support: Costs associated with hiring staff to support the administration of the new research/archival database and to perform related technology maintenance functions at the Community Shelter Board on a 0.5 FTE basis are expected to be in the range of $25,000 to $35,000. The total Phase 2 figure also includes a small annual contract for consultant services associated with maintenance and support of the database. Resource Comparison Matrix The following chart highlights the resource requirements based on high end estimates for CSB staff time, consulting services, hardware purchase, and annual support contracts. The total, firstyear cost of the project is anticipated to be in the range of $72,500 to $77,500. Resource Requirements – Time Phase 1 (4 to 6 months): Providence Data Modeling Database Development Analysis of problematic data Data migration Build import/export reports Testing Total resources for Phase 1 Phase 2 (ongoing): Total resources for Phase 2 Resource Requirements - Dollars Shared by Columbus OH on HMIS.Info/Database Research 8 CSB Staff Resources (hours) Consulting Resources (hours) 20 20 16 20 20 16 112 10 40 60 20 16 40 20 206 0.5 FTE 80 CSB Staff Resources (hard costs) All Other Resources (hard costs) Community Shelter Board Research/Archival Database Project Final Report Phase 1 (4 to 6 months) Consulting services for database development Hardware, software, peripherals, set up Phase 2 (ongoing) Database manager (0.5 FTE) Database maintenance & support (annual) $25,0002 $7,5003 $35,0004 $10,0005 Total First-Year Cost $35,000 $42,500 Future partnerships and research activities What types of research questions would drive potential partnerships? Spatial analysis – mapping of client location, service locations, employment centers, transportation networks, etc. Impact of welfare reform – analysis of clients who have experience homelessness and the impact of limitations of public assistance benefits Trends in homeless population counts and profiles counts and profiles of special populations, e.g. families, children, veterans, mentally ill, substance abusers, chronically homeless Patterns of shelter use Integrated database research (intersection of homelessness and government systems such as justice, Medicaid, mental health, etc.) Costs of service use 2 Consulting services are estimated at a rate of $120 per hour. Total hardware costs include purchase of database server, required software, and peripherals. 4 Based on key stakeholder interviews a database manager demands an annual salary in the range of $50,000 to $70,000. The position described within this report requires the resource of one half-time person or 0.5 FTE at an estimated annual 1.0 FTE salary of $70,000. 5 Database annual support contract is estimated at 80 hours per year at a rate of $120 per hour. 3 Shared by Columbus OH on HMIS.Info/Database Research 9 Community Shelter Board Research/Archival Database Project Final Report Reference Resources (Included in hard copy format with final report) Database Structure Descriptions: Community Research Partners Hearth Connection Vera Institute of Justice Standard Request for Database Access: University of Massachusetts – Boston, McCormack Center for Social Policy Research Harvard University, Radcliffe Institute for Advanced Study – Murray Research Center The Ohio State University, Center for Urban and Regional Analysis Security Standards & Consent Forms: The Ohio State University, Center for Urban and Regional Analysis University of Massachusetts – Boston, McCormack Center for Social Policy Research University of Pennsylvania, Center for Mental Health Policy & Services Research Harvard University, Radcliffe Institute for Advanced Study – Murray Research Center Data Integration Standards: University of Massachusetts – Boston, McCormack Center for Social Policy Research Entity Relationship Diagram: Community Research Partners Hearth Connection Shared by Columbus OH on HMIS.Info/Database Research 10