CENTRE COUNTY SIMULATION (CentreSIM) Data Collection Konstadinos G. Goulias March 2006 UCSB Draft Notes for GEOG 211A-B-C PREFACE Assessing air quality of a region requires travel demand models that can produce reliable estimates of hour-by-hour mobile source emission estimates. The widely available regional simulation models are not precise enough and necessity motivates their misuse by practitioners. Regional simulation models aiming at improved air quality assessments and more detailed transportation alternatives evaluations are currently created and tested in Australia, Europe, Japan, and the United States using a variety of theories, decision-making formalisms, and operational implementation methods. On one hand, these relatively new conceptualizations and models of transport systems have improved in a substantial way the realism of computerized decision support tools and have the potential of improving quantification of environmental impacts and transport management/control strategies. On the other hand, however, these systems require detailed data and understanding about behavior that very often are not readily available. This motivates the research reported here. In this project first a comparative overview of conceptual designs, data requirements, and models used in computer simulation of regional transport systems was completed in 1997 to aid the creation of Access Management Impact Simulation – a regional simulation approach using Geographic Information Systems to predict traffic impacts of individual business establishments. Then, the basic ingredients of a larger model system were defined and a first pilot study was completed with the University Park campus as the feasibility test site. These two experiences motivated the creation of a framework for model development called Longitudinal Integrated Forecasting Environment (LIFE) that contains a demographic simulator, a daily time allocation and travel scheduling system. The computational platform for the system is a Geographic Information System in which statistical models of travel behavior are embedded. Figure 1 contains a summary of the different contributions to LIFE. One component in this framework called CentreSIM emphasizes the spatial and temporal dimensions of travel demand and produces hour-by-hour maps of activity participation and travel in Centre County, Pennsylvania. The predicted traffic volumes are then validated using observed traffic data. Model development for CentreSIM is designed in cycles of incremental model improvement versions (each cycle contains multiple a sequence of versions). The first cycle that started in 1997 and ended in May 2002 produced a first version used by PennState for its Master plan activities and the currently implemented changes in parking and campus circulation at University Park. The second version used by JoNette Kuhnau for her MS thesis expanded the campus circulation model to the entire Centre County and identified critical model design deficiencies and required improvements to expand and validate the model to the entire county. The second cycle of this model development and funded by the Pennsylvania Department of Transportation and through South Central Centre County Transportation Study (SCCCTS) managed by McCormick Taylor Associates (MTA) and the US Department of Transportation through its Mid-Atlantic Universities Transportation Center uses data from a large survey in Centre County (the survey started in November 2002 and will continue for four-five months) and produced a first version of an improved model in the Spring of 2003 that was used to study the South Central Centre County Transportation Study (SCCCTS) area alternatives. This produced CentreSIM cycle 2 version 1. In Spring 2004 a second version that advances the state-of-the-art in modeling and simulation incorporating synthetic schedule generation for a day and for each person in the County was created by Pribyl in his Ph.D. dissertation. Work on model building ended in 2004 and research switched to data analysis for active living and the health impacts of transportation (Shaunna Burbidge used the data for her MA thesis at UCSB) and to more theoretical aspects on altruism and travel behavior. Work in 2005-2006 Active Living Altruism CentreSIM – version 3.0 – Activity-based microsimulator predicting demand on a hour-by-hour, parcel-by-parcel, and household-byhousehold resolution levels CENTRESIM PROJECT Synthetic schedule generation using CentreSIM activity data to create models reflecting intra-household interactions and producing travel demand estimates by time of day Telecommunication-travel interaction equations to use in CentreSIM Tae-Gyu Kim – CentreSIM project ONDREJ PRIBYL PHD DISSERTATION Combine new network from Eom’s model with a new Kuhnau model estimate using CentreSIM survey data – create scenarios for PennDOT SCCCTS study MICHAEL ZEKKOS – MS THESIS Activity survey and large-scale data collection to support new model generation. New network compatible with other applications by consultants in Centre County Develop a combined passenger/truck traffic prediction model, validate, and study the effect of different network precision levels Formulate ideas and test models of travel – telecommunications interaction in Centre County Tae-Gyu Kim and Ondrej Pribyl – variety of projects Systems of equations modeling the traveltelecommunications interactions and study the effects of information provision on travel behavior in PSTP – lessons learned for Centre County JINKI EOM – MS THESIS CENTRESIM PROJECT TAE-GYU KIM PH.D. DISSERTATION CentreSIM – version 2.0 – Expand the method to entire Centre County, identify and rectify problems with network and employment data, demonstrate the use of the model for policy impact assessment First version of models that reflect interaction between telecommunications and travel – guidance for CentreSIM data collection JONETTE KUHNAU MS THESIS TAE-GYU KIM RESEARCH CentreSIM – version 1.0. Update of network and buildings information, more detailed roadways for on-campus circulation. Identification of issues for expanding method to entire county. Creation of scenarios and demonstration of application – evidence of practical use. Second round of data consolidation and econometric model definition of the Puget Sound Transportation Panel – lessons learned for an activity-based survey for Centre County PROJECT FOR OPP AND GRADUATE COURSE PLANNING & OPERATIONS Feasibility of truck traffic forecasting and experiments with network resolution John Marker, Jr MS Thesis Demographic microsimulation DEMOS version 1.0 object oriented program in C++ Gopal Mandava MS Thesis PUGET SOUND REGIONAL COUNCIL PROJ. DEMOS 2000 and lessons learned for a Centre County application – new program in C++ ASHOK SUNDARARAJAN MS THESIS Second generation Centre County application using Windows based TRANSCAD with emphasis on evacuation of PSU campus – first activity-based survey to collect data by email – first version of the time of day model for activity and travel Econometric approach to activity scheduling and data needs definition using as pilot the Puget Sound Transportation panel – SAJJAD ALAM – MS THESIS JUNE MA PH.D. DISSERTATION Centre County AMIS– first network version, feasibility, and identification of issues – Demographic simulator for Centre County using Fortran and lessons learned for next version Centre County first complete network – demonstration of GIS accessibility capabilities Ming-Sheng Lee MS thesis JIN CHUNG PH.D. DISSERTATION Data assembly and initial consolidation of information needed to develop an activity-based approach to travel demand forecasting using travel surveys FHWA PROJECT Access Management Impact Simulation – Erie County – Experimentation with GIS TRANSCAD and interface design with TRAF NETSIM - DOS software PENNDOT PROJECTS Figure 1 Overview of LIFE and CentreSIM data collection and modeling work at PSU Within CentreSIM a household survey that is part of the six municipality (College, Harris, Spring, Benner and Potter Townships and Centre Hall Borough) study known as the South Central Centre County Transportation Study (SCCCTS) area study was conducted by the Pennsylvania Transportation Institute. The household survey covers the entire Centre County and will also include residents that work in Centre County and reside elsewhere. The data from this survey filled a critical gap in knowledge about the County and provided the necessary information to develop models that can be used in regional simulation software necessary for forecasting and for alternatives exploration. Each participating household is asked to provide voluntarily information about household composition and facilities available to the household members. In addition, each household member also provided personal information such as employment, driving ability, education and so forth. Activity and travel data were also collected from each person in the household using a two-day complete record of the activities in which each person engaged and the different transportation options taken. The survey includes a few questions about opinions and perceptions regarding the Centre County transportation system. To maximize participation in this survey a new method has been created by staff at the Transportation Survey Center, which was a unit within the Center for Intelligent Transportation Systems at Penn State University until 2004. This method is based on recent positive experience with surveys for the Pennsylvania Turnpike and the Pennsylvania Department of Transportation and experience in state-of-the-art survey design techniques used in Europe, Australia, and the US. A typical procedure followed in the survey contains a first stage of contact and introduction to the study, mailing of survey material, and a series of reminders and clarifications. The entire process culminates with post-survey thank you letters, summary of the findings to the respondents, and a gift certificate reward structure. The next section describes the survey process. A second section describes an automated way to clean the data. Patten and Goulias 1 INTEGRATED SURVEY DESIGN FOR A HOUSEHOLD ACTIVITY-TRAVEL SURVEY IN CENTRE COUNTY, PENNSYLVANIA Michael L. Patten & Konstadinos G. Goulias, Ph.D. ABSTRACT: In this paper we outline an integrated method for conducting a household activity-travel survey that was used successfully in a household survey of Centre County, Pennsylvania. This method incorporates elements of the Dillman-tailored method and the Socialdata-KONTIV design with concepts developed by the Penn State study team. The procedure followed in the survey consists of a first stage of contact and introduction to the study, mailing of survey materials, and a series of reminders and clarifications. The entire process culminates with post-survey thank you letters, summary of the findings to the respondents, and a gift certificate reward structure. Our experiments with the Dillman-tailored method and the Socialdata-KONTIV design and survey administration are extremely encouraging and a testimony to the good design advocated by both traditions. The survey experienced a response rate of 38.8 percent for the household questionnaire portion of the survey and 28.5 percent for the diaries. Within each of these two response rates, however, we find specific segments and a sequence of actions that yield a response rate as high as 68.8% and a series of actions during survey administration that aid achieving this rate confirming the suggestions offered by researchers in Europe and Australia suggesting we should change the requirements for surveys in the U.S. INTRODUCTION Collecting data via surveys, especially by mail, is a complex and expensive (in both time and money) undertaking. In order to maximize the return on the limited resources usually available to researchers, highly detailed systems have been developed to encourage participation in surveys. For example, Dillman (1) has developed the Tailored Design Method (originally called the Total Design Method) based on concepts of social exchange theory. Surveys utilizing Dillman’s process should be designed around the following three key elements: 1. Establish trust with the respondent (e.g., provide a token of appreciation in advance, make the task appear important, show sponsorship by legitimate authority); 2. Increase the respondents’ expectation of receiving a reward from participation (e.g., show positive regard, say thank you, give tangible rewards, make questionnaires interesting); and 3. Reduce the social costs to the respondent (e.g., avoid subordinating language, make the questionnaire short and easy, and minimize requests for personal information). In the realm of transportation-related surveys, several design concepts have been developed incorporating ideas similar to Dillman’s as well as the experience of the researchers involved. Brög, in his Socialdata-KONTIV Design (2), begins with the premise that in good survey design “the researchers must adjust to the respondents, not the respondents to the researchers.” The Socialdata-KONTIV design uses questionnaires and activity diaries that are simple in design and layout, minimize involved definitions and instructions, and stress the Patten and Goulias 2 collection of “complete information instead of formally correct” information. The SocialdataKONTIV design also incorporates a high level of contact with the respondents through multiple mail and telephone contacts. This design, however, contains a travel diary instead of the more recent activity diaries that we need for the application in our survey. Activity surveys require a somewhat different design because of the additional respondent burden and the sensitivity of the questions. Additional recent research has found that the day-planner booklet format is extremely useful for time-use and activity diaries (3,4,5). The reasons reported for this include: flexibility, simplicity of completion, greater detail of the answers provided, and user-friendliness. To summarize, a good design for a household activity-travel survey should incorporate the following concepts: • • • • • • The researcher should design the survey for the respondents; Survey instruments should be written in simple language and be easy to understand and complete; The research should establish trust with the respondents; Respondents should receive a reward from participation; The costs to the respondent should be minimized; and The day-planner booklet format is the most useful design for activity diaries. In this paper we discuss a new method to maximize participation in travel-related surveys created by a team at the Transportation Survey Center, which is a unit within the Mid-Atlantic Universities Transportation Center at Penn State University. This method is based on the concepts discussed above and recent positive experiences with surveys for the Pennsylvania Turnpike Commission and the Pennsylvania Department of Transportation. The procedure followed in the survey consists of a first stage of contact and introduction to the study, mailing of survey materials, and a series of reminders and clarifications. The entire process culminates with post-survey thank you letters, summary of the findings to the respondents, and a gift certificate reward structure. THE STUDY Between November 23, 2002 and May 30, 2003, the Penn State study team conducted a survey of Centre County, Pennsylvania residents to collect data about the county’s households and the activities of the household members. The survey is designed to meet the data needs of a multiyear model building research effort called CentreSIM (6), provide data for long range planning by local agencies, and to aid a Pennsylvania Department of Transportation (PennDOT) study known as the South Central Centre County Transportation Study (SCCCTS) area study. The household activity-travel survey covered the entire Centre County and also included residents that work in Centre County and reside elsewhere. The data from this survey will fill a critical gap in knowledge about Centre County and will provide the necessary information to develop models that can be used in regional simulation software necessary for forecasting and for alternatives exploration. Each participating household was asked to provide, on a volunteer basis, information about household composition and facilities available to the household members. In addition, each household member was also asked to provide personal information such as employment status, educational level, and typical mode of travel to work or school. Activity and travel data were also collected from each person in the household using a two-day complete record of the Patten and Goulias 3 activities in which each person engaged and the different transportation options taken. The survey also included a few questions about opinions and perceptions regarding the Centre County transportation system. The Study Area The survey was conducted in Centre County, Pennsylvania. This county, with an estimated 2001 population of 135,940 (7), is located in the geographic center of Pennsylvania. Centre County is predominately rural although State College borough and its adjacent areas have experienced a significant amount of urbanization. The University Park Campus of The Pennsylvania State University with more than 41,000 students and 11,000 faculty and staff (8) is also located in Centre County. The primary mode of travel in the county is motor vehicle although State College and the Penn State campus are well served by public transportation. The area immediately surrounding the Penn State campus also experiences a high level of foot and bicycle traffic. SURVEY MATERIALS The materials for the survey were divided into two components. The first component included those related to the household information survey (household questionnaire) and the second those related to the activity-travel diaries. Each group of materials is described below. Household Survey Materials The materials included in the household survey were: A cover letter describing the project and the purpose of the survey. It also provided a point of contact for additional information about the survey. A “Project Synopsis and Informed Consent Form,” as required by federal regulations, providing a point of contact for questions about the survey, describes the purpose of the survey, explaining any risks and/or benefits of participation, describing the confidentiality procedures, and indicating that the survey is voluntary. The questionnaire (survey instrument) was used to collect the data necessary for the study. It is in booklet form and designed to minimize the effort required on the part of the respondent. The majority of questions were “close-ended.” Figure 1 displays example pages from the questionnaire. A business reply envelope was included in the packet to facilitate return of the completed survey forms. Contest Flyer describing the lottery (see below) and a contest entry card. Figure 1 about here. Patten and Goulias 4 The questionnaire design paid strict attention to providing a clear, easy to read format that minimized the effort to complete it. For example, we used a large type font (14 point) and incorporated much empty space. We also utilized a vertical flow for the question layout. The questionnaire also requested the first names, ages, and occupations of each household member. This information was used to personalize the activity-travel diaries for each member. Activity-Travel Diaries The materials included with activity-travel diaries were: A cover letter describing the project and the purpose of the survey. It also provided a point of contact for additional information about the survey. A “Project Synopsis and Informed Consent Form,” as required by federal regulations, providing a point of contact for questions about the survey, describes the purpose of the survey, explaining any risks and/or benefits of participation, describing the confidentiality procedures, and indicating that the survey is voluntary. Two personalized activity-travel diaries for each household member. The diaries were for two consecutive days. Figure 2 displays an example of the diary format. A business reply envelope was included in the packet to facilitate return of the completed survey forms. Contest Flyer describing the lottery (see below) and a contest entry card. Figure 2 about here. As shown in figure 2, the activity-diaries utilized a modified day-planner format. The respondents were free to provide what level of detail they felt necessary. A key component of our method is the personalization of the diaries for each household member. The personalization consisted of two items. First, each diary has the name of the appropriate household member on the cover (figure 3). Additionally, we used four different diaries depending on the reported employment status of the members as follows: Employed (full- or part-time) outside the home–White cover Not employed outside the home–Peach cover Children age 18 and under–Blue cover University students–Blue cover The four dairies were identical except for the color of the cover and an example diary that would be relevant to the subject’s demographic group. For example, the “employed” diary example shows an individual traveling to work, working throughout the day with a lunch break, travel home, and some other activities in the evening. The “university student” diary example, on the other hand, shows attendance at classes, study time, and work at a part-time job and the “child” diary shows time at school and participation in an athletic activity. Figure 3 displays an example of a diary cover with personalization and figure 4 a portion of the example from the “employed” diary example (Note: This is also fairly representative of the information contained in the returned completed diaries.). Figure 3 about here. Figure 4 about here. Patten and Goulias 5 THE LOTTERIES As noted in the introduction, we incorporated a reward structure to encourage responses. We utilized two different reward structures each of which provided total prizes of $1,000 in gift certificates redeemable at the local shopping mall. During the remainder of this paper Lottery One refers to the period November 23, 2002 to March 6, 2003 and Lottery Two refers to the period March 7 to May 31, 2003. In both lottery pools a household was eligible to win a prize if it returned a completed household questionnaire, completed diaries, and a contest entry card. The winners were randomly drawn from all the eligible households for that pool. Lottery One: 2002 The first lottery was used during the first survey period of November 23, 2002 to March 6, 2003. A total of $1,000 in gift certificates was awarded as follows: 1 Grand Prize of $500 in gift certificates 2 Second Prizes each for $150 in gift certificates 4 Third Prizes each for $50 in gift certificates Lottery Two: 2003 The second lottery was used during the second survey period of March 7 to May 31, 2003. A total of $1,000 in gift certificates was awarded as follows: 4 First Prizes each for $150 in gift certificates 8 Second Prizes each for $50 in gift certificates SAMPLE SELECTION The sample for the survey was drawn from several pools. The first was a database of 46,448 household addresses in Centre County purchased from a commercial mailing list vendor in early October 2002. This list provided the name of the current resident, the complete mailing address and in many cases the telephone number. This list has, however, two weaknesses. First, it does not include households that have made formal requests to be removed from mailing lists. This weakness had to be accepted since there is no other legal way to gather these addresses. The second weakness results from the highly transient nature of the 40,000 students attending the University Park campus of Penn State University. The study team was able to alleviate this problem by using student address lists available through Penn State. The following three address lists of students where acquired: students residing in on-campus housing, students living off-campus, and students living in Penn State operated family housing. In addition to the above mailing lists, a fifth one was obtained from Penn State. This list contained University Park Campus employees of Penn State who reside outside of Centre County. It was important to include members of this group in the sample since they commute longer distances to work. We randomly selected a sample from each pool. There was not enough information to ensure that the sample units selected was representative of the Centre County residents. Table 1 shows the size of each pool and the sample selected from each. Patten and Goulias 6 Table 1. Centre County Activity-Travel Survey—Subject selection. Subject Pool Size of Pool Number Selected Percent of Sample Purchased Database 46,448 6,700 68.8% Penn State Students Residing On-Campus 12,714 1,200 12.3% Penn State Students Residing Off-Campus 17,942 1,200 12.3% 402 140 1.4% 1,464 507 5.2% 78,970 9,747 100.0% Penn State Students Residing in PSU Family Housing Penn State Employees Residing Outside Centre County Totals Subject Pool for Lottery One: 2002 The subject pool for Lottery One was comprised of the 6,700 households selected from the purchased database. Initially, as described below, the study team mailed survey materials to the 1,478 households for which no telephone number was available. The remaining 5,222 were to be contacted by telephone. Of these 821 were contacted by telephone. There was no contact with the remaining 4,401 households were not contacted during the Lottery One portion of the survey. Subject Pool for Lottery Two: 2003 The subject pool for Lottery Two included the 4,401 households remaining from Lottery One plus the two groups of Penn State students, the households residing in the Penn State family housing, and the Penn State employees. In total, 7,447 households were included in this phase. THE SURVEY PROCESS The study team recruited households via two mechanisms: telephone and mail. The processes used for each are outline below. Phone Recruiting The study team called households for which there was a telephone number available. If the call was answered the team member asked to speak with the head-of-the-house or another responsible adult. The purpose of the study and survey was explained and participation requested. If the household declined to participate they were removed from the respondent pool. If they agreed to participate, they were asked to provide the names, ages, and employment status of all members of the household. This information was used to produce the personalized diaries. The respondent was informed of the days on which they would participate and their mailing address verified. The appropriate diary materials were produced and mailed the next day. Those numbers with no answer were rescheduled to be called at a later date. When an answering machine was reached, an appropriate message was left noting the reason for the call Patten and Goulias 7 and that the study team would attempt to contact them at a later date. If a number was called three times without actually talking to a household member it was dropped from the pool. Mail Recruiting Mail recruiting in both lotteries was done as follows: • Advance notice of the survey was sent by mail one week prior to the survey; • Questionnaire packet was sent by mail (main mailing); • A reminder letter was sent to the entire sample one week after the main mailing; and • A reminder letter including a complete survey packet was mailed to all nonrespondents. For Lottery One it was mailed eight weeks after the main mailing and for Lottery Two four weeks after. When a household returned their questionnaire they were added to the pool to receive activity diaries. During both survey periods activity-travel diaries were mailed to each household as they returned completed household questionnaires. The survey dates for the diaries were seven days after they were mailed. On a typical day, diaries were mailed to approximately 40 households per day broken down a follows: 25 to the purchased group, 5 each to the on- and off-campus students, 1 to the Penn State housing, and 9 to the Penn State employees. No followups were made for the diaries. DATA ENTRY Activity surveys generate an extremely large amount of data. Because of this, it is also important to minimize the burden on the personnel responsible for data entry. It is also important to provide a data entry process that will minimize the number of data entry errors. For this study, the Survey Center developed an integrated database with an interface that closely resembled the questionnaire and diary formats. A comparison of figures 5 (questionnaire) and 6 (diary) to figures 1 and 2 shows that this was very successful. The incorporation of “pull-downs” for many of the fields allowed for quick and accurate entry of repeated data such as home addresses and activities such as sleeping and eating. Figure 5 about here. Figure 6 about here. RESPONSE RATES The two lotteries experienced very different levels of participation. The telephone recruiting did not prove to be very successful, although, as can be expected, households that agreed by telephone to participate responded at a very higher rate. Lottery One had a much higher response rate than Lottery Two. Lottery One: 2002 As noted earlier, during the first survey lottery (November 2002 to March 2003) subjects were recruited both by telephone and mail. Patten and Goulias 8 Telephone Recruiting Study team members contacted 821 households via telephone to request participation in the study. Of these, only 190 agreed to participate. The other 631 households refused participation, did not have in-service telephone numbers, or were called three times without speaking to a resident. The contact rate (participants/contacts) was 23.1 percent. Telephone recruiting was stopped on January 3, 2003. All of the 190 households agreeing to participate were mailed survey packets as outlined above. Of these 80 households (43.5%) returned completed, usable questionnaires and diaries. The overall response rate for the telephone recruiting is 10.0 percent (returns/contacts). Mail Recruiting In lottery one 1,478 households were contacted by mail. Of these, 422 households were dropped as undeliverable yielding a net mailing of 1,056. A total of 647 household questionnaires were returned yielding a 61.3 percent response rate. Of the 647 households returning household questionnaires 568 were mailed activitytravel diaries. Time constraints prevented inclusion of all responding households. Sixty-six were dropped as undeliverable (these packets were returned by the U.S. Post office without a valid forwarding address in Centre County) yielding a net mailing of 502. A total of 208 households returned completed diaries yielding a 41.4 percent diary response rate. With 203 households returning usable diaries the overall response rate for the Lottery One mail recruiting is 13.7 percent. Lottery Two: 2003 Lottery two was done completely by mail. The response rates for this phase of the study are reported below with details for each sub-set of the sample. Household Questionnaire Overall, questionnaires were mailed to 7,447 households. Of these, 617 households were dropped as undeliverable yielding a net mailing of 6,830. A total of 2,414 household questionnaires were returned yielding a 35.3 percent response rate. Table 2 displays the mailing and response rates by sub-group. Patten and Goulias 9 Table 2. Lottery Two–Household questionnaire return rates. Mailed Dropped Net Mailed Purchased Database 4,401 432 3,969 1,603 40.4% PSU Students Residing On-Campus 1,200 27 1,173 258 22.0% PSU Students Residing Off-Campus 1,199 129 1,070 304 28.4% PSU Students Residing in Family Housing 140 4 136 50 PSU Employees from Outside Centre Co. 507 25 482 199 7,447 617 6,830 2,414 Subject Pool Totals Number Response Returned Rate 36.8% 41.3% 35.3% Activity-Travel Diaries Of the 2,414 households returning household questionnaires 1,969 were mailed activity-travel diaries. Six were dropped as undeliverable yielding a net mailing of 1,963. A total of 494 households returned completed diaries yielding a 25.2 percent diary response rate. Table 3 displays the mailing and response rates by sub-group. With 494 households returning usable diaries, the overall response rate for lottery two is 8.9 percent. Table 3. Lottery Two–Activity-travel diary return rate. Mailed Dropped Net Mailed 1,348 5 1,343 401 29.8% PSU Students Residing On-Campus 131 1 130 16 12.3% PSU Students Residing Off-Campus 247 0 247 34 13.8% 60 0 60 11 183 0 183 32 1,969 6 1,963 494 Subject Pool Purchased Database PSU Students Residing in Family Housing PSU Employees from Outside Centre Co. Totals Number Response Returned Rate 17.5% 18.3% 25.2% Comparison of Response Rates for the Two Lotteries While a significantly larger amount of data was collected during Lottery Two, Lottery One experienced a much higher response rate both for the household questionnaire and the activity Patten and Goulias 10 diaries. This difference may be a result of the different incentive amounts offered for each. The respective rates for the household questionnaire were 61.3 percent returned for Lottery One and 35.3 percent for Lottery Two yielding an over all response rate of 38.8 percent for the household questionnaire. For the diaries, the response rate were 41.4 percent returned for Lottery One and 35.3 percent for Lottery Two yielding an over all response rate of 38.8 percent for the diaries. The overall response rate for the survey is 11.1 percent. Table 4. Household questionnaire response rate via mail. Lottery Total Mailed (A) Dropped1 Number (B) Percent (A/B) Net Mailed (C=A-B) Total Returned (D) Return Rate (E=D/C) Lottery One (2002) 1,478 422 28.6% 1,056 647 61.3% Lottery Two (2003) 7,447 617 8.3% 6,830 2,414 35.3% Totals 8,925 1,039 11.6% 7,886 3,061 38.8% 1. Respondents dropped from the survey (e.g., Non-Deliverable, deceased, under age, etc.) Table 5. Diary response rate via mail. Lottery Total Mailed (A) Dropped1 Number (B) Percent (A/B) Net Mailed (C=A-B) Total Returned (D) Return Rate (E=D/C) Lottery One (2002) 568 66 11.6% 502 208 41.4% Lottery Two (2003) 1,969 6 0.3% 1,963 494 25.2% Totals 2,537 72 2.8% 2,465 702 28.5% 1. Respondents dropped from the survey (e.g., Non-Deliverable, deceased, under age, etc.) A review of the dates that the questionnaires seems to indicate that the follow-up mailing do have an impact on overall response rate. Figure 7 shows the percent of questionnaires returned by week for each of the lotteries. In both cases, there is a spike in the number of questionnaires returned approximately two weeks after the follow-up mailing of the complete survey packet. Figure 7 about here. SYNTHESIS In this paper we outlined an integrated method for conducting a household activity-travel survey. This method has been used successfully in a household survey of Centre County, Pennsylvania. The survey experienced a response rate of 38.8 percent for the household questionnaire portion Patten and Goulias 11 of the survey and 28.5 percent for the diaries. Within each of these two response rates, however, we find specific segments and a sequence of actions that yield a response rate as high as 68.8% and a series of actions during survey administration that aid achieving this rate. Our experiments with the Dillman-tailored method and the Socialdata-KONTIV design and survey administration are extremely encouraging and a testimony to the good design advocated by both traditions. However, many aspects are still obscure. Two preliminary multivariate analysis attempts were also made when writing this paper to shed light into the extremely different response rates. The first analysis examined response rate to the household questionnaire to understand which households were more likely to respond. This was done using a person-based (because a person is the mail or telephone recipient) probability of returning the questionnaire non-linear regression model that indicates that older recipients (50+), home owners, married, and of medium ($30K to $60K per year) or higher ($60K or more per year) household income are more likely to return their household questionnaires. Mail recruitment was also (as discussed above) significantly more successful in the household questionnaire response. Similar indications were also given by a second regression model for activity diary response. In fact, recruiting by telephone in the household questionnaire leads to lower response rate but given that a household was recruited by telephone and returned the household questionnaire, it has a higher probability of returning at least one activity diary. In addition, the presence of children in the household functioned as an inhibitor in the travel diary response. ISSUES FOR FURTHER RESEARCH Several issues arouse during this study that warrant further exploration. The telephone recruiting was not very successful. The study team could not determine if this was a result of the recruiting method used or related to larger societal factors. Also, the large difference in the response rates for the two lotteries is somewhat of a puzzle. The difference in the amount of the reward offered in each lottery most likely had a significant impact on the response rate although this study did not collect data to determine what effect the reward levels had. Initial multivariate analysis experiments showed we need to explore further the response rate but we also need to study the completion and missing data patterns. The effect of children on response rate also requires further study. This is particularly important for travel behavior analysis because of the effect children have on activity and travel patterns and the risk of loosing the more interesting from research viewpoint behaviors due to non-response. All this is left as a future study that has already began. ACKNOWLEDGEMENTS/DISCLAIMERS Funding for this paper was provided by the federally funded Mid-Atlantic Universities Transportation Center (MAUTC) and the Center for Intelligent Transportation Systems (CITRANS) at the Pennsylvania State University. Partial funding for the CentreSIM survey is provided by the Pennsylvania Department of Transportation (PennDOT) through a contract with McCormick Taylor and Associates (MTA). The survey was conducted at the Transportation Survey Research Center at the Pennsylvania Transportation Institute (PTI) by a team of persons that include: Mark Hallinan, James Lee, Devani Perera, Aviroop Mukherjee, Julie Whitt, and Brian Hoffheins. Their dedication is greatly acknowledged. Li Guan at PTI programmed the databases for the CentreSIM survey and Tae-Gyu Kim at PTI estimated the response rate models. Jean-Robert Micaeli and Ondrej Pribyl have also worked on data cleaning algorithms documented elsewhere. Patten and Goulias 12 The contents of this paper reflect the views of the authors, who are responsible for the facts and the accuracy of the data presented herein. The contents do not necessarily reflect the official views or policies of the Commonwealth of Pennsylvania at the time of publication. This paper does not constitute a standard, specification, or regulation of the Pennsylvania Department of Transportation. REFERENCES 1. Dillman, Don A. Mail and Internet Surveys: The Tailored Design Method. 2nd edition. John Wiley and Sons, Inc., New York, 2000. 2. Brög, Werner. The New KONTIV Design: A Total Design for Surveys on Mobility Behavior. Presented at the International Conference on Establishment Surveys (II), Buffalo, New York, July 17-21, 2000. 3. Stopher, P. E. and C. G. Wilmot. Some New Approaches to Designing Household Travel Surveys–Time-Use Diaries and GPS. Presented at the 79th Annual Meeting of the Transportation Research Board, Washington, D.C., 1999. 4. Stopher, P. E. and C. G. Wilmot. Development of a Prototype Time-Use Diary and Application in Baton Rouge, Louisiana. Presented at the 80th Annual Meeting of the Transportation Research Board, Washington, D.C., 2000. 5. Arentze, T., M. Dijist, et. al. A New Activity Diary Format: Design and Limited Empirical Evidence. Presented at the 81st Annual Meeting of the Transportation Research Board, Washington, D.C., 2001. 6. Kuhnau J. and K.G. Goulias (2003) CentreSIM: First-generation Model Design, Pragmatic Implementation, and Scenarios, Chapter 15 in Transportation Systems Planning: Methods and Applications. Edited by K.G. Goulias, CRC Press, Boca Raton, FL, pp. 16-1 to 16-14. 7. U.S. Census Bureau. Centre County Quick Facts from the U.S. Census Bureau. May 7, 2003. http://quickfacts.census.gov/gfd/states/42/42027.html. Accessed May 29, 2003. 8. The Pennsylvania State University. Penn State Fact Book. http://www.budget.psu.edu/factbook/default.asp. Accessed July 5, 2003. Patten and Goulias 13 List of Figures Figure 1. Figure 2. Figure 3. Figure 4. Figure 5. Figure 6. Figure 7. Example pages from the household questionnaire. ...................................................... 14 Example pages from the activity-travel diaries. ........................................................... 18 Example of a personalized cover from an activity-travel diary. .................................. 19 Example of a “completed” activity-travel diary. .......................................................... 20 Weekly survey return rate. ........................................................................................... 21 Example of a household questionnaire data entry screen. ............................................ 22 Example of a diary data entry screen. .......................................................................... 23 Patten and Goulias 14 Figure 1. Example pages from the household questionnaire. Patten and Goulias 15 Patten and Goulias 16 Figure 1. Example pages from the household questionnaire (continued). Patten and Goulias 17 Patten and Goulias 18 Figure 2. Example pages from the activity-travel diaries. Patten and Goulias Figure 3. Example of a personalized cover from an activity-travel diary. 19 Patten and Goulias 20 Figure 4. Example of a “completed” activity-travel diary. Patten and Goulias 21 40% Lottery 2 30% Week 8: Lottery One follow-up mailed Week 4: Lottery Two follow-up mailed 20% 10% Week Since Initial Mailing Figure 5. Weekly survey return rate. 14 13 12 11 10 9 8 7 6 5 4 3 2 0% 1 Percent of Respondents Lottery 1 Patten and Goulias 22 Figure 6. Example of a household questionnaire data entry screen. Patten and Goulias 23 Figure 7. Example of a diary data entry screen. Patten and Goulias 24 Patten and Goulias 25 CHIRAC: A COMPREHENSIVE HOUSEHOLD INTEGRATED RECTIFIER FOR ACTIVITY DIARIES Ondrej Pribyl, Jean-Robert Micaelli, Konstadinos G. Goulias, and Michael L. Patten Abstract In some of the more advanced activity-based models we also find higher requirements for the data needed for model estimation and calibration. For example, information about all activities as well as trips conducted by individuals in the study area during one or more days must be provided by the respondents. This information is often incomplete or inconsistent, since the demands on the respondents are higher than a simple questionnaire. In this paper, we provide a general framework for management of the data in household activity diary surveys. Its main focus is on the phase of entering and verification of the completed diaries. Lessons learned are provided from our experience working on a survey conducted in Centre County, Pennsylvania (USA) in fall 2002 and spring 2003. A short summary of the survey background is provided as well. This paper is of a rather practical matter and we demonstrate the most important issues through real life examples. The paper concludes with a series of recommendations that help future data collection efforts. INTRODUCTION Activity-based approaches, with roots traced back to 1960s and 1970s, are offered as the next best alternative to typical regional modeling and simulation for transportation decisions. Chapin’s research (1974), is the most likely study to have started a different thinking in travel demand and particularly the idea of derived demand as envisioned in Manheim (1980). At about the same time Becker (1976), also developed his theory of time allocation from a household production viewpoint. This foundation is completed by a fourth study contributing a systematic approach to the constraints humans face, by chance and by design, to their action in space and - 25 - Patten and Goulias 26 time provided by Hagerstrand in his seminal work on time space geography (1970). Cullen and Dobson in two papers in the mid1970s as reviewed by Arentze and Timmermans (2000) and Golledge and Stimpson (1997) appear to be the first researchers attempting to bridge the gap between the motivational (Chapin) approach to activity participation and the constraints (Hagerstrand) approach by creating a model that depicts a routine and deliberated approach to activity analysis. Most subsequent contributions to the activity-based approach emerge in one way or another from these initial frameworks with important operational improvements as discussed in the reviews by Kitamura, 1988, Bhat and Koppelman, 1999, Arentze and Timmermans, 2000, and McNally, 2000. The basic ingredients of an activity based approach for travel demand analysis (Jones, Koppelman, and Orfeuil, 1990 and Arentze and Timmermans, 2000) are: a) explicit treatment of travel as derived demand (Manheim, 1980), i.e., participation in activities such as work, shop, and leisure motivate travel but travel could also be an activity as well (e.g., taking a drive). These activities are viewed as episodes (starting time, duration, and ending time) and they are arranged in a sequence forming a pattern of behavior that can be distinguished from other patterns (a sequence of activities in a chain of episodes). In addition, these events are not independent and their interdependency is accounted for in the theoretical framework; b) the household is considered to be the fundamental social unit (decision making unit) and the interactions among household members are explicitly modeled to capture task allocation and roles within the household, relationships and change in these relationships as households move along their life cycle stages and the individual’s commitments and constraints change and these are depicted in the activity-based model; and c) explicit consideration of constraints by the spatial, temporal, and social dimensions of the environment is given. These constraints can be explicit models of time-space prisms (Pendyala, 2003) or reflections of these constraints in the form of model parameters and/or rules in a production system format (Arentze and Timmermans, 2000). The inputs to these models are similar to the typical regional transportation model data on social, economic, and demographic information of potential travelers and land use information. In addition, they require the same amount of information that typical travel demand models need and a more detailed mapping of time allocation to create schedules followed by people in their everyday life. The outputs are detailed lists of activities pursued, times spent in each activity, and travel information from activity to activity (including travel time, mode used, and so forth) linked to individual and household characteristics. This output is very much like a “day-timer” for each person in a given region. Collecting this type of detailed information introduces however additional problems. These diaries tend to contain many inconsistencies and incomplete information with many similarities to the travel diaries that focus on trip-by-trip information (for a complete example of rectification needs in travel diaries see Goulias and Kim, 2003). This issue with emphasis on activity diaries was - 26 - Patten and Goulias 27 discussed in more detail including examples from previous research in Arentze et al., (1999). For this reason, before we can use the collected activity diaries in an activity-based model we must ensure that the data set used for its calibration is rectified to at least be free of internal contradictions, i.e., complete schedule for the respondents without major gaps. In support of such a task, Arentze et al. (1999) developed a SYstem for the Logical Verification and Inference of Activity diaries, called SYLVIA. The main purpose of the SYLVIA system is to support the calibration of the ALBATROSS model, developed by Arentze et al. (2000), but also to provide a generic data checking algorithm that can be used for verification of other activity diaries. It aims to obtain a clean and consistent data set that can be used for the purpose of validation of activity-based models. To accomplish this the Dutch team is using a set of logical rules that finds and replaces mistakes and inconsistencies in a data set. Building on the SYLVIA method, in this paper we describe a new approach that addresses a similar general problem, it is designed to support another model system called CentreSIM, and it expands the SYLVIA method to include a few additional issues. The new data verification and rectification is named CHIRAC: A Comprehensive Household Integrated Rectifier for ACtivity diaries. It is based on explicit recognition of the hierarchical nature of the data, contains many automated procedures, and allows for expert human intervention. It aims to provide a framework for checking, verifying, and rectifying data after collecting activity diaries. The remainder of the paper is organized as follows. First a brief review of major inconsistencies and the need for data rectification is presented. This is followed by the specific data collection project called CentreSIM. Then, a general framework for data rectification is offered and a description of the application is provided. The paper concludes with a summary and brief discussion of findings. MOTIVATION AND BACKGROUND In this section we review a sample of some major sources of inconsistencies and problems in activity diaries. Previous studies in transportation surveys (Richardson, Ampt, and Meyburg, 1995) provide more general reviews of the process, review diary typologies (Axhausen, 1995), and instrument design (Stecher, Bricka, and Goldenberg, 1996). In the past few years an attempt was also made to define survey performance measures and to create minimum standards for practice (TRB, 2000, Stopher and Jones, 2003). Household activity diaries that aim to collect information from all household members are believed to be better than travel diaries targeting a subset of the household members (e.g., driving age members) because they collect information about the complete decision making unit and enable us to study joint household member patterns of time allocation. Using activity/time use diaries and definitions and questionnaire topics and themes that are more “natural” for the respondents (e.g., what did you do today?) we also expect to see better reporting by the respondents. Moreover, using activities we may also trigger the respondent’s memory to provide a more - 27 - Patten and Goulias 28 accurate report of her/his behavior. Finally, the data thus derived contain some of the necessary information in the most recent activity-based travel behavior models. However, activity surveys of the format described here require more reporting by the respondents and they may offer more opportunities for mistakes unless the instrument design avoids that. They also require significantly more work for the survey designers before, during, and after survey execution. In this section we focus on some of the key factors that motivate the creation of automated procedures in data handling after the respondents returned their activity diaries. An attempt is made here to develop a typology for some of the most common sources of inconsistencies as we experienced them during the work on this project. The majority of these inconsistencies are due to the survey instrument design (alternate instrument designs were pilot tested before the final version was used, however even the final version did not eliminate all inconsistencies). For example, respondents did not clearly understand the format of the activity diaries, which led to incorrectly filled forms. The most common example is in merging together an activity and one of more trips in one record (this was also found by Arentze et al. 1999). We could observe this problem most often for shopping activities such as the typical example depicted in Figure 8. This specific record should have been entered as three separate records (episodes): a trip from home to the store, the shopping activity, and a trip back home from the store. Unfortunately, this is not the only way in which the same activity/travel pattern was reported by other respondents. Some respondents would enter the first trip and the activity in one line and create a separate record for the return trip. Others would report the trip going to the store in one line and combine the activity and the return trip in another. These inconsistent patterns (we name this occurrence episode miss-reporting) complicate any automated cleaning procedures. At the same time, finding and repairing of these erroneous records is rather essential. Without rectification, the reported diaries would be biased and would systematically underestimate the number of trips and activities in the dataset. Figure 8: An example of activity and trips reported together (episode miss-reporting). The stated starting and ending point is home location of this respondent. Note: The addresses provided in all examples through this paper were modified in order to ensure confidentiality of the provided information. - 28 - Patten and Goulias 29 Another set of issues arises when several respondents refused to provide some of the information requested (called item refusals herein). The reason behind this refusal varies. Some respondents had personal reasons for not providing information. In some cases the respondents assumed we do not need information about their activity participation, since we are interested in transportation (the material provided to the respondents contained a clearly transportation policy flavor). One respondent actually provided a written comment in this matter and reported only her/his trips. A third group of issues regards the incorrect level of detail provided by respondents. On one hand, some respondents reported only their major activities during the day (home, trip, work, trip, home). On the other hand, some would split the home activity into several records, reporting every detail (getting up, brushing teeth, taking shower, feeding dogs, preparing breakfast, eating, washing dishes, getting ready for work, preparing snack for kids, and so forth), so the resulting diary was rather long and detailed. Neither of the two is desirable for the survey. However, the more detailed one allows for aggregation of activities, while, the sketchy one reduces to a travel diary. Other important sources of problems are inconsistencies that are implicitly embedded in the data (we will call them implicit inconsistencies throughout this paper). For example, respondents would refer to the same events, or locations differently. For example, some people would describe shopping in the GIANT grocery store simply as shopping; some would provide the name of the store, or even more details. Some people used the term “Building” in the addresses, some would use abbreviated form “Bldg” or “Bldg.”, and so forth. In addition, the same respondent would refer to the same location differently in different fields of the diary. The respondents were also asked to provide information about duration of each activity or trip, and the distance traveled. These are estimates of distance and time. Previous studies (Golledge and Stimson, 1997) suggest that the perception of time and distance are not precise but time may be more accurately perceived than distance because some persons carry watches and clocks are posted at many places. In the diary we asked individuals to tell us the distance of each trip and the beginning and ending of each trip. On one hand the distance information gives us the opportunity to study their perception. On the other hand, however, caution is required in using this information when data corrections take place. The number of errors and inconsistencies that are recorded by the respondents can be reduced by providing instructions and/or examples. In the instructions we can explain what is the activity diary and how it should be filled in. On one hand, the instructions should not be too long and too complicated because the respondents do not want to spend too much time and effort in reading it and trying to understand its meaning. On the other hand, if the instructions are too short, the survey’s purpose and format is often not explained clearly. In our survey we provided a combination of instructions and examples. The examples were targeted to address different groups of population. For example, the case provided to students consisted of a lot of on-campus activities and then going out in the evening. The example for older individuals consisted of some community activities, going to church and a more relaxed evening - 29 - Patten and Goulias 30 at home. We wanted to provide useful and relevant examples for each target group that would make the filling of the diaries easy. However, as our experience shows, even these easy and targeted examples did not ensure that all respondents understood the format of the diary correctly and would not make any mistakes. A very important issue is the motivation of the respondents. They need to feel that their responses are important and that they can actually make changes in the transportation system, especially since the demand on filling in the activity diaries is high. All correspondence with the respondents is personalized. As a part of the package sent to the respondents was an introductory letter. It started with the name of each particular respondent. In this letter, we let the respondent know what is the main purpose of the project and what the possible outcome for each person can be. We let the respondents know how important is her/his cooperation to us and that her/his reply can make a change. This increases the motivation of the respondent to pay increased attention to the survey, which as a result improves the quality of the answers as well as the response rate. Errors in the data set could occur also after receiving the completed diaries, mostly during the data input phase. It should be noted, however, the majority of these problems is caused by misspelling, typographical errors, or inconsistent use of addresses (implicit inconsistencies). For example, the recorders would enter wrong ZIP code for given address (this can easily happen since the ZIP codes in Centre County often differ only by the last digit), or swap two letters in a name of street. All the examples of inconsistencies and errors that were discussed in the previous paragraphs meant to introduce some problems we are facing while collecting activity diaries. The framework developed in this paper aims to find, and eliminate or reduce these inconsistencies, and finally obtain a dataset that can be used for transportation and activity-based models. This framework is approached from a practitioner’s point of view, so that it could be used in future collection of activity diaries. Since we are dealing with human behavior and human answers, the authors believe that the use of fully automatic checking and repairs rules is highly problematic. Two problems that have the same effect are often present in the data set for different reasons and their treatment must be different as well. The following example should demonstrate this argument. Clearly, in a sequence of two episodes, the second one must start at the time when the first ends. In the following sections of this document we describe automated rules that find episodes for which this assumption is violated. However, the data suggested that there were different causes for this problem. The difference can be caused just by a typographical error during filling in the diary or during entering of this diary to the database. In some cases this would be a sign of missing activity or trip. Also in other cases there was just a problem in recording am instead of pm hours. Certainly, there can be even more explanations for such a simple problem. Often a whole sequence of records had to be adjusted to fix the given problem. To determine how to fix the record (or series of records) often requires more detailed study of the whole pattern, and possibly also patterns of other members of the family. - 30 - Patten and Goulias 31 A logical rule would repair such records correctly in the majority of the cases, however a small percentage of records would be fixed incorrectly. Contrary to the system SYLVIA, we want to eliminate even this small number of errors, at of course the price of higher demand for manual repairs. THE CENTRESIM DATA The CentreSIM model and data collection is a five-year initiative to improve modeling and simulation for transportation decisions in CentreCounty, Pennsylvania. CentreSIM is a regional simulation of Centre County, a region of approximately 136,000 persons that includes the Pennsylvania State University with more than 40,000 students and more than 12,000 faculty and staff. The unique characteristics of the Penn State population in terms of time and space choices need to be accounted for in the simulation of the county population. To accomplish this, a household survey is organized and the spatial foundation of the CentreSIM model together with a complete inventory of businesses containing the address, business type classified by standard industry code (SIC), and number of employees of each business is collected. The businesses are first matched to a digital map of the simulation area that includes the roadways and intersections of the transportation system, as well as the traffic analysis zones (TAZs) of the study area. The business type and size are used to estimate the capacity, in terms of the number of persons that can be served, of each establishment for the activities of work, shopping, recreation, and other. The activity capacities serve as a measure of attractiveness for each activity, and can be aggregated to the TAZ level to obtain the attractiveness of each TAZ and perform traditional four-step modeling. Population data from the 2000 U.S. Census are then used to estimate the capacity of each TAZ for the home activity. The spatial distribution of the four activity types and stay home forms the basis of a 24-hour accounting system of the population (named the zone presence model), their activities, and their activity locations. The spatial distribution of activity locations combined with the temporal activity patterns of six designated population segments (Penn State students, Penn State faculty, Penn State staff, unemployed, professionals, workers) are the basic elements in the development of this activity-based travel demand model. The activity patterns for each population segment will be calculated from the CentreSIM survey. The activity distributions multiplied by the number of persons in each population segment results in the number of persons engaged in each activity in each hour. This output, combined with the spatial attractiveness of each TAZ for each activity, results in a zone presence model consisting of the number of persons in each zone and their activities by time of day. An example of an early application with data from a 1996 small sample survey can be found in Kuhnau and Goulias, 2003. The CentreSIM household survey covers the entire Centre County and it also includes residents that work in Centre County and reside elsewhere. The data from this survey fill a critical gap in knowledge about this County and provides the necessary information to develop models that can be used in regional simulation software currently developed for alternatives exploration. This survey has - 31 - Patten and Goulias 32 been administered between November 23, 2002 and May 30, 2003, and included two components, a household questionnaire and twoday activity-travel diary both of which are described briefly below. The survey was conducted during two different time periods during which different recruiting methods were used. We refer to these periods as Lottery I, conducted in the fall 2002, and Lottery II, conducted in the spring 2003. The lotteries, in which prices of $1,000 in gift certificates were distributed, were used to encourage people to participate in the survey. Identical questionnaires and activity-travel diaries were used throughout the entire survey. In the questionnaire each participating household is asked to provide voluntarily information about household composition and facilities available to the household members. In addition, each household member also provides personal information such as employment, driving ability, education and so forth. The survey also includes a few questions about opinions and perceptions regarding the Centre County transportation system. Activity and travel data are also collected from each person in the household using a two-day complete record of the activities in which each person engaged and the different transportation options taken. An example of the activity diary is provided in Figure 10. The respondents were asked to record beginning and ending time of each episode, its purpose, but also questions “With whom did you do the activity” and “For whom did you do the activity”. In case of trips the respondents were to report the travel mode used for the trip, if they drove on this trip, starting and ending point of the trip, and also to estimate the distance of this trip. All these questions are included in order to capture decision making aspects of each household in activity and travel scheduling. During the first survey lottery subjects were recruited both by telephone and mail. Our study team contacted 821 households via telephone to request participation in the study. Of these, only 190 agreed to participate. The contact rate (participants/contacts) was 23.1 percent. All of the 190 households agreeing to participate were mailed survey packets as outlined above. Of these 80 households (43.5%) returned completed, usable questionnaires and diaries. The overall response rate for the telephone recruiting is 10.0 percent (returns/contacts). In Lottery I, 1,478 households were contacted by mail. A total of 647 household questionnaires were returned yielding a 61.3 percent response rate. Of the 647 households returning household questionnaires 568 were mailed activity-travel diaries. Time constraints prevented inclusion of all responding households. A total of 208 households returned completed diaries yielding a 41.4 percent diary response rate. With 208 households returning usable diaries the overall response rate for the Lottery I is 13.7 percent. Lottery II was done completely by mail. Overall, questionnaires were mailed to 7,447 households. A total of 2,414 household questionnaires were returned yielding a 35.3 percent response rate. Of the 2,414 households returning household questionnaires 1,969 were mailed activity-travel diaries. A total of 494 households returned completed diaries yielding a 25.2 percent diary response rate. - 32 - Patten and Goulias 33 With 494 households returning usable diaries, the overall response rate for Lottery II is 8.9 percent. The second lottery, however, is heavily skewed in its composition by the presence of Penn State student participants that have a dramatically lower response rate than any other segment of the population. GENERAL FRAMEWORK Collecting of activity diaries requires a rather thorough process. It starts with developing the overall framework of the survey, designing of the questionnaire, choosing the right type of the survey media, determining a representative sample of population in the study area, organizing survey instrument distribution, and building a plan of action and contingency plans for all the events that are expected to possibly create problems. This paper focuses on the steps that follow the receipt of the completed surveys. A simplified flowchart of the process described in this paper is depicted in Figure 9. After receiving, the completed surveys, they are input into a data set (data input module). These data often carry a lot of inconsistencies as discussed above. Automated rules are applied to the data set in order to determine records that are potentially inconsistent. Such records are flagged and then fixed (data cleaning module). Finally, before we can deliver the data to customers, the data need to be modified to the desired final format. Our paper addresses all steps in this process. Data Input Module Receipt of completed surveys Data Cleaning Module Final Formatting Deliver data to costumers Figure 9: Flowchart of the process The main objective of this paper is to develop a framework for the data entry process that minimizes the discrepancies between the observed activity/travel pattern and the patterns that are delivered to the customer. The final dataset should contain consistent records that can be directly used for calibration of activity-based models. Data Input Module After receiving the completed surveys, the data must be entered into a dataset. As discussed above we do not want to introduce any additional error to the data in this phase. However, in this step we do not do any additional changes to the data and we do not make - 33 - Patten and Goulias 34 any assumptions that could potentially disrupt the data. All additional changes were left to the second module, data cleaning. Our reasoning is explained in the following paragraphs. The more people were making assumption and were repairing the unfeasible records, the more inconsistency would be introduced. The same type of error would be repaired in many different ways, which is not desirable. Because of the large amount of the data that must be entered into the database, several recorders were working on entering of the data. Clearly, their level of understanding the purpose of collecting the activity diaries differs. For that reason, they were asked to record exactly what was written in the collected surveys (with the exception of obvious grammatical mistakes and errors). All major changes and repairs are done in the second phase (data cleaning module) by a limited number of experts. These experts spent time studying the data set, trying to understand possible problems and their causes. They share their experience and discuss problematic cases with the entire group. Then, they develop a unified solution to the problematic cases. These steps helped to eliminate inconsistencies in their decisions. One important issue that is though addressed in the data input module is fixing some implicit inconsistencies. The automated rules within the remaining steps of the data cleaning process often compare the entire strings characterizing behavior. Typically, we want to compare locations of two different activities in order to determine if they were conducted at the same location. All characters in these strings, including punctuation and blank spaces, need to be exactly the same in the whole string in order to successfully match the fields. In the introductory part of this paper we described some implicit inconsistencies that often appear in the address of an activity, such as using different phrases for the same location, abbreviations and other. We tracked empirically the most common cases of these problems in the fields describing location and used for them standardized abbreviations used by post offices in the USA. Some examples are provided in the following list (this list is not complete): Item: Abbreviation: Avenue Circle Drive Lane Apartment Building Room Ave Cir Dr Ln Apt Bldg Rm All recorders were instructed to use these abbreviations in all cases, instead of the original answer. Also after entering the data, a simple find-replace procedure was used to ensure that the entries are really standardized. For example, we would find all occurrences of the words “Avenue” and “Ave.” and replaced it with “Ave”. - 34 - Patten and Goulias 35 The user environment plays also an important role in the entering of the data and can reduce the number of mistakes that occur during this phase. Nice and user-friendly forms were developed for the entering and editing of the activity diaries. An example of a data editing form is shown in Figure 10. The form resembles the activity diaries as filled in by the respondents. This feature helps to keep a visual control of the entered data. Figure 10: An example of the data enter and data edit form in MS Access with a fraction of an activity pattern. Also provided is an example of assignment of trip purposes (primary Q1a, and secondary Q1b) for two of the records. It is the example of going to another travel mode as described in section 0 - 35 - Patten and Goulias 36 Usually, all in-home activities of the entire family are located at the same address so the same string repeats in multiple rows in the data set. It means that the recorder has to write the same address several times. For this reason we used a feature of MS Office that while entering a field, the rest of the field would automatically suggest the ending of the phrase based on previous entries (autocompletion). An example of the auto-completion is depicted in Figure 10. This feature significantly reduces the probability of typographical errors during this phase. The recorders had to be careful while writing the address for the first time. For the rest of the household, the field would automatically appear after writing the first few letters. Data Cleaning Module The general flowchart of the process of data cleaning is presented in the following figure. Data checking: Input: Rough data set Logical rules Assignment: • Home location • Trip Purposes • Activity types Output: Clean data set Figure 11: General flowchart of the data cleaning routines. The rough data set, with only some of the implicit inconsistencies fixed, is the input in this phase. Our major goal is to eliminate all remaining inconsistencies in the data set. The major part of the data cleaning procedure consists of applying a set of logical rules that find inconsistencies in the data. The problematic records are then marked and listed in a separate table. Then they have to be checked by an expert and fixed. The respondents wrote in their own words what was the activity they did. In order to generate tables that can be used in modeling systems, we have to assign to each trip a value from a finite set of trip purposes and, similarly, assign a type for each activity. The routines that were used to do this assignment require the knowledge of which activities were conducted at home and which out of home. Our solution to this problem is also described in this section. Automatic data checking - 36 - Patten and Goulias 37 The rules presented in this section aim to find records that are potentially inconsistent. The rules as proposed were meant to find rather more records, including those with no inconsistencies, than omit records that need to be fixed. Contrary to the algorithms in SYLVIA, the rules do not aim to fix the problems (see reasoning in chapter 0). The task remains on the experts that are using these rules. We proposed five major groups of the data checking rules. Their hierarchical structure is suggested in Figure 12. The rules operating on the lowest level are called basic rules in this project. They focus on properties of particular records, such as that the ending time of an activity should not be before the start time of the same activity and others. The rules at the second level focus on consequent records. These rules look for problems in chaining of trips and activities. Once each two consecutive records are consistent, the rules called daily patterns are used to check for inconsistencies in each daily pattern, for example start time or end time of diaries. The next stage aims to find problematic records on transition between the two survey days for each person. The highest level in this scheme is the household level. We are trying to find inconsistencies in reported number of persons in different age groups, possession of driving license and others. Household Inconsistencies (H) Day to day transition (D) Daily Patterns (P) Trip chaining (T) Basic Rules (B) Figure 12: Hierarchical structure of the data checking rules. - 37 - Patten and Goulias 38 A more detailed description of particular rules is provided in the following sections. Their summary is provided in Table 8. This table also contains the numbers of flags by particular rules when applied to the CentreSIM survey data. More details about this topic are provided in section 0. Basic rules for particular records Rule B1 - Activity hours too long It is quite unlikely that a single activity would have a long duration. Even a work activity is usually interrupted for lunch-break. This rule finds activities that are longer than 6 hour, because it is often an indication of a problem with the record (mixed activity and trip together). Only in case the activity type is school, work, or sleep, the limit is set to 10 hours. These numbers were empirically derived from the data. Rule B2 - Trip time is longer than 1 hour and its distance shorter than 15 miles Often, if there is a trip of a long duration in the data set, it is an indication that an activity is mixed together with a trip. For example, people would fill in a trip described as shopping, that in reality should be split into three records: trip from home to the store, shopping activity, and return trip back home. This rule finds trips that are longer than 1 hours and the distance traveled did not exceed 15 miles. Similarly to the previous rule, these values were derived empirically. Rule B3 - End time of an activity is before beginning time of the same activity Clearly, the ending time of each activity must be later that its beginning time. This rule helps to find records that do not satisfy this condition. Rule B4 - Missing travel information This rule finds records that are classified as trips, however, which do not provide some or all travel information. The only exception is for travel mode walk, in which case there is no answer about driving as a driver or passenger. This rule helps to repair cases in which an activity was accidentally classified as a trip. Rule B5 - Travel speed limit The respondents were asked to provide beginning and ending time for each trip, as well as travel mode and the perceived distance of the trip. This rule helps to find records with unrealistic speeds. Limits of expected speed for each travel mode were empirically derived from the data. Clearly, the respondents’ perception of distance is not perfect. This implies wider range of expected speeds. The travel modes used in the survey together with the derived speed limits and distance limits for some travel modes are provided in Table 6. - 38 - Patten and Goulias 39 In case the average speed (reported distance divided by the travel time) does not fit into the range of minimum and maximum speed for particular travel mode, the record is flagged. Rule B6 - Travel distance limit Similarly to the previous rule, this rule finds those records that suggest unrealistic distance. The limits for distances, derived empirically from the data, are also provided in Table 6. This rule concerns only trips conducted on foot, or by bicycle. The maximum distance for bicycle and for walking was set to 5 miles and 1.5 miles, respectively. Rule B7 - Missing location of an activity This rule finds records, in which the location of an activity is not specified. Table 6: The travel modes considered in the survey together with speed and distance limits empiricaly derived from the data. ID 1 2 3 4 5 6 7 888 999 Travel Mode Car, truck or van Bus Taxicab Motorcycle Bicycle Walked Other Skip No ans. Problems of trip chaining - 39 - Speed Limit [mph] Min Max 7 65 4 45 5 15 5 65 1.5 10 0.3 6 Distance Limit [miles] Min Max 5 1.5 Patten and Goulias 40 Rule T1 - Change location without travel - two consecutive activities are conducted on different location (Activity-Activity) In case there are two consecutive activities (i.e. no trip between them), and locations of these activities are different, there is most likely a missing trip. This rule aims to find these records. Rule T2 - The location of an activity does not equal the ending point of previous trip (Trip-Activity) This rule deals also with inconsistencies in location. It finds records in which an activity follows a trip, however the location of the activity is not stated at the same place as the ending point of the preceding trip. Rule T3 - The starting point of a trip does not equal to the ending point of the preceding trip (Trip-Trip) This rule is very similar to the previous rule, the difference is that it deals with two consecutive trips. In this case, the starting point of the second trip must be equal to the ending point of the first trip. Rule T4 - The starting point of a trip does not equal the location of the preceding activity (Activity-Trip) This rule also deals with inconsistencies in location, only in this case if a trip follows an activity. In this case, the starting point of the trip must be equal to the location of previous activity. Rule T5 - Ending point of a trip equals to starting point of the same trip This rule finds cases in which the starting point of a trip equals to the ending point of the same trip. This often means that there is a missing activity. Rule T6 - Beginning time of an activity does not equal the end time of the previous activity This rule finds cases in which the beginning time of one record (an activity or travel) does not equal the ending time of the previous record. There are several possible causes for this problem. For example, the individual could simply record a wrong time (in this case it could be changed automatically), or s/he forgot to record another whole trip or activity. Daily patterns Rule P1 - The end time of last activity in the day does not equal to 11:59:00PM The daily pattern in our project was defined to start at midnight (12pm) and ends at 11:59pm the same day. This rule finds records that do not end at at 11:59pm. Rule P2 - First activity does not start at 12:00:00AM Similarly, this rule finds schedules that do not begin at midnight (12pm). Rule P3 - There are less than five activities in the diary Even a simple activity/travel pattern has to consist of several records (sleep, eat breakfast, travel to work, work, return home, go to bed). This rule finds diaries that do not contain more than 5 records. - 40 - Patten and Goulias 41 Transition from one day to another Rule D1 - The location of the first activity in the day 2 does not equal the location of the last activity in day 1 Even when the daily patterns of an individual are consistent, we must ensure consistency of transition from one day to the other. This rule finds records in which the location at the beginning of the second day does not correspond to the location at the end of the first day. In such a case, a record is probably missing. Consistency on the household level Rule H1 - Total number of persons in the household (Q23) does not equal to the sum of persons stated in different age groups (Q24) In the household form, we asked several questions about the structure of the household. This rule finds records in which the number of persons declared by particular age groups does not equal the total number of persons in the household. (Q23: “how many people live permanently in your household”, and Q24:”Of the people in your household, how many are: a) younger than 5 years old; b) 5 to11 years old; c) 12 to 15 years old; d) 16 to 18 years old; d) 18 years of age or older) ) Rule H2 - Total number of persons in the household (Q23) does not equal to the number of people stated in the household questionnaire In the household questionnaire we also asked about names and other information of particular respondents. The number of provided names should equal the total number of persons stated for the household. Rule H3 - Number of people differs between household form and the diaries This rule compares the number of persons stated for a specific household to the number of travel diaries stored in the data set. Clearly, there should be two diaries for each individual (two days). This rule finds surveys that do not satisfy this expectation. Rule H4 - Person has driver's license with age less than16 An individual who is younger than 16 years is not eligible to have a driver’s license. This rule finds those individuals who stated they were younger than 16 years of age, but they also stated they do have a driver’s license. One of the fields must have been filled in incorrectly. Rule H5 - Person stated in the diary form that s/he drove, but s/he is not a licensed driver in the household form One question in the activity/travel diaries concerns if the individual was driver or passenger for particular trip. This rule finds those diaries in which a person stated s/he drove a car, however in the household form stated that s/he has no driving license. - 41 - Patten and Goulias 42 Rule H6 - Person does not have a driver’s license but in the household form stated that sometimes drives a car Another question in the household form asks how often does an individual drive a car. This rule finds forms in which a person stated that s/he does not have a driving license, however in the household form answered that sometimes drives car. Assign Home Location The respondents provided information about their home addresses in a separate field of the survey, however not in their activity diaries. Only an address was entered in the field “location of your activity” so an algorithm had to be developed to find all home based activities. However, this information is essential for other automatic routines, such as assigning of trip purposes and activity types. This section describes a semi-automated procedure to set all in home activities. The algorithm consists of two steps. In the first step we compare all addresses stated by a particular household in the activity diary to the address we originally mailed the survey. The mailing addresses were stored in a separate data set, which can be though match to the travel diary using a unique identifier - form number. In case the fields in both data sets match, this activity is clearly undertaken at home. We do not overwrite the original information. We rather introduced a new dummy variable that equals to one in case the activity was conducted at home, and zero otherwise (an activity out of home or a trip). This first step finds majority of the activities located at home, however there are some exceptions. One reason for not finding an inhome activity can be inconsistency between the stated and mailing addresses. A simple misspelling, inserting an extra space at some position in the string or other similar reasons, can be the cause. For this reason a second step is introduced. The second step is based on the assumption that the majority of persons start their daily patterns at home. This is clearly not true in all cases (for example respondents could work after midnight, some respondents are still at a party or in a bar, visiting friends, or some elsewhere for other reasons) so we cannot create an automatic procedure based on this assumption. In this second step we find all diaries in which the first activity was not marked as “at home” during the first step. The addresses in the listed diaries are then manually checked and fixed if necessary, and the “at home” field is also set for these activities. This step requires some additional non-automated work; on the other hand it increases reliability of the final data set, which was one of our major objectives. The load of manual work was not unreasonably high in this case. We will demonstrate the workload by example from our application. There were 284 records in which the first record was not found as at home (by the first step of the algorithm). These records had to be checked manually by an expert. However, only about 15% of them were actually not conducted at home (working at night, sleeping out of home, out of town, engaged in active nightlife). We did not make any corrections in these cases. The rest of the 284 cases was caused by the problem of comparing addresses (discrepancies between the address stated in the activity diary and the mailing address, different address between where they live and their mailing address (e.g., using a P.O. box - 42 - Patten and Goulias 43 number that obviously is not a residence location), and others. We corrected the address and marked these activities as activities conducted at home. Assign Trip Purposes As mentioned in the previous section, many activity and travel behavior models need to use information about the purpose of particular trip. In our activity-travel diaries, we asked people to describe in their own words what activity did they conduct, and in case of trips, what was their purpose. We did not provide a list of possible answers to the respondents because we did not want to effect and bias their answers. This provides the opportunity to understand respondents’ perception about their trip purpose but the variety of answers was very large since two people usually referred to the same event in different ways due to linguistic habits. For the purpose of modeling, we need to use only a finite number of well-defined trip purposes and activity types. Different models require different trip purposes and there is no single universally accepted standard that could be used. In this phase of our analysis we decided to use a rather small number of trip purposes that would be easy to classify. It should be noted, however, that we are not going to overwrite the original answers (marked Q1 in the data set), rather we introduce new variables to store the trip purposes. This enables us or other data users to define different set of trip purposes and assign them in later phases. We suggested 14 categories of trip purposes that are summarized in Table 7. The second column of this table describes particular trip purposes. The first column shows identification numbers that we used in the data set. We introduce two new variables: Q1a (primary trip purpose) and Q1b (secondary trip purpose). The variable secondary trip purpose is used only in case of trip going to another travel mode (denoted 81 in the table). Typically this is a case when there is a sequence of two or more trips. For example, first the individual walked to a bus station and then took a bus to school. In this case the primary purpose of the first trip is denoted “going to another travel mode” (81). Its secondary purpose is travel to school (12). The second trip has assigned only the primary trip purpose - commute to school (12). Similar example from the data set is provided in Figure 10. This section describes the process of assigning these trip purposes to particular trips. This is not as simple task as it would appear. The field that plays the most important role is answer to question Q1: “What activity did you do?” which in case of trips corresponds to the description of particular trip. However, there are also other features that have to be taken into consideration. For example, in some cases we cannot decide on the right trip purpose without considering personal characteristics of the respondent, her/his travel company (Q3), for whom did the respondent undertake the trip (Q4), or implications suggested by the entire travel pattern and travel patterns of the rest of her/his family. This again limited the possibility to use an automated algorithm as will be discussed in the sections below. - 43 - Patten and Goulias 44 The routine starts with finding trips that should be classified as “return home” (23). We can use a simple algorithm logic since we already have the information about home-located activities. In this algorithm we find all trips that precede a home located activity and set those trips as “return home”. A pseudo-code for this procedure is listed in Table 7 and it is marked by a star. To determine the remaining trip purposes is more difficult. Since the variety of respondents’ description of the trip purpose (Q1) is large and the problem is rather complicated, only a semi-automated algorithm can be used. In a preliminary study we identified several words or set of words that are unique only for one particular trip purpose. The words we used to assign trip purposes are listed in the third column of Table 7. We compare the stated purpose of the trip (Q1) to each of these words and in case of positive match, we assign the particular identification number. Note that there are two different cases described in the table. We either look for the whole phrase (“to wal-mart”), or we are looking for two separate words (“to” ”bank”). If we looked only for the connection of words, “to bank”, the algorithm would miss cases such as “to PNC bank”, “to the bank” and similar. It would find only complete matches. Our preliminary analysis helped us to determine which one of these cases is more suitable for each of the sequence of words. In order to make the procedure more versatile and user friendly, we designed a form in the MS Access environment that can be used as an interface for this step. New words or phrases to look for can be easily added to this table. We should always keep in mind that the words added should be really unique for each category. Another procedure has to be used in case of a trip going to another travel mode (81). The algorithm we proposed is suggested in Table 7 and it is marked by two stars. First we find a sequence of two or more trips. The primary purpose of the last trip is already assigned or it has to be assigned manually. The primary purpose of the other preceding trips is set to 81 (going to another travel mode) and their secondary purpose is set to the primary purpose of the last trip. However, this procedure was not working as expected. Often we found cases in which the responded omitted an activity between two trips. It was mostly because the omitted activity was of a short duration (picking up a lunch, delivery, and others). Rather than assigning trip purpose, we had to add the missing activity. For this reason all trips assigned number 81 (going to another travel mode) should be manually checked during the second phase of this procedure. The additional effort though meant another improvement of the database by adding missing activities. The automated rule makes it easy to find these records and decreases the time that must be dedicated to their fixing. Because of the problems of large variety of answers, not all trips could be classified by our automatic routines. The rest remains for manual trip purpose assignment. This step is rather time consuming, because it is important to look at the whole activity-travel patterns. That helps to understand all interdependencies among trips and activities and the danger of misclassification is reduced. - 44 - Patten and Goulias 45 Table 7: Description of trip purposes used for our analysis together with the phrases that are used for our semi-automated procedure. ID Trip Purpose 11 Commute to work 12 Commute to school Other work related travel (professional 13 drivers, etc.) 21 Return home Look for "to work", "to office" "to" "school", "to" "class" * "to weis","to wal-mart","to wegmans","to target","to store","to lowes","to giant","shopping","shop" 31 Shopping 32 33 41 42 51 52 Dinning Refreshment (coffee, drink, snack, etc.) Doctor Apt. / Other medical rel. Other Appointment / Meeting Escort (Picking Up/ Dropping off others) Errands (Banking, delivery, etc.) Recreation and leisure 61 /Exercise/Lessons/ Personal Bussiness 71 Visiting friends/family 81 Going to another travel mode * ** Q1 Q1a Q1b Q5 "coffee" "to doctor","to doctors","to dentist" "to post office", "to" "bank" "to church","walk dog" "friend" ** if Q5(i)=1 and Q5(i+1)<>1 and AtHome (i+1) = 1 then set Q1_a=23 If Q5(i)=0 and Q5(i-1)=1 and Q5(i-k)=1 then set Q1_a(i-k)=81, set Q1_b(i-k)=Q1_a(i-1). Note: look for k=2,3,4..., until Q5(i-k)=0 What activity did you do? Primary trip purpose Secondary trip purpose (only for trip purpose 81) Did you travel? 1 …YES, 0 … NO Assign Activity Types Almost all common transportation models require the knowledge of trip purposes so it was our first priority. However the striking majority of the activity-based models require also the knowledge of particular activity types. It means that each activity will be - 45 - Patten and Goulias 46 assigned one type from a limited list of types. The level of detail required by particular models varies. Within this preliminary study we consider only five different activity types that were required by an activity based model that is currently being used at the Pennsylvania State University (Kuhnau and Goulias, 2003). These activity types are Home (all in-home activities), Work (mandatory activities that are usually conducted repeatedly and in the same time range, for example work or school activity), Shopping activities (various shopping activities and errands such as banking), Recreational activities (for example leisure activities, visiting friends, dining out, and others), and a category Others (doctor’s or other appointment, escort, and others). Our task is now to assign each activity in the travel diaries to one of these four activity types. Since we already determined the purpose of each trip, we can use that information to assign the activity types. Let us demonstrate our approach on the following example. An activity that follows a trip with purpose shopping (denoted 31 in Table 7) will be clearly shopping activity. Similarly, an activity following return home will belong to the type – Home. In the data set, there is already the information about which activities are conducted at home. We can directly assign their type – Home. For the remaining activities in the diary we simply determine their type based on the purpose of preceding trip. We can assume that in case two activities follow each other, they will be of the same type. Otherwise, since there are not too many sequences of outof-home activities, we could check all of those manually. Final formatting Before delivering a final database to the costumer, we have to modify the format of the data set in order to fit expectations of particular transportation models. These changes are rather specific for each data set and also for each costumer (based on model requirements), so we will describe this step only very briefly using a few examples. Treatment of missing values In the data set, the default values for all variables were set to value 999 that correspond to missing data. However, in some cases people did not answer this question because it was not applicable. In this step we find these variables (and/or records) and change their values from 999 to desired final value, usually 0. For example, in the household survey we asked if they have an Internet connection at home, and in case they have we asked about its type and speed. Automated procedures were used to set automatically the values of the latter two answers in case the household does not have an Internet connection at all. There were more similar cases in the data set. Format of the household surveys Every household in the data set has exactly one record. Information about particular household members is stored as different variables. Most transportation models require the data set to be of the personal level, each individual corresponding to one record. A - 46 - Patten and Goulias 47 table in person-by-person format was produced for use in transportation models. Also summaries of total number of trips by trip purpose and travel mode, as well as the total time spend traveling by trip purpose and travel mode are provided in this table. Format of the travel diaries Many transportation models require information only about trips and not about activities. For this reason we derived a data set that contains only trips. Only those records with no trips had to be included as well with an indication of no trips made. APPLICATION OF CHIRAC TO CENTRESIM In this section we provide some preliminary results of the application of the logical rules to the CentreSIM survey data set. Only the data entered by February 2003 were used for this study. By this date, the data set consisted of 718 diaries and of 10,602 records (sum of all activities and trips). During the cleaning process we had to eliminate 23 respondents (46 diaries) from the data set because their activity diaries did not include some essential information. The most common error in the data set was that respondents in their diaries combined together activities and trips (episode miss-reporting). These records were split into several records, which lead to an increase in the number of records. After cleaning, the final data set consisted of 672 diaries and 10702 records. The set of logical rules was applied to the described data set. The number of records flagged by particular rules is provided in Table 8 and it is denoted as Before (before cleaning). A team of experts repaired the problematic records. At the end of the cleaning process, the logical rules were applied again to the final data set to see how much improvement can be observed in the data set. These values are denoted After (after cleaning) in Table 8. The table shows that the number of flagged records was drastically reduced, but not totally eliminated in all cases during the data cleaning process. This is because a flag does not necessarily mean an error or inconsistency in the data set. The rules were designed to find all potential problems. Rather than to miss some records that are actually wrong, in some cases even records that are correct were flagged. The cases that are flagged at the end of the cleaning process do not consist of an error and the proportion of the correct cases that were flagged in the entire number of flags varies between the rules. A typical example is the rule B1, which finds activities longer than 6 hours (10 hours in case of sleep, work, or study activities). In reality people usually do not conduct such long activities, so it can be a sign of an omitted activity. This rule aims to find such activities. In some cases the rule helped us to find some other problems, such as records in which an am and pm hour was confused, or clearly a missing activity. However, there are 113 records that were still flagged after data cleaning. In these cases the activity was really that long, for example working overtime without a lunch break, and the resulting flags actually do not mean problems in the diaries. In this case we were able to fix only a small portion of the flagged records and the rest is considered to be correct. - 47 - Patten and Goulias 48 Rule B2, that finds trips longer than one hour, is an example in which the majority of the flagged records really need some kind of rectification. These cases are unlikely and mostly were caused by episode miss-reporting for which a remedial solution was found.. In the remaining five cases people reported hunting, an activity very popular in Pennsylvania during hunting season. We did not modify these records since we did not have any additional information. We describe several other examples below. Rule P3 finds schedules that contain less than five records. In majority of the cases people would skip information about trips, mix activity and trips into one record, or made other mistakes that were fixed. The remaining 18 records actually do not contain any errors. In most of these cases respondents were traveling outside the study area, so their only record would be the long distance trip. We kept these diaries in the data set. Table 8: Summary of cleaning rules together with the number of marked records on the original (before) and final (after) data set. B1 B2 B3 B4 B5 B6 B7 B8 T1 T2 Basic rules Activity hours too long Trip time is longer than 1 hour and its distance shorter than 15 miles End time of an activity is before beginning time of the same activity Missing travel information Travel speed limit Travel distance limit Travel speed limit and travel time is longer than 20 minutes No location Inconsistencies in trip chaining Change location without travel - two consecutive activities are conducted on different location (Activity-Activity) The location of an activity does not equal the ending point of previous trip (Trip-Activity) - 48 - Befor e Afte r 138 113 121 5 9 0 352 474 10 262 179 230 333 30 74 25 320 0 686 0 Patten and Goulias T3 T4 T5 T6 P1 P2 P3 D1 H1 H2 H3 H4 H5 H6 49 The starting point of a trip does not equal to the ending point of the preceding trip (Trip-Trip) The starting point of a trip does not equal the location of the preceding activity (Activity-Trip) Ending point of a trip equals to starting point of the same trip Beginning time of an activity does not equal the end time of the previous activity Daily patterns The end time of last activity in the day does not equal to 11:59:00PM First activity does not start at 12:00:00AM There are less than five activities in the diary Day to day transition The location of the first activity in the day 2 does not equal the location of the last activity in day 1 Household inconsistencies Total number of persons in the household (Q23) does not equal to the sum of person stated in different age groups (Q24) Total number of persons in the household (Q23) does not equal to the number of people stated in the household questionnaire Number of people differs between household form and the diaries Person has driver's license with age less than16 Person stated in the diary form that s/he drove, but s/he is not a licensed driver in the household form Person does not have a driver’s license but in the household form stated that sometimes drives a car 710 299 509 241 208 85 522 0 79 0 6 40 0 18 57 0 25 0 74 0 21 2 11 0 445 0 440 0 The rule T5 finds trips that end at the same location as they started. It is usually a sign of episode miss-reporting (see example in Figure 8). Such records were fixed accordingly. The remaining 85 cases correspond to people who went for a walk, or took a dog for a walk. There were no additional activities on their way so these records are considered to be correct and remain unchanged. - 49 - Patten and Goulias 50 The rules B8 and B4 found records with missing information about location or travel information. In some cases we succeeded to fill in the missing information based on similar records in the rest of the diary, in the diary of the other day of the same respondent, as well as in the diaries of the rest of her/his family. The rule H3 finds households for which the number of received diaries does not correspond to the number of people stated in the questionnaire. We do not have the complete information about the entire household in these cases. We kept these households in the data set, even though their exclusion can be considered for some activity-based models. The flags in the rule H2, H5, and H6 were caused mostly due to the missing data. Often, people did not indicate whether they do or do not have a driving license. These records were fixed since we had the information that these people actually drove. The data rectification for rules T3 and T4 was still ongoing at the writing of this paper. All the remaining flags correspond to typographical errors. It causes a problem when comparing two strings, but it does not imply a missing trip or activity. In our analysis we aimed to fix first records that actually correspond to a problem in the activity patterns. The rectification of these rules will be completed by the time of the TRB meeting. CONCLUSIONS AND RECOMMENDATIONS In this paper, we aimed to provide a general guidance for the collection and cleaning of activity-based surveys. It is based on our practical experience. We emphasized issues that anybody working on a similar project is likely to face. The process starts with receiving completed surveys from the respondents. These surveys contain many inconsistencies and errors. The framework describes the data entering process as well as the process that eliminates all these inconsistencies. We described the entire process, the problems we were facing, as well as suggestions about solutions. We succeeded in finding inconsistencies in the data and significantly reducing their number in the resulting database. The final data set is ready to be used in transportation planning models, including those that are activity-based. The major objective of this process, minimizing the discrepancies between the observed patterns and the patterns recorded in the final data set, was also reached. All the changes made to the data set were based on expert knowledge, so no automated routine biased the modifications of the data set. However, we still learned several improvements to the proposed framework during our work on the project. These improvements have been implemented in the current stages of our data entering process – Lottery II, and are strongly recommended for future applications of similar systems into practice. They are meant to decrease the number of problems in the received surveys as well as to reduce the amount of time required for the entire process of data entry and data cleaning. In addition to the many details of the procedure - 50 - Patten and Goulias 51 followed here (e.g., thinking of data in a hierarchical format, development of explicit rules and recording of all the modifications, attention and comparison of records within a persons time use pattern and comparisons with other household members’ time use patterns) some additional major changes and recommendations are listed below. 1. Preprocessing of the received diaries by an expert The rather large number of flags in Table 8 suggests that the amount of time spent on cleaning of the data was large. The biggest problem was treating of episode miss-reporting, in which case new record(s) had to be entered and usually a large portion of the diary modified. For this reason we suggest introducing a preprocessing phase. After receiving the diaries, they should be inspected by an expert. S/he would clearly mark any necessary changes before they are entered into the data set. It takes considerably less time to check the diaries before entering, than to look for these problems and try to fix them afterwards. It helps to find also additional problems that can be treated before entering the data set as well. To look for these problems in the paper form is easier, since you can easily observe the entire time use pattern or a persons and the household. The proposed automated rules should be still used for verification purposes and as a tool to find any additional problems. 2. An improved example of activity diary sent to respondents In the motivational part of this paper we emphasized the importance of an introductory letter as well as a good example. Although we focused on this problem further improvement can be introduced. For this reason in the Lottery II we insert an extra sheet with an example in the survey materials that are mailed to the respondents. In this example we provide additionally a graphical representation of a daily pattern in order to improve respondents’ understanding of the required format. In this example, each activity is represented as a node and each trip as a link between the nodes. The correctly filled diary that corresponds to this suggested pattern is provided as well. Since this example is provided on a separate sheet of paper inserted in the package, it can be easily used when the respondents are filling in their diaries. They can have this sheet in front of them the entire time and they do not have to browse through the form in order to find this particular example. 3. Enter home and work location (and other known places) into the data set prior to the entry of diaries. In the diaries, we asked the respondents to record their home, work, school, or other often repeated locations. Having this information in the data set reduces the number of typographical errors, increases the speed of data entry, and also eliminates the step of assigning home locations. For this reason, prior to entering the diaries for each household, the frequently repeated locations will be entered in the data set. Consequently, while entering particular activities in the diaries, the coder can directly choose one of the previously entered locations and avoid writing the address over again and so reduce the probability of typographical errors. - 51 - Patten and Goulias 52 To conclude, the final data set is rather consistent if we take into consideration the complexity of activity-travel diaries. The proposed framework ensures that the modifications of the originally reported diaries are minimal. In most cases the original answers are kept together with newly recoded variables. As an example we can mention the types of activities assigned to particular records. Only a limited number of activity types is needed for calibration of the CentreSIM model. However, since we keep the original answers, the type of activities can be easily modified to fit the needs of any other activity-based model. It implies that the final data set is versatile and consistent at the same time, which was one of the objectives of the developed process. We succeeded in finding episode missreporting (as described in the introduction to this paper) which is very essential since it effects the number of reported activities and trips and their sequencing that are important elements in activity-based approaches to travel demand modeling. ACKNOWLEDGEMENTS/DISCLAIMERS Funding for this paper was provided by the federally funded Mid-Atlantic Universities Transportation Center (MAUTC) and the Center for Intelligent Transportation Systems (CITRANS) at the Pennsylvania State University. Partial funding for the CentreSIM survey is provided by the Pennsylvania Department of Transportation (PennDOT) through a contract with McCormick Taylor and Associates (MTA). The survey was conducted at the Transportation Survey Research Center at the Pennsylvania Transportation Institute (PTI) by a team of persons that include: Mark Hallinan, James Lee, Devani Perera, Aviroop Mukherjee, Julie Whitt, and Brian Hoffheins. Their dedication is greatly acknowledged. Li Guan at PTI programmed the database entry software for the CentreSIM survey. The contents of this paper reflect the views of the authors, who are responsible for the facts and the accuracy of the data presented herein. The contents do not necessarily reflect the official views or policies of the Commonwealth of Pennsylvania at the time of publication. This paper does not constitute a standard, specification, or regulation of the Pennsylvania Department of Transportation. - 52 - Patten and Goulias 53 REFERENCES: Arentze, T.A., F. Hofman, N. Kalfs, and H.J.P.Timmermans (1999), (SYLVIA) System for Logical Verification and Inference of Activity Diaries, Transportation Research Record, 1660, pp. 156-163. Arentze T. and H. Timmermans (2000) Albatross A learning Based Transportation Oriented Simulation System. European Institute of Retailing and Services Studies, Technical University of Eindhoven, Eindhoven, NL. Axhausen, K.W. (1995) Draft - Travel Diaries: An Annotated Catalogue (2nd edition). http://www.fhwa.dot.gov/ohim/trb/reports.htm accessed June 2003. Becker G. S. (1976) The Economic Approach to Human Behavior. The University of Chicago Press, Chicago, IL. Bhat C. and F.S. Koppelman (1999) Activity-based modeling of travel demand. In Handbook of transportation science (ed. R.W. hall), Kulwer, Boston, MA. Pp. 35-61. Chapin F. S. Jr. (1974) Human Activity Patterns in the City: Things people do in time and space. Wiley, New York, NY. Golledge R.G. and R. J. Stimson (1997) Spatial Behavior: A Geographic Perspective. The Guilford Press, New York, NY. Goulias K.G. and T. Kim (2003) Analysis of the Puget Sound Transportation Panel Survey Database in Waves 1-9 Draft Final Report. Submitted to the Puget Sound Regional Council. Seattle, WA. Hagerstrand T. (1970) What about people in regional science? Papers of the Regional Science association, 10, pp 7-21. Jones P., F. Koppelman, and J Orfeuil (1990) Activity analysis: State-of-the-art and future directions. In Developments in Dynamic and Activity-Based Approaches to Travel Analysis. A compendium of papers from the 1989 Oxford Conference (ed. P. Jones). Avebury, UK. Pp. 34-55. Kitamura R. (1988) An evaluation of activity-based travel analysis. Transportation 15, 9-34. - 53 - Patten and Goulias 54 Kuhnau J. and K.G. Goulias (2003) Centre SIM: First-generation Model Design, Pragmatic Implementation, and Scenarios, Chapter 15 in Transportation Systems Planning: Methods and Applications. Edited by K.G. Goulias, CRC Press, Boca Raton, FL, pp. 16-1 to 16-14. Manheim, M.L. (1980) Fundamentals of Transport System Analysis – Vol. 1: Basic Concepts . MIT Press. Boston, Massachussetts. McNally M. G. (2000) The Activity-based Approach. In Handbook of Transport Modelling (eds. D.A. Hensher and K.J. Button). Pergamon, Ansterdam, NL. pp. 113-128. Patten M.L. and K.G. Goulias (2003) Integrated Survey Design for a Household Activity-Travel Survey in Centre County, Pennsylvania. Paper sumitted for presentation at the 83rd annual Transportation Research Board Meeting, and Publication in the Transportation Research Record, Washington, D.C., January 11-15, 2004 Pendyala R. (2003) Time use and travel behavior in space and time. Chapter 2 in Transportation Systems Planning: Methods and Applications. Edited by K.G. Goulias, CRC Press, Boca Raton, FL, pp. 2-1 to 2-37. Richardson, A.J., Ampt, E.S. and Meyburg, A.H. (1995). Survey Methods for Transport Planning, Eucalyptus Press, Melbourne. Stecher, C., S. Bricka, and L. Goldenberg. Travel Behavior Survey Data Collection Instruments. In Conference Proceedings 10: Conference on Household Travel Surveys: New Concepts and Research Needs, TRB, National Research Council, Washington, D.C., 1996, pp. 154-169. Stopher P. and P. Jones eds. (2003) Transport Survey Quality and Innovation, Pergamon, Amsterdam, NL. Transportation Research Board (2000) Transport Surveys: Raising the Standard. Proceedings of an International Conference on Transport Survey Quality and Innovation, May 24-30, 1997, Grainau, Germany. Transportation Research Circular, Number E-C008, TRB, national Research Council, Washington, D.C. - 54 -