CentreSIM - Department of Geography

advertisement
CENTRE COUNTY SIMULATION (CentreSIM)
Data Collection
Konstadinos G. Goulias
March 2006
UCSB
Draft Notes for GEOG 211A-B-C
PREFACE
Assessing air quality of a region requires travel demand models that can produce reliable
estimates of hour-by-hour mobile source emission estimates. The widely available regional
simulation models are not precise enough and necessity motivates their misuse by practitioners.
Regional simulation models aiming at improved air quality assessments and more detailed
transportation alternatives evaluations are currently created and tested in Australia, Europe,
Japan, and the United States using a variety of theories, decision-making formalisms, and
operational implementation methods. On one hand, these relatively new conceptualizations and
models of transport systems have improved in a substantial way the realism of computerized
decision support tools and have the potential of improving quantification of environmental
impacts and transport management/control strategies. On the other hand, however, these systems
require detailed data and understanding about behavior that very often are not readily available.
This motivates the research reported here.
In this project first a comparative overview of conceptual designs, data requirements, and models
used in computer simulation of regional transport systems was completed in 1997 to aid the
creation of Access Management Impact Simulation – a regional simulation approach using
Geographic Information Systems to predict traffic impacts of individual business establishments.
Then, the basic ingredients of a larger model system were defined and a first pilot study was
completed with the University Park campus as the feasibility test site. These two experiences
motivated the creation of a framework for model development called Longitudinal Integrated
Forecasting Environment (LIFE) that contains a demographic simulator, a daily time allocation
and travel scheduling system. The computational platform for the system is a Geographic
Information System in which statistical models of travel behavior are embedded. Figure 1
contains a summary of the different contributions to LIFE.
One component in this framework called CentreSIM emphasizes the spatial and temporal
dimensions of travel demand and produces hour-by-hour maps of activity participation and travel
in Centre County, Pennsylvania. The predicted traffic volumes are then validated using observed
traffic data. Model development for CentreSIM is designed in cycles of incremental model
improvement versions (each cycle contains multiple a sequence of versions). The first cycle that
started in 1997 and ended in May 2002 produced a first version used by PennState for its Master
plan activities and the currently implemented changes in parking and campus circulation at
University Park. The second version used by JoNette Kuhnau for her MS thesis expanded the
campus circulation model to the entire Centre County and identified critical model design
deficiencies and required improvements to expand and validate the model to the entire county.
The second cycle of this model development and funded by the Pennsylvania Department of
Transportation and through South Central Centre County Transportation Study (SCCCTS)
managed by McCormick Taylor Associates (MTA) and the US Department of Transportation
through its Mid-Atlantic Universities Transportation Center uses data from a large survey in
Centre County (the survey started in November 2002 and will continue for four-five months) and
produced a first version of an improved model in the Spring of 2003 that was used to study the
South Central Centre County Transportation Study (SCCCTS) area alternatives. This produced
CentreSIM cycle 2 version 1. In Spring 2004 a second version that advances the state-of-the-art
in modeling and simulation incorporating synthetic schedule generation for a day and for each
person in the County was created by Pribyl in his Ph.D. dissertation.
Work on model building ended in 2004 and research switched to data analysis for active living
and the health impacts of transportation (Shaunna Burbidge used the data for her MA thesis at
UCSB) and to more theoretical aspects on altruism and travel behavior.
Work in 2005-2006
Active Living
Altruism
CentreSIM – version 3.0 – Activity-based microsimulator predicting demand on a hour-by-hour, parcel-by-parcel, and household-byhousehold resolution levels
CENTRESIM PROJECT
Synthetic schedule generation using CentreSIM activity data to create models
reflecting intra-household interactions and producing travel demand estimates by time
of day
Telecommunication-travel interaction equations
to use in CentreSIM
Tae-Gyu Kim – CentreSIM project
ONDREJ PRIBYL PHD DISSERTATION
Combine new network from Eom’s model with a new Kuhnau model estimate using
CentreSIM survey data – create scenarios for PennDOT SCCCTS study
MICHAEL ZEKKOS – MS THESIS
Activity survey and large-scale data
collection to support new model
generation. New network compatible
with other applications by consultants in
Centre County
Develop a combined passenger/truck
traffic prediction model, validate, and
study the effect of different network
precision levels
Formulate ideas and test models of travel –
telecommunications interaction in Centre
County
Tae-Gyu Kim and Ondrej Pribyl – variety of
projects
Systems of equations modeling the traveltelecommunications interactions and study the
effects of information provision on travel
behavior in PSTP – lessons learned for Centre
County
JINKI EOM – MS THESIS
CENTRESIM PROJECT
TAE-GYU KIM PH.D. DISSERTATION
CentreSIM – version 2.0 – Expand the method to entire Centre County, identify and
rectify problems with network and employment data, demonstrate the use of the
model for policy impact assessment
First version of models that reflect interaction
between telecommunications and travel –
guidance for CentreSIM data collection
JONETTE KUHNAU MS THESIS
TAE-GYU KIM RESEARCH
CentreSIM – version 1.0. Update of network and buildings information, more detailed
roadways for on-campus circulation. Identification of issues for expanding method to
entire county. Creation of scenarios and demonstration of application – evidence of
practical use.
Second round of data consolidation and
econometric model definition of the Puget
Sound Transportation Panel – lessons learned
for an activity-based survey for Centre County
PROJECT FOR OPP AND GRADUATE COURSE PLANNING &
OPERATIONS
Feasibility of truck traffic forecasting
and experiments with network
resolution
John Marker, Jr MS Thesis
Demographic microsimulation DEMOS
version 1.0 object oriented program in C++
Gopal Mandava MS Thesis
PUGET SOUND REGIONAL COUNCIL
PROJ.
DEMOS 2000 and lessons learned for a Centre
County application – new program in C++
ASHOK SUNDARARAJAN MS THESIS
Second generation Centre County application using Windows based TRANSCAD
with emphasis on evacuation of PSU campus – first activity-based survey to collect
data by email – first version of the time of day model for activity and travel
Econometric approach to activity scheduling
and data needs definition using as pilot the
Puget Sound Transportation panel –
SAJJAD ALAM – MS THESIS
JUNE MA PH.D. DISSERTATION
Centre County AMIS– first network version,
feasibility, and identification of issues –
Demographic simulator for Centre County using
Fortran and lessons learned for next version
Centre County first complete
network – demonstration of GIS
accessibility capabilities
Ming-Sheng Lee MS thesis
JIN CHUNG PH.D. DISSERTATION
Data assembly and initial consolidation of
information needed to develop an activity-based
approach to travel demand forecasting using
travel surveys
FHWA PROJECT
Access Management Impact Simulation – Erie County – Experimentation with GIS TRANSCAD and interface design with TRAF
NETSIM - DOS software
PENNDOT PROJECTS
Figure 1 Overview of LIFE and CentreSIM data collection and modeling work at PSU
Within CentreSIM a household survey that is part of the six municipality (College, Harris,
Spring, Benner and Potter Townships and Centre Hall Borough) study known as the South
Central Centre County Transportation Study (SCCCTS) area study was conducted by the
Pennsylvania Transportation Institute. The household survey covers the entire Centre County
and will also include residents that work in Centre County and reside elsewhere. The data from
this survey filled a critical gap in knowledge about the County and provided the necessary
information to develop models that can be used in regional simulation software necessary for
forecasting and for alternatives exploration. Each participating household is asked to provide
voluntarily information about household composition and facilities available to the household
members. In addition, each household member also provided personal information such as
employment, driving ability, education and so forth. Activity and travel data were also collected
from each person in the household using a two-day complete record of the activities in which
each person engaged and the different transportation options taken. The survey includes a few
questions about opinions and perceptions regarding the Centre County transportation system.
To maximize participation in this survey a new method has been created by staff at the
Transportation Survey Center, which was a unit within the Center for Intelligent Transportation
Systems at Penn State University until 2004. This method is based on recent positive experience
with surveys for the Pennsylvania Turnpike and the Pennsylvania Department of Transportation
and experience in state-of-the-art survey design techniques used in Europe, Australia, and the
US. A typical procedure followed in the survey contains a first stage of contact and introduction
to the study, mailing of survey material, and a series of reminders and clarifications. The entire
process culminates with post-survey thank you letters, summary of the findings to the
respondents, and a gift certificate reward structure.
The next section describes the survey process. A second section describes an automated way to
clean the data.
Patten and Goulias
1
INTEGRATED SURVEY DESIGN FOR A HOUSEHOLD ACTIVITY-TRAVEL
SURVEY IN CENTRE COUNTY, PENNSYLVANIA
Michael L. Patten & Konstadinos G. Goulias, Ph.D.
ABSTRACT:
In this paper we outline an integrated method for conducting a household activity-travel survey
that was used successfully in a household survey of Centre County, Pennsylvania. This method
incorporates elements of the Dillman-tailored method and the Socialdata-KONTIV design with
concepts developed by the Penn State study team. The procedure followed in the survey consists
of a first stage of contact and introduction to the study, mailing of survey materials, and a series
of reminders and clarifications. The entire process culminates with post-survey thank you
letters, summary of the findings to the respondents, and a gift certificate reward structure. Our
experiments with the Dillman-tailored method and the Socialdata-KONTIV design and survey
administration are extremely encouraging and a testimony to the good design advocated by both
traditions. The survey experienced a response rate of 38.8 percent for the household
questionnaire portion of the survey and 28.5 percent for the diaries. Within each of these two
response rates, however, we find specific segments and a sequence of actions that yield a
response rate as high as 68.8% and a series of actions during survey administration that aid
achieving this rate confirming the suggestions offered by researchers in Europe and Australia
suggesting we should change the requirements for surveys in the U.S.
INTRODUCTION
Collecting data via surveys, especially by mail, is a complex and expensive (in both time and
money) undertaking. In order to maximize the return on the limited resources usually available
to researchers, highly detailed systems have been developed to encourage participation in
surveys. For example, Dillman (1) has developed the Tailored Design Method (originally called
the Total Design Method) based on concepts of social exchange theory. Surveys utilizing
Dillman’s process should be designed around the following three key elements:
1. Establish trust with the respondent (e.g., provide a token of appreciation in advance,
make the task appear important, show sponsorship by legitimate authority);
2. Increase the respondents’ expectation of receiving a reward from participation (e.g.,
show positive regard, say thank you, give tangible rewards, make questionnaires
interesting); and
3. Reduce the social costs to the respondent (e.g., avoid subordinating language, make
the questionnaire short and easy, and minimize requests for personal information).
In the realm of transportation-related surveys, several design concepts have been
developed incorporating ideas similar to Dillman’s as well as the experience of the researchers
involved. Brög, in his Socialdata-KONTIV Design (2), begins with the premise that in good
survey design “the researchers must adjust to the respondents, not the respondents to the
researchers.” The Socialdata-KONTIV design uses questionnaires and activity diaries that are
simple in design and layout, minimize involved definitions and instructions, and stress the
Patten and Goulias
2
collection of “complete information instead of formally correct” information. The SocialdataKONTIV design also incorporates a high level of contact with the respondents through multiple
mail and telephone contacts. This design, however, contains a travel diary instead of the more
recent activity diaries that we need for the application in our survey. Activity surveys require a
somewhat different design because of the additional respondent burden and the sensitivity of the
questions.
Additional recent research has found that the day-planner booklet format is extremely
useful for time-use and activity diaries (3,4,5). The reasons reported for this include: flexibility,
simplicity of completion, greater detail of the answers provided, and user-friendliness.
To summarize, a good design for a household activity-travel survey should incorporate
the following concepts:
•
•
•
•
•
•
The researcher should design the survey for the respondents;
Survey instruments should be written in simple language and be easy to understand
and complete;
The research should establish trust with the respondents;
Respondents should receive a reward from participation;
The costs to the respondent should be minimized; and
The day-planner booklet format is the most useful design for activity diaries.
In this paper we discuss a new method to maximize participation in travel-related surveys
created by a team at the Transportation Survey Center, which is a unit within the Mid-Atlantic
Universities Transportation Center at Penn State University. This method is based on the
concepts discussed above and recent positive experiences with surveys for the Pennsylvania
Turnpike Commission and the Pennsylvania Department of Transportation.
The procedure followed in the survey consists of a first stage of contact and introduction
to the study, mailing of survey materials, and a series of reminders and clarifications. The entire
process culminates with post-survey thank you letters, summary of the findings to the
respondents, and a gift certificate reward structure.
THE STUDY
Between November 23, 2002 and May 30, 2003, the Penn State study team conducted a
survey of Centre County, Pennsylvania residents to collect data about the county’s households
and the activities of the household members. The survey is designed to meet the data needs of a
multiyear model building research effort called CentreSIM (6), provide data for long range
planning by local agencies, and to aid a Pennsylvania Department of Transportation (PennDOT)
study known as the South Central Centre County Transportation Study (SCCCTS) area study.
The household activity-travel survey covered the entire Centre County and also included
residents that work in Centre County and reside elsewhere. The data from this survey will fill a
critical gap in knowledge about Centre County and will provide the necessary information to
develop models that can be used in regional simulation software necessary for forecasting and
for alternatives exploration.
Each participating household was asked to provide, on a volunteer basis, information
about household composition and facilities available to the household members. In addition,
each household member was also asked to provide personal information such as employment
status, educational level, and typical mode of travel to work or school. Activity and travel data
were also collected from each person in the household using a two-day complete record of the
Patten and Goulias
3
activities in which each person engaged and the different transportation options taken. The
survey also included a few questions about opinions and perceptions regarding the Centre
County transportation system.
The Study Area
The survey was conducted in Centre County, Pennsylvania. This county, with an estimated 2001
population of 135,940 (7), is located in the geographic center of Pennsylvania. Centre County is
predominately rural although State College borough and its adjacent areas have experienced a
significant amount of urbanization. The University Park Campus of The Pennsylvania State
University with more than 41,000 students and 11,000 faculty and staff (8) is also located in
Centre County. The primary mode of travel in the county is motor vehicle although State
College and the Penn State campus are well served by public transportation. The area
immediately surrounding the Penn State campus also experiences a high level of foot and bicycle
traffic.
SURVEY MATERIALS
The materials for the survey were divided into two components. The first component included
those related to the household information survey (household questionnaire) and the second
those related to the activity-travel diaries. Each group of materials is described below.
Household Survey Materials
The materials included in the household survey were:
 A cover letter describing the project and the purpose of the survey. It also provided a
point of contact for additional information about the survey.
 A “Project Synopsis and Informed Consent Form,” as required by federal regulations,
providing a point of contact for questions about the survey, describes the purpose of
the survey, explaining any risks and/or benefits of participation, describing the
confidentiality procedures, and indicating that the survey is voluntary.
 The questionnaire (survey instrument) was used to collect the data necessary for the
study. It is in booklet form and designed to minimize the effort required on the part
of the respondent. The majority of questions were “close-ended.” Figure 1 displays
example pages from the questionnaire.
 A business reply envelope was included in the packet to facilitate return of the
completed survey forms.
 Contest Flyer describing the lottery (see below) and a contest entry card.
Figure 1 about here.
Patten and Goulias
4
The questionnaire design paid strict attention to providing a clear, easy to read format
that minimized the effort to complete it. For example, we used a large type font (14 point) and
incorporated much empty space. We also utilized a vertical flow for the question layout. The
questionnaire also requested the first names, ages, and occupations of each household member.
This information was used to personalize the activity-travel diaries for each member.
Activity-Travel Diaries
The materials included with activity-travel diaries were:
 A cover letter describing the project and the purpose of the survey. It also provided a
point of contact for additional information about the survey.
 A “Project Synopsis and Informed Consent Form,” as required by federal regulations,
providing a point of contact for questions about the survey, describes the purpose of
the survey, explaining any risks and/or benefits of participation, describing the
confidentiality procedures, and indicating that the survey is voluntary.
 Two personalized activity-travel diaries for each household member. The diaries
were for two consecutive days. Figure 2 displays an example of the diary format.
 A business reply envelope was included in the packet to facilitate return of the
completed survey forms.
 Contest Flyer describing the lottery (see below) and a contest entry card.
Figure 2 about here.
As shown in figure 2, the activity-diaries utilized a modified day-planner format. The
respondents were free to provide what level of detail they felt necessary. A key component of
our method is the personalization of the diaries for each household member. The
personalization consisted of two items. First, each diary has the name of the appropriate
household member on the cover (figure 3). Additionally, we used four different diaries
depending on the reported employment status of the members as follows:
 Employed (full- or part-time) outside the home–White cover
 Not employed outside the home–Peach cover
 Children age 18 and under–Blue cover
 University students–Blue cover
The four dairies were identical except for the color of the cover and an example diary that
would be relevant to the subject’s demographic group. For example, the “employed” diary
example shows an individual traveling to work, working throughout the day with a lunch break,
travel home, and some other activities in the evening. The “university student” diary example,
on the other hand, shows attendance at classes, study time, and work at a part-time job and the
“child” diary shows time at school and participation in an athletic activity. Figure 3 displays an
example of a diary cover with personalization and figure 4 a portion of the example from the
“employed” diary example (Note: This is also fairly representative of the information contained
in the returned completed diaries.).
Figure 3 about here.
Figure 4 about here.
Patten and Goulias
5
THE LOTTERIES
As noted in the introduction, we incorporated a reward structure to encourage responses. We
utilized two different reward structures each of which provided total prizes of $1,000 in gift
certificates redeemable at the local shopping mall. During the remainder of this paper Lottery
One refers to the period November 23, 2002 to March 6, 2003 and Lottery Two refers to the
period March 7 to May 31, 2003. In both lottery pools a household was eligible to win a prize if
it returned a completed household questionnaire, completed diaries, and a contest entry card.
The winners were randomly drawn from all the eligible households for that pool.
Lottery One: 2002
The first lottery was used during the first survey period of November 23, 2002 to March 6, 2003.
A total of $1,000 in gift certificates was awarded as follows:
 1 Grand Prize of $500 in gift certificates
 2 Second Prizes each for $150 in gift certificates
 4 Third Prizes each for $50 in gift certificates
Lottery Two: 2003
The second lottery was used during the second survey period of March 7 to May 31, 2003. A
total of $1,000 in gift certificates was awarded as follows:
 4 First Prizes each for $150 in gift certificates
 8 Second Prizes each for $50 in gift certificates
SAMPLE SELECTION
The sample for the survey was drawn from several pools. The first was a database of 46,448
household addresses in Centre County purchased from a commercial mailing list vendor in early
October 2002. This list provided the name of the current resident, the complete mailing address
and in many cases the telephone number. This list has, however, two weaknesses. First, it does
not include households that have made formal requests to be removed from mailing lists. This
weakness had to be accepted since there is no other legal way to gather these addresses.
The second weakness results from the highly transient nature of the 40,000 students
attending the University Park campus of Penn State University. The study team was able to
alleviate this problem by using student address lists available through Penn State. The following
three address lists of students where acquired: students residing in on-campus housing, students
living off-campus, and students living in Penn State operated family housing.
In addition to the above mailing lists, a fifth one was obtained from Penn State. This list
contained University Park Campus employees of Penn State who reside outside of Centre
County. It was important to include members of this group in the sample since they commute
longer distances to work.
We randomly selected a sample from each pool. There was not enough information to
ensure that the sample units selected was representative of the Centre County residents. Table 1
shows the size of each pool and the sample selected from each.
Patten and Goulias
6
Table 1. Centre County Activity-Travel Survey—Subject selection.
Subject Pool
Size of
Pool
Number
Selected
Percent
of
Sample
Purchased Database
46,448
6,700
68.8%
Penn State Students Residing On-Campus
12,714
1,200
12.3%
Penn State Students Residing Off-Campus
17,942
1,200
12.3%
402
140
1.4%
1,464
507
5.2%
78,970
9,747
100.0%
Penn State Students Residing in PSU Family
Housing
Penn State Employees Residing Outside Centre
County
Totals
Subject Pool for Lottery One: 2002
The subject pool for Lottery One was comprised of the 6,700 households selected from the
purchased database. Initially, as described below, the study team mailed survey materials to the
1,478 households for which no telephone number was available. The remaining 5,222 were to be
contacted by telephone. Of these 821 were contacted by telephone. There was no contact with
the remaining 4,401 households were not contacted during the Lottery One portion of the survey.
Subject Pool for Lottery Two: 2003
The subject pool for Lottery Two included the 4,401 households remaining from Lottery One
plus the two groups of Penn State students, the households residing in the Penn State family
housing, and the Penn State employees. In total, 7,447 households were included in this phase.
THE SURVEY PROCESS
The study team recruited households via two mechanisms: telephone and mail. The processes
used for each are outline below.
Phone Recruiting
The study team called households for which there was a telephone number available. If the call
was answered the team member asked to speak with the head-of-the-house or another
responsible adult. The purpose of the study and survey was explained and participation
requested. If the household declined to participate they were removed from the respondent pool.
If they agreed to participate, they were asked to provide the names, ages, and employment status
of all members of the household. This information was used to produce the personalized diaries.
The respondent was informed of the days on which they would participate and their mailing
address verified. The appropriate diary materials were produced and mailed the next day.
Those numbers with no answer were rescheduled to be called at a later date. When an
answering machine was reached, an appropriate message was left noting the reason for the call
Patten and Goulias
7
and that the study team would attempt to contact them at a later date. If a number was called
three times without actually talking to a household member it was dropped from the pool.
Mail Recruiting
Mail recruiting in both lotteries was done as follows:
• Advance notice of the survey was sent by mail one week prior to the survey;
• Questionnaire packet was sent by mail (main mailing);
• A reminder letter was sent to the entire sample one week after the main mailing; and
• A reminder letter including a complete survey packet was mailed to all nonrespondents. For Lottery One it was mailed eight weeks after the main mailing and
for Lottery Two four weeks after.
When a household returned their questionnaire they were added to the pool to receive
activity diaries. During both survey periods activity-travel diaries were mailed to each
household as they returned completed household questionnaires. The survey dates for the diaries
were seven days after they were mailed. On a typical day, diaries were mailed to approximately
40 households per day broken down a follows: 25 to the purchased group, 5 each to the on- and
off-campus students, 1 to the Penn State housing, and 9 to the Penn State employees. No followups were made for the diaries.
DATA ENTRY
Activity surveys generate an extremely large amount of data. Because of this, it is also important
to minimize the burden on the personnel responsible for data entry. It is also important to
provide a data entry process that will minimize the number of data entry errors. For this study,
the Survey Center developed an integrated database with an interface that closely resembled the
questionnaire and diary formats. A comparison of figures 5 (questionnaire) and 6 (diary) to
figures 1 and 2 shows that this was very successful. The incorporation of “pull-downs” for many
of the fields allowed for quick and accurate entry of repeated data such as home addresses and
activities such as sleeping and eating.
Figure 5 about here.
Figure 6 about here.
RESPONSE RATES
The two lotteries experienced very different levels of participation. The telephone recruiting did
not prove to be very successful, although, as can be expected, households that agreed by
telephone to participate responded at a very higher rate. Lottery One had a much higher
response rate than Lottery Two.
Lottery One: 2002
As noted earlier, during the first survey lottery (November 2002 to March 2003) subjects were
recruited both by telephone and mail.
Patten and Goulias
8
Telephone Recruiting
Study team members contacted 821 households via telephone to request participation in the
study. Of these, only 190 agreed to participate. The other 631 households refused participation,
did not have in-service telephone numbers, or were called three times without speaking to a
resident. The contact rate (participants/contacts) was 23.1 percent. Telephone recruiting was
stopped on January 3, 2003.
All of the 190 households agreeing to participate were mailed survey packets as outlined
above. Of these 80 households (43.5%) returned completed, usable questionnaires and diaries.
The overall response rate for the telephone recruiting is 10.0 percent (returns/contacts).
Mail Recruiting
In lottery one 1,478 households were contacted by mail. Of these, 422 households were dropped
as undeliverable yielding a net mailing of 1,056. A total of 647 household questionnaires were
returned yielding a 61.3 percent response rate.
Of the 647 households returning household questionnaires 568 were mailed activitytravel diaries. Time constraints prevented inclusion of all responding households. Sixty-six
were dropped as undeliverable (these packets were returned by the U.S. Post office without a
valid forwarding address in Centre County) yielding a net mailing of 502. A total of 208
households returned completed diaries yielding a 41.4 percent diary response rate.
With 203 households returning usable diaries the overall response rate for the Lottery One mail
recruiting is 13.7 percent.
Lottery Two: 2003
Lottery two was done completely by mail. The response rates for this phase of the study are
reported below with details for each sub-set of the sample.
Household Questionnaire
Overall, questionnaires were mailed to 7,447 households. Of these, 617 households were
dropped as undeliverable yielding a net mailing of 6,830. A total of 2,414 household
questionnaires were returned yielding a 35.3 percent response rate. Table 2 displays the mailing
and response rates by sub-group.
Patten and Goulias
9
Table 2. Lottery Two–Household questionnaire return rates.
Mailed
Dropped
Net
Mailed
Purchased Database
4,401
432
3,969
1,603
40.4%
PSU Students Residing On-Campus
1,200
27
1,173
258
22.0%
PSU Students Residing Off-Campus
1,199
129
1,070
304
28.4%
PSU Students Residing in Family
Housing
140
4
136
50
PSU Employees from Outside Centre
Co.
507
25
482
199
7,447
617
6,830
2,414
Subject Pool
Totals
Number Response
Returned
Rate
36.8%
41.3%
35.3%
Activity-Travel Diaries
Of the 2,414 households returning household questionnaires 1,969 were mailed activity-travel
diaries. Six were dropped as undeliverable yielding a net mailing of 1,963. A total of 494
households returned completed diaries yielding a 25.2 percent diary response rate. Table 3
displays the mailing and response rates by sub-group. With 494 households returning usable
diaries, the overall response rate for lottery two is 8.9 percent.
Table 3. Lottery Two–Activity-travel diary return rate.
Mailed
Dropped
Net
Mailed
1,348
5
1,343
401
29.8%
PSU Students Residing On-Campus
131
1
130
16
12.3%
PSU Students Residing Off-Campus
247
0
247
34
13.8%
60
0
60
11
183
0
183
32
1,969
6
1,963
494
Subject Pool
Purchased Database
PSU Students Residing in Family
Housing
PSU Employees from Outside Centre
Co.
Totals
Number Response
Returned
Rate
17.5%
18.3%
25.2%
Comparison of Response Rates for the Two Lotteries
While a significantly larger amount of data was collected during Lottery Two, Lottery One
experienced a much higher response rate both for the household questionnaire and the activity
Patten and Goulias
10
diaries. This difference may be a result of the different incentive amounts offered for each. The
respective rates for the household questionnaire were 61.3 percent returned for Lottery One and
35.3 percent for Lottery Two yielding an over all response rate of 38.8 percent for the household
questionnaire. For the diaries, the response rate were 41.4 percent returned for Lottery One and
35.3 percent for Lottery Two yielding an over all response rate of 38.8 percent for the diaries.
The overall response rate for the survey is 11.1 percent.
Table 4. Household questionnaire response rate via mail.
Lottery
Total
Mailed
(A)
Dropped1
Number
(B)
Percent
(A/B)
Net
Mailed
(C=A-B)
Total
Returned
(D)
Return
Rate
(E=D/C)
Lottery One (2002)
1,478
422
28.6%
1,056
647
61.3%
Lottery Two (2003)
7,447
617
8.3%
6,830
2,414
35.3%
Totals
8,925
1,039
11.6%
7,886
3,061
38.8%
1. Respondents dropped from the survey (e.g., Non-Deliverable, deceased, under age, etc.)
Table 5. Diary response rate via mail.
Lottery
Total
Mailed
(A)
Dropped1
Number
(B)
Percent
(A/B)
Net
Mailed
(C=A-B)
Total
Returned
(D)
Return
Rate
(E=D/C)
Lottery One (2002)
568
66
11.6%
502
208
41.4%
Lottery Two (2003)
1,969
6
0.3%
1,963
494
25.2%
Totals
2,537
72
2.8%
2,465
702
28.5%
1. Respondents dropped from the survey (e.g., Non-Deliverable, deceased, under age, etc.)
A review of the dates that the questionnaires seems to indicate that the follow-up mailing
do have an impact on overall response rate. Figure 7 shows the percent of questionnaires
returned by week for each of the lotteries. In both cases, there is a spike in the number of
questionnaires returned approximately two weeks after the follow-up mailing of the complete
survey packet.
Figure 7 about here.
SYNTHESIS
In this paper we outlined an integrated method for conducting a household activity-travel survey.
This method has been used successfully in a household survey of Centre County, Pennsylvania.
The survey experienced a response rate of 38.8 percent for the household questionnaire portion
Patten and Goulias
11
of the survey and 28.5 percent for the diaries. Within each of these two response rates, however,
we find specific segments and a sequence of actions that yield a response rate as high as 68.8%
and a series of actions during survey administration that aid achieving this rate. Our experiments
with the Dillman-tailored method and the Socialdata-KONTIV design and survey administration
are extremely encouraging and a testimony to the good design advocated by both traditions.
However, many aspects are still obscure. Two preliminary multivariate analysis attempts were
also made when writing this paper to shed light into the extremely different response rates. The
first analysis examined response rate to the household questionnaire to understand which
households were more likely to respond. This was done using a person-based (because a person
is the mail or telephone recipient) probability of returning the questionnaire non-linear regression
model that indicates that older recipients (50+), home owners, married, and of medium ($30K to
$60K per year) or higher ($60K or more per year) household income are more likely to return
their household questionnaires. Mail recruitment was also (as discussed above) significantly
more successful in the household questionnaire response. Similar indications were also given by
a second regression model for activity diary response. In fact, recruiting by telephone in the
household questionnaire leads to lower response rate but given that a household was recruited by
telephone and returned the household questionnaire, it has a higher probability of returning at
least one activity diary. In addition, the presence of children in the household functioned as an
inhibitor in the travel diary response.
ISSUES FOR FURTHER RESEARCH
Several issues arouse during this study that warrant further exploration. The telephone recruiting
was not very successful. The study team could not determine if this was a result of the recruiting
method used or related to larger societal factors. Also, the large difference in the response rates
for the two lotteries is somewhat of a puzzle. The difference in the amount of the reward offered
in each lottery most likely had a significant impact on the response rate although this study did
not collect data to determine what effect the reward levels had. Initial multivariate analysis
experiments showed we need to explore further the response rate but we also need to study the
completion and missing data patterns. The effect of children on response rate also requires
further study. This is particularly important for travel behavior analysis because of the effect
children have on activity and travel patterns and the risk of loosing the more interesting from
research viewpoint behaviors due to non-response. All this is left as a future study that has
already began.
ACKNOWLEDGEMENTS/DISCLAIMERS
Funding for this paper was provided by the federally funded Mid-Atlantic Universities
Transportation Center (MAUTC) and the Center for Intelligent Transportation Systems
(CITRANS) at the Pennsylvania State University. Partial funding for the CentreSIM survey is
provided by the Pennsylvania Department of Transportation (PennDOT) through a contract with
McCormick Taylor and Associates (MTA). The survey was conducted at the Transportation
Survey Research Center at the Pennsylvania Transportation Institute (PTI) by a team of persons
that include: Mark Hallinan, James Lee, Devani Perera, Aviroop Mukherjee, Julie Whitt, and
Brian Hoffheins. Their dedication is greatly acknowledged. Li Guan at PTI programmed the
databases for the CentreSIM survey and Tae-Gyu Kim at PTI estimated the response rate
models. Jean-Robert Micaeli and Ondrej Pribyl have also worked on data cleaning algorithms
documented elsewhere.
Patten and Goulias
12
The contents of this paper reflect the views of the authors, who are responsible for the
facts and the accuracy of the data presented herein. The contents do not necessarily reflect the
official views or policies of the Commonwealth of Pennsylvania at the time of publication. This
paper does not constitute a standard, specification, or regulation of the Pennsylvania Department
of Transportation.
REFERENCES
1. Dillman, Don A. Mail and Internet Surveys: The Tailored Design Method. 2nd edition. John
Wiley and Sons, Inc., New York, 2000.
2. Brög, Werner. The New KONTIV Design: A Total Design for Surveys on Mobility
Behavior. Presented at the International Conference on Establishment Surveys (II), Buffalo,
New York, July 17-21, 2000.
3. Stopher, P. E. and C. G. Wilmot. Some New Approaches to Designing Household Travel
Surveys–Time-Use Diaries and GPS. Presented at the 79th Annual Meeting of the
Transportation Research Board, Washington, D.C., 1999.
4. Stopher, P. E. and C. G. Wilmot. Development of a Prototype Time-Use Diary and
Application in Baton Rouge, Louisiana. Presented at the 80th Annual Meeting of the
Transportation Research Board, Washington, D.C., 2000.
5. Arentze, T., M. Dijist, et. al. A New Activity Diary Format: Design and Limited Empirical
Evidence. Presented at the 81st Annual Meeting of the Transportation Research Board,
Washington, D.C., 2001.
6. Kuhnau J. and K.G. Goulias (2003) CentreSIM: First-generation Model Design, Pragmatic
Implementation, and Scenarios, Chapter 15 in Transportation Systems Planning: Methods
and Applications. Edited by K.G. Goulias, CRC Press, Boca Raton, FL, pp. 16-1 to 16-14.
7. U.S. Census Bureau. Centre County Quick Facts from the U.S. Census Bureau. May 7,
2003. http://quickfacts.census.gov/gfd/states/42/42027.html. Accessed May 29, 2003.
8. The Pennsylvania State University. Penn State Fact Book.
http://www.budget.psu.edu/factbook/default.asp. Accessed July 5, 2003.
Patten and Goulias
13
List of Figures
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.
Example pages from the household questionnaire. ...................................................... 14
Example pages from the activity-travel diaries. ........................................................... 18
Example of a personalized cover from an activity-travel diary. .................................. 19
Example of a “completed” activity-travel diary. .......................................................... 20
Weekly survey return rate. ........................................................................................... 21
Example of a household questionnaire data entry screen. ............................................ 22
Example of a diary data entry screen. .......................................................................... 23
Patten and Goulias
14
Figure 1. Example pages from the household questionnaire.
Patten and Goulias
15
Patten and Goulias
16
Figure 1. Example pages from the household questionnaire (continued).
Patten and Goulias
17
Patten and Goulias
18
Figure 2. Example pages from the activity-travel diaries.
Patten and Goulias
Figure 3. Example of a personalized cover from an activity-travel diary.
19
Patten and Goulias
20
Figure 4. Example of a “completed” activity-travel diary.
Patten and Goulias
21
40%
Lottery 2
30%
Week 8:
Lottery One
follow-up
mailed
Week 4:
Lottery Two
follow-up
mailed
20%
10%
Week Since Initial Mailing
Figure 5. Weekly survey return rate.
14
13
12
11
10
9
8
7
6
5
4
3
2
0%
1
Percent of Respondents
Lottery 1
Patten and Goulias
22
Figure 6. Example of a household questionnaire data entry screen.
Patten and Goulias
23
Figure 7. Example of a diary data entry screen.
Patten and Goulias
24
Patten and Goulias
25
CHIRAC:
A COMPREHENSIVE HOUSEHOLD INTEGRATED RECTIFIER FOR ACTIVITY DIARIES
Ondrej Pribyl, Jean-Robert Micaelli, Konstadinos G. Goulias, and Michael L. Patten
Abstract
In some of the more advanced activity-based models we also find higher requirements for the data needed for model estimation and
calibration. For example, information about all activities as well as trips conducted by individuals in the study area during one or more
days must be provided by the respondents. This information is often incomplete or inconsistent, since the demands on the respondents
are higher than a simple questionnaire. In this paper, we provide a general framework for management of the data in household
activity diary surveys. Its main focus is on the phase of entering and verification of the completed diaries. Lessons learned are
provided from our experience working on a survey conducted in Centre County, Pennsylvania (USA) in fall 2002 and spring 2003. A
short summary of the survey background is provided as well. This paper is of a rather practical matter and we demonstrate the most
important issues through real life examples. The paper concludes with a series of recommendations that help future data collection
efforts.
INTRODUCTION
Activity-based approaches, with roots traced back to 1960s and 1970s, are offered as the next best alternative to typical regional
modeling and simulation for transportation decisions. Chapin’s research (1974), is the most likely study to have started a different
thinking in travel demand and particularly the idea of derived demand as envisioned in Manheim (1980). At about the same time
Becker (1976), also developed his theory of time allocation from a household production viewpoint. This foundation is completed by
a fourth study contributing a systematic approach to the constraints humans face, by chance and by design, to their action in space and
- 25 -
Patten and Goulias
26
time provided by Hagerstrand in his seminal work on time space geography (1970). Cullen and Dobson in two papers in the mid1970s as reviewed by Arentze and Timmermans (2000) and Golledge and Stimpson (1997) appear to be the first researchers
attempting to bridge the gap between the motivational (Chapin) approach to activity participation and the constraints (Hagerstrand)
approach by creating a model that depicts a routine and deliberated approach to activity analysis. Most subsequent contributions to
the activity-based approach emerge in one way or another from these initial frameworks with important operational improvements as
discussed in the reviews by Kitamura, 1988, Bhat and Koppelman, 1999, Arentze and Timmermans, 2000, and McNally, 2000.
The basic ingredients of an activity based approach for travel demand analysis (Jones, Koppelman, and Orfeuil, 1990 and Arentze and
Timmermans, 2000) are:
a)
explicit treatment of travel as derived demand (Manheim, 1980), i.e., participation in activities such as work, shop, and
leisure motivate travel but travel could also be an activity as well (e.g., taking a drive). These activities are viewed as episodes
(starting time, duration, and ending time) and they are arranged in a sequence forming a pattern of behavior that can be
distinguished from other patterns (a sequence of activities in a chain of episodes). In addition, these events are not independent
and their interdependency is accounted for in the theoretical framework;
b)
the household is considered to be the fundamental social unit (decision making unit) and the interactions among
household members are explicitly modeled to capture task allocation and roles within the household, relationships and change
in these relationships as households move along their life cycle stages and the individual’s commitments and constraints
change and these are depicted in the activity-based model; and
c)
explicit consideration of constraints by the spatial, temporal, and social dimensions of the environment is given. These
constraints can be explicit models of time-space prisms (Pendyala, 2003) or reflections of these constraints in the form of
model parameters and/or rules in a production system format (Arentze and Timmermans, 2000).
The inputs to these models are similar to the typical regional transportation model data on social, economic, and demographic
information of potential travelers and land use information. In addition, they require the same amount of information that typical
travel demand models need and a more detailed mapping of time allocation to create schedules followed by people in their everyday
life. The outputs are detailed lists of activities pursued, times spent in each activity, and travel information from activity to activity
(including travel time, mode used, and so forth) linked to individual and household characteristics. This output is very much like a
“day-timer” for each person in a given region.
Collecting this type of detailed information introduces however additional problems. These diaries tend to contain many
inconsistencies and incomplete information with many similarities to the travel diaries that focus on trip-by-trip information (for a
complete example of rectification needs in travel diaries see Goulias and Kim, 2003). This issue with emphasis on activity diaries was
- 26 -
Patten and Goulias
27
discussed in more detail including examples from previous research in Arentze et al., (1999). For this reason, before we can use the
collected activity diaries in an activity-based model we must ensure that the data set used for its calibration is rectified to at least be
free of internal contradictions, i.e., complete schedule for the respondents without major gaps.
In support of such a task, Arentze et al. (1999) developed a SYstem for the Logical Verification and Inference of Activity diaries,
called SYLVIA. The main purpose of the SYLVIA system is to support the calibration of the ALBATROSS model, developed by
Arentze et al. (2000), but also to provide a generic data checking algorithm that can be used for verification of other activity diaries. It
aims to obtain a clean and consistent data set that can be used for the purpose of validation of activity-based models. To accomplish
this the Dutch team is using a set of logical rules that finds and replaces mistakes and inconsistencies in a data set.
Building on the SYLVIA method, in this paper we describe a new approach that addresses a similar general problem, it is designed to
support another model system called CentreSIM, and it expands the SYLVIA method to include a few additional issues. The new data
verification and rectification is named CHIRAC: A Comprehensive Household Integrated Rectifier for ACtivity diaries. It is based on explicit
recognition of the hierarchical nature of the data, contains many automated procedures, and allows for expert human intervention. It
aims to provide a framework for checking, verifying, and rectifying data after collecting activity diaries.
The remainder of the paper is organized as follows. First a brief review of major inconsistencies and the need for data rectification is
presented. This is followed by the specific data collection project called CentreSIM. Then, a general framework for data rectification
is offered and a description of the application is provided. The paper concludes with a summary and brief discussion of findings.
MOTIVATION AND BACKGROUND
In this section we review a sample of some major sources of inconsistencies and problems in activity diaries. Previous studies in
transportation surveys (Richardson, Ampt, and Meyburg, 1995) provide more general reviews of the process, review diary typologies
(Axhausen, 1995), and instrument design (Stecher, Bricka, and Goldenberg, 1996). In the past few years an attempt was also made to
define survey performance measures and to create minimum standards for practice (TRB, 2000, Stopher and Jones, 2003).
Household activity diaries that aim to collect information from all household members are believed to be better than travel diaries
targeting a subset of the household members (e.g., driving age members) because they collect information about the complete decision
making unit and enable us to study joint household member patterns of time allocation. Using activity/time use diaries and definitions
and questionnaire topics and themes that are more “natural” for the respondents (e.g., what did you do today?) we also expect to see
better reporting by the respondents. Moreover, using activities we may also trigger the respondent’s memory to provide a more
- 27 -
Patten and Goulias
28
accurate report of her/his behavior. Finally, the data thus derived contain some of the necessary information in the most recent
activity-based travel behavior models. However, activity surveys of the format described here require more reporting by the
respondents and they may offer more opportunities for mistakes unless the instrument design avoids that. They also require
significantly more work for the survey designers before, during, and after survey execution.
In this section we focus on some of the key factors that motivate the creation of automated procedures in data handling after the
respondents returned their activity diaries. An attempt is made here to develop a typology for some of the most common sources of
inconsistencies as we experienced them during the work on this project. The majority of these inconsistencies are due to the survey
instrument design (alternate instrument designs were pilot tested before the final version was used, however even the final version did
not eliminate all inconsistencies). For example, respondents did not clearly understand the format of the activity diaries, which led to
incorrectly filled forms. The most common example is in merging together an activity and one of more trips in one record (this was
also found by Arentze et al. 1999). We could observe this problem most often for shopping activities such as the typical example
depicted in Figure 8. This specific record should have been entered as three separate records (episodes): a trip from home to the store,
the shopping activity, and a trip back home from the store. Unfortunately, this is not the only way in which the same activity/travel
pattern was reported by other respondents. Some respondents would enter the first trip and the activity in one line and create a
separate record for the return trip. Others would report the trip going to the store in one line and combine the activity and the return
trip in another. These inconsistent patterns (we name this occurrence episode miss-reporting) complicate any automated cleaning
procedures. At the same time, finding and repairing of these erroneous records is rather essential. Without rectification, the reported
diaries would be biased and would systematically underestimate the number of trips and activities in the dataset.
Figure 8: An example of activity and trips reported together (episode miss-reporting). The stated starting and ending point is
home location of this respondent.
Note: The addresses provided in all examples through this paper were modified in order to ensure confidentiality of the
provided information.
- 28 -
Patten and Goulias
29
Another set of issues arises when several respondents refused to provide some of the information requested (called item refusals
herein). The reason behind this refusal varies. Some respondents had personal reasons for not providing information. In some cases
the respondents assumed we do not need information about their activity participation, since we are interested in transportation (the
material provided to the respondents contained a clearly transportation policy flavor). One respondent actually provided a written
comment in this matter and reported only her/his trips.
A third group of issues regards the incorrect level of detail provided by respondents. On one hand, some respondents reported only
their major activities during the day (home, trip, work, trip, home). On the other hand, some would split the home activity into several
records, reporting every detail (getting up, brushing teeth, taking shower, feeding dogs, preparing breakfast, eating, washing dishes,
getting ready for work, preparing snack for kids, and so forth), so the resulting diary was rather long and detailed. Neither of the two
is desirable for the survey. However, the more detailed one allows for aggregation of activities, while, the sketchy one reduces to a
travel diary.
Other important sources of problems are inconsistencies that are implicitly embedded in the data (we will call them implicit
inconsistencies throughout this paper). For example, respondents would refer to the same events, or locations differently. For example,
some people would describe shopping in the GIANT grocery store simply as shopping; some would provide the name of the store, or
even more details. Some people used the term “Building” in the addresses, some would use abbreviated form “Bldg” or “Bldg.”, and
so forth. In addition, the same respondent would refer to the same location differently in different fields of the diary.
The respondents were also asked to provide information about duration of each activity or trip, and the distance traveled. These are
estimates of distance and time. Previous studies (Golledge and Stimson, 1997) suggest that the perception of time and distance are not
precise but time may be more accurately perceived than distance because some persons carry watches and clocks are posted at many
places. In the diary we asked individuals to tell us the distance of each trip and the beginning and ending of each trip. On one hand the
distance information gives us the opportunity to study their perception. On the other hand, however, caution is required in using this
information when data corrections take place.
The number of errors and inconsistencies that are recorded by the respondents can be reduced by providing instructions and/or
examples. In the instructions we can explain what is the activity diary and how it should be filled in. On one hand, the instructions
should not be too long and too complicated because the respondents do not want to spend too much time and effort in reading it and
trying to understand its meaning. On the other hand, if the instructions are too short, the survey’s purpose and format is often not
explained clearly. In our survey we provided a combination of instructions and examples. The examples were targeted to address
different groups of population. For example, the case provided to students consisted of a lot of on-campus activities and then going out
in the evening. The example for older individuals consisted of some community activities, going to church and a more relaxed evening
- 29 -
Patten and Goulias
30
at home. We wanted to provide useful and relevant examples for each target group that would make the filling of the diaries easy.
However, as our experience shows, even these easy and targeted examples did not ensure that all respondents understood the format of
the diary correctly and would not make any mistakes.
A very important issue is the motivation of the respondents. They need to feel that their responses are important and that they can
actually make changes in the transportation system, especially since the demand on filling in the activity diaries is high. All
correspondence with the respondents is personalized. As a part of the package sent to the respondents was an introductory letter. It
started with the name of each particular respondent. In this letter, we let the respondent know what is the main purpose of the project
and what the possible outcome for each person can be. We let the respondents know how important is her/his cooperation to us and
that her/his reply can make a change. This increases the motivation of the respondent to pay increased attention to the survey, which as
a result improves the quality of the answers as well as the response rate.
Errors in the data set could occur also after receiving the completed diaries, mostly during the data input phase. It should be noted,
however, the majority of these problems is caused by misspelling, typographical errors, or inconsistent use of addresses (implicit
inconsistencies). For example, the recorders would enter wrong ZIP code for given address (this can easily happen since the ZIP codes
in Centre County often differ only by the last digit), or swap two letters in a name of street.
All the examples of inconsistencies and errors that were discussed in the previous paragraphs meant to introduce some problems we
are facing while collecting activity diaries. The framework developed in this paper aims to find, and eliminate or reduce these
inconsistencies, and finally obtain a dataset that can be used for transportation and activity-based models. This framework is
approached from a practitioner’s point of view, so that it could be used in future collection of activity diaries.
Since we are dealing with human behavior and human answers, the authors believe that the use of fully automatic checking and repairs
rules is highly problematic. Two problems that have the same effect are often present in the data set for different reasons and their
treatment must be different as well. The following example should demonstrate this argument. Clearly, in a sequence of two episodes,
the second one must start at the time when the first ends. In the following sections of this document we describe automated rules that
find episodes for which this assumption is violated. However, the data suggested that there were different causes for this problem. The
difference can be caused just by a typographical error during filling in the diary or during entering of this diary to the database. In
some cases this would be a sign of missing activity or trip. Also in other cases there was just a problem in recording am instead of pm
hours. Certainly, there can be even more explanations for such a simple problem. Often a whole sequence of records had to be
adjusted to fix the given problem. To determine how to fix the record (or series of records) often requires more detailed study of the
whole pattern, and possibly also patterns of other members of the family.
- 30 -
Patten and Goulias
31
A logical rule would repair such records correctly in the majority of the cases, however a small percentage of records would be fixed
incorrectly. Contrary to the system SYLVIA, we want to eliminate even this small number of errors, at of course the price of higher
demand for manual repairs.
THE CENTRESIM DATA
The CentreSIM model and data collection is a five-year initiative to improve modeling and simulation for transportation decisions in
CentreCounty, Pennsylvania. CentreSIM is a regional simulation of Centre County, a region of approximately 136,000 persons that
includes the Pennsylvania State University with more than 40,000 students and more than 12,000 faculty and staff. The unique
characteristics of the Penn State population in terms of time and space choices need to be accounted for in the simulation of the county
population. To accomplish this, a household survey is organized and the spatial foundation of the CentreSIM model together with a
complete inventory of businesses containing the address, business type classified by standard industry code (SIC), and number of
employees of each business is collected. The businesses are first matched to a digital map of the simulation area that includes the
roadways and intersections of the transportation system, as well as the traffic analysis zones (TAZs) of the study area.
The business type and size are used to estimate the capacity, in terms of the number of persons that can be served, of each
establishment for the activities of work, shopping, recreation, and other. The activity capacities serve as a measure of attractiveness
for each activity, and can be aggregated to the TAZ level to obtain the attractiveness of each TAZ and perform traditional four-step
modeling. Population data from the 2000 U.S. Census are then used to estimate the capacity of each TAZ for the home activity. The
spatial distribution of the four activity types and stay home forms the basis of a 24-hour accounting system of the population (named
the zone presence model), their activities, and their activity locations.
The spatial distribution of activity locations combined with the temporal activity patterns of six designated population segments (Penn
State students, Penn State faculty, Penn State staff, unemployed, professionals, workers) are the basic elements in the development of
this activity-based travel demand model. The activity patterns for each population segment will be calculated from the CentreSIM
survey. The activity distributions multiplied by the number of persons in each population segment results in the number of persons
engaged in each activity in each hour. This output, combined with the spatial attractiveness of each TAZ for each activity, results in a
zone presence model consisting of the number of persons in each zone and their activities by time of day. An example of an early
application with data from a 1996 small sample survey can be found in Kuhnau and Goulias, 2003.
The CentreSIM household survey covers the entire Centre County and it also includes residents that work in Centre County and reside
elsewhere. The data from this survey fill a critical gap in knowledge about this County and provides the necessary information to
develop models that can be used in regional simulation software currently developed for alternatives exploration. This survey has
- 31 -
Patten and Goulias
32
been administered between November 23, 2002 and May 30, 2003, and included two components, a household questionnaire and twoday activity-travel diary both of which are described briefly below. The survey was conducted during two different time periods
during which different recruiting methods were used. We refer to these periods as Lottery I, conducted in the fall 2002, and Lottery II,
conducted in the spring 2003. The lotteries, in which prices of $1,000 in gift certificates were distributed, were used to encourage
people to participate in the survey. Identical questionnaires and activity-travel diaries were used throughout the entire survey.
In the questionnaire each participating household is asked to provide voluntarily information about household composition and
facilities available to the household members. In addition, each household member also provides personal information such as
employment, driving ability, education and so forth. The survey also includes a few questions about opinions and perceptions
regarding the Centre County transportation system.
Activity and travel data are also collected from each person in the household using a two-day complete record of the activities in
which each person engaged and the different transportation options taken. An example of the activity diary is provided in Figure 10.
The respondents were asked to record beginning and ending time of each episode, its purpose, but also questions “With whom did you
do the activity” and “For whom did you do the activity”. In case of trips the respondents were to report the travel mode used for the
trip, if they drove on this trip, starting and ending point of the trip, and also to estimate the distance of this trip. All these questions are
included in order to capture decision making aspects of each household in activity and travel scheduling.
During the first survey lottery subjects were recruited both by telephone and mail. Our study team contacted 821 households via
telephone to request participation in the study. Of these, only 190 agreed to participate. The contact rate (participants/contacts) was
23.1 percent. All of the 190 households agreeing to participate were mailed survey packets as outlined above. Of these 80 households
(43.5%) returned completed, usable questionnaires and diaries. The overall response rate for the telephone recruiting is 10.0 percent
(returns/contacts).
In Lottery I, 1,478 households were contacted by mail. A total of 647 household questionnaires were returned yielding a 61.3 percent
response rate. Of the 647 households returning household questionnaires 568 were mailed activity-travel diaries. Time constraints
prevented inclusion of all responding households. A total of 208 households returned completed diaries yielding a 41.4 percent diary
response rate. With 208 households returning usable diaries the overall response rate for the Lottery I is 13.7 percent.
Lottery II was done completely by mail. Overall, questionnaires were mailed to 7,447 households. A total of 2,414 household
questionnaires were returned yielding a 35.3 percent response rate. Of the 2,414 households returning household questionnaires 1,969
were mailed activity-travel diaries. A total of 494 households returned completed diaries yielding a 25.2 percent diary response rate.
- 32 -
Patten and Goulias
33
With 494 households returning usable diaries, the overall response rate for Lottery II is 8.9 percent. The second lottery, however, is
heavily skewed in its composition by the presence of Penn State student participants that have a dramatically lower response rate than
any other segment of the population.
GENERAL FRAMEWORK
Collecting of activity diaries requires a rather thorough process. It starts with developing the overall framework of the survey,
designing of the questionnaire, choosing the right type of the survey media, determining a representative sample of population in the
study area, organizing survey instrument distribution, and building a plan of action and contingency plans for all the events that are
expected to possibly create problems.
This paper focuses on the steps that follow the receipt of the completed surveys. A simplified flowchart of the process described in
this paper is depicted in Figure 9. After receiving, the completed surveys, they are input into a data set (data input module). These data
often carry a lot of inconsistencies as discussed above. Automated rules are applied to the data set in order to determine records that
are potentially inconsistent. Such records are flagged and then fixed (data cleaning module). Finally, before we can deliver the data to
customers, the data need to be modified to the desired final format. Our paper addresses all steps in this process.
Data
Input
Module
Receipt of
completed surveys
Data
Cleaning
Module
Final
Formatting
Deliver data
to costumers
Figure 9: Flowchart of the process
The main objective of this paper is to develop a framework for the data entry process that minimizes the discrepancies between the
observed activity/travel pattern and the patterns that are delivered to the customer. The final dataset should contain consistent records
that can be directly used for calibration of activity-based models.
Data Input Module
After receiving the completed surveys, the data must be entered into a dataset. As discussed above we do not want to introduce any
additional error to the data in this phase. However, in this step we do not do any additional changes to the data and we do not make
- 33 -
Patten and Goulias
34
any assumptions that could potentially disrupt the data. All additional changes were left to the second module, data cleaning. Our
reasoning is explained in the following paragraphs.
The more people were making assumption and were repairing the unfeasible records, the more inconsistency would be introduced.
The same type of error would be repaired in many different ways, which is not desirable. Because of the large amount of the data that
must be entered into the database, several recorders were working on entering of the data. Clearly, their level of understanding the
purpose of collecting the activity diaries differs. For that reason, they were asked to record exactly what was written in the collected
surveys (with the exception of obvious grammatical mistakes and errors). All major changes and repairs are done in the second phase
(data cleaning module) by a limited number of experts. These experts spent time studying the data set, trying to understand possible
problems and their causes. They share their experience and discuss problematic cases with the entire group. Then, they develop a
unified solution to the problematic cases. These steps helped to eliminate inconsistencies in their decisions.
One important issue that is though addressed in the data input module is fixing some implicit inconsistencies. The automated rules
within the remaining steps of the data cleaning process often compare the entire strings characterizing behavior. Typically, we want to
compare locations of two different activities in order to determine if they were conducted at the same location. All characters in these
strings, including punctuation and blank spaces, need to be exactly the same in the whole string in order to successfully match the
fields.
In the introductory part of this paper we described some implicit inconsistencies that often appear in the address of an activity, such as
using different phrases for the same location, abbreviations and other. We tracked empirically the most common cases of these
problems in the fields describing location and used for them standardized abbreviations used by post offices in the USA. Some
examples are provided in the following list (this list is not complete):
Item:
Abbreviation:
Avenue Circle Drive Lane Apartment Building Room
Ave
Cir
Dr
Ln
Apt
Bldg
Rm
All recorders were instructed to use these abbreviations in all cases, instead of the original answer. Also after entering the data, a
simple find-replace procedure was used to ensure that the entries are really standardized. For example, we would find all occurrences
of the words “Avenue” and “Ave.” and replaced it with “Ave”.
- 34 -
Patten and Goulias
35
The user environment plays also an important role in the entering of the data and can reduce the number of mistakes that occur during
this phase. Nice and user-friendly forms were developed for the entering and editing of the activity diaries. An example of a data
editing form is shown in Figure 10. The form resembles the activity diaries as filled in by the respondents. This feature helps to keep a
visual control of the entered data.
Figure 10: An example of the data enter and data edit form in MS Access with a fraction of an activity pattern. Also provided
is an example of assignment of trip purposes (primary Q1a, and secondary Q1b) for two of the records. It is the example of
going to another travel mode as described in section 0
- 35 -
Patten and Goulias
36
Usually, all in-home activities of the entire family are located at the same address so the same string repeats in multiple rows in the
data set. It means that the recorder has to write the same address several times. For this reason we used a feature of MS Office that
while entering a field, the rest of the field would automatically suggest the ending of the phrase based on previous entries (autocompletion). An example of the auto-completion is depicted in Figure 10. This feature significantly reduces the probability of
typographical errors during this phase. The recorders had to be careful while writing the address for the first time. For the rest of the
household, the field would automatically appear after writing the first few letters.
Data Cleaning Module
The general flowchart of the process of data cleaning is presented in the following figure.
Data checking:
Input:
Rough data set
Logical rules
Assignment:
• Home location
• Trip Purposes
• Activity types
Output:
Clean data set
Figure 11: General flowchart of the data cleaning routines.
The rough data set, with only some of the implicit inconsistencies fixed, is the input in this phase. Our major goal is to eliminate all
remaining inconsistencies in the data set. The major part of the data cleaning procedure consists of applying a set of logical rules that
find inconsistencies in the data. The problematic records are then marked and listed in a separate table. Then they have to be checked
by an expert and fixed.
The respondents wrote in their own words what was the activity they did. In order to generate tables that can be used in modeling
systems, we have to assign to each trip a value from a finite set of trip purposes and, similarly, assign a type for each activity. The
routines that were used to do this assignment require the knowledge of which activities were conducted at home and which out of
home. Our solution to this problem is also described in this section.
Automatic data checking
- 36 -
Patten and Goulias
37
The rules presented in this section aim to find records that are potentially inconsistent. The rules as proposed were meant to find rather
more records, including those with no inconsistencies, than omit records that need to be fixed. Contrary to the algorithms in SYLVIA,
the rules do not aim to fix the problems (see reasoning in chapter 0). The task remains on the experts that are using these rules.
We proposed five major groups of the data checking rules. Their hierarchical structure is suggested in Figure 12. The rules operating
on the lowest level are called basic rules in this project. They focus on properties of particular records, such as that the ending time of
an activity should not be before the start time of the same activity and others. The rules at the second level focus on consequent
records. These rules look for problems in chaining of trips and activities. Once each two consecutive records are consistent, the rules
called daily patterns are used to check for inconsistencies in each daily pattern, for example start time or end time of diaries. The next
stage aims to find problematic records on transition between the two survey days for each person. The highest level in this scheme is
the household level. We are trying to find inconsistencies in reported number of persons in different age groups, possession of driving
license and others.
Household Inconsistencies (H)
Day to day transition (D)
Daily Patterns (P)
Trip chaining (T)
Basic
Rules (B)
Figure 12: Hierarchical structure of the data checking rules.
- 37 -
Patten and Goulias
38
A more detailed description of particular rules is provided in the following sections. Their summary is provided in Table 8. This table
also contains the numbers of flags by particular rules when applied to the CentreSIM survey data. More details about this topic are
provided in section 0.
Basic rules for particular records
Rule B1 - Activity hours too long
It is quite unlikely that a single activity would have a long duration. Even a work activity is usually interrupted for lunch-break.
This rule finds activities that are longer than 6 hour, because it is often an indication of a problem with the record (mixed activity
and trip together). Only in case the activity type is school, work, or sleep, the limit is set to 10 hours. These numbers were
empirically derived from the data.
Rule B2 - Trip time is longer than 1 hour and its distance shorter than 15 miles
Often, if there is a trip of a long duration in the data set, it is an indication that an activity is mixed together with a trip. For
example, people would fill in a trip described as shopping, that in reality should be split into three records: trip from home to the
store, shopping activity, and return trip back home. This rule finds trips that are longer than 1 hours and the distance traveled did
not exceed 15 miles. Similarly to the previous rule, these values were derived empirically.
Rule B3 - End time of an activity is before beginning time of the same activity
Clearly, the ending time of each activity must be later that its beginning time. This rule helps to find records that do not satisfy this
condition.
Rule B4 - Missing travel information
This rule finds records that are classified as trips, however, which do not provide some or all travel information. The only
exception is for travel mode walk, in which case there is no answer about driving as a driver or passenger. This rule helps to repair
cases in which an activity was accidentally classified as a trip.
Rule B5 - Travel speed limit
The respondents were asked to provide beginning and ending time for each trip, as well as travel mode and the perceived distance
of the trip. This rule helps to find records with unrealistic speeds. Limits of expected speed for each travel mode were empirically
derived from the data. Clearly, the respondents’ perception of distance is not perfect. This implies wider range of expected speeds.
The travel modes used in the survey together with the derived speed limits and distance limits for some travel modes are provided
in Table 6.
- 38 -
Patten and Goulias
39
In case the average speed (reported distance divided by the travel time) does not fit into the range of minimum and maximum
speed for particular travel mode, the record is flagged.
Rule B6 - Travel distance limit
Similarly to the previous rule, this rule finds those records that suggest unrealistic distance. The limits for distances, derived
empirically from the data, are also provided in Table 6. This rule concerns only trips conducted on foot, or by bicycle. The
maximum distance for bicycle and for walking was set to 5 miles and 1.5 miles, respectively.
Rule B7 - Missing location of an activity
This rule finds records, in which the location of an activity is not specified.
Table 6: The travel modes considered in the survey together with speed and distance limits empiricaly derived from the data.
ID
1
2
3
4
5
6
7
888
999
Travel Mode
Car, truck or van
Bus
Taxicab
Motorcycle
Bicycle
Walked
Other
Skip
No ans.
Problems of trip chaining
- 39 -
Speed Limit
[mph]
Min Max
7
65
4
45
5
15
5
65
1.5
10
0.3
6
Distance
Limit [miles]
Min Max
5
1.5
Patten and Goulias
40
Rule T1 - Change location without travel - two consecutive activities are conducted on different location (Activity-Activity)
In case there are two consecutive activities (i.e. no trip between them), and locations of these activities are different, there is most
likely a missing trip. This rule aims to find these records.
Rule T2 - The location of an activity does not equal the ending point of previous trip (Trip-Activity)
This rule deals also with inconsistencies in location. It finds records in which an activity follows a trip, however the location of the
activity is not stated at the same place as the ending point of the preceding trip.
Rule T3 - The starting point of a trip does not equal to the ending point of the preceding trip (Trip-Trip)
This rule is very similar to the previous rule, the difference is that it deals with two consecutive trips. In this case, the starting point
of the second trip must be equal to the ending point of the first trip.
Rule T4 - The starting point of a trip does not equal the location of the preceding activity (Activity-Trip)
This rule also deals with inconsistencies in location, only in this case if a trip follows an activity. In this case, the starting point of
the trip must be equal to the location of previous activity.
Rule T5 - Ending point of a trip equals to starting point of the same trip
This rule finds cases in which the starting point of a trip equals to the ending point of the same trip. This often means that there is a
missing activity.
Rule T6 - Beginning time of an activity does not equal the end time of the previous activity
This rule finds cases in which the beginning time of one record (an activity or travel) does not equal the ending time of the
previous record. There are several possible causes for this problem. For example, the individual could simply record a wrong time
(in this case it could be changed automatically), or s/he forgot to record another whole trip or activity.
Daily patterns
Rule P1 - The end time of last activity in the day does not equal to 11:59:00PM
The daily pattern in our project was defined to start at midnight (12pm) and ends at 11:59pm the same day. This rule finds records
that do not end at at 11:59pm.
Rule P2 - First activity does not start at 12:00:00AM
Similarly, this rule finds schedules that do not begin at midnight (12pm).
Rule P3 - There are less than five activities in the diary
Even a simple activity/travel pattern has to consist of several records (sleep, eat breakfast, travel to work, work, return home, go to
bed). This rule finds diaries that do not contain more than 5 records.
- 40 -
Patten and Goulias
41
Transition from one day to another
Rule D1 - The location of the first activity in the day 2 does not equal the location of the last activity in day 1
Even when the daily patterns of an individual are consistent, we must ensure consistency of transition from one day to the other.
This rule finds records in which the location at the beginning of the second day does not correspond to the location at the end of
the first day. In such a case, a record is probably missing.
Consistency on the household level
Rule H1 - Total number of persons in the household (Q23) does not equal to the sum of persons stated in different age groups (Q24)
In the household form, we asked several questions about the structure of the household. This rule finds records in which the
number of persons declared by particular age groups does not equal the total number of persons in the household. (Q23: “how
many people live permanently in your household”, and Q24:”Of the people in your household, how many are: a) younger than 5
years old; b) 5 to11 years old; c) 12 to 15 years old; d) 16 to 18 years old; d) 18 years of age or older) )
Rule H2 - Total number of persons in the household (Q23) does not equal to the number of people stated in the household
questionnaire
In the household questionnaire we also asked about names and other information of particular respondents. The number of
provided names should equal the total number of persons stated for the household.
Rule H3 - Number of people differs between household form and the diaries
This rule compares the number of persons stated for a specific household to the number of travel diaries stored in the data set.
Clearly, there should be two diaries for each individual (two days). This rule finds surveys that do not satisfy this expectation.
Rule H4 - Person has driver's license with age less than16
An individual who is younger than 16 years is not eligible to have a driver’s license. This rule finds those individuals who stated
they were younger than 16 years of age, but they also stated they do have a driver’s license. One of the fields must have been filled
in incorrectly.
Rule H5 - Person stated in the diary form that s/he drove, but s/he is not a licensed driver in the household form
One question in the activity/travel diaries concerns if the individual was driver or passenger for particular trip. This rule finds
those diaries in which a person stated s/he drove a car, however in the household form stated that s/he has no driving license.
- 41 -
Patten and Goulias
42
Rule H6 - Person does not have a driver’s license but in the household form stated that sometimes drives a car
Another question in the household form asks how often does an individual drive a car. This rule finds forms in which a person
stated that s/he does not have a driving license, however in the household form answered that sometimes drives car.
Assign Home Location
The respondents provided information about their home addresses in a separate field of the survey, however not in their activity
diaries. Only an address was entered in the field “location of your activity” so an algorithm had to be developed to find all home based
activities. However, this information is essential for other automatic routines, such as assigning of trip purposes and activity types.
This section describes a semi-automated procedure to set all in home activities.
The algorithm consists of two steps. In the first step we compare all addresses stated by a particular household in the activity diary to
the address we originally mailed the survey. The mailing addresses were stored in a separate data set, which can be though match to
the travel diary using a unique identifier - form number. In case the fields in both data sets match, this activity is clearly undertaken at
home. We do not overwrite the original information. We rather introduced a new dummy variable that equals to one in case the
activity was conducted at home, and zero otherwise (an activity out of home or a trip).
This first step finds majority of the activities located at home, however there are some exceptions. One reason for not finding an inhome activity can be inconsistency between the stated and mailing addresses. A simple misspelling, inserting an extra space at some
position in the string or other similar reasons, can be the cause. For this reason a second step is introduced.
The second step is based on the assumption that the majority of persons start their daily patterns at home. This is clearly not true in all
cases (for example respondents could work after midnight, some respondents are still at a party or in a bar, visiting friends, or some
elsewhere for other reasons) so we cannot create an automatic procedure based on this assumption.
In this second step we find all diaries in which the first activity was not marked as “at home” during the first step. The addresses in the
listed diaries are then manually checked and fixed if necessary, and the “at home” field is also set for these activities.
This step requires some additional non-automated work; on the other hand it increases reliability of the final data set, which was one
of our major objectives. The load of manual work was not unreasonably high in this case. We will demonstrate the workload by
example from our application. There were 284 records in which the first record was not found as at home (by the first step of the
algorithm). These records had to be checked manually by an expert. However, only about 15% of them were actually not conducted at
home (working at night, sleeping out of home, out of town, engaged in active nightlife). We did not make any corrections in these
cases. The rest of the 284 cases was caused by the problem of comparing addresses (discrepancies between the address stated in the
activity diary and the mailing address, different address between where they live and their mailing address (e.g., using a P.O. box
- 42 -
Patten and Goulias
43
number that obviously is not a residence location), and others. We corrected the address and marked these activities as activities
conducted at home.
Assign Trip Purposes
As mentioned in the previous section, many activity and travel behavior models need to use information about the purpose of
particular trip. In our activity-travel diaries, we asked people to describe in their own words what activity did they conduct, and in case
of trips, what was their purpose. We did not provide a list of possible answers to the respondents because we did not want to effect and
bias their answers. This provides the opportunity to understand respondents’ perception about their trip purpose but the variety of
answers was very large since two people usually referred to the same event in different ways due to linguistic habits.
For the purpose of modeling, we need to use only a finite number of well-defined trip purposes and activity types. Different models
require different trip purposes and there is no single universally accepted standard that could be used. In this phase of our analysis we
decided to use a rather small number of trip purposes that would be easy to classify. It should be noted, however, that we are not going
to overwrite the original answers (marked Q1 in the data set), rather we introduce new variables to store the trip purposes. This
enables us or other data users to define different set of trip purposes and assign them in later phases.
We suggested 14 categories of trip purposes that are summarized in Table 7. The second column of this table describes particular trip
purposes. The first column shows identification numbers that we used in the data set.
We introduce two new variables: Q1a (primary trip purpose) and Q1b (secondary trip purpose). The variable secondary trip purpose is
used only in case of trip going to another travel mode (denoted 81 in the table). Typically this is a case when there is a sequence of
two or more trips. For example, first the individual walked to a bus station and then took a bus to school. In this case the primary
purpose of the first trip is denoted “going to another travel mode” (81). Its secondary purpose is travel to school (12). The second trip
has assigned only the primary trip purpose - commute to school (12). Similar example from the data set is provided in Figure 10.
This section describes the process of assigning these trip purposes to particular trips. This is not as simple task as it would appear. The
field that plays the most important role is answer to question Q1: “What activity did you do?” which in case of trips corresponds to the
description of particular trip. However, there are also other features that have to be taken into consideration. For example, in some
cases we cannot decide on the right trip purpose without considering personal characteristics of the respondent, her/his travel company
(Q3), for whom did the respondent undertake the trip (Q4), or implications suggested by the entire travel pattern and travel patterns of
the rest of her/his family. This again limited the possibility to use an automated algorithm as will be discussed in the sections below.
- 43 -
Patten and Goulias
44
The routine starts with finding trips that should be classified as “return home” (23). We can use a simple algorithm logic since we
already have the information about home-located activities. In this algorithm we find all trips that precede a home located activity and
set those trips as “return home”. A pseudo-code for this procedure is listed in Table 7 and it is marked by a star.
To determine the remaining trip purposes is more difficult. Since the variety of respondents’ description of the trip purpose (Q1) is
large and the problem is rather complicated, only a semi-automated algorithm can be used. In a preliminary study we identified several
words or set of words that are unique only for one particular trip purpose. The words we used to assign trip purposes are listed in the
third column of Table 7. We compare the stated purpose of the trip (Q1) to each of these words and in case of positive match, we
assign the particular identification number.
Note that there are two different cases described in the table. We either look for the whole phrase (“to wal-mart”), or we are looking
for two separate words (“to” ”bank”). If we looked only for the connection of words, “to bank”, the algorithm would miss cases such
as “to PNC bank”, “to the bank” and similar. It would find only complete matches. Our preliminary analysis helped us to determine
which one of these cases is more suitable for each of the sequence of words.
In order to make the procedure more versatile and user friendly, we designed a form in the MS Access environment that can be used
as an interface for this step. New words or phrases to look for can be easily added to this table. We should always keep in mind that
the words added should be really unique for each category.
Another procedure has to be used in case of a trip going to another travel mode (81). The algorithm we proposed is suggested in Table
7 and it is marked by two stars. First we find a sequence of two or more trips. The primary purpose of the last trip is already assigned
or it has to be assigned manually. The primary purpose of the other preceding trips is set to 81 (going to another travel mode) and their
secondary purpose is set to the primary purpose of the last trip.
However, this procedure was not working as expected. Often we found cases in which the responded omitted an activity between two
trips. It was mostly because the omitted activity was of a short duration (picking up a lunch, delivery, and others). Rather than
assigning trip purpose, we had to add the missing activity. For this reason all trips assigned number 81 (going to another travel mode)
should be manually checked during the second phase of this procedure. The additional effort though meant another improvement of
the database by adding missing activities. The automated rule makes it easy to find these records and decreases the time that must be
dedicated to their fixing.
Because of the problems of large variety of answers, not all trips could be classified by our automatic routines. The rest remains for
manual trip purpose assignment. This step is rather time consuming, because it is important to look at the whole activity-travel
patterns. That helps to understand all interdependencies among trips and activities and the danger of misclassification is reduced.
- 44 -
Patten and Goulias
45
Table 7: Description of trip purposes used for our analysis together with the phrases that are used for our semi-automated
procedure.
ID
Trip Purpose
11 Commute to work
12 Commute to school
Other work related travel (professional
13
drivers, etc.)
21 Return home
Look for
"to work", "to office"
"to" "school", "to" "class"
*
"to weis","to wal-mart","to wegmans","to target","to
store","to lowes","to giant","shopping","shop"
31 Shopping
32
33
41
42
51
52
Dinning
Refreshment (coffee, drink, snack, etc.)
Doctor Apt. / Other medical rel.
Other Appointment / Meeting
Escort (Picking Up/ Dropping off others)
Errands (Banking, delivery, etc.)
Recreation and leisure
61
/Exercise/Lessons/ Personal Bussiness
71 Visiting friends/family
81 Going to another travel mode
*
**
Q1
Q1a
Q1b
Q5
"coffee"
"to doctor","to doctors","to dentist"
"to post office", "to" "bank"
"to church","walk dog"
"friend"
**
if Q5(i)=1 and Q5(i+1)<>1 and AtHome (i+1) = 1
then set Q1_a=23
If Q5(i)=0 and Q5(i-1)=1 and Q5(i-k)=1
then set Q1_a(i-k)=81,
set Q1_b(i-k)=Q1_a(i-1).
Note: look for k=2,3,4..., until Q5(i-k)=0
What activity did you do?
Primary trip purpose
Secondary trip purpose (only for trip purpose 81)
Did you travel? 1 …YES, 0 … NO
Assign Activity Types
Almost all common transportation models require the knowledge of trip purposes so it was our first priority. However the striking
majority of the activity-based models require also the knowledge of particular activity types. It means that each activity will be
- 45 -
Patten and Goulias
46
assigned one type from a limited list of types. The level of detail required by particular models varies. Within this preliminary study
we consider only five different activity types that were required by an activity based model that is currently being used at the
Pennsylvania State University (Kuhnau and Goulias, 2003). These activity types are Home (all in-home activities), Work (mandatory
activities that are usually conducted repeatedly and in the same time range, for example work or school activity), Shopping activities
(various shopping activities and errands such as banking), Recreational activities (for example leisure activities, visiting friends,
dining out, and others), and a category Others (doctor’s or other appointment, escort, and others).
Our task is now to assign each activity in the travel diaries to one of these four activity types. Since we already determined the purpose
of each trip, we can use that information to assign the activity types. Let us demonstrate our approach on the following example. An
activity that follows a trip with purpose shopping (denoted 31 in Table 7) will be clearly shopping activity. Similarly, an activity
following return home will belong to the type – Home.
In the data set, there is already the information about which activities are conducted at home. We can directly assign their type –
Home. For the remaining activities in the diary we simply determine their type based on the purpose of preceding trip. We can assume
that in case two activities follow each other, they will be of the same type. Otherwise, since there are not too many sequences of outof-home activities, we could check all of those manually.
Final formatting
Before delivering a final database to the costumer, we have to modify the format of the data set in order to fit expectations of
particular transportation models. These changes are rather specific for each data set and also for each costumer (based on model
requirements), so we will describe this step only very briefly using a few examples.
Treatment of missing values
In the data set, the default values for all variables were set to value 999 that correspond to missing data. However, in some cases
people did not answer this question because it was not applicable. In this step we find these variables (and/or records) and change their
values from 999 to desired final value, usually 0. For example, in the household survey we asked if they have an Internet connection at
home, and in case they have we asked about its type and speed. Automated procedures were used to set automatically the values of the
latter two answers in case the household does not have an Internet connection at all. There were more similar cases in the data set.
Format of the household surveys
Every household in the data set has exactly one record. Information about particular household members is stored as different
variables. Most transportation models require the data set to be of the personal level, each individual corresponding to one record. A
- 46 -
Patten and Goulias
47
table in person-by-person format was produced for use in transportation models. Also summaries of total number of trips by trip
purpose and travel mode, as well as the total time spend traveling by trip purpose and travel mode are provided in this table.
Format of the travel diaries
Many transportation models require information only about trips and not about activities. For this reason we derived a data set that
contains only trips. Only those records with no trips had to be included as well with an indication of no trips made.
APPLICATION OF CHIRAC TO CENTRESIM
In this section we provide some preliminary results of the application of the logical rules to the CentreSIM survey data set. Only the
data entered by February 2003 were used for this study. By this date, the data set consisted of 718 diaries and of 10,602 records (sum
of all activities and trips). During the cleaning process we had to eliminate 23 respondents (46 diaries) from the data set because their
activity diaries did not include some essential information. The most common error in the data set was that respondents in their diaries
combined together activities and trips (episode miss-reporting). These records were split into several records, which lead to an
increase in the number of records. After cleaning, the final data set consisted of 672 diaries and 10702 records.
The set of logical rules was applied to the described data set. The number of records flagged by particular rules is provided in Table 8
and it is denoted as Before (before cleaning). A team of experts repaired the problematic records. At the end of the cleaning process,
the logical rules were applied again to the final data set to see how much improvement can be observed in the data set. These values
are denoted After (after cleaning) in Table 8.
The table shows that the number of flagged records was drastically reduced, but not totally eliminated in all cases during the data
cleaning process. This is because a flag does not necessarily mean an error or inconsistency in the data set. The rules were designed to
find all potential problems. Rather than to miss some records that are actually wrong, in some cases even records that are correct were
flagged. The cases that are flagged at the end of the cleaning process do not consist of an error and the proportion of the correct cases
that were flagged in the entire number of flags varies between the rules.
A typical example is the rule B1, which finds activities longer than 6 hours (10 hours in case of sleep, work, or study activities). In
reality people usually do not conduct such long activities, so it can be a sign of an omitted activity. This rule aims to find such
activities. In some cases the rule helped us to find some other problems, such as records in which an am and pm hour was confused, or
clearly a missing activity. However, there are 113 records that were still flagged after data cleaning. In these cases the activity was
really that long, for example working overtime without a lunch break, and the resulting flags actually do not mean problems in the
diaries. In this case we were able to fix only a small portion of the flagged records and the rest is considered to be correct.
- 47 -
Patten and Goulias
48
Rule B2, that finds trips longer than one hour, is an example in which the majority of the flagged records really need some kind of
rectification. These cases are unlikely and mostly were caused by episode miss-reporting for which a remedial solution was found.. In
the remaining five cases people reported hunting, an activity very popular in Pennsylvania during hunting season. We did not modify
these records since we did not have any additional information.
We describe several other examples below. Rule P3 finds schedules that contain less than five records. In majority of the cases people
would skip information about trips, mix activity and trips into one record, or made other mistakes that were fixed. The remaining 18
records actually do not contain any errors. In most of these cases respondents were traveling outside the study area, so their only
record
would
be
the
long
distance
trip.
We
kept
these
diaries
in
the
data
set.
Table 8: Summary of cleaning rules together with the number of marked records on the original (before) and final (after) data
set.
B1
B2
B3
B4
B5
B6
B7
B8
T1
T2
Basic rules
Activity hours too long
Trip time is longer than 1 hour and its distance shorter than 15
miles
End time of an activity is before beginning time of the same
activity
Missing travel information
Travel speed limit
Travel distance limit
Travel speed limit and travel time is longer than 20 minutes
No location
Inconsistencies in trip chaining
Change location without travel - two consecutive activities are
conducted on different location (Activity-Activity)
The location of an activity does not equal the ending point of
previous trip (Trip-Activity)
- 48 -
Befor
e
Afte
r
138
113
121
5
9
0
352
474
10
262
179
230
333
30
74
25
320
0
686
0
Patten and Goulias
T3
T4
T5
T6
P1
P2
P3
D1
H1
H2
H3
H4
H5
H6
49
The starting point of a trip does not equal to the ending point of
the preceding trip (Trip-Trip)
The starting point of a trip does not equal the location of the
preceding activity (Activity-Trip)
Ending point of a trip equals to starting point of the same trip
Beginning time of an activity does not equal the end time of the
previous activity
Daily patterns
The end time of last activity in the day does not equal to
11:59:00PM
First activity does not start at 12:00:00AM
There are less than five activities in the diary
Day to day transition
The location of the first activity in the day 2 does not equal the
location of the last activity in day 1
Household inconsistencies
Total number of persons in the household (Q23) does not equal to
the sum of person stated in different age groups (Q24)
Total number of persons in the household (Q23) does not equal to
the number of people stated in the household questionnaire
Number of people differs between household form and the diaries
Person has driver's license with age less than16
Person stated in the diary form that s/he drove, but s/he is not a
licensed driver in the household form
Person does not have a driver’s license but in the household form
stated that sometimes drives a car
710
299
509
241
208
85
522
0
79
0
6
40
0
18
57
0
25
0
74
0
21
2
11
0
445
0
440
0
The rule T5 finds trips that end at the same location as they started. It is usually a sign of episode miss-reporting (see example in
Figure 8). Such records were fixed accordingly. The remaining 85 cases correspond to people who went for a walk, or took a dog for a
walk. There were no additional activities on their way so these records are considered to be correct and remain unchanged.
- 49 -
Patten and Goulias
50
The rules B8 and B4 found records with missing information about location or travel information. In some cases we succeeded to fill
in the missing information based on similar records in the rest of the diary, in the diary of the other day of the same respondent, as
well as in the diaries of the rest of her/his family.
The rule H3 finds households for which the number of received diaries does not correspond to the number of people stated in the
questionnaire. We do not have the complete information about the entire household in these cases. We kept these households in the
data set, even though their exclusion can be considered for some activity-based models.
The flags in the rule H2, H5, and H6 were caused mostly due to the missing data. Often, people did not indicate whether they do or do
not have a driving license. These records were fixed since we had the information that these people actually drove.
The data rectification for rules T3 and T4 was still ongoing at the writing of this paper. All the remaining flags correspond to
typographical errors. It causes a problem when comparing two strings, but it does not imply a missing trip or activity. In our analysis
we aimed to fix first records that actually correspond to a problem in the activity patterns. The rectification of these rules will be
completed by the time of the TRB meeting.
CONCLUSIONS AND RECOMMENDATIONS
In this paper, we aimed to provide a general guidance for the collection and cleaning of activity-based surveys. It is based on our
practical experience. We emphasized issues that anybody working on a similar project is likely to face. The process starts with
receiving completed surveys from the respondents. These surveys contain many inconsistencies and errors. The framework describes
the data entering process as well as the process that eliminates all these inconsistencies. We described the entire process, the problems
we were facing, as well as suggestions about solutions.
We succeeded in finding inconsistencies in the data and significantly reducing their number in the resulting database. The final data
set is ready to be used in transportation planning models, including those that are activity-based. The major objective of this process,
minimizing the discrepancies between the observed patterns and the patterns recorded in the final data set, was also reached. All the
changes made to the data set were based on expert knowledge, so no automated routine biased the modifications of the data set.
However, we still learned several improvements to the proposed framework during our work on the project. These improvements have
been implemented in the current stages of our data entering process – Lottery II, and are strongly recommended for future applications
of similar systems into practice. They are meant to decrease the number of problems in the received surveys as well as to reduce the
amount of time required for the entire process of data entry and data cleaning. In addition to the many details of the procedure
- 50 -
Patten and Goulias
51
followed here (e.g., thinking of data in a hierarchical format, development of explicit rules and recording of all the modifications,
attention and comparison of records within a persons time use pattern and comparisons with other household members’ time use
patterns) some additional major changes and recommendations are listed below.
1.
Preprocessing of the received diaries by an expert
The rather large number of flags in Table 8 suggests that the amount of time spent on cleaning of the data was large. The biggest
problem was treating of episode miss-reporting, in which case new record(s) had to be entered and usually a large portion of the diary
modified. For this reason we suggest introducing a preprocessing phase. After receiving the diaries, they should be inspected by an
expert. S/he would clearly mark any necessary changes before they are entered into the data set. It takes considerably less time to
check the diaries before entering, than to look for these problems and try to fix them afterwards. It helps to find also additional
problems that can be treated before entering the data set as well. To look for these problems in the paper form is easier, since you can
easily observe the entire time use pattern or a persons and the household. The proposed automated rules should be still used for
verification purposes and as a tool to find any additional problems.
2.
An improved example of activity diary sent to respondents
In the motivational part of this paper we emphasized the importance of an introductory letter as well as a good example. Although we
focused on this problem further improvement can be introduced. For this reason in the Lottery II we insert an extra sheet with an
example in the survey materials that are mailed to the respondents. In this example we provide additionally a graphical representation
of a daily pattern in order to improve respondents’ understanding of the required format. In this example, each activity is represented
as a node and each trip as a link between the nodes. The correctly filled diary that corresponds to this suggested pattern is provided as
well.
Since this example is provided on a separate sheet of paper inserted in the package, it can be easily used when the respondents are
filling in their diaries. They can have this sheet in front of them the entire time and they do not have to browse through the form in
order to find this particular example.
3.
Enter home and work location (and other known places) into the data set prior to the entry of diaries.
In the diaries, we asked the respondents to record their home, work, school, or other often repeated locations. Having this information
in the data set reduces the number of typographical errors, increases the speed of data entry, and also eliminates the step of assigning
home locations. For this reason, prior to entering the diaries for each household, the frequently repeated locations will be entered in
the data set. Consequently, while entering particular activities in the diaries, the coder can directly choose one of the previously
entered locations and avoid writing the address over again and so reduce the probability of typographical errors.
- 51 -
Patten and Goulias
52
To conclude, the final data set is rather consistent if we take into consideration the complexity of activity-travel diaries. The proposed
framework ensures that the modifications of the originally reported diaries are minimal. In most cases the original answers are kept
together with newly recoded variables. As an example we can mention the types of activities assigned to particular records. Only a
limited number of activity types is needed for calibration of the CentreSIM model. However, since we keep the original answers, the
type of activities can be easily modified to fit the needs of any other activity-based model. It implies that the final data set is versatile
and consistent at the same time, which was one of the objectives of the developed process. We succeeded in finding episode missreporting (as described in the introduction to this paper) which is very essential since it effects the number of reported activities and
trips and their sequencing that are important elements in activity-based approaches to travel demand modeling.
ACKNOWLEDGEMENTS/DISCLAIMERS
Funding for this paper was provided by the federally funded Mid-Atlantic Universities Transportation Center (MAUTC) and the
Center for Intelligent Transportation Systems (CITRANS) at the Pennsylvania State University. Partial funding for the CentreSIM
survey is provided by the Pennsylvania Department of Transportation (PennDOT) through a contract with McCormick Taylor and
Associates (MTA). The survey was conducted at the Transportation Survey Research Center at the Pennsylvania Transportation
Institute (PTI) by a team of persons that include: Mark Hallinan, James Lee, Devani Perera, Aviroop Mukherjee, Julie Whitt, and
Brian Hoffheins. Their dedication is greatly acknowledged. Li Guan at PTI programmed the database entry software for the
CentreSIM survey.
The contents of this paper reflect the views of the authors, who are responsible for the facts and the accuracy of the data presented
herein. The contents do not necessarily reflect the official views or policies of the Commonwealth of Pennsylvania at the time of
publication. This paper does not constitute a standard, specification, or regulation of the Pennsylvania Department of Transportation.
- 52 -
Patten and Goulias
53
REFERENCES:
Arentze, T.A., F. Hofman, N. Kalfs, and H.J.P.Timmermans (1999), (SYLVIA) System for Logical Verification and Inference of
Activity Diaries, Transportation Research Record, 1660, pp. 156-163.
Arentze T. and H. Timmermans (2000) Albatross A learning Based Transportation Oriented Simulation System. European Institute of
Retailing and Services Studies, Technical University of Eindhoven, Eindhoven, NL.
Axhausen, K.W. (1995) Draft - Travel Diaries: An Annotated Catalogue (2nd edition). http://www.fhwa.dot.gov/ohim/trb/reports.htm accessed June 2003.
Becker G. S. (1976) The Economic Approach to Human Behavior. The University of Chicago Press, Chicago, IL.
Bhat C. and F.S. Koppelman (1999) Activity-based modeling of travel demand. In Handbook of transportation science (ed. R.W. hall),
Kulwer, Boston, MA. Pp. 35-61.
Chapin F. S. Jr. (1974) Human Activity Patterns in the City: Things people do in time and space. Wiley, New York, NY.
Golledge R.G. and R. J. Stimson (1997) Spatial Behavior: A Geographic Perspective. The Guilford Press, New York, NY.
Goulias K.G. and T. Kim (2003) Analysis of the Puget Sound Transportation Panel Survey Database in Waves 1-9 Draft Final Report.
Submitted to the Puget Sound Regional Council. Seattle, WA.
Hagerstrand T. (1970) What about people in regional science? Papers of the Regional Science association, 10, pp 7-21.
Jones P., F. Koppelman, and J Orfeuil (1990) Activity analysis: State-of-the-art and future directions. In Developments in Dynamic
and Activity-Based Approaches to Travel Analysis. A compendium of papers from the 1989 Oxford Conference (ed. P. Jones).
Avebury, UK. Pp. 34-55.
Kitamura R. (1988) An evaluation of activity-based travel analysis. Transportation 15, 9-34.
- 53 -
Patten and Goulias
54
Kuhnau J. and K.G. Goulias (2003) Centre SIM: First-generation Model Design, Pragmatic Implementation, and Scenarios, Chapter
15 in Transportation Systems Planning: Methods and Applications. Edited by K.G. Goulias, CRC Press, Boca Raton, FL, pp. 16-1 to
16-14.
Manheim, M.L. (1980) Fundamentals of Transport System Analysis – Vol. 1: Basic Concepts . MIT Press. Boston, Massachussetts.
McNally M. G. (2000) The Activity-based Approach. In Handbook of Transport Modelling (eds. D.A. Hensher and K.J. Button).
Pergamon, Ansterdam, NL. pp. 113-128.
Patten M.L. and K.G. Goulias (2003) Integrated Survey Design for a Household Activity-Travel Survey in Centre County,
Pennsylvania. Paper sumitted for presentation at the 83rd annual Transportation Research Board Meeting, and Publication in the
Transportation Research Record, Washington, D.C., January 11-15, 2004
Pendyala R. (2003) Time use and travel behavior in space and time. Chapter 2 in Transportation Systems Planning: Methods and
Applications. Edited by K.G. Goulias, CRC Press, Boca Raton, FL, pp. 2-1 to 2-37.
Richardson, A.J., Ampt, E.S. and Meyburg, A.H. (1995). Survey Methods for Transport Planning, Eucalyptus Press, Melbourne.
Stecher, C., S. Bricka, and L. Goldenberg. Travel Behavior Survey Data Collection Instruments. In Conference Proceedings 10:
Conference on Household Travel Surveys: New Concepts and Research Needs, TRB, National Research Council, Washington, D.C.,
1996, pp. 154-169.
Stopher P. and P. Jones eds. (2003) Transport Survey Quality and Innovation, Pergamon, Amsterdam, NL.
Transportation Research Board (2000) Transport Surveys: Raising the Standard. Proceedings of an International Conference on
Transport Survey Quality and Innovation, May 24-30, 1997, Grainau, Germany. Transportation Research Circular, Number E-C008,
TRB, national Research Council, Washington, D.C.
- 54 -
Download