What about the whole country?

advertisement
What about the whole country?
Extending the Activity-Based Person-Trip Synthesizer to all 330
million Americans
Judy Sun ‘14 & Luke Cheng ’14
ORF467 F13
The Process
 Generate Schools
 Generate Employee Patronage File
 Assign Patronage
 Generate Patronage-Employee Ratios
 A Look at the Data
 Generate Census File (with Microsoft Access)
 NN Files through 7 NJ Modules by Jake and Talal
 Trip File Generator: Out-of-State commuters, students,
workplace assignment, 18 Tour Type (Activity Patterns)
assignment, Temporal Dimension
Roadmap
 Schools Data
 Employee-Patronage Data
 A Look at the Data
 Census Data
 Further Steps
Schools Data
Public Schools in the US
Quick stats on Public Schools (2011)
60,000
Number of Schools in US
50,000
40,000
PUBLIC
30,000
CHARTER
20,000
10,000
Primary
School Type
Primary
Middle
High
Other
No Answer
Total
Middle
# of CHARTER
2,584
615
1,316
1,145
564
6,224
High
Other
# of PUBLIC
51,793
16,332
19,762
5,847
3,525
97,259
No Answer
Total
54,377
16,947
21,078
6,992
4,089
103,483
Public Schools: Enrollment
30,000,000
25,000,000
20,000,000
PUBLIC
15,000,000
CHARTER
10,000,000
5,000,000
-
Primary
School Type
Primary
Middle
High
Other
No Answer
Total
Middle
CHARTER
High
Other
PUBLIC
896,544
166,519
368,109
626,562
(1,128)
2,056,606
No Answer
Total
23,226,606
9,425,155
13,767,489
1,289,050
(7,016)
47,701,284
24,123,150
9,591,674
14,135,598
1,915,612
(8,144)
49,757,890
Private Schools in the US
20,000
18,000
Number of Schools in US
16,000
14,000
12,000
10,000
8,000
6,000
4,000
2,000
Primary
Type
Primary
Secondary
Combined
Total
Secondary
Number of Schools
18,400
2,517
7,300
28,217
Combined
Private Schools: Enrollment
# students
2,500,000
2,000,000
1,500,000
1,000,000
500,000
Primary
Secondary
Type
Primary
Secondary
Combined
Total
Combined
# students
2,134,007
738,600
1,431,252
4,303,859
Private Schools: School Size
600
Numer of Schools
500
400
300
200
100
0
0
200
400
600
800
1000
1200
School Size (number of students)
1400
1600
1800
2000
Post-secondary schools (2009)
Number of Schools
3,000
2,500
2,000
1,500
1,000
500
-
Graduate
Primarily
Baccalaureate
Primarily Non-Bacc
Associate's
Nondegree-granting
postbac
Nondegree-granting
pre-bac
Institution type
# of Students Enrolled # of students as percent total Number of Schools
Graduate
291
0%
Primarily Baccalaureate
1,483,018
93%
Primarily Non-Bacc
53,903
3%
Associate's
49,263
3%
Nondegree-granting postbac
17
0%
Nondegree-granting pre-bac
10,960
1%
Total
1,597,452
100%
350
2,169
623
1,745
14
2,698
7,735
Employee-Patronage Data
The Process
 2012 InfoGroup US Businesses File (5.80 GB)
 30 CSV files with 500,000 entries (~200MB) – Shell Script
 30 CSV files with patronage generation and data
cleaning and mapping (~115MB) – R Script
 1570 Segmented State Files (1KB to 20MB) – R Script
 51 Merged State Files (8MB to 390MB) – Python Script
Patronage Generation
 Previous Process – Manual Fine-Tuning
 Inconsistent: Same NAICS Code, Different
Patronage/Employee Ratio
 Current Process – Employee Size Range, Sales Volume
Range
 Not Perfect Data
 Matching businesses (Zip, County, NAICS, Latt/Long)
 Same Employee Size Range
 Assumption: Sales Volume same across time
 Trying to acquire the 2005 Data for better correlations
 Ratios from Averaging Previous EP file
Comparison: Distributions
Conclusion: Need to use NAICS Codes, in addition
A large number of 0-1 ratio values are offset by the 7-20. Therefore, we get
a surge averages of around 4-5. Difficult to capture nuances with just
employee size and sales volume.
Next Steps: Man-Power needed to assign ratio for each NAICS Code, Sales
Volume, Employee Size combination
A Look at the Data
NJ Counties (Change in NJ EP File)
Uncensored
Un-Named Removed
NJ Wide
Uncensored
Un-named Removed
 No Businesses +73,500
 No Businesses +39,350
 Tot Emp +4.8M
 Tot Emp +4.8M
 Emp Size +7.85
 Emp Size +9.09
 Tot Patrons -4.9M
 Tot Patrons -5.3M
 Avg Patrons -17.17
 Avg Patrons -16.29
Nation-Wide
Sales
Volume No. Businesses
Total
Employees
Avg Employee
Size
Total Patrons
Average
Patrons
Rank
State
1
California
$1,889
1,579,342
23,518,022
14.89
36,820,129
23.31
2
Texas
$2,115
999,331
17,624,235
17.64
24,846,695
24.86
3
Florida
$1,702
895,586
12,331,524
13.77
21,231,864
23.71
4
New York
$1,822
837,773
18,327,933
21.88
19,610,813
23.41
5
Pennsylvania
$2,134
550,678
10,498,442
19.06
13,704,903
24.89
9
New Jersey
$1,919
428,596
8,833,890
20.61
9,986,529
23.30
45
Washington DC
$1,317
49,488
5,702,617
115.23
1,067,938
21.58
47
Rhode Island
$1,814
46,503
1,117,140
24.02
1,201,124
25.83
48
North Dakota
$1,978
44,518
492,547
11.06
1,021,077
22.94
49
Delaware
$2,108
41,296
670,622
16.24
1,011,400
24.49
50
Vermont
$1,554
39,230
379,291
9.67
821,193
20.93
51
Wyoming
$1,679
35,881
340,342
9.49
772,090
21.52
Census Data
Inputs
 2010 Census Summary File 1
 http://www2.census.gov/census_2010/04-Summary_File_1/
 Does not convert to CSV/TXT; Files made for MS Access
 Process Tables (P12, P16, P29, H13, P43) with Talal’s VBA
macro in MS Access (p.78)
 VBA Code – whereabouts unknown, perhaps with Prof K
 2012 5-Year Census American Community Survey
 http://www2.census.gov/acs2012_5yr/summaryfile/
 Income Data to assign incomes to households and residents
Generation
 Module 1 – Outputs resident file for each county in state
 Rows: Individual People
 Attributes/Columns: County Number (replace with State
Number_County Number for national file), Household ID,
Household Type, Latt/Long, ID Number, Age, Sex, Traveler
Type, Income Bracket
 Module 2 – Out of state/region/nation nodes
 For commenting on code, go to p.17-19
 http://www.princeton.edu/~alaink/Orf467F12/MuftiTripSynth
esizer_v.1.pdf
Further Steps
What To Do Next?
 Patronage Generation with NAICS, Sales Volume,
Employee Size and Research – Low Difficulty
 I already generated a file mapping all NAICS and
employment counts along with payrolls for patronage
assignment using 2010 Census Data (200K entries)
 Census Data Generation and Rework NN Generation
Modules – High Difficulty
 Optional: Data Verification for Employee-Patronage Files
Modules
 Very hard-coded for NJ; not very well-commented
 Initial National Implementation Ideas:
 Treat US as one entity with external nodes at airports to
represent foreigners
 Problem: Computationally intensive for 330M people
 Solution: Do a semi-randomized sample
 Regionalize the US and use out-of-region external nodes
 Less labor-intensive and parallel processing
 Doing each state
 Problem: Hard to generalize code, out-of-state nodes
 Extremely labor-intensive
The Code: Thought Process
 Trips generated state-by-state
 Use state-level demographic information on residents
 Ignore state-level boundaries since we have employer and
attraction information for the nation.
 Example:
 John Smith lives in NYC and works in CT.
 We will get his household from NYC Census file and the
probability distribution of workplace in CT E-P file.
 When we map NYC Trips, we will see John Smith going to CT
for work. When we map CT Trips, we will see John Smith
returning from work.
 Trip destinations can be approximated using destination
county centroids
 Requires assigning centroid to each county
The Code: Thought Process
 Workplace assignment (without replacement):
 Census maps individuals to workplace
 John Smith lives in NYC and works in CT
 Use distribution to match workplace to E-P file (keep a count
of employees to match the number given)
 John Smith mapped to an employer in CT
 If more than x (e.g. 250) miles, assume arrival at airport
 School Assignment (without replacement):
 Use bounds and distribution to match students with schools
(assume same county)
 Jane (8) is mapped to elementary school in her county
The Code: Thought Process
 Tour Type assignment and Temporal Dimension
 Can try to repurpose Talal’s code
 Add in Time Zones in Temporal Dimension
 Can do this with replacement (patrons)
 Assumptions: Same behavior across states in terms of work
time and leisure time and activity patterns
 Out-of-Country Commuters / Non-Resident Workers
 International nodes for the states along the Canadian and
Mexican borders
 Trip to the nearest border crossing
Download