PrinciplesofWorkflowin IndianaUniversity

advertisement
IndianaUniversity
PrinciplesofWorkflowin
DataAnalysis Šƒ–‹•Dz™‘”ˆŽ‘™dzǫ
1.Acoordinatedframeworkforconductingdataanalysis
2.WFinvolvescoordinatedproceduresfor:
o Planning,organizinganddocumentingresearch
o Cleaningdata
o Analyzingdata
ScottLong
o Presentingresults
o Backingupandarchivingmaterials
November2010
‘—ƒŽ”‡ƒ†›Šƒ˜‡ƒ™‘”ˆŽ‘™ȋȌ
‘”ˆŽ‘™̳ͳ
Š›•Š‘—Ž†›‘—…ƒ”‡ƒ„‘—–™‘”ˆŽ‘™ǫ
1.YourWFmightbe:
1.Replication
A.Plannedandcarefullyorchestrated.
o Replicationisessentialforgoodscience.
B.Adhoc,piecemeal,developedinreactiontomistakes.
o Aneffectiveworkflowisessentialforreplication.
2.YoucanimproveyourWFwithamodestinvestmentoftime.
2.Gettingtherightanswers
o Retractionsareembarrassingandcanendcareers.
A.Thelessexperienceyouhave,theeasieritis.
3.Time B.Itwillsaveyoutimeandmakeyouabetterdataanalyst.
o “Scienceisavoraciousinstitution.”
o Aneffectiveworkflowmakesyoumoreefficient.
4.Errorsareinevitable;aneffectiveworkflowhelpsyoufindandfixthem.
‘”ˆŽ‘™̳ʹ
‘”ˆŽ‘™̳͵
”‹‰‹•‘ˆ–Š‡™‘”ˆŽ‘™’”‘Œ‡…–
5.GainingtheIUadvantage
1.Easythings:consultingoneasythings,insteadofhardthings.
“Thepublicationof[TheWorkflowofDataAnalysis
UsingStata]mayevenreduceIndiana’scomparative
advantageofproducinghotshotquantPhDsnowthat
gradstudentselsewherecanvicariouslybenefitfrom
thisimportantaspectofthetrainingthere.” Gabriel
Rossmanonhisblog
2.Incorrectresultswithclever“explanations”.
3.Adissertationdelayed18monthstodeterminewhyresultschanged.
4.Irreproducibleresultsfromasingle,743linedofile.
5.Analyzingthewrongdataset:“Thedatasetsareexactlythesameexcept
thatIchangedthemarriedvariable.”
6.AnalyzingthewrongvariablewhilewritinganNASreport.
7.Miscodedgenesthatdelayedprogressinastudyofalcholism.
8.Collaborationsthatmultiplythewaysthingscangowrong.
9.Misleadingorambiguousoutputsuchas...
‘”ˆŽ‘™̳Ͷ
‘”ˆŽ‘™̳ͷ
Example 1: definitely a problem in a $3M study
Example 2: which number is which?
. tab occ ed, row
. tabulate female sdchild_v1
R is |
Q15 Would let X care for children
female? | Defintel Probably
Probably
Definitel
|
Total
----------+---------------------------------------------+---------0Male |
41
99
155
197
|
492
1Female |
73
98
156
215
|
542
----------+---------------------------------------------+---------Total |
114
197
311
412
|
1,034
‘”ˆŽ‘™̳͸
|
Years of education
Occupation |
3
6
7
8
9
10
11
12
13 |
Total
-----------+--------------------------------------------------------------------------------------------------+---------Menial |
0
2
0
0
3
1
3
12
2 |
31
|
0.00
6.45
0.00
0.00
9.68
3.23
9.68
38.71
6.45 |
100.00
-----------+--------------------------------------------------------------------------------------------------+---------BlueCol |
1
3
1
7
4
6
5
26
7 |
69
|
1.45
4.35
1.45
10.14
5.80
8.70
7.25
37.68
10.14 |
100.00
-----------+--------------------------------------------------------------------------------------------------+---------Craft |
0
3
2
3
2
2
7
39
7 |
84
|
0.00
3.57
2.38
3.57
2.38
2.38
8.33
46.43
8.33 |
100.00
-----------+--------------------------------------------------------------------------------------------------+---------WhiteCol |
0
0
0
1
0
1
2
19
4 |
41
|
0.00
0.00
0.00
2.44
0.00
2.44
4.88
46.34
9.76 |
100.00
-----------+--------------------------------------------------------------------------------------------------+---------
‘”ˆŽ‘™̳͹
Š›Ž‡ƒ”‹‰‹•†‹ˆˆ‹…—Ž–
Example 3: good software doing things badly
. logit tenure i.female i.female#c.articles i.male i.male#c.articles, nocons
1.Tacitknowledge note: 0.male#c.articles omitted because of collinearity
note: 1.male#c.articles omitted because of collinearity
2.Heavylifting 3.Timetopractice
-----------------------------------------------------------------------------tenure |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------1.female | -2.473265
.1351561
-18.30
0.000
-2.738166
-2.208364
|
female#|
c.articles |
0 |
.0980976
.0098808
9.93
0.000
.0787316
.1174636
1 |
.0421485
.0098962
4.26
0.000
.0227524
.0615447
|
1.male | -2.693147
.1170916
-23.00
0.000
-2.922642
-2.463651
|
male#|
c.articles |
0 | (omitted)
1 | (omitted)
------------------------------------------------------------------------------
DidStataCorpreadtheWFbook?
‘”ˆŽ‘™̳ͺ
Šƒ–‹•–ƒ…‹–‘™Ž‡†‰‡ǫ
1.Explicitknowledgeisthestuffoftextbooksandarticles.
‘”ˆŽ‘™̳ͻ
Data analysis includes a lot of heavy lifting
“Thereality,ofcourse,todayisthatifyoucomeupwithagreatideayou
don'tgettogoquicklytoasuccessfulproduct.There'salotof
undifferentiatedheavyliftingthatstandsbetweenyourideaandthat
success.”JeffBezos,amazon.com
2.Tacitknowledgeisimplicitandundocumented(MichaelPolanyi). A.Peopleareunawareoftheiressentialtacitknowledge.
o HenryBessemer’spatentformakingsteeldidn’twork(1855)
B.Tacitknowledgeistransferred“atthebench”.
†‹ˆˆ‡”‡–‹ƒ–‡†Š‡ƒ˜›Ž‹ˆ–‹‰
o Personalcomputersimpedethetransferoftacitknowledge.
‘”ˆŽ‘™̳ͳͲ
‘”ˆŽ‘™̳ͳͳ
Š‡ˆ‘—†ƒ–‹‘‘ˆ‹•ironicaloptimism
TheWorkflowofDataAnalysisUsingStata
1.MakestacitknowledgeaboutWFexplicit.
2.Itdealswithalotofundifferentiatedheavylifting.
Theuniversalaptitudeforineptitudemakesanyhumanaccomplishment
anincrediblemiracle.Dr.JohnPaulStapp
3.Itcontainsspecificsonthegeneralissuesdiscussedtoday.
4.ThebookfocusesontoolsinStata,buttheprinciplesapplybroadly.
‘”ˆŽ‘™̳ͳʹ
•–ƒ”–•™‹–Šreplication
‘”ˆŽ‘™̳ͳ͵
Š›”‡’Ž‹…ƒ–‹‘‹••‘Šƒ”†
1.Thecurseofdimensionality:10minordecisions,leadsto1,024reasonable
waystocreateyourdata.
1.AneffectiveWFfacilitatesreplication.
2.Youmustplanforreplicationatthestartofaproject.
o Wheretotruncateavariable.
3.Disciplinesareincreasinglyconcernedwithreplicability.
o TheseedfortheRNgenerator.
o ArticlesinPoliticalScience,Economics,Sociologyandotherfields. o Creatingascalewithpartialmissingdata.
4.Askyourself:
o Whichcasestokeepforanalysis.
o Areyourdofilesandlogfilesreadyforpublicdisplay?
o Howtocodeeducation?
o Willtheyproduceexactlythesameresultsasyouhavepublished?
o Whatvaluestoassignincomegreaterthan$200,000?
o Andsoon...
‘”ˆŽ‘™̳ͳͷ
Decisions in the path to analysis: the choices that could be made
Decisions in the path to analysis: the choices made
‘”ˆŽ‘™̳ͳ͸
‘”ˆŽ‘™̳ͳ͹
‘”ˆŽ‘™̳ͳͺ
Š›”‡’Ž‹…ƒ–‹‘‹••‘Šƒ”†ȋ…‘–‹—‡†Ȍ
Criteriaˆ‘”…Š‘‘•‹‰ƒƒ••—‹‰”‡’Ž‹…ƒ„‹Ž‹–›
2.Documentation:Replicationshouldinvolveretrievingdocumentation,not
tryingtorememberwhatyoudid.
ͳǤ……—”ƒ…›
o Ifyourprogramisnotcorrect,thennothingelsematters.
OliveiraandStewart
3.Changingsoftware:2weeksofsleeplessnightsduetoversionvariation.This
isparticularlydifficultwhenthereisanactiveusercommunity.
4.Lostfiles:corrupted,lost,unreadable,obsolete,orambiguousfiles.
ʹǤˆˆ‹…‹‡…›
o Completingworkquicklygivenaccuracyandreplicability.
o Tensionbetweenworkingquicklyandworkingcarefully.
͵Ǥ–ƒ†ƒ”†‹œƒ–‹‘
o Don'trepeatedlyandinconsistentlydecidehowtodothings.
o Standardizationmakesiteasiertofindmistakes.
‘”ˆŽ‘™̳ͳͻ
‘”ˆŽ‘™̳ʹͲ
‘ŽŽƒ„‘”ƒ–‹‘ƒ†™‘”ˆŽ‘™
ͶǤ—–‘ƒ–‹‘
o Automatedprocedurespreventmistakesandarefaster.
o Collaborationmakesitmoredifficulttohaveaneffective,efficientand
replicableworkflow.
o Drukker'sDictim:Nevertypeanythingthatyoucanobtainfromasaved
result.(Didtheauthorsofmarginsthinkaboutthis?)
ͷǤ‹’Ž‹…‹–›
o Themorecomplicatedyourproceduresthemorelikelyyouwillmake
mistakesorabandonyourplan.
o Why?And,whycan’ttheydoitjustlikeme?
o Everyproblemyoucanhaveworkingbyyourselfismultipled.
͸Ǥ•ƒ„‹Ž‹–›
o Yourworkflowshouldreflectthewayyou liketowork. o Ifyouignoreyourprocedures,itisnotagoodWF.
͹Ǥ…ƒŽƒ„‹Ž‹–›
o Differentprojectsrequiredifferentworkflows.
‘”ˆŽ‘™̳ʹͳ
‘‘”†‹ƒ–‹‰—Ž–‹’Ž‡™‘”ˆŽ‘™•
‘”ˆŽ‘™̳ʹʹ
‘‘”†‹ƒ–‹‰—Ž–‹’Ž‡™‘”ˆŽ‘™•
‘”ˆŽ‘™̳ʹ͸
‘”ˆŽ‘™̳ʹ͹
‘‘”†‹ƒ–‹‰—Ž–‹’Ž‡™‘”ˆŽ‘™•
‘‘”†‹ƒ–‹‰—Ž–‹’Ž‡™‘”ˆŽ‘™•
‘”ˆŽ‘™̳ʹͺ
‘‘”†‹ƒ–‹‰—Ž–‹’Ž‡™‘”ˆŽ‘™•
‘”ˆŽ‘™̳ʹͻ
‘”ˆŽ‘™̳͵ͳ
‡›ˆƒ…–‘”•‹…‘ŽŽƒ„‘”ƒ–‹‘•
1. Agreeduponstandards
2. Explicitcoordination
3. Enforcementofstandards
4. Asenseofhumor
‘”ˆŽ‘™̳͵Ͳ
–‡’͵Ǥ”‡•‡–”‡•—Ž–•
Steps‹›‘—”™‘”ˆŽ‘™
–‡’ͲǤƒ˜‡ƒ‰‘‘†‹†‡ƒˆ‘”ƒ’”‘Œ‡…–
o Incorporateoutputintoyourpresentation.
–‡’ͳǤ”‡’ƒ”‡–Š‡†ƒ–ƒˆ‘”ƒƒŽ›•‹•
o Makeeffectivepresentations.
o Maintaintheprovenanceofresults.
o Datamustbeaccurate.
–‡’ͶǤ”‘–‡…–‹‰ˆ‹Ž‡•
o Variablesmustbecarefullynamedandlabeled.
o Backingupandarchiving:preservingthebitsandthecontent.
o Thistakes90%ofthetime,unlessyouhurry.
ƒ $2,000toget1variablefroman“archived”file.
–‡’ʹǤ‘†—…–ƒƒŽ›•‡•
o Replicationisimpossiblewithoutyourdataanddofiles.
o Estimatemodelsandcreategraphs.
o Oftenthesimplestpartoftheworkflow.
‘”ˆŽ‘™̳͵ʹ
o "Today'snoiseistomorrow'sknowledge."DavidClemmer
‘”ˆŽ‘™̳͵͵
Žƒ‹‰
Tasks™‹–Š‹‡ƒ…Š•–‡’
The ideal
BlauandDuncan(1967)TheAmericanOccupationalStructure o Allanalyseswerespecified9monthsbeforeoutputwasreceived.
o Thebookwaswrittenbasedentirelyonthoseanalyses.
‘”ˆŽ‘™̳͵Ͷ
1.Aplanisaremindertostayontrack,finishtheproject,andpublishresults.
Work.Finish.Publish.MichaelFaraday’ssigninhislab
‘”ˆŽ‘™̳ͶͲ
1.Organizationismovtivatedbytheneedto:
o Findthings
2.Alittleplanninggoesalongwayandalmostalwayssavestime.
o Avoidduplication
3.Planningincludes:
2.Itrequiresexplicit,consistentdecisionsaboutnamingandstoring
things.
o Generalgoals,publishingplans,andfirmdeadlines.
o Divisionoflaborandaccountability.
3.Organization:
o Proposalfordataconstruction:names,labels,formats.
o Helpsyouworkfaster
o Proceduresforhandlingmissingdata.
o Rewardsconsistencyanduniformity
o Anticipatedanalyses.
o Guidelinesandresponsibilityfordocumentation.
o Proceduresandscheduleforbackingupandarchivingmaterials.
”‰ƒ‹œ‹‰
Issues in planning
o Noneofthelaterbookswrittenwithfullaccesstothedatawereasgood.
‘”ˆŽ‘™̳Ͷͳ
o Organizationiscontagious
‘”ˆŽ‘™̳Ͷʹ
”‰ƒ‹œ‹‰ǣ–Š‡…—”•‡‘ˆ…Š‡ƒ’•–‘”ƒ‰‡
‹‰•‘ˆ’‘‘”‘”‰ƒ‹œƒ–‹‘
1.Youcan'tfindafileandthinkyoudeletedit.
1.Itiseasiertocreateafilethantofindafile. 2.Youfindmultipleversionsofafileanddon'tknowwhichiswhich.
2.Itiseasiertofindafilethantoknowwhatisinthefile.
3.Youandacolleagueareworkingondifferentversionsofthesamepaper.
Youchangedwhatshechangedandnowyouhavethreeversionsofthe
paper.
3.Withdiskspacesocheap,itistemptingtocreatealotoffiles.
4.Youneedthefinalversionofthepaperthewassubmittedforreview,but
youhavetwo(or16)fileswith"final"inthename.
o final_report_v16.docx
o NSF_science_report20101021.docx
‘”ˆŽ‘™̳Ͷ͵
‘”ˆŽ‘™̳ͶͶ
Organizing: a standard directory structure for all projects
Organizing: wfsetupsingle.bat makes it easy
REM workflow talk 2 \ wfsetupsingle.bat jsl 2009-07-12
REM directory structure for single person.
FOR /F "tokens=2,3,4 delims=/- " %%a in ("%DATE%") do set CDATE=%%c-%%a-%%b
md "- History\%cdate% project directory created"
md "- Hold then delete "
md "- Pre posted "
md "- To clean"
md "Documentation"
md "Posted"
md "Resources"
md "Text\- Versions\"
md "Work\- To do"
\WF project
\- History
\2009-03-06 project directory created
\- Hold then delete
\- Pre posted
\- To clean
\Documentation
\Posted
\Resources
\Text
\- Versions
\Work
\- To do
Forexample,abatchfilemakescreatinguniformdirectorieseasy.
‘”ˆŽ‘™̳Ͷͷ
”‰ƒ‹œ‹‰ǣ—‹ˆ‘”ˆ‘”ƒ–•ˆ‘”†‘Ǧˆ‹Ž‡•
‘”ˆŽ‘™̳Ͷ͸
”‰ƒ‹œƒ–‹‘•Š‘—Ž†„‡Ž‹‡ƒ‘†‡Ž
capture log close
log using wftalk-example, replace text
//
//
//
//
program:
task:
project:
author:
wftalk-example.do
jsl \ 2010-07-27
version 11
clear all
set linesize 80
local tag "wftalk-example.do jsl 2010-07-27"
//
//
#1
Description of task 1
//
//
#2
Description of task 2
log close
exit
Anycoloryouwantaslongasitisblack….
Templatesmakethisstructureeasytouse.
‘”ˆŽ‘™̳Ͷ͹
‘”ˆŽ‘™̳Ͷͺ
‘…—‡–ƒ–‹‘
‘‘‘ˆ–‡‹–‹•‘”‡Ž‹‡ƒǮ„—‰ǯ
1.Long'sLaw:Itisalwaysfastertodocumentittodaythantomorrow.
Corollary1:Nobodylikestowritedocumentation.
Corollary2:Nobodyregretshavingwrittendocumentation.
Haveyoueversaid:"Drat,thisprogramhastoomanycomments."
2.Documentationoccursonmanylevels:logs,metadata,comments,names.
3.Withoutdocumentation,replicationisvirtuallyimpossible,mistakesare
morelikely,andworktakeslonger. 4.Themorecodifiedthefieldthegreatertheemphasisondocumentation.
A.TheResearchLogbytheAmericanChemicalSociety.
B.Lossoftenureforanalteredresearchlog.
‘”ˆŽ‘™̳Ͷͻ
‘”ˆŽ‘™̳ͷͳ
—‰‰‡•–‹‘•ˆ‘”™”‹–‹‰†‘…—‡–ƒ–‹‘
1.Doittoday.
2.Checkittomorrowornextweek:italwaysmakessensetoday.
3.Keepupwithdocumentationbytyingittoeventsintheproject.
4.Includefulldatesandnames.
Š‡…‘”‡‘ˆ›‘—”†‘…—‡–ƒ–‹‘ǣ–Š‡”‡•‡ƒ”…ŠŽ‘‰
Arealexample...
‘”ˆŽ‘™̳ͷʹ
š‡…—–‹‘ƒ†…‘’—–‹‰
‘”ˆŽ‘™̳ͷ͵
Cornell 1975: the entire computing infrastructure
1.Executioninvolvescarryingouttaskswithineachstep.
2.Effectiveexecutionrequirestherighttools.
o Software
a.Texteditor
b.Filemanager
c.Statisticalsoftware
IBM370with240Kmemory
d.Macroprogram(evenifonlytoinserttimestamps)
e.Wordprocessor
o Hardware:display,storage,memory,CPU
Winchesterdriveswith3MBstorage
3.Planningisprobablymoreimportantthancomputingpower.
¾ Costofcomputing$1,000,000.
¾ Meantimetodegree7.6years.
For example…
‘”ˆŽ‘™̳ͷͶ
‘”ˆŽ‘™̳ͷͷ
–Š‘—‰Š–‡š’‡”‹‡–‘’Žƒ‹‰ƒ†…‘’—–‹‰
Indiana 2009: a disposable PC
1.Randomlydivideyourselvesintotwogroups.
o Thecomputerscancomputewhenevertheywantto.
o Theplannerscanonlycomputefortwosixhoursessionsaweek.
2.Whofinishesfirst?
Asus1000HEwith2GBmemory
10,000timesmore
FreeAgentwith1TBstorage
350,000timesmore...
¾ Costofcomputing$400(2,500timesless).
¾ Meantimetodegree7.6years.
‘”ˆŽ‘™̳ͷ͸
‘”ˆŽ‘™̳ͷ͹
—ƒŽ™‘”ˆŽ‘™
Principlesforacomputingworkflow
1.Dualworkflow:keepdatamanagementanddataanalysisseparate.
2.Runorder:namefilessothatiftheyarereruninalphabeticalorder,youwill
produceexactlythesameresults.
3.Postingprincipleforsharingresults(definedlater)
Datamanagement==> <==Dataanalysis
‘”ˆŽ‘™̳ͷͺ
‘”ˆŽ‘™̳ͷͻ
Š‡‡••‡–‹ƒŽpostingprinciple
Run order and a dual workflow
Datamanagement
data01.do
data02V2.do
data03.do
data03-1.do
data03-2.do
data04.do
Dataanalysis
stat01a.do
stat01b.do
stat01cV2.do
Thepostingprincipleisdefinedbytworules:
1.Thesharerule:Onlyshareresultsafterthefilesareposted.
2.Thenochangerule:Onceafileisposted,neverchangeit.
stat02a.do
stat02a1.do
stat02b.do
stat03aV2.do
stat03b.do
stat03c.do
stat03c1.do
stat03c2V2.do
stat03d.do
‘”ˆŽ‘™̳͸Ͳ
ƒ–ƒƒƒŽ›•‹•ǣ—•‡†‘Ǧˆ‹Ž‡•Ǩ
‘”ˆŽ‘™̳͸ͳ
‡‰‹„Ž‡†‘Ǧˆ‹Ž‡•ǣ‘—–’—––Šƒ–‹•‡ƒ•›–‘”‡ƒ†
1.Lotsofthoughtfulcomments
‘„—•–†‘Ǧˆ‹Ž‡•
2.Alignment,indentationandspacing
1.Theyareselfcontained
3.Shortlineswithoutwrapping
2.Theyincludeversioncontrol(version 11.1)
4.Noambiguousabbreviations: l a l in 1/3
3.Theyexcludedirectoryinformation(whichmightchange)
4.Theyexplicitlysetseedsforrandomnumbers
5.Theyrequirethatyouarchiveuserwrittenadofiles
Simplyput:Itshouldrunonanothercomputeratalaterdatewithoutchanges.
‘”ˆŽ‘™̳͸ʹ
‘”ˆŽ‘™̳͸͵
‡‰‹„Ž‡Ž‘‰ˆ‹Ž‡•ȋ‹–‡š–‘–•…ŽȌ
+----------------+
| Key
|
|----------------|
|
frequency
|
| row percentage |
+----------------+
|
Occupation |
3
12
13 |
11
—–‘ƒ–‹‘
1.Muchofdataanalysisinvolvesrepetitivetasks.
2.Repetitioninviteserrors.
6
7
Years of education
8
9
3.Automationisfaster,andlesserrorprone.
10
Total
B.loops:multipleexecutionofthesamecommands.
-----------+--------------------------------------------------------------------------------------------------+---------Menial |
0
2
0
0
3
1
3
12
2 |
31
|
0.00
6.45
0.00
0.00
9.68
3.23
9.68
38.71
6.45 |
100.00
-----------+--------------------------------------------------------------------------------------------------+---------BlueCol |
1
3
1
7
4
6
5
26
7 |
69
|
1.45
4.35
1.45
10.14
5.80
8.70
7.25
37.68
10.14 |
100.00
A.macros:wordsthatrepresentstringsoftext.
C.returnedresults:avoidingtypingthevalueofanystatisticalresult.
D.matrices:holdandsummarizekeyresults.
E.adofiles:writeprogramsthatdowhatyouwant.
F.me.hlp:don’tkeeplookingupthesamethings.Forexample,…
-----------+--------------------------------------------------------------------------------------------------+---------Craft |
0
3
2
3
2
2
7
39
7 |
84
|
0.00
3.57
2.38
3.57
2.38
2.38
‘”ˆŽ‘™̳͸Ͷ
help me
ǣ‡ƒ•›–‘—•‡”‡•—Ž–•…‘ŽŽ‡…–‘”
InStata,type:
findit snag
‘”ˆŽ‘™̳͸͹
snagcollectsdozensorhundredsofresultstomakethemeasiertodigest.
o Thestandardoutputisusedtoverifytheresults.
o The“snagged”summaryletsyoudiscoverwhatyouwant.
o Anyoneusingmarginsknowswhythisisnecessary.
‘”ˆŽ‘™̳͸ͻ
ƒ–ƒ…Ž‡ƒ‹‰ǡ‹…Ž—†‹‰ƒ‡•ƒ†Žƒ„‡Ž•
‘”ˆŽ‘™̳͹Ͳ
”‡ƒ–‹‰ƒ…‘†‡„‘‘
Žƒ‹‰ƒ‡•
”—…ƒ–‹‘ƒ†…ƒ”‡Ž‡••ƒ‡•
Example:ownsexandownsexucausedweeksofconfusion.
‘”ˆŽ‘™̳͹ͳ
‘”ˆŽ‘™̳͹ʹ
›’‡•‘ˆ†ƒ–ƒ…Ž‡ƒ‹‰
Cleaning 1a: finding an error with a graph
‘”ˆŽ‘™̳͹͵
Cleaning 1b: reversing the graph
Cleaning 2: remembering a coding decision
‘”ˆŽ‘™̳͹Ͷ
‘”ˆŽ‘™̳͹ͷ
Cleaning 3: understanding the substantive process
Cleaning 4: avoiding expensive mistakes
‘”ˆŽ‘™̳͹͸
‘”ˆŽ‘™̳͹͹
‘”ˆŽ‘™̳͹ͺ
ƒ„Ž‡•–‘‘•ƒŽŽ
ƒŽ›œ‹‰–Š‡†ƒ–ƒ
mlogit (N=337): Factor Change in the Odds of occ
Variable: white (sd=.27642268)
1. Takelotsofclassesinstatistics.
Odds comparing
|
Alternative 1
|
to Alternative 2 |
b
z
P>|z|
e^b
e^bStdX
------------------+--------------------------------------------Menial -BlueCol | -1.23650
-1.707
0.088
0.2904
0.7105
Menial -Craft
| -0.47234
-0.782
0.434
0.6235
0.8776
Menial -WhiteCol | -1.57139
-1.741
0.082
0.2078
0.6477
Menial -Prof
| -1.77431
-2.350
0.019
0.1696
0.6123
BlueCol -Menial
|
1.23650
1.707
0.088
3.4436
1.4075
BlueCol -Craft
|
0.76416
1.208
0.227
2.1472
1.2352
BlueCol -WhiteCol | -0.33488
-0.359
0.720
0.7154
0.9116
BlueCol -Prof
| -0.53780
-0.673
0.501
0.5840
0.8619
Craft
-Menial
|
0.47234
0.782
0.434
1.6037
1.1395
Craft
-BlueCol | -0.76416
-1.208
0.227
0.4657
0.8096
Craft
-WhiteCol | -1.09904
-1.343
0.179
0.3332
0.7380
Craft
-Prof
| -1.30196
-2.011
0.044
0.2720
0.6978
WhiteCol-Menial
|
1.57139
1.741
0.082
4.8133
1.5440
WhiteCol-BlueCol |
0.33488
0.359
0.720
1.3978
1.0970
WhiteCol-Craft
|
1.09904
1.343
0.179
3.0013
1.3550
WhiteCol-Prof
| -0.20292
-0.233
0.815
0.8163
0.9455
Prof
-Menial
|
1.77431
2.350
0.019
5.8962
1.6331
Prof
-BlueCol |
0.53780
0.673
0.501
1.7122
1.1603
Prof
-Craft
|
1.30196
2.011
0.044
3.6765
1.4332
Prof
-WhiteCol |
0.20292
0.233
0.815
1.2250
1.0577
----------------------------------------------------------------
2. Findexemplars;don’trediscoverthewheel;don’tdoit“yourway”.
”‡•‡–ƒ–‹‘•ƒ†’”‘˜‡ƒ…‡
Variable: ed (sd=2.9464271)
Odds comparing
|
Alternative 1
|
to Alternative 2 |
b
z
P>|z|
e^b
e^bStdX
------------------+--------------------------------------------Menial -BlueCol |
0.09942
0.972
0.331
1.1045
1.3404
Menial -Craft
| -0.09382
-0.962
0.336
0.9105
0.7585
Menial -WhiteCol | -0.35316
-3.011
0.003
0.7025
0.3533
Menial -Prof
| -0.77885
-6.795
0.000
0.4589
0.1008
BlueCol -Menial
| -0.09942
-0.972
0.331
0.9054
0.7461
BlueCol -Craft
| -0.19324
-2.494
0.013
0.8243
0.5659
BlueCol -WhiteCol | -0.45258
-4.425
0.000
0.6360
0.2636
BlueCol -Prof
| -0.87828
-8.735
0.000
0.4155
0.0752
Craft
-Menial
|
0.09382
0.962
0.336
1.0984
1.3184
Craft
-BlueCol |
0.19324
2.494
0.013
1.2132
1.7671
Craft
-WhiteCol | -0.25934
-2.773
0.006
0.7716
0.4657
Craft
-Prof
| -0.68504
-7.671
0.000
0.5041
0.1329
WhiteCol-Menial
|
0.35316
3.011
0.003
1.4236
2.8308
WhiteCol-BlueCol |
0.45258
4.425
0.000
1.5724
3.7943
WhiteCol-Craft
|
0.25934
2.773
0.006
1.2961
2.1471
WhiteCol-Prof
| -0.42569
-4.616
0.000
0.6533
0.2853
Prof
-Menial
|
0.77885
6.795
0.000
2.1790
9.9228
Prof
-BlueCol |
0.87828
8.735
0.000
2.4067 13.3002
Prof
-Craft
|
0.68504
7.671
0.000
1.9838
7.5264
Prof
-WhiteCol |
0.42569
4.616
0.000
1.5307
3.5053
----------------------------------------------------------------
1.Contentandmethodsaresubstantive,disciplinarydecisions.
2.Presentationsandpreservationofprovenanceareuniversal.
Variable: exper (sd=13.959364)
‘”ˆŽ‘™̳͹ͻ
‘Ž‘”•–Šƒ–ƒ”‡̵–†‹•–‹…–™Š‡’”‹–‡†Ȁ’”‘Œ‡…–‡†
ƒ„‡Ž•–Šƒ–ƒ”‡̵–Žƒ”‰‡‡‘—‰Š
‘”ˆŽ‘™̳ͺͲ
‘”ˆŽ‘™̳ͺͳ
‘…—‡–‹‰–Š‡’”‘˜‡ƒ…‡
‘”ˆŽ‘™̳ͺʹ
ƒ’–‹‘•–Šƒ–‹†‹…ƒ–‡–Š‡’”‘˜‡ƒ…‡
ThecircledtextcontainsresultsImayneedtoconfirmlater:
Turningon"show/hide¶"revealstheprovenance:
twoway (line art_root2 art_root3 art_root4 art_root5 articles,
///
lwidth(medium)), ytitle(Number of Publications to the k-th Root) ///
yscale(range(0 8.)) legend(pos(11) rows(4) ring(0))
///
caption(wf7-caption.do \ jsl 2008-04-09, size(vsmall))
‘”ˆŽ‘™̳ͺ͵
‘”ˆŽ‘™̳ͺͶ
”‡•‡”˜‹‰›‘—”†ƒ–ƒ
‘’‡ǡˆ‘‘Ž‹•ŠŽ›ǡ•’”‹‰•‡–‡”ƒŽȋ–Š‡™‡†‡•›†”‘‡Ȍ
Whenitcomestosavingyourwork,expectthingstogowrong,expectthatyou
willdeletethewrongfileattheworstpossibletime,andexpectahosetobeleft
onintheroomaboveyourcomputer.Ifyouexpecttheworst,youmightbeable
topreventit.
‘”ˆŽ‘™̳ͺͷ
‘”ˆŽ‘™̳ͺ͸
‘”ˆŽ‘™̳ͺͺ
šƒ’Ž‡•‘ˆ†ƒ–ƒŽ‘••
1.KennedyassassinationonNovember22,1963andthe9/11survey. 2.508KvolumesinobsoleteformatsatBritishMuseum.2MvideosatIU.
3.NeilArmstrong'swalkonthemoononJuly20,1969,thelostmoontapes,
andPinkFloyd'sDarkSideoftheMoon.
"afuzzygrayblobwadingthroughaninkwell" DarkSideoftheMoon
‘”ˆŽ‘™̳ͺ͹
•‹’Ž‡ƒ’’”‘ƒ…Š–‘’”‡•‡”˜‹‰ˆ‹Ž‡•
ƒ…–‹…•ǣ‘”–ƒ„Ž‡†”‹˜‡•ƒ–Š‘‡
‘”ˆŽ‘™̳ͺͻ
‘”ˆŽ‘™̳ͻͲ
ƒ…–‹…•ǣ‘”–ƒ„Ž‡†”‹˜‡ƒ–™‘”
ƒ…–‹…•ǣ‹˜‡•›…ȋ•‘‘–‘„‡‹˜‡‡•ŠȌ
‘”ˆŽ‘™̳ͻͳ
‘”ˆŽ‘™̳ͻ͵
ˆˆǦŽ‹‡„ƒ…—’•
Dropboxandsimilarservices,enterprisemassstorage,localservers.
1. Installtheprogam
ƒ–ƒ•–‘”ƒ‰‡ͳͻͺͳ–‘ʹͲͲͻ
2. Dropfilesintothefolder
1.Sizeperdriveincreasedbyafactorofmorethan300,000.
3. RetrievethemfromanymachinewithDropbox
2.Costpergigabytedecreasedbyafactorof7,000,000.
4. Havesharedfoldersforcollaboration
3.AshoeboxfullofportabledrivescanholdenoughIBMcardstofilla30M
cubicfootbuilding;60Mcubicfeetnextmonth.Withcompression…
5. Avoidsendingattachmentsevenforonetimefileexchanges
‘”ˆŽ‘™̳ͻͶ
Šƒ‰‹‰›‘—”™‘”ˆŽ‘™
‘”ˆŽ‘™̳ͻͷ
Š‘•‡™‘”ˆŽ‘™
1.Slowly,systematically,throughtfully.
1.Therearemanyviableworkflows.
2.Finishthelast5%ofthechange.
2.ThekeyadvantageoftheWFbookisthatitiswrittendown.
3.LikePennandTeller,masterafewcooltricks.
3.AlanAcockwrote: o “Noteveryonewillagreewithallof[Long's]suggestions.”
4.Don'tdoitunderdeadline.
o “IwillposttheannouncementofWorkflowonmydoorwiththefollowing
note:‘Iamgladtohelpanybodywhofollowedatleast25%oftheadvice
Longprovides—andbringsmetheirdofiles!’” ‘”ˆŽ‘™̳ͻ͸
4.DoyoureallywanttospendyourtimerediscoveringthemistakesImade?
‘”ˆŽ‘™̳ͻ͹
Download