Rob Sherwood
Who am I?
• Stanford 2008-2011;
Visiting Researcher/PostDoc
• Currently CTO of Big Switch Networks
Research Background
• Internet Security
• Peer-to-Peer
• Internet Measurement
• Software Defined Networking
2
Big Money Questions:
•Why this paper?
– Hint: not because it’s short
•Who is Vern Paxon?
3
• Isn’t it man-made? Why not just model it?
Partial Answers:
• Statistical models of packet arrival and traffic matrices inaccurate
• The actual topology is unknown
• Many parts are intentionally obscured for commercial gain
• Apply natural science principles
4
• Patrick Harvey, "Many of the potential pitfalls in data gathering and analysis that the paper notes--while relevant to non-Internet data--seem to be somewhat exacerbated by the heavily-layered Internet architecture.
This may be especially true of ‘misconception’, the potential for which means that sound data analysis likely must not treat Internet abstractions as opaque as is typical in the development of many actual systems, but instead take into account many layers and modules in addition to those of most immediate proximity to the measured data."
5
• Hard/formal definition?
• Why is this so important for measurement?
6
• “Real” science depends heavily on
Significant Figures
– E.g., C=2.997,924,58 x10 8 meters/second
• How do we apply SigFigs with computer systems?
• gettimeofday() == 1130322148.939977000
7
• About Time?
• About TCP?
• About Routers/Switches?
8
• What is this? Why is it important?
Critical tip:
• Save exact cut-and-paste command for every graph
• People will ask you to reproduce
War Stories:
• DNS data
• OptAck – Nick’s class last year
9
• Anonymous 1, "I feel like the first half of this paper could have been titled "Reasons to Never Ever Use tcpdump".
• Anonymous 2, “The author says this advice is drawn from his experiences so now I am bit skeptical of every chart I see.
”
1
0
• Obviously misconceptions are bad
– What can we do about them?
– Is “learn lots of domain knowledge” enough?
• What does Calibration mean in practice?
• Answer: if you want to do it right:
– “measure twice or more , cut publish once”
– Can be painstaking, but better than retraction
– Great Firewall of China visualization
1
1
• Paper was published in 2004
– Most of the lessons were learned before then
• 10+ years later, we have Hadoop, AWS,
– Is big data management still an issue?
• My Dissertation gathered 4+TB (!!! )
– Needed tuned RAID, mysql, condor, and 300+ machines to process
1
2
• Or important?
• Truth time:
– who has already had this problem?
1
3
Practical answer:
•Privacy is Important
•Very hard to get consent
– Map IPs to people?
– People don’t understand the cost or benefit
•Most academic institutions have a “fail fast” approach to legal threats
Anonymizing and De-anonymizing data
•Huge research topic; very interesting
1
4
• Anonymous, " I am a bit worried about the conflict that might appear between the two ideals of collecting an abundance of metadata and making datasets publicly available. If a lot of metadata is collected in order to allow for the reuse of the data in many settings, this only increases the complexity of dealing with privacy issues in regards to the use of the data"
1
5
• Follow Up Papers/IMC Guidelines
• “BotNet Labs”/Password distribution
• Open Question:
– Should Internet Measurement go through IRB approval?
• War story:
– Multiple accidental DoS experiments
– Very unhappy people == unhappy advisor
1
6
• Anonymous, "I feel the authors should have discussed the issue of intrusiveness of a measurement technique in the accuracy/ misconception section -- the authors rightly describe the importance of collecting metadata especially for publicly made data. But how much metadata should one collect - and what if this metadata collection actually causes an overhead and skews the results?"
1
7
• Very few of these concepts apply to just the Internet
• Rob’s claim:
– This paper made me a better scientist
– Which then made me a better system designer
• Additional Questions/Comments?
1
8