Analytics breakout - day one summary presentation – raw notes

advertisement
DOE roundtable notes
Justin Cummins
Analytics breakout - day one





What is encompassed by the term analytics?
◦ A working definition of analytics is the systematic distillation of information from data.
◦ Is it a predictive science based on statistical reason and combined with reductive methods?
◦ Important to remember the distinction between metrics and statistics.
◦ Progression of analytics
▪ methodologies as expression of the state of the art to defined methods
▪ basis on collected data to empirical to foundational rules
◦ Analytics may also refer to the study of analysts
▪ Data including workflows, analysis and data needs, and supporting tools
▪ Should study how analysts work including how they gain insight which is encompassed
in the domain knowledge.
▪ Sharing this information can be muddied by commercial interests
◦ Telling the story behind collected data.
◦ Predicting future events or data based on existing knowledge.
◦ In the meantime, techniques to demonstrate something actionable.
What is the role of data collection?
◦ Analytics is currently geared toward data post-acquisition: data collection and subsequent
searching
◦ Data is collected opportunistically: grab it because we can or just in case
◦ Sometimes a specific set of data is targeted for collection and neighboring data is included.
◦ Hopefully progress to learning what to collect or what is important in advance
◦ Validation should ensure data collected was what was intended
◦ Distilling or filtering such data is fundamental
Analytics can vary greatly based on context
◦ Differ based on different peoples' needs.
◦ For example, forensics and situational awareness are two contexts which may depend on a
common set of data but with very different needs.
What types of searches are done?
◦ Is a search hypothesis driven or merely looking for patterns?
◦ Is there an abstraction for patterns? Can it be done?
◦ How legitimate is someone processing the data with a neural net algorithm for pattern
finding without any goal or hypothesis?
Analytics can be categorized temporally based on the type of knowledge they reveal
◦ A focus on past knowledge might include forensic analytics.


◦ A focus on present knowledge lends itself to tactical and operational analytics.
◦ Analytics targeting future knowledge can be categorized strategic or predictive.
A community or consortium could be created to study and share knowledge pertaining to these
issues.
◦ This community could study, in a systematic fashion, the work analysts perform and tools
they will require in the future.
◦ Could perhaps learn from how the anti-spam community shares data.
◦ Should community be closed? If so, to the public or commercial interests?
◦ What could be shared?
▪ Infrastructure for shared analytic experiences may be too difficult without data or
sensitive information.
▪ White papers would be possible but are of limited usefulness.
▪ Real value is encompassed in the process in which a conclusion was reached if it is
replicable.
▪ Supervised learning could be done with learning and validation sets.
▪ Data sharing is a possibility but would have to be studied in context.
Department of Energy's possible role
◦ Help to develop tools for processing large scales of distributed data (e.g. Cassandra,
Hadoop).
◦ Possibly coordinate with the national visual analytics center (NVAC) on cyber-analytics.
Analytics breakout - day two summary
Purpose of breakout: Intent is to prepare presentation for your lab director for upcoming cyber
summit 3.
How relevant to DOE?
What can community provide?
How will community benefit?
Describing the problem:
DOE is very prominent as a target because of prominent name, nuclear material/info, national security
repository, international research community, etc.
Transition from independent hackers to state-sponsored attacks.
How much is cybercrime costing us (incl. Power and misused resources)?
We can support operations with improved analytics. Beyond immediate answers, analytics provide
longer-term patterns and perhaps predictive ability.
Fault(blame)-free system to report problems to authorities. Things like CERT do not see into lab
networks.
Solutions:
We are creating a DOE collaborative cyber community for shared analytics.
The result will be amplified capabilities, enhanced efficiency, and lowered cost.
Community provides cohesive background of different skill sets/domain knowledge.
Cyber summit 3 will be a framework for this collaboration.
Advancing state of analytics
Predictive analytics allow us to choose time and place of battle (level playing field?).
We will provide a common definition of cyber analytics.
Working definition of analytics: systematic distillation of knowledge from information.
Analysis can be applied to the design new systems and to thwart attacks.
Other:
Propose lab director guarantee integrity of computing resources, similar to how they must vouch for
security of nuclear resources.
-adding responsibility not a selling point
Good cyber-analytics may not benefit lab directors because of "cyber shame" generated by findings.
Ed Talbot - “Charge to the Community”



Cyber Summit
Would like to present a clear problem statement to upper-management
Three main questions
◦ What can this community provide that would help?
◦ How can this community benefit?
◦ How is it relevant to Department of Energy? Or, why is it critical to DOE's

mission?
Disc: How is funding done (if at all) for cybersecurity? What is point of community?
Richard Perlotto - “State of the Internet”




Shadow server project – nonprofit
Sinkhole and other counts on worms/malware
Several questions during talk, some around statements on network management or general
security statements.
Discussion:
◦ Questions of tool building, data sources and how we can build our experiences scientifically
so that we can quantify what is out there. What has been working to reduce the scope of the
data. What successes or failures have occurred with hardware, software, brain power?
▪ Tools to computing with large, distributed data stores are lacking. Projects like Hadoop
and Cassandra are immature.
◦ Call for improved metrics? Is that enough or what else is needed?
◦ Do metrics just describe changes over time or something different?
◦ Analysis means building a hypothesis, testing it, and reporting results. Initially, hypothesis
begin with community wisdom.
◦ Some claim there is no science base in our field. What we see are measurements.
Measurements often not taken to validate a hypothesis, just to figure out whats happening.
It's not a scientific process. However, observation is needed before hypothesis. This is
◦
◦
◦
◦
statistics, not metrics.
We measure because the data is available rather than forming a hypothesis to figure out
what we want to measure.
How can we use the statistics to predict something or understand some impact?
Commonly gathered information is important as a baseline. We gain experience with it,
even if the data's scientific usefulness is an open question.
Collected data is the start of building a taxonomy (variant A, B, etc..) but will it be
complete?
Stephanie Wehner - “Unconditional Security from Noisy
Quantum Storage”


Goal: two-party protocols (x, y, f(x,y))
◦ Password input and check
◦ Price bid against a hidden minimum
Weak measurements can be made on qubits with a low(er) probability of destroying information
Julia Narvaez - “Talaris Report Review”



Previous roundtable included breakouts on trust.
◦ It would have been preferable have a social scientist in attendance.
◦ That was not possible, however there was a cognitive scientist.
There were some questions on the semantics of statements in the presentation.
◦ Although risk management may not be a science, it is an important management aspect to
be involved in security.
◦ There was also some discussion on best practices, which are not clearly the best and are not
universally accepted.
Reproducibility in experiments was discussed as a problem area for the field.
◦ Discussion revealed there may not be many good reference examples of scientifically sound
cyber security experimentation.
◦ Such work may be difficult to publish in current conferences and institutional incentives.
◦ The field may be able to look at other computer science research areas for a solution,
potentially software engineering research.
Julia Narvaez - “Talaris Report Review” - Raw notes






Was there a social scientist involved with trust discussions?
No, but there was a cognitive scientist.
Complex systems can't be approached with both a holistic and systematic approach (they are
opposites)
Risk management isn't a science;
Can management be removed from improving cyber-security?
Perhaps it should refer to science of risk.











Takes offense to term 'best practices'. Maybe minimally acceptable. No one who defines best
practices.
Is precision the problem?
Should it be 'best principles'?
Few experiments are reproducible, even by authors.
What do we mean by experiments? Any examples?
Keystroke studies...
Science isn't easily publishable.
Journal of reproducible comp science?
Reoccurring topic. Perhaps software engineering has partially addressed it.
Was there a scientific evaluation grand challenge proposed?
I don't think specifically.
Christian Kreibich - “Spamalytics” - raw notes




























There have been wild claims as to botnet spam profitability
Your profit equation is wrong “cost to send * conversion rate * sale profit”
Did botmaster notice decline?
We didn't hide anything. We didn't think they would notice. They are becoming more aware
though.
Did you filter out your own traffic, multiple clicks, or crawlers?
Yes
How could they detect your manipulation?
They could sprinkle in their own addresses to verify spam is being sent.
Looking at conversion rate, did you compare it to a typical direct mail campaign.
Superficially
I was thinking of legitimate, aggressive direct email campaign.
There is a great venue for this (over-aggressive email marketing ruining the market) called ___?
Talking about dilution of conversion rate based on feedback or how the addresses were
originally collected
How did the proxies change over your experiment? Did the botmaster have performance
expectations?
Some spam platforms have fancy accounting built-in.
To be clear, what are you referring to with spam?
I don't know of any legitimate company selling bot-based spam, but I wouldn't be surprised to
see it later.
Did you look at what domains were better at filtering?
There is a backup slide for that.
If you have a GUI email client and an image displays, did that count as a user site visit?
No
Did you consider lowering the pharmacy prices?
What was the purchase price in the pharmacy?
Typical was $100. The site lists cheap thing that become an expensive package at checkout
What percentage of the botnet were you?
We'll come to that.
ACSAC paper talking about class that sometimes went to spam for buying products.
Another project which bought spam products. Pictures somewhere.



























Were the websites in different languages?
No. English experience may have been a factor.
You don't know why from the data, as a single data point; can't generalize out.
They're not fulfilling any orders?
Yes, just taking the money.
Did you see evidence of repeat business?
We talked to Ironport who did purchases. They did get info targeting repeat visitors. I have no
insight into how often they sent out product. Often you get it; not placebos.
I heard many of the pharmaceutical operations started as spammers and grew into legitimate
businesses.
Don't know how often that actually happens.
What is the lifespan of a scammer?
This is something we're trying to understand in another project. This one ran over a year very
well. For certain classes of binaries, MS is doing well at cleaning up.
Did the subject line of emails reflect the campaigns?
For ours, yes they did.
I'd like to hear about the decisions/ethics for the project.
2 stages. Lawyer on board since start of collaboration. They determined subtle tech aspects
made it work. We did not generate spam (just manipulate). We did not infect with our binaries.
Next stage was IRB approval at berkeley. We got idea early on that experiment would be okay.
How did your group work with the sponsor?
At the stage of funding, we didn't foresee this experiment. The broad level was to fight these
large campaigns.
What could a group like this do/offer for this type of scientific experiment?
Not sure what you offer. Legal assurance of what is okay or not would be great. Publishing
committees worried that articles would encourage others to go further. As example, keyloggers
would track how users work and upload data to server on a botnet and researchers collected this
data.
There was a Sandia employee who did a hackback, got terminated, filed lawsuit and won.
Not familiar with this case. Another paper registered “drive-by” domains. When users visited
them, their system was scanned.
As a PC, if you saw a paper that you rejected for ethical reasons and they had IRB approval
what would you do?
In this case, the PC chair asked the authors what they had done ahead of time. They got blanket
approval ahead of time for area of research.
If they had actual approval, what then?
Not every PC member would understand the issues. Reviewers should ask chair to forward IRB
doc or not review.
What conferences are you seeing these types of submissions?
IEEE, NDSS, some others.
Analytics breakout - day one – raw notes
Analytics Breakout Day 1
Led by Richard Strelitz
Charge:

Clearly articulate the problem that needs to be fixed and how this problem relates to mission of
Dept.
 Show how the results generated by the Grassroots community can affect the DOE Mission.
 Develop a 'headline' summarizing what the problem is and how we will address it.
 Show how the laboratory directed research and development (LDRD) investments are being
used to do foundational research in cybersecurity.
6 people (incl. Scribes) attended
 What is your expectation? What is analytics? Predictive science and reductionism as opposed to
holistic. Deep respect for statistical reason.
 Is it about metrics or statistics?
 Hope to listen in and learn. One project at PNNL. Seemed most relevant. Start with data we
have?
 Expression of state of the art?
 How different from analysis?
 Methodology to method. Data-based to empirical and hopefully foundation rules.
 Also study of analysts; how they go about things. Cyber-analytics. What are his needs and how
can that be supported with tools.
 Seems to be focused on data collection and search.
 Just post data-acquisition.
 Need to know what to collect beforehand.
 Could be opportunistic.
 Cyber-analytics tells the story behind the collected data.
 We collect because it's there or just in case. Analytics should help us define what's important to
us.
 Distill good stories.
 Collect all data anyway. Filtering.
 Subset of data interested in might be over here.
 Did I collect the right thing?
 Is there a useful forum for known good guys posting what they intend to do?
 Problem with forums is that commercial entities have a vested interest in those ideas. Hard to
share even in trusted environment.
 Should we address that or wait for meetings.
 Do we talk about methodology or social problems? 'Analytics' differs based on different
peoples' needs.
 Problem specific winnowing filters.
 This is scientific method though.
 I think we should be looking over shoulders and seeing how they gain insight.
 There was a reason you started collecting. Not because you had lots of empty hard drives.
 We have a specific set we want (IP address). We collect a lot more though. Redundant data can
be useful.
 What do you think you need for your problem? I need this info to distinguish between
scenarios. Conjecture-based collection of data from archives to find info using tools. How
important to stop predation of AV corps?
 It was the sharing of research ideas; that someone would monetize them.
 What makes an analyst is domain knowledge....
 Do they know what they're looking for or just for patterns?

































Can you abstract for patterns? Systematic distillation of information from data.
End result of analytics is to answer hypothesis.
Should we consider anybody with a neural net algorithm going through the data to find some
pattern.
Bunch of roles people engage in. Different roles with the same data even (forensics vs.
situational awareness). Prediction but in the meantime demonstrates something actionable.
In particle physics you have hypothesizers and builders. Opposing goals.
Tactical, right now, strategic (past, present, future). Temporal base.
Would the process be post-processing or co-processing?
Visual model. Tactical, forensic, and predictive. PNNL image of cyber-analytics. (Will be
emailed).
What do we offer the community? Have we answered it? What kind of sample questions could
you see under this definition?
What is offered? This community would be willing to look at these temporal areas in a
systematic fashion to analysts who have/need this kind of information.
Would the sharing of knowledge lead to a source of funds? A consortium or community lead to
a block grant? Persistent web presence or body of knowledge. Open to a select few; don't want
to share. How does the spam community share data?
Share data in three roles. Free to consumers. Free to researchers to some extend. With a fee to
commercial vendors (most difficult).
If we setup infrastructure for shared analytic experiences?
Would be difficult to share without data or sensitive parts. Basically white papers which no one
reads because of poor quality.
We're always outsiders and can't get data. Hope it's not that way on the inside.
Value in knowing how someone arrived at a conclusion, if it is replicable. If not, doesn't have
any value to meet.
If we had a framework and plugins (black boxes) to manipulate and put out data be useful?
Can't be required to put my data through your boxes. We have several types of query
mechanisms around the storage points. Not easily available to outsiders. Can make my own
horizontal queries.
Can we use a subset for testing.
It wouldn't be possible for the full data to be shared for performance reasons.
Sanitization not there. There are etiquette papers.
We publish summaries instead of actual data.
In context of supervised learning we'd like to have learning sets, validation sets. What is the size
I could make a reasonable statement?
Data sharing opens another front. This is great for researchers but not necessarily for operations.
Would love to have analysts go back through the data to look for patterns.
How would you provide trusted access?
That hasn't come up yet.
That could be on spec; not funded yet.
Have only given out data on your own networks.
Govt often offers to you come in-house for inspection then leave without it.
One priority for analytics is the establishing of test protocols or test datasets that are avail for
testing.
Providence, freshness, releasable, privacy.
Want to give data with no liability attached to it, for either party.
























Need a collection of data and sharing experience about tools unless it's a person-person link.
Who owns the data, you?
I collected it, so theoretically.
Why should DOE care?
There needs to be off the shelf tools for dealing with massive scale of data.
This is what DOE likes to do but not necessarily what is critical to it.
Protecting NNSA and office assets … critical-infra. What steps can be taken to make analytical
presence?
Have we answered #3?
4-fold mission: protection of nuclear sensitive materials/secrets, protection of power
grid/energy, science mission, environmental mission. Weapons or mass cyber-effect (PNNL
name).
Analytics not a prevention measure. It's a detection measure.
Could become preventive.
Protect DOE and US assets against foreign actors.
Industrial espionage and cyber-warfare.
Foreign interested in disrupting way or life as far to depriving of youtube.
Relevance established. You need analytics to enable part of it. Must be timely.
Instead of foreign national, change to state sponsored or criminal. Pulled up NATO definition of
cyber-warfare. Analytics would allow us to detect patterns.
Could be used for prediction as well. Predict from developing patterns.
Force of smart grid idea. People love it but it makes you cringe.
Zeus infecting blu-ray players.
Privacy issues with power grid as well.
How would national visual analytics center (NVAC) view a cyber-analytics trust?
We're still defining it.
Would there be a way of leveraging NVACs for a cyber-analytics initiative?
They would love to talk.
Analytics breakout - day one summary presentation – raw
notes







What can this community provide that would help?
Starting with a definition.. develop community for organization, dissemination, and collecting
analysis and meta-analysis.
From pattern detection to pattern recognition. Wanted to find a way of amassing a critical mass
to develop a persistent body off knowledge on which analytics theory and practice can be
developed. Would hope to attain sustained funding for this clearinghouse. We felt we could help
protect the integrity and existence of DOE infrastructure for the people for this country. It
would do no good to realize compromise, find vulnerabilities in smart grid systems, to find the
tools and info of the NSA would escape , even if they were irrelevant to the DOE mission, the
fallout would be awful. It would denigrate the head of DOE for this data to leak out.
There are better arguments than embarrassment.
When I taught, I tried to figure out what student understood and how to map that to what I
understand. What is their self-interest in relation to what I want/need.
How would analytics look different in the future (different that charts on conficker)?
He wants methods to find info he didn't know from his dataset. There isn't time or tools to do
this for him. Cyber-crime is hiding beneath the camouflage of the network data. How do we
pick up harbingers?
Download