Avansic Whitepaper - Analytics and Forensics for e

advertisement
Don’t Be Scared of Analytics and Forensics:
How to Use These Tools in E-Discovery
By Dr. Gavin W. Manes, CEO of Avansic: E-Discovery & Digital Forensics
By now, legal professionals have been exposed to the concept of e-discovery and the
possibilities that modern tools provide with relation to review. This includes advanced
processes, methods and tools such as analytics, concept clustering, predictive coding
and deduplication that offer the possibility of more targeted review. These can be
daunting due to terminology and changing technology, but overcoming that fear
means moving forward with shrewdness into the wider world of e-discovery.
Analytics, which is the computational analysis of data to determine facts or patterns,
is an incredibly powerful tool in e-discovery. It has become more affordable and
easier to use, despite some challenges relating to adaption and usability. Indeed,
there are ways to use analytics to find what you’re looking for rather than to
eliminate irrelevant items. Understanding concept clustering and predictive coding
(and knowing the difference between the two) can help you determine the best
circumstances in which to apply these sophisticated technologies.
The realm of forensics offers e-discovery help through collection protocols and
metadata handling. Knowledge of forensics hashing and de-duplication methods can
be a big advantage for legal professionals during e-discovery.
How to Use Analytics
There are several ways to use analytics to assist in e-discovery: “more like this”
document location, organizing and prioritizing a review, and identification of exact or
near duplicates for de-duplication purposes.
Finding additional hot documents that are “more like this” is akin an online shopping
tool recommending other products of interest. Although helpful in any situation, this
is a particularly helpful time-saver in larger cases. Once a document of interest is
located, it can be used to find other documents that are similar in content. This can
be extremely useful if you aren’t quite sure what you seek or even that it exists.
Analytics enables a user to better organize e-discovery review through grouping
documents by ideas and concepts. Linear review becomes more efficient since the
reviewer stays within the same topic and is not constantly shifting gears.
Additionally, it allows for the use of “specialist” reviewers who may have specific
knowledge of a certain document subject (for example, a drug interaction).
Analytics can also be used for text-based de-duplication, which includes exact and
near-dupe. This is not the same as forensics de-dupe which is focused on exact
duplication of entire contents of a document. The text-based deduplication utilizes
processes that identify noise words or phrases, repeating text, white space, headers,
and other textual data or formatting. Once this material is removed, the remaining
text can be compared to determine if there is duplication. If the text is the same and
Corporate Office 15 E. Fifth St. Suite 1800 Tulsa OK 74103
(918) 856-5337 (888) 808-0337 www.avansic.com
the human reviewer would see the same content, there is usually not a reason to
review duplicate versions, so the number of documents to be reviewed decreases.
The exception would be context-based review for date, custodian, privilege, and so
on.
For near-dupe, analytics leverages the results of an exact dupe but allows for
variance in word placement and length to determine the similar documents with
material differences. The software provides a “grade” for the similarity of the
documents using complex machine learning and word proximity. While the variance
in the removal of common artifacts is similar across almost all analytics platforms,
the grading system for determining near-dupe vary (and may be configurable.)
Although not the case with exact text de-dupe, multiple iterations of near-dupe
might be necessary to provide a desired result. When used properly, this can be an
effective tool to decrease the number of documents to review.
On a final note, analytics can be used to reduce the review burden but should be
combined with a regimented, tested workflow that includes quality control and a
large emphasis on sampling.
How to Use Concept Clustering
Concept clustering leverages analytics to group similar documents together based on
word proximity and similar phrases – it follows the “more like this” idea discussed
above. It utilizes mathematical formulas to determine similarity based on proximity
of words, similarity of phrases and their placement within the document. The
majority of concept clustering uses unsupervised machine learning. The result is an
arbitrary measure of how documents are similar to each other which is otherwise a
difficult measurement to take.
Tools may see different levels of similarity between documents. However, once
concepts have been noted, groups of documents can be created that are similar to
each other which feeds directly into a more efficient review.
How to Use Predictive Coding
Predictive coding is a workflow that leverages analytics, sampling, and sometimes
concept clustering. Predictive coding uses different variants of supervised machine
learning. Most common tools have workflows built into a review platform and
therefore this technology has become more accessible - it is no longer necessary to
hire mathematicians to create and sample because precision is monitored
throughout.
The most common predictive coding workflow is as follows: select a sample of data,
code that set, and use that set to train the algorithm to code the remaining
documents. This is immediately followed by random sampling of the trained results
to determine accuracy, and then repeating the process if the precision is insufficient.
Forensics Collection
Although it may seem that forensics is not necessary in a run-of-the-mill e-discovery
project, forensics collection and a proper chain of custody can help during document
review.
A true forensics, bit-by-bit copy of native documents is the best way to preserve the
metadata in those documents for retrieval at any time. In fact, better tools may exist
in the future to extract metadata from native documents. However, certain types of
Page 2 of 3
metadata must be preserved at the time of collection that do not exist within the
native file content such as, the date the file was created where the file was located,
what computer the file was located on, and the original file name. If collection is not
performed in a forensics manner this data may be lost or altered without detection
and that information can be useful during document review.
Forensics De-dupe
MD5 hash is a very useful for determining if two pieces of data, such as files, are
exactly the same. From a computer science perspective, MD5 hash is a one-way
mathematical formula that takes a string of data as an input and yields an ostensibly
unique value.
In the normal sense, MD5 provides a value that acts as a “fingerprint” for the data. If
two MD5 hash values are the same then the data that produced them is the same.
In the forensics context, MD5 generally means the hash of the content of the data,
excluding operating system level metadata such as created date, filename, and file
path. In e-discovery processing, MD5 hash has many different meanings. In some
cases a program may choose to hash only a portion of the data, such as the values
from an email header.
MD5 hash is a very powerful tool in review. For instance, use of MD5 hash doesn’t
just allow the identification of duplicates of user created documents, it also allows
the automatic exclusion of well- known files, such as operating system files and
application executables.
When producing data, an MD5 hash is absolutely necessary in order to be able to
prove that what was produced is the same as what was received. Caution is
necessary to figure out the specific information the hash encompasses; when
receiving a hash in a loadfile, you should always ask the question, “an MD5 hash of
what?” since the hash might be of only a portion of the data.
Conclusion
E-Discovery review can be helped tremendously by analytics and forensics
processes. Specifically, there is rarely a case where analytics and concept clustering
aren’t useful in e-discovery review. Predictive coding is typically helpful where there
are a unique set of requirements of time, volume. Hybrid models combine supervised
and unsupervised learning and automate the processes so there is little configuration
or knowledge required by the user. Knowledge of these tools gives the user the
confidence to consider their usefulness for future projects.
About the Author
Dr. Gavin W. Manes is a nationally recognized expert in e-discovery and digital
forensics. He is currently the CEO of Avansic, a firm that provides the legal,
business, and government sectors with e-discovery, digital forensics, data
preservation, and online review services. He founded Avansic in 2004 while serving
as a Computer Science professor at the University of Tulsa. There he led the creation
of nationally recognized research efforts in digital forensics and telecommunications
security.
Page 3 of 3
Download