Overview Using Big Data to Predict Veteran Suicide Risk Patterns and Predictions (P&P) is a predictive analytics firm with a core technology that provides unstructured and linguistics driven prediction. It is the technology powering the Durkheim Project’s ‘Big Data’ analytics network for the assessment of mental health risks. Partners include Bloomberg, The Geisel School of Medicine at Dartmouth, Cloudera, and Attivio. Funding sources include the U.S. Government’s Defense Advanced Research Project Agency (DARPA), and customers include Global 100 companies. The company’s principal partner, Chris Poulin, is co-inventor of the company’s core Centiment® technology that delivers unstructured and linguistics-driven prediction. The Durkheim Project is named in honor of David Émile Durkheim, a French sociologist whose 1897 publication, Suicide, defined early text analysis for suicide risk and provided important theoretical explanations relating to societal disconnection. The project follows its namesake’s lead in its search for what Durkheim referred to as the “qualities” of suicide – those specific patterns and clues that point to suicide risk. The Durkheim Project, though, has one valuable tool at its disposal that the founding sociologist didn’t have: technology. The Challenge Suicide is an issue with which the US military has struggled for years. Today, the battle against this pervasive enemy continues to rage on, with staggering – and persistently increasing – casualties. In one of its many articles that reference the subject, Time reports that the record number of 349 military suicides in 2012 far exceeded the number of American combat deaths in Afghanistan for the same year. The rate of military suicides is roughly double those of adults in the general US population.1 In its Suicide Data Report, 2012, the US Department of Veterans Affairs (VA) noted that, “Information on the characteristics and outcomes of veterans at risk for suicide is critical to the development of improved suicide prevention programs.”2 The Durkheim Project is well-positioned to deliver on the promise of this crucial information. With its powerful array of advanced analytics, real-time predictive modeling, and machine learning working in concert, the project seeks to identify critical correlations between veterans’ communications and suicide risk in what Fast Company describes as “the most vital use of ‘Big Data’ we’ve ever seen.”3 1 2 3 Time, One a Day. July 23, 2012 The US Department of Veterans Affairs, Suicide Data Report, 2012 Fast Company Labs, This May Be The Most Vital Use Of “Big Data” We’ve Ever Seen. July 12, 2012 CUSTOMER SUCCESS STORY 1 Key Highlights Industries • Government • Healthcare and Life Sciences Location • Portsmouth, NH, USA Business Application Supported • Predictive analytics that identify risk factors for suicide Impact • Accurate, linguistics-driven correlations between real-time communications and suicide risk • Infrastructure delivers lower cost, better computational throughput, and reduced complexity of IT support Technologies in Use • Hadoop Platform: CDH • Hadoop Components: Cloudera Impala, Cloudera Search • Servers: Cray grid, Amazon EC2 • Analytic Tools: Patterns and Predictions Centiment®; Attivio Big Data Scale • Over 1TB of jobs processed per day in real time • Up to 100,000 active duty and veterans supported in real time Solution Phase One The Durkheim Project began in 2010 with initial funding by the DARPA, a research arm of the Department of Defense (DoD), and with prior research from Dartmouth College, with which P&P and Poulin have ties. Poulin and his specialists are key players in the project’s multi-disciplinary team that also includes experts in artificial intelligence, medical professionals from private companies, the Geisel School of Medicine at Dartmouth, and the VA. Phase One of the project began with a study of three cohorts, with 100 subjects each, representing “non-psychiatric”, “psychiatric”, and “suicide positive” profiles. The researchers developed linguistics-driven prediction models to estimate suicide risk, generated from unstructured clinical notes. In 2011, P&P began sourcing the technology and building out the integrated foundational infrastructure and predictive modeling that would support the project’s extensive data collection and analysis, once it was scaled up. Distributed technologies like Apache Hadoop presented a logical solution for an efficient and highly scalable big data platform; but the project required a lightweight machine learning framework that would run on Hadoop and detect real-time risk at scale. “Most of the big data machine learning solutions that were out there were of low performance in accuracy, or highly complex in implementation and in integration with our existing environment,” explained Poulin. Cloudera’s category leadership and subject matter expertise with Hadoop and big data led Poulin to engage Cloudera Professional Services to co-develop Bayesian counters, a lightweight statistical model that detects risk at scale, based on Apache HBase and CDH (Cloudera’s Distribution Including Apache Hadoop), the market-leading, 100% open source distribution of Hadoop and related projects. The Cloudera based framework is a cornerstone technology of the Durkheim Project. The tightly integrated system was “trained” by feeding in isolated statistical indicators – keyword combinations, patterns, and other linguistic clues determined through careful analysis of the previous data from a variety of veterans’ database sources. Once trained, the machine learning then could identify useful clues in real data, and establish a risk “score.” Because suicide is such a personal act – and one in which the person tends to keep up an outward appearance of being fine, explained Poulin, “the risk signals are weaker. When you deploy the system at scale, machine learning has to be very sensitive on that big data.” The Phase One build and testing concluded in early 2013. It validated that the project’s machine learning data fabric was viable, with predictive capabilities that were 65% accurate in predicting suicide risk among a veteran control group. CUSTOMER SUCCESS STORY 2 Phase Two Phase Two of the Durkheim Project launched in July 2013 and focused, with Cloudera’s involvement, on the project’s ultimate objective of “suicidality prediction at scale” across different types of structured and unstructured data. Facebook joined DARPA in supporting this phase, through the promotion of content of consenting participants for the project’s monitoring purposes. With a target number of 100,000 veteran participants, the data most certainly will be “big.” Those veterans who opt into the project receive a unique Facebook app and a mobile app for either the iOS or Android system – all designed to capture posts, Tweets, mobile uploads, and even location. Additional profile data is captured as well, including physician information and clinical notes. To ensure compliance with various privacy and HIPAA regulations, all captured data is stored in a secure environment behind the medical firewall at Dartmouth’s Geisel School of Medicine. “With Cloudera Search and Impala, our ingestion of data on Hadoop is promisingly efficient in terms of lower costs, better computational throughput, and reduced complexity of IT support.” Chris Poulin, Principal Partner, Patterns and Predictions As participants join, individual profiles are set up and accessible, via a dashboard, to researchers at Geisel and to clinicians. The system assigns overall risk scores to each profile based on the collective information and on keywords that are specific to each participant. The use of text analytics against the continuously fed large data pool delivers an exponential number of variables which can then be compared and analyzed, resulting in a real-time assessment of the participant’s mental health. Said Poulin, “The computational processing to analyze that data requires a big data fabric, but the benefit is that it’s much more informative.” The technical rubric for the project is “maximum speed at minimum cost”, which prompted adoption of Cloudera Search and Cloudera Impala. “The project has a very complex workflow,” explained Poulin. “All of our machine learning is indexed, and we actually access all of the machine learning through search interfaces, which can get expensive. With Cloudera Search and Impala, our ingestion of data on Hadoop is promisingly efficient in terms of lower costs, better computational throughput, and reduced complexity of IT support.” Impact The complexity and sensitivity of the topic of suicide, combined with the intensifying battle that the military faces, make for a very weighty backdrop for the Durkheim Project. In that respect, “the technology aspect of the project has almost been easier than the social engineering,” said Poulin. “If a person is really committed to taking his own life, you need to be both informed and gentle enough to try to help that person find an alternative outcome.” Still in its initial phases, though, the Durkheim Project is authorized only to monitor and analyze data. While the project has delivered statistically valid results that accurately predict suicide risk in a control group of veterans, its critical research is restricted, at least for the time being, to a non-interventional protocol. Using Cloudera, the project’s continued scaling of risk classifiers, Poulin hopes, will help to establish the necessary confidence in the project’s ability to assess risk in real time so that they can apply for an interventional study. Phase One of The Durkheim Project predicted suicide risk among a veteran control group with 65% accuracy, demonstrating statistical significance. “One of the promises of big data in this case,” Poulin stated, “is that you can shorten the distance between the people who need help and the system that can get them help. That is our goal, and one we want to continue to move toward this with Cloudera as our partner.” CUSTOMER SUCCESS STORY 3 About Cloudera Cloudera is revolutionizing enterprise data management by offering the first unified Platform for Big Data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, process and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Only Cloudera offers everything needed on a journey to an enterprise data hub, including software for business critical data challenges such as storage, access, management, analysis, security and search. As the leading educator of Hadoop professionals, Cloudera has trained over 40,000 individuals worldwide. Over 1200 partners and a seasoned professional services team help deliver greater time to value. Finally, only Cloudera provides proactive and predictive support to run an enterprise data hub with confidence. Leading organizations in every industry plus top public sector organizations globally run Cloudera in production. www.cloudera.com. cloudera.com 1-888-789-1488 or 1-650-362-0488 Cloudera, Inc. 1001 Page Mill Road, Palo Alto, CA 94304, USA © 2015 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice. cloudera-casestudy-patternsandpredictions-102