Executive Briefing Series (Volume 6, Number 1) January 2013 Big Data and Business Analytics: Realizing Opportunities An Executive Summary of the November 16th, 2012 Workshop written by Dr. Erran Carmel and Mr. Michael Carleton edited by Dr. Gwanhoo Lee and Ms. Marianne Du Contents 1. Presentations Michael Brown, Chief Technology Officer, comScore, Inc Michael W. Carleton, Senior Research Fellow, American University (Former CIO, U.S. Department of Health and Human Services) Dr. Erran Carmel, Professor, American University Jill DeGraff Thorpe, Vice President, Strategic Initiatives & General Counsel , AFrame Digital 2. Group Discussion Facilitated by Michael Carleton, Senior Research Fellow, CITGE Foundations and Questions By Dr. Erran Carmel, Professor, American University Michael W. Carleton, Senior Research Fellow, American University, (Former CIO, U.S. Department of Health and Human Services) Carmel set the groundwork on Big Data giving the audience some background. He used this definition: Big Data – the amount of data just beyond technology’s ability to store, manage, and process efficiently Carmel asked: How big is Big? Carlson called later called it the data deluge. At the organizational level “Currently Big Data refers to data volumes in the range of exabytes (1018) and beyond.” He gave an estimate that 2 zetabytes (2x1021) of data created in 2011 alone. The numbers are staggering: Facebook creates 10 terabytes of data every day. Large Hadron collider generates 40 Tb per second. 30 billion RFID tags are produced every year. Mammograms create vast volumes of data, but interpretation is uncertain. We are now in the early Industrial Revolution of Data. Data is the new “raw material of business” – says the Economist magazine. Rate of data generation exceeds storage capacity and certainly exceeds our ability to use it. Carlson pointed out that often we get large volumes of data but we don’t know how to interpret them yet. Public assumes that all this data will somehow make us “better”, but…We need to find the sweet spot between resources available for data collection and value of data. Do you know the Three V’s of Big Data? You should know this one by now. It is in the introduction to every big data discussion: High Volume High Velocity High Variety – data are very complex (from sensors, internet, etc.) Do you know the Types of Analytics? You should also know this one by now. It is also in the introduction to every big data discussion: Descriptive – e.g., Medical Estimative Predictive – Business, Politics (Nate Silver predictions about the Obama win) Prescriptive - Medical Carleton offered several observations about Big Data from a practitioner and public sector point of view and emphasized the opportunities to create commercial business value and social value while avoiding offending consumers and citizens. As an example of the potential pitfalls in offending customers and citizens, he raised the need to ask more questions about security/privacy before proceeding with aggregations and mash ups from disparate data sources, specifically: How to share data without compromising security/privacy? How do non-government actors use big data? What are the security/privacy implications? (Take Note: From a legal perspective, we should more fear “Little Brother” (i.e., private business entities) uses of personally identifiable information rather than “Big Brother” (i.e., government) How to address the severability of data creation from value creation? What influences consent rates for data creation/sharing? CITGE Executive Workshop January 29, 2013 Each country has different definitions of privacy and propriety with differing frameworks around personally identifiable data as protected by property versus human rights. Carleton shared more examples of “the Data Deluge:” high rates of data capture (e.g., there are already 1 billion transistors per human on the planet), the degree to which Tim O'Reilly measures the growing gap between “information created” and “available storage,” and the explosion of unstructured data like graphics and video files. He then offered an overview of the work by the National Institute of Standards and Technology to come to terms with the emerging technologies in a manner that balances beneficial opportunities with avoidable risks. After touching upon NIST's differentiation between SQL and NoSQL Big Data frameworks, and Tene and Plonetsky's urgent call for an update of privacy protections, and Armour and Kaisler's taxonomy of types of analytics, Carleton shared some anticipated findings from work in progress on Big Data business cases. He asked about the widely varying Business Models emerging around Big Data, especially data that comes from the public sector and is put to use (often in creative secondary and tertiary uses) by private sector businesses and academic centers. These included but are not apparently limited to: Use dynamic data streams (heavy flow of real time data) Keep exclusive “treasure trove” of data, sell to others Data is free but algorithms are proprietary Big data enabled value chain He then offered a concise contrast between Federal government stewardship of big data sets and Commercial Sector exploitations of big data sets: Many constraints (legislative and regulatory) on collection of information by Federal government Politicians often restrict data use to avoid regulation and gain votes Government can only collect data for defined purposes and only retain it for limited periods Agencies dissuaded from multiple uses of data Need cooperation from citizens and government transparency to sustain Federal government information stewardship Next Carleton examined the ease and consequences of decoupling value creation and value extraction in the Information Age, touching upon: Value creation vs Value extraction Value creation using analytics on others' data– avoiding costs to generate data Value extraction – run algorithms to finely target sectors and avoid or intentionally neglect others Potential conflict in which creators don’t want to share with extractors -1©2012 Center for IT and the Global Economy, Kogod School of Business, American University CITGE Executive Workshop January 29, 2013 Should we treat taxpayer-funded data sets as a public asset to be stewarded like public lands or water resources? Carleton gave an example that is close to his heart from DHHS: NHANES – the National Youth Fitness Survey (NYFS) which collects data on exercise and the nutrition habits of U.S. youth collected through interviews and fitness tests. Case 1: comScore By Michael Brown (Founder of comScore, entrepreneur for two decades) comScore based in Reston, 1000+ employees, is a huge data collector and analyzer. It is essentially a data factory and delivers “digital intelligence.” The growth is staggering: 250 million records created every day in 2005; 1.6 billion created daily in 2009; 2.5 billion per hour in 2012. As data quantities were increasing in after 2005, comScore realized that it needed to update methodology. It created tracking code for websites, every visit creates information The diagram on the right shows the two main flows routed into one of two big data systems: Greenplum (Greenplum database is lab for experimentation/ideation, not production) – big SQL database and Hadoop. Both of these converge to a data warehouse in Sybase. Now come the big business challenges. Can we gain new value/insight from data aside from its original intent? Can you gain new revenue? Solve a new problem? We are now able to develop actual product in 4 hours using ~100 lines of PSQL to create revenue generating product We are able to see some interesting trends on device Essentials. We used dataset to determine non-PC device share of online traffic by country. For example, we found that Mobile is very popular in Singapore, less so in North America. We found that iPhone users prefer Wi-Fi/LAN, while Android prefer mobile -2©2012 Center for IT and the Global Economy, Kogod School of Business, American University CITGE Executive Workshop January 29, 2013 The audience asked: Is privacy a concern? Brown answered that comScore designs systems to destroy native IP address (obfuscation) so information can be logged without violating privacy laws, other types of personal information stripped from dataset before processing What has to be disclosed to users? Brown answered that there is an Opt out page, clients must sign agreements on data use How do you account for user self selection during data generation? Brown answered that the take cross section and bundle with demographics What about user preferences (i.e. Washington Post vs New York Times)? Brown answered that Data is compartmentalized to avoid bias What development framework do you use? Brown answered Agile/scrum with 250 member team, about 30% software engineers How does experimental environment transfer to an individual company? If companies want insights from big data, can they run analysis on smaller scales? Brown answered Software historically requirements-based Big data demands flexible framework that can handle undiscovered requirements Large flexible datasets require flexible algorithms Tools less important than philosophy of data selection How will technology need to evolve to accommodate large data sets? o Speed is absolutely core (refer to Dr Black’s last lecture) o Must refine ways to make algorithms more efficient What are the interesting research areas? What do you need to know? o Parallelized algorithms o Sampling methods o Translate anonymous data to discrete consumers Case 2: AFrame Digital By Jill DeGraff Thorpe, Vice President, Strategic Initiatives & General Counsel Jill is from a healthcare startup that revolves around new ways of data capture and analysis. While it isn’t a Big Data company yet, it will be soon. AFrame Digital addresses the critical and growing need for long-term care for aging population. By 2020, it is estimated that 12 million older Americans will need long-term care. The central technologies are a wristband device (pictured) and a mesh network in the house (pictured further below). The firm is developing -3©2012 Center for IT and the Global Economy, Kogod School of Business, American University CITGE Executive Workshop January 29, 2013 non-intrusive for real time fall detection and emergency services calling. Fall sensors can also capture rates of exertion. The mesh network in the home environment triangulates user movements in house to identify patterns of behavior (using the bathroom too often, visiting the kitchen at night, etc.) For example – 91 year old woman fell, became unconscious, but system detected fall and warned family, got her to hospital, showed doctor three months of sensor data to guide treatments. This illustrates the need to develop way for aging and unhealthy population to live with dignity while reducing unsustainable healthcare costs. It is especially important for underserved communities. Jill asked: How can we bring intelligent alerting and analytics to relevant data? first, medical use is most important issue for data systems. For example: IBM Watson used for clinical diagnosis and care management instructions. Providers are beginning to recognize that data gathered in the home is more important than clinical data. Such next generation providers are developing holistic, 24/7 medical data collection. The goal is to identify dangerous trends and warn users/doctors. Contrast that with the older “Lifeline” which had a stigma, Aframe must develop nonintrusive socially acceptable data gathering with attractive value proposition for consumer. How can researchers integrate with this? NIH and DARPA grants are a good foundation. A project example: effectiveness research to see if real-time continuous monitoring has positive impact on user health and treatment. A key weakness in our healthcare is that 5% of population creates 50% of expenditures. Health reform aims to improve return on expenditures. Spending rates level off after childhood then spike after age 65. Where do these high expenditures occur? Transfer across care settings (highest risk for re-admittance to hospital). One of the settings is the home. So we need to adopt home as primary point of care. There is emerging consensus among doctors and medical researchers of importance of home care. There is a need to migrate from episodic to continuous care. Tele-monitoring enables new models of care. It is expected that an integrated approach of enabling doctors to deliver new outcomes. There were many questions from the audience to this game changing technology for health and for health data. -4©2012 Center for IT and the Global Economy, Kogod School of Business, American University CITGE Executive Workshop January 29, 2013 What about monitoring patient compliance? Jill responded: we’re a $180 billion annual market for medication adherence. Many technologies in use, but none incorporate biofeedback so none are perfect.; but AFrame’s product monitors biofeedback (heart rate, blood pressure) to track medication consumption What types of data do you collect? There are 3 dimensions: Stability (accelerometer) about 10% of data; Biometrics – any wireless sensor device can be pulled for vitals tracking data; Activity – measure rates of exertion and patterns of behavior. Does device have Wi-Fi? No because, it would create too much battery drain. Microcontroller does some analytics and then opens net connection when threshold reached. How mainstream is this? Is it limited to the upper range of the market? There is interest in demand across market, but focused on demand driven by health reform and pay for performance ideas. PACE program is used as model for super-generalized data. It is very complicated for indigent elderly to navigate Medicare/Medicaid framework – so the firm developed a per-user model. New medical payment and treatment models will spur demand. What about coordinating MAC addresses of data points? Good idea, but also limited by battery life. Who pays for this? Is it reimbursable by insurance? Providers have inherent financial interest in keeping clients well and out of hospital. Obama care is driving switch in model by shifting risk to providers, creates incentive for insurers to focus on general health. Insurers, providers, large employers may use it to reduce risk positions. Value proposition will ultimately come from health providers with local geographic monopolies (can’t steal customers from competitors so must boost efficiency and reduce risk to increase profits). And in a nod to Big Data: there are reams of opportunities in effectiveness research! Carmel – Big Data in Academic Research In this young area there are two vectors. Whereas the Management perspective is mainly advocacy based research with very little empirical work, the Computer Science folks are all over this. Very gung ho about Big Data. Great innovation in algorithm design. There are orders of magnitude greater (100+ articles on Big Data added to ACM every month). At least three major workshops held around the world. Carmel gave a sampling of research and topics in this field: CluChunk – looks at clustering large scale data. How to take large amounts of messy data and organize it into manageable and efficient chunks to optimize performance. CS focused, looked at effectiveness of NoSQL databases for managing big data; Flex-KV, with Authors from IBM and Carnegie Mellon, looks beyond NoSQL solutions. Propose a flexible key-value storage system -5©2012 Center for IT and the Global Economy, Kogod School of Business, American University CITGE Executive Workshop January 29, 2013 Why is management perspective lagging? Probably because it tends to look at past rather than future. Our field needs to incorporate predictive analytics for theory generation across academia. One article Carmel does recommend is the LeValle survey (2012) article from HBR which is a study sponsored/run by IBM looking at large corporate users of Big data Carleton – The Philosophy of Big Data Mike ended by discussing the Philosophy of Big Data. Current models focus on building digital platform for a business and owning part of the architecture. The Key question in his mind is what are data transmission requirements for business success? Businesses must decide how to handle big data – develop own framework or hire experts? Many opportunities being created – nearly infinite potential for value creation, rewards are so great that distribution is nearly arbitrary Open Discussion Question for audience: How has big data affected your business? Huge expense up front (JHU spent $600 million on record system, single biggest expense after new buildings), so must carefully select projects to maximize value return Start with set of problems, find/buy/access data, and build models and potential solutions. But some problems don’t have solutions – solution perturbs problem. Requires constant, incremental data gathering. Extremes are data mining (no domain knowledge in algorithms, process everything) and antagonism (look at details and you miss the forest, but if you look at the forest you lose the details). Need to pay attention to government use of big data – already being used in intelligence community and healthcare, but academics should look for opportunities in other federal agencies especially in light of tightened fiscal budgets. Is there data already owned by agencies that could be reprocessed for new value (esp. “dark data”). Are there nuggets in stored data that can be analyzed anew? Sensor data from flights? o Mike: Federal government has taken position that if they make data exposed to public, someone will use it. But should the feds try to add value to data? Big ideological divide. Who gets to repackage and sell data once it has been collected? Government is taking a hands-off approach to data value and leaving it to private sector. But there are explicit restrictions on usage of medical and financial data, so agencies are averse to exposing it to the public. Others are so overwhelmed by volume of data that they do not have resources to extract additional or secondary uses of data. There is also a fear that politicians/public will misinterpret data due to poor understanding of statistics and cause more problems. But there is still a possibility of -6©2012 Center for IT and the Global Economy, Kogod School of Business, American University CITGE Executive Workshop o o o o o o o o o o o o o o o o o o January 29, 2013 creating a value chain where the government collects large data sets and makes them available for public processing Big seller and driver for openness will be fraud prevention. Once value has been proven, public will want to create other value such as disease prediction Government often collects large volumes of data, but only uses small subset related to mission. The government does not see data collection as its role. Former astrophysicist with Crohn's Disease keeps “amazingly complete” data set about his digestive system. Many restrictions on when and how government can collect data, Regulatory framework still based on paperwork reduction act when data collection was expensive Rules are focused on recurring rather than opportunistic collection So much big data being collected, so challenge is sifting through large sets to figure out what’s valuable. Storage growth is very important. How can businesses educate customers on different types of data Challenge is developing good rules on how and where to use datasets, and then communicating them to customers Ideal big data professional is a skilled programmer who can work with complex technical tools and a mathematician who can tailor and develop algorithms. Maybe that's too many skills for one person to master. The key to managing big data may be assembling strong cross-disciplinary teams Businesses want to capture every single customer interaction, down to discrete page views and develop tools to react in real time. Business readiness to embrace big data is very important too. Good data does not guarantee good decisions, and giving companies more data will not always lead to better decisions. Integrating big data is as much about refining management processes as it is about developing new technology. Some businesses are paralyzed by the scope and difficulty of big data systems – they may have access to useful datasets but lack the technical or managerial skills needed to leverage it Organizations need to develop ways to openly share large datasets – at least within the organization. See importance of cross-disciplinary teams. See Cheesecake Factory for good example of big data implementation – referenced this article http://www.newyorker.com/reporting/2012/08/13/120813fa_fact_gawande Tools are not suited to average person. Developing new tools that allow people to visualize and analyze big data is key to full leverage. One of the things we need to think about is: how much data is enough? You can gather so much data that you cannot effectively process it. There is a point where additional data produces diminishing returns. Need to develop ways to measure “goodness” of data – both in terms of quality and value delivered. Also important to account for motives of data creators. Are they truth seeking or do they have an agenda? Different specialists can reach different conclusions from same -7- ©2012 Center for IT and the Global Economy, Kogod School of Business, American University CITGE Executive Workshop o o o o January 29, 2013 dataset even if motives are pure. We are operating under the theory that giving experts more data will guarantee better conclusions, and there is evidence that this is not true. What about NP-hard problems where more data will have little/no meaningful impact? Behavioral economics shows degradation of people's reasoning skills as size of data set increases. Companies would be more willing to share data if there was a stronger framework for them to share value derived from it. Rising importance of big data highlights importance of strong search algorithms -8©2012 Center for IT and the Global Economy, Kogod School of Business, American University CITGE Executive Workshop January 29, 2013 Presenter Bios Dr. Erran Carmel Professor American University Professor Carmel teaches information technology with a specialty in globalization of technology. He studies global software teams, offshoring of information technology, and emergence of software industries around the world. His 1999 book "Global Software Teams" was the first on this topic and is considered a landmark in the field, helping many organizations take their first steps into distributed tech work. His second book "Offshoring Information Technology" came out in 2005 and has been especially successful in outsourcing / offshoring classes. He has written over 80 articles, reports, and manuscripts. He consults and speaks to industry and professional groups. Michael Carleton Senior Research Fellow American University (Former CIO, U.S. Department of Health and Human Services) Michael W. Carleton served as Chief Information Officer (CIO) for the United States Department of Health and Human Services (HHS) and the General Services Administration (GSA). Mr. Carleton holds a Master of Science in Information Resources Management from Syracuse University and a Master of Public Administration from Northeastern University. He is also a distinguished alumnus of the National Defense University’s Information Resources Management College and the Society for Information Management International's Regional Leadership Forum. He is a past president of the Capital Area Chapter of the Society for Information Management. -9©2012 Center for IT and the Global Economy, Kogod School of Business, American University CITGE Executive Workshop January 29, 2013 Michael Brown Chief Technology Officer comScore, Inc. Michael Brown was a founding member of comScore, Inc. in 1999. He leads the company’s technology efforts to measure Internet and digital activities. He has been responsible for over 17 patent applications at comScore, three of which have already been issued by the U.S. Patent and Trademark Office. Prior to joining comScore, Mike worked on projects that included a large help desk deployment and modernization effort for Deutsche Bahn in Frankfurt, Germany. In 1993, Brown cofounded Pragmatic Image Technologies, a consulting group focused on implementation of IBM’s ImagePlus. One of the core projects completed was the successful rollout of ImagePlus at Pennsylvania Blue Shield to over 1,200 users, resulting in the largest image workflow installation on the East Coast at that time. Brown holds a bachelor’s degree in computer science from the University of Maryland and a master's in computer and information science from Hood College. Jill DeGraff Thorpe Vice President, Strategic Initiatives & General Counsel AFrame Digital, Inc. Jill DeGraff Thorpe brings 20 years’ experience advising public and private companies in corporate, strategic partnering, M&A, structured finance, technology acquisition and private equity transactions. She is familiar with all corporate operating functions and has advised on key matters in intellectual property, product development, sales and marketing, contracts, corporate compliance, risk management and human resources. Ms DeGraff Thorpe was Associate General Counsel for CyberCash, serving as its corporate and securities counsel. Before that, she practiced law at Morrison & Foerster, specializing in corporate, securities and financial transactions. She holds a B.A. cum laude from Wellesley College and a JD from The University of Virginia School of Law. - 10 ©2012 Center for IT and the Global Economy, Kogod School of Business, American University CITGE Executive Workshop January 29, 2013 Confirmed Attendees (ordered by affiliation) First name Last name Jill Thorpe Engin Mike Erran Mary Cakici Carleton Carmel Culnan Organization AFrame Digital, Inc. William J. Alberto Keyvan Michael Itir Jill American University American University American University American University & Bentley University DeLone American University Espinosa American University Gheissari American University Ginzberg American University Karaesmen-aydin American University Klein American University Irene Gwanhoo Lam Lee American University American University Kelsey Phyllis Lee Peres American University American University Kamalika Sandell American University Matthew Bob Sandra Paritosh Margaret Larry Filippo Michael Mindy Yvonne Patrick Vanessa Ron Shannon Sloan Smothers Uttarwar Weber Fitzpatrick Morelli Brown Ko Chaplin Murray Sherman Renjilian Mohamoud Jibrell Steve Peter Curtis Carol Kaisler Keen Generous Hayes American University American University American University American University American University Computech, Inc. Computech, Inc. comScore, Inc. COTELCO Center CSC CSC CSC Emerios Government Services Howard Hughes Medical Insititue i_SW Corporation Keen Innovations Navy Federal Credit Union Navy Federal Credit Union Title Vice President, Strategic Initiatives & General Counsel Assistant Professor Senior Research Fellow Professor Senior Research Fellow and Professor Emeritus Professor Associate Professor Student Dean, Kogod School of Business Assistant Professor Director, Professional MBA; Executive in Residence, Information Technology OIT - Enterprise Systems Associate Professor and Director, CITGE Student Senior Vice Provost and Dean of Academic Affairs Associate CIO, Office of Information Technology Student Faculty Student Student Student President Chief Technology Officer Chief Technology Officer Graduate Research Associate Partner Consultant Partner President VP for Information Technology Senior Scientist Chairman Chief Technology Officer Assistant Vice President, Enterprise Data - 11 ©2012 Center for IT and the Global Economy, Kogod School of Business, American University CITGE Executive Workshop Susan Bennett Bill Prasanna Jason DeLeo Lal Das Bongard January 29, 2013 Paragon Technology Group, Inc. SAS World Bank Strategy Services Program Manager Director, Release Engineering Lead Program Officer - 12 ©2012 Center for IT and the Global Economy, Kogod School of Business, American University CITGE Executive Team Dr. William H. DeLone Executive Director, CITGE Professor, Kogod School of Business, American University Dr. Gwanhoo Lee Director, CITGE Associate Professor, Kogod School of Business, American University Dr. Richard J. Schroth Executive-in-Residence, Kogod School of Business, American University CEO, Executive Insights, Ltd. Michael Carleton Senior Research Fellow Former CIO, U.S. Department of Health and Human Services Dr. Frank Armour Research Fellow CITGE Advisory Council Steve Cooper CIO, Air Traffic Organization, Federal Aviation Administration Bill DeLeo Director of Release Engineering Architecture, SAS Associated Faculty and Research Fellows Dr. Erran Carmel Professor, Kogod School of Business, American University Mohamoud Jibrell CIO, Howard Hughes Medical Institute Dr. J. Alberto Espinosa Associate Professor, Kogod School of Business, American University Joe Kraus CIO, U.S. Holocaust Memorial Museum Dr. Peter Keen Distinguished Research Fellow Chairman, Keen Innovation Ed Trainor former CIO, AMTRAK Dr. Mary Culnan Senior Research Fellow Slade Professor of Management and Information Technology, Bentley College Susan Zankman SVP of Information Resources Finance and Management Services, Marriott International