Data Science Strategy & Project Report (dASTRA)

STRATEGY DOCUMENT AND DETAILED PROJECT REPORT (DPR) ON DATA SCIENCE, TECHNOLOGY, RESEARCH & APPLICATIONS (dASTRA) DRAFT REPORT DEVELOPED FOR DEPARTMENT OF SCIENCE & TECHNOLOGY, Ministry of Science & Technology, Government of India CONSULTANCY DEVELOPMENT CENTRE, nd 2 Floor, Core IV-B, India Habitat Centre, Lodhi Road, New Delhi – 110003 TABLE OF CONTENTS Page No. Executive Summary iii List of PDAC Members xx 1. Introduction: New Generation Computational Paradigm 001 2. Data Science & Technology 027 3. Data Science – Research & Development 044 4. Data Science Applications 064 5. Entrepreneurship Development & Start-ups 081 6. Data Science Policy Perspectives 091 7. Training & Capacity Building 097 8. Investments – Detailed Project Report 102 9. Conclusions 153 List of Abbreviations 154 List of Tables 156 List of Figures 157 References 158 Annexure 161 Acknowledgements 244 ii EXECUTIVE SUMMARY Data Science, Technology, Research and Applications (dASTRA) Data is increasingly becoming cheap and important. We are now digitizing analog content that was created over centuries and collecting myriad new types of data from web logs, mobile devices, sensors, instruments, and transactions. A study estimates that 90 percent of the data in the world today has been created in the past two years and is increasing day by day in manifolds. At the same time, new technologies are emerging to organize and make sense of this avalanche of data. We can now identify patterns and regularities in data of all sorts that allow us to advance scholarship, improve the human condition, and create commercial and social value. The rise of “big data” has the potential to deepen our understanding of phenomena ranging from physical and biological systems to human social and economic behavior. Virtually every sector of the economy now has access to more data than would have been imaginable even a decade ago. Businesses today are accumulating new data at a rate that exceeds their capacity to extract value from it. The question facing every organization that wants to attract a community is how to use data effectively — not just their own data, but all of the data that’s available and relevant. Our ability to derive social and economic value from the newly available data is limited by the lack of expertise. Working with this data requires distinctive new skills and tools. The corpuses are often too voluminous to fit on a single computer, to manipulate with traditional databases or statistical tools, or to represent using standard graphics software. The data is also more heterogeneous than the highly curated data of the past. Digitized text, audio, and visual content, like sensor and weblog data, is typically messy, incomplete, and unstructured; it is often of uncertain provenance and quality; and frequently must be combined with other data to be useful. Working with user-generated data sets also raises challenging issues of privacy, security, and ethics. Scientific progress is a result of relentless academic research endeavour. The scientific community has been focused for a while now on the growing challenges of Big Data in a number of disciplines. This immense repository of past/current academic knowledge is increasing at an exponential rate, and handily qualifies as Big Data in terms of volume, variety and velocity of growth. The estimation of the veracity of this data also presents challenges. As the amount of knowledge in an academic field grows, a quick assessment of the state-of-the-art in any sub-field becomes that much harder. One way of enabling the acceleration of the process of discovery, is to significantly enhance current search capabilities to support deep scientific queries. This includes: i) improving the efficiency and depth of search by enabling segmentation and recognition of all the components of a traditional academic research including graphs, tables, and diagrams, ii) developing tools to integrate various sources of information on any topic, not just from the textual content but often from parallel channels such as video, speech, and the web, in order to gain comprehensive understanding on the topic, and most importantly, iii) making unapparent connections between methods, features, data, constraints, and parameters across the spectrum of reported scientific data using advanced data mining approaches. iii We believe that this will require enhancements to the state-of-the-art in a variety of disciplines such as computer vision, pattern recognition, Natural Language Processing (NLP) and fusion of classifiers. We will make a case for the viability of this plan and step through a case study in machine learning techniques for combining classifiers. We believe that the development of such technologies is also likely to have significant broader societal impact. Big Data By definition, Big Data, is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it. In other words, big data is characterised by volume, variety (structured and unstructured data) velocity (high rate of changing) and veracity (uncertainty and incompleteness) and Value. By 2017, globally big data industry is expected to be USD 25 billion industry. Nasscom predicts that Indian Big data industry will be worth more than 1 billion in coming years. Volume refers to the vast amounts of data generated every second. Just think of all the emails, twitter messages, photos, video clips, sensor data etc. we produce and share every second. We are not talking Terabytes but Zettabytes or Brontobytes. On Facebook alone we send 10 billion messages per day, click the "like' button 4.5 billion times and upload 350 million new pictures each and every day. If we take all the data generated in the world between the beginning of time and 2008, the same amount of data will soon be generated every minute! This increasingly makes data sets too large to store and analyse using traditional database technology. With big data technology we can now store and use these data sets with the help of distributed systems, where parts of the data is stored in different locations and brought together by software. Velocity refers to the speed at which new data is generated and the speed at which data moves around. Just think of social media messages going viral in seconds, the speed at which credit card transactions are checked for fraudulent activities, or the milliseconds it takes trading systems to analyse social media networks to pick up signals that trigger decisions to buy or sell shares. Big data technology allows us now to analyse the data while it is being generated, without ever putting it into databases. Variety refers to the different types of data we can now use. In the past we focused on structured data that neatly fits into tables or relational databases, such as financial data (e.g. sales by product or region). In fact, 80% of the world’s data is now unstructured, and therefore can’t easily be put into tables (think of photos, video sequences or social media updates). With big data technology we can now harness differed types of data (structured and unstructured) including messages, social media conversations, photos, sensor data, video or voice recordings and bring them together with more traditional, structured data. Veracity refers to the messiness or trustworthiness of the data. With many forms of big data, quality and accuracy are less controllable (just think of Twitter posts with hash tags, abbreviations, typos and colloquial speech as well as the reliability and accuracy of content) but big data and analytics technology now allows us to work with these type of data. The volumes often make up for the lack of quality or accuracy. iv Value: Then there is another V to take into account when looking at Big Data: Value It is all well and good having access to big data but unless we can turn it into value it is useless. So you can safely argue that 'value' is the most important V of Big Data. It is important that businesses make a business case for any attempt to collect and leverage big data. It is so easy to fall into the buzz trap and embark on big data initiatives without a clear understanding of costs and benefits. Wireless sensor technology has advanced to such a point that it is feasible to equip even everyday items with a variety of sensors and measure state at a frequency and scale not possible a few years ago. This development has to lead the idea of an “internet of things” and the application of data driven analytics to different domains. We now talk of smart cities / villages, where each component of infrastructure can be closely monitored and controlled for efficient use of resources and higher quality of living. To efficiently store, manage and process the data that is generated in the process requires the development of new algorithms and approaches to traditional problems. Specific areas in which there is both domain expertise as well as access to data are available among the co-investigators are in transportation, power and water distribution networks, health care, agriculture and food products, finance and health. Empirical Research is a critical component of a comprehensive scientific enquiry in any academic discipline. The central idea of such an approach is the use of data to solve problems or achieve objectives. The increasing ability to store large data sets and process them effectively, and the advent of powerful algorithms to analyse the data has increased the importance of such an approach in making significant contributions to an academic discipline. This trend is particularly prominent when contrasted with the use of theory based approaches, which rely on axiomatic extensions based on principles of reasoning. While these two approaches are not mutually exclusive, and in most cases should augment each other in providing a comprehensive understanding of the subject of investigation, the advent of the petabyte age has greatly tilted research contribution in favour of the data driven approach. At the very least a data driven approach seeks to validate findings from fundamental theory, and at best it wholly supplants the need for any fundamental theory to make conjectures and testable hypotheses. In this light, it becomes critical for researchers to use and work with data and analytics to make a meaningful contribution to the body of literature in their respective field. The impact of this work can also be felt in the area of data security and privacy. The proposed project requires strengthening the perimeter security of a data centre as well as has access control policies, mechanisms and architecture. International Initiatives Big Data for Development: United Nations (UN) initiatives The recent waves of global shocks – food, fuel, and financial – have revealed a wide gap between the onset of a global crisis and the availability of actionable information that can help protect the world’s most vulnerable populations against further regressions. v Traditional statistics, household surveys and census data have been effective in tracking medium to longterm development trends, but can be ineffective in generating the type of real-time picture that decision makers need in order to develop timely responses to ongoing issues. For example, much of the data used to track progress toward the Millennium Development Goals (MDGs) dates back to 2008 or earlier and doesn’t take into account the more recent economic crisis. While this may feed a perception that there is a scarcity of information about the wellbeing of populations, the opposite is in fact true. Thanks to the digital revolution, there is an ocean of data, being continuously generated in both developed and developing nations, that did not exist even a few years ago. Since its inception in 2009, Global Pulse has been investigating the viability of using new and alternative data sources to support development goals. This includes data from: i. Online Content - Public news stories, blogs, Twitter, Facebook, obituaries, birth announcements, job postings, e- commerce, etc. ii. Data Exhaust - Anonymized data generated through the use of services such as telecommunications, mobile banking, online search, hotline usage, transit, etc. iii. Physical Sensors - Satellite imagery, video, traffic sensors, etc. iv. Crowdsourced Reports - Information actively produced or submitted by citizens through mobile phone-based surveys, user generated maps, etc. It has become clear that protecting social development gains requires the ability to quickly, and as accurately as possible, profile and respond to crises that have the potential to undo decades of development work. Today’s shocks—fast, global, and fluid—demand more agile response systems. The private sector is already finding ways to efficiently analyze this new data to better understand its customers. Innovative companies are utilizing real-time analytics to better understand the changing needs of their customers and to respond with more agile platforms. The United Nations (UN) Global Pulse is working to design approaches for harnessing big data and realtime analytics for monitoring development progress, emerging vulnerabilities and overall population well being of the populations the UN Serves. Other global initiatives International focus on data science has been gaining popularity over the last decade, and over the last two years reached a frenzied involvement from various quarters. This has led to what is often heralded as the ‘Big Data’ Revolution. The activity in this area is at different levels, ranging from governments that look at data science to address their problems (make cities safer, lower energy dependency, tackle healthcare, agriculture, etc.), to businesses that are aiming to be more profitable, and finally to academic institutions that are conducting research to improve the knowledge we gain from information. A review of some of the key international research activities is presented below: The White House announced in March 2012 a "Big Data Research and Development Initiative" that consisted of six Federal departments and agencies. This 200 million dollar initiative works with the NSF (National Science Foundation), NIH (National Institutes of Health), Department of Defense, Department of Energy, and the U.S. Geological Survey. This initiative is aimed at helping to solve some the United States’ “most pressing challenges by improving the ability to extract knowledge and insights from large and complex collections of digital data.” vi The European Commission has funded the “Big Data Public Private Forum”. They are partnered with 11 other institutes (industry and academia) with the vision of Building a self-sustainable Industrial community around Big Data in Europe In May 2012 Intel entered into a partnership with MIT’s CSAIL (Computer Science and Artificial Intelligence Laboratory) through a contribution of 12.5 million and the establishment of the bigdata@csail initiative. In this program, experts in hardware and software development, theoretical computer science, and computer security come together to develop new architectures capable of sorting and storing massive quantities of information, as well as the algorithms that can process them. This was founded alongside the U.S. State of Massachusetts’ inaugural “The Massachusetts Big Data Initiative”, which provides funding from the state government and private companies to a variety of research institutions In May 2013 the UK government and a private Philanthropist created a £ 30 million “Big Data” health research centre at the University of Oxford. This follows an already complete £35 million first phase of the centre – The Target Discovery Institute – which won another £10 million more for further research activity. Also the UK government has categorized “Big Data” as one of the “eight great technologies” outlined by Universities and the Science Minister as being a government priority. The AMPLab at the University of California, Berkeley, is a five year, multi-million dollar Big Data initiative. They receive funding from the NSF, DARPA and many industrial sponsors. The NSF Cluster Exploratory (CluE) program provides NSF-funded researchers software and services running on a Google-IBM cluster to explore innovative research ideas in data-intensive computing. National efforts Similar to the various international initiatives, there has been interest from different corporations and educational institutions in Data Science. A few of these initiatives are summarized below: In IIT Bombay there are a group of researchers who investigate issues related to indexing web data, organizing the semi-structured information found on the web, structured learning and large scale optimization. The focus of the group is on algorithms for web data. They are funded by Yahoo! labs, Microsoft, IBM, HP labs, and others. Prof. Soumen Chakrabarti is recognized as one of the world leaders in web mining and indexing, while Prof. Sunita Sarawagi is likewise a leader in the domain of structured output prediction. While they have a large facility for distributed computing funded by Yahoo! labs, the activities are not exclusive to that facility. Indian Institute of Science, Bangalore has several groups working in related areas across multiple departments. In the Computer Science and Automation Department, the machine learning groups work on large scale optimization and ranking problems. While the focus is not explicitly on "big" data analytics, they are one of the most successful groups in the country in terms of their research output. Prof. Jayant Harista is a well-recognized expert on database systems and has been working in collaboration with IBM research in building data management applications that can handle large volumes of data. In the Supercomputing Education and Research Centre (SERC) there have been efforts to start big data analytics facilities, especially focused on the study of biological systems. In the ECE Department there is a network analysis group that studies complex networks and deal with issues of scale. vii IIT Delhi, and IIT Kanpur have data analytics groups that look at different aspects of data handling, storage and analytics. There is a database management and information extraction meta-group that has been formed recently across IIT Bombay, Delhi, and IIIT Delhi, under the IMPECS scheme with each institution focusing on sub-areas in this domain. IIT Kharaghpur has a large complex networks group and a center for network analysis, again under IMPECS. They also have several researchers who look at data analytics and scaling to large data volumes. The Indian Institute of Management, Bangalore operates a Data Centre and Analytics Lab. The purpose of this initiative is to support interdisciplinary empirical research using data primarily on India and other emerging markets. The centre also offers a one year certificate program in Analytics. IIIT Hyderabad is another institute that is active in data analytics work. While there have been several successful products and startups that have been incubated there, they do not focus on large data handling issues. Some of the notable ideas from their groups are the eSagu system for agricultural analytics, and veeoz a real-time social media tracking system. One feature that will distinguish our efforts from the rest is that we are looking at data from engineering systems as well as biological, technological, financial and social media sources. To the best of our knowledge such a concerted effort is not available in a large scale in other places. IIT Bombay has a group that looks at power system analytics, headed by Prof. Soman. The complex networks group at IIT Kharaghpur headed by Prof. Niloy Ganguly has analyzed the Indian Railways network and derived many interesting insights. There are several groups in the Indian Industry that look at big data analytics. The group regarded highly globally is the one at Microsoft research. Not only do they publish cutting edge research, they also contribute very actively to data analytics product development in Microsoft. IBM research labs has probably the largest collection of researchers working in data analytics and related areas, organized into different groups such as business analytics, information management, human language technologies, etc. Many of the large labs have active machine learning/data analytics research groups, notable are Yahoo! labs, GE research, Xerox Research Center India, Adobe Research, etc. In addition to the national importance poured into data sciences there also been considerable focus recently in India to build Smart Cities. In the budget for the current fiscal year the government had planned to develop 100 smart cities across the country through a $1.2 billion investment. Gap areas From the previous two sections it should be abundantly clear that research institutes, government bodies, and corporations are taking Data Science and Big Data Initiatives very seriously. However most of these aspects are focused towards addressing areas in IT that have been well established. Further the participants and users of these systems are knowledgeable about the use of IT and computers. One of the distinguishing features of this effort is the interdisciplinary nature of this initiative. A further focus area in our context is to reach the masses with low or minimal knowledge of IT. In this context, participating in the development of this rich research area and ensuring reachability to end users has huge implications for the near future. With the right effort and people, the Big Data revolution can be useful in many ways. HOW BIG IS DATA IN INDIA? viii i. We are living in the age of information overload. A huge amount of data is constantly being generated around us. Increasingly, automation is being adopted and consequently leads to greater amounts of data. The challenge today for enterprises as well as small and medium businesses (SMBs) is manifold. Indian SMBs and enterprises are sitting on a gold mine of information. Making sense of these huge data sets has become imperative. In these circumstances, big data analytics has become one of the more talked about topics in India. ii. Big data has tremendous potential in India. With social media usage on the rise and increased adoption of technology by sectors such as BFSI(banking, financial services, and insurance), retail, hospitality etc, big data analytics are on the agenda of boardrooms across Indian enterprises. However, most Indian enterprises are still coming to terms with this concept. While everybody realizes the importance and the potential to analyze these data sets, very few have the capability of doing it. It is widely accepted that Indian enterprises base their decisions mostly on intuitions and ‘gut-feel’ and have barely scratched the surface in terms of using data for decision-making. iii. In India, many of the large enterprises have started using or are contemplating the use of big data analytics. SMBs are still some distance away from adopting this concept. Their challenges are more basic – effective data storage and management. However, there are many medium businesses that are already past the initial stages of IT adoption are expected to take this up shortly. Data Science: R & D PERSPECTIVE In the Big Data research context, so called analytics over Big Data is playing a leading role. Analytics cover a wide family of problems mainly arising in the context of Database, Data Warehousing and Data Mining research. Analytics research is intended to develop complex procedures running over large-scale, enormous in-size data repositories with the objective of extracting useful knowledge hidden in such repositories. One of the most significant application scenarios where Big Data arise is, without doubt, scientific computing. Here, scientists and researchers produce huge amounts of data per-day via experiments (e.g., disciplines like high-energy physics, astronomy, biology, bio-medicine, and so forth). But extracting useful knowledge for decision making purposes from these massive, large-scale data repositories is almost impossible for actual DBMS-inspired analysis tools. From a methodological point of view, there are also research challenges. A new methodology is required for transforming Big Data stored in heterogeneous and different-in-nature data sources (e.g., legacy systems, Web, scientific data repositories, sensor and stream databases, social networks) into a structured, hence well-interpretable format for target data analytics. As a consequence, data-driven approaches, in biology, medicine, public policy, social sciences, and humanities, can replace the traditional hypothesis-driven research in science. The research problems linked to the discovery of new insights from big-data belong to a novel and rapidly expanding research domain: machine learning. At the edge of statistics, computer science and emerging applications in industry, this research domain focuses on the development of fast and efficient algorithms for processing of data with as a main goal to deliver accurate predictions of various kinds. To name only a few applications, think of business cases such as product recommendation, segmentation of customers, fraud detection or churn prevention. Machine learning techniques can solve such applications using a set of generic methods that differ from more traditional statistical techniques. The emphasis is on real-time and highly scalable predictive analytics, using fully automatic and generic methods that simplify most of the problems of data analytics. At the user layer, visualization and interactive exploration are important problems for Big Data. A novel class of visualization metaphors, ix methodologies and solutions must be devised, in order to cope with emerging challenges posed by visualization problem of Big Data; real-time visualization of extracted core data, visualization of mashuped data, and effective visualization over mobile devices are interesting problems. Coupled with visualization issues, interactive exploration issues are critical milestones to traverse in the context of Big Data research; in fact, enormous-sized data are difficult to explore while extracting useful knowledge. Strategies need to address issues such as conceptual navigation, concept drift, interaction metaphors, and so forth. Environmental monitoring has become reliant upon automated sensors for data acquisition. These results in generation of large, high-dimensional data streams (‘Big Data’) those personnel must search through to identify data structures. Nature-inspired computation, inclusive of artificial neural networks (ANNs), affords the unearthing of complex, recurring patterns within sizable data volumes. This has applications in agriculture, weather monitoring, epidemiological study, traffic planning, pollution monitoring, ecological and nature resource management. Data: Science & Technology - Challenges Some of the S&T challenges that researchers across the globe and in India facing are related to data deluge pertaining to: i. Astrophysics ii. Materials Science iii. Earth & atmospheric observations iv. Energy v. Fundamental Science vi. Computational Biology, Bioinformatics & Medicine vii. Engineering & Technology, GIS and Remote Sensing viii. Cognitive science ix. Statistical data These challenges require development of advanced algorithms, visualization techniques, data streaming methodologies and analytics. The overall constraints that community facing are • The IT Challenge: Storage and computational power • The computer science : Algorithm design, visualization, scalability (Machine Learning, network & Graph analysis, streaming of data and text mining), distributed data, architectures, data dimension reduction and implementation • The mathematical science: Statistics, Optimisation, uncertainty quantification, model development (statistical, Ab Initio, simulation) analysis and systems theory • The multi-disciplinary approach: Contextual problem solving Data Science: Businesses perspective ANALYTICS COMPANIES IN INDIA DURING LAST TWO YEARS: x ‘Analytics India Magazine’ has published a study on how analytics organizations are coming up in various cities in India and where the action is taking place. By Analytics Organizations it refers to companies that provide services externally around analytics and related fields. This can include training organizations or even large consulting companies with analytics as a service line. It has also included product companies that have created products with a deep focus or dependency around analytics. The study provides following insight into the potential of BA in the country: i. 6% of analytics organizations worldwide are either based out of India or have operations in India. ii. The number of analytics companies in India have grown three folds in last 1.5 years. iii. Analytics firms have also grown in size. A year back, the percentage of analytics firms in India with employee size less 50 was 71%. This year, this number has decreased to 66%. iv. Bangalore still is the hub of analytics in India, though other cities are coming up. Analytics Industry- A key to growth of India Imagine a situation where someone is moving in Pantaloons Men’s shoes section, and is about to buy one and then receives a message from Indiatimes, “The same shoe is being offered with 25% discount, just login here”. A scanner reads the shoe data, the customer’s pantaloons card is attached to his mobile and his mobile is attached to Indiatimes. Indiatimes and Pantaloons are doing joint marketing. A win-win situation for everybody that is only possible with the help of analytics. So, Analytics is now no more a luxury for an organization rather a hygiene factor. Let us have small look at the current analytics industry of India: i. Size of the Indian analytics Market: – 375 Million $ ii. No. of companies operating in this segment in India – More than 500 iii. Expected Indian Analytics market by 2017 – 1.15 bn $ as per Business standard report. Big Data Analytics and Digital Social Networks This is a focused research area engaged in the analysis of social networks in the context of the new digital cultural ethos of India. There are many dimensions to such analysis possible ranging from the technological, sociological, cultural, economic and strategic. There is a special need in the country today for a deep analytical capability around the content and activity of social networks. Since much of the content of social networks is textual information (emails, blogs, tweets, SMS, websites, documents), audio and images/video, the information sciences involved in data analytics of social networks would have to span spectral, image, text and quantitative analytics. The ability and capability to monitor and detect patterns of information flow in social networks can provide extremely strategic value to national security: • Alerting the nation security agencies to disturbing or threatening trends that are not obvious. • Determining hot spots in the networks that should be monitored and leveraged for rapid broadcast of information that can have salutary effects to calm the citizenry and counter the effects of harmful disinformation. At the same time, there would be scholarship and commercial value to be derived from the expertise created by this activity. This would allow an embedding of the strategic effort as a covert effort for obvious advantage. Desiderata and Resources Required: xi i. Data: Distributed data centres and dedicated broadband connectivity to other centres to eventually get a seamless semantic experience which have high speed access to huge amounts of real time data in order to carry out analysis at with reasonable turnaround times. The irony is that openness in society makes us vulnerable to terrorism and yet openness is key to a good defense. While we are always interested in connecting the dots, collecting the dots is a crucial first step! ii. Curation: with such large data streams, it would be critical to have technology for automated classification and clustering with human oversight for better accuracy. Much of these required technology pieces are available today but the challenge is in integration and getting effective pipelines engineered for preparing the data in knowledge bases that are in a form to be leveraged for rapid interpretation and inference. iii. Analysis: connecting the dots, discovering patterns, generating hypotheses, predicting outcomes. This needs a crack technical team with strong mathematical and decision sciences training. There also have to be a few key domain experts feeding the technologists with the key questions as well as keeping them honest with quick feedback on the interpretations. The core team working on strategic issues related to national security would be embedded within this analytics team Big Data: Business Analytics Business Analytics is science of examining data (Big Data in the form of text, quantitative, qualitative, etc.) to bring forth underlying information. This information can give us some undiscovered patterns or can establish hidden relationships, which can shape the decision making capability of an organization. There are two important facets of Analytics: First is practical intuitiveness, there can be hundreds of ways a given data can be analyzed, but the beauty is that none can be completely correct however it will give us some direction. The point is to chase that direction and to keep it updating with the trend. Second is real time, if a Google search engine will take one hour to list down all the possible matches, its enormousness couldn’t have been achieved. All the analysis has to be done on the fly. APPLICATIONS OF BIG DATA-BUSINESS ANALYTICS IN GOVERNMENT SECTOR: There are many ways in which ‘Big Data – Business Analytics’ can be leveraged by the Central and the State Government to grow more and go for the changes and implementing the various policies and government schemes. Some of the prominent areas are: i. ADHAAR: As majority of citizens (more than 60 crores at the last count) in the country have been provided with ADHAAR number, the governments can use this facility to plan, implement & monitor and their citizen related initiatives. ii. Direct Benefit Transfer Scheme: The Governments can decide the funding for a various schemes, ensure that the money reached the beneficiaries and keep track of improvement and the growth within the scheme and any particular region where people are benefited of this scheme. iii. Impact of Election and Voting system: Governments can analyze this big data for making policies and the scheme based on those statistics which will help the people of the country as well as the growth of the country. iv. Impact and conditions of Infrastructure Projects: Analysis of the large amount of Data Periodically collected can help the governments in preserving critical infrastructure all over the country. xii v. Impact of Education: Analysis of the large amount of Data Periodically collected about delivery, outputs, outcomes and impact of the education initiatives at primary, secondary and tertiary level can be useful in formulating the education policies. vi. Impact of Health care initiatives: Analysis of the large amount of Data Periodically collected about delivery, outputs, outcomes and impact of the healthcare initiatives at primary, secondary and tertiary level can be useful in formulating the healthcare policies. BUSINESS ANALYTICS FOR TAX ADMINISTRATION: The Central as well as State Governments is involved in multiple tax regimes - corporate as well as individual level. The country's income tax-payer base itself is about 3 crore and the number has been inching its way slowly for the last 5-10 years, which the government would like to see growing at a faster pace. Consider the following relevant facts: • • • • • According to government data, the total tax payers in the country stood at about 3.24 crore during fiscal year 2011-12 (FY12). The Finance Ministry had collected Rs. 4.73 lakh crore in indirect taxes during 2012-13. For the current fiscal, it has fixed the target of collecting Rs. 5.65 lakh crore in indirect taxes, comprising customs, excise and service tax. Total collection of indirect taxes stood at about Rs. 2,28,550 crore during the first six months of 2013-14. Direct tax collection from corporate and income tax payers, which was at Rs. 14,530 crore till August, surged to Rs. 18,077 crore till September 15, 2013. Our total direct taxes are only 9 per cent of our GDP, whereas it should be about 18 per cent, and you cannot raise it by taxing people who you have already taxed. The governments are always looking for efficient ways and means of ‘Improving Tax Administration”. This is possible by analyzing huge amounts of data available on various parameters typical to the tax regime such as ‘spending patterns’, interstate movement of goods. BIG DATA ANALYTICS AND THE INDIA EQUATION To tap the analytics momentum, India now needs to build a sustainable analytics eco-system that brings in a strong partnership across the industry players, government, and academia. Some of the key actions for analytics eco-system in India would be around. i. Talent Pool - Create industry academia partnership to groom the talent pool in universities as well as develop strong internal training curriculum to advance analytical depth. ii. Collaborate - Form analytics forum across organization boundaries to discuss the painpoints of the practitioner community and share best practices to scale analytics organizations. iii. Capability Development - Invest in long term skills and capabilities that forms the basis for differentiation and value creation. There needs to be an innovation culture that will facilitate IP creation and asset development. iv. Value Creation - Building rigor to measure the impact of analytics deployment is very critical to earn legitimacy within the organization. Big Data and analytics offers tremendous untapped potential to drive big business outcomes. For organizations to leverage India as a global analytics hub can be one of the key levers to move up their analytics maturity curve. xiii HOW BUSINESS ANALYTICS CAN MAKE INDIAN BUSINESS MORE COMPETITIVE? The role analytics plays in organizations today goes far beyond the slipshod Excel/pivot table culture of yesteryear. Analytics has now become a de facto requirement in organizations with companies designating dedicated teams with KRAs to achieve specific revenue numbers using analytics. The fact is corroborated by a recent report from the research firm Gartner, which says that the benefits of factbased decision-making are clear to business managers in a broad range of disciplines from marketing, sales, supply chain management, manufacturing, engineering to risk management, finance and HR. The firm predicts that BI and analytics will remain the top focus for CIOs through 2017. Consider the following examples from Indian Industry and Business: i. ii. iii. iv. v. Take the case of retailer, Shoppers Stop, which uses analytics to mine customer preferences and buying behaviour to source merchandize more intelligently and connect with the customers on things they would like to see at the stores. The retailer’s buying team uses sales data to figure out what is selling and where, which in turn enables it to take supply decisions. While earlier the time taken for category performance reviews ran into days, now with the technology in use, insights are available in a couple of hours. Flipkart, faced a similar challenge, where there was a pressing need to improve inventory utilization. Flipkart needed to integrate complex data from disparate sources and deliver analytical data to the staff in various departments. Using a BA solution Flipkart was able to optimize stock levels and lower costs associated with excess stock, improving its inventory utilization by 5 percent and providing up-to-date analytics for embedded, data-driven decision making. Aircel had a variety of heterogeneous systems for capturing massive amounts of customer data, presenting business with the gruelling task of extracting information from the vast amount of data in disparate systems and get an integrated view of customers to analyze customer demographics, usage patterns, social behaviour and more. By using an analytics solution an integrated data view was achieved, enabling a 360-degree view of the customer life cycle, including tasks such as customer identification, customer acquisition, customer relationship management, customer retention and customer value enhancement. For Mahindra & Mahindra, analytics has opened up new avenues for interacting with the customers. For instance if a customer goes to a dealership, the quantum of services available can now range from offering exchange for used cars, insurance and so forth, basis the customer information on file. Another example of innovative usage of analytics is from Mahindra’s electric vehicle, the E2O (formerly REVA), which is equipped with switches that continuously send back information about the vehicle/battery performance of the car to the company, where this data is now being crunched. If you can juxtapose this data with GPS data, this will open up new avenues for interaction with the customer. MOVING TOWARD AN ANALYTICS CULTURE. Staying the course: While the point of inflection, where value exceeds investment, may still remains elusive for many companies, clearly the business should recognize that the shift to analytics is not a long-term endeavour. Get the executives on board: Even though business analytics initiatives are typically incremental, getting the top brass to see the value will help drive a culture in which the norm is data-based decision-making. xiv According to a recent survey, effective users of business analytics are nearly always (86%) in organizations where executive management places a great deal of trust in the results of analytics. Getting quick wins on important issues can help gain the confidence of senior management. Data comes first: Before embarking on analytics initiatives, organizations need to assess the effectiveness of their data-management strategies. Those who have a solid approach to their data are more than twice as likely to have successful analytics programs. Viewing data as a strategic asset—and as the backbone of effective decision-making—is a key element to an analytics culture. Get your “analytics” on: Organizations desirous of reaping high benefits from business analytics have to boldly move into new technology. They have to significantly increase their use of analytics at nearly four times the rate of other companies. Share the knowledge: In developing the analytics culture, “silo-busting” is essential. Information and data must be shared across the organization. People must have access to the data they need. Effective users of business analytics have to be much more proficient than their counterparts at collaborating and sharing information. Integrate: Companies that wish to take the next step beyond collaboration—integration across the organization—has to be well on their way to building a strong analytics culture. Integration is one of the key components in getting benefits from analytics. The “competitive edge” so often promoted in the marketplace really only comes when the organization takes a holistic approach to analytics. Hire the right talent: Adoption of analytical tools without the right people to make the best use of them can prove to be a poor investment. In developing a functional analytics culture, the linchpins are people, process, and infrastructure. Find your equilibrium: The average mix of intuition to analytics in decision-making should be 60/40. For those organizations using analytics effectively, the scale tips move toward analytics at 53/47 versus 62/38 for all others. Broad contours of BDI programme positioning strategy shall be i. To develop core generic technologies, tools and algorithms for wider applications in Govt, planners and policy makers. ii. To understand the present status of the industry in terms of market size, different players providing services across sectors/ functions, opportunities, SWOT of industry, policy framework (if any), present skill levels available etc. iii. To carryout market landscape survey to assess the future opportunities and demand for skill levels in next 10 years iv. To carryout gap analysis in terms of skills levels and policy framework. v. To evolve a strategic Road Map and micro level action plan clearly defining of roles of various stakeholders – Govt., Industry, Academia, Industry Associations and others with clear timelines and outcome for the next 10 years. Deliverables and cost benefit Analysis xv Smart city Transportation: The Centre of Excellence in Urban Transport has been recording GPS data from over 75 Metropolitan Transport Corporation buses for several months now. This voluminous data are stored in a database provided by the supplier of the GPS hardware equipment. There exist significant scope in optimizing the storage and retrieval of this data. It is a critical need since the data is used for real-time travel time information provision as well as bus arrival prediction. The work here shall address the scalability issue of such real-time databases. Traffic data from 25 video cameras installed on the road medians and shoulders and transmitted through dedicated wireless network are also being collected at the Centre and they require a standardized archival and retrieval system. Low-level realtime image processing techniques are used to convert the data into useful traffic information. Data analytics techniques can be used to better extract useful information from the video data or the processed data. These data are anticipated to be of high importance for researchers and practitioners alike. Smart Grid data: Smart grid development is one of the most important technology revolutions taking place as electricity grids are world's one of the largest pieces of infrastructures yet to be digitized. To fully leverage the capabilities of grid enhancements, one has to naturally turn to data analytics. A large amount of real-time data can be collected from smart meters, PMU and any other sensors in smart electricity grids. These data can be used to detect events such as severe voltage and frequency fluctuation, sudden increase of demand from a particular location, and in some cases even to predict or detect blackouts or cyber attacks. Further, the data can be helpful to develop models for forecasting the load. It is proposed to develop new methods for detecting events in real time using the multi-sensor data. The project involves theoretical and simulation studies. Some of our collaborators at VJTI Mumbai (IITM has an MoU with them), are working with Power Grid Corporation of India Ltd (PGCIL) and are willing to share their PMU (phasor measurement unit) data with us. Other possible sources of data is the PEPS group at IITB who monitor real time PMU data across India. Water Flow networks data: Urban water distribution are being renovated and instrumented in order to ensure 24x7 water in several cities of India. Data related to flow rates, pressures, and tank levels can be continuously obtained and used to (i) monitor performance of network especially with respect to leakages and non-revenue water, (ii) optimize operations such that water can be delivered to customers at desired flow rates and pressures, (iii) determine health of sensors and pipes for scheduling maintenance. Pimpri-Chinchwad municipality near Pune with more than 1 lakh connections has already achieved 85% metering of connections and have been gathering data for the past several months. The municipality is willing to provide us the data to enable above identified solutions to be developed and later implemented in their operations. Socio-Economic Initiatives data: Socio-economic issues can be divided into two parts: (i) Health and (ii) Food security. A tropical country like India is prone to seasonal diseases such as Dengue during monsoon season, cancer due to chewing of tobacco and other life threatening diseases like Tuberculosis. Further the affected people usually are from rural areas with limited knowledge to disease and its effects as well as practising traditional medicines. With this initiative on data collection insurance schemes propagated by the Government will benefit. For example, Government of Tamilnadu offers insurance schemes for people below poverty line. Big data analysis on patient profiles will help in predicting outbreak of diseases and helping in proactive action thereby saving money to the Government in terms of premiums paid for insurance. With respect to food security, effective and efficient distribution of food products that reaches the lowest level of the people in economic status is of primary importance. Big Data analytics can help in ensuring that such people benefit from the technology advancement. Interdisciplinary studies and courses are very few in the current context. Such courses are offered by institutions which want to use big-data for their work, but not focused on Information Technology. For xvi example Agricultural Institutes find it difficult to have access to high end computing, even though enough data is collected by them from the fields. Offering and introducing courses that are interdisciplinary and helping researchers in such areas have access to high-end computing will help in moving the benefits to end users quickly. Our academic outreach will be at three levels - train IT knowledgeable people to be good data scientists; train people with domain knowledge to be data science knowledgeable so that they can interface better with data scientists; finally educate end-users on the possibilities of data science, in a model similar to a popular science program. Biological network analysis: The recent decade has witnessed a paradigm shift in biology, from the study of individual genes and proteins to the study of genes, proteins and metabolites interacting in a concerted network of metabolic, signalling and regulatory networks. Fuelled by development in sequencing and other experimental techniques, a deluge of biological ‘omics’ data has been generated — genomic (sequence), transcriptomic (gene expression), proteomic (protein quantitation) and metabolomics (metabolite levels). Many challenges exist in the analysis, integration and assimilation of the biological network data, to better understand biological system function, generate new hypotheses for experimental verification. As the 2013 citation for the Nobel Prize in Chemistry put it “Today the computer is just as important a tool for chemists as the test tube.” Some of the problems currently being addressed include the analysis of metabolic networks to identify critical targets for therapeutic intervention, identification of essential proteins in protein interaction networks, and the learning of reaction rules from complex metabolic networks. Cancer genomics data: While sequencing the first human genome took over a decade of extensive collaboration across labs, the three gigabases of the human genome can now be sequenced in a few days on the ‘next-generation sequencing’ (NGS) machines. However, NGS data is in the form of short reads and demands intensive computational analysis to crunch the data, re-assess quality and generate sequences. It is also critical to develop infrastructure and algorithms for effective storage, indexing and retrieval of the terabytes of NGS data. With the establishment of the National Cancer Tissue Biobank at IITM, we are uniquely placed to analyse extensive genomic data from varied cancer tissues, of particular relevance in the Indian context. NGS data lend themselves to a wide range of analyses, right from the identification of critical genes mutated in cancerous tissues, to identifying changes at the gene or signalling network level, through mathematical modelling. Healthcare Data Analytics: Our group has been collaborating with various hospitals and research institutes, such as Sankara Nethralaya Vison Research Foundation and Patterson Cancer Research Centre, and have developed innovative early-stage screening algorithms. We are also working with companies in the electronic health records domain to enable more insightful analytics on patient data. The Centre will work toward building a large suit of algorithms that are specifically tailored for the healthcare domain. JEE/GATE data: It is perceived that IIT could greatly benefit from using state-of-the-art techniques in storing, formatting, accessing data related to JEE and GATE applicants through a centralized repository. This could lead to many data analytic initiatives. To name a few examples, consider the detection of duplicate exam attempts, analysing demographic changes in applicants, or examining the relationship between JEE performance and subsequent college performance, etc. Telephonic networks data: Telephone service providers generate a lot of data per call that is placed through their system. These are typically available as call data records which have information related to the phone numbers involved, time of the call, duration, cell tower location, calling plan, charges incurred, etc. While the individual call records provide a lot of information, organizing these into "call graphs" typically lead to more insights. The challenge here is that a tremendous volume of data is generated every day and the data is very dynamic in nature. So we need new techniques for processing xvii the data without delay and build systems that are responsive to the changing data. Typical questions that end users are interested in are related to churn prediction, service recommendations, viral marketing opportunities, graph evolution models, and behaviour analytics. While there are privacy concerns in releasing live data, several organizations have made anonymised data available online. Even if we are not able to obtain necessary permissions for live data, we can suitably organize the available public data and build access mechanisms for them to facilitate research in this domain. Financial data: Conducting research in financial markets and their functioning relies on actual market microstructure data. Critical research work in the liquidity of securities, volatility, arbitrage (pure and statistical), market making, Lead-Lag effects, etc, entail empirical research work that can be back tested on actual market behaviour of exchange listed securities. Working with market microstructure data which can spill into enormous sizes can address many pressing questions in the behaviour of financial markets. Statistical Experimental data: Controlled experiments are often carried out in every discipline where it is critical to empirically understand a system, product or process. In order to make advances in algorithms for experimentation, it is important to have a lot of datasets reflecting the different response surfaces that typically undergo an experimental exercise. This initiative will carry forward an ongoing research effort in this area to gather data sets published in journals across various engineering disciplines. Chemometric data: Chemometrics is a science of data analysis of experimental data generated from chemical systems. In the laboratory, the sophisticated analytical instruments such as Raman, IR, NIR, NMR, UV-vis spectrometers, various chromatography etc., are employed to measure indirectly the quantity of chemicals in the samples. The data generated using these instruments are of multivariate nature. Further, the chmometrics methods are routinely applied to the “omics” data generated in biological science such as metabolomics, proteomics, genomics etc. The field of data science is emerging at the intersection of the fields of social science, statistics, information and computer science and other application domain disciplines. Keeping in view the fast growth of Data Science and Analytics in future across the various applications, it is imperative to chalk out a strategic Road Map and investments in this direction to reap the benefits towards the overall development of the country. OBJECTIVES OF THE STUDY • • • • • Assess the present status of the industry in terms of market size, different players providing services across sectors/ functions, opportunities, SWOT of industry, policy framework (if any), present skill levels available etc. Market landscape survey to assess the future opportunities and demand for skill levels in next 10 years Gap analysis in terms of skills levels and policy framework Evolve a strategic Road Map and micro level action plan clearly defining roles of various stakeholders - Govt., Industry, Academia, Industry Associations and others with clear timelines and outcome for the next 10 years. The international scenario may also be examined while evolving Strategic Road Map. xviii THE CONSULTATIVE APPROACH ADOPTED: i. Two Consultative Meetings and four Interactive Workshops were held as per the details given below: CONSULTATIVE MEETINGS (CM) & INTERACTIVE WORKSHOPS (IW) ORGANIZED CONSULTATIVE MEETINGS (CM) & NUMBER OF INTERACTIVE WORKSHOPS (IW)HELD S. No. DATE PARTICIPANTS AT 1 28/11/14 New Delhi (CM) 34 2 07/01/15 Bengaluru (IW) 31 3 19/01/15 Pune (IW) 20 4 29/01/15 Hyderabad (IW) 40 5 20/02/15 Kolkata (IW) 52 6 25/03/15 New Delhi (CM) 42 TOTAL 219 ii. Draft Report has been up loaded on Consultancy Development Centre’s (CDC) website www.cdc.org.in under announcement section from 13.04.2015 for a period of two weeks for inviting comments/inputs of stakeholders. iii. The Draft Report after consultative process at (i) and (ii) above will be presented to the Secretary, DST and the PDAC members tentatively on 27th May 2015. iv. The final Report would have incorporated all the inputs received in the above consultative process suitably. The present study, through a combination of primary and secondary research has established the need of urgent initiative on part of DST to (i) strengthen the dASTRA Ecosystem of the country, (ii) take steps to nurture the same so as to leverage the unique advantageous position of the country’s manpower in not only in the scientific research and development but in the business and industry also. The project is to be implemented in five years and the cost has been estimated to be around Rs. 580 Cores. The major activities of the project will include (i) R&D PROMOTION through Open Sky Research, Cluster Based Network Programs, International Collaborative Research Program,(ii) ESTABLISHMENT OF CENTRE OF EXCELLENCE FOR DATA SCIENCE, (iii) SKILL DEVELOPMENT CAPACITY & TRAINING through Fellowship Based UG/PG & PhD, Short Term Training for Faculty, On-Line Programs, National Workshops & Conferences, Collaborative Interactive Conferences, Entrepreneur Development, (iv) INTERNATIONAL LINKAGES & COLLABORATIONS through UN (R&D and Standards), Regional Associations/Collaborations, Bilateral & Multi Lateral Exchange Programs, and (v) INFRASTRUCTURE DEVELOPMENT. xix LIST OF PDAC MEMBERS S. No. NAME ORGANIZATION 1 Prof. Sankar K. Pal Distinguished Scientist and Former Director ISI, Kolkata 2 Prof. Santanu Choudhury Professor, IITDelhi 3 Prof. Bapiraju Prof. Central University of Hyderabad. Hyderabad 4 Prof. Ramesh Hariharan Adjunct Faculty, IISc, Bangalore 5 Dr. Raghavendra Singh IBM Research, New Delhi 6 Dr. Gautam Shroff TCS Innovation Labs, New Delhi 7 Prof. Vijay Chandru Adjunct Professor, ICTS, Banaglore 8 Prof. S. Pyne, 9 Shri Avnish Sabharwal Professor, CR Rao Advanced Institute of Mathematics, Statistics and Computer Science, Hyderabad Accenture India (Pvt.) Limited, Bangalore xx 1. INTRODUCTION: NEW GENERATION COMPUTATIONAL PARADIGM 1.1. AN INTERNATIONAL PERSPECTIVE 1.1.1 AN OVERVIEW – DATA SCIENCE Down through the years of human history, the most successful decisions that were made in the world of business were based on the interpretation of available data. Every day, 2.5 quintillion bytes of data are created—so much that 90% of the data in the world today has been created in the last two years. Correct analysis of the data is the key success factor in being able to make better decisions that are based on the data. Given the quantity and complexity of the data that is being created, traditional database management tools and data processing applications simply cannot keep up, much less make sense of it all. The challenges for handling big data include capture, storage, search, sharing, transfer, analysis, and visualization. The trend to larger data sets is due to the additional information that can be derived from analysis of a single large set of related data, compared to separate smaller sets with the same total amount of data. Some estimates for the data growth are as high as 50 times by the year 2020. 1.1.2 DATA SCIENCE Data science: is deep knowledge discovery through data inference and exploration. This discipline often involves using mathematic and algorithmic techniques to solve some of the most analytically complex business problems, leveraging troves of raw information to figure out hidden insight that lies beneath the surface. It centres on evidence-based analytical rigor and building robust decision capabilities. Ultimately, data science matters because it enables companies to operate and strategize more intelligently. It is all about adding substantial enterprise value by learning from data. See figure 1.1 as given below. 1 FIGURE 1.1: DATA SCIENCE FOR BUSINESS SOURCE: https://datajobs.com/what-is-data-science dated 3/4/15 Techopedia (http://www.techopedia.com/definition/30202/data-science dated 3/4/15) would like to define Data science is a broad field that refers to the collective processes, theories, concepts, tools and technologies that enable the review, analysis and extraction of valuable knowledge and information from raw data. It is geared toward helping individuals and organizations make better decisions from stored, consumed and managed data. Data science enables the use of theoretical, mathematical, computational and other practical methods to study and evaluate data. The key objective is to extract required or valuable information that may be used for multiple purposes, such as decision making, product development, trend analysis and forecasting. 1.1.3 DATA SCIENCE ECOSYSTEM: Considering the above, Data science isn't new, but the demand for quality data has exploded recently. This isn't a fad or a rebranding, it's an evolution. Decisions that govern everything from successful presidential campaigns to a one-man startup headquartered at a kitchen table are now based on real, actionable data, not hunches and guesswork. Because data science is growing so rapidly, we now have a massive ecosystem of useful tools. Since data science is so inherently cross-functional, it is really hard to categorize the companies and the tools provided by them for users. But at the very highest level, they break down into the three main parts of a data scientist's work flow that is (i) Getting data, (ii) Wrangling data and (iii) Analyzing data. A schematic representation of the DATA SCIENCE ECOSYSTEM is as given in figure 1.2 below. 2 FIGURE 1.2: DATA SCIENCE ECOSYSTEM SOURCE: http://www.computerworld.com/article/2899647/the-data-science-ecosystem.html dated 3/4/15 3 1.1.4 DATA SCIENTIST: Rising alongside the relatively new technology of big data is the new job title data scientist. While not tied exclusively to big data projects, the data scientist role does complement them because of the increased breadth and depth of data being examined, as compared to traditional roles. A data scientist represents an evolution from the business or data analyst role. The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge. Good data scientists will not just address business problems; they will pick the right problems that have the most value to the organization. Whereas a traditional data analyst may look only at data from a single source – a CRM system, for example – a data scientist will most likely explore and examine data from multiple disparate sources. The data scientist will sift through all incoming data with the goal of discovering a previously hidden insight, which in turn can provide a competitive advantage or address a pressing business problem. A data scientist does not simply collect and report on data, but also looks at it from many angles, determines what it means, then recommends ways to apply the data. Data scientists are inquisitive: exploring, asking questions, doing “what if” analysis, questioning existing assumptions and processes. Armed with data and analytical results, a top-tier data scientist will then communicate informed conclusions and recommendations across an organization’s leadership structure. As per Techopedia, a data scientist is an individual that practices data science. Data science techniques include data mining, big data analysis, data extraction and data retrieval. Moreover, data science concepts and processes are derived from data engineering, statistics, programming, social engineering, data warehousing, machine learning and natural language processing, among others. 1.1.5 DATA SCIENCE BEYOND 2015: Kurt Cagle, an information architect, data scientist, author and industry analyst working with Avalon Consulting, LLC., predicts, (https://www.linkedin.com/pulse/ten-trends-data-science-2015-kurt-cagle#a11y-content dated 3/4/15) the following for DATA SCIENCE during 2015: • • • • • • Rise of Data Virtualization Hybrid Data Stores Become More Common Semantics Becomes Standard Databases Become Working Memory Move Towards a Universal Data Query Language Data Analytics Moves Beyond SQL 4 • Data Science Teams: will consist of Integrator, Data Translation Specialist, Curators, Data Scientist, Domain Expert, Visualizers and Data Science Manager. Big Data News (http://www.bigdatanews.com/profiles/blogs/13-new-trends-in-big-data-and-data-science dated 3/4/15) has forecasted the following trends in relation to use of Data Science in the time to come: • The rise of data plumbing, to make big data run smoothly, safely, reliably, and fast through all "data pipes" (Internet, Intranet, in-memory, local servers, cloud), optimizing redundancy, load balance, data caching, data storage, data compression, signal extraction, data summarization and more. • The rise of the data plumber, system architect, and system analyst (a new breed of engineers and data scientists), a direct result of the rise of data plumbing • Use of data science in unusual fields such as astrophysics, and the other way around (data science integrating techniques from these fields) • The rise of the right-sized data (as oppose to big data). Other keywords related to this trend are "light analytics", big data diet", "data outsourcing", the re-birth of "small data". Not that big data is going away, it is indeed getting bigger every second, but many businesses are trying to leverage an increasingly smaller portion of it, rather than being lost in a (costly) ocean of unexploited data. • Putting more intelligence (sometimes called AI or deep learning) into rudimentary big data applications (currently lacking any true statistical science) such as recommendation engines, crowdsourcing or collaborative filtering. Purpose: detecting and eliminating spam, fake profiles, fake traffic, propaganda, attacks, scams, bad recommendations and other abuses, as early as possible. • High performance computing (HPC) which could revolutionize the way algorithms are designed. • Forecasting space weather (best time / best location lo land on Mars), and natural events on Earth (volcanoes, Earthquakes, undersea weather patterns and implications to humans, when will Earth's magnetic field flip). • Use of data science for automated content generation (including content aggregation and classification); for automated correction of student essays; data science used in court to strengthen the level of evidence - or lack of - against a defendant; for plagiarism detection; for car traffic optimization and to compute optimum routes; for identifying, selecting and keeping ideal employees; for automated income tax audits sent to taxpayers to avoid costly litigation and time wasting; for urban planning; for precision agriculture • Measuring yield of big data or data science initiatives (that is, benefit after software and HR costs, over baseline) • Digital health: diagnostic/treatment offered by a robot (artificial intelligence, decision trees) and/or remote doctors; digital law: same thing, with attorneys replaced by robots, at least for mundane cases or tasks. Even lawyers and doctors could have their jobs replaced by robots! This 5 assumes that a lot of medical or legal data gets centralized, processed and made well structured for easy querying, updating and retrieval by (automated) deep learning systems. • Analytic processes (even in batch mode) accessible from your browser anywhere on any device. Growth of analytics apps and APIs. 1.1.6 WHAT IS BIG DATA? Big data is a phenomenon that is characterized by the rapid expansion of raw data. This data that is being collected and generated so quickly, that it is inundating government and society. Therefore, it represents both a challenge and an opportunity. The challenge is related to how this volume of data is harnessed, and the opportunity is related to how the effectiveness of society’s institutions is enhanced by properly analyzing this information. It is now commonplace to distinguish big data solutions from conventional IT solutions by considering the SEVEN dimensions given below and in Figure 1.3. • • • • • • • Volume: Big data solutions must manage and process larger amounts of data. Velocity: Big data solutions must process more rapidly arriving data. Variety: Big data solutions must deal with more kinds of data, both structured and unstructured. Veracity: Big data solutions must validate the correctness of the large amount of rapidly arriving data. Variability: To take care if the data consistent in terms of availability or interval of reporting and does it accurately portrays the event reported? Visualization: Once Big Data has been processed it needs to be presenting the data in a manner that’s readable and accessible. Value: Big data solutions must provide valuable inputs in decision making process of the organization 6 FIGURE 1.3: SEVEN DIMENSIONS OF BIG DATA VARIETY VARIABILITY VELOCITY VOLUME VERACITY VALUE VISUALIZATION As a result, big data solutions are characterized by real-time complex processing and data relationships, advanced analytics, and search capabilities. These solutions emphasize the flow of data, and they move analytics from the research labs into the core processes and functions of enterprises. 1.1.7 BUSINESS (ORGANIZATIONAL) VALUE OF BIG DATA: Big data is a technology to transform analysis of data-heavy workloads, but it is also a disruptive force. It is fuelling the transformation of entire industries that require constant analysis of data to address daily business challenges. Big data is about broader use of existing data, integration of new sources of data, and analytics that delve deeper by using new tools in a more timely way to increase efficiency or to enable new business models. Today, big data is becoming a business imperative because it enables organizations to accomplish several objectives: • • • • • • Apply analytics beyond the traditional analytics use cases to support real-time decisions, anytime and anywhere Tap into all types of information that can be used in data-driven decision making Empower people in all roles to explore and analyze information and offer insights to others Optimize all types of decisions, whether they are made by individuals or are embedded in automated systems by using insights that are based on analytics Provide insights from all perspectives and time horizons, from historic reporting to real-time analysis, to predictive modelling Improve business outcomes and manage risk, now and in the future In short, big data provides the capability for an organization to reshape itself into a contextual enterprise, an organization that dynamically adapts to the changing needs of its individual users/customers by using information from a wide range of sources. Although it is true that many 7 organizations/businesses use big data technologies to manage the growing capacity requirements of today’s applications, the contextual enterprise uses big data to enhance revenue streams by changing the way that it does business. Volume – Scalability Data volume is increasing faster than computing resources and processor speeds that exist in the marketplace. Over the last five years, the evolution of processor technology largely stalled, and we no longer see a doubling of chip clock cycle frequency every 18 - 24 months. The size of big data is easily recognized as an obvious challenge. Big data is pushing scalability in storage, with increases in data density on disks to match. A large percentage of the data might not be of interest. It can be filtered and compressed by an order of magnitude. The challenge is to filter intelligently without discarding data samples that might be relevant to the task. Volume – Impact of Networking The failure of a networking device affects multiple data nodes. This means that a job might need to be restarted or more loads must be pushed to the available nodes, which makes jobs take a lot longer to finish. As a result, networks must be designed to provide redundancy with multiple paths between computing nodes and, furthermore, must be able to scale. In addition, the network must be able to handle bursts effectively without dropping packets. Volume – Cloud Services Big data and cloud services are two initiatives that are at the top of the agenda for many organizations. There is a view that cloud computing can provide the opportunity to enhance organizations’ agility, enable efficiencies, and reduce costs. In many cases, cloud computing provides a flexible model for organizations to scale their big data capabilities. However, this needs to be done with careful planning, especially estimating the amount of data to analyze by using the big data capability in the cloud, because not all public or private cloud offerings are built to accommodate big data solutions. Velocity – Access Latencies Access latencies create bottlenecks in systems in general, but especially with big data. The speed at which data can be accessed while in memory, network latency, and the access time for hard disks all have performance and capacity implications. For big data, data movement is usually not feasible, because it puts an unbearable load on the network. For example, moving petabytes of data across a network in a one-to-one or one-to-many fashion requires an extremely high-bandwidth, low-latency network infrastructure for efficient communication between computer nodes. 8 Big data uses different types of analytics, such as “adaptive predictive models, automated decision making, network analytics, analytics on data-in-motion, and new visualization.” Previously, data was pre-cleaned and stored in a data mart. Now, most or even all source data is retained. Furthermore, new types of feeds, such as video or social media feeds are available (Twitter, for instance). Velocity - Rapid use and rapid data interpretation It is crucial in today’s fast-paced business climate to derive rapid insight from data. Consequently, agility is essential for businesses. Successfully taking advantage of the value of big data requires experimentation and exploration, and both need to be done rapidly and in a timely manner. Velocity - Response time Response times for results are still critical, despite the increase of data size. To ensure speed and real-time feedback from big data, a new approach is emerging where data sets are processed entirely within a server’s memory. Velocity - Impact of security on performance and capacity The increased velocity of data corresponds to an increase in security-relevant data. According to Tim Mather of KPMG, “Many big data systems were not designed with security in mind.” The security mechanisms need to be applied in a manner that does not increase access latency. In addition, big data technology enables massive data aggregation beyond what was previously possible. Therefore, organizations need to make data security and privacy high priorities as they collect more data in trying to get a single view of the customer. Variety – Data Type One of the crucial challenges that affect performance and capacity in a big data system arises from the variety of data types that can be introduced during typical processing cycles. These challenges can arise for these reasons: o Growth, necessitating the addition of new systems, which can result in an uncontrolled heterogeneous landscape in the enterprise (such as a plethora of types of systems) o The introduction of new systems that provide data but introduce challenges in identifying its relevance in big data systems Variety - Tuning 9 The rise of information from a variety of sources, such as social media, sensors, mobile devices, videos, and chats, results in an explosion of the volume of data. Previously, companies often discarded the data because of the cost of storing it. Veracity – Cleaning the Massy Data The huge amount of data that comes from digital pictures, videos, posts to social media sites, intelligent sensors, purchase transaction records, and cell phone GPS signals, to name a few is messy data. Veracity deals with uncertain or imprecise data. If the data is error-prone, the information that is derived from it is unreliable, and users lose confidence in the output. Cleaning the existing data and putting processes in place to reduce the accumulation of dirty data is crucial. Veracity – Performance & Capacity To address the performance and capacity challenges that arise from lack of veracity, it is important to have data quality strategies and tools as part of a big data infrastructure. The aim of the data quality strategies is to ascertain “fit for purpose.” This involves evaluating the intended use of big data within the organization and determining how accurate the data needs to be to meet the business goal of the particular use case. The data quality approaches that the organization adopts need to include several strategies: o o o o o Definition of data quality benchmarks and criteria Identification of key data quality attributes (such as timeliness and completeness) Data lifecycle management and compliance Metadata requirements and management Data element classification 1.2. INTERNATIONAL SCENARIO, BEST PRACTICES, BUSINESS MODELS AND OPPORTUNITIES AVAILABLE 1.2.1 BIG DATA Big Data’s biggest strength is its versatility and global application. So, quite naturally, it has enormous, widespread impact. Use of Big Data in government – local, national or international – can be a game changer! For, every government faces numerous challenges, the biggest perhaps being making sense of the massive amounts of information they receive every day and making decisions based on the same, which in turn, may affect an entire country or even multiple nations. Not only is it tough to scrutinize all the information, but it even more difficult to verify it. Flawed information can have devastating consequences. This is where Big Data comes to the rescue! With the help of Big Data, governments can derive crucial insights to aid decision making in real-time from the heaps of ever-mounting data received from a myriad of sources, including the Web, biological and industrial sensors, video, email, and 10 social communications. Governments can utilize Big Data to serve their citizens better and overcome countless challenges like increasing health care costs, unemployment, natural calamities, poverty, illiteracy, terrorism, international trade relations, and so on. Big Data in government can be the touchstone of a nation’s global standing. Here are a few areas where implementing Big Data can get governments enormous benefits: Air-Rail-Road Safety &Transport: With Big Data, governments can improve air-rail-road networks, transportation, and minimize accidents and mishaps. Healthcare: Big Data tools can be used to intensify treatment efficiency and provide more personalized care to patients. Education: Education is another important area where Big Data can do wonders for the government. Big data can help governments understand the educational needs of the population better. Agriculture: Big Data can help governments and government agencies keep track of numerous factors within and outside of the national borders – land, livestock, crops grown, crops required to be cultivated, food scarcity/abundance, flood/famine, farmer welfare, and other countless agriculture-related issues. Poverty: Big Data makes it easy for the governments to assess the greatest needs of their people and allows them to focus on areas where poverty alleviation is required. Weather: Weather officials can use Big Data to predict impending weather-related emergencies and quickly alert the residents of danger and consequently save numerous lives. Tax compliance: Big Data can help tax agencies detect and regulate tax frauds, waste & abuse of unpaid taxes, denied refunds. Crime Prevention: Big Data tools can help law and order agencies in identifying emerging threats, anticipating and averting criminal activity. Big data technology is a very powerful and useful tool for governments across the globe. Agreed, it cannot resolve all problems at once, but it is one big step in the right direction. Big Data empowers governments with the right tools to bring about important changes that can have ubiquitous impact on generations – present and future! Consider the examples as given below from varieties of countries • Seoul uses analytics to find late night bus routes: In South Korean capital Seoul, night bus routes are determined by late night call volumes. Here’s how the city is helping late night commuters reach home safely. When the government was figuring out how to operate the night buses, it color-coded areas in the city based on call volumes. It then found out how many passengers get on and off and eat bus stop in the high call volume areas to determine the busiest routes the buses should ply. • Singapore Government provide personalized services to its citizens: Singapore government websites are able to better recognize citizens’ needs with a new big data analytics tool. The cloudbased tool can process and understand a citizen’s question accurately and provide an answer within seconds. This capability enables citizens to better navigate government services and get 11 personalized advice when using online services. The tool also provides government agencies with insights on citizens’ needs and priorities. Data sharing tips from Colorado – Health, Education, labor & industry: Using Big Data concepts, the data shared are around eligibility and service quality, for example: are people getting served in a reasonable time frame, and what are the demographics of the people who are consuming health, education, and employment services. • • Singapore Government’s initiatives in Using analytics to improve quality of decisions & lives: The government believes that data analytics has huge opportunities to impact government services and improve citizens’ lives in a wide range of areas, such as healthcare, transportation, education, retail and waste management. A large volume of data is being generated from sensors and mobile devices today. This includes communication between person-to-person, person-to-machine and machine-to-machine, added Sen. He and his team are tasked to evaluate and apply advanced analytics techniques and models that can help organizations get a “360-degree view on people, technology and policies to improve the quality of decisions and improve citizens’ lives and journey of experience at various touch points.” • Australian Immigration became over 30% more effective thanks to analytics: After having deployed a new analytics system 18 months ago, they are generally now 30-40% more effective. The analytics system allows the Department of Immigration and Border Protection to identify the highrisk passengers with less disruption to other passengers coming into Australia. This has become possible as the analytics system combines data from the visa approval process, travel history to and from Australia and even real-time data collected during check-in. It analyses these datasets to profile the level of risk posed by each of the 50,000 passengers arriving to Australia every day. • South Korea Government improves citizen engagement through Open Data & Big Data: The South Korean Ministry of Government Legislation (MOLEG) has significantly improved citizen engagement by enabling easy access to and search of accurate and timely legal information. The Centre gathers all kinds of information related to legislation, current laws and their histories, constitution, laws passed in the national assembly, treaties, presidential decrees, decrees produced by each ministry, and other rules including local governments’ ordinances and regulations. MOLEG has also created a mobile app so that citizens can access the Centre on the go. Besides making legal information open and easily searchable by citizens, MOLEG wants to involve the public in the lawmaking process. • o o o o o Big Data: Digital Agenda for European Commission: Healthcare: saving lives with better diagnosis Transport: fewer accidents and traffic jams Environment: reduced energy consumption Agriculture: safer food and increased productivity Manufacturing and retail: optimized processes for safer and personalized products 12 • o o o o o o o o o o o o o o o o o Some More Ways Big Data Is Used Today To Change Our World: Understanding and Targeting Customers Understanding and Optimizing Business Processes Personal Quantification and Performance Optimization Improving Healthcare and Public Health Improving Sports Performance Improving Science and Research Optimizing Machine and Device Performance Improving Security and Law Enforcement. Improving and Optimizing Cities and Countries Financial Trading / Pricing Out of home advertising Retail Habits Politics Weather Heart Disease Infectious diseases Doctor performance 1.2.2 OPEN DATA Governments and public authorities across the world are launching Open Data initiatives. Research indicates that by October 2011, twenty eight nations around the world had established Open Data portals Public administration officials are now beginning to realize the value that opening up data can have. For instance, the direct impact of Open Data on the EU27 economy was estimated at €32 Billion in 2010, with an estimated annual growth rate of 7%. However, very few governments are taking the right measures in realizing the economic benefits out of Open Data. Political support, breadth and refresh rate of data released, the ease in sourcing data and participation from user community determine the degree of maturity of an Open Data program. Capgemini Consulting conducted an analysis of 23 select countries across the world, which have already initiated Open Data programs, and rated them on a set of parameters as given below in figure 1.4: 13 FIGURE 1.4: PARAMETERS USED FOR BENCHMARKING COUNTRIES ON OPEN DATA INITIATIVES (Source: Capegemini) After analyzing 23 countries, based on their positioning and pace of adoption of Open Data initiatives, we were able to classify them into three categories – Beginners, Followers and Trend Setters. The results are as given below in figure 1.5: FIGURE 1.5: BENCHMARKING OF OPEN DATA INITIATIVES, SELECT COUNTRIES, 2012 (Source: Capegemini) 14 1.3. INDIAN PERSPECTIVE 1.3.1 BIG DATA ANALYTICS An IDC Insight document examines the Big Data India market trends and provides a forecast for the period of 2014–2017. It also captures the current market situation, spending and adoption patterns across end user organizations, as well as business drivers and inhibitors and use cases across verticals. This covers the Big Data state of adoption across end user organizations and also the forecast for the coming years. The report is based on key findings from a survey of 250+ end user organizations and in-depth interviews done with 5+ supply side vendors. The report provides Big Data spend for each technology segment — infrastructure, software and services — and the growth pattern for each of the segment. "IDC expects the Big Data technology and services market in India to witness a phenomenal compounded annual growth rate (CAGR) of 36.3% for the period of 2012–2017 to reach US$ 191 million from US$ 40.7 million in CY12. The huge growth potential is attributed to the inclination of the business functions to get meaningful insight out of the humongous data growth in their organizations," (www.idc.com/getdoc.jsp?containerId...dated 21/02/15) • While looking into the future of IoT, Big Data and Cloud Computing the NetApp has reported in the CIOL that the year 2015 will see quantum increases in data generation, led by the IoT phenomenon. Data will become the new gold. A leading industry analyst firm’s Digital Universe analysis of the growth of data projects that intelligent connected devices will increase the amount of “useful data” that can be analyzed and used to make decisions from 22% in 2013 to 35% in 2020. This “useful data” needs to be in digital storage in order to enable the analysis and use of this data. This will compel enterprises and government alike to think harder about network efficiency, storage and analytics. If India is to achieve the goals we have set for ourselves in 2014, a calibrated approach is an imperative, born of long term technology roadmaps. Analytics deployments will be spurred in the increasingly complex marketing and consumer engagement environment that have been created in the digital era. 1.3.2 CLOUD COMPUTING & SOFTWARE DEFINED STORAGE Organizations contemplating both green field and brown field cloud deployments will tend towards a multi-vendor hybrid cloud environment, that will provide the benefits of both the worlds – public and private cloud. Avoidance of lock-in, leverage in negotiations, or simply a desire for choice will make customers reluctant to work with one cloud vendor, and multiple-vendor hybrid clouds will attain prominence. This growth will further be boosted as big data evolves and drives the need for sophisticated storage infrastructure. 15 Software Defined Storage (SDS) is foundational platform which address range of use cases managing data placement according to cost, compliance, availability, and performance requirements. SDS has the ability to be deployed on different hardware platforms and will extend to cloud architectures as well. SDS will enable data accessibility across cloud platforms consistently, thus simplifying data management. Till date, enterprises have used disks to store their critical data. These SATA disks come with a lot of challenges including space usage, time taken to input and overhead costs to maintain the requisite environment. While this is definitely not going to change and at least 80% of enterprise data will continue to reside on disks, Flash will start taking baby steps as organizations become aware of its advantages and ease of use. However, the growth of this transformative technology will be hindered by costs – the least expensive SSDs will likely be 10 times more expensive than the least expensive SATA disks. 1.3.3 INTERNET OF THINGS – IoT Companies are increasingly looking for scale-out applications. To accommodate this need, Dockers are more resource efficient and reduce the storage space required as compared to hypervisors. We will see the emergence of a robust ecosystem for data management through Dockers and other surrounding services in 2015. With the IoT devices expected to grow to 4.9 billion in 2015, up 30 per cent from 2014 and reach 25 billion by 2020 as per a leading analyst firm, unstructured data is being created by every device thinkable – from smart phones, laptops, and social to cloud applications. Organizations need to become technologically sharp to deal with the changing dynamics in the big data space. They should adopt improved storage solutions to address their needs and the above predictions hold good for them. (www.netapp.com/in/.../news/.../news-rel-20141219-184398.aspx, dated 25/12/14) 1.3.4 INDIA’S HIGH DEMAND FOR BIG DATA WORKERS The biggest fallout of the big data revolution -- where every type of business gathers and analyzes data -- is a massive human resources shortage. Across the globe, thousands of data analytics jobs are going a begging because of a shortage of qualified manpower. A McKinsey Global Institute Study Report (Big data: The next frontier for innovation, competition, and productivity) projects that the US alone will face a shortage of about 190,000 data scientists by 2018 and, further, a shortfall of 1.5 million managers and analysts who can understand and make decisions using big data. As per this report India is producing third largest absolute numbers of BA Professionals after USA and China; however India is producing only 1.12 BA professionals per 100 as compared to USA’s 8.11 and that of China’s 1.31. The worry is that India’s figure of 1.12 is smaller than most of the countries. See Figure 1.6. 16 FIGURE 1.6: NUMBER OF GRADUATES WITH DEEP ANALYTICAL TRAINING McKinsey Global Institute Study Report (Big data: The next frontier for innovation, competition, and productivity) Three key types of talent are required to capture value from big data: o Deep Analytical Talent -people with technical skills in statistics and machine learning, for example, who are capable of analyzing large volumes of data to derive business insights; o Data-Savvy Managers and Analysts - who have the skills to be effective consumers of big data insights—i.e., capable of posing the right questions for analysis, interpreting and challenging the results, and making appropriate decisions; and o Supporting Technology Personnel - who develop, implement, and maintain the hardware and software tools such as databases and analytic programs needed to make use of big data. Data analytics as a job discipline became main stream almost a decade ago, and the demand for trained professionals has been growing steadily since. Given India's reputation for the availability of professionals in varied disciplines at reasonable costs, global banks and financial services firms were the first to migrate their analytics work to India, followed by pharmaceutical and life sciences companies. Global retailers, consumer firms, logistics firms, consultancies, and engineering firms have all begun routing their data analytics work to IT services providers and specialized analytics service providers in India. The talent deficit is on two fronts, data scientists who can perform analytics and analytics consultants who can understand and use the data. The first, big data engineers and scientists are extremely scarce and in the second category, better quality is needed, and India is going to be short of a million data consultants soon. 1.3.5 NASSCOM PERSPECTIVE 17 To address the growing business opportunities in the Analytics and Big Data space, National Association of Software and Services Companies (NASSCOM) has taken initiatives in terms of holding the NASSCOM Big Data & Analytics Summit 2014 in Hyderabad. With the theme of “Industrialization of Analytics”, the summit deliberated on how to build analytically-mature organizations with analytics embedded at the business core & across the business value chain. The summit witnessed industry leaders share best practices on processes, tools, technology, technique and applications used in the context of analytics and also insights upon how to build India’s Analytics talent strength. The following are the highlights: Global analytics market As firms gain access to greater volumes and newer varieties of data, and as they unearth more innovative ways of generating insights for improved customer engagement, implementing analytics is gaining in importance. The global analytics market (software products and outsourced services) is growing at over 12 per cent since 2012. The 2014 market size is estimated at USD 96 billion and is projected to reach USD 121 billion by 2016. Outsourced services around analytics is growing at a faster CAGR of over 14 per cent vis-à-vis analytics software (CAGR ~10 per cent). This growth is being driven by a host of factors – cloud, in-memory computing; mobile devices, social media; emergence of different business units across an organization as consumers of analytics, etc. With analytics being consistently recognized as the top priority for CXOs, firms are also industrializing analytics within the organizational culture and this in turn, is seeing the emergence of the Chief Data Officers’ role. India analytics market Compared to the global market, the overall India analytics market size is miniscule and currently accounts for only 1 per cent share. The India market (exports and domestic) is growing at double the rate of global market at 24 per cent CAGR. In FY2014, the total market was USD 954 million and is expected to reach nearly USD 2.3 billion by FY2016. The ratio of exports-to-domestic is likely to remain steady at 85:15 during this period. Currently, this segment has over 600 firms offering analytics-related products and services and it employs about 29,000 people. Of this, India is the primary target market for ~50 per cent of these firms. The fact that India’s Top 100 IT-BPM (integrated) firms and about 500+ start-up firms are focused on analytics is statement of proof of this technology’s increasing relevance. India is rapidly emerging as the analytics hub for the world. It has the complete range of ecosystem players from GICs, integrated IT-BPM firms, pure-play analytics firms to BPM-KPOs and a vibrant analytics product firms. In terms of geographic density, Bengaluru has the highest number of analytics firms – 29 per cent, followed by Mumbai and Pune – 24 per cent. Apart from this, many Tier II/III cities are also emerging hubs - Trivandrum, Kochi, Mysore, Indore, etc. Analytics in the India domestic market 18 There is also a pull factor from the user side – firms in India are beginning to realize the value of implementing analytics. Potential impact can be operational (cost control, process efficiencies), end customers (user insights, targeted marketing) and strategic (driving sales, improved decision making). Firms in the BFSI, telecom and ecommerce verticals have so far been taking the lead in adopting and applying analytics to a wide range of business areas – portfolio analytics, risk & compliance analytics, customer loyalty, subscriber profiling, churn management, etc. Emerging verticals that are still in the pilot phase of adoption include retail, manufacturing and media & entertainment. One of the key verticals that is showing great promise is the Government – SEBI (fraud detection), NATGRID (anti-terrorism) – and state level initiatives - Maharashtra Sales Tax Department and Hyderabad’s intelligent transport system. 1.4. SWOT ANALYSIS OF THE BIG DATA ANALYTICS 1.4.1 THE NEED A SWOT analysis helps in understanding the strengths and weaknesses and helps in identification of open opportunities and the threat that can come along. It provides with a vision to differentiate between marginal and valuable opportunities. It also helps in deciding what to exploit and what to ignore. SWOT analysis gives a taste of what are the threats and their intensity. It facilitates with options to keep an eye on the unlikely to cause damage and beware of increasingly dangerous threats. Finally provides it an opportunity to indentify the GAPs that will lead to preparation of a strong and structured Strategic Roadmap for Big Data Analytics. Below is the SWOT analysis of big data analytics in India. 1.4.2 SWOT ANALYSIS – BIG DATA ANALYTICS, INDIA Strengths • • it. • • • • • • There is a growing interest in archiving, sensing, behavioral data, and personal data. There is a large amount of content and data available – the issue is accessing and making use of There is a broad and detailed domain know-how as well as process know-how available. Many domains have innovative technology and skilled people. There are many universities/institutions with high capacity where skills can be developed. Avenues where good science/engineering /domain specific education can be obtained. Immense growth opportunity in the analytics market: Indian product firms have shown a growth rate of 20-40 per cent in the last few years; several emerging players have witnessed over 100 per cent growth within the first year of launch. (NASSCOM) Analytics – a definite market for India: Over 100 Indian analytics focused software product firms have successfully developed and launched products catering to niche business needs, cut across vertical-specific, horizontal process-centric and niche applications and platforms. (NASSCOM) 19 Growing start-up base accelerating the growth: Four-fold increase in analytics start-ups in the last four years. (NASSCOM) • Innovative offerings focusing on end-to-end customer business needs. (NASSCOM) Weaknesses • • • • • • • • • • • • • • • • • • • • • • • • • • There are no established cooperation networks between content providers in several domains. Computer clusters and cloud resources are readily available and accessible to the users/stakeholders such as Researchers in the Institutes and Research Labs. There are not many SMEs that are dynamic and flexible and can react quickly to market changes. Geospatial and environmental data sets and supporting infrastructure data sets are not readily available. There is no existing and strong content/data market in India. There is a lack of a solid start-up culture because of risk aversion and intolerance of failure. There are few large companies to lead the market, and many small sized companies that need nurturing. There is a lack of access to Big Data facilities that make data more easily accessible. There is no visibility of ecosystem service offerings. It is unclear what data should be preserved, and for how long, in all the different sectors and markets. Lack of process able linked data, and of aggregated/combined data. Lack of seamless data access and inter-connectivity, and low levels of interoperability: data is often in silos and data sharing is difficult due to a ineffective Data Sharing Policy as well as standards e.g. formats and semantics. Migration of data between systems, versions or partners is challenging. Access and processing of data sets those are too big to be given to the end user. Public data in the country is not available to the extent it should be. The quality of data in even in open data portals is often very low. The different languages within the country create a barrier (multilingualism) during data processing. Structural data sources often lack precise semantics. Poor and inconsistent use or management of metadata. There is a lack of specialized education programs for data analysts. There are not enough skilled people to participate in capacity building training programs. Legislative restrictions on data sharing decrease availability across the country and makes nationally/industry/domain focused initiatives that address these issues more difficult. Rules and regulations are fragmented across the country/industry/domain. There are high security/sensitivity/confidentiality demands that can be difficult to address. There is no well-designed data governance: Data governance is a must-have, and no longer merely a good-to-have. In today's extremely hyper-competitive markets, insightful knowledge means the difference between success and being overwhelmed. But it has to be based on the right data, based on business requirements. Data protection Policy: "Ignoring data security, data quality and data access can cost organizations millions of dollars, hurting enterprise agility, efficiency and reputation." Opportunities 20 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Being a multi-cultural society, various cultures/practices/strengths/approaches can result in creative thinking if they are mixed. The proposed topics by the DST/BDI and best practice examples in other initiatives can lead to synergies. Strengthening the Indian market, e.g. by fusing the emerging start-up nucleus. Create lots of SMEs for the low hanging fruits of Big Data for which agility is required. Investment in the entire innovation chain, beyond basic research. Investment support mechanisms for SMEs/Research/ Institutions/Students/Scholars/Entrepreneurs. Collaboration within Industry/Academia/DST/Service Providers/Data Generators. Improve and encourage innovation & creativity to create cost-effective solutions. There is the opportunity to open up completely new and different business areas and services. New applications can be created throughout the Big Data ecosystem, ranging over acquisition, data extraction, analysis, visualization and utilization. Easier syndication of data and content across industry/domains Micropayments for processed data or the results from analytics. Wearable sensors and sensor technologies become mainstream generating more data. The explosion of device types opens up access to any data from any device for greater and more varied usage. Development of APIs for access becoming standardized and available. Interoperability tools and standardized APIs to facilitate data exchange. Greater visibility and increased use of directory services for data sources. Use semantics to align content from various data sources. Providing facilities to better navigate and curate data. Contextualization and personalization of data. The evolution of different sectors and the increased volume of data enable innovative applications to be developed. Exploring new research areas. Training focused on innovation in DST/BDI. Use and exploration of Big Data to be ubiquitous in education and training. Address the safe and secure storage of data on the national basis. User generated and crowd-sourced content increasingly available that will help variety of recurring problems solved once for all. Data-as-a-service can significantly lower the market entry barriers (in particular to new markets). Shift from technology push to end-user engagement. Create rich and complex data value chains. Develop strong and workable policies for data access in the country across private and public data to help build comprehensive capabilities. By 2020, information will be used to reinvent, digitalize or eliminate 80% of business processes and products from a decade earlier: As the presence of the Internet of Things (IoT) — such as connected devices, sensors and smart machines — grows, the ability of things to generate new types of real-time information and to actively participate in an industry’s value stream will also grow. (GARTNER) By 2017, more than 30% of enterprise access to broadly based big data will be via intermediary data broker services, serving context to business decisions: 21 • • • Digital business demands real-time situation-awareness. This includes insights into what goes on both inside and outside the organization. How do weather patterns impact inventory? More so, how do this season’s customer preferences as expressed in social media suggest greater or lesser inventory? (GARTNER) By 2017, more than 20% of customer-facing analytic deployments will provide product tracking information leveraging the IoT: Fueled by the Nexus of Forces (mobile, social, cloud and information), customers now demand a lot more information from their vendors. The rapid dissemination of the IoT will create a new style of customer-facing analytics — product tracking — where increasingly less expensive sensors will be embedded into all types of products. (GARTNER) Analytics – Opening up a gamut of opportunities for Indian software product firms (NASSCOM) Big Data as a service (BDaaS): That is the delivery of Statistical Analysis tools or information by an outside provider that helps organizations understand and use insights gained from large information sets in order to gain a competitive advantage. Threats • • • • • • • • • • • • • Many skilled professionals leave the country to work in other regions; adding to the risk of a “Brain Drain”. Acute lack of skilled professionals and graduates. Non standardization of the ‘contents’, ‘duration’, ‘mode of delivery’ and ‘certification’ of the skilling and or up skilling efforts made by the education/training ecosystem of the society. There are no existing ecosystems and portals where reliable data sets are is available, however, there is a need to create them. Policies are often too connected to the ‘old data’ world. Complete analysis of ethical and privacy issues is needed. Risk of over-regulation and protectionism in the country as compared to elsewhere in the developed world. Policies of data availability; for example companies are not willing to make data available ‘justin-case’ it may cause a legal action or result in competition. Technology & Techniques: To capture value from big data the organizations will have to deploy new technologies e.g. storage, computing and analytical software. The range of technology and technique challenges and priorities set for tackling them will differ depending on the data maturity of the institution. Organizational Change and Talent: Organizational leaders may not fully understand and appreciate the value in big data as well as how to unlock this value. Shortage of Skills: There are a wide range of skills relevant for businesses wanting to use data analytics, including knowledge of statistical techniques, the ability to program and use software, market-specific knowledge and communication. These skills may not be available in required quantity and quality. Business-Education Collaboration: One way to provide the multi-disciplinary skills required for big data analysis is for students to work closely with a company during their studies. Collaboration between a university/institution with analysis expertise and a business with real world data can be beneficial for both parties. Trying to rush all data out to everyone all at once: Consider the whole cycle from the acquisition of data to the extraction of information, and consider the hygiene factors along this path. There 22 • • is a time in which data should be immediately available to decision makers, and there is a time when it can be retired. BDaaS requires a coordinated effort: Successful Big Data-as-a-Service implementation would require close collaboration between Enterprise Architects, Data Architects, Database admin, BI and DW SMEs, SOA experts, InfoSec representatives and business strategists. Data Sharing Policy: The recommendations made by CODATA on Capacity Building and the Data Sharing Principles in Developing Countries are as given below. Unless these are not implemented the use of Big Data Analytics mat not takeoff as desired. o o o o o o o o o o 1.5. Data should be open and unrestricted. Data should be free to the user. Data should be informative and assessed for quality. Data sharing should be timely. Data should be easy to find and access. Data should be interoperable Data should be sustainable. Data contributors should be given credit. Data access should be equitable. Data may be restricted, in exceptional cases, if adequately justified. THE IDENTIFIED GAP AREAS 1.5.1 GAP IDENTIFICATION The big data, if used successfully, will be a big leap in the field of intelligent governance of the industry, business, research and government, however, before moving on further, it would be beneficial to examine the issues raised in the SWOT analysis and to indentify the GAPs. The challenges are enormous but if it is possible to execute this shift in paradigm properly, it will change the way future will look like. It will be a hard journey full of ifs and buts but the risk and effort are worth taking as intelligent governance is the need of the hour. To help preparing the Strategic Roadmap, the identified gaps could be categorized in the following categories: o Market and Business o Technical o Data, Content and Usage o Education and Skills o Policy, Legal and Security Market and Business • Rewarding the efforts to improve and encourage innovation & creativity to create cost-effective solutions. • Exploit the opportunity to open up completely new and different business areas and services. • There are not many SMEs that are dynamic and flexible and can react quickly to market changes. 23 • • • • • • • • • There are few large companies to lead the market, and many small sized companies that need nurturing. Encouraging shift from technology push to end-user engagement. There is a lack of a solid start-up culture because of risk aversion and intolerance of failure. There are few large companies to lead the market, and many small sized companies that need nurturing. Launching new initiatives so as to strengthen the Indian market, e.g. by fusing the emerging start-up nucleus. Initiatives that can lead to creation of lots of SMEs for the low hanging fruits of Big Data for which agility is required. Providing investment in the entire innovation chain, beyond basic research. Trying to rush all data out to everyone all at once: Consider the whole cycle from the acquisition of data to the extraction of information, and consider the hygiene factors along this path. There is a time in which data should be immediately available to decision makers, and there is a time when it can be retired. BDaaS requires a coordinated effort: Successful Big Data-as-a-Service implementation would require close collaboration between Enterprise Architects, Data Architects, Database admin, BI and DW SMEs, SOA experts, InfoSec representatives and business strategists. Technical • • • • • • • • • • • • Computer clusters and cloud resources are readily available and accessible to the users/stakeholders such as Researchers in the Institutes and Research Labs. There is a lack of access to Big Data facilities that make data more easily accessible. Migration of data between systems, versions or partners is challenging. Access and processing of data sets those are too big to be given to the end user. The quality of data in even in open data portals is often very low. Technology & Techniques: To capture value from big data the organizations will have to deploy new technologies e.g. storage, computing and analytical software. The range of technology and technique challenges and priorities set for tackling them will differ depending on the data maturity of the institution. Provide a platform for collaboration within Industry/Academia/DST/Service Providers/Data Generators. Organizational Change and Talent: Organizational leaders may not fully understand and appreciate the value in big data as well as how to unlock this value. The different languages within the country create a barrier (multilingualism) during data processing. Structural data sources often lack precise semantics. Poor and inconsistent use or management of metadata. Mechanism to encourage large number of research projects among the proposed topics by the DST/BDI and best practice examples in other initiatives that can lead to synergies. Encouraging development of new applications throughout the Big Data ecosystem, ranging over acquisition, data extraction, analysis, visualization and utilization. Data, Content and Usage • Facilitate easier syndication of data and content across industry/domains • There are no established cooperation networks between content providers in several domains. 24 • • • • • • • Geospatial and environmental data sets and supporting infrastructure data sets are not readily available. There is no existing and strong content/data market in India. There is no visibility of ecosystem service offerings. Providing facilities to better navigate and curate data. Encouraging contextualization and personalization of data. Lack of process able linked data, and of aggregated/combined data. Lack of seamless data access and inter-connectivity, and low levels of interoperability: data is often in silos and data sharing is difficult due to a ineffective Data Sharing Policy as well as standards e.g. formats and semantics. Education and Skills • There is a lack of specialized education programs for data analysts. • Development of APIs for access becoming standardized and available. • Development of interoperability tools and standardized APIs to facilitate data exchange. • There are not enough skilled people to participate in capacity building training programs. • Many skilled professionals leave the country to work in other regions; adding to the risk of a “Brain Drain”. • Investment support mechanisms for SMEs, Research, Institutions, Students, Scholars, and Entrepreneurs. • Acute lack of skilled professionals and graduates. • Non standardization of the ‘contents’, ‘duration’, ‘mode of delivery’ and ‘certification’ of the skilling and or up skilling efforts made by the education/training ecosystem of the society. • Shortage of Skills: There are a wide range of skills relevant for businesses wanting to use data analytics, including knowledge of statistical techniques, the ability to program and use software, market-specific knowledge and communication. These skills may not be available in required quantity and quality. • Business-Education Collaboration: One way to provide the multi-disciplinary skills required for big data analysis is for students to work closely with a company during their studies. Collaboration between a university/institution with analysis expertise and a business with real world data, can be beneficial for both parties. • The ways and means to leverage the broad and detailed domain know-how as well as process know-how available in some parts of the industry/research and business. • To encourage the sharing of the expertise available in many domains where exists innovative technology and highly skilled people. Policy, Legal and Security • It is unclear what data should be preserved, and for how long, in all the different sectors and markets. • Public data in the country is not available to the extent it should be. • Legislative restrictions on data sharing decrease availability across the country and makes nationally/industry/domain focused initiatives that address these issues more difficult. • Rules and regulations are fragmented across the country/industry/domain. • There are high security/sensitivity/confidentiality demands that can be difficult to address. • There is no well-designed data governance: Data governance is a must-have, and no longer merely a good-to-have. In today's extremely hyper-competitive markets, insightful knowledge 25 • • • • • • • means the difference between success and being overwhelmed. But it has to be based on the right data, based on business requirements. Data protection Policy: "Ignoring data security, data quality and data access can cost organizations millions of dollars, hurting enterprise agility, efficiency and reputation." There are no existing ecosystems and portals where reliable data sets are available, however, there is a need to create them. Policies are often too connected to the ‘old data’ world. Complete analysis of ethical and privacy issues is needed. Risk of over-regulation and protectionism in the country as compared to elsewhere in the developed world. Policies of data availability; for example companies are not willing to make data available ‘justin-case’ it may cause a legal action or result in competition. Data Sharing Policy: The recommendations made by CODATA on Capacity Building and the Data Sharing Principles in Developing Countries are as given below. Unless these are not implemented the use of Big Data Analytics mat not takeoff as desired. o o o o o o o o o o Data should be open and unrestricted. Data should be free to the user. Data should be informative and assessed for quality. Data sharing should be timely. Data should be easy to find and access. Data should be interoperable Data should be sustainable. Data contributors should be given credit. Data access should be equitable. Data may be restricted, in exceptional cases, if adequately justified. 26 2. DATA SCIENCE & TECHNOLOGY 2.1 WORLDWIDE SITUATION Big Data and the Internet of Things are disrupting entire markets, with machine data blurring the virtual world with the physical world. This market matters —a recent Goldman Sachs report cites an astounding $2 Trillion opportunity by 2020 for IoT, with the potential to impact everything from new product opportunities, to shop floor optimization, to factory worker efficiency gains that will power top-line and bottom-line gains. The company that delivers high quality big data solutions fastest and enables customers to connect people, data and things to transform their industries and organizations will win. (blog.pentaho.com/tag/iot/dated 25/12/14) In the current world, technology drives businesses and internet solves every underlying problem in a business. No job can be deemed complete in this world without the use of computers and internet. But the widespread use of these technologies across every possible business also leads to an enormous amount of data – data that cannot suffice or can be managed by the traditional databases we are used to see around us. 2.1.1 UN’S GLOBAL PULSE: Overview Global Pulse is an innovation initiative launched by the Executive Office of the United Nations Secretary-General, in response to the need for more timely information to track and monitor the impacts of global and local socio-economic crises. The Global Pulse initiative is exploring how new, digital data sources and real-time analytics technologies can help policymakers understand human well-being and emerging vulnerabilities in real-time, in order to better protect populations from shocks. It is felt that digital data offers the opportunity to gain a better understanding of changes in human well-being, and to get real-time feedback on how well policy responses are working. The overarching objective of Global Pulse is to mainstream the use of data mining and real-time data analytics into development organizations and communities of practice. Global Pulse promotes awareness of the opportunities Big Data presents for relief and development, forge public-private data sharing partnerships, generate high-impact analytical tools and approaches through its network of Pulse Labs, and drive broad adoption of useful innovations across the UN System. The objectives of the initiative include: • Increasing the number of Big Data for Development (BD4D) innovation success cases • Lowering systemic barriers to big data for development adoption and scaling • Strengthening cooperation between the big data for development ecosystem Big Data for Development Since its inception in 2009, Global Pulse has been investigating the viability of using new and alternative data sources to support development goals. This includes data from: • Online Content - Public news stories, blogs, Twitter, Facebook, obituaries, birth 27 • • • announcements, job postings, e- commerce, etc. Data Exhaust - Anonymized data generated through the use of services such as telecommunications, mobile banking, online search, hotline usage, transit, etc. Physical Sensors - Satellite imagery, video, traffic sensors, etc. Crowdsourced Reports - Information actively produced or submitted by citizens through mobile phone-based surveys, user generated maps, etc. Global Pulse is exploring innovative methods and frameworks for combining new types of digital data with traditional indicators to track global development in real-time. See figure 2.1. Research Overview Global Pulse identifies problems that could be addressed through real-time monitoring of digital data. It also designs and conducts applied research projects with the aim to discover practical uses of Big Data to solve the challenges and prototype technology tools for monitoring development progress and tracking emerging vulnerabilities. The Pulse Lab teams assist in conducting pilot-based evaluations of new tools and approaches within existing programs and policy initiatives. Global Pulse also forges strategic public-private partnerships to secure access to sources of Big Data, state-of-the-art analytical tools, and expert advisors in the relevant technical fields. FIGURE 2.1: INNOVATIVE CYCLE (SOURCE: GLOBAL PULSE) 28 Goals of the Project The development of a new set of technology tools, partnerships and capacities is designed to complement existing data-gathering and analysis methods. These should contribute to improved global development outcomes in three ways: • • • Enhanced Early Warning: Real-Time Awareness: and Real-Time Feedback: Projected Program for 2015-2016 • • • • • Pulse Lab Network. With at least 3 Pulse Labs launched, labs are sharing analytical methodologies and key innovations in relevant technologies to support institutional partners in the adoption of realtime data into their decision-making and monitoring. Real-Time Monitoring Framework. Building from continued joint research on real-time monitoring with governments, UN agencies, private sector and academia, publishes compilation of methods papers. Pulse Lab Handbook. Handbook, capturing lessons and best practices in analysis, technology innovation, community engagement and partnerships to support government use of Big Data for real-time development monitoring and planning. Technology Toolkit. Integrated suite of free and open source technology tools for data collection, analysis, and decision support made available to the global community. Data Philanthropy Network. Global Pulse assembles a global network of public and private sector partners sharing data through a secure network to support real time tracking of development. 2.1.2 THE WORLD DATA SYSTEM (WDS) STRATEGY PLAN 2014-2018 The World Data System (WDS) is an Interdisciplinary Body of the International Council for Science (ICSU). Its vision is “a world where excellence in science is effectively translated into policy making and socio-economic development. In such a world, universal and equitable access to scientific data and information is a reality and all countries have the scientific capacity to use these and to contribute to generating the new knowledge that is necessary to establish their own development pathways in a sustainable manner.” And its goals include: Enable universal and equitable access to quality-assured scientific data, data services, products and information • Ensure long term data stewardship • Foster compliance to agreed-upon data standards and conventions • Provide mechanisms to facilitate and improve access to data and data products The Strategic Committee on Information and Data of ICSU works closely with ICSU’s Committee on Data for Science and Technology (CODATA), and developing strategic collaboration on issues of common interest. WDS Strategic Targets are: • 29 • • • • Enable universal and equitable access to scientific data, data services, products and information Ensure long-term data stewardship Foster compliance to agreed-upon data standards and conventions Provide mechanisms to facilitate and improve access to data and data products The major targets of the current years are as follows: • Make trusted data services an integral part of international collaborative scientific research and to this end, ICSU-WDS will endeavour to: o o Involve WDS Members more closely in international collaborative scientific research. Promote the use of best practices in international collaborative research programs. • Nurture active disciplinary and multidisciplinary scientific data services communities and to this end ICSU-WDS will strive to: o o o o Support existing data communities whose practices serve their members and the scientific community well. Strengthen emerging communities by helping them to identify their needs and to organize their activities. Provide mechanisms that facilitate cross-disciplinary interactions and activities. Contribute towards scientific development by improving the analytical environment. • Improve the funding environment and ICSU-WDS seeks to play a key role in this coordination by working with its Members to: o o Promote international, national, and disciplinary policies that lead to sustainable longterm funding. Engage and work with research funders to increase resources for data services, including as part of research funding. • Improve the trust in and quality of open Scientific Data Services, ICSU-WDS is committed to increasing the quality of, and trust in, the services provided by its Members, and will concentrate on the following targets: o o o o Provide a certification framework for WDS Regular and Network Members. Actively promote policies of full and open access to data at national and international fora. Foster interoperable practices to facilitate data sharing. Facilitate access to, and use or reuse of datasets—including through publication—in particular for multidisciplinary research. • Position ICSU-WDS as the premium global multidisciplinary network for quality-assessed scientific research data 30 2.1.3 WORLD ECONOMIC FORUM (WEF): REWARDS AND RISKS OF BIG DATA - 2014 Extracting Value from Big Data Big data is changing our lives and changing the way we do business. Data-based value creation requires the identification of patterns from which predictions can be inferred and decisions made. It requires understanding the right way for creating this value and will require knowledge as to how to separate valuable information from hype. It means a clear understanding of some of the following: • • • • • The network unleashes the benefits of big data; The way policymakers and business executives need to develop action plans to extract value from big data; How to balance the risks and rewards and how to manage them Rebalancing socioeconomic asymmetry in a data-driven economy; What may be the role of regulation and trust building to achieve the potential of big data into socioeconomic results; and how to define organizational change to take full advantage of big data. 2.1.4 BIG DATA FOR DEVELOPMENT IN CHINA – UNDP PERSPECTIVE (NOVEMBER 2014) China has the world’s largest mobile phone market, with over 1.2 billion mobile subscriptions, it has over 600 million Internet users, it the world’s most active environment for social media, with the government estimating that over 250 million people use social media. It is also estimated that the digital universe in China will continue to grow at a rapid rate, with the country’s share of global digital data expected to rise to 18 percent by 2020, up from 13 percent in 2014. China therefore is a favorable environment where the Big Data approach could be effective in providing insights on emerging concerns that are highly relevant to China’s development. A general approach to use of big data for development is given in figure 2.2 Two proposed levels of work in relation to Big Data for Development in China: To leverage the considerable potential of Big Data for Development, UNDP has identified 2 levels of work in relation to Big Data for Development: • • Create an enabling environment for Big Data for Development. Tackle particular development challenges with the Big Data approach. November 2013 China’s National Bureau of Statistics (NBS) signed a series of agreements with 11 major Chinese enterprises, aiming to build long-term collaborative relationships on using Big Data. These enterprises have indicated their willingness to share data with NBS to maximize the effect of Big Data application. For example, the cooperation between NBS and Baidu focuses on 3 main aspects: 31 • • • Generalizing the official statistical data and programs through the Baidu website; Improvingg the predicting model of the macro economy by combining the Big Data on the web with survey data collected by NBS; Grasping more meaningful statistical requirements and completing the survey programs by following netizens’ paths on the web platform. Devel Development opment agencies could seek to be involved in these established partnerships or even create new partnerships with these enterprises to identify and share data that would be relevant for development purposes. FIGURE 2.2: 6 ILLUSTRATIVE EXAMPLES OF BIG DATA FOR DEVELOPMENT (SOURCE: UNDP PERSPECTIVE (NOVEMBER 2014)) Tackle specific development challenges with the Big Data approach: To leverage big data to overcome the development challenges in China the following measures have been devised by UN: • • • • • Promote sustainable e--waste disposal practices: Improve productivity of the public sector: Understand socioeconomic development trend: Map poverty: Improve mprove urban transport planning 32 • Identify pollution hotspots in cities: Challenges for application of Big Data fo for Development in China: Adopting Big Data approach will mean a multitude of issues that need to be addressed in order to ensure effective application of Big Data for Development. China China,, like any other country, will also need to address these challenges w while using Big Data for development purpose. These challenges can be broadly adly grouped into 3 categories. This is represented in figure 2.3 • Operational/systemic challenges o Privacy: o Respect of the principle of purpose specification; o Limiting the amount of data collected and stored o Obtaining valid consents from data subjects o Whether or not the data will be distributed to third parties; o Giving individual appropriate access to the data collected about them o Access to information and decisions made about them. o Changes in decision decision-making process: o Administrative barriers: • Data challenges o Accessibility: o Availability: o Reliability: • Analytical challenges o Dissonance between perceptions and facts: o Data interpretation: FIGURE 2.3: MAJOR CHALLENGES CONFRONTING BIG DATA FOR DEVELOPMENT (SOURCE: UNDP PERSPECTIVE (NOVEMBER 2014)) 33 2.1.5 BIG DATA – USA Office of Science & Technology Policy (OSTP) of the USA has been spearheading the use of the concept of Big Data in a big way across the Federal Government establishments. Below are highlights of ongoing Federal government programs that address the challenges of, and tap the opportunities afforded by, the big data revolution to advance agency missions and further scientific discovery and innovation. The Office of Science: • The Office of Advanced Scientific Computing Research (ASCR) provides leadership to the data management, visualization and data analytics communities including digital preservation and community access. • The High Performance Storage System (HPSS) is software that manages petabytes of data on disks and robotic tape systems. • Mathematics for Analysis of Petascale Data addresses the mathematical challenges of extracting insights from huge scientific datasets and finding key features and understanding the relationships between those features. • The Next Generation Networking program supports tools that enable research collaborations to find, move and use large data: from the Globus Middleware Project in 2001, to the GridFTP data transfer protocol in 2003, to the Earth Systems Grid (ESG) in 2007. The Office of Basic Energy Sciences (BES): • BES Scientific User Facilities have supported a number of efforts aimed at assisting users with data management and analysis of big data, which can be as big as yerabytes of data per day from a single experiment. • The Biological and Environmental Research Program (BER), Atmospheric Radiation • Measurement (ARM) Climate Research Facility is a multi-platform scientific user facility that provides the international research community infrastructure for obtaining precise observations of key atmospheric phenomena needed for the advancement of atmospheric process understanding and climate models. • The Systems Biology Knowledgebase (Kbase) is a community-driven software framework enabling data-driven predictions of microbial, plant and biological community function in an environmental context. The Office of Fusion Energy Sciences (FES): 34 • The Scientific Discovery through Advanced Computing (SciDAC) partnership between FES and the office of Advanced Scientific Computing Research (ASCR) addresses big data challenges associated with computational and experimental research in fusion energy science. The Office of High Energy Physics (HEP): • The Computational High Energy Physics Program supports research for the analysis of large, complex experimental data sets as well as large volumes of simulated data—an undertaking that typically requires a global effort by hundreds of scientists. The Office of Nuclear Physics (NP): • The US Nuclear Data Program (USNDP) is a multisite effort involving seven national labs and two universities that maintains and provides access to extensive, dedicated databases spanning several areas of nuclear physics, which compile and cross-check all relevant experimental results on important properties of nuclei. The Office of Scientific and Technical Information (OSTI): • OSTI, the only U.S. federal agency member of DataCite (a global consortium of leading scientific and technical information organizations) plays a key role in shaping the policies and technical implementations of the practice of data citation, which enables efficient reuse and verification of data so that the impact of data may be tracked, and a scholarly structure that recognizes and rewards data producers may be established. Health and Human Services (HHS): • Centre for Disease Control & Prevention (CDC) BioSense 2.0 is the first system to take into account the feasibility of regional and national coordination for public health situation awareness through an interoperable network of systems, built on existing state and local capabilities. • Networked phylogenomics for bacteria and outbreak ID. CDC’s Special Bacteriology Reference Laboratory (SBRL) identifies and classifies unknown bacterial pathogens for effective, rapid outbreak detection. • Center for Medicare & Medicaid Services (CMS) A data warehouse based on Hadoop is being developed to support analytic and reporting requirements from Medicare and Medicaid programs. National Institute of General Medical Sciences: • The Models of Infectious Disease Agent Study (MIDAS) is an effort to develop computational and analytical approaches for integrating infectious disease information 35 rapidly and providing modelling results to policy makers at the local, state, national, and global levels. While data need to be collected and integrated globally, because public health policies are implemented locally, information must also be fine-grained, with needs for data access, management, analysis and archiving. 2.1.6 BIG DATA – AUSTRALIA (Accenture’s 2014 Australia Survey Results) This survey was designed to understand perceptions and experience with big data. Some of the important findings are given in the figures 2.4 to figure 2.6 FIGURE 2.4 : AUSTRALIAN ORGANIZATIONS LAG IN THE USE OF MANY DATA SOURCES Source: Accenture Big Success with Big Data Survey, April 2014 36 FIGURE 2.5: AUSTRALIAN ORGANIZATIONS HOWEVER LEAD IN THE USE OF SOME DATA SOURCES Source: Accenture Big Success with Big Data Survey, April 2014 FIGURE 2.6: ORGANIZATIONS USING BIG DATA TO IMPROVE THE CUSTOMER EXPERIENCE Source: Accenture Big Success with Big Data Survey, April 2014 37 Australian Government Service Scenario The data held by Australian Government agencies has been recognized as a government and national asset. Departments can ask questions that were previously unanswerable, because the data wasn't available or the processing methods were not feasible. It is expected that big data analytics will be used to streamline service delivery, create opportunities for innovation, and identify new service and policy approaches as well as support the effective delivery of existing programs across a broad range of government operations. To facilitate the improved delivery a “Better Practice Guide” has been developed, the salient features of this guide are the applications of Big Data. A consolidated presentation as to how for the Big data projects that often fall into the domains of scientific, economic and social research, analytics is applied to customer/client segmentation and marketing research, campaign management, behavioral economics initiatives, enhancing the service delivery experience and efficiency, intelligence discovery, fraud detection and risk scoring. Figure 2.7 shows the identified areas. FIGURE 2.7: CATEGORIES OF BUSINESS PROCESSES THAT CAN BENEFIT FROM BIG DATA PROJECTS (BigInsights Submission, www.biginsights.com.) 2.1.7 ECONOMIST INTELLIGENCE UNIT (EIU) : WHO’S BIG ON BIG DATA? In September 2014, The Economist Intelligence Unit (EIU) carried out a global survey of 395 C-level executives with sponsorship from Platfora. The findings are summaries as below: 38 Finding 1: Executives’ attitudes towards big data are overwhelmingly positive (See chart 1 and chart 2 as given in figure 2.8 and 2.9 respectively) FIGURE 2.8: VIEW OF THE FUTURE OF BIG DATA FIGURE 2.9: ATTITUDE TOWARDS BIG DATA 39 Finding 2: Executives agree on the need for big-data solutions and want to know more (See Chart 3 as given in figure 2.10) FIGURE 2.10: PERSONAL KNOWLEDGE OF BIG DATA Finding 3: Customer processes currently stand out as candidates for big-data analytics (See Chart 4 as given in figure 2.11) FIGURE 2.11: PRIORITY APPLICATION OF BIG DATA 40 Finding 4: Lack of understanding about how to use big data stands in the way of implementation (See Chart 5 as given in figure 2.12) FIGURE 2.12: INTERNAL OBSTACLES IN USE OF BIG DATA Finding 5: Implementation is also held back by lack of agreement about the value of big data (See Chart 6 as given in figure 2.13) FIGURE 2.13: CEO’S VIEW OF BIG DATA 41 Finding 6: Optimal value from big data comes from the creation of enterprise-wide big-data teams (See Chart 7 as given in figure 2.14). FIGURE 2.14: STRATEGIES FOR OBTAINING OPTIMUM VALUE FROM BIG DATA TOOLS Finding 7: Specialized technical skills are needed to optimize use of big data, but in a supportive role (See Chart 8 as given in figure 2.15) FIGURE 2.15: HOW THE ORGANIZATION ADDRESSES HUMAN ASPECT OF BIG DATA 2.1.8 IDC WORLDWIDE BIG DATA AND ANALYTICS PREDICTIONS FOR 2015 Some of the important predictions from the IDC FutureScape for Big Data and Analytics are as given below. 42 • • • • • • Visual data discovery tools will be growing 2.5 times faster than rest of the business intelligence (BI) market. By 2018. Over the next five years spending on cloud-based Big Data and analytics (BDA) solutions will grow three times faster than spending for on-premise solutions. Shortage of skilled staff will persist. In the U.S. alone there will be 181,000 deep analytics roles in 2018 and five times that many positions requiring related skills in data management and interpretation. By 2017 unified data platform architecture will become the foundation of BDA strategy. Growth in applications incorporating advanced and predictive analytics. Adoption of technology to continuously analyze streams of events will accelerate in 2015 as it is applied to Internet of Things (IoT) analytics, which is expected to grow at a five-year compound annual growth rate (CAGR) of 30%. 43 3. DATA SCIENCE RESEARCH & DEVELOPMENT 3.1 DATA SCIENCE – RESEARCH CHALLENGES 3.1.1 SCIENCE & TECHNOLOGY – CHALLENGES Some of the S&T challenges that researchers across the globe and in India facing are related to data deluge pertaining to: • Astrophysics • Materials Science • Earth & atmospheric observations • Energy • Fundamental Science • Computational Biology, Bioinformatics & Medicine • Engineering & Technology, GIS and Remote Sensing • Cognitive science • Statistical data These challenges require development of advanced algorithms, visualization techniques, data streaming methodologies and analytics. The overall constraints that community facing are: • The IT Challenge: Storage and computational power • The computer science : Algorithm design, visualization, scalability (Machine Learning, network & Graph analysis, streaming of data and text mining), distributed data, architectures, data dimension reduction and implementation • The mathematical science: Statistics, Optimization, uncertainty quantification, model development (statistical, Ab Initio, simulation) analysis and systems theory • The multi-disciplinary approach: Contextual problem solving 3.1.2 CHALLENGES IN ACHIEVING ACTIONABLE INSIGHT WITH DATA & ANALYTICS Big data technologies are maturing to a point in which more organizations are prepared to pilot and adopt big data as a core component of the information management and analytics infrastructure. 44 Big data, as a compendium of emerging disruptive tools and technologies, is positioned as the next great step in enabling integrated analytics in many common business scenarios. As big data wends its inextricable way into the enterprise, information technology (IT) practitioners and business sponsors alike will bump up against a number of challenges that must be addressed before any big data program can be successful. Some of those challenges are: Uncertainty of the Data Management Landscape – There are many competing technologies, and within each technical area there are numerous rivals. Our first challenge is making the best choices while not introducing additional unknowns and risk to big data adoption. The Big Data Talent Gap – The excitement around big data applications seems to imply that there is a broad community of experts available to help in implementation. However, this is not yet the case, and the talent gap poses our second challenge. Getting Data into the Big Data Platform – The scale and variety of data to be absorbed into a big data environment can overwhelm the unprepared data practitioner, making data accessibility and integration our third challenge. Locating data and software tools: Investigators need straightforward means of knowing what datasets and software tools are available and where to obtain them, along with descriptions of each dataset or tool. Ideally, this would include all published and resource datasets and software tools, both basic and clinical, and, to the extent possible, even unpublished or proprietary data and software. Synchronization Across the Data Sources – As more data sets from diverse sources are incorporated into an analytical platform, the potential for time lags to impact data currency and consistency becomes our fourth challenge. Getting Useful Information out of the Big Data Platform –Using big data for different purposes ranging from storage augmentation to enabling high-performance analytics is impeded if the information cannot be adequately provisioned back within the other components of the enterprise information architecture, making big data syndication another challenge Standardizing data and metadata: Investigators need data to be in standard formats to facilitate interoperability, data sharing, and the use of tools to manage and analyze the data. The datasets need to be described by standard metadata to allow novel uses as well as reuse and integration. Extending policies and practices for data and software sharing: While significant progress has been made, broad and rapid sharing of data and software is not yet the norm in all areas of biomedical research. Establishing effective data- and software-sharing practices requires appropriate policies, changes in the research culture, recognition of the contributions made by 45 data and software generators, and technical innovations. Validation of software to ensure quality, reproducibility, provenance, and interoperability is essential. Developing new methods for analyzing Big Data: The size, complexity, and multidimensional nature of many datasets make data analysis extremely challenging. Substantial research is needed for developing new methods and software tools for analyzing such large, complex, and multidimensional datasets. User-friendly data workflow platforms and visualization tools are also needed to facilitate the analysis of Big Data. Focusing on knowledge to advance the business agenda: With the structured and unstructured data (historic, current and predictive) users are not able to take a 360-degree view of the extraordinary volume of data available. Thus they are not able to extract what they need as well as discover what they don’t yet know they need. Overcoming internal obstacles: An information-centric organization needs an information-driven mindset – from the top down. That means employees must be managed, measured and compensated based on how well they use data to make decisions and drive business outcomes. New ways of running the information-centric businesses of tomorrow will require new organizational models. Formal change management efforts will be needed to create a high-performance culture prepared for the organizational implications of new skills, capabilities and infrastructure. Training researchers for analyzing and for designing tools for analyzing biomedical Big Data effectively: The challenges of biomedical Big Data are multifaceted. Advances in biomedical sciences using Big Data will require more scientists with the appropriate data science expertise and skills, including those in many quantitative science areas such as computational biology, biomedical informatics, biostatistics, and related areas. Users of Big Data software tools and resources must be trained to use them well. Meeting the need for speed: With the hypercompetitive business environment, companies not only have to find and analyze the relevant data they need, they must find it quickly. Visualization helps organizations perform analyses and make decisions much more rapidly, but the challenge is going through the sheer volumes of data and accessing the level of detail needed, all at a high speed. The challenge only grows as the degree of granularity increases. Addressing data quality: Even if you can find and analyze data quickly and put it in the proper context for the audience that will be consuming the information, the value of data for decisionmaking purposes will be jeopardized if the data is not accurate or timely. This is a challenge with any data analysis, but when considering the volumes of information involved in big data projects, it becomes even more pronounced. 3.1.3 BIG DATA SECURITY AND PRIVACY CHALLENGES KPMG has identified five key security and privacy challenges organizations must address to help ensure proper control of their Big Data program: 46 Big Data governance: The implementation of Big Data initiatives may lead to the creation or discovery of previously secret or sensitive information through the combination of different data sets. Organizations that attempt to implement Big Data initiatives without a strong governance regime in place, risk placing themselves in ethical dilemmas without set processes or guidelines to follow. Therefore, a strong ethical code, along with process, training, people, and metrics, is imperative to govern what organizations can do within a Big Data program. Maintaining original privacy and security requirements(original intent) of data throughout the information life cycle: Data that is collected and used for Big Data will likely be correlated with other data sets that may ultimately create new data sets or alter the original data in different, often unforeseen ways. Organizations must make sure that all security and privacy requirements that are applied to their original data sets are tracked and maintained across Big Data processes throughout the information life cycle from data collection to disclosure or retention/destruction. Re-identification risk: Data that has been processed, enhanced, or changed by Big Data programs may have benefits both internal and external to the organization. Often, the data must be anonymized to protect the privacy of the original data source, such as customers or vendors. Data that is not properly anonymized prior to external release (or in some cases, internal as well) may result in the compromise of data privacy as the data is combined with previously collected, complex data sets including geo-location, image recognition, and behavioral tracking. If data is simply deidentified, possible correlation between data subjects contained within separate data sets must be evaluated, as third parties with access to several data sets may be able to re-identify otherwise anonymous individuals. Third parties – usage and honoring contractual obligations: Matching data sets from other organizations may unlock insights using Big Data that an organization could not uncover with its data alone. It may also pose significant risk, as the security and privacy data protections in place at the third-party organization may not be adequate. Prior to sharing data with third parties, organizations must evaluate their relevant practices and decide whether they are satisfactory. Interpreting current regulations and anticipating future regulations: As noted, the United States and the EU do not have laws or regulations specific to Big Data; however, there are existing laws restricting the collection, use, and storage of specific personal information types, including financial, health, and children’s data. To keep current with quickly changing and new implemented laws, companies must perform an initial inventory of applicable laws and update this inventory on a regular basis. 47 3.2. MODELS FOR RESEARCH 3.2.1 TARGETED RESEARCH PROGRAM: To provide support to large-scale, multidisciplinary projects that demonstrate potential to expand on existing strengths or develop new innovative research related to strategic areas of emphasis for the DST’s BDI. Targeted research proposals must have significant institutional and department support. 3.2.2 BRIDGE FUNDING PROGRAM: Bridge funding will provide financial support for existing research programs for which external funding sources are expended. The funding needs to support the continuation of the operations of a lab or program in order to avoid ending the program while external support is being reviewed or pursued. Bridge funding must be used in a strategic and coordinated way to maintain project/lab momentum while assuring effective use of limited resources. 3.2.3 SEED FUNDING PROGRAM: Seed funding will afford an opportunity for special projects with the aim of fostering the engagement of multidisciplinary teams to establish linkages towards attaining extramural funding. This program will be open to all disciplines, with prioritization given to projects that have the potential to position the researcher or research team to be competitive for external funding or to bring high impact to DST’s BDI through the proposed work. 3.2.4 OTHER MODELS Apart from the above core initiatives a few more suggested as given below in table 3.1. TABLE 3.1: MODELS FOR RESEARCH S. N. 1 MAJOR MODEL FOR RESEARCH Research Projects MINOR MODELS Research Project Grants Small Grants 48 PURPOSE & BRIEF DESCRIPTION To provide support to an institution on behalf of a principal investigator for a project proposed by the investigator To provide limited research support usually for preliminary, short-term projects; two years maximum; non renewable. S. N. MAJOR MODEL FOR RESEARCH MINOR MODELS Conference Grants Exploratory/Developmental Grant Resource-Related Research Projects Education Projects Field Trial Planning Grant Small Business Technology Transfer Grants PURPOSE & BRIEF DESCRIPTION To provide funding for conferences to exchange and disseminate information related to program interests. To encourage new research in given program area; preliminary data not generally required. To support research projects to enhance capacity of resources that serve biomedical research. To support to develop or implement a program in education, information, training, technical assistance, coordination, or evaluation. To supports initial development of a field/clinical trial e.g., establishing a research team, developing tools for managing data and overseeing the research, and developing a trial design, protocol, recruitment strategies, and procedure manuals To support collaborative research by a small business with a research institution on a project intended for commercialization in two phases: Phase I grant to be used to establish the technical merit and feasibility of the research concept. Phase II grant to supports further research leading to a product or service. 49 S. N. MAJOR MODEL FOR RESEARCH MINOR MODELS Small Business Innovation Research Grants PURPOSE & BRIEF DESCRIPTION To support small business research on a project intended for commercialization in two phases: Phase I grant to be used to establish the technical merit and feasibility of the research concept. Phase II grant to supports further research leading to a product or service. High Priority, Short Term Project Award 2 Fellowships Awards Research Service Awards Senior Research Service Awards 3 Career Development Awards Mentored Research Development Award Independent Research Development Award 50 To provide interim, nonrenewable research support for up to one year to highly meritorious applications. For Individual Postdoctoral Fellows to provide individual fellowships to postdoctoral trainees. For Senior Fellows to provide opportunities for experienced scientists to make major changes in the direction of their research careers, broaden their scientific background, or acquire new research capabilities. Career development in a new area of BDA research – 1 year Develop the career of the funded researcher – 1 year CENTRE OF EXCELLENCE (CoE) FOR DATA SCIENCE 3.3. Centres of excellence are emerging as a vital strategic asset to serve as the primary vehicle for managing complex change initiatives. Centres of excellence exist to bring about central focus to many business issues, for example, data integration, project management, enterprise architecture, business and IT optimization, and enterprise-wide access to information. What is a Centre of Excellence? Forrester Research, Inc. defines a Centre of Excellence as “A formally appointed and documented body of knowledge and experience on a particular subject area with the goals of providing expertise, managing governance practices, and supporting projects associated with the subject area.” Why a Centre of Excellence for DATA SCIENCE/BDA is necessary: A Centre of Excellence will maximize the quality, efficiency and application of analytics across all lines of business, resulting in greater confidence and consistency in decision-making. It will lead to a higher success rate for business analytics deployments, delivering more value at less cost and in less time. The CoE drives end user adoption, leading to a smoother path to improved outcomes and provides a formal organizational structure, enabling the organization (DST) to strike the right balance between agility and sound management in deploying analytics technologies. It will also eliminate the gap between Business and IT, improving time-to-market and responsiveness to change. 3.3.1 CoE VALUE PROPOSITION: Centre of Excellence has the following major propositions and the constituents of each of the proposition, see table 3.2 TABLE 3.2: CoE VALUE PROPOSITION S. N. 1 2 MAJOR PROPOSITIONS CONSTITUENTS Governance & Practices: Purpose Drive Disruption: 51 o o Thought Leadership Quality o o Customer Loyalty Increased Revenues o Matrices & KPIs o Innovation o Integration o o Collaboration Alignment o Growth 3 4 Framework & Reusable Artifacts: Knowledge Dissemination: o Rapid Solutions o o o Efficiency Reduced Cycle Reduced Cost o High Performance o Enablement o Competency o o Employee Loyalty Expert System o Just in Time Advice 3.3.2 DATA SCIENCE CENTRE OF EXCELLENCE COVERAGE Any good CoE should provide the following: • • • • • • • • • • Establishing Competitive Advantage Discovery of real, viable use cases Identification of situations needing and not needing Big Data Resource Management Collaboration Awareness Requirements Gathering Best Practices Building a Steering Committee Enterprise Architecture Big Data Maturity Models Damage Control when projects go off track BDA CoE Scope: A comprehensive BDA CoE is broadly scoped to include the services, functions, tools and metrics to ensure the organization invests in the most valuable projects, and then delivers the expected business benefits from project outcomes. BDA CoE Function Chart as given in figure 3.1 provides a summary of typical BDACoE responsibilities. 52 FIGURE 3.1: BDA CoE FUNCTION CHART SOURCE: www.batimes.com/.../the-ba-practice-lead-handbook-5-getting-organized, dated 19/05/13) BDA CENTRE OF EXCELLENCE BDA BDA BDA STANDARDS DEVELOPMENT SERVICES BDA FULL CYCLE GOVERNANCE 3.3.3 THE DATA SCIENCE CoE MISSION AND OBJECTIVES: The objectives are met through training, consulting and mentoring business analysts and project team members, by providing BDA resources to the project teams, by facilitating the portfolio management process, and by serving as the custodian of BA best practices. The strategic BDA CoE generally performs all or a subset of the following services: • BA Standards – provides standard business analysis practices o Methods o Knowledge Management o Continuous Improvement • BDA Development – provides professional development for business analysts o BDA Career Path o Coaching and Mentoring o Training and Professional Development o Team Building • BA Services – serves as a group of facilitators and on-the-job trainers who are skilled and accomplished business analysts to provide business analysis consulting support including: 53 o o o o o o o • Conducting market research, benchmark, and feasibility studies Developing and maintaining the business architecture Preparing and monitoring the business case Eliciting, analyzing, specifying, documenting, validating, and Managing requirements verification and validation activities, for example, the user acceptance test Preparing the organization for deployment of a new business solution Providing resources to augment project teams to perform business analysis activities that are under resourced or urgent Full cycle Governance – promotes a full life-cycle governance process, managing investments in business solutions from research and development to operations; provides a home (funding and resources) for pre-project business analysis and business case development o o o o o Business Program Management Strategic Project Resources Portfolio Management Enterprise Analysis Benefits Management 3.3.4 KPIs FOR MEASURING THE SUCCESS OF A CoE: Some of the KPIs as given below may be deployed for the measurement of the success of the CoEs established: • • • • • • • • Higher project success rate Reduced costs for professional services, management overhead and TCO Reduced gap between Business and IT, improving time to market and responsiveness to change More unified collaboration between departments and regions Increased ROI and clearer identification of competitive advantage Greater confidence and consistency in decision-making Higher success rate for business analytics deployments The right balance between agility and sound management in deploying analytics solutions 3.3.5 SOME SUGGESTIONS FOR CREATING CENTRES OF EXCELLENCE: • • • • • • • Centre for Causal Modelling and Discovery Centre for Predictive Computational Phenotyping Centre for Mobility Data Integration to Insight Centre for Computational Knowledge Engine Centre for Big Data in Translational Genomics Centre for Patient-Cantered Information Commons Centre for Mobile Sensor Data-to-Knowledge 54 • • 3.4. Centre for Expanded Data Annotation and Retrieval Centre for Big Data for Discovery Science BIG DATA IN SCIENCE AND TECHNOLOGY R&D 3.4.1 BIG DATA: R & D PERSPECTIVE In the Big Data research context, so called analytics over Big Data is playing a leading role. Analytics cover a wide family of problems mainly arising in the context of Database, Data Warehousing and Data Mining research. Analytics research is intended to develop complex procedures running over large-scale, enormous in-size data repositories with the objective of extracting useful knowledge hidden in such repositories. One of the most significant application scenarios where Big Data arise is, without doubt, scientific computing. Here, scientists and researchers produce huge amounts of data per-day via experiments (e.g., disciplines like high-energy physics, astronomy, biology, bio-medicine, and so forth). But extracting useful knowledge for decision making purposes from these massive, large-scale data repositories is almost impossible for actual DBMS-inspired analysis tools. From a methodological point of view, there are also research challenges. A new methodology is required for transforming Big Data stored in heterogeneous and different-in-nature data sources (e.g., legacy systems, Web, scientific data repositories, sensor and stream databases, social networks) into a structured, hence well-interpretable format for target data analytics. As a consequence, data-driven approaches, in biology, medicine, public policy, social sciences, and humanities, can replace the traditional hypothesis-driven research in science. The research problems linked to the discovery of new insights from big-data belong to a novel and rapidly expanding research domain: machine learning. At the edge of statistics, computer science and emerging applications in industry, this research domain focuses on the development of fast and efficient algorithms for processing of data with as a main goal to deliver accurate predictions of various kinds. To name only a few applications, think of business cases such as product recommendation, segmentation of customers, fraud detection or churn prevention. Machine learning techniques can solve such applications using a set of generic methods that differ from more traditional statistical techniques. The emphasis is on real-time and highly scalable predictive analytics, using fully automatic and generic methods that simplify most of the problems of data analytics. At the user layer, visualization and interactive exploration are important problems for Big Data. A novel class of visualization metaphors, methodologies and solutions must be devised, in order to cope with emerging challenges posed by visualization problem of Big Data; real-time visualization of extracted core data, visualization of mashuped data, and effective visualization over mobile devices are interesting problems. Coupled with visualization issues, interactive exploration issues are critical milestones to traverse in the context of Big Data research; in fact, enormous-sized data are difficult to explore while extracting useful knowledge. Strategies need to address issues such as conceptual navigation, concept drift, interaction metaphors, and so forth. Environmental monitoring has become reliant upon automated sensors for data acquisition. These results in generation of large, high-dimensional data streams (‘Big Data’) those personnel 55 must search through to identify data structures. Nature-inspired computation, inclusive of artificial neural networks (ANNs), affords the unearthing of complex, recurring patterns within sizable data volumes. This has applications in agriculture, weather monitoring, epidemiological study, traffic planning, pollution monitoring, ecological and nature resource management. The world has become a much more dangerous place. The existence of private organizations willing to kill randomly to further their view of the world, and then to hide among innocents to avoid being attacked in turn, presents a challenge that was not present a few decades ago. Surveillance, involving all forms of data, starting from monitoring of media reports, twitter streams, videos, social sensing, requires processing of huge volume of constantly changing and uncertain data for deriving desired intelligence. This is a significant research challenge. We need equivalent advances in technology to prevent terrorist mayhem from proliferating. We can use data mining on big data to detect bad actors. 3.4.2 CODATA RECOMMENDATIONS FOR SCIENTIFIC PROGRAMS The international scientific community has a responsibility to examine all the opportunities to use Big Data for knowledge discovery that will benefit society and the sustainability of the planet. The scientific research and discovery presents particularly significant challenges and notable opportunities for transdisciplinary, international research programs. The challenges and opportunities of Big Data have significant implications for scientific data services and infrastructure providers. The Workshop on Big Data for International Scientific Programs join with CODATA has made the following recommendations to address-the challenges and take advantage of the opportunities of the Big Data age. • • • • • • • Respond to the importance of Big Data for international scientific programs Exploit the benefits of Big Data for society Improve understanding of Big Data through international collaboration Promote universal access to Big Data through global research infrastructures Explore and Address the Challenges of Big Data Stewardship Encourage capacity building and skills development Foster development of policies to maximize exploitation of Big Data Proposed Actions for a CODATA Working Group: • • • • • Establish a CODATA Working Group on Big Data for Scientific Programs Produce case studies in Big Data for international scientific program Promote sharing of Big Data solutions across scientific disciplines Research policy, ethical and legal issues for Big Data Research stewardship and sustainability challenges for Big Data 3.4.3 EUROPEAN RESEARCH AGENDA FOR BIG DATA ANALYTICS 56 The vision of Big Data Analytics in Europe is based on the fair use of big data with the development of associated policies and standards, as well as on empowering citizens, whose digital traces are recorded in the data. It is expected to provide a data and knowledge infrastructure providing to citizens, scientists, institutions and businesses through: • • Access to data and knowledge services and Access to analytical services and results, within a framework of policies for access and sharing based on the values of privacy, trust, individual empowerment and public good. This means fulfilment of several requirements at different levels some of these are as given below. • • • • • • • Scientific and technological challenges such laying new foundations for Big Data Analytics, which integrate knowledge discovery from Big Data with statistical modelling and complex systems science, Semantics data integration and enrichment technologies Scalable, distributed, streaming Big Data Analytics technologies Data requirements such as who owns and use personal data, the real value of such data, How to make it possible to access and link the different data sources etc. Education and data literacy: Promotional initiatives for data analytics and BDA-as-a-service Effective way for promoting and helping the development of Big Data Analytics 3.4.4 BIG DATA IN GENOMICS Life Sciences have been highly affected by the generation of large data sets, specifically by overloads of omics information (genomes, transcriptomes, epigenomes and other omics data from cells, tissues and organisms). Next-Generation Sequencing (NGS) platforms that use semiconductors or nanotechnology have exponentially increased the rate of biological data generation in the last two years. The steadily decreasing costs have enabled the generation of information at the petabyte scale. However, there is a lack of computational infrastructure that is needed to securely generate, maintain, transfer, and analyze large-scale information in life sciences and to integrate omics data with other data sets, such as clinical data from patients (mainly from Electronic Medical Records or EMRs). Genomics Personal genomics is a key enabler for predictive medicine, where a patient’s genetic profile can be used to determine the most appropriate medical treatment. Projects such as Encode have produced piles of data, illustrating how Big Data is becoming integral for scientific research. Indeed, science today is increasingly “social”, especially in fields such as genomics in which huge amounts of data are generated. 57 There is a need felt to store data and information generated by big projects; computational solutions such as cloud-based computing have emerged. Cloud computing is the only storage model that can provide the elastic scale needed for DNA sequencing. Many companies are using cloud solutions from different providers, however challenges remain such as security and privacy of personal medical and scientific data, some companies, though, offer solutions (table 3.3). TABLE 3.3: EXAMPLES OF COMPANIES & INSTITUTIONS PROVIDING SOLUTIONS TO GENERATE, ANALYZE & VISUALIZE OMICS & CLINICAL DATA (SOURCE: Big Data in Genomics: Challenges and Solutions) Company / Type of Solution Website Institution Appistry Appistry's high-performance big data platform www.appistry.com combines self-organizing computational storage with optimized and distributed highperformance computing to provide secure, HIPAA-complaint accurate on-demand analysis of omics data in association with clinical information BGI Beijing Genomics Institute (BGI)'s solution www.genomics.cn/en serves as a solid foundation for large-scale bioinformatics processing. BGI computing platform is an integrated service composed of versatile software and powerful hardware applied to life sciences CLC Bio CLC Bio bioinformatics has a platform where www.clcbio.com both desktop and server software are integrated and optimized for best performance. CLC Bio utilize proprietary algorithms, based on published methods, in order to successfully accelerate data calculations to achieve remarkable improvements in big data analytics DNAnexus DNAnexus provides solutions for NGS by using www.dnanexus.com cloud computing infrastructure with scalable systems and advanced bioinformatics in a webbased platform to solve data management and the challenges in analysis that are common in unified systems Genome Genome International Corporation (GIC) is a www.genome.com International research-driven company that provides Corporation innovative bioinformatics products and custom (GIC) research solutions for corporate, government, and academic laboratories in life sciences 58 Company / Institution GNS Healthcare Foundation Medicine Knome NextBio Type of Solution GNS Healthcare is a big data analytics company that has developed a scalable approach to deal with big data solutions that could be applied across the healthcare industry Foundation Medicine is a molecular information company on the forefront of bringing comprehensive cancer genomic analysis to routine clinical care. Foundation Medicine is pioneering the development of a comprehensive cancer diagnostic test combining omics data, clinical information and big data analytics applied to cancer research Knome analyzes whole genome data using software-based tests simultaneously to examine and compare many genes, gene networks, and genomes as well as integrate other forms of molecular and non-molecular data. Knome provides a platform and tools to help researchers and doctors develop next generation, software-based tests and make clinical decisions. NextBio's big data technology enables users to systematically integrate and interpret public and proprietary molecular data and clinical information from individual patients, population studies and model organisms applying genomic data in useful ways both in scientific and medical research. Website www.gnshealthcare.com www.foundationmedicine.com www.knome.com www.nextbio.com In time to come channels to deal with increasing amounts of genomics data will be needed to store, transfer, analyze, visualize, and generate “short” reports to researchers and clinicians. It is possible that the genomics industry could be helped by cloud computing, which will transform medicine and life sciences. Data-driven Science and Medicine: There is all the possibility that the complexity of the data generated in scientific projects will only increase as we continue to isolate and sequence individual cells and organisms while lowering the costs to generate and analyze this data, such that hundreds of millions of samples can be profiled. 59 In the future the big genome centres would be requiring high-performance computational environments for integrating all the data generated. The integration between hardware and software infrastructures tailored to deal with big data in life sciences will become more common in the years to come. The data-driven medicine will enable the discovery of new treatment options based on multi-model molecular measurements on patients and learning from the trends in differential diagnosis, prognosis and prescription side-effects in clinical databases. More over the combination of omics data with clinical information from patients will enable new scientific knowledge that could be applied in the clinics to help in patient care. Considering all possible scenarios the role of big data will be very significant in both scientific inquiry and patient care. Major Challenges • • • Big Data generation and acquisition will create challenges for storage, transfer and security of information. The second challenge will be to transfer data from one location to another Third major challenge will be poised by Security and privacy of data from individuals. 3.4.5 BIG DATA AND REMOTE SENSING Remote sensing researchers have long been using remote sensing data to address localized science questions, such as assessing the amount of developed versus undeveloped land in a particular metropolitan area, or quantifying timber resources in a given forested area. Subsequently, as software and hardware capabilities for processing large volumes of imagery became more accessible, and image availability also increased, remote sensing correspondingly expanded to encompass regional and global scales, such as estimating vegetation biomass covering the Earth’s land surfaces, or measuring the sea surface temperatures of our oceans. With today’s processing capacity, this has been extended yet further to include investigations of large-scale dynamic processes, such as assessing global ecosystem shifts resulting from climate change, or improving the modeling of weather patterns and storm events around the world. There is a logical progression as research and applications keep pace with greater data availability and ongoing improvements in processing tools. But the field of remote sensing, and its associated data, is continuing to grow. What else can remote sensing tell us and how else can this immense volume of data be used? Are there relationships yet to be exploited that can be used to indicate consumer behavior and habits in certain markets? Are there geospatial patterns in population expansion that can be used to better predict future development and resource utilization? The answers to these and many other similar questions can suitably be provided by using big data in remote sensing. 60 3.4.6 ISACA: RISKS AND CONCERNS WITH BIG DATA The process of big data analytics involves analyzing the collected data to find patterns and correlations that may not be initially apparent, but may be useful in making business decisions. These data, also personal, are useful from a marketing perspective in understanding the likes and dislikes of potential buyers and in analyzing and predicting their buying behaviour. Personal data can be categorized as: • • • Volunteered data—Created and explicitly shared by individuals (e.g., social network profiles) Observed data—Captured by recording the actions of individuals (e.g., location data when using cell phones) Inferred data—Data about individuals based on analysis of volunteered or observed information (e.g., credit scores) Risks and Concerns with Big Data Big data, on one hand, can supply a competitive advantage and other benefits, it also carries significant risk. The enterprises that have huge amounts of structured and unstructured data available, should be asking: • • • Where should we store the data? How are we going to protect the data? How are we going to utilize the data safely and lawfully? As the security policies and procedures are still developing in many areas, the big data risk management is evolving. Though the need to manage data risk within the enterprise may not be clearly communicated and understood at all management levels, it is essential to point out that addressing big data risk and concerns cannot be seen exclusively as an information technology exercise. There is a need for the entire enterprise, including legal, finance, compliance, internal audit and other business departments to get involved in data risk. It is essential to understand that some data should be considered “toxic” in the sense that loss of control over these data could be damaging to the enterprise. Examples of potentially “toxic” data are: • • • Private or custodial information such as credit card numbers etc. Strategic information such as intellectual property, business plans and product designs Information such as key performance indicators, sales figures, financial metrics etc. Enterprises that rely on personal data that are generated or that can be modified by the public have to be extra careful. Social media data can be a highly valuable source for assessing customer 61 sentiment, tracking the effectiveness of marketing campaigns and learning more about consumers. To deal with this kind of data will require addressing current uncertainties and points of tension: • • • • • Privacy—Individual needs for privacy vary. Global governance—there is a lack of global legal interoperability. Personal data ownership—The concept of property rights is not easily extended to data, creating challenges in establishing usage rights. Transparency—Too much transparency too soon presents as much of a risk to destabilizing the personal data ecosystem as too little transparency. Value distribution—Even before value can be shared more equitably, more clarity is required on what truly constitutes value for each stakeholder. Strategies for Addressing Big Data Risk The main strategy for addressing risk is aligning the technology solution to business needs. The COBIT 5 framework addresses this in the goals cascade by aligning stakeholder drivers and stakeholder needs. ISACA has identified seven enablers that should be applied to assist the enterprise in addressing risk and improving its ability to meet its business objectives and create value for its stakeholders. It further defines the dimensions of Data Quality. Goals of information are divided into three sub dimensions of quality, see table 3.4 below. TABLE 3.4: DATA QUALITY SUB DIMENSIONS (SOURCE: ISACA) Intrinsic Quality • • • • Accuracy Objectivity Believability Reputation Security/Accessibility Quality Contextual and Representational Quality • • • • • • • • • Relevancy Completeness Currency Appropriate amount of information Concise representation Consistent representation Interpretability Understandability Ease of manipulation • • Availability/timeliness Restricted access Governance for Big Data Governance ensures that stakeholders’ needs, conditions and options are evaluated to determine balanced, agreed-on enterprise objectives to be achieved. The scope of an enterprise’s governance, risk and compliance would most likely be expanded to create a unified system to consolidate silos and business functions to enable access of all the data. The end-to-end governance approach that is 62 at the foundation of COBIT 5 is depicted in figure 3.2 below, which is showing the key components of a governance system. Assurance Considerations for Big Data Controls around big data can be grouped into four categories: • Approach and understanding: This category addresses demonstrating the right tone at the top of the enterprise. A critical facet in this effort is the establishment and implementation of a data policy. • Data Quality: Controls should be established and implemented across the data flow to assess data against the accuracy, reliability, completeness and timeliness criteria defined in the data policy and associated standards. • Confidentiality and privacy: Through the data risk management process, all sensitive data should be identified and appropriate controls put in place. The nature of the sensitive information could vary from personal information to competitive secrets. • Availability: Reliable (i.e., tested) disaster recovery arrangements should be in place to ensure that data are available in accordance with the data recovery point objective (RPO) and recovery time objective (RTO) criteria defined in a business impact analysis. FIGURE 3.2: GOVERNANCE OBJECTIVE: VALUE CREATION 63 4. DATA SCIENCE APPLICATIONS BIG DATA - DIGITAL INDIA, MAKE-IN-INDIA 4.1. 4.1.1 BIG DATA IMPACT ON INDIA: Fast data systems and less expensive smart phones, will drive the appropriation of a lot of people new administrations, and set new desires regarding client experience. Four major advances/administrations Big Data analytics, Internet of Things, Mobile Financial Services, and Network Functions Virtualization will impact the country: • • • • Big Data and Analytics both will be a big boost for the KPO businesses in India Big Data & Analytics To Trigger Jobs Growth for India Digital India Powered By The Big Data From Smart Cities The mobile Explosion And Big Data Analytics 4.1.2 THE DIGITAL INDIA IMPETUS Ernst & Young: says there’s a lot of reason for the IT industry to be positive about 2015. The understanding seems to be based on the following: • • • The Government has estimated an investment of US$ 26 billion in technology for 2014-15 for digitization, infrastructural improvements, push for manufacturing and technology in healthcare and agriculture. The Indian government’s Digital India program, which aims to transform India into a digitally-empowered society and knowledge economy, will bring forth a lot of opportunities for large number of IT industry players to develop platforms providing government services and information to people in all parts of the country. Security and data accessibility solutions will see increased demand from the government. The development of 100 smart cities, under ‘Smart Cities’ GoI initiative, will require companies to build consortiums to bag these projects. This will drive investments at all layers of ICT infrastructure, benefitting companies which are into technology consulting, telecommunications, networks, hardware infrastructure, managed services and systems integration. The Government of India, through its “Make in India” initiative, is increasing its focus on this sector, and aims to transform it from a consumption-driven market to the one with manufacturing capability to meet local and export-related demand. Several incentives are being offered by the government including financial assistance in setting up electronics manufacturing clusters, capital subsidies to Electronic System Design & Manufacturing (ESDM) manufacturing units, approving set up of semiconductor fabrication units, and setting up of a US$2 billion Electronics Development Fund to fund selected projects. IDC: According to research agency IDC, IT spend in Indian manufacturing will double by 2016. IDC’s Manufacturing Insight predicts the India manufacturing IT spending to grow to $8,781.8 million by 2016, which doubles the manufacturing IT spending of 2011, representing a CAGR of 14.5% 64 between 2012 and 2016. The sector with the highest IT spend in the Indian manufacturing sector in 2012 is automotive, which is followed by chemicals and consumer products. 4.1.3 MAKE IN INDIA: The government’s ‘Make In India’ campaign aims at spurring a manufacturing-led growth with more focus on the ease of doing business than on an incentive-linked investment climate. The push for manufacturing has two aims, to create jobs and lift growth. According to the India Electronics and Semiconductor Association (IESA), an industry body, the electronic system design and manufacturing (ESDM) industry will benefit from the government’s Make in India campaign and is projected to see investment proposals worth Rs 10,000 crore over the next two years. The Internet of Things and big data go hand in hand, and with access to more information and the ability to rapidly analyze it, manufacturers will be able to develop new tools improving quality, increase throughput, and reduce machine failure and downtime, to achieve a leading competitive advantage. Tata Group chairman: Cyrus Mistry says "Today, emerging technologies in the digital and physical space are transforming business at a pace never seen before. We must deepen our understanding in several areas such as digitization and big data analytics and develop an innovation and technology roadmap to effectively serve evolving customer needs,". He further commented that “In India, recent policy measures and the strategic direction defined by the new government, especially its ambitious 'Make in India' campaign, hold the promise of re-igniting growth in the years to come." 4.2. GOVERNMENT & BIG DATA 4.2.1 BDA APPLICATIONS IN GOVERNMENT SECTOR: There are many ways in which ‘Big Data – Business Analytics’ can be leveraged by the Central and the State Government to grow more and go for the changes and implementing the various policies and government schemes. Some of the prominent areas are: • • • ADHAAR: As majority of citizens (more than 60 crores at the last count) in the country have been provided with ADHAAR number, the governments can use this facility to plan, implement & monitor and their citizen related initiatives. Direct Benefit Transfer Scheme: The Governments can decide the funding for a various schemes, ensure that the money reached the beneficiaries and keep track of improvement and the growth within the scheme and any particular region where people are benefited of this scheme. Impact of Election and Voting system: Governments can analyze this big data for making policies and the scheme based on those statistics which will help the people of the country as well as the growth of the country. 65 • • • • • Impact and conditions of Infrastructure Projects: Analysis of the large amount of Data Periodically collected can help the governments in preserving critical infrastructure all over the country. Impact of Education: Analysis of the large amount of Data Periodically collected about delivery, outputs, outcomes and impact of the education initiatives at primary, secondary and tertiary level can be useful in formulating the education policies. Impact of Health care initiatives: Analysis of the large amount of Data Periodically collected about delivery, outputs, outcomes and impact of the healthcare initiatives at primary, secondary and tertiary level can be useful in formulating the healthcare policies. Business Analytics for Tax Administration: The Central as well as State Governments is involved in multiple tax regimes - corporate as well as individual level. The country's income tax-payer base itself is about 3 crore and the number has been inching its way slowly for the last 5-10 years, which the government would like to see growing at a faster pace. The governments are always looking for efficient ways and means of ‘Improving Tax Administration”. This is possible by analyzing huge amounts of data available on various parameters typical to the tax regime such as ‘spending patterns’, interstate movement of goods. Crowd sourcing platform mygov.in: Already, the Prime Minister’s Office is using Big Data Analytics to process citizen’s ideas and sentiments through the crowd sourcing platform mygov.in and implementing an attendance system for India’s Central Government employees through attendance.gov.in. Similarly, the state Government of Telangana is employing Big Data Analytics for the data collected from nearly 3.5 crore people across strata,” 4.2.2 POSSIBLE BENEFITS OF BIG DATA FOR STATE AND LOCAL GOVERNMENTS: Big Data technologies allow groups to play out scenarios under controlled circumstances, customize what-if planning to different organizations, support data-backed decision-making, and identify correlations and trends in underlying data and more. By laying the foundation for effective use of Big Data, the state and local government agencies can: • • • • • • • • Make better decisions more quickly Improve mission outcomes Identify and reduce inefficiencies Eliminate waste, fraud and abuse Improve productivity of their resources Boost ROI, cut total cost of ownership (TCO) Enhance transparency and service delivery Reduce security, both information & physical, threats and crime 4.2.3 SOME MORE AREAS OF BDA APPLICATIONS IN GOVERNMENT: • • • Public Services data Social Services Data Economic data 66 • • • • • • • • • • • • • • 4.3. Public Safety Public Health Care Issues Education & Training Civic Infra Structure Employment Opportunities Sports & Recreation Various Taxes/Revenue Related Data Transportation Public Distribution System Census data Crime Prevention and Prediction Tourism Data Environmental Data Locational/Geographic Data pertaining to social, economic and other aspects of citizens/business/government/employees BIG DATA ANALYTICS TRENDS 4.3.1 TRANSPARENCY MARKET RESEARCH REPORT “Big Data Market - Global Scenario, Trends, Industry Analysis, Size, Share and Forecast, 2012 – 2018”, the market intelligence report by Transparency Market Research sheds significant light on the various market elements, the factors that drive and hider growth and the booming regional markets of the global big data market. The report estimates that the global big data market, that was expected to have a value worth $6.3 billion in 2012, will reach a value worth $48.3 billion by 2018 by observing year on year growth at a CAGR of 40.5% during the defined forecast of the report, i.e. between 2012 and 2018. In terms of revenue, the current leader of this market is the region of North America, which, according to the report, will maintain its leading rank and amass share worth about 54.5% of the global big data market during the forecast period. It could be followed by Europe. 4.3.2 BIG DATA PREDICTIONS of 2015 Big Data goes mainstream: 2015 will see Big Data management become more mainstream. In many ways, we are still in the infancy of Big Data, but the consistent growth is becoming unstoppable. Everything goes up in the cloud: One of the problems encountered by businesses trying to manage Big Data was the complex technology involved. Cloud solutions are already starting to offer a way forward, and 2015 will likely see more steps in this direction. People-based marketing drives digital marketing: To date, Big Data-driven marketing has been fueled by cookie data. Cookies, an invention from when the desktop drove the Web, are no longer 67 the most important data source. As Facebook, Google, and other major players, people-based marketing will drive a premium in digital marketing in 2015 and soon become the standard. Big Data is called just 'data': The terminology may well change as the technology becomes standard operating procedure. If the industry standard of data management becomes larger and vast quantities of data and analytics becomes typical, the word "big" could become unnecessary and tautological. The time between collection and results will be shorter: Collecting data from consumers has value only if it translates to improved business outcomes and 2015 should see a rise in more rapid ROI results. Pervasive personalization emerges: The Internet of Things will lead to a tsunami. As The Internet of Things begins to get traction—everything from Fitbits to iWatches to Nests—with sensors becoming ubiquitous, personalized marketing communication will be everywhere. In-Memory Databases: In-Memory databases allow companies the freedom to access, analyze and take actions based on data much quicker than regular databases. This in turn means that either decisions can be made quicker as data can be analyzed faster or more informed as more data can be analyzed in the same amount of time. Non-Data Scientists: It is likely to see more automated platforms that can allow employees who may not have as much skill with data as others, to collect, analyze and make decisions based on this data. This could be anything from simple to use interfaces with more complex backend or simpler tasks that could create business results. More Sensor Driven Data: The internet of things is evolving and more companies are using it, but it may well hit its tipping point in 2015. This would be sensor-to-sensor data being collected, collated and analyzed through purely sensor based collection. Deeper Customer Insight: Despite the fact that transactional data is still more numerous that sensor data, 2015 may be the year that we see it being truly looked at in multi-dimensional ways to create even deeper customer insight. HR Analytics: Once thought of as the definition of making your employees ‘just a number’, HR Analytics are being shown to have significant benefits to both the company and its employees. No Ownership In Just One Department: Data will become a commodity that is not just kept in one department alone and used purely by senior company leaders. The Internet of Things: There are many implications that come with IoT-enabled devices generating massive streams of data. For instance, imagine equipping a whole workforce with IoT-enabled 68 devices. That could generate terabytes of data everyday! However, with already existing data warehousing technologies and big data tools, data originated from IoT technology has the potential to create value in many industries. Big Data Security: With the magnitude of last year’s hacks, tightening cyber security will be a top priority for them in 2015 and onwards. On the other hand, big data insights can be used to help increase security. Big data analytics have the potential to complement existing security methods. Faster Growth in the Big Data Market: According to Gartner, 85% of Fortune 500 companies aren’t yet prepared to take advantage of Big Data, so as the big data tools become more widespread (and cheaper), companies will scramble to become more data-driven in order to stay competitive. 4.3.3 INTERESTING TRENDS TO WATCH Facial recognition and geospatial monitoring: Data from inexpensive cameras and cell phones is now widely available to train machine learning systems. Expect to see plenty of innovation in this field. Citizen backlash: Between government monitoring, data breaches, and well-intentioned commercial efforts that cross the “creepy” line, people are starting to realize just how much can be learned about them from the data they unintentionally produce. It may not be long before we see public demands for enforceable accountability on those who collect or disseminate personal data. Analytics driving the physical world: Technology that controls physical activities (think of the Google self-driving car, or even the Nest thermostat) has received a significant amount of media attention. Many consumers seem eager for these analytics-enabled capabilities today. In the rush to serve consumer appetites, it will be important for businesses to thoroughly plan for the potential consequences— good and bad—of these capabilities. 4.4. SOCIAL MEDIA AND BIG DATA ANALYTICS 4.4.1 SOCIAL MEDIA The business can tap into Big Data and use the information to improve planning, deliver targeted campaigns, fully take advantage of the omni-channel and optimize social media and interactions in real time. Big Data means that networks can know their users better than ever and it allows those with knowledge of a user base to target them better by looking at their interests, location and search history.“Facebook can help a brand up to a point but the completion of that journey must be managed by the organizations themselves – data and insights can track the right person down, but it can’t complete the sale. 69 There are an increasing number of tools available to help businesses more accurately track the data from social media. As Big Data grows, so does the availability and functionality of the technology available to deal with it. Some solutions can undertake a quantitative analysis of social media that includes sentiment, some allow for tracking of social media channels and others are designed to track the effectiveness of campaigns against specific targets. Search terms, key words and brand names can be used to identify the content that references them and these can be further analyzed using analytics tools and software. Businesses should use social listening tools to capture and analyze huge amounts of data. The keyword strategy is the most important element of finding the content that matters to the business. Smart search terms in a listening tool can help surface relevant data about your products and your competitors. 4.4.2 REASONS TO EXPLORE BIG DATA WITH SOCIAL MEDIA ANALYTICS Reason# 1 Social Media Analytics and Volume. Social Media has many factors that contribute to the increase of data volume to explore. There are unstructured data streaming as well as increase of sensor and machine to machine data being collected. Proper use of Social Media Analytics can help create value which is significant to the relevant data. Reason# 2 Social Media Analytics and Velocity. Data in Social Media is streaming at exceptional speed that must be dealt with in a well-timed manner. It would be interesting to explore this feature in Social Media analytics as this is one of the great challenges for many organizations. Reason# 3 Social Media Analytics and Variety. Data in Social Media come in all types of formats. Structured numeric data in traditional databases, information generated from line-of-business applications, unstructured text documents, email, video, audio, stock ticker data and financial transactions all comes in different types of formats. Reason# 4 Social Media Analytics and Variability. Social Media data flows can be highly unpredictable with periodic peaks. Such data loads from what’s trending in social media, mixed up with unstructured data are even more challenging to manage yet interesting to explore. Reason# 5 Social Media Analytics and Complexity. Data in Social Media comes from numerous sources. It is a great challenge to undergo the different processes like linking, matching, connecting, correlating relationships, hierarchies and multiple data linkages. This is how complex data can be and if not managed properly, they can spiral out of control. 70 It is absolutely essential to discover more about exploring Big Data with social media analytics. If you are serious about optimizing your website, then you should learn the best ways to design and develop a website. 4.4.3 SOCIAL MEDIA ANALYTICS IN INDIA & TOOLS: Social Media analytics is still in very nascent stages. Both the consumers as well as providers have till now been perplexed about how to best consume the gold mine of social chatter online. Service providers have preferred the product route till now, mostly contained around social media monitoring and real time evaluation. For enterprises and end consumers, the transition from social media speaking to listening is just closing in. Here is a of list important social analytics companies in India that are making the most difference in whole ecosystem. The list is not in any particular order and we would keep this updated as the field evolves and new startup foray into this area. Important Social Media Analytics Companies in India: Simplify360: Simplify360 is the leading Social Business Intelligence Company. Simplify360 facilitates Social Customer Service, Online Reputation Management, Real Time Market Intelligence and Social Media Performance Analytics, and is majorly used by Enterprises (CMO’s and CIO’s Office), BPO Companies and Digital Media Agencies. Germin8: Germin8 is a Big Data analytics company headquartered in Mumbai. Germin8 is focused on building products for analysing social media data and textual data available within organizations to help them make better decisions based on insights drawn from that data. Explic8 ™ is Germin8’s stakeholder insights and engagement platform that collects and analyses conversations in real time from public sources and private sources, and converts them into industryspecific actionable insights and leads. The stakeholder conversations are taken from both public sources like social media and news, and private sources like emails and chats and analysed through the tool and presented in the form of live interactive dashboards which generate actionable insights for various departments within an organization like Marketing, Customer Care, Corporate Communications, Sales, etc. The product was launched in 2012 and is currently being used by over a 100 brands directly or through partner agencies. BlueOcean: Blueocean Market Intelligence offers a comprehensive and end-to-end social intelligence solution that effectively addresses business challenges and helps organizations gain a much-needed advantage. It enables organizations to monitor, track and measure social media effectiveness on various channels, and monitor the ROI of social media initiatives. Their digital scientist team has been able to move away from traditional KPI measurements on social, and deploy innovative and enhanced measurements to effectively measure true ROI and better understand the social landscape. They deliver through the integration of traditional research techniques, 71 measurements and contextual mapping to provide comprehensive insights to solve intriguing business problems. Frrole: Frrole is a social intelligence startup with an ability to mine precise content and deep insights from Twitter data. Its media and brand focused offering built on top of this intelligence allows customers to identify in real-time what people, influencers and cities are talking about and integrate that content and derived insights directly into their products. While most Social Analytics/Intelligence products provide results based on statistics and the first level of NLP, Frrole goes two levels deeper building semantic context for each topic and tying it up with information available in the general and historical data sets. Unmetric: For Fortune 500 companies, agencies and other large global brands that seek to more meaningfully engage with their target audiences, Unmetric provides an online platform that enables them to understand, uncover and unlock insights into how well they and their competitors’ content, campaigns, and top-line metrics perform in social media. Unmetric combines the power of human cognition and technology to track and analyze the online behavior of 18,000 brands segmented across 30 sectors for all major social networks. Unlike listening or publishing services, Unmetric is a seeing platform, providing global brands with data to analyze, benchmark and enhance their social media efforts. Infinite analytics: Infinite Analytics is the most advanced big data & social data analytics company. Its flagship product – SocialGenomix, uses a consumer’s Social Data, along with NLP, Machine Learning, Semantic Technologies and Predictive Analytics to predict consumer behavior, personalize user experience and provide actionable insights to e-retailers. Thoughtbuzz: ThoughtBuzz is a social media intelligence company. The web-based tool helps companies monitor and track online conversations. ThoughtBuzz offers a full-feature analytics service with unlimited access to billions of social media conversations, as well features such as automated sentiment and geo-demographics. ThoughtBuzz is ideal for in-depth research, historical analysis, and the preparation of value-added reports. It go beyond what companies offer today and use real-time information. Other features such as sentiment analysis, key themes, demographics, topic intentions are also available. Konnectsocial: Konnect Social is a search engine for forums, blogs, news and social media. Understanding the unique needs and requirements of our users has always been paramount. We make every effort to listen to our customers and stay on top of trends of the industry. Our experts are highly experienced and versatile enough to provide the best of the class solutions for any kind of business running at optimum performance levels and offer extensions to the functionality of your new edge business solution. We ensure that our technology solutions help you increase the effectiveness of your current IT infrastructure. 72 Abzooba: Abzooba’s social media monitoring, analyses, and analytics platform uses sophisticated technologies such as Natural language processing, Domain specific ontologies and Machine learning based classification over Big Data to provide organizations with actionable intelligence and insights in real-time. 4.4.4 RECENT EXAMPLES OF USE OF SOCIAL MEDIA ANALYTICS IN INDIA: The Prime Minister's Office is using Big Data techniques to process ideas thrown up by citizens on its crowd sourcing platform mygov. in, place them in context of the popular mood as reflected in trends on social media, and generate actionable reports for ministries and departments to consider and implement. Elections in India are a classic BIG DATA problem and the 2014 general elections was the biggest of them all. While technology may be able to process this humongous data, how can all this information be consumed and understood by a billion people? That too, in real time as it happens? The Indian General Elections also have another perspective which often does not figure in our most buoyant thoughts. Consider the facts like 300 parties, 8000 candidates, 800 Million voters, 1 Million booths served/secured by ~20 Million officials. The heady mix is further embellished with variety of structured & unstructured information – candidate histories, crime records, declared assets and audacious election manifestos. Mixed with the above is the frenetic activity on the day of results. Live streaming of results: ~21000 votes to be counted per second, from all corner of the country spanning an area of ~1 million square miles. With plans in place and trial runs completed, the visualization dashboard went live on the morning of 16th May – the counting day. – will Gramener technology stand the ultimate performance test on this D-Day? 4.5. OPEN SOURCE SOFTWARE FOR BIG DATA ANALYTICS 4.5.1 OPEN SOURCE SOFTWARE Consider the role of the modern data scientist. Unlike a pure statistician, a data scientist is also expected to write code and understand business. Data science is a multi-disciplinary practice requiring a broad range of knowledge and insight. It’s not unusual for a data scientist to explore a fresh set of data in the morning, create a model before lunch, run a series of analytics in the afternoon and brief a team of digital marketers before heading home at night. In addition to possessing a wide range of practical knowledge, a data scientist must also be agile and flexible. Today’s swiftly changing markets require lightning fast reflexes – companies must be capable of assessing new data and responding in the space of a heartbeat to unexpected shifts in commerce, across all industry verticals and economic sectors. The speed of modern business plays to the strengths of data science and open source programming. In the past, business moved relatively slowly and large-scale market trends were fairly predictable. As a result, most companies were quite comfortable relying on proprietary (closed source) software 73 to analyze data. The downside of proprietary software, however, is that it cannot be quickly modified or updated to handle unexpected circumstances or disruptions of existing business models. Until recently, it was common practice for traditional vendors to release updated versions of critical proprietary software quarterly or annually. Open source software can be modified or rewritten in days or hours, making it an ideal choice for real-time analytics. Moreover, the open source movement is democratizing data science. In the past, you needed special training on a proprietary system and years of experience to become a valuable member of a business or research team. Thanks to a wider choice of open source tools, more people can begin contributing valuable insight and analysis from the start. It’s hard to downplay the influence of open source software on the spectacular rise of data science. Open source isn’t just an interesting aspect of the data science revolution; it’s absolutely critical, however the following key points must be considered when evaluating open source software: • • • • • Open source software is not free. Be a bit wary of experts, even well-intentioned ones. Think of open source software as a platform and not a product. There are no guarantees. Is your organization a good fit for open source software? 4.5.2 SOME OF THE BEST OPEN SOURCE TOOLS BIG DATA ANALYTICS: o Apache Sqoop: Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. o Apache Giraph: Apache Giraph is an Apache project to perform graph processing on big data. Giraph utilises Apache Hadoop's MapReduce implementation to process graphs. o Apache Hama: Apache Hama is a distributed computing framework based on Bulk Synchronous Parallel computing techniques for massive scientific computations eg, matrix, graph and network algorithms. o Cloudera Impala: Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. o Apache Drill: Apache Drill is an open-source software framework that supports dataintensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an infrastructure service called Google BigQuery. 74 o Neo4j: Neo4j is an open-source graph database, implemented in Java. The developers describe Neo4j as "embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables". o Couchbase Server: Couchbase Server, originally known as Membase, is an open source, distributed (shared-nothing architecture) NoSQL document-oriented database that is optimised for interactive applications. These applications must service many concurrent users; creating, storing, retrieving, aggregating, manipulating and presenting data. o SciDB: SciDB is an array database designed for multidimensional data management and analytics common to scientific, geospatial, financial, and industrial applications. 4.5.3 SOME OTHER OPEN SOURCE TOOLS FOR BIG DATA Big Data Analysis Platforms and Tools • Hadoop • MapReduce • GridGain • HPCC • Storm Databases/Data Warehouses • Cassandra • HBase • MongoDB • Neo4j • CouchDB • OrientDB • Terrastore • FlockDB • Hibari • Riak • Hypertable • BigData • Hive • InfoBright Community Edition • Infinispan • Redis Business Intelligence • Talend • Jaspersoft • Palo BI Suite/Jedox • Pentaho • SpagoBI 75 • KNIME • BIRT/Actuate Data Mining • RapidMiner/RapidAnalytics • Mahout • Orange • Weka • jHepWork • KEEL • SPMF • Rattle • Gluster • Hadoop Distributed File System Programming Languages • Pig/Pig Latin • R • ECL Big Data Search • Lucene • Solr Data Aggregation and Transfer • Sqoop • Flume • Chukwa Miscellaneous Big Data Tools • Terracotta • Avro • Oozie • Zookeeper 4.6 AMOUNT OF BIG DATA IN INDIA 4.6.1 AUTHENTIC DATA The authenticated inventory of public domain data sets is available at the data.gov.in. It has more than 300 Catalogs and these catalogs contain more than 13500 data sets on variety of subjects. These sets belong to the Open Data Category. Open data is data that can be freely used, reused and redistributed by anyone - subject only, at most, to the requirement to attribute and share alike. The most important characteristics of open data are Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form. 76 Reuse and Redistribution: the data must be provided under terms that permit reuse and redistribution including the intermixing with other datasets. Universal Participation: everyone must be able to use, reuse and redistribute there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed. Following is the inventory list of the data sets available at the data.gov.in. : Central Publication Metrics (SOURCE: data.gov.in) RESOURCE (DATASET) MINISTRY/DEPARTMENT Ministry of Home Affairs 3843 Department of Home TOTAL CATALOGS 244 3592 Registrar General and Census Commissioner, India Department of States 109 3582 106 234 National Crime Records Bureau (NCRB) 118 234 Ministry of Agriculture 2350 Department of Agriculture and Co-operation 118 368 2272 312 Directorate of Marketing and Inspection (DMI) 2267 310 Directorate of Economics and Statistics (DES) 5 2 Department of Animal Husbandry, Dairying and Fisheries 51 31 Department of Agricultural Research and Education (DARE) 27 25 Indian Council of Agricultural Research (ICAR) Planning Commission 27 1560 Unique Identification Authority of India (UIDAI) 25 776 4 4 Ministry of Statistics and Programme Implementation 1432 345 Ministry of Water Resources 1060 557 Ministry of Health and Family Welfare 868 141 77 RESOURCE (DATASET) MINISTRY/DEPARTMENT TOTAL CATALOGS Department of Health and Family Welfare 854 127 Department of AIDS Control 10 10 Department of Ayurveda, Yoga and Naturopathy, Unani, Siddha and Homoeopathy (AYUSH) 4 4 Ministry of Road Transport and Highways 476 133 Ministry of Power 475 5 Central Electricity Authority 475 5 Rajya Sabha 248 154 Ministry of Human Resource Development 217 64 Department of Higher Education 197 59 Department of School Education and Literacy 47 7 National Council of Educational Research and Training (NCERT) Ministry of Finance 47 175 7 124 Department of Economic Affairs 118 105 Department of Financial Services 43 9 Department of Revenue 9 5 Financial Intelligence Unit - India 9 Department of Disinvestment 5 5 Ministry of Commerce and Industry 142 Department of Commerce 83 Directorate General of Foreign Trade (DGFT) Department of Industrial Policy and Promotion Office of the Economic Adviser 5 17 9 80 6 59 8 29 Ministry of Environment and Forests 127 Central Pollution Control Board 127 Ministry of Science and Technology 112 78 2 10 10 86 RESOURCE (DATASET) MINISTRY/DEPARTMENT Department of Science and Technology (DST) TOTAL CATALOGS 66 44 National Science and Technology Management Information System (NSTMIS) 66 41 NSDI India GEO Portal, National Spatial Data Infrastructure (NSDI) 0 3 Department of Biotechnology, Government of India 46 42 National Institute of Biomedical Genomics (NIBMG), Kalyani 6 6 Regional Centre for Biotechnology (RCB), Gurgaon 5 5 Rajiv Gandhi Centre for Biotechnology (RGCB) 4 4 Institute of Bioresources and Sustainable Development (IBSD) 4 4 Bio Processing Unit (BPU), Mohali 2 2 National Institute of Animal Biotechnology (NIAB), Hyderabad 2 2 National Institute of Immunology (NII) 1 1 National Institute of Plant Genome Research (NIPGR), New Delhi 1 1 National Agri-Food Biotechnology Institute (NABI), Mohali 1 1 National Centre for Cell Sciences (NCCS) 1 1 Lok Sabha Secretariat 112 100 Ministry of Chemicals and Fertilizers 104 14 Department of Fertilizers 70 8 Department of Chemicals and Petrochemicals 34 6 Ministry of Corporate Affairs 39 26 Ministry of Micro, Small and Medium Enterprises 33 14 Ministry of Mines 33 26 Indian Bureau of Mines 21 13 Geological Survey of India 12 13 79 RESOURCE (DATASET) MINISTRY/DEPARTMENT Ministry of Defence 29 Department of Defence Research and Development Defence Research and Development Organisation (DRDO) TOTAL CATALOGS 23 29 23 29 23 Ministry of Petroleum and Natural Gas 27 26 Ministry of New and Renewable Energy 26 12 Ministry of Drinking Water and Sanitation (MDWS) 24 16 Ministry of Communications and Information Technology 18 18 Department of Electronics and Information Technology (DeitY) 9 8 Department of Posts 8 9 Department of Telecommunications (DOT) 1 1 Comptroller And Auditor General of India(CAG) 16 16 Ministry of Earth Sciences 8 10 India Meteorological Department (IMD) 8 8 Indian National Centre for Ocean Information Services (INCOIS) 0 2 Ministry of Information and Broadcasting 7 7 Ministry of Tourism 3 3 Ministry of Panchayati Raj 3 3 Ministry of Civil Aviation 3 3 Ministry of Rural Development 2 2 Department of Land Resources (DLR) 2 2 Department of Atomic Energy 1 1 Ministry of Development of North Eastern Region 1 1 Department of Space 0 15 Indian Space Research Organization 0 National Remote Sensing Centre 0 Total 13574 80 15 15 3360 5. ENTREPRENEURSHIP DEVELOPMENT & START UPS 5.1. ENTREPRENEURSHIP DEVELOPMENT 5.1.1 WORLD WIDE - BIG DATA VENDOR REVENUE AND MARKET FORECAST Wikibon forecasts that the Total Big Data market may exceed $47 billion by 2017. That translates to a 31% compound annual growth rate over the five year period 2012-2017. The growth rate so far of Big Data revenue may be due to a number of factors, including: • • • • An increased awareness of the benefits of Big Data as applied to industries beyond the Web, most notably financial services, pharmaceuticals, and retail; The maturation of Big Data software such as Hadoop, NoSQL data stores, in-memory analytic engines, and massively parallel processing analytic databases; Increasingly sophisticated professional services practices that assist enterprises in practically applying Big Data hardware and software to business use cases; Increased investment in Big Data infrastructure by massive Web properties – most notable Google, Facebook, and Amazon – and government agencies for intelligence and counterterrorism purposes. The Big Data market is still within the confines of the early adopter phase and is poised for significant growth. For the Big Data market to reach its full potential, enterprises and vendors must overcome several obstacles. While a detailed discussion of these obstacles is outside the purview of this report, they are worth noting. They include: • • • • • • • The well-publicized lack of analytic specialists and Data Scientists. Lack of understanding among enterprises on how to organize Big Data staff to best identify business requirements for Big Data projects. Organizational resistance to adopting Big Data analytics-driven decision-making. Vendor marketing overly focused on “speeds-and-feeds,” product features and “Big Datawashing” rather than laying out a vision for Big Data in the enterprise. Development of Big Data platforms and tools by vendors that eschew open frameworks in favor of closed, locked-down solutions. Lack of best practices and related technologies for managing Big Data as a corporate asset. Dearth of Big Data application development tools and services. Top 10 Vendors As part of its market-sizing efforts, Wikibon tracked and/or modeled the 2012 Big Data revenue of more than 60 vendors. This list includes both Big Data pure-plays – those vendors that derive close to if not all their revenue from the sale of Big Data products and services – and vendors for whom Big Data sales is just one of multiple revenue streams. Partial list including the top 10 vendors is given in table 5.1 below. 81 TABLE 5.1: 2012 WORLDWIDE BIG DATA REVENUE BY TOP 10 VENDOR ($US MILLIONS) - WIKIBON S. N. Vendor Big Data Revenue Total Revenue Big Data Revenue as % of Total Revenue % Big Data Hardware Revenue % Big Data Software Revenue % Big Data Services Revenue 1 IBM $1,252 $103,930 1% 19% 31% 50% 2 HP $664 $119,895 1% 34% 29% 38% 3 Teradata $435 $2,665 16% 31% 28% 41% 4 Dell $425 $59,878 1% 83% 0% 17% 5 Oracle $415 $39,463 1% 25% 34% 41% 6 SAP $368 $21,707 2% 0% 67% 33% 7 EMC $336 $23,570 1% 24% 36% 39% 8 Cisco Systems $214 $47,983 0% 58% 0% 42% 9 PwC $199 $31,500 1% 0% 0% 100% 10 Microsoft $196 $$71,474 0% 0% 67% 33% Wikibon’s Big Data market forecast broken down by market component through 2017 in Billion US$ is given in table 5.2 below. TABLE 5.2: BIG DATA MARKET FORECAST BROKEN DOWN BY MARKET COMPONENT THROUGH 2017 IN BILLION US$ - WIKIBON YEAR WISE FORECAST IN BILLION US$ REVENUE MARKET COMPONENT 2014 2015 2016 2017 Big Data XaaS 1.78 2.52 2.97 3.31 Big Data Professional Services 10.62 14.15 16.17 17.59 Big Data Application – Analytic and Transactional Big Data NoSQL 3.47 5.29 6.48 7.38 0.50 0.79 0.98 1.12 82 YEAR WISE FORECAST IN BILLION US$ REVENUE MARKET COMPONENT 2014 2015 2016 2017 Big Data SQL 1.72 2.14 2.36 2.51 Big Data Infrastructure Software 0.64 0.88 1.03 1.14 Big Data Networking 0.56 0.75 0.86 0.93 Big Data Storage 4.20 5.59 6.39 6.95 Big Data Compute 4.89 6.26 7.01 7.53 TOTAL BIG DATA REVENUE 28.38 38.37 44.25 48.46 5.2 DEVELOPMENT OF THE BA INDUSTRY IN INDIA Analytics Industry- Key to Growth of India: Imagine a situation where someone is moving in Pantaloons Men’s shoes section, and is about to buy one and then receives a message from Indiatimes, “The same shoe is being offered with 25% discount, just login here”. A scanner reads the shoe data, the customer’s pantaloons card is attached to his mobile and his mobile is attached to Indiatimes. Indiatimes and Pantaloons are doing joint marketing. A win-win situation for everybody that is only possible with the help of analytics. So, Analytics is now no more a luxury for an organization rather a hygiene factor. Consider the following: • Size of the Indian analytics Market: – 375 Million $ • • No. of companies operating in this segment in India – More than 500 Expected Indian Analytics market by 2015 – 1.15 bn $ as per Business standard report. The chart given in figure 5.1 below gives further information about the various types Analytics applications and classification of analytics industry in various business segments. How Big Is Big Data In India? • We are living in the age of information overload. A huge amount of data is constantly being generated around us. Increasingly, automation is being adopted and consequently leads to greater amounts of data. The challenge today for enterprises as well as small and medium businesses (SMBs) is manifold. Indian SMBs and enterprises are sitting on a gold mine of information. Making sense of these huge data sets has become imperative. In these circumstances, big data analytics has become one of the more talked about topics in India. • Big data has tremendous potential in India. With social media usage on the rise and increased adoption of technology by sectors such as BFSI(banking, financial services, and 83 • insurance), retail, hospitality etc, big data analytics are on the agenda of boardrooms across Indian enterprises. However, most Indian enterprises are still coming to terms with this concept. While everybody realizes the importance and the potential to analyze these data sets, very few have the capability of doing it. It is widely accepted that Indian enterprises base their decisions mostly on intuitions and ‘gut-feel’ and have barely scratched the surface in terms of using data for decision-making. In India, many of the large enterprises have started using or are contemplating the use of big data analytics. SMBs are still some distance away from adopting this concept. Their challenges are more basic – effective data storage and management. However, there are many medium businesses that are already past the initial stages of IT adoption are expected to take this up shortly. FIGURE 5.1: ANALYTICS APPLICATIONS, AND CLASSIFICATION OF ANALYTICS INDUSTRY (SOURCE: http://www.iitk.ac.in/ime/MBA_IITK/avantgarde/?p=1165 dated 27/08/13) 84 Development of BA Industry in India Big Data, Open Data, analytics, data insights and visualization open up lucrative opportunities for Indian companies, startups and incumbent IT/KPO players. A number of market research reports throw light on this opportunity, and a wave of startups is emerging in this space. TechSparks has summarized the market trends and insights into the BA industry in India, some indicators are as follows: A report by NASSCOM and CRISIL Global Research and Analytics predicts that the global Big Data market will reach $25 billion by 2015, up from $5.3 billion in 2011; the Indian industry in Big Data will reach $1 billion by 2015. At the 2014 edition of its Big Data and Analytics Summit, NASSCOM released another report in partnership with BlueOcean Market Intelligence, which predicts that the analytics market in India could reach $2.3 billion by the end of 2017-18. According to research by Avendus Capital, the data analytics market in India is expected to reach $1.15 billion by 2015, and will account for a fifth of India’s knowledge process outsourcing (KPO) market of $5.6 billion. Market leader US is expected to have a shortage of 140,000 – 190,000 analytics professionals by 2018, which opens up a huge opportunity for product and service companies in India. The Internet of Things is another market opportunity for India in Big Data and analytics. According to Machina Research data cited at a recent panel of TiE Bangalore, the global market for IoT in 2020 will be worth $373 billion in revenue in hardware and software, and India will account for $10-12 billion of this total revenue. Early stage startups like SenseGiz.com and mature startups such as ConnectM are active in this space. “Analytics holds the key importance for the commercial growth of India in the future to come,” according to Nishu Navneet of IIT Kanpur, citing research that shows 29% of analytics companies are in Bangalore, 25% in NCR, 8% in Pune, 8% in Hyderabad, 6% in Chennai and 6% in Mumbai. Analytics India magazine divides the Indian analytics market into three kinds of players: service providers (80% of the market), captives (back offices for analytics: 15% of the market) and domestic market (Indian companies using analytics: 5% of the market). Indian companies in a number of sectors are using analytics: banking (ICICI, HDFC, Axis, Yes Bank, Kotak Mahindra), telecom (Bharti Airtel, Idea Cellular), automotive (Tata Motors, Mahindra & Mahindra) and eCommerce (Flipkart, Snapdeal, Jabong). Information Week magazine has also documented the use of analytics by a range of Indian players: HDFC, Shoppers Stop and Aircel. Outside of the private sector, the BJP party used analytics and real time social media monitoring during its recent election campaign. 85 Indian startups and service providers in the space of analytics services and data insights include a range of players such as AbsolutData, DataWeave, Flutura, Formcept, Fractal Analytics, GenPact, Germin8, LatentView, ew, Mu Sigma, Nanobi and Veda Semantics. 5.3 ANALYTICS AS A SERVICE (AAAS) Analytics as a Service (AaaS) is an extensible analytical platform provided using a cloud-based cloud delivery model, where various tools for data analytics are available and can be configured conf by the user to efficiently process and analyze huge quantities of heterogeneous data. Customers will feed their enterprise data into the platform, and get back concrete and more useful analytic insights. These analytic insights are generated by Ana Analytical lytical Apps, which orchestrate concrete data analytic workflows. These workflows are built using an extensible collection of services that implement analytical algorithms; many of them based on Machine Learning concepts. The data provided by the user can be enhanced by external, ‘curated’ data sources. A diagram describing the concept is given in the figure 5.2. The AaaS platform is designed to be extensible, in order to handle various potential use cases. One concrete case of this is the collection of Analytical Services, but it is not the only one. For example, the system can support the integration of very different external data sources. To enable AaaS to be extensibility and easily configured, the platform includes a series of tools to support the complete lifecycle of its analytics capabilities. FIGURE 5. 2 : CONCEPTUAL DIAGRAM OF AaaS (Atos White paper on DAaaS) 86 The Importance A common scenario amongst the organizations is the struggle to improve the data to insight conversion rate. The hurdles are found at all levels of the organization including IT, business and leadership. Some of the often found refrains are: • • • • • • Data is dirty Available data cannot be trusted We have no experience in analyzing unstructured data streaming from the social media and other digital channels As decision maker I have no access to data The expensive software bought cannot be maintained by IT department Though the IT department would like to support business with new technologies, their budgets don’t allow them The traditional processes for software delivery, procurement and system management, do not generate a robust business case that positively impacts both business and IT. The initial investment (CapEx) is generally high for introducing new technologies from the analytics arena. Once the new technology is in place further direct and indirect costs will add to the IT-budget due to increased complexity of the IT-landscape (OpEx). This is the background with which the organizations generally struggle for acquiring a competitive data to insight conversion rate. An ideal AaaS case is when data is already in the cloud, or at least easy to upload into the cloud. The most obvious data streams to link to AaaS are the external ones like social media or machine-tomachine (m2m). Establishing a new AaaS-channel is done typically within hours, and the Analytics Expert can support decision makers with data-driven insights. This is how Analytics-as-a-Service drastically improves the data to decision conversion rate. The Challenges of AaaS Analytic solutions that need to support Big Data services present additional challenges. This is even more the case if these services are intended to be delivered through a cloud environment. • • • • • • • Information Lifecycle Management Data model diversity Analytic knowledge Data volume Real-time analytics Security Privacy 87 Benefits of AaaS The main benefit of the AaaS is to lower the barrier of entry to advanced analytical capabilities, without demanding that the user commits to large internal infrastructures and human resources to the project. The table 5.3 provides a comparison between a complex custom project the customer and the AaaS: TABLE 5.3: COMPARISON BETWEEN AaaS & INTERNAL BD PROJECT AaaS Typical Internal Big Data Project • Data Scientists working for the organization explore the AppStore for an Analytical App that fits the problem. • Data Scientists need additional resources to design and implement the solution. • They rent the Analytical App for a specific time or quantity of data. • Installs a complete Big Data infrastructure based in some complex technology like Hadoop. • They configure the Analytical App to its needs including, for example, the usage of external data sources provided by the AaaS. • Implements complex analytical processes in low-level languages becoming in reality an expensive coder. • Then the data is fed from the internal systems to the Analytical App. • Integrates the new system with your enterprise systems with more development effort. • The SMEs in the company validate the results and even enhance them with some customization. • Examines the results and reiterates until achieving success. • Outcomes are available for all other uses. 5.4 POSSIBLE MODELS FOR ENTREPRENEURSHIP BUILDING Entrepreneurial development is a systematic and an organized development of a person to an entrepreneur. The development of an entrepreneur refers to inculcate the entrepreneurial skills into a common person, providing the needed knowledge, developing the technical, financial, marketing and managerial skills, and building the entrepreneurial attitude. 88 Despite all the hurdles to success, this is a great time to be an entrepreneur in India. With huge open opportunities in Software as a Service (SaaS), mobile payments, gaming, entertainment, marketplaces, and just easier and better access to information and products, the potential for impact is immense. By leveraging mobile and internet technology, entrepreneurs in India have the opportunity to transform the way Indians will lead their lives DST in collaboration with Technological Development Board (TDB) can become a catalyst in facilitating emergence of competent first generation entrepreneurs in and transition of existing entrepreneurs into growth-oriented BDA enterprises through entrepreneurship consulting, education, training, research & institution building through promoting and encouraging entrepreneurship in Big Data Analytics. TDB will be funding agency while DST will have the responsibilities as given below: • • • • • • • • • • • Provide initial Capital/Seed Financing Enhance management bandwidth Accelerate the product development Provide help in building prototype, market validation and business plan Facilitate access to an ecosystem of US founders/Strategic Partners and Clients Engage with thought leaders in The Hive Big Data community worldwide Accelerate access to sources of future capital In house business and strategy team to help develop the intricacies of the entrepreneurs’ business Technology team to help with data science and petabyte scale systems Advisors from our extensive network of technology and business experts Engagement with thought leaders in the Big Data community In addition to mentoring startups, DST may also host periodic talks and panel discussions to share knowledge and bring together experts and visionaries from academia and the industry. In the initial stage DST should engage with entrepreneurs to help them refine their product concept DST may also take up the responsibilities in the area of entrepreneurship training and education with the following additional objectives: • • • • • To promote and develop high-end entrepreneurship for BDA manpower as well as selfemployment by utilizing S&T infrastructure and by using S&T methods. To facilitate and conduct various informational services relating to promotion of entrepreneurship in BDA To network agencies of the support system, academic institutions and Research & Development (R&D) organizations in BDA to foster entrepreneurship and selfemploying using BDA. To act as a policy advisory body with regard to entrepreneurship in BDA. To evolve standardized materials and processes for selection, training, support and sustenance of entrepreneurs, potential and existing. 89 • • • • • • • To serve as an apex national level resource institute for accelerating the process of entrepreneurship development in BDA ensuring its impact across the country and among all strata of the society. To provide vital information and support to trainers, promoters and entrepreneurs by organizing research and documentation activities relevant to entrepreneurship development in BDA. To train trainers, promoters and consultants in various areas of entrepreneurship development in BDA To offer consultancy nationally/internationally for promotion of entrepreneurship and small business development in BDA To provide national/international forums for interaction and exchange of experiences helpful for policy formulation and modification in the BDA domain at various levels. To share international experience and expertise in entrepreneurship development in BDA. To share experience and expertise in entrepreneurship development in BDA across national frontiers. 90 6. DATA SCIENCE POLICY PERSPECTIVES 6.1 CONCERNS FOR BIG DATA AND PUBLIC POLICY Advances in digital technology are making it possible to collect, store and process ever-expanding amounts of data. This explosion of data holds tremendous potential to boost innovation, productivity, efficiency and, ultimately, economic growth and social value. The use of ‘big’ data, however, raises many questions: • • • • • • • What do individuals think about the data being gathered about their everyday activities (for example, through social media and the internet, sensors, radio-frequency identification chips, geospatial technologies, loyalty cards or transport cards)? Who should own and control such data? What is the right trade-off between privacy, intellectual property rights and security and allowing society to benefit from data-driven innovations and better ways of living? Is the right to be forgotten practicable, useful and meaningful, and does it need to be complemented with a right to be remembered? What sorts of curation mechanisms are most effective in ensuring data quality and interoperability across organizational boundaries, particularly in the case of open data sets? How can we assess the impact of big data on existing communications, legal and regulatory systems? How can society benefit most from big data? 6.2 BIG DATA: MANAGING THE LEGAL AND REGULATORY RISKS When adopting a new and potentially disruptive technology such as Big Data all the risks need to be identified and managed. That includes securing asset values and addressing the other legal and regulatory risks. Among other things, a failure to address legal and regulatory risk in relation to Big Data could result in a serious regulatory breach, attracting fines, reputational damage and loss of business. In this article we consider how to identify and manage such risks. Controlling use of big data Data privacy law is one area of law that any business is going to have to take very seriously indeed in relation to the use of Big Data. While these laws vary from country to country, in Europe there are certain commonalities. Big Data typically involves the reuse of data originally collected for another purpose. Among other things, such reuse would need to be 'not incompatible' with the original purpose for which the date was collected for reuse to be permissible. The Article 29 Working Party (consisting of the data privacy regulators across the EU) has set out a four stage test to determine when this requirement is met. The four stage test includes a requirement that safeguards are put in place to ensure fair processing and to prevent undue impact on the relevant individual. This could include 'functional separation' (that is, anonymising / pseudonymising or aggregating the results). In 91 many cases, the only way to overcome data privacy concerns in relation to Big Data will be by way of adequate consent notifications. To obtain effective consent in relation to Big Data analytics is not straightforward. How do you protect rights in big data? Across the EU, the intellectual property right that could provide the most protection is the database protection regime. It has limitations, as do copyright and patents in relation to Big Data. The law of confidentiality may provide some protection, depending on the particular information and its source. As the law in this area may provide only limited protection, it may sometimes be necessary to return to the basics: ensure that any disclosure is coupled with adequate contractual confidentiality provisions limiting further use and disclosure. Conversely it will be essential to check that the compilation of a Big Data data set has not infringed a third party’s intellectual property or contractual rights. What are the other potential liabilities? Among the potential liabilities that need to be addressed is the question of data reliability. Data sourced from publicly available sources, from another business, or collated by the business itself, may contain errors. Data sets may have their origin in several different sources. So-called 'open data' is typically licensed on terms similar to those applicable to open source software. Such terms usually give little or no comfort in relation to the reliability (and non-infringing nature) of the licensed material. Public providers of such data sets (such as local authorities or central government) are seldom willing to accept liability for losses arising from reliance on the data (particularly when the data are provided free or for a nominal charge). What technical and organizational measures should be considered? Interception, appropriation and corruption of data remain an issue for businesses possessing Big Data sets, just as with any other data. The data privacy laws in many countries require that the data controller implements appropriate technical and organizational measures to safeguard the security of personal data. Such laws typically require the data controller to flow down these requirements in contractual relations with their suppliers. These requirements will apply to Big Data sets held by businesses that contain personal data. Businesses will also need to take into account the new EU Data Protection Regulation, which will require that technical and organizational measures ought to be provided for by design and default. Purely technical solutions, implemented in the absence of a more comprehensive approach to information governance, may not be adequate. 92 The need for expertise: A recent survey by Accenture (Big Success with Big Data Survey, April 2014) found that 41% of businesses reported a lack of appropriately skilled resources to implement a Big Data project. Such expertise will need to include a legal and regulatory compliance review. It is simply a case of taking steps to address these issues early on. 6.3 DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE, A REPORT TO PRESIDENT’S COUNCIL OF ADVISORS ON SCIENCE AND TECHNOLOGY (PCAST, USA) The body has come with five major recommendations as given below. Recommendation 1: Policy attention should focus more on the actual uses of big data and less on its collection and analysis. Recommendation 2: Policies and regulation, at all levels of government, should not embed particular technological solutions, but rather should be stated in terms of intended outcomes. To avoid falling behind the technology, it is essential that policy concerning privacy protection should address the purpose (the “what”) rather than prescribing the mechanism (the “how”). Recommendation 3: With coordination and encouragement from Office of Science and Technology Policy (OSTP), the Networking and Information Technology Research and Development Program (NITRD) agencies should strengthen U.S. research in privacy-related technologies and in the relevant areas of social science that inform the successful application of those technologies. Recommendation 4: OSTP, together with the appropriate educational institutions and professional societies, should encourage increased education and training opportunities concerning privacy protection, including career paths for professionals. Programs that provide education leading to privacy expertise (akin to what is being done for security expertise) are essential and need encouragement. Recommendation 5: The United States should take the lead both in the international arena and at home by adopting policies that stimulate the use of practical privacy-protecting technologies that exist today. 6.4 BEYOND NDSAP: REGULATORY MODE National Data Sharing and Accessibility Policy NDSAP: NDSAP aims to provide an enabling provision and platform for proactive and open access to the data generated by various Government of India entities. The objective of this policy is to facilitate access to Government of India owned shareable data (along with its usage information) in machine readable form through a wide area network all over the country in a periodically updatable manner, 93 within the framework of various related policies, acts and rules of Government of India, thereby permitting a wider accessibility and usage by public. National Data Sharing and Accessibility Policy (NDSAP) is designed so as to apply to all sharable non-sensitive data available either in digital or analog forms but generated using public funds by various Ministries/Departments /Subordinate offices/Organizations/ Agencies of Government of India as well as States. There is a need felt to elevate this Policy into an act so that the aim to provide an enabling provision and platform for proactive and open access to the data generated by various Government of India entities can be fully achieved. Open Government Data (OGD) Platform India: OGD, (http://data.gov.in) has been set up to provide access to datasets published by different government entities in open format. It also provides a search, discovery & on-the-fly data conversion (to widely used open formats) mechanisms for instant access 2 to desired datasets. OGD Platform has a backend data management system which is used by government departments to publish their datasets through a predefined workflow. They shall also have a dashboard to see the current status on their datasets, usage analytics as well as feedback and queries from citizens at one point. OGD Platform India is still at its nascent stage and is going through proportions of changes. One of the major challenges faced is that of the formation of a NDSAP Cell in every Ministry/Department. As per policy guidelines, in order to implement NDSAP, each Department is required to establish a NDSAP Cell, which shall be headed by the Data Controller, who could be assisted by number of Data Contributors and few domain specialists. These professionals would monitor and manage the open data initiative in their respective Ministry/Department and extend technical support to ensure quality as well as correctness of the data. 6.5 EXISTING STANDARDS – CODATA & WAY FOR INDIA CODATA Capacity Building and the Data Sharing Principles in Developing Countries CODATA is concerned with all types of data resulting from experimental measurements, observations and calculations in every field of science and technology, including the physical sciences, biology, geology, astronomy, engineering, environmental science, ecology and others. Particular emphasis is given to data management problems common to different disciplines and to data used outside the field in which they were generated. CODATA have come out with guidelines on Data Sharing, particularly for the developing countries. To start with, it will be a good idea to accept these guide lines and then work towards developing specific guide lines suiting to our own requirements. Broadly the CODATA guide lines in the form of 10 Principles, on sharing data are as given below. • 1. Data should be open and unrestricted. Data generated with public support, including private foundations, should be openly accessible and subject to unrestricted (re)use, absent specific, justified reasons to the contrary (see Principle-10). Openness is especially beneficial for 94 development purposes and research uses, but can benefit all society equally and have a multiplier effect on the economy. • 2. Data should be free to the user. In most cases, any cost for access is an insurmountable barrier to users in the developing world. Therefore, data should be free online to the user. In some special cases, access to data may be no more than the marginal cost of filling a user request. At the same time, it is recognized that adequate preparation and open availability of data require sufficient financial support (see Principle-7). • 3. Data should be informative and assessed for quality. Data should be of known quality and integrity, and should be organized and described (with metadata) in datasets sufficient to allow them to be understood and effectively (re)used by others. Baseline technical and management standards need to be established, especially in the developing world where state-of-the art practices are not yet as prevalent. Adequate preparation and the use of nonproprietary software are especially important for any datasets expected to have long-term value. • 4 Data sharing should be timely. Once datasets are sufficiently informative and quality controlled, they should be released as quickly as possible. This can be done in steps, starting with the metadata to avoid duplication. In some cases, such as public emergencies and disasters, open release of relevant data should be an immediate priority. In other cases, such as research, data should be openly available no later than upon the publication or patenting of results. Users in developing countries have the most to gain from such policies. • 5 Data should be easy to find and access. Upon the public release of any dataset, the provider should promote ease of access by the broadest user base. Diverse means of publication should be considered in recognition of potential connectivity and other technological challenges. • 6 Data should be interoperable, when necessary. If data from a dataset are likely to be combined with data from one or more other datasets (e.g., in geospatially referenced research), special attention should be given to making such data technically, semantically, and legally interoperable. • 7 Data should be sustainable. The life-cycles of any datasets that are expected to be reused by others should be planned at the outset with support sufficient to successfully implement the first six Principles. The lower availability of funding in developing countries, especially for longterm preservation, makes this a key priority so that valuable datasets remain intelligible and are not lost or in need of rescue. Cost recovery for data archiving and availability should not be borne by the users, consistent with Principle-2, but by other entities in the data lifecycle. • 8 Data contributors should be given credit. A significant incentive for the open disclosure and “publication” of a dataset is the ability to properly cite and attribute the contributor(s), whether internal or external to an organization. Any subsequent user of the data has at least an ethical obligation—and possibly a legal one—to cite and attribute the source of the data whenever they are reused, and not to misuse the data in any way. Such practices can also improve the integrity 95 of the data sets made available by the contributors, in support of Principle-3. In particular, data contributors in the developing world require recognition and rewards for such disclosure, and this should become common practice. A persistent digital identifier, attached to the dataset online, is the best way to promote this goal. • 9 Data access should be equitable. Open access and use of data in developing countries, especially for public purposes, should be supported by the governments and institutions in the more economically developed nations. Capacity building of essential experts and infrastructure in developing countries should be a priority of international organizations. Similarly, experts in developing countries should join and actively participate in the relevant regional and international organizations. • 10 Data may be restricted for a limited time, if adequately justified. Restrictions may be placed on access to and uses of publicly funded data and datasets for specified periods of time. Justified restrictions may include specific protections of national security, personal privacy, intellectual property, confidentiality, and other values, such as indigenous peoples’ rights or location of endangered species. Nevertheless, the default rule should be one of openness, consistent with Principle 1, and any restrictions should be minimized to the extent possible. Way for India While efforts are being made to provide teeth to NDSAP by converting and elevating the policy into an act, India may start enforcing the 10 principles enunciated by CODATA for developing countries. 96 7. TRAINING & CAPACITY BUILDING 7.1 SKILLS NEEDED & AVAILABLE: QUALITY & QUANTITY: Big Data Analytics work is fluid, often practiced under pressure and frequently demanding attention to detail while simultaneously focusing on the larger purpose. Analysis often involves the customization or creation of tools, the painstaking cleaning of datasets, and the technical and analytic challenges of linking datasets. Within this challenging environment, data workers must have a strong skill set that combines technical and business acumen, involving creativity and agility as well as strong problem-solving skills. Grit, dogged persistence and resilience in the face of these daily challenges, underlies the essential skill set of data workers, without which survival and success are unlikely. Indeed a unique combination of skill set is required to make the most of the opportunity offered by Big Data Analytics. These are the Hard and Soft Skills. Apart from the hard skills such as Subject Matter Expertise, Mathematics & Statistics Knowledge and Data & Technical there is need of soft skills such Problem Solving, Story Telling, Collaboration, Curiosity, Communication and Creativity. The skill sets can also be classified by Role Groups such as Developers, Architects, Analysts, Coders, Data Scientist, Data Engineers, Designers, Administrators, Project Mangers, Business Experts, and the Consultants. Given below an additional list of a few Business Analytics Roles across industries: • • • • • • • • • • Data Analyst Financial Analyst Pricing Analyst Website Analyst Retail Sales Analyst Business Analyst Marketing Analytics Manager Supply Chain Analyst Fraud Analyst Clinical Analyst 7.2 THE ESSENTIAL SET OF DATA SCIENCE COMPETENCIES: Becoming a data scientist requires comprehensive mastery of a number of fields, however, one don’t need to learn a lifetime’s worth of data-related information and skills as quickly as possible. Instead, learn to read data science job descriptions closely. This will enable one to apply to jobs for which one already have necessary skills, or develop specific data skill sets to match the jobs you want. Never the less the essential competencies as given below need to be acquired asap. 97 Basic Tools: No matter the type of company one is expected to know how to use the tools of the trade. This means a statistical programming language, like R or Python, and a database querying language like SQL. Basic Statistics: At least a basic understanding of statistics is vital as a data scientist. Machine Learning: This can mean things like k-nearest neighbors, random forests, and ensemble methods etc. Multivariable Calculus and Linear Algebra: Understanding these concepts is most important at companies where the product is defined by the data and small improvements in predictive performance or algorithm optimization can lead to huge wins for the company. Data Munging: Often times, the data being analyzed is going to be messy and difficult to work with. Because of this, it’s really important to know how to deal with imperfections in data. Some examples of data imperfections include missing values, inconsistent string formatting (e.g., ‘New Delhi’ versus ‘new delhi’ versus ‘ND’), and date formatting (‘2014-01-01’ vs. ‘01/01/2014’, unix time vs. timestamps, etc.). Data Visualization & Communication: Visualizing and communicating data is incredibly important, especially at young companies who are making data-driven decisions for the first time or companies where data scientists are viewed as people who help others make data-driven decisions. Software Engineering: It is important to have a strong software engineering background. Thinking Like A Data Scientist: It’s important to think about what things are important, and what things aren’t. How should one, as the data scientist, interact with the engineers and product managers? What methods should one use? When do approximations make sense? Creativity: There are no hard and fast rules about what a company should use big data for. Business skills : An understanding of business objectives, and the underlying processes which drive profit and business growth are also essential. Communication ability: Both inter-personal and written – an essential part of a data scientist skill set is the ability to communicate the results of the analysis to other members of their team as well as to the key decision-makers who need to be able to quickly understand the key messages and insights. 7.3 INDUSTRY NEEDS & THE GAPs • • • • According to the Business Standard, the Indian analytics market is expected to grow to $1.15 billion and industry bodies predict a five-fold growth in the number of big data professionals by 2015. For many organizations especially in India, where big data is booming at an exponential rate, finding the right talent and knowing what skills to look for continues to be a major roadblock. Students graduating from many Indian colleges and Indian universities do not possess several of the advanced skill sets that are required by big data workers; such as, predictive analysis skills, working with advanced business intelligence tools and data integration skills. The existing, experienced big data professionals do not have adequate expertise to train fresh talent, and hence most entry level professionals need to learn with experience. 98 • • • • • According to the Jigsaw academy annual salary report 2014 for analytic professionals, the average salary of entry level big data professionals has increased 27 percent since 2013, from 5.2 laksh to Rs 6.6 lakhs per annum. Typically, there is also a 250 percent increase in salary while moving from an entry level analyst to the position of a manager. Keeping in mind the above challenges, in-house HR professionals in organizations that recruit big data analysts will face a two-fold problem -identifying the right candidates with hybrid skill sets and staying within the budget. Recruiters have to think out of the box while looking for big data talent. Expanding their search outside the regular computer science related streams, to those fields where mathematical and business skills are heavily utilized is one way to overcome the scarcity. Few innovative start-ups have taken the internship route to identify the best suitable talent for this niche. These start-ups have developed internship programs to acquire, identify and nurture the right talent from top Indian colleges. While big data is changing the game for recruiters, big data organizations themselves, need to face up to the challenges of hiring the right people for the right seats. 7.4 MODELS FOR CAPACITY BUILDING & TRAINING Existing Training Institutes Finding trained and competent analytics personnel has become one of the biggest challenges for employers in India. Hence, in order to solve this issue, a large number of analytics training institutes have surfaced all over the country. These institutes focus on training the candidates in a manner so as to make them fit in Business Analytics and Business Intelligence Sector. Given below is a list of some prominent analytics training institutes in the country. • • • • • • • • • • Academy For Decision Science And Analytics, Ahemedabad Analytics Training Institute, Bengaluru Big Data Training, Chennai Business Analytics, Bengaluru Indian Institute of Technology, Mumbai International School of Engineering, Hyderabad Jigsaw Academy, Bangalore Mudra Institute of Communications, Ahemedabad NI Analytics, Bengaluru Ureach Solutions, Bengaluru Online Big Data Training Courses: To meet the big data talent demand, many universities and training institutes have started offering online courses which caters to learning and working with Hadoop technologies. Following are some of the institutes and organizations offering On-Line courses on Big Data Analytics and related technologies: • • • AnalytixLabs Cloudera Data Analyst Cloudera Introduction to Data Science 99 • • • • • • • • • • • • • Edureka Big Data and Hadoop Edureka Data Science Edvancer’s Eduventures EMC2 Data Science and Big Data Analytics GuRu Prevails Jigsaw Wiley Certified Big Data Specialist Imarticus Learning International School of Engineering (INSOFE) Ivy® Professional School Learning Tree Big Data Analytics Learning Tree Big Data Analytics with Pig, Hive and Impala NIVT, Industrial Training Centre affiliated to NCVT, DGE&T SimpliLearn Big Data and Hadoop Developer Institutes offering PG qualifications Some of the other institutions and organizations offering course leading to academic qualifications in BDA in the country are as given below: • • • • • • • • • • • • • • • Analytics Essentials – IIIT, Bangalore Business Analytics and Intelligence (BAI) – IIM Bangalore Certificate Program in Business Analytics – ISB, Hyderabad Data Analysis Online courses - SRM University Executive Program in Business Analytics – IIM Calcutta Executive Program in Business Analytics and Business Intelligence – IIM Ranchi Jigsaw Academy courses M. Tech. Computer Science and Engineering with Specialization in Big Data Analytics – VIT M.Tech (Database Systems) – SRM University M.Tech Computer Engineering and Predictive Analytics – Crescent Engineering College M.Tech. (ICT) in 'Data Science and Analytics' - Institute of Engineering & Technology, (IET), Ahmedabad M.Tech. with specialization in Business Analytics – Hindustan University Post Graduate Program in Business Analytics – Praxis Business School Post Graduate Program in Business Analytics and Big Data - Aegis Statistical Techniques for Data Mining & Business Analytics (DMBA) - ISI Mumbai Offerings from SW Companies: Major Software companies are offering training & certification in Big Data Analytics. Some of the training programs are free, although the certifications generally are not. Details of programs from five prominent big data companies are as given below: • • • • Oracle big data training SAS big data training SAP training and certification Microsoft SQL server training 100 • • IBM training and certifications HP Vertica certification 101 8. INVESTMENTS – DETAILED PROJECT REPORT 8.1 OBJECTIVES OF THE STUDY • Assess the present status of the industry in terms of market size, different players providing services across sectors/ functions, opportunities, SWOT of industry, policy framework (if any), present skill levels available etc. • Market landscape survey to assess the future opportunities and demand for skill levels in next 10 years • Gap analysis in terms of skills levels and policy framework • Evolve a strategic Road Map and micro level action plan clearly defining roles of various stakeholders - Govt., Industry, Academia, Industry Associations and others with clear timelines and outcome for the next 10 years. • The international scenario may also be examined while evolving Strategic Road Map. 8.2 CONSULTATIVE APPROACH 8.2.1 THE NEED Indian business, research institutions and enterprises are sitting on a gold mine of information. Making sense of these huge data sets has become imperative. In these circumstances, big data analytics has become one of the more talked about topics in India. Big data has tremendous potential in India. With social media usage on the rise and increased adoption of technology by all most all sectors big data analytics are on the agenda of boardrooms across Indian enterprises. However, most Indian enterprises are still coming to terms with this concept. Apart from the business & industry, the government through its various ministries and arms, research organizations and institutes of higher learning are yet another source of huge amount of data generated. While everybody realizes the importance and the potential to analyze these data sets not much has been done and achieved by way of a structured and concerted approach to channelize the resources and efforts to exploit and leverage the possibilities of using big data in the country. In addition the Big Data domain of the country has a very large number of stakeholders having their stakes very divergent fields. The handle the enormous task of preparing strategic roadmap for big data analytics Consultative Approach suits the most. 102 8.3 METHODOLOGY The Methodology adopted for the study was two-folds that is Primary Research and Secondary Research.-A Research. diagrammatic representation of the en entire approach is given in figure 8.1 .1 given below. FIGURE 8.1: APPROACH & METHODOLOGY 8.3.1 SECONDARY RESEARCH: This involved capturing relevant information from public domain through research articles, published documents on Big Data Analytics Analytics, Net Search etc. A very large number of research papers, reports, books, other public domain documents and presentations; in addition information collected during participation in number of Big Data related conferences/seminars held recently in the country. A list l of the materials referred has been included in the Bibliography given in the report. An organized and structured thought process, as given in the table 8.1 .1 below was deployed to cull relevant information. 103 TABLE 8.1: STRATEGIC THOUGHT PROCESS MAJOR STRATEGIC THOUGHT Data Science and its supporting role in Big Data Assessment of the current status Opportunities, threats, Gaps and questions Data Science and The Global Scenario Identification of the Pillars of Information Driven Governance and Business Transformation Maturity Stages on Road to Big Data Leveraging Data Science for Scientific Research & Development MAPPING WITH THE OBJECTIVES OF THE STUDY CONTRIBUTING FACTORS TO BE INVESTIGATED • Understanding Data Science • Defining Big Data • Types of data and non-marketing applications • Driving value from Big Data • How to approach a Big Data project • Major stakeholders, • Availability of data, • The current and future technologies to be used, • Adequacy of the available infrastructure, • Quality & quantity of man power available, • Available Technologies & Providers • Where and how do we start? • How do we create a business case for a pilot? • SWOT Analysis • What data is relevant? • Assess the present status of the industry • • Emerging ICT Paradigm, Assessing the present status of the industry • SWOT Analysis of Big Data Analytics, Market landscape survey to assess the future opportunities and demand • Successful applications of Big Data • Big Data for Development • Big Data Market • Identification of the Business Drivers • Doing more with the data – using Big Data and Business Analytics • The aims of deploying an enterprise data hub • • The international scenario for evolving Strategic Road Map. Evolve a strategic Road Map and micro level action plan clearly defining roles of various stakeholders • Initiate – Kick start & build first success • Scale up - Build confidence in sustainable success • Applications in R&D projects • Establishment of Centre for Excellence in Data Science • Dissemination of Big Data knowledge • Capacity Building through Training programs • Big Data Road Map • • • • Indian Perspective Indentify Gap Areas Challenges in R&D for S&T Gap analysis in terms of skills levels and policy framework 104 • MAJOR STRATEGIC THOUGHT MAPPING WITH THE OBJECTIVES OF THE STUDY CONTRIBUTING FACTORS TO BE INVESTIGATED Digital India • • • • Provisions under Digital India Initiative Leveraging provisions under ‘Digital India’ Leveraging Big Data & Open Data Synergy with e-Governance initiatives • Evolve a strategic Road Map and micro level action plan clearly defining roles of various stakeholders Data Demand Trends • Identify Gaps by roles and skills • Gap Closing – Centre of Excellence, Skilling/Up skilling, Training Programs, Workshops at various levels of stakeholders • The regulatory context • Privacy law as applied to Big Data • Responsible Big Data business practices • R & D Projects • Gap analysis in terms of skills levels and policy framework • • Big Data Policy Perspective Evolve a strategic Road Map and micro level action plan clearly defining roles of various stakeholders • Formulation of the Project, its Objectives & Targets, Cost Benefits & Outcomes, Monitoring mechanisms and Action Plan including technology, cyber security and other relevant issues • • • • Justification of the Project Project Objectives & Targets Project Design & Costs Envisaged Benefits & Outcomes Evaluation parameters {Measurable Indicators} Project Monitoring and MIS Roles of various Stakeholders Evolve a strategic Road Map and micro level action plan clearly defining roles of various stakeholders to Managing the governance issues of Big Data Detailed Project Report and Future Outlook • • • • 8.3.2 PRIMARY RESEARCH The Primary Research consisted of obtaining feedback from the two National Consultative Meetings, feedback through four sets of questionnaire (One each for the four major Stakeholders viz. Data Generators, Researchers, End Users and Service Providers) see Annexure-1, and four Interactive Workshops held in 4 locations (Bengaluru, Pune, Hyderabad & Kolkata). The names of the participants/respondents and their respective organization are provided in Annexure-2. Summary of the efforts made is as given in tables below: 105 TABLE 8.2: CONSULTATIVE MEETINGS (CM) & INTERACTIVE WORKSHOPS (IW) ORGANIZED CONSULTATIVE MEETINGS (CM) & NUMBER OF DATE INTERACTIVE WORKSHOPS S. No. PARTICIPANTS (IW)HELD AT 1 28/11/14 New Delhi (CM) 34 2 07/01/15 Bengaluru (IW) 31 3 19/01/15 Pune (IW) 20 4 29/01/15 Hyderabad (IW) 40 5 20/02/15 Kolkata (IW) 52 6 25/03/15 New Delhi (CM) 42 TOTAL 219 S. No. 1 2 3 4 TABLE 8.3: RESPONSES RECEIVED FROM THE STAKEHOLDERS NUMBER OF CATEGORY OF STAKEHOLDER SUGGESTIONS/RESPONSES RECEIVED Data Generators (DG) 100+ Including Filled in Researchers (RE) Questionnaires End Users (EU) Service Providers (SP) Second Consultative Meeting: The second consultative meeting was held on 25th March 2015 at New Delhi. At this meeting the findings of this draft Report were presented to the stakeholders and their comments, observations and suggestions were invited. The comments & suggestions have been in comported in this report in Chapter 4. 8.4 ANALYSIS AND RESULTS 8.4.1 PARTICIPANT’S SUGGESTIONS During the Consultative Meetings and the Interactive Workshops, the participants made valuable suggestions. These suggestions are consolidated as given below. First Consultative Meeting • Based on the suggestions of participants, the stakeholders were divided into four groups namely: o RESEARCHERS (RE), 106 o DATA GENERATORS (DG), o END USERS (EU) and o SERVICE PROVIDERS (SP) • Questionnaire for each of the stakeholders should be designed separately to capture the specific inputs. • Each questionnaire is divided into several parts consolidating questions pertaining to a specific discipline so as to facilitate filling in by the person from that discipline. • Big data analytics is at initiative stage in India, Questionnaire may be sent along with Concept note providing brief about the project and purpose behind filling of questionnaire • The questions may be framed to answer what rather than how • Structure of the questionnaire may be categorized under various heads, it may be possible that some heads are less/non relevant for a particular stakeholder. Therefore, different questionnaire needs to be prepared for different stakeholders such as government, industry, individual expert etc. • In multiple choice questions, options should not be more than 3-4, many options (like10) given for the a particular question will not result in appropriate outcome • It is important to identify and select appropriate person to fill the questionnaire based on his skills, expertise and experience in the area of Big data analytics. • Some questions mentioned in the questionnaire are not seems relevant and difficult to fill, for eg. How often you update your data? Such questions may be removed. • Some important stakeholders in data analytics are: Academic institutes, industry, agencies like IRDA, UIDAI ( Aadhar), Electoral office, NSDC etc. • Curation of data is an important aspect, it needs to include in the Report. As, in present environment lack of standards in data storage results in seek/incomplete/non-authentic data • Quality of data is an important aspect to be checked for ontology and metadata • The Report should have specific outcomes against definite scope of work, focused areas may be Human Resource Development (HRD), Policy framework, research proposals, association with industries for data analytics based on their specific requirements. In policy framework, ownership of data may be defined. • Digitalization of data, Digital Asset Management and training to SMEs, academics and other stakeholders for the same may be part of the Report, as very few government organizations presently have skilled manpower for data analytics. • DST may initiate summer/winter training schools, workshop, conferences for wide spread dissemination of data analytics • The project initiative may link to Digital India and Make in India. Organizations like DIT may be associated in this regard. Disaster Management system may developed through this project • Stakeholders for data may categorize as follows: o Data generators o Data brokers (Who do data analytics) o Data implementers • Data analytics can generate base for many IPRs. The Report may be documented in a way such that it can be reformed as an Act, which may be implementable at national and international level • Generally in data analytics, more attention is given on Semantic aspect and applied technology, however processes needs to be given more attention • Strategic document should not include the scenario after 10 years, considering the speed technology gets upgrading, it will not provide practical aspect. Four Interactive Workshops 107 Some of the important suggestions and observations made during the four interactive workshops are as given below: • • • • • • • • • • • • • • • • • Is data really big or are we making it big? Is it about Big Data only or about DATA? Need to educate and create awareness in people to generate and store only relevant data. The problem of data sharing could be eased by elevating the National Data Sharing and Accessibility Policy (NDSAP) to an act Big Data should be changed to “Analytics and Big Data”, or “Data Science”. The data available with large number of organizations are not Big Data in the sense that they do not challenge the existing computing technologies. However, they are very important problems and the user can add great value by analyzing the data. Many Institutes have started offering courses in analytics and big data without understanding them well; a white paper will help to eliminate these misconceptions. Need for national level curriculum for Analytics and Big Data. Lack of real-world big data sets and need for government agencies to more actively participate via open-data etc. Feasibility of allowing the final year project work to be a "big-data-analytics" project Lack of trained tier II faculty Identify top 5 PROBLEMS to be solved by Big Data Analytics Policy on sharing Data Use of all forms of Analytics viz. Descriptive, Predictive and Prescriptive. Creating a Platform where all the stakeholders can interact and give and seek what they have or want. Take advantage of the ‘Incubation Plans’ offered by many organizations. Creating a ‘SAFE KEEP’ platform for all the data Created, Shared as well as Used. Creating a ‘Regulatory Authority’ for all the facets of Big Data in the country. Suggested Research Areas • Big Data: River Network optimization - A data driven analytics approach. • Understanding Urban/ Rural people perceptions on immunization • Online Signals for Risk Factors of Non-Communicable Diseases (NCDs) • Characterizing human behaviour during floods through the lens of mobile phone activity • Mining Indian Tweets to Understand Food Price Crises • Advocacy Monitoring through Social Data: Women and Children Health • Analyzing Online Content for Insight on Women and Employment in India • Analytics and Understanding social Conversations through Big Data • Unemployment analysis through Social Media 108 • Monitoring Food Security Issues Through News Media and Analytics • India and State-wise, region wise Snapshots of mental Health/ Wellbeing - Mobile Survey • Daily Tracking of essential Commodity Prices in India through data mining and analytics • Twitter and Perceptions of Crisis-Related Stress • Population migration and analytics • Food and Nutrition Security Monitoring and Analysis • Monitoring Household Coping Strategies During Complex Crises • Economic crisis, tourism decline and its impact on local dependents • Impacts of the financial crisis on health and poverty in India • Impact of the financial crisis on primary schools, teachers and parents • A Visual Analytics Approach to Understanding Poverty Assessment through Disaster Impacts in India • Monitoring the impact of the global financial crisis on crime in India • Urban crime pattern analysis, unemployment, education, social hierarchy and economic linkages • Search Engines in Indian Languages • Understanding Social Media • Summarization of Data • IoT • Data Engineering • Robotics • Visual Information Technology • Segmenting Videos • Healthcare • Cognitive Science • Signal Processing • Drug Development • Computational Neurological Science 109 • Pattern Analysis and Machine Intelligence • Statistical Data Analysis • Image Analysis and Retrieval • Video Image Analysis and Retrieval • Data and Text Mining • Web and Social Network Mining • Bio-informatics and Computational Biology • Hadoop and MapReduce • Pattern Analysis and Machine Intelligence • • • • • • • • • • • • • Statistical Data Analysis • • • • • • • • • • Dimensionality Reduction Density Estimation Artificial Neural Networks Kernel Methods Large – Scale Machine Learning Soft computing and uncertainty analysis Regression Analysis Manifold Learning Support Vector Machines Pattern Classification and Clustering Reinforcement Learning Cognitive Machine Biostatistics Statistical Computing Large Dimensional Random Matrices Computational Finance Statistical Genomics Non Parametric and Robust Statistics Stochastic Processes Robust Inference Multivariate Analysis Image Analysis and Retrieval • • • • • Hyper-spectral Image Analysis Automatic Target Recognition Remote Sensing Image Analysis Content Based Image Retrieval Face, Pose and Giant Recognition 110 • • • • • • • • Video Image Analysis and Retrieval • • • • • • • • • • Background Subtraction Moving Object Detection and Tracking Shot Boundary Detection Video Retrieval Target Detection from Video Shadow Removal Video Sequence Matching Video Copy Detection Video Storyboard Generation Data and Text Mining • • • • • • • • • • Fuzzy Image Modeling Digital Watermarking Medical image processing & retrieval Mathematical Morphology Document Image Analysis Document Image Analysis Optical Character Recognition Multi-resolution Image Analysis Association/Correlation Analysis Rule Mining Sequence Mining Graph-Pattern Mining Information Retrieval Granular Mining Computational Forensics Data Warehousing Data Visualization Web and Social Network Mining • • • • • • • • • • Reliability and Cost analysis of Complex Networks Fitting Distributions to Network Data Centrality Measures in Large Scale Social Networks Structural Balance and Transitivity Relational Network Mining Network Visualization Target Set Selection Community Detection Rough-Fuzzy Granular Model of Social Network 111 8.5 CONSOLIDATION OF QUESTIONNAIRE RESPONSES All the Questionnaires received were consolidated as per the Stakeholder Category viz. Data Generators, Data Researchers, End Users and Service Providers. These consolidated responses are provided as per the details below: Consolidated Responses from Data Generators Consolidated Responses from Data Researchers Consolidated Responses from End Users Consolidated Responses from Service Providers : Annexure 3 : Annexure 4 : Annexure 5 : Annexure 6 Major findings, observations, concerns, suggestions and apprehensions of the Stakeholders are as given below: 8.5.1 CURRENT STATUS, STRATEGY & PROFILE a. Stakeholder Segment/Category: Organizations generally operate in one category/segment, however, sometimes the sometimes they operate in multiple categories/segments also. b. Commonly active Data Segments are: • • • • • • • • • • • • • • • • • • Analytics and Simulation segments. Banking and Finance, Capacity building initiatives in Big Data Management, Analytics and Machine Learning. Click stream data and analytics services Cloud BI and cloud data services Consultancy in Big Data Management, Analytics and Machine Learning. Customers, Data Analysis Genomics & Life Sciences. Industry: Education, Manufacturing, Media & Content, Logistics and E-Commerce Internet and Media IoT data & analytics services Policy Making Research in the areas of Big Data Management, Analytics and Machine Learning. Retail, Telecom, Transaction both internal & external Transaction Information c. Data Segments that are outsourced are: 112 • • • • • • • • Banking and Finance, CRM apps and customization Debit Card Data ERP apps and customization Internet and Media Mobile apps and customization Retail, Telecom, d. Budget Provisions For Big Data Usage during 2014-15, 15-16 and 16-17 varies between Rs. 100 lakhs to Rs. 500 lakhs. e. Areas where more Investment in Resources is likely to be made by most organizations: • • • • • • • • Capacity Building Software Tools : Most preferred area Data Sources Other Data Generation Agile Process Quality & ISO compliance for data services Data security & industry specific compliance audits/standards. User Experience (UX) standards and best practice. Sales & Marketing standards& scaling the business f. The Current state of big data activities within organization is generally in flux state and is represented by: • • • • Not yet started to consider big data's use within our organization Offering training programs and consultancy One or more pilots or proofs of concept Implementing big data technologies g. General expectations of the organizations from Big Data Analytics in the next 10 years are of varied nature and can be captured by: • • • • • • It is going to rule many organizations Plan to set up an internationally known centre of excellence in Big Data Management, Analytics, Mining, Machine Learning for Research and Development, Consultancy Services and Capacity Building Capacity Building Training/Research Big data will enable integrated cloud warehousing that integrates external and internal data. Machine learning algorithms will enable analytic automation that gives competitive 113 • • • • • • • • • End user intelligent apps will get more dependent upon big data APIs that make them smarter and more personalized Smarter cities, Smoother e-governance Superior analytics for business growth and customer service satisfaction Data Driven Methods, Conflict resolution, Early warning systems Superior analytics for business growth and customer service satisfaction Big Data techniques will allow organization to analyze data for patterns more quickly and at a much lower cost. It will lead to important business insights that can drive the business. h. Current state of big data activities within organization • • • We are in the process of developing a strategy / roadmap We have started one or more pilots or proofs of concept We are implementing big data technologies i. Organization's competitive position can be described as ‘on par with industry or as Underperforming industry / market peers’ j. Big data management is generally viewed strategically at senior levels of the organization. k. Generally there is enough of a “big data culture” in the organization, where the use of big data in decision-making is valued and rewarded. l. Organization are Not Sure about the usefulness of Big Data Analytics Applications m. Organizational Data is not available at data.gov.in 8.5.2 MANPOWER, SKILL GAPS AND TRAINING NEEDS a. Following are the identified skills gaps in dealing with data and analytics: • • • • Data integration skills Data storage skills Tooling / software skills Visualization skills b. Big Data experts are employed in the following areas: 114 • • • • • • Computer science: Artificial Intelligence and machine learning experts Computer science: programming experts (R, Python, SQL, SAS, Java, etc) Computer science: text, voice, music, image and video experts OR and applied mathematics Statistics and econometrics Those who understand business and the data that goes with it. c. The training needs as indentified by the stakeholders are: • • • • • • • • • Application related courses (Big Data in marketing, finance, logistics, etc) Computer science: machine learning and artificial intelligence courses Computer science: programming courses (R, Python, SQL, SAS, Java, etc) Computer science: text, image and video recognition courses High Frequency Data Operations research and applied mathematics courses Software tools such as Splunk, ELK Cloudera. Statistics and econometrics courses Strategy courses on Big Data for top management d. Capacity Building initiatives needed to be taken are listed as per details given below: Name of the Program Who should be the participants Duration Modality of Delivery Application related courses (Big Data in marketing, finance, logistics, etc) Business Units 3-5 days Classroom Training Basic Statistics Research Scholars/Academic Professionals/Corporate Personnel BE /Graduates 40 Hours Class Room session 6 months Apprentice model Big data Certifications Researchers and Practitioners 2 Months Hybrid – Class room + elearning Big Data in marketing, finance Middle Management and Lower Management 3 Days Class room program Big Data 115 Name of the Program Who should be the participants Duration Modality of Delivery Cloud BI MCAs 3 months Apprentice model Cloud DS Diplomas 3 months Apprentice model Computer science: text, image and video recognition courses Business Units 3-5 days Classroom Training Data Mining and Data Warehousing 40 Hours Class Room Session M. Sc (Big Data) Research Scholars/Academic Professionals/Corporate Personnel B. Sc 4 Semesters Class Room M. Tech. (Big Data) B. Tech 4 Semesters Class Room Multivariate Analysis Research Scholars/Academic Professionals/Corporate Personnel Research Scholars/Academic Professionals/Corporate Personnel Research Scholars/Academic Professionals/Corporate Personnel IT officers 40 Hours Class Room session 40 Hours Class Room session 40 Hours Class Room session 7 Days Class room program Numerical Methods Operation Research Programming courses 116 Name of the Program Who should be the participants Duration Modality of Delivery Statistics and econometrics courses Middle Management and Lower Management 3 Days Class room program Strategy courses on Big Data Top Management Half Day Class room program Strategy courses on Big Data for top management Business Units 3-5 days Classroom Training Strategy courses on Big Data for top management C level professionals, researchers and policy makers 2 Months Hybrid – Class room + elearning Strategy courses on Big Data for top management Business Units 3-5 days Classroom Training 8.5.3 PERCEIVED SUCCESS FACTORS, IMPEDIMENTS & CHALLENGES FOR BIG DATA APPLICATION a. Organizations have taken initiatives in the following areas that are related to Big Data Science & Technology • • • • • Analysis of Unstructured/Semi-structured data Data streaming & Processing New Computational Models Security & Privacy issues Visualization & Visual Analytics b. Organizations have taken initiative in the following areas that are related to Big Data Infrastructure • • • Big Data Open Platforms Programming Models Software Techniques & Architectures in Cloud/Grid/Stream Computing 117 • System Architectures, Design and Deployment c. Organizations have initiative in the following areas that are related to Big Data Search, Mining and Management • • • • • • • Algorithms & Systems for Big Data Search Cloud/Grid/Stream Data Mining-Big Velocity Data Computational Modeling & Data Integration Data Acquisition, Integration, Cleaning & Best Practices Multimedia and Multi-structured Data-Big Variety Data Search & Mining of variety of data including scientific, engineering, social, sensor & multimedia Visualization Analytics for Big Data d. Organizations have taken initiative in the following areas that are related to Big Data applications • • • • • Big Data Analytics in Small Business Enterprises (SMEs) Big Data as a Service Complex Big Data Applications in Science, Engineering, Medicine, Healthcare, Finance, Business, Law and Education Retailing, social media and Telecommunication e. Organizations are able to have timely Access to Information needed only to some extent. f. Organizations are able to get only modest competitive advantage created by information. g. Challenges inhibiting the organizations from acquiring and integrating data • • • Inconsistencies in data from various source systems Legacy infrastructure that inhibits data collection Difficult to share data internally and or in integrating internal data across silos h. Challenges inhibiting organizations from analyzing data • • Lack of software/tools and or Software too difficult to use Inconsistent data across variety of source systems i. Challenges inhibiting organizations from acting on data insights and analytics • Lack of software/tools that allow end-users to perform analytics themselves 118 j. The biggest impediments to using big data for effective decision-making • Too many “silos”—data is not pooled for the benefit of the entire organization. k. It is generally agreed that the issue for us is now not the growing volumes of data, but rather being able to analyze and act on data in real-time. 8.5.4 AREAS OF APPLICATION, MODELS & INFRASTRUCTURE a. Steps taken by the organizations to Integrate Data into Organization’s Business: • • • • Improve data collection processes Redesigned/reengineered your important Business Processes Training current employees or recruiting new employees in BA Upgrade IT Systems b. Areas reasonably ‘developed’ to ‘well developed’ in organization that may help use of big data in the organization: • • • • • • • • A clear company strategy A sound procedure for legal, ethical and reputational issues An organization structure that supports multi-disciplinary projects Financial budget Support by higher management Supporting systems and procedures Talent Training c. Organizations feel that the number of Big Data specialists in organization next year (2015) will increase. d. Suggestions on, Data Storage, Data Curation, Data Retrieval include that these technologies are evolving and should be constantly innovated and the organization roadmap should be focused on alignment with emerging technologies e. Organizations suggested the FINAL PRODUCTS for which the Big Data Community may strive include. • • • Big Data as a Service providing easy experimentation and quick prototyping Big Data Analytics platforms for Internet of Things and wearable devices Solutions/Protocols for seamless data integration, privacy and security. 119 • Big Data Analysis Platforms f. Organizations suggested the following thrust areas for the Researchers in the Big Data Discipline: • Immediately : o o o o o o Better algorithms/platforms Big Data Management, ETL and Analytics – improving the open source solutions Procuring real time data Data gathering, Data integration, Data integrity Data security • In the next 5-10 years: o o o o o Scalable Machine Learning for Big Data, IOT and BIG data integration and products. Developing Medical layer for supporting end users, System developers, Building DSS, KSS, Event triggering systems and agents aiming at integrating with Internet of things. 8.5.5 TYPE, AMOUNT OF DATA & ANALYTICAL TECHNIQUES USED a. Type of data analyzed by the organizations in the context of Big Data applications: • • Numerical data (for statistics, predictions, etc) Text (automated text analysis) b. The support, the organizations, would like to get from the Government: • Building a central repository of financial markets statistical data. • Clarity on security aspect Clarity on statutory / regulatory / compliance requirements High level 5-year country strategy Our Analytic capability may be used for the needy Partnering with peer organizations and relevant government agencies • • • • 120 • • • • • Setting up of SEZ’s for smaller set ups like ours which completely export the services. The current SEZ’s are unaffordable and only larger companies can get the benefit of working out of a SEZ. Support for enhancing the capacity of our Big Data Engineering Lab Support for offering internationally known Big Data certifications in India The government’s roadmap on big data To foster an encouraging environment for entrepreneurship especially for small start Ups. c. Organizations consider that the amount of data available to support decision-making is enough d. Challenges faced by the organizations in GENERATING Data include: • • Ensuring uniformity in data structure. Coping with rapid changes in business requirements. e. Challenges faced by the organization in CLEANING the Generated Data include: • • • Identifying mandatory data fields to ensure correct analytics. Data correlation Data quality f. Advanced analytics methods used by the organizations in Big Data applications • • • Statistics and econometrics Operations research (OR) / applied mathematics Artificial intelligence (AI) and machine learning g. Organizations consider the following as the most important factors for successful Big Data implementations. "1"=most important, to "5" =least important. • • • • • • • • A clear company strategy-1 A sound procedure for legal, ethical and reputational issues-3 An organizational structure that supports multi-disciplinary projects-4 Financial budget-1 Support by higher management-1 Supporting systems and procedures-4 Talent-3 Training-4 h. Open Source domain Tools and Platforms used by the organizations for Big Data Analytics. 121 • • • • • • • • Apache Hadoop Ecosystem – Hortonworks, Cloudera Apache Solr/Lucene Graph Data Bases – Neo4J, etc Hadoop Hortonworks Mapreduce No SQL data bases – Mongo DB, CAssandra R , Python – SciPy i. Organizations consider that their Performance in Information and Analytic Tasks are as follows: on A Scale of 1 To 5, Where 1=Poorly and 5=Very Well • • • Acquire and integrate data Analyze data Act on data-driven insights j. Organizations currently not envisage that the DATA CURATION function in-house is a part of data analytics in the organization. 3 3 4 8.5.6 SECURITY CONCERNS a. Initiative taken by the organizations for Big Data Security & Privacy • • • • • Challenges for Big Data Security & Privacy Cyber security and Gigabit Networks Intrusion Detection Sociological Aspects of Big Data Privacy Visualizing Large Scale Security Data b. Organization’s views on the IPR Issues as related to Big Data Analytics: • • • • • Cost of Patent filing is too high Implement the right policies for big data governance. In the crowd sourcing world of Big Data Analytics it is very difficult to clearly demarcate the IPR related boundaries. Most of the research outcomes are not commercialized by governmental organizations Need to think through certain fundamental legal aspects of IPR, e.g. "who owns the input data companies are using in their analysis, and who owns the output?” 122 • Over emphasis on IPR may also hamper the open innovation approach in the internet based application development model. c. Views and suggestions of the organizations on the adequacy or otherwise of the National Data Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics • • • Will comply with national regulatory requirements All data needs to be made available on common portal accessible to all We will comply with national regulatory requirements 8.5.7 OTHER INFORMATION TO SHARE a. We have more data but we don’t have proper documentation b. Even if we have data but we don’t have operating resources to act as analyst. c. We find it difficult to identify the resource persons who have knowledge and skill in areas like Econometrics, operational research, multivariate tools, computer science etc. 8.6 JUSTIFICATION OF THE PROJECT 8.6.1 FINDINGS OF THE SWOT ANALYSIS SWOT undertaken in Chapter 1 indicates that the country should take immediate and firm steps in the area of development and leveraging of BDA. The important components of the SWOT Analysis are reproduced as below. Strengths • There is a broad and detailed domain know-how as well as process know-how available. • Immense growth opportunity in the analytics market: Indian product firms have shown a growth rate of 20-40 per cent in the last few years; several emerging players have witnessed over 100 per cent growth within the first year of launch. (NASSCOM) • Analytics – a definite market for India: Over 100 Indian analytics focused software product firms have successfully developed and launched products catering to niche business needs, cut across vertical-specific, horizontal process-centric and niche applications and platforms. (NASSCOM) • Growing start-up base accelerating the growth: Four-fold increase in analytics startups in the last four years. (NASSCOM) • Innovative offerings focusing on end-to-end customer business needs. (NASSCOM) Weaknesses 123 • There is no existing and strong content/data market in India. • There is a lack of a solid start-up culture because of risk aversion and intolerance of failure. • Public data in the country is not available to the extent it should be. • The different languages within the country create a barrier (multilingualism) during data processing. Structural data sources often lack precise semantics. • There is a lack of specialized education programs for data analysts. • There are not enough skilled people to participate in capacity building training programs. • Legislative restrictions on data sharing decrease availability across the country and makes nationally/industry/domain focused initiatives that address these issues more difficult. • Rules and regulations are fragmented across the country/industry/domain. • There are high security/sensitivity/confidentiality demands that can be difficult to address. • There is no well-designed data governance: Data governance is a must-have, and no longer merely a good-to-have. In today's extremely hyper-competitive markets, insightful knowledge means the difference between success and being overwhelmed. But it has to be based on the right data, based on business requirements. • Data protection Policy: "Ignoring data security, data quality and data access can cost organizations millions of dollars, hurting enterprise agility, efficiency and reputation." Opportunities • Strengthening the Indian market, e.g. by fusing the emerging start-up nucleus. • Create lots of SMEs for the low hanging fruits of Big Data for which agility is required. • Investment in the entire innovation chain, beyond basic research. • Investment support mechanisms for SMEs/Research/ Institutions/Students/Scholars/Entrepreneurs. • There is the opportunity to open up completely new and different business areas and services. • New applications can be created throughout the Big Data ecosystem, ranging over acquisition, data extraction, analysis, visualization and utilization. • Development of APIs for access becoming standardized and available. • Providing facilities to better navigate and curate data. • Contextualization and personalization of data. • The evolution of different sectors and the increased volume of data enable innovative applications to be developed. • Exploring new research areas. • User generated and crowd-sourced content increasingly available that will help variety of recurring problems solved once for all. • Shift from technology push to end-user engagement. • By 2020, information will be used to reinvent, digitalize or eliminate 80% of 124 business processes and products from a decade earlier: As the presence of the Internet of Things (IoT) — such as connected devices, sensors and smart machines — grows, the ability of things to generate new types of real-time information and to actively participate in an industry’s value stream will also grow. (GARTNER) • By 2017, more than 30% of enterprise access to broadly based big data will be via intermediary data broker services, serving context to business decisions: Digital business demands real-time situation-awareness. This includes insights into what goes on both inside and outside the organization. How do weather patterns impact inventory? More so, how do this season’s customer preferences as expressed in social media suggest greater or lesser inventory? (GARTNER) • By 2017, more than 20% of customer-facing analytic deployments will provide product tracking information leveraging the IoT: Fueled by the Nexus of Forces (mobile, social, cloud and information), customers now demand a lot more information from their vendors. The rapid dissemination of the IoT will create a new style of customer-facing analytics — product tracking — where increasingly less expensive sensors will be embedded into all types of products. (GARTNER) • Analytics – Opening up a gamut of opportunities for Indian software product firms (NASSCOM) • Big Data as a service (BDaaS): That is the delivery of Statistical Analysis tools or information by an outside provider that helps organizations understand and use insights gained from large information sets in order to gain a competitive advantage. Threats • Many skilled professionals leave the country to work in other regions; adding to the risk of a “Brain Drain”. • Acute lack of skilled professionals and graduates. • There are no existing ecosystems and portals where reliable data sets are is available, however, there is a need to create them. • Policies of data availability; for example companies are not willing to make data available ‘just-in-case’ it may cause a legal action or result in competition. • Shortage of Skills: There are a wide range of skills relevant for businesses wanting to use data analytics, including knowledge of statistical techniques, the ability to program and use software, market-specific knowledge and communication. These skills may not be available in required quantity and quality. • Business-Education Collaboration: One way to provide the multi-disciplinary skills required for big data analysis is for students to work closely with a company during their studies. Collaboration between a university/institution with analysis expertise and a business with real world data can be beneficial for both parties. • Data Sharing Policy: Non-implementation is hindering Big Data Analytics takeoff. The project is justified as there is an urgent need of taking appropriate actions to take advantage of the OPPORTUNITIES by leveraging our STRENGTHS and at the same time take cognizance of the THREATS and mitigate the same by making plans to overcome the identifies WEAKNESSES. 125 8.6.2 THE GAP IDENTIFICATION: Gap identification was carried out in Chapter 1. The gaps have been identified in terms of the following major issues. o Market and Business o Technical o Data, Content and Usage o Education and Skills o Policy, Legal and Security Consider the issues of (i) Education and Skills and (ii) Policy, Legal and Security, sooner or later these gaps have to be plugged and the initiative has to be taken by the Government. Government being the largest producer and user of data, to large extent the issue of Data, Content & usage affects the government much more than the corporate. Logically the action has to come from the Government. With the huge potential of business with in the country and outside, an early initiative is likely to give positive results. Considering the government’s call of ‘Make-in-India’, there is urgency for the government to take the first step so that the corporate world can join in the efforts. 8.6.3 RESPONSES RECEIVED FROM THE STAKEHOLDERS: As indicated in Chapter 3, a preliminary research was carried out through questionnaires circulated and inputs in the consultative meetings and interactive workshops held with different stakeholders. The detailed responses received have been analyzed to provide an insight in the ground realities of BDA in the country. The following are the important parameters that reveal the ground situation: • • • • • • • Current Status, Strategy & Profile Manpower, Skill Gaps and Training Needs Perceived Success Factors, Impediments & Challenges for Big Data Application Areas of Application, Models & Infrastructure Type, Amount of Data & Analytical Techniques Used The support expected from Government Security Concerns, Data sharing and IPR Issues Important and salient responses from the stakeholders are summarized below: Current Status, Strategy & Profile: Organizations are not necessarily belong to one category but may operate multiple categories/segments also. The pace of the activity is rather sluggish. The Data Segments actively used are many. They all are very optimistic about the growth in the business related to BDA and are planning to invest Rs. 100 – 500 Lakhs in the next three years. The investments are for capacity building, software, ISO compliance, data security etc. 126 Manpower, Skill Gaps and Training Needs: The skill gaps indentified include Data integration, Data storage, and Visualization skills. The representative identified training needs are Application related courses (Big Data in marketing, finance, logistics, etc), Computer science: machine learning and artificial intelligence courses amongst the technology area and Strategy courses on Big Data for top management. The capacity building efforts have been identified from a short duration course of 3-5 days to midterm courses lasting a few months and UG and PG courses. Perceived Success Factors, Impediments & Challenges for Big Data Application: On their own many organizations are taking initiatives in the areas of Technology, Infrastructure, Management, and Big Data Applications. The major concerns are such as Inconsistencies in data from various source systems, legacy infrastructure that inhibits data collection and difficult to share data internally and or in integrating internal data across silos. The challenges faced include lack of software/tools that allow end-users to perform analytics themselves. It is generally agreed that the issue is not the growing volumes of data, but rather being able to analyze and act on data in realtime. Areas of Application, Models & Infrastructure: There is strong feeling that the number of Big Data specialists required in organization next year (2015) will increase. The suggested immediate thrust areas of research in BDA include better algorithms/platforms Big Data Management, ETL and Analytics – improving the open source solutions, Procuring real time data, Data gathering, Data integration and Data security. Type, Amount of Data & Analytical Techniques Used: The type of data analyzed by the organizations, in the context of Big Data applications include, both Numerical data (for statistics, predictions, etc) and Text (automated text analysis). The support expected from Government: • Building a central repository of financial markets statistical data. • Clarity on security aspect • Clarity on statutory / regulatory / compliance requirements • High level 5-year country strategy • Our Analytic capability may be used for the needy • Partnering with peer organizations and relevant government agencies • Setting up of SEZ’s for smaller set ups like ours which completely export the services. The current SEZ’s are unaffordable and only larger companies can get the benefit of working out of a SEZ. • Support for enhancing the capacity of our Big Data Engineering Lab 127 • Support for offering internationally known Big Data certifications in India • The government’s roadmap on big data • To foster an encouraging environment for entrepreneurship especially for small start Ups. Security Concerns, Data sharing and IPR Issues: The organizations have taken a few initiatives for Big Data Security & Privacy. On the Data Sharing Policy they are willing to comply with the national regulatory requirements, however they also feel that the public domain data should be available on common portal and be equally accessible to all. Their views on the issues related to IPR include: • NDSAP be elevated to an act. • Cost of Patent filing is too high • Implement the right policies for big data governance. • Most of the research outcomes are not commercialized by governmental organizations • Need to think through certain fundamental legal aspects of IPR, e.g. "who owns the input data companies are using in their analysis, and who owns the output?” It is certain that the stakeholders are ready to take off on the BDA, provided their concerns are alleviated and they are provided with the support for the weak areas. This also calls for an urgent action on part of the government so that the country does not miss the opportunity that is otherwise reachable. 8.6.4 CAPACITY BUILDING & TRAINING AND ENTREPRENEURSHIP DEVELOPMENT: In earlier chapters extensive analysis has been carried on the opportunities worldwide and within the country vis-à-vis the preparedness of the country. Substantial supporting facts and figures have been provided on the following important indicators: • • • • • • • • • • • IDC Worldwide Big Data and Analytics Predictions for 2015 The EIU Survey Transparency Market Research Report Big Data Trends & Predictions of 2015 European research agenda for Big Data Analytics Skills Needed & Available: Quality & Quantity: The Essential Set of Data Science Competencies: World Vide - Big Data Vendor Revenue and Market Forecast Development of the BA Industry in India Analytics as a Service (AaaS) Possible Models for Entrepreneurship Building Considering the opportunities available worldwide and within the country and taking into the factors identified earlier thorough SWOT analysis of the Indian scenario it can be justifiable concluded that it is the most opportune moment for DST to take the country, it’s BDA related recourses and their potential into the BDA ecosystem of the world. The project is justified. 128 8.7 PROJECT OBJECTIVES & TARGETS 8.7.1 VISION& THRUST AREAS The following two vision statements are given to spell out DST’s Initiative: "To become established as the complete support system provider in the country in the ecosystem of Data Science, Technology, Research & Applications (dASTRA)" Or “To act as a facilitator to promote and develop Data Science, Technology, Research & Applications (dASTRA) and related ecosystem in the country” The major thrust areas would be: • • • • • Big Data Science & Technology Big Data Infrastructure Big Data Search, Mining and Management Big Data Security & Privacy Big Data Applications/Research 8.7.2 OBJECTIVES Following may the objectives of the Data Science initiative of the Department of science & Technology: DST with its all pervasive intervention in the dASTRA ecosystem should strive to achieve the following: • Talent Pool - Create industry academia partnership to groom the talent pool in universities as well as develop strong internal training curriculum to advance analytical depth. • Collaborate - Form analytics forum across organization boundaries to discuss the pain-points of the practitioner community and share best practices to scale analytics organizations. • Capability Development - Invest in long term skills and capabilities that form the basis for differentiation and value creation. There needs to be an innovation culture that will facilitate IP creation and asset development. • Value Creation - Building rigor to measure the impact of analytics deployment is very critical to earn legitimacy within the organization. 8.7.3 ACTIVITIES For the Department of Science & Technology, Government of India, based on the vision statement given above the main activities will be, but not limited to, will be as given below: • R&D PROMOTION 129 o Open Sky Research o Cluster Based Network Programs o International Collaborative Research Program • ESTABLISHMENT OF CENTRE CENTRES OF EXCELLENCE ON DATA SCIENCE • SKILL DEVELOPMENT, CAPACITY & TRAINING o Fellowship Based UG/PG and PhD o Short Term Training for Faculty o On-Line Programs o National Workshops & Conferences o Collaborative Interactive Conferences o Entrepreneur Development • INTERNATIONAL LINKAGES & COLLABORATIONS o UN (R&D and Standards) o Regional Associations/Collaborations o Bilateral & Multi Lateral Excha Exchange Programs INFRASTRUCTURE DEVELOPMENT A schematic representation is given in the figure 8.2 given below. FIGURE 8.2: DST’s VISION OF dASTRA IN INDIA DST’s VISION OF dASTRA IN INDIA R&D PROMOTION Open Sky Research Cluster Based Network Programs International Collaborative Research Program ESTABLISHM ENT OF CENTERS OF EXCELLENCE SKILL DEVELOPMENT, CAPACITY & TRAINING INTERNATIONAL LINKAGES & COLLABORATION UN (R&D and Standards) Fellowship Based UG/PG and PhD etc. Regional Associations/Collaborations Short Term Training for Faculty Bilateral & Multi Lateral Exchange Programs On-Line Programs National Workshops & Conferences Collaborative Interactive Conferences Entrepreneur Development, 130 INFRASTRUCTURE DEVELOPMENT FIGURE 8.3: CONCEPTUAL MODEL OF SIX MONTHS STUDENT’S PROJECTS LINKED WITH BIG DATA FIGURE 8.4: SUGGESTED CAPACITY BUILDING MODEL TRAIN FACULTY JULY - AUGUST TRAIN STUDENTS SEPTEMBER DECEMBER 131 STUDENTS UNDERTAKE PROJECTS JANUARY - JUNE CERTIFICATION BY INDUSTRY JUNE 8.7.4 TARGETS Tentative targets are as shown in table 8.4 below. TABLE 8.4: TENTATIVE TARGETS COMPONENT R&D PROMOTION Open Sky Research Cluster Based Network Programs International Collaborative Research Program ESTABLISHMENT OF CENTRE OF EXCELLENCE ON DATA SCIENCE SKILL DEVELOPMENT, CAPACITY & TRAINING Fellowship Based UG/PG and PhD in 80:20 ratio Short Term Training for Faculty On-Line Programs National Workshops & Conferences Collaborative Interactive Conferences Entrepreneur Development INTERNATIONAL LINKAGES & COLLABORATIONS UN (R&D and Standards) Regional Associations/Collaborations Bilateral & Multi Lateral Exchange Programs INFRASTRUCTURE DEVELOPMENT UNITS YEAR 1 UNITS PLANNED FOR THE PERIOD YEAR 2 YEAR 3 YEAR 4 YEAR 5 TOTAL Number of Grants Numbers of Programs Number of Programs 5 12 12 12 12 53 10 12 12 12 12 58 5 5 5 5 5 25 Number of Centres 4 .5 .5 5 5 24 1292 1300 1500 1600 1700 7392 2 2 2 2 2 10 2 2 4 4 4 16 Numbers 2 2 4 4 4 16 Numbers 2 2 3 3 3 13 Number of Projects 1 2 3 10 15 31 Numbers 1 1 1 1 1 5 Numbers 1 1 1 1 1 5 Numbers Number of Programs 1 1 1 1 1 5 4 4 4 4 4 20 Number of Fellowships Number of Training Programs Number of Programs 132 8.8 PROJECT DESIGN & COSTS 8.8.1 PROJECT DESIGN Project Management Unit: The envisaged project is of very high value, it is spread over five years and the Outcomes of the project are vital for the country. Considering this it is suggested that the project is managed through a Project Management Unit (PMU). The PMU would be headed by the Director, Big Data Initiative, Department of Science & Technology, Government of India and will be situated at the office of the Director, Big Data Initiative, DST. The major functions of the PMU would be: • To work as the nodal agency of the Government of India for Big Data Initiatives and to coordinate with all the stakeholders. • Selecting components and activities to be included in the project from time to time and preparing annual action plans. • Developing and fine tuning the final delivery contents, mechanisms and performance measurement criterion of each of the component/activity to be undertaken by the project. • Developing and finalizing the guidelines & terms and conditions of the grants and various other formats and documents needed for making requests to participate in the project activities, submitting periodic reports, funds utilization statements etc. • Seeking proposals from individuals, institutions and other organizations for undertaking the various components and activities selected to be included in the project from time to time. • Assign the responsibility of the delivery to competent agencies within government (State & Central) and outside such as national institutes of higher learning, research organizations, service providers and others in the dASTRA ecosystem. • Monitoring all the aspects of the delivery of the project components and activities and ensuring the quality of delivery. • To ensure effective coordination with implementing agencies together with collection of information pertaining to implementation and progress. • Overseeing and Management of the project funds. Use of CPSMS could be made. • The cost of the PMU will be met from the total project cost. The PMU cost should not exceed 2.5% of the total project cost Organization Structure The overall project initiative will be spearheaded by the Director, dASTRA/BDI. It is suggested that there will be a Project Head to support the Director, BDI in implementation of the project. The project has four major initiative areas that is (i) Research & Development (R&D), (ii) Capacity & Training (C&T), (iii) International Linkages & Collaborations (ILC)and (iv) Entrepreneurship Development (ID). Each of these 133 areas should be looked after by a Divisional Head. Implementation of the project will involve considerable amount of coordination with external agencies both national and international, It is suggested, therefore, to provide a Coordinator to each of the Division Divisional al Head. To take care of the very large number of stakeholders such as Students, Participants of the training programs, Faculty Members etc. and to keep track of the information and documents received and sent out from the project, a pool of Project Assistants ants is recommended. The Project will hav have some Multi Task Staff. Table 8.5 .5 gives the details of the project personnel, their number and estimated costs. These costs should not exceed 2.5% of the total project cost. The Organization Structure is as given in figure 8.5. TABLE 8.5: SUGGESTED PROJECT PERSONNEL S. N. 1. 2. 3. 4. 5. Project Position Project Head Divisional Head Coordinator Project Assistant Multi Task Staff Numbers 1 4 5 8 12 FIGURE 8.5: SUGGESTED ORGANIZATION STRUCTURE 134 8.8.2 PROJECT COSTS TABLE 8.6: COMPUTATION OF COSTS – ALL COSTS IN Rs. LAKHS COMPONENT UNITS UNIT COST YEAR 1 UNITS YEAR 2 COST UNITS YEAR 3 COST UNITS YEAR 4 COST UNITS YEAR 5 COST UNITS PROJECT LIFE COST UNITS COST R&D PROMOTION Cluster Based Network Programs Number of Grants Numbers of Programs International Collaborative Research Program Number of Programs Open Sky Research 100 5 500 12 1200 12 1200 12 1200 12 1200 53 5300 150 10 1500 12 1800 12 1800 12 1800 12 1800 58 8700 200 5 1000 5 1000 5 1000 5 1000 5 1000 25 5000 TOTAL FOR R&D PROMOTION ESTABLISHMENT OF CENTRE OF EXCELLENCE ON DATA SCIENCE 3000 Number of Centres 1000 4 TOTAL FOR ESTABLISHMENT OF CENTRE OF EXCELLENCE ON DATA SCIENCE 4000 4000 5 4000 5000 4000 5 5000 5000 4000 5 5000 5000 4000 5 5000 5000 19000 24 5000 24000 24000 SKILL DEVELOPMENT, CAPACITY & TRAINING Fellowship Based UG/PG & PhD in 80:20 ratio Short Term Training for Faculty On-Line Programs Number of Fellowships PG/UG Number of Fellowships Ph D Number of Training Programs Number of Programs 1.2 1215 1458 1225 1470 1390 1668 1390 1668 1500 1800 3.0 77 231 75 225 110 330 210 630 200 600 20 2 40 2 40 2 40 2 40 2 30 2 60 2 60 4 120 4 120 4 135 7392 10080 40 10 200 120 16 480 COMPONENT National Workshops & Conferences Collaborative Interactive Conferences Entrepreneur Development UNITS Numbers Number of Projects Number of Projects UNIT COST YEAR 1 UNITS YEAR 2 COST UNITS YEAR 3 COST UNITS YEAR 4 COST UNITS YEAR 5 COST UNITS PROJECT LIFE COST UNITS COST 30 2 60 2 60 4 120 4 120 4 120 16 480 20 2 40 2 40 3 60 3 60 3 60 13 260 *** 1 TOTAL FOR SKILL DEVELOPMENT, CAPACITY & TRAINING 2 2000 5 2000 10 2500 15 2500 33 2500 11500 INTERNATIONAL LINKAGES & COLLABORATIONS UN (R&D and Standards) Regional Associations/Collaborations Numbers 60 1 60 1 60 1 60 1 60 1 60 5 300 Numbers 40 1 40 1 40 1 40 1 40 1 40 5 200 Bilateral & Multi Lateral Exchange Programs Numbers 100 1 100 1 100 1 100 1 100 1 100 5 500 TOTAL FOR INTERNATIONAL LINKAGES & COLLABORATIONS INFRASTRUCTURE DEVELOPMENT Number of Programs TOTAL FOR INFRASTRUCTURE DEVELOPMENT GRAND TOTAL 200 125 4 500 200 4 500 200 4 500 200 4 500 200 4 500 1000 20 2500 500 500 500 500 500 2500 9700 11700 12200 12200 12200 58000 *** No funds planned under DST as proposal will only be evaluated and approved for funding through TDB NOTE: The Total project cost is inclusive of the PMU cost. The PMU cost should not exceed 2.5 % of the total project cost. 136 8.9 DIRECT AND INDIRECT BENEFITS The envisaged outputs/benefits and the possible outcomes of the project are summarized below in table 8.7. TABLE 8.7: ENVISAGED OUTPUTS/BENEFITS AND THE POSSIBLE OUTCOMES PROJECT COMPONENT INPUTS OUTPUTS OUTCOMES R&D PROMOTION Open Sky Research on Big Data • Identification of the application areas • Scrutinizing the proposal • Tying up with industry • New tools created • New Solutions created Cluster Based Network Programs • Identification of the application areas • Scrutinizing the proposal • Tying up with industry • New tools created • New Solutions created International Collaborative Research Program • Identification of the application areas • Scrutinizing the proposal • Tying up with international agencies & industry • New knowledge & experiences • New tools created • New Solutions created • Selection of the theme for CoE Centre of Excellence • Patents • Recognition and Acceptability of Indian talent • New BDA application areas • New Revenue channels • Increased business in BDA • Closer interaction between industry & DST • Patents • Recognition and Acceptability of Indian talent • New BDA application areas • New Revenue channels • Increased business in BDA • Closer interaction between industry & DST • Patents • Recognition and Acceptability of Indian talent • New BDA application areas • New Revenue channels • Increased business in BDA • Documentation of the new knowledge & experience • Closer interaction between international agencies, industry & DST CENTRES OF EXCELLENCE Centres of Excellence 137 • Higher project success rate PROJECT COMPONENT INPUTS OUTPUTS • Preparation of the guidelines for implementation • Selection of the implementing agency • Funding • Supervision OUTCOMES established • Reduced costs for professional services, management overhead and TCO • Reduced gap between Business and IT, improving time to market and responsiveness to change • Best Practices • Employment generation • Placement of the certified resources • Acceptance of the employers • Increased business in BDA • Closer interaction between industry & DST • Increased number of training programs could be organized • Increased availability of up skilled human resources • Increase availability of Certified recourses for deployment SKILL DEVELOPMENT, CAPACITY & TRAINING Fellowships Based UG/PG and Ph D • Selecting students • Performance evaluation • Availability of skilled human resources Short Term Training Programs for Faculty • Designing of the courses • Selecting implementing agency • Selecting trainees • Administering courses • Performance evaluation • Designing of the courses • Selecting implementing agency • Selecting students • Administering courses • Performance evaluation Availability of Certified trainers for further training others On-Line Training Programs National Workshops • Availability of up skilled human resources • Availability of Certified recourses for deployment • Identification of BDA areas/themes 138 • New knowledge and • Placement of the certified resources for more responsible jobs • New/additional work areas undertaken by employers • Better salaries/promotion offered by employers • Increased business in BDA • Closer interaction and higher confidence level between industry & DST • Adapting and Implementing the newly PROJECT COMPONENT /Conferences Collaborative International Conferences Entrepreneurship Development INPUTS OUTPUTS • Development of the guidelines and contents • Receiving /conference research papers • Selection of the papers • Scrutinizing the proposal OUTCOMES Experience gained gained knowledge and experience in Indian BDA ecosystem • Documentation of the new knowledge & experience • New venture created • Employment generation • Increased business in BDA • New BDA application areas • Closer interaction between industry & DST • New knowledge and Experience gained • Adapting and Implementing the newly gained knowledge and experience in Indian BDA ecosystem • Documentation of the new knowledge & experience Upgraded infrastructure available for R&D at some Centers but usable by multiple agencies • Higher project success rate • Reduced costs for infrastructure overhead and TCO • Improved R&D facilities to generate Best Practices INTERNATIONAL LINKAGES & COLLABORATION UN (R&D and Standards) • Identification of areas institutions and countries • Tying up with Regional Associations institutions and /Collaborations countries • Preparation of Bilateral & Multi the guidelines for Lateral Exchange implementation • Scrutiny of Programs proposals for participation INFRASTRUCTURE DEVELOPMENT Infrastructure Development • Preparation of the guidelines for implementation • Selection of the implementing agency • Funding • Supervision 139 8.10 EVALUATION PARAMETERS Many of the initiatives taken in the project may not be quantifiable and the impact of the project will have be understood on the qualitative aspects, however following is the list of possible evaluation parameters and measurable indicators for understanding the results of the project: • • • • • • • • • • • • • • • • • • • • • • • • • • • • Number of Research Papers published Number of new venture created Number of Centre of Excellence established Number of new tools created Number of new Solutions created Number of Best Practices developed Number of fellowships awarded Number of Ph Ds Number of Trainers Trained Number of implementing agency selected Requests received from industry as a result of closer interaction between industry & DST Number of training programs organized Number of trainees trained Number of the new BDA application areas identified Number of Business Models to deploy BDA identified Number of tie-ups with business Number of proposal received for Venture capital/seed money etc. Number of new Revenue channels generated Number of Patents registered Awards and Recognitions won Number of international collaborative research projects started/completed Number of cluster based network projects started/completed Number of Open sky research projects undertaken on Big Data Number of Infrastructure projects started/implemented Number of online programs launched Number of participants benefited through online programs Number of national workshops/conferences organized Number of collaborative international conferences organized 8.11 PROJECT MONITORING AND MIS Monitoring, MIS & Evaluation is a process of continued gathering of information and its analysis, in order to determine whether the project progress is being made towards pre-specified goals and objectives, and highlight whether there are any unintended (positive or negative) effects from a 140 project and its activities. Monitoring, MIS and Evaluation ar are e closely related concepts that are distinct but complementary. Monitoring and MIS is a continuous collection of data on specified indicators to facilitate decision making on whether an intervention (project, program or policy) is being implemented in line with the design i.e. its activity schedules and budget; while Evaluation is the periodic and systematic collection of data to assess the design, implementation and impact in terms of effectiveness, efficiency, distribution and sustainability of outcomes and a impacts. The concept is shown in a schematic diagram in figure 8.6 below. FIGURE 8.6: MONITORING, MIS & EVALUATION Monitoring and Evaluation systems provide the project owners and the other stakeholders with regular information on progress relative tto o targets and this enables them towards: Accountability: demonstrating to funding agency, beneficiaries and implementing partners that expenditure, actions and results are as agreed or can reasonably be expected in the situation. Operational management/I management/Implementation: provision of the information needed to co-ordinate co the human, financial and physical resources committed to the project and to improve performance. Strategic management:: provision of information to inform setting and adjustment of objectives objectiv and strategies. Capacity building:: building the capacity, self self-reliance reliance and confidence of beneficiaries and implementing staff and partners to effectively initiate and implement development initiatives. Benefits at the project level level: • Provide regular feedback on project performance and show any need for ‘mid-course’ ‘mid corrections • Identify problems early and propose solutions • Monitor access to project services and outcomes by the target population; • Evaluate achievement of project objectives • Incorporate stakeholder takeholder views and promote participation, ownership and accountability The key indicators: Indicators may be qualitative or quantitative variables that measure project performance and achievements. Indicators are developed for all levels of project logic logi i.e. indicators are needed to monitor progress with respect to inputs, activities, outputs, outcomes and impact, to feedback on areas of success and where improvement is required. For the project the monitoring, MIS and evaluation indicators are explaine explained in table 8.8 below. 141 TABLE 8.8: MONITORING, MIS AND EVALUATION INDICATORS Indicator Input indicators Purpose & Description Input indicators are quantified and time-bound statements of the resources financed by the project, and are usually monitored by routine accounting and management records. They are mainly used by managers closest to implementation, and are consulted frequently (daily or weekly). They are often left out of discussions of project monitoring, though they are part of essential management information. An accounting system is needed to track expenditures and provide data on costs for analysis of the cost effectiveness and efficiency of project processes and the production of outputs. Process indicators Process indicators monitor the activities completed during implementation, and are often specified as milestones or completion of sub-contracted tasks, as set out in time-scaled work schedules. One of the best process indicators is often to closely monitor the project's procurement processes. Every output depends on the procurement of goods, works or services and the process has well defined steps that can be used to monitor progress by each package of activities Output indicators Outcome indicators Output indicators monitor the production of goods and delivery of services by the project. They are often evaluated and reported with the use of performance measures based on cost or operational ratios. The indicators for inputs, activities and outputs, and the systems used for data collection, recording and reporting are sometimes collectively referred to as the project physical and financial monitoring system, or management information system (MIS). The core of an M&E system and an essential part of good management practice, it can also be referred to as ‘implementation monitoring’. Outcome indicators are specific to a project’s purpose and the logical chain of cause and effect that underlies its design. Often achievement of outcomes will depend at least in part on the actions of beneficiaries in responding to project outputs, and indicators will depend on data collected from 142 Indicator Impact indicators Purpose & Description Impact indicators usually refer to medium or long-term developmental change to which the project is expected to contribute. Dealing with the effects of project outcomes on beneficiaries, measures of change often involve statistics concerning economic or social welfare, collected either from existing regional or sectoral statistics or through relatively demanding surveys of beneficiaries. Selection of Indicators for Monitoring, MIS and Evaluation: Considering expectations from the current project an indicative selection of the indicators is as given in the table 8.9 below. TABLE 8.9: INDICATIVE SELECTION OF THE INDICATORS FROM MONITORING, MIS AND EVALUATION Indicator Type Input indicators Process indicators Output indicators Important Indicators for the Project components • Number of the themes selected/finalized for CoE • Progress of preparation of the Guidelines for implementation of CoE • Number of Courses design completed for UG Students, • Number of Courses design completed for short term courses, • Training of Trainers • Progress of Establishment of CoEs • Number of implementing agencies for CoE, Short term courses • Funds released and due for various activities • Progress of preparation/review of Guidelines for various activities • Progress in preparation of guidelines for entrepreneurships approvals • Progress in Identification of the application areas, Scrutinizing the proposal and Tying up with international organizations, countries and industry for various activities • Progress in Identification of the application areas for the Open sky research on Big Data • Number of Centre of Excellence established • Number of students/candidates selected for UG/PG fellowships, short term courses etc. • Performance Results of the UG/PG Students/ candidates attended trainers courses • Number of entrepreneurship proposals sanctioned • Number of proposals Scrutinized and tie ups finalized with industry for cluster networks • Number of proposals Scrutinized and tie ups finalized with industry for Open sky research on Big Data • New tools created and new Solutions created towards the Open sky research on Big Data 143 Indicator Type Outcome indicators Important Indicators for the Project components • Number of Best Practices developed by CoEs • Number of candidates attended short term courses and • Number of New/additional work areas provided to candidates/students by the employers • Additional capacity for training created by training of trainers courses • Employment generation, Increased business in BDA and New BDA application areas developed due to sanctioned entrepreneurship projects • Number of Patents and Recognition achieved by way of creation of New tools and New Solutions • New BDA application areas and New Revenue channels created through New tools and New Solutions • New BDA application/research areas created through international collaboration activities • Number of Patents and Recognition achieved by way of Open sky research on Big Data • New BDA application areas and New Revenue channels created through Open sky research on Big Data 144 Indicator Type Impact indicators Important Indicators for the Project components • Number of Research papers published • Number of new venture created • Number of Centre of Excellence established • Number of new tools created • Number of new Solutions created • Number of Best Practices developed • Number of fellowships awarded • Number of Trainers Trained • Number implementing agency selected • Feedback on the acceptance of employed recourses from the employers • Salaries offered by employers • Requests received from industry as a result of closer interaction between industry & DST • Number of New/additional work areas undertaken by employers • Promotion / increased salaries offered by employers • Number of training programs organized • Number of trainees trained • Number of types of trainings organized • Number of the new BDA application areas identified • Number of Business Models to deploy BDA identified • Number of tie-ups with business • Number of proposal received for Venture capital/seed money etc. • Number of new BDA application areas generated • Number of new Revenue channels generated • Increased business in BDA • Number of Patents registered • Awards and Recognitions won • Number of international collaborative research projects started/completed • Number of cluster based network projects started/completed • Number of Open sky research projects undertaken on Big Data • Number of Infrastructure projects started/implemented • Number of online programs launched • Number of participants benefited through online programs • Number of national workshops/conferences organized • Number of collaborative international conferences organized 145 8.12 ROLES OF VARIOUS STAKEHOLDERS The Big Data Analytics ecosystem consists of a large number of stakeholders types. With its continuous spread there is hardly any entity which is left untouched. The following are some of the of the important stakeholders: o o o o o o o o o o o o o o o o o o Researchers Data Generators End Users Service Providers Platform Providers Data Curators Software Professionals Skill/Job Seekers Academicians Researchers Trainees/Students Trainers Customers/Common Citizen Funding Agencies/Government Departments Other Government Departments Industry Associations National and International Regulatory Bodies Professionals, Practitioners and those interested in adjacent technologies For the purposes of identifying the roles of the stakeholders the above listed stakeholders could be grouped as given below: i. Funding Agencies ii. Industry/Corporate/ Industry Associations/ Regulatory Bodies iii. Trainers/Academics/Researchers iv. Employment Seekers In order to achieve the objectives of the DST initiative as mentioned earlier in the chapter above the following roles may be assigned to the stakeholder groups as listed above. Funding Agencies: The main role of these stakeholders would be financing the skill up gradation, R&D and training in the area of Big Data Analytics. The funding would not be provided by DST but the Industry would also partner DST in funding the overall development activities as mentioned earlier of the dASRAT ecosystem in the country. Industry: Industry will be both benefactor and the beneficiary of the BDA development in the country. The industry has to provide all assistance, help and support to the DST’s dASTRA initiative, particularly by 146 participating in research, up skilling, employment, funding, developing standards, establishing and improving the Skill calibration and certification activities. Trainers: The role of this community would be to set the curriculum, contents, course materials etc. for the training, teaching, skilling and up skilling of the large number of job seekers as well as those who are in the job already. They will have to participate, in collaboration, with the industry in developing standards, establishing and improving the Skill calibration and certification activities. This will be in addition to the research initiatives in the dASTRA so as to maintain an edge in the national and international competitive situation. Employment Seekers: This will be, perhaps, the largest group within the dASTRA ecosystem. The most important role they need to play is in terms of a determined effort to pick up the competencies in BDA area and then to use them for the overall benefit of the society. 8.13 OTHER IMPORTANT ISSUES FOR CONSIDERATION There are many other important issues that need to be considered by the project and necessary initiative taken during the project life. As many of these issues will need coordination between DST and other governmental and other agencies, not only it is difficult to list all the issues at present and it is also not possible to prepare an action plan for these issues. Moreover many more issues will come up from time to time. An inductive list of such issues is as given below. i. Analytics Maturity Model: Efforts should be made to take up the development of the Analytics Maturity Model. In this exercise, apart from investigating the international practices and standards, the practitioners, service providers and the users all should be involved in the development work. ii. Organizing Contests for Analytics Models: When an individual or a group of professionals are working in the context of a competition, they try a few things and get to the top of a leader board and in fact they are pretty happy with themselves. Using competition to perfect various types of Big Data models will be good idea. This will encourage the developers and service providers, periodical contests may be organized. DST can host contests for data scientists. Companies who want problems solved post them, along with relevant data sets, on the site. Anyone can submit a solution, and each competitor ranks on a leader board throughout the competition. Substantial prize many can be either provided by DST or may be sponsored by the Company that is seeking the best solution. These may be theme based and vertical based. Participants may be supplied with a real data set of ‘transactions’ related to a real life situation. These may include ‘actions’ by the subjects involved in the transactions, real life ‘results’ of the transactions and ‘values’ of other parameters etc. Using the provided data, the participants may be asked to build a model to analyze the data, and address one or more of the given questions/problems faced by the sponsor. 147 iii. Big Data Governance: As the domain of Big Data Analytics is relatively new for the country and the practitioners, there is need to provide support systems, policies and procedures for the governance of the Big Data in the country. iv. Big Data Best Practices: Providing encouragement for the development, documentation and dissemination of the Best Practices in the domain of Big Data. These best practices may be nationally or internationally developed. v. Creation on a National Advisory Body: Creation an Internal Organization Structure, with involvement of the Stakeholders. This could be an extension of the current PDAC in terms of its scope and advisory role. vi. Catchy name to Big Data Analytics in India: In order to create awareness across and to popularize Big Data Analytics a completion may be organized to select a catchy name/acronym for Big Data in India. vii. Big Data Competency Frame Work: The primary responsibility of the Data Scientist - Big Data is design, development and support of applications, very large databases and infrastructure for storing structured and unstructured company data and for use in analyzing business activities, detecting patterns and reporting trends. Over a period a large number of professionals will be trained and will be available for work in the area of Big Data Analytics. Also large number of organizations would be initiated in this area. In order to provide a firm footing both to the professionals and users there will be a need to create, standardize and popularize Big Data Competency Frame Work for the entire Big Data ecosystem. Some of the areas that may be taken up for developing Big Data Competency Framework are as follows: • • • • • • • • • • Application Architect Application Developer Senior Application Developer Big Data Manager Chief Data Officer Data Architect Data Engineer Data Scientist Data Visualization Specialist System Administrator viii. Big Data Regulatory Frame Work: Over a period short time the usage of Big Data spreads in the country. That will eventually give rise to a number of issues related to usage of Big Data and related legal aspects, especially concerning the Personal Information, privacy of personal data etc. Therefore, there is need to think about the ethical and regulatory framework around big data, as it will increasingly impact on the lives of individuals and underpin customer service, innovation, quality and business operations. In a world of big data, decisions 148 about individuals will increasingly be made on the basis of patterns and profiling. Therefore, big data has deep social implications about when we want to prejudge people based on data about past behaviour, personal characteristics and similarities to others. All this is strongly linked to debates about privacy. Data profiling, especially where large amounts of personal data are aggregated together, provides very deep insights into individuals. The benefit of these activities are cheap (or free) personalized services, and to date, many consumers have been content with this trade-off. However, greater concern may be shown as analysis goes deeper into our activities and personal lives. Organizations both in R&D and Businesses will need to have appropriate governance to manage the risks and ensure data is used in acceptable ways. Policymakers also need to consider the regulatory framework carefully, and encourage the range of skills needed to exploit big data. To provide lasting solutions to such problems, DST may take initiatives as appropriate for developing Big Data Regulatory Frame Work. ix. Organization & Regulation of CoEs: It is envisaged that the one of the important activity of the project would be creation of CoEs. To achieve good results it will be desirable to define CoE’s role, their organization and the mechanisms of sharing the outcomes. It is advisable that the outcomes be measured based on the adoption of the solutions. There should be a review of the performance CoEs periodically. x. Encouragement to MSME: Considering the interest being taken by individuals in developing Big Data based solutions, it is envisaged that there will be good number who will be technically qualified to take up the Big Data Challenges offered by governments and government agencies, however they or their small start ups and or outfits will lack in financial worthiness in terms of annual turnover etc. It is suggested, therefore, that some mechanism be developed and implemented to give due recognition of this handicap on part of the MSMEs. To achieve this some suggestions are given below: • • • • For fixing a turnover cap the contacting organizations may be asked and or mandated to ask that a supplier company turnover be just more than twice that of the contract value. Public bodies may also be asked to limit the amount of lots a supplier can win. This ability to break contracts into lots will encourage more SME participation. Suppliers who have performed poorly on a previous contract can be excluded from future competitions by the contracting authority. Public bodies should also take into account the education, experience and the achievements of ‘individuals’ at the award stage of the competition. Some other issues, to encourage the participation of MSME, that need the attention of DST will be Lack of knowledge & awareness of MSME; Capacity issues; and Complex procurement processes. 149 xi. Types, grades, competency levels and probable requirement of Big Data Professionals: It will be desirable that DST, in cooperation with organizations like NASSCOM, NSDC etc. takes initiative in Identification of the various levels/types/grads/ competencies and numbers of Big Data ‘professionals’ needed in by the industry, business and academia. xii. Setting up uniform curricula for Big Data professionals: Data science’s learning curve is formidable. To a large extent one will need a degree, or something substantially like it, to prove the committed to this career. There are run-of-the-mill certificates, and other qualifications degrees in data-science-related fields. The most important to the modern business isn’t that every data scientist has a big honking doctorate. What matters most is that a substantial body of personnel has a common grounding in core curriculum of skills, tools and approaches. Big data initiatives will thrive if all data scientists have been trained and certified on a basic minimum curriculum with the foundation such as (a) Paradigms and practices, (b) Algorithms and modelling, (c) Tools and platforms, (d) Applications and outcomes etc. Classroom instruction is important, but a curriculum that is 100 percent devoted to reading books, taking tests and sitting through lectures is insufficient. Hands-on laboratory work is paramount for a truly well-rounded data scientist. It is, therefore, necessary to make sure that the data scientists acquire certifications and degrees that reflect them actually developing statistical models that use real data and address substantive business issues. A business-oriented data-science curriculum should produce expert developers of statistical and predictive models. It should not degenerate into a program that produces analytics geeks with heads stuffed with theory but whose diplomas are only fit for hanging on the wall. To achieve this, DST may in collaboration with academic partners, businesses, R&D Organizations and industry develop curriculum, that is approach, learning model, and course content, that reflects the mix of technical and problem-solving skills that is necessary to prepare students/professionals for Big Data and analytics careers, across all industries. xiii. Big Data Portal: Creating a Platform (PORTAL) where all the stakeholders can interact and give and seek what they have or want frig Data Ecosystem. xiv. Publication of Research: The use of big data in development is largely being driven by opportunistic partnerships between private companies, researchers and academics. Data exhaust is often owned by the private sector, especially mobile phone operators. Online activity, sensing data, and crowdsourced information are often publicly accessible, but the size and complexity of these data sets requires specialized analytical skills. Because of this, and because big data analytics are still in a nascent phase methodologically, professional researchers and academics currently have a high degree of influence in how big data is actually utilized. 150 Some of these professional researchers and academics work in-house for the interested firms, but most are in public and private university systems. DST may play a key role in publicizing the potential role of big data in development. It can fund big data research through a variety of financing streams, and take initiative in creating forums where big data researchers can exchange ideas and data sets. The current landscape of big data is, overall, less the result of agenda setting by a small group of politically and economically powerful institutions than it is the unplanned aggregate of diverse projects focusing on those aspects of big data analytics that are methodologically and legally tractable. In the short-term big data projects will need to rely on complementary “groundtruthing” data from traditional sources in order to assess the nature and magnitude of bias in big data sets. Such validation procedures are necessary for end-users of the data, including policymakers, to interpret the contextual meaning of big data across cultures and economies. In addition, big data sets are not by virtue of their size exempt from the conventional requirements of good theoretical and statistical practice, including careful problem identification, model construction, and hypothesis testing. Therefore, such an initiative on part of DST will bring the researchers, users, policy makers and the public at large near to each other and they all together will be instrumental in making the BIG USE of Big Data. 8.14 ACTION PLAN Implementation of the project will involve, apart from the many administrative actions, the following major activities: • • • • • • • • • Establishment of PMU Preparing Guidelines Calling for proposals Selection of agencies Assigning/sanctioning projects Review of Schemes Yearly review of progress Mid Term Review Preparation and Publication of progress reports As evident some of the activities are one time, however the remaining activities need to be carried out on periodic or as the need be basis. The Project Head will undertake the activities in time so that the aims of the project are achieved. Considering the major activities, an action plan for the implementation of the project is as given below in table 8.10. 151 TABLE 8.10: TENTATIVE IMPLEMENTATION ACTION PLAN MAJOR ACTIVITY Establishment of PMU Preparing Guidelines Calling for proposals Selection of agencies Assigning/sanctioning projects Review of Schemes Yearly review of progress Mid Term Review Preparation and Publication of progress reports YEAR 1 & MONTHS 1 2 3 4 5 6 7 8 9 10 11 12 YEAR 2 Q1 Q2 152 Q3 YEAR 3 Q4 Q1 Q2 Q3 YEAR 4 Q4 Q1 Q2 Q3 YEAR 5 Q4 Q1 Q2 Q3 Q4 9. CONCLUSIONS Scientific progress is a result of relentless academic research endeavour. The scientific community has been focused for a while now on the growing challenges of Data Science in a number of disciplines. This immense repository of past/current academic knowledge is increasing at an exponential rate, and handily qualifies as Big Data in terms of volume, variety and velocity of growth. The estimation of the veracity of this data also presents challenges. As the amount of knowledge in an academic field grows, a quick assessment of the state-of-the-art in any sub-field becomes that much harder. One way of enabling the acceleration of the process of discovery, is to significantly enhance current search capabilities to support deep scientific queries. This includes: • Improving the efficiency and depth of search by enabling segmentation and recognition of all the components of a traditional academic research including graphs, tables, and diagrams. • Developing tools to integrate various sources of information on any topic, not just from the textual content but often from parallel channels such as video, speech, and the web, in order to gain comprehensive understanding on the topic, and most importantly. • Making unapparent connections between methods, features, data, constraints, and parameters across the spectrum of reported scientific data using advanced data mining approaches. Keeping in view the fast growth of Business Analytics in future across the various applications, it is imperative to chalk out a strategic Road Map in this direction to reap the benefits towards the overall development of the country. The present study, through a combination of primary and secondary research has established the need of urgent initiative on part of DST to (i) strengthen the dASTRA Ecosystem of the country, (ii) take steps to nurture the same so as to leverage the unique advantageous position of the country’s manpower in not only in the scientific research and development but in the business and industry also. The project is to be implemented in five years and the cost has been estimated to be around Rs. 580 Cores. The major activities of the project will include (i) R&D PROMOTION through Open Sky Research, Cluster Based Network Programs, International Collaborative Research Program,(ii) ESTABLISHMENT OF CENTRE OF EXCELLENCE ON DATA SCIENCE, (iii) SKILL DEVELOPMENT CAPACITY & TRAINING through Fellowship Based UG/PG and Ph D, Short Term Training for Faculty, On-Line Programs, National Workshops & Conferences, Collaborative Interactive Conferences, Entrepreneur Development, (iv) INTERNATIONAL LINKAGES & COLLABORATIONS through UN (R&D and Standards), Regional Associations/Collaborations, Bilateral & Multi Lateral Exchange Programs, and (v) INFRASTRUCTURE DEVELOPMENT. 153 LIST OF ABBREVIATIONS AaaS BA BD BDA BDaaS BDI CapEx CDC CEO CODATA CoE/COE CSI dASTRA DST EIU EMR ESDM EU HRD ICSU ICT IESA IoT IPR ISACA IT KPI KPO M&E m2m MIS NASSCOM NDSAP NSDC OGD OpEx OSTI PCAST PG PMU : Analytics as a Service : Business Analytics : Big Data : Big Data Analytics/Big Data and analytics : Big Data Analytics as a Service : Big Data Initiative : Capital Expenses : Consultancy development Centre : Chief Executive Officer : Committee on Data for Science and Technology : Centre of Excellence : Computer Society of India : Data Science, Technology, Research & Applications : Department of Science & Technology : Economist Intelligence Unit : Electronic Medical Records : The electronic system design and manufacturing industry : The European Union : Human Resource Development : International Council for Science : Information & Communication Technology : India Electronics and Semiconductor Association : Internet of Things : Intellectual Property Rights : Information Systems Audit and Control Association : Information Technology : Key Performance Indicators : Knowledge process outsourcing : Monitoring & Evaluation : Machine to Machine : Management Information System : National Association of Software and Services Companies : National Data Sharing and Accessibility Policy : National Skills Development Corporation : Open Government Data : Operating Expenses : Office of Scientific and Technical Information (USA) : President’s Council of Advisors on Science and Technology, USA : Post Graduate : Project Monitoring Unit 154 R&D ROI RPO RTO S&T SaaS SEZ SMB SME SW TCO TDB UG UIDAI UN UNDP WDS WEF : Research & development : Return on Investment : Recovery Point Objective : Recovery Time Objective : Science & Technology : Software as a Service : Special Export Zone : Small and Medium Businesses : Small and Medium Enterprises : Soft Ware : Total Cost of Ownership : Technological Development Board of DST : Under Graduate : Unique Identification Authority of India : United Nations : United Nations Development Program : World Data System : World Economic Forum 155 LIST OF TABLES Table 3.1: Models for Research Page No. 048 Table 3.2: CoE Value Proposition 051 Table 3.3: Examples of Companies & Institutions Providing Solutions to Generate, Analyze & Visualize Omics & Clinical Data 057 Table 3.4: Data Quality Sub Dimensions 062 Table 5.1: 2012 Worldwide Big Data Revenue by Top 10 Vendors 082 Table 5.2: Big Data Market Forecast Broken Down By Market Component through 2017 082 Table 5.3: Comparison Between AaaS & Internal BD Project 088 Table 8.1: Strategic Thought Process 104 Table 8.2: Consultative Meetings & Interactive Workshops Organized 106 Table 8.3: Responses Received From The Stakeholders 106 Table 8.4: Tentative Targets 132 Table 8.5: Suggested Project Personnel 134 Table 8.6: Computation Of Costs – All Costs In Rs. Lakhs 135 Table 8.7: Envisaged Outputs/Benefits And The Possible Outcomes 137 Table 8.8: Monitoring, MIS and Evaluation Indicators 142 Table 8.9: Indicative Selection of the Indicators From Monitoring, MIS and Evaluation 143 Table 8.10: Tentative Implementation Action Plan 152 156 LIST OF FIGURES Page No. Figure 1.1 Data Science & Business 002 Figure 1.2 Data Science Ecosystem 003 Figure 1.3: Seven Dimensions of Big Data 007 Figure 1.4: Parameters Used For Benchmarking Countries on Open Data Initiatives 014 Figure 1.5: Benchmarking Of Open Data Initiatives, Select Countries, 2012 014 Figure 1.6: Number of Graduates With Deep Analytical Training 017 Figure 2.1: Innovative Cycle 028 Figure 2.2: 6 Illustrative Examples of Big Data for Development 032 Figure 2.3: Major Challenges Confronting Big Data for Development 033 Figure 2.4: Australian Organizations Lag In The Use Of Many Data Sources 036 Figure 2.5: Australian Organizations However Lead In the Use of Some Data Sources 037 Figure 2.6: Organizations Using Big Data to Improve The Customer Experience 037 Figure 2.7: Categories of Business Processes That Can Benefit From Big Data Projects 038 Figure 2.8: View of the Future of Big Data 039 Figure 2.9: Attitude towards Big Data 039 Figure 2.10: Personal Knowledge of Big Data 040 Figure 2.11: Priority Application of Big Data 040 Figure 2.12: Internal Obstacles In Use Of Big Data 041 Figure 2.13: CEO’s View of Big Data 041 Figure 2.14: Strategies for Obtaining Optimum Value from Big Data Tools 042 Figure 2.15: How the Organization Addresses Human Aspect of Big Data 042 Figure 3.1: BDA CoE Function Chart 063 Figure 3.2: Governance Objective: Value Creation 052 Figure 5.1: Analytics Applications, And Classification of Analytics Industry 084 Figure 5.2: Conceptual Diagram of AaaS 086 Figure 8.1: Approach & Methodology 103 Figure 8.2: DST’s Vision of dASTRA in India 130 Figure 8.3: Conceptual Model of Six Months Student’s Projects Linked with Big Data 131 Figure 8.4: Suggested Capacity Building Model 131 Figure 8.5: Suggested Organization Structure 134 Figure 8.6: Monitoring, MIS & Evaluation 141 157 REFERENCES • “Apply new analytics tools to reveal new opportunities,” IBM Smarter Planet website, Business Analytics page • A Survey Report on: Become Prudent with Big Data -Technological sophistication in India, Sujata A. Pardeshi, Pooja K. Akulwar. • Analytics and Big Data: big markets in India for adopters and innovators, Madanmohan Rao. • Australian Public Service Better Practice Guide for Big Data, 2015, Australian Government. • Becoming bold with big data - How Australian organisations can boost their adoption of big data to help drive business success, a Accenture Digital Document. • Big Data & Analytics Maturity Model, Chris Nott, 2014. • Big data and data protection, ICO. • Big Data for Development in China, UNDP 2014 Report. • Big Data for Development: Challenges & Opportunities – UN GLOBAL PULSE, May 2012 • Big Data for Government, INFORMATICA. Addressing government challenges with big data analytics, IBM White Paper. • Big data GS for social good: Putting Knowledge on Map, Pulak Ghosh, IIMB, Advisor on Big DataUN-Global Pulse. • Big data in Environmental Remote Sensing Challenges and Chances, Th. Udelhoven, University of Trier Environmental and Geoinformatics Department. • Big Data in Genomics: Challenges and Solutions, Is Life Sciences Prepared for a Big Data Revolution?, Fabrício F. Costa. • Big Data in High Energy Physics, Andrew McNab, Alessandr,. Forti Robert Frank, High Energy Physics group, University of Manchester. • Big Data the next big thing, NASSCOM Report 2012. • Big Data Vendor Revenue and Market Forecast 2012-2017, Wikibon. • Big Data, Impacts & Benefits, ISACA Whit Paper 2013. • Big Data: Big benefits and imperiled privacy, a PwC document. • Big Data's 5 Routes to Value, 27th June 2014, The Boston Consulting Group. • Big Success With Big Data, Accenture Paper. • Bio-IT and Healthcare in India, Department of Biotechnology Ministry of Science and Technology, Government Of India. • Building Trust: The Role of Regulation in Unlocking the Value of Big Data, McKinsey & Company. • Business Analysis Center of Excellence, a HP document. 158 • Cheryl Wilson, “Making the Contextual Enterprise Possible with ODM,” IBM Connections blog, 2013. • CODATA Capacity Building and the Data Sharing Principles in Developing Countries, Simon Hodson • CODATA International Training Workshop in Big Data for Science for Researchers from Emerging and Developing Countries, Beijing , China, 5-20 June 2014, Overview of things learned Presentation at NeDICC Meeting on 16 July 2014. • Data Analytics as a Service: unleashing the power of Cloud and Big Data, a white paper from ATOS. • Data Intensive Scientific Discovery, Vijay Chandru, Hon. Professor, NIAS, Chairman, Strand Life Sciences. • Department of Science & Technology, Ministry of Science & Technology Government of India Website- WWW.dst.gov.in • Developing an Analytics Centre of Excellence, Charles D • Divyakant Agrawal, Philip Bernstein, Elisa Bertino, et al. “Challenges and Opportunities with Big Data.” Princeton University white paper, 2012. • Edd Dumbill. “What is big data? An introduction to the big data landscape.” O’Reilly, 2012. • European Big Data Value Strategic Research & Innovation Agenda Version 1.0, January 2015. • Executive Director CODATA. • Fact Sheet: Big Data Across the Federal Government, USA, 2012. • How Big Data Is Changing Astronomy (Again), Ross Andersenapr. • How Manufacturers Can Gain From Big Data, IoT, by Satish N Jadhav, Director, IoT-Embedded Sales, Intel South Asia. • IBM Redpaper publication, Smarter Analytics: Information Architecture for a New Era of Computing, SG24-5012. • ICSU World Data System (WDS) Strategic Plan 2014–2018. • Implementation Guidelines for National Data Sharing and Accessibility Policy (NDSAP), April 2013, Department of Electronics and Information Technology, Ministry of Communications and Information Technology, Government of India. • India – A Hub for Analytics Products Analytics Product Excellence Matrix, 2013, A NASSCOM Forst & Sullivan Report. • Industrialization of Analytics, in India – Big Opportunity, Bigger Outcomes, NASSCOM – BLUE OCEAN Study, 2014. • Is Big Data a Big Deal for State Governments? NASCIO. • John Gantz and David Reinsel, “The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East.” IDC, for EMC Corporation, December 2012 159 Findings • John Hagerty and Tina Groves, “Unlock Big Value in Big Data with Analytics: An IBM Redbooks Point-of-View publication.” IBM Redbooks publications, 2013. • Main points of the OECD expert Consultation on unlocking global Collaboration to accelerate Innovation for alzheimer's disease and Dementia – an OECD Report. • Michael Cooper and Peter Mell, “Tackling Big Data” slide presentation. NIST Information Technology Laboratory, Computer Security Division, US Department of Commerce, • Open data: Unlocking innovation and performance with liquid information, 2013, McKinsey Global Institute • Report to The President, Big Data And Privacy: A Technological Perspective. Executive Office of the President, President’s Council of Advisors on Science and Technology May 2014. • States Investing in Big Data Initiatives, George Leopold • The Global Information Technology Report 2014, Rewards and Risks of Big Data, WEF Report. • Unleashing the potential of big data - A IBM white paper based on the 2013 World Summit on Big Data and Organization Design. • Using Data to Understand Biological Systems, Ramesh Hariharan, IISC, Strand. • Views from the C-suite, Who’s big on BIG DATA?, The Economist Intelligence Unit Limited 2014. • What to Watch Out For in 2015: EY. • Workshop: How to Build the Business Case for Analytics, Kurt Schlegel, GARTNER. • Worldwide Big Data Technology and Services 2012– 2015 Forecast, a IDC Report. 160 ANNEXURE Page No. Annexure 1: Set of 4 Questionnaires 162 Annexure 2: List of participants of the Consultative Meetings and Interactive Workshops 205 Annexure 3: Consolidated Responses from Data Generators 221 Annexure 4: Consolidated Responses from Data Researchers 225 Annexure 5: Consolidated Responses from End Users 231 Annexure 6: Consolidated Responses from Service Providers 238 161 ANNEXURE 1 SET OF 4 QUESTIONNAIRES 162 QUESTIONNAIRE FOR DATA GENERATORS (DG) 163 PART A GENERAL ORGANIZATIONAL PROFILE 1. Name & Address of the Organization/Department: Telephones: Fax: E-mail: website: 2. Name & designation and Address of the CEO/HOD: Telephones: Mobile: E-mail: 3. Name, Designation and Address of the Respondent: Telephones: Mobile: E-mail: 4. Date & Place: 164 PART B CURRENT STATUS, STRATEGY & PROFILE n. Please identify the Stakeholder Segment/Category your organization belongs to: (Multiple answers are possible) SEGMENT/CATEGORY YES (Y)/NO(N) RESEARCHERS (RE) DATA GENERATORS (DG) END USERS (EU) SERVICE PROVIDERS (SP) PLATFORM PROVIDER (PP) DATA CURATOR (DC) o. Mentions the Data Segments in which you are active: p. Mention the Data Segments that you Outsource: q. Is Your Organizational Data Available At Data.Gov.In? • Yes • No • Do not know r. What are your expectations from Big Data Analytics in the next 10 years? s. In our Organization the Big data management is not viewed strategically at senior levels of the organisation. • Agree • Disagree • Don’t know/Not applicable 165 t. There is not enough of a “big data culture” in the organisation, where the use of big data in decision-making is valued and rewarded. • Agree • Disagree • Don’t know/Not applicable 166 PART C MANPOWER, SKILL GAPS AND TRAINING NEEDS e. Identify the skills gaps within your functional area in dealing with data and analytics. • Visualization skills • Data integration skills • Data analysis skills • Data storage skills • Tooling / software skills f. How many Big Data experts does your organization employ and in which area? • Computer science: programming experts (R, Python, SQL, SAS, Java, etc) • Computer science: Artificial Intelligence and machine learning experts • Computer science: text, voice, music, image and video experts • Experts in statistics and econometrics • Experts in OR and applied mathematics • Other (please specify) g. What are the training needs of your organization? • Strategy courses on Big Data for top management • Computer science: programming courses (R, Python, SQL, SAS, Java, etc) • Computer science: text, image and video recognition courses • Computer science: machine learning and artificial intelligence courses • Statistics and econometrics courses • Operations research and applied mathematics courses • Application related courses (Big Data in marketing, finance, logistics, etc) • Any Other (Please Specify) h. For the purposes of the Capacity Building initiatives, please suggest the programs and other details as given below: Name of the Who should be Modality of Coverage Duration Program the participants Delivery 167 PART D PERCEIVED SUCCESS FACTORS, IMPEDIMENTS & CHALLENGES FOR BIG DATA APPLICATION Purposely Left Blank as there are no questions under this head for Data Generators 168 PART E AREAS OF APPLICATION, MODELS & INFRASTRUCTURE g. Has Your Organization Taken any of the Steps Mentioned Below to Integrate Data into Your Organization’s Business? (Multiple answers possible ) • Upgrade IT Systems • Improve data collection processes • Training current employees or recruiting new employees in BA • Redesigned/reengineered your important Business Processes h. How well are these areas developed in your organization? (Answer in terms of Very well, Reasonably well, Not so well, Don't know) • A clear company strategy • A sound procedure for legal, ethical and reputational issues • An organization structure that supports multi-disciplinary projects • Financial budget • Support by higher management • Supporting systems and procedures • Talent • Training i. What do you predict will happen to the number of Big Data specialists in your organization next year (2015) • It will decrease • It will remain stable • It will increase • Don't know j. Please make suggestions on the following important aspects : • • • Data Storage Data Curation Data Retrieval 169 PART F TYPE, AMOUNT OF DATA & ANALYTICAL TECHNIQUES USED 1. What support do you need from the Government 2. Looking specifically at your organization/department, how would you characterise the amount of data available to support decision-making? • Too much • Enough • Not enough • Don’t know 3. Mention the Challenges faced by you in GENERATING Data 4. Mention the Challenges faced by you in CLEANING the Generated Data 5. Do you have the DATA CURATION function in-house, if not please mentions the reasons. 170 PART G SECURITY CONCERNS d. Your organization has taken initiative in which of the following areas related to Big Data Security & Privacy? (Multiple answers possible) • Intrusion Detection • Cyber security and Gigabit Networks • Visualizing Large Scale Security Data • Challenges for Big Data Security & Privacy • Sociological Aspects of Big Data Privacy e. Please provide your views on the IPR Issues as related to Big Data Analytics. f. Please provide your views and suggestions on the adequacy or otherwise of the National Data Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics PART H ANY OTHER INFORMATION YOU MAY LIKE TO SHARE 171 QUESTIONNAIRE FOR END USERS (EU) 172 PART A GENERAL ORGANIZATIONAL PROFILE 5. Name & Address of the Organization/Department: Telephones: Fax: E-mail: website: 6. Name & designation and Address of the CEO/HOD: Telephones: Mobile: E-mail: 7. Name, Designation and Address of the Respondent: Telephones: Mobile: E-mail: 8. Date & Place: 173 PART B CURRENT STATUS, STRATEGY & PROFILE u. Please identify the Stakeholder Segment/Category your organization belongs to: (Multiple answers are possible) SEGMENT/CATEGORY YES (Y)/NO(N) RESEARCHERS (RE) DATA GENERATORS (DG) END USERS (EU) SERVICE PROVIDERS (SP) PLATFORM PROVIDER (PP) DATA CURATOR (DC) v. Mentions the Data Segments in which you are active: w. Mention the Data Segments that you Outsource: x. What is the Organization Budget Provisions For Big Data Usage in Lakhs of Rupees • 2014 – 15 • 2015 – 16 • 2016 – 17 y. In Which Areas Will Your Organization Be Investing More Resources? • Capacity Building • Software Tools • Data Sources • Other z. How would you describe your organization's competitive position? • Underperforming industry / market peers • On par with industry / market peers • Outperforming industry / market peers • Don't know 174 aa. What is the current state of big data activities within your organization? • We have not yet started to consider big data's use within our organization • We are in the process of developing a strategy / roadmap • We have started one or more pilots or proofs of concept • We are implementing big data technologies bb. How useful have been the Big Data Analytics Applications in the past in your organization: cc. What are your expectations from Big Data Analytics in the next 10 years? dd. In our Organization the Big data management is not viewed strategically at senior levels of the organisation. • Agree • Disagree • Don’t know/Not applicable ee. There is not enough of a “big data culture” in the organisation, where the use of big data in decision-making is valued and rewarded. • Agree • Disagree • Don’t know/Not applicable 175 PART C MANPOWER, SKILL GAPS AND TRAINING NEEDS i. Identify the skills gaps within your functional area in dealing with data and analytics. • Visualization skills • Data integration skills • Data analysis skills • Data storage skills • Tooling / software skills j. How many Big Data experts does your organization employ and in which area? • Computer science: programming experts (R, Python, SQL, SAS, Java, etc) • Computer science: Artificial Intelligence and machine learning experts • Computer science: text, voice, music, image and video experts • Experts in statistics and econometrics • Experts in OR and applied mathematics • Other (please specify) k. What are the training needs of your organization? • Strategy courses on Big Data for top management • Computer science: programming courses (R, Python, SQL, SAS, Java, etc) • Computer science: text, image and video recognition courses • Computer science: machine learning and artificial intelligence courses • Statistics and econometrics courses • Operations research and applied mathematics courses • Application related courses (Big Data in marketing, finance, logistics, etc) • Any Other (Please Specify) l. For the purposes of the Capacity Building initiatives, please suggest the programs and other details as given below: Name of the Who should be Modality of Coverage Duration Program the participants Delivery 176 PART D PERCEIVED SUCCESS FACTORS, IMPEDIMENTS & CHALLENGES FOR BIG DATA APPLICATION l. To What Extent Do You Have Timely Access To The Information Needed To Do Your Job Successfully? • To some extent • To a great extent • Completely • Don't know m. To what extent does information and business analytics create a competitive advantage for your organization within its industry or markets? • Modest advantage • On par with competitors • Significant advantage • Don't know n. Which challenges inhibit your organization from acquiring and integrating data? • Inconsistencies in data from various source systems • Legacy infrastructure that inhibits data collection • Difficult to share data internally and or in integrating internal data across silos • Security, privacy and/or malware concerns o. What challenges inhibit your organization from analyzing data? • Too much data to analyze; overwhelmed by data • Lack of software/tools and or Software too difficult to use • Lack of skills • Inconsistent data across variety of source systems • Customer privacy concerns p. Which challenges inhibit your organization from acting on data insights and analytics? • Lack of understanding of how to use analytics to improve the business • Lack of skills to interpret and leverage the data • Lack of software/tools that allow end-users to perform analytics themselves • Too time consuming or costly to perform all the analytics desired q. Your organization has taken initiative in which of the following areas related to Big Data Science & Technology? (Multiple answers possible) • Data streaming & Processing • Analysis of Unstructured/Semi-structured data 177 • • • • r. Visualization & Visual Analytics Security & Privacy issues New Computational Models Data & Information Quality and New Data Standards. Your organization has taken initiative in which of the following areas related to Big Data Infrastructure ? (Multiple answers possible) • System Architectures, Design and Deployment • Programming Models • Software Techniques & Architectures in Cloud/Grid/Stream Computing • Big Data Open Platforms s. Your organization has taken initiative in which of the following areas related to Big Data Search, Mining and Management ? (Multiple answers possible) • Search & Mining of variety of data including scientific, engineering, social, sensor & multimedia • Algorithms & Systems for Big Data Search • Data Acquisition, Integration, Cleaning & Best Practices • Visualization Analytics for Big Data • Computational Modeling & Data Integration • Cloud/Grid/Stream Data Mining-Big Velocity Data • Mobility and Big Data • Multimedia and Multi-structured Data-Big Variety Data t. Your organization has taken initiative in which of the following areas related to Big Data Applications? (Multiple answers possible) • Complex Big Data Applications in Science, Engineering, • Medicine, Healthcare, Finance, Business, Law and Education • Indian Traditional Knowledge • Transportation • Retailing, social media and Telecommunication • Big Data Analytics in Small Business Enterprises (SMEs) • Big Data Analytics in Central and State Governments, Public Sector and Society in General • Real-life Case Studies of Value Creation through Big Data Analytics • Big Data as a Service • Big Data Industry deployments & Standards and Experiences of • Big Data Government Deployments/ Projects. u. What are your organisation’s three biggest impediments to using big data for effective decision-making? 178 • Too many “silos”—data is not pooled for the benefit of the entire organisation. • Shortage of skilled people to analyse the data properly. • Big data is not viewed sufficiently strategically by senior management • Something not on this list (please specify). v. To what extent do you agree with the following statement: “The issue for us is now not the growing volumes of data, but rather being able to analyse and act on data in real-time. • Agree • Disagree • Don’t know/Not applicable w. In which areas does your organization have Big Data applications? (Multiple answers possible) • E-Commerce, e-Business, Online Operations (Web shops, etc) • e-Governance • Direct and online marketing • Fraud detection / management • Customer and market analysis • Customer service • Supply change management and logistics • Information Technology • Finance and administration • HR and people development • Risk management • We do not have applications • I don't know in which area(s) we have applications. • Other (please specify) 179 PART E AREAS OF APPLICATION, MODELS & INFRASTRUCTURE k. Has Your Organization Taken any of the Steps Mentioned Below to Integrate Data into Your Organization’s Business? (Multiple answers possible ) • Upgrade IT Systems • Improve data collection processes • Training current employees or recruiting new employees in BA • Redesigned/reengineered your important Business Processes l. How well are these areas developed in your organization? (Answer in terms of Very well, Reasonably well, Not so well, Don't know) • A clear company strategy • A sound procedure for legal, ethical and reputational issues • An organization structure that supports multi-disciplinary projects • Financial budget • Support by higher management • Supporting systems and procedures • Talent • Training m. What do you predict will happen to the number of Big Data specialists in your organization next year (2015) • It will decrease • It will remain stable • It will increase • Don't know n. Please make suggestions on the following important aspects : • • • Data Storage Data Curation Data Retrieval o. As the country would like to keep ahead of the rest of the world in the area of Big Data applications, please suggest FINAL PRODUCTS for which the Big Data Community may strive. p. For the Researchers in the Big Data Discipline, what should be thrust Areas: • • Immediately In the next 5-10 years 180 PART F TYPE, AMOUNT OF DATA & ANALYTICAL TECHNIQUES USED 6. At what pace is the available data within your function updated or refreshed? • As it is streamed in real-time • Less than a day • Less than a week • Monthly or later 7. What support do you need from the Government 8. Looking specifically at your organization/department, how would you characterise the amount of data available to support decision-making? • Too much • Enough • Not enough • Don’t know 9. How Well Does Your Organization Perform the Following Information and Analytic Tasks on A Scale Of 1 To 5, Where 1=Poorly and 5=Very Well • Acquire and integrate data • Analyze data • Act on data-driven insights 10. Which data does your organization analyze in the context of Big Data applications? (Multiple answers possible): • Numerical data (for statistics, predictions, etc) • Text (automated text analysis) • Audio (voice, music) • Images (automated image recognition) • Video (automated video recognition) • Don't know • Other (please specify) 11. Does your organization apply advanced analytics methods (statistics, econometrics, operations research, artificial intelligence, applied mathematics) in Big Data applications? • Yes • No • Don't know 12. Which advanced analytics methods does your organization use in Big Data applications? 181 • • • • • Statistics and econometrics Operations research (OR) / applied mathematics Artificial intelligence (AI) and machine learning None Any Other (Please Specify) 13. How does your organization develop Big Data applications? • We mainly develop our applications internally. • We mainly develop our applications externally (outsourcing) • We do both: we develop internally as well as externally • Don’t know. 14. What are in your opinion the most important factors for successful Big Data implementations? Please rate from "1" (=most important) to "5" (=least important). • A clear company strategy • Support by higher management • Talent • Training • Supporting systems and procedures • Financial budget • An organizational structure that supports multi-disciplinary projects • A sound procedure for legal, ethical and reputational issues 182 PART G SECURITY CONCERNS g. Your organization has taken initiative in which of the following areas related to Big Data Security & Privacy? (Multiple answers possible) • Intrusion Detection • Cyber security and Gigabit Networks • Visualizing Large Scale Security Data • Challenges for Big Data Security & Privacy • Sociological Aspects of Big Data Privacy h. Please provide your views on the IPR Issues as related to Big Data Analytics. i. Please provide your views and suggestions on the adequacy or otherwise of the National Data Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics PART H ANY OTHER INFORMATION YOU MAY LIKE TO SHARE 183 QUESTIONNAIRE FOR RESEARCHERS (RE) 184 PART A GENERAL ORGANIZATIONAL PROFILE 9. Name & Address of the Organization/Department: Telephones: Fax: E-mail: website: 10. Name & designation and Address of the CEO/HOD: Telephones: Mobile: E-mail: 11. Name, Designation and Address of the Respondent: Telephones: Mobile: E-mail: 12. Date & Place: 185 PART B CURRENT STATUS, STRATEGY & PROFILE ff. Please identify the Stakeholder Segment/Category your organization belongs to: (Multiple answers are possible) SEGMENT/CATEGORY YES (Y)/NO(N) RESEARCHERS (RE) DATA GENERATORS (DG) END USERS (EU) SERVICE PROVIDERS (SP) PLATFORM PROVIDER (PP) DATA CURATOR (DC) gg. Mentions the Data Segments in which you are active: hh. Mention the Data Segments that you Outsource: ii. What is the Organization Budget Provisions For Big Data Usage in Lakhs of Rupees • 2014 – 15 • 2015 – 16 • 2016 – 17 jj. In Which Areas Will Your Organization Be Investing More Resources? • Capacity Building • Software Tools • Data Sources • Other kk. What is the current state of big data activities within your organization? • We have not yet started to consider big data's use within our organization • We are in the process of developing a strategy / roadmap • We have started one or more pilots or proofs of concept • We are implementing big data technologies 186 ll. What are your expectations from Big Data Analytics in the next 10 years? mm. In our Organization the Big data management is not viewed strategically at senior levels of the organisation. • Agree • Disagree • Don’t know/Not applicable nn. There is not enough of a “big data culture” in the organisation, where the use of big data in decision-making is valued and rewarded. • Agree • Disagree • Don’t know/Not applicable 187 PART C MANPOWER, SKILL GAPS AND TRAINING NEEDS m. Identify the skills gaps within your functional area in dealing with data and analytics. • Visualization skills • Data integration skills • Data analysis skills • Data storage skills • Tooling / software skills n. How many Big Data experts does your organization employ and in which area? • Computer science: programming experts (R, Python, SQL, SAS, Java, etc) • Computer science: Artificial Intelligence and machine learning experts • Computer science: text, voice, music, image and video experts • Experts in statistics and econometrics • Experts in OR and applied mathematics • Other (please specify) o. What are the training needs of your organization? • Strategy courses on Big Data for top management • Computer science: programming courses (R, Python, SQL, SAS, Java, etc) • Computer science: text, image and video recognition courses • Computer science: machine learning and artificial intelligence courses • Statistics and econometrics courses • Operations research and applied mathematics courses • Application related courses (Big Data in marketing, finance, logistics, etc) • Any Other (Please Specify) p. For the purposes of the Capacity Building initiatives, please suggest the programs and other details as given below: Name of the Who should be Modality of Coverage Duration Program the participants Delivery 188 PART D PERCEIVED SUCCESS FACTORS, IMPEDIMENTS & CHALLENGES FOR BIG DATA APPLICATION x. Your organization has taken initiative in which of the following areas related to Big Data Science & Technology? (Multiple answers possible) • Data streaming & Processing • Analysis of Unstructured/Semi-structured data • Visualization & Visual Analytics • Security & Privacy issues • New Computational Models • Data & Information Quality and New Data Standards. y. Your organization has taken initiative in which of the following areas related to Big Data Infrastructure ? (Multiple answers possible) • System Architectures, Design and Deployment • Programming Models • Software Techniques & Architectures in Cloud/Grid/Stream Computing • Big Data Open Platforms z. Your organization has taken initiative in which of the following areas related to Big Data Search, Mining and Management ? (Multiple answers possible) • Search & Mining of variety of data including scientific, engineering, social, sensor & multimedia • Algorithms & Systems for Big Data Search • Data Acquisition, Integration, Cleaning & Best Practices • Visualization Analytics for Big Data • Computational Modeling & Data Integration • Cloud/Grid/Stream Data Mining-Big Velocity Data • Mobility and Big Data • Multimedia and Multi-structured Data-Big Variety Data aa. Your organization has taken initiative in which of the following areas related to Big Data Applications? (Multiple answers possible) • Complex Big Data Applications in Science, Engineering, • Medicine, Healthcare, Finance, Business, Law and Education • Indian Traditional Knowledge • Transportation • Retailing, social media and Telecommunication • Big Data Analytics in Small Business Enterprises (SMEs) • Big Data Analytics in Central and State Governments, Public Sector and Society in General 189 • • • • Real-life Case Studies of Value Creation through Big Data Analytics Big Data as a Service Big Data Industry deployments & Standards and Experiences of Big Data Government Deployments/ Projects. bb. To what extent do you agree with the following statement: “The issue for us is now not the growing volumes of data, but rather being able to analyse and act on data in real-time. • Agree • Disagree • Don’t know/Not applicable 190 PART E AREAS OF APPLICATION, MODELS & INFRASTRUCTURE q. Has Your Organization Taken any of the Steps Mentioned Below to Integrate Data into Your Organization’s Business? (Multiple answers possible ) • Upgrade IT Systems • Improve data collection processes • Training current employees or recruiting new employees in BA • Redesigned/reengineered your important Business Processes r. How well are these areas developed in your organization? (Answer in terms of Very well, Reasonably well, Not so well, Don't know) • A clear company strategy • A sound procedure for legal, ethical and reputational issues • An organization structure that supports multi-disciplinary projects • Financial budget • Support by higher management • Supporting systems and procedures • Talent • Training s. What do you predict will happen to the number of Big Data specialists in your organization next year (2015) • It will decrease • It will remain stable • It will increase • Don't know t. As the country would like to keep ahead of the rest of the world in the area of Big Data applications, please suggest FINAL PRODUCTS for which the Big Data Community may strive. u. For the Researchers in the Big Data Discipline, what should be thrust Areas: • • Immediately In the next 5-10 years 191 PART F TYPE, AMOUNT OF DATA & ANALYTICAL TECHNIQUES USED 15. What support do you need from the Government 16. What are your are suggestions for leveraging Big Data Analytics applications in Government (In terms of the following) • The Opportunities • Possible Application Areas • Priority Application Areas for the next 10 years • Market size • Skills gaps and actions needed to fill in the gaps • Policy Frame Works 17. Does your organization apply advanced analytics methods (statistics, econometrics, operations research, artificial intelligence, applied mathematics) in Big Data applications? • Yes • No • Don't know 18. Which advanced analytics methods does your organization use in Big Data applications? • Statistics and econometrics • Operations research (OR) / applied mathematics • Artificial intelligence (AI) and machine learning • None • Any Other (Please Specify) 19. What are in your opinion the most important factors for successful Big Data implementations? Please rate from "1" (=most important) to "5" (=least important). • A clear company strategy • Support by higher management • Talent • Training • Supporting systems and procedures • Financial budget • An organizational structure that supports multi-disciplinary projects • A sound procedure for legal, ethical and reputational issues 20. Please suggest the Tools and Platforms to be used for Big Data Analytics in the Open Source domain. 192 PART G SECURITY CONCERNS j. Your organization has taken initiative in which of the following areas related to Big Data Security & Privacy? (Multiple answers possible) • Intrusion Detection • Cyber security and Gigabit Networks • Visualizing Large Scale Security Data • Challenges for Big Data Security & Privacy • Sociological Aspects of Big Data Privacy k. Please provide your views on the IPR Issues as related to Big Data Analytics. l. Please provide your views and suggestions on the adequacy or otherwise of the National Data Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics PART H ANY OTHER INFORMATION YOU MAY LIKE TO SHARE 193 QUESTIONNAIRE FOR SERVICE PROVIDERS (SP) 194 PART A GENERAL ORGANIZATIONAL PROFILE 13. Name & Address of the Organization/Department: Telephones: Fax: E-mail: website: 14. Name & designation and Address of the CEO/HOD: Telephones: Mobile: E-mail: 15. Name, Designation and Address of the Respondent: Telephones: Mobile: E-mail: 16. Date & Place: 195 PART B CURRENT STATUS, STRATEGY & PROFILE oo. Please identify the Stakeholder Segment/Category your organization belongs to: (Multiple answers are possible) SEGMENT/CATEGORY YES (Y)/NO(N) RESEARCHERS (RE) DATA GENERATORS (DG) END USERS (EU) SERVICE PROVIDERS (SP) PLATFORM PROVIDER (PP) DATA CURATOR (DC) pp. Mentions the Data Segments in which you are active: qq. Mention the Data Segments that you Outsource: rr. In Which Areas Will Your Organization Be Investing More Resources? • Capacity Building • Software Tools • Data Sources • Other ss. How would you describe your organization's competitive position? • Underperforming industry / market peers • On par with industry / market peers • Outperforming industry / market peers • Don't know tt. What are your expectations from Big Data Analytics in the next 10 years? 196 uu. In our Organization the Big data management is not viewed strategically at senior levels of the organisation. • Agree • Disagree • Don’t know/Not applicable vv. There is not enough of a “big data culture” in the organisation, where the use of big data in decision-making is valued and rewarded. • Agree • Disagree • Don’t know/Not applicable 197 PART C MANPOWER, SKILL GAPS AND TRAINING NEEDS q. Identify the skills gaps within your functional area in dealing with data and analytics. • Visualization skills • Data integration skills • Data analysis skills • Data storage skills • Tooling / software skills r. How many Big Data experts does your organization employ and in which area? • Computer science: programming experts (R, Python, SQL, SAS, Java, etc) • Computer science: Artificial Intelligence and machine learning experts • Computer science: text, voice, music, image and video experts • Experts in statistics and econometrics • Experts in OR and applied mathematics • Other (please specify) s. What are the training needs of your organization? • Strategy courses on Big Data for top management • Computer science: programming courses (R, Python, SQL, SAS, Java, etc) • Computer science: text, image and video recognition courses • Computer science: machine learning and artificial intelligence courses • Statistics and econometrics courses • Operations research and applied mathematics courses • Application related courses (Big Data in marketing, finance, logistics, etc) • Any Other (Please Specify) t. For the purposes of the Capacity Building initiatives, please suggest the programs and other details as given below: Name of the Who should be Modality of Coverage Duration Program the participants Delivery 198 PART D PERCEIVED SUCCESS FACTORS, IMPEDIMENTS & CHALLENGES FOR BIG DATA APPLICATION cc. Your organization has taken initiative in which of the following areas related to Big Data Science & Technology? (Multiple answers possible) • Data streaming & Processing • Analysis of Unstructured/Semi-structured data • Visualization & Visual Analytics • Security & Privacy issues • New Computational Models • Data & Information Quality and New Data Standards. dd. Your organization has taken initiative in which of the following areas related to Big Data Infrastructure ? (Multiple answers possible) • System Architectures, Design and Deployment • Programming Models • Software Techniques & Architectures in Cloud/Grid/Stream Computing • Big Data Open Platforms ee. Your organization has taken initiative in which of the following areas related to Big Data Search, Mining and Management ? (Multiple answers possible) • Search & Mining of variety of data including scientific, engineering, social, sensor & multimedia • Algorithms & Systems for Big Data Search • Data Acquisition, Integration, Cleaning & Best Practices • Visualization Analytics for Big Data • Computational Modeling & Data Integration • Cloud/Grid/Stream Data Mining-Big Velocity Data • Mobility and Big Data • Multimedia and Multi-structured Data-Big Variety Data ff. Your organization has taken initiative in which of the following areas related to Big Data Applications? (Multiple answers possible) • Complex Big Data Applications in Science, Engineering, • Medicine, Healthcare, Finance, Business, Law and Education • Indian Traditional Knowledge • Transportation • Retailing, social media and Telecommunication • Big Data Analytics in Small Business Enterprises (SMEs) 199 • • • • • Big Data Analytics in Central and State Governments, Public Sector and Society in General Real-life Case Studies of Value Creation through Big Data Analytics Big Data as a Service Big Data Industry deployments & Standards and Experiences of Big Data Government Deployments/ Projects. gg. In which areas does your organization have Big Data applications? (Multiple answers possible) • E-Commerce, e-Business, Online Operations (Web shops, etc) • e-Governance • Direct and online marketing • Fraud detection / management • Customer and market analysis • Customer service • Supply change management and logistics • Information Technology • Finance and administration • HR and people development • Risk management • We do not have applications • I don't know in which area(s) we have applications. • Other (please specify) 200 PART E AREAS OF APPLICATION, MODELS & INFRASTRUCTURE v. Has Your Organization Taken any of the Steps Mentioned Below to Integrate Data into Your Organization’s Business? (Multiple answers possible ) • Upgrade IT Systems • Improve data collection processes • Training current employees or recruiting new employees in BA • Redesigned/reengineered your important Business Processes w. How well are these areas developed in your organization? (Answer in terms of Very well, Reasonably well, Not so well, Don't know) • A clear company strategy • A sound procedure for legal, ethical and reputational issues • An organization structure that supports multi-disciplinary projects • Financial budget • Support by higher management • Supporting systems and procedures • Talent • Training x. What do you predict will happen to the number of Big Data specialists in your organization next year (2015) • It will decrease • It will remain stable • It will increase • Don't know y. Please make suggestions on the following important aspects : • • • Data Storage Data Curation Data Retrieval z. As the country would like to keep ahead of the rest of the world in the area of Big Data applications, please suggest FINAL PRODUCTS for which the Big Data Community may strive. 201 PART F TYPE, AMOUNT OF DATA & ANALYTICAL TECHNIQUES USED 21. What support do you need from the Government 22. What are your are suggestions for leveraging Big Data Analytics applications in Government (In terms of the following) • The Opportunities • Possible Application Areas • Priority Application Areas for the next 10 years • Market size • Skills gaps and actions needed to fill in the gaps • Policy Frame Works 23. Which data does your organization analyze in the context of Big Data applications? (Multiple answers possible): • Numerical data (for statistics, predictions, etc) • Text (automated text analysis) • Audio (voice, music) • Images (automated image recognition) • Video (automated video recognition) • Don't know • Other (please specify) 24. Does your organization apply advanced analytics methods (statistics, econometrics, operations research, artificial intelligence, applied mathematics) in Big Data applications? • Yes • No • Don't know 25. Which advanced analytics methods does your organization use in Big Data applications? • Statistics and econometrics • Operations research (OR) / applied mathematics • Artificial intelligence (AI) and machine learning • None • Any Other (Please Specify) 26. What are in your opinion the most important factors for successful Big Data implementations? Please rate from "1" (=most important) to "5" (=least important). 202 • • • • • • • • A clear company strategy Support by higher management Talent Training Supporting systems and procedures Financial budget An organizational structure that supports multi-disciplinary projects A sound procedure for legal, ethical and reputational issues 27. Please suggest the Tools and Platforms to be used for Big Data Analytics in the Open Source domain. 203 PART G SECURITY CONCERNS m. Your organization has taken initiative in which of the following areas related to Big Data Security & Privacy? (Multiple answers possible) • Intrusion Detection • Cyber security and Gigabit Networks • Visualizing Large Scale Security Data • Challenges for Big Data Security & Privacy • Sociological Aspects of Big Data Privacy n. Please provide your views on the IPR Issues as related to Big Data Analytics. o. Please provide your views and suggestions on the adequacy or otherwise of the National Data Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics PART H ANY OTHER INFORMATION YOU MAY LIKE TO SHARE 204 ANNEXURE 2 LIST OF PARTICIPANTS OF THE CONSULTATIVE MEETINGS AND INTERACTIVE WORKSHOPS 205 FIRST CONSULTATIVE MEETING HELD ON 28th NOVEMBER 2014 AT NEW DELHI LIST OF THE PARTICIPANTS S. No. Name Designation & Organization 1. Prof. Sankar K Pal Distinguished Scientist and Former Director, Indian Statistical Institute, Kolkota 2. Prof. Santanu Chaudhury, Dhananjay Chair Professor, Department of Electrical Engineering, Indian Institute of Technology, New Delhi 3. Prof. Bapi Raju Surampudi 4. Prof. Ramesh Hariharan Professor, Deptt. of Computer/ Info. Sciences Coordinator, Centre for Neural and Cognitive Sciences (CNCS), University of Hyderabad, Gachibowli Adjunct Professor, Strand Life Sciences, Bangalore 5. Dr. Raghavendra Singh Research Staff Member, IBM Research, India Research Laboratory, New Delhi 6. Shri Avnish Sabharwal Managing Director & Strategy Head, Accenture India (Pvt.) Ltd., Bangalore 7. Mr. G M Bagai Scientist `G', Department of Scientific & Industrial Research, Ministry of Science and Technology, New Delhi 8. Mr. Sanjay S Gahlout Deputy Director General, National Informatics Centre, New Delhi 9. Shri K R Murali Mohan Head, Big Data Initiative Division, Department of Science & Technology, Ministry of Science and Technology, New Delhi 10. Prof. Amit Kumar Bardhan Associate Professor, Faculty of Management Studies, University of Delhi, Delhi 11. Dr. Prageet Aeron Assistant Professor (Information Management), International Management Institute, New Delhi 12. Prof. Krishan Lal President, The Korean Academy of Science and Technology (KAST), National Physical Laboratory, New Delhi 13. Shri Nikunj Garg Manager Enterprise Risk Services, Deloitte Touche Tohmatsu India Private Limited, Gurgaon 14. Shri Prashant Gupta Director - Enterprise Risk Services, Deloitte Touche Tohmatsu India Private Limited, Gurgaon 206 S. No. Name Designation & Organization 15. Dr. A K Singh Suryavanshi Professor & Dean, Department of Business Management & Entrepreneurship, National Institute of Food Technology Entrepreneurship & Management, Sonipat Indian Council of Agricultural Research, New Delhi 16. Dr. Sudeep Marwah 17. Shri Mratunjay Tewari Deputy General Manager (IT), Indian Railway Catering & Tourism Corporation Ltd., New Delhi 18. Shri Ramakant Tiwari Deputy General Manager (IT), Indian Railway Catering & Tourism Corporation Ltd., New Delhi 19. Dr. Nahid Alam The Associated Chambers of Commerce & Industry of India, New Delhi 20. Shri Uday Laroia Deputy Director, Confederation of Indian Industry, New Delhi 21. Shri Manjeet Bose Director, NASSCOM, New Delhi 22. Shri Bhushan Mohan Department of Electronics and Information Technology, New Delhi 23. Shri Rahul Mittal NIIT Technologies, NOIDA 24. Shri Shobit Bahadur Head-Research, Ma Foi Analytics and Research, Chennai 25. Shri K Pandia Rajan Chairman & Managing Director, Ma Foi Strategic Consultants Private Limited, Chennai 26. Dr. UditaTaneja Associate Professor (Healthcare Management), University School of Management Studies, New Delhi 27. Prof. Usha Munshi Indian Institute of Public Administration, New Delhi 28. Shri Anupam Bhatnagar Managing Advisor, National Co-operative Consumer's Federation of India Ltd. NOIDA 29. Shri Pradeep Dadlani Director, Sycom Projects Consultants Pvt. Ltd., New Delhi 30. Shri Rohit Anand Managing Director, Value Edge Research Services, New Delhi 207 S. No. Name Designation & Organization 31. Dr. Praveen Arora Adviser & Head, CHORD (NSTMIS) Division, Department of Science & Technology, Ministry of Science and Technology, New Delhi 32. Mr. Deepak Agrawal CDC 33. Mr. S. K. Lawani CDC 34. Mr. B. G. Gupta CDC 208 LIST OF THE PARTICIPANTS FIRST INTERACTIVE WORKSHOP, BENGALURU, 7th JANUARY 2015 S. No. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. NAME Mr. K. J. Rajeshwar Mr. Yogesh Simmhan Mr. C. Bhttacharya Mr. N. Viswanath Mr. N. S. N. Sastry Mr. Hatim Matiwala Mr. Animesh Bisaria Ms. Nandini S. S. Mr. G. Sabuwel Mr. N. Yatindra Mr. Mohan Muniswamaiah Mr. U. Dinesh Kumar Mr. Arun Singh R. Mr. R. P. Thangavelu Mr. Vidyadhar Mudkani Mr. Harsha Vardhan Mr. Vamsi Veeramachaneni Mr. Adiya R. Sinha Mr. Nikunj Garg Dr. Murli Manohar Mr. S. K. Lalwani Mr. B. G. Gupta Mr. Jaiprakash S Mr. Asad Awasi Mr. Devika P. Madalli Mr. Mohan Mokashi Mr. T. V. Suresh Kumar Mr. Srinivasa K. G. Mr. ARD Prasad Prof. K Sankara Rao Prof. K. R. Kullayiswamy ORGANIZATION WIPRO Ltd. IISC IISC IISC ISI SAP Integra Micro Software Services Pvt. Ltd. Integra Micro Software Services Pvt. Ltd. IBAB IBAB Gooly Consultancy Services IIM IISC CSIR-4PI CSIR-NAL TALLY-ASSOCHAM Strand Life Sciences SECON Deloitte DST CDC CDC Big Data Analytics India Pvt Ltd ASSOCHAM ISI The LOBICONS MS Ramaiah Institute of Technology MS Ramaiah Institute of Technology ISI Centre for Ecological Sciences, Indian Institute of Science Indian Institute of Science 209 LIST OF THE PARTICIPANTS SECOND INTERACTIVE WORKSHOP, PUNE, 19th JANUARY 2015 S. No. Name Organization 1. Ms. Radhika G Brahme National Aids Research Institute (Indian Council of Medical Research) 2. Mr. Maheshwari Desai Harbinger Systems Private Ltd. 3. Mr. PravdaGodbole Inteliment Technologies 4. Dr. R. R. Hirwani CSIR-Unit for Research and Development of Information Products 5. Dr. Vijay Khare International Center & Professor of Defence and Strategic Studyes, University of Pune 6. Mr. Kundan Kumar MITCON Consultancy & Engineering Services Ltd. 7. Dr. Karthikeyan CSIR-National Chemical Laboratory, 8. Mr. Siddharth Thomas Cignex Datamatics Technologies Ltd. 9. Wg Cdr Srinivas AIPER 10. Mr. Gautam Harbingar Systems Pvt. Ltd. 11. Mr. Vivek EY 12. Dr. D. M. Thakore BVU College of Engineering 13. Prof. S. Z. Gawali BVU College of Engineering 14. Prof. Rejani Meshram Pune University 210 S. No. Name Organization 15. Dr. Lalita Khare Modern Institute of Business Management 16. Dr. Nanaji Shewale GIPE 17. Mr. S. K. Lalwani CDC 18. Mr. Deepak Agarwal CDC 19. Mr. B. G. Gupta CDC 20. Mr. Prashany Pansare Inteliment Technologies 211 LIST OF THE PARTICIPANTS THIRD INTERACTIVE WORKSHOP, HYDERABAD, 29th JANUARY 2015 S. No. Name Organization 1. Prof. Nirmala Apsingikar Administrative Staff College of India 2. Mr. Syed Azgar Engineering Staff College of India 3. Ms. Kiran Hegda Logic Matter 4. Major Gen. (Retd) R. Shiv Kumar GITAM University 5. Mr. Maruthi Kumar System Soft Technologies (India) Pvt. Ltd. 6. Prof. Kamlakar IIIT 7. Prof. Arun K. Pujari University of Hyderabad 8. Mr. R. Raghvan Insurance Information Bureau of India 9. Prof. K. S. Rajan IIIT 10. Prof. S. Bapi Raju IIIT 11. Mr. E. Pttabhi Rama Rao 12. Ms. Pallavi Rao Indian National Centre for Ocean Information Services (ESSOINCOIS) Ministry of Earth Sciences Logic Matter 13. Prof. C. R. Rao School of Computer and Information Sciences 14. Dr. BLS Prakasa Rao CR Rao Advanced Institute for Mathematics Statistics and Computer Science 212 S. No. Name Organization 15. Dr. S. Ravichandran National Academy of Agricultural Research Management 16. Mr. Dipanjan Roy IIIT 17. Dr. M L Saundh GVK Emergency Management and Research Institute 18. Mr. Surya Putchala Zettamine Tech. 19. Mr. Tejpal Pola Zettamine Tech. 20. Prof. R. Ravi IDRBT 21. Prof. Sobhan IIT 22. Dr. D. Subramanyam DRR 23. Dr. Sundarsan Jena GITAM University 24. Dr. S. Phani Kumar GITAM University 25. Mr. Raghu Patri DATAWISE 26. Ms. NupurPavan Bang IIB 27. Ms. Aruna M. PITS Pilani 28. Mr. J. A. Chaudhary Telentsprint 29. Mr. P. Krishna Reddy IIIT 30. Ms. Kavita Vemeri IIIT 213 S. No. Name Organization 31. Ms. Jaswinder Kaur DATAWISE 32. Mr. V. Srinivas Rao BT & BT 33. Mr. M Krishna State Government 34. Dr. Shanthi Engineering Staff College of India 35. Mr. B. G. Gupta CDC 36. Mr. S. K. Lalwani CDC 37. Ch. Sobhan Basu Indian Institute of Technology 38. Prof. PJ Narayanan IIIT Hyderabad 39. Dr. Priyanka Srivastava IIIT Hyderabad 40. Prof. Vasudeva Varma IIIT Hyderabad 214 LIST OF THE PARTICIPANTS FOURTH INTERACTIVE WORKSHOP, KOLKATA, 17thFEBRUARY 2015 S. No. Name Organization 1. Prof. K. Sankar Pal ISI 2. Prof. Ashish Ghosh ISI 3. Mr. S. K. Lalwani CDC 4. Mr. B. G. Gupta CDC 5. Prof. Sanghmitra Badhopadhaya ISI 6. Prof. Rajat K. De ISI 7. Prof. Meeta Nasipuri Jadavpur University 8. Dr. Sumita Ghosh Jadavpur University 9. Dr. Subhdip Basu Jadavpur University 10. Mr. Bhaktipada Kundu Cognizant Technology Solution 11. Prof. Subhamoy Chakraborti Magma Fincrop Lid. 12. Prof. Phalguni Gupta NITTTR 13. Prof. Pabitra Mitra IIT KH 14. Mr. P K Chatterjee Conmat Technologies Pvt. Ltd. 215 S. No. Name Organization 15. Mr. Guatam Das ProtechInfosystemsPvt. Ltd 16. Mr. Arnab Ganguli Webcon Consulting (India) Ltd. 17. Prof. Kalyan Kumar Bhar Indian Institute of Engineering Science & Technology 18. Mr. Debasish Hajra PricewaterhouseCoopers Private Limited 19. Mr. SK Ray AKB Power Consultants Pvt. Ltd. 20. Dr. A N Roy National Institute of Research on Jute and Allied Fibre Technology 21. Dr. Sucheta Tripathy CSIR 22. Dr. Apurba Kr. Ghosh University of Burdwan 23. Mr. Amarlal Chaudhary University of Burdwan 24. Mr. Sudipta Sen Sharma ISI 25. Mr. K. K. Chakraborti Adansa Solutions Pvt. Ltd. 26. Ms. Suman Kundu ISI 27. Mr. K Ramachandra Murthy ISI 28. Mr. Arnab Biswas ISI 29. Mr. Satish Chandra ISI 30. Mr. Animesh Basak ISI 216 S. No. Name Organization 31. Mr. Ansuman Das ISI 32. Mr. Chrag Gupta ISI 33. Mr. Arnab Kundu ISI 34. Mr. Kamlesh Nayak ISI 35. Mr. Prateek Pandey ISI 36. Mr. Dyaneshwar Patil ISI 37. Mr. Pratish Ranjan ISI 38. Mr. Kushai Sen ISI 39. Ms. Procheta Sen ISI 40. Mr. Ankit Sharma ISI 41. Mr. Abhishek Singh ISI 42. Mr. Guatam Banerjee Business Brio 43. Shri Gautam Das ProtechInfosystemsPvt. Ltd. 44. Ms. Deepshikha Banerjee Conmat Technologies Pvt. Ltd. 45. Ms. KanishkaDhamija Indian Statistical Institute 46. Shri Arindam Pal Indian Statistical Institute 217 S. No. Name Organization 47. Shri Ajoy K Ray IIEST 48. Shri Kunal Shrivastava Indian Statistical Institute 49. Ms. Niharika Das Indian Statistical Institute 50. Ms. Shrabana Dutta Indian Statistical Institute 51. Ms. Romi Banerjee Indian Statistical Institute 52. Shri Bhaskar Dey Indian Statistical Institute 218 LIST OF THE PARTICIPANTS SECOND CONSULTATIVE MEETING, NEW DELHI, 25th MARCH 2015 S. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 NAME OF THE PARTICIPANT Prof. Sankar K Pal Prof. Santanu Chaudhary Prof. S Bapi Raju Prof. Ramesh Hariharan DR. Raghavendra Singh Prof. Prageet Aeron Shri Prashant Arya Shri Zubin Baben Shri Gautam Banerjee Prof. Amit Bardhan Shri SK Dey Biswas Dr. Lovneesh Chanana Dr. K P Chaudhary Ms. Soumya Das Ms. VineetaDixit Shri Parameswara Rao Ganta Prof. Ashish Ghosh Shri Radhesh Gupta Lt. Col M Haridas Dr. B Kanagadurai Shri Vipul Kaushik Shri Sanjay Krishen Shri Shirish Mahendru Ms. Kamini Malhotra Shri Rahul Mittal Shri Bhushan Mohan Mohankrishnan P Prof. Usha Munshi Dr. K R Murali Mohan Shri Deepak Pandhi Dr. Maya Ramanath DR. V Ravi Dr. S Ravichandran DR. Ravi Sekhar ORGANIZATION Indian Statistical Institute, Kolkata Indian Institute of Technology, New Delhi Cognitive Science Lab, Hyderabad Strand Life Sciences, Bangalore IBM Research, India Research Laboratory, New Delhi International Management Institute, New Delhi Google India Pvt. Ltd., Gurgaon. SAP India Pvt. Ltd. Bangalore Business Brio, Kolkata Faculty of Management Studies, New Delhi Indian Council of Medical Research, New Delhi SAP India Pvt. Ltd., New Delhi National Physical Laboratory, New Delhi RudrabhishekInfosystemPvt. Ltd., Noida Google India Pvt. Ltd., Gurgaon RTL Technologies, Hyderabad ISI, Kolkata Centre for Land Warfare Studies, Delhi Cantt. CSIR - Central Road Research Institute, New Delhi Ernst & Young LLP, Gurgaon Intel Technology India Pvt. Ltd., Bangalore Fairwood Design Pvt. Ltd., Noida Defence Research and Development Organisation, New Delhi NIIT Technologies Ltd., Noida NASSCOM, Bangalore Indian Institute of Public Administration, New Delhi DST, Ministry of Science and Technology, New Delhi Confederation of Indian Industry, Gurgoan IIT, New Delhi IDRBT, Hyderabad NAARM, Hyderabad CSIR - Central Road Research Institute 219 S. No. 35 36 37 38 39 40 41 42 NAME OF THE PARTICIPANT Shri Bimal Sikdar Shri Salam Shyamsunder Singh Shri Vishin Sukh DR. KVS Viswanathan Shri Deepak Agrawal Shri S. K. Lalwani Shri B. G. Gupta Shri Soumaya ORGANIZATION Project Art, New Delhi Deptt. of Economic AffairsMinistry of Finance, New Delhi Fairwood Design Pvt. Ltd., Noida NASSCOM, Bangalore CDC CDC CDC CDC 220 ANNEXURE 3 CONSOLIDATED RESPONSES FROM DATA GENERATORS 221 ANNEXURE 3 CONSOLIDATED RESPONSES FROM DATA GENERATORS (DG) Current Status, Strategy & Profile ww. Stakeholder Segment/Category : Multiple - Researchers, Data Generators, End Users xx. Active Data Segments : Customers, Transaction both internal & external yy. Data Segments that is outsourced: Sometimes e.g. Debit Card Data zz. Data not available at Data.Gov.In aaa. Expectations from Big Data Analytics in the next 10 years: • Superior analytics for business growth and customer service satisfaction • Big Data techniques will allow organization to analyze data for patterns more quickly and at a much lower cost. It will lead to important business insights that can drive the business. bbb. Big data management is viewed strategically at senior levels of the organisation. ccc. Generally there is not enough of a “big data culture” in the organisation, where the use of big data in decision-making is valued and rewarded. Manpower, Skill Gaps and Training Needs u. Identified skills gaps in dealing with data and analytics. • Visualization skills • Tooling / software skills v. Big Data experts employed in areas • Computer science: programming experts (R, Python, SQL, SAS, Java, etc) • Statistics and econometrics w. The training needs are: • Strategy courses on Big Data for top management • Computer science: programming courses (R, Python, SQL, SAS, Java, etc) • Computer science: text, image and video recognition courses • Statistics and econometrics courses • Application related courses (Big Data in marketing, finance, logistics, etc) • High Frequency Data x. Required Capacity Building initiatives as per details given below: Name of the Program Strategy courses on Big Data Big Data in marketing, finance Statistics and econometrics Who should be the participants Top Management Middle Management and Lower Management Middle Management and 222 Overview Extensive Half Day 3 Days Modality of Delivery Class room program Class room program Extensive 3 Days Class room program Coverage Duration courses Programming courses Strategy courses on Big Data Strategy courses on Big Data for top management Computer science: text, image and video recognition courses Application related courses (Big Data in marketing, finance, logistics, etc) Lower Management IT officers Top Management Business Units T&O Business Units T&O Business Units T&O Extensive Overview Detailed 7 Days Half Day 3-5 days Class room program Class room program Classroom Training Detailed 3-5 days Classroom Training Detailed 3-5 days Classroom Training Perceived Success Factors, Impediments & Challenges for Big Data Application There are no questions under this head for Data Generators Areas of Application, Models & Infrastructure aa. Steps taken to Integrate Data into Organization’s Business: • Upgrade IT Systems • Improve data collection processes • Redesigned/reengineered your important Business Processes bb. Areas developed in organization : Very Well • A clear company strategy • A sound procedure for legal, ethical and reputational issues • An organization structure that supports multi-disciplinary projects • Financial budget • Support by higher management Reasonably well to very well • Supporting systems and procedures • Talent • Training cc. The number of Big Data specialists in organization next year (2015) will increase. dd. No significant suggestions on the following important aspects : • • • Data Storage Data Curation Data Retrieval Type, Amount of Data & Analytical Techniques Used 223 28. Support needed from the Government: • Building a central repository of financial markets statistical data. • The government’s roadmap on big data • High level 5-year country strategy • Clarity on statutory / regulatory / compliance requirements • Partnering with peer organizations and relevant government agencies Clarity on security aspect 29. The amount of data available to support decision-making is enough 30. Challenges faced in GENERATING Data • Ensuring uniformity in data structure. • Coping with rapid changes in business requirements. 31. Challenges faced in CLEANING the Generated Data • Identifying mandatory data fields to ensure correct analytics. • Data correlation • Data quality 32. DATA CURATION function in-house • Data Curation currently not envisages as part of data analytics in the organisation. Security Concerns p. Initiative taken for Big Data Security & Privacy • Intrusion Detection • Cyber security and Gigabit Networks • Visualizing Large Scale Security Data • Challenges for Big Data Security & Privacy • Sociological Aspects of Big Data Privacy q. Views on the IPR Issues as related to Big Data Analytics. • Need to think through certain fundamental legal aspects of IPR, e.g. "who owns the input data companies are using in their analysis, and who owns the output?” • Implement the right policies for big data governance. • In the crowd sourcing world of Big Data Analytics it is very difficult to clearly demarcate the IPR related boundaries. • Over emphasis on IPR may also hamper the open innovation approach in the internet based application development model. • r. Views and suggestions on the adequacy or otherwise of the National Data Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics • Will comply with national regulatory requirements • Big Data Analytics will be beneficial by extracting unstructured information and combine it with the power of social media. Annexure 4: Consolidated Responses from Data Researchers 224 ANNEXURE 4 CONSOLIDATED RESPONSES FROM DATA RESEARCHERS 225 ANNEXURE 4 CONSOLIDATED RESPONSES FROM DATA RESEARCHERS (RE) Current Status, Strategy & Profile ddd. Stakeholder Segment/Category: Mostly Researcher, sometimes multiple such as Service Provider. eee. Active Data Segments: • Research in the areas of Big Data Management, Analytics and Machine Learning. • Consultancy in Big Data Management, Analytics and Machine Learning. • Capacity building initiatives in Big Data Management, Analytics and Machine Learning. • Data Analysis • Genomics & Life Sciences. fff. Data Segments that is outsourced: • Nil ggg. Budget Provisions For Big Data Usage in Lakhs of Rupees • 2014 – 15 100 to 200 Lakhs • 2015 – 16 100 to 200 Lakhs • 2016 – 17 300 to 500 Lakhs hhh. Areas where more Investment in Resources • Capacity Building • Software Tools : Most preferred area • Data Sources • Other Data Generation iii. Current state of big data activities within organization • Not yet started to consider big data's use within our organization • Offering training programmes and consultancy • One or more pilots or proofs of concept • Implementing big data technologies jjj. Expectations from Big Data Analytics in the next 10 years • It is going to rule many organizations • Plan to set up an internationally known centre of excellence in Big Data Management, Analytics, Mining, Machine Learning for Research and Development, Consultancy Services and Capacity Building • Capacity Building • Training/Research kkk.Big data management is viewed strategically at senior levels of the organisation. lll. Generally there is enough of a “big data culture” in the organisation, where the use of big data in decision-making is valued and rewarded. Manpower, Skill Gaps and Training Needs 226 y. Identified skills gaps in dealing with data and analytics. • Visualization skills • Data integration skills • Data storage skills z. Big Data experts employed in areas (About 2 – 5 Experts) • Computer science: programming experts (R, Python, SQL, SAS, Java, etc) • Computer science: Artificial Intelligence and machine learning experts • Computer science: text, voice, music, image and video experts • Experts in statistics and econometrics • Experts in OR and applied mathematics aa. The training needs are: • Statistics and econometrics courses • Operations research and applied mathematics courses • Application related courses (Big Data in marketing, finance, logistics, etc) bb. Required Capacity Building initiatives as per details given below: Who should be the Name of the Program Coverage Duration participants Strategy courses on C level professionals, Data Strategy, tangible 2 Months Big Data for top researchers and policy benefits and action management makers plans Big data Certifications Basic Statistics Numerical Methods Multivariate Analysis Operation Research Data Mining and Data Researchers and Practitioners Research Scholars/Academic Professionals/Corporate Personnel Research Scholars/Academic Professionals/Corporate Personnel Research Scholars/Academic Professionals/Corporate Personnel Research Scholars/Academic Professionals/Corporate Personnel Research Modality of Delivery Hybrid – Class room + e-learning Open Source platforms –Hadoop ETL, etc ALL 2 Months 40 Hours Hybrid – Class room + e-learning Class Room session ALL 40 Hours Class Room session ALL 40 Hours Class Room session ALL 40 Hours Class Room session ALL 40 Hours Class Room Session 227 Warehousing Scholars/Academic Professionals/Corporate Personnel Perceived Success Factors, Impediments & Challenges for Big Data Application hh. Organizational initiative in the following areas related to Big Data Science & Technology • Analysis of Unstructured/Semi-structured data • Security & Privacy issues • New Computational Models ii. Organizational initiative in the following areas related to Big Data Infrastructure • System Architectures, Design and Deployment • Programming Models • Software Techniques & Architectures in Cloud/Grid/Stream Computing • Big Data Open Platforms jj. Organizational initiative in the following areas related to Big Data Search, Mining and Management • Search & Mining of variety of data including scientific, engineering, social, sensor & multimedia • Algorithms & Systems for Big Data Search • Computational Modelling & Data Integration • Cloud/Grid/Stream Data Mining-Big Velocity Data kk. Organizational initiative in the following areas related to Big Data applications • Complex Big Data Applications in Science, Engineering, • Medicine, Healthcare, Finance, Business, Law and Education • Retailing, social media and Telecommunication • Big Data as a Service • Big Data Industry deployments & Standards and Experiences of ll. It is generally agreed that the issue for us is now not the growing volumes of data, but rather being able to analyse and act on data in real-time. Areas of Application, Models & Infrastructure ee. Steps taken to Integrate Data into Organization’s Business: • Upgrade IT Systems • Training current employees or recruiting new employees in BA ff. Areas developed in organization : Not so well to reasonably well • A clear company strategy • A sound procedure for legal, ethical and reputational issues 228 • An organization structure that supports multi-disciplinary projects • Financial budget • Support by higher management • Supporting systems and procedures • Talent • Training gg. The number of Big Data specialists in organization next year (2015) will increase. hh. Suggested FINAL PRODUCTS for which the Big Data Community may strive. • Big Data as a Service providing easy experimentation and quick prototyping • Big Data Analytics platforms for Internet of Things and wearable devices • Solutions/Protocols for seamless data integration, privacy and security. ii. For the Researchers in the Big Data Discipline, what should be thrust Areas: • Immediately : Better algorithms/platforms Big Data Management, ETL and Analytics – improving the open source solutions • In the next 5-10 years: Scalable Machine Learning for Big Data, IOT and BIG data integration and products. Type, Amount of Data & Analytical Techniques Used 33. Support needed from the Government: • Our Analytic capability may be used for the needy • Support for enhancing the capacity of our Big Data Engineering Lab • Support for offering internationally known Big Data certifications in India 34. Suggestions for leveraging Big Data Analytics applications in Government • Possible Application Areas 35. Organization apply advanced analytics methods (statistics, econometrics, operations research, artificial intelligence, applied mathematics) in Big Data applications 36. Advanced analytics methods used in Big Data applications • Statistics and econometrics • Operations research (OR) / applied mathematics • Artificial intelligence (AI) and machine learning 37. The most important factors for successful Big Data implementations? Please rate from "1" (=most important) to "5" (=least important). • A clear company strategy-1 • Support by higher management-1 • Talent-3 • Training-4 • Supporting systems and procedures-4 • Financial budget-1/4 • An organizational structure that supports multi-disciplinary projects-4 • A sound procedure for legal, ethical and reputational issues-3 229 38. Tools and Platforms to be used for Big Data Analytics in the Open Source domain. • • • • • Apache Hadoop Ecosystem – Hortonworks, Cloudera Apache Solr/Lucene No SQL data bases – Mongo DB, CAssandra R , Python – SciPy Graph Data Bases – Neo4J, etc Security Concerns s. Initiative taken for Big Data Security & Privacy • Intrusion Detection • Cyber security and Gigabit Networks • Visualizing Large Scale Security Data • Challenges for Big Data Security & Privacy • Sociological Aspects of Big Data Privacy t. Views on the IPR Issues as related to Big Data Analytics. • Cost of Patent filing is too high • Most of the research outcomes are not commercialized by governmental organizations u. Views and suggestions on the adequacy or otherwise of the National Data Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics: NIL Any Other Information You May Like To Share d. We have more data but we don’t have proper documentation e. Even if we have data but we don’t have operating resources to act as analyst. f. We find it difficult to identify the resource persons who have knowledge and skill in areas like Econometrics, operational research, multivariate tools, computer science etc. 230 ANNEXURE 5 CONSOLIDATED RESPONSES FROM END USERS 231 ANNEXURE 5 CONSOLIDATED RESPONSE FROM END USERS (EU) Current Status, Strategy & Profile mmm. Stakeholder Segment/Category: Mostly multiple such as Data Generator and Researcher. nnn. Active Data Segments: • Transaction Information • Policy Making ooo. Data Segments that is outsourced: NIL ppp. Budget Provisions For Big Data Usage in Lakhs of Rupees • 2014 – 15 100 • 2015 – 16 200 • 2016 – 17 300 qqq. Areas where more Investment in Resources • Capacity Building • Software Tools • Data Sources rrr. Organization's competitive position • Underperforming industry / market peers sss. Current state of big data activities within organization • We are in the process of developing a strategy / roadmap • We have started one or more pilots or proofs of concept • We are implementing big data technologies ttt. Usefulness of Big Data Analytics Applications: Not sure uuu. Expectations from Big Data Analytics in the next 10 years • Superior analytics for business growth and customer service satisfaction • Data Driven Methods, • Conflict resolution, • Early warning systems vvv.Big data management is viewed strategically at senior levels of the organisation. www. Generally there is enough of a “big data culture” in the organisation, where the use of big data in decision-making is valued and rewarded. Manpower, Skill Gaps and Training Needs cc. Identified skills gaps in dealing with data and analytics. • Visualization skills • Data integration skills • Data analysis skills • Data storage skills 232 • Tooling / software skills dd. Big Data experts employed in areas (About 2 – 5 Experts) • Computer science: programming experts (R, Python, SQL, SAS, Java, etc) • Computer science: Artificial Intelligence and machine learning experts • Computer science: text, voice, music, image and video experts • Experts in statistics and econometrics • Experts in OR and applied mathematics ee. The training needs are: • Strategy courses on Big Data for top management • Computer science: text, image and video recognition courses • Application related courses (Big Data in marketing, finance, logistics, etc) ff. Required Capacity Building initiatives as per details given below: Name of the Program Strategy courses on Big Data for top management Computer science: text, image and video recognition courses Application related courses (Big Data in marketing, finance, logistics, etc) Strategy courses on Big Data for top management M. Tech M. Sc Who should be the participants Business Units T&O Business Units T&O Coverage Duration Modality of Delivery Classroom Training Classroom Training Detailed 3-5 days Detailed 3-5 days Business Units T&O Detailed 3-5 days Classroom Training Business Units T&O B. Tech B. Sc Detailed 3-5 days Classroom Training Class Room Class Room 4 Semesters 4 Semesters Perceived Success Factors, Impediments & Challenges for Big Data Application mm. The extent of timely Access to Information needed • To some extent nn. The extent of competitive advantage created by information • Modest advantage oo. Challenges inhibiting from acquiring and integrating data • Inconsistencies in data from various source systems • Legacy infrastructure that inhibits data collection • Difficult to share data internally and or in integrating internal data across silos pp. Challenges inhibiting from analyzing data 233 • Lack of software/tools and or Software too difficult to use • Inconsistent data across variety of source systems qq. Challenges inhibiting from acting on data insights and analytics • Lack of software/tools that allow end-users to perform analytics themselves rr. Organizational initiative in the following areas related to Big Data Science & Technology • Data streaming & Processing • Analysis of Unstructured/Semi-structured data • Visualization & Visual Analytics ss. Organizational initiative in the following areas related to Big Data Infrastructure • System Architectures, Design and Deployment • Programming Models • Big Data Open Platforms tt. Organizational initiative in the following areas related to Big Data Search, Mining and Management • Search & Mining of variety of data including scientific, engineering, social, sensor & multimedia • Algorithms & Systems for Big Data Search • Data Acquisition, Integration, Cleaning & Best Practices • Mobility and Big Data uu. Organizational initiative in the following areas related to Big Data applications • Complex Big Data Applications in Science, Engineering, • Big Data Analytics in Small Business Enterprises (SMEs) • Real-life Case Studies of Value Creation through Big Data Analytics • Big Data as a Service • Big Data Industry deployments & Standards and Experiences of vv. The biggest impediments to using big data for effective decision-making • Too many “silos”—data is not pooled for the benefit of the entire organisation. ww. It is generally agreed that the issue for us is now not the growing volumes of data, but rather being able to analyse and act on data in real-time. xx. Areas of Big Data applications • E-Commerce, e-Business, Online Operations (Web shops, etc) • e-Governance • Direct and online marketing • Fraud detection / management • Customer and market analysis • Customer service • Information Technology • Finance and administration • Risk management 234 Areas of Application, Models & Infrastructure jj. Steps taken to Integrate Data into Organization’s Business: • Upgrade IT Systems • Improve data collection processes • Redesigned/reengineered your important Business Processes kk. Areas developed in organization : Not so well to reasonably well • A clear company strategy • A sound procedure for legal, ethical and reputational issues • An organization structure that supports multi-disciplinary projects • Financial budget • Support by higher management • Supporting systems and procedures • Talent • Training ll. The number of Big Data specialists in organization next year (2015) will increase. mm. Suggestions on the following important aspects: • Data Storage • Data Curation • Data Retrieval These technologies are evolving and should be constantly innovated and the organization roadmap should be focussed on alignment with emerging technologies nn. Suggested FINAL PRODUCTS for which the Big Data Community may strive. • Quality of data, • Accessibility of data, • Data consciousness, • Right to data oo. For the Researchers in the Big Data Discipline, what should be thrust Areas: • Immediately o Procuring real time data o Data gathering, o Data integration, o Data integrity o Data security • In the next 5-10 years o Developing Medical layer for supporting end users, o System developers, o Building DSS, KSS, o Event triggering systems and agents aiming at integrating with Internet of things. 235 Type, Amount of Data & Analytical Techniques Used 39. Pace at which data is available, updated or refreshed • As it is streamed in real-time • Less than a week 40. Support needed from the Government: • The government’s roadmap on big data • High level 5-year country strategy • Clarity on statutory / regulatory / compliance requirements • Partnering with peer organizations and relevant government agencies • Clarity on security aspect 41. The amount of data available to support decision-making • Enough 42. Organizational Performance in Information and Analytic Tasks on A Scale Of 1 To 5, Where 1=Poorly and 5=Very Well • Acquire and integrate data 3 • Analyze data 3 • Act on data-driven insights 4 43. Type of data analyzed in the context of Big Data applications: • Numerical data (for statistics, predictions, etc) • Text (automated text analysis) 44. Organizational use of advanced analytics methods (statistics, econometrics, operations research, artificial intelligence, applied mathematics) in Big Data applications • Yes 45. Advanced analytics methods used in Big Data applications • Statistics and econometrics • Operations research (OR) / applied mathematics 46. Developing Big Data applications • Both internally as well as externally 47. Consider the following factors as most important factors for successful Big Data implementations • A clear company strategy • Support by higher management • Talent • Training • Supporting systems and procedures • Financial budget • An organizational structure that supports multi-disciplinary projects • A sound procedure for legal, ethical and reputational issues 236 Security Concerns v. Initiative taken for Big Data Security & Privacy • Intrusion Detection • Cyber security and Gigabit Networks • Visualizing Large Scale Security Data • Challenges for Big Data Security & Privacy • Sociological Aspects of Big Data Privacy w. Views on the IPR Issues as related to Big Data Analytics. • Recently initiated this process x. Views and suggestions on the adequacy or otherwise of the National Data Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics: • All data needs to be made available on common portal accessible to all • We will comply with national regulatory requirements 237 ANNEXURE 6 CONSOLIDATED RESPONSES FROM SERVICE PROVIDERS 238 ANNEXURE 6 CONSOLIDATED RESPONSE FROM SERVICE PROVIDERS (SP) Current Status, Strategy & Profile xxx. Stakeholder Segment/Category: Mostly Service Provider, some time as Platform provider. yyy.Active Data Segments: • Telecom, • Retail, • Banking and Finance, • Internet and Media • Industry: Education, Manufacturing, Media & Content, Logistics and E-Commerce o Cloud BI and cloud data services • IoT data & analytics services • Click stream data and analytics services • Analytics and Simulation segments. zzz. Data Segments that is outsourced: • Telecom, • Retail, • Banking and Finance, • Internet and Media • Mobile apps and customization • CRM apps and customization • ERP apps and customization aaaa. Areas where more Investment in Resources • Capacity Building • Software Tools • Data Sources • Agile Process Quality & ISO compliance for data services • Data security & industry specific compliance audits/standards. • User Experience (UX) standards and best practise. • Sales & Marketing standards& scaling the business bbbb. Organization's competitive position • On par with industry / market peers • Don't know cccc. Expectations from Big Data Analytics in the next 10 years • Big data will enable Integrated cloud warehousing that integrates external and internal data. • Machine learning algorithms will enable analytic automation that gives competitive • End user intelligent apps will get more dependent upon big data APIs that make them smarter and more personalized • Smarter cities, • Smoother e-governance dddd. Big data management is NOT viewed strategically at senior levels of the organisation. 239 eeee. Generally there is enough of a “big data culture” in the organisation, where the use of big data in decision-making is valued and rewarded. Manpower, Skill Gaps and Training Needs gg. Identified skills gaps in dealing with data and analytics. • Visualization skills • Data integration skills • Data analysis skills • Tooling / software skills hh. Big Data experts employed in areas (Generally about 2 – 5 Experts, sometimes more) • Computer science: programming experts (R, Python, SQL, SAS, Java, etc) • Computer science: Artificial Intelligence and machine learning experts • Computer science: text, voice, music, image and video experts • Experts in statistics and econometrics • Experts in OR and applied mathematics • People who understand business and the data that goes with it. ii. The training needs are: • Strategy courses on Big Data for top management • Computer science: programming courses (R, Python, SQL, SAS, Java, etc) • Computer science: machine learning and artificial intelligence courses • Statistics and econometrics courses • Operations research and applied mathematics courses • Application related courses (Big Data in marketing, finance, logistics, etc) • Training in software tools such as Splunk, ELK Cloudera. jj. Required Capacity Building initiatives as per details given below: Name of the Program Cloud BI Big Data Cloud DS Who should be the participants MCAs BE Grads Diplomas Coverage BI devOps for analytics R & Visualization Data processing Duration 3 months 6 months 3 months Modality of Delivery Apprentice model Apprentice model Apprentice model Perceived Success Factors, Impediments & Challenges for Big Data Application yy. Organizational initiative in the following areas related to Big Data Science & Technology • Data streaming & Processing • Analysis of Unstructured/Semi-structured data 240 • Visualization & Visual Analytics • Security & Privacy issues zz. Organizational initiative in the following areas related to Big Data Infrastructure • System Architectures, Design and Deployment • Big Data Open Platforms aaa. Organizational initiative in the following areas related to Big Data Search, Mining and Management • Search & Mining of variety of data including scientific, engineering, social, sensor & multimedia • Algorithms & Systems for Big Data Search • Data Acquisition, Integration, Cleaning & Best Practices • Visualization Analytics for Big Data • Multimedia and Multi-structured Data-Big Variety Data bbb. Organizational initiative in the following areas related to Big Data applications • Medicine, Healthcare, Finance, Business, Law and Education • Retailing, social media and Telecommunication • Big Data Analytics in Small Business Enterprises (SMEs) • Big Data as a Service • Big Data Industry deployments & Standards and Experiences of ccc. Organization has Big Data applications in • E-Commerce, e-Business, Online Operations (Web shops, etc) • Fraud detection / management • Customer and market analysis • Customer service • Information Technology Areas of Application, Models & Infrastructure pp. Steps taken to Integrate Data into Organization’s Business: • Training current employees or recruiting new employees in BA qq. Areas developed in organization : Not so well to reasonably well • A clear company strategy • A sound procedure for legal, ethical and reputational issues • An organization structure that supports multi-disciplinary projects • Financial budget • Support by higher management • Supporting systems and procedures • Talent • Training rr. The number of Big Data specialists in organization next year (2015) will increase. 241 ss. Suggestions on the following important aspects : NIL tt. Suggested FINAL PRODUCTS for which the Big Data Community may strive. • Big Data Analysis Platforms Type, Amount of Data & Analytical Techniques Used 48. Support needed from the Government: • Setting up of SEZ’s for smaller set ups like ours which completely export the services. The current SEZ’s are unaffordable and only larger companies can get the benefit of working out of a SEZ. • To foster an encouraging environment for entrepreneurship especially for small start Ups. 49. NO suggestions for leveraging Big Data Analytics applications in Government 50. Organization analyzes in the context of Big Data applications • Numerical data (for statistics, predictions, etc) • Text (automated text analysis) 51. Organization apply advanced analytics methods (statistics, econometrics, operations research, artificial intelligence, applied mathematics) in Big Data applications 52. Advanced analytics methods used in Big Data applications • Statistics and econometrics • Artificial intelligence (AI) and machine learning 53. The most important factors for successful Big Data implementations? Please rate from "1" (=most important) to "5" (=least important). • A clear company strategy 1 • Support by higher management 2 • Talent 1 • Training 3 • Financial budget 2 • An organizational structure that supports multi-disciplinary projects 4 • A sound procedure for legal, ethical and reputational issues 4 54. Tools and Platforms to be used for Big Data Analytics in the Open Source domain • Hortonworks • -Hadoop • -Mapreduce Security Concerns y. Your organization has taken initiative in which of the following areas related to Big Data Security & Privacy? (Multiple answers possible) NO COMMENTS • Intrusion Detection • Cyber security and Gigabit Networks • Visualizing Large Scale Security Data 242 • • Challenges for Big Data Security & Privacy Sociological Aspects of Big Data Privacy z. Please provide your views on the IPR Issues as related to Big Data Analytics. NO COMMENTS aa. Please provide your views and suggestions on the adequacy or otherwise of the National Data Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics NO COMMENTS 243 ACKNOWLEDGEMENTS Consultancy Development Centre (CDC), an Autonomous Institution of DSIR, Ministry of Science & Technology, Government of India was commissioned to prepare Strategic Document on Data Science, Technology, Research and Applications (dASTRA) for the Data Science Initiative taken by the Department of Science & Technology. CDC is thankful to Department of Science & Technology, Ministry of Science & Technology, Government of India for reposing confidence in it by assigning this task of national importance to it. CDC is especially thankful to numerous Governmental, Public & Private Organizations, NGOs, Educational, Academic and Research Institutions and individuals for their sparing time and effort to respond to the survey questionnaires, personal discussions, interviews and participating in the Consultative Meetings and the Interactive Workshops organized around the country. The team working on this study has studied, consulted and referred a very large number of research papers, reports, books, other public domain documents and presentations; in addition it has participated in number of Big Data related conferences/seminars held recently in the country. A list of the materials referred has been included in the Bibliography given in the report. Many ideas from the above materials and personal interactions have directly or indirectly become the part of this report. The team would like to acknowledge with thanks the valuable contributions made by the various authors of these interactions, documents and presentations. It is requested that this may be taken as the personal acknowledgement for each and every person whose ideas have found place in this report. 244

Data Science Strategy & Project Report (dASTRA)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib