Uploaded by sen jeff

DRAFT REPORT V3

advertisement
STRATEGY DOCUMENT AND
DETAILED PROJECT REPORT (DPR) ON
DATA SCIENCE, TECHNOLOGY, RESEARCH &
APPLICATIONS (dASTRA)
DRAFT REPORT
DEVELOPED FOR
DEPARTMENT OF SCIENCE & TECHNOLOGY,
Ministry of Science & Technology, Government of India
CONSULTANCY DEVELOPMENT CENTRE,
nd
2 Floor, Core IV-B, India Habitat Centre, Lodhi Road, New Delhi – 110003
TABLE OF CONTENTS
Page No.
Executive Summary
iii
List of PDAC Members
xx
1. Introduction: New Generation Computational Paradigm
001
2. Data Science & Technology
027
3. Data Science – Research & Development
044
4. Data Science Applications
064
5. Entrepreneurship Development & Start-ups
081
6. Data Science Policy Perspectives
091
7. Training & Capacity Building
097
8. Investments – Detailed Project Report
102
9. Conclusions
153
List of Abbreviations
154
List of Tables
156
List of Figures
157
References
158
Annexure
161
Acknowledgements
244
ii
EXECUTIVE SUMMARY
Data Science, Technology, Research and Applications (dASTRA)
Data is increasingly becoming cheap and important. We are now digitizing analog content that was
created over centuries and collecting myriad new types of data from web logs, mobile devices, sensors,
instruments, and transactions. A study estimates that 90 percent of the data in the world today has been
created in the past two years and is increasing day by day in manifolds.
At the same time, new technologies are emerging to organize and make sense of this avalanche of data.
We can now identify patterns and regularities in data of all sorts that allow us to advance scholarship,
improve the human condition, and create commercial and social value. The rise of “big data” has the
potential to deepen our understanding of phenomena ranging from physical and biological systems to
human social and economic behavior.
Virtually every sector of the economy now has access to more data than would have been imaginable
even a decade ago. Businesses today are accumulating new data at a rate that exceeds their capacity to
extract value from it. The question facing every organization that wants to attract a community is how
to use data effectively — not just their own data, but all of the data that’s available and relevant.
Our ability to derive social and economic value from the newly available data is limited by the lack of
expertise. Working with this data requires distinctive new skills and tools. The corpuses are often too
voluminous to fit on a single computer, to manipulate with traditional databases or statistical tools, or to
represent using standard graphics software. The data is also more heterogeneous than the highly
curated data of the past. Digitized text, audio, and visual content, like sensor and weblog data, is typically
messy, incomplete, and unstructured; it is often of uncertain provenance and quality; and frequently
must be combined with other data to be useful. Working with user-generated data sets also raises
challenging issues of privacy, security, and ethics.
Scientific progress is a result of relentless academic research endeavour. The scientific community has
been focused for a while now on the growing challenges of Big Data in a number of disciplines. This
immense repository of past/current academic knowledge is increasing at an exponential rate, and
handily qualifies as Big Data in terms of volume, variety and velocity of growth. The estimation of the
veracity of this data also presents challenges. As the amount of knowledge in an academic field grows, a
quick assessment of the state-of-the-art in any sub-field becomes that much harder. One way of
enabling the acceleration of the process of discovery, is to significantly enhance current search
capabilities to support deep scientific queries. This includes:
i)
improving the efficiency and depth of search by enabling segmentation and recognition of all
the components of a traditional academic research including graphs, tables, and diagrams,
ii) developing tools to integrate various sources of information on any topic, not just from the
textual content but often from parallel channels such as video, speech, and the web, in order to
gain comprehensive understanding on the topic, and most importantly,
iii) making unapparent connections between methods, features, data, constraints, and parameters
across the spectrum of reported scientific data using advanced data mining approaches.
iii
We believe that this will require enhancements to the state-of-the-art in a variety of disciplines such as
computer vision, pattern recognition, Natural Language Processing (NLP) and fusion of classifiers. We
will make a case for the viability of this plan and step through a case study in machine learning
techniques for combining classifiers. We believe that the development of such technologies is also likely
to have significant broader societal impact.
Big Data
By definition, Big Data, is data whose scale, diversity, and complexity require new architecture,
techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it. In
other words, big data is characterised by volume, variety (structured and unstructured data) velocity
(high rate of changing) and veracity (uncertainty and incompleteness) and Value. By 2017, globally big
data industry is expected to be USD 25 billion industry. Nasscom predicts that Indian Big data industry
will be worth more than 1 billion in coming years.
Volume refers to the vast amounts of data generated every second. Just think of all the emails, twitter
messages, photos, video clips, sensor data etc. we produce and share every second. We are not talking
Terabytes but Zettabytes or Brontobytes. On Facebook alone we send 10 billion messages per day, click
the "like' button 4.5 billion times and upload 350 million new pictures each and every day. If we take all
the data generated in the world between the beginning of time and 2008, the same amount of data will
soon be generated every minute! This increasingly makes data sets too large to store and analyse using
traditional database technology. With big data technology we can now store and use these data sets
with the help of distributed systems, where parts of the data is stored in different locations and brought
together by software.
Velocity refers to the speed at which new data is generated and the speed at which data moves around.
Just think of social media messages going viral in seconds, the speed at which credit card transactions
are checked for fraudulent activities, or the milliseconds it takes trading systems to analyse social media
networks to pick up signals that trigger decisions to buy or sell shares. Big data technology allows us
now to analyse the data while it is being generated, without ever putting it into databases.
Variety refers to the different types of data we can now use. In the past we focused on structured data
that neatly fits into tables or relational databases, such as financial data (e.g. sales by product or region).
In fact, 80% of the world’s data is now unstructured, and therefore can’t easily be put into tables (think
of photos, video sequences or social media updates). With big data technology we can now harness
differed types of data (structured and unstructured) including messages, social media conversations,
photos, sensor data, video or voice recordings and bring them together with more traditional,
structured data.
Veracity refers to the messiness or trustworthiness of the data. With many forms of big data, quality
and accuracy are less controllable (just think of Twitter posts with hash tags, abbreviations, typos and
colloquial speech as well as the reliability and accuracy of content) but big data and analytics technology
now allows us to work with these type of data. The volumes often make up for the lack of quality or
accuracy.
iv
Value: Then there is another V to take into account when looking at Big Data: Value It is all well and
good having access to big data but unless we can turn it into value it is useless. So you can safely argue
that 'value' is the most important V of Big Data. It is important that businesses make a business case for
any attempt to collect and leverage big data. It is so easy to fall into the buzz trap and embark on big
data initiatives without a clear understanding of costs and benefits.
Wireless sensor technology has advanced to such a point that it is feasible to equip even everyday items
with a variety of sensors and measure state at a frequency and scale not possible a few years ago. This
development has to lead the idea of an “internet of things” and the application of data driven analytics
to different domains. We now talk of smart cities / villages, where each component of infrastructure
can be closely monitored and controlled for efficient use of resources and higher quality of living. To
efficiently store, manage and process the data that is generated in the process requires the
development of new algorithms and approaches to traditional problems. Specific areas in which there is
both domain expertise as well as access to data are available among the co-investigators are in
transportation, power and water distribution networks, health care, agriculture and food products,
finance and health.
Empirical Research is a critical component of a comprehensive scientific enquiry in any academic
discipline. The central idea of such an approach is the use of data to solve problems or achieve
objectives. The increasing ability to store large data sets and process them effectively, and the advent of
powerful algorithms to analyse the data has increased the importance of such an approach in making
significant contributions to an academic discipline. This trend is particularly prominent when contrasted
with the use of theory based approaches, which rely on axiomatic extensions based on principles of
reasoning. While these two approaches are not mutually exclusive, and in most cases should augment
each other in providing a comprehensive understanding of the subject of investigation, the advent of
the petabyte age has greatly tilted research contribution in favour of the data driven approach. At the
very least a data driven approach seeks to validate findings from fundamental theory, and at best it
wholly supplants the need for any fundamental theory to make conjectures and testable hypotheses. In
this light, it becomes critical for researchers to use and work with data and analytics to make a
meaningful contribution to the body of literature in their respective field.
The impact of this work can also be felt in the area of data security and privacy. The proposed project
requires strengthening the perimeter security of a data centre as well as has access control policies,
mechanisms and architecture.
International Initiatives
Big Data for Development: United Nations (UN) initiatives
The recent waves of global shocks – food, fuel, and financial – have revealed a wide gap between the
onset of a global crisis and the availability of actionable information that can help protect the world’s
most vulnerable populations against further regressions.
v
Traditional statistics, household surveys and census data have been effective in tracking medium to longterm development trends, but can be ineffective in generating the type of real-time picture that decision
makers need in order to develop timely responses to ongoing issues. For example, much of the data used
to track progress toward the Millennium Development Goals (MDGs) dates back to 2008 or earlier and
doesn’t take into account the more recent economic crisis.
While this may feed a perception that there is a scarcity of information about the wellbeing of
populations, the opposite is in fact true. Thanks to the digital revolution, there is an ocean of data, being
continuously generated in both developed and developing nations, that did not exist even a few years
ago.
Since its inception in 2009, Global Pulse has been investigating the viability of using new and alternative
data sources to support development goals. This includes data from:
i. Online Content - Public news stories, blogs, Twitter, Facebook, obituaries, birth announcements,
job postings, e- commerce, etc.
ii. Data Exhaust - Anonymized data generated through the use of services such as
telecommunications, mobile banking, online search, hotline usage, transit, etc.
iii. Physical Sensors - Satellite imagery, video, traffic sensors, etc.
iv. Crowdsourced Reports - Information actively produced or submitted by citizens through mobile
phone-based surveys, user generated maps, etc.
It has become clear that protecting social development gains requires the ability to quickly, and as
accurately as possible, profile and respond to crises that have the potential to undo decades of
development work. Today’s shocks—fast, global, and fluid—demand more agile response systems.
The private sector is already finding ways to efficiently analyze this new data to better understand its
customers. Innovative companies are utilizing real-time analytics to better understand the changing
needs of their customers and to respond with more agile platforms.
The United Nations (UN) Global Pulse is working to design approaches for harnessing big data and realtime analytics for monitoring development progress, emerging vulnerabilities and overall population well
being of the populations the UN Serves.
Other global initiatives
International focus on data science has been gaining popularity over the last decade, and over the last
two years reached a frenzied involvement from various quarters. This has led to what is often heralded
as the ‘Big Data’ Revolution. The activity in this area is at different levels, ranging from governments that
look at data science to address their problems (make cities safer, lower energy dependency, tackle
healthcare, agriculture, etc.), to businesses that are aiming to be more profitable, and finally to
academic institutions that are conducting research to improve the knowledge we gain from information.
A review of some of the key international research activities is presented below:
The White House announced in March 2012 a "Big Data Research and Development Initiative" that
consisted of six Federal departments and agencies. This 200 million dollar initiative works with the NSF
(National Science Foundation), NIH (National Institutes of Health), Department of Defense, Department
of Energy, and the U.S. Geological Survey. This initiative is aimed at helping to solve some the United
States’ “most pressing challenges by improving the ability to extract knowledge and insights from large
and complex collections of digital data.”
vi
The European Commission has funded the “Big Data Public Private Forum”. They are partnered with 11
other institutes (industry and academia) with the vision of Building a self-sustainable Industrial
community around Big Data in Europe
In May 2012 Intel entered into a partnership with MIT’s CSAIL (Computer Science and Artificial
Intelligence Laboratory) through a contribution of 12.5 million and the establishment of the
bigdata@csail initiative. In this program, experts in hardware and software development, theoretical
computer science, and computer security come together to develop new architectures capable of
sorting and storing massive quantities of information, as well as the algorithms that can process them.
This was founded alongside the U.S. State of Massachusetts’ inaugural “The Massachusetts Big Data
Initiative”, which provides funding from the state government and private companies to a variety of
research institutions
In May 2013 the UK government and a private Philanthropist created a £ 30 million “Big Data” health
research centre at the University of Oxford. This follows an already complete £35 million first phase of
the centre – The Target Discovery Institute – which won another £10 million more for further research
activity. Also the UK government has categorized “Big Data” as one of the “eight great technologies”
outlined by Universities and the Science Minister as being a government priority.
The AMPLab at the University of California, Berkeley, is a five year, multi-million dollar Big Data
initiative. They receive funding from the NSF, DARPA and many industrial sponsors.
The NSF Cluster Exploratory (CluE) program provides NSF-funded researchers software and services
running on a Google-IBM cluster to explore innovative research ideas in data-intensive computing.
National efforts
Similar to the various international initiatives, there has been interest from different corporations and
educational institutions in Data Science. A few of these initiatives are summarized below:
In IIT Bombay there are a group of researchers who investigate issues related to indexing web data,
organizing the semi-structured information found on the web, structured learning and large scale
optimization. The focus of the group is on algorithms for web data. They are funded by Yahoo! labs,
Microsoft, IBM, HP labs, and others. Prof. Soumen Chakrabarti is recognized as one of the world leaders
in web mining and indexing, while Prof. Sunita Sarawagi is likewise a leader in the domain of structured
output prediction. While they have a large facility for distributed computing funded by Yahoo! labs, the
activities are not exclusive to that facility.
Indian Institute of Science, Bangalore has several groups working in related areas across multiple
departments. In the Computer Science and Automation Department, the machine learning groups work
on large scale optimization and ranking problems. While the focus is not explicitly on "big" data
analytics, they are one of the most successful groups in the country in terms of their research output.
Prof. Jayant Harista is a well-recognized expert on database systems and has been working in
collaboration with IBM research in building data management applications that can handle large
volumes of data. In the Supercomputing Education and Research Centre (SERC) there have been efforts
to start big data analytics facilities, especially focused on the study of biological systems. In the ECE
Department there is a network analysis group that studies complex networks and deal with issues of
scale.
vii
IIT Delhi, and IIT Kanpur have data analytics groups that look at different aspects of data handling,
storage and analytics. There is a database management and information extraction meta-group that has
been formed recently across IIT Bombay, Delhi, and IIIT Delhi, under the IMPECS scheme with each
institution focusing on sub-areas in this domain. IIT Kharaghpur has a large complex networks group and
a center for network analysis, again under IMPECS. They also have several researchers who look at data
analytics and scaling to large data volumes.
The Indian Institute of Management, Bangalore operates a Data Centre and Analytics Lab. The purpose
of this initiative is to support interdisciplinary empirical research using data primarily on India and other
emerging markets. The centre also offers a one year certificate program in Analytics.
IIIT Hyderabad is another institute that is active in data analytics work. While there have been several
successful products and startups that have been incubated there, they do not focus on large data
handling issues. Some of the notable ideas from their groups are the eSagu system for agricultural
analytics, and veeoz a real-time social media tracking system.
One feature that will distinguish our efforts from the rest is that we are looking at data from engineering
systems as well as biological, technological, financial and social media sources. To the best of our
knowledge such a concerted effort is not available in a large scale in other places. IIT Bombay has a
group that looks at power system analytics, headed by Prof. Soman. The complex networks group at IIT
Kharaghpur headed by Prof. Niloy Ganguly has analyzed the Indian Railways network and derived many
interesting insights.
There are several groups in the Indian Industry that look at big data analytics. The group regarded highly
globally is the one at Microsoft research. Not only do they publish cutting edge research, they also
contribute very actively to data analytics product development in Microsoft. IBM research labs has
probably the largest collection of researchers working in data analytics and related areas, organized into
different groups such as business analytics, information management, human language technologies,
etc. Many of the large labs have active machine learning/data analytics research groups, notable are
Yahoo! labs, GE research, Xerox Research Center India, Adobe Research, etc.
In addition to the national importance poured into data sciences there also been considerable focus
recently in India to build Smart Cities. In the budget for the current fiscal year the government had
planned to develop 100 smart cities across the country through a $1.2 billion investment.
Gap areas
From the previous two sections it should be abundantly clear that research institutes, government
bodies, and corporations are taking Data Science and Big Data Initiatives very seriously. However most
of these aspects are focused towards addressing areas in IT that have been well established. Further the
participants and users of these systems are knowledgeable about the use of IT and computers. One of
the distinguishing features of this effort is the interdisciplinary nature of this initiative. A further focus
area in our context is to reach the masses with low or minimal knowledge of IT. In this context,
participating in the development of this rich research area and ensuring reachability to end users has
huge implications for the near future. With the right effort and people, the Big Data revolution can be
useful in many ways.
HOW BIG IS DATA IN INDIA?
viii
i.
We are living in the age of information overload. A huge amount of data is constantly being
generated around us. Increasingly, automation is being adopted and consequently leads to
greater amounts of data. The challenge today for enterprises as well as small and medium
businesses (SMBs) is manifold. Indian SMBs and enterprises are sitting on a gold mine of
information. Making sense of these huge data sets has become imperative. In these
circumstances, big data analytics has become one of the more talked about topics in India.
ii. Big data has tremendous potential in India. With social media usage on the rise and increased
adoption of technology by sectors such as BFSI(banking, financial services, and insurance), retail,
hospitality etc, big data analytics are on the agenda of boardrooms across Indian enterprises.
However, most Indian enterprises are still coming to terms with this concept. While everybody
realizes the importance and the potential to analyze these data sets, very few have the
capability of doing it. It is widely accepted that Indian enterprises base their decisions mostly on
intuitions and ‘gut-feel’ and have barely scratched the surface in terms of using data for
decision-making.
iii. In India, many of the large enterprises have started using or are contemplating the use of big
data analytics. SMBs are still some distance away from adopting this concept. Their challenges
are more basic – effective data storage and management. However, there are many medium
businesses that are already past the initial stages of IT adoption are expected to take this up
shortly.
Data Science: R & D PERSPECTIVE
In the Big Data research context, so called analytics over Big Data is playing a leading role. Analytics
cover a wide family of problems mainly arising in the context of Database, Data Warehousing and Data
Mining research. Analytics research is intended to develop complex procedures running over large-scale,
enormous in-size data repositories with the objective of extracting useful knowledge hidden in such
repositories. One of the most significant application scenarios where Big Data arise is, without doubt,
scientific computing. Here, scientists and researchers produce huge amounts of data per-day via
experiments (e.g., disciplines like high-energy physics, astronomy, biology, bio-medicine, and so forth).
But extracting useful knowledge for decision making purposes from these massive, large-scale data
repositories is almost impossible for actual DBMS-inspired analysis tools. From a methodological point
of view, there are also research challenges. A new methodology is required for transforming Big Data
stored in heterogeneous and different-in-nature data sources (e.g., legacy systems, Web, scientific data
repositories, sensor and stream databases, social networks) into a structured, hence well-interpretable
format for target data analytics. As a consequence, data-driven approaches, in biology, medicine, public
policy, social sciences, and humanities, can replace the traditional hypothesis-driven research in science.
The research problems linked to the discovery of new insights from big-data belong to a novel
and rapidly expanding research domain: machine learning. At the edge of statistics, computer science
and emerging applications in industry, this research domain focuses on the development of fast and
efficient algorithms for processing of data with as a main goal to deliver accurate predictions of various
kinds. To name only a few applications, think of business cases such as product recommendation,
segmentation of customers, fraud detection or churn prevention. Machine learning techniques can solve
such applications using a set of generic methods that differ from more traditional statistical techniques.
The emphasis is on real-time and highly scalable predictive analytics, using fully automatic and generic
methods that simplify most of the problems of data analytics. At the user layer, visualization and
interactive exploration are important problems for Big Data. A novel class of visualization metaphors,
ix
methodologies and solutions must be devised, in order to cope with emerging challenges posed by
visualization problem of Big Data; real-time visualization of extracted core data, visualization of
mashuped data, and effective visualization over mobile devices are interesting problems. Coupled with
visualization issues, interactive exploration issues are critical milestones to traverse in the context of Big
Data research; in fact, enormous-sized data are difficult to explore while extracting useful knowledge.
Strategies need to address issues such as conceptual navigation, concept drift, interaction metaphors,
and so forth.
Environmental monitoring has become reliant upon automated sensors for data acquisition. These
results in generation of large, high-dimensional data streams (‘Big Data’) those personnel must search
through to identify data structures. Nature-inspired computation, inclusive of artificial neural networks
(ANNs), affords the unearthing of complex, recurring patterns within sizable data volumes. This has
applications in agriculture, weather monitoring, epidemiological study, traffic planning, pollution
monitoring, ecological and nature resource management.
Data: Science & Technology - Challenges
Some of the S&T challenges that researchers across the globe and in India facing are related to data
deluge pertaining to:
i.
Astrophysics
ii. Materials Science
iii. Earth & atmospheric observations
iv. Energy
v. Fundamental Science
vi. Computational Biology, Bioinformatics & Medicine
vii. Engineering & Technology, GIS and Remote Sensing
viii. Cognitive science
ix. Statistical data
These challenges require development of advanced algorithms, visualization techniques, data streaming
methodologies and analytics. The overall constraints that community facing are
•
The IT Challenge: Storage and computational power
•
The computer science : Algorithm design, visualization, scalability (Machine Learning, network &
Graph analysis, streaming of data and text mining), distributed data, architectures, data
dimension reduction and implementation
•
The mathematical science: Statistics, Optimisation, uncertainty quantification, model
development (statistical, Ab Initio, simulation) analysis and systems theory
•
The multi-disciplinary approach: Contextual problem solving
Data Science: Businesses perspective
ANALYTICS COMPANIES IN INDIA DURING LAST TWO YEARS:
x
‘Analytics India Magazine’ has published a study on how analytics organizations are coming up in various
cities in India and where the action is taking place. By Analytics Organizations it refers to companies that
provide services externally around analytics and related fields. This can include training organizations or
even large consulting companies with analytics as a service line. It has also included product companies
that have created products with a deep focus or dependency around analytics. The study provides
following insight into the potential of BA in the country:
i. 6% of analytics organizations worldwide are either based out of India or have operations in
India.
ii. The number of analytics companies in India have grown three folds in last 1.5 years.
iii. Analytics firms have also grown in size. A year back, the percentage of analytics firms in
India with employee size less 50 was 71%. This year, this number has decreased to 66%.
iv. Bangalore still is the hub of analytics in India, though other cities are coming up.
Analytics Industry- A key to growth of India
Imagine a situation where someone is moving in Pantaloons Men’s shoes section, and is about to buy
one and then receives a message from Indiatimes, “The same shoe is being offered with 25% discount,
just login here”. A scanner reads the shoe data, the customer’s pantaloons card is attached to his mobile
and his mobile is attached to Indiatimes. Indiatimes and Pantaloons are doing joint marketing. A win-win
situation for everybody that is only possible with the help of analytics. So, Analytics is now no more a
luxury for an organization rather a hygiene factor. Let us have small look at the current analytics industry
of India:
i. Size of the Indian analytics Market: – 375 Million $
ii. No. of companies operating in this segment in India – More than 500
iii. Expected Indian Analytics market by 2017 – 1.15 bn $ as per Business standard report.
Big Data Analytics and Digital Social Networks
This is a focused research area engaged in the analysis of social networks in the context of the new
digital cultural ethos of India. There are many dimensions to such analysis possible ranging from the
technological, sociological, cultural, economic and strategic. There is a special need in the country today
for a deep analytical capability around the content and activity of social networks. Since much of the
content of social networks is textual information (emails, blogs, tweets, SMS, websites, documents),
audio and images/video, the information sciences involved in data analytics of social networks would
have to span spectral, image, text and quantitative analytics.
The ability and capability to monitor and detect patterns of information flow in social networks can
provide extremely strategic value to national security:
•
Alerting the nation security agencies to disturbing or threatening trends that are not obvious.
•
Determining hot spots in the networks that should be monitored and leveraged for rapid
broadcast of information that can have salutary effects to calm the citizenry and counter the
effects of harmful disinformation.
At the same time, there would be scholarship and commercial value to be derived from the expertise
created by this activity. This would allow an embedding of the strategic effort as a covert effort for
obvious advantage.
Desiderata and Resources Required:
xi
i.
Data: Distributed data centres and dedicated broadband connectivity to other centres to
eventually get a seamless semantic experience which have high speed access to huge amounts
of real time data in order to carry out analysis at with reasonable turnaround times. The irony is
that openness in society makes us vulnerable to terrorism and yet openness is key to a good
defense. While we are always interested in connecting the dots, collecting the dots is a crucial
first step!
ii. Curation: with such large data streams, it would be critical to have technology for automated
classification and clustering with human oversight for better accuracy. Much of these required
technology pieces are available today but the challenge is in integration and getting effective
pipelines engineered for preparing the data in knowledge bases that are in a form to be
leveraged for rapid interpretation and inference.
iii. Analysis: connecting the dots, discovering patterns, generating hypotheses, predicting
outcomes. This needs a crack technical team with strong mathematical and decision sciences
training. There also have to be a few key domain experts feeding the technologists with the key
questions as well as keeping them honest with quick feedback on the interpretations. The core
team working on strategic issues related to national security would be embedded within this
analytics team
Big Data: Business Analytics
Business Analytics is science of examining data (Big Data in the form of text, quantitative, qualitative,
etc.) to bring forth underlying information. This information can give us some undiscovered patterns or
can establish hidden relationships, which can shape the decision making capability of an organization.
There are two important facets of Analytics: First is practical intuitiveness, there can be hundreds of
ways a given data can be analyzed, but the beauty is that none can be completely correct however it will
give us some direction. The point is to chase that direction and to keep it updating with the trend.
Second is real time, if a Google search engine will take one hour to list down all the possible matches, its
enormousness couldn’t have been achieved. All the analysis has to be done on the fly.
APPLICATIONS OF BIG DATA-BUSINESS ANALYTICS IN GOVERNMENT SECTOR:
There are many ways in which ‘Big Data – Business Analytics’ can be leveraged by the Central and
the State Government to grow more and go for the changes and implementing the various policies
and government schemes. Some of the prominent areas are:
i. ADHAAR: As majority of citizens (more than 60 crores at the last count) in the country have
been provided with ADHAAR number, the governments can use this facility to plan,
implement & monitor and their citizen related initiatives.
ii. Direct Benefit Transfer Scheme: The Governments can decide the funding for a various
schemes, ensure that the money reached the beneficiaries and keep track of improvement
and the growth within the scheme and any particular region where people are benefited of
this scheme.
iii. Impact of Election and Voting system: Governments can analyze this big data for making
policies and the scheme based on those statistics which will help the people of the country
as well as the growth of the country.
iv. Impact and conditions of Infrastructure Projects: Analysis of the large amount of Data
Periodically collected can help the governments in preserving critical infrastructure all over
the country.
xii
v. Impact of Education: Analysis of the large amount of Data Periodically collected about
delivery, outputs, outcomes and impact of the education initiatives at primary, secondary
and tertiary level can be useful in formulating the education policies.
vi. Impact of Health care initiatives: Analysis of the large amount of Data Periodically collected
about delivery, outputs, outcomes and impact of the healthcare initiatives at primary,
secondary and tertiary level can be useful in formulating the healthcare policies.
BUSINESS ANALYTICS FOR TAX ADMINISTRATION:
The Central as well as State Governments is involved in multiple tax regimes - corporate as well as
individual level. The country's income tax-payer base itself is about 3 crore and the number has been
inching its way slowly for the last 5-10 years, which the government would like to see growing at a faster
pace. Consider the following relevant facts:
•
•
•
•
•
According to government data, the total tax payers in the country stood at about 3.24 crore
during fiscal year 2011-12 (FY12).
The Finance Ministry had collected Rs. 4.73 lakh crore in indirect taxes during 2012-13. For the
current fiscal, it has fixed the target of collecting Rs. 5.65 lakh crore in indirect taxes, comprising
customs, excise and service tax.
Total collection of indirect taxes stood at about Rs. 2,28,550 crore during the first six months of
2013-14.
Direct tax collection from corporate and income tax payers, which was at Rs. 14,530 crore till
August, surged to Rs. 18,077 crore till September 15, 2013.
Our total direct taxes are only 9 per cent of our GDP, whereas it should be about 18 per cent,
and you cannot raise it by taxing people who you have already taxed.
The governments are always looking for efficient ways and means of ‘Improving Tax Administration”.
This is possible by analyzing huge amounts of data available on various parameters typical to the tax
regime such as ‘spending patterns’, interstate movement of goods.
BIG DATA ANALYTICS AND THE INDIA EQUATION
To tap the analytics momentum, India now needs to build a sustainable analytics eco-system that brings
in a strong partnership across the industry players, government, and academia. Some of the key actions
for analytics eco-system in India would be around.
i.
Talent Pool - Create industry academia partnership to groom the talent pool in universities
as well as develop strong internal training curriculum to advance analytical depth.
ii. Collaborate - Form analytics forum across organization boundaries to discuss the painpoints of the practitioner community and share best practices to scale analytics
organizations.
iii. Capability Development - Invest in long term skills and capabilities that forms the basis for
differentiation and value creation. There needs to be an innovation culture that will
facilitate IP creation and asset development.
iv. Value Creation - Building rigor to measure the impact of analytics deployment is very critical
to earn legitimacy within the organization.
Big Data and analytics offers tremendous untapped potential to drive big business outcomes. For
organizations to leverage India as a global analytics hub can be one of the key levers to move up
their analytics maturity curve.
xiii
HOW BUSINESS ANALYTICS CAN MAKE INDIAN BUSINESS MORE COMPETITIVE?
The role analytics plays in organizations today goes far beyond the slipshod Excel/pivot table culture of
yesteryear. Analytics has now become a de facto requirement in organizations with companies
designating dedicated teams with KRAs to achieve specific revenue numbers using analytics. The fact is
corroborated by a recent report from the research firm Gartner, which says that the benefits of factbased decision-making are clear to business managers in a broad range of disciplines from marketing,
sales, supply chain management, manufacturing, engineering to risk management, finance and HR. The
firm predicts that BI and analytics will remain the top focus for CIOs through 2017. Consider the
following examples from Indian Industry and Business:
i.
ii.
iii.
iv.
v.
Take the case of retailer, Shoppers Stop, which uses analytics to mine customer preferences and
buying behaviour to source merchandize more intelligently and connect with the customers on
things they would like to see at the stores. The retailer’s buying team uses sales data to figure
out what is selling and where, which in turn enables it to take supply decisions. While earlier the
time taken for category performance reviews ran into days, now with the technology in use,
insights are available in a couple of hours.
Flipkart, faced a similar challenge, where there was a pressing need to improve inventory
utilization. Flipkart needed to integrate complex data from disparate sources and deliver
analytical data to the staff in various departments. Using a BA solution Flipkart was able to
optimize stock levels and lower costs associated with excess stock, improving its inventory
utilization by 5 percent and providing up-to-date analytics for embedded, data-driven decision
making.
Aircel had a variety of heterogeneous systems for capturing massive amounts of customer data,
presenting business with the gruelling task of extracting information from the vast amount of
data in disparate systems and get an integrated view of customers to analyze customer
demographics, usage patterns, social behaviour and more. By using an analytics solution an
integrated data view was achieved, enabling a 360-degree view of the customer life cycle,
including tasks such as customer identification, customer acquisition, customer relationship
management, customer retention and customer value enhancement.
For Mahindra & Mahindra, analytics has opened up new avenues for interacting with the
customers. For instance if a customer goes to a dealership, the quantum of services available
can now range from offering exchange for used cars, insurance and so forth, basis the customer
information on file.
Another example of innovative usage of analytics is from Mahindra’s electric vehicle, the E2O
(formerly REVA), which is equipped with switches that continuously send back information
about the vehicle/battery performance of the car to the company, where this data is now being
crunched. If you can juxtapose this data with GPS data, this will open up new avenues for
interaction with the customer.
MOVING TOWARD AN ANALYTICS CULTURE.
Staying the course: While the point of inflection, where value exceeds investment, may still remains
elusive for many companies, clearly the business should recognize that the shift to analytics is not a
long-term endeavour.
Get the executives on board: Even though business analytics initiatives are typically incremental, getting
the top brass to see the value will help drive a culture in which the norm is data-based decision-making.
xiv
According to a recent survey, effective users of business analytics are nearly always (86%) in
organizations where executive management places a great deal of trust in the results of analytics.
Getting quick wins on important issues can help gain the confidence of senior management.
Data comes first: Before embarking on analytics initiatives, organizations need to assess the
effectiveness of their data-management strategies. Those who have a solid approach to their data are
more than twice as likely to have successful analytics programs. Viewing data as a strategic asset—and
as the backbone of effective decision-making—is a key element to an analytics culture.
Get your “analytics” on: Organizations desirous of reaping high benefits from business analytics have to
boldly move into new technology. They have to significantly increase their use of analytics at nearly four
times the rate of other companies.
Share the knowledge: In developing the analytics culture, “silo-busting” is essential. Information and
data must be shared across the organization. People must have access to the data they need. Effective
users of business analytics have to be much more proficient than their counterparts at collaborating and
sharing information.
Integrate: Companies that wish to take the next step beyond collaboration—integration across the
organization—has to be well on their way to building a strong analytics culture. Integration is one of the
key components in
getting benefits from analytics. The “competitive edge” so often promoted in
the marketplace really only comes when the organization takes a holistic approach to analytics.
Hire the right talent: Adoption of analytical tools without the right people to make the best use of them
can prove to be a poor investment. In developing a functional analytics culture, the linchpins are people,
process, and infrastructure.
Find your equilibrium: The average mix of intuition to analytics in decision-making should be 60/40. For
those organizations using analytics effectively, the scale tips move toward analytics at 53/47 versus
62/38 for all others.
Broad contours of BDI programme positioning strategy shall be
i. To develop core generic technologies, tools and algorithms for wider applications in Govt,
planners and policy makers.
ii. To understand the present status of the industry in terms of market size, different players
providing services across sectors/ functions, opportunities, SWOT of industry, policy
framework (if any), present skill levels available etc.
iii. To carryout market landscape survey to assess the future opportunities and demand for skill
levels in next 10 years
iv. To carryout gap analysis in terms of skills levels and policy framework.
v. To evolve a strategic Road Map and micro level action plan clearly defining of roles of
various stakeholders – Govt., Industry, Academia, Industry Associations and others with
clear timelines and outcome for the next 10 years.
Deliverables and cost benefit Analysis
xv
Smart city Transportation: The Centre of Excellence in Urban Transport has been recording GPS data
from over 75 Metropolitan Transport Corporation buses for several months now. This voluminous data
are stored in a database provided by the supplier of the GPS hardware equipment. There exist
significant scope in optimizing the storage and retrieval of this data. It is a critical need since the data is
used for real-time travel time information provision as well as bus arrival prediction. The work here shall
address the scalability issue of such real-time databases. Traffic data from 25 video cameras installed on
the road medians and shoulders and transmitted through dedicated wireless network are also being
collected at the Centre and they require a standardized archival and retrieval system. Low-level realtime image processing techniques are used to convert the data into useful traffic information. Data
analytics techniques can be used to better extract useful information from the video data or the
processed data. These data are anticipated to be of high importance for researchers and practitioners
alike.
Smart Grid data: Smart grid development is one of the most important technology revolutions taking
place as electricity grids are world's one of the largest pieces of infrastructures yet to be digitized. To
fully leverage the capabilities of grid enhancements, one has to naturally turn to data analytics. A large
amount of real-time data can be collected from smart meters, PMU and any other sensors in smart
electricity grids. These data can be used to detect events such as severe voltage and frequency
fluctuation, sudden increase of demand from a particular location, and in some cases even to predict or
detect blackouts or cyber attacks. Further, the data can be helpful to develop models for forecasting the
load. It is proposed to develop new methods for detecting events in real time using the multi-sensor
data. The project involves theoretical and simulation studies. Some of our collaborators at VJTI Mumbai
(IITM has an MoU with them), are working with Power Grid Corporation of India Ltd (PGCIL) and are
willing to share their PMU (phasor measurement unit) data with us. Other possible sources of data is the
PEPS group at IITB who monitor real time PMU data across India.
Water Flow networks data: Urban water distribution are being renovated and instrumented in order to
ensure 24x7 water in several cities of India. Data related to flow rates, pressures, and tank levels can be
continuously obtained and used to (i) monitor performance of network especially with respect to
leakages and non-revenue water, (ii) optimize operations such that water can be delivered to customers
at desired flow rates and pressures, (iii) determine health of sensors and pipes for scheduling
maintenance. Pimpri-Chinchwad municipality near Pune with more than 1 lakh connections has already
achieved 85% metering of connections and have been gathering data for the past several months. The
municipality is willing to provide us the data to enable above identified solutions to be developed and
later implemented in their operations.
Socio-Economic Initiatives data: Socio-economic issues can be divided into two parts: (i) Health and (ii)
Food security. A tropical country like India is prone to seasonal diseases such as Dengue during monsoon
season, cancer due to chewing of tobacco and other life threatening diseases like Tuberculosis. Further
the affected people usually are from rural areas with limited knowledge to disease and its effects as well
as practising traditional medicines. With this initiative on data collection insurance schemes propagated
by the Government will benefit. For example, Government of Tamilnadu offers insurance schemes for
people below poverty line. Big data analysis on patient profiles will help in predicting outbreak of
diseases and helping in proactive action thereby saving money to the Government in terms of premiums
paid for insurance. With respect to food security, effective and efficient distribution of food products
that reaches the lowest level of the people in economic status is of primary importance. Big Data
analytics can help in ensuring that such people benefit from the technology advancement.
Interdisciplinary studies and courses are very few in the current context. Such courses are offered by
institutions which want to use big-data for their work, but not focused on Information Technology. For
xvi
example Agricultural Institutes find it difficult to have access to high end computing, even though
enough data is collected by them from the fields. Offering and introducing courses that are
interdisciplinary and helping researchers in such areas have access to high-end computing will help in
moving the benefits to end users quickly. Our academic outreach will be at three levels - train IT
knowledgeable people to be good data scientists; train people with domain knowledge to be data
science knowledgeable so that they can interface better with data scientists; finally educate end-users
on the possibilities of data science, in a model similar to a popular science program.
Biological network analysis: The recent decade has witnessed a paradigm shift in biology, from the
study of individual genes and proteins to the study of genes, proteins and metabolites interacting in a
concerted network of metabolic, signalling and regulatory networks. Fuelled by development in
sequencing and other experimental techniques, a deluge of biological ‘omics’ data has been generated
— genomic (sequence), transcriptomic (gene expression), proteomic (protein quantitation) and
metabolomics (metabolite levels). Many challenges exist in the analysis, integration and assimilation of
the biological network data, to better understand biological system function, generate new hypotheses
for experimental verification. As the 2013 citation for the Nobel Prize in Chemistry put it “Today the
computer is just as important a tool for chemists as the test tube.” Some of the problems currently being
addressed include the analysis of metabolic networks to identify critical targets for therapeutic
intervention, identification of essential proteins in protein interaction networks, and the learning of
reaction rules from complex metabolic networks.
Cancer genomics data: While sequencing the first human genome took over a decade of extensive
collaboration across labs, the three gigabases of the human genome can now be sequenced in a few
days on the ‘next-generation sequencing’ (NGS) machines. However, NGS data is in the form of short
reads and demands intensive computational analysis to crunch the data, re-assess quality and generate
sequences. It is also critical to develop infrastructure and algorithms for effective storage, indexing and
retrieval of the terabytes of NGS data. With the establishment of the National Cancer Tissue Biobank at
IITM, we are uniquely placed to analyse extensive genomic data from varied cancer tissues, of particular
relevance in the Indian context. NGS data lend themselves to a wide range of analyses, right from the
identification of critical genes mutated in cancerous tissues, to identifying changes at the gene or
signalling network level, through mathematical modelling.
Healthcare Data Analytics: Our group has been collaborating with various hospitals and research
institutes, such as Sankara Nethralaya Vison Research Foundation and Patterson Cancer Research
Centre, and have developed innovative early-stage screening algorithms. We are also working with
companies in the electronic health records domain to enable more insightful analytics on patient data.
The Centre will work toward building a large suit of algorithms that are specifically tailored for the
healthcare domain.
JEE/GATE data: It is perceived that IIT could greatly benefit from using state-of-the-art techniques in
storing, formatting, accessing data related to JEE and GATE applicants through a centralized repository.
This could lead to many data analytic initiatives. To name a few examples, consider the detection of
duplicate exam attempts, analysing demographic changes in applicants, or examining the relationship
between JEE performance and subsequent college performance, etc.
Telephonic networks data: Telephone service providers generate a lot of data per call that is placed
through their system. These are typically available as call data records which have information related to
the phone numbers involved, time of the call, duration, cell tower location, calling plan, charges
incurred, etc. While the individual call records provide a lot of information, organizing these into "call
graphs" typically lead to more insights. The challenge here is that a tremendous volume of data is
generated every day and the data is very dynamic in nature. So we need new techniques for processing
xvii
the data without delay and build systems that are responsive to the changing data. Typical questions
that end users are interested in are related to churn prediction, service recommendations, viral
marketing opportunities, graph evolution models, and behaviour analytics. While there are privacy
concerns in releasing live data, several organizations have made anonymised data available online. Even
if we are not able to obtain necessary permissions for live data, we can suitably organize the available
public data and build access mechanisms for them to facilitate research in this domain.
Financial data: Conducting research in financial markets and their functioning relies on actual market
microstructure data. Critical research work in the liquidity of securities, volatility, arbitrage (pure and
statistical), market making, Lead-Lag effects, etc, entail empirical research work that can be back tested
on actual market behaviour of exchange listed securities. Working with market microstructure data
which can spill into enormous sizes can address many pressing questions in the behaviour of financial
markets.
Statistical Experimental data: Controlled experiments are often carried out in every discipline where it
is critical to empirically understand a system, product or process. In order to make advances in
algorithms for experimentation, it is important to have a lot of datasets reflecting the different response
surfaces that typically undergo an experimental exercise. This initiative will carry forward an ongoing
research effort in this area to gather data sets published in journals across various engineering
disciplines.
Chemometric data: Chemometrics is a science of data analysis of experimental data generated from
chemical systems. In the laboratory, the sophisticated analytical instruments such as Raman, IR, NIR,
NMR, UV-vis spectrometers, various chromatography etc., are employed to measure indirectly the
quantity of chemicals in the samples. The data generated using these instruments are of multivariate
nature. Further, the chmometrics methods are routinely applied to the “omics” data generated in
biological science such as metabolomics, proteomics, genomics etc.
The field of data science is emerging at the intersection of the fields of social science, statistics,
information and computer science and other application domain disciplines. Keeping in view the fast
growth of Data Science and Analytics in future across the various applications, it is imperative to chalk
out a strategic Road Map and investments in this direction to reap the benefits towards the overall
development of the country.
OBJECTIVES OF THE STUDY
•
•
•
•
•
Assess the present status of the industry in terms of market size, different players providing
services across sectors/ functions, opportunities, SWOT of industry, policy framework (if
any), present skill levels available etc.
Market landscape survey to assess the future opportunities and demand for skill levels in
next 10 years
Gap analysis in terms of skills levels and policy framework
Evolve a strategic Road Map and micro level action plan clearly defining roles of various
stakeholders - Govt., Industry, Academia, Industry Associations and others with clear
timelines and outcome for the next 10 years.
The international scenario may also be examined while evolving Strategic Road Map.
xviii
THE CONSULTATIVE APPROACH ADOPTED:
i.
Two Consultative Meetings and four Interactive Workshops were held as per the details given
below:
CONSULTATIVE MEETINGS (CM) & INTERACTIVE WORKSHOPS (IW) ORGANIZED
CONSULTATIVE MEETINGS (CM) &
NUMBER OF
INTERACTIVE WORKSHOPS (IW)HELD
S. No.
DATE
PARTICIPANTS
AT
1
28/11/14 New Delhi (CM)
34
2
07/01/15 Bengaluru (IW)
31
3
19/01/15 Pune (IW)
20
4
29/01/15 Hyderabad (IW)
40
5
20/02/15 Kolkata (IW)
52
6
25/03/15 New Delhi (CM)
42
TOTAL
219
ii. Draft Report has been up loaded on Consultancy Development Centre’s (CDC) website
www.cdc.org.in under announcement section from 13.04.2015 for a period of two weeks for
inviting comments/inputs of stakeholders.
iii. The Draft Report after consultative process at (i) and (ii) above will be presented to the Secretary,
DST and the PDAC members tentatively on 27th May 2015.
iv. The final Report would have incorporated all the inputs received in the above consultative process
suitably.
The present study, through a combination of primary and secondary research has established the
need of urgent initiative on part of DST to (i) strengthen the dASTRA Ecosystem of the country, (ii)
take steps to nurture the same so as to leverage the unique advantageous position of the country’s
manpower in not only in the scientific research and development but in the business and industry
also.
The project is to be implemented in five years and the cost has been estimated to be around Rs. 580
Cores. The major activities of the project will include (i) R&D PROMOTION through Open Sky
Research, Cluster Based Network Programs, International Collaborative Research Program,(ii)
ESTABLISHMENT OF CENTRE OF EXCELLENCE FOR DATA SCIENCE, (iii) SKILL DEVELOPMENT CAPACITY & TRAINING through Fellowship Based UG/PG & PhD, Short Term Training for Faculty,
On-Line Programs, National Workshops & Conferences, Collaborative Interactive Conferences,
Entrepreneur Development, (iv) INTERNATIONAL LINKAGES & COLLABORATIONS through UN (R&D
and Standards), Regional Associations/Collaborations, Bilateral & Multi Lateral Exchange Programs,
and (v) INFRASTRUCTURE DEVELOPMENT.
xix
LIST OF PDAC MEMBERS
S. No.
NAME
ORGANIZATION
1
Prof. Sankar K. Pal
Distinguished Scientist and Former Director ISI, Kolkata
2
Prof. Santanu Choudhury
Professor, IITDelhi
3
Prof. Bapiraju
Prof. Central University of Hyderabad. Hyderabad
4
Prof. Ramesh Hariharan
Adjunct Faculty, IISc, Bangalore
5
Dr. Raghavendra Singh
IBM Research, New Delhi
6
Dr. Gautam Shroff
TCS Innovation Labs, New Delhi
7
Prof. Vijay Chandru
Adjunct Professor, ICTS, Banaglore
8
Prof. S. Pyne,
9
Shri Avnish Sabharwal
Professor, CR Rao Advanced Institute of Mathematics,
Statistics and Computer Science, Hyderabad
Accenture India (Pvt.) Limited, Bangalore
xx
1. INTRODUCTION: NEW GENERATION COMPUTATIONAL PARADIGM
1.1.
AN INTERNATIONAL PERSPECTIVE
1.1.1 AN OVERVIEW – DATA SCIENCE
Down through the years of human history, the most successful decisions that were made in the
world of business were based on the interpretation of available data. Every day, 2.5 quintillion bytes
of data are created—so much that 90% of the data in the world today has been created in the last
two years. Correct analysis of the data is the key success factor in being able to make better
decisions that are based on the data.
Given the quantity and complexity of the data that is being created, traditional database
management tools and data processing applications simply cannot keep up, much less make sense
of it all. The challenges for handling big data include capture, storage, search, sharing, transfer,
analysis, and visualization. The trend to larger data sets is due to the additional information that can
be derived from analysis of a single large set of related data, compared to separate smaller sets with
the same total amount of data. Some estimates for the data growth are as high as 50 times by the
year 2020.
1.1.2 DATA SCIENCE
Data science: is deep knowledge discovery through data inference and exploration. This discipline
often involves using mathematic and algorithmic techniques to solve some of the most analytically
complex business problems, leveraging troves of raw information to figure out hidden insight that
lies beneath the surface. It centres on evidence-based analytical rigor and building robust decision
capabilities. Ultimately, data science matters because it enables companies to operate and
strategize more intelligently. It is all about adding substantial enterprise value by learning from data.
See figure 1.1 as given below.
1
FIGURE 1.1: DATA SCIENCE FOR BUSINESS
SOURCE: https://datajobs.com/what-is-data-science dated 3/4/15
Techopedia (http://www.techopedia.com/definition/30202/data-science dated 3/4/15) would like to define
Data science is a broad field that refers to the collective processes, theories, concepts, tools and
technologies that enable the review, analysis and extraction of valuable knowledge and information
from raw data. It is geared toward helping individuals and organizations make better decisions from
stored, consumed and managed data. Data science enables the use of theoretical, mathematical,
computational and other practical methods to study and evaluate data. The key objective is to
extract required or valuable information that may be used for multiple purposes, such as decision
making, product development, trend analysis and forecasting.
1.1.3 DATA SCIENCE ECOSYSTEM:
Considering the above, Data science isn't new, but the demand for quality data has exploded
recently. This isn't a fad or a rebranding, it's an evolution. Decisions that govern everything from
successful presidential campaigns to a one-man startup headquartered at a kitchen table are now
based on real, actionable data, not hunches and guesswork. Because data science is growing so
rapidly, we now have a massive ecosystem of useful tools.
Since data science is so inherently cross-functional, it is really hard to categorize the companies and
the tools provided by them for users. But at the very highest level, they break down into the three
main parts of a data scientist's work flow that is (i) Getting data, (ii) Wrangling data and (iii)
Analyzing data. A schematic representation of the DATA SCIENCE ECOSYSTEM is as given in figure
1.2 below.
2
FIGURE 1.2: DATA SCIENCE ECOSYSTEM
SOURCE: http://www.computerworld.com/article/2899647/the-data-science-ecosystem.html dated 3/4/15
3
1.1.4 DATA SCIENTIST:
Rising alongside the relatively new technology of big data is the new job title data scientist. While not
tied exclusively to big data projects, the data scientist role does complement them because of the
increased breadth and depth of data being examined, as compared to traditional roles. A data scientist
represents an evolution from the business or data analyst role. The formal training is similar, with a
solid foundation typically in computer science and applications, modeling, statistics, analytics and
math. What sets the data scientist apart is strong business acumen, coupled with the ability to
communicate findings to both business and IT leaders in a way that can influence how an organization
approaches a business challenge. Good data scientists will not just address business problems; they
will pick the right problems that have the most value to the organization.
Whereas a traditional data analyst may look only at data from a single source – a CRM system, for
example – a data scientist will most likely explore and examine data from multiple disparate sources.
The data scientist will sift through all incoming data with the goal of discovering a previously hidden
insight, which in turn can provide a competitive advantage or address a pressing business problem. A
data scientist does not simply collect and report on data, but also looks at it from many angles,
determines what it means, then recommends ways to apply the data.
Data scientists are inquisitive: exploring, asking questions, doing “what if” analysis, questioning
existing assumptions and processes. Armed with data and analytical results, a top-tier data scientist
will then communicate informed conclusions and recommendations across an organization’s
leadership structure.
As per Techopedia, a data scientist is an individual that practices data science. Data science techniques
include data mining, big data analysis, data extraction and data retrieval. Moreover, data science
concepts and processes are derived from data engineering, statistics, programming, social engineering,
data warehousing, machine learning and natural language processing, among others.
1.1.5 DATA SCIENCE BEYOND 2015:
Kurt Cagle, an information architect, data scientist, author and industry analyst working with Avalon
Consulting, LLC., predicts, (https://www.linkedin.com/pulse/ten-trends-data-science-2015-kurt-cagle#a11y-content
dated 3/4/15) the following for DATA SCIENCE during 2015:
•
•
•
•
•
•
Rise of Data Virtualization
Hybrid Data Stores Become More Common
Semantics Becomes Standard
Databases Become Working Memory
Move Towards a Universal Data Query Language
Data Analytics Moves Beyond SQL
4
•
Data Science Teams: will consist of Integrator, Data Translation Specialist, Curators, Data
Scientist, Domain Expert, Visualizers and Data Science Manager.
Big Data News (http://www.bigdatanews.com/profiles/blogs/13-new-trends-in-big-data-and-data-science dated 3/4/15) has forecasted
the following trends in relation to use of Data Science in the time to come:
•
The rise of data plumbing, to make big data run smoothly, safely, reliably, and fast through all
"data pipes" (Internet, Intranet, in-memory, local servers, cloud), optimizing redundancy, load
balance, data caching, data storage, data compression, signal extraction, data summarization
and more.
•
The rise of the data plumber, system architect, and system analyst (a new breed of engineers
and data scientists), a direct result of the rise of data plumbing
•
Use of data science in unusual fields such as astrophysics, and the other way around (data
science integrating techniques from these fields)
•
The rise of the right-sized data (as oppose to big data). Other keywords related to this trend are
"light analytics", big data diet", "data outsourcing", the re-birth of "small data". Not that big
data is going away, it is indeed getting bigger every second, but many businesses are trying to
leverage an increasingly smaller portion of it, rather than being lost in a (costly) ocean of
unexploited data.
•
Putting more intelligence (sometimes called AI or deep learning) into rudimentary big data
applications (currently lacking any true statistical science) such as recommendation engines,
crowdsourcing or collaborative filtering. Purpose: detecting and eliminating spam, fake profiles,
fake traffic, propaganda, attacks, scams, bad recommendations and other abuses, as early as
possible.
•
High performance computing (HPC) which could revolutionize the way algorithms are designed.
•
Forecasting space weather (best time / best location lo land on Mars), and natural events on
Earth (volcanoes, Earthquakes, undersea weather patterns and implications to humans, when
will Earth's magnetic field flip).
•
Use of data science for automated content generation (including content aggregation and
classification); for automated correction of student essays; data science used in court to
strengthen the level of evidence - or lack of - against a defendant; for plagiarism detection; for
car traffic optimization and to compute optimum routes; for identifying, selecting and keeping
ideal employees; for automated income tax audits sent to taxpayers to avoid costly litigation
and time wasting; for urban planning; for precision agriculture
•
Measuring yield of big data or data science initiatives (that is, benefit after software and HR
costs, over baseline)
•
Digital health: diagnostic/treatment offered by a robot (artificial intelligence, decision trees)
and/or remote doctors; digital law: same thing, with attorneys replaced by robots, at least for
mundane cases or tasks. Even lawyers and doctors could have their jobs replaced by robots! This
5
assumes that a lot of medical or legal data gets centralized, processed and made well structured
for easy querying, updating and retrieval by (automated) deep learning systems.
•
Analytic processes (even in batch mode) accessible from your browser anywhere on any device.
Growth of analytics apps and APIs.
1.1.6 WHAT IS BIG DATA?
Big data is a phenomenon that is characterized by the rapid expansion of raw data. This data that is
being collected and generated so quickly, that it is inundating government and society. Therefore, it
represents both a challenge and an opportunity. The challenge is related to how this volume of data
is harnessed, and the opportunity is related to how the effectiveness of society’s institutions is
enhanced by properly analyzing this information.
It is now commonplace to distinguish big data solutions from conventional IT solutions by
considering the SEVEN dimensions given below and in Figure 1.3.
•
•
•
•
•
•
•
Volume: Big data solutions must manage and process larger amounts of data.
Velocity: Big data solutions must process more rapidly arriving data.
Variety: Big data solutions must deal with more kinds of data, both structured and unstructured.
Veracity: Big data solutions must validate the correctness of the large amount of rapidly arriving
data.
Variability: To take care if the data consistent in terms of availability or interval of reporting and
does it accurately portrays the event reported?
Visualization: Once Big Data has been processed it needs to be presenting the data in a manner
that’s readable and accessible.
Value: Big data solutions must provide valuable inputs in decision making process of the
organization
6
FIGURE 1.3: SEVEN DIMENSIONS OF BIG DATA
VARIETY
VARIABILITY
VELOCITY
VOLUME
VERACITY
VALUE
VISUALIZATION
As a result, big data solutions are characterized by real-time complex processing and data
relationships, advanced analytics, and search capabilities. These solutions emphasize the flow of
data, and they move analytics from the research labs into the core processes and functions of
enterprises.
1.1.7 BUSINESS (ORGANIZATIONAL) VALUE OF BIG DATA:
Big data is a technology to transform analysis of data-heavy workloads, but it is also a disruptive
force. It is fuelling the transformation of entire industries that require constant analysis of data to
address daily business challenges. Big data is about broader use of existing data, integration of new
sources of data, and analytics that delve deeper by using new tools in a more timely way to increase
efficiency or to enable new business models. Today, big data is becoming a business imperative
because it enables organizations to accomplish several objectives:
•
•
•
•
•
•
Apply analytics beyond the traditional analytics use cases to support real-time decisions,
anytime and anywhere
Tap into all types of information that can be used in data-driven decision making
Empower people in all roles to explore and analyze information and offer insights to others
Optimize all types of decisions, whether they are made by individuals or are embedded in
automated systems by using insights that are based on analytics
Provide insights from all perspectives and time horizons, from historic reporting to real-time
analysis, to predictive modelling
Improve business outcomes and manage risk, now and in the future
In short, big data provides the capability for an organization to reshape itself into a contextual
enterprise, an organization that dynamically adapts to the changing needs of its individual
users/customers by using information from a wide range of sources. Although it is true that many
7
organizations/businesses use big data technologies to manage the growing capacity requirements of
today’s applications, the contextual enterprise uses big data to enhance revenue streams by
changing the way that it does business.
Volume – Scalability
Data volume is increasing faster than computing resources and processor speeds that exist in the
marketplace. Over the last five years, the evolution of processor technology largely stalled, and we
no longer see a doubling of chip clock cycle frequency every 18 - 24 months. The size of big data is
easily recognized as an obvious challenge. Big data is pushing scalability in storage, with increases in
data density on disks to match. A large percentage of the data might not be of interest. It can be
filtered and compressed by an order of magnitude. The challenge is to filter intelligently without
discarding data samples that might be relevant to the task.
Volume – Impact of Networking
The failure of a networking device affects multiple data nodes. This means that a job might need to
be restarted or more loads must be pushed to the available nodes, which makes jobs take a lot
longer to finish. As a result, networks must be designed to provide redundancy with multiple paths
between computing nodes and, furthermore, must be able to scale. In addition, the network must
be able to handle bursts effectively without dropping packets.
Volume – Cloud Services
Big data and cloud services are two initiatives that are at the top of the agenda for many
organizations. There is a view that cloud computing can provide the opportunity to enhance
organizations’ agility, enable efficiencies, and reduce costs. In many cases, cloud computing provides
a flexible model for organizations to scale their big data capabilities. However, this needs to be done
with careful planning, especially estimating the amount of data to analyze by using the big data
capability in the cloud, because not all public or private cloud offerings are built to accommodate big
data solutions.
Velocity – Access Latencies
Access latencies create bottlenecks in systems in general, but especially with big data. The speed at
which data can be accessed while in memory, network latency, and the access time for hard disks all
have performance and capacity implications. For big data, data movement is usually not feasible,
because it puts an unbearable load on the network. For example, moving petabytes of data across a
network in a one-to-one or one-to-many fashion requires an extremely high-bandwidth, low-latency
network infrastructure for efficient communication between computer nodes.
8
Big data uses different types of analytics, such as “adaptive predictive models, automated decision
making, network analytics, analytics on data-in-motion, and new visualization.” Previously, data was
pre-cleaned and stored in a data mart. Now, most or even all source data is retained. Furthermore,
new types of feeds, such as video or social media feeds are available (Twitter, for instance).
Velocity - Rapid use and rapid data interpretation
It is crucial in today’s fast-paced business climate to derive rapid insight from data. Consequently,
agility is essential for businesses. Successfully taking advantage of the value of big data requires
experimentation and exploration, and both need to be done rapidly and in a timely manner.
Velocity - Response time
Response times for results are still critical, despite the increase of data size. To ensure speed and
real-time feedback from big data, a new approach is emerging where data sets are processed
entirely within a server’s memory.
Velocity - Impact of security on performance and capacity
The increased velocity of data corresponds to an increase in security-relevant data. According to Tim
Mather of KPMG, “Many big data systems were not designed with security in mind.”
The security mechanisms need to be applied in a manner that does not increase access latency. In
addition, big data technology enables massive data aggregation beyond what was previously
possible. Therefore, organizations need to make data security and privacy high priorities as they
collect more data in trying to get a single view of the customer.
Variety – Data Type
One of the crucial challenges that affect performance and capacity in a big data system arises from
the variety of data types that can be introduced during typical processing cycles. These challenges
can arise for these reasons:
o Growth, necessitating the addition of new systems, which can result in an uncontrolled
heterogeneous landscape in the enterprise (such as a plethora of types of systems)
o The introduction of new systems that provide data but introduce challenges in identifying its
relevance in big data systems
Variety - Tuning
9
The rise of information from a variety of sources, such as social media, sensors, mobile devices,
videos, and chats, results in an explosion of the volume of data. Previously, companies often
discarded the data because of the cost of storing it.
Veracity – Cleaning the Massy Data
The huge amount of data that comes from digital pictures, videos, posts to social media sites,
intelligent sensors, purchase transaction records, and cell phone GPS signals, to name a few is messy
data. Veracity deals with uncertain or imprecise data. If the data is error-prone, the information
that is derived from it is unreliable, and users lose confidence in the output. Cleaning the existing
data and putting processes in place to reduce the accumulation of dirty data is crucial.
Veracity – Performance & Capacity
To address the performance and capacity challenges that arise from lack of veracity, it is important
to have data quality strategies and tools as part of a big data infrastructure. The aim of the data
quality strategies is to ascertain “fit for purpose.” This involves evaluating the intended use of big
data within the organization and determining how accurate the data needs to be to meet the
business goal of the particular use case. The data quality approaches that the organization adopts
need to include several strategies:
o
o
o
o
o
Definition of data quality benchmarks and criteria
Identification of key data quality attributes (such as timeliness and completeness)
Data lifecycle management and compliance
Metadata requirements and management
Data element classification
1.2.
INTERNATIONAL SCENARIO, BEST PRACTICES, BUSINESS MODELS AND
OPPORTUNITIES AVAILABLE
1.2.1 BIG DATA
Big Data’s biggest strength is its versatility and global application. So, quite naturally, it has
enormous, widespread impact. Use of Big Data in government – local, national or international – can
be a game changer! For, every government faces numerous challenges, the biggest perhaps being
making sense of the massive amounts of information they receive every day and making decisions
based on the same, which in turn, may affect an entire country or even multiple nations.
Not only is it tough to scrutinize all the information, but it even more difficult to verify it. Flawed
information can have devastating consequences.
This is where Big Data comes to the rescue! With the help of Big Data, governments can derive
crucial insights to aid decision making in real-time from the heaps of ever-mounting data received
from a myriad of sources, including the Web, biological and industrial sensors, video, email, and
10
social communications. Governments can utilize Big Data to serve their citizens better and
overcome countless challenges like increasing health care costs, unemployment, natural calamities,
poverty, illiteracy, terrorism, international trade relations, and so on.
Big Data in government can be the touchstone of a nation’s global standing. Here are a few areas
where implementing Big Data can get governments enormous benefits:
Air-Rail-Road Safety &Transport: With Big Data, governments can improve air-rail-road networks,
transportation, and minimize accidents and mishaps.
Healthcare: Big Data tools can be used to intensify treatment efficiency and provide more
personalized care to patients.
Education: Education is another important area where Big Data can do wonders for the
government. Big data can help governments understand the educational needs of the population
better.
Agriculture: Big Data can help governments and government agencies keep track of numerous
factors within and outside of the national borders – land, livestock, crops grown, crops required to
be cultivated, food scarcity/abundance, flood/famine, farmer welfare, and other countless
agriculture-related issues.
Poverty: Big Data makes it easy for the governments to assess the greatest needs of their people
and allows them to focus on areas where poverty alleviation is required.
Weather: Weather officials can use Big Data to predict impending weather-related emergencies and
quickly alert the residents of danger and consequently save numerous lives.
Tax compliance: Big Data can help tax agencies detect and regulate tax frauds, waste & abuse of
unpaid taxes, denied refunds.
Crime Prevention: Big Data tools can help law and order agencies in identifying emerging threats,
anticipating and averting criminal activity.
Big data technology is a very powerful and useful tool for governments across the globe. Agreed, it
cannot resolve all problems at once, but it is one big step in the right direction. Big Data empowers
governments with the right tools to bring about important changes that can have ubiquitous impact
on generations – present and future! Consider the examples as given below from varieties of
countries
• Seoul uses analytics to find late night bus routes: In South Korean capital Seoul, night bus
routes are determined by late night call volumes. Here’s how the city is helping late night
commuters reach home safely. When the government was figuring out how to operate the night
buses, it color-coded areas in the city based on call volumes. It then found out how many
passengers get on and off and eat bus stop in the high call volume areas to determine the busiest
routes the buses should ply.
• Singapore Government provide personalized services to its citizens: Singapore government
websites are able to better recognize citizens’ needs with a new big data analytics tool. The cloudbased tool can process and understand a citizen’s question accurately and provide an answer within
seconds. This capability enables citizens to better navigate government services and get
11
personalized advice when using online services. The tool also provides government agencies with
insights on citizens’ needs and priorities.
Data sharing tips from Colorado – Health, Education, labor & industry: Using Big Data
concepts, the data shared are around eligibility and service quality, for example: are people getting
served in a reasonable time frame, and what are the demographics of the people who are
consuming health, education, and employment services.
•
• Singapore Government’s initiatives in Using analytics to improve quality of decisions & lives:
The government believes that data analytics has huge opportunities to impact government services
and improve citizens’ lives in a wide range of areas, such as healthcare, transportation, education,
retail and waste management. A large volume of data is being generated from sensors and mobile
devices today. This includes communication between person-to-person, person-to-machine and
machine-to-machine, added Sen. He and his team are tasked to evaluate and apply advanced
analytics techniques and models that can help organizations get a “360-degree view on people,
technology and policies to improve the quality of decisions and improve citizens’ lives and journey of
experience at various touch points.”
• Australian Immigration became over 30% more effective thanks to analytics: After having
deployed a new analytics system 18 months ago, they are generally now 30-40% more effective. The
analytics system allows the Department of Immigration and Border Protection to identify the highrisk passengers with less disruption to other passengers coming into Australia. This has become
possible as the analytics system combines data from the visa approval process, travel history to and
from Australia and even real-time data collected during check-in. It analyses these datasets to
profile the level of risk posed by each of the 50,000 passengers arriving to Australia every day.
• South Korea Government improves citizen engagement through Open Data & Big Data: The
South Korean Ministry of Government Legislation (MOLEG) has significantly improved citizen
engagement by enabling easy access to and search of accurate and timely legal information. The
Centre gathers all kinds of information related to legislation, current laws and their histories,
constitution, laws passed in the national assembly, treaties, presidential decrees, decrees produced
by each ministry, and other rules including local governments’ ordinances and regulations. MOLEG
has also created a mobile app so that citizens can access the Centre on the go. Besides making legal
information open and easily searchable by citizens, MOLEG wants to involve the public in the lawmaking process.
•
o
o
o
o
o
Big Data: Digital Agenda for European Commission:
Healthcare: saving lives with better diagnosis
Transport: fewer accidents and traffic jams
Environment: reduced energy consumption
Agriculture: safer food and increased productivity
Manufacturing and retail: optimized processes for safer and personalized products
12
•
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Some More Ways Big Data Is Used Today To Change Our World:
Understanding and Targeting Customers
Understanding and Optimizing Business Processes
Personal Quantification and Performance Optimization
Improving Healthcare and Public Health
Improving Sports Performance
Improving Science and Research
Optimizing Machine and Device Performance
Improving Security and Law Enforcement.
Improving and Optimizing Cities and Countries
Financial Trading / Pricing
Out of home advertising
Retail Habits
Politics
Weather
Heart Disease
Infectious diseases
Doctor performance
1.2.2 OPEN DATA
Governments and public authorities across the world are launching Open Data initiatives. Research
indicates that by October 2011, twenty eight nations around the world had established Open Data
portals
Public administration officials are now beginning to realize the value that opening up data can have.
For instance, the direct impact of Open Data on the EU27 economy was estimated at €32 Billion in
2010, with an estimated annual growth rate of 7%.
However, very few governments are taking the right measures in realizing the economic benefits out
of Open Data. Political support, breadth and refresh rate of data released, the ease in sourcing data
and participation from user community determine the degree of maturity of an
Open Data program.
Capgemini Consulting conducted an analysis of 23 select countries across the world, which have
already initiated Open Data programs, and rated them on a set of parameters as given below in
figure 1.4:
13
FIGURE 1.4: PARAMETERS USED FOR BENCHMARKING COUNTRIES ON OPEN DATA INITIATIVES
(Source: Capegemini)
After analyzing 23 countries, based on their positioning and pace of adoption of Open Data
initiatives, we were able to classify them into three categories – Beginners, Followers and Trend
Setters. The results are as given below in figure 1.5:
FIGURE 1.5: BENCHMARKING OF OPEN DATA INITIATIVES, SELECT COUNTRIES, 2012
(Source: Capegemini)
14
1.3.
INDIAN PERSPECTIVE
1.3.1 BIG DATA ANALYTICS
An IDC Insight document examines the Big Data India market trends and provides a forecast for the
period of 2014–2017. It also captures the current market situation, spending and adoption patterns
across end user organizations, as well as business drivers and inhibitors and use cases across
verticals. This covers the Big Data state of adoption across end user organizations and also the
forecast for the coming years. The report is based on key findings from a survey of 250+ end user
organizations and in-depth interviews done with 5+ supply side vendors. The report provides Big
Data spend for each technology segment — infrastructure, software and services — and the growth
pattern for each of the segment.
"IDC expects the Big Data technology and services market in India to witness a phenomenal
compounded annual growth rate (CAGR) of 36.3% for the period of 2012–2017 to reach US$ 191
million from US$ 40.7 million in CY12. The huge growth potential is attributed to the inclination of
the business functions to get meaningful insight out of the humongous data growth in their
organizations," (www.idc.com/getdoc.jsp?containerId...dated 21/02/15)
•
While looking into the future of IoT, Big Data and Cloud Computing the NetApp has reported in the
CIOL that the year 2015 will see quantum increases in data generation, led by the IoT phenomenon.
Data will become the new gold. A leading industry analyst firm’s Digital Universe analysis of the
growth of data projects that intelligent connected devices will increase the amount of “useful data”
that can be analyzed and used to make decisions from 22% in 2013 to 35% in 2020. This “useful
data” needs to be in digital storage in order to enable the analysis and use of this data. This will
compel enterprises and government alike to think harder about network efficiency, storage and
analytics.
If India is to achieve the goals we have set for ourselves in 2014, a calibrated approach is an
imperative, born of long term technology roadmaps. Analytics deployments will be spurred in the
increasingly complex marketing and consumer engagement environment that have been created in
the digital era.
1.3.2 CLOUD COMPUTING & SOFTWARE DEFINED STORAGE
Organizations contemplating both green field and brown field cloud deployments will tend towards
a multi-vendor hybrid cloud environment, that will provide the benefits of both the worlds – public
and private cloud. Avoidance of lock-in, leverage in negotiations, or simply a desire for choice will
make customers reluctant to work with one cloud vendor, and multiple-vendor hybrid clouds will
attain prominence. This growth will further be boosted as big data evolves and drives the need for
sophisticated storage infrastructure.
15
Software Defined Storage (SDS) is foundational platform which address range of use cases managing
data placement according to cost, compliance, availability, and performance requirements. SDS has
the ability to be deployed on different hardware platforms and will extend to cloud architectures as
well. SDS will enable data accessibility across cloud platforms consistently, thus simplifying data
management.
Till date, enterprises have used disks to store their critical data. These SATA disks come with a lot of
challenges including space usage, time taken to input and overhead costs to maintain the requisite
environment. While this is definitely not going to change and at least 80% of enterprise data will
continue to reside on disks, Flash will start taking baby steps as organizations become aware of its
advantages and ease of use. However, the growth of this transformative technology will be hindered
by costs – the least expensive SSDs will likely be 10 times more expensive than the least expensive
SATA disks.
1.3.3 INTERNET OF THINGS – IoT
Companies are increasingly looking for scale-out applications. To accommodate this need, Dockers
are more resource efficient and reduce the storage space required as compared to hypervisors. We
will see the emergence of a robust ecosystem for data management through Dockers and other
surrounding services in 2015.
With the IoT devices expected to grow to 4.9 billion in 2015, up 30 per cent from 2014 and reach 25
billion by 2020 as per a leading analyst firm, unstructured data is being created by every device
thinkable – from smart phones, laptops, and social to cloud applications. Organizations need to
become technologically sharp to deal with the changing dynamics in the big data space. They should
adopt improved storage solutions to address their needs and the above predictions hold good for
them. (www.netapp.com/in/.../news/.../news-rel-20141219-184398.aspx, dated 25/12/14)
1.3.4 INDIA’S HIGH DEMAND FOR BIG DATA WORKERS
The biggest fallout of the big data revolution -- where every type of business gathers and analyzes
data -- is a massive human resources shortage. Across the globe, thousands of data analytics jobs
are going a begging because of a shortage of qualified manpower. A McKinsey Global Institute
Study Report (Big data: The next frontier for innovation, competition, and productivity) projects that
the US alone will face a shortage of about 190,000 data scientists by 2018 and, further, a shortfall of
1.5 million managers and analysts who can understand and make decisions using big data. As per
this report India is producing third largest absolute numbers of BA Professionals after USA and
China; however India is producing only 1.12 BA professionals per 100 as compared to USA’s 8.11 and
that of China’s 1.31. The worry is that India’s figure of 1.12 is smaller than most of the countries. See
Figure 1.6.
16
FIGURE 1.6: NUMBER OF GRADUATES WITH DEEP ANALYTICAL TRAINING
McKinsey Global Institute Study Report (Big data: The next frontier for innovation, competition, and productivity)
Three key types of talent are required to capture value from big data:
o Deep Analytical Talent -people with technical skills in statistics and machine learning, for
example, who are capable of analyzing large volumes of data to derive business insights;
o Data-Savvy Managers and Analysts - who have the skills to be effective consumers of big data
insights—i.e., capable of posing the right questions for analysis, interpreting and challenging the
results, and making appropriate decisions; and
o Supporting Technology Personnel - who develop, implement, and maintain the hardware and
software tools such as databases and analytic programs needed to make use of big data.
Data analytics as a job discipline became main stream almost a decade ago, and the demand for
trained professionals has been growing steadily since. Given India's reputation for the availability of
professionals in varied disciplines at reasonable costs, global banks and financial services firms were
the first to migrate their analytics work to India, followed by pharmaceutical and life sciences
companies. Global retailers, consumer firms, logistics firms, consultancies, and engineering firms
have all begun routing their data analytics work to IT services providers and specialized analytics
service providers in India.
The talent deficit is on two fronts, data scientists who can perform analytics and analytics
consultants who can understand and use the data. The first, big data engineers and scientists are
extremely scarce and in the second category, better quality is needed, and India is going to be short
of a million data consultants soon.
1.3.5 NASSCOM PERSPECTIVE
17
To address the growing business opportunities in the Analytics and Big Data space, National
Association of Software and Services Companies (NASSCOM) has taken initiatives in terms of holding
the NASSCOM Big Data & Analytics Summit 2014 in Hyderabad. With the theme of “Industrialization
of Analytics”, the summit deliberated on how to build analytically-mature organizations with
analytics embedded at the business core & across the business value chain. The summit witnessed
industry leaders share best practices on processes, tools, technology, technique and applications
used in the context of analytics and also insights upon how to build India’s Analytics talent strength.
The following are the highlights:
Global analytics market
As firms gain access to greater volumes and newer varieties of data, and as they unearth more
innovative ways of generating insights for improved customer engagement, implementing analytics
is gaining in importance. The global analytics market (software products and outsourced services) is
growing at over 12 per cent since 2012. The 2014 market size is estimated at USD 96 billion and is
projected to reach USD 121 billion by 2016. Outsourced services around analytics is growing at a
faster CAGR of over 14 per cent vis-à-vis analytics software (CAGR ~10 per cent). This growth is
being driven by a host of factors – cloud, in-memory computing; mobile devices, social media;
emergence of different business units across an organization as consumers of analytics, etc. With
analytics being consistently recognized as the top priority for CXOs, firms are also industrializing
analytics within the organizational culture and this in turn, is seeing the emergence of the Chief Data
Officers’ role.
India analytics market
Compared to the global market, the overall India analytics market size is miniscule and currently
accounts for only 1 per cent share. The India market (exports and domestic) is growing at double the
rate of global market at 24 per cent CAGR. In FY2014, the total market was USD 954 million and is
expected to reach nearly USD 2.3 billion by FY2016. The ratio of exports-to-domestic is likely to
remain steady at 85:15 during this period. Currently, this segment has over 600 firms offering
analytics-related products and services and it employs about 29,000 people. Of this, India is the
primary target market for ~50 per cent of these firms. The fact that India’s Top 100 IT-BPM
(integrated) firms and about 500+ start-up firms are focused on analytics is statement of proof of
this technology’s increasing relevance.
India is rapidly emerging as the analytics hub for the world. It has the complete range of ecosystem
players from GICs, integrated IT-BPM firms, pure-play analytics firms to BPM-KPOs and a vibrant
analytics product firms. In terms of geographic density, Bengaluru has the highest number of
analytics firms – 29 per cent, followed by Mumbai and Pune – 24 per cent. Apart from this, many
Tier II/III cities are also emerging hubs - Trivandrum, Kochi, Mysore, Indore, etc.
Analytics in the India domestic market
18
There is also a pull factor from the user side – firms in India are beginning to realize the value of
implementing analytics. Potential impact can be operational (cost control, process efficiencies), end
customers (user insights, targeted marketing) and strategic (driving sales, improved decision
making).
Firms in the BFSI, telecom and ecommerce verticals have so far been taking the lead in adopting and
applying analytics to a wide range of business areas – portfolio analytics, risk & compliance
analytics, customer loyalty, subscriber profiling, churn management, etc. Emerging verticals that are
still in the pilot phase of adoption include retail, manufacturing and media & entertainment. One of
the key verticals that is showing great promise is the Government – SEBI (fraud detection), NATGRID
(anti-terrorism) – and state level initiatives - Maharashtra Sales Tax Department and Hyderabad’s
intelligent transport system.
1.4.
SWOT ANALYSIS OF THE BIG DATA ANALYTICS
1.4.1 THE NEED
A SWOT analysis helps in understanding the strengths and weaknesses and helps in identification of
open opportunities and the threat that can come along. It provides with a vision to differentiate
between marginal and valuable opportunities. It also helps in deciding what to exploit and what to
ignore. SWOT analysis gives a taste of what are the threats and their intensity. It facilitates with
options to keep an eye on the unlikely to cause damage and beware of increasingly dangerous
threats. Finally provides it an opportunity to indentify the GAPs that will lead to preparation of a
strong and structured Strategic Roadmap for Big Data Analytics. Below is the SWOT analysis of big
data analytics in India.
1.4.2 SWOT ANALYSIS – BIG DATA ANALYTICS, INDIA
Strengths
•
•
it.
•
•
•
•
•
•
There is a growing interest in archiving, sensing, behavioral data, and personal data.
There is a large amount of content and data available – the issue is accessing and making use of
There is a broad and detailed domain know-how as well as process know-how available.
Many domains have innovative technology and skilled people.
There are many universities/institutions with high capacity where skills can be developed.
Avenues where good science/engineering /domain specific education can be obtained.
Immense growth opportunity in the analytics market: Indian product firms have shown a growth
rate of 20-40 per cent in the last few years; several emerging players have witnessed over 100
per cent growth within the first year of launch. (NASSCOM)
Analytics – a definite market for India: Over 100 Indian analytics focused software product firms
have successfully developed and launched products catering to niche business needs, cut across
vertical-specific, horizontal process-centric and niche applications and platforms. (NASSCOM)
19
Growing start-up base accelerating the growth: Four-fold increase in analytics start-ups in the
last four years. (NASSCOM)
• Innovative offerings focusing on end-to-end customer business needs. (NASSCOM)
Weaknesses
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
There are no established cooperation networks between content providers in several domains.
Computer clusters and cloud resources are readily available and accessible to the
users/stakeholders such as Researchers in the Institutes and Research Labs.
There are not many SMEs that are dynamic and flexible and can react quickly to market changes.
Geospatial and environmental data sets and supporting infrastructure data sets are not readily
available.
There is no existing and strong content/data market in India.
There is a lack of a solid start-up culture because of risk aversion and intolerance of failure.
There are few large companies to lead the market, and many small sized companies that need
nurturing.
There is a lack of access to Big Data facilities that make data more easily accessible.
There is no visibility of ecosystem service offerings.
It is unclear what data should be preserved, and for how long, in all the different sectors and
markets.
Lack of process able linked data, and of aggregated/combined data.
Lack of seamless data access and inter-connectivity, and low levels of interoperability: data is
often in silos and data sharing is difficult due to a ineffective Data Sharing Policy as well as
standards e.g. formats and semantics.
Migration of data between systems, versions or partners is challenging.
Access and processing of data sets those are too big to be given to the end user.
Public data in the country is not available to the extent it should be.
The quality of data in even in open data portals is often very low.
The different languages within the country create a barrier (multilingualism) during data
processing. Structural data sources often lack precise semantics.
Poor and inconsistent use or management of metadata.
There is a lack of specialized education programs for data analysts.
There are not enough skilled people to participate in capacity building training programs.
Legislative restrictions on data sharing decrease availability across the country and makes
nationally/industry/domain focused initiatives that address these issues more difficult.
Rules and regulations are fragmented across the country/industry/domain.
There are high security/sensitivity/confidentiality demands that can be difficult to address.
There is no well-designed data governance: Data governance is a must-have, and no longer
merely a good-to-have. In today's extremely hyper-competitive markets, insightful knowledge
means the difference between success and being overwhelmed. But it has to be based on the
right data, based on business requirements.
Data protection Policy: "Ignoring data security, data quality and data access can cost
organizations millions of dollars, hurting enterprise agility, efficiency and reputation."
Opportunities
20
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Being a multi-cultural society, various cultures/practices/strengths/approaches can result in
creative thinking if they are mixed.
The proposed topics by the DST/BDI and best practice examples in other initiatives can lead to
synergies.
Strengthening the Indian market, e.g. by fusing the emerging start-up nucleus.
Create lots of SMEs for the low hanging fruits of Big Data for which agility is required.
Investment in the entire innovation chain, beyond basic research.
Investment support mechanisms for SMEs/Research/
Institutions/Students/Scholars/Entrepreneurs.
Collaboration within Industry/Academia/DST/Service Providers/Data Generators.
Improve and encourage innovation & creativity to create cost-effective solutions.
There is the opportunity to open up completely new and different business areas and services.
New applications can be created throughout the Big Data ecosystem, ranging over acquisition,
data extraction, analysis, visualization and utilization.
Easier syndication of data and content across industry/domains
Micropayments for processed data or the results from analytics.
Wearable sensors and sensor technologies become mainstream generating more data.
The explosion of device types opens up access to any data from any device for greater and more
varied usage.
Development of APIs for access becoming standardized and available.
Interoperability tools and standardized APIs to facilitate data exchange.
Greater visibility and increased use of directory services for data sources.
Use semantics to align content from various data sources.
Providing facilities to better navigate and curate data.
Contextualization and personalization of data.
The evolution of different sectors and the increased volume of data enable innovative
applications to be developed.
Exploring new research areas.
Training focused on innovation in DST/BDI.
Use and exploration of Big Data to be ubiquitous in education and training.
Address the safe and secure storage of data on the national basis.
User generated and crowd-sourced content increasingly available that will help variety of
recurring problems solved once for all.
Data-as-a-service can significantly lower the market entry barriers (in particular to new
markets).
Shift from technology push to end-user engagement.
Create rich and complex data value chains.
Develop strong and workable policies for data access in the country across private and public
data to help build comprehensive capabilities.
By 2020, information will be used to reinvent, digitalize or eliminate 80% of business processes
and products from a decade earlier: As the presence of the Internet of Things (IoT) — such as
connected devices, sensors and smart machines — grows, the ability of things to generate new
types of real-time information and to actively participate in an industry’s value stream will also
grow. (GARTNER)
By 2017, more than 30% of enterprise access to broadly based big data will be via intermediary
data broker services, serving context to business decisions:
21
•
•
•
Digital business demands real-time situation-awareness. This includes insights into what goes on
both inside and outside the organization. How do weather patterns impact inventory? More so,
how do this season’s customer preferences as expressed in social media suggest greater or
lesser inventory? (GARTNER)
By 2017, more than 20% of customer-facing analytic deployments will provide product tracking
information leveraging the IoT: Fueled by the Nexus of Forces (mobile, social, cloud and
information), customers now demand a lot more information from their vendors. The rapid
dissemination of the IoT will create a new style of customer-facing analytics — product tracking
— where increasingly less expensive sensors will be embedded into all types of products.
(GARTNER)
Analytics – Opening up a gamut of opportunities for Indian software product firms (NASSCOM)
Big Data as a service (BDaaS): That is the delivery of Statistical Analysis tools or information by
an outside provider that helps organizations understand and use insights gained from large
information sets in order to gain a competitive advantage.
Threats
•
•
•
•
•
•
•
•
•
•
•
•
•
Many skilled professionals leave the country to work in other regions; adding to the risk of a
“Brain Drain”.
Acute lack of skilled professionals and graduates.
Non standardization of the ‘contents’, ‘duration’, ‘mode of delivery’ and ‘certification’ of the
skilling and or up skilling efforts made by the education/training ecosystem of the society.
There are no existing ecosystems and portals where reliable data sets are is available, however,
there is a need to create them.
Policies are often too connected to the ‘old data’ world.
Complete analysis of ethical and privacy issues is needed.
Risk of over-regulation and protectionism in the country as compared to elsewhere in the
developed world.
Policies of data availability; for example companies are not willing to make data available ‘justin-case’ it may cause a legal action or result in competition.
Technology & Techniques: To capture value from big data the organizations will have to deploy
new technologies e.g. storage, computing and analytical software. The range of technology and
technique challenges and priorities set for tackling them will differ depending on the data
maturity of the institution.
Organizational Change and Talent: Organizational leaders may not fully understand and
appreciate the value in big data as well as how to unlock this value.
Shortage of Skills: There are a wide range of skills relevant for businesses wanting to use data
analytics, including knowledge of statistical techniques, the ability to program and use software,
market-specific knowledge and communication. These skills may not be available in required
quantity and quality.
Business-Education Collaboration: One way to provide the multi-disciplinary skills required for
big data analysis is for students to work closely with a company during their studies.
Collaboration between a university/institution with analysis expertise and a business with real
world data can be beneficial for both parties.
Trying to rush all data out to everyone all at once: Consider the whole cycle from the acquisition
of data to the extraction of information, and consider the hygiene factors along this path. There
22
•
•
is a time in which data should be immediately available to decision makers, and there is a time
when it can be retired.
BDaaS requires a coordinated effort: Successful Big Data-as-a-Service implementation would
require close collaboration between Enterprise Architects, Data Architects, Database admin, BI
and DW SMEs, SOA experts, InfoSec representatives and business strategists.
Data Sharing Policy: The recommendations made by CODATA on Capacity Building and the Data
Sharing Principles in Developing Countries are as given below. Unless these are not
implemented the use of Big Data Analytics mat not takeoff as desired.
o
o
o
o
o
o
o
o
o
o
1.5.
Data should be open and unrestricted.
Data should be free to the user.
Data should be informative and assessed for quality.
Data sharing should be timely.
Data should be easy to find and access.
Data should be interoperable
Data should be sustainable.
Data contributors should be given credit.
Data access should be equitable.
Data may be restricted, in exceptional cases, if adequately justified.
THE IDENTIFIED GAP AREAS
1.5.1 GAP IDENTIFICATION
The big data, if used successfully, will be a big leap in the field of intelligent governance of the
industry, business, research and government, however, before moving on further, it would be
beneficial to examine the issues raised in the SWOT analysis and to indentify the GAPs.
The challenges are enormous but if it is possible to execute this shift in paradigm properly, it will
change the way future will look like. It will be a hard journey full of ifs and buts but the risk and
effort are worth taking as intelligent governance is the need of the hour.
To help preparing the Strategic Roadmap, the identified gaps could be categorized in the following
categories:
o Market and Business
o Technical
o Data, Content and Usage
o Education and Skills
o Policy, Legal and Security
Market and Business
• Rewarding the efforts to improve and encourage innovation & creativity to create cost-effective
solutions.
• Exploit the opportunity to open up completely new and different business areas and services.
• There are not many SMEs that are dynamic and flexible and can react quickly to market changes.
23
•
•
•
•
•
•
•
•
•
There are few large companies to lead the market, and many small sized companies that need
nurturing.
Encouraging shift from technology push to end-user engagement.
There is a lack of a solid start-up culture because of risk aversion and intolerance of failure.
There are few large companies to lead the market, and many small sized companies that need
nurturing.
Launching new initiatives so as to strengthen the Indian market, e.g. by fusing the emerging
start-up nucleus.
Initiatives that can lead to creation of lots of SMEs for the low hanging fruits of Big Data for
which agility is required.
Providing investment in the entire innovation chain, beyond basic research.
Trying to rush all data out to everyone all at once: Consider the whole cycle from the acquisition
of data to the extraction of information, and consider the hygiene factors along this path. There
is a time in which data should be immediately available to decision makers, and there is a time
when it can be retired.
BDaaS requires a coordinated effort: Successful Big Data-as-a-Service implementation would
require close collaboration between Enterprise Architects, Data Architects, Database admin, BI
and DW SMEs, SOA experts, InfoSec representatives and business strategists.
Technical
•
•
•
•
•
•
•
•
•
•
•
•
Computer clusters and cloud resources are readily available and accessible to the
users/stakeholders such as Researchers in the Institutes and Research Labs.
There is a lack of access to Big Data facilities that make data more easily accessible.
Migration of data between systems, versions or partners is challenging.
Access and processing of data sets those are too big to be given to the end user.
The quality of data in even in open data portals is often very low.
Technology & Techniques: To capture value from big data the organizations will have to deploy
new technologies e.g. storage, computing and analytical software. The range of technology and
technique challenges and priorities set for tackling them will differ depending on the data
maturity of the institution.
Provide a platform for collaboration within Industry/Academia/DST/Service Providers/Data
Generators.
Organizational Change and Talent: Organizational leaders may not fully understand and
appreciate the value in big data as well as how to unlock this value.
The different languages within the country create a barrier (multilingualism) during data
processing. Structural data sources often lack precise semantics.
Poor and inconsistent use or management of metadata.
Mechanism to encourage large number of research projects among the proposed topics by the
DST/BDI and best practice examples in other initiatives that can lead to synergies.
Encouraging development of new applications throughout the Big Data ecosystem, ranging over
acquisition, data extraction, analysis, visualization and utilization.
Data, Content and Usage
• Facilitate easier syndication of data and content across industry/domains
• There are no established cooperation networks between content providers in several domains.
24
•
•
•
•
•
•
•
Geospatial and environmental data sets and supporting infrastructure data sets are not readily
available.
There is no existing and strong content/data market in India.
There is no visibility of ecosystem service offerings.
Providing facilities to better navigate and curate data.
Encouraging contextualization and personalization of data.
Lack of process able linked data, and of aggregated/combined data.
Lack of seamless data access and inter-connectivity, and low levels of interoperability: data is
often in silos and data sharing is difficult due to a ineffective Data Sharing Policy as well as
standards e.g. formats and semantics.
Education and Skills
• There is a lack of specialized education programs for data analysts.
• Development of APIs for access becoming standardized and available.
• Development of interoperability tools and standardized APIs to facilitate data exchange.
• There are not enough skilled people to participate in capacity building training programs.
• Many skilled professionals leave the country to work in other regions; adding to the risk of a
“Brain Drain”.
• Investment support mechanisms for SMEs, Research, Institutions, Students, Scholars, and
Entrepreneurs.
• Acute lack of skilled professionals and graduates.
• Non standardization of the ‘contents’, ‘duration’, ‘mode of delivery’ and ‘certification’ of the
skilling and or up skilling efforts made by the education/training ecosystem of the society.
• Shortage of Skills: There are a wide range of skills relevant for businesses wanting to use data
analytics, including knowledge of statistical techniques, the ability to program and use software,
market-specific knowledge and communication. These skills may not be available in required
quantity and quality.
• Business-Education Collaboration: One way to provide the multi-disciplinary skills required for
big data analysis is for students to work closely with a company during their studies.
Collaboration between a university/institution with analysis expertise and a business with real
world data, can be beneficial for both parties.
• The ways and means to leverage the broad and detailed domain know-how as well as process
know-how available in some parts of the industry/research and business.
• To encourage the sharing of the expertise available in many domains where exists innovative
technology and highly skilled people.
Policy, Legal and Security
• It is unclear what data should be preserved, and for how long, in all the different sectors and
markets.
• Public data in the country is not available to the extent it should be.
• Legislative restrictions on data sharing decrease availability across the country and makes
nationally/industry/domain focused initiatives that address these issues more difficult.
• Rules and regulations are fragmented across the country/industry/domain.
• There are high security/sensitivity/confidentiality demands that can be difficult to address.
• There is no well-designed data governance: Data governance is a must-have, and no longer
merely a good-to-have. In today's extremely hyper-competitive markets, insightful knowledge
25
•
•
•
•
•
•
•
means the difference between success and being overwhelmed. But it has to be based on the
right data, based on business requirements.
Data protection Policy: "Ignoring data security, data quality and data access can cost
organizations millions of dollars, hurting enterprise agility, efficiency and reputation."
There are no existing ecosystems and portals where reliable data sets are available, however,
there is a need to create them.
Policies are often too connected to the ‘old data’ world.
Complete analysis of ethical and privacy issues is needed.
Risk of over-regulation and protectionism in the country as compared to elsewhere in the
developed world.
Policies of data availability; for example companies are not willing to make data available ‘justin-case’ it may cause a legal action or result in competition.
Data Sharing Policy: The recommendations made by CODATA on Capacity Building and the Data
Sharing Principles in Developing Countries are as given below. Unless these are not
implemented the use of Big Data Analytics mat not takeoff as desired.
o
o
o
o
o
o
o
o
o
o
Data should be open and unrestricted.
Data should be free to the user.
Data should be informative and assessed for quality.
Data sharing should be timely.
Data should be easy to find and access.
Data should be interoperable
Data should be sustainable.
Data contributors should be given credit.
Data access should be equitable.
Data may be restricted, in exceptional cases, if adequately justified.
26
2. DATA SCIENCE & TECHNOLOGY
2.1 WORLDWIDE SITUATION
Big Data and the Internet of Things are disrupting entire markets, with machine data blurring the
virtual world with the physical world. This market matters —a recent Goldman Sachs report cites an
astounding $2 Trillion opportunity by 2020 for IoT, with the potential to impact everything from new
product opportunities, to shop floor optimization, to factory worker efficiency gains that will power
top-line and bottom-line gains. The company that delivers high quality big data solutions fastest and
enables customers to connect people, data and things to transform their industries and
organizations will win. (blog.pentaho.com/tag/iot/dated 25/12/14)
In the current world, technology drives businesses and internet solves every underlying problem in a
business. No job can be deemed complete in this world without the use of computers and internet.
But the widespread use of these technologies across every possible business also leads to an
enormous amount of data – data that cannot suffice or can be managed by the traditional databases
we are used to see around us.
2.1.1 UN’S GLOBAL PULSE:
Overview
Global Pulse is an innovation initiative launched by the Executive Office of the United Nations
Secretary-General, in response to the need for more timely information to track and monitor the
impacts of global and local socio-economic crises. The Global Pulse initiative is exploring how new,
digital data sources and real-time analytics technologies can help policymakers understand human
well-being and emerging vulnerabilities in real-time, in order to better protect populations from
shocks.
It is felt that digital data offers the opportunity to gain a better understanding of changes in human
well-being, and to get real-time feedback on how well policy responses are working. The overarching
objective of Global Pulse is to mainstream the use of data mining and real-time data analytics into
development organizations and communities of practice.
Global Pulse promotes awareness of the opportunities Big Data presents for relief and development,
forge public-private data sharing partnerships, generate high-impact analytical tools and approaches
through its network of Pulse Labs, and drive broad adoption of useful innovations across the UN
System. The objectives of the initiative include:
•
Increasing the number of Big Data for Development (BD4D) innovation success cases
•
Lowering systemic barriers to big data for development adoption and scaling
•
Strengthening cooperation between the big data for development ecosystem
Big Data for Development
Since its inception in 2009, Global Pulse has been investigating the viability of using new and
alternative data sources to support development goals. This includes data from:
•
Online Content - Public news stories, blogs, Twitter, Facebook, obituaries, birth
27
•
•
•
announcements, job postings, e- commerce, etc.
Data Exhaust - Anonymized data generated through the use of services such as
telecommunications, mobile banking, online search, hotline usage, transit, etc.
Physical Sensors - Satellite imagery, video, traffic sensors, etc.
Crowdsourced Reports - Information actively produced or submitted by citizens through
mobile phone-based surveys, user generated maps, etc.
Global Pulse is exploring innovative methods and frameworks for combining new types of digital
data with traditional indicators to track global development in real-time. See figure 2.1.
Research Overview
Global Pulse identifies problems that could be addressed through real-time monitoring of digital
data. It also designs and conducts applied research projects with the aim to discover practical uses of
Big Data to solve the challenges and prototype technology tools for monitoring development
progress and tracking emerging vulnerabilities.
The Pulse Lab teams assist in conducting pilot-based evaluations of new tools and approaches
within existing programs and policy initiatives. Global Pulse also forges strategic public-private
partnerships to secure access to sources of Big Data, state-of-the-art analytical tools, and expert
advisors in the relevant technical fields.
FIGURE 2.1: INNOVATIVE CYCLE
(SOURCE: GLOBAL PULSE)
28
Goals of the Project
The development of a new set of technology tools, partnerships and capacities is designed to
complement existing data-gathering and analysis methods. These should contribute to
improved global development outcomes in three ways:
•
•
•
Enhanced Early Warning:
Real-Time Awareness: and
Real-Time Feedback:
Projected Program for 2015-2016
•
•
•
•
•
Pulse Lab Network. With at least 3 Pulse Labs launched, labs are sharing analytical methodologies
and key innovations in relevant technologies to support institutional partners in the adoption of realtime data into their decision-making and monitoring.
Real-Time Monitoring Framework. Building from continued joint research on real-time monitoring
with governments, UN agencies, private sector and academia, publishes compilation of methods
papers.
Pulse Lab Handbook. Handbook, capturing lessons and best practices in analysis, technology
innovation, community engagement and partnerships to support government use of Big Data for
real-time development monitoring and planning.
Technology Toolkit. Integrated suite of free and open source technology tools for data collection,
analysis, and decision support made available to the global community.
Data Philanthropy Network. Global Pulse assembles a global network of public and private sector
partners sharing data through a secure network to support real time tracking of development.
2.1.2 THE WORLD DATA SYSTEM (WDS) STRATEGY PLAN 2014-2018
The World Data System (WDS) is an Interdisciplinary Body of the International Council for Science
(ICSU). Its vision is “a world where excellence in science is effectively translated into policy making
and socio-economic development. In such a world, universal and equitable access to scientific data
and information is a reality and all countries have the scientific capacity to use these and to
contribute to generating the new knowledge that is necessary to establish their own development
pathways in a sustainable manner.” And its goals include:
Enable universal and equitable access to quality-assured scientific data,
data services, products and information
• Ensure long term data stewardship
• Foster compliance to agreed-upon data standards and conventions
• Provide mechanisms to facilitate and improve access to data and
data products
The Strategic Committee on Information and Data of ICSU works closely with ICSU’s Committee on
Data for Science and Technology (CODATA), and developing strategic collaboration on issues of
common interest. WDS Strategic Targets are:
•
29
•
•
•
•
Enable universal and equitable access to scientific data, data services, products and
information
Ensure long-term data stewardship
Foster compliance to agreed-upon data standards and conventions
Provide mechanisms to facilitate and improve access to data and data products
The major targets of the current years are as follows:
•
Make trusted data services an integral part of international collaborative scientific research
and to this end, ICSU-WDS will endeavour to:
o
o
Involve WDS Members more closely in international collaborative scientific research.
Promote the use of best practices in international collaborative research programs.
•
Nurture active disciplinary and multidisciplinary scientific data services communities and to
this end ICSU-WDS will strive to:
o
o
o
o
Support existing data communities whose practices serve their members and the
scientific community well.
Strengthen emerging communities by helping them to identify their needs and to
organize their activities.
Provide mechanisms that facilitate cross-disciplinary interactions and activities.
Contribute towards scientific development by improving the analytical environment.
•
Improve the funding environment and ICSU-WDS seeks to play a key role in this
coordination by working with its Members to:
o
o
Promote international, national, and disciplinary policies that lead to sustainable longterm funding.
Engage and work with research funders to increase resources for data services,
including as part of research funding.
•
Improve the trust in and quality of open Scientific Data Services, ICSU-WDS is committed to
increasing the quality of, and trust in, the services provided by its Members, and will concentrate on
the following targets:
o
o
o
o
Provide a certification framework for WDS Regular and Network Members.
Actively promote policies of full and open access to data at national and international
fora.
Foster interoperable practices to facilitate data sharing.
Facilitate access to, and use or reuse of datasets—including through publication—in
particular for multidisciplinary research.
•
Position ICSU-WDS as the premium global multidisciplinary network for quality-assessed
scientific research data
30
2.1.3 WORLD ECONOMIC FORUM (WEF): REWARDS AND RISKS OF BIG DATA - 2014
Extracting Value from Big Data
Big data is changing our lives and changing the way we do business. Data-based value creation
requires the identification of patterns from which predictions can be inferred and decisions made. It
requires understanding the right way for creating this value and will require knowledge as to how to
separate valuable information from hype. It means a clear understanding of some of the following:
•
•
•
•
•
The network unleashes the benefits of big data;
The way policymakers and business executives need to develop action plans to extract value
from big data;
How to balance the risks and rewards and how to manage them
Rebalancing socioeconomic asymmetry in a data-driven economy;
What may be the role of regulation and trust building to achieve the potential of big data
into socioeconomic results; and how to define organizational change to take full advantage
of big data.
2.1.4 BIG DATA FOR DEVELOPMENT IN CHINA – UNDP PERSPECTIVE (NOVEMBER 2014)
China has the world’s largest mobile phone market, with over 1.2 billion mobile subscriptions, it has
over 600 million Internet users, it the world’s most active environment for social media, with the
government estimating that over 250 million people use social media. It is also estimated that the
digital universe in China will continue to grow at a rapid rate, with the country’s share of global
digital data expected to rise to 18 percent by 2020, up from 13 percent in 2014. China therefore is a
favorable environment where the Big Data approach could be effective in providing insights on
emerging concerns that are highly relevant to China’s development. A general approach to use of
big data for development is given in figure 2.2
Two proposed levels of work in relation to Big Data for Development in China:
To leverage the considerable potential of Big Data for Development, UNDP has identified 2 levels of
work in relation to Big Data for Development:
•
•
Create an enabling environment for Big Data for Development.
Tackle particular development challenges with the Big Data approach.
November 2013 China’s National Bureau of Statistics (NBS) signed a series of agreements with 11
major Chinese enterprises, aiming to build long-term collaborative relationships on using Big Data.
These enterprises have indicated their willingness to share data with NBS to maximize the effect of
Big Data application. For example, the cooperation between NBS and Baidu focuses on 3 main
aspects:
31
•
•
•
Generalizing the official statistical data and programs through the Baidu website;
Improvingg the predicting model of the macro economy by combining the Big Data on the
web with survey data collected by NBS;
Grasping more meaningful statistical requirements and completing the survey programs by
following netizens’ paths on the web platform. Devel
Development
opment agencies could seek to be
involved in these established partnerships or even create new partnerships with these
enterprises to identify and share data that would be relevant for development purposes.
FIGURE 2.2: 6 ILLUSTRATIVE EXAMPLES OF BIG DATA FOR DEVELOPMENT
(SOURCE: UNDP PERSPECTIVE (NOVEMBER 2014))
Tackle specific development challenges with the Big Data approach:
To leverage big data to overcome the development challenges in China the following measures have
been devised by UN:
•
•
•
•
•
Promote sustainable e--waste disposal practices:
Improve productivity of the public sector:
Understand socioeconomic development trend:
Map poverty:
Improve
mprove urban transport planning
32
•
Identify pollution hotspots in cities:
Challenges for application of Big Data fo
for Development in China:
Adopting Big Data approach will mean a multitude of issues that need to be addressed in order to
ensure effective application of Big Data for Development. China
China,, like any other country, will also
need to address these challenges w
while using Big Data for development purpose. These challenges
can be broadly
adly grouped into 3 categories. This is represented in figure 2.3
• Operational/systemic challenges
o Privacy:
o Respect of the principle of purpose specification;
o Limiting the amount of data collected and stored
o Obtaining valid consents from data subjects
o Whether or not the data will be distributed to third parties;
o Giving individual appropriate access to the data collected about them
o Access to information and decisions made about them.
o Changes in decision
decision-making process:
o Administrative barriers:
• Data challenges
o Accessibility:
o Availability:
o Reliability:
• Analytical challenges
o Dissonance between perceptions and facts:
o Data interpretation:
FIGURE 2.3: MAJOR CHALLENGES CONFRONTING BIG DATA FOR DEVELOPMENT
(SOURCE: UNDP PERSPECTIVE (NOVEMBER 2014))
33
2.1.5 BIG DATA – USA
Office of Science & Technology Policy (OSTP) of the USA has been spearheading the use of the
concept of Big Data in a big way across the Federal Government establishments. Below are
highlights of ongoing Federal government programs that address the challenges of, and tap the
opportunities afforded by, the big data revolution to advance agency missions and further scientific
discovery and innovation.
The Office of Science:
•
The Office of Advanced Scientific Computing Research (ASCR) provides leadership to the
data management, visualization and data analytics communities including digital
preservation and community access.
•
The High Performance Storage System (HPSS) is software that manages petabytes of data
on disks and robotic tape systems.
•
Mathematics for Analysis of Petascale Data addresses the mathematical challenges of
extracting insights from huge scientific datasets and finding key features and understanding
the relationships between those features.
•
The Next Generation Networking program supports tools that enable research
collaborations to find, move and use large data: from the Globus Middleware Project in
2001, to the GridFTP data transfer protocol in 2003, to the Earth Systems Grid (ESG) in 2007.
The Office of Basic Energy Sciences (BES):
•
BES Scientific User Facilities have supported a number of efforts aimed at assisting users
with data management and analysis of big data, which can be as big as yerabytes of data per
day from a single experiment.
•
The Biological and Environmental Research Program (BER), Atmospheric Radiation
•
Measurement (ARM) Climate Research Facility is a multi-platform scientific user facility
that provides the international research community infrastructure for obtaining precise
observations of key atmospheric phenomena needed for the advancement of atmospheric
process understanding and climate models.
•
The Systems Biology Knowledgebase (Kbase) is a community-driven software framework
enabling data-driven predictions of microbial, plant and biological community function in an
environmental context.
The Office of Fusion Energy Sciences (FES):
34
•
The Scientific Discovery through Advanced Computing (SciDAC) partnership between FES
and the office of Advanced Scientific Computing Research (ASCR) addresses big data
challenges associated with computational and experimental research in fusion energy
science.
The Office of High Energy Physics (HEP):
•
The Computational High Energy Physics Program supports research for the analysis of large,
complex experimental data sets as well as large volumes of simulated data—an undertaking
that typically requires a global effort by hundreds of scientists.
The Office of Nuclear Physics (NP):
•
The US Nuclear Data Program (USNDP) is a multisite effort involving seven national labs and
two universities that maintains and provides access to extensive, dedicated databases
spanning several areas of nuclear physics, which compile and cross-check all relevant
experimental results on important properties of nuclei.
The Office of Scientific and Technical Information (OSTI):
•
OSTI, the only U.S. federal agency member of DataCite (a global consortium of leading
scientific and technical information organizations) plays a key role in shaping the policies
and technical implementations of the practice of data citation, which enables efficient reuse
and verification of data so that the impact of data may be tracked, and a scholarly structure
that recognizes and rewards data producers may be established.
Health and Human Services (HHS):
•
Centre for Disease Control & Prevention (CDC) BioSense 2.0 is the first system to take into
account the feasibility of regional and national coordination for public health situation
awareness through an interoperable network of systems, built on existing state and local
capabilities.
•
Networked phylogenomics for bacteria and outbreak ID. CDC’s Special Bacteriology
Reference Laboratory (SBRL) identifies and classifies unknown bacterial pathogens for
effective, rapid outbreak detection.
•
Center for Medicare & Medicaid Services (CMS) A data warehouse based on Hadoop is
being developed to support analytic and reporting requirements from Medicare and
Medicaid programs.
National Institute of General Medical Sciences:
•
The Models of Infectious Disease Agent Study (MIDAS) is an effort to develop
computational and analytical approaches for integrating infectious disease information
35
rapidly and providing modelling results to policy makers at the local, state, national, and
global levels. While data need to be collected and integrated globally, because public health
policies are implemented locally, information must also be fine-grained, with needs for data
access, management, analysis and archiving.
2.1.6 BIG DATA – AUSTRALIA (Accenture’s 2014 Australia Survey Results)
This survey was designed to understand perceptions and experience with big data. Some of the
important findings are given in the figures 2.4 to figure 2.6
FIGURE 2.4 : AUSTRALIAN ORGANIZATIONS LAG IN THE USE OF MANY DATA SOURCES
Source: Accenture Big Success with Big Data Survey, April 2014
36
FIGURE 2.5: AUSTRALIAN ORGANIZATIONS HOWEVER LEAD IN THE USE OF SOME DATA SOURCES
Source: Accenture Big Success with Big Data Survey, April 2014
FIGURE 2.6: ORGANIZATIONS USING BIG DATA TO IMPROVE THE CUSTOMER EXPERIENCE
Source: Accenture Big Success with Big Data Survey, April 2014
37
Australian Government Service Scenario
The data held by Australian Government agencies has been recognized as a government and
national asset. Departments can ask questions that were previously unanswerable, because the data
wasn't available or the processing methods were not feasible.
It is expected that big data analytics will be used to streamline service delivery, create opportunities
for innovation, and identify new service and policy approaches as well as support the effective
delivery of existing programs across a broad range of government operations.
To facilitate the improved delivery a “Better Practice Guide” has been developed, the salient
features of this guide are the applications of Big Data. A consolidated presentation as to how for the
Big data projects that often fall into the domains of scientific, economic and social research,
analytics is applied to customer/client segmentation and marketing research, campaign
management, behavioral economics initiatives, enhancing the service delivery experience and
efficiency, intelligence discovery, fraud detection and risk scoring. Figure 2.7 shows the identified
areas.
FIGURE 2.7: CATEGORIES OF BUSINESS PROCESSES THAT CAN BENEFIT FROM BIG DATA PROJECTS
(BigInsights Submission, www.biginsights.com.)
2.1.7 ECONOMIST INTELLIGENCE UNIT (EIU) : WHO’S BIG ON BIG DATA?
In September 2014, The Economist Intelligence Unit (EIU) carried out a global survey of 395 C-level
executives with sponsorship from Platfora. The findings are summaries as below:
38
Finding 1: Executives’ attitudes towards big data are overwhelmingly positive (See chart 1
and chart 2 as given in figure 2.8 and 2.9 respectively)
FIGURE 2.8: VIEW OF THE FUTURE OF BIG DATA
FIGURE 2.9: ATTITUDE TOWARDS BIG DATA
39
Finding 2: Executives agree on the need for big-data solutions and want to know more (See
Chart 3 as given in figure 2.10)
FIGURE 2.10: PERSONAL KNOWLEDGE OF BIG DATA
Finding 3: Customer processes currently stand out as candidates for big-data analytics (See
Chart 4 as given in figure 2.11)
FIGURE 2.11: PRIORITY APPLICATION OF BIG DATA
40
Finding 4: Lack of understanding about how to use big data stands in the way of
implementation (See Chart 5 as given in figure 2.12)
FIGURE 2.12: INTERNAL OBSTACLES IN USE OF BIG DATA
Finding 5: Implementation is also held back by lack of agreement about the value of big data (See
Chart 6 as given in figure 2.13)
FIGURE 2.13: CEO’S VIEW OF BIG DATA
41
Finding 6: Optimal value from big data comes from the creation of enterprise-wide big-data
teams (See Chart 7 as given in figure 2.14).
FIGURE 2.14: STRATEGIES FOR OBTAINING OPTIMUM VALUE FROM BIG DATA TOOLS
Finding 7: Specialized technical skills are needed to optimize use of big data, but in a supportive
role (See Chart 8 as given in figure 2.15)
FIGURE 2.15: HOW THE ORGANIZATION ADDRESSES HUMAN ASPECT OF BIG DATA
2.1.8 IDC WORLDWIDE BIG DATA AND ANALYTICS PREDICTIONS FOR 2015
Some of the important predictions from the IDC FutureScape for Big Data and Analytics are as given
below.
42
•
•
•
•
•
•
Visual data discovery tools will be growing 2.5 times faster than rest of the business
intelligence (BI) market. By 2018.
Over the next five years spending on cloud-based Big Data and analytics (BDA) solutions will
grow three times faster than spending for on-premise solutions.
Shortage of skilled staff will persist. In the U.S. alone there will be 181,000 deep analytics
roles in 2018 and five times that many positions requiring related skills in data management
and interpretation.
By 2017 unified data platform architecture will become the foundation of BDA strategy.
Growth in applications incorporating advanced and predictive analytics.
Adoption of technology to continuously analyze streams of events will accelerate in 2015 as
it is applied to Internet of Things (IoT) analytics, which is expected to grow at a five-year
compound annual growth rate (CAGR) of 30%.
43
3. DATA SCIENCE RESEARCH & DEVELOPMENT
3.1 DATA SCIENCE – RESEARCH CHALLENGES
3.1.1 SCIENCE & TECHNOLOGY – CHALLENGES
Some of the S&T challenges that researchers across the globe and in India facing are related to data
deluge pertaining to:
•
Astrophysics
•
Materials Science
•
Earth & atmospheric observations
•
Energy
•
Fundamental Science
•
Computational Biology, Bioinformatics & Medicine
•
Engineering & Technology, GIS and Remote Sensing
•
Cognitive science
•
Statistical data
These challenges require development of advanced algorithms, visualization techniques, data
streaming methodologies and analytics. The overall constraints that community facing are:
•
The IT Challenge: Storage and computational power
•
The computer science : Algorithm design, visualization, scalability (Machine Learning,
network & Graph analysis, streaming of data and text mining), distributed data,
architectures, data dimension reduction and implementation
•
The mathematical science: Statistics, Optimization, uncertainty quantification, model
development (statistical, Ab Initio, simulation) analysis and systems theory
•
The multi-disciplinary approach: Contextual problem solving
3.1.2 CHALLENGES IN ACHIEVING ACTIONABLE INSIGHT WITH DATA & ANALYTICS
Big data technologies are maturing to a point in which more organizations are prepared to pilot and
adopt big data as a core component of the information management and analytics infrastructure.
44
Big data, as a compendium of emerging disruptive tools and technologies, is positioned as the next
great step in enabling integrated analytics in many common business scenarios.
As big data wends its inextricable way into the enterprise, information technology (IT) practitioners
and business sponsors alike will bump up against a number of challenges that must be addressed
before any big data program can be successful. Some of those challenges are:
Uncertainty of the Data Management Landscape – There are many competing technologies, and
within each technical area there are numerous rivals. Our first challenge is making the best choices
while not introducing additional unknowns and risk to big data adoption.
The Big Data Talent Gap – The excitement around big data applications seems to imply that there is
a broad community of experts available to help in implementation. However, this is not yet the case,
and the talent gap poses our second challenge.
Getting Data into the Big Data Platform – The scale and variety of data to be absorbed into a big
data environment can overwhelm the unprepared data practitioner, making data accessibility and
integration our third challenge.
Locating data and software tools: Investigators need straightforward means of knowing what
datasets and software tools are available and where to obtain them, along with descriptions of
each dataset or tool. Ideally, this would include all published and resource datasets and software
tools, both basic and clinical, and, to the extent possible, even unpublished or proprietary data
and software.
Synchronization Across the Data Sources – As more data sets from diverse sources are incorporated
into an analytical platform, the potential for time lags to impact data currency and consistency
becomes our fourth challenge.
Getting Useful Information out of the Big Data Platform –Using big data for different purposes
ranging from storage augmentation to enabling high-performance analytics is impeded if the
information cannot be adequately provisioned back within the other components of the enterprise
information architecture, making big data syndication another challenge
Standardizing data and metadata: Investigators need data to be in standard formats to facilitate
interoperability, data sharing, and the use of tools to manage and analyze the data. The datasets
need to be described by standard metadata to allow novel uses as well as reuse and integration.
Extending policies and practices for data and software sharing: While significant progress has
been made, broad and rapid sharing of data and software is not yet the norm in all areas of
biomedical research. Establishing effective data- and software-sharing practices requires
appropriate policies, changes in the research culture, recognition of the contributions made by
45
data and software generators, and technical innovations. Validation of software to ensure quality,
reproducibility, provenance, and interoperability is essential.
Developing new methods for analyzing Big Data: The size, complexity, and multidimensional
nature of many datasets make data analysis extremely challenging. Substantial research is
needed for developing new methods and software tools for analyzing such large, complex, and
multidimensional datasets. User-friendly data workflow platforms and visualization tools are also
needed to facilitate the analysis of Big Data.
Focusing on knowledge to advance the business agenda: With the structured and unstructured data
(historic, current and predictive) users are not able to take a 360-degree view of the extraordinary
volume of data available. Thus they are not able to extract what they need as well as discover what
they don’t yet know they need.
Overcoming internal obstacles: An information-centric organization needs an information-driven
mindset – from the top down. That means employees must be managed, measured and
compensated based on how well they use data to make decisions and drive business outcomes. New
ways of running the information-centric businesses of tomorrow will require new organizational
models. Formal change management efforts will be needed to create a high-performance culture
prepared for the organizational implications of new skills, capabilities and infrastructure.
Training researchers for analyzing and for designing tools for analyzing biomedical Big Data
effectively: The challenges of biomedical Big Data are multifaceted. Advances in biomedical
sciences using Big Data will require more scientists with the appropriate data science expertise
and skills, including those in many quantitative science areas such as computational biology,
biomedical informatics, biostatistics, and related areas. Users of Big Data software tools and
resources must be trained to use them well.
Meeting the need for speed: With the hypercompetitive business environment, companies not
only have to find and analyze the relevant data they need, they must find it quickly. Visualization
helps organizations perform analyses and make decisions much more rapidly, but the challenge is
going through the sheer volumes of data and accessing the level of detail needed, all at a high
speed. The challenge only grows as the degree of granularity increases.
Addressing data quality: Even if you can find and analyze data quickly and put it in the proper
context for the audience that will be consuming the information, the value of data for decisionmaking purposes will be jeopardized if the data is not accurate or timely. This is a challenge with any
data analysis, but when considering the volumes of information involved in big data projects, it
becomes even more pronounced.
3.1.3 BIG DATA SECURITY AND PRIVACY CHALLENGES
KPMG has identified five key security and privacy challenges organizations must address to help
ensure proper control of their Big Data program:
46
Big Data governance: The implementation of Big Data initiatives may lead to the creation or
discovery of previously secret or sensitive information through the combination of different data
sets. Organizations that attempt to implement Big Data initiatives without a strong governance
regime in place, risk placing themselves in ethical dilemmas without set processes or guidelines to
follow. Therefore, a strong ethical code, along with process, training, people, and metrics, is
imperative to govern what organizations can do within a Big Data program.
Maintaining original privacy and security requirements(original intent) of data throughout the
information life cycle: Data that is collected and used for Big Data will likely be correlated with
other data sets that may ultimately create new data sets or alter the original data in different, often
unforeseen ways. Organizations must make sure that all security and privacy requirements that are
applied to their original data sets are tracked and maintained across Big Data processes throughout
the information life cycle from data collection to disclosure or retention/destruction.
Re-identification risk: Data that has been processed, enhanced, or changed by Big Data programs
may have benefits both internal and external to the organization. Often, the data must be
anonymized to protect the privacy of the original data source, such as customers or vendors. Data
that is not properly anonymized prior to external release (or in some cases, internal as well) may
result in the compromise of data privacy as the data is combined with previously collected, complex
data sets including geo-location, image recognition, and behavioral tracking. If data is simply deidentified, possible correlation between data subjects contained within separate data sets must be
evaluated, as third parties with access to several data sets may be able to re-identify otherwise
anonymous individuals.
Third parties – usage and honoring contractual obligations: Matching data sets from other
organizations may unlock insights using Big Data that an organization could not uncover with its data
alone. It may also pose significant risk, as the security and privacy data protections in place at the
third-party organization may not be adequate. Prior to sharing data with third parties, organizations
must evaluate their relevant practices and decide whether they are satisfactory.
Interpreting current regulations and anticipating future regulations: As noted, the United States and
the EU do not have laws or regulations specific to Big Data; however, there are existing laws
restricting the collection, use, and storage of specific personal information types, including financial,
health, and children’s data. To keep current with quickly changing and new implemented laws,
companies must perform an initial inventory of applicable laws and update this inventory on a
regular basis.
47
3.2.
MODELS FOR RESEARCH
3.2.1 TARGETED RESEARCH PROGRAM:
To provide support to large-scale, multidisciplinary projects that demonstrate potential to expand
on existing strengths or develop new innovative research related to strategic areas of emphasis for
the DST’s BDI. Targeted research proposals must have significant institutional and department
support.
3.2.2 BRIDGE FUNDING PROGRAM:
Bridge funding will provide financial support for existing research programs for which external
funding sources are expended. The funding needs to support the continuation of the operations of a
lab or program in order to avoid ending the program while external support is being reviewed or
pursued. Bridge funding must be used in a strategic and coordinated way to maintain project/lab
momentum while assuring effective use of limited resources.
3.2.3 SEED FUNDING PROGRAM: Seed funding will afford an opportunity for special projects with
the aim of fostering the engagement of multidisciplinary teams to establish linkages towards
attaining extramural funding. This program will be open to all disciplines, with prioritization given to
projects that have the potential to position the researcher or research team to be competitive for
external funding or to bring high impact to DST’s BDI through the proposed work.
3.2.4 OTHER MODELS
Apart from the above core initiatives a few more suggested as given below in table 3.1.
TABLE 3.1: MODELS FOR RESEARCH
S.
N.
1
MAJOR
MODEL FOR
RESEARCH
Research
Projects
MINOR MODELS
Research Project Grants
Small Grants
48
PURPOSE & BRIEF DESCRIPTION
To provide support to an
institution on behalf of a
principal investigator for a
project proposed by the
investigator
To provide limited research
support usually for preliminary,
short-term projects; two years
maximum; non renewable.
S.
N.
MAJOR
MODEL FOR
RESEARCH
MINOR MODELS
Conference Grants
Exploratory/Developmental
Grant
Resource-Related Research
Projects
Education Projects
Field Trial Planning Grant
Small Business Technology
Transfer Grants
PURPOSE & BRIEF DESCRIPTION
To provide funding for
conferences to exchange and
disseminate information related
to program interests.
To encourage new research in
given program area; preliminary
data not generally required.
To support research projects to
enhance capacity of resources
that serve biomedical research.
To support to develop or
implement a program in
education, information, training,
technical assistance,
coordination, or evaluation.
To supports initial development
of a field/clinical trial e.g.,
establishing a research team,
developing tools for
managing data and overseeing
the research, and developing a
trial design, protocol,
recruitment strategies, and
procedure manuals
To support collaborative research
by a small business with a
research institution on a project
intended for commercialization
in two phases:
Phase I grant to be used to
establish the technical merit and
feasibility of the research
concept.
Phase II grant to supports further
research leading to a product or
service.
49
S.
N.
MAJOR
MODEL FOR
RESEARCH
MINOR MODELS
Small Business Innovation
Research Grants
PURPOSE & BRIEF DESCRIPTION
To support small business
research on a project intended
for commercialization in two
phases:
Phase I grant to be used to
establish the technical merit and
feasibility of the research
concept.
Phase II grant to supports further
research leading to a product or
service.
High Priority, Short Term
Project Award
2
Fellowships
Awards
Research Service Awards
Senior Research Service
Awards
3
Career
Development
Awards
Mentored Research
Development Award
Independent Research
Development Award
50
To provide interim,
nonrenewable research support
for up to one year to highly
meritorious applications.
For Individual Postdoctoral
Fellows to provide individual
fellowships to postdoctoral
trainees.
For Senior Fellows to provide
opportunities for experienced
scientists to make major changes
in the direction of their research
careers, broaden their scientific
background, or acquire new
research capabilities.
Career development in a new
area of BDA research – 1 year
Develop the career of the funded
researcher – 1 year
CENTRE OF EXCELLENCE (CoE) FOR DATA SCIENCE
3.3.
Centres of excellence are emerging as a vital strategic asset to serve as the primary vehicle for
managing complex change initiatives. Centres of excellence exist to bring about central focus to
many business issues, for example, data integration, project management, enterprise architecture,
business and IT optimization, and enterprise-wide access to information.
What is a Centre of Excellence?
Forrester Research, Inc. defines a Centre of Excellence as “A formally appointed and documented
body of knowledge and experience on a particular subject area with the goals of providing expertise,
managing governance practices, and supporting projects associated with the subject area.”
Why a Centre of Excellence for DATA SCIENCE/BDA is necessary:
A Centre of Excellence will maximize the quality, efficiency and application of analytics across all
lines of business, resulting in greater confidence and consistency in decision-making. It will lead to a
higher success rate for business analytics deployments, delivering more value at less cost and in less
time. The CoE drives end user adoption, leading to a smoother path to improved outcomes and
provides a formal organizational structure, enabling the organization (DST) to strike the right
balance between agility and sound management in deploying analytics technologies. It will also
eliminate the gap between Business and IT, improving time-to-market and responsiveness to
change.
3.3.1 CoE VALUE PROPOSITION:
Centre of Excellence has the following major propositions and the constituents of each of the
proposition, see table 3.2
TABLE 3.2: CoE VALUE PROPOSITION
S. N.
1
2
MAJOR PROPOSITIONS
CONSTITUENTS
Governance & Practices:
Purpose Drive Disruption:
51
o
o
Thought Leadership
Quality
o
o
Customer Loyalty
Increased Revenues
o
Matrices & KPIs
o
Innovation
o
Integration
o
o
Collaboration
Alignment
o
Growth
3
4
Framework & Reusable Artifacts:
Knowledge Dissemination:
o
Rapid Solutions
o
o
o
Efficiency
Reduced Cycle
Reduced Cost
o
High Performance
o
Enablement
o
Competency
o
o
Employee Loyalty
Expert System
o
Just in Time Advice
3.3.2 DATA SCIENCE CENTRE OF EXCELLENCE COVERAGE
Any good CoE should provide the following:
•
•
•
•
•
•
•
•
•
•
Establishing Competitive Advantage
Discovery of real, viable use cases
Identification of situations needing and not needing Big Data
Resource Management
Collaboration Awareness
Requirements Gathering Best Practices
Building a Steering Committee
Enterprise Architecture
Big Data Maturity Models
Damage Control when projects go off track
BDA CoE Scope:
A comprehensive BDA CoE is broadly scoped to include the services, functions, tools and metrics to
ensure the organization invests in the most valuable projects, and then delivers the expected
business benefits from project outcomes. BDA CoE Function Chart as given in figure 3.1 provides a
summary of typical BDACoE responsibilities.
52
FIGURE 3.1: BDA CoE FUNCTION CHART
SOURCE: www.batimes.com/.../the-ba-practice-lead-handbook-5-getting-organized, dated 19/05/13)
BDA
CENTRE OF EXCELLENCE
BDA
BDA
BDA
STANDARDS
DEVELOPMENT
SERVICES
BDA
FULL CYCLE
GOVERNANCE
3.3.3 THE DATA SCIENCE CoE MISSION AND OBJECTIVES:
The objectives are met through training, consulting and mentoring business analysts and project
team members, by providing BDA resources to the project teams, by facilitating the portfolio
management process, and by serving as the custodian of BA best practices. The strategic BDA CoE
generally performs all or a subset of the following services:
•
BA Standards – provides standard business analysis practices
o Methods
o Knowledge Management
o Continuous Improvement
•
BDA Development – provides professional development for business analysts
o BDA Career Path
o Coaching and Mentoring
o Training and Professional Development
o Team Building
•
BA Services – serves as a group of facilitators and on-the-job trainers who are skilled and
accomplished business analysts to provide business analysis consulting support including:
53
o
o
o
o
o
o
o
•
Conducting market research, benchmark, and feasibility studies
Developing and maintaining the business architecture
Preparing and monitoring the business case
Eliciting, analyzing, specifying, documenting, validating, and
Managing requirements verification and validation activities, for example, the user
acceptance test
Preparing the organization for deployment of a new business solution
Providing resources to augment project teams to perform business analysis activities
that are under resourced or urgent
Full cycle Governance – promotes a full life-cycle governance process, managing
investments in business solutions from research and development to operations; provides
a home (funding and resources) for pre-project business analysis and business case
development
o
o
o
o
o
Business Program Management
Strategic Project Resources
Portfolio Management
Enterprise Analysis
Benefits Management
3.3.4 KPIs FOR MEASURING THE SUCCESS OF A CoE:
Some of the KPIs as given below may be deployed for the measurement of the success of the CoEs
established:
•
•
•
•
•
•
•
•
Higher project success rate
Reduced costs for professional services, management overhead and TCO
Reduced gap between Business and IT, improving time to market and responsiveness to
change
More unified collaboration between departments and regions
Increased ROI and clearer identification of competitive advantage
Greater confidence and consistency in decision-making
Higher success rate for business analytics deployments
The right balance between agility and sound management in deploying analytics solutions
3.3.5 SOME SUGGESTIONS FOR CREATING CENTRES OF EXCELLENCE:
•
•
•
•
•
•
•
Centre for Causal Modelling and Discovery
Centre for Predictive Computational Phenotyping
Centre for Mobility Data Integration to Insight
Centre for Computational Knowledge Engine
Centre for Big Data in Translational Genomics
Centre for Patient-Cantered Information Commons
Centre for Mobile Sensor Data-to-Knowledge
54
•
•
3.4.
Centre for Expanded Data Annotation and Retrieval
Centre for Big Data for Discovery Science
BIG DATA IN SCIENCE AND TECHNOLOGY R&D
3.4.1 BIG DATA: R & D PERSPECTIVE
In the Big Data research context, so called analytics over Big Data is playing a leading role. Analytics
cover a wide family of problems mainly arising in the context of Database, Data Warehousing and
Data Mining research. Analytics research is intended to develop complex procedures running over
large-scale, enormous in-size data repositories with the objective of extracting useful knowledge
hidden in such repositories. One of the most significant application scenarios where Big Data arise is,
without doubt, scientific computing. Here, scientists and researchers produce huge amounts of data
per-day via experiments (e.g., disciplines like high-energy physics, astronomy, biology, bio-medicine,
and so forth). But extracting useful knowledge for decision making purposes from these massive,
large-scale data repositories is almost impossible for actual DBMS-inspired analysis tools. From a
methodological point of view, there are also research challenges. A new methodology is required for
transforming Big Data stored in heterogeneous and different-in-nature data sources (e.g., legacy
systems, Web, scientific data repositories, sensor and stream databases, social networks) into a
structured, hence well-interpretable format for target data analytics. As a consequence, data-driven
approaches, in biology, medicine, public policy, social sciences, and humanities, can replace the
traditional hypothesis-driven research in science.
The research problems linked to the discovery of new insights from big-data belong to a novel
and rapidly expanding research domain: machine learning. At the edge of statistics, computer
science and emerging applications in industry, this research domain focuses on the development of
fast and efficient algorithms for processing of data with as a main goal to deliver accurate
predictions of various kinds. To name only a few applications, think of business cases such as
product recommendation, segmentation of customers, fraud detection or churn prevention.
Machine learning techniques can solve such applications using a set of generic methods that differ
from more traditional statistical techniques. The emphasis is on real-time and highly scalable
predictive analytics, using fully automatic and generic methods that simplify most of the problems
of data analytics. At the user layer, visualization and interactive exploration are important problems
for Big Data. A novel class of visualization metaphors, methodologies and solutions must be devised,
in order to cope with emerging challenges posed by visualization problem of Big Data; real-time
visualization of extracted core data, visualization of mashuped data, and effective visualization over
mobile devices are interesting problems. Coupled with visualization issues, interactive exploration
issues are critical milestones to traverse in the context of Big Data research; in fact, enormous-sized
data are difficult to explore while extracting useful knowledge. Strategies need to address issues
such as conceptual navigation, concept drift, interaction metaphors, and so forth.
Environmental monitoring has become reliant upon automated sensors for data acquisition.
These results in generation of large, high-dimensional data streams (‘Big Data’) those personnel
55
must search through to identify data structures. Nature-inspired computation, inclusive of artificial
neural networks (ANNs), affords the unearthing of complex, recurring patterns within sizable data
volumes. This has applications in agriculture, weather monitoring, epidemiological study, traffic
planning, pollution monitoring, ecological and nature resource management.
The world has become a much more dangerous place. The existence of private organizations
willing to kill randomly to further their view of the world, and then to hide among innocents to avoid
being attacked in turn, presents a challenge that was not present a few decades ago. Surveillance,
involving all forms of data, starting from monitoring of media reports, twitter streams, videos, social
sensing, requires processing of huge volume of constantly changing and uncertain data for deriving
desired intelligence. This is a significant research challenge. We need equivalent advances in
technology to prevent terrorist mayhem from proliferating. We can use data mining on big data to
detect bad actors.
3.4.2 CODATA RECOMMENDATIONS FOR SCIENTIFIC PROGRAMS
The international scientific community has a responsibility to examine all the opportunities to use
Big Data for knowledge discovery that will benefit society and the sustainability of the planet. The
scientific research and discovery presents particularly significant challenges and notable
opportunities for transdisciplinary, international research programs. The challenges and
opportunities of Big Data have significant implications for scientific data services and infrastructure
providers.
The Workshop on Big Data for International Scientific Programs join with CODATA has made the
following recommendations to address-the challenges and take advantage of the opportunities
of the Big Data age.
•
•
•
•
•
•
•
Respond to the importance of Big Data for international scientific programs
Exploit the benefits of Big Data for society
Improve understanding of Big Data through international collaboration
Promote universal access to Big Data through global research infrastructures
Explore and Address the Challenges of Big Data Stewardship
Encourage capacity building and skills development
Foster development of policies to maximize exploitation of Big Data
Proposed Actions for a CODATA Working Group:
•
•
•
•
•
Establish a CODATA Working Group on Big Data for Scientific Programs
Produce case studies in Big Data for international scientific program
Promote sharing of Big Data solutions across scientific disciplines
Research policy, ethical and legal issues for Big Data
Research stewardship and sustainability challenges for Big Data
3.4.3 EUROPEAN RESEARCH AGENDA FOR BIG DATA ANALYTICS
56
The vision of Big Data Analytics in Europe is based on the fair use of big data with the development
of associated policies and standards, as well as on empowering citizens, whose digital traces are
recorded in the data. It is expected to provide a data and knowledge infrastructure providing to
citizens, scientists, institutions and businesses through:
•
•
Access to data and knowledge services and
Access to analytical services and results, within a framework of policies for access and
sharing based on the values of privacy, trust, individual empowerment and public good.
This means fulfilment of several requirements at different levels some of these are as given below.
•
•
•
•
•
•
•
Scientific and technological challenges such laying new foundations for Big Data Analytics,
which integrate knowledge discovery from Big Data with statistical modelling and complex
systems science,
Semantics data integration and enrichment technologies
Scalable, distributed, streaming Big Data Analytics technologies
Data requirements such as who owns and use personal data, the real value of such data,
How to make it possible to access and link the different data sources etc.
Education and data literacy:
Promotional initiatives for data analytics and BDA-as-a-service
Effective way for promoting and helping the development of Big Data Analytics
3.4.4 BIG DATA IN GENOMICS
Life Sciences have been highly affected by the generation of large data sets, specifically by overloads
of omics information (genomes, transcriptomes, epigenomes and other omics data from cells,
tissues and organisms).
Next-Generation Sequencing (NGS) platforms that use semiconductors or nanotechnology have
exponentially increased the rate of biological data generation in the last two years. The steadily
decreasing costs have enabled the generation of information at the petabyte scale. However, there
is a lack of computational infrastructure that is needed to securely generate, maintain, transfer, and
analyze large-scale information in life sciences and to integrate omics data with other data sets,
such as clinical data from patients (mainly from Electronic Medical Records or EMRs).
Genomics
Personal genomics is a key enabler for predictive medicine, where a patient’s genetic profile can be
used to determine the most appropriate medical treatment. Projects such as Encode have produced
piles of data, illustrating how Big Data is becoming integral for scientific research. Indeed, science
today is increasingly “social”, especially in fields such as genomics in which huge amounts of data
are generated.
57
There is a need felt to store data and information generated by big projects; computational
solutions such as cloud-based computing have emerged. Cloud computing is the only storage model
that can provide the elastic scale needed for DNA sequencing. Many companies are using cloud
solutions from different providers, however challenges remain such as security and privacy of
personal medical and scientific data, some companies, though, offer solutions (table 3.3).
TABLE 3.3: EXAMPLES OF COMPANIES & INSTITUTIONS PROVIDING SOLUTIONS TO GENERATE,
ANALYZE & VISUALIZE OMICS & CLINICAL DATA (SOURCE: Big Data in Genomics: Challenges and Solutions)
Company /
Type of Solution
Website
Institution
Appistry
Appistry's high-performance big data platform
www.appistry.com
combines self-organizing computational storage
with optimized and distributed highperformance computing to provide secure,
HIPAA-complaint accurate on-demand analysis
of omics data in association with clinical
information
BGI
Beijing Genomics Institute (BGI)'s solution
www.genomics.cn/en
serves as a solid foundation for large-scale
bioinformatics processing. BGI computing
platform is an integrated service composed of
versatile software and powerful hardware
applied to life sciences
CLC Bio
CLC Bio bioinformatics has a platform where
www.clcbio.com
both desktop and server software are integrated
and optimized for best performance. CLC Bio
utilize proprietary algorithms, based on
published methods, in order to successfully
accelerate data calculations to achieve
remarkable improvements in big data analytics
DNAnexus
DNAnexus provides solutions for NGS by using
www.dnanexus.com
cloud computing infrastructure with scalable
systems and advanced bioinformatics in a webbased platform to solve data management and
the challenges in analysis that are common in
unified systems
Genome
Genome International Corporation (GIC) is a
www.genome.com
International
research-driven company that provides
Corporation
innovative bioinformatics products and custom
(GIC)
research solutions for corporate, government,
and academic laboratories in life sciences
58
Company /
Institution
GNS
Healthcare
Foundation
Medicine
Knome
NextBio
Type of Solution
GNS Healthcare is a big data analytics company
that has developed a scalable approach to deal
with big data solutions that could be applied
across the healthcare industry
Foundation Medicine is a molecular information
company on the forefront of bringing
comprehensive cancer genomic analysis to
routine clinical care. Foundation Medicine is
pioneering the development of a
comprehensive cancer diagnostic test combining
omics data, clinical information and big data
analytics applied to cancer research
Knome analyzes whole genome data using
software-based tests simultaneously to examine
and compare many genes, gene networks, and
genomes as well as integrate other forms of
molecular and non-molecular data. Knome
provides a platform and tools to help
researchers and doctors develop next
generation, software-based tests and make
clinical decisions.
NextBio's big data technology enables users to
systematically integrate and interpret public and
proprietary molecular data and clinical
information from individual patients, population
studies and model organisms applying genomic
data in useful ways both in scientific and
medical research.
Website
www.gnshealthcare.com
www.foundationmedicine.com
www.knome.com
www.nextbio.com
In time to come channels to deal with increasing amounts of genomics data will be needed to store,
transfer, analyze, visualize, and generate “short” reports to researchers and clinicians. It is possible
that the genomics industry could be helped by cloud computing, which will transform medicine and
life sciences.
Data-driven Science and Medicine:
There is all the possibility that the complexity of the data generated in scientific projects will only
increase as we continue to isolate and sequence individual cells and organisms while lowering the
costs to generate and analyze this data, such that hundreds of millions of samples can be profiled.
59
In the future the big genome centres would be requiring high-performance computational
environments for integrating all the data generated. The integration between hardware and
software infrastructures tailored to deal with big data in life sciences will become more common in
the years to come. The data-driven medicine will enable the discovery of new treatment options
based on multi-model molecular measurements on patients and learning from the trends in
differential diagnosis, prognosis and prescription side-effects in clinical databases.
More over the combination of omics data with clinical information from patients will enable new
scientific knowledge that could be applied in the clinics to help in patient care. Considering all
possible scenarios the role of big data will be very significant in both scientific inquiry and patient
care.
Major Challenges
•
•
•
Big Data generation and acquisition will create challenges for storage, transfer and security
of information.
The second challenge will be to transfer data from one location to another
Third major challenge will be poised by Security and privacy of data from individuals.
3.4.5 BIG DATA AND REMOTE SENSING
Remote sensing researchers have long been using remote sensing data to address localized
science questions, such as assessing the amount of developed versus undeveloped land in
a particular metropolitan area, or quantifying timber resources in a given forested area.
Subsequently, as software and hardware capabilities for processing large volumes of imagery
became more accessible, and image availability also increased, remote sensing correspondingly
expanded to encompass regional and global scales, such as estimating vegetation biomass covering
the Earth’s land surfaces, or measuring the sea surface temperatures of our oceans. With today’s
processing capacity, this has been extended yet further to include investigations of large-scale
dynamic processes, such as assessing global ecosystem shifts resulting from climate change, or
improving the modeling of weather patterns and storm events around the world.
There is a logical progression as research and applications keep pace with greater data availability
and ongoing improvements in processing tools. But the field of remote sensing, and its associated
data, is continuing to grow. What else can remote sensing tell us and how else can this immense
volume of data be used? Are there relationships yet to be exploited that can be used to indicate
consumer behavior and habits in certain markets? Are there geospatial patterns in population
expansion that can be used to better predict future development and resource utilization? The
answers to these and many other similar questions can suitably be provided by using big data in
remote sensing.
60
3.4.6 ISACA: RISKS AND CONCERNS WITH BIG DATA
The process of big data analytics involves analyzing the collected data to find patterns and
correlations that may not be initially apparent, but may be useful in making business decisions.
These data, also personal, are useful from a marketing perspective in understanding the likes and
dislikes of potential buyers and in analyzing and predicting their buying behaviour. Personal data can
be categorized as:
•
•
•
Volunteered data—Created and explicitly shared by individuals (e.g., social network
profiles)
Observed data—Captured by recording the actions of individuals (e.g., location data
when using cell phones)
Inferred data—Data about individuals based on analysis of volunteered or observed
information (e.g., credit scores)
Risks and Concerns with Big Data
Big data, on one hand, can supply a competitive advantage and other benefits, it also carries
significant risk. The enterprises that have huge amounts of structured and unstructured data
available, should be asking:
•
•
•
Where should we store the data?
How are we going to protect the data?
How are we going to utilize the data safely and lawfully?
As the security policies and procedures are still developing in many areas, the big data risk
management is evolving. Though the need to manage data risk within the enterprise may not be
clearly communicated and understood at all management levels, it is essential to point out that
addressing big data risk and concerns cannot be seen exclusively as an information technology
exercise. There is a need for the entire enterprise, including legal, finance, compliance, internal audit
and other business departments to get involved in data risk.
It is essential to understand that some data should be considered “toxic” in the sense that loss of
control over these data could be damaging to the enterprise. Examples of potentially “toxic” data
are:
•
•
•
Private or custodial information such as credit card numbers etc.
Strategic information such as intellectual property, business plans and product designs
Information such as key performance indicators, sales figures, financial metrics etc.
Enterprises that rely on personal data that are generated or that can be modified by the public have
to be extra careful. Social media data can be a highly valuable source for assessing customer
61
sentiment, tracking the effectiveness of marketing campaigns and learning more about consumers.
To deal with this kind of data will require addressing current uncertainties and points of tension:
•
•
•
•
•
Privacy—Individual needs for privacy vary.
Global governance—there is a lack of global legal interoperability.
Personal data ownership—The concept of property rights is not easily extended to
data, creating challenges in establishing usage rights.
Transparency—Too much transparency too soon presents as much of a risk to
destabilizing the personal data ecosystem as too little transparency.
Value distribution—Even before value can be shared more equitably, more clarity is
required on what truly constitutes value for each stakeholder.
Strategies for Addressing Big Data Risk
The main strategy for addressing risk is aligning the technology solution to business needs. The
COBIT 5 framework addresses this in the goals cascade by aligning stakeholder drivers and
stakeholder needs. ISACA has identified seven enablers that should be applied to assist the
enterprise in addressing risk and improving its ability to meet its business objectives and create
value for its stakeholders. It further defines the dimensions of Data Quality. Goals of information are
divided into three sub dimensions of quality, see table 3.4 below.
TABLE 3.4: DATA QUALITY SUB DIMENSIONS (SOURCE: ISACA)
Intrinsic
Quality
•
•
•
•
Accuracy
Objectivity
Believability
Reputation
Security/Accessibility
Quality
Contextual and Representational Quality
•
•
•
•
•
•
•
•
•
Relevancy
Completeness
Currency
Appropriate amount of information
Concise representation
Consistent representation
Interpretability
Understandability
Ease of manipulation
•
•
Availability/timeliness
Restricted access
Governance for Big Data
Governance ensures that stakeholders’ needs, conditions and options are evaluated to determine
balanced, agreed-on enterprise objectives to be achieved. The scope of an enterprise’s governance,
risk and compliance would most likely be expanded to create a unified system to consolidate silos
and business functions to enable access of all the data. The end-to-end governance approach that is
62
at the foundation of COBIT 5 is depicted in figure 3.2 below, which is showing the key components
of a governance system.
Assurance Considerations for Big Data
Controls around big data can be grouped into four categories:
•
Approach and understanding: This category addresses demonstrating the right tone at the
top of the enterprise. A critical facet in this effort is the establishment and implementation
of a data policy.
•
Data Quality: Controls should be established and implemented across the data flow to
assess data against the accuracy, reliability, completeness and timeliness criteria defined in
the data policy and associated standards.
•
Confidentiality and privacy: Through the data risk management process, all sensitive data
should be identified and appropriate controls put in place. The nature of the sensitive
information could vary from personal information to competitive secrets.
•
Availability: Reliable (i.e., tested) disaster recovery arrangements should be in place to
ensure that data are available in accordance with the data recovery point objective (RPO)
and recovery time objective (RTO) criteria defined in a business impact analysis.
FIGURE 3.2: GOVERNANCE OBJECTIVE: VALUE CREATION
63
4. DATA SCIENCE APPLICATIONS
BIG DATA - DIGITAL INDIA, MAKE-IN-INDIA
4.1.
4.1.1 BIG DATA IMPACT ON INDIA:
Fast data systems and less expensive smart phones, will drive the appropriation of a lot of people
new administrations, and set new desires regarding client experience. Four major
advances/administrations Big Data analytics, Internet of Things, Mobile Financial Services, and
Network Functions Virtualization will impact the country:
•
•
•
•
Big Data and Analytics both will be a big boost for the KPO businesses in India
Big Data & Analytics To Trigger Jobs Growth for India
Digital India Powered By The Big Data From Smart Cities
The mobile Explosion And Big Data Analytics
4.1.2 THE DIGITAL INDIA IMPETUS
Ernst & Young: says there’s a lot of reason for the IT industry to be positive about 2015. The
understanding seems to be based on the following:
•
•
•
The Government has estimated an investment of US$ 26 billion in technology for 2014-15
for digitization, infrastructural improvements, push for manufacturing and technology in
healthcare and agriculture. The Indian government’s Digital India program, which aims to
transform India into a digitally-empowered society and knowledge economy, will bring forth
a lot of opportunities for large number of IT industry players to develop platforms providing
government services and information to people in all parts of the country. Security and data
accessibility solutions will see increased demand from the government.
The development of 100 smart cities, under ‘Smart Cities’ GoI initiative, will require
companies to build consortiums to bag these projects. This will drive investments at all
layers of ICT infrastructure, benefitting companies which are into technology consulting,
telecommunications, networks, hardware infrastructure, managed services and systems
integration.
The Government of India, through its “Make in India” initiative, is increasing its focus on this
sector, and aims to transform it from a consumption-driven market to the one with
manufacturing capability to meet local and export-related demand. Several incentives are
being offered by the government including financial assistance in setting up electronics
manufacturing clusters, capital subsidies to Electronic System Design & Manufacturing
(ESDM) manufacturing units, approving set up of semiconductor fabrication units, and
setting up of a US$2 billion Electronics Development Fund to fund selected projects.
IDC: According to research agency IDC, IT spend in Indian manufacturing will double by 2016.
IDC’s Manufacturing Insight predicts the India manufacturing IT spending to grow to $8,781.8 million
by 2016, which doubles the manufacturing IT spending of 2011, representing a CAGR of 14.5%
64
between 2012 and 2016. The sector with the highest IT spend in the Indian manufacturing sector in
2012 is automotive, which is followed by chemicals and consumer products.
4.1.3 MAKE IN INDIA:
The government’s ‘Make In India’ campaign aims at spurring a manufacturing-led growth with more
focus on the ease of doing business than on an incentive-linked investment climate. The push
for manufacturing has two aims, to create jobs and lift growth. According to the India Electronics
and
Semiconductor
Association
(IESA),
an industry body,
the
electronic system design and manufacturing (ESDM) industry will
benefit
from
the
government’s Make in India campaign and is projected to see investment proposals worth Rs 10,000
crore over the next two years.
The Internet of Things and big data go hand in hand, and with access to more information and the
ability to rapidly analyze it, manufacturers will be able to develop new tools improving quality,
increase throughput, and reduce machine failure and downtime, to achieve a leading competitive
advantage.
Tata Group chairman: Cyrus Mistry says "Today, emerging technologies in the digital and physical
space are transforming business at a pace never seen before. We must deepen our understanding in
several areas such as digitization and big data analytics and develop an innovation and technology
roadmap to effectively serve evolving customer needs,". He further commented that “In India,
recent policy measures and the strategic direction defined by the new government, especially its
ambitious 'Make in India' campaign, hold the promise of re-igniting growth in the years to come."
4.2.
GOVERNMENT & BIG DATA
4.2.1 BDA APPLICATIONS IN GOVERNMENT SECTOR:
There are many ways in which ‘Big Data – Business Analytics’ can be leveraged by the Central and
the State Government to grow more and go for the changes and implementing the various policies
and government schemes. Some of the prominent areas are:
•
•
•
ADHAAR: As majority of citizens (more than 60 crores at the last count) in the country have
been provided with ADHAAR number, the governments can use this facility to plan,
implement & monitor and their citizen related initiatives.
Direct Benefit Transfer Scheme: The Governments can decide the funding for a various
schemes, ensure that the money reached the beneficiaries and keep track of improvement
and the growth within the scheme and any particular region where people are benefited of
this scheme.
Impact of Election and Voting system: Governments can analyze this big data for making
policies and the scheme based on those statistics which will help the people of the country
as well as the growth of the country.
65
•
•
•
•
•
Impact and conditions of Infrastructure Projects: Analysis of the large amount of Data
Periodically collected can help the governments in preserving critical infrastructure all over
the country.
Impact of Education: Analysis of the large amount of Data Periodically collected about
delivery, outputs, outcomes and impact of the education initiatives at primary, secondary
and tertiary level can be useful in formulating the education policies.
Impact of Health care initiatives: Analysis of the large amount of Data Periodically collected
about delivery, outputs, outcomes and impact of the healthcare initiatives at primary,
secondary and tertiary level can be useful in formulating the healthcare policies.
Business Analytics for Tax Administration: The Central as well as State Governments is
involved in multiple tax regimes - corporate as well as individual level. The country's income
tax-payer base itself is about 3 crore and the number has been inching its way slowly for the
last 5-10 years, which the government would like to see growing at a faster pace. The
governments are always looking for efficient ways and means of ‘Improving Tax
Administration”. This is possible by analyzing huge amounts of data available on various
parameters typical to the tax regime such as ‘spending patterns’, interstate movement of
goods.
Crowd sourcing platform mygov.in: Already, the Prime Minister’s Office is using Big Data
Analytics to process citizen’s ideas and sentiments through the crowd sourcing
platform mygov.in and implementing an attendance system for India’s Central Government
employees through attendance.gov.in. Similarly, the state Government of Telangana is
employing Big Data Analytics for the data collected from nearly 3.5 crore people across
strata,”
4.2.2 POSSIBLE BENEFITS OF BIG DATA FOR STATE AND LOCAL GOVERNMENTS:
Big Data technologies allow groups to play out scenarios under controlled circumstances, customize
what-if planning to different organizations, support data-backed decision-making, and identify
correlations and trends in underlying data and more. By laying the foundation for effective use of
Big Data, the state and local government agencies can:
•
•
•
•
•
•
•
•
Make better decisions more quickly
Improve mission outcomes
Identify and reduce inefficiencies
Eliminate waste, fraud and abuse
Improve productivity of their resources
Boost ROI, cut total cost of ownership (TCO)
Enhance transparency and service delivery
Reduce security, both information & physical, threats and crime
4.2.3 SOME MORE AREAS OF BDA APPLICATIONS IN GOVERNMENT:
•
•
•
Public Services data
Social Services Data
Economic data
66
•
•
•
•
•
•
•
•
•
•
•
•
•
•
4.3.
Public Safety
Public Health Care Issues
Education & Training
Civic Infra Structure
Employment Opportunities
Sports & Recreation
Various Taxes/Revenue Related Data
Transportation
Public Distribution System
Census data
Crime Prevention and Prediction
Tourism Data
Environmental Data
Locational/Geographic Data pertaining to social, economic and other aspects of
citizens/business/government/employees
BIG DATA ANALYTICS TRENDS
4.3.1 TRANSPARENCY MARKET RESEARCH REPORT
“Big Data Market - Global Scenario, Trends, Industry Analysis, Size, Share and Forecast, 2012 –
2018”, the market intelligence report by Transparency Market Research sheds significant light on
the various market elements, the factors that drive and hider growth and the booming regional
markets of the global big data market. The report estimates that the global big data market, that
was expected to have a value worth $6.3 billion in 2012, will reach a value worth $48.3 billion by
2018 by observing year on year growth at a CAGR of 40.5% during the defined forecast of the report,
i.e. between 2012 and 2018. In terms of revenue, the current leader of this market is the region of
North America, which, according to the report, will maintain its leading rank and amass share worth
about 54.5% of the global big data market during the forecast period. It could be followed by
Europe.
4.3.2 BIG DATA PREDICTIONS of 2015
Big Data goes mainstream: 2015 will see Big Data management become more mainstream. In many
ways, we are still in the infancy of Big Data, but the consistent growth is becoming unstoppable.
Everything goes up in the cloud: One of the problems encountered by businesses trying to manage
Big Data was the complex technology involved. Cloud solutions are already starting to offer a way
forward, and 2015 will likely see more steps in this direction.
People-based marketing drives digital marketing: To date, Big Data-driven marketing has been
fueled by cookie data. Cookies, an invention from when the desktop drove the Web, are no longer
67
the most important data source. As Facebook, Google, and other major players, people-based
marketing will drive a premium in digital marketing in 2015 and soon become the standard.
Big Data is called just 'data': The terminology may well change as the technology becomes standard
operating procedure. If the industry standard of data management becomes larger and vast
quantities of data and analytics becomes typical, the word "big" could become unnecessary and
tautological.
The time between collection and results will be shorter: Collecting data from consumers has value
only if it translates to improved business outcomes and 2015 should see a rise in more rapid ROI
results.
Pervasive personalization emerges: The Internet of Things will lead to a tsunami. As The Internet of
Things begins to get traction—everything from Fitbits to iWatches to Nests—with sensors becoming
ubiquitous, personalized marketing communication will be everywhere.
In-Memory Databases: In-Memory databases allow companies the freedom to access, analyze and
take actions based on data much quicker than regular databases. This in turn means that either
decisions can be made quicker as data can be analyzed faster or more informed as more data can be
analyzed in the same amount of time.
Non-Data Scientists: It is likely to see more automated platforms that can allow employees who
may not have as much skill with data as others, to collect, analyze and make decisions based on this
data. This could be anything from simple to use interfaces with more complex backend or simpler
tasks that could create business results.
More Sensor Driven Data: The internet of things is evolving and more companies are using it, but it
may well hit its tipping point in 2015. This would be sensor-to-sensor data being collected, collated
and analyzed through purely sensor based collection.
Deeper Customer Insight: Despite the fact that transactional data is still more numerous that sensor
data, 2015 may be the year that we see it being truly looked at in multi-dimensional ways to create
even deeper customer insight.
HR Analytics: Once thought of as the definition of making your employees ‘just a number’, HR
Analytics are being shown to have significant benefits to both the company and its employees.
No Ownership In Just One Department: Data will become a commodity that is not just kept in one
department alone and used purely by senior company leaders.
The Internet of Things: There are many implications that come with IoT-enabled devices generating
massive streams of data. For instance, imagine equipping a whole workforce with IoT-enabled
68
devices. That could generate terabytes of data everyday! However, with already existing data
warehousing technologies and big data tools, data originated from IoT technology has the potential
to create value in many industries.
Big Data Security: With the magnitude of last year’s hacks, tightening cyber security will be a top
priority for them in 2015 and onwards. On the other hand, big data insights can be used to help
increase security. Big data analytics have the potential to complement existing security methods.
Faster Growth in the Big Data Market: According to Gartner, 85% of Fortune 500 companies aren’t
yet prepared to take advantage of Big Data, so as the big data tools become more widespread (and
cheaper), companies will scramble to become more data-driven in order to stay competitive.
4.3.3 INTERESTING TRENDS TO WATCH
Facial recognition and geospatial monitoring: Data from inexpensive cameras and cell phones is now
widely available to train machine learning systems. Expect to see plenty of innovation in this field.
Citizen backlash: Between government monitoring, data breaches, and well-intentioned commercial
efforts that cross the “creepy” line, people are starting to realize just how much can be learned
about them from the data they unintentionally produce. It may not be long before we see public
demands for enforceable accountability on those who collect or disseminate personal data.
Analytics driving the physical world: Technology that controls physical activities (think of the Google
self-driving car, or even the Nest thermostat) has received a significant amount of media attention.
Many consumers seem eager for these analytics-enabled capabilities today. In the rush to serve
consumer appetites, it will be important for businesses to thoroughly plan for the potential
consequences— good and bad—of these capabilities.
4.4. SOCIAL MEDIA AND BIG DATA ANALYTICS
4.4.1 SOCIAL MEDIA
The business can tap into Big Data and use the information to improve planning, deliver targeted
campaigns, fully take advantage of the omni-channel and optimize social media and interactions in
real time.
Big Data means that networks can know their users better than ever and it allows those with
knowledge of a user base to target them better by looking at their interests, location and search
history.“Facebook can help a brand up to a point but the completion of that journey must be
managed by the organizations themselves – data and insights can track the right person down, but it
can’t complete the sale.
69
There are an increasing number of tools available to help businesses more accurately track the data
from social media. As Big Data grows, so does the availability and functionality of the technology
available to deal with it.
Some solutions can undertake a quantitative analysis of social media that includes sentiment, some
allow for tracking of social media channels and others are designed to track the effectiveness of
campaigns against specific targets. Search terms, key words and brand names can be used to
identify the content that references them and these can be further analyzed using analytics tools
and software.
Businesses should use social listening tools to capture and analyze huge amounts of data. The
keyword strategy is the most important element of finding the content that matters to the business.
Smart search terms in a listening tool can help surface relevant data about your products and your
competitors.
4.4.2 REASONS TO EXPLORE BIG DATA WITH SOCIAL MEDIA ANALYTICS
Reason# 1 Social Media Analytics and Volume. Social Media has many factors that contribute to the
increase of data volume to explore. There are unstructured data streaming as well as increase of
sensor and machine to machine data being collected. Proper use of Social Media Analytics can help
create value which is significant to the relevant data.
Reason# 2 Social Media Analytics and Velocity. Data in Social Media is streaming at exceptional
speed that must be dealt with in a well-timed manner. It would be interesting to explore this feature
in Social Media analytics as this is one of the great challenges for many organizations.
Reason# 3 Social Media Analytics and Variety. Data in Social Media come in all types of formats.
Structured numeric data in traditional databases, information generated from line-of-business
applications, unstructured text documents, email, video, audio, stock ticker data and financial
transactions all comes in different types of formats.
Reason# 4 Social Media Analytics and Variability. Social Media data flows can be highly
unpredictable with periodic peaks. Such data loads from what’s trending in social media, mixed up
with unstructured data are even more challenging to manage yet interesting to explore.
Reason# 5 Social Media Analytics and Complexity. Data in Social Media comes from numerous
sources. It is a great challenge to undergo the different processes like linking, matching, connecting,
correlating relationships, hierarchies and multiple data linkages. This is how complex data can be
and if not managed properly, they can spiral out of control.
70
It is absolutely essential to discover more about exploring Big Data with social media analytics. If you
are serious about optimizing your website, then you should learn the best ways to design and
develop a website.
4.4.3 SOCIAL MEDIA ANALYTICS IN INDIA & TOOLS:
Social Media analytics is still in very nascent stages. Both the consumers as well as providers have till
now been perplexed about how to best consume the gold mine of social chatter online.
Service providers have preferred the product route till now, mostly contained around social media
monitoring and real time evaluation. For enterprises and end consumers, the transition from social
media speaking to listening is just closing in.
Here is a of list important social analytics companies in India that are making the most difference in
whole ecosystem. The list is not in any particular order and we would keep this updated as the field
evolves and new startup foray into this area. Important Social Media Analytics Companies in India:
Simplify360: Simplify360 is the leading Social Business Intelligence Company. Simplify360 facilitates
Social Customer Service, Online Reputation Management, Real Time Market Intelligence and Social
Media Performance Analytics, and is majorly used by Enterprises (CMO’s and CIO’s Office), BPO
Companies and Digital Media Agencies.
Germin8: Germin8 is a Big Data analytics company headquartered in Mumbai. Germin8 is focused
on building products for analysing social media data and textual data available within organizations
to help them make better decisions based on insights drawn from that data.
Explic8 ™ is Germin8’s stakeholder insights and engagement platform that collects and analyses
conversations in real time from public sources and private sources, and converts them into industryspecific actionable insights and leads.
The stakeholder conversations are taken from both public sources like social media and news, and
private sources like emails and chats and analysed through the tool and presented in the form of live
interactive dashboards which generate actionable insights for various departments within an
organization like Marketing, Customer Care, Corporate Communications, Sales, etc. The product was
launched in 2012 and is currently being used by over a 100 brands directly or through partner
agencies.
BlueOcean: Blueocean Market Intelligence offers a comprehensive and end-to-end social
intelligence solution that effectively addresses business challenges and helps organizations gain a
much-needed advantage. It enables organizations to monitor, track and measure social media
effectiveness on various channels, and monitor the ROI of social media initiatives. Their digital
scientist team has been able to move away from traditional KPI measurements on social, and deploy
innovative and enhanced measurements to effectively measure true ROI and better understand the
social landscape. They deliver through the integration of traditional research techniques,
71
measurements and contextual mapping to provide comprehensive insights to solve intriguing
business problems.
Frrole: Frrole is a social intelligence startup with an ability to mine precise content and deep insights
from Twitter data. Its media and brand focused offering built on top of this intelligence allows
customers to identify in real-time what people, influencers and cities are talking about and integrate
that content and derived insights directly into their products.
While most Social Analytics/Intelligence products provide results based on statistics and the first
level of NLP, Frrole goes two levels deeper building semantic context for each topic and tying it up
with information available in the general and historical data sets.
Unmetric: For Fortune 500 companies, agencies and other large global brands that seek to more
meaningfully engage with their target audiences, Unmetric provides an online platform that enables
them to understand, uncover and unlock insights into how well they and their competitors’ content,
campaigns, and top-line metrics perform in social media. Unmetric combines the power of human
cognition and technology to track and analyze the online behavior of 18,000 brands segmented
across 30 sectors for all major social networks. Unlike listening or publishing services, Unmetric is a
seeing platform, providing global brands with data to analyze, benchmark and enhance their social
media efforts.
Infinite analytics: Infinite Analytics is the most advanced big data & social data analytics company.
Its flagship product – SocialGenomix, uses a consumer’s Social Data, along with NLP, Machine
Learning, Semantic Technologies and Predictive Analytics to predict consumer behavior, personalize
user experience and provide actionable insights to e-retailers.
Thoughtbuzz: ThoughtBuzz is a social media intelligence company. The web-based tool helps
companies monitor and track online conversations. ThoughtBuzz offers a full-feature analytics
service with unlimited access to billions of social media conversations, as well features such as
automated sentiment and geo-demographics. ThoughtBuzz is ideal for in-depth research, historical
analysis, and the preparation of value-added reports. It go beyond what companies offer today and
use real-time information. Other features such as sentiment analysis, key themes, demographics,
topic intentions are also available.
Konnectsocial: Konnect Social is a search engine for forums, blogs, news and social media.
Understanding the unique needs and requirements of our users has always been paramount. We
make every effort to listen to our customers and stay on top of trends of the industry. Our experts
are highly experienced and versatile enough to provide the best of the class solutions for any kind of
business running at optimum performance levels and offer extensions to the functionality of your
new edge business solution. We ensure that our technology solutions help you increase the
effectiveness of your current IT infrastructure.
72
Abzooba: Abzooba’s social media monitoring, analyses, and analytics platform uses sophisticated
technologies such as Natural language processing, Domain specific ontologies and Machine learning
based classification over Big Data to provide organizations with actionable intelligence and insights
in real-time.
4.4.4 RECENT EXAMPLES OF USE OF SOCIAL MEDIA ANALYTICS IN INDIA:
The Prime Minister's Office is using Big Data techniques to process ideas thrown up by citizens on its
crowd sourcing platform mygov. in, place them in context of the popular mood as reflected in trends
on social media, and generate actionable reports for ministries and departments to consider and
implement.
Elections in India are a classic BIG DATA problem and the 2014 general elections was the biggest of
them all. While technology may be able to process this humongous data, how can all this
information be consumed and understood by a billion people? That too, in real time as it happens?
The Indian General Elections also have another perspective which often does not figure in our most
buoyant thoughts. Consider the facts like 300 parties, 8000 candidates, 800 Million voters, 1 Million
booths served/secured by ~20 Million officials. The heady mix is further embellished with variety of
structured & unstructured information – candidate histories, crime records, declared assets and
audacious election manifestos. Mixed with the above is the frenetic activity on the day of results.
Live streaming of results: ~21000 votes to be counted per second, from all corner of the country
spanning an area of ~1 million square miles. With plans in place and trial runs completed, the
visualization dashboard went live on the morning of 16th May – the counting day. – will Gramener
technology stand the ultimate performance test on this D-Day?
4.5. OPEN SOURCE SOFTWARE FOR BIG DATA ANALYTICS
4.5.1 OPEN SOURCE SOFTWARE
Consider the role of the modern data scientist. Unlike a pure statistician, a data scientist is also
expected to write code and understand business. Data science is a multi-disciplinary practice
requiring a broad range of knowledge and insight. It’s not unusual for a data scientist to explore a
fresh set of data in the morning, create a model before lunch, run a series of analytics in the
afternoon and brief a team of digital marketers before heading home at night.
In addition to possessing a wide range of practical knowledge, a data scientist must also be agile and
flexible. Today’s swiftly changing markets require lightning fast reflexes – companies must be
capable of assessing new data and responding in the space of a heartbeat to unexpected shifts in
commerce, across all industry verticals and economic sectors.
The speed of modern business plays to the strengths of data science and open source programming.
In the past, business moved relatively slowly and large-scale market trends were fairly predictable.
As a result, most companies were quite comfortable relying on proprietary (closed source) software
73
to analyze data. The downside of proprietary software, however, is that it cannot be quickly
modified or updated to handle unexpected circumstances or disruptions of existing business
models. Until recently, it was common practice for traditional vendors to release updated versions
of critical proprietary software quarterly or annually.
Open source software can be modified or rewritten in days or hours, making it an ideal choice for
real-time analytics.
Moreover, the open source movement is democratizing data science. In the past, you needed
special training on a proprietary system and years of experience to become a valuable member of a
business or research team. Thanks to a wider choice of open source tools, more people can begin
contributing valuable insight and analysis from the start.
It’s hard to downplay the influence of open source software on the spectacular rise of data science.
Open source isn’t just an interesting aspect of the data science revolution; it’s absolutely critical,
however the following key points must be considered when evaluating open source software:
•
•
•
•
•
Open source software is not free.
Be a bit wary of experts, even well-intentioned ones.
Think of open source software as a platform and not a product.
There are no guarantees.
Is your organization a good fit for open source software?
4.5.2 SOME OF THE BEST OPEN SOURCE TOOLS BIG DATA ANALYTICS:
o
Apache Sqoop: Sqoop is a command-line interface application for transferring data
between relational databases and Hadoop. It supports incremental loads of a single
table or a free form SQL query as well as saved jobs which can be run multiple times to
import updates made to a database since the last import.
o
Apache Giraph: Apache Giraph is an Apache project to perform graph processing on big
data. Giraph utilises Apache Hadoop's MapReduce implementation to process graphs.
o
Apache Hama: Apache Hama is a distributed computing framework based on Bulk
Synchronous Parallel computing techniques for massive scientific computations eg,
matrix, graph and network algorithms.
o
Cloudera Impala: Cloudera Impala is Cloudera's open source massively parallel
processing (MPP) SQL query engine for data stored in a computer cluster running
Apache Hadoop.
o
Apache Drill: Apache Drill is an open-source software framework that supports dataintensive distributed applications for interactive analysis of large-scale datasets. Drill is
the open source version of Google's Dremel system which is available as an
infrastructure service called Google BigQuery.
74
o
Neo4j: Neo4j is an open-source graph database, implemented in Java. The developers
describe Neo4j as "embedded, disk-based, fully transactional Java persistence engine
that stores data structured in graphs rather than in tables".
o
Couchbase Server: Couchbase Server, originally known as Membase, is an open source,
distributed (shared-nothing architecture) NoSQL document-oriented database that is
optimised for interactive applications. These applications must service many concurrent
users; creating, storing, retrieving, aggregating, manipulating and presenting data.
o
SciDB: SciDB is an array database designed for multidimensional data management and
analytics common to scientific, geospatial, financial, and industrial applications.
4.5.3 SOME OTHER OPEN SOURCE TOOLS FOR BIG DATA
Big Data Analysis Platforms and Tools
• Hadoop
• MapReduce
• GridGain
• HPCC
• Storm
Databases/Data Warehouses
• Cassandra
• HBase
• MongoDB
• Neo4j
• CouchDB
• OrientDB
• Terrastore
• FlockDB
• Hibari
• Riak
• Hypertable
• BigData
• Hive
• InfoBright Community Edition
• Infinispan
• Redis
Business Intelligence
• Talend
• Jaspersoft
• Palo BI Suite/Jedox
• Pentaho
• SpagoBI
75
• KNIME
• BIRT/Actuate
Data Mining
• RapidMiner/RapidAnalytics
• Mahout
• Orange
• Weka
• jHepWork
• KEEL
• SPMF
• Rattle
• Gluster
• Hadoop Distributed File System
Programming Languages
• Pig/Pig Latin
• R
• ECL
Big Data Search
• Lucene
• Solr
Data Aggregation and Transfer
• Sqoop
• Flume
• Chukwa
Miscellaneous Big Data Tools
• Terracotta
• Avro
• Oozie
• Zookeeper
4.6 AMOUNT OF BIG DATA IN INDIA
4.6.1 AUTHENTIC DATA
The authenticated inventory of public domain data sets is available at the data.gov.in. It has more
than 300 Catalogs and these catalogs contain more than 13500 data sets on variety of subjects.
These sets belong to the Open Data Category. Open data is data that can be freely used, reused and
redistributed by anyone - subject only, at most, to the requirement to attribute and share alike. The
most important characteristics of open data are
Availability and Access: the data must be available as a whole and at no more
than a reasonable reproduction cost, preferably by downloading over the
internet. The data must also be available in a convenient and modifiable form.
76
Reuse and Redistribution: the data must be provided under terms that permit
reuse and redistribution including the intermixing with other datasets.
Universal Participation: everyone must be able to use, reuse and redistribute there should be no discrimination against fields of endeavour or against persons
or groups. For example, ‘non-commercial’ restrictions that would prevent
‘commercial’ use, or restrictions of use for certain purposes (e.g. only in
education), are not allowed.
Following is the inventory list of the data sets available at the data.gov.in. :
Central Publication Metrics (SOURCE: data.gov.in)
RESOURCE
(DATASET)
MINISTRY/DEPARTMENT
Ministry of Home Affairs
3843
Department of Home
TOTAL
CATALOGS
244
3592
Registrar General and Census Commissioner, India
Department of States
109
3582
106
234
National Crime Records Bureau (NCRB)
118
234
Ministry of Agriculture
2350
Department of Agriculture and Co-operation
118
368
2272
312
Directorate of Marketing and Inspection (DMI)
2267
310
Directorate of Economics and Statistics (DES)
5
2
Department of Animal Husbandry, Dairying and Fisheries
51
31
Department of Agricultural Research and Education (DARE)
27
25
Indian Council of Agricultural Research (ICAR)
Planning Commission
27
1560
Unique Identification Authority of India (UIDAI)
25
776
4
4
Ministry of Statistics and Programme Implementation
1432
345
Ministry of Water Resources
1060
557
Ministry of Health and Family Welfare
868
141
77
RESOURCE
(DATASET)
MINISTRY/DEPARTMENT
TOTAL
CATALOGS
Department of Health and Family Welfare
854
127
Department of AIDS Control
10
10
Department of Ayurveda, Yoga and Naturopathy, Unani,
Siddha and Homoeopathy (AYUSH)
4
4
Ministry of Road Transport and Highways
476
133
Ministry of Power
475
5
Central Electricity Authority
475
5
Rajya Sabha
248
154
Ministry of Human Resource Development
217
64
Department of Higher Education
197
59
Department of School Education and Literacy
47
7
National Council of Educational Research and Training
(NCERT)
Ministry of Finance
47
175
7
124
Department of Economic Affairs
118
105
Department of Financial Services
43
9
Department of Revenue
9
5
Financial Intelligence Unit - India
9
Department of Disinvestment
5
5
Ministry of Commerce and Industry
142
Department of Commerce
83
Directorate General of Foreign Trade (DGFT)
Department of Industrial Policy and Promotion
Office of the Economic Adviser
5
17
9
80
6
59
8
29
Ministry of Environment and Forests
127
Central Pollution Control Board
127
Ministry of Science and Technology
112
78
2
10
10
86
RESOURCE
(DATASET)
MINISTRY/DEPARTMENT
Department of Science and Technology (DST)
TOTAL
CATALOGS
66
44
National Science and Technology Management Information
System (NSTMIS)
66
41
NSDI India GEO Portal, National Spatial Data Infrastructure
(NSDI)
0
3
Department of Biotechnology, Government of India
46
42
National Institute of Biomedical Genomics (NIBMG),
Kalyani
6
6
Regional Centre for Biotechnology (RCB), Gurgaon
5
5
Rajiv Gandhi Centre for Biotechnology (RGCB)
4
4
Institute of Bioresources and Sustainable Development
(IBSD)
4
4
Bio Processing Unit (BPU), Mohali
2
2
National Institute of Animal Biotechnology (NIAB),
Hyderabad
2
2
National Institute of Immunology (NII)
1
1
National Institute of Plant Genome Research (NIPGR), New
Delhi
1
1
National Agri-Food Biotechnology Institute (NABI), Mohali
1
1
National Centre for Cell Sciences (NCCS)
1
1
Lok Sabha Secretariat
112
100
Ministry of Chemicals and Fertilizers
104
14
Department of Fertilizers
70
8
Department of Chemicals and Petrochemicals
34
6
Ministry of Corporate Affairs
39
26
Ministry of Micro, Small and Medium Enterprises
33
14
Ministry of Mines
33
26
Indian Bureau of Mines
21
13
Geological Survey of India
12
13
79
RESOURCE
(DATASET)
MINISTRY/DEPARTMENT
Ministry of Defence
29
Department of Defence Research and Development
Defence Research and Development Organisation (DRDO)
TOTAL
CATALOGS
23
29
23
29
23
Ministry of Petroleum and Natural Gas
27
26
Ministry of New and Renewable Energy
26
12
Ministry of Drinking Water and Sanitation (MDWS)
24
16
Ministry of Communications and Information Technology
18
18
Department of Electronics and Information Technology (DeitY)
9
8
Department of Posts
8
9
Department of Telecommunications (DOT)
1
1
Comptroller And Auditor General of India(CAG)
16
16
Ministry of Earth Sciences
8
10
India Meteorological Department (IMD)
8
8
Indian National Centre for Ocean Information Services
(INCOIS)
0
2
Ministry of Information and Broadcasting
7
7
Ministry of Tourism
3
3
Ministry of Panchayati Raj
3
3
Ministry of Civil Aviation
3
3
Ministry of Rural Development
2
2
Department of Land Resources (DLR)
2
2
Department of Atomic Energy
1
1
Ministry of Development of North Eastern Region
1
1
Department of Space
0
15
Indian Space Research Organization
0
National Remote Sensing Centre
0
Total
13574
80
15
15
3360
5. ENTREPRENEURSHIP DEVELOPMENT & START UPS
5.1. ENTREPRENEURSHIP DEVELOPMENT
5.1.1 WORLD WIDE - BIG DATA VENDOR REVENUE AND MARKET FORECAST
Wikibon forecasts that the Total Big Data market may exceed $47 billion by 2017. That translates to
a 31% compound annual growth rate over the five year period 2012-2017. The growth rate so far of
Big Data revenue may be due to a number of factors, including:
•
•
•
•
An increased awareness of the benefits of Big Data as applied to industries beyond the Web,
most notably financial services, pharmaceuticals, and retail;
The maturation of Big Data software such as Hadoop, NoSQL data stores, in-memory
analytic engines, and massively parallel processing analytic databases;
Increasingly sophisticated professional services practices that assist enterprises in practically
applying Big Data hardware and software to business use cases;
Increased investment in Big Data infrastructure by massive Web properties – most notable
Google, Facebook, and Amazon – and government agencies for intelligence and counterterrorism purposes.
The Big Data market is still within the confines of the early adopter phase and is poised for
significant growth. For the Big Data market to reach its full potential, enterprises and vendors must
overcome several obstacles. While a detailed discussion of these obstacles is outside the purview of
this report, they are worth noting. They include:
•
•
•
•
•
•
•
The well-publicized lack of analytic specialists and Data Scientists.
Lack of understanding among enterprises on how to organize Big Data staff to best identify
business requirements for Big Data projects.
Organizational resistance to adopting Big Data analytics-driven decision-making.
Vendor marketing overly focused on “speeds-and-feeds,” product features and “Big Datawashing” rather than laying out a vision for Big Data in the enterprise.
Development of Big Data platforms and tools by vendors that eschew open frameworks in
favor of closed, locked-down solutions.
Lack of best practices and related technologies for managing Big Data as a corporate asset.
Dearth of Big Data application development tools and services.
Top 10 Vendors
As part of its market-sizing efforts, Wikibon tracked and/or modeled the 2012 Big Data revenue of
more than 60 vendors. This list includes both Big Data pure-plays – those vendors that derive close
to if not all their revenue from the sale of Big Data products and services – and vendors for whom
Big Data sales is just one of multiple revenue streams. Partial list including the top 10 vendors is
given in table 5.1 below.
81
TABLE 5.1: 2012 WORLDWIDE BIG DATA REVENUE BY TOP 10 VENDOR ($US MILLIONS) - WIKIBON
S.
N.
Vendor
Big Data
Revenue
Total
Revenue
Big Data
Revenue
as % of
Total
Revenue
% Big
Data
Hardware
Revenue
% Big
Data
Software
Revenue
% Big
Data
Services
Revenue
1
IBM
$1,252
$103,930
1%
19%
31%
50%
2
HP
$664
$119,895
1%
34%
29%
38%
3
Teradata
$435
$2,665
16%
31%
28%
41%
4
Dell
$425
$59,878
1%
83%
0%
17%
5
Oracle
$415
$39,463
1%
25%
34%
41%
6
SAP
$368
$21,707
2%
0%
67%
33%
7
EMC
$336
$23,570
1%
24%
36%
39%
8
Cisco
Systems
$214
$47,983
0%
58%
0%
42%
9
PwC
$199
$31,500
1%
0%
0%
100%
10
Microsoft
$196
$$71,474
0%
0%
67%
33%
Wikibon’s Big Data market forecast broken down by market component through 2017 in Billion US$
is given in table 5.2 below.
TABLE 5.2: BIG DATA MARKET FORECAST BROKEN DOWN BY MARKET COMPONENT THROUGH
2017 IN BILLION US$ - WIKIBON
YEAR WISE FORECAST IN BILLION US$
REVENUE MARKET COMPONENT
2014
2015
2016
2017
Big Data XaaS
1.78
2.52
2.97
3.31
Big Data Professional Services
10.62
14.15
16.17
17.59
Big Data Application – Analytic and
Transactional
Big Data NoSQL
3.47
5.29
6.48
7.38
0.50
0.79
0.98
1.12
82
YEAR WISE FORECAST IN BILLION US$
REVENUE MARKET COMPONENT
2014
2015
2016
2017
Big Data SQL
1.72
2.14
2.36
2.51
Big Data Infrastructure Software
0.64
0.88
1.03
1.14
Big Data Networking
0.56
0.75
0.86
0.93
Big Data Storage
4.20
5.59
6.39
6.95
Big Data Compute
4.89
6.26
7.01
7.53
TOTAL BIG DATA REVENUE
28.38
38.37
44.25
48.46
5.2 DEVELOPMENT OF THE BA INDUSTRY IN INDIA
Analytics Industry- Key to Growth of India:
Imagine a situation where someone is moving in Pantaloons Men’s shoes section, and is about to
buy one and then receives a message from Indiatimes, “The same shoe is being offered with 25%
discount, just login here”. A scanner reads the shoe data, the customer’s pantaloons card is attached
to his mobile and his mobile is attached to Indiatimes. Indiatimes and Pantaloons are doing joint
marketing. A win-win situation for everybody that is only possible with the help of analytics. So,
Analytics is now no more a luxury for an organization rather a hygiene factor. Consider the
following:
•
Size of the Indian analytics Market: – 375 Million $
•
•
No. of companies operating in this segment in India – More than 500
Expected Indian Analytics market by 2015 – 1.15 bn $ as per Business standard report. The
chart given in figure 5.1 below gives further information about the various types Analytics
applications and classification of analytics industry in various business segments.
How Big Is Big Data In India?
•
We are living in the age of information overload. A huge amount of data is constantly being
generated around us. Increasingly, automation is being adopted and consequently leads to
greater amounts of data. The challenge today for enterprises as well as small and medium
businesses (SMBs) is manifold. Indian SMBs and enterprises are sitting on a gold mine of
information. Making sense of these huge data sets has become imperative. In these
circumstances, big data analytics has become one of the more talked about topics in India.
•
Big data has tremendous potential in India. With social media usage on the rise and
increased adoption of technology by sectors such as BFSI(banking, financial services, and
83
•
insurance), retail, hospitality etc, big data analytics are on the agenda of boardrooms across
Indian enterprises. However, most Indian enterprises are still coming to terms with this
concept. While everybody realizes the importance and the potential to analyze these data
sets, very few have the capability of doing it. It is widely accepted that Indian enterprises
base their decisions mostly on intuitions and ‘gut-feel’ and have barely scratched the surface
in terms of using data for decision-making.
In India, many of the large enterprises have started using or are contemplating the use of
big data analytics. SMBs are still some distance away from adopting this concept. Their
challenges are more basic – effective data storage and management. However, there are
many medium businesses that are already past the initial stages of IT adoption are expected
to take this up shortly.
FIGURE 5.1: ANALYTICS APPLICATIONS, AND CLASSIFICATION OF ANALYTICS INDUSTRY
(SOURCE: http://www.iitk.ac.in/ime/MBA_IITK/avantgarde/?p=1165 dated 27/08/13)
84
Development of BA Industry in India
Big Data, Open Data, analytics, data insights and visualization open up lucrative opportunities for
Indian companies, startups and incumbent IT/KPO players. A number of market research reports
throw light on this opportunity, and a wave of startups is emerging in this space. TechSparks has
summarized the market trends and insights into the BA industry in India, some indicators are as
follows:
A report by NASSCOM and CRISIL Global Research and Analytics predicts that the global Big Data
market will reach $25 billion by 2015, up from $5.3 billion in 2011; the Indian industry in Big Data
will reach $1 billion by 2015.
At the 2014 edition of its Big Data and Analytics Summit, NASSCOM released another report in
partnership with BlueOcean Market Intelligence, which predicts that the analytics market in India
could reach $2.3 billion by the end of 2017-18.
According to research by Avendus Capital, the data analytics market in India is expected to reach
$1.15 billion by 2015, and will account for a fifth of India’s knowledge process outsourcing (KPO)
market of $5.6 billion. Market leader US is expected to have a shortage of 140,000 – 190,000
analytics professionals by 2018, which opens up a huge opportunity for product and service
companies in India.
The Internet of Things is another market opportunity for India in Big Data and analytics. According to
Machina Research data cited at a recent panel of TiE Bangalore, the global market for IoT in 2020
will be worth $373 billion in revenue in hardware and software, and India will account for $10-12
billion of this total revenue. Early stage startups like SenseGiz.com and mature startups such as
ConnectM are active in this space.
“Analytics holds the key importance for the commercial growth of India in the future to come,”
according to Nishu Navneet of IIT Kanpur, citing research that shows 29% of analytics companies are
in Bangalore, 25% in NCR, 8% in Pune, 8% in Hyderabad, 6% in Chennai and 6% in Mumbai.
Analytics India magazine divides the Indian analytics market into three kinds of players: service
providers (80% of the market), captives (back offices for analytics: 15% of the market) and domestic
market (Indian companies using analytics: 5% of the market).
Indian companies in a number of sectors are using analytics: banking (ICICI, HDFC, Axis, Yes Bank,
Kotak Mahindra), telecom (Bharti Airtel, Idea Cellular), automotive (Tata Motors, Mahindra &
Mahindra) and eCommerce (Flipkart, Snapdeal, Jabong). Information Week magazine has also
documented the use of analytics by a range of Indian players: HDFC, Shoppers Stop and Aircel.
Outside of the private sector, the BJP party used analytics and real time social media monitoring
during its recent election campaign.
85
Indian startups and service providers in the space of analytics services and data insights include a
range of players such as AbsolutData, DataWeave, Flutura, Formcept, Fractal Analytics, GenPact,
Germin8, LatentView,
ew, Mu Sigma, Nanobi and Veda Semantics.
5.3 ANALYTICS AS A SERVICE (AAAS)
Analytics as a Service (AaaS) is an extensible analytical platform provided using a cloud-based
cloud
delivery model, where various tools for data analytics are available and can be configured
conf
by the
user to efficiently process and analyze huge quantities of heterogeneous data. Customers will feed
their enterprise data into the platform, and get back concrete and more useful analytic insights.
These analytic insights are generated by Ana
Analytical
lytical Apps, which orchestrate concrete data analytic
workflows.
These workflows are built using an extensible collection of services that implement analytical
algorithms; many of them based on Machine Learning concepts. The data provided by the user can
be enhanced by external, ‘curated’ data sources. A diagram describing the concept is given in the
figure 5.2.
The AaaS platform is designed to be extensible, in order to handle various potential use cases. One
concrete case of this is the collection of Analytical Services, but it is not the only one. For example,
the system can support the integration of very different external data sources. To enable AaaS to be
extensibility and easily configured, the platform includes a series of tools to support the complete
lifecycle of its analytics capabilities.
FIGURE 5. 2 : CONCEPTUAL DIAGRAM OF AaaS
(Atos White paper on DAaaS)
86
The Importance
A common scenario amongst the organizations is the struggle to improve the data to insight
conversion rate. The hurdles are found at all levels of the organization including IT, business and
leadership. Some of the often found refrains are:
•
•
•
•
•
•
Data is dirty
Available data cannot be trusted
We have no experience in analyzing unstructured data streaming from the social media and other
digital channels
As decision maker I have no access to data
The expensive software bought cannot be maintained by IT department
Though the IT department would like to support business with new technologies, their budgets
don’t allow them
The traditional processes for software delivery, procurement and system management, do not
generate a robust business case that positively impacts both business and IT. The initial investment
(CapEx) is generally high for introducing new technologies from the analytics arena. Once the new
technology is in place further direct and indirect costs will add to the IT-budget due to increased
complexity of the IT-landscape (OpEx). This is the background with which the organizations generally
struggle for acquiring a competitive data to insight conversion rate.
An ideal AaaS case is when data is already in the cloud, or at least easy to upload into the cloud. The
most obvious data streams to link to AaaS are the external ones like social media or machine-tomachine (m2m). Establishing a new AaaS-channel is done typically within hours, and the Analytics
Expert can support decision makers with data-driven insights. This is how Analytics-as-a-Service
drastically improves the data to decision conversion rate.
The Challenges of AaaS
Analytic solutions that need to support Big Data services present additional challenges. This is even
more the case if these services are intended to be delivered through a cloud environment.
•
•
•
•
•
•
•
Information Lifecycle Management
Data model diversity
Analytic knowledge
Data volume
Real-time analytics
Security
Privacy
87
Benefits of AaaS
The main benefit of the AaaS is to lower the barrier of entry to advanced analytical capabilities,
without demanding that the user commits to large internal infrastructures and human resources to
the project. The table 5.3 provides a comparison between a complex custom project the customer
and the AaaS:
TABLE 5.3: COMPARISON BETWEEN AaaS & INTERNAL BD PROJECT
AaaS
Typical Internal Big Data Project
•
Data Scientists working for the
organization explore the AppStore for an
Analytical App that fits the problem.
•
Data Scientists need additional resources
to design and implement the solution.
•
They rent the Analytical App for a specific
time or quantity of data.
•
Installs a complete Big Data
infrastructure based in some complex
technology like Hadoop.
•
They configure the Analytical App to its
needs including, for example, the usage
of external data sources provided by the
AaaS.
•
Implements complex analytical processes
in low-level languages becoming in reality
an expensive coder.
•
Then the data is fed from the internal
systems to the Analytical App.
•
Integrates the new system with your
enterprise systems with more
development effort.
•
The SMEs in the company validate the
results and even enhance them with
some customization.
•
Examines the results and reiterates until
achieving success.
•
Outcomes are available for all other uses.
5.4 POSSIBLE MODELS FOR ENTREPRENEURSHIP BUILDING
Entrepreneurial development is a systematic and an organized development of a person to an
entrepreneur. The development of an entrepreneur refers to inculcate the entrepreneurial skills into
a common person, providing the needed knowledge, developing the technical, financial, marketing
and managerial skills, and building the entrepreneurial attitude.
88
Despite all the hurdles to success, this is a great time to be an entrepreneur in India. With huge open
opportunities in Software as a Service (SaaS), mobile payments, gaming, entertainment,
marketplaces, and just easier and better access to information and products, the potential for
impact is immense. By leveraging mobile and internet technology, entrepreneurs in India have the
opportunity to transform the way Indians will lead their lives
DST in collaboration with Technological Development Board (TDB) can become a catalyst in
facilitating emergence of competent first generation entrepreneurs in and transition of existing
entrepreneurs into growth-oriented BDA enterprises through entrepreneurship consulting,
education, training, research & institution building through promoting and encouraging
entrepreneurship in Big Data Analytics. TDB will be funding agency while DST will have the
responsibilities as given below:
•
•
•
•
•
•
•
•
•
•
•
Provide initial Capital/Seed Financing
Enhance management bandwidth
Accelerate the product development
Provide help in building prototype, market validation and business plan
Facilitate access to an ecosystem of US founders/Strategic Partners and Clients
Engage with thought leaders in The Hive Big Data community worldwide
Accelerate access to sources of future capital
In house business and strategy team to help develop the intricacies of the
entrepreneurs’ business
Technology team to help with data science and petabyte scale systems
Advisors from our extensive network of technology and business experts
Engagement with thought leaders in the Big Data community
In addition to mentoring startups, DST may also host periodic talks and panel discussions to share
knowledge and bring together experts and visionaries from academia and the industry.
In the initial stage DST should engage with entrepreneurs to help them refine their product concept
DST may also take up the responsibilities in the area of entrepreneurship training and education
with the following additional objectives:
•
•
•
•
•
To promote and develop high-end entrepreneurship for BDA manpower as well as selfemployment by utilizing S&T infrastructure and by using S&T methods.
To facilitate and conduct various informational services relating to promotion of
entrepreneurship in BDA
To network agencies of the support system, academic institutions and Research &
Development (R&D) organizations in BDA to foster entrepreneurship and selfemploying using BDA.
To act as a policy advisory body with regard to entrepreneurship in BDA.
To evolve standardized materials and processes for selection, training, support and
sustenance of entrepreneurs, potential and existing.
89
•
•
•
•
•
•
•
To serve as an apex national level resource institute for accelerating the process of
entrepreneurship development in BDA ensuring its impact across the country and
among all strata of the society.
To provide vital information and support to trainers, promoters and entrepreneurs by
organizing research and documentation activities relevant to entrepreneurship
development in BDA.
To train trainers, promoters and consultants in various areas of entrepreneurship
development in BDA
To offer consultancy nationally/internationally for promotion of entrepreneurship and
small business development in BDA
To provide national/international forums for interaction and exchange of experiences
helpful for policy formulation and modification in the BDA domain at various levels.
To share international experience and expertise in entrepreneurship development in
BDA.
To share experience and expertise in entrepreneurship development in BDA across
national frontiers.
90
6. DATA SCIENCE POLICY PERSPECTIVES
6.1 CONCERNS FOR BIG DATA AND PUBLIC POLICY
Advances in digital technology are making it possible to collect, store and process ever-expanding
amounts of data. This explosion of data holds tremendous potential to boost innovation,
productivity, efficiency and, ultimately, economic growth and social value. The use of ‘big’ data,
however, raises many questions:
•
•
•
•
•
•
•
What do individuals think about the data being gathered about their everyday activities
(for example, through social media and the internet, sensors, radio-frequency
identification chips, geospatial technologies, loyalty cards or transport cards)?
Who should own and control such data?
What is the right trade-off between privacy, intellectual property rights and security and
allowing society to benefit from data-driven innovations and better ways of living?
Is the right to be forgotten practicable, useful and meaningful, and does it need to be
complemented with a right to be remembered?
What sorts of curation mechanisms are most effective in ensuring data quality and
interoperability across organizational boundaries, particularly in the case of open data
sets?
How can we assess the impact of big data on existing communications, legal and
regulatory systems?
How can society benefit most from big data?
6.2 BIG DATA: MANAGING THE LEGAL AND REGULATORY RISKS
When adopting a new and potentially disruptive technology such as Big Data all the risks need to be
identified and managed. That includes securing asset values and addressing the other legal and
regulatory risks. Among other things, a failure to address legal and regulatory risk in relation to Big
Data could result in a serious regulatory breach, attracting fines, reputational damage and loss of
business. In this article we consider how to identify and manage such risks.
Controlling use of big data
Data privacy law is one area of law that any business is going to have to take very seriously indeed in
relation to the use of Big Data. While these laws vary from country to country, in Europe there are
certain commonalities. Big Data typically involves the reuse of data originally collected for another
purpose. Among other things, such reuse would need to be 'not incompatible' with the original
purpose for which the date was collected for reuse to be permissible. The Article 29 Working Party
(consisting of the data privacy regulators across the EU) has set out a four stage test to determine
when this requirement is met. The four stage test includes a requirement that safeguards are put in
place to ensure fair processing and to prevent undue impact on the relevant individual. This could
include 'functional separation' (that is, anonymising / pseudonymising or aggregating the results). In
91
many cases, the only way to overcome data privacy concerns in relation to Big Data will be by way of
adequate consent notifications. To obtain effective consent in relation to Big Data analytics is not
straightforward.
How do you protect rights in big data?
Across the EU, the intellectual property right that could provide the most protection is the database
protection regime. It has limitations, as do copyright and patents in relation to Big Data. The law of
confidentiality may provide some protection, depending on the particular information and its
source. As the law in this area may provide only limited protection, it may sometimes be necessary
to return to the basics: ensure that any disclosure is coupled with adequate contractual
confidentiality provisions limiting further use and disclosure. Conversely it will be essential to check
that the compilation of a Big Data data set has not infringed a third party’s intellectual property or
contractual rights.
What are the other potential liabilities?
Among the potential liabilities that need to be addressed is the question of data reliability. Data
sourced from publicly available sources, from another business, or collated by the business itself,
may contain errors.
Data sets may have their origin in several different sources. So-called 'open data' is typically licensed
on terms similar to those applicable to open source software. Such terms usually give little or no
comfort in relation to the reliability (and non-infringing nature) of the licensed material.
Public providers of such data sets (such as local authorities or central government) are seldom
willing to accept liability for losses arising from reliance on the data (particularly when the data are
provided free or for a nominal charge).
What technical and organizational measures should be considered?
Interception, appropriation and corruption of data remain an issue for businesses possessing Big
Data sets, just as with any other data. The data privacy laws in many countries require that the data
controller implements appropriate technical and organizational measures to safeguard the security
of personal data. Such laws typically require the data controller to flow down these requirements in
contractual relations with their suppliers. These requirements will apply to Big Data sets held by
businesses that contain personal data.
Businesses will also need to take into account the new EU Data Protection Regulation, which will
require that technical and organizational measures ought to be provided for by design and default.
Purely technical solutions, implemented in the absence of a more comprehensive approach to
information governance, may not be adequate.
92
The need for expertise:
A recent survey by Accenture (Big Success with Big Data Survey, April 2014) found that 41% of
businesses reported a lack of appropriately skilled resources to implement a Big Data project. Such
expertise will need to include a legal and regulatory compliance review. It is simply a case of taking
steps to address these issues early on.
6.3 DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE, A REPORT TO PRESIDENT’S COUNCIL OF
ADVISORS ON SCIENCE AND TECHNOLOGY (PCAST, USA)
The body has come with five major recommendations as given below.
Recommendation 1: Policy attention should focus more on the actual uses of big data and less on
its collection and analysis.
Recommendation 2: Policies and regulation, at all levels of government, should not embed
particular technological solutions, but rather should be stated in terms of intended outcomes. To
avoid falling behind the technology, it is essential that policy concerning privacy protection should
address the purpose (the “what”) rather than prescribing the mechanism (the “how”).
Recommendation 3: With coordination and encouragement from Office of Science and Technology
Policy (OSTP), the Networking and Information Technology Research and Development Program
(NITRD) agencies should strengthen U.S. research in privacy-related technologies and in the
relevant areas of social science that inform the successful application of those technologies.
Recommendation 4: OSTP, together with the appropriate educational institutions and professional
societies, should encourage increased education and training opportunities concerning privacy
protection, including career paths for professionals. Programs that provide education leading to
privacy expertise (akin to what is being done for security expertise) are essential and need
encouragement.
Recommendation 5: The United States should take the lead both in the international arena and at
home by adopting policies that stimulate the use of practical privacy-protecting technologies that
exist today.
6.4 BEYOND NDSAP: REGULATORY MODE
National Data Sharing and Accessibility Policy NDSAP:
NDSAP aims to provide an enabling provision and platform for proactive and open access to the data
generated by various Government of India entities. The objective of this policy is to facilitate access
to Government of India owned shareable data (along with its usage information) in machine
readable form through a wide area network all over the country in a periodically updatable manner,
93
within the framework of various related policies, acts and rules of Government of India, thereby
permitting a wider accessibility and usage by public. National Data Sharing and Accessibility Policy
(NDSAP) is designed so as to apply to all sharable non-sensitive data available either in digital or
analog forms but generated using public funds by various Ministries/Departments /Subordinate
offices/Organizations/ Agencies of Government of India as well as States.
There is a need felt to elevate this Policy into an act so that the aim to provide an enabling provision
and platform for proactive and open access to the data generated by various Government of India
entities can be fully achieved.
Open Government Data (OGD) Platform India:
OGD, (http://data.gov.in) has been set up to provide access to datasets published by different
government entities in open format. It also provides a search, discovery & on-the-fly data
conversion (to widely used open formats) mechanisms for instant access 2 to desired datasets. OGD
Platform has a backend data management system which is used by government departments to
publish their datasets through a predefined workflow. They shall also have a dashboard to see the
current status on their datasets, usage analytics as well as feedback and queries from citizens at one
point. OGD Platform India is still at its nascent stage and is going through proportions of changes.
One of the major challenges faced is that of the formation of a NDSAP Cell in every
Ministry/Department. As per policy guidelines, in order to implement NDSAP, each Department is
required to establish a NDSAP Cell, which shall be headed by the Data Controller, who could be
assisted by number of Data Contributors and few domain specialists. These professionals would
monitor and manage the open data initiative in their respective Ministry/Department and extend
technical support to ensure quality as well as correctness of the data.
6.5 EXISTING STANDARDS – CODATA & WAY FOR INDIA
CODATA Capacity Building and the Data Sharing Principles in Developing Countries
CODATA is concerned with all types of data resulting from experimental measurements,
observations and calculations in every field of science and technology, including the physical
sciences, biology, geology, astronomy, engineering, environmental science, ecology and others.
Particular emphasis is given to data management problems common to different disciplines and to
data used outside the field in which they were generated.
CODATA have come out with guidelines on Data Sharing, particularly for the developing countries.
To start with, it will be a good idea to accept these guide lines and then work towards developing
specific guide lines suiting to our own requirements. Broadly the CODATA guide lines in the form of
10 Principles, on sharing data are as given below.
•
1. Data should be open and unrestricted. Data generated with public support, including private
foundations, should be openly accessible and subject to unrestricted (re)use, absent specific,
justified reasons to the contrary (see Principle-10). Openness is especially beneficial for
94
development purposes and research uses, but can benefit all society equally and have a
multiplier effect on the economy.
•
2. Data should be free to the user. In most cases, any cost for access is an insurmountable
barrier to users in the developing world. Therefore, data should be free online to the user. In
some special cases, access to data may be no more than the marginal cost of filling a user
request. At the same time, it is recognized that adequate preparation and open availability of
data require sufficient financial support (see Principle-7).
•
3. Data should be informative and assessed for quality. Data should be of known quality and
integrity, and should be organized and described (with metadata) in datasets sufficient to allow
them to be understood and effectively (re)used by others. Baseline technical and management
standards need to be established, especially in the developing world where state-of-the art
practices are not yet as prevalent. Adequate preparation and the use of nonproprietary
software are especially important for any datasets expected to have long-term value.
•
4 Data sharing should be timely. Once datasets are sufficiently informative and quality
controlled, they should be released as quickly as possible. This can be done in steps, starting
with the metadata to avoid duplication. In some cases, such as public emergencies and
disasters, open release of relevant data should be an immediate priority. In other cases, such as
research, data should be openly available no later than upon the publication or patenting of
results. Users in developing countries have the most to gain from such policies.
•
5 Data should be easy to find and access. Upon the public release of any dataset, the provider
should promote ease of access by the broadest user base. Diverse means of publication should
be considered in recognition of potential connectivity and other technological challenges.
•
6 Data should be interoperable, when necessary. If data from a dataset are likely to be
combined with data from one or more other datasets (e.g., in geospatially referenced research),
special attention should be given to making such data technically, semantically, and legally
interoperable.
•
7 Data should be sustainable. The life-cycles of any datasets that are expected to be reused by
others should be planned at the outset with support sufficient to successfully implement the
first six Principles. The lower availability of funding in developing countries, especially for longterm preservation, makes this a key priority so that valuable datasets remain intelligible and are
not lost or in need of rescue. Cost recovery for data archiving and availability should not be
borne by the users, consistent with Principle-2, but by other entities in the data lifecycle.
•
8 Data contributors should be given credit. A significant incentive for the open disclosure and
“publication” of a dataset is the ability to properly cite and attribute the contributor(s), whether
internal or external to an organization. Any subsequent user of the data has at least an ethical
obligation—and possibly a legal one—to cite and attribute the source of the data whenever they
are reused, and not to misuse the data in any way. Such practices can also improve the integrity
95
of the data sets made available by the contributors, in support of Principle-3. In particular, data
contributors in the developing world require recognition and rewards for such disclosure, and
this should become common practice. A persistent digital identifier, attached to the dataset
online, is the best way to promote this goal.
•
9 Data access should be equitable. Open access and use of data in developing countries,
especially for public purposes, should be supported by the governments and institutions in the
more economically developed nations. Capacity building of essential experts and infrastructure
in developing countries should be a priority of international organizations. Similarly, experts in
developing countries should join and actively participate in the relevant regional and
international organizations.
•
10 Data may be restricted for a limited time, if adequately justified. Restrictions may be placed
on access to and uses of publicly funded data and datasets for specified periods of time.
Justified restrictions may include specific protections of national security, personal privacy,
intellectual property, confidentiality, and other values, such as indigenous peoples’ rights or
location of endangered species. Nevertheless, the default rule should be one of openness,
consistent with Principle 1, and any restrictions should be minimized to the extent possible.
Way for India
While efforts are being made to provide teeth to NDSAP by converting and elevating the policy into
an act, India may start enforcing the 10 principles enunciated by CODATA for developing countries.
96
7. TRAINING & CAPACITY BUILDING
7.1 SKILLS NEEDED & AVAILABLE: QUALITY & QUANTITY:
Big Data Analytics work is fluid, often practiced under pressure and frequently demanding attention
to detail while simultaneously focusing on the larger purpose. Analysis often involves the
customization or creation of tools, the painstaking cleaning of datasets, and the technical and
analytic challenges of linking datasets. Within this challenging environment, data workers must have
a strong skill set that combines technical and business acumen, involving creativity and agility as
well as strong problem-solving skills. Grit, dogged persistence and resilience in the face of these
daily challenges, underlies the essential skill set of data workers, without which survival and success
are unlikely.
Indeed a unique combination of skill set is required to make the most of the opportunity offered by
Big Data Analytics. These are the Hard and Soft Skills. Apart from the hard skills such as Subject
Matter Expertise, Mathematics & Statistics Knowledge and Data & Technical there is need of soft
skills such Problem Solving, Story Telling, Collaboration, Curiosity, Communication and Creativity.
The skill sets can also be classified by Role Groups such as Developers, Architects, Analysts, Coders,
Data Scientist, Data Engineers, Designers, Administrators, Project Mangers, Business Experts, and
the Consultants. Given below an additional list of a few Business Analytics Roles across industries:
•
•
•
•
•
•
•
•
•
•
Data Analyst
Financial Analyst
Pricing Analyst
Website Analyst
Retail Sales Analyst
Business Analyst
Marketing Analytics Manager
Supply Chain Analyst
Fraud Analyst
Clinical Analyst
7.2 THE ESSENTIAL SET OF DATA SCIENCE COMPETENCIES:
Becoming a data scientist requires comprehensive mastery of a number of fields, however, one
don’t need to learn a lifetime’s worth of data-related information and skills as quickly as possible.
Instead, learn to read data science job descriptions closely. This will enable one to apply to jobs for
which one already have necessary skills, or develop specific data skill sets to match the jobs you
want. Never the less the essential competencies as given below need to be acquired asap.
97
Basic Tools: No matter the type of company one is expected to know how to use the tools of the
trade. This means a statistical programming language, like R or Python, and a database querying
language like SQL.
Basic Statistics: At least a basic understanding of statistics is vital as a data scientist.
Machine Learning: This can mean things like k-nearest neighbors, random forests, and ensemble
methods etc.
Multivariable Calculus and Linear Algebra: Understanding these concepts is most important at
companies where the product is defined by the data and small improvements in predictive
performance or algorithm optimization can lead to huge wins for the company.
Data Munging: Often times, the data being analyzed is going to be messy and difficult to work with.
Because of this, it’s really important to know how to deal with imperfections in data. Some
examples of data imperfections include missing values, inconsistent string formatting (e.g., ‘New
Delhi’ versus ‘new delhi’ versus ‘ND’), and date formatting (‘2014-01-01’ vs. ‘01/01/2014’, unix time
vs. timestamps, etc.).
Data Visualization & Communication: Visualizing and communicating data is incredibly important,
especially at young companies who are making data-driven decisions for the first time or companies
where data scientists are viewed as people who help others make data-driven decisions.
Software Engineering: It is important to have a strong software engineering background.
Thinking Like A Data Scientist: It’s important to think about what things are important, and what
things aren’t. How should one, as the data scientist, interact with the engineers and product
managers? What methods should one use? When do approximations make sense?
Creativity: There are no hard and fast rules about what a company should use big data for.
Business skills : An understanding of business objectives, and the underlying processes which drive
profit and business growth are also essential.
Communication ability: Both inter-personal and written – an essential part of a data scientist skill set
is the ability to communicate the results of the analysis to other members of their team as well as to
the key decision-makers who need to be able to quickly understand the key messages and insights.
7.3 INDUSTRY NEEDS & THE GAPs
•
•
•
•
According to the Business Standard, the Indian analytics market is expected to grow to $1.15
billion and industry bodies predict a five-fold growth in the number of big data professionals
by 2015.
For many organizations especially in India, where big data is booming at an exponential rate,
finding the right talent and knowing what skills to look for continues to be a major
roadblock.
Students graduating from many Indian colleges and Indian universities do not possess
several of the advanced skill sets that are required by big data workers; such as, predictive
analysis skills, working with advanced business intelligence tools and data integration skills.
The existing, experienced big data professionals do not have adequate expertise to train
fresh talent, and hence most entry level professionals need to learn with experience.
98
•
•
•
•
•
According to the Jigsaw academy annual salary report 2014 for analytic professionals, the
average salary of entry level big data professionals has increased 27 percent since 2013,
from 5.2 laksh to Rs 6.6 lakhs per annum. Typically, there is also a 250 percent increase in
salary while moving from an entry level analyst to the position of a manager.
Keeping in mind the above challenges, in-house HR professionals in organizations that
recruit big data analysts will face a two-fold problem -identifying the right candidates with
hybrid skill sets and staying within the budget.
Recruiters have to think out of the box while looking for big data talent. Expanding their
search outside the regular computer science related streams, to those fields where
mathematical and business skills are heavily utilized is one way to overcome the scarcity.
Few innovative start-ups have taken the internship route to identify the best suitable talent
for this niche. These start-ups have developed internship programs to acquire, identify and
nurture the right talent from top Indian colleges.
While big data is changing the game for recruiters, big data organizations themselves, need
to face up to the challenges of hiring the right people for the right seats.
7.4 MODELS FOR CAPACITY BUILDING & TRAINING
Existing Training Institutes
Finding trained and competent analytics personnel has become one of the biggest challenges for
employers in India. Hence, in order to solve this issue, a large number of analytics training institutes
have surfaced all over the country. These institutes focus on training the candidates in a manner so
as to make them fit in Business Analytics and Business Intelligence Sector. Given below is a list of
some prominent analytics training institutes in the country.
•
•
•
•
•
•
•
•
•
•
Academy For Decision Science And Analytics, Ahemedabad
Analytics Training Institute, Bengaluru
Big Data Training, Chennai
Business Analytics, Bengaluru
Indian Institute of Technology, Mumbai
International School of Engineering, Hyderabad
Jigsaw Academy, Bangalore
Mudra Institute of Communications, Ahemedabad
NI Analytics, Bengaluru
Ureach Solutions, Bengaluru
Online Big Data Training Courses:
To meet the big data talent demand, many universities and training institutes have started offering
online courses which caters to learning and working with Hadoop technologies. Following are some
of the institutes and organizations offering On-Line courses on Big Data Analytics and related
technologies:
•
•
•
AnalytixLabs
Cloudera Data Analyst
Cloudera Introduction to Data Science
99
•
•
•
•
•
•
•
•
•
•
•
•
•
Edureka Big Data and Hadoop
Edureka Data Science
Edvancer’s Eduventures
EMC2 Data Science and Big Data Analytics
GuRu Prevails
Jigsaw Wiley Certified Big Data Specialist
Imarticus Learning
International School of Engineering (INSOFE)
Ivy® Professional School
Learning Tree Big Data Analytics
Learning Tree Big Data Analytics with Pig, Hive and Impala
NIVT, Industrial Training Centre affiliated to NCVT, DGE&T
SimpliLearn Big Data and Hadoop Developer
Institutes offering PG qualifications
Some of the other institutions and organizations offering course leading to academic qualifications
in BDA in the country are as given below:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Analytics Essentials – IIIT, Bangalore
Business Analytics and Intelligence (BAI) – IIM Bangalore
Certificate Program in Business Analytics – ISB, Hyderabad
Data Analysis Online courses - SRM University
Executive Program in Business Analytics – IIM Calcutta
Executive Program in Business Analytics and Business Intelligence – IIM Ranchi
Jigsaw Academy courses
M. Tech. Computer Science and Engineering with Specialization in Big Data Analytics – VIT
M.Tech (Database Systems) – SRM University
M.Tech Computer Engineering and Predictive Analytics – Crescent Engineering College
M.Tech. (ICT) in 'Data Science and Analytics' - Institute of Engineering & Technology, (IET),
Ahmedabad
M.Tech. with specialization in Business Analytics – Hindustan University
Post Graduate Program in Business Analytics – Praxis Business School
Post Graduate Program in Business Analytics and Big Data - Aegis
Statistical Techniques for Data Mining & Business Analytics (DMBA) - ISI Mumbai
Offerings from SW Companies:
Major Software companies are offering training & certification in Big Data Analytics. Some of the
training programs are free, although the certifications generally are not. Details of programs from
five prominent big data companies are as given below:
•
•
•
•
Oracle big data training
SAS big data training
SAP training and certification
Microsoft SQL server training
100
•
•
IBM training and certifications
HP Vertica certification
101
8. INVESTMENTS – DETAILED PROJECT REPORT
8.1 OBJECTIVES OF THE STUDY
•
Assess the present status of the industry in terms of market size, different players providing
services across sectors/ functions, opportunities, SWOT of industry, policy framework (if any),
present skill levels available etc.
•
Market landscape survey to assess the future opportunities and demand for skill levels in next
10 years
•
Gap analysis in terms of skills levels and policy framework
•
Evolve a strategic Road Map and micro level action plan clearly defining roles of various
stakeholders - Govt., Industry, Academia, Industry Associations and others with clear timelines
and outcome for the next 10 years.
•
The international scenario may also be examined while evolving Strategic Road Map.
8.2 CONSULTATIVE APPROACH
8.2.1 THE NEED
Indian business, research institutions and enterprises are sitting on a gold mine of information. Making
sense of these huge data sets has become imperative. In these circumstances, big data analytics has
become one of the more talked about topics in India. Big data has tremendous potential in India. With
social media usage on the rise and increased adoption of technology by all most all sectors big data
analytics are on the agenda of boardrooms across Indian enterprises. However, most Indian enterprises are
still coming to terms with this concept. Apart from the business & industry, the government through its
various ministries and arms, research organizations and institutes of higher learning are yet another source
of huge amount of data generated.
While everybody realizes the importance and the potential to analyze these data sets not much has been
done and achieved by way of a structured and concerted approach to channelize the resources and efforts
to exploit and leverage the possibilities of using big data in the country. In addition the Big Data domain of
the country has a very large number of stakeholders having their stakes very divergent fields. The handle
the enormous task of preparing strategic roadmap for big data analytics Consultative Approach suits the
most.
102
8.3 METHODOLOGY
The Methodology adopted for the study was two-folds that is Primary Research and Secondary Research.-A
Research.
diagrammatic representation of the en
entire approach is given in figure 8.1
.1 given below.
FIGURE 8.1: APPROACH & METHODOLOGY
8.3.1 SECONDARY RESEARCH:
This involved capturing relevant information from public domain through research articles, published
documents on Big Data Analytics
Analytics, Net Search etc. A very large number of research papers, reports, books,
other public domain documents and presentations; in addition information collected during participation in
number of Big Data related conferences/seminars held recently in the country. A list
l of the materials
referred has been included in the Bibliography given in the report. An organized and structured thought
process, as given in the table 8.1
.1 below was deployed to cull relevant information.
103
TABLE 8.1: STRATEGIC THOUGHT PROCESS
MAJOR STRATEGIC
THOUGHT
Data Science and its
supporting role in
Big Data
Assessment of the
current status
Opportunities,
threats, Gaps and
questions
Data Science and
The Global Scenario
Identification of the
Pillars of
Information Driven
Governance and
Business
Transformation
Maturity Stages on
Road to Big Data
Leveraging Data
Science for Scientific
Research &
Development
MAPPING WITH THE
OBJECTIVES OF THE STUDY
CONTRIBUTING FACTORS TO BE INVESTIGATED
• Understanding Data Science
• Defining Big Data
• Types of data and non-marketing applications
• Driving value from Big Data
• How to approach a Big Data project
• Major stakeholders,
• Availability of data,
• The current and future technologies to be
used,
• Adequacy of the available infrastructure,
• Quality & quantity of man power available,
• Available Technologies & Providers
• Where and how do we start?
• How do we create a business case for a pilot?
• SWOT Analysis
• What data is relevant?
•
Assess the present status of
the industry
•
•
Emerging ICT Paradigm,
Assessing the present status
of the industry
•
SWOT Analysis of Big Data
Analytics,
Market landscape survey to
assess the future
opportunities and demand
• Successful applications of Big Data
• Big Data for Development
• Big Data Market
• Identification of the Business Drivers
• Doing more with the data – using Big Data
and Business Analytics
• The aims of deploying an enterprise data hub
•
•
The international scenario
for evolving Strategic Road
Map.
Evolve a strategic Road Map
and micro level action plan
clearly defining roles of
various stakeholders
• Initiate – Kick start & build first success
• Scale up - Build confidence in sustainable
success
• Applications in R&D projects
• Establishment of Centre for Excellence in Data
Science
• Dissemination of Big Data knowledge
• Capacity Building through Training programs
•
Big Data Road Map
•
•
•
•
Indian Perspective
Indentify Gap Areas
Challenges in R&D for S&T
Gap analysis in terms of
skills levels and policy
framework
104
•
MAJOR STRATEGIC
THOUGHT
MAPPING WITH THE
OBJECTIVES OF THE STUDY
CONTRIBUTING FACTORS TO BE INVESTIGATED
Digital India
•
•
•
•
Provisions under Digital India Initiative
Leveraging provisions under ‘Digital India’
Leveraging Big Data & Open Data
Synergy with e-Governance initiatives
•
Evolve a strategic Road Map
and micro level action plan
clearly defining roles of
various stakeholders
Data Demand
Trends
• Identify Gaps by roles and skills
• Gap Closing – Centre of Excellence, Skilling/Up
skilling, Training Programs, Workshops at various
levels of stakeholders
• The regulatory context
• Privacy law as applied to Big Data
• Responsible Big Data business practices
• R & D Projects
•
Gap analysis in terms of
skills levels and policy
framework
•
•
Big Data Policy Perspective
Evolve a strategic Road Map
and micro level action plan
clearly defining roles of
various stakeholders
• Formulation of the Project, its Objectives &
Targets, Cost Benefits & Outcomes, Monitoring
mechanisms and Action Plan including
technology, cyber security and other relevant
issues
•
•
•
•
Justification of the Project
Project Objectives & Targets
Project Design & Costs
Envisaged Benefits &
Outcomes
Evaluation parameters
{Measurable Indicators}
Project Monitoring and MIS
Roles of various
Stakeholders
Evolve a strategic Road Map
and micro level action plan
clearly defining roles of
various stakeholders to
Managing the
governance issues
of Big Data
Detailed Project
Report and Future
Outlook
•
•
•
•
8.3.2 PRIMARY RESEARCH
The Primary Research consisted of obtaining feedback from the two National Consultative Meetings,
feedback through four sets of questionnaire (One each for the four major Stakeholders viz. Data
Generators, Researchers, End Users and Service Providers) see Annexure-1, and four Interactive Workshops
held in 4 locations (Bengaluru, Pune, Hyderabad & Kolkata). The names of the participants/respondents
and their respective organization are provided in Annexure-2. Summary of the efforts made is as given in
tables below:
105
TABLE 8.2: CONSULTATIVE MEETINGS (CM) & INTERACTIVE WORKSHOPS (IW) ORGANIZED
CONSULTATIVE MEETINGS (CM) &
NUMBER OF
DATE
INTERACTIVE WORKSHOPS
S. No.
PARTICIPANTS
(IW)HELD AT
1
28/11/14 New Delhi (CM)
34
2
07/01/15 Bengaluru (IW)
31
3
19/01/15 Pune (IW)
20
4
29/01/15 Hyderabad (IW)
40
5
20/02/15 Kolkata (IW)
52
6
25/03/15 New Delhi (CM)
42
TOTAL
219
S.
No.
1
2
3
4
TABLE 8.3: RESPONSES RECEIVED FROM THE STAKEHOLDERS
NUMBER OF
CATEGORY OF STAKEHOLDER
SUGGESTIONS/RESPONSES
RECEIVED
Data Generators (DG)
100+ Including Filled in
Researchers (RE)
Questionnaires
End Users (EU)
Service Providers (SP)
Second Consultative Meeting:
The second consultative meeting was held on 25th March 2015 at New Delhi. At this meeting the findings of
this draft Report were presented to the stakeholders and their comments, observations and suggestions
were invited. The comments & suggestions have been in comported in this report in Chapter 4.
8.4 ANALYSIS AND RESULTS
8.4.1 PARTICIPANT’S SUGGESTIONS
During the Consultative Meetings and the Interactive Workshops, the participants made valuable
suggestions. These suggestions are consolidated as given below.
First Consultative Meeting
• Based on the suggestions of participants, the stakeholders were divided into four groups
namely:
o RESEARCHERS (RE),
106
o DATA GENERATORS (DG),
o END USERS (EU) and
o SERVICE PROVIDERS (SP)
• Questionnaire for each of the stakeholders should be designed separately to capture the specific
inputs.
• Each questionnaire is divided into several parts consolidating questions pertaining to a specific
discipline so as to facilitate filling in by the person from that discipline.
• Big data analytics is at initiative stage in India, Questionnaire may be sent along with Concept
note providing brief about the project and purpose behind filling of questionnaire
• The questions may be framed to answer what rather than how
• Structure of the questionnaire may be categorized under various heads, it may be possible that
some heads are less/non relevant for a particular stakeholder. Therefore, different questionnaire
needs to be prepared for different stakeholders such as government, industry, individual expert etc.
• In multiple choice questions, options should not be more than 3-4, many options (like10) given
for the a particular question will not result in appropriate outcome
• It is important to identify and select appropriate person to fill the questionnaire based on his
skills, expertise and experience in the area of Big data analytics.
• Some questions mentioned in the questionnaire are not seems relevant and difficult to fill, for
eg. How often you update your data? Such questions may be removed.
• Some important stakeholders in data analytics are: Academic institutes, industry, agencies like
IRDA, UIDAI ( Aadhar), Electoral office, NSDC etc.
• Curation of data is an important aspect, it needs to include in the Report. As, in present
environment lack of standards in data storage results in seek/incomplete/non-authentic data
• Quality of data is an important aspect to be checked for ontology and metadata
• The Report should have specific outcomes against definite scope of work, focused areas may be
Human Resource Development (HRD), Policy framework, research proposals, association with
industries for data analytics based on their specific requirements. In policy framework, ownership of
data may be defined.
• Digitalization of data, Digital Asset Management and training to SMEs, academics and other
stakeholders for the same may be part of the Report, as very few government organizations
presently have skilled manpower for data analytics.
• DST may initiate summer/winter training schools, workshop, conferences for wide spread
dissemination of data analytics
• The project initiative may link to Digital India and Make in India. Organizations like DIT may be
associated in this regard. Disaster Management system may developed through this project
• Stakeholders for data may categorize as follows:
o Data generators
o Data brokers (Who do data analytics)
o Data implementers
• Data analytics can generate base for many IPRs. The Report may be documented in a way such
that it can be reformed as an Act, which may be implementable at national and international level
• Generally in data analytics, more attention is given on Semantic aspect and applied technology,
however processes needs to be given more attention
• Strategic document should not include the scenario after 10 years, considering the speed
technology gets upgrading, it will not provide practical aspect.
Four Interactive Workshops
107
Some of the important suggestions and observations made during the four interactive workshops
are as given below:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Is data really big or are we making it big? Is it about Big Data only or about DATA?
Need to educate and create awareness in people to generate and store only relevant data.
The problem of data sharing could be eased by elevating the National Data Sharing and
Accessibility Policy (NDSAP) to an act
Big Data should be changed to “Analytics and Big Data”, or “Data Science”.
The data available with large number of organizations are not Big Data in the sense that they do
not challenge the existing computing technologies. However, they are very important problems
and the user can add great value by analyzing the data.
Many Institutes have started offering courses in analytics and big data without understanding
them well; a white paper will help to eliminate these misconceptions.
Need for national level curriculum for Analytics and Big Data.
Lack of real-world big data sets and need for government agencies to more actively participate
via open-data etc.
Feasibility of allowing the final year project work to be a "big-data-analytics" project
Lack of trained tier II faculty
Identify top 5 PROBLEMS to be solved by Big Data Analytics
Policy on sharing Data
Use of all forms of Analytics viz. Descriptive, Predictive and Prescriptive.
Creating a Platform where all the stakeholders can interact and give and seek what they have or
want.
Take advantage of the ‘Incubation Plans’ offered by many organizations.
Creating a ‘SAFE KEEP’ platform for all the data Created, Shared as well as Used.
Creating a ‘Regulatory Authority’ for all the facets of Big Data in the country.
Suggested Research Areas
•
Big Data: River Network optimization - A data driven analytics approach.
•
Understanding Urban/ Rural people perceptions on immunization
•
Online Signals for Risk Factors of Non-Communicable Diseases (NCDs)
•
Characterizing human behaviour during floods through the lens of mobile phone activity
•
Mining Indian Tweets to Understand Food Price Crises
•
Advocacy Monitoring through Social Data: Women and Children Health
•
Analyzing Online Content for Insight on Women and Employment in India
•
Analytics and Understanding social Conversations through Big Data
•
Unemployment analysis through Social Media
108
•
Monitoring Food Security Issues Through News Media and Analytics
•
India and State-wise, region wise Snapshots of mental Health/ Wellbeing - Mobile Survey
•
Daily Tracking of essential Commodity Prices in India through data mining and analytics
•
Twitter and Perceptions of Crisis-Related Stress
•
Population migration and analytics
•
Food and Nutrition Security Monitoring and Analysis
•
Monitoring Household Coping Strategies During Complex Crises
•
Economic crisis, tourism decline and its impact on local dependents
•
Impacts of the financial crisis on health and poverty in India
•
Impact of the financial crisis on primary schools, teachers and parents
•
A Visual Analytics Approach to Understanding Poverty Assessment through Disaster Impacts in
India
•
Monitoring the impact of the global financial crisis on crime in India
•
Urban crime pattern analysis, unemployment, education, social hierarchy and economic linkages
•
Search Engines in Indian Languages
•
Understanding Social Media
•
Summarization of Data
•
IoT
•
Data Engineering
•
Robotics
•
Visual Information Technology
•
Segmenting Videos
•
Healthcare
•
Cognitive Science
•
Signal Processing
•
Drug Development
•
Computational Neurological Science
109
•
Pattern Analysis and Machine Intelligence
•
Statistical Data Analysis
•
Image Analysis and Retrieval
•
Video Image Analysis and Retrieval
•
Data and Text Mining
•
Web and Social Network Mining
•
Bio-informatics and Computational Biology
•
Hadoop and MapReduce
•
Pattern Analysis and Machine Intelligence
•
•
•
•
•
•
•
•
•
•
•
•
•
Statistical Data Analysis
•
•
•
•
•
•
•
•
•
•
Dimensionality Reduction
Density Estimation
Artificial Neural Networks
Kernel Methods
Large – Scale Machine Learning
Soft computing and uncertainty analysis
Regression Analysis
Manifold Learning
Support Vector Machines
Pattern Classification and Clustering
Reinforcement Learning
Cognitive Machine
Biostatistics
Statistical Computing
Large Dimensional Random Matrices
Computational Finance
Statistical Genomics
Non Parametric and Robust Statistics
Stochastic Processes
Robust Inference
Multivariate Analysis
Image Analysis and Retrieval
•
•
•
•
•
Hyper-spectral Image Analysis
Automatic Target Recognition
Remote Sensing Image Analysis
Content Based Image Retrieval
Face, Pose and Giant Recognition
110
•
•
•
•
•
•
•
•
Video Image Analysis and Retrieval
•
•
•
•
•
•
•
•
•
•
Background Subtraction
Moving Object Detection and Tracking
Shot Boundary Detection
Video Retrieval
Target Detection from Video
Shadow Removal
Video Sequence Matching
Video Copy Detection
Video Storyboard Generation
Data and Text Mining
•
•
•
•
•
•
•
•
•
•
Fuzzy Image Modeling
Digital Watermarking
Medical image processing & retrieval
Mathematical Morphology
Document Image Analysis Document Image Analysis
Optical Character Recognition
Multi-resolution Image Analysis
Association/Correlation Analysis
Rule Mining
Sequence Mining
Graph-Pattern Mining
Information Retrieval
Granular Mining
Computational Forensics
Data Warehousing
Data Visualization
Web and Social Network Mining
•
•
•
•
•
•
•
•
•
•
Reliability and Cost analysis of Complex Networks
Fitting Distributions to Network Data
Centrality Measures in Large Scale Social
Networks
Structural Balance and Transitivity
Relational Network Mining
Network Visualization
Target Set Selection
Community Detection
Rough-Fuzzy Granular Model of Social Network
111
8.5 CONSOLIDATION OF QUESTIONNAIRE RESPONSES
All the Questionnaires received were consolidated as per the Stakeholder Category viz. Data Generators,
Data Researchers, End Users and Service Providers. These consolidated responses are provided as per the
details below:
Consolidated Responses from Data Generators
Consolidated Responses from Data Researchers
Consolidated Responses from End Users
Consolidated Responses from Service Providers
: Annexure 3
: Annexure 4
: Annexure 5
: Annexure 6
Major findings, observations, concerns, suggestions and apprehensions of the Stakeholders are as given
below:
8.5.1 CURRENT STATUS, STRATEGY & PROFILE
a. Stakeholder Segment/Category:
Organizations generally operate in one category/segment, however, sometimes the sometimes they
operate in multiple categories/segments also.
b. Commonly active Data Segments are:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Analytics and Simulation segments.
Banking and Finance,
Capacity building initiatives in Big Data Management, Analytics and Machine Learning.
Click stream data and analytics services
Cloud BI and cloud data services
Consultancy in Big Data Management, Analytics and Machine Learning.
Customers,
Data Analysis
Genomics & Life Sciences.
Industry: Education, Manufacturing, Media & Content, Logistics and E-Commerce
Internet and Media
IoT data & analytics services
Policy Making
Research in the areas of Big Data Management, Analytics and Machine Learning.
Retail,
Telecom,
Transaction both internal & external
Transaction Information
c. Data Segments that are outsourced are:
112
•
•
•
•
•
•
•
•
Banking and Finance,
CRM apps and customization
Debit Card Data
ERP apps and customization
Internet and Media
Mobile apps and customization
Retail,
Telecom,
d. Budget Provisions For Big Data Usage during 2014-15, 15-16 and 16-17 varies between Rs. 100
lakhs to Rs. 500 lakhs.
e. Areas where more Investment in Resources is likely to be made by most organizations:
•
•
•
•
•
•
•
•
Capacity Building
Software Tools : Most preferred area
Data Sources
Other Data Generation
Agile Process Quality & ISO compliance for data services
Data security & industry specific compliance audits/standards.
User Experience (UX) standards and best practice.
Sales & Marketing standards& scaling the business
f. The Current state of big data activities within organization is generally in flux state and is
represented by:
•
•
•
•
Not yet started to consider big data's use within our organization
Offering training programs and consultancy
One or more pilots or proofs of concept
Implementing big data technologies
g. General expectations of the organizations from Big Data Analytics in the next 10 years are of
varied nature and can be captured by:
•
•
•
•
•
•
It is going to rule many organizations
Plan to set up an internationally known centre of excellence in Big Data Management, Analytics,
Mining, Machine Learning for Research and Development, Consultancy Services and Capacity
Building
Capacity Building
Training/Research
Big data will enable integrated cloud warehousing that integrates external and internal data.
Machine learning algorithms will enable analytic automation that gives competitive
113
•
•
•
•
•
•
•
•
•
End user intelligent apps will get more dependent upon big data APIs that make them smarter
and more personalized
Smarter cities,
Smoother e-governance
Superior analytics for business growth and customer service satisfaction
Data Driven Methods,
Conflict resolution,
Early warning systems
Superior analytics for business growth and customer service satisfaction
Big Data techniques will allow organization to analyze data for patterns more quickly and at a
much lower cost. It will lead to important business insights that can drive the business.
h. Current state of big data activities within organization
•
•
•
We are in the process of developing a strategy / roadmap
We have started one or more pilots or proofs of concept
We are implementing big data technologies
i. Organization's competitive position can be described as ‘on par with industry or as
Underperforming industry / market peers’
j.
Big data management is generally viewed strategically at senior levels of the organization.
k. Generally there is enough of a “big data culture” in the organization, where the use of big data
in decision-making is valued and rewarded.
l.
Organization are Not Sure about the usefulness of Big Data Analytics Applications
m. Organizational Data is not available at data.gov.in
8.5.2 MANPOWER, SKILL GAPS AND TRAINING NEEDS
a. Following are the identified skills gaps in dealing with data and analytics:
•
•
•
•
Data integration skills
Data storage skills
Tooling / software skills
Visualization skills
b. Big Data experts are employed in the following areas:
114
•
•
•
•
•
•
Computer science: Artificial Intelligence and machine learning experts
Computer science: programming experts (R, Python, SQL, SAS, Java, etc)
Computer science: text, voice, music, image and video experts
OR and applied mathematics
Statistics and econometrics
Those who understand business and the data that goes with it.
c. The training needs as indentified by the stakeholders are:
•
•
•
•
•
•
•
•
•
Application related courses (Big Data in marketing, finance, logistics, etc)
Computer science: machine learning and artificial intelligence courses
Computer science: programming courses (R, Python, SQL, SAS, Java, etc)
Computer science: text, image and video recognition courses
High Frequency Data
Operations research and applied mathematics courses
Software tools such as Splunk, ELK Cloudera.
Statistics and econometrics courses
Strategy courses on Big Data for top management
d. Capacity Building initiatives needed to be taken are listed as per details given below:
Name of the Program
Who should be the
participants
Duration
Modality of
Delivery
Application related courses
(Big Data in marketing,
finance, logistics, etc)
Business Units
3-5 days
Classroom
Training
Basic Statistics
Research
Scholars/Academic
Professionals/Corporate
Personnel
BE /Graduates
40 Hours
Class Room
session
6 months
Apprentice
model
Big data Certifications
Researchers and
Practitioners
2 Months
Hybrid – Class
room + elearning
Big Data in marketing, finance
Middle Management
and Lower
Management
3 Days
Class room
program
Big Data
115
Name of the Program
Who should be the
participants
Duration
Modality of
Delivery
Cloud BI
MCAs
3 months
Apprentice
model
Cloud DS
Diplomas
3 months
Apprentice
model
Computer science: text, image
and video recognition courses
Business Units
3-5 days
Classroom
Training
Data Mining and Data
Warehousing
40 Hours
Class Room
Session
M. Sc (Big Data)
Research
Scholars/Academic
Professionals/Corporate
Personnel
B. Sc
4
Semesters
Class Room
M. Tech. (Big Data)
B. Tech
4
Semesters
Class Room
Multivariate Analysis
Research
Scholars/Academic
Professionals/Corporate
Personnel
Research
Scholars/Academic
Professionals/Corporate
Personnel
Research
Scholars/Academic
Professionals/Corporate
Personnel
IT officers
40 Hours
Class Room
session
40 Hours
Class Room
session
40 Hours
Class Room
session
7 Days
Class room
program
Numerical Methods
Operation Research
Programming courses
116
Name of the Program
Who should be the
participants
Duration
Modality of
Delivery
Statistics and econometrics
courses
Middle Management
and Lower
Management
3 Days
Class room
program
Strategy courses on Big Data
Top Management
Half Day
Class room
program
Strategy courses on Big Data
for top management
Business Units
3-5 days
Classroom
Training
Strategy courses on Big Data
for top management
C level professionals,
researchers and policy
makers
2 Months
Hybrid – Class
room + elearning
Strategy courses on Big Data
for top management
Business Units
3-5 days
Classroom
Training
8.5.3 PERCEIVED SUCCESS FACTORS, IMPEDIMENTS & CHALLENGES FOR BIG DATA APPLICATION
a. Organizations have taken initiatives in the following areas that are related to Big Data Science
& Technology
•
•
•
•
•
Analysis of Unstructured/Semi-structured data
Data streaming & Processing
New Computational Models
Security & Privacy issues
Visualization & Visual Analytics
b. Organizations have taken initiative in the following areas that are related to Big Data
Infrastructure
•
•
•
Big Data Open Platforms
Programming Models
Software Techniques & Architectures in Cloud/Grid/Stream Computing
117
•
System Architectures, Design and Deployment
c. Organizations have initiative in the following areas that are related to Big Data Search,
Mining and Management
•
•
•
•
•
•
•
Algorithms & Systems for Big Data Search
Cloud/Grid/Stream Data Mining-Big Velocity Data
Computational Modeling & Data Integration
Data Acquisition, Integration, Cleaning & Best Practices
Multimedia and Multi-structured Data-Big Variety Data
Search & Mining of variety of data including scientific, engineering, social, sensor & multimedia
Visualization Analytics for Big Data
d. Organizations have taken initiative in the following areas that are related to Big Data
applications
•
•
•
•
•
Big Data Analytics in Small Business Enterprises (SMEs)
Big Data as a Service
Complex Big Data Applications in Science, Engineering,
Medicine, Healthcare, Finance, Business, Law and Education
Retailing, social media and Telecommunication
e. Organizations are able to have timely Access to Information needed only to some extent.
f.
Organizations are able to get only modest competitive advantage created by information.
g. Challenges inhibiting the organizations from acquiring and integrating data
•
•
•
Inconsistencies in data from various source systems
Legacy infrastructure that inhibits data collection
Difficult to share data internally and or in integrating internal data across silos
h. Challenges inhibiting organizations from analyzing data
•
•
Lack of software/tools and or Software too difficult to use
Inconsistent data across variety of source systems
i.
Challenges inhibiting organizations from acting on data insights and analytics
•
Lack of software/tools that allow end-users to perform analytics themselves
118
j.
The biggest impediments to using big data for effective decision-making
•
Too many “silos”—data is not pooled for the benefit of the entire organization.
k. It is generally agreed that the issue for us is now not the growing volumes of data, but rather
being able to analyze and act on data in real-time.
8.5.4 AREAS OF APPLICATION, MODELS & INFRASTRUCTURE
a. Steps taken by the organizations to Integrate Data into Organization’s Business:
•
•
•
•
Improve data collection processes
Redesigned/reengineered your important Business Processes
Training current employees or recruiting new employees in BA
Upgrade IT Systems
b. Areas reasonably ‘developed’ to ‘well developed’ in organization that may help use of big data
in the organization:
•
•
•
•
•
•
•
•
A clear company strategy
A sound procedure for legal, ethical and reputational issues
An organization structure that supports multi-disciplinary projects
Financial budget
Support by higher management
Supporting systems and procedures
Talent
Training
c. Organizations feel that the number of Big Data specialists in organization next year (2015) will
increase.
d. Suggestions on, Data Storage, Data Curation, Data Retrieval include that these technologies
are evolving and should be constantly innovated and the organization roadmap should be
focused on alignment with emerging technologies
e. Organizations suggested the FINAL PRODUCTS for which the Big Data Community may strive
include.
•
•
•
Big Data as a Service providing easy experimentation and quick prototyping
Big Data Analytics platforms for Internet of Things and wearable devices
Solutions/Protocols for seamless data integration, privacy and security.
119
•
Big Data Analysis Platforms
f.
Organizations suggested the following thrust areas for the Researchers in the Big Data
Discipline:
•
Immediately :
o
o
o
o
o
o
Better algorithms/platforms Big Data Management, ETL and Analytics – improving the open
source solutions
Procuring real time data
Data gathering,
Data integration,
Data integrity
Data security
•
In the next 5-10 years:
o
o
o
o
o
Scalable Machine Learning for Big Data, IOT and BIG data integration and products.
Developing Medical layer for supporting end users,
System developers,
Building DSS, KSS,
Event triggering systems and agents aiming at integrating with Internet of things.
8.5.5 TYPE, AMOUNT OF DATA & ANALYTICAL TECHNIQUES USED
a. Type of data analyzed by the organizations in the context of Big Data applications:
•
•
Numerical data (for statistics, predictions, etc)
Text (automated text analysis)
b. The support, the organizations, would like to get from the Government:
•
Building a central repository of financial markets statistical data.
•
Clarity on security aspect
Clarity on statutory / regulatory / compliance requirements
High level 5-year country strategy
Our Analytic capability may be used for the needy
Partnering with peer organizations and relevant government agencies
•
•
•
•
120
•
•
•
•
•
Setting up of SEZ’s for smaller set ups like ours which completely export the services. The
current SEZ’s are unaffordable and only larger companies can get the benefit of working out of a
SEZ.
Support for enhancing the capacity of our Big Data Engineering Lab
Support for offering internationally known Big Data certifications in India
The government’s roadmap on big data
To foster an encouraging environment for entrepreneurship especially for small start Ups.
c. Organizations consider that the amount of data available to support decision-making is
enough
d. Challenges faced by the organizations in GENERATING Data include:
•
•
Ensuring uniformity in data structure.
Coping with rapid changes in business requirements.
e. Challenges faced by the organization in CLEANING the Generated Data include:
•
•
•
Identifying mandatory data fields to ensure correct analytics.
Data correlation
Data quality
f.
Advanced analytics methods used by the organizations in Big Data applications
•
•
•
Statistics and econometrics
Operations research (OR) / applied mathematics
Artificial intelligence (AI) and machine learning
g. Organizations consider the following as the most important factors for successful Big Data
implementations. "1"=most important, to "5" =least important.
•
•
•
•
•
•
•
•
A clear company strategy-1
A sound procedure for legal, ethical and reputational issues-3
An organizational structure that supports multi-disciplinary projects-4
Financial budget-1
Support by higher management-1
Supporting systems and procedures-4
Talent-3
Training-4
h. Open Source domain Tools and Platforms used by the organizations for Big Data Analytics.
121
•
•
•
•
•
•
•
•
Apache Hadoop Ecosystem – Hortonworks, Cloudera
Apache Solr/Lucene
Graph Data Bases – Neo4J, etc
Hadoop
Hortonworks
Mapreduce
No SQL data bases – Mongo DB, CAssandra
R , Python – SciPy
i.
Organizations consider that their Performance in Information and Analytic Tasks are as
follows: on A Scale of 1 To 5, Where 1=Poorly and 5=Very Well
•
•
•
Acquire and integrate data
Analyze data
Act on data-driven insights
j.
Organizations currently not envisage that the DATA CURATION function in-house is a part of
data analytics in the organization.
3
3
4
8.5.6 SECURITY CONCERNS
a. Initiative taken by the organizations for Big Data Security & Privacy
•
•
•
•
•
Challenges for Big Data Security & Privacy
Cyber security and Gigabit Networks
Intrusion Detection
Sociological Aspects of Big Data Privacy
Visualizing Large Scale Security Data
b. Organization’s views on the IPR Issues as related to Big Data Analytics:
•
•
•
•
•
Cost of Patent filing is too high
Implement the right policies for big data governance.
In the crowd sourcing world of Big Data Analytics it is very difficult to clearly demarcate the IPR
related boundaries.
Most of the research outcomes are not commercialized by governmental organizations
Need to think through certain fundamental legal aspects of IPR, e.g. "who owns the input data
companies are using in their analysis, and who owns the output?”
122
•
Over emphasis on IPR may also hamper the open innovation approach in the internet based
application development model.
c. Views and suggestions of the organizations on the adequacy or otherwise of the National Data
Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics
•
•
•
Will comply with national regulatory requirements
All data needs to be made available on common portal accessible to all
We will comply with national regulatory requirements
8.5.7 OTHER INFORMATION TO SHARE
a. We have more data but we don’t have proper documentation
b. Even if we have data but we don’t have operating resources to act as analyst.
c. We find it difficult to identify the resource persons who have knowledge and skill in areas like
Econometrics, operational research, multivariate tools, computer science etc.
8.6 JUSTIFICATION OF THE PROJECT
8.6.1 FINDINGS OF THE SWOT ANALYSIS
SWOT undertaken in Chapter 1 indicates that the country should take immediate and firm steps in the area
of development and leveraging of BDA. The important components of the SWOT Analysis are reproduced
as below.
Strengths
• There is a broad and detailed domain know-how as well as process know-how
available.
• Immense growth opportunity in the analytics market: Indian product firms have
shown a growth rate of 20-40 per cent in the last few years; several emerging players
have witnessed over 100 per cent growth within the first year of launch. (NASSCOM)
• Analytics – a definite market for India: Over 100 Indian analytics focused software
product firms have successfully developed and launched products catering to niche
business needs, cut across vertical-specific, horizontal process-centric and niche
applications and platforms. (NASSCOM)
• Growing start-up base accelerating the growth: Four-fold increase in analytics startups in the last four years. (NASSCOM)
• Innovative offerings focusing on end-to-end customer business needs. (NASSCOM)
Weaknesses
123
• There is no existing and strong content/data market in India.
• There is a lack of a solid start-up culture because of risk aversion and intolerance of
failure.
• Public data in the country is not available to the extent it should be.
• The different languages within the country create a barrier (multilingualism) during
data processing. Structural data sources often lack precise semantics.
• There is a lack of specialized education programs for data analysts.
• There are not enough skilled people to participate in capacity building training
programs.
• Legislative restrictions on data sharing decrease availability across the country and
makes nationally/industry/domain focused initiatives that address these issues more
difficult.
• Rules and regulations are fragmented across the country/industry/domain.
• There are high security/sensitivity/confidentiality demands that can be difficult to
address.
• There is no well-designed data governance: Data governance is a must-have, and
no longer merely a good-to-have. In today's extremely hyper-competitive markets,
insightful knowledge means the difference between success and being overwhelmed.
But it has to be based on the right data, based on business requirements.
• Data protection Policy: "Ignoring data security, data quality and data access can
cost organizations millions of dollars, hurting enterprise agility, efficiency and
reputation."
Opportunities
• Strengthening the Indian market, e.g. by fusing the emerging start-up nucleus.
• Create lots of SMEs for the low hanging fruits of Big Data for which agility is
required.
• Investment in the entire innovation chain, beyond basic research.
• Investment support mechanisms for SMEs/Research/
Institutions/Students/Scholars/Entrepreneurs.
• There is the opportunity to open up completely new and different business areas
and services.
• New applications can be created throughout the Big Data ecosystem, ranging over
acquisition, data extraction, analysis, visualization and utilization.
• Development of APIs for access becoming standardized and available.
• Providing facilities to better navigate and curate data.
• Contextualization and personalization of data.
• The evolution of different sectors and the increased volume of data enable
innovative applications to be developed.
• Exploring new research areas.
• User generated and crowd-sourced content increasingly available that will help
variety of recurring problems solved once for all.
• Shift from technology push to end-user engagement.
• By 2020, information will be used to reinvent, digitalize or eliminate 80% of
124
business processes and products from a decade earlier: As the presence of the Internet
of Things (IoT) — such as connected devices, sensors and smart machines — grows,
the ability of things to generate new types of real-time information and to actively
participate in an industry’s value stream will also grow. (GARTNER)
• By 2017, more than 30% of enterprise access to broadly based big data will be via
intermediary data broker services, serving context to business decisions:
Digital business demands real-time situation-awareness. This includes insights into
what goes on both inside and outside the organization. How do weather patterns
impact inventory? More so, how do this season’s customer preferences as expressed
in social media suggest greater or lesser inventory? (GARTNER)
• By 2017, more than 20% of customer-facing analytic deployments will provide
product tracking information leveraging the IoT: Fueled by the Nexus of Forces
(mobile, social, cloud and information), customers now demand a lot more
information from their vendors. The rapid dissemination of the IoT will create a new
style of customer-facing analytics — product tracking — where increasingly less
expensive sensors will be embedded into all types of products. (GARTNER)
• Analytics – Opening up a gamut of opportunities for Indian software product firms
(NASSCOM)
• Big Data as a service (BDaaS): That is the delivery of Statistical Analysis tools or
information by an outside provider that helps organizations understand and use
insights gained from large information sets in order to gain a competitive advantage.
Threats
• Many skilled professionals leave the country to work in other regions; adding to the
risk of a “Brain Drain”.
• Acute lack of skilled professionals and graduates.
• There are no existing ecosystems and portals where reliable data sets are is
available, however, there is a need to create them.
• Policies of data availability; for example companies are not willing to make data
available ‘just-in-case’ it may cause a legal action or result in competition.
• Shortage of Skills: There are a wide range of skills relevant for businesses wanting to
use data analytics, including knowledge of statistical techniques, the ability to program
and use software, market-specific knowledge and communication. These skills may not
be available in required quantity and quality.
• Business-Education Collaboration: One way to provide the multi-disciplinary skills
required for big data analysis is for students to work closely with a company during
their studies. Collaboration between a university/institution with analysis expertise
and a business with real world data can be beneficial for both parties.
• Data Sharing Policy: Non-implementation is hindering Big Data Analytics takeoff.
The project is justified as there is an urgent need of taking appropriate actions to take advantage of the
OPPORTUNITIES by leveraging our STRENGTHS and at the same time take cognizance of the THREATS and
mitigate the same by making plans to overcome the identifies WEAKNESSES.
125
8.6.2 THE GAP IDENTIFICATION:
Gap identification was carried out in Chapter 1. The gaps have been identified in terms of the following
major issues.
o Market and Business
o Technical
o Data, Content and Usage
o Education and Skills
o Policy, Legal and Security
Consider the issues of (i) Education and Skills and (ii) Policy, Legal and Security, sooner or later
these gaps have to be plugged and the initiative has to be taken by the Government. Government
being the largest producer and user of data, to large extent the issue of Data, Content & usage
affects the government much more than the corporate. Logically the action has to come from the
Government. With the huge potential of business with in the country and outside, an early
initiative is likely to give positive results. Considering the government’s call of ‘Make-in-India’, there
is urgency for the government to take the first step so that the corporate world can join in the
efforts.
8.6.3 RESPONSES RECEIVED FROM THE STAKEHOLDERS:
As indicated in Chapter 3, a preliminary research was carried out through questionnaires circulated
and inputs in the consultative meetings and interactive workshops held with different stakeholders.
The detailed responses received have been analyzed to provide an insight in the ground realities of
BDA in the country. The following are the important parameters that reveal the ground situation:
•
•
•
•
•
•
•
Current Status, Strategy & Profile
Manpower, Skill Gaps and Training Needs
Perceived Success Factors, Impediments & Challenges for Big Data Application
Areas of Application, Models & Infrastructure
Type, Amount of Data & Analytical Techniques Used
The support expected from Government
Security Concerns, Data sharing and IPR Issues
Important and salient responses from the stakeholders are summarized below:
Current Status, Strategy & Profile:
Organizations are not necessarily belong to one category but may operate multiple
categories/segments also. The pace of the activity is rather sluggish. The Data Segments actively
used are many. They all are very optimistic about the growth in the business related to BDA and are
planning to invest Rs. 100 – 500 Lakhs in the next three years. The investments are for capacity
building, software, ISO compliance, data security etc.
126
Manpower, Skill Gaps and Training Needs:
The skill gaps indentified include Data integration, Data storage, and Visualization skills. The
representative identified training needs are Application related courses (Big Data in marketing,
finance, logistics, etc), Computer science: machine learning and artificial intelligence courses
amongst the technology area and Strategy courses on Big Data for top management. The capacity
building efforts have been identified from a short duration course of 3-5 days to midterm courses
lasting a few months and UG and PG courses.
Perceived Success Factors, Impediments & Challenges for Big Data Application:
On their own many organizations are taking initiatives in the areas of Technology, Infrastructure,
Management, and Big Data Applications. The major concerns are such as Inconsistencies in data
from various source systems, legacy infrastructure that inhibits data collection and difficult to share
data internally and or in integrating internal data across silos. The challenges faced include lack of
software/tools that allow end-users to perform analytics themselves. It is generally agreed that the
issue is not the growing volumes of data, but rather being able to analyze and act on data in realtime.
Areas of Application, Models & Infrastructure:
There is strong feeling that the number of Big Data specialists required in organization next year
(2015) will increase. The suggested immediate thrust areas of research in BDA include better
algorithms/platforms Big Data Management, ETL and Analytics – improving the open source
solutions, Procuring real time data, Data gathering, Data integration and Data security.
Type, Amount of Data & Analytical Techniques Used:
The type of data analyzed by the organizations, in the context of Big Data applications include, both
Numerical data (for statistics, predictions, etc) and Text (automated text analysis).
The support expected from Government:
• Building a central repository of financial markets statistical data.
• Clarity on security aspect
• Clarity on statutory / regulatory / compliance requirements
• High level 5-year country strategy
• Our Analytic capability may be used for the needy
• Partnering with peer organizations and relevant government agencies
• Setting up of SEZ’s for smaller set ups like ours which completely export the services. The
current SEZ’s are unaffordable and only larger companies can get the benefit of working out of a
SEZ.
• Support for enhancing the capacity of our Big Data Engineering Lab
127
• Support for offering internationally known Big Data certifications in India
• The government’s roadmap on big data
• To foster an encouraging environment for entrepreneurship especially for small start Ups.
Security Concerns, Data sharing and IPR Issues:
The organizations have taken a few initiatives for Big Data Security & Privacy. On the Data Sharing
Policy they are willing to comply with the national regulatory requirements, however they also feel
that the public domain data should be available on common portal and be equally accessible to all.
Their views on the issues related to IPR include:
• NDSAP be elevated to an act.
• Cost of Patent filing is too high
• Implement the right policies for big data governance.
• Most of the research outcomes are not commercialized by governmental organizations
• Need to think through certain fundamental legal aspects of IPR, e.g. "who owns the input data
companies are using in their analysis, and who owns the output?”
It is certain that the stakeholders are ready to take off on the BDA, provided their concerns are
alleviated and they are provided with the support for the weak areas. This also calls for an urgent
action on part of the government so that the country does not miss the opportunity that is
otherwise reachable.
8.6.4 CAPACITY BUILDING & TRAINING AND ENTREPRENEURSHIP DEVELOPMENT:
In earlier chapters extensive analysis has been carried on the opportunities worldwide and within
the country vis-à-vis the preparedness of the country. Substantial supporting facts and figures have
been provided on the following important indicators:
•
•
•
•
•
•
•
•
•
•
•
IDC Worldwide Big Data and Analytics Predictions for 2015
The EIU Survey
Transparency Market Research Report
Big Data Trends & Predictions of 2015
European research agenda for Big Data Analytics
Skills Needed & Available: Quality & Quantity:
The Essential Set of Data Science Competencies:
World Vide - Big Data Vendor Revenue and Market Forecast
Development of the BA Industry in India
Analytics as a Service (AaaS)
Possible Models for Entrepreneurship Building
Considering the opportunities available worldwide and within the country and taking into the
factors identified earlier thorough SWOT analysis of the Indian scenario it can be justifiable
concluded that it is the most opportune moment for DST to take the country, it’s BDA related
recourses and their potential into the BDA ecosystem of the world. The project is justified.
128
8.7 PROJECT OBJECTIVES & TARGETS
8.7.1 VISION& THRUST AREAS
The following two vision statements are given to spell out DST’s Initiative:
"To become established as the complete support system provider in the
country in the ecosystem of Data Science, Technology, Research &
Applications (dASTRA)"
Or
“To act as a facilitator to promote and develop Data Science, Technology,
Research & Applications (dASTRA) and related ecosystem in the country”
The major thrust areas would be:
•
•
•
•
•
Big Data Science & Technology
Big Data Infrastructure
Big Data Search, Mining and Management
Big Data Security & Privacy
Big Data Applications/Research
8.7.2 OBJECTIVES
Following may the objectives of the Data Science initiative of the Department of science & Technology:
DST with its all pervasive intervention in the dASTRA ecosystem should strive to achieve the following:
• Talent Pool - Create industry academia partnership to groom the talent pool in universities as
well as develop strong internal training curriculum to advance analytical depth.
• Collaborate - Form analytics forum across organization boundaries to discuss the pain-points of
the practitioner community and share best practices to scale analytics organizations.
• Capability Development - Invest in long term skills and capabilities that form the basis for
differentiation and value creation. There needs to be an innovation culture that will facilitate IP
creation and asset development.
• Value Creation - Building rigor to measure the impact of analytics deployment is very critical to
earn legitimacy within the organization.
8.7.3 ACTIVITIES
For the Department of Science & Technology, Government of India, based on the vision statement given
above the main activities will be, but not limited to, will be as given below:
•
R&D PROMOTION
129
o Open Sky Research
o Cluster Based Network Programs
o International Collaborative Research Program
• ESTABLISHMENT OF CENTRE
CENTRES OF EXCELLENCE ON DATA SCIENCE
• SKILL DEVELOPMENT, CAPACITY & TRAINING
o Fellowship Based UG/PG and PhD
o Short Term Training for Faculty
o On-Line Programs
o National Workshops & Conferences
o Collaborative Interactive Conferences
o Entrepreneur Development
• INTERNATIONAL LINKAGES & COLLABORATIONS
o UN (R&D and Standards)
o Regional Associations/Collaborations
o Bilateral & Multi Lateral Excha
Exchange Programs
INFRASTRUCTURE DEVELOPMENT
A schematic representation is given in the figure 8.2 given below.
FIGURE 8.2: DST’s VISION OF dASTRA IN INDIA
DST’s VISION
OF dASTRA IN
INDIA
R&D
PROMOTION
Open Sky
Research
Cluster Based
Network
Programs
International
Collaborative
Research
Program
ESTABLISHM
ENT OF
CENTERS OF
EXCELLENCE
SKILL
DEVELOPMENT,
CAPACITY &
TRAINING
INTERNATIONAL
LINKAGES &
COLLABORATION
UN (R&D and Standards)
Fellowship Based UG/PG
and PhD etc.
Regional
Associations/Collaborations
Short Term Training for
Faculty
Bilateral & Multi Lateral
Exchange Programs
On-Line Programs
National Workshops &
Conferences
Collaborative Interactive
Conferences
Entrepreneur
Development,
130
INFRASTRUCTURE
DEVELOPMENT
FIGURE 8.3: CONCEPTUAL MODEL OF SIX MONTHS STUDENT’S PROJECTS LINKED WITH BIG DATA
FIGURE 8.4: SUGGESTED CAPACITY BUILDING MODEL
TRAIN
FACULTY
JULY - AUGUST
TRAIN
STUDENTS
SEPTEMBER DECEMBER
131
STUDENTS
UNDERTAKE
PROJECTS
JANUARY - JUNE
CERTIFICATION
BY INDUSTRY
JUNE
8.7.4 TARGETS
Tentative targets are as shown in table 8.4 below.
TABLE 8.4: TENTATIVE TARGETS
COMPONENT
R&D PROMOTION
Open Sky Research
Cluster Based Network
Programs
International Collaborative
Research Program
ESTABLISHMENT OF CENTRE
OF EXCELLENCE ON DATA
SCIENCE
SKILL DEVELOPMENT,
CAPACITY & TRAINING
Fellowship Based UG/PG and
PhD in 80:20 ratio
Short Term Training for
Faculty
On-Line Programs
National Workshops &
Conferences
Collaborative Interactive
Conferences
Entrepreneur Development
INTERNATIONAL LINKAGES
& COLLABORATIONS
UN (R&D and Standards)
Regional
Associations/Collaborations
Bilateral & Multi Lateral
Exchange Programs
INFRASTRUCTURE
DEVELOPMENT
UNITS
YEAR 1
UNITS PLANNED FOR THE PERIOD
YEAR 2 YEAR 3 YEAR 4 YEAR 5
TOTAL
Number of Grants
Numbers of
Programs
Number of
Programs
5
12
12
12
12
53
10
12
12
12
12
58
5
5
5
5
5
25
Number of Centres
4
.5
.5
5
5
24
1292
1300
1500
1600
1700
7392
2
2
2
2
2
10
2
2
4
4
4
16
Numbers
2
2
4
4
4
16
Numbers
2
2
3
3
3
13
Number of Projects
1
2
3
10
15
31
Numbers
1
1
1
1
1
5
Numbers
1
1
1
1
1
5
Numbers
Number of
Programs
1
1
1
1
1
5
4
4
4
4
4
20
Number of
Fellowships
Number of Training
Programs
Number of
Programs
132
8.8 PROJECT DESIGN & COSTS
8.8.1 PROJECT DESIGN
Project Management Unit:
The envisaged project is of very high value, it is spread over five years and the Outcomes of the project are
vital for the country. Considering this it is suggested that the project is managed through a Project
Management Unit (PMU). The PMU would be headed by the Director, Big Data Initiative, Department of
Science & Technology, Government of India and will be situated at the office of the Director, Big Data
Initiative, DST. The major functions of the PMU would be:
•
To work as the nodal agency of the Government of India for Big Data Initiatives and to coordinate
with all the stakeholders.
•
Selecting components and activities to be included in the project from time to time and preparing
annual action plans.
•
Developing and fine tuning the final delivery contents, mechanisms and performance
measurement criterion of each of the component/activity to be undertaken by the project.
•
Developing and finalizing the guidelines & terms and conditions of the grants and various other
formats and documents needed for making requests to participate in the project activities,
submitting periodic reports, funds utilization statements etc.
•
Seeking proposals from individuals, institutions and other organizations for undertaking the various
components and activities selected to be included in the project from time to time.
•
Assign the responsibility of the delivery to competent agencies within government (State &
Central) and outside such as national institutes of higher learning, research organizations, service
providers and others in the dASTRA ecosystem.
•
Monitoring all the aspects of the delivery of the project components and activities and ensuring the
quality of delivery.
•
To ensure effective coordination with implementing agencies together with collection of
information pertaining to implementation and progress.
•
Overseeing and Management of the project funds. Use of CPSMS could be made.
•
The cost of the PMU will be met from the total project cost. The PMU cost should not exceed 2.5%
of the total project cost
Organization Structure
The overall project initiative will be spearheaded by the Director, dASTRA/BDI. It is suggested that there
will be a Project Head to support the Director, BDI in implementation of the project. The project has four
major initiative areas that is (i) Research & Development (R&D), (ii) Capacity & Training (C&T), (iii)
International Linkages & Collaborations (ILC)and (iv) Entrepreneurship Development (ID). Each of these
133
areas should be looked after by a Divisional Head. Implementation of the project will involve considerable
amount of coordination with external agencies both national and international, It is suggested, therefore,
to provide a Coordinator to each of the Division
Divisional
al Head. To take care of the very large number of
stakeholders such as Students, Participants of the training programs, Faculty Members etc. and to keep
track of the information and documents received and sent out from the project, a pool of Project
Assistants
ants is recommended. The Project will hav
have some Multi Task Staff. Table 8.5
.5 gives the details of the
project personnel, their number and estimated costs. These costs should not exceed 2.5% of the total
project cost. The Organization Structure is as given in figure 8.5.
TABLE 8.5: SUGGESTED PROJECT PERSONNEL
S. N.
1.
2.
3.
4.
5.
Project Position
Project Head
Divisional Head
Coordinator
Project Assistant
Multi Task Staff
Numbers
1
4
5
8
12
FIGURE 8.5: SUGGESTED ORGANIZATION STRUCTURE
134
8.8.2 PROJECT COSTS
TABLE 8.6: COMPUTATION OF COSTS – ALL COSTS IN Rs. LAKHS
COMPONENT
UNITS
UNIT
COST
YEAR 1
UNITS
YEAR 2
COST
UNITS
YEAR 3
COST
UNITS
YEAR 4
COST
UNITS
YEAR 5
COST
UNITS
PROJECT LIFE
COST
UNITS
COST
R&D PROMOTION
Cluster Based Network Programs
Number of
Grants
Numbers of
Programs
International Collaborative
Research Program
Number of
Programs
Open Sky Research
100
5
500
12
1200
12
1200
12
1200
12
1200
53
5300
150
10
1500
12
1800
12
1800
12
1800
12
1800
58
8700
200
5
1000
5
1000
5
1000
5
1000
5
1000
25
5000
TOTAL FOR R&D PROMOTION
ESTABLISHMENT OF CENTRE OF
EXCELLENCE ON DATA SCIENCE
3000
Number of
Centres
1000
4
TOTAL FOR ESTABLISHMENT OF
CENTRE OF EXCELLENCE ON
DATA SCIENCE
4000
4000
5
4000
5000
4000
5
5000
5000
4000
5
5000
5000
4000
5
5000
5000
19000
24
5000
24000
24000
SKILL DEVELOPMENT, CAPACITY
& TRAINING
Fellowship Based UG/PG & PhD in
80:20 ratio
Short Term Training for Faculty
On-Line Programs
Number of
Fellowships
PG/UG
Number of
Fellowships
Ph D
Number of
Training
Programs
Number of
Programs
1.2
1215
1458
1225
1470
1390
1668
1390
1668
1500
1800
3.0
77
231
75
225
110
330
210
630
200
600
20
2
40
2
40
2
40
2
40
2
30
2
60
2
60
4
120
4
120
4
135
7392
10080
40
10
200
120
16
480
COMPONENT
National Workshops &
Conferences
Collaborative Interactive
Conferences
Entrepreneur Development
UNITS
Numbers
Number of
Projects
Number of
Projects
UNIT
COST
YEAR 1
UNITS
YEAR 2
COST
UNITS
YEAR 3
COST
UNITS
YEAR 4
COST
UNITS
YEAR 5
COST
UNITS
PROJECT LIFE
COST
UNITS
COST
30
2
60
2
60
4
120
4
120
4
120
16
480
20
2
40
2
40
3
60
3
60
3
60
13
260
***
1
TOTAL FOR SKILL DEVELOPMENT,
CAPACITY & TRAINING
2
2000
5
2000
10
2500
15
2500
33
2500
11500
INTERNATIONAL LINKAGES &
COLLABORATIONS
UN (R&D and Standards)
Regional
Associations/Collaborations
Numbers
60
1
60
1
60
1
60
1
60
1
60
5
300
Numbers
40
1
40
1
40
1
40
1
40
1
40
5
200
Bilateral & Multi Lateral Exchange
Programs
Numbers
100
1
100
1
100
1
100
1
100
1
100
5
500
TOTAL FOR INTERNATIONAL
LINKAGES & COLLABORATIONS
INFRASTRUCTURE
DEVELOPMENT
Number of
Programs
TOTAL FOR INFRASTRUCTURE
DEVELOPMENT
GRAND TOTAL
200
125
4
500
200
4
500
200
4
500
200
4
500
200
4
500
1000
20
2500
500
500
500
500
500
2500
9700
11700
12200
12200
12200
58000
*** No funds planned under DST as proposal will only be evaluated and approved for funding through TDB
NOTE: The Total project cost is inclusive of the PMU cost. The PMU cost should not exceed 2.5 % of the total project cost.
136
8.9 DIRECT AND INDIRECT BENEFITS
The envisaged outputs/benefits and the possible outcomes of the project are summarized below in table
8.7.
TABLE 8.7: ENVISAGED OUTPUTS/BENEFITS AND THE POSSIBLE OUTCOMES
PROJECT
COMPONENT
INPUTS
OUTPUTS
OUTCOMES
R&D PROMOTION
Open Sky
Research on Big
Data
• Identification of
the application
areas
• Scrutinizing the
proposal
• Tying up with
industry
• New tools
created
• New
Solutions
created
Cluster Based
Network
Programs
• Identification of
the application
areas
• Scrutinizing the
proposal
• Tying up with
industry
• New tools
created
• New
Solutions
created
International
Collaborative
Research Program
• Identification of
the application
areas
• Scrutinizing the
proposal
• Tying up with
international
agencies &
industry
• New
knowledge &
experiences
• New tools
created
• New
Solutions
created
• Selection of the
theme for CoE
Centre of
Excellence
• Patents
• Recognition and
Acceptability of Indian talent
• New BDA application
areas
• New Revenue channels
• Increased business in BDA
• Closer interaction
between industry & DST
• Patents
• Recognition and
Acceptability of Indian talent
• New BDA application
areas
• New Revenue channels
• Increased business in BDA
• Closer interaction
between industry & DST
• Patents
• Recognition and
Acceptability of Indian talent
• New BDA application
areas
• New Revenue channels
• Increased business in BDA
• Documentation of the
new knowledge & experience
• Closer interaction
between international
agencies, industry & DST
CENTRES OF EXCELLENCE
Centres of
Excellence
137
• Higher project success
rate
PROJECT
COMPONENT
INPUTS
OUTPUTS
• Preparation of
the guidelines for
implementation
• Selection of the
implementing
agency
• Funding
• Supervision
OUTCOMES
established
• Reduced costs for
professional services,
management overhead and
TCO
• Reduced gap between
Business and IT, improving
time to market and
responsiveness to change
• Best Practices
• Employment generation
• Placement of the certified
resources
• Acceptance of the
employers
• Increased business in BDA
• Closer interaction
between industry & DST
• Increased number of
training programs could be
organized
• Increased availability of
up skilled human resources
• Increase availability of
Certified recourses for
deployment
SKILL DEVELOPMENT, CAPACITY & TRAINING
Fellowships Based
UG/PG and Ph D
• Selecting
students
• Performance
evaluation
• Availability of
skilled human
resources
Short Term
Training Programs
for Faculty
• Designing of
the courses
• Selecting
implementing
agency
• Selecting
trainees
• Administering
courses
• Performance
evaluation
• Designing of
the courses
• Selecting
implementing
agency
• Selecting
students
• Administering
courses
• Performance
evaluation
Availability of
Certified
trainers for
further training
others
On-Line Training
Programs
National
Workshops
• Availability of
up skilled
human
resources
• Availability of
Certified
recourses for
deployment
• Identification of
BDA areas/themes
138
• New
knowledge and
• Placement of the certified
resources for more
responsible jobs
• New/additional work
areas undertaken by
employers
• Better salaries/promotion
offered by employers
• Increased business in BDA
• Closer interaction and
higher confidence level
between industry & DST
• Adapting and
Implementing the newly
PROJECT
COMPONENT
/Conferences
Collaborative
International
Conferences
Entrepreneurship
Development
INPUTS
OUTPUTS
• Development
of the guidelines
and contents
• Receiving
/conference
research papers
• Selection of the
papers
• Scrutinizing the
proposal
OUTCOMES
Experience
gained
gained knowledge and
experience in Indian BDA
ecosystem
• Documentation of the
new knowledge & experience
• New venture
created
• Employment generation
• Increased business in BDA
• New BDA application
areas
• Closer interaction
between industry & DST
• New
knowledge and
Experience
gained
• Adapting and
Implementing the newly
gained knowledge and
experience in Indian BDA
ecosystem
• Documentation of the
new knowledge & experience
Upgraded
infrastructure
available for
R&D at some
Centers but
usable by
multiple
agencies
• Higher project success
rate
• Reduced costs for
infrastructure overhead and
TCO
• Improved R&D facilities to
generate Best Practices
INTERNATIONAL LINKAGES & COLLABORATION
UN (R&D and
Standards)
• Identification of
areas institutions
and countries
• Tying up with
Regional Associations
institutions and
/Collaborations
countries
• Preparation of
Bilateral & Multi
the guidelines for
Lateral Exchange
implementation
• Scrutiny of
Programs
proposals for
participation
INFRASTRUCTURE DEVELOPMENT
Infrastructure
Development
• Preparation of
the guidelines for
implementation
• Selection of the
implementing
agency
• Funding
• Supervision
139
8.10 EVALUATION PARAMETERS
Many of the initiatives taken in the project may not be quantifiable and the impact of the project will have
be understood on the qualitative aspects, however following is the list of possible evaluation parameters
and measurable indicators for understanding the results of the project:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Number of Research Papers published
Number of new venture created
Number of Centre of Excellence established
Number of new tools created
Number of new Solutions created
Number of Best Practices developed
Number of fellowships awarded
Number of Ph Ds
Number of Trainers Trained
Number of implementing agency selected
Requests received from industry as a result of closer interaction between industry & DST
Number of training programs organized
Number of trainees trained
Number of the new BDA application areas identified
Number of Business Models to deploy BDA identified
Number of tie-ups with business
Number of proposal received for Venture capital/seed money etc.
Number of new Revenue channels generated
Number of Patents registered
Awards and Recognitions won
Number of international collaborative research projects started/completed
Number of cluster based network projects started/completed
Number of Open sky research projects undertaken on Big Data
Number of Infrastructure projects started/implemented
Number of online programs launched
Number of participants benefited through online programs
Number of national workshops/conferences organized
Number of collaborative international conferences organized
8.11 PROJECT MONITORING AND MIS
Monitoring, MIS & Evaluation is a process of continued gathering of information and its analysis, in
order to determine whether the project progress is being made towards pre-specified goals and
objectives, and highlight whether there are any unintended (positive or negative) effects from a
140
project and its activities. Monitoring, MIS and Evaluation ar
are
e closely related concepts that are
distinct but complementary. Monitoring and MIS is a continuous collection of data on specified
indicators to facilitate decision making on whether an intervention (project, program or policy) is
being implemented in line with the design i.e. its activity schedules and budget; while Evaluation is
the periodic and systematic collection of data to assess the design, implementation and impact in
terms of effectiveness, efficiency, distribution and sustainability of outcomes and
a impacts. The
concept is shown in a schematic diagram in figure 8.6 below.
FIGURE 8.6: MONITORING, MIS & EVALUATION
Monitoring and Evaluation systems provide the project owners and the other stakeholders with
regular information on progress relative tto
o targets and this enables them towards:
Accountability: demonstrating to funding agency, beneficiaries and implementing partners that
expenditure, actions and results are as agreed or can reasonably be expected in the situation.
Operational management/I
management/Implementation: provision of the information needed to co-ordinate
co
the human, financial and physical resources committed to the project and to improve performance.
Strategic management:: provision of information to inform setting and adjustment of objectives
objectiv
and strategies.
Capacity building:: building the capacity, self
self-reliance
reliance and confidence of beneficiaries and
implementing staff and partners to effectively initiate and implement development initiatives.
Benefits at the project level
level:
• Provide regular feedback on project performance and show any need for ‘mid-course’
‘mid
corrections
• Identify problems early and propose solutions
• Monitor access to project services and outcomes by the target population;
• Evaluate achievement of project objectives
• Incorporate stakeholder
takeholder views and promote participation, ownership and accountability
The key indicators: Indicators may be qualitative or quantitative variables that measure project
performance and achievements. Indicators are developed for all levels of project logic
logi i.e.
indicators are needed to monitor progress with respect to inputs, activities, outputs, outcomes and
impact, to feedback on areas of success and where improvement is required. For the project the
monitoring, MIS and evaluation indicators are explaine
explained in table 8.8 below.
141
TABLE 8.8: MONITORING, MIS AND EVALUATION INDICATORS
Indicator
Input
indicators
Purpose & Description
Input indicators are quantified and time-bound statements of the resources financed
by the project, and are usually monitored by routine accounting and management
records.
They are mainly used by managers closest to implementation, and are consulted
frequently (daily or weekly). They are often left out of discussions of project
monitoring, though they are part of essential management information. An
accounting system is needed to track expenditures and provide data on costs for
analysis of the cost effectiveness and efficiency of project processes and the
production of outputs.
Process
indicators
Process indicators monitor the activities completed during implementation, and are
often specified as milestones or completion of sub-contracted tasks, as set out in
time-scaled work schedules.
One of the best process indicators is often to closely monitor the project's
procurement processes. Every output depends on the procurement of goods, works
or services and the process has well defined steps that can be used to monitor
progress by each package of activities
Output
indicators
Outcome
indicators
Output indicators monitor the production of goods and delivery of services by the
project. They are often evaluated and reported with the use of performance
measures based on cost or operational ratios.
The indicators for inputs, activities and outputs, and the systems used for data
collection, recording and reporting are sometimes collectively referred to as the
project physical and financial monitoring system, or management information system
(MIS). The core of an M&E system and an essential part of good management
practice, it can also be referred to as ‘implementation monitoring’.
Outcome indicators are specific to a project’s purpose and the logical chain of cause
and effect that underlies its design.
Often achievement of outcomes will depend at least in part on the actions of
beneficiaries in responding to project outputs, and indicators will depend on data
collected from
142
Indicator
Impact
indicators
Purpose & Description
Impact indicators usually refer to medium or long-term developmental change to
which the project is expected to contribute.
Dealing with the effects of project outcomes on beneficiaries, measures of change
often involve statistics concerning economic or social welfare, collected either from
existing regional or sectoral statistics or through relatively demanding surveys of
beneficiaries.
Selection of Indicators for Monitoring, MIS and Evaluation: Considering expectations from the current
project an indicative selection of the indicators is as given in the table 8.9 below.
TABLE 8.9: INDICATIVE SELECTION OF THE INDICATORS FROM MONITORING, MIS AND EVALUATION
Indicator Type
Input
indicators
Process
indicators
Output
indicators
Important Indicators for the Project components
• Number of the themes selected/finalized for CoE
• Progress of preparation of the Guidelines for implementation of CoE
• Number of Courses design completed for UG Students,
• Number of Courses design completed for short term courses,
• Training of Trainers
• Progress of Establishment of CoEs
• Number of implementing agencies for CoE, Short term courses
• Funds released and due for various activities
• Progress of preparation/review of Guidelines for various activities
• Progress in preparation of guidelines for entrepreneurships approvals
• Progress in Identification of the application areas, Scrutinizing the proposal
and Tying up with international organizations, countries and industry for
various activities
• Progress in Identification of the application areas for the Open sky
research on Big Data
• Number of Centre of Excellence established
• Number of students/candidates selected for UG/PG fellowships, short
term courses etc.
• Performance Results of the UG/PG Students/ candidates attended trainers
courses
• Number of entrepreneurship proposals sanctioned
• Number of proposals Scrutinized and tie ups finalized with industry for
cluster networks
• Number of proposals Scrutinized and tie ups finalized with industry for
Open sky research on Big Data
• New tools created and new Solutions created towards the Open sky
research on Big Data
143
Indicator Type
Outcome
indicators
Important Indicators for the Project components
• Number of Best Practices developed by CoEs
• Number of candidates attended short term courses and
• Number of New/additional work areas provided to candidates/students by
the employers
• Additional capacity for training created by training of trainers courses
• Employment generation, Increased business in BDA and New BDA
application areas developed due to sanctioned entrepreneurship projects
• Number of Patents and Recognition achieved by way of creation of New
tools and New Solutions
• New BDA application areas and New Revenue channels created through
New tools and New Solutions
• New BDA application/research areas created through international
collaboration activities
• Number of Patents and Recognition achieved by way of Open sky research
on Big Data
• New BDA application areas and New Revenue channels created through
Open sky research on Big Data
144
Indicator Type
Impact
indicators
Important Indicators for the Project components
• Number of Research papers published
• Number of new venture created
• Number of Centre of Excellence established
• Number of new tools created
• Number of new Solutions created
• Number of Best Practices developed
• Number of fellowships awarded
• Number of Trainers Trained
• Number implementing agency selected
• Feedback on the acceptance of employed recourses from the employers
• Salaries offered by employers
• Requests received from industry as a result of closer interaction between
industry & DST
• Number of New/additional work areas undertaken by employers
• Promotion / increased salaries offered by employers
• Number of training programs organized
• Number of trainees trained
• Number of types of trainings organized
• Number of the new BDA application areas identified
• Number of Business Models to deploy BDA identified
• Number of tie-ups with business
• Number of proposal received for Venture capital/seed money etc.
• Number of new BDA application areas generated
• Number of new Revenue channels generated
• Increased business in BDA
• Number of Patents registered
• Awards and Recognitions won
• Number of international collaborative research projects
started/completed
• Number of cluster based network projects started/completed
• Number of Open sky research projects undertaken on Big Data
• Number of Infrastructure projects started/implemented
• Number of online programs launched
• Number of participants benefited through online programs
• Number of national workshops/conferences organized
• Number of collaborative international conferences organized
145
8.12 ROLES OF VARIOUS STAKEHOLDERS
The Big Data Analytics ecosystem consists of a large number of stakeholders types. With its continuous
spread there is hardly any entity which is left untouched. The following are some of the of the important
stakeholders:
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Researchers
Data Generators
End Users
Service Providers
Platform Providers
Data Curators
Software Professionals
Skill/Job Seekers
Academicians
Researchers
Trainees/Students
Trainers
Customers/Common Citizen
Funding Agencies/Government Departments
Other Government Departments
Industry Associations
National and International Regulatory Bodies
Professionals, Practitioners and those interested in adjacent technologies
For the purposes of identifying the roles of the stakeholders the above listed stakeholders could be
grouped as given below:
i. Funding Agencies
ii. Industry/Corporate/ Industry Associations/ Regulatory Bodies
iii. Trainers/Academics/Researchers
iv. Employment Seekers
In order to achieve the objectives of the DST initiative as mentioned earlier in the chapter above the
following roles may be assigned to the stakeholder groups as listed above.
Funding Agencies: The main role of these stakeholders would be financing the skill up gradation, R&D and
training in the area of Big Data Analytics. The funding would not be provided by DST but the Industry
would also partner DST in funding the overall development activities as mentioned earlier of the dASRAT
ecosystem in the country.
Industry: Industry will be both benefactor and the beneficiary of the BDA development in the country. The
industry has to provide all assistance, help and support to the DST’s dASTRA initiative, particularly by
146
participating in research, up skilling, employment, funding, developing standards, establishing and
improving the Skill calibration and certification activities.
Trainers: The role of this community would be to set the curriculum, contents, course materials etc. for
the training, teaching, skilling and up skilling of the large number of job seekers as well as those who are in
the job already. They will have to participate, in collaboration, with the industry in developing standards,
establishing and improving the Skill calibration and certification activities. This will be in addition to the
research initiatives in the dASTRA so as to maintain an edge in the national and international competitive
situation.
Employment Seekers: This will be, perhaps, the largest group within the dASTRA ecosystem. The most
important role they need to play is in terms of a determined effort to pick up the competencies in BDA
area and then to use them for the overall benefit of the society.
8.13 OTHER IMPORTANT ISSUES FOR CONSIDERATION
There are many other important issues that need to be considered by the project and necessary
initiative taken during the project life. As many of these issues will need coordination between DST
and other governmental and other agencies, not only it is difficult to list all the issues at present
and it is also not possible to prepare an action plan for these issues. Moreover many more issues
will come up from time to time. An inductive list of such issues is as given below.
i.
Analytics Maturity Model: Efforts should be made to take up the development of the
Analytics Maturity Model. In this exercise, apart from investigating the international
practices and standards, the practitioners, service providers and the users all should be
involved in the development work.
ii. Organizing Contests for Analytics Models: When an individual or a group of
professionals are working in the context of a competition, they try a few things and get
to the top of a leader board and in fact they are pretty happy with themselves. Using
competition to perfect various types of Big Data models will be good idea. This will
encourage the developers and service providers, periodical contests may be organized.
DST can host contests for data scientists. Companies who want problems solved post
them, along with relevant data sets, on the site. Anyone can submit a solution, and each
competitor ranks on a leader board throughout the competition. Substantial prize many
can be either provided by DST or may be sponsored by the Company that is seeking the
best solution. These may be theme based and vertical based. Participants may be
supplied with a real data set of ‘transactions’ related to a real life situation. These may
include ‘actions’ by the subjects involved in the transactions, real life ‘results’ of the
transactions and ‘values’ of other parameters etc. Using the provided data, the
participants may be asked to build a model to analyze the data, and address one or
more of the given questions/problems faced by the sponsor.
147
iii. Big Data Governance: As the domain of Big Data Analytics is relatively new for the
country and the practitioners, there is need to provide support systems, policies and
procedures for the governance of the Big Data in the country.
iv. Big Data Best Practices: Providing encouragement for the development, documentation
and dissemination of the Best Practices in the domain of Big Data. These best practices
may be nationally or internationally developed.
v. Creation on a National Advisory Body: Creation an Internal Organization Structure, with
involvement of the Stakeholders. This could be an extension of the current PDAC in
terms of its scope and advisory role.
vi. Catchy name to Big Data Analytics in India: In order to create awareness across and to
popularize Big Data Analytics a completion may be organized to select a catchy
name/acronym for Big Data in India.
vii. Big Data Competency Frame Work: The primary responsibility of the Data Scientist - Big
Data is design, development and support of applications, very large databases and
infrastructure for storing structured and unstructured company data and for use in
analyzing business activities, detecting patterns and reporting trends. Over a period a
large number of professionals will be trained and will be available for work in the area of
Big Data Analytics. Also large number of organizations would be initiated in this area. In
order to provide a firm footing both to the professionals and users there will be a need
to create, standardize and popularize Big Data Competency Frame Work for the entire
Big Data ecosystem. Some of the areas that may be taken up for developing Big Data
Competency Framework are as follows:
•
•
•
•
•
•
•
•
•
•
Application Architect
Application Developer
Senior Application Developer
Big Data Manager
Chief Data Officer
Data Architect
Data Engineer
Data Scientist
Data Visualization Specialist
System Administrator
viii. Big Data Regulatory Frame Work: Over a period short time the usage of Big Data
spreads in the country. That will eventually give rise to a number of issues related to
usage of Big Data and related legal aspects, especially concerning the Personal
Information, privacy of personal data etc.
Therefore, there is need to think about the ethical and regulatory framework around
big data, as it will increasingly impact on the lives of individuals and underpin customer
service, innovation, quality and business operations. In a world of big data, decisions
148
about individuals will increasingly be made on the basis of patterns and profiling.
Therefore, big data has deep social implications about when we want to prejudge
people based on data about past behaviour, personal characteristics and similarities to
others. All this is strongly linked to debates about privacy. Data profiling, especially
where large amounts of personal data are aggregated together, provides very deep
insights into individuals. The benefit of these activities are cheap (or free) personalized
services, and to date, many consumers have been content with this trade-off. However,
greater concern may be shown as analysis goes deeper into our activities and personal
lives. Organizations both in R&D and Businesses will need to have appropriate
governance to manage the risks and ensure data is used in acceptable ways.
Policymakers also need to consider the regulatory framework carefully, and encourage
the range of skills needed to exploit big data.
To provide lasting solutions to such problems, DST may take initiatives as appropriate
for developing Big Data Regulatory Frame Work.
ix. Organization & Regulation of CoEs: It is envisaged that the one of the important activity
of the project would be creation of CoEs. To achieve good results it will be desirable to
define CoE’s role, their organization and the mechanisms of sharing the outcomes. It is
advisable that the outcomes be measured based on the adoption of the solutions. There
should be a review of the performance CoEs periodically.
x. Encouragement to MSME: Considering the interest being taken by individuals in
developing Big Data based solutions, it is envisaged that there will be good number who
will be technically qualified to take up the Big Data Challenges offered by governments
and government agencies, however they or their small start ups and or outfits will lack
in financial worthiness in terms of annual turnover etc. It is suggested, therefore, that
some mechanism be developed and implemented to give due recognition of this
handicap on part of the MSMEs. To achieve this some suggestions are given below:
•
•
•
•
For fixing a turnover cap the contacting organizations may be asked and or
mandated to ask that a supplier company turnover be just more than twice
that of the contract value.
Public bodies may also be asked to limit the amount of lots a supplier can win.
This ability to break contracts into lots will encourage more SME participation.
Suppliers who have performed poorly on a previous contract can be excluded
from future competitions by the contracting authority.
Public bodies should also take into account the education, experience and the
achievements of ‘individuals’ at the award stage of the competition.
Some other issues, to encourage the participation of MSME, that need the attention of
DST will be Lack of knowledge & awareness of MSME; Capacity issues; and Complex
procurement processes.
149
xi. Types, grades, competency levels and probable requirement of Big Data Professionals:
It will be desirable that DST, in cooperation with organizations like NASSCOM, NSDC etc.
takes initiative in Identification of the various levels/types/grads/ competencies and
numbers of Big Data ‘professionals’ needed in by the industry, business and academia.
xii. Setting up uniform curricula for Big Data professionals: Data science’s learning curve is
formidable. To a large extent one will need a degree, or something substantially like it,
to prove the committed to this career. There are run-of-the-mill certificates, and other
qualifications degrees in data-science-related fields. The most important to the modern
business isn’t that every data scientist has a big honking doctorate. What matters most
is that a substantial body of personnel has a common grounding in core curriculum of
skills, tools and approaches. Big data initiatives will thrive if all data scientists have been
trained and certified on a basic minimum curriculum with the foundation such as (a)
Paradigms and practices, (b) Algorithms and modelling, (c) Tools and platforms, (d)
Applications and outcomes etc.
Classroom instruction is important, but a curriculum that is 100 percent devoted to
reading books, taking tests and sitting through lectures is insufficient. Hands-on
laboratory work is paramount for a truly well-rounded data scientist. It is, therefore,
necessary to make sure that the data scientists acquire certifications and degrees that
reflect them actually developing statistical models that use real data and address
substantive business issues.
A business-oriented data-science curriculum should produce expert developers of
statistical and predictive models. It should not degenerate into a program that produces
analytics geeks with heads stuffed with theory but whose diplomas are only fit for
hanging on the wall.
To achieve this, DST may in collaboration with academic partners, businesses, R&D
Organizations and industry develop curriculum, that is approach, learning model, and
course content, that reflects the mix of technical and problem-solving skills that is
necessary to prepare students/professionals for Big Data and analytics careers, across all
industries.
xiii. Big Data Portal: Creating a Platform (PORTAL) where all the stakeholders can interact
and give and seek what they have or want frig Data Ecosystem.
xiv. Publication of Research: The use of big data in development is largely being driven by
opportunistic partnerships between private companies, researchers and academics.
Data exhaust is often owned by the private sector, especially mobile phone operators.
Online activity, sensing data, and crowdsourced information are often publicly
accessible, but the size and complexity of these data sets requires specialized analytical
skills. Because of this, and because big data analytics are still in a nascent phase
methodologically, professional researchers and academics currently have a high degree
of influence in how big data is actually utilized.
150
Some of these professional researchers and academics work in-house for the interested
firms, but most are in public and private university systems. DST may play a key role in
publicizing the potential role of big data in development. It can fund big data research
through a variety of financing streams, and take initiative in creating forums where big
data researchers can exchange ideas and data sets. The current landscape of big data is,
overall, less the result of agenda setting by a small group of politically and economically
powerful institutions than it is the unplanned aggregate of diverse projects focusing on
those aspects of big data analytics that are methodologically and legally tractable.
In the short-term big data projects will need to rely on complementary “groundtruthing” data from traditional sources in order to assess the nature and magnitude of
bias in big data sets. Such validation procedures are necessary for end-users of the data,
including policymakers, to interpret the contextual meaning of big data across cultures
and economies. In addition, big data sets are not by virtue of their size exempt from the
conventional requirements of good theoretical and statistical practice, including careful
problem identification, model construction, and hypothesis testing.
Therefore, such an initiative on part of DST will bring the researchers, users, policy
makers and the public at large near to each other and they all together will be
instrumental in making the BIG USE of Big Data.
8.14 ACTION PLAN
Implementation of the project will involve, apart from the many administrative actions, the
following major activities:
•
•
•
•
•
•
•
•
•
Establishment of PMU
Preparing Guidelines
Calling for proposals
Selection of agencies
Assigning/sanctioning projects
Review of Schemes
Yearly review of progress
Mid Term Review
Preparation and Publication of progress reports
As evident some of the activities are one time, however the remaining activities need to be carried
out on periodic or as the need be basis. The Project Head will undertake the activities in time so
that the aims of the project are achieved. Considering the major activities, an action plan for the
implementation of the project is as given below in table 8.10.
151
TABLE 8.10: TENTATIVE IMPLEMENTATION ACTION PLAN
MAJOR ACTIVITY
Establishment of
PMU
Preparing Guidelines
Calling for proposals
Selection of agencies
Assigning/sanctioning
projects
Review of Schemes
Yearly review of
progress
Mid Term Review
Preparation and
Publication of
progress reports
YEAR 1 & MONTHS
1 2 3 4 5 6 7 8 9 10 11 12
YEAR 2
Q1
Q2
152
Q3
YEAR 3
Q4
Q1
Q2
Q3
YEAR 4
Q4
Q1
Q2
Q3
YEAR 5
Q4
Q1
Q2
Q3
Q4
9. CONCLUSIONS
Scientific progress is a result of relentless academic research endeavour. The scientific community
has been focused for a while now on the growing challenges of Data Science in a number of
disciplines. This immense repository of past/current academic knowledge is increasing at an
exponential rate, and handily qualifies as Big Data in terms of volume, variety and velocity of
growth. The estimation of the veracity of this data also presents challenges. As the amount of
knowledge in an academic field grows, a quick assessment of the state-of-the-art in any sub-field
becomes that much harder. One way of enabling the acceleration of the process of discovery, is to
significantly enhance current search capabilities to support deep scientific queries. This includes:
•
Improving the efficiency and depth of search by enabling segmentation and recognition of all
the components of a traditional academic research including graphs, tables, and diagrams.
•
Developing tools to integrate various sources of information on any topic, not just from the
textual content but often from parallel channels such as video, speech, and the web, in order to
gain comprehensive understanding on the topic, and most importantly.
•
Making unapparent connections between methods, features, data, constraints, and parameters
across the spectrum of reported scientific data using advanced data mining approaches.
Keeping in view the fast growth of Business Analytics in future across the various applications, it is
imperative to chalk out a strategic Road Map in this direction to reap the benefits towards the
overall development of the country.
The present study, through a combination of primary and secondary research has established the
need of urgent initiative on part of DST to (i) strengthen the dASTRA Ecosystem of the country, (ii)
take steps to nurture the same so as to leverage the unique advantageous position of the country’s
manpower in not only in the scientific research and development but in the business and industry
also.
The project is to be implemented in five years and the cost has been estimated to be around Rs. 580
Cores. The major activities of the project will include (i) R&D PROMOTION through Open Sky
Research, Cluster Based Network Programs, International Collaborative Research Program,(ii)
ESTABLISHMENT OF CENTRE OF EXCELLENCE ON DATA SCIENCE, (iii) SKILL DEVELOPMENT CAPACITY & TRAINING through Fellowship Based UG/PG and Ph D, Short Term Training for Faculty,
On-Line Programs, National Workshops & Conferences, Collaborative Interactive Conferences,
Entrepreneur Development, (iv) INTERNATIONAL LINKAGES & COLLABORATIONS through UN (R&D
and Standards), Regional Associations/Collaborations, Bilateral & Multi Lateral Exchange Programs,
and (v) INFRASTRUCTURE DEVELOPMENT.
153
LIST OF ABBREVIATIONS
AaaS
BA
BD
BDA
BDaaS
BDI
CapEx
CDC
CEO
CODATA
CoE/COE
CSI
dASTRA
DST
EIU
EMR
ESDM
EU
HRD
ICSU
ICT
IESA
IoT
IPR
ISACA
IT
KPI
KPO
M&E
m2m
MIS
NASSCOM
NDSAP
NSDC
OGD
OpEx
OSTI
PCAST
PG
PMU
: Analytics as a Service
: Business Analytics
: Big Data
: Big Data Analytics/Big Data and analytics
: Big Data Analytics as a Service
: Big Data Initiative
: Capital Expenses
: Consultancy development Centre
: Chief Executive Officer
: Committee on Data for Science and Technology
: Centre of Excellence
: Computer Society of India
: Data Science, Technology, Research & Applications
: Department of Science & Technology
: Economist Intelligence Unit
: Electronic Medical Records
: The electronic system design and manufacturing industry
: The European Union
: Human Resource Development
: International Council for Science
: Information & Communication Technology
: India Electronics and Semiconductor Association
: Internet of Things
: Intellectual Property Rights
: Information Systems Audit and Control Association
: Information Technology
: Key Performance Indicators
: Knowledge process outsourcing
: Monitoring & Evaluation
: Machine to Machine
: Management Information System
: National Association of Software and Services Companies
: National Data Sharing and Accessibility Policy
: National Skills Development Corporation
: Open Government Data
: Operating Expenses
: Office of Scientific and Technical Information (USA)
: President’s Council of Advisors on Science and Technology, USA
: Post Graduate
: Project Monitoring Unit
154
R&D
ROI
RPO
RTO
S&T
SaaS
SEZ
SMB
SME
SW
TCO
TDB
UG
UIDAI
UN
UNDP
WDS
WEF
: Research & development
: Return on Investment
: Recovery Point Objective
: Recovery Time Objective
: Science & Technology
: Software as a Service
: Special Export Zone
: Small and Medium Businesses
: Small and Medium Enterprises
: Soft Ware
: Total Cost of Ownership
: Technological Development Board of DST
: Under Graduate
: Unique Identification Authority of India
: United Nations
: United Nations Development Program
: World Data System
: World Economic Forum
155
LIST OF TABLES
Table 3.1: Models for Research
Page No.
048
Table 3.2: CoE Value Proposition
051
Table 3.3: Examples of Companies & Institutions Providing Solutions
to Generate, Analyze & Visualize Omics & Clinical Data
057
Table 3.4: Data Quality Sub Dimensions
062
Table 5.1: 2012 Worldwide Big Data Revenue by Top 10 Vendors
082
Table 5.2: Big Data Market Forecast Broken Down
By Market Component through 2017
082
Table 5.3: Comparison Between AaaS & Internal BD Project
088
Table 8.1: Strategic Thought Process
104
Table 8.2: Consultative Meetings & Interactive Workshops Organized
106
Table 8.3: Responses Received From The Stakeholders
106
Table 8.4: Tentative Targets
132
Table 8.5: Suggested Project Personnel
134
Table 8.6: Computation Of Costs – All Costs In Rs. Lakhs
135
Table 8.7: Envisaged Outputs/Benefits And The Possible Outcomes
137
Table 8.8: Monitoring, MIS and Evaluation Indicators
142
Table 8.9: Indicative Selection of the Indicators
From Monitoring, MIS and Evaluation
143
Table 8.10: Tentative Implementation Action Plan
152
156
LIST OF FIGURES
Page No.
Figure 1.1 Data Science & Business
002
Figure 1.2 Data Science Ecosystem
003
Figure 1.3: Seven Dimensions of Big Data
007
Figure 1.4: Parameters Used For Benchmarking Countries on Open Data Initiatives
014
Figure 1.5: Benchmarking Of Open Data Initiatives, Select Countries, 2012
014
Figure 1.6: Number of Graduates With Deep Analytical Training
017
Figure 2.1: Innovative Cycle
028
Figure 2.2: 6 Illustrative Examples of Big Data for Development
032
Figure 2.3: Major Challenges Confronting Big Data for Development
033
Figure 2.4: Australian Organizations Lag In The Use Of Many Data Sources
036
Figure 2.5: Australian Organizations However Lead In the Use of Some Data Sources
037
Figure 2.6: Organizations Using Big Data to Improve The Customer Experience
037
Figure 2.7: Categories of Business Processes That Can Benefit From Big Data Projects
038
Figure 2.8: View of the Future of Big Data
039
Figure 2.9: Attitude towards Big Data
039
Figure 2.10: Personal Knowledge of Big Data
040
Figure 2.11: Priority Application of Big Data
040
Figure 2.12: Internal Obstacles In Use Of Big Data
041
Figure 2.13: CEO’s View of Big Data
041
Figure 2.14: Strategies for Obtaining Optimum Value from Big Data Tools
042
Figure 2.15: How the Organization Addresses Human Aspect of Big Data
042
Figure 3.1: BDA CoE Function Chart
063
Figure 3.2: Governance Objective: Value Creation
052
Figure 5.1: Analytics Applications, And Classification of Analytics Industry
084
Figure 5.2: Conceptual Diagram of AaaS
086
Figure 8.1: Approach & Methodology
103
Figure 8.2: DST’s Vision of dASTRA in India
130
Figure 8.3: Conceptual Model of Six Months Student’s Projects Linked with Big Data
131
Figure 8.4: Suggested Capacity Building Model
131
Figure 8.5: Suggested Organization Structure
134
Figure 8.6: Monitoring, MIS & Evaluation
141
157
REFERENCES
•
“Apply new analytics tools to reveal new opportunities,” IBM Smarter Planet website, Business
Analytics page
•
A Survey Report on: Become Prudent with Big Data -Technological sophistication in India, Sujata
A. Pardeshi, Pooja K. Akulwar.
•
Analytics and Big Data: big markets in India for adopters and innovators, Madanmohan Rao.
•
Australian Public Service Better Practice Guide for Big Data, 2015, Australian Government.
•
Becoming bold with big data - How Australian organisations can boost their adoption of big data
to help drive business success, a Accenture Digital Document.
•
Big Data & Analytics Maturity Model, Chris Nott, 2014.
•
Big data and data protection, ICO.
•
Big Data for Development in China, UNDP 2014 Report.
•
Big Data for Development: Challenges & Opportunities – UN GLOBAL PULSE, May 2012
•
Big Data for Government, INFORMATICA.
Addressing government challenges with big data analytics, IBM White Paper.
•
Big data GS for social good: Putting Knowledge on Map, Pulak Ghosh, IIMB, Advisor on Big DataUN-Global Pulse.
•
Big data in Environmental Remote Sensing Challenges and Chances, Th. Udelhoven, University of
Trier Environmental and Geoinformatics Department.
•
Big Data in Genomics: Challenges and Solutions, Is Life Sciences Prepared for a Big Data
Revolution?, Fabrício F. Costa.
•
Big Data in High Energy Physics, Andrew McNab, Alessandr,. Forti Robert Frank, High Energy
Physics group, University of Manchester.
•
Big Data the next big thing, NASSCOM Report 2012.
•
Big Data Vendor Revenue and Market Forecast 2012-2017, Wikibon.
•
Big Data, Impacts & Benefits, ISACA Whit Paper 2013.
•
Big Data: Big benefits and imperiled privacy, a PwC document.
•
Big Data's 5 Routes to Value, 27th June 2014, The Boston Consulting Group.
•
Big Success With Big Data, Accenture Paper.
•
Bio-IT and Healthcare in India, Department of Biotechnology Ministry of Science and
Technology, Government Of India.
•
Building Trust: The Role of Regulation in Unlocking the Value of Big Data, McKinsey & Company.
•
Business Analysis Center of Excellence, a HP document.
158
•
Cheryl Wilson, “Making the Contextual Enterprise Possible with ODM,” IBM Connections blog,
2013.
•
CODATA Capacity Building and the Data Sharing Principles in Developing Countries, Simon
Hodson
•
CODATA International Training Workshop in Big Data for Science for Researchers from Emerging
and Developing Countries, Beijing , China, 5-20 June 2014, Overview of things learned
Presentation at NeDICC Meeting on 16 July 2014.
•
Data Analytics as a Service: unleashing the power of Cloud and Big Data, a white paper from
ATOS.
•
Data Intensive Scientific Discovery, Vijay Chandru, Hon. Professor, NIAS, Chairman, Strand Life
Sciences.
•
Department of Science & Technology, Ministry of Science & Technology Government of India
Website- WWW.dst.gov.in
•
Developing an Analytics Centre of Excellence, Charles D
•
Divyakant Agrawal, Philip Bernstein, Elisa Bertino, et al. “Challenges and Opportunities with Big
Data.” Princeton University white paper, 2012.
•
Edd Dumbill. “What is big data? An introduction to the big data landscape.” O’Reilly, 2012.
•
European Big Data Value Strategic Research & Innovation Agenda Version 1.0, January 2015.
•
Executive Director CODATA.
•
Fact Sheet: Big Data Across the Federal Government, USA, 2012.
•
How Big Data Is Changing Astronomy (Again), Ross Andersenapr.
•
How Manufacturers Can Gain From Big Data, IoT, by Satish N Jadhav, Director, IoT-Embedded
Sales, Intel South Asia.
•
IBM Redpaper publication, Smarter Analytics: Information Architecture for a New Era of
Computing, SG24-5012.
•
ICSU World Data System (WDS) Strategic Plan 2014–2018.
•
Implementation Guidelines for National Data Sharing and Accessibility Policy (NDSAP), April
2013, Department of Electronics and Information Technology, Ministry of Communications and
Information Technology, Government of India.
•
India – A Hub for Analytics Products Analytics Product Excellence Matrix, 2013, A NASSCOM
Forst & Sullivan Report.
•
Industrialization of Analytics, in India – Big Opportunity, Bigger Outcomes, NASSCOM – BLUE
OCEAN Study, 2014.
•
Is Big Data a Big Deal for State Governments? NASCIO.
•
John Gantz and David Reinsel, “The Digital Universe in 2020: Big Data, Bigger Digital Shadows,
and Biggest Growth in the Far East.” IDC, for EMC Corporation, December 2012
159
Findings
•
John Hagerty and Tina Groves, “Unlock Big Value in Big Data with Analytics: An IBM Redbooks
Point-of-View publication.” IBM Redbooks publications, 2013.
•
Main points of the OECD expert Consultation on unlocking global Collaboration to accelerate
Innovation for alzheimer's disease and Dementia – an OECD Report.
•
Michael Cooper and Peter Mell, “Tackling Big Data” slide presentation. NIST Information
Technology Laboratory, Computer Security Division, US Department of Commerce,
•
Open data: Unlocking innovation and performance with liquid information, 2013, McKinsey
Global Institute
•
Report to The President, Big Data And Privacy: A Technological Perspective. Executive Office
of the President, President’s Council of Advisors on Science and Technology May 2014.
•
States Investing in Big Data Initiatives, George Leopold
•
The Global Information Technology Report 2014, Rewards and Risks of Big Data, WEF Report.
•
Unleashing the potential of big data - A IBM white paper based on the 2013 World Summit on
Big Data and Organization Design.
•
Using Data to Understand Biological Systems, Ramesh Hariharan, IISC, Strand.
•
Views from the C-suite, Who’s big on BIG DATA?, The Economist Intelligence Unit Limited 2014.
•
What to Watch Out For in 2015: EY.
•
Workshop: How to Build the Business Case for Analytics, Kurt Schlegel, GARTNER.
•
Worldwide Big Data Technology and Services 2012– 2015 Forecast, a IDC Report.
160
ANNEXURE
Page No.
Annexure 1: Set of 4 Questionnaires
162
Annexure 2: List of participants of the Consultative Meetings
and Interactive Workshops
205
Annexure 3: Consolidated Responses from Data Generators
221
Annexure 4: Consolidated Responses from Data Researchers
225
Annexure 5: Consolidated Responses from End Users
231
Annexure 6: Consolidated Responses from Service Providers
238
161
ANNEXURE 1
SET OF 4 QUESTIONNAIRES
162
QUESTIONNAIRE FOR
DATA GENERATORS (DG)
163
PART A
GENERAL ORGANIZATIONAL PROFILE
1. Name & Address of the Organization/Department:
Telephones:
Fax:
E-mail:
website:
2. Name & designation and Address of the CEO/HOD:
Telephones:
Mobile:
E-mail:
3. Name, Designation and Address of the Respondent:
Telephones:
Mobile:
E-mail:
4. Date & Place:
164
PART B
CURRENT STATUS, STRATEGY & PROFILE
n. Please identify the Stakeholder Segment/Category your organization belongs to: (Multiple
answers are possible)
SEGMENT/CATEGORY
YES (Y)/NO(N)
RESEARCHERS (RE)
DATA GENERATORS (DG)
END USERS (EU)
SERVICE PROVIDERS (SP)
PLATFORM PROVIDER
(PP)
DATA CURATOR (DC)
o. Mentions the Data Segments in which you are active:
p. Mention the Data Segments that you Outsource:
q. Is Your Organizational Data Available At Data.Gov.In?
• Yes
• No
• Do not know
r.
What are your expectations from Big Data Analytics in the next 10 years?
s. In our Organization the Big data management is not viewed strategically at senior levels of the
organisation.
• Agree
• Disagree
• Don’t know/Not applicable
165
t.
There is not enough of a “big data culture” in the organisation, where the use of big data in
decision-making is valued and rewarded.
• Agree
• Disagree
• Don’t know/Not applicable
166
PART C
MANPOWER, SKILL GAPS AND TRAINING NEEDS
e. Identify the skills gaps within your functional area in dealing with data and analytics.
• Visualization skills
• Data integration skills
• Data analysis skills
• Data storage skills
• Tooling / software skills
f.
How many Big Data experts does your organization employ and in which area?
• Computer science: programming experts (R, Python, SQL, SAS, Java, etc)
• Computer science: Artificial Intelligence and machine learning experts
• Computer science: text, voice, music, image and video experts
• Experts in statistics and econometrics
• Experts in OR and applied mathematics
• Other (please specify)
g. What are the training needs of your organization?
• Strategy courses on Big Data for top management
• Computer science: programming courses (R, Python, SQL, SAS, Java, etc)
• Computer science: text, image and video recognition courses
• Computer science: machine learning and artificial intelligence courses
• Statistics and econometrics courses
• Operations research and applied mathematics courses
• Application related courses (Big Data in marketing, finance, logistics, etc)
• Any Other (Please Specify)
h. For the purposes of the Capacity Building initiatives, please suggest the programs and other
details as given below:
Name of the
Who should be
Modality of
Coverage
Duration
Program
the participants
Delivery
167
PART D
PERCEIVED SUCCESS FACTORS, IMPEDIMENTS & CHALLENGES FOR BIG DATA APPLICATION
Purposely Left Blank as there are no questions under
this head for
Data Generators
168
PART E
AREAS OF APPLICATION, MODELS & INFRASTRUCTURE
g. Has Your Organization Taken any of the Steps Mentioned Below to Integrate Data into Your
Organization’s Business? (Multiple answers possible )
• Upgrade IT Systems
• Improve data collection processes
• Training current employees or recruiting new employees in BA
• Redesigned/reengineered your important Business Processes
h. How well are these areas developed in your organization? (Answer in terms of Very well,
Reasonably well, Not so well, Don't know)
• A clear company strategy
• A sound procedure for legal, ethical and reputational issues
• An organization structure that supports multi-disciplinary projects
• Financial budget
• Support by higher management
• Supporting systems and procedures
• Talent
• Training
i.
What do you predict will happen to the number of Big Data specialists in your
organization next year (2015)
• It will decrease
• It will remain stable
• It will increase
• Don't know
j.
Please make suggestions on the following important aspects :
•
•
•
Data Storage
Data Curation
Data Retrieval
169
PART F
TYPE, AMOUNT OF DATA & ANALYTICAL TECHNIQUES USED
1. What support do you need from the Government
2. Looking specifically at your organization/department, how would you characterise the
amount of data available to support decision-making?
• Too much
• Enough
• Not enough
• Don’t know
3. Mention the Challenges faced by you in GENERATING Data
4. Mention the Challenges faced by you in CLEANING the Generated Data
5. Do you have the DATA CURATION function in-house, if not please mentions the reasons.
170
PART G
SECURITY CONCERNS
d. Your organization has taken initiative in which of the following areas related to Big Data
Security & Privacy? (Multiple answers possible)
• Intrusion Detection
• Cyber security and Gigabit Networks
• Visualizing Large Scale Security Data
• Challenges for Big Data Security & Privacy
• Sociological Aspects of Big Data Privacy
e. Please provide your views on the IPR Issues as related to Big Data Analytics.
f. Please provide your views and suggestions on the adequacy or otherwise of the National Data
Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics
PART H
ANY OTHER INFORMATION YOU MAY LIKE TO SHARE
171
QUESTIONNAIRE FOR
END USERS (EU)
172
PART A
GENERAL ORGANIZATIONAL PROFILE
5. Name & Address of the Organization/Department:
Telephones:
Fax:
E-mail:
website:
6. Name & designation and Address of the CEO/HOD:
Telephones:
Mobile:
E-mail:
7. Name, Designation and Address of the Respondent:
Telephones:
Mobile:
E-mail:
8. Date & Place:
173
PART B
CURRENT STATUS, STRATEGY & PROFILE
u. Please identify the Stakeholder Segment/Category your organization belongs to: (Multiple
answers are possible)
SEGMENT/CATEGORY
YES (Y)/NO(N)
RESEARCHERS (RE)
DATA GENERATORS (DG)
END USERS (EU)
SERVICE PROVIDERS (SP)
PLATFORM PROVIDER
(PP)
DATA CURATOR (DC)
v. Mentions the Data Segments in which you are active:
w. Mention the Data Segments that you Outsource:
x. What is the Organization Budget Provisions For Big Data Usage in Lakhs of Rupees
• 2014 – 15
• 2015 – 16
• 2016 – 17
y. In Which Areas Will Your Organization Be Investing More Resources?
• Capacity Building
• Software Tools
• Data Sources
• Other
z. How would you describe your organization's competitive position?
• Underperforming industry / market peers
• On par with industry / market peers
• Outperforming industry / market peers
• Don't know
174
aa. What is the current state of big data activities within your organization?
• We have not yet started to consider big data's use within our organization
• We are in the process of developing a strategy / roadmap
• We have started one or more pilots or proofs of concept
• We are implementing big data technologies
bb. How useful have been the Big Data Analytics Applications in the past in your organization:
cc. What are your expectations from Big Data Analytics in the next 10 years?
dd. In our Organization the Big data management is not viewed strategically at senior levels of the
organisation.
• Agree
• Disagree
• Don’t know/Not applicable
ee. There is not enough of a “big data culture” in the organisation, where the use of big data in
decision-making is valued and rewarded.
• Agree
• Disagree
• Don’t know/Not applicable
175
PART C
MANPOWER, SKILL GAPS AND TRAINING NEEDS
i.
Identify the skills gaps within your functional area in dealing with data and analytics.
• Visualization skills
• Data integration skills
• Data analysis skills
• Data storage skills
• Tooling / software skills
j.
How many Big Data experts does your organization employ and in which area?
• Computer science: programming experts (R, Python, SQL, SAS, Java, etc)
• Computer science: Artificial Intelligence and machine learning experts
• Computer science: text, voice, music, image and video experts
• Experts in statistics and econometrics
• Experts in OR and applied mathematics
• Other (please specify)
k. What are the training needs of your organization?
• Strategy courses on Big Data for top management
• Computer science: programming courses (R, Python, SQL, SAS, Java, etc)
• Computer science: text, image and video recognition courses
• Computer science: machine learning and artificial intelligence courses
• Statistics and econometrics courses
• Operations research and applied mathematics courses
• Application related courses (Big Data in marketing, finance, logistics, etc)
• Any Other (Please Specify)
l.
For the purposes of the Capacity Building initiatives, please suggest the programs and other
details as given below:
Name of the
Who should be
Modality of
Coverage
Duration
Program
the participants
Delivery
176
PART D
PERCEIVED SUCCESS FACTORS, IMPEDIMENTS & CHALLENGES FOR BIG DATA APPLICATION
l.
To What Extent Do You Have Timely Access To The Information Needed To Do Your Job
Successfully?
• To some extent
• To a great extent
• Completely
• Don't know
m. To what extent does information and business analytics create a competitive advantage for
your organization within its industry or markets?
• Modest advantage
• On par with competitors
• Significant advantage
• Don't know
n. Which challenges inhibit your organization from acquiring and integrating data?
• Inconsistencies in data from various source systems
• Legacy infrastructure that inhibits data collection
• Difficult to share data internally and or in integrating internal data across silos
• Security, privacy and/or malware concerns
o. What challenges inhibit your organization from analyzing data?
• Too much data to analyze; overwhelmed by data
• Lack of software/tools and or Software too difficult to use
• Lack of skills
• Inconsistent data across variety of source systems
• Customer privacy concerns
p. Which challenges inhibit your organization from acting on data insights and analytics?
• Lack of understanding of how to use analytics to improve the business
• Lack of skills to interpret and leverage the data
• Lack of software/tools that allow end-users to perform analytics themselves
• Too time consuming or costly to perform all the analytics desired
q. Your organization has taken initiative in which of the following areas related to Big Data
Science & Technology? (Multiple answers possible)
• Data streaming & Processing
• Analysis of Unstructured/Semi-structured data
177
•
•
•
•
r.
Visualization & Visual Analytics
Security & Privacy issues
New Computational Models
Data & Information Quality and New Data Standards.
Your organization has taken initiative in which of the following areas related to Big Data
Infrastructure ? (Multiple answers possible)
• System Architectures, Design and Deployment
• Programming Models
• Software Techniques & Architectures in Cloud/Grid/Stream Computing
• Big Data Open Platforms
s. Your organization has taken initiative in which of the following areas related to Big Data
Search, Mining and Management ? (Multiple answers possible)
• Search & Mining of variety of data including scientific, engineering, social, sensor &
multimedia
• Algorithms & Systems for Big Data Search
• Data Acquisition, Integration, Cleaning & Best Practices
• Visualization Analytics for Big Data
• Computational Modeling & Data Integration
• Cloud/Grid/Stream Data Mining-Big Velocity Data
• Mobility and Big Data
• Multimedia and Multi-structured Data-Big Variety Data
t.
Your organization has taken initiative in which of the following areas related to Big Data
Applications? (Multiple answers possible)
• Complex Big Data Applications in Science, Engineering,
• Medicine, Healthcare, Finance, Business, Law and Education
• Indian Traditional Knowledge
• Transportation
• Retailing, social media and Telecommunication
• Big Data Analytics in Small Business Enterprises (SMEs)
• Big Data Analytics in Central and State Governments, Public Sector and Society in General
• Real-life Case Studies of Value Creation through Big Data Analytics
• Big Data as a Service
• Big Data Industry deployments & Standards and Experiences of
• Big Data Government Deployments/ Projects.
u. What are your organisation’s three biggest impediments to using big data for effective
decision-making?
178
• Too many “silos”—data is not pooled for the benefit of the entire organisation.
• Shortage of skilled people to analyse the data properly.
• Big data is not viewed sufficiently strategically by senior management
• Something not on this list (please specify).
v. To what extent do you agree with the following statement: “The issue for us is now not the
growing volumes of data, but rather being able to analyse and act on data in real-time.
• Agree
• Disagree
• Don’t know/Not applicable
w. In which areas does your organization have Big Data applications? (Multiple answers possible)
• E-Commerce, e-Business, Online Operations (Web shops, etc)
• e-Governance
• Direct and online marketing
• Fraud detection / management
• Customer and market analysis
• Customer service
• Supply change management and logistics
• Information Technology
• Finance and administration
• HR and people development
• Risk management
• We do not have applications
• I don't know in which area(s) we have applications.
• Other (please specify)
179
PART E
AREAS OF APPLICATION, MODELS & INFRASTRUCTURE
k. Has Your Organization Taken any of the Steps Mentioned Below to Integrate Data into Your
Organization’s Business? (Multiple answers possible )
• Upgrade IT Systems
• Improve data collection processes
• Training current employees or recruiting new employees in BA
• Redesigned/reengineered your important Business Processes
l. How well are these areas developed in your organization? (Answer in terms of Very well,
Reasonably well, Not so well, Don't know)
• A clear company strategy
• A sound procedure for legal, ethical and reputational issues
• An organization structure that supports multi-disciplinary projects
• Financial budget
• Support by higher management
• Supporting systems and procedures
• Talent
• Training
m. What do you predict will happen to the number of Big Data specialists in your
organization next year (2015)
• It will decrease
• It will remain stable
• It will increase
• Don't know
n. Please make suggestions on the following important aspects :
•
•
•
Data Storage
Data Curation
Data Retrieval
o. As the country would like to keep ahead of the rest of the world in the area of Big Data
applications, please suggest FINAL PRODUCTS for which the Big Data Community may strive.
p. For the Researchers in the Big Data Discipline, what should be thrust Areas:
•
•
Immediately
In the next 5-10 years
180
PART F
TYPE, AMOUNT OF DATA & ANALYTICAL TECHNIQUES USED
6. At what pace is the available data within your function updated or refreshed?
• As it is streamed in real-time
• Less than a day
• Less than a week
• Monthly or later
7. What support do you need from the Government
8. Looking specifically at your organization/department, how would you characterise the
amount of data available to support decision-making?
• Too much
• Enough
• Not enough
• Don’t know
9. How Well Does Your Organization Perform the Following Information and Analytic Tasks on A
Scale Of 1 To 5, Where 1=Poorly and 5=Very Well
• Acquire and integrate data
• Analyze data
• Act on data-driven insights
10. Which data does your organization analyze in the context of Big Data applications? (Multiple
answers possible):
• Numerical data (for statistics, predictions, etc)
• Text (automated text analysis)
• Audio (voice, music)
• Images (automated image recognition)
• Video (automated video recognition)
• Don't know
• Other (please specify)
11. Does your organization apply advanced analytics methods (statistics, econometrics,
operations research, artificial intelligence, applied mathematics) in Big Data applications?
• Yes
• No
• Don't know
12. Which advanced analytics methods does your organization use in Big Data applications?
181
•
•
•
•
•
Statistics and econometrics
Operations research (OR) / applied mathematics
Artificial intelligence (AI) and machine learning
None
Any Other (Please Specify)
13. How does your organization develop Big Data applications?
• We mainly develop our applications internally.
• We mainly develop our applications externally (outsourcing)
• We do both: we develop internally as well as externally
• Don’t know.
14. What are in your opinion the most important factors for successful Big Data implementations?
Please rate from "1" (=most important) to "5" (=least important).
• A clear company strategy
• Support by higher management
• Talent
• Training
• Supporting systems and procedures
• Financial budget
• An organizational structure that supports multi-disciplinary projects
• A sound procedure for legal, ethical and reputational issues
182
PART G
SECURITY CONCERNS
g. Your organization has taken initiative in which of the following areas related to Big Data
Security & Privacy? (Multiple answers possible)
• Intrusion Detection
• Cyber security and Gigabit Networks
• Visualizing Large Scale Security Data
• Challenges for Big Data Security & Privacy
• Sociological Aspects of Big Data Privacy
h. Please provide your views on the IPR Issues as related to Big Data Analytics.
i. Please provide your views and suggestions on the adequacy or otherwise of the National Data
Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics
PART H
ANY OTHER INFORMATION YOU MAY LIKE TO SHARE
183
QUESTIONNAIRE FOR
RESEARCHERS (RE)
184
PART A
GENERAL ORGANIZATIONAL PROFILE
9. Name & Address of the Organization/Department:
Telephones:
Fax:
E-mail:
website:
10. Name & designation and Address of the CEO/HOD:
Telephones:
Mobile:
E-mail:
11. Name, Designation and Address of the Respondent:
Telephones:
Mobile:
E-mail:
12. Date & Place:
185
PART B
CURRENT STATUS, STRATEGY & PROFILE
ff. Please identify the Stakeholder Segment/Category your organization belongs to: (Multiple
answers are possible)
SEGMENT/CATEGORY
YES (Y)/NO(N)
RESEARCHERS (RE)
DATA GENERATORS (DG)
END USERS (EU)
SERVICE PROVIDERS (SP)
PLATFORM PROVIDER
(PP)
DATA CURATOR (DC)
gg. Mentions the Data Segments in which you are active:
hh. Mention the Data Segments that you Outsource:
ii. What is the Organization Budget Provisions For Big Data Usage in Lakhs of Rupees
• 2014 – 15
• 2015 – 16
• 2016 – 17
jj. In Which Areas Will Your Organization Be Investing More Resources?
• Capacity Building
• Software Tools
• Data Sources
• Other
kk. What is the current state of big data activities within your organization?
• We have not yet started to consider big data's use within our organization
• We are in the process of developing a strategy / roadmap
• We have started one or more pilots or proofs of concept
• We are implementing big data technologies
186
ll. What are your expectations from Big Data Analytics in the next 10 years?
mm.
In our Organization the Big data management is not viewed strategically at senior
levels of the organisation.
• Agree
• Disagree
• Don’t know/Not applicable
nn. There is not enough of a “big data culture” in the organisation, where the use of big data in
decision-making is valued and rewarded.
• Agree
• Disagree
• Don’t know/Not applicable
187
PART C
MANPOWER, SKILL GAPS AND TRAINING NEEDS
m. Identify the skills gaps within your functional area in dealing with data and analytics.
• Visualization skills
• Data integration skills
• Data analysis skills
• Data storage skills
• Tooling / software skills
n. How many Big Data experts does your organization employ and in which area?
• Computer science: programming experts (R, Python, SQL, SAS, Java, etc)
• Computer science: Artificial Intelligence and machine learning experts
• Computer science: text, voice, music, image and video experts
• Experts in statistics and econometrics
• Experts in OR and applied mathematics
• Other (please specify)
o. What are the training needs of your organization?
• Strategy courses on Big Data for top management
• Computer science: programming courses (R, Python, SQL, SAS, Java, etc)
• Computer science: text, image and video recognition courses
• Computer science: machine learning and artificial intelligence courses
• Statistics and econometrics courses
• Operations research and applied mathematics courses
• Application related courses (Big Data in marketing, finance, logistics, etc)
• Any Other (Please Specify)
p. For the purposes of the Capacity Building initiatives, please suggest the programs and other
details as given below:
Name of the
Who should be
Modality of
Coverage
Duration
Program
the participants
Delivery
188
PART D
PERCEIVED SUCCESS FACTORS, IMPEDIMENTS & CHALLENGES FOR BIG DATA APPLICATION
x. Your organization has taken initiative in which of the following areas related to Big Data
Science & Technology? (Multiple answers possible)
• Data streaming & Processing
• Analysis of Unstructured/Semi-structured data
• Visualization & Visual Analytics
• Security & Privacy issues
• New Computational Models
• Data & Information Quality and New Data Standards.
y. Your organization has taken initiative in which of the following areas related to Big Data
Infrastructure ? (Multiple answers possible)
• System Architectures, Design and Deployment
• Programming Models
• Software Techniques & Architectures in Cloud/Grid/Stream Computing
• Big Data Open Platforms
z. Your organization has taken initiative in which of the following areas related to Big Data
Search, Mining and Management ? (Multiple answers possible)
• Search & Mining of variety of data including scientific, engineering, social, sensor &
multimedia
• Algorithms & Systems for Big Data Search
• Data Acquisition, Integration, Cleaning & Best Practices
• Visualization Analytics for Big Data
• Computational Modeling & Data Integration
• Cloud/Grid/Stream Data Mining-Big Velocity Data
• Mobility and Big Data
• Multimedia and Multi-structured Data-Big Variety Data
aa. Your organization has taken initiative in which of the following areas related to Big Data
Applications? (Multiple answers possible)
• Complex Big Data Applications in Science, Engineering,
• Medicine, Healthcare, Finance, Business, Law and Education
• Indian Traditional Knowledge
• Transportation
• Retailing, social media and Telecommunication
• Big Data Analytics in Small Business Enterprises (SMEs)
• Big Data Analytics in Central and State Governments, Public Sector and Society in General
189
•
•
•
•
Real-life Case Studies of Value Creation through Big Data Analytics
Big Data as a Service
Big Data Industry deployments & Standards and Experiences of
Big Data Government Deployments/ Projects.
bb. To what extent do you agree with the following statement: “The issue for us is now not the
growing volumes of data, but rather being able to analyse and act on data in real-time.
• Agree
• Disagree
• Don’t know/Not applicable
190
PART E
AREAS OF APPLICATION, MODELS & INFRASTRUCTURE
q. Has Your Organization Taken any of the Steps Mentioned Below to Integrate Data into Your
Organization’s Business? (Multiple answers possible )
• Upgrade IT Systems
• Improve data collection processes
• Training current employees or recruiting new employees in BA
• Redesigned/reengineered your important Business Processes
r.
How well are these areas developed in your organization? (Answer in terms of Very well,
Reasonably well, Not so well, Don't know)
• A clear company strategy
• A sound procedure for legal, ethical and reputational issues
• An organization structure that supports multi-disciplinary projects
• Financial budget
• Support by higher management
• Supporting systems and procedures
• Talent
• Training
s. What do you predict will happen to the number of Big Data specialists in your
organization next year (2015)
• It will decrease
• It will remain stable
• It will increase
• Don't know
t.
As the country would like to keep ahead of the rest of the world in the area of Big Data
applications, please suggest FINAL PRODUCTS for which the Big Data Community may strive.
u. For the Researchers in the Big Data Discipline, what should be thrust Areas:
•
•
Immediately
In the next 5-10 years
191
PART F
TYPE, AMOUNT OF DATA & ANALYTICAL TECHNIQUES USED
15. What support do you need from the Government
16. What are your are suggestions for leveraging Big Data Analytics applications in Government
(In terms of the following)
• The Opportunities
• Possible Application Areas
• Priority Application Areas for the next 10 years
• Market size
• Skills gaps and actions needed to fill in the gaps
• Policy Frame Works
17. Does your organization apply advanced analytics methods (statistics, econometrics,
operations research, artificial intelligence, applied mathematics) in Big Data applications?
• Yes
• No
• Don't know
18. Which advanced analytics methods does your organization use in Big Data applications?
• Statistics and econometrics
• Operations research (OR) / applied mathematics
• Artificial intelligence (AI) and machine learning
• None
• Any Other (Please Specify)
19. What are in your opinion the most important factors for successful Big Data implementations?
Please rate from "1" (=most important) to "5" (=least important).
• A clear company strategy
• Support by higher management
• Talent
• Training
• Supporting systems and procedures
• Financial budget
• An organizational structure that supports multi-disciplinary projects
• A sound procedure for legal, ethical and reputational issues
20. Please suggest the Tools and Platforms to be used for Big Data Analytics in the Open Source
domain.
192
PART G
SECURITY CONCERNS
j.
Your organization has taken initiative in which of the following areas related to Big Data
Security & Privacy? (Multiple answers possible)
• Intrusion Detection
• Cyber security and Gigabit Networks
• Visualizing Large Scale Security Data
• Challenges for Big Data Security & Privacy
• Sociological Aspects of Big Data Privacy
k. Please provide your views on the IPR Issues as related to Big Data Analytics.
l. Please provide your views and suggestions on the adequacy or otherwise of the National Data
Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics
PART H
ANY OTHER INFORMATION YOU MAY LIKE TO SHARE
193
QUESTIONNAIRE FOR
SERVICE PROVIDERS (SP)
194
PART A
GENERAL ORGANIZATIONAL PROFILE
13. Name & Address of the Organization/Department:
Telephones:
Fax:
E-mail:
website:
14. Name & designation and Address of the CEO/HOD:
Telephones:
Mobile:
E-mail:
15. Name, Designation and Address of the Respondent:
Telephones:
Mobile:
E-mail:
16. Date & Place:
195
PART B
CURRENT STATUS, STRATEGY & PROFILE
oo. Please identify the Stakeholder Segment/Category your organization belongs to: (Multiple
answers are possible)
SEGMENT/CATEGORY
YES (Y)/NO(N)
RESEARCHERS (RE)
DATA GENERATORS (DG)
END USERS (EU)
SERVICE PROVIDERS (SP)
PLATFORM PROVIDER
(PP)
DATA CURATOR (DC)
pp. Mentions the Data Segments in which you are active:
qq. Mention the Data Segments that you Outsource:
rr. In Which Areas Will Your Organization Be Investing More Resources?
• Capacity Building
• Software Tools
• Data Sources
• Other
ss. How would you describe your organization's competitive position?
• Underperforming industry / market peers
• On par with industry / market peers
• Outperforming industry / market peers
• Don't know
tt. What are your expectations from Big Data Analytics in the next 10 years?
196
uu. In our Organization the Big data management is not viewed strategically at senior levels of the
organisation.
• Agree
• Disagree
• Don’t know/Not applicable
vv. There is not enough of a “big data culture” in the organisation, where the use of big data in
decision-making is valued and rewarded.
• Agree
• Disagree
• Don’t know/Not applicable
197
PART C
MANPOWER, SKILL GAPS AND TRAINING NEEDS
q. Identify the skills gaps within your functional area in dealing with data and analytics.
• Visualization skills
• Data integration skills
• Data analysis skills
• Data storage skills
• Tooling / software skills
r.
How many Big Data experts does your organization employ and in which area?
• Computer science: programming experts (R, Python, SQL, SAS, Java, etc)
• Computer science: Artificial Intelligence and machine learning experts
• Computer science: text, voice, music, image and video experts
• Experts in statistics and econometrics
• Experts in OR and applied mathematics
• Other (please specify)
s. What are the training needs of your organization?
• Strategy courses on Big Data for top management
• Computer science: programming courses (R, Python, SQL, SAS, Java, etc)
• Computer science: text, image and video recognition courses
• Computer science: machine learning and artificial intelligence courses
• Statistics and econometrics courses
• Operations research and applied mathematics courses
• Application related courses (Big Data in marketing, finance, logistics, etc)
• Any Other (Please Specify)
t.
For the purposes of the Capacity Building initiatives, please suggest the programs and other
details as given below:
Name of the
Who should be
Modality of
Coverage
Duration
Program
the participants
Delivery
198
PART D
PERCEIVED SUCCESS FACTORS, IMPEDIMENTS & CHALLENGES FOR BIG DATA APPLICATION
cc. Your organization has taken initiative in which of the following areas related to Big Data
Science & Technology? (Multiple answers possible)
• Data streaming & Processing
• Analysis of Unstructured/Semi-structured data
• Visualization & Visual Analytics
• Security & Privacy issues
• New Computational Models
• Data & Information Quality and New Data Standards.
dd. Your organization has taken initiative in which of the following areas related to Big Data
Infrastructure ? (Multiple answers possible)
• System Architectures, Design and Deployment
• Programming Models
• Software Techniques & Architectures in Cloud/Grid/Stream Computing
• Big Data Open Platforms
ee. Your organization has taken initiative in which of the following areas related to Big Data
Search, Mining and Management ? (Multiple answers possible)
• Search & Mining of variety of data including scientific, engineering, social, sensor &
multimedia
• Algorithms & Systems for Big Data Search
• Data Acquisition, Integration, Cleaning & Best Practices
• Visualization Analytics for Big Data
• Computational Modeling & Data Integration
• Cloud/Grid/Stream Data Mining-Big Velocity Data
• Mobility and Big Data
• Multimedia and Multi-structured Data-Big Variety Data
ff. Your organization has taken initiative in which of the following areas related to Big Data
Applications? (Multiple answers possible)
• Complex Big Data Applications in Science, Engineering,
• Medicine, Healthcare, Finance, Business, Law and Education
• Indian Traditional Knowledge
• Transportation
• Retailing, social media and Telecommunication
• Big Data Analytics in Small Business Enterprises (SMEs)
199
•
•
•
•
•
Big Data Analytics in Central and State Governments, Public Sector and Society in General
Real-life Case Studies of Value Creation through Big Data Analytics
Big Data as a Service
Big Data Industry deployments & Standards and Experiences of
Big Data Government Deployments/ Projects.
gg. In which areas does your organization have Big Data applications? (Multiple answers possible)
• E-Commerce, e-Business, Online Operations (Web shops, etc)
• e-Governance
• Direct and online marketing
• Fraud detection / management
• Customer and market analysis
• Customer service
• Supply change management and logistics
• Information Technology
• Finance and administration
• HR and people development
• Risk management
• We do not have applications
• I don't know in which area(s) we have applications.
• Other (please specify)
200
PART E
AREAS OF APPLICATION, MODELS & INFRASTRUCTURE
v. Has Your Organization Taken any of the Steps Mentioned Below to Integrate Data into Your
Organization’s Business? (Multiple answers possible )
• Upgrade IT Systems
• Improve data collection processes
• Training current employees or recruiting new employees in BA
• Redesigned/reengineered your important Business Processes
w. How well are these areas developed in your organization? (Answer in terms of Very well,
Reasonably well, Not so well, Don't know)
• A clear company strategy
• A sound procedure for legal, ethical and reputational issues
• An organization structure that supports multi-disciplinary projects
• Financial budget
• Support by higher management
• Supporting systems and procedures
• Talent
• Training
x. What do you predict will happen to the number of Big Data specialists in your
organization next year (2015)
• It will decrease
• It will remain stable
• It will increase
• Don't know
y. Please make suggestions on the following important aspects :
•
•
•
Data Storage
Data Curation
Data Retrieval
z. As the country would like to keep ahead of the rest of the world in the area of Big Data
applications, please suggest FINAL PRODUCTS for which the Big Data Community may strive.
201
PART F
TYPE, AMOUNT OF DATA & ANALYTICAL TECHNIQUES USED
21. What support do you need from the Government
22. What are your are suggestions for leveraging Big Data Analytics applications in Government
(In terms of the following)
• The Opportunities
• Possible Application Areas
• Priority Application Areas for the next 10 years
• Market size
• Skills gaps and actions needed to fill in the gaps
• Policy Frame Works
23. Which data does your organization analyze in the context of Big Data applications? (Multiple
answers possible):
• Numerical data (for statistics, predictions, etc)
• Text (automated text analysis)
• Audio (voice, music)
• Images (automated image recognition)
• Video (automated video recognition)
• Don't know
• Other (please specify)
24. Does your organization apply advanced analytics methods (statistics, econometrics,
operations research, artificial intelligence, applied mathematics) in Big Data applications?
• Yes
• No
• Don't know
25. Which advanced analytics methods does your organization use in Big Data applications?
• Statistics and econometrics
• Operations research (OR) / applied mathematics
• Artificial intelligence (AI) and machine learning
• None
• Any Other (Please Specify)
26. What are in your opinion the most important factors for successful Big Data implementations?
Please rate from "1" (=most important) to "5" (=least important).
202
•
•
•
•
•
•
•
•
A clear company strategy
Support by higher management
Talent
Training
Supporting systems and procedures
Financial budget
An organizational structure that supports multi-disciplinary projects
A sound procedure for legal, ethical and reputational issues
27. Please suggest the Tools and Platforms to be used for Big Data Analytics in the Open Source
domain.
203
PART G
SECURITY CONCERNS
m. Your organization has taken initiative in which of the following areas related to Big Data
Security & Privacy? (Multiple answers possible)
• Intrusion Detection
• Cyber security and Gigabit Networks
• Visualizing Large Scale Security Data
• Challenges for Big Data Security & Privacy
• Sociological Aspects of Big Data Privacy
n. Please provide your views on the IPR Issues as related to Big Data Analytics.
o. Please provide your views and suggestions on the adequacy or otherwise of the National Data
Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics
PART H
ANY OTHER INFORMATION YOU MAY LIKE TO SHARE
204
ANNEXURE 2
LIST OF PARTICIPANTS OF THE CONSULTATIVE MEETINGS
AND INTERACTIVE WORKSHOPS
205
FIRST CONSULTATIVE MEETING HELD ON 28th NOVEMBER 2014 AT NEW DELHI
LIST OF THE PARTICIPANTS
S. No.
Name
Designation & Organization
1.
Prof. Sankar K Pal
Distinguished Scientist and Former Director, Indian Statistical
Institute, Kolkota
2.
Prof. Santanu Chaudhury,
Dhananjay Chair Professor, Department of Electrical Engineering,
Indian Institute of Technology, New Delhi
3.
Prof. Bapi Raju Surampudi
4.
Prof. Ramesh Hariharan
Professor, Deptt. of Computer/ Info. Sciences Coordinator, Centre
for Neural and Cognitive Sciences (CNCS), University of Hyderabad,
Gachibowli
Adjunct Professor, Strand Life Sciences, Bangalore
5.
Dr. Raghavendra Singh
Research Staff Member, IBM Research, India Research Laboratory,
New Delhi
6.
Shri Avnish Sabharwal
Managing Director & Strategy Head, Accenture India (Pvt.) Ltd.,
Bangalore
7.
Mr. G M Bagai
Scientist `G', Department of Scientific & Industrial Research,
Ministry of Science and Technology, New Delhi
8.
Mr. Sanjay S Gahlout
Deputy Director General, National Informatics Centre, New Delhi
9.
Shri K R Murali Mohan
Head, Big Data Initiative Division, Department of Science &
Technology, Ministry of Science and Technology, New Delhi
10.
Prof. Amit Kumar Bardhan
Associate Professor, Faculty of Management Studies, University of
Delhi, Delhi
11.
Dr. Prageet Aeron
Assistant Professor (Information Management), International
Management Institute, New Delhi
12.
Prof. Krishan Lal
President, The Korean Academy of Science and Technology (KAST),
National Physical Laboratory, New Delhi
13.
Shri Nikunj Garg
Manager Enterprise Risk Services, Deloitte Touche Tohmatsu India
Private Limited, Gurgaon
14.
Shri Prashant Gupta
Director - Enterprise Risk Services, Deloitte Touche Tohmatsu India
Private Limited, Gurgaon
206
S. No.
Name
Designation & Organization
15.
Dr. A K Singh Suryavanshi
Professor & Dean, Department of Business Management &
Entrepreneurship, National Institute of Food Technology
Entrepreneurship & Management, Sonipat
Indian Council of Agricultural Research, New Delhi
16.
Dr. Sudeep Marwah
17.
Shri Mratunjay Tewari
Deputy General Manager (IT), Indian Railway Catering & Tourism
Corporation Ltd., New Delhi
18.
Shri Ramakant Tiwari
Deputy General Manager (IT), Indian Railway Catering & Tourism
Corporation Ltd., New Delhi
19.
Dr. Nahid Alam
The Associated Chambers of Commerce & Industry of India, New
Delhi
20.
Shri Uday Laroia
Deputy Director, Confederation of Indian Industry, New Delhi
21.
Shri Manjeet Bose
Director, NASSCOM, New Delhi
22.
Shri Bhushan Mohan
Department of Electronics and Information Technology, New Delhi
23.
Shri Rahul Mittal
NIIT Technologies, NOIDA
24.
Shri Shobit Bahadur
Head-Research, Ma Foi Analytics and Research, Chennai
25.
Shri K Pandia Rajan
Chairman & Managing Director, Ma Foi Strategic Consultants Private
Limited, Chennai
26.
Dr. UditaTaneja
Associate Professor (Healthcare Management), University School of
Management Studies, New Delhi
27.
Prof. Usha Munshi
Indian Institute of Public Administration, New Delhi
28.
Shri Anupam Bhatnagar
Managing Advisor, National Co-operative Consumer's Federation of
India Ltd. NOIDA
29.
Shri Pradeep Dadlani
Director, Sycom Projects Consultants Pvt. Ltd., New Delhi
30.
Shri Rohit Anand
Managing Director, Value Edge Research Services, New Delhi
207
S. No.
Name
Designation & Organization
31.
Dr. Praveen Arora
Adviser & Head, CHORD (NSTMIS) Division, Department of Science
& Technology, Ministry of Science and Technology, New Delhi
32.
Mr. Deepak Agrawal
CDC
33.
Mr. S. K. Lawani
CDC
34.
Mr. B. G. Gupta
CDC
208
LIST OF THE PARTICIPANTS
FIRST INTERACTIVE WORKSHOP, BENGALURU, 7th JANUARY 2015
S. No.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
NAME
Mr. K. J. Rajeshwar
Mr. Yogesh Simmhan
Mr. C. Bhttacharya
Mr. N. Viswanath
Mr. N. S. N. Sastry
Mr. Hatim Matiwala
Mr. Animesh Bisaria
Ms. Nandini S. S.
Mr. G. Sabuwel
Mr. N. Yatindra
Mr. Mohan Muniswamaiah
Mr. U. Dinesh Kumar
Mr. Arun Singh R.
Mr. R. P. Thangavelu
Mr. Vidyadhar Mudkani
Mr. Harsha Vardhan
Mr. Vamsi Veeramachaneni
Mr. Adiya R. Sinha
Mr. Nikunj Garg
Dr. Murli Manohar
Mr. S. K. Lalwani
Mr. B. G. Gupta
Mr. Jaiprakash S
Mr. Asad Awasi
Mr. Devika P. Madalli
Mr. Mohan Mokashi
Mr. T. V. Suresh Kumar
Mr. Srinivasa K. G.
Mr. ARD Prasad
Prof. K Sankara Rao
Prof. K. R. Kullayiswamy
ORGANIZATION
WIPRO Ltd.
IISC
IISC
IISC
ISI
SAP
Integra Micro Software Services Pvt. Ltd.
Integra Micro Software Services Pvt. Ltd.
IBAB
IBAB
Gooly Consultancy Services
IIM
IISC
CSIR-4PI
CSIR-NAL
TALLY-ASSOCHAM
Strand Life Sciences
SECON
Deloitte
DST
CDC
CDC
Big Data Analytics India Pvt Ltd
ASSOCHAM
ISI
The LOBICONS
MS Ramaiah Institute of Technology
MS Ramaiah Institute of Technology
ISI
Centre for Ecological Sciences, Indian Institute
of Science
Indian Institute of Science
209
LIST OF THE PARTICIPANTS
SECOND INTERACTIVE WORKSHOP, PUNE, 19th JANUARY 2015
S. No.
Name
Organization
1.
Ms. Radhika G Brahme
National Aids Research Institute (Indian Council of Medical
Research)
2.
Mr. Maheshwari Desai
Harbinger Systems Private Ltd.
3.
Mr. PravdaGodbole
Inteliment Technologies
4.
Dr. R. R. Hirwani
CSIR-Unit for Research and Development of Information Products
5.
Dr. Vijay Khare
International Center & Professor of Defence and Strategic Studyes,
University of Pune
6.
Mr. Kundan Kumar
MITCON Consultancy & Engineering Services Ltd.
7.
Dr. Karthikeyan
CSIR-National Chemical Laboratory,
8.
Mr. Siddharth Thomas
Cignex Datamatics Technologies Ltd.
9.
Wg Cdr Srinivas
AIPER
10.
Mr. Gautam
Harbingar Systems Pvt. Ltd.
11.
Mr. Vivek
EY
12.
Dr. D. M. Thakore
BVU College of Engineering
13.
Prof. S. Z. Gawali
BVU College of Engineering
14.
Prof. Rejani Meshram
Pune University
210
S. No.
Name
Organization
15.
Dr. Lalita Khare
Modern Institute of Business Management
16.
Dr. Nanaji Shewale
GIPE
17.
Mr. S. K. Lalwani
CDC
18.
Mr. Deepak Agarwal
CDC
19.
Mr. B. G. Gupta
CDC
20.
Mr. Prashany Pansare
Inteliment Technologies
211
LIST OF THE PARTICIPANTS
THIRD INTERACTIVE WORKSHOP, HYDERABAD, 29th JANUARY 2015
S. No.
Name
Organization
1.
Prof. Nirmala Apsingikar
Administrative Staff College of India
2.
Mr. Syed Azgar
Engineering Staff College of India
3.
Ms. Kiran Hegda
Logic Matter
4.
Major Gen. (Retd) R. Shiv
Kumar
GITAM University
5.
Mr. Maruthi Kumar
System Soft Technologies (India) Pvt. Ltd.
6.
Prof. Kamlakar
IIIT
7.
Prof. Arun K. Pujari
University of Hyderabad
8.
Mr. R. Raghvan
Insurance Information Bureau of India
9.
Prof. K. S. Rajan
IIIT
10.
Prof. S. Bapi Raju
IIIT
11.
Mr. E. Pttabhi Rama Rao
12.
Ms. Pallavi Rao
Indian National Centre for Ocean Information Services (ESSOINCOIS)
Ministry of Earth Sciences
Logic Matter
13.
Prof. C. R. Rao
School of Computer and Information Sciences
14.
Dr. BLS Prakasa Rao
CR Rao Advanced Institute for Mathematics Statistics and Computer
Science
212
S. No.
Name
Organization
15.
Dr. S. Ravichandran
National Academy of Agricultural Research Management
16.
Mr. Dipanjan Roy
IIIT
17.
Dr. M L Saundh
GVK Emergency Management and Research Institute
18.
Mr. Surya Putchala
Zettamine Tech.
19.
Mr. Tejpal Pola
Zettamine Tech.
20.
Prof. R. Ravi
IDRBT
21.
Prof. Sobhan
IIT
22.
Dr. D. Subramanyam
DRR
23.
Dr. Sundarsan Jena
GITAM University
24.
Dr. S. Phani Kumar
GITAM University
25.
Mr. Raghu Patri
DATAWISE
26.
Ms. NupurPavan Bang
IIB
27.
Ms. Aruna M.
PITS Pilani
28.
Mr. J. A. Chaudhary
Telentsprint
29.
Mr. P. Krishna Reddy
IIIT
30.
Ms. Kavita Vemeri
IIIT
213
S. No.
Name
Organization
31.
Ms. Jaswinder Kaur
DATAWISE
32.
Mr. V. Srinivas Rao
BT & BT
33.
Mr. M Krishna
State Government
34.
Dr. Shanthi
Engineering Staff College of India
35.
Mr. B. G. Gupta
CDC
36.
Mr. S. K. Lalwani
CDC
37.
Ch. Sobhan Basu
Indian Institute of Technology
38.
Prof. PJ Narayanan
IIIT Hyderabad
39.
Dr. Priyanka Srivastava
IIIT Hyderabad
40.
Prof. Vasudeva Varma
IIIT Hyderabad
214
LIST OF THE PARTICIPANTS
FOURTH INTERACTIVE WORKSHOP, KOLKATA, 17thFEBRUARY 2015
S. No.
Name
Organization
1.
Prof. K. Sankar Pal
ISI
2.
Prof. Ashish Ghosh
ISI
3.
Mr. S. K. Lalwani
CDC
4.
Mr. B. G. Gupta
CDC
5.
Prof. Sanghmitra
Badhopadhaya
ISI
6.
Prof. Rajat K. De
ISI
7.
Prof. Meeta Nasipuri
Jadavpur University
8.
Dr. Sumita Ghosh
Jadavpur University
9.
Dr. Subhdip Basu
Jadavpur University
10.
Mr. Bhaktipada Kundu
Cognizant Technology Solution
11.
Prof. Subhamoy
Chakraborti
Magma Fincrop Lid.
12.
Prof. Phalguni Gupta
NITTTR
13.
Prof. Pabitra Mitra
IIT KH
14.
Mr. P K Chatterjee
Conmat Technologies Pvt. Ltd.
215
S. No.
Name
Organization
15.
Mr. Guatam Das
ProtechInfosystemsPvt. Ltd
16.
Mr. Arnab Ganguli
Webcon Consulting (India) Ltd.
17.
Prof. Kalyan Kumar Bhar
Indian Institute of Engineering Science & Technology
18.
Mr. Debasish Hajra
PricewaterhouseCoopers Private Limited
19.
Mr. SK Ray
AKB Power Consultants Pvt. Ltd.
20.
Dr. A N Roy
National Institute of Research on Jute and Allied Fibre Technology
21.
Dr. Sucheta Tripathy
CSIR
22.
Dr. Apurba Kr. Ghosh
University of Burdwan
23.
Mr. Amarlal Chaudhary
University of Burdwan
24.
Mr. Sudipta Sen Sharma
ISI
25.
Mr. K. K. Chakraborti
Adansa Solutions Pvt. Ltd.
26.
Ms. Suman Kundu
ISI
27.
Mr. K Ramachandra Murthy
ISI
28.
Mr. Arnab Biswas
ISI
29.
Mr. Satish Chandra
ISI
30.
Mr. Animesh Basak
ISI
216
S. No.
Name
Organization
31.
Mr. Ansuman Das
ISI
32.
Mr. Chrag Gupta
ISI
33.
Mr. Arnab Kundu
ISI
34.
Mr. Kamlesh Nayak
ISI
35.
Mr. Prateek Pandey
ISI
36.
Mr. Dyaneshwar Patil
ISI
37.
Mr. Pratish Ranjan
ISI
38.
Mr. Kushai Sen
ISI
39.
Ms. Procheta Sen
ISI
40.
Mr. Ankit Sharma
ISI
41.
Mr. Abhishek Singh
ISI
42.
Mr. Guatam Banerjee
Business Brio
43.
Shri Gautam Das
ProtechInfosystemsPvt. Ltd.
44.
Ms. Deepshikha Banerjee
Conmat Technologies Pvt. Ltd.
45.
Ms. KanishkaDhamija
Indian Statistical Institute
46.
Shri Arindam Pal
Indian Statistical Institute
217
S. No.
Name
Organization
47.
Shri Ajoy K Ray
IIEST
48.
Shri Kunal Shrivastava
Indian Statistical Institute
49.
Ms. Niharika Das
Indian Statistical Institute
50.
Ms. Shrabana Dutta
Indian Statistical Institute
51.
Ms. Romi Banerjee
Indian Statistical Institute
52.
Shri Bhaskar Dey
Indian Statistical Institute
218
LIST OF THE PARTICIPANTS
SECOND CONSULTATIVE MEETING, NEW DELHI, 25th MARCH 2015
S. No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
NAME OF THE PARTICIPANT
Prof. Sankar K Pal
Prof. Santanu Chaudhary
Prof. S Bapi Raju
Prof. Ramesh Hariharan
DR. Raghavendra Singh
Prof. Prageet Aeron
Shri Prashant Arya
Shri Zubin Baben
Shri Gautam Banerjee
Prof. Amit Bardhan
Shri SK Dey Biswas
Dr. Lovneesh Chanana
Dr. K P Chaudhary
Ms. Soumya Das
Ms. VineetaDixit
Shri Parameswara Rao Ganta
Prof. Ashish Ghosh
Shri Radhesh Gupta
Lt. Col M Haridas
Dr. B Kanagadurai
Shri Vipul Kaushik
Shri Sanjay Krishen
Shri Shirish Mahendru
Ms. Kamini Malhotra
Shri Rahul Mittal
Shri Bhushan Mohan
Mohankrishnan P
Prof. Usha Munshi
Dr. K R Murali Mohan
Shri Deepak Pandhi
Dr. Maya Ramanath
DR. V Ravi
Dr. S Ravichandran
DR. Ravi Sekhar
ORGANIZATION
Indian Statistical Institute, Kolkata
Indian Institute of Technology, New Delhi
Cognitive Science Lab, Hyderabad
Strand Life Sciences, Bangalore
IBM Research, India Research Laboratory, New Delhi
International Management Institute, New Delhi
Google India Pvt. Ltd., Gurgaon.
SAP India Pvt. Ltd. Bangalore
Business Brio, Kolkata
Faculty of Management Studies, New Delhi
Indian Council of Medical Research, New Delhi
SAP India Pvt. Ltd., New Delhi
National Physical Laboratory, New Delhi
RudrabhishekInfosystemPvt. Ltd., Noida
Google India Pvt. Ltd., Gurgaon
RTL Technologies, Hyderabad
ISI, Kolkata
Centre for Land Warfare Studies, Delhi Cantt.
CSIR - Central Road Research Institute, New Delhi
Ernst & Young LLP, Gurgaon
Intel Technology India Pvt. Ltd., Bangalore
Fairwood Design Pvt. Ltd., Noida
Defence Research and Development Organisation, New Delhi
NIIT Technologies Ltd., Noida
NASSCOM, Bangalore
Indian Institute of Public Administration, New Delhi
DST, Ministry of Science and Technology, New Delhi
Confederation of Indian Industry, Gurgoan
IIT, New Delhi
IDRBT, Hyderabad
NAARM, Hyderabad
CSIR - Central Road Research Institute
219
S. No.
35
36
37
38
39
40
41
42
NAME OF THE PARTICIPANT
Shri Bimal Sikdar
Shri Salam Shyamsunder Singh
Shri Vishin Sukh
DR. KVS Viswanathan
Shri Deepak Agrawal
Shri S. K. Lalwani
Shri B. G. Gupta
Shri Soumaya
ORGANIZATION
Project Art, New Delhi
Deptt. of Economic AffairsMinistry of Finance, New Delhi
Fairwood Design Pvt. Ltd., Noida
NASSCOM, Bangalore
CDC
CDC
CDC
CDC
220
ANNEXURE 3
CONSOLIDATED RESPONSES FROM DATA GENERATORS
221
ANNEXURE 3
CONSOLIDATED RESPONSES FROM DATA GENERATORS (DG)
Current Status, Strategy & Profile
ww.
Stakeholder Segment/Category : Multiple - Researchers, Data Generators, End Users
xx. Active Data Segments : Customers, Transaction both internal & external
yy. Data Segments that is outsourced: Sometimes e.g. Debit Card Data
zz. Data not available at Data.Gov.In
aaa.
Expectations from Big Data Analytics in the next 10 years:
• Superior analytics for business growth and customer service satisfaction
• Big Data techniques will allow organization to analyze data for patterns more quickly and at
a much lower cost. It will lead to important business insights that can drive the business.
bbb.
Big data management is viewed strategically at senior levels of the organisation.
ccc. Generally there is not enough of a “big data culture” in the organisation, where the use of big
data in decision-making is valued and rewarded.
Manpower, Skill Gaps and Training Needs
u. Identified skills gaps in dealing with data and analytics.
• Visualization skills
• Tooling / software skills
v. Big Data experts employed in areas
• Computer science: programming experts (R, Python, SQL, SAS, Java, etc)
• Statistics and econometrics
w. The training needs are:
• Strategy courses on Big Data for top management
• Computer science: programming courses (R, Python, SQL, SAS, Java, etc)
• Computer science: text, image and video recognition courses
• Statistics and econometrics courses
• Application related courses (Big Data in marketing, finance, logistics, etc)
• High Frequency Data
x. Required Capacity Building initiatives as per details given below:
Name of the Program
Strategy courses on Big Data
Big Data in marketing, finance
Statistics and econometrics
Who should be the
participants
Top Management
Middle Management and
Lower Management
Middle Management and
222
Overview
Extensive
Half Day
3 Days
Modality of
Delivery
Class room program
Class room program
Extensive
3 Days
Class room program
Coverage
Duration
courses
Programming courses
Strategy courses on Big Data
Strategy courses on Big Data
for top management
Computer science: text, image
and video recognition courses
Application related courses
(Big Data in marketing,
finance, logistics, etc)
Lower Management
IT officers
Top Management
Business Units
T&O
Business Units
T&O
Business Units
T&O
Extensive
Overview
Detailed
7 Days
Half Day
3-5 days
Class room program
Class room program
Classroom Training
Detailed
3-5 days
Classroom Training
Detailed
3-5 days
Classroom Training
Perceived Success Factors, Impediments & Challenges for Big Data Application
There are no questions under this head for Data Generators
Areas of Application, Models & Infrastructure
aa. Steps taken to Integrate Data into Organization’s Business:
• Upgrade IT Systems
• Improve data collection processes
• Redesigned/reengineered your important Business Processes
bb. Areas developed in organization :
Very Well
• A clear company strategy
• A sound procedure for legal, ethical and reputational issues
• An organization structure that supports multi-disciplinary projects
• Financial budget
• Support by higher management
Reasonably well to very well
• Supporting systems and procedures
• Talent
• Training
cc. The number of Big Data specialists in organization next year (2015) will increase.
dd. No significant suggestions on the following important aspects :
•
•
•
Data Storage
Data Curation
Data Retrieval
Type, Amount of Data & Analytical Techniques Used
223
28. Support needed from the Government:
• Building a central repository of financial markets statistical data.
• The government’s roadmap on big data
• High level 5-year country strategy
• Clarity on statutory / regulatory / compliance requirements
• Partnering with peer organizations and relevant government agencies
Clarity on security aspect
29. The amount of data available to support decision-making is enough
30. Challenges faced in GENERATING Data
• Ensuring uniformity in data structure.
• Coping with rapid changes in business requirements.
31. Challenges faced in CLEANING the Generated Data
• Identifying mandatory data fields to ensure correct analytics.
• Data correlation
• Data quality
32. DATA CURATION function in-house
• Data Curation currently not envisages as part of data analytics in the organisation.
Security Concerns
p. Initiative taken for Big Data Security & Privacy
• Intrusion Detection
• Cyber security and Gigabit Networks
• Visualizing Large Scale Security Data
• Challenges for Big Data Security & Privacy
• Sociological Aspects of Big Data Privacy
q. Views on the IPR Issues as related to Big Data Analytics.
• Need to think through certain fundamental legal aspects of IPR, e.g. "who owns the input
data companies are using in their analysis, and who owns the output?”
• Implement the right policies for big data governance.
• In the crowd sourcing world of Big Data Analytics it is very difficult to clearly demarcate the
IPR related boundaries.
• Over emphasis on IPR may also hamper the open innovation approach in the internet based
application development model.
•
r.
Views and suggestions on the adequacy or otherwise of the National Data Sharing and
Accessibility Policy (NDSAP) as related to Big Data Analytics
• Will comply with national regulatory requirements
• Big Data Analytics will be beneficial by extracting unstructured information and combine it
with the power of social media.
Annexure 4: Consolidated Responses from Data Researchers
224
ANNEXURE 4
CONSOLIDATED RESPONSES FROM DATA RESEARCHERS
225
ANNEXURE 4
CONSOLIDATED RESPONSES FROM DATA RESEARCHERS (RE)
Current Status, Strategy & Profile
ddd.
Stakeholder Segment/Category: Mostly Researcher, sometimes multiple such as Service
Provider.
eee.
Active Data Segments:
• Research in the areas of Big Data Management, Analytics and Machine Learning.
• Consultancy in Big Data Management, Analytics and Machine Learning.
• Capacity building initiatives in Big Data Management, Analytics and Machine Learning.
• Data Analysis
• Genomics & Life Sciences.
fff. Data Segments that is outsourced:
• Nil
ggg.
Budget Provisions For Big Data Usage in Lakhs of Rupees
• 2014 – 15 100 to 200 Lakhs
• 2015 – 16 100 to 200 Lakhs
• 2016 – 17 300 to 500 Lakhs
hhh.
Areas where more Investment in Resources
• Capacity Building
• Software Tools : Most preferred area
• Data Sources
• Other Data Generation
iii. Current state of big data activities within organization
• Not yet started to consider big data's use within our organization
• Offering training programmes and consultancy
• One or more pilots or proofs of concept
• Implementing big data technologies
jjj. Expectations from Big Data Analytics in the next 10 years
• It is going to rule many organizations
• Plan to set up an internationally known centre of excellence in Big Data Management,
Analytics, Mining, Machine Learning for Research and Development, Consultancy Services
and Capacity Building
• Capacity Building
• Training/Research
kkk.Big data management is viewed strategically at senior levels of the organisation.
lll. Generally there is enough of a “big data culture” in the organisation, where the use of big data
in decision-making is valued and rewarded.
Manpower, Skill Gaps and Training Needs
226
y. Identified skills gaps in dealing with data and analytics.
• Visualization skills
• Data integration skills
• Data storage skills
z. Big Data experts employed in areas (About 2 – 5 Experts)
• Computer science: programming experts (R, Python, SQL, SAS, Java, etc)
• Computer science: Artificial Intelligence and machine learning experts
• Computer science: text, voice, music, image and video experts
• Experts in statistics and econometrics
• Experts in OR and applied mathematics
aa. The training needs are:
• Statistics and econometrics courses
• Operations research and applied mathematics courses
• Application related courses (Big Data in marketing, finance, logistics, etc)
bb. Required Capacity Building initiatives as per details given below:
Who should be the
Name of the Program
Coverage
Duration
participants
Strategy courses on
C level professionals,
Data Strategy, tangible 2 Months
Big Data for top
researchers and policy
benefits and action
management
makers
plans
Big data Certifications
Basic Statistics
Numerical Methods
Multivariate Analysis
Operation Research
Data Mining and Data
Researchers and
Practitioners
Research
Scholars/Academic
Professionals/Corporate
Personnel
Research
Scholars/Academic
Professionals/Corporate
Personnel
Research
Scholars/Academic
Professionals/Corporate
Personnel
Research
Scholars/Academic
Professionals/Corporate
Personnel
Research
Modality of
Delivery
Hybrid – Class
room + e-learning
Open Source platforms
–Hadoop ETL, etc
ALL
2 Months
40 Hours
Hybrid – Class
room + e-learning
Class Room session
ALL
40 Hours
Class Room session
ALL
40 Hours
Class Room session
ALL
40 Hours
Class Room session
ALL
40 Hours
Class Room Session
227
Warehousing
Scholars/Academic
Professionals/Corporate
Personnel
Perceived Success Factors, Impediments & Challenges for Big Data Application
hh. Organizational initiative in the following areas related to Big Data Science & Technology
• Analysis of Unstructured/Semi-structured data
• Security & Privacy issues
• New Computational Models
ii. Organizational initiative in the following areas related to Big Data Infrastructure
• System Architectures, Design and Deployment
• Programming Models
• Software Techniques & Architectures in Cloud/Grid/Stream Computing
• Big Data Open Platforms
jj. Organizational initiative in the following areas related to Big Data Search, Mining and
Management
• Search & Mining of variety of data including scientific, engineering, social, sensor &
multimedia
• Algorithms & Systems for Big Data Search
• Computational Modelling & Data Integration
• Cloud/Grid/Stream Data Mining-Big Velocity Data
kk. Organizational initiative in the following areas related to Big Data applications
• Complex Big Data Applications in Science, Engineering,
• Medicine, Healthcare, Finance, Business, Law and Education
• Retailing, social media and Telecommunication
• Big Data as a Service
• Big Data Industry deployments & Standards and Experiences of
ll. It is generally agreed that the issue for us is now not the growing volumes of data, but rather
being able to analyse and act on data in real-time.
Areas of Application, Models & Infrastructure
ee. Steps taken to Integrate Data into Organization’s Business:
• Upgrade IT Systems
• Training current employees or recruiting new employees in BA
ff. Areas developed in organization : Not so well to reasonably well
• A clear company strategy
• A sound procedure for legal, ethical and reputational issues
228
• An organization structure that supports multi-disciplinary projects
• Financial budget
• Support by higher management
• Supporting systems and procedures
• Talent
• Training
gg. The number of Big Data specialists in organization next year (2015) will increase.
hh. Suggested FINAL PRODUCTS for which the Big Data Community may strive.
• Big Data as a Service providing easy experimentation and quick prototyping
• Big Data Analytics platforms for Internet of Things and wearable devices
• Solutions/Protocols for seamless data integration, privacy and security.
ii. For the Researchers in the Big Data Discipline, what should be thrust Areas:
• Immediately : Better algorithms/platforms Big Data Management, ETL and Analytics –
improving the open source solutions
• In the next 5-10 years: Scalable Machine Learning for Big Data, IOT and BIG data integration
and products.
Type, Amount of Data & Analytical Techniques Used
33. Support needed from the Government:
• Our Analytic capability may be used for the needy
• Support for enhancing the capacity of our Big Data Engineering Lab
• Support for offering internationally known Big Data certifications in India
34. Suggestions for leveraging Big Data Analytics applications in Government
• Possible Application Areas
35. Organization apply advanced analytics methods (statistics, econometrics, operations research,
artificial intelligence, applied mathematics) in Big Data applications
36. Advanced analytics methods used in Big Data applications
• Statistics and econometrics
• Operations research (OR) / applied mathematics
• Artificial intelligence (AI) and machine learning
37. The most important factors for successful Big Data implementations? Please rate from "1"
(=most important) to "5" (=least important).
• A clear company strategy-1
• Support by higher management-1
• Talent-3
• Training-4
• Supporting systems and procedures-4
• Financial budget-1/4
• An organizational structure that supports multi-disciplinary projects-4
• A sound procedure for legal, ethical and reputational issues-3
229
38. Tools and Platforms to be used for Big Data Analytics in the Open Source domain.
•
•
•
•
•
Apache Hadoop Ecosystem – Hortonworks, Cloudera
Apache Solr/Lucene
No SQL data bases – Mongo DB, CAssandra
R , Python – SciPy
Graph Data Bases – Neo4J, etc
Security Concerns
s. Initiative taken for Big Data Security & Privacy
• Intrusion Detection
• Cyber security and Gigabit Networks
• Visualizing Large Scale Security Data
• Challenges for Big Data Security & Privacy
• Sociological Aspects of Big Data Privacy
t. Views on the IPR Issues as related to Big Data Analytics.
• Cost of Patent filing is too high
• Most of the research outcomes are not commercialized by governmental organizations
u. Views and suggestions on the adequacy or otherwise of the National Data Sharing and
Accessibility Policy (NDSAP) as related to Big Data Analytics: NIL
Any Other Information You May Like To Share
d. We have more data but we don’t have proper documentation
e. Even if we have data but we don’t have operating resources to act as analyst.
f. We find it difficult to identify the resource persons who have knowledge and skill in areas like
Econometrics, operational research, multivariate tools, computer science etc.
230
ANNEXURE 5
CONSOLIDATED RESPONSES FROM END USERS
231
ANNEXURE 5
CONSOLIDATED RESPONSE FROM END USERS (EU)
Current Status, Strategy & Profile
mmm.
Stakeholder Segment/Category: Mostly multiple such as Data Generator and
Researcher.
nnn.
Active Data Segments:
• Transaction Information
• Policy Making
ooo.
Data Segments that is outsourced: NIL
ppp.
Budget Provisions For Big Data Usage in Lakhs of Rupees
• 2014 – 15 100
• 2015 – 16 200
• 2016 – 17 300
qqq.
Areas where more Investment in Resources
• Capacity Building
• Software Tools
• Data Sources
rrr. Organization's competitive position
• Underperforming industry / market peers
sss. Current state of big data activities within organization
• We are in the process of developing a strategy / roadmap
• We have started one or more pilots or proofs of concept
• We are implementing big data technologies
ttt. Usefulness of Big Data Analytics Applications: Not sure
uuu.
Expectations from Big Data Analytics in the next 10 years
• Superior analytics for business growth and customer service satisfaction
• Data Driven Methods,
• Conflict resolution,
• Early warning systems
vvv.Big data management is viewed strategically at senior levels of the organisation.
www.
Generally there is enough of a “big data culture” in the organisation, where the use of
big data in decision-making is valued and rewarded.
Manpower, Skill Gaps and Training Needs
cc. Identified skills gaps in dealing with data and analytics.
• Visualization skills
• Data integration skills
• Data analysis skills
• Data storage skills
232
•
Tooling / software skills
dd. Big Data experts employed in areas (About 2 – 5 Experts)
• Computer science: programming experts (R, Python, SQL, SAS, Java, etc)
• Computer science: Artificial Intelligence and machine learning experts
• Computer science: text, voice, music, image and video experts
• Experts in statistics and econometrics
• Experts in OR and applied mathematics
ee. The training needs are:
• Strategy courses on Big Data for top management
• Computer science: text, image and video recognition courses
• Application related courses (Big Data in marketing, finance, logistics, etc)
ff. Required Capacity Building initiatives as per details given below:
Name of the Program
Strategy courses on Big Data for top
management
Computer science: text, image
and video recognition courses
Application related courses (Big
Data in marketing, finance, logistics,
etc)
Strategy courses on Big Data for top
management
M. Tech
M. Sc
Who should be
the participants
Business Units
T&O
Business Units
T&O
Coverage
Duration
Modality of
Delivery
Classroom
Training
Classroom
Training
Detailed
3-5 days
Detailed
3-5 days
Business Units
T&O
Detailed
3-5 days
Classroom
Training
Business Units
T&O
B. Tech
B. Sc
Detailed
3-5 days
Classroom
Training
Class Room
Class Room
4 Semesters
4 Semesters
Perceived Success Factors, Impediments & Challenges for Big Data Application
mm.
The extent of timely Access to Information needed
• To some extent
nn. The extent of competitive advantage created by information
• Modest advantage
oo. Challenges inhibiting from acquiring and integrating data
• Inconsistencies in data from various source systems
• Legacy infrastructure that inhibits data collection
• Difficult to share data internally and or in integrating internal data across silos
pp. Challenges inhibiting from analyzing data
233
• Lack of software/tools and or Software too difficult to use
• Inconsistent data across variety of source systems
qq. Challenges inhibiting from acting on data insights and analytics
• Lack of software/tools that allow end-users to perform analytics themselves
rr. Organizational initiative in the following areas related to Big Data Science & Technology
• Data streaming & Processing
• Analysis of Unstructured/Semi-structured data
• Visualization & Visual Analytics
ss. Organizational initiative in the following areas related to Big Data Infrastructure
• System Architectures, Design and Deployment
• Programming Models
• Big Data Open Platforms
tt. Organizational initiative in the following areas related to Big Data Search, Mining and
Management
• Search & Mining of variety of data including scientific, engineering, social, sensor &
multimedia
• Algorithms & Systems for Big Data Search
• Data Acquisition, Integration, Cleaning & Best Practices
• Mobility and Big Data
uu. Organizational initiative in the following areas related to Big Data applications
• Complex Big Data Applications in Science, Engineering,
• Big Data Analytics in Small Business Enterprises (SMEs)
• Real-life Case Studies of Value Creation through Big Data Analytics
• Big Data as a Service
• Big Data Industry deployments & Standards and Experiences of
vv. The biggest impediments to using big data for effective decision-making
• Too many “silos”—data is not pooled for the benefit of the entire organisation.
ww.
It is generally agreed that the issue for us is now not the growing volumes of data, but
rather being able to analyse and act on data in real-time.
xx. Areas of Big Data applications
• E-Commerce, e-Business, Online Operations (Web shops, etc)
• e-Governance
• Direct and online marketing
• Fraud detection / management
• Customer and market analysis
• Customer service
• Information Technology
• Finance and administration
• Risk management
234
Areas of Application, Models & Infrastructure
jj. Steps taken to Integrate Data into Organization’s Business:
• Upgrade IT Systems
• Improve data collection processes
• Redesigned/reengineered your important Business Processes
kk. Areas developed in organization : Not so well to reasonably well
• A clear company strategy
• A sound procedure for legal, ethical and reputational issues
• An organization structure that supports multi-disciplinary projects
• Financial budget
• Support by higher management
• Supporting systems and procedures
• Talent
• Training
ll. The number of Big Data specialists in organization next year (2015) will increase.
mm.
Suggestions on the following important aspects:
• Data Storage
• Data Curation
• Data Retrieval
These technologies are evolving and should be constantly innovated and the organization
roadmap should be focussed on alignment with emerging technologies
nn. Suggested FINAL PRODUCTS for which the Big Data Community may strive.
• Quality of data,
• Accessibility of data,
• Data consciousness,
• Right to data
oo. For the Researchers in the Big Data Discipline, what should be thrust Areas:
• Immediately
o Procuring real time data
o Data gathering,
o Data integration,
o Data integrity
o Data security
• In the next 5-10 years
o Developing Medical layer for supporting end users,
o System developers,
o Building DSS, KSS,
o Event triggering systems and agents aiming at integrating with Internet of things.
235
Type, Amount of Data & Analytical Techniques Used
39. Pace at which data is available, updated or refreshed
• As it is streamed in real-time
• Less than a week
40. Support needed from the Government:
• The government’s roadmap on big data
• High level 5-year country strategy
• Clarity on statutory / regulatory / compliance requirements
• Partnering with peer organizations and relevant government agencies
• Clarity on security aspect
41. The amount of data available to support decision-making
• Enough
42. Organizational Performance in Information and Analytic Tasks on A Scale Of 1 To 5, Where
1=Poorly and 5=Very Well
• Acquire and integrate data 3
• Analyze data
3
• Act on data-driven insights 4
43. Type of data analyzed in the context of Big Data applications:
• Numerical data (for statistics, predictions, etc)
• Text (automated text analysis)
44. Organizational use of advanced analytics methods (statistics, econometrics, operations
research, artificial intelligence, applied mathematics) in Big Data applications
• Yes
45. Advanced analytics methods used in Big Data applications
• Statistics and econometrics
• Operations research (OR) / applied mathematics
46. Developing Big Data applications
• Both internally as well as externally
47. Consider the following factors as most important factors for successful Big Data
implementations
• A clear company strategy
• Support by higher management
• Talent
• Training
• Supporting systems and procedures
• Financial budget
• An organizational structure that supports multi-disciplinary projects
• A sound procedure for legal, ethical and reputational issues
236
Security Concerns
v. Initiative taken for Big Data Security & Privacy
• Intrusion Detection
• Cyber security and Gigabit Networks
• Visualizing Large Scale Security Data
• Challenges for Big Data Security & Privacy
• Sociological Aspects of Big Data Privacy
w. Views on the IPR Issues as related to Big Data Analytics.
• Recently initiated this process
x. Views and suggestions on the adequacy or otherwise of the National Data Sharing and
Accessibility Policy (NDSAP) as related to Big Data Analytics:
• All data needs to be made available on common portal accessible to all
• We will comply with national regulatory requirements
237
ANNEXURE 6
CONSOLIDATED RESPONSES FROM SERVICE PROVIDERS
238
ANNEXURE 6
CONSOLIDATED RESPONSE FROM SERVICE PROVIDERS (SP)
Current Status, Strategy & Profile
xxx. Stakeholder Segment/Category: Mostly Service Provider, some time as Platform provider.
yyy.Active Data Segments:
• Telecom,
• Retail,
• Banking and Finance,
• Internet and Media
• Industry: Education, Manufacturing, Media & Content, Logistics and E-Commerce
o Cloud BI and cloud data services
• IoT data & analytics services
• Click stream data and analytics services
• Analytics and Simulation segments.
zzz. Data Segments that is outsourced:
• Telecom,
• Retail,
• Banking and Finance,
• Internet and Media
• Mobile apps and customization
• CRM apps and customization
• ERP apps and customization
aaaa.
Areas where more Investment in Resources
• Capacity Building
• Software Tools
• Data Sources
• Agile Process Quality & ISO compliance for data services
• Data security & industry specific compliance audits/standards.
• User Experience (UX) standards and best practise.
• Sales & Marketing standards& scaling the business
bbbb.
Organization's competitive position
• On par with industry / market peers
• Don't know
cccc.
Expectations from Big Data Analytics in the next 10 years
• Big data will enable Integrated cloud warehousing that integrates external and internal data.
• Machine learning algorithms will enable analytic automation that gives competitive
• End user intelligent apps will get more dependent upon big data APIs that make them
smarter and more personalized
• Smarter cities,
• Smoother e-governance
dddd.
Big data management is NOT viewed strategically at senior levels of the organisation.
239
eeee.
Generally there is enough of a “big data culture” in the organisation, where the use of
big data in decision-making is valued and rewarded.
Manpower, Skill Gaps and Training Needs
gg. Identified skills gaps in dealing with data and analytics.
• Visualization skills
• Data integration skills
• Data analysis skills
• Tooling / software skills
hh. Big Data experts employed in areas (Generally about 2 – 5 Experts, sometimes more)
• Computer science: programming experts (R, Python, SQL, SAS, Java, etc)
• Computer science: Artificial Intelligence and machine learning experts
• Computer science: text, voice, music, image and video experts
• Experts in statistics and econometrics
• Experts in OR and applied mathematics
• People who understand business and the data that goes with it.
ii. The training needs are:
• Strategy courses on Big Data for top management
• Computer science: programming courses (R, Python, SQL, SAS, Java, etc)
• Computer science: machine learning and artificial intelligence courses
• Statistics and econometrics courses
• Operations research and applied mathematics courses
• Application related courses (Big Data in marketing, finance, logistics, etc)
• Training in software tools such as Splunk, ELK Cloudera.
jj. Required Capacity Building initiatives as per details given below:
Name of the
Program
Cloud BI
Big Data
Cloud DS
Who should be
the participants
MCAs
BE Grads
Diplomas
Coverage
BI devOps for analytics
R & Visualization
Data processing
Duration
3 months
6 months
3 months
Modality of Delivery
Apprentice model
Apprentice model
Apprentice model
Perceived Success Factors, Impediments & Challenges for Big Data Application
yy. Organizational initiative in the following areas related to Big Data Science & Technology
• Data streaming & Processing
• Analysis of Unstructured/Semi-structured data
240
• Visualization & Visual Analytics
• Security & Privacy issues
zz. Organizational initiative in the following areas related to Big Data Infrastructure
• System Architectures, Design and Deployment
• Big Data Open Platforms
aaa.
Organizational initiative in the following areas related to Big Data Search, Mining and
Management
• Search & Mining of variety of data including scientific, engineering, social, sensor &
multimedia
• Algorithms & Systems for Big Data Search
• Data Acquisition, Integration, Cleaning & Best Practices
• Visualization Analytics for Big Data
• Multimedia and Multi-structured Data-Big Variety Data
bbb.
Organizational initiative in the following areas related to Big Data applications
• Medicine, Healthcare, Finance, Business, Law and Education
• Retailing, social media and Telecommunication
• Big Data Analytics in Small Business Enterprises (SMEs)
• Big Data as a Service
• Big Data Industry deployments & Standards and Experiences of
ccc. Organization has Big Data applications in
• E-Commerce, e-Business, Online Operations (Web shops, etc)
• Fraud detection / management
• Customer and market analysis
• Customer service
• Information Technology
Areas of Application, Models & Infrastructure
pp. Steps taken to Integrate Data into Organization’s Business:
• Training current employees or recruiting new employees in BA
qq. Areas developed in organization : Not so well to reasonably well
• A clear company strategy
• A sound procedure for legal, ethical and reputational issues
• An organization structure that supports multi-disciplinary projects
• Financial budget
• Support by higher management
• Supporting systems and procedures
• Talent
• Training
rr. The number of Big Data specialists in organization next year (2015) will increase.
241
ss. Suggestions on the following important aspects : NIL
tt. Suggested FINAL PRODUCTS for which the Big Data Community may strive.
•
Big Data Analysis Platforms
Type, Amount of Data & Analytical Techniques Used
48. Support needed from the Government:
• Setting up of SEZ’s for smaller set ups like ours which completely export the services. The
current SEZ’s are unaffordable and only larger companies can get the benefit of working out
of a SEZ.
• To foster an encouraging environment for entrepreneurship especially for small start Ups.
49. NO suggestions for leveraging Big Data Analytics applications in Government
50. Organization analyzes in the context of Big Data applications
• Numerical data (for statistics, predictions, etc)
• Text (automated text analysis)
51. Organization apply advanced analytics methods (statistics, econometrics, operations research,
artificial intelligence, applied mathematics) in Big Data applications
52. Advanced analytics methods used in Big Data applications
• Statistics and econometrics
• Artificial intelligence (AI) and machine learning
53. The most important factors for successful Big Data implementations? Please rate from "1"
(=most important) to "5" (=least important).
• A clear company strategy 1
• Support by higher management
2
• Talent
1
• Training
3
• Financial budget 2
• An organizational structure that supports multi-disciplinary projects
4
• A sound procedure for legal, ethical and reputational issues
4
54. Tools and Platforms to be used for Big Data Analytics in the Open Source domain
• Hortonworks
• -Hadoop
• -Mapreduce
Security Concerns
y. Your organization has taken initiative in which of the following areas related to Big Data Security
& Privacy? (Multiple answers possible) NO COMMENTS
• Intrusion Detection
• Cyber security and Gigabit Networks
• Visualizing Large Scale Security Data
242
•
•
Challenges for Big Data Security & Privacy
Sociological Aspects of Big Data Privacy
z. Please provide your views on the IPR Issues as related to Big Data Analytics. NO COMMENTS
aa. Please provide your views and suggestions on the adequacy or otherwise of the National Data
Sharing and Accessibility Policy (NDSAP) as related to Big Data Analytics NO COMMENTS
243
ACKNOWLEDGEMENTS
Consultancy Development Centre (CDC), an Autonomous Institution of DSIR, Ministry of Science &
Technology, Government of India was commissioned to prepare Strategic Document on Data
Science, Technology, Research and Applications (dASTRA) for the Data Science Initiative taken by
the Department of Science & Technology.
CDC is thankful to Department of Science & Technology, Ministry of Science & Technology,
Government of India for reposing confidence in it by assigning this task of national importance to it.
CDC is especially thankful to numerous Governmental, Public & Private Organizations, NGOs,
Educational, Academic and Research Institutions and individuals for their sparing time and effort to
respond to the survey questionnaires, personal discussions, interviews and participating in the
Consultative Meetings and the Interactive Workshops organized around the country.
The team working on this study has studied, consulted and referred a very large number of research
papers, reports, books, other public domain documents and presentations; in addition it has
participated in number of Big Data related conferences/seminars held recently in the country. A list
of the materials referred has been included in the Bibliography given in the report. Many ideas from
the above materials and personal interactions have directly or indirectly become the part of this
report. The team would like to acknowledge with thanks the valuable contributions made by the
various authors of these interactions, documents and presentations. It is requested that this may be
taken as the personal acknowledgement for each and every person whose ideas have found place in
this report.
244
Download