Document

advertisement
3A G3
Big Data
981919 黃于庭
991619 鍾佳琳
991635 陸雨新
991660 魏松毅
991604 林右千
991632 游智鈞
991637 杜韋霆
991664 梅耀文
991616 李嘉芸
991634 陳鈺玟
991648 何冠儀
Question a:
Describe its possible definitions
991637 杜韋霆
What is big data?
With the advance of science and technology,we
automatically create a large amounts of data every day.
These data are generated from many places such as:
• sensors used to gather climate information
• posts to social media sites
• digital pictures and videos
• purchase transaction records
• cell phone GPS signals
We can call this kind of data “Big data”.
Ref: Speed of Business --- IBM
http://www-01.ibm.com/software/data/bigdata/
3
Definitions
• Wiki
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to
capture, curate, manage, and process the data within a tolerable elapsed time.
Ref: http://en.wikipedia.org/wiki/Big_data
• Gartner
“Big data” is high-volume, -velocity and -variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.
Ref: http://www.gartner.com/it-glossary/big-data/
• Doug Laney
Data sets where the three Vs—volume, velocity and variety—present specific challenges in
managing these data sets.
Ref: http://www.isaca.org/Knowledge-Center/Blog/Lists/Posts/Post.aspx?ID=299
• Webopedia
Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and
unstructured data that is so large that it's difficult to process using traditional database and software
techniques.
Ref: http://www.webopedia.com/TERM/B/big_data.html
4
Definitions(continue)
•
Andrew Brust
We can safely say that Big Data is about the technologies and practice of
handling data sets so large that conventional database management systems
cannot handle them efficiently, and sometimes cannot handle them at all.
• John Rauser
Any amount of data that's too big to be handled by one computer.
• Techopedia.com
Big data refers to a process that is used when traditional data mining and
handling techniques cannot uncover the insights and meaning of the
underlying data. Data that is unstructured or time sensitive or simply very
large cannot be processed by relational database engines. This type of data
requires a different processing approach called big data, which uses massive
parallelism on readily-available hardware.
And more!
Ref: http://www.opentracker.net/article/25-definitions-big-data
5
Definitions(conclusion)
There are many different definitions of Big data,
But most of them talk about:
1. Size of data sets are very large.
2. Hard to deal with commonly used software
tools .
3. The time of data processing are important.
4. Types of data are many.
6
Why is it so important?
Big data issues are important because:
1.data sets that companies gathers are more than just
words, but also includes video and images.
2. The methods that generate data are different from the
past.
3.Manual or on-hand tools are not efficient enough.
4.Companies requires fast reaction and accuracy.
5.Data before processed are useless.
Ref: http://www.arthurtoday.com/2012/01/big-data.html
7
What are its characteristics ?
991632 游智鈞
• Actually we find there are many definitions used in
big data ,but the origins of the term come from a
2001 paper by Doug Laney of Meta Group , it
defines big data as data sets which have three VsVolume ,Velocity and Variety.
• Most people talking about is 3Vs ,but some talk
about is 4Vs , add a fourth V “Veracity”.
• IBM proposed a concept of 3I –
Instrumented ,Interconnected and Intelligent .
8
Batch : It’s not continuous
processing of data, batch
processing is used for very
large files. The files to be
transmitted are gathered
over a period and then send
together as a batch.
[2]
Reference:
[2]
http://www.datasciencecentral.co
m/forum/topics/the-3vs-thatdefine-big-data
[3]
http://contest.trendmicro.com/20
13/tw/train.htm
9
4V(Volume、Velocity、Variety、Veracity)
• Volume: There are many factors contributes to increase in
data volume-past transaction records, daily data collected
from sensors, data create by social media, etc.
• Velocity: It means how fast data is be creating and how fast
data must be processing . For businesses, in the shortest
possible time processed data, enterprises will be able to bring
more benefits.
• Variety: Today's data type may be a variety of formats.
1.
Structured Data : Database Data(Trial Balance, Financial Report, General information)
2.
Semi-structured Data : Email, Blog Posts
3.
Unstructured Data : Text, Video, Photo ,Audio[4]
Reference :
[4]雲端時代的殺手級應用-海量資料分析 胡世忠 著
10
• Veracity: Because the source of data from anywhere, you can
not guarantee the correctness of data or any data is benefit
for enterprises. So ,it’s important for enterprises to get a
useful data and analysis it.
Reference:
[5] http://skyfollow.com/big-datavelocity-comparisons-incoming-ratechart/
[5]
11
3I(Instrumented、Interconnected、Intelligent)
• Instrumented : It means huge change in data source. We
place the sensors above lots of things so that people can be
more sensitive, more comprehensive perceive the physical
world. Eg: Smart meter
• Interconnected : It means huge change in the way of data
transmission. We use sensors, RFID and more communication
technology to communicate between objects.
• Intelligent : It means huge change in the way of data use. Eg:
SuperComputer-Waston
Reference:
[6]http://www.ibm.com/smarterplane
t/ie/en/overview/ideas/
12
Conclusion
Big data is not just only represent the basis of
a large volume of data , but also represents
life has now entered another level.
Bring to life more convenient, more intelligent
choices.
13
Question b:
What’s the possible challenges,
and opportunities of big data?
991604 林右千
991616 李嘉芸
991619 鍾佳琳
Challenges - Understand & Use
991604 林右千
• The challenge is how we can understand and use big data
when it comes in an unstructured format, such as text or
video.
• Unstructured data is a generic label for describing any
corporate information that is not in a database.
Unstructured data can be textual or non-textual.
– Textual unstructured data is generated in media like email messages,
PowerPoint presentations, Word documents, collaboration software
and instant messages.
– Non-textual unstructured data is generated in media like JPEG images,
MP3 audio files and Flash video files.
References from:
http://spotfire.tibco.com/blog/?p=6793
http://searchbusinessanalytics.techtarget.com/definition/unstructured-data
15
Challenges - Understand & Use
Direct quote from: http://www.slideshare.net/Hadoop_Summit/hadoopsopportunity-to-power-nextgeneration-architectures
16
Challenges - Understand & Use
• For example, as social media applications like Twitter and
Facebook go mainstream, the growth of unstructured data is
expected to far outpace the growth of structured data.
• In customer-facing businesses, the information contained in
unstructured data can be analyzed to improve customer
relationship management and relationship marketing.
References from:
http://searchbusinessanalytics.techtarget.com/definition/unstructured-data
17
Opportunities - Government
• The opportunities about the government, following
are three parts and their examples.
1. Improve administrative efficiency - ACSSA
2. Combat and prevent crime - Memphis PD
3. Improve traffic problems - Stockholm
Direct quote from:
http://www.alamedasocialservices.org/public/index.cfm
References from: 胡世忠. 雲端時代的殺手級應用:
Big Data海量資料分析. 臺北市: 天下雜誌股份有限
公司. 2013: 9789862416730
Direct quote from:
http://www.memphispolice.org/
18
Improve administrative efficiency - ACSSA
• Alameda County is the seventh largest county in California.
The Alameda County Social Services Agency (ACSSA) provides
social services to as many as 140,000 people living below the
poverty line, with 19,000 actively managed cases.
• The antiquated systems it was using could not keep up with
the need for information, which meant that the agency’s
understanding of what was happening out in the community
lagged weeks or even months behind actual events.
• ACSSA teamed with IBM to deploy an information
management system that combined analytics with business
intelligence to give workers an agency-wide, comprehensive
view of individual cases.
References from: IBM The SmarterCities Leadership Series. Smarter Government Services.
19
Improve administrative efficiency - ACSSA
• Outcome:
1) ACSSA now has an average annual savings of nearly $25M.
2) Real-time understanding of case and program status enable them to
find the best assistance programs for each situation.
3) Real time tracking reveals relationships between benefit recipients
and programs, helping to eliminate waste, fraud and redundancy.
4) Reports are generated in minutes instead of weeks or months.
5) The system has increased the productivity and win rates of agency
lawyers who defend the agency when a claimant appeals their
discontinuation of benefits, which saves the agency $900,000
annually.
References from: IBM The SmarterCities Leadership Series. Smarter Government Services.
20
Combat and prevent crime - Memphis PD
• Memphis PD use Blue CRUSH (Criminal Reduction Utilizing
Statistical History) to reduce the rate of crime.
• At the heart of Blue CRUSH is a predictive model that
incorporates fresh crime data from sources that range from
the MPD’s records management system to video cameras
monitoring events on the street.
• Blue CRUSH lays bare underlying crime trends in the way that
promotes an effective fast response, as well as a deeper
understanding of the longer-term factors (like abandoned
housing) that affect crime trends.
References from: IBM Smarter Planet Leadership Series. Memphis PD:
Keeping ahead of criminals by finding the “hot spots”. 2011
21
Combat and prevent crime - Memphis PD
• It happens at the precinct level. Looking at multilayer maps
that show crime hot spots, commanders can see not only
current activity levels, but also any shifts in such activities
that may have resulted from previous changes in policing
deployment and tactics. At each weekly meeting,
commanders go over these results with their officers to judge
what worked, what didn’t and how to adjust tactics in the
coming week.
References from:IBM Smarter Planet Leadership Series. Memphis PD:
Keeping ahead of criminals by finding the “hot spots”. 2011
22
Combat and prevent crime - Memphis PD
• Outcome:
1)
30% reduction in serious crime overall, including a 36.8% reduction
in crime in one targeted area
2) 15% reduction in violent crime
3) 4x increase in the share of cases solved in the MPD’s Felony Assault
Unit (FAU), from 16 percent to nearly 70 percent
4) Overall improvement in the ability to allocate police resource in a
budget-constrained fiscal environment
References from: IBM Smarter Planet Leadership Series. Memphis PD:
Keeping ahead of criminals by finding the “hot spots”. 2011
23
Improve traffic problems - Stockholm
• The Swedish National Road Administration(SNRA) and the
Stockholm City Council announced a trial Congestion Tax.
• The goal was not only to reduce congestion, but encourage
ancillary benefits, such as improving public transport and
alleviating environmental damage. The government’s plan is
to devote revenue from the tax to completing a ring road
around the city.
• With help from IBM, the solution they came up with was an
innovative, high-tech traffic charging system that directly
charges drivers who use city center roads during peak
business hours.
References from: Driving Change in Stockholm. 2008
24
Improve traffic problems - Stockholm
• The way it works, drivers can install simple transponder tags
that communicate with receivers at the control points and
trigger automatic payment of road use fees. Once a vehicle
passes a roadside control point during designated congestion
hours, it is recognized by the transponder that is read by
sensors.
• In addition, cars passing through these control points are
photographed, and the license plate numbers are used to
identify those vehicles without tags and to provide evidence
to support the enforcement of non-payers. The information is
sent to a computer system that matches the vehicle with its
registration data, and a fee is charged to the owner. All of the
above steps can be completed within milliseconds.
References from: Driving Change in Stockholm. 2008
25
Improve traffic problems - Stockholm
• Outcome:
1) traffic was down nearly 25 percent.
2) Public transport schedules had to be redesigned because of the
increase in speed from reduced congestion.
3) 40,000 more travelers used Stockholm Transport on an ordinary
weekday than the year before—an increase of six percent.
4) The reduction in traffic has led to a drop in emissions from road
traffic by eight to 14 percent in the inner-city.
5) Greenhouse gases such as carbon dioxide have fallen by 40 percent
in the inner-city.
References from: Driving Change in Stockholm. 2008
26
Opportunities - Manufacturing
• The opportunities for manufacturing.
Direct quote from: McKinsey Global Institute. Big data: The next frontier for
innovation, competition, and productivity. June 2011: 78
27
Opportunities - Manufacturing
Direct quote from: McKinsey Global Institute. Big data: The next frontier for
innovation, competition, and productivity. June 2011: 78
28
Opportunities - Manufacturing
• Example:
Haitai Confectionery & Food Co., Ltd., a South
Korean company with its main business in retail and instant
foods, especially confectionery, beverage and ice cream.
Haitai use a business intelligence and analysis platform to
analysis historical data, tracking changes in supply and
demand, and forecast demand. That quickly grasp the market
demand and reduce the day in inventory.
References from: 胡世忠. 雲端時代的殺手級應用:Big Data海量資料分析. 臺
北市: 天下雜誌股份有限公司. 2013: 201-202. 9789862416730
29
Challenge ─ Private Security
991616 李嘉芸
• With the Big Data era, Internet will always release
huge amounts of data, and society benefit from
the use of Big Data, but the privacy is nowhere to
hide.
• With the produce, storage, analysis, increasing
the amount of data, whether it is about business
sales, or personal spending habits, identity, etc.,
has stored in various forms.
SOURCE: R[1], R[2]
30
Challenge ─ Private Security
• Large amounts of data hidden a large number
of economic and political interests,
particularly through data integration, analysis
and mining.
• With technological innovations arising from
Big Data era also gave birth to all sectors of
society to face strong demand for personal
privacy.
SOURCE: R[1], R[2]
31
Big data era has the following
behavior invasion of personal privacy:
• In the process of data storage:the user can not know the
exact storage location of data, and users lose control of
personal data collection, storage, use, and share.
• The process of data transmission results in violating
personal privacy. Because the data transmission more open
and pluralistic, it may result in data leakage or risk of
eavesdropping
• In the process of data destruction:the data may already
be backed up, will lead to the destruction incompletely.
SOURCE: R[1], R[2]
32
How to strengthen the protection of
personal privacy:
1. The personal information protection into national strategies for
conservation and planning issues.
2. To build a completed Personal Privacy protection’s law: we need to
create a personal privacy protection law and basic rules. In addition,
we should actively promote laws and regulations related to the
protection of privacy legislation to reduce violations of personal
privacy.
3. Strengthen the technical protection of personal privacy: Encourage
development of Privacy protection technologies. How to prevent
personal data is processed by unnecessary and undesirable
manners, and let the users know where their data is stored, how
they are processed
SOURCE: R[1], R[2]
33
Opportunities - Energy
• High oil and high electricity prices make
sustainable energy issues exist persistently,
and make big data analysis are increasingly
important in the energy industry.
SOURCE: R[3]
34
Opportunities - Energy
• Big Data analysis directly affect the profit, so
many industry professionals already installed
the intelligent monitoring equipment ,it can
collect a large amounts of data immediately to
proceed simulate analysis, and it use to
increase productivity and reduce costs.
SOURCE: R[3]
35
Opportunities - Energy
• Energy industry from the following aspects
using big data analysis:
1 . Mobile Data Integration: Power Company
can analyze consumer’s patterns of activity
and comments on the website, and then
develop more in line with the needs of service
users.
SOURCE: R[3]
36
Opportunities - Energy
2. Data link Thermostats:thermostats can
record and transmit electricity which is
consumed by adjusting temperature to user's
home , each thermostat will generate tens of
thousands records in a month, if take
advantage of it, also helps power companies
to regulate electricity and encourages the
users to change consumption habits.
SOURCE: R[3]
37
Opportunities - Energy
3. Study habits of electric vehicle owners
charge: By tracking and analyzing the
owners charging habits, the power company
can understand people use electricity more
relatively in which period, and encourages
users to charge during off-peak hours.
SOURCE: R[3]
38
References
• R[1],大數據時代個人隱私保護刻不容緩
http://big5.ce.cn/gate/big5/www.ce.cn/xwzx/gnsz/gdxw/201212/20/t201212
20_23958532.shtml
• R[2],大數據時代﹕數據開放更注重個人隱私保護
http://big5.gmw.cn/g2b/IT.gmw.cn/2013-04/12/content_7292096.htm
• R[3],雲端時代的殺手級應用-Big Data 海量資料分析, 胡世忠, 天下雜誌
股份有限公司 ,2013/03/08
39
Challenges – Storage
991619 鍾佳琳
• "Big data" refers to data sets that are too large to be
captured, handled, analyzed or stored in an appropriate
timeframe using traditional infrastructures.
Bit
1 or 0
Byte
8 bits
Kilobyte
1,000 bytes
Megabyte 1,000 KB
1 PB=1000000000000000 B
= 1015 bytes = 1,000 terabytes
Gigabyte
1,000 MB
Terabyte
1,000 GB
Petabyte
1,000 TB
Exabyte
1,000 PB
Zettabyte
1,000 EB
Source from: R[1],R[2]
40
Challenges – Storage
• Storage is especially challenging because there are many
different kinds of data that needs to be stored.
• Persistence
– Many big data applications involve regulatory compliance that
dictates data be saved for years or decades.
– Medical information is often saved for the life of the patient.
Financial information is typically saved for seven years.
– Big data users are also saving data longer because it’s part of an
historical record or used for time-based analysis. This
requirement for longevity means storage manufacturers need to
include on-going integrity checks and other long-term
reliability features, as well as address the need for data-inplace upgrades.
Source from: R[2]
41
Challenges – Storage
• Storage must evolve
–Big data has outgrown its own
infrastructure and it’s driving the
development of storage, networking
and computer systems designed to
handle its specific.
Source From: R[2]
42
Challenges – Storage
Types of data:
– Structured Data
• Data that resides in fixed fields within a record or file.
– Semi-structured Data
• XML, E-mail, Blog
– Unstructured Data
• pictures, digital audio, video, Word, pdf
Source from: R[7]
43
Structured Data (Traditional)
Business User
Data Warehouse Administrator
Business Analyst
Direct quote from: R[3]
44
Structured Data (Traditional)
• Traditionally, data processing for analytic purposes followed a
fairly static blueprint. Namely, through the regular course of
business enterprises create modest amounts of structured
data with stable data models via enterprise applications like
CRM, ERP and financial systems.
Source From: R[3]
45
Structured Data (Traditional)
• Data integration tools are used to extract, transform and load
the data from enterprise applications and transactional
databases to a staging area where data quality and data
normalization occur and the data is modeled into neat rows
and tables.
Source From: R[3]
46
Structured Data (Traditional)
• The modeled, cleansed data is then loaded into an enterprise
data warehouse. This routine usually occurs on a scheduled
basis – usually daily or weekly, sometimes more frequently.
Source From: R[3]
47
Structured Data (Traditional)
User
How to use Traditional Data Warehouse
Create and schedule regular reports to run against
normalized data stored in the warehouse, which
Data Warehouse
are distributed to the business.
Administrator They also create dashboards and other limited
visualization tools for executives and
management.
Use data analytics tools/engines to run advanced
analytics against the warehouse, or more often
Business Analyst against sample data migrated to a local data mart
due to size limitations.
Business User
Perform basic data visualization and limited
analytics against the data warehouse via frontend business intelligence tools from vendors like
SAP BusinessObjects and IBM Cognos.
Source From: R[3]
48
Semi-structured Data
& Unstructured Data
• Hadoop is an open source framework for processing,
storing and analyzing massive amounts of distributed,
unstructured data.
• It was designed to handle petabytes and exabytes of
data distributed over multiple nodes in parallel.
• Fundamental concept
– Hadoop breaks up Big Data into multiple parts so
each part can be processed and analyzed at the
same time.
Source from: R[3]
49
Opportunities - Healthcare
• Divide the healthcare
into five broad
categories:
Clinical
operations
Public
health
Payment
/pricing
Healthcare
New
business
models
Source from: R[4],R[7]
R&D
50
Opportunities - Healthcare
Categories
How to apply it
Clinical operations –
Clinical decision support systems
The current generation of such systems
analyzes physician entries and compares
them against medical guidelines to alert for
potential errors such as adverse drug
reactions or events.
By deploying these systems, providers can
reduce adverse reactions and lower
treatment error rates and liability claims,
especially those arising from clinical
mistakes.
Payment/pricing –
Patients would obtain improved health
outcomes with a value-based formulary and
gain access to innovative drugs at
reasonable costs.
Health Economics and Outcomes Research and
performance-based pricing plans
Source from: R[4],R[7]
51
Opportunities - Healthcare
Categories
How to apply it
R&D –
Personalized medicine
The objective of this lever is to examine the
relationships among genetic variation,
predisposition for specific diseases, and
specific drug responses and then to account
for the genetic variability of individuals in the
drug development process.
New business models –
Online platforms and communities
Example of this business model in practice
include Web sites such as
PatientsLikeMe.com, where individuals can
share their experience as patients in the
system.
Public health –
Be better prepared for emerging diseases
and outbreaks
This lever offers numerous benefits,
including a smaller number of claims and
payouts, thanks to a timely public health
response that would result in a lower
incidence of infection.
Source from: R[4],R[7]
52
Opportunities - Healthcare
• Example
This type of Big Data healthcare company is focused on “Increasing Awareness”.
A mobile app called Asthmapolis is an example of this type. A mobile sensor
device is attached to an asthma inhaler, which then monitors where and when
asthma attacks happen. The device wirelessly synchronizes with an iOS/Android
app, allowing users to track their triggers and symptoms.
Source from: R[5],R[6]
53
References
•
R[1], The Wall Street Journal, January 21, 2013
http://online.wsj.com/article/SB10001424127887323468604578245540627666664.html
•
R[2], Storage for big data, page2 & page5, April 2, 2012
http://searchstorage.techtarget.com/magazineContent/Storage-for-big-data?pageNo=1
•
R[3], Big Data: Hadoop, Business Analytics and Beyond, April 16, 2013
http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond
•
R[4], McKinsey Global Institute, May 2011
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_i
nnovation
•
R[5], How Big Data Is Improving Healthcare, October 2, 2012
http://readwrite.com/2012/10/02/how-big-data-is-improving-healthcare
•
R[6], ASTHMAPOLIS
http://asthmapolis.com/
•
R[7], 雲端時代的殺手級應用-Big Data 海量資料分析, 胡世忠, 天下雜誌股份有限公
司 ,2013/03/08
54
Question c:
Explain how can a corporate
deal with the problems
associated with big data and
explain its possible solutionsProblem: How to analyze and apply
to Big Data
Pattern recognition
Classification
Anomaly Detection
991648 何冠儀
Pattern recognition
• What is the pattern? [1]
– The pattern is a picture, a string of characters, a set
of symbols, a sequence of signal, etc.
• What is pattern recognition?
– The act of taking in raw data and making an action
based on the “category” of the pattern.[1]
– Pattern recognition is a "decision" of science.[2]
57
Pattern recognition
• Process [2]
– Feature:the character of sample
– Training sample:the sample is to build a system
– Test sample:use test sample to test accuracy of system
58
Pattern recognition
• Application [1][2]
-Biometric Authentication: fingerprint、 voice print
-Voice recognition: analysis of the contents of the speaker's talk
-Medical Image Analysis : X-rays、 nuclear medicine imaging
- Wireless Telecommunication Analysis:Determine how
many wireless networks in the space.
-Satellite image analysis :Determine which areas is grassland,
river, sand, buildings, etc.
-Handwriting Recognition:Determining the handwritten text.
59
Classification
• What is Classification? [3][4]
– It is used to group items based on certain key
characteristics.
– classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
• Purposes [5]
– Analysis of the factors affecting data classification
– Predict the category of data (class label)
60
Classification
• The process of classification [5][6]
1. Establish the model:
– Using the existing data to find out classification models.
– Such as Decision tree、 classification rules
2. Assessment model:
– Existing information will be divided into two groups:
training samples and testing samples.
– First phase:use training sample to build the model
– Second phase:use test sample to evaluate the accuracy
of model
3.Using the model :
– Find out the reasons for data classification
– Predict new type of data
61
Classification
• Algorithms [7]
– Support vector machines:are supervised
learning models with associated learning algorithms that
analyze data and recognize patterns[8]
– Neural networks:consists of an interconnected group of
artificial neurons, and it processes information using
a connectionist approach to computation.[9]
– Kernel estimation: is a fundamental data smoothing
problem where inferences about the population are made,
based on a finite data sample.[10]
– Decision trees:create a model that predicts the value of a
target variable based on several input variables.[11]
62
Classification
• Application[7]
– Speech recognition: is the translation of spoken words
into text.[12]
– Biological classification:is a method of scientific
taxonomy used to group and categorize organisms into groups
such as genus or species.[13]
– Credit scoring:is a numerical expression based on a
statistical analysis of a person's credit files, to represent
the creditworthiness of that person.[14]
63
Anomaly detection
• What is Anomaly detection?
– Anomalies? The set of data points that are considerably
different than the remainder of the data [15]
– Usually produce a large number of false alarms.[16]
– Also referred to exceptions, deviation.[17]
64
Anomaly detection
• Common causes of anomalies [19]
– Data From Different Classes: objects different
because they are of a different type or class
– Natural Variation: datasets modeled by statistical
distributions , where are admitted variations in
data
– Data Measurement and Collection Errors: errors
in the data collection or during the measurement
process
65
Anomaly detection
• Categories [17]
1.Unsupervised :
– No labels assumed
– Based on the assumption that anomalies are very rare
compared to normal data
2.Supervised :
– Labels available for both normal data and anomalies
– Similar to rare class mining
3.Semi-supervised :
– Labels available only for normal data
66
Anomaly detection
• Techniques [18]
1.Model-Based:
– Build a model of the data.
– Anomalies are objects that do not fit the model.
2.Proximity-Based:
– Define a proximity measure between objects
– Anomalies are objects that are distant from most of the
other objects
3.Density-Based:
– Estimate the density of objects
– Anomalies are objects that are in regions of low density
67
Anomaly detection
• Applications [18]
– Intrusion detection: monitoring systems and
networks for unusual behavior
– Fraud detection : looking for buying patterns
different from typical behavior
– System health monitoring : use unusual symptoms
or test result to indicate potential health problems
– Detecting Eco-system disturbances : try to predict
events like hurricanes and floods
– Public Health:use medical statistic reports for
diagnosis
68
References
[1]http://www.ie.ksu.edu.tw/ie1/100ie/web/files/download/2011.11.
15.%E6%9C%B1%E5%AE%B6%E5%BE%B7.pdf
[2]http://nthur.lib.nthu.edu.tw/dspace/handle/987654321/4878
[3] http://zh.scribd.com/doc/137177757/Statistical-PatternRecognition-2nd-Ed
[4] http://www.wisegeek.com/what-is-a-data-miningclassification.htm
[5] http://sls.weco.net/node/10936
[6] http://faculty.stust.edu.tw/~jehuang/DMCourse/ch5-3.html
[7] http://en.wikipedia.org/wiki/Statistical_classification
[8]http://en.wikipedia.org/wiki/Support_vector_machine
69
References
[9] http://en.wikipedia.org/wiki/Artificial_neural_networks
[10] http://en.wikipedia.org/wiki/Kernel_density_estimation
[11] http://en.wikipedia.org/wiki/Decision_tree_learning
[12] http://en.wikipedia.org/wiki/Speech_recognition
[13] http://en.wikipedia.org/wiki/Biological_classification
[14] http://en.wikipedia.org/wiki/Credit_scoring
[15] http://www.slideshare.net/guest76d673/chap10-anomalydetection
[16]經濟部九十年度科技專案 國家資通安全技術服務計劃 入侵偵
測系統簡介 陳培德 國立成功大學電機所博士候選人
[17] www.siam.org/meetings/sdm08/TS2.ppt
[18] http://www.cli.di.unipi.it/~tamberi/old/docs/tdm/anomalydetection.pdf
70
Question c:
Explain how can a corporate deal with the problems associated
with big data and explain its possible solutions
Association rule learning &
Predictive modeling
991634
陳鈺玟
Association rule learning
• What is association rule learning?
– Association Rules describe frequent co-occurences
in sets.
– Association rule learning was first used by major
supermarket chains to discover interesting
relations between products.
– A set of techniques for discovering interesting
relationships among variables in data.
Ref[1]:http://dataminingintelligence.com/?p=60
Ref[2]:http://www.ke.tu-darmstadt.de/lehre/archiv/ws0405/mldm/association-rules.pdf
Ref[3]:http://www.firmex.com/blog/7-big-data-techniques-that-create-business-value/
72
Association rule learning
• Basic way of Association rule learning :
– If a supermarket has 100,000 transactions, out of
which 2,000 include both butter and bread and 800 of
these 2,000 transactions include milk,
• Support : How many times does this rule cover?
 800 times in 100,000 transactions
 alternatively 0.8% = 800/100,000
• confidence : How strong is the implication of the
rule?
 800 times in 2000 transactions
 800/2000 = 40%
Ref[4]:http://akashrajak.webs.com/%20New%20Folder/Association%20Rule%20MiningApplications%20in%20Vario
us%20Areas.pdf
Ref[5]:http://www.ke.tu-darmstadt.de/lehre/archiv/ws0405/mldm/association-rules.pdf
73
Ref[6]:http://en.wikipedia.org/wiki/Association_rule_learning
Association rule learning
• Examples:
– Which products are frequently bought together
by customers?
• DataTable = Receipts x Products
• onions and potatoes → hamburger
– Which courses tend to be attended together?
• DataTable = Students x Courses
• Programming → Computer Science, Algorithm
Ref[7]:http://www.ke.tu-darmstadt.de/lehre/archiv/ws0405/mldm/association-rules.pdf
74
Association rule learning
• Applications:
– Market basket analysis :
• 1. Encourage more purchases : to know if certain groups of
items are consistently purchased together, for adjusting store layouts
• 2. Improve efficient : by alerting which merchandising effort is
ineffective, and which product is not selling
• 3. Enhance inventory management : by eliminating slowmoving items and increasing the supply of fast-moving merchandise
• 4. Extract information about visitors to websites from
logs : a merchant could analyze data on visitor browsing patterns,
login counts, past purchase behavior, and responses to promotions —
to eliminate what isn't working and focus on what does
Ref[8]:http://www.practicalecommerce.com/articles/3945-4-Ways-Big-Data-Can-Help-Ecommerce-MerchantsRef[9]:http://www.firmex.com/blog/7-big-data-techniques-that-create-business-value/
75
Association rule learning
• Applications:
– Protein sequences :
• For healthy care : because proteins are important constituents of
cellular machinery of any organism, and they are sequences made up
of 20 types of amino acids, with association rule learning, this can
enhance our understanding of protein composition and hold the
potential to give clues regarding the global interactions
– Census data :
• For general public and government : a huge variety of general
statistical information on society, the information related to
population and economic census can be forecasted in planning public
services, such as education, health, transport, funds
Ref[10]:http://www.firmex.com/blog/7-big-data-techniques-that-create-business-value/
Ref[11]:http://www.practicalecommerce.com/articles/3945-4-Ways-Big-Data-Can-Help-Ecommerce-MerchantsRef[12]:http://akashrajak.webs.com/-%20New%20Folder/Association%20Rule%20Mining76
Applications%20in%20Various%20Areas.pdf
Predictive modeling
• What is Predictive modeling?
– Short definition : using data to make decisions
– Long definition : using data to take actions and
make decisions using models that are statistically
valid and empirically derived
– A process by which a model is created or chosen
to try to best predict the probability of an
outcome given a set amount of input data.
Ref[13]:http://en.wikipedia.org/wiki/Predictive_modelling
Ref[14]:http://cdn.oreillystatic.com/en/assets/1/event/85/Best%20Practices%20for%20Building%20and%20Deployi
ng%20Predictive%20Models%20over%20Big%20Data%20Presentation.pdf
77
Predictive modeling
• Basic way of Predictive modeling:
– The birth of a predictive model :
• A predictive model is the result of combining data and
mathematics.
• To put it formally, data + model technique
= Predictive modeling
Ref[15]:http://www.ibm.com/developerworks/library/ba-predictiveanalytics2/
Ref[16]:http://www.ibm.com/developerworks/library/ba-predictiveanalytics2/fig01.gif
78
Predictive modeling
• Common categories of models :
– Predictive models : how likely an event is
• For example, how likely a credit card transaction is to
be fraudulent, how likely a visitor to a web site is to
click on an ad, or how likely a company is to go
bankrupt.
– Summary models : summarize data
• For example, divide credit card transactions or airline
passengers into different groups depending upon their
characteristics.
Ref[17]:http://opendatagroup.com/predictive-analytics-faq/
79
Predictive modeling
• Applications:
– In financial way:
• 1.Optimize availability, allocation and yield of assets
• 2.Improve business outcomes, make better decisions,
increase competitiveness
– In operational way:
• 1.Exceed service level commitments by increasing
speed and reducing risk of failure
• 2.Optimize maintenance schedules around conditions
Ref[18]:http://www.forrester.com/pimages/rws/reprints/document/85601/oid/1-KWYFVB
Ref[19]:http://www-01.ibm.com/software/data/bigdata/industry-retail.html
80
Predictive modeling
• Applications:
– Customer Relationship Management : analyze and
understand the products in demand, predict customers'
buying habits in order to promote
– Product or economy-level prediction : predicting
store-level demand for inventory management purposes,
predicting the unemployment rate for the next year
– Clinical decision support systems : experts use this in
health care primarily to predict which patients are at risk of
developing certain conditions
Ref[20]:http://en.wikipedia.org/wiki/Predictive_modelling
Ref[21]:http://en.wikipedia.org/wiki/Predictive_analytics
81
Question c:
Explain how can a corporate deal with the problems associated
with big data and explain its possible solutions
Cluster analysis
neural networks
Sentiment Analysis
991635 陸雨新
82
Cluster analysis
▲What is Cluster analysis?
• Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
• Cluster analysis
 Finding similarities between data according to the characteristics found in
the data and grouping similar data objects into clusters
• Unsupervised learning: no predefined classes
• Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
Ref[1][5]
83
K-Means Clustering on Big Data
▲What are K-Means?
• Given k, the k-means algorithm is implemented in four
steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the clusters of the
current partition (the centroid is the center, i.e., mean
point, of the cluster)
 Assign each object to the cluster with the nearest seed point
 Go back to Step 2, stop when no more new assignment
Ref[1][5]
84
The K-Means Clustering Method
• Example
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
10
9
9
8
8
7
7
6
6
5
5
4
2
1
0
0
1
2
3
4
5
6
7
8
7
8
9
10
reassign
3
Ref[1][5]
Update
the
cluster
means
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
85
10
What are neural networks?
•
•
Connectionism refers to a computer modeling
approach to computation that is loosely based
upon the architecture of the brain.
Many different models, but all include:
 Multiple, individual “nodes” or “units” that operate at the
same time (in parallel)
 A network that connects the nodes together
 Learning can occur with gradual changes in connection
strength
Ref[2][3][5]
86
Feed-forward nets
Information flow is unidirectional
Data is presented to Input layer
Passed on to Hidden Layer
Passed on to Output layer
Information is distributed
Information processing is parallel
Internal representation (interpretation) of data
Ref[2][3][5]
87
Feed-forward nets
Input Layer
1.0
Hidden Layer
Node 1
W1j
W1i
Node j
Wjk
W2j
0.4
Output Layer
Node 2
W2i
Node k
Wik
Node i
W3j
0.7
W
lj
0.20
Ref[2][3][5]
Node 3
W
li
W
2j
0.10
0.30
W3i
W
2i
–0.10
W
3j
–0.10
W
3i
W
0.20
0.10
jk
W
ik
0.50
88
Neural Network Input Format
newValue 
originalVa lue  minimumVal ue
maximumVal ue  minimumVal ue
where
newValue : the computed value falling in the [0,1] interval range
originalVa lue : the value to be converted
minimumVal ue : the smallest possible value for the attribute
maximumVal ue : the largest possible attribute value
Ref[2][3][5]
89
The Sigmoid Function(Output)
1
f ( x) 
1  ex
where
e is the base of natural logarithms approximat ed by 2.718282.
(node1  W 1 j )  (node2  W 2 j )  (node3  W 3 j )
 (1  0.2)  (0.4  0.3)  (0.7  -0.1)  0.25
node j  f (0.25)
Ref[2][3][5]
90
Sentiment Analysis
• Sentiment
A thought, view, or attitude, especially one based mainly on
emotion instead of reason
• Sentiment Analysis
aka opinion mining
use of natural language processing (NLP) and computational
techniques to automate the extraction or classification of
sentiment from typically unstructured text
Ref[4]
91
Motivation
• Consumer information
 Product reviews
(Is this customer email satisfied or dissatisfied?)
• Marketing
 Consumer attitudes
 Trends
(Based on a sample of tweets, how are people responding to this ad campaign/product
release/news item?)
• Politics
 Politicians want to know voters’ views
 Voters want to know policitians’ stances and who else supports them
(How have bloggers' attitudes about the president changed since the election?)
• Social
 Find like-minded individuals or communities
Ref[4][5]
92
Challenges
• People express opinions in complex ways
• In opinion texts, lexical content alone can be
misleading
• Intra-textual and sub-sentential reversals, negation,
topic change common
• Rhetorical devices/modes such as sarcasm, irony,
implication, etc.
Ref[5]
93
References
[1] https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CE
MQFjAC&url=http%3A%2F%2Fwww.gersteinlab.org%2Fcourses%2F545%2F07spr%2Fslides%2FDM_clustering.ppt&ei=GOi2Ue7MDImakAX62YGQDg&usg=AF
QjCNFHk7vRJAci6AD_PrgNBFytWJCnSA&sig2=wRKWMuMHDeaSLhox04LI4g
[2] https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CF
MQFjAD&url=http%3A%2F%2Fweb.cecs.pdx.edu%2F~mperkows%2FCAPSTONE
S%2F2005%2FL005.Neural_Networks.ppt&ei=OOq2UYmBoWulQXOjYGoCw&usg=AFQjCNGlQtvoUuAoYEuWHcnGFnlzG55yYA&sig2=wv7
yax6HyWtxPJdCogjUAg
[3] https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CE
kQFjAC&url=http%3A%2F%2Fwww.math.uaa.alaska.edu%2F~afkjm%2Fcs405%
2Fhandouts%2FNN.ppt&ei=OOq2UYmBoWulQXOjYGoCw&usg=AFQjCNEoRa7VHnmdee2HkxsqhK_lCOIjrg&sig2=iNux5xl1vzL2Du79jm3ow
[4] https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&ved=0CFgQFjAE&url=http%3A%2F
%2Fwww.public.asu.edu%2F~huanliu%2Fdmml_presentation%2F2008%2FSentiment%2BAnalysis.ppt
&ei=cuu2UfPlJcXXkgXVoYHIAg&usg=AFQjCNGRQmOjDghJPcXGTrOCqLvjgRyDNw&sig2=T578hj5uiIb5u
bsHovZ9rw
[5] http://www.lct-master.org/files/MullenSentimentCourseSlides.pdf
[6] “Data Mining-A Tutorial-Based Primer” ,Richard J. Roiger, Michael W. Geatz (2003)
94
Assume that you are a team of IT
staffs and your team is assigned to
provide a cost and benefit
evaluation for the big data solutions.
Evaluation :Association rule learning
(991660)
• Retail
– Better understanding the correlation between
products or information
– Effectively increase the income
– Inventory management easier
– Better understanding customer spending
patterns(market basket analyses)
References :http://en.wikipedia.org/wiki/Association_rule_learning
雲端時代的殺手級應用 : 海量資料分析(胡世忠)
96
Evaluation : Classification
• Customer classification
– Find more potential customers to increase the
income
• Banking
– Define risk loan customers at each levels
– Long-term credit ratings(Standard & Poor‘s)
References :雲端時代的殺手級應用 : 海量資料分析(胡世忠)
http://zh.wikipedia.org/wiki/%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98
97
Classification :Standard & Poor‘s
• Long-term credit ratings
– The company rates borrowers on a scale from AAA
to D. Intermediate ratings are offered at each level
between AA and CCC (e.g., BBB+, BBB and BBB-).
For some borrowers, the company may also offer
guidance (termed a "credit watch") as to whether
it is likely to be upgraded (positive), downgraded
(negative) or uncertain (neutral).
References : http://en.wikipedia.org/wiki/Standard_%26_Poor's
http://countryeconomy.com/ratings/taiwan
98
Evaluation : Cluster analysis
• Find common traits in the consumer groups
– Increase marketing effectiveness
– Better understanding the customer behavior
– In more detailed classification the customer
characteristics
References :雲端時代的殺手級應用 : 海量資料分析(胡世忠)
http://zh.wikipedia.org/wiki/%E8%81%9A%E7%B1%BB%E5%88%86%E6%9E%90
99
Evaluation :Neural networks
(991664)
• Enhanced information processing efficiency
• Assist in establishing cost-effective IT modules
– Help to create the predictive modules
• Stock market prediction module
• Sales volume forecast module
• Weather forecasting module
References :雲端時代的殺手級應用 : 海量資料分析(胡世忠)
http://zh.wikipedia.org/wiki/%E4%BA%BA%E5%B7%A5%E7%A5%9E%E7%BB%8F%
E7%BD%91%E7%BB%9C
100
Neural networks : CompStat
• In 1994, Police Commissioner William Bratton
introduced a data-driven management model
in the New York City Police Department called
CompStat.
• CompStat has diffused quickly across the
United States and has become a widely
embraced management model focused on
crime reduction.
• Reduce the crime rate 27%.
References :http://www.compstat.umd.edu/what_is_cs.php
雲端時代的殺手級應用 : 海量資料分析(胡世忠)
101
Evaluation : Sentiment analysis
• Better understanding of the customer
emotional mind for the company
– Increase marketing effectiveness
– Strengthen corporate image
– Better understanding the customer behavior
References :雲端時代的殺手級應用 : 海量資料分析(胡世忠)
http://en.wikipedia.org/wiki/Sentiment_analysis
102
Youtube with sentiment analysis
• Collect users' past record preview to analysis
of user's preference.
• Recommended user‘s preference related
videos to increased use time of user on
youtube.
References :雲端時代的殺手級應用 : 海量資料分析(胡世忠)
103
WiseWindow:Mass Opinion Business
Intelligence
• Wise Window, Inc. provides mass opinion
business intelligence solutions. The company
offers Mass Opinion Business Intelligence, a
solution that translates mass opinions expressed
on the Web into an actionable data for business.
• Correct interpretation of the blogs, news reports,
online forums and social networking sites, the
market for a particular product, service, person
or news topics instant reaction and opinion
trends(sentiment analysis).
References :雲端時代的殺手級應用 : 海量資料分析(胡世忠)
http://www.inside.com.tw/2011/03/03/emotion-robot
http://en.wikipedia.org/wiki/WiseWindow
104
Pattern recognition
(981919)
• depends on a number of different factors
• In many applications misclassification costs
are hard to quantify, such as monetary costs,
time and other more subjective costs.
References :http://zh.scribd.com/doc/137177757/Statistical-Pattern-Recognition-2nd-Ed
雲端時代的殺手級應用 : 海量資料分析(胡世忠)
105
Pattern recognition
• Increasing information type
– Transform more information type for analysis
•
•
•
•
•
Voice recognition[2]
Image Recognition
Handwriting recognition
Face Recognition
medical diagnosis problem [1]
References[1]http://zh.scribd.com/doc/137177757/Statistical-Pattern-Recognition-2nd-Ed
[2] http://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/ccd=HA4PGp/search#result
雲端時代的殺手級應用 : 海量資料分析(胡世忠)
106
Pattern recognition
• The Cost of Choose pattern recognition
-it may be very difficult to assign costs
-they may be the subjective opinion of an expert
References[2] http://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/ccd=HA4PGp/search#result
雲端時代的殺手級應用 : 海量資料分析(胡世忠)
107
Pattern recognition
• The Benefit of Choose pattern recognition
- favors simpler models
- Bayesian approach facilitates a seamless
intermixing
References[3]http://en.wikipedia.org/wiki/Pattern_recognition
雲端時代的殺手級應用 : 海量資料分析(胡世忠)
108
Optical character recognition(OCR)
• Ocr is the mechanical or electronic conversion
of scanned images of handwritten,
typewritten or printed text into machineencoded text. It is widely used as a form of
data entry from some sort of original paper
data source, whether documents, sales
receipts, mail, or any number of printed
records.
References : http://en.wikipedia.org/wiki/Optical_character_recognition
雲端時代的殺手級應用 : 海量資料分析(胡世忠)
109
Ocr:SinoPac Securities(永豐金控)
• SinoPac Securities selected Orc's trading and
connectivity technology solutions to
strengthen its Asian market trading
capabilities.
• Orc Trading to enhance their use of electronic
trading tools in trading, pricing and risk
management capabilities.
References : http://www.orcgroup.com/Global/Additional%20languages/Chinese%20Traditional/SinoPac_Orc_191109_
Final_Chinese.pdf
雲端時代的殺手級應用 : 海量資料分析(胡世忠)
110
Anomaly detection
• Reduce product defect rate
– Reduce the cost
• Data flow anomaly detection
– Strengthen information security
Used to detect whether the system is hacked
References[5]http://blog.udn.com/chungchia/3460421#ixzz2VpmDuTBS
雲端時代的殺手級應用 : 海量資料分析(胡世忠)
111
Anomaly detection
• The Cost of Choose anomaly detection
Where this is not the normal behavior of the scope, and
both are considered abnormal, often resulting in
miscarriage of justice to refuse normal network connection
References
[4]http://avp.toko.edu.tw/docs/class/3/%E5%85%A5%E4%BE%B5%E5%81%B5%E6%B8%AC%E8%88%8
7%E9%A0%90%E9%98%B2%E7%B3%BB%E7%B5%B1%E7%B0%A1%E4%BB%8B%E8%88%87%E6%87%8
9%E7%94%A8.pdf
雲端時代的殺手級應用 : 海量資料分析(胡世忠)
112
Anomaly detection
• The Benefit of Choose pattern recognition
- when the context of abuse and network intrusion
detection
- This pattern does not adhere to the common
statistical definition of an outlier as a rare object
- a cluster analysis algorithm is able to detect the
micro clusters formed by these patterns.
References: http://en.wikipedia.org/wiki/Anomaly_detection
雲端時代的殺手級應用 : 海量資料分析(胡世忠)
113
Predictive modeling
• Market price forecast
– Crude oil price , price of gold
– Stock Market Investing
• Stock market prediction module
• Sales volume forecast module
•
-
The Cost of Choose predictive modeling
History cannot always predict future
The issue of unknown unknowns
Self-defeat of an algorithm
References: http://en.wikipedia.org/wiki/Predictive_modelling
http://www.forrester.com/pimages/rws/reprints/document/85601/oid/1-KWYFVB
雲端時代的殺手級應用 : 海量資料分析(胡世忠)
114
Public Sentiment with Stock Price
• Derwent Capital Markets is known as an early
pioneer in the use of social media sentiment
analysis to trade financial derivatives.
• Prediction accuracy rate 87.6%(in 2011)
References :http://en.wikipedia.org/wiki/Der
went_Capital_Markets
http://www.forbes.com/sites/tomiogeron/20
12/02/28/datasift-launches-historical-twittersearch-for-businesses/
雲端時代的殺手級應用 : 海量資料分析(胡
世忠)
115
Predictive modeling
• The Benefit of Choose predictive modeling
- Moore’s law increases the capability and drives
down the cost .
- resulted in an exponentially increasing amount of
scientific data being produced each year.
- allows the retention programme
References: http://opendatagroup.com/predictive-analytics-faq/
http://en.wikipedia.org/wiki/Predictive_modelling
雲端時代的殺手級應用 : 海量資料分析(胡世忠)
116
Download