Document

3A G3 Big Data 981919 黃于庭 991619 鍾佳琳 991635 陸雨新 991660 魏松毅 991604 林右千 991632 游智鈞 991637 杜韋霆 991664 梅耀文 991616 李嘉芸 991634 陳鈺玟 991648 何冠儀 Question a: Describe its possible definitions 991637 杜韋霆 What is big data? With the advance of science and technology，we automatically create a large amounts of data every day. These data are generated from many places such as: • sensors used to gather climate information • posts to social media sites • digital pictures and videos • purchase transaction records • cell phone GPS signals We can call this kind of data “Big data”. Ref: Speed of Business --- IBM http://www-01.ibm.com/software/data/bigdata/ 3 Definitions • Wiki Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Ref: http://en.wikipedia.org/wiki/Big_data • Gartner “Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. Ref: http://www.gartner.com/it-glossary/big-data/ • Doug Laney Data sets where the three Vs—volume, velocity and variety—present specific challenges in managing these data sets. Ref: http://www.isaca.org/Knowledge-Center/Blog/Lists/Posts/Post.aspx?ID=299 • Webopedia Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques. Ref: http://www.webopedia.com/TERM/B/big_data.html 4 Definitions(continue) • Andrew Brust We can safely say that Big Data is about the technologies and practice of handling data sets so large that conventional database management systems cannot handle them efficiently, and sometimes cannot handle them at all. • John Rauser Any amount of data that's too big to be handled by one computer. • Techopedia.com Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware. And more! Ref: http://www.opentracker.net/article/25-definitions-big-data 5 Definitions(conclusion) There are many different definitions of Big data, But most of them talk about: 1. Size of data sets are very large. 2. Hard to deal with commonly used software tools . 3. The time of data processing are important. 4. Types of data are many. 6 Why is it so important? Big data issues are important because: 1.data sets that companies gathers are more than just words, but also includes video and images. 2. The methods that generate data are different from the past. 3.Manual or on-hand tools are not efficient enough. 4.Companies requires fast reaction and accuracy. 5.Data before processed are useless. Ref: http://www.arthurtoday.com/2012/01/big-data.html 7 What are its characteristics ? 991632 游智鈞 • Actually we find there are many definitions used in big data ,but the origins of the term come from a 2001 paper by Doug Laney of Meta Group , it defines big data as data sets which have three VsVolume ,Velocity and Variety. • Most people talking about is 3Vs ,but some talk about is 4Vs , add a fourth V “Veracity”. • IBM proposed a concept of 3I – Instrumented ,Interconnected and Intelligent . 8 Batch : It’s not continuous processing of data, batch processing is used for very large files. The files to be transmitted are gathered over a period and then send together as a batch. [2] Reference: [2] http://www.datasciencecentral.co m/forum/topics/the-3vs-thatdefine-big-data [3] http://contest.trendmicro.com/20 13/tw/train.htm 9 4V(Volume、Velocity、Variety、Veracity) • Volume: There are many factors contributes to increase in data volume-past transaction records, daily data collected from sensors, data create by social media, etc. • Velocity: It means how fast data is be creating and how fast data must be processing . For businesses, in the shortest possible time processed data, enterprises will be able to bring more benefits. • Variety: Today's data type may be a variety of formats. 1. Structured Data : Database Data(Trial Balance, Financial Report, General information) 2. Semi-structured Data : Email, Blog Posts 3. Unstructured Data : Text, Video, Photo ,Audio[4] Reference : [4]雲端時代的殺手級應用-海量資料分析胡世忠著 10 • Veracity: Because the source of data from anywhere, you can not guarantee the correctness of data or any data is benefit for enterprises. So ,it’s important for enterprises to get a useful data and analysis it. Reference: [5] http://skyfollow.com/big-datavelocity-comparisons-incoming-ratechart/ [5] 11 3I(Instrumented、Interconnected、Intelligent) • Instrumented : It means huge change in data source. We place the sensors above lots of things so that people can be more sensitive, more comprehensive perceive the physical world. Eg: Smart meter • Interconnected : It means huge change in the way of data transmission. We use sensors, RFID and more communication technology to communicate between objects. • Intelligent : It means huge change in the way of data use. Eg: SuperComputer-Waston Reference: [6]http://www.ibm.com/smarterplane t/ie/en/overview/ideas/ 12 Conclusion Big data is not just only represent the basis of a large volume of data , but also represents life has now entered another level. Bring to life more convenient, more intelligent choices. 13 Question b: What’s the possible challenges, and opportunities of big data? 991604 林右千 991616 李嘉芸 991619 鍾佳琳 Challenges - Understand & Use 991604 林右千 • The challenge is how we can understand and use big data when it comes in an unstructured format, such as text or video. • Unstructured data is a generic label for describing any corporate information that is not in a database. Unstructured data can be textual or non-textual. – Textual unstructured data is generated in media like email messages, PowerPoint presentations, Word documents, collaboration software and instant messages. – Non-textual unstructured data is generated in media like JPEG images, MP3 audio files and Flash video files. References from: http://spotfire.tibco.com/blog/?p=6793 http://searchbusinessanalytics.techtarget.com/definition/unstructured-data 15 Challenges - Understand & Use Direct quote from: http://www.slideshare.net/Hadoop_Summit/hadoopsopportunity-to-power-nextgeneration-architectures 16 Challenges - Understand & Use • For example, as social media applications like Twitter and Facebook go mainstream, the growth of unstructured data is expected to far outpace the growth of structured data. • In customer-facing businesses, the information contained in unstructured data can be analyzed to improve customer relationship management and relationship marketing. References from: http://searchbusinessanalytics.techtarget.com/definition/unstructured-data 17 Opportunities - Government • The opportunities about the government, following are three parts and their examples. 1. Improve administrative efficiency - ACSSA 2. Combat and prevent crime - Memphis PD 3. Improve traffic problems - Stockholm Direct quote from: http://www.alamedasocialservices.org/public/index.cfm References from: 胡世忠. 雲端時代的殺手級應用： Big Data海量資料分析. 臺北市: 天下雜誌股份有限公司. 2013: 9789862416730 Direct quote from: http://www.memphispolice.org/ 18 Improve administrative efficiency - ACSSA • Alameda County is the seventh largest county in California. The Alameda County Social Services Agency (ACSSA) provides social services to as many as 140,000 people living below the poverty line, with 19,000 actively managed cases. • The antiquated systems it was using could not keep up with the need for information, which meant that the agency’s understanding of what was happening out in the community lagged weeks or even months behind actual events. • ACSSA teamed with IBM to deploy an information management system that combined analytics with business intelligence to give workers an agency-wide, comprehensive view of individual cases. References from: IBM The SmarterCities Leadership Series. Smarter Government Services. 19 Improve administrative efficiency - ACSSA • Outcome: 1) ACSSA now has an average annual savings of nearly $25M. 2) Real-time understanding of case and program status enable them to find the best assistance programs for each situation. 3) Real time tracking reveals relationships between benefit recipients and programs, helping to eliminate waste, fraud and redundancy. 4) Reports are generated in minutes instead of weeks or months. 5) The system has increased the productivity and win rates of agency lawyers who defend the agency when a claimant appeals their discontinuation of benefits, which saves the agency $900,000 annually. References from: IBM The SmarterCities Leadership Series. Smarter Government Services. 20 Combat and prevent crime - Memphis PD • Memphis PD use Blue CRUSH (Criminal Reduction Utilizing Statistical History) to reduce the rate of crime. • At the heart of Blue CRUSH is a predictive model that incorporates fresh crime data from sources that range from the MPD’s records management system to video cameras monitoring events on the street. • Blue CRUSH lays bare underlying crime trends in the way that promotes an effective fast response, as well as a deeper understanding of the longer-term factors (like abandoned housing) that affect crime trends. References from: IBM Smarter Planet Leadership Series. Memphis PD: Keeping ahead of criminals by finding the “hot spots”. 2011 21 Combat and prevent crime - Memphis PD • It happens at the precinct level. Looking at multilayer maps that show crime hot spots, commanders can see not only current activity levels, but also any shifts in such activities that may have resulted from previous changes in policing deployment and tactics. At each weekly meeting, commanders go over these results with their officers to judge what worked, what didn’t and how to adjust tactics in the coming week. References from:IBM Smarter Planet Leadership Series. Memphis PD: Keeping ahead of criminals by finding the “hot spots”. 2011 22 Combat and prevent crime - Memphis PD • Outcome: 1) 30% reduction in serious crime overall, including a 36.8% reduction in crime in one targeted area 2) 15% reduction in violent crime 3) 4x increase in the share of cases solved in the MPD’s Felony Assault Unit (FAU), from 16 percent to nearly 70 percent 4) Overall improvement in the ability to allocate police resource in a budget-constrained fiscal environment References from: IBM Smarter Planet Leadership Series. Memphis PD: Keeping ahead of criminals by finding the “hot spots”. 2011 23 Improve traffic problems - Stockholm • The Swedish National Road Administration(SNRA) and the Stockholm City Council announced a trial Congestion Tax. • The goal was not only to reduce congestion, but encourage ancillary benefits, such as improving public transport and alleviating environmental damage. The government’s plan is to devote revenue from the tax to completing a ring road around the city. • With help from IBM, the solution they came up with was an innovative, high-tech traffic charging system that directly charges drivers who use city center roads during peak business hours. References from: Driving Change in Stockholm. 2008 24 Improve traffic problems - Stockholm • The way it works, drivers can install simple transponder tags that communicate with receivers at the control points and trigger automatic payment of road use fees. Once a vehicle passes a roadside control point during designated congestion hours, it is recognized by the transponder that is read by sensors. • In addition, cars passing through these control points are photographed, and the license plate numbers are used to identify those vehicles without tags and to provide evidence to support the enforcement of non-payers. The information is sent to a computer system that matches the vehicle with its registration data, and a fee is charged to the owner. All of the above steps can be completed within milliseconds. References from: Driving Change in Stockholm. 2008 25 Improve traffic problems - Stockholm • Outcome: 1) traffic was down nearly 25 percent. 2) Public transport schedules had to be redesigned because of the increase in speed from reduced congestion. 3) 40,000 more travelers used Stockholm Transport on an ordinary weekday than the year before—an increase of six percent. 4) The reduction in traffic has led to a drop in emissions from road traffic by eight to 14 percent in the inner-city. 5) Greenhouse gases such as carbon dioxide have fallen by 40 percent in the inner-city. References from: Driving Change in Stockholm. 2008 26 Opportunities - Manufacturing • The opportunities for manufacturing. Direct quote from: McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity. June 2011: 78 27 Opportunities - Manufacturing Direct quote from: McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity. June 2011: 78 28 Opportunities - Manufacturing • Example: Haitai Confectionery & Food Co., Ltd., a South Korean company with its main business in retail and instant foods, especially confectionery, beverage and ice cream. Haitai use a business intelligence and analysis platform to analysis historical data, tracking changes in supply and demand, and forecast demand. That quickly grasp the market demand and reduce the day in inventory. References from: 胡世忠. 雲端時代的殺手級應用：Big Data海量資料分析. 臺北市: 天下雜誌股份有限公司. 2013: 201-202. 9789862416730 29 Challenge ─ Private Security 991616 李嘉芸 • With the Big Data era, Internet will always release huge amounts of data, and society benefit from the use of Big Data, but the privacy is nowhere to hide. • With the produce, storage, analysis, increasing the amount of data, whether it is about business sales, or personal spending habits, identity, etc., has stored in various forms. SOURCE: R[1], R[2] 30 Challenge ─ Private Security • Large amounts of data hidden a large number of economic and political interests, particularly through data integration, analysis and mining. • With technological innovations arising from Big Data era also gave birth to all sectors of society to face strong demand for personal privacy. SOURCE: R[1], R[2] 31 Big data era has the following behavior invasion of personal privacy: • In the process of data storage：the user can not know the exact storage location of data, and users lose control of personal data collection, storage, use, and share. • The process of data transmission results in violating personal privacy. Because the data transmission more open and pluralistic, it may result in data leakage or risk of eavesdropping • In the process of data destruction：the data may already be backed up, will lead to the destruction incompletely. SOURCE: R[1], R[2] 32 How to strengthen the protection of personal privacy: 1. The personal information protection into national strategies for conservation and planning issues. 2. To build a completed Personal Privacy protection’s law： we need to create a personal privacy protection law and basic rules. In addition, we should actively promote laws and regulations related to the protection of privacy legislation to reduce violations of personal privacy. 3. Strengthen the technical protection of personal privacy： Encourage development of Privacy protection technologies. How to prevent personal data is processed by unnecessary and undesirable manners, and let the users know where their data is stored, how they are processed SOURCE: R[1], R[2] 33 Opportunities - Energy • High oil and high electricity prices make sustainable energy issues exist persistently, and make big data analysis are increasingly important in the energy industry. SOURCE: R[3] 34 Opportunities - Energy • Big Data analysis directly affect the profit, so many industry professionals already installed the intelligent monitoring equipment ,it can collect a large amounts of data immediately to proceed simulate analysis, and it use to increase productivity and reduce costs. SOURCE: R[3] 35 Opportunities - Energy • Energy industry from the following aspects using big data analysis: 1 . Mobile Data Integration： Power Company can analyze consumer’s patterns of activity and comments on the website, and then develop more in line with the needs of service users. SOURCE: R[3] 36 Opportunities - Energy 2. Data link Thermostats：thermostats can record and transmit electricity which is consumed by adjusting temperature to user's home , each thermostat will generate tens of thousands records in a month, if take advantage of it, also helps power companies to regulate electricity and encourages the users to change consumption habits. SOURCE: R[3] 37 Opportunities - Energy 3. Study habits of electric vehicle owners charge： By tracking and analyzing the owners charging habits, the power company can understand people use electricity more relatively in which period, and encourages users to charge during off-peak hours. SOURCE: R[3] 38 References • R[1],大數據時代個人隱私保護刻不容緩 http://big5.ce.cn/gate/big5/www.ce.cn/xwzx/gnsz/gdxw/201212/20/t201212 20_23958532.shtml • R[2],大數據時代﹕數據開放更注重個人隱私保護 http://big5.gmw.cn/g2b/IT.gmw.cn/2013-04/12/content_7292096.htm • R[3],雲端時代的殺手級應用-Big Data 海量資料分析, 胡世忠, 天下雜誌股份有限公司 ,2013/03/08 39 Challenges – Storage 991619 鍾佳琳 • "Big data" refers to data sets that are too large to be captured, handled, analyzed or stored in an appropriate timeframe using traditional infrastructures. Bit 1 or 0 Byte 8 bits Kilobyte 1,000 bytes Megabyte 1,000 KB 1 PB=1000000000000000 B = 1015 bytes = 1,000 terabytes Gigabyte 1,000 MB Terabyte 1,000 GB Petabyte 1,000 TB Exabyte 1,000 PB Zettabyte 1,000 EB Source from: R[1],R[2] 40 Challenges – Storage • Storage is especially challenging because there are many different kinds of data that needs to be stored. • Persistence – Many big data applications involve regulatory compliance that dictates data be saved for years or decades. – Medical information is often saved for the life of the patient. Financial information is typically saved for seven years. – Big data users are also saving data longer because it’s part of an historical record or used for time-based analysis. This requirement for longevity means storage manufacturers need to include on-going integrity checks and other long-term reliability features, as well as address the need for data-inplace upgrades. Source from: R[2] 41 Challenges – Storage • Storage must evolve –Big data has outgrown its own infrastructure and it’s driving the development of storage, networking and computer systems designed to handle its specific. Source From: R[2] 42 Challenges – Storage Types of data: – Structured Data • Data that resides in fixed fields within a record or file. – Semi-structured Data • XML, E-mail, Blog – Unstructured Data • pictures, digital audio, video, Word, pdf Source from: R[7] 43 Structured Data (Traditional) Business User Data Warehouse Administrator Business Analyst Direct quote from: R[3] 44 Structured Data (Traditional) • Traditionally, data processing for analytic purposes followed a fairly static blueprint. Namely, through the regular course of business enterprises create modest amounts of structured data with stable data models via enterprise applications like CRM, ERP and financial systems. Source From: R[3] 45 Structured Data (Traditional) • Data integration tools are used to extract, transform and load the data from enterprise applications and transactional databases to a staging area where data quality and data normalization occur and the data is modeled into neat rows and tables. Source From: R[3] 46 Structured Data (Traditional) • The modeled, cleansed data is then loaded into an enterprise data warehouse. This routine usually occurs on a scheduled basis – usually daily or weekly, sometimes more frequently. Source From: R[3] 47 Structured Data (Traditional) User How to use Traditional Data Warehouse Create and schedule regular reports to run against normalized data stored in the warehouse, which Data Warehouse are distributed to the business. Administrator They also create dashboards and other limited visualization tools for executives and management. Use data analytics tools/engines to run advanced analytics against the warehouse, or more often Business Analyst against sample data migrated to a local data mart due to size limitations. Business User Perform basic data visualization and limited analytics against the data warehouse via frontend business intelligence tools from vendors like SAP BusinessObjects and IBM Cognos. Source From: R[3] 48 Semi-structured Data & Unstructured Data • Hadoop is an open source framework for processing, storing and analyzing massive amounts of distributed, unstructured data. • It was designed to handle petabytes and exabytes of data distributed over multiple nodes in parallel. • Fundamental concept – Hadoop breaks up Big Data into multiple parts so each part can be processed and analyzed at the same time. Source from: R[3] 49 Opportunities - Healthcare • Divide the healthcare into five broad categories: Clinical operations Public health Payment /pricing Healthcare New business models Source from: R[4],R[7] R&D 50 Opportunities - Healthcare Categories How to apply it Clinical operations – Clinical decision support systems The current generation of such systems analyzes physician entries and compares them against medical guidelines to alert for potential errors such as adverse drug reactions or events. By deploying these systems, providers can reduce adverse reactions and lower treatment error rates and liability claims, especially those arising from clinical mistakes. Payment/pricing – Patients would obtain improved health outcomes with a value-based formulary and gain access to innovative drugs at reasonable costs. Health Economics and Outcomes Research and performance-based pricing plans Source from: R[4],R[7] 51 Opportunities - Healthcare Categories How to apply it R&D – Personalized medicine The objective of this lever is to examine the relationships among genetic variation, predisposition for specific diseases, and specific drug responses and then to account for the genetic variability of individuals in the drug development process. New business models – Online platforms and communities Example of this business model in practice include Web sites such as PatientsLikeMe.com, where individuals can share their experience as patients in the system. Public health – Be better prepared for emerging diseases and outbreaks This lever offers numerous benefits, including a smaller number of claims and payouts, thanks to a timely public health response that would result in a lower incidence of infection. Source from: R[4],R[7] 52 Opportunities - Healthcare • Example This type of Big Data healthcare company is focused on “Increasing Awareness”. A mobile app called Asthmapolis is an example of this type. A mobile sensor device is attached to an asthma inhaler, which then monitors where and when asthma attacks happen. The device wirelessly synchronizes with an iOS/Android app, allowing users to track their triggers and symptoms. Source from: R[5],R[6] 53 References • R[1], The Wall Street Journal, January 21, 2013 http://online.wsj.com/article/SB10001424127887323468604578245540627666664.html • R[2], Storage for big data, page2 & page5, April 2, 2012 http://searchstorage.techtarget.com/magazineContent/Storage-for-big-data?pageNo=1 • R[3], Big Data: Hadoop, Business Analytics and Beyond, April 16, 2013 http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond • R[4], McKinsey Global Institute, May 2011 http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_i nnovation • R[5], How Big Data Is Improving Healthcare, October 2, 2012 http://readwrite.com/2012/10/02/how-big-data-is-improving-healthcare • R[6], ASTHMAPOLIS http://asthmapolis.com/ • R[7], 雲端時代的殺手級應用-Big Data 海量資料分析, 胡世忠, 天下雜誌股份有限公司 ,2013/03/08 54 Question c: Explain how can a corporate deal with the problems associated with big data and explain its possible solutionsProblem: How to analyze and apply to Big Data Pattern recognition Classification Anomaly Detection 991648 何冠儀 Pattern recognition • What is the pattern? [1] – The pattern is a picture, a string of characters, a set of symbols, a sequence of signal, etc. • What is pattern recognition? – The act of taking in raw data and making an action based on the “category” of the pattern.[1] – Pattern recognition is a "decision" of science.[2] 57 Pattern recognition • Process [2] – Feature：the character of sample – Training sample：the sample is to build a system – Test sample：use test sample to test accuracy of system 58 Pattern recognition • Application [1][2] -Biometric Authentication： fingerprint、 voice print -Voice recognition： analysis of the contents of the speaker's talk -Medical Image Analysis ： X-rays、 nuclear medicine imaging - Wireless Telecommunication Analysis：Determine how many wireless networks in the space. -Satellite image analysis ：Determine which areas is grassland, river, sand, buildings, etc. -Handwriting Recognition：Determining the handwritten text. 59 Classification • What is Classification? [3][4] – It is used to group items based on certain key characteristics. – classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Purposes [5] – Analysis of the factors affecting data classification – Predict the category of data (class label) 60 Classification • The process of classification [5][6] 1. Establish the model： – Using the existing data to find out classification models. – Such as Decision tree、 classification rules 2. Assessment model： – Existing information will be divided into two groups: training samples and testing samples. – First phase：use training sample to build the model – Second phase：use test sample to evaluate the accuracy of model 3.Using the model ： – Find out the reasons for data classification – Predict new type of data 61 Classification • Algorithms [7] – Support vector machines：are supervised learning models with associated learning algorithms that analyze data and recognize patterns[8] – Neural networks：consists of an interconnected group of artificial neurons, and it processes information using a connectionist approach to computation.[9] – Kernel estimation： is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample.[10] – Decision trees：create a model that predicts the value of a target variable based on several input variables.[11] 62 Classification • Application[7] – Speech recognition： is the translation of spoken words into text.[12] – Biological classification：is a method of scientific taxonomy used to group and categorize organisms into groups such as genus or species.[13] – Credit scoring：is a numerical expression based on a statistical analysis of a person's credit files, to represent the creditworthiness of that person.[14] 63 Anomaly detection • What is Anomaly detection? – Anomalies? The set of data points that are considerably different than the remainder of the data [15] – Usually produce a large number of false alarms.[16] – Also referred to exceptions, deviation.[17] 64 Anomaly detection • Common causes of anomalies [19] – Data From Different Classes: objects different because they are of a different type or class – Natural Variation: datasets modeled by statistical distributions , where are admitted variations in data – Data Measurement and Collection Errors: errors in the data collection or during the measurement process 65 Anomaly detection • Categories [17] 1.Unsupervised ： – No labels assumed – Based on the assumption that anomalies are very rare compared to normal data 2.Supervised ： – Labels available for both normal data and anomalies – Similar to rare class mining 3.Semi-supervised ： – Labels available only for normal data 66 Anomaly detection • Techniques [18] 1.Model-Based： – Build a model of the data. – Anomalies are objects that do not fit the model. 2.Proximity-Based： – Define a proximity measure between objects – Anomalies are objects that are distant from most of the other objects 3.Density-Based： – Estimate the density of objects – Anomalies are objects that are in regions of low density 67 Anomaly detection • Applications [18] – Intrusion detection： monitoring systems and networks for unusual behavior – Fraud detection ： looking for buying patterns different from typical behavior – System health monitoring ： use unusual symptoms or test result to indicate potential health problems – Detecting Eco-system disturbances ： try to predict events like hurricanes and floods – Public Health：use medical statistic reports for diagnosis 68 References [1]http://www.ie.ksu.edu.tw/ie1/100ie/web/files/download/2011.11. 15.%E6%9C%B1%E5%AE%B6%E5%BE%B7.pdf [2]http://nthur.lib.nthu.edu.tw/dspace/handle/987654321/4878 [3] http://zh.scribd.com/doc/137177757/Statistical-PatternRecognition-2nd-Ed [4] http://www.wisegeek.com/what-is-a-data-miningclassification.htm [5] http://sls.weco.net/node/10936 [6] http://faculty.stust.edu.tw/~jehuang/DMCourse/ch5-3.html [7] http://en.wikipedia.org/wiki/Statistical_classification [8]http://en.wikipedia.org/wiki/Support_vector_machine 69 References [9] http://en.wikipedia.org/wiki/Artificial_neural_networks [10] http://en.wikipedia.org/wiki/Kernel_density_estimation [11] http://en.wikipedia.org/wiki/Decision_tree_learning [12] http://en.wikipedia.org/wiki/Speech_recognition [13] http://en.wikipedia.org/wiki/Biological_classification [14] http://en.wikipedia.org/wiki/Credit_scoring [15] http://www.slideshare.net/guest76d673/chap10-anomalydetection [16]經濟部九十年度科技專案國家資通安全技術服務計劃入侵偵測系統簡介陳培德國立成功大學電機所博士候選人 [17] www.siam.org/meetings/sdm08/TS2.ppt [18] http://www.cli.di.unipi.it/~tamberi/old/docs/tdm/anomalydetection.pdf 70 Question c: Explain how can a corporate deal with the problems associated with big data and explain its possible solutions Association rule learning & Predictive modeling 991634 陳鈺玟 Association rule learning • What is association rule learning? – Association Rules describe frequent co-occurences in sets. – Association rule learning was first used by major supermarket chains to discover interesting relations between products. – A set of techniques for discovering interesting relationships among variables in data. Ref[1]:http://dataminingintelligence.com/?p=60 Ref[2]:http://www.ke.tu-darmstadt.de/lehre/archiv/ws0405/mldm/association-rules.pdf Ref[3]:http://www.firmex.com/blog/7-big-data-techniques-that-create-business-value/ 72 Association rule learning • Basic way of Association rule learning : – If a supermarket has 100,000 transactions, out of which 2,000 include both butter and bread and 800 of these 2,000 transactions include milk, • Support : How many times does this rule cover?  800 times in 100,000 transactions  alternatively 0.8% = 800/100,000 • confidence : How strong is the implication of the rule?  800 times in 2000 transactions  800/2000 = 40% Ref[4]:http://akashrajak.webs.com/%20New%20Folder/Association%20Rule%20MiningApplications%20in%20Vario us%20Areas.pdf Ref[5]:http://www.ke.tu-darmstadt.de/lehre/archiv/ws0405/mldm/association-rules.pdf 73 Ref[6]:http://en.wikipedia.org/wiki/Association_rule_learning Association rule learning • Examples: – Which products are frequently bought together by customers? • DataTable = Receipts x Products • onions and potatoes → hamburger – Which courses tend to be attended together? • DataTable = Students x Courses • Programming → Computer Science, Algorithm Ref[7]:http://www.ke.tu-darmstadt.de/lehre/archiv/ws0405/mldm/association-rules.pdf 74 Association rule learning • Applications: – Market basket analysis : • 1. Encourage more purchases : to know if certain groups of items are consistently purchased together, for adjusting store layouts • 2. Improve efficient : by alerting which merchandising effort is ineffective, and which product is not selling • 3. Enhance inventory management : by eliminating slowmoving items and increasing the supply of fast-moving merchandise • 4. Extract information about visitors to websites from logs : a merchant could analyze data on visitor browsing patterns, login counts, past purchase behavior, and responses to promotions — to eliminate what isn't working and focus on what does Ref[8]:http://www.practicalecommerce.com/articles/3945-4-Ways-Big-Data-Can-Help-Ecommerce-MerchantsRef[9]:http://www.firmex.com/blog/7-big-data-techniques-that-create-business-value/ 75 Association rule learning • Applications: – Protein sequences : • For healthy care : because proteins are important constituents of cellular machinery of any organism, and they are sequences made up of 20 types of amino acids, with association rule learning, this can enhance our understanding of protein composition and hold the potential to give clues regarding the global interactions – Census data : • For general public and government : a huge variety of general statistical information on society, the information related to population and economic census can be forecasted in planning public services, such as education, health, transport, funds Ref[10]:http://www.firmex.com/blog/7-big-data-techniques-that-create-business-value/ Ref[11]:http://www.practicalecommerce.com/articles/3945-4-Ways-Big-Data-Can-Help-Ecommerce-MerchantsRef[12]:http://akashrajak.webs.com/-%20New%20Folder/Association%20Rule%20Mining76 Applications%20in%20Various%20Areas.pdf Predictive modeling • What is Predictive modeling? – Short definition : using data to make decisions – Long definition : using data to take actions and make decisions using models that are statistically valid and empirically derived – A process by which a model is created or chosen to try to best predict the probability of an outcome given a set amount of input data. Ref[13]:http://en.wikipedia.org/wiki/Predictive_modelling Ref[14]:http://cdn.oreillystatic.com/en/assets/1/event/85/Best%20Practices%20for%20Building%20and%20Deployi ng%20Predictive%20Models%20over%20Big%20Data%20Presentation.pdf 77 Predictive modeling • Basic way of Predictive modeling: – The birth of a predictive model : • A predictive model is the result of combining data and mathematics. • To put it formally, data + model technique = Predictive modeling Ref[15]:http://www.ibm.com/developerworks/library/ba-predictiveanalytics2/ Ref[16]:http://www.ibm.com/developerworks/library/ba-predictiveanalytics2/fig01.gif 78 Predictive modeling • Common categories of models : – Predictive models : how likely an event is • For example, how likely a credit card transaction is to be fraudulent, how likely a visitor to a web site is to click on an ad, or how likely a company is to go bankrupt. – Summary models : summarize data • For example, divide credit card transactions or airline passengers into different groups depending upon their characteristics. Ref[17]:http://opendatagroup.com/predictive-analytics-faq/ 79 Predictive modeling • Applications: – In financial way: • 1.Optimize availability, allocation and yield of assets • 2.Improve business outcomes, make better decisions, increase competitiveness – In operational way: • 1.Exceed service level commitments by increasing speed and reducing risk of failure • 2.Optimize maintenance schedules around conditions Ref[18]:http://www.forrester.com/pimages/rws/reprints/document/85601/oid/1-KWYFVB Ref[19]:http://www-01.ibm.com/software/data/bigdata/industry-retail.html 80 Predictive modeling • Applications: – Customer Relationship Management : analyze and understand the products in demand, predict customers' buying habits in order to promote – Product or economy-level prediction : predicting store-level demand for inventory management purposes, predicting the unemployment rate for the next year – Clinical decision support systems : experts use this in health care primarily to predict which patients are at risk of developing certain conditions Ref[20]:http://en.wikipedia.org/wiki/Predictive_modelling Ref[21]:http://en.wikipedia.org/wiki/Predictive_analytics 81 Question c: Explain how can a corporate deal with the problems associated with big data and explain its possible solutions Cluster analysis neural networks Sentiment Analysis 991635 陸雨新 82 Cluster analysis ▲What is Cluster analysis? • Cluster: a collection of data objects  Similar to one another within the same cluster  Dissimilar to the objects in other clusters • Cluster analysis  Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters • Unsupervised learning: no predefined classes • Typical applications  As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms Ref[1][5] 83 K-Means Clustering on Big Data ▲What are K-Means? • Given k, the k-means algorithm is implemented in four steps:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)  Assign each object to the cluster with the nearest seed point  Go back to Step 2, stop when no more new assignment Ref[1][5] 84 The K-Means Clustering Method • Example 10 10 9 9 8 8 7 7 6 6 5 5 10 9 8 7 6 5 4 4 3 2 1 0 0 1 2 3 4 5 6 7 8 K=2 Arbitrarily choose K object as initial cluster center 9 10 Assign each objects to most similar center 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 4 3 2 1 0 0 1 2 3 4 5 6 reassign 10 10 9 9 8 8 7 7 6 6 5 5 4 2 1 0 0 1 2 3 4 5 6 7 8 7 8 9 10 reassign 3 Ref[1][5] Update the cluster means 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 85 10 What are neural networks? • • Connectionism refers to a computer modeling approach to computation that is loosely based upon the architecture of the brain. Many different models, but all include:  Multiple, individual “nodes” or “units” that operate at the same time (in parallel)  A network that connects the nodes together  Learning can occur with gradual changes in connection strength Ref[2][3][5] 86 Feed-forward nets Information flow is unidirectional Data is presented to Input layer Passed on to Hidden Layer Passed on to Output layer Information is distributed Information processing is parallel Internal representation (interpretation) of data Ref[2][3][5] 87 Feed-forward nets Input Layer 1.0 Hidden Layer Node 1 W1j W1i Node j Wjk W2j 0.4 Output Layer Node 2 W2i Node k Wik Node i W3j 0.7 W lj 0.20 Ref[2][3][5] Node 3 W li W 2j 0.10 0.30 W3i W 2i –0.10 W 3j –0.10 W 3i W 0.20 0.10 jk W ik 0.50 88 Neural Network Input Format newValue  originalVa lue  minimumVal ue maximumVal ue  minimumVal ue where newValue : the computed value falling in the [0,1] interval range originalVa lue : the value to be converted minimumVal ue : the smallest possible value for the attribute maximumVal ue : the largest possible attribute value Ref[2][3][5] 89 The Sigmoid Function(Output) 1 f ( x)  1  ex where e is the base of natural logarithms approximat ed by 2.718282. (node1  W 1 j )  (node2  W 2 j )  (node3  W 3 j )  (1  0.2)  (0.4  0.3)  (0.7  -0.1)  0.25 node j  f (0.25) Ref[2][3][5] 90 Sentiment Analysis • Sentiment A thought, view, or attitude, especially one based mainly on emotion instead of reason • Sentiment Analysis aka opinion mining use of natural language processing (NLP) and computational techniques to automate the extraction or classification of sentiment from typically unstructured text Ref[4] 91 Motivation • Consumer information  Product reviews (Is this customer email satisfied or dissatisfied?) • Marketing  Consumer attitudes  Trends (Based on a sample of tweets, how are people responding to this ad campaign/product release/news item?) • Politics  Politicians want to know voters’ views  Voters want to know policitians’ stances and who else supports them (How have bloggers' attitudes about the president changed since the election?) • Social  Find like-minded individuals or communities Ref[4][5] 92 Challenges • People express opinions in complex ways • In opinion texts, lexical content alone can be misleading • Intra-textual and sub-sentential reversals, negation, topic change common • Rhetorical devices/modes such as sarcasm, irony, implication, etc. Ref[5] 93 References [1] https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CE MQFjAC&url=http%3A%2F%2Fwww.gersteinlab.org%2Fcourses%2F545%2F07spr%2Fslides%2FDM_clustering.ppt&ei=GOi2Ue7MDImakAX62YGQDg&usg=AF QjCNFHk7vRJAci6AD_PrgNBFytWJCnSA&sig2=wRKWMuMHDeaSLhox04LI4g [2] https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CF MQFjAD&url=http%3A%2F%2Fweb.cecs.pdx.edu%2F~mperkows%2FCAPSTONE S%2F2005%2FL005.Neural_Networks.ppt&ei=OOq2UYmBoWulQXOjYGoCw&usg=AFQjCNGlQtvoUuAoYEuWHcnGFnlzG55yYA&sig2=wv7 yax6HyWtxPJdCogjUAg [3] https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CE kQFjAC&url=http%3A%2F%2Fwww.math.uaa.alaska.edu%2F~afkjm%2Fcs405% 2Fhandouts%2FNN.ppt&ei=OOq2UYmBoWulQXOjYGoCw&usg=AFQjCNEoRa7VHnmdee2HkxsqhK_lCOIjrg&sig2=iNux5xl1vzL2Du79jm3ow [4] https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&ved=0CFgQFjAE&url=http%3A%2F %2Fwww.public.asu.edu%2F~huanliu%2Fdmml_presentation%2F2008%2FSentiment%2BAnalysis.ppt &ei=cuu2UfPlJcXXkgXVoYHIAg&usg=AFQjCNGRQmOjDghJPcXGTrOCqLvjgRyDNw&sig2=T578hj5uiIb5u bsHovZ9rw [5] http://www.lct-master.org/files/MullenSentimentCourseSlides.pdf [6] “Data Mining-A Tutorial-Based Primer” ,Richard J. Roiger, Michael W. Geatz (2003) 94 Assume that you are a team of IT staffs and your team is assigned to provide a cost and benefit evaluation for the big data solutions. Evaluation :Association rule learning (991660) • Retail – Better understanding the correlation between products or information – Effectively increase the income – Inventory management easier – Better understanding customer spending patterns(market basket analyses) References :http://en.wikipedia.org/wiki/Association_rule_learning 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 96 Evaluation : Classification • Customer classification – Find more potential customers to increase the income • Banking – Define risk loan customers at each levels – Long-term credit ratings(Standard & Poor‘s) References :雲端時代的殺手級應用 : 海量資料分析(胡世忠) http://zh.wikipedia.org/wiki/%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98 97 Classification :Standard & Poor‘s • Long-term credit ratings – The company rates borrowers on a scale from AAA to D. Intermediate ratings are offered at each level between AA and CCC (e.g., BBB+, BBB and BBB-). For some borrowers, the company may also offer guidance (termed a "credit watch") as to whether it is likely to be upgraded (positive), downgraded (negative) or uncertain (neutral). References : http://en.wikipedia.org/wiki/Standard_%26_Poor's http://countryeconomy.com/ratings/taiwan 98 Evaluation : Cluster analysis • Find common traits in the consumer groups – Increase marketing effectiveness – Better understanding the customer behavior – In more detailed classification the customer characteristics References :雲端時代的殺手級應用 : 海量資料分析(胡世忠) http://zh.wikipedia.org/wiki/%E8%81%9A%E7%B1%BB%E5%88%86%E6%9E%90 99 Evaluation :Neural networks (991664) • Enhanced information processing efficiency • Assist in establishing cost-effective IT modules – Help to create the predictive modules • Stock market prediction module • Sales volume forecast module • Weather forecasting module References :雲端時代的殺手級應用 : 海量資料分析(胡世忠) http://zh.wikipedia.org/wiki/%E4%BA%BA%E5%B7%A5%E7%A5%9E%E7%BB%8F% E7%BD%91%E7%BB%9C 100 Neural networks : CompStat • In 1994, Police Commissioner William Bratton introduced a data-driven management model in the New York City Police Department called CompStat. • CompStat has diffused quickly across the United States and has become a widely embraced management model focused on crime reduction. • Reduce the crime rate 27%. References :http://www.compstat.umd.edu/what_is_cs.php 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 101 Evaluation : Sentiment analysis • Better understanding of the customer emotional mind for the company – Increase marketing effectiveness – Strengthen corporate image – Better understanding the customer behavior References :雲端時代的殺手級應用 : 海量資料分析(胡世忠) http://en.wikipedia.org/wiki/Sentiment_analysis 102 Youtube with sentiment analysis • Collect users' past record preview to analysis of user's preference. • Recommended user‘s preference related videos to increased use time of user on youtube. References :雲端時代的殺手級應用 : 海量資料分析(胡世忠) 103 WiseWindow:Mass Opinion Business Intelligence • Wise Window, Inc. provides mass opinion business intelligence solutions. The company offers Mass Opinion Business Intelligence, a solution that translates mass opinions expressed on the Web into an actionable data for business. • Correct interpretation of the blogs, news reports, online forums and social networking sites, the market for a particular product, service, person or news topics instant reaction and opinion trends(sentiment analysis). References :雲端時代的殺手級應用 : 海量資料分析(胡世忠) http://www.inside.com.tw/2011/03/03/emotion-robot http://en.wikipedia.org/wiki/WiseWindow 104 Pattern recognition (981919) • depends on a number of different factors • In many applications misclassification costs are hard to quantify, such as monetary costs, time and other more subjective costs. References :http://zh.scribd.com/doc/137177757/Statistical-Pattern-Recognition-2nd-Ed 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 105 Pattern recognition • Increasing information type – Transform more information type for analysis • • • • • Voice recognition[2] Image Recognition Handwriting recognition Face Recognition medical diagnosis problem [1] References[1]http://zh.scribd.com/doc/137177757/Statistical-Pattern-Recognition-2nd-Ed [2] http://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/ccd=HA4PGp/search#result 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 106 Pattern recognition • The Cost of Choose pattern recognition -it may be very difficult to assign costs -they may be the subjective opinion of an expert References[2] http://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/ccd=HA4PGp/search#result 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 107 Pattern recognition • The Benefit of Choose pattern recognition - favors simpler models - Bayesian approach facilitates a seamless intermixing References[3]http://en.wikipedia.org/wiki/Pattern_recognition 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 108 Optical character recognition(OCR) • Ocr is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machineencoded text. It is widely used as a form of data entry from some sort of original paper data source, whether documents, sales receipts, mail, or any number of printed records. References : http://en.wikipedia.org/wiki/Optical_character_recognition 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 109 Ocr:SinoPac Securities(永豐金控) • SinoPac Securities selected Orc's trading and connectivity technology solutions to strengthen its Asian market trading capabilities. • Orc Trading to enhance their use of electronic trading tools in trading, pricing and risk management capabilities. References : http://www.orcgroup.com/Global/Additional%20languages/Chinese%20Traditional/SinoPac_Orc_191109_ Final_Chinese.pdf 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 110 Anomaly detection • Reduce product defect rate – Reduce the cost • Data flow anomaly detection – Strengthen information security Used to detect whether the system is hacked References[5]http://blog.udn.com/chungchia/3460421#ixzz2VpmDuTBS 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 111 Anomaly detection • The Cost of Choose anomaly detection Where this is not the normal behavior of the scope, and both are considered abnormal, often resulting in miscarriage of justice to refuse normal network connection References [4]http://avp.toko.edu.tw/docs/class/3/%E5%85%A5%E4%BE%B5%E5%81%B5%E6%B8%AC%E8%88%8 7%E9%A0%90%E9%98%B2%E7%B3%BB%E7%B5%B1%E7%B0%A1%E4%BB%8B%E8%88%87%E6%87%8 9%E7%94%A8.pdf 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 112 Anomaly detection • The Benefit of Choose pattern recognition - when the context of abuse and network intrusion detection - This pattern does not adhere to the common statistical definition of an outlier as a rare object - a cluster analysis algorithm is able to detect the micro clusters formed by these patterns. References: http://en.wikipedia.org/wiki/Anomaly_detection 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 113 Predictive modeling • Market price forecast – Crude oil price , price of gold – Stock Market Investing • Stock market prediction module • Sales volume forecast module • - The Cost of Choose predictive modeling History cannot always predict future The issue of unknown unknowns Self-defeat of an algorithm References: http://en.wikipedia.org/wiki/Predictive_modelling http://www.forrester.com/pimages/rws/reprints/document/85601/oid/1-KWYFVB 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 114 Public Sentiment with Stock Price • Derwent Capital Markets is known as an early pioneer in the use of social media sentiment analysis to trade financial derivatives. • Prediction accuracy rate 87.6%(in 2011) References :http://en.wikipedia.org/wiki/Der went_Capital_Markets http://www.forbes.com/sites/tomiogeron/20 12/02/28/datasift-launches-historical-twittersearch-for-businesses/ 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 115 Predictive modeling • The Benefit of Choose predictive modeling - Moore’s law increases the capability and drives down the cost . - resulted in an exponentially increasing amount of scientific data being produced each year. - allows the retention programme References: http://opendatagroup.com/predictive-analytics-faq/ http://en.wikipedia.org/wiki/Predictive_modelling 雲端時代的殺手級應用 : 海量資料分析(胡世忠) 116

Document

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib