Unit 5 Analytics • Business Analytics The use of analytical methods, either manually or automatically, to derive relationships from data CLUSTER ANALYSIS What is Cluster analysis? Cluster analysis is a group of multivariate techniques whose primary purpose is to group objects (e.g., respondents, products, or other entities) based on the characteristics they possess. It is a means of grouping records based upon attributes that make them similar. If plotted geometrically, the objects within the clusters will be close together, while the distance between clusters will be farther apart. * Cluster Variate - represents a mathematical representation of the selected set of variables which compares the object’s similarities. Cluster Analysis - grouping is based on the distance (proximity) -you don’t know who or what belongs to which group. Not even the number of groups. Application: • Field of psychiatry - where the characterization of patients on the basis of clusters of symptoms can be useful in the identification of an appropriate form of therapy. • Biology - used to find groups of genes that have similar functions. • Information Retrieval - The world Wide Web consists of billions of Web pages, and the results of a query to a search engine can return thousands of pages. Clustering can be used to group these search results into small number of clusters, each of which captures a particular aspect of the query. For instance, a query of “movie” might return Web pages grouped into categories such as reviews, trailers, stars and theaters. Each category (cluster) can be broken into subcategories (sub-clusters_, producing a hierarchical structure that further assists a user’s exploration of the query results. • Climate - Understanding the Earth’s climate requires finding patterns in the atmosphere and ocean. To that end, cluster analysis has been applied to find patterns in the atmospheric pressure of polar regions and areas of the ocean that have a significant impact on land climate. Common Roles Cluster Analysis can play: • Data Reduction -A researcher may be faced with a large number of observations that can be meaningless unless classified into manageable groups. CA can perform this data reduction procedure objectively by reducing the info. from an entire population of sample to info. about specific groups. • Hypothesis Generation - Cluster analysis is also useful when a researcher wishes to develop hypotheses concerning the nature of the data or to examine previously stated hypotheses. Most Common Criticisms of CA • Cluster analysis is descriptive, a theoretical, and non inferential. Cluster analysis has no statistical basis upon which to draw inferences from a sample to a population. No guarantee of unique solutions, because the cluster membership for any number of solutions is dependent upon many elements of the procedure, and many different solutions can be obtained by varying one or more elements. • Cluster analysis will always create clusters, regardless of the actual existence of any structure in the data. When using cluster analysis, the researcher is making an assumption of some structure among the objects. The researcher should always remember that just because clusters can be found does not validate their existence. Only with strong conceptual support and then validation are the clusters potentially meaningful and relevant. Most Common Criticisms of CA • The cluster solution is not generalizable because it is totally dependent upon the variables used as the basis for the similarity measure. This criticism can be made against any statistical technique, but cluster analysis is generally considered more dependent on the measures used to characterize the objects than other multivariate techniques, with the cluster variate completely specified by the researcher. As a result, the researcher must be especially cognizant of the variables used in the analysis, ensuring that they have strong conceptual support. Objectives of cluster analysis • Cluster analysis used for: – Taxonomy Description. Identifying groups within the data – Data Simplication. The ability to analyze groups of similar observations instead all individual observation. – Relationship Identification. The simplified structure from CA portrays relationships not revealed otherwise. • Theoretical, conceptual and practical considerations must be observed when selecting clustering variables for CA: – Only variables that relate specifically to objectives of the CA are included. – Variables selected characterize the individuals (objects) being clustered. How does Cluster Analysis work? The primary objective of cluster analysis is to define the structure of the data by placing the most similar observations into groups. To accomplish this task, we must address three basic questions: – How do we measure similarity? – How do we form clusters? – How many groups do we form? Measuring Similarity • Similarity represents the degree of correspondence among objects across all of the characteristics used in the analysis. It is a set of rules that serve as criteria for grouping or separating items. – Correlational measures. - Less frequently used, where large values of r’s do indicate similarity – Distance Measures. Most often used as a measure of similarity, with higher values representing greater dissimilarity (distance between cases), not similarity. Similarity Measure Graph 2 Graph 1 Chart Title 7 7 Graph 16 represents higher level of similarity 6 5 5 4 4 3 3 2 2 1 1 0 Category 1Category 2Category 3Category 4 0 Category 1 Category 2 Category 3 Category 4 • Both graph have the same r = 1, which implies to have a same pattern. But the distances (d’s) are not equal. Distance Measures - several distance measures are available, each with specific characteristics. • Euclidean distance. The most commonly recognized to as straight- line distance. • Squared Euclidean distance. The sum of the squared differences without taking the square root. • City- block (Manhattan) distance. Euclidean distance. Uses the sum of the variables’ absolute differences • Chebychev distance. Is the maximum of the absolute difference in the clustering variables’ values. Frequently used when working with metric (or ordinal) data. • Mahalanobis distance (D2). Is a generalized distance measure that accounts for the correlations among variables in a way that weights each variables equally. Illustration: Simple Example • Suppose a marketing researcher wishes to determine market segments in a community based on patterns of loyalty to brands and stores a small sample of seven respondents is selected as a pilot test of how cluster analysis is applied. Two measures of loyalty- V1(store loyalty) and V2(brand loyalty)- were measured for each respondents on 0-10 scale. Simple Example How do we measure similarity? Proximity Matrix of Euclidean Distance Between Observations Observations Observation A B C D E F A --- B 3.162 --- C 5.099 2.000 --- D 5.099 2.828 2.000 --- E 5.000 2.236 2.236 4.123 --- F 6.403 3.606 3.000 5.000 1.414 --- G 3.606 2.236 3.606 5.000 2.000 3.162 G --- How do we form clusters? • SIMPLE RULE: – Identify the two most similar(closest) observations not already in the same cluster and combine them. – We apply this rule repeatedly to generate a number of cluster solutions, starting with each observation as its own “cluster” and then combining two clusters at a time until all observations are in a single cluster. This process is termed a hierarchical procedure because it moves in a stepwise fashion to form an entire range of cluster solutions. It is also an agglomerative method because clusters are formed by combining existing clusters How do we form clusters? AGGLOMERATIVE PROCESS Step Minimum Distance Unclustered Observationsa Observation Pair Initial Solution CLUSTER SOLUTION Cluster Membership Number of Clusters Overall Similarity Measure (Average Within-Cluster Distance) (A)(B)(C)(D)(E)(F)(G) 7 0 1 1.414 E-F (A)(B)(C)(D)(E-F)(G) 6 1.414 2 2.000 E-G (A)(B)(C)(D)(E-F-G) 5 2.192 3 2.000 C-D (A)(B)(C-D)(E-F-G) 4 2.144 4 2.000 B-C (A)(B-C-D)(E-F-G) 3 2.234 5 2.236 B-E (A)(B-C-D-E-F-G) 2 2.896 6 3.162 A-B (A-B-C-D-E-F-G) 1 3.420 In steps 1,2,3 and 4, the OSM does not change substantially, which indicates that we are forming other clusters with essentially the same heterogeneity of the existing clusters. (OSD of E-F , E-G & F-G) When we get to step 5, we see a large increase. This indicates that joining clusters (B-C-D) and (E-F-G) resulted a single cluster that was markedly less homogenous. How many groups do we form? • Therefore, the three – cluster solution of Step 4 seems the most appropriate for a final cluster solution, with two equally sized clusters, (B-C-D) and (E-F-G), and a single outlying observation (A). This approach is particularly useful in identifying outliers, such as Observation A. It also depicts the relative size of varying clusters, although it becomes unwieldy when the number of observations increases. Graphical Portrayals Graphical Portrayals • Dendogram - Graphical representation (tree graph) of the results of a hierarchical procedure. Starting with each object as a separate cluster, the dendrogram shows graphically how the clusters are combined at each step of the procedure until all are contained in a single cluster Outliers: Removed or Retained? • Outliers can severely distort the representativeness of the results if they appear as structure (clusters) inconsistent with the objectives. – They should be removed if the outliers represents: • Aberrant observations not representative of the population • Observations of small or insignificant segments within the population and of no interest to the research objectives – They should be retained if a under sampling/poor representation of relevant groups in the population; the sample should be augmented to ensure representation of these group. Detecting Outliers • Outliers can be identified based on the similarity measure by: – Finding observations with large distances from all other observations. – Graphic profile diagrams highlighting outlying cases. – Their appearance in cluster solutions as single – member or small clusters. Sample Size • The researcher should ensure that the sample size is large enough to provide sufficient representation of all relevant groups of the population • The researcher must therefore be confident that the obtained sample is representative of the population. Standardizing the Data • Clustering variables that have scales using widely differing numbers of scale points or that exhibit large differences in standard deviations should de standardized. – The most common standardization conversion is Z score (with mean equals to 0 and standard deviation of 1). Deriving Clusters There are number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows: Hierarchical Cluster Analysis Nonhierarchical Cluster Analysis Combination of Both Methods Hierarchical Cluster Analysis Hierarchical Cluster Analysis • The stepwise procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm either agglomerative or divisive, resulting to a construction of a hierarchy or tree like structure (dendogram) depicting the formation of clusters. This is one of the most straightforward method. • HCA are preferred when: – The sample size is moderate (under 300 – 400, not exceeding 1000). Two Basic Types of HCA Agglomerative Algorithm Divisive Algorithm *Algorithm- defines how similarity is defined between multiple – member clusters in the clustering process. Agglomerative Algorithm • Hierarchical procedure that begins with each object or observation in a separate cluster. In each subsequent step, the two clusters that are most similar are combined to build a new aggregate cluster. The process is repeated until all objects a finally combined into a single clusters. From n clusters to 1. • Similarity decreases during successive steps. Clusters can’t be split. Divisive Algorithm • Begins with all objects in single cluster, which is then divided at each step into two additional clusters that contain the most dissimilar objects. The single cluster is divided into two clusters, then one of these clusters is split for a total of three clusters. This continues until all observations are in a single – member clusters. From 1 cluster to n sub clusters Dendogram/ Tree Graph Divisive Method Aglomerative Method Agglomerative Algorithms • Among numerous approaches, the five most popular agglomerative algorithms are: – Single – Linkage – Complete – Linkage – Average – Linkage – Centroid Method – Ward’s Method – Mahalanobis Distance Agglomerative Algorithms • Single – Linkage – Also called the nearest – neighbor method, defines similarity between clusters as the shortest distance from any object in one cluster to any object in the other. Agglomerative Algorithms • Complete Linkage – Also known as the farthest – neighbor method. – The oppositional approach to single linkage assumes that the distance between two clusters is based on the maximum distance between any two members in the two clusters. Agglomerative Algorithms • Average Linkage • The distance between two clusters is defined as the average distance between all pairs of the two clusters’ members Agglomerative Algorithms • Centroid Method – Cluster Centroids - are the mean values of the observation on the variables of the cluster. – The distance between the two clusters equals the distance between the two centroids. Agglomerative Algorithms • Ward’s Method – The similarity between two clusters is the sum of squares within the clusters summed over all variables. – Ward's method tends to join clusters with a small number of observations, and it is strongly biased toward producing clusters with the same shape and with roughly the same number of observations. Hierarchical Cluster Analysis • The Hierarchical Cluster Analysis provides an excellent framework with which to compare any set of cluster solutions. • This method helps in judging how many clusters should be retained or considered. • Text Mining Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data HomeLoan ( Loanee: Frank Rizzo Lender: MWF Agency: Lake View Amount: $200,000 Term: 15 years ) Multimedia Free Text Hypertext Loans($200K,[map],...) Frank Rizzo bought his home from Lake View Real Estate in 1992. He paid $200,000 under a15-year loan from MW Financial. <a href>Frank Rizzo </a> Bought <a hef>this home</a> from <a href>Lake View Real Estate</a> In <b>1992</b>. <p>... “Search” versus “Discover” Structured Data Unstructured Data (Text) Search (goal-oriented) Discover (opportunistic) Data Retrieval Data Mining Information Retrieval Text Mining © 2002, AvaQuest Inc. Data Retrieval • Find records within a structured database. Database Type Structured Search Mode Goal-driven Atomic entity Data Record Example Information Need “Find a Japanese restaurant in Boston that serves vegetarian food.” Example Query “SELECT * FROM restaurants WHERE city = boston AND type = japanese AND has_veg = true” © 2002, AvaQuest Inc. Information Retrieval • Find relevant information in an unstructured information source (usually text) Database Type Unstructured Search Mode Goal-driven Atomic entity Document Example Information Need “Find a Japanese restaurant in Boston that serves vegetarian food.” Example Query “Japanese restaurant Boston” or Boston->Restaurants->Japanese © 2002, AvaQuest Inc. Data Mining • Discover new knowledge through analysis of data Database Type Structured Search Mode Opportunistic Atomic entity Numbers and Dimensions Example Information Need “Show trend over time in # of visits to Japanese restaurants in Boston ” Example Query “SELECT SUM(visits) FROM restaurants WHERE city = boston AND type = japanese ORDER BY date” © 2002, AvaQuest Inc. Text Mining • Discover new knowledge through analysis of text Database Type Unstructured Search Mode Opportunistic Atomic entity Language feature or concept Example Information Need “Find the types of food poisoning most often associated with Japanese restaurants” Example Query Rank diseases found associated with “Japanese restaurants” © 2002, AvaQuest Inc. Motivation for Text Mining • • Approximately 90% of the world’s data is held in unstructured formats (source: Oracle Corporation) Information intensive business processes demand that we transcend from simple document retrieval to “knowledge” discovery. 10% 90% Structured Numerical or Coded Information Unstructured or Semi-structured Information © 2002, AvaQuest Inc. Challenges of Text Mining • Very high number of possible “dimensions” – All possible word and phrase types in the language!! • Unlike data mining: – records (= docs) are not structurally identical – records are not statistically independent • Complex and subtle relationships between concepts in text • Ambiguity and context sensitivity – automobile = car = vehicle = Toyota – Apple (the company) or apple (the fruit) © 2002, AvaQuest Inc. Text Processing • Statistical Analysis – Quantify text data • Language or Content Analysis – Identifying structural elements – Extracting and codifying meaning – Reducing the dimensions of text data © 2002, AvaQuest Inc. Statistical Analysis • Use statistics to add a numerical dimension to unstructured text Term frequency Document frequency Term proximity Document length © 2002, AvaQuest Inc. Content Analysis • Lexical and Syntactic Processing – Recognizing “tokens” (terms) – Normalizing words – Language constructs (parts of speech, sentences, paragraphs) • Semantic Processing – Extracting meaning – Named Entity Extraction (People names, Company Names, Locations, etc…) • Extra-semantic features – Identify feelings or sentiment in text • Goal = Dimension Reduction © 2002, AvaQuest Inc. Syntactic Processing • Lexical analysis – Recognizing word boundaries – Relatively simple process in English • Syntactic analysis – – – – Recognizing larger constructs Sentence and Paragraph Recognition Parts of speech tagging Phrase recognition © 2002, AvaQuest Inc. Named Entity Extraction • Identify and type language features • Examples: • • • • • • People names Company names Geographic location names Dates Monetary amount Others… (domain specific) © 2002, AvaQuest Inc. Simple Entity Extraction “The quick brown fox jumps over the lazy dog” Noun phrase Noun phrase Mammal Mammal Canidae Canidae © 2002, AvaQuest Inc. Entity Extraction in Use • Categorization – Assign structure to unstructured content to facilitate retrieval • Summarization – Get the “gist” of a document or document collection • Query expansion – Expand query terms with related “typed” concepts • Text Mining – Find patterns, trends, relationships between concepts in text © 2002, AvaQuest Inc. Extra-semantic Information • Extracting hidden meaning or sentiment based on use of language. – Examples: • “Customer is unhappy with their service!” • Sentiment = discontent • Sentiment is: – Emotions: fear, love, hate, sorrow – Feelings: warmth, excitement – Mood, disposition, temperament, … • Or even (someday)… – Lies, sarcasm © 2002, AvaQuest Inc. Text Mining: General Applications • Relationship Analysis – If A is related to B, and B is related to C, there is potentially a relationship between A and C. • Trend analysis – Occurrences of A peak in October. • Mixed applications – Co-occurrence of A together with B peak in November. © 2002, AvaQuest Inc. Text Mining: Business Applications • Ex 1: Decision Support in CRM - What are customers’ typical complaints? - What is the trend in the number of satisfied customers in England? • Ex 2: Knowledge Management – People Finder • Ex 3: Personalization in e-Commerce - Suggest products that fit a user’s interest profile (even based on personality info). © 2002, AvaQuest Inc. Example 1: Decision Support using Bank Call Center Data • The Needs: – Analysis of call records as input into decisionmaking process of Bank’s management – Quick answers to important questions • Which offices receive the most angry calls? • What products have the fewest satisfied customers? • (“Angry” and “Satisfied” are recognizable sentiments) – User friendly interface and visualization tools © 2002, AvaQuest Inc. Example 1: Decision Support using Bank Call Center Data • The Information Source: – Call center records – Example: AC2G31, 01, 0101, PCC, 021, 0053352, NEW YORK, NY, H-SUPRVR8, STMT, “mr stark has been with the company for about 20 yrs. He hates his stmt format and wishes that we would show a daily balance to help him know when he falls below the required balance on the account.” © 2002, AvaQuest Inc. Commercial Tools • • • • • • • • IBM Intelligent Miner for Text Semio Map InXight LinguistX / ThingFinder LexiQuest ClearForest Teragram SRA NetOwl Extractor Autonomy © 2002, AvaQuest Inc. User Interfaces for Text Mining • Need some way to present results of Text Mining in an intuitive, easy to manage form. • Options: – Conventional text “lists” (1D) – Charts and graphs (2D) – Advanced visualization tools (3D+) • Network maps • Landscapes • 3d “spaces” © 2002, AvaQuest Inc. Charts and Graphs http://www.cognos.com/ © 2002, AvaQuest Inc. Visualization: Network Maps http://www.lexiquest.com/ © 2002, AvaQuest Inc. Visualization: Landscapes http://www.aurigin.com/ © 2002, AvaQuest Inc. Basic Measures for Text Retrieval Relevant Relevant & Retrieved Retrieved All Documents • Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) | {Relevant} {Retrieved } | precision | {Retrieved } | • Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved Recall | {Relevant} {Retrieved} | | {Relevant} | Types of Text Data Mining • Keyword-based association analysis • Automatic document classification • Similarity detection – Cluster documents by a common author – Cluster documents containing information from a common source • Link analysis: unusual correlation between entities • Sequence analysis: predicting a recurring event • Anomaly detection: find information that violates usual patterns • Hypertext analysis – Patterns in anchors/links • Anchor text correlations with linked objects Text Categorization • Pre-given categories and labeled document examples (Categories may form hierarchy) • Classify new documents • A standard classification (supervised learning ) Sports problem Categorization Business System Education … Sports Business Education … Science Applications • • • • • News article classification Automatic email filtering Webpage classification Word sense disambiguation …… In-memory analytics • Oracle Exalytics hardware is delivered as a server that is optimally configured for in-memory analytics for business intelligence workloads. • Multiple Oracle Exalytics machines can be clustered together to expand available memory capacity and to provide high availability. • Oracle Exalytics includes several Terabytes of DRAM, distributed across multiple processors, with a high-speed processor interconnect architecture that is designed to provide a single hop access to all memory. • This memory is backed up by several terabytes of Flash storage and hard disks. These stores can be further backed by network attached storage, extending the capacity of the machine and providing data high availability. In-memory analytics • The memory and storage is matched with the right amount of compute capacity. • A high-performance business intelligence system requires fast connectivity to data warehouses, operational systems and other data sources. • Multiple network interfaces are also required for deployments that demand high availability, load balancing and disaster recovery. • Oracle Exalytics provides dedicated high-speed InfiniBand ports for connecting to Oracle Exadata. These operate at quad data rate of 40Gb/s and provide extreme performance for connecting to both data warehousing and OLTP data sources. • Several 10Gb/s Ethernet ports are also available for high-speed connectivity to other data sources and clients. Where legacy network interconnectivity is desired, multiple 1GB/s compatible ports are also available. • The Oracle Exalytics In-Memory Machine features an optimized Oracle Business Intelligence Foundation Suite and Oracle TimesTen In-Memory Database for Exalytics. • Business Intelligence Foundation takes advantage of large memory, processors, concurrency, storage, networking, operating system, kernel, and system configuration of Oracle Exalytics hardware. • This optimization results in better query responsiveness, higher user scalability. • Oracle TimesTen In-Memory Database for Exalytics is an optimized in-memory analytic database, with features exclusively available on Oracle Exalytics platform. Oracle Exalytics engineers hardware and software designed to work together seamlessly Oracle Exalytics Software Overview • Oracle Exalytics is designed to run the entire stack of Oracle Business Intelligence and Enterprise Performance Management applications. The following software runs on Exalytics: • 1. Oracle Business Intelligence Foundation Suite, including Oracle Essbase • 2. Oracle TimesTen In-Memory Database for Exalytics • 3. Oracle Endeca Information Discovery • 4. Oracle Exalytics In-Memory Software Oracle Business Intelligence Foundation Suite • Delivers the most complete, open, and integrated business intelligence platform on the market today. • Provides comprehensive and complete capabilities for business intelligence, including enterprise reporting, dashboards, ad hoc analysis, multi-dimensional OLAP, scorecards, and predictive analytics on an integrated platform. • Includes the industry’s best-in-class server technology for relational and multi-dimensional analysis and delivers rich end user experience that includes visualization, collaboration, alerts and notifications, search and mobile access. • Includes Oracle Essbase - the industry leading multidimensional OLAP Server for analytic applications. Oracle TimesTen In-Memory Database for Exalytics • Memory-optimized full featured relational database with persistence. • TimesTen stores all its data in memory in optimized data structures and supports query algorithms specifically designed for in-memory processing. • SQL programming interfaces- provides real-time data management, fast response times, and very high throughput for a variety of workloads. • Specifically enhanced for analytical processing at in-memory speeds. Oracle TimesTen In-Memory Database for Exalytics Columnar Compression: supports columnar compression that reduces the memory footprint for in-memory data. • Compression ratios of 5X are practical and help expand in-memory capacity. • Analytic algorithms are designed to operate directly on compressed data, thus further speeding up the in-memory analytics queries. Oracle TimesTen In-Memory Database for Exalytics Oracle Exalytics In-memory Software • Collection of in-memory features, optimizations and configuration • Certified and supported only on Oracle Exalytics In-Memory Machine. • Oracle Exalytics In-Memory Software consists of 1. Oracle OBIEE Accelerator 2. Oracle Essbase Accelerator 3. Oracle In-Memory Data Caching 4. Oracle BI Publisher Accelerator • Oracle OBIEE In-Memory Accelerator • Hardware-specific optimizations to Oracle Business Intelligence Enterprise Edition (included in Oracle Business Intelligence Foundation Suite). • Optimizations make OBIEE specifically tuned for Oracle Exalytics hardware - for e.g., the processor architecture, number of cores, the memory latency and variations. • Up to 3X improvement in throughput at high loads and thus can handle 3X more users than similarly configured commodity hardware. • Oracle Essbase In-Memory Accelerator • Leverage processor concurrency, memory latency, memory distribution, availability of flash storage, and provides improvements to overall storage layer performance, enhancements to parallel operations • Up to 4X faster query execution as well as reduction in writeback and calculation operations, including batch processes. • Flash as storage provides 25X faster Oracle Essbase data load performance and 10X faster Oracle Essbase Calculations when compared to hard disk drives. • In-Memory Data Caching • Multiple ways to use the large memory available with Oracle Exalytics: • Oracle In-Memory Intelligent Result Cache • Oracle In-Memory Adaptive Data Mart • Oracle In-Memory Table Caching • Oracle In-Memory Intelligent Result Cache • Completely reusable in-memory cache, populated with results of previous logical queries generated by the server. • Provides data for repeated queries • For best query acceleration, Oracle Exalytics provides tools to analyze usage, as well as identify and automate the pre-seeding of result caches. The pre-seeding ensures instant responsiveness for queries at run time. • Oracle In-Memory Adaptive Data Mart • Identifying and creating a data mart for the relevant “hot” data. • Implementing the in-memory data mart in Oracle TimesTen for Exalytics provides the most effective improvement in query responsiveness for large data sets. • Tests with customer data have shown a reduction of query response times by 20X as well as a throughput increase of 5X. • Automated Recommendation: • Creating a data mart for query acceleration has often required expensive and error prone manual research to determine what data aggregates or cubes from subject areas to cache into memory. • Oracle Exalytics dramatically reduces, and in many cases eliminates, costs associated with this manual tuning by providing the necessary tools that can be used to identify, creates and refresh the best fit in-memory data mart for a specific business intelligence deployment • In-Memory Table Caching • Selective facts and dimensions from analytic data warehouses may be cached “as is” into Oracle TimesTen In-Memory Database on Exalytics. • In many cases, especially with enterprise business intelligence data warehouses, including pre-packaged Business Intelligence Applications provided by Oracle, the entire data warehouse may be able to fit entirely in memory. • In such cases, replication tools like Oracle GoldenGate and ETL tools like Oracle Data Integrator can be used to implement table caching mechanisms into Oracle TimesTen inMemory database for Exalytics. • Once tables are cached, Oracle Business Intelligence metadata can be configured to route queries to the TimesTen In-Memory database, providing faster query responses for the cached data. • Oracle BI Publisher In-Memory Accelerator • Oracle Business Intelligence Publisher features piped document generation and high volume bursting that can enable the generation of an extremely large number of personalized reports and documents for business users in time periods that were previously not achievable. • This is enabled via a configuration setting and the use of Exalytics’ substantial compute, memory, and multithreading capabilities. Performance can be further optimized by configuring BI Publisher to use memory disks (for nonclustered environments only) or solid state drives. In-database analytics • Refers to the integration of data analytics into data warehousing functionality. • Today, many large databases, such as those used for credit card fraud detection and investment bank risk management, use this technology because it provides significant performance improvements over traditional methods. • Traditional approaches to data analysis require data to be moved out of the database into a separate analytics environment for processing, and then back to the database. (SPSS from IBM are examples of tools that still do this today). • Doing the analysis in the database, where the data resides, eliminates the costs, time and security issues associated with the old approach by doing the processing in the data warehouse itself. • “big data” is one of the primary reasons it has become important to collect, process and analyze data efficiently and accurately. • the need for in-database processing has become more pressing as the amount of data available to collect and analyze continues to grow exponentially (due largely to the rise of the Internet), from megabytes to gigabytes, terabytes and petabytes. • Also, the speed of business has accelerated to the point where a performance gain of nanoseconds can make a difference in some industries. As more people and industries use data to answer important questions, the questions they ask become more complex, demanding more sophisticated tools and more precise results. • All of these factors in combination have created the need for in-database processing. The introduction of the column-oriented database, specifically designed for analytics, data warehousing and reporting, has helped make the technology possible. Types • There are three main types of in-database processing: 1. Translating a model into SQL code, 2. Loading C or C++ libraries into the database process space as a built-in user-defined function (UDF), 3. Out-of-process libraries typically written in C, C++ or Java and registering them in the database as a built-in UDFs in a SQL statement. 1. Translating models into SQL code • In this type of in-database processing, a predictive model is converted from its source language into SQL that can run in the database usually in a stored procedure. • Many analytic model-building tools have the ability to export their models in either SQL or PMML (Predictive Modeling Markup Language). • Once the SQL is loaded into a stored procedure, values can be passed in through parameters and the model is executed natively in the database. • Tools that can use this approach include SAS, SPSS, R and KXEN. 2. Loading C or C++ libraries into the database process space • With C or C++ UDF libraries that run in process, the functions are typically registered as built-in functions within the database server and called like any other built-in function in a SQL statement. • Running in process allows the function to have full access to the database server’s memory, parallelism and processing management capabilities. Because of this, the functions must be well-behaved so as not to negatively impact the database or the engine. This type of UDF gives the highest performance out of any method for OLAP, mathematical, statistical, univariate distributions and data mining algorithms. 3. Out-of-process libraries • Out-of-process UDFs are typically written in C, C++ or Java. • By running out of process, they do not run the same risk to the database or the engine as they run in their own process space with their own resources. • Here, they wouldn’t be expected to have the same performance as an in-process UDF. • Registered in the database engine and called through standard SQL, usually in a stored procedure. • Out-of-process UDFs are a safe way to extend the capabilities of a database server and are an ideal way to add custom data mining libraries. • Uses • In-database processing makes data analysis more accessible and relevant for high-throughput, realtime applications including fraud detection, credit scoring, risk management, transaction processing, pricing and margin analysis, usagebased micro-segmenting, behavioral ad targeting and recommendation engines, such as those used by customer service organizations to determine next-best actions. Google Analytics • Is a web analytics service offered by Google that tracks and reports website traffic • Google launched the service in November 2005. • Google Analytics is now the most widely used web analytics service on the Internet. History • Google acquired Urchin Software Corp. in April 2005.Google's service was developed from Urchin on Demand. The system also brings ideas from Adaptive Path, whose product, Measure Map, was acquired and used in the redesign of Google Analytics in 2006. Google continued to sell the standalone, installable Urchin WebAnalytics Software through a network of value-added resellers until discontinuation on March 28, 2012 • The Google-branded version was rolled out in November 2005 to anyone who wished to sign up. However due to extremely high demand for the service, new sign-ups were suspended only a week later. History • As capacity was added to the system, Google began using a lottery-type invitation-code model. Prior to August 2006 Google was sending out batches of invitation codes as server availability permitted; since midAugust 2006 the service has been fully available to all users – whether they use Google for advertising or not. • The newer version of Google Analytics tracking code is known as the asynchronous tracking code,which Google claims is significantly more sensitive and accurate, and is able to track even very short activities on the website. The previous version delayed page loading and so, for performance reasons, it was generally placed just before the </body> body close HTML tag. The new code can be placed between the <head>...</head> HTML head tags because, once triggered, it runs in parallel with page loading History • In April 2011 Google announced the availability of a new version of Google Analytics featuring multiple dashboards, more custom report options, and a new interface design. This version was later updated with some other features such as real-time analytics and goal flow charts. • In October 2012 the latest version of Google Analytics was announced, called 'Universal Analytics'. The key differences from the previous versions were: cross-platform tracking, flexible tracking code to collect data from any device, and the introduction of custom dimensions and custom metrics Features • Integrated with AdWords, users can now review online campaigns by tracking landing page quality and conversions (goals). Goals might include sales, lead generation, viewing a specific page, or downloading a particular file. • Google Analytics' approach is to show high-level, dashboard-type data for the casual user, and more in-depth data further into the report set. Google Analytics analysis can identify poorly performing pages with techniques such as funnel visualization, where visitors came from, how long they stayed and their geographical position. It also provides more advanced features, including custom visitor segmentation. • Google Analytics e-commerce reporting can track sales activity and performance. The e-commerce reports shows a site's transactions, revenue, and many other commerce-related metrics. • On September 29, 2011, Google Analytics launched Real Time analytics. Features • AdWords (Google AdWords) is an advertising service by Google for businesses wanting to display ads on Google and its advertising network. The AdWords program enables businesses to set a budget for advertising and only pay when people click the ads. The ad service is largely focused on keywords. • Google Analytics includes Google Website Optimizer, rebranded as Google Analytics Content Experiments. • Google Analytics Cohort analysis feature helps understand the behavior of component groups of users apart from your user population. It is very much beneficial to marketers and analysts for successful implementation of Marketing Strategy • Google Analytics is implemented with "page tags", in this case called the Google Analytics Tracking Code (GATC), which is a snippet of JavaScript code that the website owner adds to every page of the website. The tracking code runs in the client browser when the client browses the page (if JavaScript is enabled in the browser) and collects visitor data and sends it to a Google data collection server as part of a request for a web beacon. • The tracking code loads a larger JavaScript file from the Google webserver and then sets variables with the user's account number. The larger file (currently known as ga.js) is typically 18 KB. The file does not usually have to be loaded, however, due to browser caching. Assuming caching is enabled in the browser, it downloads ga.js only once at the start of the visit. As all websites that implement Google Analytics with the ga.js code use the same master file from Google, a browser that has previously visited any other website running Google Analytics will already have the file cached on their machine. • In addition to transmitting information to a Google server, the tracking code sets first party cookies (If cookies are enabled in the browser) on each visitor's computer. These cookies store anonymous information such as whether the visitor has been to the site before (new or returning visitor), the timestamp of the current visit, and the referrer site or campaign that directed the visitor to the page (e.g., search engine, keywords, banner, or email). • If the visitor arrived at the site by clicking on a link tagged with Urchin Traffic Monitor (UTM) codes such as: • http://toWebsite.com?utm_source=fromWeb site&utm_medium=bannerAd&utm_campaig n=fundraiser2012 the tag values are passed to the database too. Limitations • In addition, Google Analytics for Mobile Package allows Google Analytics to be applied to mobile websites. The Mobile Package contains server-side tracking codes that use PHP, JavaServer Pages, ASP.NET, or Perl for its server-side language.[19] • Many ad filtering programs and extensions (such as Firefox's Adblock and NoScript) can block the Google Analytics Tracking Code. This prevents some traffic and users from being tracked, and leads to holes in the collected data. Also, privacy networks like Tor will mask the user's actual location and present inaccurate geographical data. Some users do not have JavaScriptenabled/capable browsers or turn this feature off. However, these limitations are considered small—affecting only a small percentage of visits Limitations • The largest potential impact on data accuracy comes from users deleting or blocking Google Analytics cookies Without cookies being set, Google Analytics cannot collect data. Any individual web user can block or delete cookies resulting in the data loss of those visits for Google Analytics users. Website owners can encourage users not to disable cookies, for example, by making visitors more comfortable using the site through posting a privacy policy. • These limitations affect the majority of web analytics tools which use page tags (usually JavaScript programs) embedded in web pages to collect visitor data, store it in cookies on the visitor's computer, and transmit it to a remote database by pretending to load a tiny graphic "beacon“. • Another limitation of Google Analytics for large websites is the use of sampling in the generation of many of its reports. To reduce the load on their servers and to provide users with a relatively quick response for their query, Google Analytics limits reports to 5,00,000 randomly sampled visits at the profile level for its calculations. While margins of error are indicated for the visits metric, margins of error are not provided for any other metrics in the Google Analytics reports. For small segments of data, the margin of error can be very large • Performance concerns • There have been several online discussions about the impact of Google Analytics on site performance. However, Google introduced asynchronous JavaScript code in December 2009 to reduce the risk of slowing the loading of pages tagged with the ga.js script Privacy issues • Due to its ubiquity, Google Analytics raises some privacy concerns. Whenever someone visits a web site that uses Google Analytics, if JavaScript is enabled in the browser, then Google tracks that visit via the user's IP address in order to determine the user's approximate geographic location. To meet German legal requirements, Google Analytics can anonymize the IP address • Google has also released a browser plugin that turns off data about a page visit being sent to Google. Since this plug-in is produced and distributed by Google itself, it has met much discussion and criticism. Furthermore, the realisation of Google scripts tracking user behaviours has spawned the production of multiple, often open-source, browser plug-ins to reject tracking cookies. These plug-ins offer the user a choice, whether to allow Google Analytics (for example) to track his/her activities. However, partially because of new European privacy laws, most modern browsers allow users to reject tracking cookies Privacy issues • It has been reported that behind proxy servers and multiple firewalls that errors can occur changing time stamps and registering invalid searches. • Webmasters who seek to mitigate Google Analytics specific privacy issues can employ a number of alternatives having their backends hosted on their own machines. Until its discontinuation, an example of such a product was Urchin WebAnalytics Software from Google itself. • On Jan. 20, 2015, the Associated Press reported in an article titled: "Government health care website quietly sharing personal data" that HealthCare.gov is providing access to enrollees personal data to private companies that specialize in advertising. Google Analytics was mentioned in that article. Assignment /Test • What is clustering? Explain its types • What is text mining? • What is the difference between in-memory analytics & in-database analytics?