ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . NEW QUESTIONS WITH BIG DATA • Not always data are in structured data model • Often we need to join data with not same keys • Often data coming with periodic flow in real time • Often we need to recognize pattern from data changing frequently New ways to manage distributed and not structured in classical way data are needed: We need different paradigm to organize data and, above all, to query them. Collect several sources and manage them open several new problems: • Relational data (GRAPH DATA) can be useful to understand event spreading in a population. • Data in motion coming from several tools on field (sensor devices) provide dynamic pattern often without an history of their form C op yr i g h t © 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . ANALYSIS • Not always you can apply sampling to extract data • Not always you can join data to define ABT • Often you need to know how environment can influence event changements. • Often we need to merging information collected in different time window. • SQL Queries often are useless to reach these data: • Information are not organized into DB structures • Data are very different way to provides information: i.e. text are not easy to query using traditional query languages. • Merging are driven by fuzzy keys where you can assign group information according statistic relationship. • Event can be happen driven from relational with other data rather from specific behavior. C op yr i g h t © 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . SAS PROCEDURES BIG DATA REQUIRES ALSO SEVERAL METHODOLOGICAL STRATEGIES: • methods for pattern recognition coming from statistical inference analysis using SEMMA paradigm for supervised and unsupervised data patterns. • Other coming from stochastic process analysis both for continue time and discrete events like diffusion process or markov chains process. • Time series forecasting: stochastic processes in continue time with continue space • Multivariate analysis applied on semantic rules to discover text patterns • Graph analysis C op yr i g h t © 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . ANALYTICAL CATEGORIES AND TARGET USAGE Data Mining Statistics • • Binary target & continuous no. predictions Linear, NonLinear, & Mixed Linear modeling C op yr i g h t © 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . • Complex relationships • Tree-based Classification • Variable Selection Text Mining • Parsing large-scale text collections • Extract entities • Auto. Stemming & synonym detection Forecasting • Large-scale, • multiple hierarchy problems Optimization Econometrics • Probability of events Severity of random events • • • Local search optimization Large-scale linear & mixed integer problems Graph theory Data coming from different sources can be tie using different methods like linear or not linear canonical decomposition. Data pattern variability on data in motion like data coming from devices can be sampled or simulate pattern distribution. Sparse vector data with missing values can be simulate using particular regression methods Discrete choice among different events can be defined using multinomial discrete models. Automatic time series forecast considering many series at the same time C op yr i g h t © 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . GRAPH ANALYSIS Network Graph Analysis can be used to: Measuring nodes importance and relationships among them. Link Node Measuring changes over time into a net. Identify how events spreading into the net using particular diffusion process. C op yr i g h t © 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . Scenario REAL TIME MONITORING SYSTEM: Building and managing the behavioral patterns of the measures for each type sensor to detect abnormal process by rules of alarm (offline process). Building scenario how events spreading and influence different part of system Monitoring measures to detect anomalies and the validity of the rules over time (online process). Produce models to predict abnormalities in the medium term. C op yr i g h t © 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . Scenario INTEGRATED PROCESS CONTROL: • Shewhart type control charts with identification of the role of the history of the measures and trend-cycle components according to the Box-Jenkins methodology • Multivariate analysis of processes: This is the main tool for statistical process control measures in relation to each other considering Markov chain process or diffusion processes • Classification system components: The machines can be classified according to their behavior and some information about the specific characteristics of the same • Identifying patterns of alarm: Rules of diagnostic thresholds identified by the control charts to minimize false alarms, depending on the history of the event to be monitored in real time C op yr i g h t © 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . ADMINISTRATION SYSTEM: EXAMPLE System interface Extraction rules DABT Pattern recognition and event handling Module Event process thresholds managing for alert process Measures Metadata and classification Historical process data storage C op yr i g h t © 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . REAL TIME MONITORING SYSTEM: EXAMPLE Alert Rules and pattern thresholds Module in real time check Real time modelling. Data streaming analysis and update historical data. Real time Feedback C op yr i g h t © 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .