CyberVisual: Designing User Environments for Large Scale Networks and Simulations by Nahoin Hailemariam Workie S.B., C.S M.I.T., 2013 Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfilnent of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science ~ACHURYA1& M A SSACHUSETTS INSTMUTE. at the Massachusetts Institute of Technology OF TECHNOLOGY May 2014 JUL 15 2014 Copyright 2014 Nahom H. Workie. A rights reserved LIBRARIES The author hereby grants to M.I.T. permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole and in part in any medium now known or hereafter created. Author Signature redacted Department of Electrical Engineering and Computer Science Signature redacted May 15t, 2014 Certified by Dr.Abel Sanchez, MIT Geospatial Data Center Signature redacted Thesis Supervisor Accepted by Prof. Albert R. Meyer, Chairman Masters of Engineering Thesis Committee 2 CyberVisual- Designing User Environments for Large Scale Networks and Simulations by Nahom Hailemariam Workie Submitted to the Department of Electrical Engineering and Computer Science May 15th, 2014 In Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science Abstract The growth of data collection within the technology sector has been increasing at an astounding rate over the last decade. This growth has given rise to techniques and statistical tools for computation that enable us to see trends and answer queries; however, most of this information has been in the form of numbers in a data set resulting from these computations. As this trend towards data collection and analysis continues, it becomes increasingly inportant to show and present this information to users and non-data scientists with quantifiable and concise methods. With respect to large data in the field of networks and cyber security, though the tools for calculating and simulating threats are available, there are currently very few options for showing the results. Here we demonstrate CyberVisual, a possible visualization tool for displaying and simulating calculations on client-side applications. CyberVisual is a two-part attempt at changing how this information is presented in the form of visual encoding. The tool is a client-side application that would enable a user to make better sense of information using various visualization techniques to emphasize different modes of summarization and interaction with a given dataset. Through this we aim to improve the future of visualization by enabling users different and more interactive approaches for showing massive data sets Keywords: Human-Computer Interaction, Visualization Analytics, Computer Systems, Big Data 2 3 3 4 Acknowledgements This work is partially funded by the MIT Lincoln Laboratory. Any opinions, fidings, and conclusions or recommendations expressed in this thesis are those of the author and do not necessarily reflect the views of the MIT Lincoln Laboratory. Dr. Abel Sanchez, for not only serving as my supervisor, but also giving me the opportunity to really dive into and get a chance to learn and explore new areas and technologies that I wouldn't have been able to discover otherwise. Ivan Ruiz, for helping me brainstorm and try new ideas, as well as keeping me focused towards the end of this experience. My family for their unwavering support throughout my life 4 5 Table of Contents Chapter 1: Introduction............................................................................................ 1.1 The Simulator ................... . . ................... .............. 1.2 Analytics and Event Correlation ........ 1.3 Visualization Engine.........................................11 7 ......... 8 ............................ 9 Chapter 2: History and Background of Data Visualization..........................14 2.1: The Growth of Visualization...... . . ............ ....... ..................... 15 2.2: Current State of Dashboards and Visualizations .................... ....... 17 Chapter 3: Simulation, Analytics and Event Correlation.......................................21 3.1 Modeling Networks ................................................................... 23 3.2 Analytics Module..........................................25 A. Support Vector Machines (SVM)..............................26 B. One Class-Support Vector Machines (OC-SVM) .......................... 27 C. Statistical Models and Covariance Estimation .................................. 28 3.3 Summary................................................29 Chapter 4: CyberVisual User-Input Environment..............................................30 4.1: Iterative Design for User-Environments ................................................... 32 4.2: CyberVisual Input ............................................ 33 5 6 Chapter 5: CyberVisual User-Output Environment ................................................... 39 Chapter 6: Further Work .......................................................................................... 49 6.1:Adaptive User Interface (AUI) Dashboards.................................................... Chapter 7: Conclusion ............................................................................................ 50 53 References..................................................................................................................55 6 7 Chapter 1: Introduction The importance of data collection has exponentially risen over the past decade giving rise to a tremendous growth within the Information Technology (IT) sector. As this growth continues with an increase in storage capabilities as well as the rapid development of cloud infrastructures, the protection of the datacenters that store this information is quickly becoming a priority. A portion of the reason for growth lies in the simple fact that the global datacenters for the majority of multinational technology companies hold a vast amount of public as well as private data. This data not only needs to be reliably stored, but also needs to be protected as it contains personal or internal information. With regards to network growth and security, the movement towards and emphasis on Big Data and the rise of statistical tools to uncover patterns in this data has also enabled new ways of fiding potential problems, anomalies and detect possible threats. Furthermore, these methods combined with pre-existing technologies have also provided ways to better show these weaknesses to non-experts. The datacenters used by many corporations today are integral to their success. However, cyber-attacks and network down time are some of the biggest factors that impede the performance of these centers and by large the companies themselves. As the infrastructures of these companies grow it becomes increasingly important to detect potential weakness and vulnerabilities for improvement within the global network. As such one possible way to achieve this is through creating a simulation of the global network using various parameters in order to comprehend network behavior. Further, this ability 7 8 to test various scenarios and possible outcomes through simulations allows for a cost-efficient way to detect possible areas of improvement potentially well in advance. This global simulation and visualization system seamlessly integrates a global data center infrastructure with an event correlation and visualization engine. With regards to this thesis, we'll briefly outline the simulation and event correlation system used, but mainly focus on the visualization aspect of this system. 1.1: The Simulator In order to learn and understand more about network behavior within large global infrastructures, we need to be able to simulate system performance and service availability. The purpose of the global simulator is to provide this understanding without affecting the actual global network and its services. Furthermore, the simulator should also provide granularity by enabling the simulation of entire datacenters at a time, as well as being able to collect information at the individual single client-server level. Finally, the simulator is also modular as it's still only a piece of the data simulation framework that is to be integrated with an analytics and correlation, as well as a visualization module. Creating a simulator has immense benefits over the standard way of approaching this problem which is akin to infrastructure profiling. The simplicity and overall cheaper cost of a simulator compared with actually testing and profiling is one of the primary reasons that make it a better approach. Furthermore, infrastructure profiling can also reduce performance in large networks simply because numerous parameters are being analyzed while the network is running creating a larger overhead which can greatly reduce system performance. One of the biggest advantages of a simulator however, is in the ability for users to test out various possible scenarios 8 9 to see how the network would respond. These sort of "what if' questions enable the owners of these datacenters to systematically ask and test different use cases in order to improve and strength the overall infrastructure of the network. [1] The system as a whole is also modular in the sense that it can be used with other components in order to create tools for multiple use cases. An example of this is the Malicious Activities Detection System (MADS) shown below, which is used for anomaly detection in network topologies. Malicious Activities Detection System Holistic Operator Simulator Web Browser Figure 1.1: The diagramabove shows the context diagramforthe MADSframework, a use case ofthe data simulationsystemfor trackingMaliciousActivities andAnomalies. 1.2: Analytics and Event Correlation The analytics component is the second module in the data simulation framework. In order to detect malicious activities and anomalies as well as bottlenecks in networks from the 9 10 simulation data, the analytics component relies on both statistical and classification models. The statistical models in this component use methods such as covariance estimation, where the assumption is made that anomalies and various activities are generated within a standard (normal) distribution. The inference this assumption makes is that activities which are deemed normal happen in the areas of the distribution that correlate to high probabilities. In contrast, the low probability areas of said distribution are assumed to contain potential problems or malicious activities. The analytics component also uses classification models where no assumptions about the data and probability distribution. These methods become particularly useful when the data has no standard distribution and is strongly non-Gaussian. A strong example of this type of system used in MADS is the One-Class Support Vector Machine (OC-SVM) which works well with unlabeled datasets in contrast to support vector machines that are usually considered to be supervised learning algorithms (i.e. there is a training data that can be used as a baseline). These methods of analyzing the data allow for an analytics engine that is effective at classifying and analyzing different types of networks and anomalies. -- s-11a-5 a s d decision (8). function (A).erssGaussian Dsatribution ~I -s I. One Class SVM (C sifcatjon) (errors; 6) BiraodI Distribution Z 2. Cowulance Estmadon (Statistical) {errors: 2) 1. One Class SVM (Classtlkation) (errors: 4) 2. Covarlance Estimation (Statiatical) Icr roys: 20) Figure1.2: 2DAnomaly Detection example using dierent correlationmethods. 10 11 1.3: Visualization Engine The visualization component is a two-part attempt at designing applications for a broad range of use cases. The first aspect is summarization of a given dataset, which enables the user to see trends and outliers in network application performance in an easy web application with a user interface focus on simplifying the information and insights gained. This client-side visualization application uses the output of the simulator data and continuously imports it into a database. This data is then called on the browser using a visualization library and displayed in a multitude of ways highlighting different correlations and attributes of the data. The purpose of this application is to process the multivariate dataset and use a layered approach in order to visualize this information. We aim to do this by first creating a super set of smaller visualization views in order to continuously adjust and refine the message given by the layer. As a user traverses through the different layers of the application, we can also change the context and queries that are explained by each layer. This approach allows us to present various different types of relationships and correlations with varying degrees of detail 11 12 Input Vsualizaion - L coertr Simulator output VIRnalizabw Analics Module Figure 1.3: Context Diagramforthe Summarization Visualization Component However, in order to answer the "what if' question discussed earlier, a second approach is needed. The second aspect of the visualization component is an interactive web application that enables users to better formulate the input to the simulation. By enabling the user to select the datacenters and how the network is overlayed, we enable an interactive user experience that helps engage the user to formulate and answer queries with regards to the network topology and other analyzed information. By using these applications we can provide a user multiple visualization toolsets for various use cases. For example, a user can switch through different views of data and understand the different correlations between them. As well as experiment with different topologies to see where possible improvements can be made, or find bottlenecks in an already developed infrastructure. The full goal of my work with regards to the visualization component is two-fold. First, I want to understand how to build better user interfaces and visualizations that are not only 12 13 appealing, but also significantly more informing than other ways of representing data. Secondly, I also want to try and create new methodologies for improving user interaction and testing approaches with the use of machine learning. Although these goals are broad and by no means simple, I think our work with regards to the visualization component of the data simulation framework is an effective start into trying to answer and complete these goals. 13 14 Chapter 2: History and Background of Data Visualization To say the increase in the collection of data over the past decade has been tremendous is somewhat of an understatement. Due to the decreasing cost of memory and storage, combined with new ways of collecting data from a multitude of different resource, data is being stored at an immense rate that has been increasing over the past several years. This storage of data enables new possibilities in how we approach problems and technology. Specifically, with more information becoming available to us it becomes substantially easier to make inferences and decisions for problems in different sectors of industry. However, in order to better understand our current state and where we're heading, we need to better understand how we got here. In this section, we briefly go over the rise of the collection of data as well as the use of visualizations and dashboards. From there we begin to highlight some of the problems that we're currently facing with visualizations, so that later we can connect them to some of the novel ways that might improve these problems. In the early 2000s, the idea of big data was considered more theory than practice. Although there was an increase in the collection of data at the time, it was still considerably slower and nowhere near enough to be considered a shift in the paradigm of how we make decisions. The idea of data scientist then was not nearly as a big phrase as it is just 10-12 years later. This is due to numerous factors, one of the primary being that the infrastructure for accessing and sharing so much data was still in its early stages. In 2000, the world's capacity to for data communication was at 2.2 Exabytes. This increased by almost a factor of 30 in 2007 to 14 15 65 Exabytes, and is expected to increase to 667 Exabytes by 2014. As storage and communication capabilities increased dramatically over time and with approaches to finding and storing data matured, this growth in data science exploded. [3] The infrastructure for supporting this growth in data science has happened rapidly over the past decade. One of the major contributors to this was the creation of the MapReduce process by Google in 2004. This process enables parallel processing to process substantial amounts of data, in order to process and deliver queries. Later this process was further modified and further built for big data by Apache in order to create Hadoop starting in 2005. With other technologies such as MongoDB being developed, the architecture for big data quickly grew and matured during the late 2000s. However, even with this growth this paradigm for how we approach problems is still very much in its early stages. With respect to industry big data is currently growing at over 10% per year, and with an increase in the number of people gaining access to the internet combined with the improvement of storage and communication networks this number will continue to grow immensely over the next decade. This presents new opportunities to take advantage of this information, specifically with respect to making critical decisions for different problems or systems. 2.1: The Growth of Visualization Often when data science is considered, one of the primary concerns that usually comes is how this data and results are presented. Given that some multivariate datasets have thousands if not hundreds of thousands of rows, it becomes crucial to figure out the optimal way to show this 15 16 information. The early use of data visualizations were not actually created completely for this purpose, rather they were a subdomain of computer graphics which is was used early on for modeling and presenting problems. As computer modeling and graphics grew and the idea of big data emerged, these areas naturally combined into the field that we understand today. [4] Of course the use of data visualization to show information goes back way before even the modem computer era as the use of maps to show geographical data has been used for millennia. At the end of the 18th century, we began mapping out and graphing mathematical functions, as well as medical and economic data. The beginning of data collection as we now know it began during this time as well. Over time we have learned and created better ways to show information using different types of metaphors, graphs and references to simplify our understanding of problems and the world around us. [7] Data visualization as we know it could be said to have started in the 1960s as statisticians began the process of information visualization. Ways to display different statistics and information grew throughout the latter half of the 1900s, one prime example of this was 'The Visual Display of Quantitative Information" by Edward Tufte whose work could be said to be pre-cursor for modem day information visualization. The growth of this field has enabled a massive number of smaller areas that are actively being developed. [8] The development of software for data visualization emerged in the late 2000s as the big data industry took off. Dashboards, user environments for analysis, and libraries to create different forms of presentation became increasingly complex as we began to create improved ways of showing information. However, even as we develop new methods for improving how we present new information we still face crucial challenges that need to be better understood and solved. Data visualization is no longer about just showing graphically appealing graphs and maps 16 17 in order to show information, but also an analysis of how users will actually understand and take in this information. In creating and presenting better information to improve the state of this field, we have to better understand the cohorts of users that are looking to use the visualizations as well as the use cases that they aim to satisfy. Later we discuss in more detail some methods to improve user environments and visualization. Some of the more promising ideas in this space come from testing user in order to figure out the best way to create and improve visualizations (A/B testing), as well as using test cases and Artificial Intelligence in order to figure out the optimal way to improve a user interface (Planning). These methods of iterating and improving on user interfaces can also be used to improve visualizations in order to enable user to garner better insights and patterns from different datasets. Essentially, these methods aren't creating new ways to show this information, rather they're using information about the user and use cases to iterate and improve on an already created visualization. 2.2: Current State of Dashboards and Visualizations Finally in the brief depiction of the growth and evolution of data science and visualization, we now come to outlining the problems that are emerging with dashboards and visualizations. With the abundance of data and numerous libraries and languages to support the creation of various unique and detailed visualizations, the biggest problem that arises is that many of these graphics are often ineffective. In order to understand how we first began by discussing what makes an effective visualization and follow by listing some of the shortages and 17 18 problems that occur when designing visualizations, as well as how some of these issues can be resolved if carefully considered with respect to the data. One of the primary purposes of information visualization is to enable the user to quickly identify patterns, answer questions, understand correlations/relationships, and absorb the dimension of the dataset. That's a lot of different requirements for a graphic and often visualizations complete one or two of these points instead of the full spectrum. Often visualizations tend to focus completely on a single point, which can often confuse the viewer or misunderstand the purpose and relationships of the respective dataset. [2] In order to effectively show a datasets a visualization should create a natural chart from the data to the visualization. This meaning that the user should easily be able to see the dimensions and idea behind the dataset when looking at the visualization. For example, if you have a map with latitude and longitude it's relatively easy to see a dataset with a list of cities and their exact locations. Although this is a simple example, it's straightforward to see the mapping between the data and the visualization. As the datasets become increasingly complex the intuitiveness becomes more difficult to achieve; however, maintaining this relationship helps immensely in the creation of effective graphics. Another prime aspect of using visualizations to show information is the fact that it enables us to get a better understanding of dataset density. For example, if a viewer had to sort through a table of people to find the count of the number who live in California they would have to sort through thousands of rows of a table to see the number of people. This problem can easily be visualized through a bar chart that shows the frequency count. Using this the viewer would not only be able to see the result of their query, but they would also be able to see the total density of the dataset giving them a bigger perspective on the overall picture. 18 19 The last and probably one of the more important aspects of visualization that seems to go wrong concerns labels. Effective visualizations create labels that allow the user to better identify trends and patterns within the data, as well as see the scale at which these patterns are occurring. For example, if a graph shows exponential growth, the viewer can clearly tell the rate of growth for the respective dataset is high. However, with no sense of scale, the viewer has no idea of what this information is relative to. Incorrect or unthoughtful use of labels and color scheme create visualizations that often tend to look appealing, but give no meaningful insight or information to the viewers. [6] As we consider what makes for effective visualization designs, it becomes easier to see why today's dashboards and graphics are ineffective. Creating good graphics requires a through process about the different aspects of the dataset as well as the different cohorts of users that will utilize the graphic. Often we tend to overdue one aspect of a visualization without consider other dimensions. One of the better ways to think about designing graphics is commonly used in User Interface (UI) design, which is to consider the different dimensions of usability. These dimensions are easy to learn, effective, engaging, error tolerant and efficient. [5] Some of these, like error tolerant apply differently from how they are defined in user interfaces. For example, in UI the idea of error tolerance means enabling the user so that they don't do anything that would break the system or mislead them. This is often done by using confirmation pages in order to make sure the user is correctly using the interface. In information visualization, the idea of error tolerant applies in the sense of making sure that the viewer doesn't misread or misunderstand the information given to them. This can be done by making legions or better labels in order to effectively guide the viewer and prevent them from making possible errors when exploring the dataset. 19 20 By applying these different dimensions of usability in addition to better mapping of the datasets, we can aim to improve the inefficiencies that are present in modem visualizations. 20 21 Chapter 3: Simulation, Analytics and Event Correlation When testing a system with numerous datacenters worldwide it's can be more practical to simulate and analyze this data using software than physically testing the entire system. The simulator system was designed specificaRy for this purpose by enabling the user to see outcomes and insights for various possible scenarios using input parameters as constraints. In this section, we briefly continue the discussion regarding the modules from the previous chapter by going into detail about the different aspects of the simulator. One of the primary motivations for the simulator was to experiment with and scale networks to show resource/time allocations and their impact with regards to multiple concurrent users. In order to do so the simulation system makes assumptions for resource use and allocation. In order to do so the system simulates a request initiated by a user and the corresponding response that would occur by a respective server, demonstrating a client-server interface. Further, the system also takes this to scale by modeling multiple concurrent user requests and respective responses. On a large scale, this enables users to understand how the datacenters in their network wil behave and perform under different conditions and topologies. In discussing the architecture of the simulation platform we'll dissect different aspect of the models with more detail as they are more important, in some aspects, to the CyberVisual project. The simulator was implemented in C# and uses a CCR (Concurrency and Coordination Runtime) library for asynchronous methods and behavior. Its input files are primarily in XML format. During initialization these files are interpreted and deserialized with 21 22 the network and each of the various sites. The sinulator uses an XML parser in order to create the different objects that are part of the simulated network topology. Although this XML input methodology enables the simulator to run efficiently when the user manually edits and changes values to simulate a topology; it becomes immensely ineffective when the user does not understand the inner workings of the simulator nor how to actually modify and create different topologies. The CyberVisual project aims to correctthis by enabling a graphical user interface (GUI) to enable the users to create and draw their own network topology. However, this problem requires dealing with a conversion and customization of the input files into the standard XML format taken by the simulator. We discuss this implementation later on in a different section including a converter module to create these standard input files. The output of this simulator file is in Microsoft Excel (xis) format, which enables the user to easily see the data in familiar format and user interface. This also enables the results of the simulation to easily be translated into graphs and various charts. Further, the output system for the simulator is designed to be modular, which enables us to connect and link it to a clientside site for further analysis. Although having precise numbers of a simulation is useful, enabling a user to see this information in a multitude of ways (ie. numbers, graphs, charts, animations, etc.) is more beneficial by enabling the user to understand this information. A client-side website is substantially more efficient than an excel output primarily because it exceeds at doing this, enabling a variation of visualizations rather than just numbers of the network simulation. 22 23 Figure3.1: Thefigure above shows the diferentblocks ofthe CyberVisualsimulationplatform. 3.1: Modeling Networks The primes of the simulator is based on a test for a single user client server system. Carried out in a lab by Dassault Systemes, the idea enables for a build up to a multisystem model to test against multiple concurrent users with regards to system resource allocation and timing. The simulator breaks down different aspects of different datacenters/servers by creating different linear and non-linear performance models. Although there are assumptions that are made with this regard, this still produces large-scale simulations that allow for users to see accurate and realistic results of simulated systems. The network architecture of the simulator basically uses a distributed co-design system via a client/server topology approach for distributed file system design. This begins with 23 24 software that enables a user's machine to connect over the network to the main server. Different sites in the network are represented in this system using dispersed nodes. The model enables them to be the primary or local datacenters nodes. Memnory VP Client Server CPU Figure 3.2: The figure above shows a basic one client-serverarchitecture. This setup enables for a low latency system. Further it creates a more stable network with multiple copies of information thereby reducing network loss. Further, since we can use the web user interface to define the network as an input, we can also design and simulate different types of networks based on different input modifications. The simulator as a whole however is only a partially distributed system, as there are constraints that must be put on specific aspects of the network inputs. To ensure the performance of the simulator it was also tested and validated against laboratory results for multiple/concurrent users. 24 25 3.2: Analytics Module The analytics module of the CyberVisual system is responsible for the classification and identification of attacks and anomalies that can be modeled via the simulator. At the end of a simulation, the first module is responsible for aggregation and summarization of data. In order to create this module, we must consider ways to process and prepare the input as well as identifying the methods, which would enable us to perform analysis on this multivariate dataset efficiently. Here we examine the analytics module created by the MADS framework in order to see how a general case analytic module would work. [9] In order to complete a user controlled adjustment of time-series visualization, we need to aggregate and analyze the simulation data at different time intervals. After which, the data is converted to the appropriate input module for the visualization engine which can access and displays this data in a client-side browser. Therefore, the analysis engine has to be converted into multiple smaller parts in order to complete this task. First, we must create a converter class that enables the efficient conversion of the simulation data so that it becomes an appropriate input for the analysis module. This is a crucial but relatively trivial task that allows the data to be ready as input for supervised learning algorithms. The MADS analysis engine itself contains multiple algorithms in order to analyze this input. One of the most important aspects of this analysis is the classification algorithm of simulator data between when the simulator is running under normal conditions compared to when there is an attack or an abnormal increase of data in the data output. An example of this is a Distributed Denial of Service (DDoS) attack. This type of attack can be simulated by systematically increasing the 25 26 number of operations per minute per application. This allows for an overload in the usage of that operation which is similar to the expected symptoms of a DDoS attack. When these types of attacks are being simulated, it's becomes a vital aspect ofthe analytics module in the MADS system to be able to classify them as such. There are several algorithms that allow for this kind of classification, a relatively straightforward approach to this is by using Support Vector Machines (SVMs). Using the marginal distance of the farthest points from the hyper-plane it becomes dramatically easier to separate normal simulation data from that of a simulated anomaly. This approach also places no assumptions on the distribution of data, which is immensely useful if the data has complex boundaries (ie. defining normal activity). To explain this idea further, let's consider the idea of normal simulation activity. This would regularly mean that the simulation is behaving correctly with regards to specific margins of data; however, there is no exact metric for defining this because the values would be in the specific range rather than a discrete value. Therefore, for this type of non-Gaussian distribution it becomes particularly useful to not depend on statistical models, which depend heavily on assumptions of the distribution of the input data. Furthermore, SVMs attempt to separate data from the origin with the maximum margin using a hyperplane, which is dramatically different than other ways of separating data like using lines or planes to separate positive and negative samples of input data. In the next section, I briefly discuss the machine learning algorithms used by the MADS system. A. Support Vector Machines (SVM) In looking at SVMs, let's first consider a simple linear classifier for a binary classification problem. This is a practical consideration since our actual system will be do binary classification 26 27 (anomaly vs. standard performance) at the basic level. If we state that the classifier has features denoted by x and labels denoted by y, then we can state the non-vector parameterization of the classification equation as: hw,b(x) =g(w x + b) Where: g(z)= I ifz> 0, g(z) = -1 otherwise. From the definition above it's easy to see that the classifier will output two results that correspond to either one the values in the binary classification problem. This is particularly useful since it allows us to quickly analyze and classify a multivariate dataset from a given sample time provided that the dataset is labeled and we understand what the key requirements are for classifying this data. However, there are occasions where this is not possible. For example, when the dataset is unlabeled it becomes difficult to use supervised learning algorithms like this type of SVM in order to do anomaly detection. Therefore, it's crucial to explore other types of algorithms that could potentially help make improve this type of detection. B. One Class-Support Vector Machines (OC-SVM) Unlike traditional Support Vector Machines, OC-SVMs are unsupervised learning algorithms, which can work with an unlabeled dataset. This makes them immensely useful for detecting anomalies especially considering the additional fact that classification algorithms have no dependence on the actual distribution of datasets. Similar to the SVMs discussed in the previous section, the OC-SVM algorithm attempts to find a hyperplane to use for classification between 27 28 normal performance and actual anomaly detection. The decision function for this algorithm is as follows: N f(z)= E K( ,x) -p n1 In this function, a is the Lagrange multiplier, p is the bias of the hyper plane and K is the kernel function. In this case, if (x) < 0 , then an observation can be classified as an outlier. The OC-SVM methods works well when there are no well-known distributions to the data (ie. Gaussian). However, when the distribution of the data is known, it's better in most cases to switch to statistical models for analyzing the data such as covariance estimation. C. Statistical Models and Covaiance Estimation In the previous sections, we briefly discussed two approaches to anomaly detection based on the assumptions that there are no well-known distributions to the data. However, in some cases it's possible to make basic assumptions so that the data can fit into a known distribution. The basic example is to fit the data into the normal distribution where anomalies in this distribution occur with a low probability, while a normal data comes in with high probability. When dealing with this type of approach we can use statistical models and approximations like covariance estimation in order to estimate and predict anomalies in the data. Combining these approached the MADS system presents a comprehensive example of an analytics module for CyberVisual. 28 29 3.3: Suummary The simulator enables us a piece by piece approach at analyzing networks which is crucial in enabling us to see the behavior of a network on a large scale. While the analytic module is a crucial part of the CyberVisual system as it lays the foundation for answering questions from the simulated dataset. By simulating and analyzing network data using different methods we can begin to gain results that enables a further investigation through better queries and insights. In the process, this also allows us an increasing understanding of different possible simulation scenarios. Finally, it also creates the appropriate output format for the visualization model which takes this information and gives it to the user in multiple frames and views making it more effective and easier to understand. 29 30 Chapter 4: CyberVisual User-Input Environment In examining the current state of the simulator we separately consider the two primary use cases for the operators. First we consider the input state of the simulator which enables an operator to basically create the network topology for the simulation. We explain the state of the process prior to the CyberVisual project and the different steps that the operator took, as well as an analysis on the effectiveness of the system. From there we present and analyze the CyberVisual project, putting into focus the iterative design process for visualization. Here specificaly, we will focus on improving visualizations for user environments that require user-input, a perspective on the topics discussed earlier albeit with unique challenges and approaches. The simulator system originally used XML files as the primary source of input for network topology. In defining various parameters for each aspect of a network the user can specifically customize and create constraints in detail with respect to the two types of applications that the simulator would test. During simulation, one of the primary methods of testing network topology was by simulating CAD and VIS applications which required various input files for Connections, Search, Filtering, as well as network latency, bandwidth, etc. The combination of these parameters would effectively simulate a client-server architecture within a system. Before doing so however the user is required to input the parameters in an input folder. Although a substantial amount of these files could be exempt from this process, it still a difficult and time taking bottleneck since the parameters would have to be modified after every simulation run for a different network. 30 31 Figure4.1: The screenshotaboveshows a CAD Connectfile, one ofthe manyinputfiles that the userhad to manuallyenterin orderforthe simulatorto createthe network topology. In total the input folder required to run a complete simulation requires the majority the user going through 30+ XML files and possibly modify up to that number. Not only is this a time taking task, it's also immensely difficult for an operator with very little domain knowledge on the simulation system. Yet this parallels scenarios in which companies have to teach new employees how to use various unique and complicated dashboard systems for actively monitoring networks. The difficulty lies in not only getting the operator to get familiar with how to run the system, but also for them to actually understand all the possibilities that are offered. 31 32 4.1: Iterative Design for User-Environments The primary goal for the CyberVisual project was to create a unique user environment to make this process more efficient and usable. To approach we used an iterative design approach by first abstracting out the various details of the input files wondering if we can reduce this to a simple table where the user inputs the parameters instead of having numerous XML files. Cybervision Connect Search Filter Param Param Param Param Param Param Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Figure4.2: The prototype (vJ.O) above showeda tablestyle view thatkepttrack ofeveryXMLftles as tabs (atthe top) followed by all theparameteroptions andthe valuesforeachparameter. The approach for the first version of CyberVisual was simple, minimize the click count in order to enable the user to find and enter the information as quickly as possible. Click counting 32 33 has been used constantly in website user interface design primarily for making sure that sites are usable. The 3-click rule, which has been disproven in showing a correlation between success rate and number of clicks is still widely used by designers as good practice. With regards to visualization this rule could actually be of practical use if the graphic is designed so that the use can get an answer to their question within 3 clicks. This would enable the user to gain the value and insights they need in a relatively short amount of time. In this prototype, the tabbed view is used to abstract away the folder and access each of the files independently on their respective tabs. This approach is effective due to that fact that it also minimizes the amount of time the user has to spend preparing the input by opening each file independently. Further, having one location to view each independent file enables the user to also gain a better understanding of the overall structure of the input data thereby decreasing the learning curve for the respective user. However, one of the underlining problems with this iteration is that it doesn't address one of the earliest aspects of good visualizations that we discussed earlier. It doesn't naturally map the data into visualization. The data in this case is a network topology consisting of datacenters with geo-location data. 4.2: CyberVisual Input In order to better use the geospatial information given, we iterated on the previous design by adding a map and enabling the user to actually design the user environment that they want to simulate. Then this information is submitted to and sent back to the server via AJAX requests and processed to recreate the XML files that the user originally had to input. This approach enabled the user to basically select the type of server that they wanted to place, where the 33 34 respective server would be located, as well as the parameters with regards to server connectivity, bottlenecks, etc. Overall it would create a simple user environment that would dramatically decrease the learning curve for the user in order to give them more control over the actual customizations. CyberVisual: Interactive Large-Scale Network Simulation Analysis Figure4.3: The Ulfor CyberVisualallowsfor the user to select 4 differenttypes ofservers thatcan be set on any locationon the map. 34 35 Figure4.4: After selecting 2 datacentersthe usercan choose to connectthem togetherusing thewire option By doingsotheycanfill out the tablewith parametersthatrequiredforthe connection. Also,parametets thatcould optionalforthis connectionare automaticallyset to a default value decreasingtheamount ofinformation the userisforcedto enter. Figure4.5: The screenshotabove shows the basic connection between two datacenters. 35 36 CyberVisual use of Google Maps is done in order to create a layout that allows for a familiar user interface for the user that is both efficient and easy to learn. The combination of a map and a table layout for the parameters creates a full user experience for creating a global network. Further, the color scheme is easy to understand where buttons are highlighted in different colors based on the importance of their actions. The "Run Simulation" button for example is red to caution user before running an actual simulation. This is due to the fact that running a simulation sends the current state of the map and the table to the server through AJAX requests and deletes the state of the board. Although a confrmation button occurs after the user decides to run a simulation, it could still potentially cost the user the most amount of time if they don't careful consider the current state of the UI before selecting this option. On the development end, the input user interface for CyberVisual was created using the RaLs framework for Ruby. On the front end, we used an HTML object approach in order to create the servers and overlayed these servers over a map of the world (created using the Google Maps API). When the user creates a new datacenter, it creates an object with a state of different parameters for the type of datacenter. Some of these parameters are set by default to expedite the process. The default values are typically the ones that wouldn't concern the user and would be set based on the type of datacenter chosen and other earlier parameters. To make the user interface even simpler this information is saved internally without having to bother the user or overload them with too much information. Therefore, the user would see a simple graphic while the backend of the application handled the state and processing of the data. 36 37 AMA requests Cybervisual Server Figure4.6: The contextdiagramnabove shows the connection between the CyberVisualInput UIandthe server. Server-side the application first collects all the data into several different tables separated between location, application, and server type data. After the user inputs the parameters for a server and places it on the map, an event gets triggered to take all the parameters and send them to the server via AJAX. Once there the server separates the different types of data and send them to the converter module. This module takes the information from the different information arrays and recreates the XML files associated with this information. This process is near instantaneous with no lag time and therefore the user will be quickly be prompted that the information has been saved. After this is complete a script takes the information and runs the simulator. The output of the simulator is an .xls sheet with the information on each of the datacenters the user had created. Although there are numerous advantages of the current CyberVisual system as described above there are also some inherent disadvantages that need to be considered. The CyberVisual Input UI doesn't do well for power users (i.e. the user cohorts that completely understand the simulation system). This is primarily due to the inability to make advance 37 38 changes with numerous number of the parameters set to default. In aiming for creating a simple and straightforward user interface we also decreased the amount of control that the user has in choosing options for their network. We believe that this design tradeoff is worth it in this case however because it enables a large group of cohorts to actually use the system even if it somewhat limits the capacity of the overall system. 38 39 Chapter 5: CyberVisual User-Output Environment The second case we present is the CyberVisual User-Output environment, in which we consider how different types of users will see and interact with the output of a given simulation. First, we consider the previous state of the output and the primary ways that information about a given simulation is passed to the viewer. This will enable for a deeper conversation on the bigger topic of making decisions when we have data. It's within this discussion that we depict the CyberVisual User-Output interface and analyze it's improvements over traditional outputs. Further, we describe the trend in improving new visualization within dashboards for decision making as well as the disadvantages of the current state of dashboards. As we previously stated the output of a given simulation is saved as a single Excel (.xls) file. Each sheet of this file contains information about a specific datacenter in the network topology. For example, all the information about a particular datacenter in the United States would be stored in a single sheet within the file. From there, the sheet also contains the time interval for which the simulation was run, the given operations in execution, number of users, and many other parameters with regards to simulate a client-server environment. The primary structure of the each datacenter sheet contains two different applications that are being simulated. The CAD and VIS applications basically represent different applications that might be run from different servers. They aim to test and estimate how much processing power and speed these applications would take in different network topologies with different amounts of users. 39 40 Further, different networks can be customized to only run one of these applications instead of both due to the fact that that the server configuration might not support the application. The output of the simulation enables the user to precisely see the output number of the simulation as well as the three graphs. These show the frequency parameters on the number of users that are logged in for each of the application. Probably one of the most important metrics in each of the sheets is actually the Operations in Execution, a metric that along with number of users , that allows the viewer to sort of get an idea of what's normal output for a specific number of logged users. The structure of the output for the simulation enables us to take a better look and analyze what makes a good output for different cohorts of users. The primary reason for using cohorts in this case is the basic notion that the users might not have the same background knowledge when it comes to using the simulator and analyzing its results. In this case the primary cohorts would be network administrators look at simulation data with the primary variable of difference being time (Le experience) that would enable them to have different domain knowledge on the types of output possible for various network topologies. Taking this into consideration our approach was to create a simplified user interface that enable option to look at more detailed views of the datacenters if the user elected to see them. This enable us to design primarily for users who are looking to use this dashboard in a quick current state look rather than for deep analysis. Also, this approach enables us to close the gap in domain knowledge between multiple cohorts of users by creating a user interface that focuses on consistency with other visualizations making it substantially easier to learn and operate. In the first iteration of this interface our objective was to use a layer approach in order to modulate the information into different sections. The primary reasoning with this is to simplify 40 41 and drastically reduce the amount of information the user has to see by organizing the information into a clearly divided layout with labeled parts. An aspect of this that was immensely important was creating an ordered structure in how the user viewed the information. As mentioned in earlier sections, one of the primary difficulties of creating an effective visualization is overloading the view with too much information at one time. In order to reduce this the best way to improve this is to create an ordering of sort that is consistent with how the user would look for information. This consistency is a dimension of usability in the sense that it is a subset of an interface that is easy to learn. Visualization Layer 1 .. ......... ... . .. . Figure5.1: One of thefirstdesignsfor the Output Environment. The idea behindthis design was to use the layout to control and create a stricter orderingforhow the user views this information. The layered approach does this by taking a sky to ground view, this meaning that it tries to show an overview of the entire system first before getting specifically into details about a particular module or piece. In essence, the goal of this is to give the viewer a summary of what's going on at 41 42 a system or a network level and then enable them to dissect any part of that respective system/network. The primary problem with the first iteration (Figure 5.1) came in the fact that although a stricter ordering was created it was still too much information. The idea behind this was to use geo-location data at the center and then use the different sections of the page to display individual datacenter information. If the user selected a datacenter on the center map, then the information for that particular datacenter would be displayed in one or more of the sections that are on either side of the map. . ............ Worldwide Datacenter Overview Mexico Datacenter Select Hode USA Datacenter BraziDatacenter China England Datacenter Datacenter :Select Datacenter Australia Germany Datacenter Inda Datacenter Datacenter Select Time Rn simulatoo Simulation Analysis Overview Figure5.2: The main pagefor the CyberVisual Output Interface. However, this approach minimizes the amount of actual summary or overview that the user gets on the entire system by enabling more space to be saved for individual datacenters on the side 42 43 sections. This became less information to actually show for the overview of the entire system. As we iterated on the designs we decided to emulate some aspects of the original Excel output by using a tabbed approach. Using different tabs to show each datacenter in detail we can give an overview of the entire system within the first tab and break apart and analyze each individual datacenter under their own individual tab. This is the final approach we used for CyberVisual, and one that balances information and amount of detail with efficiency and ordering that allows the viewer to get a more holistic understanding of the entire dataset. With the CyberVisual interface, the first thing that the user notices is the workl map with points for each datacenter in the network (Figure 5.2). The user can select anyone ofthe datacenters and they are automatically linked to the correct tab for the specific datacenter. They can also just select the tabs to access each individual datacenter if they don't want to access the map. The simplicity of the system in this interface is that it minimizes the 3-click rule that was discussed earlier. It takes a single click to get to detailed information regarding each individual datacenter. Further, the data is ordered in an intuitive sequence which enables the user to first get an idea of the overall system before diving into specifics. The main page also includes metrics that pertain to the entire network. For example, immediately below the map are two multi-line graphs that represent the total input and output data of the network color separated for each of the datacenters. This is done to show the scale at which information is passed throughout the entire network, having seen this the user can get a better understanding of what to possibly expect from the different datacenters. For example, if a particular datacenter has zero data as an output it might be easier to conclude there is a bottleneck with that particular location. Therefore, for analyst this enables a simple way to quickly see the state of each datacenter in the system. 43 44 Global US Network Push Data 100 Amesteata data Whe-4- llD ... .... US damrkP t 5D 22.13:00 Z2:1 3:10 22:13:20 22. !030 22:130.40 22.13:50 22.14:0a Global US Network Pull Data Figure 5.3: GlobalPush andpullnetwork charts. When a user selects to go tothe page for a particular datacenterthey can seethe information separated into 4 different graphics. When the user hovers over any of the graphics a detailed description of what information is represented is shown. This enables for a shorter learning curve specifically for users that are new or unfamiliar with the application. The different color scheme of the individual datacenters represent different results of the CAD and VIS applications with respect to time. The top two graphics on the datacenter pages represent the more important metrics that the user usually cares about, the number operations in execution for each of the applications as well as their number of logged users. For the user enabling a quick access to these makes the entire process simpler and more efficient. 44 45 Brazil Datacenter Mexco Datacenter Cham Datacenter Austraia Datacenter Simulation Analysis for the England 400 .0 350 325 27575 2621 25 O 300 2.250 23.75 21 25 275 e 250 225,15 200 5 ~ 625 92 9 90 9S 95 1 175 1.5 150 11.2 917 3 4 5 tr* 9 Logged CAD(Red) vs. VIS App Users Figure 5.4: User and Datacenteroperationsdata. The lower graphs on each datacenters page represent other operations that are called by the different applications. In particular, the polar area charts enables the user to see the magnitude of the operations while the web chart on the left enables the user to compare the magnitudes of the different operations. The advantage to having these types of graphs is that the user can quickly see when a number is off by a magnitude which enables them to see potential problem a lot faster. For example, if the correct values for VIS Search and Connect are always lower than 2, it would be easier to quickly label a red circle around the number 2 on the chart and make it immediately noticeable when one of the operations is over that mark. The basic premise behind using these charts is that it makes it so that early and quick detection of problems becomes a lot easier by putting less dependencies on the domain knowledge of the user. 45 46 2000 10.75 1'50 1025 13'75 1250 11.25.. 10.00 .t. iiI ............. T50~505 I.0- . 91 20 945 9150 A 9sos 9 230- 0 1 95 9W0 3D Updte - Open 9 945 990h O 1005 Ses-,n -Select Vaodate Figure5.5: Applicationoperationsdata. In many ways the CyberVisual system is actually very similar to the very sheet of the Excel file that is outputted by the simulator. The primary differences is with the actual medium of the presentation. The CyberVisual environment is an application that can be accessed by anyone on the web at any time. This presents an immense advantage over the current output because this medium allows for anyone to have access to the information (given that they have the access/credentials to do so). Further, the user interface is also optimized for touch screen interfaces as well as mobile devices making it accessible from many devices and locations. The CyberVisual output system is built using Node.js, using several different APIs for creating the visualizations. In order to gather the information required, the system uses AJAX requests to complete load the file into a JavaScript module in order to process it. The module 46 47 creates numerous arrays that take the information from the different sheets and cleans the data. After which a main function creates the main/overview page of the user interface. Each of the arrays are then passed into sub-procedures that handle each of the individual tabs for the unique datacenters. The frst iteration of the project user D3.js for drawing the different graphs and displaying information. However, after finalizing precisely what visualizations to use it was scrapped in favor of specific libraries that could create the graphics. For creating the map on the homepage, we use Vectormaps.js, to create a vector map based on CSV (Comma Separated Values) file. This also allowed us quick customization of the map to graphic allowing us to map out each individual datacenter. An alternative approach to this could actually have been the use ofthe Google Maps API which was used for CyberVisual Input interface. To draw the individual graphs, we used the combination of Highcharts.js and Charts.js. The primary benefit of using these two libraries was that they were both lightweight and enabled plenty of customization without the learning curve that typically occurs with using D3.js. Overall both these libraries were efficient in creating powerful and graphically appealing visualizations, without taking too much development time. This was immensely important especially considering the iterative design that was being used with the project. As with any other system the CyberVisual Output interface does have tradeoffs due to being optimized for specific set of users. By using graphs to show the majority of information the user loses the precision that was in the original system. It's very easy to see the magnitude of performance for a specific application operation; however, it's substantially harder to figure the exact value of different operations. Further, the primary problems with using the Charts.js library is that it can lag when it comes to loading time, especially if the network simulation contains a lot of information. This can be frustrating at times, considerably so if the information is needed 47 48 quickly. However, the system is an improvement for the purpose of introducing this system to a wider range of audiences and making it substantially easier to see and access simulation information. 48 49 Chapter 6: FurtherWork In the previous few sections we've discussed the design trade-offs of both the CyberVisual environments and analyzed the development strategies used in both interfaces. With each environment it's easy to see that there is still space for improving and iterating on the designs. Here we present a few ideas with regards to improving and changing how we think about the data and each user interface. Further we analyze the trends of large-scale visualizations to suggest ways to that can apply these patterns into our current environments. First let's consider our assumptions of good visualization. In an earlier section we described a good visualization as an entity that follows and takes advantage of the different dimensions of usability. Specifically, we also mentioned that good visualizations often tend to guide a user through some ordering of data enabling the user to slowly build upon information. However, even in presenting this data in an ordering can still be problematic to some users because the user interface itself (especially for dashboards) can have a steep learning curve. Let's consider the Bloomberg terminal dashboard as an example. The user interface for the dashboard has been described by UX magazine and users as "hideous"; however, because of its de facto status as the premier dashboard for the finance industry the dashboard design hasn't actually changed in years. This presents a dual problem, the first of which is that the data itself (in this case financial data) is complicated to start with. The second and the easier problem to solve is that the user interface and dashboard itself is complicated to learn and understand. This same problem can also be applied to some network dashboards, and it's problematic primarily because it much 49 50 more difficult to train people to quickly learn and use the different interfaces. Essentially, solving this problem enables users with little domain knowledge to quickly get to the primary goal of understanding the dataset rather than spending a lot of time trying to learn the environment. 6.1: Adaptive User Interface (AUI) Dashboards An approach that could potentially solve this problem is by creating a user interface that gradually changes over time. As increasingly sophisticated framework for developing web application are created it's becoming increasingly easier to quickly modify and iterate on a user interface. Further, with the abundance of storage and data sources it's also becoming easier to collect more information about individual users and their usage patterns. For example, by using a simple click counter attached to event listeners on every link on a web application and saving this information with every session associated with a logged user it becomes easy to see how a specific user navigates through an application. Although - this method might not be the most practical method to scale, it's one of the many ways that user information and usage patterns can be collected. The importance of this is the central idea behind an Adaptive User Interface (AUI) model, which heavily pushes the personalization of user interfaces. For example, with respect to our simulation interface when a user first logs into the dashboard to run a simulation, they would be given a very minimalistic user interface to start with. The UI would be simple enough where the user could simply design a basic network similar to the CyberVisual Input Environment. The main goal for this would be to get the user up and running as soon as possible. However, the benefit of the AUI comes in as the user becomes increasingly familiar with the dataset and the user interface. 50 51 Gradually the system notices what options the user tends to select as well as the usage patterns for the logged user, and respond to this by enabling new areas on the UI that enable the user to quickly select and modify these options. Over time the user interface can become as complicated as necessary as the user grows from novel user to a power user. By enabling this kind of change in the dashboard, it enables the user interface to actually train the user with the dataset and the options available for them. Therefore, this approach can not only save time, but it can also be cost-efficient, as the user would not need training from others in order traverse the steep learning curve. However, there are a few loopholes in this design that should also be covered. When a user starts with a minimalistic design there are numerous features that wouldn't be available to them immediately. Therefore, if a user needed to use a specific feature a search bar with auto-fill at the top of the screen would enable them to select that feature and a widget would be displayed on a subsection of the screen in order to use that feature. The AUI example in this scenario is a Single Page Application (SPA). Where new widgets would be displayed on the screen, as the user needed them. Over time these UI would just load at the start of the UI as the system recognizes the user's viewing and usage patterns. The other concern to address is for users who are already power or advance users, and don't want to start with a minimalist design. This case however is simple because each user has to log in with credentials and therefore the UI is personalized and saved differently for every user. The AUI Design is an ideal next step to the CyberVisual project primarily because it would enable non-experts to quickly get to learning and understanding actual network data instead of spending too much time learning the dashboard/user-interface. It also efficiently targets all cohorts of users regardless of experience or numerous other variables by creating an efficient way to 51 52 personalize and optimize for each individual user. The primary tradeoff for this approach is with regards to its time consuming process. Creating adaptive user interface can take a lot of development and test time. Further, it could also be more costly when it comes to scaling this to a massive number of users. Finally, as is the case with all type of AUI's the adaptation process requires good understanding of the user's usage patterns and intentions. Without accurately understanding these aspects the adaptation might be more detrimental for the user as it could steer away from the user's goals and use cases. 52 53 Chapter 7: Conclusion In this section I discuss the overall thought and objectives of the CyberVisual project. CyberVisual represents an iterative approach to improving dashboards and visualizations by using the same principle used to design user interfaces in many of today's web applications. This approach is beneficial to creating such visualizations due to the fact that it avoids common pitfalls that commonly occur in most ineffective visualizations. By creating visualizations that follow the different dimensions of usability we can ensure for example that our visualizations are not only familiar, but also easy to learn for most groups of users regardless of their diverse backgrounds. Further, by using common design principles used in today's web sites such as the 3 click-rule we can improve the efficiency of dashboards to enable the user to get to their query's result a lot faster. With the input environment for CyberVisual, we presented different ways to show information while creating an interface prime for interaction. Further, we explored different modes of input for the same dataset to try to best optimize how we can begin the process of mapping a visualization set to a data outcome. The biggest takeaway in doing so was to create a user interface that aims for and enables efficiency for the user. In order to make this a reality we explored associations with information visualizations using data maps to create the easy way to create network topologies. Further, we created a system to translate this information into custom input files for the simulation system. 53 54 By creating the output environment for CyberVisual, we explored the other side of the visualization picture, the mapping of data into visualization. Here we presented novel ways to show information using data visualization such as polar radar graphs and various different types of charts. We also explored information visualization techniques such as parallel coordinates to show global output information. Further, we presented why current visualizations within dashboards are ineffective by analyzing our user interface with the different dimensions of usability that we mentioned above as discuss methods for effectively mapping data and visualization. 54 55 References [i] HerraroLopez, S.; "Large Scale Simulator for Global Data Infrastructure Optimization." Diss. MIT, 2011. Print [2] On Data visualization and info-graphics. http:/datavisualization.ch [3] Gil Press. A very short history on Data Science. Forbes, 2013. http-//www.forbes.com/sites/gipress/2013/05/28/a-very-short-history-of-data-science/ [4] Norman H Nie. The rise of Big Data Spurs a Revolution in Big Analytics. Revolution Analytics. http:/wwwxevolutionanalytics.comAvhitepaper/rise-big-data-spurs-revolution-big-analytics [5] Cohn Eberhardt. Ineffective Data Visulaization and How to fix it. April 2010. http://www.scottlogic.com/blog/2010/04/3/ineffective-data-visualisation-and-how-to-fix-it.html [6] Designing Effective Visualizations. Chapter 12. http-/www.ifs.tuwien.ac.at/-silvia/wien/vuinfovis/articles/Chapterl2_DesigningEffectiveVisualiation_355-377.pdf [7] A Brief History of Data Visualization. Dundas. http:/www.dundas.con/blog-post/a-brief-history-of- data-visualization/ [8] History of Data Visualizations. http:/visual.ly/history-of-data-visualization [9] Predicative Modeling for Malicious Activity Detection using Simulated Complex Environments. Aimaatouq, Alabdulkareem, Nouh, Alsale, Alarifi, Sanchez, Alfaris, Williams. MIT. 2014 55