Document 11199479

advertisement
CyberVisual: Designing User Environments for Large Scale Networks and Simulations
by
Nahoin Hailemariam Workie
S.B., C.S M.I.T., 2013
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfilnent of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
~ACHURYA1&
M A SSACHUSETTS INSTMUTE.
at the Massachusetts Institute of Technology
OF TECHNOLOGY
May 2014
JUL 15 2014
Copyright 2014 Nahom H. Workie. A rights reserved
LIBRARIES
The author hereby grants to M.I.T. permission to reproduce and
to distribute publicly paper and electronic copies of this thesis document in whole and in part in
any medium now known or hereafter created.
Author
Signature redacted
Department of Electrical Engineering and Computer Science
Signature redacted
May 15t, 2014
Certified by
Dr.Abel Sanchez, MIT Geospatial Data Center
Signature redacted
Thesis Supervisor
Accepted by
Prof. Albert R. Meyer, Chairman
Masters of Engineering Thesis Committee
2
CyberVisual- Designing User Environments for Large Scale Networks and Simulations
by
Nahom Hailemariam Workie
Submitted to the Department of Electrical Engineering and Computer Science
May 15th, 2014
In Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
The growth of data collection within the technology sector has been increasing at an
astounding rate over the last decade. This growth has given rise to techniques and statistical tools
for computation that enable us to see trends and answer queries; however, most of this
information has been in the form of numbers in a data set resulting from these computations. As
this trend towards data collection and analysis continues, it becomes increasingly inportant to
show and present this information to users and non-data scientists with quantifiable and concise
methods. With respect to large data in the field of networks and cyber security, though the tools
for calculating and simulating threats are available, there are currently very few options for
showing the results. Here we demonstrate CyberVisual, a possible visualization tool for
displaying and simulating calculations on client-side applications. CyberVisual is a two-part
attempt at changing how this information is presented in the form of visual encoding. The tool is
a client-side application that would enable a user to make better sense of information using
various visualization techniques to emphasize different modes of summarization and interaction
with a given dataset. Through this we aim to improve the future of visualization by enabling
users different and more interactive approaches for showing massive data sets
Keywords: Human-Computer Interaction, Visualization Analytics, Computer Systems,
Big Data
2
3
3
4
Acknowledgements
This work is partially funded by the MIT Lincoln Laboratory. Any opinions, fidings, and
conclusions or recommendations expressed in this thesis are those of the author and do not
necessarily reflect the views of the MIT Lincoln Laboratory.
Dr. Abel Sanchez, for not only serving as my supervisor, but also giving me the opportunity to
really dive into and get a chance to learn and explore new areas and technologies that I wouldn't
have been able to discover otherwise.
Ivan Ruiz, for helping me brainstorm and try new ideas, as well as keeping me focused towards
the end of this experience.
My family for their unwavering support throughout my life
4
5
Table of Contents
Chapter 1: Introduction............................................................................................
1.1
The Simulator ...................
.
.
...................
..............
1.2
Analytics and Event Correlation ........
1.3
Visualization Engine.........................................11
7
......... 8
............................
9
Chapter 2: History and Background of Data Visualization..........................14
2.1: The Growth of Visualization......
.
. ............
.......
..................... 15
2.2: Current State of Dashboards and Visualizations ....................
....... 17
Chapter 3: Simulation, Analytics and Event Correlation.......................................21
3.1 Modeling Networks ...................................................................
23
3.2 Analytics Module..........................................25
A.
Support Vector Machines (SVM)..............................26
B. One Class-Support Vector Machines (OC-SVM) ..........................
27
C. Statistical Models and Covariance Estimation ..................................
28
3.3 Summary................................................29
Chapter 4: CyberVisual User-Input Environment..............................................30
4.1: Iterative Design for User-Environments ...................................................
32
4.2: CyberVisual Input ............................................
33
5
6
Chapter 5: CyberVisual User-Output Environment ...................................................
39
Chapter 6: Further Work ..........................................................................................
49
6.1:Adaptive User Interface (AUI) Dashboards....................................................
Chapter 7: Conclusion ............................................................................................
50
53
References..................................................................................................................55
6
7
Chapter 1: Introduction
The importance of data collection has exponentially risen over the past decade giving rise
to a tremendous growth within the Information Technology (IT) sector. As this growth continues
with an increase in storage capabilities as well as the rapid development of cloud infrastructures,
the protection of the datacenters that store this information is quickly becoming a priority. A
portion of the reason for growth lies in the simple fact that the global datacenters for the majority
of multinational technology companies hold a vast amount of public as well as private data. This
data not only needs to be reliably stored, but also needs to be protected as it contains personal or
internal information.
With regards to network growth and security, the movement towards and emphasis on
Big Data and the rise of statistical tools to uncover patterns in this data has also enabled new
ways of fiding potential problems, anomalies and detect possible threats. Furthermore, these
methods combined with pre-existing technologies have also provided ways to better show these
weaknesses to non-experts. The datacenters used by many corporations today are integral to their
success. However, cyber-attacks and network down time are some of the biggest factors that
impede the performance of these centers and by large the companies themselves. As the
infrastructures of these companies grow it becomes increasingly important to detect potential
weakness and vulnerabilities for improvement within the global network.
As such one possible way to achieve this is through creating a simulation of the global
network using various parameters in order to comprehend network behavior. Further, this ability
7
8
to test various scenarios and possible outcomes through simulations allows for a cost-efficient
way to detect possible areas of improvement potentially well in advance. This global simulation
and visualization system seamlessly integrates a global data center infrastructure with an event
correlation and visualization engine. With regards to this thesis, we'll briefly outline the
simulation and event correlation system used, but mainly focus on the visualization aspect of this
system.
1.1: The Simulator
In order to learn and understand more about network behavior within large global
infrastructures, we need to be able to simulate system performance and service availability. The
purpose of the global simulator is to provide this understanding without affecting the actual
global network and its services. Furthermore, the simulator should also provide granularity by
enabling the simulation of entire datacenters at a time, as well as being able to collect
information at the individual single client-server level. Finally, the simulator is also modular as
it's still only a piece of the data simulation framework that is to be integrated with an analytics
and correlation, as well as a visualization module.
Creating a simulator has immense benefits over the standard way of approaching this
problem which is akin to infrastructure profiling. The simplicity and overall cheaper cost of a
simulator compared with actually testing and profiling is one of the primary reasons that make it
a better approach. Furthermore, infrastructure profiling can also reduce performance in large
networks simply because numerous parameters are being analyzed while the network is running
creating a larger overhead which can greatly reduce system performance. One of the biggest
advantages of a simulator however, is in the ability for users to test out various possible scenarios
8
9
to see how the network would respond. These sort of "what if' questions enable the owners of
these datacenters to systematically ask and test different use cases in order to improve and
strength the overall infrastructure of the network. [1]
The system as a whole is also modular in the sense that it can be used with other
components in order to create tools for multiple use cases. An example of this is the Malicious
Activities Detection System (MADS) shown below, which is used for anomaly detection in
network topologies.
Malicious Activities Detection System
Holistic
Operator
Simulator
Web
Browser
Figure 1.1: The diagramabove shows the context diagramforthe MADSframework, a use case ofthe data
simulationsystemfor trackingMaliciousActivities andAnomalies.
1.2: Analytics and Event Correlation
The analytics component is the second module in the data simulation framework.
In
order to detect malicious activities and anomalies as well as bottlenecks in networks from the
9
10
simulation data, the analytics component relies on both statistical and classification models.
The statistical models in this component use methods such as covariance estimation, where
the assumption is made that anomalies and various activities are generated within a standard
(normal) distribution. The inference this assumption makes is that activities which are deemed
normal happen in the areas of the distribution that correlate to high probabilities. In contrast,
the low probability areas of said distribution are assumed to contain potential problems or
malicious activities.
The analytics component also uses classification models where no assumptions
about the data and probability distribution. These methods become particularly useful when
the data has no standard distribution and is strongly non-Gaussian. A strong example of this
type of system used in MADS is the One-Class Support Vector Machine (OC-SVM) which
works well with unlabeled datasets in contrast to support vector machines that are usually
considered to be supervised learning algorithms (i.e. there is a training data that can be used
as a baseline). These methods of analyzing the data allow for an analytics engine that is
effective at classifying and analyzing different types of networks and anomalies.
--
s-11a-5
a s
d decision
(8).
function (A).erssGaussian Dsatribution
~I
-s
I. One Class SVM (C sifcatjon) (errors; 6)
BiraodI Distribution
Z
2. Cowulance Estmadon (Statistical) {errors: 2)
1. One Class SVM (Classtlkation) (errors: 4)
2. Covarlance Estimation (Statiatical) Icr roys: 20)
Figure1.2: 2DAnomaly Detection example using dierent correlationmethods.
10
11
1.3: Visualization Engine
The visualization component is a two-part attempt at designing applications for a broad
range of use cases. The first aspect is summarization of a given dataset, which enables the user to
see trends and outliers in network application performance in an easy web application with a user
interface focus on simplifying the information and insights gained.
This client-side visualization application uses the output of the simulator data and
continuously imports it into a database. This data is then called on the browser using a
visualization library and displayed in a multitude of ways highlighting different correlations and
attributes of the data. The purpose of this application is to process the multivariate dataset and
use a layered approach in order to visualize this information. We aim to do this by first creating a
super set of smaller visualization views in order to continuously adjust and refine the message
given by the layer.
As a user traverses through the different layers of the application, we can also change the
context and queries that are explained by each layer. This approach allows us to present various
different types of relationships and correlations with varying degrees of detail
11
12
Input Vsualizaion
-
L coertr
Simulator
output VIRnalizabw
Analics Module
Figure 1.3: Context Diagramforthe Summarization Visualization Component
However, in order to answer the "what if' question discussed earlier, a second approach
is needed. The second aspect of the visualization component is an interactive web application
that enables users to better formulate the input to the simulation. By enabling the user to select
the datacenters and how the network is overlayed, we enable an interactive user experience that
helps engage the user to formulate and answer queries with regards to the network topology and
other analyzed information.
By using these applications we can provide a user multiple visualization toolsets for
various use cases. For example, a user can switch through different views of data and understand
the different correlations between them. As well as experiment with different topologies to see
where possible improvements can be made, or find bottlenecks in an already developed
infrastructure.
The full goal of my work with regards to the visualization component is two-fold. First, I
want to understand how to build better user interfaces and visualizations that are not only
12
13
appealing, but also significantly more informing than other ways of representing data. Secondly,
I also want to try and create new methodologies for improving user interaction and testing
approaches with the use of machine learning. Although these goals are broad and by no means
simple, I think our work with regards to the visualization component of the data simulation
framework is an effective start into trying to answer and complete these goals.
13
14
Chapter 2: History and Background of Data Visualization
To say the increase in the collection of data over the past decade has been tremendous is
somewhat of an understatement. Due to the decreasing cost of memory and storage, combined
with new ways of collecting data from a multitude of different resource, data is being stored at an
immense rate that has been increasing over the past several years. This storage of data enables
new possibilities in how we approach problems and technology. Specifically, with more
information becoming available to us it becomes substantially easier to make inferences and
decisions for problems in different sectors of industry.
However, in order to better understand our current state and where we're heading, we
need to better understand how we got here. In this section, we briefly go over the rise of the
collection of data as well as the use of visualizations and dashboards. From there we begin to
highlight some of the problems that we're currently facing with visualizations, so that later we
can connect them to some of the novel ways that might improve these problems.
In the early 2000s, the idea of big data was considered more theory than practice.
Although there was an increase in the collection of data at the time, it was still considerably
slower and nowhere near enough to be considered a shift in the paradigm of how we make
decisions. The idea of data scientist then was not nearly as a big phrase as it is just 10-12 years
later. This is due to numerous factors, one of the primary being that the infrastructure for
accessing and sharing so much data was still in its early stages. In 2000, the world's capacity to
for data communication was at 2.2 Exabytes. This increased by almost a factor of 30 in 2007 to
14
15
65 Exabytes, and is expected to increase to 667 Exabytes by 2014. As storage and
communication capabilities increased dramatically over time and with approaches to finding and
storing data matured, this growth in data science exploded. [3]
The infrastructure for supporting this growth in data science has happened rapidly
over the past decade. One of the major contributors to this was the creation of the MapReduce
process by Google in 2004. This process enables parallel processing to process substantial
amounts of data, in order to process and deliver queries. Later this process was further modified
and further built for big data by Apache in order to create Hadoop starting in 2005. With other
technologies such as MongoDB being developed, the architecture for big data quickly grew and
matured during the late 2000s.
However, even with this growth this paradigm for how we approach problems is
still very much in its early stages. With respect to industry big data is currently growing at over
10% per year, and with an increase in the number of people gaining access to the internet
combined with the improvement of storage and communication networks this number will
continue to grow immensely over the next decade. This presents new opportunities to take
advantage of this information, specifically with respect to making critical decisions for different
problems or systems.
2.1: The Growth of Visualization
Often when data science is considered, one of the primary concerns that usually comes is
how this data and results are presented. Given that some multivariate datasets have thousands if
not hundreds of thousands of rows, it becomes crucial to figure out the optimal way to show this
15
16
information. The early use of data visualizations were not actually created completely for this
purpose, rather they were a subdomain of computer graphics which is was used early on for
modeling and presenting problems. As computer modeling and graphics grew and the idea of big
data emerged, these areas naturally combined into the field that we understand today. [4]
Of course the use of data visualization to show information goes back way before even
the modem computer era as the use of maps to show geographical data has been used for
millennia. At the end of the 18th century, we began mapping out and graphing mathematical
functions, as well as medical and economic data. The beginning of data collection as we now
know it began during this time as well. Over time we have learned and created better ways to
show information using different types of metaphors, graphs and references to simplify our
understanding of problems and the world around us. [7]
Data visualization as we know it could be said to have started in the 1960s as statisticians
began the process of information visualization. Ways to display different statistics and
information grew throughout the latter half of the 1900s, one prime example of this was 'The
Visual Display of Quantitative Information" by Edward Tufte whose work could be said to be
pre-cursor for modem day information visualization. The growth of this field has enabled a
massive number of smaller areas that are actively being developed. [8]
The development of software for data visualization emerged in the late 2000s as the big
data industry took off. Dashboards, user environments for analysis, and libraries to create
different forms of presentation became increasingly complex as we began to create improved
ways of showing information. However, even as we develop new methods for improving how we
present new information we still face crucial challenges that need to be better understood and
solved. Data visualization is no longer about just showing graphically appealing graphs and maps
16
17
in order to show information, but also an analysis of how users will actually understand and take
in this information.
In creating and presenting better information to improve the state of this field, we have to
better understand the cohorts of users that are looking to use the visualizations as well as the use
cases that they aim to satisfy. Later we discuss in more detail some methods to improve user
environments and visualization. Some of the more promising ideas in this space come from
testing user in order to figure out the best way to create and improve visualizations (A/B testing),
as well as using test cases and Artificial Intelligence in order to figure out the optimal way to
improve a user interface (Planning). These methods of iterating and improving on user interfaces
can also be used to improve visualizations in order to enable user to garner better insights and
patterns from different datasets. Essentially, these methods aren't creating new ways to show this
information, rather they're using information about the user and use cases to iterate and improve
on an already created visualization.
2.2: Current State of Dashboards and Visualizations
Finally in the brief depiction of the growth and evolution of data science and
visualization, we now come to outlining the problems that are emerging with dashboards and
visualizations. With the abundance of data and numerous libraries and languages to support the
creation of various unique and detailed visualizations, the biggest problem that arises is that
many of these graphics are often ineffective. In order to understand how we first began by
discussing what makes an effective visualization and follow by listing some of the shortages and
17
18
problems that occur when designing visualizations, as well as how some of these issues can be
resolved if carefully considered with respect to the data.
One of the primary purposes of information visualization is to enable the user to quickly
identify patterns, answer questions, understand correlations/relationships, and absorb the
dimension of the dataset. That's a lot of different requirements for a graphic and often
visualizations complete one or two of these points instead of the full spectrum. Often
visualizations tend to focus completely on a single point, which can often confuse the viewer or
misunderstand the purpose and relationships of the respective dataset. [2]
In order to effectively show a datasets a visualization should create a natural chart from
the data to the visualization. This meaning that the user should easily be able to see the
dimensions and idea behind the dataset when looking at the visualization. For example, if you
have a map with latitude and longitude it's relatively easy to see a dataset with a list of cities and
their exact locations. Although this is a simple example, it's straightforward to see the mapping
between the data and the visualization. As the datasets become increasingly complex the
intuitiveness becomes more difficult to achieve; however, maintaining this relationship helps
immensely in the creation of effective graphics.
Another prime aspect of using visualizations to show information is the fact that it
enables us to get a better understanding of dataset density. For example, if a viewer had to sort
through a table of people to find the count of the number who live in California they would have
to sort through thousands of rows of a table to see the number of people. This problem can easily
be visualized through a bar chart that shows the frequency count. Using this the viewer would
not only be able to see the result of their query, but they would also be able to see the total
density of the dataset giving them a bigger perspective on the overall picture.
18
19
The last and probably one of the more important aspects of visualization that
seems to go wrong concerns labels. Effective visualizations create labels that allow the user to
better identify trends and patterns within the data, as well as see the scale at which these patterns
are occurring. For example, if a graph shows exponential growth, the viewer can clearly tell the
rate of growth for the respective dataset is high. However, with no sense of scale, the viewer has
no idea of what this information is relative to. Incorrect or unthoughtful use of labels and color
scheme create visualizations that often tend to look appealing, but give no meaningful insight or
information to the viewers. [6]
As we consider what makes for effective visualization designs, it becomes easier
to see why today's dashboards and graphics are ineffective. Creating good graphics requires a
through process about the different aspects of the dataset as well as the different cohorts of users
that will utilize the graphic. Often we tend to overdue one aspect of a visualization without
consider other dimensions. One of the better ways to think about designing graphics is
commonly used in User Interface (UI) design, which is to consider the different dimensions of
usability. These dimensions are easy to learn, effective, engaging, error tolerant and efficient. [5]
Some of these, like error tolerant apply differently from how they are defined in user
interfaces. For example, in UI the idea of error tolerance means enabling the user so that they
don't do anything that would break the system or mislead them. This is often done by using
confirmation pages in order to make sure the user is correctly using the interface. In information
visualization, the idea of error tolerant applies in the sense of making sure that the viewer doesn't
misread or misunderstand the information given to them. This can be done by making legions or
better labels in order to effectively guide the viewer and prevent them from making possible
errors when exploring the dataset.
19
20
By applying these different dimensions of usability in addition to better mapping of the
datasets, we can aim to improve the inefficiencies that are present in modem visualizations.
20
21
Chapter 3: Simulation, Analytics and Event Correlation
When testing a system with numerous datacenters worldwide it's can be more practical to
simulate and analyze this data using software than physically testing the entire system. The
simulator system was designed specificaRy for this purpose by enabling the user to see outcomes
and insights for various possible scenarios using input parameters as constraints. In this section,
we briefly continue the discussion regarding the modules from the previous chapter by going into
detail about the different aspects of the simulator.
One of the primary motivations for the simulator was to experiment with and
scale networks to show resource/time allocations and their impact with regards to multiple
concurrent users. In order to do so the simulation system makes assumptions for resource use and
allocation. In order to do so the system simulates a request initiated by a user and the
corresponding response that would occur by a respective server, demonstrating a client-server
interface. Further, the system also takes this to scale by modeling multiple concurrent user
requests and respective responses. On a large scale, this enables users to understand how the
datacenters in their network wil behave and perform under different conditions and topologies.
In discussing the architecture of the simulation platform we'll dissect different
aspect of the models with more detail as they are more important, in some aspects, to the
CyberVisual project. The simulator was implemented in C# and uses a CCR (Concurrency and
Coordination Runtime) library for asynchronous methods and behavior. Its input files are
primarily in XML format. During initialization these files are interpreted and deserialized with
21
22
the network and each of the various sites. The sinulator uses an XML parser in order to create
the different objects that are part of the simulated network topology.
Although this XML input methodology enables the simulator to run efficiently
when the user manually edits and changes values to simulate a topology; it becomes immensely
ineffective when the user does not understand the inner workings of the simulator nor how to
actually modify and create different topologies. The CyberVisual project aims to correctthis by
enabling a graphical user interface (GUI) to enable the users to create and draw their own
network topology. However, this problem requires dealing with a conversion and customization
of the input files into the standard XML format taken by the simulator. We discuss this
implementation later on in a different section including a converter module to create these
standard input files.
The output of this simulator file is in Microsoft Excel (xis) format, which enables
the user to easily see the data in familiar format and user interface. This also enables the results
of the simulation to easily be translated into graphs and various charts. Further, the output system
for the simulator is designed to be modular, which enables us to connect and link it to a clientside site for further analysis. Although having precise numbers of a simulation is useful, enabling
a user to see this information in a multitude of ways (ie. numbers, graphs, charts, animations,
etc.) is more beneficial by enabling the user to understand this information. A client-side website
is substantially more efficient than an excel output primarily because it exceeds at doing this,
enabling a variation of visualizations rather than just numbers of the network simulation.
22
23
Figure3.1: Thefigure above shows the diferentblocks ofthe CyberVisualsimulationplatform.
3.1: Modeling Networks
The primes of the simulator is based on a test for a single user client server
system. Carried out in a lab by Dassault Systemes, the idea enables for a build up to a multisystem model to test against multiple concurrent users with regards to system resource allocation
and timing. The simulator breaks down different aspects of different datacenters/servers by
creating different linear and non-linear performance models. Although there are assumptions that
are made with this regard, this still produces large-scale simulations that allow for users to see
accurate and realistic results of simulated systems.
The network architecture of the simulator basically uses a distributed co-design system
via a client/server topology approach for distributed file system design. This begins with
23
24
software that enables a user's machine to connect over the network to the main server. Different
sites in the network are represented in this system using dispersed nodes. The model enables
them to be the primary or local datacenters nodes.
Memnory
VP
Client
Server
CPU
Figure 3.2: The figure above shows a basic one client-serverarchitecture.
This setup enables for a low latency system. Further it creates a more stable network with
multiple copies of information thereby reducing network loss. Further, since we can use the web
user interface to define the network as an input, we can also design and simulate different types
of networks based on different input modifications. The simulator as a whole however is only a
partially distributed system, as there are constraints that must be put on specific aspects of the
network inputs. To ensure the performance of the simulator it was also tested and validated
against laboratory results for multiple/concurrent users.
24
25
3.2: Analytics Module
The analytics module of the CyberVisual system is responsible for the classification and
identification of attacks and anomalies that can be modeled via the simulator. At the end of a
simulation, the first module is responsible for aggregation and summarization of data. In order to
create this module, we must consider ways to process and prepare the input as well as identifying
the methods, which would enable us to perform analysis on this multivariate dataset efficiently.
Here we examine the analytics module created by the MADS framework in order to see how a
general case analytic module would work. [9]
In order to complete a user controlled adjustment of time-series visualization, we need to
aggregate and analyze the simulation data at different time intervals. After which, the data is
converted to the appropriate input module for the visualization engine which can access and
displays this data in a client-side browser. Therefore, the analysis engine has to be converted into
multiple smaller parts in order to complete this task. First, we must create a converter class that
enables the efficient conversion of the simulation data so that it becomes an appropriate input for
the analysis module. This is a crucial but relatively trivial task that allows the data to be ready as
input for supervised learning algorithms.
The MADS analysis engine itself contains multiple algorithms in order to analyze this input.
One of the most important aspects of this analysis is the classification algorithm of simulator data
between when the simulator is running under normal conditions compared to when there is an
attack or an abnormal increase of data in the data output. An example of this is a Distributed Denial
of Service (DDoS) attack. This type of attack can be simulated by systematically increasing the
25
26
number of operations per minute per application. This allows for an overload in the usage of that
operation which is similar to the expected symptoms of a DDoS attack.
When these types of attacks are being simulated, it's becomes a vital aspect ofthe analytics
module in the MADS system to be able to classify them as such. There are several algorithms that
allow for this kind of classification, a relatively straightforward approach to this is by using
Support Vector Machines (SVMs). Using the marginal distance of the farthest points from the
hyper-plane it becomes dramatically easier to separate normal simulation data from that of a
simulated anomaly. This approach also places no assumptions on the distribution of data, which is
immensely useful if the data has complex boundaries (ie. defining normal activity).
To explain this idea further, let's consider the idea of normal simulation activity. This would
regularly mean that the simulation is behaving correctly with regards to specific margins of data;
however, there is no exact metric for defining this because the values would be in the specific
range rather than a discrete value. Therefore, for this type of non-Gaussian distribution it becomes
particularly useful to not depend on statistical models, which depend heavily on assumptions of
the distribution of the input data.
Furthermore, SVMs attempt to separate data from the origin with the maximum margin
using a hyperplane, which is dramatically different than other ways of separating data like using
lines or planes to separate positive and negative samples of input data. In the next section, I briefly
discuss the machine learning algorithms used by the MADS system.
A. Support Vector Machines (SVM)
In looking at SVMs, let's first consider a simple linear classifier for a binary classification
problem. This is a practical consideration since our actual system will be do binary classification
26
27
(anomaly vs. standard performance) at the basic level. If we state that the classifier has features
denoted by x and labels denoted by y, then we can state the non-vector parameterization of the
classification equation as:
hw,b(x) =g(w x + b)
Where:
g(z)= I ifz> 0,
g(z) = -1 otherwise.
From the definition above it's easy to see that the classifier will output two results that
correspond to either one the values in the binary classification problem. This is particularly useful
since it allows us to quickly analyze and classify a multivariate dataset from a given sample time
provided that the dataset is labeled and we understand what the key requirements are for classifying
this data. However, there are occasions where this is not possible. For example, when the dataset
is unlabeled it becomes difficult to use supervised learning algorithms like this type of SVM in
order to do anomaly detection. Therefore, it's crucial to explore other types of algorithms that
could potentially help make improve this type of detection.
B. One Class-Support Vector Machines (OC-SVM)
Unlike traditional Support Vector Machines, OC-SVMs are unsupervised learning
algorithms, which can work with an unlabeled dataset. This makes them immensely useful for
detecting anomalies especially considering the additional fact that classification algorithms have
no dependence on the actual distribution of datasets. Similar to the SVMs discussed in the previous
section, the OC-SVM algorithm attempts to find a hyperplane to use for classification between
27
28
normal performance and actual anomaly detection. The decision function for this algorithm is as
follows:
N
f(z)= E
K(
,x) -p
n1
In this function, a is the Lagrange multiplier, p is the bias of the hyper plane and K is the
kernel function. In this case, if (x) < 0 , then an observation can be classified as an outlier. The
OC-SVM methods works well when there are no well-known distributions to the data (ie.
Gaussian). However, when the distribution of the data is known, it's better in most cases to switch
to statistical models for analyzing the data such as covariance estimation.
C. Statistical Models and Covaiance Estimation
In the previous sections, we briefly discussed two approaches to anomaly detection
based on the assumptions that there are no well-known distributions to the data. However, in some
cases it's possible to make basic assumptions so that the data can fit into a known distribution.
The basic example is to fit the data into the normal distribution where anomalies in this distribution
occur with a low probability, while a normal data comes in with high probability. When dealing
with this type of approach we can use statistical models and approximations like covariance
estimation in order to estimate and predict anomalies in the data. Combining these approached the
MADS system presents a comprehensive example of an analytics module for CyberVisual.
28
29
3.3: Suummary
The simulator enables us a piece by piece approach at analyzing networks which is
crucial in enabling us to see the behavior of a network on a large scale. While the analytic
module is a crucial part of the CyberVisual system as it lays the foundation for answering
questions from the simulated dataset. By simulating and analyzing network data using different
methods we can begin to gain results that enables a further investigation through better queries
and insights. In the process, this also allows us an increasing understanding of different possible
simulation scenarios. Finally, it also creates the appropriate output format for the visualization
model which takes this information and gives it to the user in multiple frames and views making
it more effective and easier to understand.
29
30
Chapter 4: CyberVisual User-Input Environment
In examining the current state of the simulator we separately consider the two primary use
cases for the operators. First we consider the input state of the simulator which enables an operator
to basically create the network topology for the simulation. We explain the state of the process
prior to the CyberVisual project and the different steps that the operator took, as well as an analysis
on the effectiveness of the system. From there we present and analyze the CyberVisual project,
putting into focus the iterative design process for visualization. Here specificaly, we will focus on
improving visualizations for user environments that require user-input, a perspective on the topics
discussed earlier albeit with unique challenges and approaches.
The simulator system originally used XML files as the primary source of input for
network topology. In defining various parameters for each aspect of a network the user can
specifically customize and create constraints in detail with respect to the two types of applications
that the simulator would test. During simulation, one of the primary methods of testing network
topology was by simulating CAD and VIS applications which required various input files for
Connections, Search, Filtering, as well as network latency, bandwidth, etc.
The combination
of these parameters would effectively simulate a client-server
architecture within a system. Before doing so however the user is required to input the parameters
in an input folder. Although a substantial amount of these files could be exempt from this process,
it still a difficult and time taking bottleneck since the parameters would have to be modified after
every simulation run for a different network.
30
31
Figure4.1: The screenshotaboveshows a CAD Connectfile, one ofthe manyinputfiles that the userhad
to manuallyenterin orderforthe simulatorto createthe network topology.
In total the input folder required to run a complete simulation requires the majority the
user going through 30+ XML files and possibly modify up to that number. Not only is this a time
taking task, it's also immensely difficult for an operator with very little domain knowledge on
the simulation system. Yet this parallels scenarios in which companies have to teach new
employees how to use various unique and complicated dashboard systems for actively
monitoring networks. The difficulty lies in not only getting the operator to get familiar with how
to run the system, but also for them to actually understand all the possibilities that are offered.
31
32
4.1: Iterative Design for User-Environments
The primary goal for the CyberVisual project was to create a unique user environment to
make this process more efficient and usable. To approach we used an iterative design approach
by first abstracting out the various details of the input files wondering if we can reduce this to a
simple table where the user inputs the parameters instead of having numerous XML files.
Cybervision
Connect
Search
Filter
Param
Param
Param
Param
Param
Param
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Figure4.2: The prototype (vJ.O) above showeda tablestyle view thatkepttrack ofeveryXMLftles as tabs
(atthe top) followed by all theparameteroptions andthe valuesforeachparameter.
The approach for the first version of CyberVisual was simple, minimize the click count in
order to enable the user to find and enter the information as quickly as possible. Click counting
32
33
has been used constantly in website user interface design primarily for making sure that sites are
usable. The 3-click rule, which has been disproven in showing a correlation between success rate
and number of clicks is still widely used by designers as good practice. With regards to
visualization this rule could actually be of practical use if the graphic is designed so that the use
can get an answer to their question within 3 clicks. This would enable the user to gain the value
and insights they need in a relatively short amount of time.
In this prototype, the tabbed view is used to abstract away the folder and access
each of the files independently on their respective tabs. This approach is effective due to that fact
that it also minimizes the amount of time the user has to spend preparing the input by opening
each file independently. Further, having one location to view each independent file enables the
user to also gain a better understanding of the overall structure of the input data thereby
decreasing the learning curve for the respective user. However, one of the underlining problems
with this iteration is that it doesn't address one of the earliest aspects of good visualizations that
we discussed earlier. It doesn't naturally map the data into visualization. The data in this case is a
network topology consisting of datacenters with geo-location data.
4.2: CyberVisual Input
In order to better use the geospatial information given, we iterated on the previous design
by adding a map and enabling the user to actually design the user environment that they want to
simulate. Then this information is submitted to and sent back to the server via AJAX requests
and processed to recreate the XML files that the user originally had to input. This approach
enabled the user to basically select the type of server that they wanted to place, where the
33
34
respective server would be located, as well as the parameters with regards to server connectivity,
bottlenecks, etc. Overall it would create a simple user environment that would dramatically
decrease the learning curve for the user in order to give them more control over the actual
customizations.
CyberVisual: Interactive Large-Scale Network Simulation Analysis
Figure4.3: The Ulfor CyberVisualallowsfor the user to select 4 differenttypes ofservers thatcan be set on any
locationon the map.
34
35
Figure4.4: After selecting 2 datacentersthe usercan choose to connectthem togetherusing thewire
option By doingsotheycanfill out the tablewith parametersthatrequiredforthe connection. Also,parametets
thatcould optionalforthis connectionare automaticallyset to a default value decreasingtheamount ofinformation
the userisforcedto enter.
Figure4.5: The screenshotabove shows the basic connection between two datacenters.
35
36
CyberVisual use of Google Maps is done in order to create a layout that allows for a
familiar user interface for the user that is both efficient and easy to learn. The combination of a
map and a table layout for the parameters creates a full user experience for creating a global
network. Further, the color scheme is easy to understand where buttons are highlighted in
different colors based on the importance of their actions.
The "Run Simulation" button for example is red to caution user before running an actual
simulation. This is due to the fact that running a simulation sends the current state of the map and
the table to the server through AJAX requests and deletes the state of the board. Although a
confrmation button occurs after the user decides to run a simulation, it could still potentially cost
the user the most amount of time if they don't careful consider the current state of the UI before
selecting this option.
On the development end, the input user interface for CyberVisual was created
using the RaLs framework for Ruby. On the front end, we used an HTML object approach in
order to create the servers and overlayed these servers over a map of the world (created using the
Google Maps API). When the user creates a new datacenter, it creates an object with a state of
different parameters for the type of datacenter. Some of these parameters are set by default to
expedite the process.
The default values are typically the ones that wouldn't concern the user and would be set
based on the type of datacenter chosen and other earlier parameters. To make the user interface
even simpler this information is saved internally without having to bother the user or overload
them with too much information. Therefore, the user would see a simple graphic while the
backend of the application handled the state and processing of the data.
36
37
AMA requests
Cybervisual
Server
Figure4.6: The contextdiagramnabove shows the connection between the CyberVisualInput UIandthe
server.
Server-side the application first collects all the data into several different tables separated
between location, application, and server type data. After the user inputs the parameters for a
server and places it on the map, an event gets triggered to take all the parameters and send them
to the server via AJAX. Once there the server separates the different types of data and send them
to the converter module. This module takes the information from the different information arrays
and recreates the XML files associated with this information.
This process is near instantaneous with no lag time and therefore the user will be quickly
be prompted that the information has been saved. After this is complete a script takes the
information and runs the simulator. The output of the simulator is an .xls sheet with the
information on each of the datacenters the user had created.
Although there are numerous advantages of the current CyberVisual system as
described above there are also some inherent disadvantages that need to be considered. The
CyberVisual Input UI doesn't do well for power users (i.e. the user cohorts that completely
understand the simulation system). This is primarily due to the inability to make advance
37
38
changes with numerous number of the parameters set to default. In aiming for creating a simple
and straightforward user interface we also decreased the amount of control that the user has in
choosing options for their network. We believe that this design tradeoff is worth it in this case
however because it enables a large group of cohorts to actually use the system even if it
somewhat limits the capacity of the overall system.
38
39
Chapter 5: CyberVisual User-Output Environment
The second case we present is the CyberVisual User-Output environment, in which
we consider how different types of users will see and interact with the output of a given simulation.
First, we consider the previous state of the output and the primary ways that information about a
given simulation is passed to the viewer. This will enable for a deeper conversation on the bigger
topic of making decisions when we have data. It's within this discussion that we depict the
CyberVisual User-Output interface and analyze it's improvements over traditional outputs. Further,
we describe the trend in improving new visualization within dashboards for decision making as
well as the disadvantages of the current state of dashboards.
As we previously stated the output of a given simulation is saved as a single Excel (.xls)
file. Each sheet of this file contains information about a specific datacenter in the network topology.
For example, all the information about a particular datacenter in the United States would be stored
in a single sheet within the file. From there, the sheet also contains the time interval for which the
simulation was run, the given operations in execution, number of users, and many other parameters
with regards to simulate a client-server environment. The primary structure of the each datacenter
sheet contains two different applications that are being simulated. The CAD and VIS applications
basically represent different applications that might be run from different servers. They aim to test
and estimate how much processing power and speed these applications would take in different
network topologies with different amounts of users.
39
40
Further, different networks can be customized to only run one of these applications instead
of both due to the fact that that the server configuration might not support the application.
The
output of the simulation enables the user to precisely see the output number of the simulation as
well as the three graphs. These show the frequency parameters on the number of users that are
logged in for each of the application. Probably one of the most important metrics in each of the
sheets is actually the Operations in Execution, a metric that along with number of users , that
allows the viewer to sort of get an idea of what's normal output for a specific number of logged
users.
The structure of the output for the simulation enables us to take a better look and analyze
what makes a good output for different cohorts of users. The primary reason for using cohorts in
this case is the basic notion that the users might not have the same background knowledge when
it comes to using the simulator and analyzing its results. In this case the primary cohorts would be
network administrators look at simulation data with the primary variable of difference being time
(Le experience) that would enable them to have different domain knowledge on the types of output
possible for various network topologies.
Taking this into consideration our approach was to create a simplified user interface that
enable option to look at more detailed views of the datacenters if the user elected to see them. This
enable us to design primarily for users who are looking to use this dashboard in a quick current
state look rather than for deep analysis. Also, this approach enables us to close the gap in domain
knowledge between multiple cohorts of users by creating a user interface that focuses on
consistency with other visualizations making it substantially easier to learn and operate.
In the first iteration of this interface our objective was to use a layer approach in order to
modulate the information into different sections. The primary reasoning with this is to simplify
40
41
and drastically reduce the amount of information the user has to see by organizing the information
into a clearly divided layout with labeled parts. An aspect of this that was immensely important
was creating an ordered structure in how the user viewed the information. As mentioned in earlier
sections, one of the primary difficulties of creating an effective visualization is overloading the
view with too much information at one time. In order to reduce this the best way to improve this
is to create an ordering of sort that is consistent with how the user would look for information.
This consistency is a dimension of usability in the sense that it is a subset of an interface that is
easy to learn.
Visualization Layer 1
..
.........
... . .. .
Figure5.1: One of thefirstdesignsfor the Output Environment. The idea behindthis design was
to use the layout to control and create a stricter orderingforhow the user views this information.
The layered approach does this by taking a sky to ground view, this meaning that it tries to
show an overview of the entire system first before getting specifically into details about a particular
module or piece. In essence, the goal of this is to give the viewer a summary of what's going on at
41
42
a system or a network level and then enable them to dissect any part of that respective
system/network. The primary problem with the first iteration (Figure 5.1) came in the fact that
although a stricter ordering was created it was still too much information. The idea behind this was
to use geo-location data at the center and then use the different sections of the page to display
individual datacenter information. If the user selected a datacenter on the center map, then the
information for that particular datacenter would be displayed in one or more of the sections that
are on either side of the map.
.
............
Worldwide Datacenter Overview
Mexico Datacenter
Select Hode
USA Datacenter
BraziDatacenter
China
England Datacenter
Datacenter
:Select Datacenter
Australia
Germany Datacenter
Inda
Datacenter
Datacenter
Select Time
Rn simulatoo
Simulation Analysis Overview
Figure5.2: The main pagefor the CyberVisual Output Interface.
However, this approach minimizes the amount of actual summary or overview that the user
gets on the entire system by enabling more space to be saved for individual datacenters on the side
42
43
sections. This became less information to actually show for the overview of the entire system. As
we iterated on the designs we decided to emulate some aspects of the original Excel output by
using a tabbed approach. Using different tabs to show each datacenter in detail we can give an
overview of the entire system within the first tab and break apart and analyze each individual
datacenter under their own individual tab. This is the final approach we used for CyberVisual, and
one that balances information and amount of detail with efficiency and ordering that allows the
viewer to get a more holistic understanding of the entire dataset.
With the CyberVisual interface, the first thing that the user notices is the workl map with
points for each datacenter in the network (Figure 5.2). The user can select anyone ofthe datacenters
and they are automatically linked to the correct tab for the specific datacenter. They can also just
select the tabs to access each individual datacenter if they don't want to access the map. The
simplicity of the system in this interface is that it minimizes the 3-click rule that was discussed
earlier. It takes a single click to get to detailed information regarding each individual datacenter.
Further, the data is ordered in an intuitive sequence which enables the user to first get an idea of
the overall system before diving into specifics.
The main page also includes metrics that pertain to the entire network. For example,
immediately below the map are two multi-line graphs that represent the total input and output data
of the network color separated for each of the datacenters. This is done to show the scale at which
information is passed throughout the entire network, having seen this the user can get a better
understanding of what to possibly expect from the different datacenters. For example, if a
particular datacenter has zero data as an output it might be easier to conclude there is a bottleneck
with that particular location. Therefore, for analyst this enables a simple way to quickly see the
state of each datacenter in the system.
43
44
Global US Network Push Data
100
Amesteata data
Whe-4-
llD
... ....
US damrkP
t
5D
22.13:00
Z2:1 3:10
22:13:20
22.
!030
22:130.40
22.13:50
22.14:0a
Global US Network Pull Data
Figure 5.3: GlobalPush andpullnetwork charts.
When a user selects to go tothe page for a particular datacenterthey can seethe information
separated into 4 different graphics. When the user hovers over any of the graphics a detailed
description of what information is represented is shown. This enables for a shorter learning curve
specifically for users that are new or unfamiliar with the application. The different color scheme
of the individual datacenters represent different results of the CAD and VIS applications with
respect to time. The top two graphics on the datacenter pages represent the more important metrics
that the user usually cares about, the number operations in execution for each of the applications
as well as their number of logged users. For the user enabling a quick access to these makes the
entire process simpler and more efficient.
44
45
Brazil Datacenter
Mexco Datacenter
Cham Datacenter
Austraia Datacenter
Simulation Analysis for the England
400
.0
350
325
27575
2621
25 O
300
2.250
23.75
21 25
275
e
250
225,15
200
5
~
625 92 9 90 9S 95 1
175
1.5
150
11.2
917
3
4
5
tr*
9
Logged CAD(Red) vs. VIS App Users
Figure 5.4: User and Datacenteroperationsdata.
The lower graphs on each datacenters page represent other operations that are called by the
different applications. In particular, the polar area charts enables the user to see the magnitude of
the operations while the web chart on the left enables the user to compare the magnitudes of the
different operations. The advantage to having these types of graphs is that the user can quickly see
when a number is off by a magnitude which enables them to see potential problem a lot faster. For
example, if the correct values for VIS Search and Connect are always lower than 2, it would be
easier to quickly label a red circle around the number 2 on the chart and make it immediately
noticeable when one of the operations is over that mark. The basic premise behind using these
charts is that it makes it so that early and quick detection of problems becomes a lot easier by
putting less dependencies on the domain knowledge of the user.
45
46
2000
10.75
1'50
1025
13'75
1250
11.25..
10.00
.t. iiI
.............
T50~505
I.0- .
91
20
945
9150
A
9sos
9
230-
0
1
95 9W0
3D
Updte
-
Open
9
945
990h
O
1005
Ses-,n
-Select
Vaodate
Figure5.5: Applicationoperationsdata.
In many ways the CyberVisual system is actually very similar to the very sheet of the Excel
file that is outputted by the simulator. The primary differences is with the actual medium of the
presentation. The CyberVisual environment is an application that can be accessed by anyone on
the web at any time. This presents an immense advantage over the current output because this
medium
allows for anyone to have access to the information
(given
that they have the
access/credentials to do so). Further, the user interface is also optimized for touch screen interfaces
as well as mobile devices making it accessible from many devices and locations.
The CyberVisual output system is built using Node.js, using several different APIs for
creating the visualizations. In order to gather the information required, the system uses AJAX
requests to complete load the file into a JavaScript module in order to process it. The module
46
47
creates numerous arrays that take the information from the different sheets and cleans the data.
After which a main function creates the main/overview page of the user interface. Each of the
arrays are then passed into sub-procedures that handle each of the individual tabs for the unique
datacenters. The frst iteration of the project user D3.js for drawing the different graphs and
displaying information. However, after finalizing precisely what visualizations to use it was
scrapped in favor of specific libraries that could create the graphics.
For creating the map on the homepage, we use Vectormaps.js, to create a vector map based
on CSV (Comma Separated Values) file. This also allowed us quick customization of the map to
graphic allowing us to map out each individual datacenter. An alternative approach to this could
actually have been the use ofthe Google Maps API which was used for CyberVisual Input interface.
To draw the individual graphs, we used the combination of Highcharts.js and Charts.js. The
primary benefit of using these two libraries was that they were both lightweight and enabled plenty
of customization without the learning curve that typically occurs with using D3.js. Overall both
these libraries were efficient in creating powerful and graphically appealing visualizations, without
taking too much development time. This was immensely important especially considering the
iterative design that was being used with the project.
As with any other system the CyberVisual Output interface does have tradeoffs due to being
optimized for specific set of users. By using graphs to show the majority of information the user
loses the precision that was in the original system. It's very easy to see the magnitude of
performance for a specific application operation; however, it's substantially harder to figure the
exact value of different operations. Further, the primary problems with using the Charts.js library
is that it can lag when it comes to loading time, especially if the network simulation contains a lot
of information. This can be frustrating at times, considerably so if the information is needed
47
48
quickly. However, the system is an improvement for the purpose of introducing this system to a
wider range of audiences and making it substantially easier to see and access simulation
information.
48
49
Chapter 6: FurtherWork
In the previous few sections we've discussed the design trade-offs of both the CyberVisual
environments and analyzed the development strategies used in both interfaces.
With each
environment it's easy to see that there is still space for improving and iterating on the designs.
Here we present a few ideas with regards to improving and changing how we think about the data
and each user interface. Further we analyze the trends of large-scale visualizations to suggest ways
to that can apply these patterns into our current environments.
First let's consider our assumptions of good visualization. In an earlier section we
described a good visualization as an entity that follows and takes advantage of the different
dimensions of usability. Specifically, we also mentioned that good visualizations often tend to
guide a user through some ordering of data enabling the user to slowly build upon information.
However, even in presenting this data in an ordering can still be problematic to some users because
the user interface itself (especially for dashboards) can have a steep learning curve.
Let's consider the Bloomberg terminal dashboard as an example. The user interface for the
dashboard has been described by UX magazine and users as "hideous"; however, because of its de
facto status as the premier dashboard for the finance industry the dashboard design hasn't actually
changed in years. This presents a dual problem, the first of which is that the data itself (in this
case financial data) is complicated to start with. The second and the easier problem to solve is that
the user interface and dashboard itself is complicated to learn and understand. This same problem
can also be applied to some network dashboards, and it's problematic primarily because it much
49
50
more difficult to train people to quickly learn and use the different interfaces. Essentially, solving
this problem enables users with little domain knowledge to quickly get to the primary goal of
understanding the dataset rather than spending a lot of time trying to learn the environment.
6.1: Adaptive User Interface (AUI) Dashboards
An approach that could potentially solve this problem is by creating a user interface that
gradually changes over time. As increasingly sophisticated framework for developing web
application are created it's becoming increasingly easier to quickly modify and iterate on a user
interface. Further, with the abundance of storage and data sources it's also becoming easier to
collect more information about individual users and their usage patterns. For example, by using a
simple click counter attached to event listeners on every link on a web application and saving this
information with every session associated with a logged user it becomes easy to see how a specific
user navigates through an application. Although - this method might not be the most practical
method to scale, it's one of the many ways that user information and usage patterns can be collected.
The importance of this is the central idea behind an Adaptive User Interface (AUI) model,
which heavily pushes the personalization of user interfaces. For example, with respect to our
simulation interface when a user first logs into the dashboard to run a simulation, they would be
given a very minimalistic user interface to start with. The UI would be simple enough where the
user could simply design a basic network similar to the CyberVisual Input Environment. The main
goal for this would be to get the user up and running as soon as possible. However, the benefit of
the AUI comes in as the user becomes increasingly familiar with the dataset and the user interface.
50
51
Gradually the system notices what options the user tends to select as well as the usage
patterns for the logged user, and respond to this by enabling new areas on the UI that enable the
user to quickly select and modify these options. Over time the user interface can become as
complicated as necessary as the user grows from novel user to a power user. By enabling this kind
of change in the dashboard, it enables the user interface to actually train the user with the dataset
and the options available for them. Therefore, this approach can not only save time, but it can also
be cost-efficient, as the user would not need training from others in order traverse the steep learning
curve.
However, there are a few loopholes in this design that should also be covered. When a user
starts with a minimalistic design there are numerous features that wouldn't be available to them
immediately. Therefore, if a user needed to use a specific feature a search bar with auto-fill at the
top of the screen would enable them to select that feature and a widget would be displayed on a
subsection of the screen in order to use that feature. The AUI example in this scenario is a Single
Page Application (SPA). Where new widgets would be displayed on the screen, as the user needed
them. Over time these UI would just load at the start of the UI as the system recognizes the user's
viewing and usage patterns. The other concern to address is for users who are already power or
advance users, and don't want to start with a minimalist design. This case however is simple
because each user has to log in with credentials and therefore the UI is personalized and saved
differently for every user.
The AUI Design is an ideal next step to the CyberVisual project primarily because it would
enable non-experts to quickly get to learning and understanding actual network data instead of
spending too much time learning the dashboard/user-interface. It also efficiently targets all cohorts
of users regardless of experience or numerous other variables by creating an efficient way to
51
52
personalize and optimize for each individual user. The primary tradeoff for this approach is with
regards to its time consuming process. Creating adaptive user interface can take a lot of
development and test time. Further, it could also be more costly when it comes to scaling this to
a massive number of users. Finally, as is the case with all type of AUI's the adaptation process
requires good understanding of the user's usage patterns and intentions.
Without accurately
understanding these aspects the adaptation might be more detrimental for the user as it could steer
away from the user's goals and use cases.
52
53
Chapter 7: Conclusion
In this section I discuss the overall thought and objectives of the CyberVisual project.
CyberVisual represents an iterative approach to improving dashboards and visualizations by
using the same principle used to design user interfaces in many of today's web applications. This
approach is beneficial to creating such visualizations due to the fact that it avoids common
pitfalls that commonly occur in most ineffective visualizations. By creating visualizations that
follow the different dimensions of usability we can ensure for example that our visualizations are
not only familiar, but also easy to learn for most groups of users regardless of their diverse
backgrounds. Further, by using common design principles used in today's web sites such as the 3
click-rule we can improve the efficiency of dashboards to enable the user to get to their query's
result a lot faster.
With the input environment for CyberVisual, we presented different ways to show
information while creating an interface prime for interaction. Further, we explored different
modes of input for the same dataset to try to best optimize how we can begin the process of
mapping a visualization set to a data outcome. The biggest takeaway in doing so was to create a
user interface that aims for and enables efficiency for the user. In order to make this a reality we
explored associations with information visualizations using data maps to create the easy way to
create network topologies. Further, we created a system to translate this information into custom
input files for the simulation system.
53
54
By creating the output environment for CyberVisual, we explored the other side
of the visualization picture, the mapping of data into visualization. Here we presented novel
ways to show information using data visualization such as polar radar graphs and various
different types of charts. We also explored information visualization techniques such as parallel
coordinates to show global output information. Further, we presented why current visualizations
within dashboards are ineffective by analyzing our user interface with the different dimensions of
usability that we mentioned above as discuss methods for effectively mapping data and
visualization.
54
55
References
[i] HerraroLopez, S.; "Large Scale Simulator for Global Data Infrastructure Optimization." Diss. MIT,
2011. Print
[2]
On Data visualization and info-graphics. http:/datavisualization.ch
[3]
Gil
Press.
A
very
short
history
on
Data
Science.
Forbes,
2013.
http-//www.forbes.com/sites/gipress/2013/05/28/a-very-short-history-of-data-science/
[4]
Norman H Nie. The rise of Big Data Spurs a Revolution in Big Analytics. Revolution Analytics.
http:/wwwxevolutionanalytics.comAvhitepaper/rise-big-data-spurs-revolution-big-analytics
[5] Cohn Eberhardt. Ineffective Data Visulaization and How to fix it. April 2010.
http://www.scottlogic.com/blog/2010/04/3/ineffective-data-visualisation-and-how-to-fix-it.html
[6]
Designing Effective Visualizations. Chapter 12. http-/www.ifs.tuwien.ac.at/-silvia/wien/vuinfovis/articles/Chapterl2_DesigningEffectiveVisualiation_355-377.pdf
[7] A Brief History of Data Visualization. Dundas. http:/www.dundas.con/blog-post/a-brief-history-of-
data-visualization/
[8]
History of Data Visualizations. http:/visual.ly/history-of-data-visualization
[9] Predicative Modeling for Malicious Activity Detection using Simulated Complex Environments.
Aimaatouq, Alabdulkareem, Nouh, Alsale, Alarifi, Sanchez, Alfaris, Williams. MIT. 2014
55
Download