Uploaded by Thomas Girard

A Tool for Elastic Architectures

advertisement
A tool for the modelisation and
deployment of elastic architectures
Thomas Girard
thomas.girard@epfl.ch
18 bis Avenue de France
1003 Lausanne
Switzerland
School of Computer and Communication Sciences
Master’s Project
Winter 2012
Supervisor
Christian Gasser
ELCA
Responsible
Prof. AndreĢ Schiper
EPFL / LSR
Abstract
Conception of large applications is a complex task and making these
applications tolerant to failures and scalable is even more complex. The
deployment and operational costs of such applications is significant.
The Platform as a Service model of cloud computing can help reduce
those costs by providing simplified deployment mechanisms, high level
monitoring capabilities and elasticity to existing applications.
However, the scope of features provided by such platforms does not
extend to whole systems, but merely to single applications. Even when
solutions are found to solve this problem, they cannot easily be used
on different cloud providers, since they are not standardized.
This project introduces a runtime application that bridges the gap
between Platform as a Service and complex, loosely-coupled applications. This runtime relies on a modeling language used to describe
application systems, rules to deploy them and ensure their correct operation. Applications are deployed on the cloud, and the runtime uses
a monitoring and supervision mechanism to ensure that they function
according to their model. Features whose scope is wider than a single
application, such as elasticity, can be achieved using the runtime even
when the applications and the platform do not explicitly implement
them.
Contents
1 Introduction
1.1 Large Scale Applications . . . . . . . . . . . . .
1.1.1 The case of Secutix . . . . . . . . . . . .
1.2 Cloud computing . . . . . . . . . . . . . . . . .
1.2.1 Infrastructure as a Service . . . . . . . .
1.2.2 Platform as a Service . . . . . . . . . .
1.3 Is cloud computing good enough for systems ? .
2 State of the art
2.1 Cloud Foundry . . . . . . . . . . . .
2.1.1 Architecture . . . . . . . . .
2.1.2 Experience with the platform
2.2 Other platforms . . . . . . . . . . . .
2.2.1 Invasiveness . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
7
8
9
9
10
.
.
.
.
.
12
12
13
15
15
17
3 Architecture
18
3.1 Modeling applications and their deployment . . . . . . . . . . 18
3.1.1 The problem domain: SOA . . . . . . . . . . . . . . . 19
3.1.2 The target domain: Platform as a Service constraints 21
3.1.3 Model extensibility . . . . . . . . . . . . . . . . . . . . 22
3.2 The meta-model . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Deployment process . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Continuous deployment . . . . . . . . . . . . . . . . . 28
3.4 Runtime scalability and fault tolerance . . . . . . . . . . . . . 29
4 Implementation
4.1 Core . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Model and DSL implementation . . . . . . . .
4.1.2 Core components and deployment process . . .
4.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Distributed Event Manager . . . . . . . . . . .
4.2.2 Replication and Fault-tolerance of the Runtime
4.3 Plugin system . . . . . . . . . . . . . . . . . . . . . . .
4.4 Web frontend . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
30
31
31
32
34
35
39
39
41
5 Experiments
5.1 Metrics . . . . . . . . . . . . . . . . . . . .
5.2 Test setup . . . . . . . . . . . . . . . . . . .
5.3 Results . . . . . . . . . . . . . . . . . . . . .
5.3.1 Application performance, scalability
5.3.2 Elasticity . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
43
44
44
44
46
6 Conclusion
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
3
1
1.1
Introduction
Large Scale Applications
Software has become a central part of many businesses today. Some rely on
it for their internal processes; others make their revenue by selling software
as a service. As their importance increases, the scope of applications tends
to expand. This increase in functionality makes applications more complex,
composed of multiple logical units that communicate with each other. The
number of users of these applications can also grow very large, which means
that performance is more than ever a relevant issue.
The term application is generally used to describe a single piece of software, but this term is too simplistic to describe modern applications which
are actually systems in which multiple smaller applications work together.
The term Service Oriented Architecture has been coined to describe such
systems. The main idea behind SOA is to divide problems into small focused tasks and implement each task as a service that is independent from
the rest of the system. Each service should be reusable and provide a clean
interface for its users. Once the required services are available, building an
application amounts to connecting them together in a meaningful way.
Even though each system is different, they often share concerns that are
orthogonal to the business problem they target. It does not matter whether
an application is some kind of banking software or a social network; in
both cases it is important for it to be scalable, fault tolerant, easy to manage, etc. These non-functional requirements have a significant cost both in
terms of development and in terms of maintenance during the operational
phase of the application. It is also desirable that those concerns are separated from business logic. But because these features intervene in many
locations in software, achieving such separation is neither obvious nor easy.
Any required service may also need its configuration to be managed. System administrators must ensure that the application and associated services
function correctly at all times. All of this generates significant costs.
Furthermore, since the systems are large, it quickly becomes unfeasible
to develop them as a single monolithic deliverable software. Software development is often approached using an Agile methodology. These methods put
emphasis on an iterative and incremental development process. Typically,
the different services of an application are specified as well as their interfaces. Once this is done, it is possible to develop each service independently
from the others, as long as the specified interface is respected. This is not
always the case in practice, because bugs or incomplete and misunderstood
specifications may lead to failures and regressions. Processes, such as continuous integration, must be created to ensure that updating an application
4
does not cause problems with other applications that are not updated at the
same time. This takes resources and is prone to failure when the process is
not automated.
Solving those problems can be made easier by using frameworks or middlewares designed for this specific task. Looking at the big picture, a modern
web application relies on many components : server-side network infrastructure (e.g. load balancer), operating system, runtime (e.g. a Java virtual
machine), application server (e.g. JBoss), web server (e.g. Tomcat). Furthermore these applications may require one or more frameworks or libraries
(database connectors, external monitoring libraries, persistence APIs, etc).
They may also communicate with one or more services (databases or queues
for example) that are present on the same or other physical servers.
Figure 1: A layered view of web applications. In grey is the infrastructure
layers, in blue the platform layers, in red the third party components of the
application.
All of these components form an architecture in which each layer provides abstractions that, in theory, should let users forget about the layers
underneath. Figure 1 gives a view of those layers. In an ideal world, developers would appreciate being able to focus only on upper layers, those that
are close to their application. But this is not always possible because the abstraction provided by each layer is rarely perfect and some details from the
underlying layers may leak. The Java Virtual Machine, for example, aims
to make java programs cross-platform (”‘write once, run anywhere”’) which
means giving an abstraction of the Operating System. This abstraction is
not perfect though, and OS-specific code sometimes has to be written. Furthermore while each new layer added to the system may ease part of the
deployment and development, the layers themselves must also be managed.
5
The next paragraphs examine some of the non-functional requirements
that arise in large modern applications.
High availability and Fault tolerance is the ability of a system to
reliably serve requests, even in the presence of failures. Such systems cannot have single points of failure, and consequently components are usually
replicated and/or distributed. This requires complex synchronization algorithms. Failures can be caused by software or hardware, but also by
network problems or natural disasters. For these reasons, it is common to
distribute systems over multiple physical locations, which introduces additional difficulties in the way the parts of the system communicate. Latency
gets higher and bandwidth may be restricted, making it even harder to satisfy non-functional requirements. High availability also means that system
updates should not require a full halt of the system, which means that deployment of new applications and updates have to be done live. This last
requirement is not too difficult to satisfy when the application is already
replicated however.
Scalability is the ability of a software system to support high loads if
provided with enough resources. To explain it roughly, a system is scalable if its ability to accomplish its task increases almost linearly with the
amount of computing power available. Scaling up is crucial because it means
that the application is able to support higher loads, whenever more clients
start using the system. Non-distributed applications suffer from bottlenecks (mainly disk, memory and processor speed) that only technological
advances in hardware can overcome. For this reason scalable applications
are generally distributed applications. Building distributed applications is
harder than non distributed one, and deploying them takes more work than
deploying non-distributed systems.
Elasticity is a concept somewhat related to scalability[2]. It is the ability
for an application to dynamically adjust its resource consumption depending
on a varying amount of work that is to be executed. Most services have
a varying number of users in the course of a day (e.g. high load during
office hours, and almost no users during night time). An elastic system
automatically requests computing resources during peak hours, and releases
them when the load decreases. A system that is not elastic would either
require manual intervention to achieve the same effect, or would require the
maximum amount of resources at all times. This ability to scale systems,
and even better to scale them automatically based on usage was one of the
main ideas that led to the creation of Amazon Web Services.
6
Multi tenancy is the ability for a single installation of a system to service
multiple customers (or tenants). For example an accounting system may
be sold as a service by the software vendor to multiple businesses. All of
those businesses actually connect to the same system to use the service,
but have the impression that each of them owns a single separate instance.
Multi tenancy reduces deployment complexity and operational costs since
the vendor only has one system to manage even though there are multiple
clients. However, multi-tenant applications are harder to develop since the
system must provide isolation between tenants.
Data persistence and consistency : most applications manipulate critical data. Losing some of this data is not acceptable and software engineers
must ensure that the data is persisted ; stored safely and recoverable in case
of failure. This often involves, once again, replication. Another requirement is that the data should be consistent (with respect to business rules).
Achieving consistency often involves the use of transactions. These requirements are usually fulfilled by relying on a database management system
(DBMS), but managing and interfacing those DBMS is not straightforward.
Monitoring a system consists in observing it, collecting data about performance and events that occur. Monitoring is crucial for two reasons. First,
most systems today are not fully automated and may require some human
intervention in case of incidents. Having a good monitoring system reduces
the delay between failures and recovery, as it helps identifying problems and
their causes. Secondly, monitoring is a good way to detect bottlenecks affecting systems and can be used to improve performance. It can be interesting
to monitor over multiple layers, from the hardware up to the applications
themselves, and having a central point to access monitoring information
makes it easier to have a sensible view of the system as a whole. It may
however be complex to implement.
1.1.1
The case of Secutix
Secutix 1 is an application developed by ELCA Informatique 2 . It is an application designed for professionals working in the entertainment and events
industry. It is used to manage sales (of tickets or goods) as well as customer
relationship management. Businesses in multiple countries use Secutix, such
as the Swiss Federal Railways (in order to sell train tickets), the Stade de
France, or the Palo Festival.
1
2
http://www.secutix.com
http://www.elca.ch
7
Secutix is Software as a Service (SaaS): customers get to use Secutix and
customize their instance of the software as they see fit, but ELCA manages
the hardware and application to ensure that it works correctly. Secutix fits
the description of complex Service Oriented Architecture described above.
It involves more than a hundred different services. It must be able to scale
(and quickly : when the Palo Festival opens its ticket sale, tens of thousands
tickets are sold in a few hours). As the load decreases, elasticity would be a
benefit.
The system is multi-tenant; each client does not get its own full deployment of the Secutix Platform. It has to be Fault tolerant, consistent and
secure. In summary, Secutix must meet all the non-functional requirements
previously described.
Managing Secutix, however, is not an easy task. It involves monitoring
many servers and infrastructural components. Updating applications whenever new versions are made available by the multiple development teams,
while still ensuring that the applications do not create conflicts. The overall
cost of ownership is really significant and making these tasks easier would
be of significant interest to ELCA.
1.2
Cloud computing
Cloud computing has been a trendy topic in the IT community for almost
a decade. Amazon was one of the first well known players in this market,
unveiling their Amazon Web Services brand in 2002. While the term cloud
computing is often used as a marketing buzzword, it has many interesting
underlying concepts. The basic idea behind cloud computing is still to provide layers of abstraction in client-server architectures. The cloud provider
assumes the task of providing and maintaining some of the layers described
previously. The customer uses those layers and adds upon them the additional resource required to run his application, including the application
itself.
Assume for example that a cloud provider takes care of the infrastructure, the task of managing the hardware and the network is effectively lifted
from the customer. This model is called Infrastructure as a Service. This
is only one way to provide cloud services. Because there are many layers,
there are multiple kinds of cloud computing models, the next sections discuss two of the most common ones. Note that what is described here is not
an exhaustive view of what cloud computing is : only a small subset of cloud
computing concepts are relevant in the context of this project.
8
1.2.1
Infrastructure as a Service
An Infrastructure as a Service (or IaaS) provider usually lets customers use
servers (or more generally computing resources) with a time based pricing
model. One can simply choose the wanted amount of resources (memory,
processors) and start deploying the needed environment (operating system
and further layers). To make things easier, virtual images may be provided
which contain pre-configured operating systems, frameworks and runtimes,
but in the end the management of these layers must be assumed by the
customer. The layers abstracted by IaaS are in grey in Figure 1.
This is the kind of service offered by Amazon Web Services, in particular
with their Elastic Compute Cloud (EC2) platform. It makes life easier both
for small businesses, which do not have to buy and maintain hardware or
network equipment, and for larger businesses which do not have to increase
the size of their IT departments. This allows for significant money savings
because the task of managing hardware for many businesses is now handled
by a single provider, economies of scale are possible.
One of the other main benefits of IaaS is that it enables easy scalability:
adding or removing resources to a system can be done nearly instantly. It
can also be done programmatically, which means that elasticity becomes
feasible more easily. The infrastructure sold to clients can also be more
homogeneous, partially thanks to the use of virtualization technologies. High
availability is also made easier because of the IaaS provider having a larger
amount of resources to ensure it. The cost of spreading the infrastructure
over multiple locations is split over all the customers.
1.2.2
Platform as a Service
The level of abstraction provided by Platform as a Service (or PaaS) is
higher than that of IaaS. Whereas IaaS only manages the hardware (and
sometimes services), PaaS takes things further by providing the operating
system and runtime environment. These additional layers are shown in blue
in Figure 1. When deploying applications on a Platform as a Service, the
PaaS customer does not have to set up anything but its own application and
the PaaS configuration.
PaaS providers sometimes rely on an IaaS and extend it by providing
access to servers with preinstalled and preconfigured environments. The use
of the platform is often made easier by letting users deploy and manage their
software using a web-console. The choice of runtime can be made very easily
for example by ticking options on a web page. A single click can be enough
to have an application running and accessible to clients in seconds. Each
PaaS provider tends to have its own way to do things and there exist no
9
standard describing platforms or their interfaces. This can make switching
from one provider to another problematic.
PaaS is generally strongly web-oriented. Most of the PaaS offerings
currently on the market only allow the deployment and management of
web applications. This can, in some cases, be a limitation in the kinds of
applications PaaS can be useful for.
Enterprise applications often rely on services such as databases or communication middlewares. The notion of platform generally encompasses
services. PaaS providers allow customers to request instances of services
and bind those instances to their application. Some platforms are able to
inject part of the configuration relevant to services in the application directly. Cloud Foundry for example is able to inject database access objects
into Spring applications.
Concerning monitoring it is unclear whether the use of PaaS is beneficial
overall. On one hand, monitoring capabilities may also be included by the
PaaS which is beneficial. On the other hand, because the platform hides
some of the layers from the user, it becomes unfeasible to add custom-made
monitoring on top of the PaaS abstraction. For example it would not be
possible, in Cloud Foundry, to use the Java Management Extensions (JMX),
the monitoring API provided by the Java Virtual Machine.
Some PaaS offerings are examined in more detail in section 2 of this
report. One of the things to note is the fact that PaaS offers a somewhat restrictive paradigm to customers. The fact that some platforms only manage
web applications was already mentioned. Another example of restriction is
that platforms may only be able to provide a small set of environments and
services. The reason for this is that the platform has to be able to automatically configure a considerable amount of layers. As such, adding new
environments or services requires takes work and requires specific knowledge. Other restrictions, which are examined later in detail, make PaaS a
very useful tool, but one that cannot fit all needs.
1.3
Is cloud computing good enough for systems ?
Both IaaS and PaaS provide features that can make the task deploying
SOAs easier. In particular, PaaS alleviates most management problems,
but comes at the cost of new constraints. Non-functional requirements such
as elasticity can be achieved more easily using PaaS. It is wrong however
to think that simply putting an application in the cloud instantly makes
it elastic or scalable. Applications still have to be conceived and deployed
properly for this to work.
Consider for example scalability. Replication is a good mean to achieve
10
scalability because new replicas (or instances) of the application can be
added when needed. Replication requires that all instances are able to work
together to keep a consistent state for the application. For example in
a banking system, the balance of each account must be the same in all
instances. This means that the state of the application must be properly
shared between all instances. This can be achieved, for example, by making
the application stateless and storing the state in a shared database (but in
this case the database could become the scalability bottleneck).
PaaS makes scalability easier by making the following tasks easier :
• providing and managing the database,
• adding new instances of the application to the system,
• taking care of load balancing between instances.
What PaaS cannot do, however, is ensure the consistency of the state. This
means that the application must still be designed with replication and scalability in mind. Cloud computing is not the cure-all that can take a nondistributed application and magically make it scalable.
PaaS does not consider entities on a scale greater than that of a single
application. This can be problematic because SOA only work properly when
consistency is achieved on a global scope. The fact that each application in
a system is correct in itself does not guarantee that the system as a whole
works. For any system there is a set of rules and constraints that must be
satisfied. For example whenever an application fails, some actions could be
necessary to prevent further failures in other applications. This means that
some events or behavior in one application may affect multiple applications
in the system. Such behaviors cannot be enforced in platforms that only let
users manage things at the scope of a single application.
Additionally, the notion of “PaaS paradigm” is not actually well defined.
There is no standard of what Platform as a Service is. Each PaaS provider
has its own model of how things run, and these models do not rely on an
interoperable standard. One application may work as-is on one platform,
and require changes on another. There are also examples where the PaaS
abstraction is leaky: some of the technicalities of the underlying layers still
have to be taken into account by developers.
The aim of the project is to improve this state of affairs by providing a
mechanism that adapts the benefits of PaaS to SOA better. The aim is not
to create a new PaaS, but an application (the runtime) that relies on existing
providers and augment the available features to alleviate the shortcomings
stated above. One sub goal is for this runtime to be provider agnostic, and
thus provide a way to easily migrate from one PaaS provider to another.
11
Note that while the term runtime is used to describe the application developed in this project, it should not be confused with platform runtimes such
as the JVM or a Ruby runtime.
2
State of the art
There is a large number of cloud computing solutions that provide a various
set of features. Since there is no strict definition of what cloud computing
is, many different approaches exist and there’s no one size fits all solution.
This section examines a few cloud computing offerings and analyzes the
commonalities between them. The main differences are examined as well
and in particular this report tries to determine which parts of the problems
described in the introduction already have interesting solutions and what
these solutions are.
Cloud computing, and more specifically Platform as a Service is a relatively new topic. Consequently, players in the field tend to appear, evolve
and vanish pretty quickly. Many large companies (VMWare, Red Hat,
Google, Microsoft) have only begun rolling out their offerings less than 5
years ago. A lot of things happened during the six months this project took
place. It has been interesting to see other companies pursue goals similar
to what is described in this report. Some of these companies announced
features that come close to what was developed in the project, this comes
as a form of validation of what was done.
2.1
Cloud Foundry
Cloud Foundry (CF) is an open source Platform as a Service project by
VMWare. It is a rather new product, with a first release in 2011, but has
received significant support from users and the industry. There are actually
two sides to the CF project: the open source platform3 , and a commercial
offering by VMWare4 . At the time of writing this report, the commercial
offering is not yet publicly available.
3
4
http://cloudfoundry.org
http://cloudfoundry.com
12
Figure 2: Cloud Foundry as a bridge between runtime/frameworks, services
and infrastructure.
The platform is developed in ruby but is able to deploy applications
running on the Java Virtual Machine (including Java EE, Spring, Scala/Lift
applications and more), ruby frameworks, Node.js and more. Third parties
provide support for more platforms, which is possible thanks to the open
source nature of the project. AppFog, for example, adds PHP support to the
platform. This support is provided at two distinct levels: the runtime level
(Java Virtual Machine, Ruby interpreter) and the framework level (Spring,
Ruby on Rails). This is the top side of the triangle in Figure 2 (image taken
from http://blog.cloudfoundry.com/post/4613634293/).
The project includes support for a number of services including databases
(MySQL, Postgres, etc.), messaging (RabbitMQ) and key-value stores (Redis). As with the frameworks and runtimes, support for other services can
be developed by third parties. This is the left side of the triangle.
Finally the right side is infrastructure independence. Cloud Foundry
provides an API that enables users to deploy their applications on multiple
clouds without needing to perform any change in their code. This is an
interesting step towards standardisation of cloud platforms.
2.1.1
Architecture
At the beginning of this project, the open source nature and versatility of
Cloud Foundry made it an interesting target platform to develop for. The
choice was made to use Cloud Foundry to develop and deploy the project,
which means that the implementation choices discussed in this report were
influenced by the functionment and development of Cloud Foundry. This
13
section gives an overview of the Cloud Foundry internals. More details can
be found in [8].
The brain of Cloud Foundry is the cloud controller. This component
exposes the interface to the cloud. In particular this means that it will
receive user applications and re-package them to enable their deployment
in the platform. The main goal of this repackaging is to wrap the application in a CF-specific environment that the other platform components
can manipulate. The cloud controller will provision services and request
the deployment of applications. It also updates the configuration files of
the application so that they are properly bound to requested services. The
repackaged applications are called droplets.
The second most important component is the DEA (Droplet Execution
Agent). DEAs advertise their features (amount of RAM available, supported
frameworks, etc.) to the controller. Whenever an application instance must
be deployed, the controller contacts the DEAs with the request, and DEAs
that are able to handle the application will volunteer to do so. Adding more
DEAs to a system means that more applications can be deployed.
Service nodes manage CF provisioned services such as databases.
The health manager ensures that deployed applications and services run
correctly. It will monitor deployed applications and verify whether they
are running. When a crash is detected, the health manager tells the cloud
controller, which has the task of fixing the problem. The health manager is
an important component because it makes the platform resistant to failures.
The router does what its name implies: it receives requests for a given
application, and forwards the request to the right instance in the cloud.
Applications in cloud foundry are bound to domain names. The router ensures that those names always point to an application instance, even when
instances are added, removed or moved. The router also provides load balancing, since when more than one instance of an application is running, the
router will dispatch requests to all instances evenly. The routing mechanism
uses what is called sticky sessions: once a user request has been routed to a
particular instance, all further requests by this user are routed to the same
instance. Sticky sessions ensure that temporary stateful data belonging to
one user is available for the duration of this user’s session.
All components in the Cloud Foundry architecture can be replicated,
which ensures that there is no single point of failure or bottleneck in the
system.
14
2.1.2
Experience with the platform
After having spent six months using Cloud Foundry, it has become apparent that the platform is still not very mature. Documentation is lacking
and disorganized. Furthermore, the public excitement surrounding Cloud
Foundry that was perceptible at the beginning of this project seems to have
partially faded. Furthermore, the slow evolution of CF contrasts with the
very fast progress made by other provides in the market.
It is a difficult task to build a reliable private cloud based on Cloud
Foundry. A lot of time was spent during the project to set up such a cloud
for testing the developed runtime. CF does not work out of the box. When it
fails, error messages are difficult to understand. Official support is slow and
does not always help. Community support is almost inexistent. Even third
parties such as Right Scale, who claims to provide easy installation of Cloud
Foundry on Amazon Web Services, cannot ensure a smooth experience.
When the platform works well it is very efficient and pleasant to use.
Sadly it seldom works well enough.
2.2
Other platforms
This section does not aim to give a comprehensive view of all the platforms
and features on the market. Only the most interesting features are discussed
here. Note that the platforms were not actually tested, but only reviewed
based on their own documentation and third party reviews. Actually testing
them would have been a time consuming and potentially costly process.
Examining and testing all available offerings would be a long task worthy of
a report of its own.
Here are a few of the most important platforms that were examined:
• Cloud Foundry
• Microsoft Windows Azure
• Heroku
• Google App Engine
• Amazon Web Services
Note that AWS is not strictly a PaaS provider, but an IaaS provider.
However there are tools, such as Rightscale, that tend to bridge the gap between the two concepts. Furthermore, AWS offers a wide range of services in
addition to its infrastructure (such as S3 for storage or SQS for messaging).
These services bring AWS closer to the PaaS paradigm.
15
It is difficult to compare the providers based on pricing. The exercise can
even be considered meaningless given the variable nature of the proposed
service. All providers adopt an usage based billing. Services are also billed
based on usage, the usage metric used varies (bandwidth, data storage,
amount of requests). Cloud Foundry also stands out from the other providers
in this regard ; there is currently no available commercial offering.
Language and framework support : Java is supported on many platforms, and generally any JVM language (e.g. Scala) will also be supported.
All of the platforms mentioned above support it. Ruby (and more specifically the Ruby on Rails web framework) is also well represented (only Google
App Engine doesn’t support it) which is not surprising given the rising popularity of ruby on rails. Microsoft’s Windows Azure is the only platform
supporting the C# language, but upcoming support from Cloud Foundry
has been announced. Languages such as javascript (through the Node.js
framework), PHP or Python are also supported by a few platforms. In the
context of this report, Java was the most important language to consider.
Note that the same language may not be equally supported on all providers.
Cloud Foundry, for example, will not only support Java, but also the Spring
Framework. This support for Spring is one of the most valuable features of
CF: it allows deployment to the cloud with strictly no configuration changes.
On the opposite, Google App Engine’s Java support is restricted : using
Sockets, for example, is not allowed.
Features : One of the most advertised features among the platforms is
scalability. All providers offer some mechanism that makes scalability easy.
Most of the times the scalability is not automated however, and requesting
new instances must be done manually. Whenever the process is automated,
it relies on monitoring metrics that the platform provides (resource usage,
application performance). It is unclear whether any of the providers have a
simple mechanism to enable elasticity of an application based on the metrics
collected on another one.
From this stems a general observation about the platforms: most of them
assume that customers will only deploy one single application. The concept
of multiple inter-dependant applications is not approached in tutorials or
in the platform’s presentations. This seems to be a limitation of those
platforms.
Services : all the examined providers offer some kinds of services. Databases
and messaging services are available on all of them, but in different flavors.
Cloud Foundry and Heroku provide well known options (MySQL, Postgres).
16
Amazon Web Services provides its own SimpleDB and SQS, but its IaaS
nature technically allows the deployment of any application. Google App
Engine provides a datastore that is not actually SQL compliant.
Heroku has the most comprehensive collection of services (which they
call add-ons) of all with more than 50 available services. Those add-ons
range from the most generic ones to exotic things like video encoding. In
this regard, the add-ons provided by Heroku come close to the notion of
service as defined in the context of SOA in section 1.1.
2.2.1
Invasiveness
In general the providers can be split in two categories: invasive and noninvasive ones. Invasive providers, such as Google App Engine or Amazon
Web Services, provide an API for their services. Applications that are to be
deployed on those platforms have to use the API in order to benefit from
the features. A direct effect of this is that the application becomes coupled
to the API and thus the platform provided. The services provided by the
API are proprietary too. For example, Amazon Simple Queue Service (SQS)
provides messaging capabilities. Contrast this with Cloud Foundry’s noninvasive model, in which RabbitMQ is currently the only available messaging
service. RabbitMQ is a third party service that is not only available on
Cloud Foundry. Furthermore, using RabbitMQ in Cloud Foundry doesn’t
tie applications to Cloud Foundry, only to RabbitMQ itself.
Non-invasive providers are more portable than intrusive ones. In general
however, platforms - invasive or not - do not share common standards.
Invasive providers tend to have more services and those services are more
customized. The paradigm offered by those providers can be described as
an ecosystem more than platform. In general there seems to be a tradeoff between invasiveness and the amount of offered functionality. Cloud
Foundry, which is the most open of the platforms that were tested, lacks
many features the other platforms provide.
17
3
Architecture
The project is composed of two main parts: a modeling language to describe
SOA in the cloud, and an application to deploy and monitor systems based
on such models. The features of the project and the main concepts underlying them are discussed in this section. Three main axes that are examined:
the modelisation of enterprise applications and their deployment process,
the deployment process itself, and finally the monitoring of the deployed
system.
While each of these axes follows the previous one logically, they can also
work separately. For example the monitoring tool can be used on any system,
not necessarily one that is deployed using the runtime. While this was not
a design requirement, the tools are designed not too be tightly coupled to
one another and generic enough to work on their own.
3.1
Modeling applications and their deployment
The runtime is able to deploy applications of arbitrary complexity which
are composed of a large set of different entities. In order to be able to
describe such systems and handle them inside the application, users are
able to describe their applications using models. A meta-model was built
to define the boundaries of such models. Meta-models are a description
of entities, relations and attributes and their semantics[5]. Meta-models
do not define syntax or implementation. The created meta-model is then
implemented in the form of a domain specific languages (DSL) that users
can manipulate to create their models. Multiple DSLs can be implemented
as long as they adhere to the contract of the meta-model, giving users more
flexibility to express models.
Figure 3: Relationships between models, DSLs, Meta-model and constraints
The meta-model itself must be flexible to avoid restricting the possibilities offered to users. It also must be simple to use by developers. Finally it
must be extensible, because it is not possible, in the scope of this project,
18
to define every relevant component users might ever need in their models.
There are two sets of constraints that have to be considered when designing
the meta-model, even though these sets are sometimes conflicting. The first
set is called the problem domain: the goal the project must accomplish, and
the constraints that stem from SOA and their deployment. The second set
is the target domain: the platform that is used to deploy SOA on, with its
own model and limitations. These two sets of requirements lead to two approaches to the problem of creating a meta-model. A bottom-up approach,
in which the problem domain is analyzed to find the desired features. Those
features are then expressed in the meta-model. A top-down approach, in
which the target domain’s limitations and concepts are examined, and translated into the meta-model.
Using the first approach tends to lead to a meta-model that expresses
the problem domain knowledge very well, but does not necessarily translate
well in the target domain. On the opposite, the second approach could
lead to a situation where the meta-model is a good fit towards the target,
but fails to represent concepts in the domain model properly. Since both
approaches have their advantages and weaknesses, an hybrid approach was
used in which the model was designed incrementally by mixing the two
approaches. This ensures that the meta-model does not stray too far from
either set of constraints, so that the end result satisfies all the requirements.
The next sections examine the assumptions made about each set of constraints. The question of extensibility is discussed afterwards.
3.1.1
The problem domain: SOA
The basic question here is: “what do users of the runtime want to model
?” An approach similar to model driven development was used. In this
paradigm, experts in the problem domain should be able to understand
and design models easily. This means that in order to create a proper
meta-model, understanding the problem domain and building the metamodel based on the needs and knowledge of the people who would use it[3]
was important. This was not an easy task because of the experimental
nature of this project ; it had no real clients or experts to express concrete
requirements and criticize design decisions.
Systems are an assembly of applications. Applications are thus the basic
building block and as such one of the most important entities in the metamodel. Most other concepts that must be modeled are attributes or properties of applications. Attributes and properties are often shared by multiple
applications. It must be easy to model and modify shared attributes. In
particular the process must not be error-prone, repetition of data in models
19
must be avoided (the DRY5 principle). This indicates that a way to group
applications and attributes must be provided in the model.
Since modeled systems can be arbitrarily large and complex, models describing those systems are large too. For this reason it makes sense for
models to have a concept of hierarchy or scope, so that someone manipulating a model can focus on smaller parts of it without having to keep in mind
everything.
Applications that share attributes may often have some kind of dependency on one another. The notion of dependency is very important because
that is what makes a system more than just a bunch of applications. Expressing these dependencies is thus an important part of the model. Dependency management must be easy to visualize and express, as such it makes
sense to apply the notion of scope mentioned above to dependencies.
Finally monitoring concepts are the hardest part to express because there
are many kinds of monitoring. There are actually two main aspects to
consider here: monitoring and supervision. Monitoring is the collection of
data about the system, this data can be used in various ways. A significant
amount of data of various kinds can be collected in systems: histograms of
resource usage, explicit reporting made by applications, events triggered by
the system itself, events triggered by the deployment process, etc. Every
monitoring system can have its peculiarities and providing a single software
component that would be able to monitor all relevant indicators for any
user is not realistic. Consequently, it makes sense to provide an extensible
monitoring entity, called a Monitor, that can be extended by users.
Acquiring data is only half of the battle. Simply monitoring the system is interesting because it allows administrators to analyze the data and
use it as they see fit. However, many concepts of SOA are to be expressed
and achieved at a system-wide level. Some of those concepts (such as elasticity, reliability) were examined in the introduction. Achieving such goals
without human interaction requires that the system be able to process data
produced by monitoring and react to it. A simple example is that whenever
an application crashes, the system should be able to see this, and restart
the application. This kind of mechanism, which has effects on the system,
is referred to as supervision. It is expressed in models using a Supervisor
entity. This entity receives data from Monitors, and reacts to this data by
altering the state of the system. The exact mechanism is examined in a later
section.
5
DRY: Don’t Repeat Yourself
20
3.1.2
The target domain: Platform as a Service constraints
The previous section gave an high level overview of what PaaS is and what
are the main concepts behind it. While the mentioned features are commonly expected from PaaS, the implementation of those features may vary
a lot from provider to provider. This section describes a subset of functionalities and constraints that are found in most platforms. The goal of the
meta-model is not to model those constraints, but they must not be ignored
either in order to ensure that models can still reasonably be mapped to PaaS
providers.
PaaS is application-centric, which fits well with the ability to model
applications and their properties. Applications in a PaaS have required
attributes related to how the platform is supposed to deploy or manage
them. These must imperatively be integrated in the meta-model. Examples
of such attributes include the amount of memory required by the application,
or the kind of runtime it must be deployed in.
Service Oriented Architecture defines the notion of service as a reusable
software component. It does not use a different terminology to describe an
ad-hoc application that encapsulates business logic and a component such as
a generic database. PaaS has a slightly different definition of service: only
the generic database is considered a service. The component implementing
business logic is considered an application. The difference is that in PaaS,
a service is an entity that the platform automatically provisions, and that
is not provided by the platform user. From now on this report will use the
word service only to refer to those automatically provisioned by a platform.
Custom components will be referred to as applications.
The notion of Service must thus be added to the meta-model, with their
relevant attributes (service type and settings). Describing services is not
enough: users must also be able to describe what applications are bound to
what services. This notion of binding can be well expressed by considering
that services are available in defined scopes, and automatically bound to
applications in the scope.
While PaaS is application-centric, it does not provide features that extend to a scope wider than the application scope. Entities such as Dependencies can, by construction, have a scope that is larger than a single
application. These concepts may not easily be mapped to PaaS. This is a
significant constraint, but these concepts are too important to be left out
of the meta-model. The implementation of those features is examined in
section 4.1.
21
3.1.3
Model extensibility
The concepts of dependency, monitor and supervisor have already been identified. Their semantics are examined in the next section, but the requirement
for them to be extensible has already been established. This is also true for
supervisors: given any data provided by monitoring, a supervisor should be
able to react to this data by altering the system. There is, once again, a
multitude of interesting supervision behaviors and implementing them all
was not a goal of this project. This problem was addressed by creating an
extensible plugin system. The implementation details of this system are
discussed in section 4.3, and for the time being the reader should just keep
in mind the following: since the implementation of the plugins define their
semantics, the task of describing those entities in the model is reduced to
giving their class name, and optionally a few relevant parameters or arguments to the plugin implementations.
3.2
The meta-model
The paragraphs above establish requirements for the meta-model, including
extensibility, a scoping mechanism, dependencies and a number of entities:
application, service, supervisor, monitor. The Application and Service entities are relevant to the deployment on the target Platform, while the Supervisor and Monitor entities belong to the monitoring mechanism developed
in the project.
In order to provide scope, entities of the model can be nested. This
means that entities in a model form a tree. As such, services, monitoring
and dependency bindings can be modeled in a clean fashion using the parent/child relationship of the tree. A service is bound to an application if
that application is a child of the service. A Monitor sees only the applications that are its children in the tree, and so on. Properties defined on an
entity affect all relevant descendants in the tree. Figure 4 demonstrates the
concept.
22
Figure 4: In this model fragment, App1 and App2 have a dependency, but
only App2 is bound to Service1.
All entities in models have names that serve as identifier. Users are
also able to attach metadata and a description to their entities to make
models more descriptive. The next paragraphs examine the entities and
their semantics.
PaaS entities
A Deployable is a description of a single application that must be deployed on the target platform. The name deployable is used instead of
application to avoid ambiguities and to insist on the fact that a Deployable
must be actually associated by a deployable application file. In PaaS platforms, the Deployable is mapped to the basic unit of deployment that the
platform can handle. A model without any deployable would be of limited
use since it no application would actually run. By extension, the model tree
should have a deployable as the leaf of every branch.
Deployables reference a single file such as a .war file (a Java web archive,
containing executable classes). This report refers to these files as artifacts.
When the model is deployed, artifacts will be acquired by the runtime and
deployed on the PaaS. The meta-model itself does not put any constraint
on the kind of deployables that can be described. The underlying PaaS will
however be able to handle only certain types of artifacts. Deployables also
have attributes such as the required memory per instance, which framework
and runtime to use (e.g. Java Virtual Machine or a Ruby environment and
framework such as Ruby on Rails).
The Service entity matches the notion of Service in a PaaS. Service
entities declare the kind of services that are required by the system (e.g.
database, or more explicitly a mysql database). The binding of services to
deployables is implicitly defined by the structure of the model tree. Any
23
application that is in the scope of a service is bound to the service.
The platform is responsible for provisioning and managing services and
the runtime does not do more than requesting them during deployment.
Only the two entities above, Deployable and Service, can actually be
represented in a PaaS paradigm. The remaining entities described below
are implemented by the runtime and are at the core of the features this
project provides.
Dependencies
Applications can be dependent on one another. The concrete meaning of
Dependency is not directly defined in the meta-model, but is implemented
by plugins. The model describes the kind of dependency that should be verified at runtime, and any relevant parameters the dependency may require
to execute its verification. Dependencies can be of many types, an obvious
example is the ability to declare a dependency between versions of different
applications, which would prevent the deployment of an application that is
incompatible with another.
Monitoring and Supervison
This report uses the term monitoring to describe the ability of the runtime to keep an up to date view of the deployed system, to examine any
event that occurs. Supervision is then the ability to react to those events to
keep the system in a consistent state.
The task of the Monitor entity is to observe the system (by polling
or listening) for specific events and to propagate these events so that the
runtime can propagate and then react to them. One can simply see them as
an event source. In order to accommodate all kind of monitoring needs, this
entity is implemented by multiple plugins. Examples of monitors include :
• an agent that periodically sends an http request to a web page and
report the response time or any abnormal http response code.
• an agent that observes the number of messages an application sends,
and reports this number when it crosses a certain threshold.
• an agent that observes the memory or cpu usage of instances of an
application, and reports it.
The role of monitors is not to alter the system but merely to observe it and
report significant events.
24
Whereas a monitor is an event producer, a Supervisor is an event
consumer. Supervisors subscribe to monitors and get notified whenever an
event occurs. Their task is to react to the events by executing actions and
altering the system. Supervisors are simple components providing this fixed
functionality, however they rely on Condition and Action entities which are
themselves implemented by plugins in order to provide an interesting range
of supervision functionalities. Multiple supervisors can subscribe to a single
monitor, but each supervisor subscribes only to one monitor.
A Condition entity examines the state of the system after an event and
resolves to a boolean value. One Supervisor can have more than one Condition (defined in a specific order). Conditions are successively evaluated. If
all conditions are valid, then the Supervisor executes its Action. Conditions
are useful to filter out events by inspecting the system in order to decide
whether the event is relevant.
An Action defines the role of its supervisors. Actions represent the
task that the supervisor must execute whenever an event is received and
conditions are valid. Again, this is a pluggable components because actions
should not be limited in their nature. Actions, however, can be shared between multiple supervisors so that common tasks do not need to be described
multiple times.
The fact that monitors, conditions and action are all implemented by
plugins makes this system very flexible. It becomes possible to describe
almost any monitoring task, since any significant kind of data can be monitored, and any resulting action can be taken. The separation of monitoring
and supervison makes for a model in which each component (Monitor, Condition, Action) can be kept simple and focused, which promotes reusability.
This simple mechanism allows new application instances to be automatically created when the load increases. The configuration of the elasticity
behavior is contained in the model, and it is easy to change parameters such
as the load threshold or the instance creation delay. Each component has a
single, simple and reusable task to execute. This model is close to the actual
system deployed and tested in section 5.
Figure 5 shows a model for an application that is subject to significant
load variations. This system has a master node and multiple slave instances.
There’s a dependency between the two deployables, to ensure that they
use the same version of a given library. Whenever the load on the master
becomes too high (as detected by the monitor), new instances of the slave
application have to be started. This action is done by the supervisor. The
supervisor has a condition to ensure that at most one instance is started
every 20 seconds (to avoid creating new instances while the previous one is
still not up).
25
Figure 5: Example of model. Not all attributes are shown for simplicity.
Blue blocks are plugins.
3.3
Deployment process
This section examines what it means to deploy a model as described by
the defined meta-model. The deployer is a core component of the runtime
that has to interact with the whole system. The task of the deployer is the
following: given a model, a set of deployable files and a target deployment
platform (PaaS), then make the applications run as described by the model.
This requires multiple successive steps. Furthermore the need to continuously update and deploy new models, without requiring a full stop of the
system, has to be considered. These successive steps are shown in Figure 6,
and each of them is described below.
26
Figure 6: The successive steps when deploying a new model. Validation
steps are in green, blue steps are the actual deployment phases, and the red
step is continuous monitoring. Resources external to the Deployer are in
orange.
Runtime validation : during this phase, the model is examined and some
basic constraints can be checked. For example, the deployer verifies that
any name that is referenced in the model is actually declared somewhere.
Some constraints can be enforced such as verifying that no more than one
application requests a binding to the same URL. This is a straightforward
step that relies only on the model. This phase has to be executed only when
the model changes.
Dependency validation : this phase instantiates dependencies and validates them. Dependencies read the artifacts that are to be deployed, so the
artifacts are fetched during this phase. This phase has to be executed again
whenever any artifact is updated, to ensure that the updates are safe.
Target compatibility validation : the model and the artifacts are inspected and the system computes all of the resources, services and features
that deployment requires. It can then query the target deployment platform
27
to ensure that all required resources and platforms are actually available.
This phase fails if, for example, the model requests more RAM than the
PaaS platform allows, or if a service or environment is requested that the
platform is not able to provide.
Target deployment : this phase actually updates the applications and
services on the cloud. Services are provisioned, applications are created and
artifacts uploaded, the binding of services and applications is done. At the
end of this phase, the system is ready to be run (but not yet ready to be
monitored).
Run time deployment : the monitoring components are created in the
runtime. Monitors and supervisors are instantiated but are not yet enabled.
They will be enabled automatically at the next phase, and can be stopped
and restarted manually using the web interface.
Monitoring : the monitoring system starts. Monitors start polling or
listening for events and relying them to the runtime. If any application
is set to auto-start, then it is started. The real effect of this phase varies
significantly depending on the model, but the general idea is that a chain
of events is triggered that will start most or all applications one after the
other until the whole system is up and running.
3.3.1
Continuous deployment
The successive steps defined above have a major shortcoming: they do not
allow the system to be updated (either a model or an artifact update) without a full restart. This is something that has to be overcome to achieve high
availability, and to avoid unnecessary restarts of applications (which are not
always painless).
In the case where only one or more artifacts are updated, then the problem is not too complex. The deployer has to retrieve the new artifact, and
revalidate all relevant dependencies. If they are satisfied, then the runtime
can simply stop, update and restart the relevant applications. Note that this
incur some downtime, which cannot be avoided because starting the updated
application before stopping the previous one could have unpredictable results unless the application is specifically developed to handle this. Having
both versions running simultaneously could, in general, put the persistent
data of the application in an unexpected state.
Whenever the model changes (which is not necessarily a frequent operation), the technique is to analyze both versions of the model and compute
28
two new models that contain only the changes. One model contains the
obsolete entities, which have to be stopped and deleted. The other model
contains the new entities that are to be created. The runtime goes through
the first three phases of the deployment process. The two next phases however are executed twice, the first time in order to remove obsolete entities,
and the second time to create the new ones. The process is shown in Figure
7.
Figure 7: The deployment process for an already running model.
3.4
Runtime scalability and fault tolerance
Section 1.1 mentioned that making applications scalable is both one of the
challenges faced when developing enterprise applications, and one of the
core benefits of cloud computing. It is clear that the application developed
in this project must not become a bottleneck that slows down otherwise
scalable systems. For this reason scalability of the runtime was considered
a mandatory objective.
Two core concepts were defined to achieve this goal:
• Making the application stateless.
• Keeping consistency among instances by using services provided by
the PaaS.
The reasoning is as follows: the end goal is to ensure that the system
deployed by the runtime works properly. A failure of the PaaS can, in
general, cause an unrecoverable failure of one application, and by extension
the failure of the whole system. As such, requiring that the runtime is able
to recover in this case does not provide any benefit. The consequence of this
is that the PaaS provides a failure model that is easy to work with: services
provisioned by the platform can be considered as either faulty (in which case
correctness of the runtime is not required) or always correct (in which case
correctness of the runtime also required).
29
A large part of the complexity in distributed algorithms comes from
the fact that any process can crash at any time. This is what makes algorithms such as consensus or membership problem hard. The assumption
made above removes this complexity; services provided by the platform can
be assumed to be always correct. Thus, in this failure model, complex distributed algorithms become much easier (e.g. the consensus algorithm is
solved by letting the correct process decide all the time) and distributed
problems are solved more easily.
Storing the state of the runtime in a database makes it stateless. The
database can be considered to be an always correct process. Since the
database is shared by all instances, the state is easy to share too. Once
the application is made stateless, any instance can be started or can crash
at any time and the state is still preserved. Instances communicate with
each other to know who is present in the system, and can dispatch any task
that was lost in a crash.
4
Implementation
The previous sections established the required features and high level concepts that the project has to implement. Those features were implemented
in the form of a Runtime application. The implementation is examined in
this section to see how it fits our requirements. Interesting features are discussed as well as which technical difficulties were encountered and how they
were overcome.
The runtime is implemented in Java and uses the Spring Framework to
manage dependency injections, database and web-related libraries. The dependency injection mechanism of Spring promotes loose-coupling between
the components of the application and to separate interface from implementation. This means that the runtime can be easily modified and extended
in the future.
The project is divided in three main parts, each examined in its own
sub-section :
• The core application, which defines the interfaces and the main components, including the deployer (section 4.1).
• The extensions, which are concrete implementations of the interfaces
defined in the core (section 4.2).
• The web frontend, which is a facade that lets users interact with the
system easily (section 4.4).
30
Furthermore, a small library of plugins (dependencies, monitors, actions
and conditions) has been developed in a fourth independent sub-project.
The plugin system and a few plugin examples are discussed in section 4.3.
4.1
Core
The main components of the Core project can be seen in Figure 9. Only
three components have concrete implementations in this project, the others
are interfaces, for which one or more implementation was developed in the
extensions project. The components are examined in this section, while the
next one introduces implementations for the interfaces. The core project
also contains the implementation of the meta-model as a set of java classes.
This implementation is discussed first, followed by the other components.
4.1.1
Model and DSL implementation
The model entities are simple java classes that correspond to the metamodel discussed in section 3.2. Entities implement the Composite design
pattern. Applications are Leafs and all other entities can act as Composites.
The package also provides utility classes to manipulate models such as the
ModelVisitor, an abstract implementation of the Visitor pattern for models.
Both Visitor and Composite have been described in [6]. The choice of using
those two patterns makes it easy to create new classes that read models
without having to modify the model classes themselves.
A sub-package, dsl, provides classes that make it easier to build models
directly using the java language. This package relies on a java language
feature called double-brace initialization to get past java usually very verbose
syntax6 . The following figure shows a short example comparing this syntax
with the usual one. The double-brace syntax is less verbose (but still brace
and parenthesis heavy), and helps identifying how the components of the
model are nested inside one another. It also does not require creating and
referencing variables.
This is only a small improvement over the standard Java syntax however
and a full-fledged DSL, with supporting tools and even a graphical editor
would be more expressive. This is something that could not be achieved
during the short duration of this project.
6
This is used, for example, in the Java Anti-Template Language, to generate HTML.
31
Figure 8: Comparison of the double-brace syntax (left) with usual Java
syntax (right).
4.1.2
Core components and deployment process
Figure 9: Main components of the Core project. Blue blocks represent
concrete components, green blocks are abstract interfaces. Arrows indicate
dependencies.
The Runtime component provides the public interface to the project. It has
methods to start and stop the system; others to interact with models, the
deployer and deployed systems. It is mostly a facade for the underlying
32
components, performing little logic itself. The Runtime has the task of
acquiring and retrieving user models, a task that is deferred to the ModelDAO
interface. This interface defines the storeModel and getModel methods.
The Parsers component is a collection of instances of the ModelParser
interface. This interface defines components that read InputStreams in a
predefined format, and produce java representation of deployable models.
When a user uploads a model file through the web interface, the Runtime
passes it to the Parsers class which tries to find an available ModelParser
that understands the file format. If such a parser is found, the file is parsed
and a model instance is returned.
The Deployer is the most complex components, implementing the features described in section 3.3. The next paragraphs considers what happens
during the deployment of a model, and examines step by step what components and interfaces are used, and how. For reminder, Figure 6 describes
the six deployment steps.
Step 1, runtime validation, is implemented by the ModelChecker. This
class visits 7 the model and verifies that all names in the model are unique
and that all names referenced by one component actually exists in the model.
It verifies whether the required plugins are available in the class path. It
verifies that URLs are not bound to more than one application. All of these
validation steps require no outside resources.
Step 2, dependency validation, relies on the DependencyChecker. This
class visits the model and instantiates Dependency plugins when it encounters them. Once a plugin is instantiated, its validation method is called
with the relevant deployables passed as arguments. If the plugin requires
access to artifacts, it has access to a collection of ArtifactSources which
have method to retrieve artifacts. An ArtifactSource retrieves artifacts
as described in the model, and returns a representation of the artifact with
metadata (such as length or modification time) and direct access to the
file data. The DependencyChecker returns to the deployer a list of failed
dependencies (and reasons for the failures) if any occurred.
Step 3, target compatibility validation, relies on the DeployTarget interface. An implementation of this interface provides methods to interact with a particular PaaS. This interface defines methods such as login,
createApplication or getCloudStatus. To execute this step, the method
canDeployModel of the target is called. The implementation is thus platformdependant, but is expected, for example, to ensure that all requested resources (memory, services, runtimes, etc.) are available on the platform.
Starting with step 4 and forward, the deployment process starts to have
7
the term visit is used when a class traverses the model using the visitor pattern.
33
side effects. Step 4 is the target deployment. A class, the CloudDeployer
visits the model and uses the CloudTarget to create applications and services on the PaaS. Data about the current deployment state (e.g. the currently deployed artifacts) is stored using the DeployDAO.
Step 5 is the runtime deployment, in which instances of Monitors and
Supervisors are created. This step uses the final interface presented in Figure 9, the EventManager. This interface has the task of managing all of the
monitoring and supervision. Step 6 simply starts the monitors and applications that have to be started, and the continuous monitoring of the system
begins.
This concludes the description of the core project. This project has
been kept as simple as possible, in order for the deployment logic to be
kept separate from the non-functional requirements (persistence, fault tolerance). The resulting implementation has only a few concrete classes and
a number of interfaces. The next section examines the implementation of
the interfaces.
4.2
Extensions
The core project establishes seven interfaces, as described above. Figure 10
shows those interfaces and some of the implementations produced for them.
This section examines those implementations. Note that only the most important implementations that enable a reliable operation of the runtime are
discussed here. The project also provides simple implementations for most
interfaces that rely on the system’s main memory. These implementation
are not suitable if the system has to survive through crashes, but are easier
to deploy and generally faster.
Figure 10: Interfaces are in green, implementations in orange. The purple
block, the event manager, is described in sub section 4.2.1.
34
SerialParser is a simple implementation of the ModelParser interface
that reads models from their standard java serialized version. Models are
created as java source code, as seen for example in Figure 8 and written
to a file using java API’s ObjectOutputStream class. This is one of the
most straightforward implementation that can be produced for this interface,
but manipulating such models is not practical: they cannot be edited by a
human, and any modification to the source code requires a new compilation
and execution of some java program.
The ModelDAO and DeployDAO are respectively implemented by the JDBCModelDAO and JDBCDeployDAO classes. Both implementations rely on Spring’s
JdbcTemplate which provides a standard abstraction for all JDBC compliant databases. MySQL was used as the de facto implementation (because it
is both simple and a service in Cloud Foundry). Using a database ensures
persistence of the data even when the Runtime crashes, which means that
both models and deployment information can be restored upon restart. It
also enables sharing of data between all replicas of the runtime, as long as
all replicas use the same database. The ModelDAO uses proxy classes of
the model so that large models can be partially loaded from the database,
in order to improve performance.
Two implementations of the ArtifactSource interface are provided.
HTTPArtifactSource reads artifacts from a configurable HTTP URL. It uses
HTTP headers as metadata for the artifacts (e.g. the Last-Modified header
is used to verify whether a new version of an artifact is available). A second implementation, JDBCArtifactSource, relies on a database (the same
way the JDBCModelDAO above does). This second implementation is actually
a WriteableArtifactSource which is an extended interface that the web
frontend can use to let users store artifacts from the web application. A
proof of concept for a SVNArtifactSource was partially developed to fetch
artifacts from SVN repositories.
Finally, the DeployTarget interface is implemented by CloudFoundryTarget which as its name implies, is able to deploy to the Cloud Foundry PaaS.
It relies on the cloudfoundry API for Java8 . A dummy implementation,
named SimulationTarget is also available. It does not actually deploy
anything but can be used for testing purposes.
4.2.1
Distributed Event Manager
The event manager is a more complex component which deserves are more
thorough inspection. The overall organisation of the implementation can be
seen in Figure 11.
8
https://github.com/cloudfoundry/vcap-java-client
35
Figure 11: Distributed implementation of the EventManager interface. The
implementation defines its own sub-interfaces in green, and implementations,
in orange.
The actual logic of managing monitors and supervisors is implemented
by the LocalEventManager. This component, which is in itself a valid implementation of the EventManager interface works only for a single instance
of the application. Section 4.3 describes the interface of monitors and supervisors. The local event manager is a straightforward implementation that
satisfies this model. It uses threads to let monitors run concurrently, and
queries the local runtime instance to build execution contexts.
It relies on the MonitorDAO to store data about monitoring. The default implementation for this interface relies on a database and thus ensures
persistence of the monitoring data.
However such an implementation will not scale when the number of
monitors gets large, and it cannot be replicated as-is. For this reason, the
DistributedEventManager was implemented.
Each instance of the Runtime application (refered to as a Peer ) runs
a distributed event manager which has its private local event manager instance. It also relies on three interfaces for which a single implementation
was created:
36
• PeerDAO, which reliably stores data about who are the participants in
the distributed system.
• PeerMessaging, which is used for peers to communicate with one another to know who are the participants in the distributed system.
• MonitorMessaging, used to communicate commands and data about
the monitoring processes.
A theoritical replication model was described in section 3.4. These three
interfaces used together aim to implement this distributed model. The goal
is for all peers to be able to keep the monitoring system running even in
the presence of multiple failures. This requires two things: peers must know
the other participants in the distributed system and they must be able to
cooperate to share the monitoring work and avoid loss of data.
First, the JDBCPeerDAO implements the PeerDAO interface using a JDBCcompliant database. The implementation is similar to the others that were
previously discussed. The aim of this interface is to store data about the
peers. The database process serves as the persistent and always correct node
of our distributed model. It has one additional task: it attributes an unique
identity to each of the instances. This is shown in Figure 12.
Figure 12: A new peer connects to the database and is attributed an unique
identifier. Other peers are now aware there’s a new participant. Black
arrows between peers represent heartbeat messages.
37
Both the Peer and Monitor messaging interfaces are implemented using
Spring Advanced Message Queuing Protocol (AMQP)9 . RabbitMQ, which is
an available service on the Cloud Foundry platform, implements this protocol. AMQP provides messaging abstractions based on queues and exchanges
that make it easy to implement broadcast and unicast messaging. It also
provides abstractions for work queues: multiple processes subscribe to a single message queue, and messages are distributed only once to one subscriber.
Each of these messaging abstractions are useful in different cases which are
covered in the next paragraphs.
The AMQPPeerMessaging implementation to the peer messaging interface
uses both unicast and broadcast messages. This main goal of the protocol
implemented here is to detect peer failures. The general idea is simple: each
peer knows every other registered peer from the database. Since each peer
is attributed an unique (numeric) identifier, a total ordering of the peers is
implicitely defined. For any peer with identifier P , the predecessor functions
pred(P ), and the successor function succ(P ) can be defined. pred(P ) is the
peer with the largest identifier that is smaller than P . If no such peer exists,
then the predecessor is the peer with the largest identifier. succ(P ) is the
peer with the smallest identifier larger than P . If no such peer exists, then
the predecessor is the peer with the lowest identifier. Each peer has exactly
one successor and one supervisor, and no peer is the predecessor or successor
of more than one peer.
In Figure 12, black arrows indicate the succ function, e.g. the successor of Peer1 is Peer2. The relations establish a unique cycle containing all
peers. There cannot be a partition between the peers because the database
is assumed to be correct. Peers periodically send heartbeat messages to
their successors, and expect messages from their predecessors. When a peer
has not received a message from its predecessor for too long, it informs the
database that this peer is considered dead. All peers know this, and in particular if P dies, pred(P ) knows that its new successor is now succ(succ(P )),
closing the cycle. Even if P was having a temporary failure, it must accept
that it was declared dead, and must stop participating in the protocol. This
is commonly known as a crash-stop model.
The possibility of malicious peers trying to compromise the system is not
considered. Such a task would be easy: a malicious peer can simply declare
all peers as dead to bring the system to a halt.
Finally, the AMQPMonitorMessaging implementation to the monitor messaging interface relies on the guarantees described above to distribute the
monitoring workload among peers. All peers subscribe to a work queue.
Whenever a model has to be deployed, all the monitors that must be de9
http://www.springsource.org/spring-amqp
38
ployed are sent as messages through the queue. By default, messages are
distributed in a round robin fashion, which ensures an equal number of
monitors on each peer.
Peers have to confirm the reception of monitors, so that every participant
knows the repartition of monitors at all times. Whenever a peer fails, its
successor detects the failure and declares the peer dead as described above.
It initiates recovery by re-sending the monitors of the failed peer on the
work queue. The number of participants in the system is known, and the
number of active monitors is known too. Whenever a new peer is added
to the group, other peers detect that they have too many monitors, and
redistribute the excess ones on the work queue.
4.2.2
Replication and Fault-tolerance of the Runtime
This section examined the most important implementations for all the interfaces required by the runtime. These implementations, when used together,
achieve the fault tolerance and scalability goals that were set for the project.
Using a central database for storing application data makes the whole application stateless, which means that replication becomes possible.
The distributed manager described in the previous section enables synchronization between instances for the sole resource consuming task of the
runtime: monitoring. Crashes are automatically detected and recovered by
sending monitors again. There is a small down time for monitors when
they need to be re-dispatched, but this is considered acceptable given the
continuous nature of the monitoring process.
4.3
Plugin system
The meta-model defines four entities (dependencies, monitors, conditions
and actions) that have to be implemented by plugins, these entities are
called pluggable. The plugin system is simple: the model declares pluggable
entities with two attributes: a classname and arguments which are a set of
key-value pairs.
When the model is deployed and a pluggable entity is required, the
runtime uses the reflection capabilities of Java to create a plugin instance.
Plugin instances must accept arguments, but each implementation is free
to process them however they want. Each plugin type interface declares its
own logic method.
• For Dependencies:
DependencyResult isSatisfied(DependencyContext context)
39
• For Conditions:
boolean isSatisfied(ConditionContext context)
• For Actions:
boolean execute(ActionContext context)
• for Monitors:
Collection<Event> fetchEvents(MonitorContext context)
Monitors have additional methods that are used by the runtime to give
information about their current status, as well as methods to start and
stop the monitor.
The role of a plugin developer is thus to implement the relevant code
to perform the behavior of their choice. Figure 13 shows an example of
Action that requests a new instance of the applications in its supervisor’s
scope. The maximum number of instances can be configured in the model,
as shown in Figure 14.
Figure 13: Java class for an action plugin.
40
Figure 14: This java snippet shows how the plugin can be used in a model.
Each type of dependency uses its own Context to perform its task. The
context usually contains some part of the model (the part that is in the
scope of the plugin instance). It also gives access to other components of
the runtime. A Monitor, for example, has a read-access to the CloudTarget
in order to be able to see the current resource usage. However, only the ActionContext gives a write access to runtime components, the other contexts
are read only.
It is clear however that since any java code (including code using the
Reflection API) can be used by plugins, it is very hard to guarantee proper
and secure behavior. Only trusted plugins should be used. This problem
can be partially adressed by a proper use of java’s SecurityManager. This
has not been implemented in the project for now.
4.4
Web frontend
The implementation of the web frontend is not of particular technical interest. This section only describes a few features of this interface. Data can be
visualized about the current cloud target, such as the supported runtimes,
frameworks, services as well as the limits for the current user (maximum
number of applications, bandwith, etc.).
A screen gives a summary of the state of all the currently deployed
applications. The details for each application includes its current resource
usage, number and status of instances, etc. An example is shown in Figure
15.
41
Figure 15: How an application appears in the web frontend.
Figure 16: Monitors also report their current status in the web interface.
Only text reports are currently supported but providing a rich web interface
for monitors is an interesting future feature.
Finally, the current model is visible in one screen, and the user can
upload a new model at any time. Models can be completely or partially
deployed (e.g. by selecting only the validation steps). In case of errors, a
report is given, including for example all dependencies that failed and why.
42
5
Experiments
A small document conversion application was developed in order to test
the project. Using a web interface, users are able upload documents to
the application, select a document converter and download the converted
documents. For this experiment, a converter is used to transform word
documents to PDF. The application is composed of two modules: master
and slave. The master application manages the web frontend, but does not
perform any conversion logic, which is assumed by the slave application.
Each slave instance only processes one document at a time. Converting documents requires a lot of resources. In order for the application to
scale, multiple slave instances must be used. It is not necessary to have
multiple masters because their workload is not very high (replicating them
is technically possible though).
Master and slave instances communicate with each other using the RabbitMQ service provided by Cloud Foundry. The master sends all received
documents through an exchange. Slave nodes subscribe to this exchange,
and receive documents as they come. Once a document has been processed
by a slave, it sends the document back to the master through a direct queue.
A slave does not accept new documents from the exchange until it has finished processing the previous one. This means that when all slaves are busy,
documents accumulate in a queue until a slave becomes available.
The project is used in three different ways for this experiment:
• The application is modeled using the DSL.
• Monitors are created to measure the performance of the application.
• Supervisors are used to achieve elasticity.
5.1
Metrics
This section introduces the metrics of interest for these tests. Some of the
metrics are used as a simple benchmark of performance for the document
generation application. Other metrics are used to compare the behavior of
the application with or without the runtime.
Throughput is the number of documents converted per unit of time.
Latency is the time elapsed between the submission of a document and the
end of its conversion. These two metrics are the most important ones to
measure the performance of the application.
The application is Scalable if the throughput increases linearly with the
43
number of slave nodes. Given the above description of the application, scalability is expected, until either the master or RabbitMQ gets overwhelmed.
Elasticity is how well the system is able to cope with increasing or
decreasing load. Elasticity can be described as the ratio between the amount
of computing-time used to complete a given task when elasticity is enabled
and disabled. For example, assume a scenario in which 10 instances are
required 20% of the time when the load peaks, and only 2 are needed the rest
of the time. When this scenario is run for 10 hours, an inelastic service would
use 100 compute-hours. An ideal elastic service would use 36 compute-hours
(2 hours with 10 instances, 8 hours with 2 instances). The elasticity ratio
would be 36%.
5.2
Test setup
A private cloud relying on Cloud Foundry is used for the experiments. The
Cloud Foundry components (see section 2.1 for details) and the deployed
applications are hosted on a virtual machine.
The Master instance requires 512 MB of memory whereas each slave
instance requires 256 MB. Furthermore, the runtime requires 512 MB per
instance. This means that in total, it is possible to deploy at most one
master instance, one runtime instance and 6 slave instances before attaining
the memory limit.
Multiple tests are run. In each one documents have to be converted and
the metrics defined above are measured. Some tests are done without the
runtime, some tests with the runtime, but no elasticity mechanism in place,
and finally some tests use the elasticity mechanism. Figure 5 in section 3.2
describes the model used for the application.
5.3
5.3.1
Results
Application performance, scalability
In the first test, 10000 documents have to be processed by a fixed number
of instances ranging from one to six. Throughput and latency are measured
to establish absolute performance and how well the application scales.
44
Figure 17: Processing 10000 documents with 1 to 6 instances.
The graph indicates that the application scales, but not extremely well.
Performance per instance is down to 65% when six instances are enabled.
One explication for this dissatisfying result is that the multiple instances
actually compete for the resources of a single virtual machine. Deploying
instances on separate machines could improve scalability. Note that Cloud
Foundry provides no possibility to affect the way instances of an application
are deployed, which can be problematic.
In absolute terms, the maximum throughput for a single instance is close
to 22 documents per second, whereas it goes up to 90 documents per second
for six instances.
Figure 18: In a second test, average latency for all documents is measured.
There seems to be no significant impact of the number of instances on the
latency.
45
The third test measures the time required for an instance to be available.
A new instance is requested on the cloud controller, and the throughput is
measured.
Figure 19: An instance takes approximately 15 secondss to become available.
This value is used to establish the elasticity policy.
These three tests were executed with and without the runtime. The
presence of the runtime makes no significant difference on the performance.
This is not surprising at all because the runtime does not have to interact
with the applications at all.
5.3.2
Elasticity
In a third test, performance of the monitoring system is measured, and elasticity is tested. The elasticity policy is based on the number of documents
waiting to be processed. When the number gets too large, a new instance is
requested. When the queue is empty, instances are released until only one
remains. The system waits at least 15 between two instance requests.
The number of documents to be processed varies over time during this
test. At first, 40 documents are added to the queue per second for one
minute. In this situation, two instances are able to process the documents.
Then 80 documents per second are added for another minute, this requires
at least four instances. After that, 100 documents per second are added
to the queue for 40 seconds. Even 6 instances cannot handle that amount
of documents, and so the queue size is bound to increase. Finally, 200
documents per second are added during 20 seconds, which will require the
maximum amount of processing for the remainder of the test. In total, 15200
documents have to be processed.
46
Figure 20: Number of documents added to the queue, queue size and
throughput over time when six instances are always active. The task takes
210 seconds to complete. As long as the load is low enough, the queue stays
almost empty.
Figure 21: Same scenario with elasticity. The number of instances over time
is shown. The elastic system takes a few more seconds to complete the task.
Documents are accumulated in the queue even when the load is small.
47
Instances
Elapsed time
Difference with 6 instances
Instance-time
Difference with 6 instances
4
235 s
+12%
940 s
-25%
5
216 s
+3%
1080 s
-14%
6
210 s
1260 s
-
Elastic
224 s
+7%
912 s
-28%
Table 1: Elasticity test results
The table above synthetizes the results for the elastic setting and 4, 5 and
6 instances. The elastic system requires a quarter less computing resources,
but completes the task in 7% more time. Using less instances also reduces
the required resources, but comparatively increase the execution time more.
This gives a slight advantage to the elastic system, even though the elasticity
policy that is used is very simple. A better policy could improve these results
significantly.
The elastic system has a significant disadvantage though: it increases
latency even when the load is not very high. At the very beginning of the
test, when only 40 documents are added per second, the queue contains
more than 300 documents. For this reason, the elasticity policy used in this
test would not be suited for tasks in which low latency is important.
What matters more than the raw performance of the elastic system is the
simple fact that it works even though Cloud Foundry doesn’t provide automatic elasticity. Any application can be made elastic using this mechanism,
and it works on any CF-compatible cloud.
48
6
Conclusion
The conception of large application systems is a complex task. Making these
applications scalable, tolerant to failures, loosely coupled, consistent, etc. is
a considerable part of this complexity. Service Oriented Architecture makes
the development of applications simpler by promoting reusability and modularity. Applications become systems in which services interact to achieve
a global task. Deploying, monitoring and managing applications is also a
difficult problem, and represents an important cost. SOA does not address
the difficulty of implementing non-functional requirements and can make
deployment processes more complex.
Cloud computing, and in particular the emergence of the Platform as a
Service paradigm has the potential to simplify those problems. PaaS makes
the deployment of applications easy, and lessens the amount of configuration
required to get a system working. Furthermore, platforms provide mechanisms to monitor applications and make them scalable. However, the PaaS
paradigm is limited in scope: features are restricted to monolithic applications and do not apply to systems. When applications are loosely coupled,
the proper operation of the system requires that some level of consistency is
achieved over multiple applications. An example of this is elasticity: when
the load increases on one part of the system (e.g. a frontend application),
it can be necessary to pre-emptively create new instances on another part
of the system (e.g. a backend application) to cope with the load.
The runtime developed in this project helps bridging the gap between
the requirements of large application systems and the features provided by
PaaS. This is achieved in three ways. First, by creating a meta-model to
describe complex applications and their deployment. Second, by providing
a deployment process that extends to systems as a whole, including a dependency mechanism that spans multiple applications. Third, by creating an
extensible monitoring and supervison mechanism to detect significant events
in deployed applications, and use those events to alter the behavior of the
system as a whole. The use of an open source platform, Cloud Foundry, and
the choice to make the developed runtime loosely coupled with the platform
also promote portability among multiple cloud providers.
Finally the test realized using the proof of concept document generation
application validates the choices made during the project. It demonstrates
that the runtime makes it possible to deploy an application and to enforce
system-wide rules on a platform that does not support it. Applications do
not need to be modified which means that non-functional requirements can
be separated from application logic. In the tests, elasticity is achieved for
the document generation application, even though neither the platform nor
the application explicitly implement it.
49
Future works
While the prototype described in this report works, there is much room for
improvement.
One of the first steps would be to validate the project against a larger,
real-life application. In section 3.1.1, the importance of improving the metamodel and the DSL in an incremental process was discussed. Letting domain
experts evaluate and criticise the current implementation would probably
give interesting insights and lead to a more refined model. In particular,
developing a real domain specific language with specific tools for editing
and visualizing models would be an interesting task.
Furthermore, testing the runtime on a larger scale would be the only way
to truly validate the scalability and performance of the runtime application.
The test exposed in section 5 is a good start, but may not necessarily reflect
real world situations.
Another aspect to consider is integration with other tools. Most modern
software is developed using Agile methods. These methods promote continuous integration and frequent releases. Being able to automatically test and
update components in the cloud and under the supervision of the runtime
can be an interesting feature. This could be accomplished, for example, by
creating a maven plugin that interacts with the runtime.
50
References
[1] Ali Arsanjani, Service-oriented modeling and architecture. IBM developer works, Novemner 2004.
[2] Gartner Research, Cloud computing: defining and describing an emerging phenomenon. Gartner Research, June 2008.
[3] Markus Voelter, Best Practices for DSLs and Model-Driven Development. Journal of Object Technology, vol. 8, no. 6, SeptemberOctober 2009, pp. 79 - 102. http://www.jot.fm/issues/issue_2009_
09/column6/.
[4] Steven Kelly, Risto Pohjonen, Worst Practices For Domain-Specific
Modeling. IEEE Software, vol. 26, no. 4, July-August 2009, pp. 22 29.
[5] Mellor, Clark, Futagami, Model-Driven Development. IEEE Software,
vol. 20, no. 5, September-October 2003, pp. 14 - 18.
[6] Gamma, Helm, Johnson, Vlissides, Design Patterns: Elements of
Reusable Object-Oriented Software. Addison-Wesley, 1994.
[7] Peter Mell, Timothy Grance, The NIST Definition of Cloud Computing.
NIST Special Publication 800-145, September 2011.
[8] Ezra
Zygmuntowicz,
Cloud
Foundry
Open
PaaS
Deep
Dive.
http://blog.cloudfoundry.com/post/4754582920/
cloud-foundry-open-paas-deep-dive April 2011.
51
Download