CS 4200 - FINAL YEAR PROJECT REPORT SIDDHI-CEP By Project Group – 04 Suhothayan S. (070474R) Gajasinghe K.C.B. (070137M) Loku Narangoda I.U. (070329E) Chaturanga H.W.G.S (070062D) Project Supervisors Dr. Srinath Perera Ms.Vishaka Nanayakkara Coordinated By Mr. Shantha Fernando THIS REPORT IS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE AWARD OF THE DEGREE OF BACHELOR OF SCIENCE OF ENGINEERING AT UNIVERSITY OF MORATUWA, SRI LANKA. 3rd of September 2011 Abstract Project Title : Siddhi-CEP - High Performance Complex Event Processing Engine Authors : Suhothayan Sriskandarajah - 070474R Kasun Gajasinghe - 070137M Isuru Udana Loku Narangoda - 070329E Subash Chaturanga - 070062D Coordinator : Dr. Shantha Fernando Supervisors : Dr. Srinath Perera Mrs. Vishaka Nanayakkara During the last half a decade or so, Complex Event Processing (CEP) is one of the most rapidly emerging fields. Due to massive amount of business transactions and numerous new technologies like RFID (Radio Frequency Identification), it has now become a real challenge to provide real time event driven systems that can process data and handle high input data rates with near zero latency. The basic functionality of the complex event processor is to match queries with events and trigger a response. These queries describe the details about the events that the system needs to search within the input data streams. Unlike traditional systems like Relational database Systems which are operating with hundreds of queries running for short durations against stored static data, event driven systems operate with stored queries running constantly against extremely dynamic data streams. In fact, an event processing system is an upside down view of a database. The tasks of CEP are to identify meaningful patterns, relationships and data abstractions among unrelated events and fire an immediate response. Siddhi is an Apache-2.0 Licensed Complex Event Processing Engine. It addresses some of the main concerns of event processing world where there is absolute need to have a opensource variant with the ability of processing huge flood of events that may go well over one hundred thousand events per second with a near-zero latency. This needed careful design of generic concepts of a CEP. Siddhi was designed after doing an in-detail literature review focusing on each and every concept separately. Current Siddhi implementation provides an extendable, scalable framework for the open-source community for extending Siddhi to match specific business needs. Table of Figures Figure 1 Maven Structure ........................................................................................................ 16 Figure 2 Some CEP engines and their brief distributions with their time line ........................ 19 Figure 3 S4 architecture ........................................................................................................... 20 Figure 4 Esper architecture ...................................................................................................... 21 Figure 5 PADRES broker network ........................................................................................ 22 Figure 6 PADRES router architecture .................................................................................... 22 Figure 7 Aurora Pipeline Architecture ..................................................................................... 26 Figure 8 Siddhi high-level architecture.................................................................................... 33 Figure 9 Siddhi Class Diagram ................................................................................................ 35 Figure 10 Sequence Diagram ................................................................................................... 37 Figure 11 Siddhi Sequence Diagram ...................................................................................... 37 Figure 12 Siddhi Implementation View .................................................................................. 38 Figure 13 Siddhi Process View ................................................................................................ 40 Figure 14 Siddhi Deployment View ....................................................................................... 42 Figure 15 Siddhi Use Case Diagram ........................................................................................ 43 Figure 16 Siddhi Event Tuple .................................................................................................. 45 Figure 17 Siddhi Pipeline Architecture.................................................................................... 46 Figure 18 Map holding Executor Listeners ............................................................................. 49 Figure 19 Simple Siddhi Query ............................................................................................... 53 Figure 20 Siddhi Query Form UI Implementation Used in OpenMRS ................................... 54 Figure 22 Siddhi Time Window .............................................................................................. 56 Figure 21 Siddhi Query with Simple Condition ...................................................................... 56 Figure 23 Siddhi Time Window Query ................................................................................... 57 Figure 24 Siddhi Batch Window.............................................................................................. 58 Figure 25 Siddhi Unique Query ............................................................................................... 60 Figure 26 Scrum Development Process ................................................................................... 61 Figure 27 Test-Driven Development (TDD) .......................................................................... 62 Figure 28 Call Tree for the Time Window in JProfiler ........................................................... 67 Figure 29 The Memory Usage of Siddhi for a Time Window query....................................... 68 Figure 30 Siddhi Benchmark ................................................................................................... 69 Figure 31 DocBook Documentation Configuration ................................................................. 71 Figure 32 Siddhi Web Site ....................................................................................................... 72 Table of Graphs Graph 1 Siddhi Vs Esper Simple Filter Comparison ............................................................... 75 Graph 2 Siddhi Vs Esper Average over Time Window Comparison ...................................... 75 Graph 3 Siddhi Vs Esper State Machine Comparison ............................................................. 76 Table of Tables Table 1 Comparison between Database Applications and Event-driven Applications ........... 11 Table 2 Different Every Operator Cases .................................................................................. 48 Table of Contents Abstract ...................................................................................................................................... 2 Table of Figures ......................................................................................................................... 3 Table of Graphs.......................................................................................................................... 4 Table of Tables .......................................................................................................................... 4 1. 2. INTRODUCTION .............................................................................................................. 9 1.1. Complex Event processing .......................................................................................... 9 1.2. Aims and Objectives ................................................................................................. 10 LITERATURE SURVEY................................................................................................. 11 2.1. Background ............................................................................................................... 11 2.1.1. What is Complex Event Processing? ..................................................................... 11 2.1.2. Why Complex Event Processing? ......................................................................... 12 2.1.3. CEP General Use Cases ......................................................................................... 12 2.2. Terminology .............................................................................................................. 13 2.3. Tools & Technology studies ..................................................................................... 14 2.3.1. Compiler Generators.............................................................................................. 15 2.3.2. ANTLR .............................................................................................................. 15 Building and Project Management tools ............................................................... 15 Apache Maven ................................................................................................... 15 Apache ANT ...................................................................................................... 17 2.3.3. Version Control Systems ....................................................................................... 18 2.4. Subversion.......................................................................................................... 18 CEP Implementation Related Study.......................................................................... 18 2.4.1. Some Well Known CEP Implementations ............................................................ 20 S4 [3] [4] ............................................................................................................ 20 Esper/NEsper [7] [6] .......................................................................................... 21 PADRES [6] [8] ................................................................................................. 22 Intelligent Event Processor (IEP) [6] [10] ......................................................... 23 Sopera [6] [11] ................................................................................................... 23 Stream-based And Shared Event Processing (SASE) [12] ................................ 23 Cayuga [6] [14] [15] .......................................................................................... 25 Aurora and Borealis [6] [20] [21] [15] .............................................................. 26 TelegraphCQ [6] [15] [26] ................................................................................. 27 STREAM [6] [33] .............................................................................................. 28 PIPES ................................................................................................................. 29 BEA WebLogic [15] .......................................................................................... 29 Coral8 [15] ......................................................................................................... 30 Progress Apama [15].......................................................................................... 30 StreamBase [15] ................................................................................................. 31 Truviso [15] [41] ................................................................................................ 31 2.5. 2.5.1. Event Stream Processing with Out-of-Order Data Arrival [42] ............................ 31 2.5.2. Efficient Pattern Matching over Event Streams [43]............................................. 32 2.6. 3. Some Interesting Research Papers ............................................................................ 31 What We Have Gained from the Literature Survey .................................................. 32 SIDDHI DESIGN ............................................................................................................. 33 3.1. Siddhi Architecture ................................................................................................... 33 Input Adapters .................................................................................................................. 33 Siddhi- core....................................................................................................................... 34 Output Adapters ................................................................................................................ 34 Compiler ........................................................................................................................... 34 Pluggable UI ..................................................................................................................... 34 3.2. 3.2.1. 4+1 Model ................................................................................................................. 35 Logical View ......................................................................................................... 35 Class Diagram................................................................................................................... 35 Sequence Diagram ............................................................................................................ 36 Implementation View ....................................................................................................... 38 Process View .................................................................................................................... 39 Deployment View ............................................................. Error! Bookmark not defined. Use Case View.................................................................................................................. 43 Use case 1. ........................................................................................................................ 43 Use case 2. ........................................................................................................................ 44 Use case 3. ........................................................................................................................ 44 3.3. Major Design Components ....................................................................................... 45 3.3.1. Event Tuples .......................................................................................................... 45 3.3.2. Pipeline architecture .............................................................................................. 45 3.3.3. State Machine ........................................................................................................ 47 Sequence Queries ............................................................................................................. 47 Every Operator ................................................................................................................. 47 Design Decisions in Sequence Processor ......................................................................... 48 Pattern Queries ................................................................................................................. 49 Kleene Star Operator ........................................................................................................ 50 Design Decisions in Pattern Processor ............................................................................. 51 3.3.4. Processor Architecture ........................................................................................... 51 Executors .......................................................................................................................... 51 Event Generators ................................................................................................ 52 3.3.5. Query Object Model .............................................................................................. 53 3.3.6. Query parser .......................................................................................................... 55 3.3.7. Window ................................................................................................................. 56 Time Window ................................................................................................................... 56 Batch window ................................................................................................................... 57 Time Batch Window:........................................................................................................ 57 Length Batch Window: ..................................................................................................... 58 3.3.8. 4. Implementation ................................................................................................................. 60 4.1. Process models .......................................................................................................... 60 4.1.1. Scrum Development Process ................................................................................. 60 4.1.2. Test-driven development (TDD) ........................................................................... 62 4.2. Version control .......................................................................................................... 63 4.3. Project management .................................................................................................. 63 4.4. Coding Standards & Best Practices Guidelines for Siddhi ....................................... 64 4.4.1. General................................................................................................................... 64 4.4.2. Java Specific .......................................................................................................... 66 4.5. Profiling..................................................................................................................... 66 4.6. Benchmark ................................................................................................................ 68 4.7. Documentation .......................................................................................................... 70 4.8. Web Site .................................................................................................................... 71 4.9. Bug tracker ................................................................................................................ 72 4.10. 5. “UNIQUE” Support ............................................................................................... 59 Distribution ............................................................................................................ 73 Results .............................................................................................................................. 74 5.1. 6. Performance testing ................................................................................................... 74 Discussion and Conclusion ............................................................................................... 76 6.1. Known issues............................................................................................................. 76 6.2. Future work ............................................................................................................... 76 6.2.1. Incubating Siddhi at Apache.................................................................................. 77 6.2.2. Find out a query language for Siddhi .................................................................... 77 6.2.3. Out of order event handling ................................................................................... 77 6.3. Siddhi Success Story ................................................................................................. 77 6.4. Conclusion................................................................................................................. 78 Abbreviations ........................................................................................................................... 80 Bibliography ............................................................................................................................ 81 Appendix A .............................................................................................................................. 84 1. INTRODUCTION 1.1. Complex Event processing Data processing is one of the key functionality in computing. Data processing refers to a process that a computer program does to enter data and then analyze it to convert the data into usable information. Basically, data is nothing but unorganized facts and which can be converted into useful information. Analyzing data, Sorting data and Storing data are a few major tasks involved in data processing. Data processing has a tight connection with Event-Driven Architecture (EDA). EDA can be viewed as a subset data processing where, in EDA, a stream of data (can be called as events) is processed. One might think an Event is just another data which has a time-stamp. But an Event has a broader meaning. One of the better ideas about the relationship between data and events is “Data is derived from the Events”. So the Events are different from data. Actually the event representation contains the data. During the last half a decade, Complex Event Processing (CEP) has been one of the most rapidly emerging fields in data processing. Due to the massive amount of business transactions and numerous new technologies like RFID (Radio Frequency Identification), it has now become a real challenge to provide real time event driven systems that can process data and handle high input data rates with near zero latency (nearly real-time). The basic functionality of the complex event processor is to match queries with events and trigger a response immediately. These queries describe the details about the events that the system needs to search for within the input data streams. Unlike traditional systems like Relational database systems (RDBMS) which are operating with hundreds of queries running for short durations against stored static data, event driven systems operate with stored queries running constantly against extremely dynamic data streams. Actually, an event processing system is the inverted version of a database where the search queries are stored in the system and matched against incoming data. Hence, Complex Event Processing is used in systems like data monitoring centers, financial services, Web analysis, and many more, where extremely dynamic data is being generated. In the abstract, the tasks of the CEP is to identify meaningful patterns, relationships and data abstractions among unrelated events and fire an immediate response such as an Alert message. Examples: Search a Document for a specific key word Radio Frequency Identification (RFID) Financial market transaction pattern analysis 1.2. Aims and Objectives The main aim of our project is to implement a 100% open source high performance complex event processing engine. There are several commercial and very few open-source CEP engines currently available. Most of them were implemented early in this decade and now they have become stable. Since they were implemented some time ago, there can be improvements that can be done to those CEP implementations even at the architectural level. But since they have become stable, we can see that there is no tendency to further improve their system base. So our main aim is to identify those weaknesses and implement a better open source CEP implementation on a latest JDK. Carry out a literature survey, compare & contrast different implementations of Event Processing Engines, and come up with an effective architecture that can process any type of external event streams which is computationally efficient. The factors to look for are the support for high speed processing with low memory consumption. Implement the basic Complex Event Processing engine framework Build up the planned features for the engine on top of the written framework Do Java profiling in several iterations on the code base and improve the efficiency. This makes sure that there are no/less overhead due to the code written 2. LITERATURE SURVEY 2.1. Background Following sections will provide the details of the background of this project. It includes describing what Complex Event Processing systems are, why there are need for CEPs, and their general use cases. 2.1.1. What is Complex Event Processing? Event processing can be defined as a methodology that performs predefined operations on event objects including analyzing, creating, reading, transforming or deleting them. Generally, Complex Event Processing (CEP) can be defined as an emerging technology that creates actionable, situational knowledge from distributed message-based systems, databases and applications in real time or near real time. In other words, a CEP software implementation aggregates information from distributed systems in real time and applies rules to discern patterns and trends that would otherwise go unnoticed. In another view, we can identify a CEP as a database turned upside-down. That is because instead of storing the data and running queries against those stored data, the CEP engine stores queries and run the data through them as a stream. So, Complex Event Processing is primarily an event processing concept that deals with the task of processing events from several event streams and identifying the meaningful events within the event domain and fire when a matching was found based on the rules provided. Basically, CEP uses techniques such as detection of complex patterns of events, event hierarchies, and relationships between events such as causality, timing etc. Table 1 Comparison between Database Applications and Event-driven Applications Query Paradigm Database Applications Event-driven Applications Ad-hoc queries or Continuous standing queries requests Latency Seconds, hours, days Milliseconds or less Data Rate Hundreds of events/sec Tens of thousands of events/sec or more 2.1.2. Why Complex Event Processing? The IT industry is getting complex day by day, and the state of today’s industry can be identified as an event driven era. In modern enterprise software systems, events are a very frequent commodity. Thus, extraction of what is relevant and what is not can be a nightmare, especially when there are thousands of changes taking place per second. So Complex Event Processing is a new way to deal with applications where many agents produce huge amounts of data per second and we need to transform data into reliable information in a short period of time. These applications consist of massive amounts of data being produced per second where traditional computing power (hardware and software) won’t have the enough capacity. Therefore when we need to process massive amount of incoming events in real time the classical methodologies will fail, where the CEP comes in to play. For example, say you have to trace the increase of stock prices on a particular set of companies. The traditional process was to put all of them in a database under a particular schema and at the end of the day, go through the whole data base monitoring whether there was an increase in stock price. Generally, there will be hundreds of such companies to track and there will be thousands of buying and selling per second on the stock market. Therefore, the database will be enormous. This is where the real need of a CEP engine arises, which gives you real time feed back based on the specified business rules while saving both time and space. As explained, unlike traditional static data analysis methodologies, the CEP engines are event-driven: the logic of the analysis is applied in advance, and each new event is processed as soon as it arrives, immediately updating all high level information and triggering any rules that have been defined. With CEP, businesses can map discrete events to expected outcomes and relate a series of events to key performance indicators (KPIs). Through this CEP gives businesses more insight into which events will have the greatest operational impact by helping them to seize opportunities and mitigate risks. 2.1.3. CEP General Use Cases In early age of Complex Event Processing Systems they were used for monitoring stock trading systems, and many still believe that is the major use case of CEP. But in the present days there are so many other interesting applications of CEP, especially across the IT industry, Financial Markets and Manufacturing Organizations. They are as follows, Cleansing and validation of data: CEPs can recognize patterns in the data and filter out anomalous events which fall outside recognized parameters. Alerts and notifications: CEP engines can monitor event streams and detect patterns, and notify them by hooking into email servers; posting messages to web services and etc. (i.e. a real time business system should be able to send notifications when problems occurred.) Decision making systems: CEPs are used in automated business decision making systems that take current conditions into its knowledge base. Feed handling: Most CEP platforms come with many in-built feed handlers for common market data formats. Data standardization: CEP engines are capable of standardizing data of the same entity from different sources within a common reference schema. 2.2. Terminology This section covers a small set of basic terms related to event processing. However much event processing technologies have progressed, still there is no standardized terminology defined for even the basic terms such as ‘event’, ‘event stream’, ‘event processing’. The definition varies slightly depending on the implementation and the product. There is an ongoing effort to standardize this [1]. The definitions here give a generic idea of the terms, and shows distinctions between different implementations. The basic term that is excessively used in Complex Event Processing is ‘event’. And it is one term that is misused most of the time. Basically, an ‘event’ can be defined as anything that happens, or is contemplated as happening. However, the term is mostly used for the representation of an event. The authors Event Processing Glossary [1] have generalized this by defining two separate terms, i.e. ‘event’ and ‘event object’. The term ‘event object’ refers to the representation of a particular event. This can be a tuple, vector, or row implementation. Events can be two-fold, simple events or composite (complex) events. A simple event refers only to an occurrence of a single event. A complex event is an aggregation of several events. To understand this simply, let’s take the example of stock-market transactions. There, each buying and selling of stock is a simple event. The simple event object may consist of the stock symbol, the buying/selling price and volume. However, if we record all the buying and selling of stocks of a specific company, and return them as one event, that event can be considered as a complex event. In this case, it consists of a number of simple events [2]. The Event processing engine receives events through an event stream. Event stream is a linearly ordered sequence of events which is ordered by the time of arrival. The usage of event streams varies depending on the implementation. Some implementations allow event streams of different types for events, while other implementations restrict it to be of a predefined type. Siddhi for example, restricts the event type of a given stream. User has the ability to create different streams to send events of different type. This makes the implementation clear and less confusing. The processing of events comes with a variety of names. Mostly people call it ‘complex event processing’ (CEP) or ‘event stream processing’ (ESP). CEP is the name we have used for Siddhi throughout this report. ‘Query’ is a basic mean of specifying the rules/patterns to match in the incoming stream of events. It specifies the needed event streams, the operations that need to be performed on those events, and how to output the outcome of the operation. The outcome/output is also an event, generally. Mostly we can regard this as a composite event if there are aggregatorfunctions specified in the query. The Query Language follows a SQL-like structure. There are differences on the processing of the query, but the language contains many similarities. For example, each of these SELECT, FROM, WHERE, HAVING, GROUP BY clauses intends to say the same thing though the processing would be different. Different CEP implementations use different query languages, and there is no standard for it. These query languages extend SQL with the ability to process real-time data streams. In SQL, we send a query to be performed on the stored data rows in a database-table. In here, the queries are fed to the system before-hand, and the real-time streams of events are passed through these queries performing the operations specified. The query will fire a new event when a match for the rule/pattern occurs. 2.3. Tools & Technology studies Following sections describes the tools and technologies we have used for providing the basic infrastructure to develop Siddhi. This contains details about build management tools, version controlling system, compiler generator etc. 2.3.1. Compiler Generators As a part of our literature survey we looked at compiler generators. We have to construct a compiler which generates the query object model from a query. One of the popular tools which can be used to construct a compiler is ANTLR. ANTLR ANTLR which stands for “Another Tool for Language Recognition” is a tool which can be used to create compilers, recognizers and interpreters. ANTLR provides a framework which greatly supports tree construction, translation, error reporting and recovery. ANTLR provides a single syntax notation for specifying both lexer and parser. This feature makes it easy for the users to specify their rules. It also has a graphical grammar editor and a debugger called ANTLRWorks which enhance the ease of usage further. ANTLR uses Extended Backus-Naur Form (EBNF) grammars and supports many target programming languages like Java, C, C++, C#, Ruby and Python. The parser generated by ANTLR is a LL(*) parser which provides infinite look ahead. Since ANTLR is a top-down parser it uses Syntactic predicates to resolve ambiguities such as left factoring which are not supported by native top-down parsers. 2.3.2. Building and Project Management tools Apache Maven Apache Maven is a project management tool. Maven is developed to make the build process much easier. Initially Maven was created to manage the complex build process of the Jakarta Turbine project. Maven is rapidly evolving and though its newest version is 3, still version 2 is widely used. Features of Maven I. II. Maven understands how a project is typically built. Maven makes use of its built-in project knowledge to simplify and facilitate project builds. III. Maven prescribes and enforces a proven dependency management system that is in tune with todays globalized and connected project teams. IV. Maven is completely flexible for power users; the built-in models can be overridden and adapted declaratively for specific application scenarios. V. VI. Maven is fully extensible for scenario details not yet covered by existing behaviors. Maven is continuously improved by capturing any newfound best practices and identified commonality between user communities and making them a part of Maven's built-in project knowledge. VII. Maven can be used to create projects files from build files. By using commands a. Creating an eclipse artifact for any source containing a build script. mvn eclipse:eclipse b. Creating an IntelliJ idea artifact using mvn idea:idea Figure 1 Maven Structure Project object model (POM): The POM is a model for maven 2 which is partially built into the maven main engine. Pom.xml which is a XML based metadata file is the build file which has the declarations of the components. Dependency management model: Dependency management is a key part of the maven. The maven dependency management can be adapted to most requirements and its model is built in to maven 2.This model is a proven workable and productive model currently deployed by major open source projects. Build life cycle and phases: These are the interfaces between its built-in model and the plugins. The default lifecycle has the following build phases. validate - validate the project is correct and all necessary information is available compile - compile the source code of the project test - test the compiled source code using a suitable unit testing framework. These tests should not require the code be packaged or deployed package - take the compiled code and package it in its distributable format, such as a JAR. integration-test - process and deploy the package if necessary into an environment where integration tests can be run verify - run any checks to verify the package is valid and meets quality criteria install - install the package into the local repository, for use as a dependency in other projects locally deploy - done in an integration or release environment, copies the final package to the remote repository for sharing with other developers and projects. Plug-ins: Most of the effective work of maven is performed using maven plug-ins. Following is a part of a pom.xml file. Here we have used p2- feature plug-in to generate a p2 feature. <plugins> <plugin> <groupId>org.wso2.maven</groupId> <artifactId>carbon-p2-plugin</artifactId> <version>1.1</version> <executions> <execution> <id>p2-feature-generation</id> <phase>package</phase> <goals> <goal>p2-feature-gen</goal> </goals> </execution> </executions> </plugin> </plugins> Apache ANT Apache ANT is another popular build tool. The acronym ANT stands for “Another Neat Tool”. It is similar to other build tools like make and nmake. The major difference when compared to other build tools is that ANT is written in Java. So ANT is very much suitable for Java projects. ANT uses XML to specify the build structure. Apache ANT provides a rich set of operations that we can use to write build scripts. ANT is widely used in the industry as the universal build tool for Java projects. ANT targets can be invoked by simple commands. To run an ANT target called foo one may just type ‘ant foo’ on the command prompt. 2.3.3. Version Control Systems Most of the open source projects are not developed by a single developer. Usually the projects are a team effort. Therefore there should be a way to manage the source code. That is the task of a version control system. A version control system manages files and directories over time. Subversion Subversion is a version controlling system which is distributed under Apache/BSD-style Open Source license. It is a replacement for the CVS version controlling system. Subversion can be used by people on different computers. Everyone can modify the source code of a project at the same time. If someone has done something incorrectly, we can simply undo those changes by looking into project history. Subversion has an official API (which is not there in CVS). Subversion is written as a set of libraries in C language. Although it was written in C it has language binding for many programming languages. So Subversion is a very extensible version controlling system. JavaHL is the Java language binding of Subversion. Though the default UI of the subversion is a command line interface, there are many third party tools developed to provide better user interfaces for different environments. For windows there is a client called Tortoise svn. Subclipse and Subvise are two plugins for Eclipse IDE. 2.4. CEP Implementation Related Study The following are some of the projects which have made some significant efforts in the same research area. Some of these have not been completed and some others projects have not yet released their CEP engine. This shows that even though CEP is relatively an old concept there is still much significant work going on, and the there is no CEP that took over this market. The following are some CEP engines and their brief distributions with their time line. Figure 2 Some CEP engines and their brief distributions with their time line 2.4.1. Some Well Known CEP Implementations Let’s look at some of the well known Complex Event Processing engines in market with their features and advantages & disadvantages. S4 [3] [4] S4 was created and released by Yahoo! This is a framework for "processing continuous, unbounded streams of data." The framework, allows for massively distributed computation over data that is constantly changing. This was initially developed to personalize search advertising products at Yahoo! and Yahoo has now released this under Apache License v2. The architecture of S4 resembles the Actors model [5], providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. This design choice also makes it relatively easy to reason about correctness due to the general absence of side-effects. S4 was designed for big data with the ability to capable mining information from continuous data streams using user defined operators. Though S4 design shares many attributes with IBM’s Stream Processing Core (SPC) middleware [6] architecturally S4 have some differences. S4 is believed to achieve greater level of simplicity due to its symmetry in its design where all its nodes in the cluster are identical and there is no centralized control. Further S4 is accomplished by leveraging ZooKeeper [3] which is a simple and elegant cluster management service that can be shared by many systems in a data center. The disadvantage of S4 is that it allows lossy failovers. Where upon a server failure, processes are automatically moved to a standby mode and the state of the processes at the time of frailer are stored in local memory, and it allows data loss during this handoff process. The state is Figure 3 S4 architecture regenerated using the input streams. Downstream systems must degrade gracefully. Further any Nodes cannot be added or removed from a running cluster. Esper/NEsper [7] [6] EsperTech [7] brings Complex Event Processing (CEP) to mainstream with an Open Source approach, ensuring rapid innovation with quality productization, support and services for mission critical environments, from SOA to eXtreme Transaction Processing deployments. EsperTech runs on Java 5 or Java 6 JVM Fully embeddable. A tailored Event Processing Language (EPL) allows registering queries in the engine, using Java objects (POJO, JavaBeans) to represent events. A listener class - which is basically also a POJO - will then be called by the engine when the EPL condition is matched as events come in. The EPL allows expressing complex matching conditions that include temporal windows, and joining different event Figure 4 Esper architecture streams, as well as filtering and sorting them. [7] The internals of Esper are made up of fairly complex algorithms primarily relying on state machines and delta networks in which only changes to data are communicated across object boundaries when required. Esper is available under the GNU GPL license (GPL also known as GPL v2). Esper and NEsper are embeddable components written in Java and C#, these are not servers by themselves but are designed to hook into any sort of servers, and therefore suitable for integration into any Java process or .NET-based process including J2EE application servers or standalone Java applications. Esper has a pull query API. Events in Esper allow object representation and dynamic typing. Esper features a Statement Object Model API, which is a set of classes to directly construct, manipulate or interrogate EPL statements. Esper also has a Commercial version, and the disadvantage of Esper is that its free version does not contain a GUI management, editor or portal application. Esper also does not currently have a server. Esper provides only a small number of key inputs and output adapters through EsperIO, and provides an adapter framework. PADRES [6] [8] PADRES (Publish/Subscribe Applied to Distributed Resource Scheduling) is developed by Middleware Systems Research Group (MSRG) and University of Toronto. This is an enterprise-grade event management infrastructure that is designed for large-scale event management applications. Ongoing research seeks to add and improve enterprise-grade qualities of the middleware. A publish/subscribe middleware [9] provides many benefits to enterprise applications. Content-based interaction simplifies the IT development and maintenance by decoupling enterprise components. The expressive PADRES subscription language supports sophisticated interactions among components, and allows fine-grained queries and event management functions. Furthermore, scalability is achieved with in-network filtering and processing capabilities. Figure 5 PADRES broker network Figure 6 PADRES router architecture Intelligent Event Processor (IEP) [6] [10] Intelligent Event Processor (IEP) is a product of CollabNet, Inc. This is an open source Complex Event Processing (CEP) engine. IEP is a JBI Service Engine and is a part of the Open ESB community. OpenESB is an open source project with the goal of building a worldclass Enterprise Service Bus. An ESB provides a flexible and extensible platform on which to build SOA and Application Integration solutions. Sopera [6] [11] SOPERA is a complete and proven SOA platform, which is rigorously oriented to practical requirements. Companies and organizations benefit from the SOA know-how integrated in SOPERA during implementation of sophisticated SOA strategies. SOPERA has the ability to predict failure of the business process by monitoring the event patterns. SOPERA detects Patterns which are schema based and when it discovers a certain schema of events that leads to a failure of the business process and if all of the events of the pattern occurred within the time window, it fires a new complex event that alerts the staffer in advance about a process failure in the future. This provides the ability to react proactively Stream-based And Shared Event Processing (SASE) [12] The goal of SASE research project conducted by UC Berkeley and University of Massachusetts Amherst to design and develop an efficient, robust RFID stream processing system that addresses the challenges in emerging RFID deployments, including the datainformation mismatch, incomplete and noisy data, and high data volume, and it enables realtime tracking and monitoring. The paper [13] presented on SASE give insight to different algorithms used in their efficient state machine implementation. SASE extends existing event languages to meet the needs of a range of RFID-enabled monitoring applications. SASE supports High volume streams, extract events from large windows which even spans up to 12 hours, and include flexible use of negation in sequences, parameterized predicates, and sliding windows. This approach is based on a new abstraction of CEP i.e., a dataflow paradigm with native sequence operators at the bottom, and pipelining query-defined sequences to subsequent relational style operators. SASE language supports not only basic constructs such as sequence and negation that existing event languages have, but also offers flexible use of negation in event sequences, adds parameterized predicates for correlating events via value based constraints, includes sliding windows for imposing additional temporal constraints, and resolves the semantic subtlety of negation when used together with sliding windows. Unlike previous work that focuses on complex event “detection” (i.e., only reporting that an event query is satisfied but not how), SASE explicitly report what events are used to match the query. This significantly increases the complexity of query processing. SASE approach employs an abstraction of complex event processing that is a dataflow (query-defined event sequence) paradigm with pipelined operators as in relational query processing. As such, it provides flexibility in query execution, ample opportunities for optimization, and extensibility as the event language evolves. The paper [13] provides a comparison between SASE and relational stream processor, TelegraphCQ (TCQ) [12], developed at the University of California, Berkeley. TCQ uses an n-way join to handle an equivalence test over an event sequence. This certainly incurs high overhead when the sequence length is high. Moreover, TCQ only considers equality comparisons in joins. Therefore, temporal constraints for sequencing, e.g., “s.time > r.time”, are evaluated only after the join. In contrast, SASE uses the NFA to naturally capture sequencing of events, and the PAIS algorithm to handle the equivalence test during NFA execution, yielding much better scalability. SASE also has some limitations where SASE does not handle Hierarchy of complex event types, where the output of one query cannot be used as an input to another. This assumes total ordering of events, a known issue with this assumption arises in the scenario where a composite event usually obtains its timestamp from one of its primitive events, when such composite events are mixed together with primitive events to detect more complex events, the assumption of total order on all events no longer holds. Further SASE language can be extended to support aggregates such as count()and avg()but these have not yet been implemented. Cayuga [6] [14] [15] This project is part of 2007 AFRL/IF/AFOSR Minigrant titled “User-Centric Personalized Extensibility for Data-Driven Web Applications,” by James Nagy (AFRL/IFED) [16]. This minigrant focuses on Cayuga as a stateful publish/subscribe system for use in a graphical programming model (also being developed at Cornell) known as Hilda. An overview of both systems can be found in the Minigrant Proposal. Researchers at Cornell describe Cayuga as a general-purpose complex event processing system [17]. The system can be used to detect event patterns in event streams. The Cayuga system is designed to leverage traditional publication/subscription techniques to allow for high scalability [18]. This leads to comparisons not only with other data stream management systems, but also to publish/subscribe systems to demonstrate the applications and capabilities of Cayuga. The Cayuga system architecture is designed to efficiently support a large number of concurrent subscriptions. Its core components include a query processing engine, an index component, a metadata manager, and a memory manager. One of the most novel components of Cayuga is the implementation of the processing engine, which utilizes a variation of nondeterministic finite automata [18]. However, the automata in Cayuga are a generalization of the standard nondeterministic finite automata model. These automata read relational streams, instead of a finite input alphabet. Also, the state transitions are performed using predicates. The use of automata allows for the storing of input data and new inputs can be compared against previously encountered events. Cayuga requires users to specify their interests in the structured Cayuga Event Language (CEL). Not every Cayuga query can be implemented by a single automaton. In order to process arbitrary queries, Cayuga supports re-subscription. This is similar to pipelining – the output stream from a query is used as the input stream to another query. Because of resubscription, query output must be produced in real time. Since each tuple output by a query has the same detection time as the last input event that contributed to it, its processing (by resubscription) must take place in the same epoch in which that event arrived. This motivates the Cayuga Priority Queue, where the only form of query optimization performed by the Engine is to merge manifestly equivalent states events having the same time stamps to be processed together and then updated to the automata’s Internal String Table (week referenced hash table). There is also research regarding a distributed implementation of Cayuga, known as FingerLakes [19]. Aurora and Borealis [6] [20] [21] [15] The primary goal of the Aurora project [20] is to build a single infrastructure that can efficiently and seamlessly meet the requirements of demanding real-time streaming applications. This project has been superseded by the Borealis [21] project. Both Aurora and Borealis are described as general-purpose data stream management systems [22] in the papers published by the creators at Brandeis University, Brown University, and the Massachusetts Institute of Technology. The goal of the systems is to support various realtime monitoring applications. The overall system architecture of Aurora and Borealis is based on the “boxes-and-arrows” process- and work-flow systems [23]. Data flows through the system as tuples, along pathways, which are arrows in the model. The data is processed at operators, or in the boxes. After the last processing component, they are delivered to an application for processing [22]. Figure 7 Aurora Pipeline Architecture There are three types of graphs used to monitor the Aurora and Borealis systems – latency graphs, value-based graphs, and loss-tolerance graphs. Monitoring these graphs are several optimizations that these systems are capable of carrying out to decrease system stress. The primary optimizations are the insertion of processing boxes, moving processing boxes, combining two boxes into a single, larger box, reordering boxes, and load shedding [22]. Where load shedding is one of the most important optimizations introduced in these systems, which means that the number of tuples presented for processing are reduced to end the overloaded states. In Aurora and Borealis systems, load shedding is done in a manner that opting to drop the tuples relating to systems that are more tolerant of lost and missing data. Borealis being the second generation system developed by Brandeis, Brown, and MIT [24] they have improved and integrated the stream processing functionality of Aurora system, and also integrated the distribution techniques of Borealis from a project known as Medusa [25]. It should also be noted that the Aurora team has now commercialized the Aurora project through StreamBase [23]. TelegraphCQ [6] [15] [26] TelegraphCQ was developed by the University of California at Berkeley, and it was designed to provide event processing capabilities alongside the relational database management capabilities by utilizing the PostgreSQL [27]. Since PostgreSQL is an open source database they have modified its existing architecture to allow continuous queries over streaming data [28]. TelegraphCQ focuses on the issues such as scheduling and resource management for groups of queries, support for out-of-core data, allow variable adaptively, dynamic QoS support, and parallel cluster-based processing and distribution. Further this also allows multiple simultaneous notions of time, such as logical sequence numbers or physical time. TelegraphCQ uses different types of windows to impose different requirements of the query processor and its underlying storage manager [27]. One fundamental issue TelegraphCQ has is to do with the use of logical (i.e., tuple sequence number) vs. physical (i.e., wall clock) timestamps. If the former is used, then the memory requirements of a window can be known as a priori, while in the latter case, memory requirements will depend on fluctuations in the data arrival rate. Another issue related to memory requirements is to do with the type of window used in the query. Consider the execution of a MAX aggregate over a stream. For a landmark window, it is possible to compute the answer iteratively by simply comparing the current maximum to the newest element as the window expands. On the other hand, for a sliding window, computing the maximum requires the maintenance of the entire window. Further the direction of movement and the “hop” size of the windows (the distance between consecutive windows defined by for loop) also have significant impact on query execution. For instance, if the hop size of the window exceeds the size of the window itself, then some portions of the stream are never involved in the processing of the query. There are several significant open problems in TelegraphCQ with respect to the complexity and quality of routing policies: understanding how ticket based schemes perform under a variety of workloads, and how they compare to (NP-hard) optimal schedule computations, modifying such schemes to adjust the priority of individual queries, and evaluating the feasibility (in terms of computational complexity and quality) of more sophisticated schemes. Routing decisions could consume significant portions of overall execution time. For this reason, two techniques play a key role in TelegraphCQ: batching tuples, by dynamically adjusting the frequency of routing decisions in order to reduce per-tuple costs, and fixing operators, by adapting the number and order of operators scheduled with each decision to reduce per operator costs. Since TelegraphCQ has designed with a storage subsystem that exploits the sequential write workload and with a broadcast-disk style read behaviour, queries accessing data that spans memory and disk also raise significant Quality of Service issues in terms of deciding what work to be dropped when the system is in danger of falling behind the incoming data stream. Currently the developers of TelegraphCQ are extending the Flux module of TelegraphCQ to serve as the basis of the cluster-based implementation. Also has been a spate of work on sharing work across queries are related to the problem of multi-query optimization which was originally posed by Sellis, et al., [29] by the group at IIT-Bombay [30] [31] [32]. It should also be noted that though TelegraphCQ is licensed under BSD license there is also a commercialized version of TelegraphCQ named the Truviso event processing system. STREAM [6] [33] STREAM is a Distributed Data Stream Management System, produced at Stanford University [33]. The goal of STREAM is to be able to consider both structured data streams and stored data together. The queries over data streams are issued declaratively, but are then translated into flexible physical query plans. STREAM system includes adaptive approaches in processing and load shedding, provides approximate answers, and also manipulates query plans during execution. In STREAM, queries are independent units that logically generate separate plans, but those plans are then combined by the system and ultimately result in an Aurora-like mega plan. One of the notable features of the STREAM system is its subscription language, known as the Continuous Query Language (CQL). CQL features two layers – an abstract semantics layer and an implementation of the abstract semantics. Here the implementation of the abstract semantics uses SQL to express relational operations and adds extensions for stream-related operations. Currently STREAM has several limitations such as merging sub expressions with different window sizes, sampling rates, or filters. This is because it’s handling resource sharing and approximation separately. As the number of tuples in a shared queue at any time depends on the rate at which tuples are added to the queue, and the rate at which the slowest parent operator consumes the tuples, where when the queries with common sub expressions produces parent operators to handle the tuples in different consumption rates, then it is preferable not to use a shared sub plan, which the STREAM is not currently handing. STREAM was released under BSD license and according to the STREAM homepage the project has now officially wound down [33]. Now it is used as the base for the Coral8 event processing engine. PIPES PIPES [34] is developed by University of Marburg. It’s a flexible and extensible infrastructure providing fundamental building blocks to implement a data stream management system (DSMS). PIPES cover the functionality of the Continuous Query Language (CQL). The First and last public release of PIPES was done on 2004 under GNU Lesser General Public License. BEA WebLogic [15] BEA Systems developed the WebLogic Real Time and WebLogic Event Server system which focuses on enterprise-level system architectures and service integrations. BEA WebLogic focuses on event-driven service-oriented architecture which provides a complete event processing and event-driven service-oriented architecture infrastructure that supports high-volume, real-time, event-driven applications. BEA WebLogic is one of the few commercial offerings of a complete, integrated solution for event processing and serviceoriented architectures. BEA WebLogic system includes a series of Eclipse-based developer tools for easy development and some administration tools for monitoring throughput, latency, and to monitor other statistics. As BEA WebLogic had been acquired by Oracle Corporation, Oracle have released some non-programmatic interfaces to allow all interested parties to configure queries and rules for processing event data. Coral8 [15] The Coral8 event processing tool is designed to process multiple data streams and hand heterogeneous stream data. Coral8 has the capability of processing operations that require filtering, aggregation, correlation (including correlation across streams), pattern matching, and other complex operations in near-real-time [35] [36]. Coral8 Engine is composed of two tools, which are the Coral8 Server and the Coral8 Studio, and Coral8 also comes with a Software Development Kit (SDK) to perform further optimizations. Coral8 Server is the heart of Coral8 [35] which provides clustering support [37]. Coral8 Server also includes features such as publication of status data stream that can be used to monitor performance and activity of the server by providing Simple Network Monitoring Protocol (SNMP) to be used by management consoles and monitoring frameworks. Coral8 Studio provides an IDE-like interface which allows the Administrators to add and remove queries, and input and output data streams. This uses a subscription language called Continuous Computational Language (CCL) to mange queries. Progress Apama [15] The Progress Apama event stream processing platform [38] consist of several tools, including an event processing engine, data stream management tools, event visualization tools, adapters for converting external events into internal events, and some development tools. The Apama technology was tested at the Air Force Research Laboratory by Robert Farrell (AFRL/IFSA) [15] to disprove the marketing claims of Apama, relating to the throughput and latency. These results have shown that Apama could process events at rates measured in thousands of events per second [38]. StreamBase [15] The StreamBase event processing engine is developed based on research from the Massachusetts Institute of Technology, Brown University, and Brandeis University. This is an improved version of the Aurora project [39]. StreamBase provides a Server and a Studio module [40]. The Server module is designed to be able to scale from a single-CPU to a distributed system. The Studio is Eclipse-based and this not only provides graphical (drag-and-drop) creation of queries, but also supports text-based editing of the queries, which uses StreamSQL. Truviso [15] [41] Truviso is a commercial event processing engine that’s based on the TelegraphCQ project of UC Berkeley. The most important feature of Truviso is that it supports a fully-functional SQL, alongside a stream processing engine. The queries of Truviso are simply standard SQL with extensions that add functionalities for time window and event processing. In addition, the use of an integrated relational database of Truviso allows for easy caching, persistence, and archival of data streams, for queries that include not only real-time data, but also the historical data. 2.5. Some Interesting Research Papers Siddhi has taken inspiration from some valuable research papers when designing its architecture. We have read several papers, and then chose the algorithms that most suitable for us, and perform better. 2.5.1. Event Stream Processing with Out-of-Order Data Arrival [42] This paper provides an in-depth architectural and algorithmic view to manage arrival of out order data (with respect to timestamp). This is because there’s a possibility for the CEP to lose its accuracy in real time, due to network traffic and other factors. The system presented here is very much similar to SASE [12] where it also uses stacks to handle Event arrivals. Here they have provided this as a feature which can act upon out of order events, by enabling all stacks at the beginning to maintain a clock to check for the out of order events arrivals from their timestamp. They have also provided an algorithm to handle the identified out of order events in this paper. This will be useful only for the projects having similar design to SASE. 2.5.2. Efficient Pattern Matching over Event Streams [43] This paper focuses on Richer Query Languages for efficient processing over Streams. Their query evaluation framework is based on three principles: First, the evaluation framework should be sufficient for the full set of pattern queries. Second, given such full support, it should be computationally efficient. Third, it should allow optimization in a principle way. To achieve the above they have tested using a Formal Evaluation Model, NFAb, which combines a finite automaton with a match buffer. When testing SQL-TS, Cayuga, SASE+, CEDR they have found SASE+ to be much richer and more useful. 2.6. What We Have Gained from the Literature Survey Literature Survey has greatly helped us to understand the different implementations of present CEP engines and also get to know about their pros and cons. Through this survey we were able to come up with the best architecture to Siddhi-CEP by understanding how other CEP engines are implemented. From the literature review we found out pipelines will be the most appropriate model for Event passing, and therefore we decided to build our core use producer consumer architecture having an Aurora like structure, to obtain high performance [44]. Further there we also found an interesting paper [45] to help us implement the Query Plan Management with will not only improve efficiency but also allows approximation of data in Stream Management. 3. SIDDHI DESIGN This chapter discusses all the major sections of our project in terms of system architecture and design. We discuss the basic System Architecture in detail with the use of various different diagrams such as architecture diagrams, use case diagrams, class diagrams, etc. This document also discusses about system design considerations which we thought before starting the implementation phase. Since the significant factor which makes this project different from other projects is performance, and therefore in this section we discuss more on how we have achieved performance through our system design and the problems we faced. 3.1. Siddhi Architecture Figure 8 Siddhi high-level architecture Input Adapters Siddhi receives events through input adapters. The major task of input adapters is to provide an interface for event sources to send events to Siddhi. Also Siddhi has several types of input adapters where each accepts different types of events. For instance Siddhi accepts XML events, POJO Events, JSON Events, and etc and convert all of those different types of events in to a particular data structure for internal processing. This data structure is simply a tuple. Siddhi- core Siddhi-core is the heart of the Siddhi complex event processing engine. All the processing is done at this component. Core consists of many sub components such as Processors and Event Queues. As indicated in the diagram input events are placed on input queues, the processors then fetch the events to process them, and after processing those events are places in to the output queues. More details on Siddhi-core are given under the “Detailed System Design”. Output Adapters Detected complex events are notified to the appropriate clients through output adapters. Compiler This component takes the user queries, compiles them and builds the object model according to the query. Mainly, the compiler validates the user entered query and then if get validated, the intermediate query object model is created, which the Siddhi Engine/core can understand. Pluggable UI Pluggable UI can be used to display useful statistics and monitoring tasks. It is not an essential component for processing events. 3.2. 4+1 Model 3.2.1. Logical View Class Diagram The class diagram of Siddhi is shown in the figure. Figure 9 Siddhi Class Diagram Siddhi Manager : o This is the main manager class which will manage Queries, Input Adapters and Output Adapters accordingly Input Adapter & Output Adapter: o The Input Adapter and Output Adapter are provided as interfaces for Events, since they are customizable by the clients in order to handle various types of Input and Output Events (XML, JSON, etc). Event & Event Streams: o Events are serializable, and the Event structure is dependent on the Event Stream from which it is generated. Dispatcher, Processor & Event Queue: o Dispatchers will delegate Event task to each queries by passing Event to various Event Queues, and the Processors will be doing the processing by fetching the Events from the corresponding Event Queues. Sequence Diagram This show how the Queries, Input Adapters and Output Adapters are assigned, and how Siddhi manager invokes the appropriate Event Dispatchers to pass the Events to the corresponding Query and processes them. Receiving Event from the client, processing them, and Firing the right Output Event are concurrently handled as depicted in figure 3. Note: Here without the loss of generality, two QueryProcessors and two SpecificQueryDispatchers are shown in order to show how several instances could be present at the same time. Figure Figure 11 Siddhi 10 Sequence Sequence Diagram Diagram Implementation View Figure 12 Siddhi Implementation View For the ‘compiler’ component, we decided to use ANTLR as the parser generator and compiler. The queries fed by users are passed through the compiler. ANTLR validates the query and if there’s an error/anomaly in the query, it returns an error. Else, the query is transformed to an intermediate object model and passed to the Siddhi-core. For testing purposes, we are using JUnit as our testing framework. Since Siddhi development is carried out as a Test-Driven Development (TDD) testing framework played a major role in the development, where the whole system is tested for different kinds of inputs and the output is observed to make sure the intended output is generated by the system. The Siddhi system is currently hosted at the source repository at Sourceforge and is using Subversion as its Version Controlling System (VCS). Process View Process View gives a detailed description of end to end event processing in Siddhi. There are input adapters in Siddhi which listens to external event sources. Each incoming events will be put in to a blocking queue (input). Where there is an event dispatcher which dispatches the events to Siddhi Processor from the input blocking queue. Here, Siddhi duplicates the input Events according to the number of queries which are interested in processing those Events, suppose if there is only one query interested in the event it just passes the event to the specific query without duplication. Thereafter, the queries (which are the rules defined by the client, which is represented by a query processing element instance) will process the event and according to the query and the type of events they may store them for further processing or drop them. Whenever there is a need to fire an output event Siddhi creates an output event and passes that to the output queue. Figure 13 Siddhi Process View Deployment View Deployment view, also known as Physical view illustrates the system from a systemengineer’s perspective. Deployment diagrams shows the hardware used in the system, the software that will be running in the system, and the middle-ware that connect the different hardware systems together. The above diagram shows the deployment diagram of Siddhi-CEP project. The threedimensional boxes represent nodes, either software or hardware. The hardware/physical nodes are shown with the stereotype <<device>>. The above deployment diagram depicts the deployment of Siddhi in an SOA environment. Here the Siddhi processor is wrapped inside an Event Server. The Event Server exposes the Siddhi to outside as a Web Service allowing other Web Services to send events to Siddhi without any hassle. For Siddhi to have an Event Server, currently what we have in our minds is to make Siddhi compatible with the current WSO2 Event Server implementation. This will make a Siddhi more like a plug and run module. Siddhi uses event streams to receive events from different streams/event types. The Event Server wraps the ‘Siddhi input streams’ and exposes what is called ‘input topics’. Topics define a high-level easier to understand Interface. This hides internal details of the Siddhi Processor. Where the event type of the streams is not necessarily exposed by input topics, but rather Input topics and streams could have a many-to-many mapping. Meaning one inputtopic can have many streams, and one stream can be mapped to many input-topics. To configure the wrapping, user should: Define input topics Map them to an event input stream (This is inside Siddhi Engine) Define queries to be processed Define an output topic per query. Then map a stream to the output topic if applicable. (optional) Through this configuration the Event Server will be able to receive events from different event sources, and process then using Siddhi-CEP. Siddhi then alerts the Event Server via its output Event streams when an event ‘hit’ occurred, where then the Event Server notifies to the relevant parties as defined. Figure 14 Siddhi Deployment View Use Case View Figure 15 Siddhi Use Case Diagram Use case 1. This use case is of querying Siddhi-CEP. Querying Siddhi includes three main things. I. Assigning Input adapters There will be several Input adapters available in Siddhi, and the client can select one according to the type of message the client is intended to send for processing. For example the client can use the SOAP/XML message adapters to convert the XML to Event-tuples for the Siddhi engine to process. In situations when there is no adapter for the type of event, the client is intended to send, the API is made flexible such a way that the client can write their own adapters and easily plugin to Siddhi. II. Assigning Output adapters Output adapters send output events back to the client in the set format. The client can select an existing output adapter according to the type of message he is intended to get as the output message. When the client couldn't find suitable adapters he will be writing the callback method with his own Output adapter and sets the reference to Siddhi. III. Submit EPL Query Another part of querying is submitting EPL (Event Programming Language) queries. After EPL query is received, Siddhi will compile the query and build the object model. This generated object model is used to check for the matching incoming events. Use case 2. The second use-case is, Event Sources sending events to Siddhi. The format of the Event will be set when the query is sent to the Siddhi engine. Here the Event Sources will call the appropriate input adapter and send the Events through it. Use case 3. The third use-case is, Event Subscribers getting notified of matching events. This happens in the form of a callback in the Siddhi Output Adaptors. The output format is defined when the output event is sent from the Siddhi engine. 3.3. Major Design Components This section covers the design of basic elements, to high-level design components in Siddhi. 3.3.1. Event Tuples In Siddhi, the internal representation of an event is a Tuple. Tuple is a very basic data structure for representing an event. Stream Id Data 1 Data 2 Data 3 Figure 16 Siddhi Event Tuple Design Alternatives: Plain Old Java Object (POJO), XML The reason for selecting Tuple as the internal data structure for an event: In the initial stage of the project we design the Siddhi architecture choosing XML as the internal representation of events. But later we moved into event tuples. A Tuple is a very simple data structure. Retrieving data from a Tuple is very simple compared to other alternatives. That helps Siddhi to process events faster minimizing the overheads in accessing data in an event. That is the major reason for selecting Tuple as the data structure for the internal event representation. 3.3.2. Pipeline architecture The architecture of Siddhi depicts a producer consumer model. As described in the diagram the input Events are placed in an Event Queue, from which the processors fetch one at a time and process the events. Siddhi processors use Executors as their internal processing elements. These executors will access each event and produce a Boolean output, as whether the event matches or not. The Events not matched are simply discarded, and the matching events are then processed by the processors accordingly. When a matching event has occurred according to the query the processor either stores the event for further processing or create the appropriate output event and place that in the Output Queue, which will be used by another processor or to notify the end user as a notification. This presents a simple many to many producer consumer design pattern by developing pipeline architecture according to the code books. Figure 17 Siddhi Pipeline Architecture Design Alternatives: Single processor model Parallel processing model Pipeline architecture model The reason for choosing pipeline architecture: The single processing model is well used in the CEP domain, where some famous CEP engines such as Esper are using such architectures. But this solution is highly discouraged as it uses only one processing thread. Since CEP is suppose to produce high performance, and therefore having multi-core processors running a single threaded process is highly inefficient. When compared to single processor model, parallel processing seems to be more attractive. Here each complex query will be processed by a thread. But the disadvantage of this process is its’ resource utilization. This is because in most cases many complex queries have several common sub queries, and this will end up by running many duplicate sub queries at the same time. Hence the pipeline architecture model rectifies the issues occurred in the both the single processor model and parallel processing model, by having only one running sub query at a time. Here each sub query will be running in a separate thread and further paralyzing the execution. This facilitates much faster processing and high throughput as we required. 3.3.3. State Machine State machine is one of the major components in Siddhi. It can be considered as the most vital part in processing complex queries. In Siddhi we have used state machine for handling two types of queries. They are Sequence queries and pattern queries. Sequence Queries In sequence queries we can define Siddhi to fire an event when series of conditions are satisfied one after the other. Example We define the sequence of conditions as, A -> B -> C Let’s consider event sequence as follows, A1, B1, A2, A3, C1, A4, B2, B3, C2 An event is fired when Siddhi receive C1. Siddhi captures the event sequence A1, B1, C1. Every Operator In the above mentioned sequence after capturing an event successfully, Siddhi stops looking for new events. If we want to continue the process, we can use the ‘Every’ operator. Consider the event sequence A1 B1 C1 B2 A2 D1 A3 B3 E1 A4 F1 E2 B4 Table 2 Different Every Operator Cases Example Description every ( A -> B ) Detect an A event followed by a B Matched Sequences Matches on B1 for combination {A1, B1} event. At the time when B occurred, Matches on B3 for combination {A2, B3} the sequence matcher restarts and Matches on B4 for combination {A4, B4} looks for the next A event. every A -> B An event fires for every A event Matches on B1 for combination {A1, B1} followed by a B event. Matches on B3 for combination {A2, B3} and {A3, B3} Matches on B4 for combination {A4, B4} A -> every B An event fires for an A event followed Matches on B1 for combination {A1, B1}. by every B event. Matches on B2 for combination {A1, B2}. Matches on B3 for combination {A1, B3} Matches on B4 for combination {A1, B4} The Sequence processor does the core processing of sequence queries. Design Decisions in Sequence Processor In Siddhi the basic unit which does the condition matching is the Executor. For the Sequence Conditions we have series of ‘Followed-by Executors’ corresponds to each state. Here we have deviated from the conventional state machine concept. In a conventional state machine only one state is active and listening to inputs at a particular time. But because of the ‘Every operator’ here we have multiple states listening to input events. A Map to hold Executor Listeners We have used a Map to store the currently active Followed-By Executors. Map <StreamId, LinkedList<FollowedByExecutor>> The linked List holds the Followed-By Executors which belongs to a particular stream. When sending the events, we send only to executors which correspond to the input event stream id. Design Alternatives: A Linked list to hold all the executors We could have stored all the executors in a one Linked List and send new input events for all of them iteratively. Event Flow Event CSE Stream Data 1 Data 2 Steam Id Map CSE Stream Electricity Data ream Wind Data Temp Data Linked List Matched/ Mismatched Figure 18 Map holding Executor Listeners The reason for choosing a Map In the initial design of the Sequence Processor we have used only a Linked List to hold all the currently active executors. When we were doing performance improvements we saw that sending all the input events to all the active executors reduces the performance. So we decided to use a Map to prevent sending all the input events to all the active executors. Here we have filtered the necessary executors using the key of the Map. Pattern Queries In Pattern Queries we can define Siddhi to fire an event when series of conditions satisfied one after the other in consecutive manner. Example: We define the sequence of conditions as, A -> B -> C Let’s consider event sequence as follows, B1, A1, B2, C1, A2, B3, B4, C2 An event is fired when Siddhi receive C1. Siddhi captures the event sequence A1, B2, C1. Unlike in the Sequence Queries here the sequence of conditions has to be satisfied consecutively. In the above example no event is fired for the event sequence, B1, A1, A2, C1, A3, B3, B4, C2 Here Pattern matching started when Siddhi receives the A1 event. Then it will listen to a B event. Since Siddhi receives A2 event, the pattern fails. Kleene Star Operator For the pattern queries we can use kleene star operator to define infinite number of intermediate conditions. Example: We define the sequence of conditions as, A -> B* -> C In here B* stands for zero or more B events Let’s consider event sequence as follows, A1, B1, B2, C1, A2, B3, B4, C2 An event is fired when Siddhi receive C1. Siddhi captures the event sequence A1, B1, B2, C1 and A2, B3, B4, C4 Design Decisions in Pattern Processor List to hold Executor Listeners For the Pattern Conditions we have series of Pattern Executors corresponds to each state. We have used a Linked List to store currently active Executors. Design Alternative: As discussed in the Sequence queries we can use a Map to hold currently active event listeners. The reason for choosing a linked list: Compared to the Sequence Processor, the number of active executors in the Pattern processor is very low. So there is no need to use a Map to filter out executors. This makes the implementation simple and prevents unnecessary overheads in finding relevant executors. 3.3.4. Processor Architecture Siddhi processor has two major components, Executors Event Generators Executors Executors are the principle processing element in the Siddhi processor. These Executors are generated by the Query parser by parsing the Query Object Model defined by the user. The Executors formed will have a tree like structure, and when an Event is passed to the Executor tree, it will process and return true if the Event matches with the query or return false if the event does not match. Though at the same time there can be many Executor trees present in the processor, only one get processed at a particular moment. These Executor trees are processed in a depth first search (DFS) order, and when a miss match occurred at any of the nodes, the process gets terminated and the nodes will recursively returns false till the root to notifying false to the processor and making that event an obsolete. If all the nodes had matching outputs the root will notifying true to the processor. AndExecutor, ExpressionExecutor FollowedByExecutor NotExecutor OrExecutor PatternExecutor Design Alternatives: Multiple Executor model Reason for choosing Tree executor model: Since each sub query is having its own query condition, they have to have corresponding executors for each condition. Therefore it is essential to have many different types of Executors to handle all the different cases. Though the ‘multiple executor model’ satisfies this requirement, it has issues in rejecting non matching Events at the early stage. This is because it needs to process duplicated Execution nodes in different sequences consuming much time and delays in finding out the failing nodes. But in the use of Tree executor model, the Executors are arranged in an optimal order and in this arrangement if the left sub tree returns false, the right sub tree won’t be processed and the false out put is immediately notified to the processor. Event Generators Event generator is the other most important element in the processor. The duty of the Event generator is to produce output Event tuples according to the query definition and place them in the output queue. Design Alternatives: Bundled output Events The reason for selecting this Formatting output Events: Some well known CEP solution sends all accepted events bundled together when the query get satisfied. These bundling of Events is occurred when executing queries such as Pattern Query, Sequence Query and Join Query, this is because when the query gets satisfied many output events need to be sent at the same time. In the first iteration of Siddhi we came up with this model but then we failed miserably, this is because of its lack of support in query chaining. Our architecture became more and more complex when we need to plug in queries for the bundled output Events. Therefore we decided to follow the SQL convention with the advice of Dr. Srinath by formatting output. Here the output will be converted in to a single output Event as defined in the definition. This enables better control over the other queries and enables Siddhi to have pluggable pipeline architecture. 3.3.5. Query Object Model Query object model depicts the internal structure of the query, which the Siddhi-core can understand. Since we have not yet implemented the compiler, currently the Query object model is used to configure Siddhi queries. Thought this is not user friendly like SQL, since Siddhi is following an SQL like query structure the Query object module too resembles the same, and therefore its very easy for users to understand and write queries in Siddhi. Figure 19 Simple Siddhi Query As Siddhi query is currently an Object Model that has to be coded as follow or programmatically generated. The ability to generate the query pragmatically allows Siddhi to have various different interfaces suitable for its deployment. For example the ‘OpenMRS’ implementation uses a ‘Graphical Form Entry’ to define the queries [46]. Figure 20 Siddhi Query Form UI Implementation Used in OpenMRS Design Alternatives: Custom object model Reasons for selecting SQL like object model: Using Custom Object model is feasible, but we decided to chose an SQL like object model because of the following advantages. 1. By following SQL standards, Siddhi Query Language fall in-line with the relational algebraic expressions and the Siddhi Query can also utilize the Optimization techniques that are used in SQL and relational databases. 2. Other CEP solutions also follow the same trend, and it is widely acceptable, as CEP and SQL queries mostly express the same set of functions 3. SQL is common and it is easy for users to start using Siddhi, if Siddhi query is also in-line with SQL. 3.3.6. Query parser Query parser parses the Query object model and creates the query Executors. After the query (the Query Object Model) is added to the Siddhi Manager, and when the Siddhi Manager is updated, the Siddhi Manger parses the Query object models, then Executors are created from the object model. And finally created executors are assigned to a processor. Currently though the Executor tree structure is quite similar to the defined conditions in the query, various optimization techniques have been implemented for faster performance and user friendliness. Design Alternatives: Generating the Executions without having an internal Query object model The reason for generating the Query object model and then convert that to Executors: When designing, we looked in to both the alternatives. Generating Executors without Query object model seems to be very fast and straight forward. But this model in not capable enough to handle complex queries as its fails to optimize the Executor by reusing the available resources. But when we create the Query object model first and then create the Executors, Siddhi was able to get more understanding on the nature of each Query and tune the Executors accordingly. Using this architecture Siddhi was able to achieve performance improvements as this facilitated Siddhi to use SQL Query optimization techniques and resource sharing to build the optimal pipeline structure. For example, managing the Executors according to the complexity of the query is one of its major features in query optimization. Here when the queries have a collection of simple conditions Siddhi uses the Simple Executors. These are fast but limited in capability. When more complex conditions are defined the Query passer tries to convert that to simple queries and use the Simple Executors. Only failing this attempt Siddhi will use the Complex Executors which will have MVEL engines that can perform highly complex operations, but with the expense of time. Figure 21 Siddhi Query with Simple Condition 3.3.7. Window Time Window Time window is a sliding window which keeps track of the events arrived within a given amount of time to past from the current time. Time windowing concept (as well as length window) is useful for analyzing events that arrived within a limited amount of time. This is useful in many cases. For example, this is used for getting statistical analysis of the arrived events such as average, sum of a particular attribute in the arriving events etc. Figure 22 Siddhi Time Window Take the following query as an example (The internal Query model is shown). This query filters out the buying and selling events of the stock symbol “IBM” via the input stream ‘cseEventStream’. Then, it calculates the average of its’ stock price within an hour. Then it generates an output an event containing the attributes (symbol, avgPrice). Figure 23 Siddhi Time Window Query Batch window Siddhi has built-in data windows for both Time and Length windows that act on batches of events arriving via event streams. Thus the two batch windows which Siddhi supports are named as Time Batch Window and Length Batch Window. In other words, Batch window concept is exactly same as the typical Siddhi sliding window concept except it sends events as a batch at the end of the time or length interval instead of sending each at the arrival time. Time Batch Window: Collects all the events within a given time interval as defined in the time window and send them all at once to the Siddhi core listeners to process as a batch of events Length Batch Window: Collects all the given no of events as defined in the length window and send them all at once as a batch of events to the Siddhi core listeners to process. Following diagram will illustrate how Time Batch Window works for a window size of 4 second and event receiving start time of t. Figure 24 Siddhi Batch Window 1. At time t + 1 seconds an event E1 arrives and enters the batch window. Siddhi Event Listener will not be informed. 2. At time t + 3 seconds, event E2 arrives and enters the batch window. Still Siddhi Event Listener will not be informed. 3. At time t + 4 seconds Siddhi Event Listener captures all the events in the batch window(E1, E2) and handover events as a batch at once, to Siddhi core to consume and a starts a new batch to wait for new events. 4. At time t + 5 seconds an event E3 arrives and enters the batch window. Still Siddhi Event Listener will not be informed. 5. At time t + 8 seconds, Siddhi Event Listener captures all the events in the batch window(only E3) and handover events as a batch at once, to Siddhi core to consume and a starts a new batch to wait for new events. Also set E1 and E2 event’s “isNew” Boolean flag to false. These steps will continuously take place throughout the event processing period. 3.3.8. “UNIQUE” Support The unique view is used for filtering duplicated events. Event duplication is an important concept in Complex Event Processing, especially when multiple event streams are involved. Say, for example, there were several RFID sensors in a super market to get the readings of the products which covers different sections. Obviously there will be over-lapping between the sensors, which will be producing the same event. It may lead to inaccurate and incorrect output. So, this needs to be handled. The UNIQUE functions provide support for that. The duplication removal mechanism is customizable, and the user can chose what event needs to be filtered while what needs to be kept from a set of duplicate events. There are two main kinds of methodologies for matching control. Those are, UNIQUE o The unique view is a view that includes only the most recent among events having the same value(s) for the result of the specified expression or list of expressions. FIRSTUNIQUE o The FIRSTUNIQUE view is a view that only includes the first event out of a set of duplicated events. The event duplication is determined by evaluating the given expression. If this view is set, whenever a duplicated event arrived, it will be dropped. What criteria need to be considered to evaluate the event duplication is also customizable, and can be specified by the user. In the query, user could specify any parameters/fields that need to be matched. UNIQUE support is available for both time-windowed queries, and queries without a time-window. Figure 25 Siddhi Unique Query 4. Implementation This section covers the implementation details of Siddhi. First we describe the development process models we used. And, then we move on to saying how we manage the source, and how we build the project etc. 4.1. Process models We have used two process models for Siddhi development. Namely, they are Scrum, and Test-Driven Development (TDD). This section will describe what these process models are, and the reasons we chose them as our process models. 4.1.1. Scrum Development Process Scrum is a software development process which is a form of agile software development. It is an iterative and incremental development approach which progresses by several iterations which are called sprints. A sprint is generally two-four weeks long, and the team sets goals and objectives be completed during the sprint at the start of each sprint. Then, the scrum team has daily meetings (generally starts at same time and same place everyday) for getting an update about the project status. The daily scrums are generally set for 15 minutes. Figure 26 Scrum Development Process Generally there are set of defined features that clients require from the product which the team is developing. These set of features are called product backlog. For each sprint, the team takes a set of features out of product backlog and put them in to the Sprint Backlog which they’ll be implementing during the sprint. The daily meetings are taken place to get an update of each feature the team members are developing. “Sprint Burn down” chart is used to track this progress. It shows the remaining work, the unfinished work and how far the work is completed. At the end of the sprint, all tasks are revised; the unfinished features are put back in to the product backlog; and the client is updated about the new features. Generally a product release happens if the product is in a releasable state. We thought to use Scrum for development, because it seems that it’s more appropriate for our work. Especially since we are writing the product from scratch, and we needed iterative development process. Further, scrum was more suited because we had a large knowledge base to cover and the communication between team members was very important for the successful completion of the project. In addition to those there are several other advantages of Scrum, including, Increase in productivity High communication between team members makes sure that everyone understand the project requirements well Frequent releases of the product. That way, users don’t have to wait longer time to benefit from the newly implemented features. “Release early, Release often” theme is getting popular among Open-Sources project these days. 4.1.2. Test-driven development (TDD) While we use Scrum, we used Test-driven development as our development methodology within Scrum sprints. In test-driven development, the developer first writes a failing test case for a desired feature / task. Then the developer writes code to successfully pass the tests, and then matches the developed code according to the coding standards & guidelines that the team agreed upon. Figure 27 Test-Driven Development (TDD) The above flow chart shows how this process progresses. There are several benefits on this process. Since the needed tests are already created there’s less need of debugging the code. Because of the nature of this process, there will be test cases for testing every small feature. That means the code-coverage by tests will be higher which means it’s less prone to errors. Since there were several features / functions are needed from Siddhi, there was a great deal of need for testing the code. For a code written from scratch, the base/underline-framework needs to be tested thoroughly to make sure that code is accurate and does the intended job. It’ll be pretty hard if broken features were identified later on, because probably there will be lot of developments carried out on top of that already. It may mean that all the relevant section need to be rewritten. Test-driven development avoids these issues to a greater extent. 4.2. Version control For version controlling we have used subversion (SVN) version control system. The main reason for selecting subversion as our version control system is that we all are very much familiar with it. And it has all the features that we are expecting. We have hosted our source code on sourceforge.net which is one of the most popular sites for hosting open source projects. Siddhi source can be found in, http://siddhi.sourceforge.net/source-repository.html All of us commit our regular work to the trunk. When we have a stable milestone implementation in the trunk, we create a tag out of it. 4.3. Project management We have used Apache Maven as a project management tool. Maven is used to make the build process easier. Siddhi project structured in to following main modules. 1. siddhi-api 2. siddhi-core 3. siddhi-io 4. siddhi-benchmark siddhi-api: siddhi-api contains the classes related to creating queries. We decided to have a separate module for the API in order to make a clear separation from the core. Siddhi-api does not depend on the core of Siddhi. siddhi –core: Siddhi core contains class related to the core processing. This can be considered as the heart of the Siddhi. Siddhi-core is dependant on the siddhi-api module. siddhi-io: Siddhi-io contains the input output related classes. siddhi-benchmark: Siddhi-benchmark contains classes related to the siddhi-benchmark. Siddhi-benchmark depends on the siddhi-core module. In Siddhi we have used maven to following tasks. To build binary from the source code To create javadoc documentation TP create the website 4.4. Coding Standards & Best Practices Guidelines for Siddhi Siddhi has employed a set of coding standards, and best practices to maintain the consistency and increase the readability of the Siddhi code. The guidelines are as following. 4.4.1. General Comments Doc comments All classes and all methods/functions MUST have doc comments. The comments should explain each parameter, return type and assumptions made. Line comments In case you have complex logic, explain any genius logic, rationale for doing something. Logging Log then and there. Have ample local information and context for the logs. Remember that logs are for users. Make them meaningful, readable and also make sure you spell check (ispell). Use correct log level, ex. do not log errors as warnings or vice-versa. Remember to log the error before throwing an exception. We use commons-logging 1.1 for logging purposes. Logic Make sure that your genius code readable. Always use meaningful variable names. Remember, compilers can handle long variable names. Variables declared in local, as and when required. The underscore character should be used only when declaring constants, and should not be used anywhere else in Java code Methods/Functions Make sure the function/method names are self descriptive. One should be able to explain a function/method using a single sentence without conjunctions (that is no and/or in description) Have proper separation of concerns. Check if you do multiple things in a function. Too many parameters are smelly, indicates that something is wrong. Use status variables to capture status and return at the end whenever possible. Avoid returning from multiple places, that makes code less readable. Committing to repository Use your own account for committing. Don’t use the primary account. Use separate commits for each different changes you make. For example: If you are going to fix two bugs, first fix one bug and commit it; then fix the other bug and commit it. Following provides a set of additional guidelines. Be consistent in managing state e.g. Initialize to FALSE and set to TRUE everywhere else Where does that if block end, or what block did you end right now? Have a comment at end of a block at } Use if statements rationally, ensure the behavior is homogeneous In case of returning a collection, must return an empty collection and not null (or NULL) Do not use interfaces to declare constants. Use a final class with public static final attributes and a private constructor. Always use braces to surround code blocks ({}) even if it is a single line. Be sure to define, who should catch an exception when throwing one Be sure to catch those exceptions that you can handle Do not use string literals in the code, instead declare constants and use them, constant names should be self descriptive Use constants already defined whenever possible, check to see if someone already declared one. 4.4.2. Java Specific Coding conventions - http://java.sun.com/docs/codeconv/ Only exception is line length, say 100 Run FindBugs on your code - http://findbugs.sourceforge.net/ 4.5. Profiling Profiling is a dynamic program analysis methodology. It is used for finding memory leakages, usage of functions and methods in the code, frequency of method calls etc. Profiling is a very important phase of the software development, especially when the product is performance critical. Profilers that are used for profiling shows the profiled info using charts, and tables detailing each and every aspect of the product. Since Siddhi edge lies in performance to a greater extent, profiling played a key part. We have used JProfiler, a commercial all-in-one Java profiler. Our main need is to reducing the time spent on the CPU, so, the hot-spot methods, and method call stack traces were mostly needed. Hot-spot methods are the ones which are taking considerable percentage of the CPU time. Most of these methods implementations can be improved by having a careful look at the code. Reducing the method invocation is one other aspect. Most of the time, there are unnecessary method calls that exist in the code in which the functionalities can be achieved by fewer method calls. Especially Siddhi is targeted to process as big as 30,000 events per second. So, there is enormous number of method calls for processing one event. So, even one change in a method may make a big increase in the performance. We’ve ran profiling after each iteration of our development as done the necessary changes. Our supervisor, Dr. Srinath Perera also helped on this. Figure 28 Call Tree for the Time Window in JProfiler The Call Tree view shows the stack trace on where the program started, and which methods it has called. The method calls are shown in a tree structure along with their execution times, and number of invocations. With this view can identify where the Siddhi spends most of it’s time when executing. Obviously, the time spent on utility classes/methods should be lower while the core methods of Siddhi taking most of the time. This view can be used to identify things like that. After identifying the flaws in the code, we improved the code, and did a profile again to measure whether there’s a change. This will be repeated. Further, there’s a Hot-Spots tab under the same “CPU Views” section, which shows the hotspot regions in the code that consume most of the CPU time. It generally shows the time and percentage CPU time of methods. These methods need special attention. Following diagram shows the memory view of Siddhi for the time window query. This view along with heap walker snapshot is useful to determine whether there are any memory leaks in the code. A quick analyze of this graph shows that EventImpl class objects has taken about 49 MB of memory, while the LinkedBlockingQueue objects taking about 24MB of memory. These two together has taken 60%+ of the total memory. Figure 29 The Memory Usage of Siddhi for a Time Window query 4.6. Benchmark Siddhi released a benchmark for performance evaluation. This benchmark can be directly used to evaluate the performance of Siddhi in different view points as well as it can be used to compare performance between other competitive CEPs. Siddhi benchmark kit basically listens to Events from remote clients over TCP and process them in the Server. The benchmark evaluates the performance of VWAP(Volume Weighted Average Price) Events. Many other CEP benchmarks also use VWAP Events for their benchmarks. Figure 30 Siddhi Benchmark This is capable of handling events from multiple clients connected to it. The benchmark is capable of handling given number of queries to process on Siddhi core. This can be done by adding a system parameter - Dsiddhi.benchmark.symbol=1000 and it will generate 1000 different queries from the given query by only changing the symbol of the query. Benchmark has shell scripts for client and server to start the kit. And all the above mentioned options can be set as configured by changing the parameters in the shell script. The server output is logged in to serverout.log and all the results can be found there. This KIT is capable of interpreting CEP performance results under event basis or time basis by setting the parameter BMEVAL to true or false. And we encourage using Event Based evaluation method which is more expressive than the other. All the time durations are calculated with the network overhead. Which means the time is calculated as end to end. In the server script There are options to edit or you can let the default values use by specifying the attributes. # Uncomment following and set JVM options #JVM_OPT="-Xms1024m -Xmx1024m" # Set a port to allow clients to connect to this server or use default port without specifying any PORT="-port 5555" #Set maxim limit of events for the CEP to process LIMIT="-limit 100" #Set variable true for Event based benchmark evaluation or set false to evaluate in time basis procedure and results will be viewed based on this BMEVAL="-eventbs false" In the client # Uncomment following and set JVM options #JVM_OPT="-Xms1024m -Xmx1024m" # Set host server address or use default host without specifying any HOST="-host" # Set batchwait time in milliseconds which will sleep a given waited time among event batches or comment the param to skip waiting. BATCHWT="-batch_wait 10" # Set port to connect to the server or use default port without specifying any PORT="-port 5555" The Siddhi Benchmark basically can be used to understand and analyze Siddhi performance against different queries on given number of events and even can be used to compare with other CEP implementations. 4.7. Documentation We have fully documented Siddhi for the use of users and future developers. We’ve created user guide and developer guides for facilitating that. DocBook was used as our markup language. DocBook is a semantic markup language for technical documentation. It is highly popular among open-source communities as their default documentation format. Compared other formats like doc, odf etc, DocBook can be used to create document content in a presentation-neutral form. Further, it allows localizing the documentation in a pretty easier manner. The presentation neutral view comes from the native nature of it. DocBook document is written in XML using the dtd rules DocBook provides. It does not contain any styling information. We can then add any styling to it using a XSL style sheet. XSL (XML Styling Language) could be written to transform the said DocBook document to several forms including PDF, HTML web pages, MS-word doc, EPUB, webhelp format etc. Because of these powerful features of it as well our familiarity with the product made us to use DocBook as our documentation format. We generate the documentation in PDF, and DocBook WebHelp (for web publishing) formats. The documentation is generated by Apache Maven using docbkx-maven-plugin. Following code snippet generates this. Figure 31 DocBook Documentation Configuration 4.8. Web Site We have developed website for our project detailing Siddhi features, how to access our source repository, hosting the documentation, project deliverables (releases) etc. The web site is hosted at Sourceforge and can be accessed via http://siddhi.sourceforge.net/. [47] Figure 32 Siddhi Web Site It provides access to following information. Downloads - Releases and Source code Documentation - User guide, Developer guide, Javadocs, FAQ, and a Source code Viewer (Viewvc) Project Information - Mailing lists, Project team, Bug Tracking System The license of Siddhi - The site provides a link to a copy of Siddhi’s license, Apache License, version 2.0 4.9. Bug tracker Issue tracking system is a system which keeps track of issue of a particular software system. In open source world JIRA is a well known issue tracker. But Siddhi is currently hosted in the SourceForge source repository. Thus we didn’t go for a separate issue tracking system, and we used the SourceForge native issue tracker. There, people who are signed up for the project can create issue regarding Siddhi, and they can set different important attributes of that particular issue such as its description, priority, resolution, status, assignee and etc. We have raised several main issue on this an assigned the related person so that he will receive a notification that he has assigned a work based on some issue. This led our group to have a better remote communication in addition to siddhi-dev mailing list discussions. And when the issue is resolved, the reporter or admin can close the issue. As we already have two clients (OpenMRS and Hybrid California Weather forecasting project), they also can report bugs in our issue tracker (which is very straight forward and useful) and wait to resolve them by Siddhi developers. In this issue tracking system, we have a good opportunity to identify bugs which we may not find, but users could find. 4.10. Distribution Siddhi jars are hosted in the Maven repository. This was powered by the sonatype.org. Through this any Maven project can directly use Siddhi by simply adding the dependency configurations and the repository configuration in their pom.xml. <dependencies> ... <dependency> <groupId>org.siddhi</groupId> <artifactId>siddhi-api</artifactId> <type>jar</type> </dependency> <dependency> <groupId>org.siddhi</groupId> <artifactId>siddhi-core</artifactId> <type>jar</type> </dependency> </dependencies> <repositories> <repository> <id>sonatype-nexus-snapshots</id> <url>https://oss.sonatype.org/content/repositories/snapshots</url> </repository> </repositories> 5. Results The results of the performance testing for given set of queries are discussed below. 5.1. Performance testing The most expected outcome of a Complex Event Processing engine is its performance. How fast it can process and evaluate, events and patterns, and notify them to the subscribers is a very important fact. End to end delivery time is a key fact which CEP clients mostly expect. Thus, it led Siddhi to have a benchmark performance kit for evaluating performance. Other than evaluating Siddhi performance along itself, we as Siddhi team, decided to do a performance comparison with an existing competitive CEP engine in today’s CEP market. So as an initiative, we decided to do a performance comparison with Esper (Esper is one of the most widely used CEP engine in the current market which has many customers including more than 30-40 giant customers like ORACLE, Swisscom and etc.) Due to incompatibilities and the way Esper benchmark interprets their performance results, to compare Esper performance with Siddhi; we couldn’t directly use the Esper performance benchmark kit to do a perfect performance comparison by providing 100% similar conditions to both parties. Thus we had to implement a separate framework which provides exactly same conditions to both Esper and Siddhi. And initially we compared the performance with three different types of queries which covers most of the basic important CEP key functionalities such as pattern matching, simple filtering and filtering with data windows. So following three graphs will illustrate the performance comparison of Siddhi and Esper for three basic queries. NOTE: These three queries are very similar to Esper performance benchmark queries. 1. Performance comparison for a simple filter without time or length window. Graph 1 Siddhi Vs Esper Simple Filter Comparison 2. Performance comparison for a timed window query for average calculation for a given symbol. Graph 2 Siddhi Vs Esper Average over Time Window Comparison 3. Performance comparison for state machine. Graph 3 Siddhi Vs Esper State Machine Comparison 6. Discussion and Conclusion 6.1. Known issues Event order might change when parsing the events from one processor to another within the Siddhi core The reason for this is, when parallel processed events are combined, the event arrival at the combining point may depend not only on the sequence of input event arrival but also on the execution speed of the previous processes. 6.2. Future work In future we are expecting to address the above mentioned issues. Apart from the above mentioned issues following features will also add a value to Siddhi. 6.2.1. Incubating Siddhi at Apache Siddhi team has an idea of incubating Siddhi project at the Apache Software Foundation (ASF) which is a well known open source software foundation. It will help Siddhi to get a good recognition in the open source world. And many people will get a chance to use and contribute for the future development of the project. 6.2.2. Find out a query language for Siddhi We need to find out a query language that will be sufficient to express the full set of pattern queries. Currently Siddhi uses an object model to represent a query. We are expecting to implement a query language that will provide a simpler method to write queries. 6.2.3. Out of order event handling Due various reasons like network issues, events may come into the CEP engine in out of order. That means an event which has a recent timestamp may come before an event which has an older timestamp. Currently Siddhi does not support out of order event handling. 6.3. Siddhi Success Story Implementing a high performance complex event processing engine (CEP) is a challenging project. This idea was initiated by Dr. Sanjiva Weerawarana and Dr. Srinath Perera where they had a vision for a completely Open-Source high performing CEP. We had a temptation to take this challenge, while past students for a couple of years have dropped this idea feeling this is a too risky for a final year project. Our external supervisor is Dr. Srinath Perera and our internal supervisor is Mrs. Vishaka Nanayakkara. After having couple of meetings with them we put up a list of tasks to accomplish before starting our projects. As the first initiative, we did a lot of research by reading several papers related to complex event processing. Then with some understanding we started our first iteration, trying to implement a working CEP. But we failed our first attempt (yes, it was a miserable failure). But then, after several rounds of discussions and meetings, we took some important architectural decisions based on our failures. We knew we would have to throw away more codes in future, until we find the best one. Therefore even in some vacation days as a project group, we decided to come to the university and work on our project. The most important decision we took intern of implementation is that we are never going to look in to the codes of other CEP implementation; this is because we always wanted to come up with our own idea to reach our goal. So during the implementation process, we faced difficulties but with couple of more iteration we improved the quality of our implementation step by step by doing performance testing and profiling on our code. Meantime we looked in to other CEP functionalities and started implementing the lacking functionalities in ours. With the help of Dr. Srinath, we got a client for a Project from US who are in need of a Complex Event Processing engine. They had some requirements they expected from a CEP. We have collected there requirements, and has implemented some of those features in Siddhi as appropriate. At the same time, we also got a contact from OpenMRS, which is the one of the most popular medical record system in the world having more than several millions of patient records. As Siddhi was in a public maven repository at that time, they tried out our CEP and got impressed. Thus, now Siddhi is running in the back end of OpenMRS NCD Module. We got a great support from our supervisors. Dr. Srinath helped us a lot in designing Siddhi and his contribution helped a lot in making this project a success. Ms. Vishaka Nanayakkara always guided us by giving valuable advices to stay on track with the project and on our research. Therefore we would like to thank both our supervisors for the important role they have played to bring our project to this state. Our next goal is to push this project to the Apache Software Foundation and release this under Apache license v2, so that we can dedicate a high performance CEP engine to the open source community. 6.4. Conclusion As per the above discussion, Siddhi can be used for complex event processing delivering the performance that most of the people expect. Siddhi has addressed the absolute need of having have an open-source CEP with the ability of processing huge flood of events that may go well over one hundred thousand events per second with a near-zero latency. We have carried out a detailed literature survey, comparing and contrasting different Event Processing architectures, and has came up with an architecture, that seems to been computationally efficient. It has been optimized for high speed processing with low memory consumption. Currently, Siddhi has the entire common features that a Complex Event Processing engine should support that is based on the basic framework for Complex Event we’ve written. Further, there are some additional features in Siddhi that are added based on the requests from users. Current Siddhi implementation provides an extendable, scalable framework for the open-source community for extending Siddhi to match specific business needs. Abbreviations CEP – Complex Event Processing ESP – Event Stream Processing EDA – Event-Driven Architecture EPL – Event Programming Language RDBMS – Relational Database Systems XML – eXtensible Markup Language SOAP – Simple Object Access Protocol ESB – Enterprise Service Bus BAM – Business Activity Monitor SOA – Service Oriented Architecture URL – Uniform Resource Locator ASF – Apache Software Foundation BPM – Business Process Management Bibliography [1] D. Luckham and R. Schulte. (2007, May) Event Processing Glossary. [Online]. http://complexevents.com/?p=195 [2] T. J. Owens, "Survey of event processing. Technical report, Air Force Research Laboratory, Information Directorate ," 2007. [3] (2010) S4: Distributed Stream Computing Platform. [Online]. http://s4.io [4] (2010, Nov.) S4. [Online]. http://wiki.s4.io/Manual/S4Overview [5] and H. Baker C. Hewitt, "ActorsAndContinuousFunctionals ,". [6] G. Tóth, R. Rácz, J. Pánczél, T. Gergely, A. Beszédes, L. Farkas, L.J. F\ül\öp, "Survey on Complex Event Processing and Predictive Analytics," , 2010. [7] (2010, Nov.) Homepage of Esper/NEsper. [Online]. http://www.espertech.com/ [8] Homepage of PADRES. [Online]. http://research.msrg.utoronto.ca/Padres/ [9] Alex Cheung, Guoli Li, Balasubramaneyam Maniymaran, Vinod Muthusamy, Reza Sherafat Kazemzadeh Hans-Arno Jacobsen, "System, The PADRES Publish/Subscribe,". [10] Homepage of Intelligent Event Processor (IEP). [Online]. http://wiki.openesb.java.net/Wiki.jsp?page=IEPSE [11] (2010, May) Sopera Homepage. [Online]. http://www.sopera.de/en/home [12] Stream-based And Shared Event Processing (SASE) Home page. [Online]. http://sase.cs.umass.edu/ [13] Y. Diao, and S. Rizvi E. Wu, "High-performance complex event processing over streams," , Chicago, IL, USA, 2006. [14] Cayuga Homepage. [Online]. http://www.cs.cornell.edu/database/cayuga/. [15] Air Force Research Laboratory, "Survey of Event Processing ," , 2007. [16] J. Nagy, "User-Centric Personalized Extensibility for Data-Driven Web Applications , IF/AFOSR Minigrant Proposal," , 2007. [17] J. Gehrke, B. Panda, M. Riedewald, V. Sharma, and W. White A. Demers, "Cayuga: A General Purpose Event Monitoring System," , Asolimar, California, January 2007. [18] J. Gehrke, M. Hong, M. Riedewald, and W. White A. Demers, "A General Algebra and Implementation for Monitoring Event Streams," 2005. [19] K. Vikram, "FingerLakes: A Distributed Event Stream Monitoring System,". [20] Aurora Homepage. [Online]. http://www.cs.brown.edu/research/aurora/ [21] Borealis Homepage. [22] D Abadi, D Carney, U. Cetintemel, and M. et al. Cherniack, "Aurora: a data stream management system," , San Diego, California, 2003. [23] H. Balakrishnan, M. Balazinska, D. Carney, and U. et al. Cetintemel, "Retrospective on Aurora," vol. 13, no. 4, 2004. [24] D. Abadi, Y. Ahmad, and M. Balazinska: U. Cetintemel et al. al., "The Design of the Borealis Stream Processing Engine," , 2005. [25] S. Zdonik, M. Stonebraker, M. Cherniack, and U. Centintemel et al., "The Aurora and Medusa Projects," , 2003. [26] TelegraphCQ Homepage. [Online]. http://telegraph.cs.berkeley.edu/ [27] S. Chandrasekaran, O. Cooper, A. Deshpande, and M. Franklin et al., "TelegraphCQ: Continuous Dataflow Processing for an Uncertain World," , 2003. [28] S. Chandrasekaran, O. Cooper, A. Deshpande, M. Franklin, J. Hellerstein, W. Hong, S. Madden, F. Reiss, M. Shah, S. Krishnamurthy, "TelegraphCQ: An Architectural Status Report," IEEE Data Engineering Bulletin, vol. 26, no. 1, 2003. [29] T. Sellis, "Multiple Query Optimization," , 1988. [30] N., Sanghai, S., Roy, P., Sudarshan, S. Dalvi, "Pipelining in Multi-Query Optimization," , 2001. [31] A., Sudarshan, S., Viswanathan, S. Gupta, "Query Scheduling in Multi Query Optimization," , 2001. [32] P., Seshadri, A., Sudarshan, A., Bhobhe, S. Roy, "Efficient and Extensible Algorithms For Multi Query Optimization," , 2000. [33] STREAM Homepage. [Online]. http://www-db.stanford.edu/stream/ [34] PIPES Homepage. [Online]. http://dbs.mathematik.unimarburg.de/Home/Research/Projects/PIPES/ [35] Sybase Complex Event Processing. [Online]. http://www.coral8.com/developers/documentation.html [36] Sybase Complex Event Processing. [Online]. http://www.coral8.com/developers/documentation.html [37] Coral8: The Fastest Path to Complex Event Processing. [Online]. http://www.coral8.com/developers/documentation.html [38] Progress Apama – Monitor, Analyze, and Act on Events in Under a Millisecond.. [Online]. http://www.progress.com/apama/index.ssp [39] S. Zdonik. (2006, March) Stream Processing Overview [presentation] Workshop on Event Processing, Hawthorne, New York. [40] Real-Time Data Processing with a Stream Processing Engine. [Online]. http://www.streambase.com/print/knowledgecenter.htm [41] Truviso Product Brief. [Online]. http://www.truviso.com/resources/ [42] M. Liu, L. Ding, E.A. Rundensteiner, and M. Mani M. Li, "Event Stream Processing with Out-of-Order Data Arrival," , 2007. [43] Y. Diao, D. Gyllstrom, and N. Immerman J. Agrawal, "Efficient pattern matching over event streams," , 2008. [44] T. Shafeek, "Aurora: A New Model And Architecture For Data Stream Management, B.Tech Seminar Report, Government Engineering College, Thrissur," 2010. [45] J. Widom , A. Arasu, R. Motwani, "Query Processing, Resource Management, and Approximation in a Data Stream Management System," , 2003. [46] OpenMRS Wiki. [Online]. https://wiki.openmrs.org/display/docs/Notifiable+Condition+Detector+%28NCD%29+ Module [47] Oracle and BEA. (2011-5-6). [Online]. http://www.oracle.com/us/corporate/Acquisitions/bea/index.html Appendix A Apache License, Version 2.0 Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royaltyfree, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royaltyfree, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: You must give any other recipients of the Work or Derivative Works a copy of this License; and You must cause any modified files to carry prominent notices stating that You changed the files; and You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NONINFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.