Impact Alternatives Impact without Impact Daniel L. Needles January 21, 2011 Rev 2.0 Modification History AUTHOR Daniel L. Needles Daniel L. Needles DATE 05/01/2003 1/1/2010 Daniel L. Needles 1/21/2011 COMMENTS Initial draft. Added Omnibus 7.X case studies: XinY Event Audit. Impact Replacement Additional clean up. COPYRIGHT © 2010, NMS Guru Any copying, distribution, or use of any of the information contained within this document in any way or form is not permitted without the written consent of NMS Guru. This is an unpublished work protected under the copyright laws. All rights reserved. Impact without Impact Modification History i Contents MODIFICATION HISTORY .................................................................................... I CONTENTS .......................................................................................................... II INTRODUCTION ................................................................................................... 1 WHAT DOES THE TIVOLI NETCOOL SUITE DO? .......................................................... 1 WHAT IS OMNIBUS? .......................................................................................................... 2 WHAT IS IMPACT? .............................................................................................................. 2 WEIGHING THE ALTERNATIVE OPTIONS? ................................................................... 3 INTRODUCTION ............................................................................................................ 3 COMPLEXITY ................................................................................................................. 3 COST ............................................................................................................................... 4 FLEXIBILITY .................................................................................................................. 4 LOAD .............................................................................................................................. 4 ROBUSTNESS (HA/DR) ............................................................................................... 4 SUMMARY....................................................................................................................... 5 DOCUMENT PURPOSE ...................................................................................................... 5 INTENDED AUDIENCE ..................................................................................................... 5 POLICIES WITHOUT IMPACT ................................................................................6 INTRODUCTION ................................................................................................................. 6 POLICY DEVELOPMENT ................................................................................................... 6 PROBE, AUTOMATION, AND DATABASE FIELD DESIGN ............................................ 8 AUTOMATION AND DATABASE FIELD DESIGN ...................................................... 8 DATABASE FIELD DESIGN.......................................................................................... 9 AUTOMATION DESIGN.............................................................................................. 10 SOLUTION IMPLEMENTATION AND TESTING ............................................................ 10 DOCUMENTATION AND TRAINING .............................................................................. 10 SUMMARY.......................................................................................................................... 11 CASE STUDY: LIGHTS OUT NOC (OMNIBUS 3.X)............................................ 12 INTRODUCTION ............................................................................................................... 12 Impact without Impact Contents ii BACKGROUND ................................................................................................................. 12 THE SOLUTION ................................................................................................................ 14 PAGING POLICY DEVELOPMENT ............................................................................ 14 PAGING SOLUTION - FIRST ATTEMPT WITH DELAY EDGE TRIGGERS ............. 15 PAGING SOLUTION - SECOND ATTEMPT WITH AUTOMATIONS AND DB FIELDS16 TRANSLATION .................................................................................................................. 18 AUTOMATIONS ................................................................................................................ 18 CASE STUDY: XINY (OMNIBUS 7.X) .................................................................. 20 INTRODUCTION ............................................................................................................... 20 REQUIREMENTS ............................................................................................................... 21 PROVIDED SOLUTION FRAGMENT .......................................................................... 21 ENVIRONMENT FEATURES ....................................................................................... 21 RECOMMENDATIONS...................................................................................................... 22 INTRODUCTION .......................................................................................................... 22 METHOD 1: PROBE UNIVERSAL INCLUDE FILE 1 ................................................. 24 METHOD 2: PROBE UNIVERSAL INCLUDE FILE 2 ................................................. 26 METHOD 3: AUTOMATIONS IN COLLECTION (EVENT DEPENDENT) ............... 26 METHOD 4: AUTOMATIONS IN COLLECTION (EVENT INDEPENDENT) ........... 26 METHOD 5: COLLECTION IMPACT .......................................................................... 26 METHOD 6: AGGREGATION IMPACT ...................................................................... 27 SUMMARY..................................................................................................................... 27 XINY SOLUTION DESIGN .............................................................................................. 28 INTRODUCTION .......................................................................................................... 28 NEW_ROW AND DEDUPLICATION: ......................................................................... 29 XINY_NEW_ROW AND XINY_DEDUPLICATION................................................... 29 XINY_EXPIRE............................................................................................................. 29 XINYWINDOW_EXPIRE............................................................................................ 30 XINY_CLEANUP ........................................................................................................ 30 TEST SCENARIOS ............................................................................................................. 31 INTRODUCTION .......................................................................................................... 31 SCENARIO 1: XINY WITHIN EVENT LIFESPAN ...................................................... 31 SCENARIO 2: XINY LIFESPAN SPANNING TWO RELATED EVENTS ................... 31 SCENARIO 3:XINY LIFESPAN NOT QUITE SPANNING TWO RELATED EVENTS31 SCENARIO 4: CLEAN HA/DR FAILOVER (BETWEEN XINY EVENTS) ................ 32 SCENARIO 5: DIRTY HA/DR FAILOVER (DURING AN XINY EVENT) ............... 32 SCENARIO 6: AN XINY FAILURE TO THRIVE......................................................... 32 SCENARIO 7: AN INTERRUPTED XINY LIFESPAN (BY GENERICCLEAR) ........... 32 SUMMARY.......................................................................................................................... 33 Impact without Impact Contents iii CASE STUDY: EVENT AUDIT (OMNIBUS 7.X) ................................................... 34 INTRODUCTION ............................................................................................................... 34 AUTOMATION AUDIT ARCHITECTURE ........................................................................ 34 AUTOMATION AUDIT REPORT...................................................................................... 35 INSTALLATION ................................................................................................................. 35 STEP 1: CREATE THE TABLE ALERTS.ALARMAUDIT............................................... 35 STEP 2: ADD THE AUDIT FIELD (VARCHAR(64)) TO ALERTS.STATUS.................. 37 STEP 3: INSERT AN INITIAL RECORD INTO THE ALERTS.ALARMAUDIT TABLE.. 37 STEP 4: CREATE THE CLEAN_ALARMAUDIT_TABLE AUTOMATION .................. 37 STEP 5: OPTIONAL: MAKE A PLACE HOLDER FOR TRACKING PROBE EVENTS39 STEP 6: OPTIONAL: MAKE A PLACE HOLDER FOR TRACKING PROBE EVENTS40 STEP 7: UPDATE STATE_CHANGE TO TRACK AUTOMATION UPDATES .............. 41 STEP 8: UPDATE GENERICCLEAR, EXPIRE, AND OTHER AUTOMATIONS TO ENABLE EVENT AUDITING ...................................................................................................... 42 STEP 9-N: UPDATE OTHER EVENT SPECIFIC AUTOMATIONS (FUTURE EXPANSION) ....................................................................................................................................... 43 SUMMARY.......................................................................................................................... 43 CASE STUDY: IMPACT REPLACEMENT .............................................................. 45 INTRODUCTION ............................................................................................................... 45 COMPATIBLE FUNCTIONALITY ..................................................................................... 46 FLEXIBILITY ..................................................................................................................... 46 SPEED ................................................................................................................................ 48 ROBUSTNESS .................................................................................................................... 48 MAINTAINABILITY .......................................................................................................... 48 SUMMARY ........................................................................................................... 49 APPENDIX A: XINY NEW TABLES AND TABLE UPDATES ................................ 50 ALERTS.XINY................................................................................................................... 50 ALERTS.XINYWINDOW.................................................................................................. 51 ALERTS.STATUS ............................................................................................................... 52 APPENDIX B: ALERTS.STATUS XINY CENTRIC AUTOMATIONS ...................... 53 NEW_ROW AUTOMATION: .............................................................................................. 53 SETTINGS ................................................................................................................. 53 ACTION ..................................................................................................................... 53 Impact without Impact Contents iv DEDUPLICATION AUTOMATION .................................................................................... 54 SETTINGS ................................................................................................................. 54 ACTION ..................................................................................................................... 54 APPENDIX C: ALERTS.XINY CENTRIC AUTOMATIONS .................................... 56 XINY_NEW_ROW AUTOMATION .................................................................................. 56 SETTINGS ................................................................................................................. 56 ACTION ..................................................................................................................... 56 XINY_DEDUPLICATION AUTOMATION ....................................................................... 56 SETTINGS ................................................................................................................. 56 ACTION ..................................................................................................................... 56 APPENDIX D: ALERTS.XINY CENTRIC AUTOMATIONS ................................... 58 XINY_EXPIRE (TEMPORAL TRIGGER) ........................................................................ 58 SETTINGS ................................................................................................................. 58 EVALUATE ............................................................................................................... 58 ACTION ..................................................................................................................... 58 APPENDIX E: XINYWINDOW CENTRIC AUTOMATIONS .................................. 60 XINYWINDOW_EXPIRE (DATABASE TRIGGER) ........................................................ 60 SETTINGS ................................................................................................................. 60 ACTION ..................................................................................................................... 60 XINY_CLEANUP (TEMPORAL TRIGGER) .................................................................... 60 SETTINGS ................................................................................................................. 60 ACTION ..................................................................................................................... 60 APPENDIX F: EVENTSTREAM.PL ....................................................................... 61 APPENDIX G: EVENT AUDIT REPORT SCRIPT.............................................. 63 APPENDIX H: EVENT AUDIT REPORT (EXAMPLE 1) ................................... 67 APPENDIX I: EVENT AUDIT REPORT (EXAMPLE 2) ..................................... 73 APPENDIX I: IMPACT-LIKE PERL SHELL .................................................... 86 IMPORTANT NOTICE ................................................................................. 96 ABOUT NMS GURU ........................................................................................... 97 AUTHOR ............................................................................................................. 97 Impact without Impact Contents v Figures FIGURE 1: EXAMPLE OF AN EVENT LIST .............................................................. 1 FIGURE 2: EXAMPLE OF TYPICAL SINGLE TIER NETCOOL DEPLOYMENT ......... 2 FIGURE 3: POLICY FLOW DIAGRAM ...................................................................... 7 FIGURE 4: LIGHTS OUT NOC - ENTERPRISE MONITORING ARCHITECTURE .. 13 FIGURE 5: LIGHTS OUT NOC - NOTIFICATION SUPPRESSION GUI .................. 13 FIGURE 6: LIGHTS OUT NOC - DELAYED EDGE TRIGGER EXAMPLE ............... 15 FIGURE 7: LIGHTS OUT NOC - POLICY FOR A DOWN EVENT ........................... 16 FIGURE 8: LIGHTS OUT NOC - POLICY FOR A UP EVENT ................................. 17 TABLE 9: LIGHTS OUT NOC - FIELD ADDITIONS ............................................. 18 FIGURE 10: LIGHTS OUT NOC - TRIGGER NEODOWNPAGE............................. 18 FIGURE 11: LIGHTS OUT NOC - TRIGGER NEOCORRELATEUP ........................ 19 FIGURE 12: LIGHTS OUT NOC - TRIGGER NEOFIREUPPAGE........................... 19 FIGURE 13: XINY - SIX POSSIBLE APPROACHES .................................................. 24 FIGURE 14: XINY - AUTOMATION FLOW DIAGRAM ............................................ 28 FIGURE 15: XINY - TEST SCENARIO 1 ................................................................. 31 FIGURE 16: XINY - TEST SCENARIO 2 ................................................................. 31 FIGURE 17: XINY - TEST SCENARIO 3 ................................................................. 31 FIGURE 18: XINY - TEST SCENARIO 4 ................................................................. 32 FIGURE 19: XINY - TEST SCENARIO 5 ................................................................. 32 FIGURE 20: XINY - TEST SCENARIO 6 ................................................................ 32 FIGURE 21: XINY - TEST SCENARIO 7 ................................................................. 33 FIGURE 22: ADMIN GUI - TABLE CREATION ..................................................... 36 FIGURE 23: ADMIN GUI - AUTOMATION CLEAN_ALARMAUDIT_TABLE............ 37 FIGURE 24: ADMIN GUI - AUTOMATION CLEAN_ALARMAUDIT_TABLE............ 38 FIGURE 25: ADMIN GUI - AUTOMATION NEW_ROW .......................................... 39 FIGURE 26: ADMIN GUI - AUTOMATION DEDUPLICATION................................ 40 FIGURE 27: ADMIN GUI - AUTOMATION STATE_CHANGE ................................. 41 FIGURE 28: ADMIN GUI - AUTOMATION GENERICCLEAR ................................ 42 FIGURE 29: ADMIN GUI - AUTOMATION EXPIRE ............................................... 43 Impact without Impact Contents vi FIGURE 30: SCHEMA: ALERTS.XINY ................................................................... 51 FIGURE 31: SCHEMA: ALERTS.XINYWINDOW .................................................... 51 FIGURE 32: SCHEMA CHANGES: ALERTS.STATUS ............................................... 52 Impact without Impact Contents vii Introduction What does the Tivoli Netcool Suite do? IBM's flagship for fault management is Tivoli Netcool suite. It consists of several products working together to shuttle events from disparate sources into a common database for display, reporting, and analysis. Figure 1: Example of an Event List The events flow through the system as follows. Each probe receives events from a particular source, the most common being syslog and SNMP traps. Using detailed instructions contained in rules files, the probes convert the information about the alarm into a common format and insert them into the object server database. Many applications pull the events from the object server for reports, display, data redundancy, or further additional analysis. Impact without Impact Introduction 1 Figure 2: Example of Typical Single Tier Netcool Deployment What is Omnibus? Together the probes, gateway, and object server are called Omnibus. Omnibus collectively acts as an event processor by effectively: Consolidating events from multiple Network Management Systems (NMSs) and Element Management Systems (EMS's) into a single display screen. Deduplicating multiple events to a single device and/or service centric event and in the process drastically reducing the number of events. Normalizing the disparate events into a consistent presentation while preserving the events' meaning. Correctly stressing an event’s individual importance and/or urgency relative to other events displayed regardless of the disparate event sources. What is Impact? Impact is a separate product that can be integrated into Omnibus to increase the functionality of the fault management solution. Impact provides two additional features not readily available in Omnibus: Impact without Impact Introduction 2 1. Outside data sources integration (i.e. web services, Oracle, Sybase, etc.) 2. Complex event enrichment, correlation, and/or augmentation. Impact performs these two functions by executing three steps periodically: 1. Pull object server events 2. Performing work: Programmatically access and write data to and from: a. Object server b. Web services (XML, HTML) c. Databases (Oracle, Sybase, Mysql, etc.) 3. Update object server events The result is an extension of fault management functionality. Weighing the Alternative Options? Introduction At first glance Impact appears the most logical, if not the only, method to provide outside data source integration and/or complex event enrichment and correlation. However, one of the chief strengths of Netcool is its ability to interoperate with other applications and programs. As a result Impact is not always the best solution for the problem at hand. The following chart weighs four common approaches with regards to the complexity, cost, flexibility, load, and robustness (high availability (HA) and disaster recovery (DR)) of each approach. Option Impact Automations Probes Scripts Complex 4 3 5 1 Cost 1 5 5 2 Flexible 4 3 1 5 Load HA/DR 2 1 5 4 2 4 4 5 Complexity Impact uses a proprietary, 4th generational language. As a result, specialists are required to administer, program, and maintain the policies in the application. Further, Impact lacks many of the program constructs of the third generational language while not abstracting out constructs as most forth generational languages do. For example, in order to scale multiple polices, the policies need to be cascaded from a single data source reader. This construct is not prebuilt and requires considerable configuration and coding. The result is neither intuitive nor self documenting. Impact without Impact Introduction 3 Despite these issues, Impact is designed to extend Netcool and as a result is simpler than most other approaches. The only reason Impact is not simpler than the probes' approach is because the probes are limited and as a result have less options, making their approach simpler. Cost The complexity of Impact requires high end consultants to effectively maintain it. This on top of a high purchase price and yearly maintenance makes Impact very costly in comparison to the alternatives: automations, probes, and scripts. Automations and probes licensing and maintenance are already paid for since they come with the core Netcool product - Omnibus. Scripts are slightly cheaper because they do not carry the hefty maintenance cost associated with Impact. Flexibility Impact is the most flexible approach besides scripting. Both automations and probes are limited in language and do not have the ability to work directly with outside data sources. However, automations and probe both can impact the event immediately so there is no latency in event updates, unlike Impact or scripting. Scripting provides all the flexibility of Impact and more at the cost of building the Impact framework functionality. Load Load is not one of Impacts strongest suites. Once Impact is installed and configured, in most cases the product can handle at most 6-10 mildly complex algorithms and that can only occur if the polling is backed off from 3-4 seconds to 30-90 seconds. This of course means that for 30-90 seconds the object server events are not updated, which leads to other issues. Automations do not perform much better and can place a heavy load on the system. Probes are the least "heavy" but are very limited in application. Finally, scripts often provide the best functionality and can scale considerably better than Impact. Robustness (HA/DR) As the JAVA based Impact program consumes system resources, the program as a whole begins to behave badly. Part of this is poor product architecture in that Impact does not take into account the architecture of JAVA. In particular each JVM handles resource allocation for the given Java programs. When the JVM gets overloaded or is context switching, Garbage Collection falls behind. When it does, the JVM blocks all processing until the Garbage is collected. In effect, it is cleaning up objects that have been exercised and abandoned or are done. If the code spawns a lot of objects, the Garbage collector consumes the CPU as it traverses memory to recover memory. The result is when the primary needs to failover to the secondary in the cluster, there is not enough memory or CPU to handle the failover correctly. Unfortunately this is exactly the scenario one expects the clustering to handle. Worse, to work around the issue, the primary and secondary Impact are shut down together before the primary is restarted, which results in an Impact functionality outage anyway. Impact without Impact Introduction 4 Summary When extending the functionality of Netcool into event manipulation and integration with outside data sources, there are many options. Only one of these is Impact. A variety of factors need to be taken in consideration when making the selection of approach. This section has discussed five of these, which are: complexity, cost, flexibility, load, and robustness (high availability (HA) and disaster recovery (DR)). Document Purpose There are alternatives to purchasing Impact or building an in-house solution. Many NOCs require only simple automated diagnostics, notifications, and/or escalation processes. The purpose of this document is to explore Impact alternatives for these cases through either: Omnibus Automations and Database Fields Probe Rules and Database Fields Customized Scripts and Database Fields Several case studies are provided within this document that fully document how this is done: Lights Out NOC: Automated notification and escalation. XinY: Flagging X Events occurring in Y period of time. Event Audit: What policy/automation/probe touched which events when. Impact Replacement: A Impact-like PERL script that emulates Impact behavior. Procedures to emulate policies can be written and embedded in the script to replace Impact functionality. Intended Audience This document is directed toward managers and engineers tasked with creating, maintaining, and/or using a distributed NMS architecture. Impact without Impact Introduction 5 Policies without Impact Introduction Through the addition of database fields and the careful use of Omnibus automations, Omnibus can implement complex and effective state machines. However, since this is not a designed feature of the product, there are three caveats, which complicate the implementation of these solutions: 1. The state transitions are ALL timed or based on database actions: insert, reinsert, update, and delete. Only Object Server triggers or probe processing can provide the action to transition between states. 2. Unlike a normal policy model, all states are available to the automation for state transitions. That is, in the case of the Object Server automations the SQL WHERE clause must be used to exclude events from the transitions. Otherwise events not in a particular state will also be transitioned, breaking the model. 3. The presentation of automations and event records is not engineered towards displaying policy models. Thus, other facilities need to be used to describe and document the policy Model. This is critical since natural entropy more than any other factor eventually leads to the demise of any monitoring solution. With these limitations in mind, there is a straightforward four step process to create and implement models within the Probes and/or Omnibus’s Object Server: 1. Policy Development. 2. Probe, Automations, and Database Field Design. 3. Solution Implement and Testing. 4. Documentation and training. Policy Development The first step is to fully delineate the 'policy' to be modeled. That is, the business process and its impact should be fully delineated and vetted with the customer base. This is best explained through an example. Impact without Impact Policies without Impact 6 Yes Manager Tier-3 No No Older then 60 minutes? Yes Escalation Process Event Resolved? Yes Engineer Tier-2 No No Older then 15 minutes? Event Resolved? Yes Operator Tier-1 Event Received Event Removed From Escalation Figure 3: Policy Flow Diagram Imagine a small company wants to implement a lights out NOC to enable the workers the flexibility of working from home as well as perform other duties. Impact without Impact Policies without Impact 7 In this case once all the stake holders are consulted and the requirements identified, a simple escalation policy was drafted: 1. When an event for a set of particular devices is received by Omnibus, a 24 X 7 NOC tier-1 operator is notified and troubleshoots the problem. 2. If the operator cannot solve (clear) the issue within 15 minutes, the system notifies tier-2 by paging an engineer (whether or not the event has been acknowledged.) 3. If after an hour no resolution is found, the system escalates to tier-3 by paging the manager of the engineering group (whether or not the event has been acknowledged.) In each case the event must be resolved in order to remove it from the escalation process. Once the problem is resolved, the escalation process can begin anew if the issue reoccurs. During the vetting process one area of concern was discovered - what if the tier-1 or tier2 deletes the event? This would bypass the escalation solution. The decision was made to alter the delete event tool in Webtop and the thick windows-based Omnibus client to prevent the deletion of these events for these users. Though these users should not have access to nco_sql, this also needed to be verified. Probe, Automation, and Database Field Design Once the policy has been identified and described, the solution can be designed though modifications and additions to probe rules, automations, and the alerts.status database. In our light out NOC example, implementing the solution within the probe rules can be ruled out. The issue is that the escalation is time based and not event instance based. Since the probe rules are only activated when an event hits the probe, this is likely not the best location to model the bulk of the policy. This leaves the Omnibus automations to perform the state transitions and the alerts.status database fields to record the various states in the policy. Automation and Database Field Design Once the models have been designed, they must be translated into Omnibus database fields and automations. The database fields are used to keep track of an events flow through the model while the automations transition the events from one stage to the next in the model by changing the value of the database fields. Impact without Impact Policies without Impact 8 Database Field Design Database fields contain information regarding the event’s participation in the policy model. Fields can be used to indicate the position (state) within the model. They can also be used to provide pertinent information for the model such as paging addresses. Which database fields are to be used, depends on the application. The cleanest method is to add custom fields with field names that clearly describe how they are used. However, depending on change control, memory limitations, and other constraints this is sometimes not possible. Alternatively, existing fields can be used by adding additional possible values, appending values to end of the field, or taking over their function entirely. In these cases documentation and testing are critical to ensure business functions are not disrupted. In this example, a custom integer field called “State” is created to track the status of the escalation. For instance, the following states could be set: State Status 0 Event Received by Tier-1 (default) 1 Escalated to Tier-2 2 Escalated to Tier-3 3 Event Removed or NOT Participating in the Escalation Process The State field not only allows for status tracking, but when used in conjunction with the automations it ensures that events aren’t processed multiple times. Once the fields and field values are selected, three additional constraints have to be considered: model granularity, model crosstalk, and failover constraints. Model Granularity As stated above, the Omnibus Object Server will use a field in each record to track an event through the model. The implication is that without building extra complexity into a model, the lowest granularity that can be obtained by a model is the database record. For example, if the mttrapd rules file is configured to use the same identifier for any interface error (buffer overruns, CRC, etc), no simple model using automations and database fields can be built to track these events separately. Model Crosstalk and Failover Constraints In other cases, processes might disrupt the model unexpectedly. For example, a misconfigured: probe or the generic clear automation could reset the database field and thus the model in random cases. Similarly, if the bidirectional gateway is not set up Impact without Impact Policies without Impact 9 correctly the model’s database fields could fail to copy across or get reset depending on which automation is firing when on both servers. Automation Design As stated earlier, automations provide the transport mechanism between states within the policy by changing the values of the database fields. Where the database fields remember the current state of the model, the automations provide the logic, such as when and if the event should be escalated. Details such as how often the events should be checked, what conditions should be considered and what actions need to be performed are all specified within the automations. Correctly understanding how each type of trigger works is essential to the design of the policy. In general, there are three types of automations: temporal, database, and signal. Temporal automations can be used for transitions based on a timer as well as transitions that can place a load on the system. Using a temporal automation guarantees the regularity at which the automation will execute. Database automations can be used to provide a more interrupt rather than 'polled' based transition. Signal automations have less relevance in policy building. However, when augmenting an existing administration function, they do have their uses. The lights off NOC automation design is discussed in detail in the first case study. Solution Implementation and Testing Once completed and implement each transition in the model should be unit tested. This can be accomplished by monitoring the data stream or by simulating the actions through database insertions using nco_sql. If possible, every known event occurrence should be tested to provide the most comprehensive testing. In addition, failover should be forced for every state of the model to verify the model is correctly designed. After the solution is rolled into production, the solution still should be monitored. It is unlikely that every possible case has been caught in development, test, and stage environments. As a result, expecting and planning for issues is a more cost effective, customer centric approach. Documentation and Training Once completed and implement each transition in the model should be unit tested. This can be accomplished by monitoring the data stream or by simulating the actions through database insertions using nco_sql. If possible, every known event occurrence should be Impact without Impact Policies without Impact 10 tested to provide the most comprehensive testing. In addition, failover should be forced for every state of the model to verify the model is correctly designed. Summary Through a simple scenario, this section described the four-step approach to creating policy models within an Omnibus Object Server. However, models can be created and used for numerous complex scenarios. The following sections describe multiple solutions that were incorporated into several fortune 500 businesses. Impact without Impact Policies without Impact 11 Case Study: Lights Out NOC (Omnibus 3.X) Introduction This case study reviews a simple escalation and paging notification policy built for a fortune 500 company with a pre existing Netcool architecture. Their Omnibus solution remained on version 3.6 due to preexisting integrations and lack of budget. Since several companies remain in this earlier state, this case study presents a 3.X based solution. The company's architecture consisted of a Netcool virtual server pair collecting events from five probes. Initially a simple paging procedure was developed to send pages to the responsible engineer, but due to an excessive number of pages being sent, the engineers began to ignore the pages. As such the system was ineffective. The client required a solution that capitalized on their existing investment and would increase the effectiveness of their system. Background A NOC staff monitored the network using the Netcool architecture illustrated in the following figure. The system also sent out pages for specific events. In order to issue the actual pages, Netcool calls the Netcool utility nco_page that relayed the requests to the paging application. The probes collect network information and translate them into events. Using a lookup file, the engineer responsible for the event is added into the custom PageDestination field during the probe processing. If the device creating the event is unknown or no engineer is assigned to the event, then the field is left blank. The event is then passed onto the Object Server. The Object Server’s automations call the nco_page script to page the engineer if the PageDestination field was populated. The nco_page script calls the paging application. This application uses the parameter originally set by the probe in the PageDestination field to determine which people or groups need to be paged. This is accomplished by having the paging application look up the value of the PageDestination field in its configuration file. The page is then sent. Impact without Impact Case Study: Lights Out NOC (Omnibus 3.X) 12 Primary Server Secondary Server Paging Application Paging Application nco_page script nco_page script Bi Gate Primary_Object_Server Secondary_Object_Server Virtual Object Server System 1 System 2 TrapD Probe Ping Probe Paging.lookup System 3 Syslog Probe HPOV NNM5 Probe Paging.lookup TrapD Probe Paging.lookup Figure 4: Lights Out NOC - Enterprise Monitoring Architecture Figure 5: Lights Out NOC - Notification Suppression GUI Impact without Impact Case Study: Lights Out NOC (Omnibus 3.X) 13 The final aspect to the paging process is the paging/notification blackout periods. Through WebTop, a blackout web page (a cgi form) was created to enable users to create blackout periods. The cgi form created a special event in the Object Server that indicated the blackout start and stop time. Within the Object Server, an automation suppressed any events corresponding to the blackout event, thus prohibiting notifications being sent out for the specific events. Although this process correctly sent pages to the engineers during the correct time periods, it had one major flaw. So many pages were sent that engineers began ignoring their pagers. This rendered the mechanism ineffective. More discretion was necessary. The Solution The best solution available was to use a policy to develop an escalation system to reduce the number of pages. However, the client could not afford the overhead of a policy. It was decided that a combination of database fields and automations would be used to create the necessary policy model. Architecting this solution was a four-step process. An effective paging policy had to be agreed upon. The model needed to be drawn out as a flow diagram and visually tested to ensure that the policy was implemented effectively. The model had to be implemented using automations and database fields. The end result needed to be unit tested to ensure effectiveness. Paging Policy Development The first step to rectifying the problem was to develop an effective paging policy. After some discussion among the managers, engineers, and NOC staff, the following policy was agreed upon: 1. The engineer that is responsible for a device will receive the page. If the engineer responsible for the device is unknown, no page is sent. 2. Pages do NOT go out for down events resolved within 5 minutes by a corresponding up event. 3. Pages MUST go out for down events not resolved after 5 minutes. 4. Pages do NOT go out for up events that do NOT match a "paged" down event. 5. Pages MUST go out for up events that correspond to "paged" down events. Impact without Impact Case Study: Lights Out NOC (Omnibus 3.X) 14 6. Severity MUST be set to 0 for any matching up and down events. (This enabled a preexisting cleaning automation to delete the records after a period of time.) Paging Solution - First Attempt with Delay Edge Triggers Once the paging policy was decided, a delay edge trigger solution was attempted. The trigger was configured to activate on a per row basis after a one-interval time delay. The time interval for the trigger was 3 minutes. Any event requiring a page outstanding after 3 minutes would cause the ascend action to generate a problem page, while the disappearance of the event would cause the descend action to generate a resolved page for the event. At first this seemed like a reasonable approach. However, after reviewing how the edge triggers have been designed the logic reveals a flaw. Even though the delay edge trigger is configured to perform its ascend and descend actions on each record, the activation of the trigger itself occurs on the entire group of records. This results in 4 distinct types of behaviors depending on how the lifespan of an event falls relative to the lifespan of the delay edge trigger. Condition Met Condition Not Met Event E Event C Event B Event A Ascend Action Descend Action Figure 6: Lights Out NOC - Delayed Edge Trigger Example Event A An ascend action is executed by the edge trigger for the event. This does not meet the logic’s requirements because no descend action is executed for the event. Event B An ascend action and a descend action is executed. This meets the logic’s requirements. Event C No action is executed. This does not meet the logic’s requirements. Impact without Impact Case Study: Lights Out NOC (Omnibus 3.X) 15 Event E A descend action is executed. This does not meet the logic’s requirements because no ascend action is executed for the event. Events A and B are present when the ascend action is executed and the trigger executes an action for these two events. Event C & E occur after the trigger has executed, therefore no ascend actions are executed for these events. Event A & C resolve, but since events B & E still met the trigger’s condition to descend action is executed. Finally, events B&E resolve and the descend action is executed. To meet the logic’s requirements an ascend and descend action would need to be executed for each event. Obviously the use of a delay edge trigger at a row level will not work. Paging Solution - Second Attempt with automations and db fields Since the use of an edge trigger would not work, it was decided to design a model based solution. The paging policy was converted into two related models shown below - one model for the down event and another model for the up event. NeoPageDown (Action) Send Problem Page NeoPageDown (trigger) Meet Paging Criteria Down Event Initial State Figure 7: Lights Out NOC - Policy for a Down Event Impact without Impact Case Study: Lights Out NOC (Omnibus 3.X) 16 Yes NeoFireUpPage (Action) Send Page Clear Down Event Mark Down Event for Page Clear Up Event Yes NeoCorrelateUp (Action) NeoFIreUpPage (Trigger) Event Marked for Page? NeoCorrelateUp (Trigger) Matching Down Event? Yes Down Page Sent? No Clear Down Event Up Event Initial State Figure 8: Lights Out NOC - Policy for a Up Event When Netcool receives a down event, it begins the Down Event Model. This model waits in the Down Event Ground State until either a corresponding up event occurs or until the down event ages more than 5 minutes. If a corresponding up event does occur within 5 minutes the automation, NeoCorrelateUp clears the down and up events. If no corresponding up event is detected within five minutes, the NeoPageDown automation will execute, sending a page and updating the event to indicate that a page has been sent. If the corresponding up event occurs 5 minutes or more after the down event the Up Event policy begins. The NeoCorrelateUp automation determines if there is a down event corresponding to the up event and if so determines if a page has been sent out for the down event. If a (problem) page has been sent out, the NeoCorrelateUp automation updates the down event indicating that another (resolution) page needs to be sent. The NeoFireUpPage automation finds the updated down event requiring a resolution page to be sent and sends out the page. If no (problem) page has been sent out for the corresponding down event, the NeoCorrelateUp automation clears the up and down events. If no down event is found the automation clears the up event. Impact without Impact Case Study: Lights Out NOC (Omnibus 3.X) 17 Translation The implementation required the addition of database fields and the creation of three automations, each with a trigger. Four fields were added to the Netcool database. They are as follows. Field Description PageDestination The Probe populates this field. It specifies a Telamon paging group representing an engineer who is responsible for the device. Usummary The Deleted event's Summary field is copied into the Up event's Usummary field. Cleared By Either 0 or 1. Used to determine if an Up event has been correlated to a down event. Page This field is set to 1 if a page is to be sent. It simplifies the policy by offloading the responsibility to determine if an Up Event requires a page to the nco_page script. Table 9: Lights Out NOC - Field Additions Automations In addition to the field additions the following triggers and automations were used to create the policies. Trigger Name NeoDownPage Sample Rate 57 Type Level Threshold 0 Row Count Greater than 0 Execution For each matched row Condition select * from alerts.status where Page = 0 and Type = 1 and ClearedBy = 0 and Severity > 0 and PageDestination <> '' and ServerName = 'NCOMS' and (StateChange < getdate – 300); Ascend Action NeoDownPage SQL Action update alerts.status via '@Identifier' set Page = 1; Executable /opt/Omnibus/utils/nco_page** Host Parameters -d @PageDestination -n @Node -u @Mastername -m @Summary Run As ** This PERL script was augmented such that if the Page field is set to 0, no page will go out. If the Page field is set to any other value, a page will go out. Figure 10: Lights Out NOC - Trigger NeoDownPage Impact without Impact Case Study: Lights Out NOC (Omnibus 3.X) 18 Trigger Name NeoCorrelateUp Sample Rate 58 Type Level Threshold 0 Row Count Greater than 0 Execution For each matched row Condition select * from alerts.status where Type = 2 and ClearedBy = 0 and Severity > 0 and ServerName = 'NCOMS'; Ascend Action NeoCorrelateUp update alerts.status set USummary = @Summary, ClearedBy = 1 where Type = 1 and ClearedBy = 0 and Manager = ‘@Manager' and AlertGroup = '@AlertGroup' and AlertKey = '@AlertKey' and Node = '@Node' and ServerName = 'NCOMS' and Severity >0 ; SQL Action update alerts.status via '@Identifier' set Severity=0; Executable Parameters Host Run As Figure 11: Lights Out NOC - Trigger NeoCorrelateUp Trigger Name NeoFireUpPage Sample Rate 57 Type Level Threshold 0 Row Count Greater than 0 Execution For each matched row Condition select * from alerts.status where Type=1 and ClearedBy>0 and Page = 1 and Severity >0 and ServerName = 'NCOMS'; Ascend Action NeoFireUpPage SQL Action update alerts.status via '@Identifier' set Severity=0; Executable /opt/Omnibus/utils/nco_page Parameters -d @PageDest -n @Node -u @Mastername -q @Page -m @USummary Host localhost Run As 0 Figure 12: Lights Out NOC - Trigger NeoFireUpPage The behavioral model highlights three points. 1. All up events can immediately be set to severity 0 (clear) regardless if a matching down event is discovered or not. 2. Down events that do not have their PageDestination field populated do not issue a page. 3. Repetitive (flapping) events occurring at a rate greater than 5 minutes will be ignored by the model. Impact without Impact Case Study: Lights Out NOC (Omnibus 3.X) 19 Case Study: XinY (Omnibus 7.X) Introduction Often when detecting network, system, and other problems, the individual events cannot be taken out of context of a larger picture. For example, sometimes the number of occurrences of a particular event within a window of time determines if there is an issue or not. This type of detection is called XinY functionality because X number of instances within a window of Y seconds flags a problem. This section describes a case study where a fortune 500 company requested that the XinY functionality be built into the exiting Netcool fault management solution such that upon detection of an XinY event either: The Severity is incremented unless it is already set at Critical (5) in which case the AlarmPriority is incremented The Summary is prefixed with the text: "XinY Policy X events in Y seconds Met:" where X and Y are populated with the correct number of instances and seconds. In addition the XinY solution needed to provide the following abilities: Allow for individually specified X and Y values for different types of events. Unset the XinY condition after a period of time. (The time period is globally set.) Track XinY state even if an event is deleted and reinserted into the event list. Ignore GenericClear associated with a problem event. There were many possible XinY implementations that met these criteria. This case study explores an implementation which Reviewed the requirements of the customer and the demands of the pre existing Netcool architecture. Reviewed and ranked six XinY possible solutions, recommending one solution for implementation. Implemented one of the XinY solutions with a detailed design Fully tested the implementation At the end of this case study the reader should have enough information to evaluate their own customer needs and environmental demands to determine the correct XinY approach for their customer environment. Impact without Impact Case Study: XinY (Omnibus 7.X) 20 Requirements The requirements for a XinY solution for the customer depended on: Provided solution fragment Environment features These requirements in turn pointed to four possible strategies to address these aspects. Provided Solution Fragment Very few projects are set up perfectly. Usually the customer not only defines what they need, but also aspect of the solution. Sometimes consulting with the customer can free up the solution constraints. However, this was not the case with this customer. In particular as part of the initial requirements, a fragment of the XinY solution was provided. The population specification for the events through the rules files was decided via four additional fields within the event: 1. *XinY Current X count against XinY state. 2. XinYXValue Event count threshold for event 3. XinYYValue Event time threshold for event 4. XinYEffect The action taken when the XinY threshold is breached. * Since the event will likely be deleted and possibly reinserted during the course of XinY calculations, this field could not be used to track XinY X count unless the field was back populated by an outside source. However, the field was leveraged as a semaphore to prevent corrupted updates to alerts.status due to simultaneous updates by the primary and secondary object servers. Environment Features In addition to the solution fragment, five key environment features dominated the potential architecture for the XinY logic and helped determine the final design: 1. The customer had a two tier Netcool architecture: collection and aggregation/display. 2. "Blocked" events are pooled for five minutes inside collection and are not visible in aggregation. (NOTE: This "Hold Down" logic was created in a previous project. As part of the analysis for the XinY project, it was noted that marking rather than restraining the events in collection would have been a much less complex and maintainable approach. However, the scope of the XinY project prevented revisiting and correcting this "hold down" logic.) 3. Non-blocked events are deleted every minute after forwarding to aggregation. Thus XinY "state" information cannot be stored within the events. 4. Without an added stream of data between collection and aggregation, the sliding window information is lost when events are pushed from collection to aggregation since that information cannot be contained efficiently within the events (alerts.status.) 5. Like XinY, the Generic Clear automation updates the Severity field. Thus, the conflict between these two solutions needed to be addressed. Four possible strategies were devised to counter these issues: Impact without Impact Case Study: XinY (Omnibus 7.X) 21 Two were efficient: 1. Apply XinY logic BEFORE the collection layer 2. Apply XinY logic completely within the collection layer One was inefficient and complex: 3. Shuttle sliding window information from the collection layer to aggregation. One was inaccurate: 4. Use fixed windows instead of sliding windows between layers. (Thus all necessary XinY information is contained in the event's alerts.status fields.) Based on these strategies in response to the requirements, six possible approaches were discussed. These are described in the next section. Recommendations Introduction There are many possibilities for implementing XinY logic. A virtual meeting (conference call) was created to brain storm options among the engineers and customer and other stake holders. Like most solutions, the biggest hurdles were political, not technical. In particular the natural bias of a good engineer came into play. The customer had many good and great engineers. Good engineers are more created than born. They are forged from life experiences working among many companies coupled with a strong personality and mindset. A lot of bad comes with the good. Inventing options within a group is not a strong point of the good engineer. The pressures of business, an artist personality, as well as of the politics of past relationships often inhibit the openness required to build a full set of options within a team. As a result, most participants came with their minds already made with regards to the correct solution. Having an open mind is not easy for engineers, especially good ones who naturally over time tend to drink more deeply of their own Kool-Aid. Though such an approach serves the individual well, it often hinders the outcome of team effort. Most of the participants came with solutions already in mind. Much of the problem comes from the fact that engineers are creators. Like all artist they tend to build from the inside out, manipulating natural creativity and inspiration into working solutions. Unfortunately, like artist, their ego becomes inseparable with their solutions. Any criticism, justified or not, is perceived as a personal attack. Unlike salesmen, which create via adaptation to the customer and outside environment, the engineer when pressured in a team will tend to become more ridged in their beliefs and focus on their position. The result is a battle over position rather than a brainstorm to generate possible to solution. To combat these issues five ground rules were given for the discussion in advance: Only one stakeholder to represent each area, voted in by the respective group. Impact without Impact Case Study: XinY (Omnibus 7.X) 22 Strict Agenda o Brain storm ideas and squash all judgment for a later o Select potential ideas to develop o Brain storm development of the selected solutions o Solution selection Separate the person from the position/problem. Be hard on the mutually shared issues, not people. One person could be irrational/emotional at one time. (i.e. no emotional reaction to someone's' emotions.) Focus on interests, not positions. As a result, the mindset of the group was opened and peer pressure and moderation by the consultant prevented escalation. The next issue became the single mindedness of the group. As plausible solutions emerged from the group, the tendency was argue from these positions rather than brainstorm more approaches. This naturally leads to all or nothing thinking - where only by compromising one approach can lead to success of another. This rarely reflect reality. So to keep out of this rut focus was constantly returned to a discussion of issues rather than positions. For example, at one point the brainstorming broke down into a struggle between using Impact verses using Omnibus automations. People engaged in narrowing the gap between these two positions rather than broadening the discussion to other options. Further the thought was only by comprising one approach would lead to success in the other. The focus became these two positions rather than inventing alternatives to the core issue that everyone shared. The moderator was able to expand the discussion by throwing in a third option of using PERL with the DBI:DBD Sybase module instead of either Impact or Omnibus. In the end, the PERL option was not even among the developed approaches. However, introducing the alternative at this stage broke apart the Omnibus and Impact camps. As a result, two probe options arose as possibilities. The result of this session was six developed approaches to the XinY solution. The six approaches took into account the four strategies listed above and are listed in the table below. # 1 Method Probe Universal Include File 1 XinY Set Accuracy Max Xvalue Skill Req. System Load HA/DR Low N/A Lowest Low Low *No #4 Very High <20+ Low Low Low *No #3 Event Rank Reliant (Strategies 1 and 4 ) 2 Probe Universal Include File 2 Impact without Impact Case Study: XinY (Omnibus 7.X) 23 (Strategy 1) 3 Collection Automations High Unlimited Med Med Med +Yes No High Unlimited Med Med Med *+No #1 High Unlimited MedHigh High Med *+No #2 Med Unlimited Highest Very High Med +Yes No (Strategy 2) 4 Collection Automations with XinY Lifespan tracking (Strategy 2) 5 Collection Impact (Strategy 2) 6 Aggregation Impact + Automations (Strategy 3 and maybe 4) * Special logic needed to handle Generic clear since it too updates Severity + Applied at collection layer Figure 13: XinY - Six possible Approaches The meaning of the table's fields are described as follows: XinY Set Accuracy: How quickly the XinY condition is detected and set. Max XValue: If a sliding window is used, the maximum X that can be used. Skill Req.: What is the required staff skill level to maintain the solution. System Load: What is the memory and CPU load of the solution. HA/DR: How accurate and resilient is high availability and disaster recovery associated with the solution Event reliant: Does the XinY state depend on the event not being deleted in alerts.status. Rank: Is the relative recommendations. Method 1: Probe Universal Include File 1 This was by far the simplest and most maintainable approach. Basically an include rules file used arrays indexed by the Identifier to track the threshold conditions. The following is the proof of concept implemented for this method: ## IF XinY SET, NOTHING TO DO. (NOTE: THERE IS NO XinY UNSET in this logic) @XinYXValue=10 ## XinY Watermark @XinYYValue=10 ## XinY Watermark if (!(match(IsXinY[@Identifier],"1"))){ ## INTIALIZE ARRAY (NOTE: Taking the MAX in the dedup automation can prevent a restart from clearing the XinY state) if (match(X[@Identifier],"")){ $tempdate = getdate Y[@Identifier] = timetodate($tempdate,"%D %T") Impact without Impact Case Study: XinY (Omnibus 7.X) 24 X[@Identifier] = "1" CurrX[@Identifier] = "1" @XinY=0 ## OBJECT SERVER MEMORY IsXinY[@Identifier] = "0" ## PROBE MEMORY $State = "XinY Initialize." ## UPDATE X TALLY, AND CHECK IS XinY ASSERTED (NOTE: NO XinY RESET IN THIS LOGIC.) }else { CurrX[@Identifier] = int(CurrX[@Identifier]) + 1 $dx = int(CurrX[@Identifier]) - int(X[@Identifier]) $dy = getdate - datetotime(Y[@Identifier],"%D %T") ## INSIDE TIME WINDOW if (int($dy) <= @XinYYValue) { ## AND THRESHOLD CROSSED if (int($dx) >= @XinYXValue) ) { $State = "XinY Tripped." IsXinY[@Identifier] = "1" ## So Probe as well as Object Server knows @XinY=1 ## AND NOT CROSSED THRESHOLD } else { $State = "XinY: Not tripped." } ## TIME WINDOW EXPIRED } else { Y[@Identifier] = timetodate(getdate,"%D %T") X[@Identifier] = 0 $State = "XinY: Window expired, reset" } } # DEBUGGING $Y = Y[@Identifier] $X = X[@Identifier] $CurrX = CurrX[@Identifier] details($Y,$X,$CurrX,$dx,$dy,$State) @XinYDebug = $State update(@XinYDebug) } Probe rules' arrays persist from event to event. This enabled this method as a viable approach. However, the solution was inaccurate since the logic did not use a sliding window. Further, there were HA/DR issues since all XinY state information was lost if the probe dies (though this information was preserved through IHUPs.) This solution preserved the XinY state when the event was deleted from the object server since the information was stored within the probe. Further this form of the solution was in production at more than 2 Fortune 500 companies. Finally, clearing the XinY condition was implemented through a simple automation. The maximum delay of clearing the XinY condition was the poll period of the automation. The HA/DR, probe system limitations, and lack of a sliding window made this a less desirable solution for the customer in this case study. Impact without Impact Case Study: XinY (Omnibus 7.X) 25 Method 2: Probe Universal Include File 2 This solution mirrors Method 1 with the addition of a sliding window. The sliding window bumped up the event accuracy to the highest value. Only the delay in clearing the XinY condition via the automation compromised its accuracy. Still this was the highest event accuracy that could be expected from any of the solutions. The HA/DR and probe system limitations made this a less desirable solution for the customer in this case study. Method 3: Automations in Collection (Event Dependent) This solution performed all its work in the collection tier, verses the aggregation tier. The automation that shuttled events from collection to aggregation was changed to enable XinY events to persist in collection until their XinY status cleared. This greatly simplified the complexities of determining the sliding window and prevented the need when going to aggregation of either discarding the sliding window or shuttling all the associated information back and forth. This solution had 4 draw backs. First, the solution was more complex than the probe solution - the sliding window information must be shuttled between the two object servers. Second, the deletion of aged events (i.e. old X) was a temporal trigger so delays led to inaccuracy. That is, an XinY event in some cases was asserted when actually the problem was the temporal trigger hadn't fired yet. By subtracting the temporal trigger period from the Y value, the nature of the error was controlled as to whether a false XinY assertion or missed XinY assertion or somewhere in between was committed. Third, the load of the automation solution was placed on the object server rather than the probe. Forth, the combination of Probe HA and Object Server HA could potentially cause an outage. Specifically, if the primary object server goes down hard or for an extended period of time, there would be an outage for all events up to the duration of the bi-direction gateway’s poll cycle. Method 4: Automations in Collection (Event Independent) This method performed all its work in collection tier. Unlike Method 3, this approach tracked the XinY state in a persistent table separate from the alerts.status table. Though the extra table increased the complexity over method 3, it gave the XinY logic independence from the event shuttling logic between the tiers. The customer implemented this approach due to its logic independence and automation centric approach. Method 5: Collection Impact Like method 3, this method performed all its work in collection. To do so Impact was moved from the aggregation tier to the collection tier.. This allowed the normal XinY algorithm to have full visibility of the blocked events, which remain stuck in collection for five minutes. However, Impact’s difficulties with failover and proper function under high loads made it less HA/DR safe. Also, this method inherited the HA/DR issue cause by the incompatibility between the probe HA and the Object Server HA. Finally, the added complexity of Impact made the solution less maintainable. Despite these draw backs, the customer bias, system resource availability, and flexibility of the solution made this the second most desirable approach. Impact without Impact Case Study: XinY (Omnibus 7.X) 26 Method 6: Aggregation Impact This method had all the complexity of method 3 and method 4 combined. The solution needed to operate independently within collection and aggregation. Further, any sliding window information in collection needed to be either discarded or shuttled from collection to aggregation. In short, this method accounted for the many nuances arising from splitting the solution across the two tiers. This was the least desirable solution out of the six. Summary Six potential solutions were developed with the engineers, customers, and other stake holders. As a result the biases and politics of the group were accounted for and the expectations set. The group settled on method 4 due to its flexibility, event independence, HA/DR ability, and its decent accuracy, maintenance, and load requirements. Though the probe alternative appeared a better fit, the probe's system resources and political mindset prevented this from being a viable option. Impact without Impact Case Study: XinY (Omnibus 7.X) 27 XinY Solution Design Introduction Once the solution the automation based XinY solution was selected, the detailed design began. The XinY solution was designed to provide: XinY state independence from event existence. Co-exist with GenericClear Provide XinY unset as well as set ability Three tables were required instead of the standard two. In particular in addition to alerts.status tracking the event state as well as alerts.XinYWindow tracking the sliding window, the addition of alerts.XinY was required to preserve the XinY state even when events were deleted from collection after being forwarded to aggregation. Along with the three tables, five new automations and modification to two existing automations were required. The figure below describes the relationship between the tables and the automations. These are discussed in detail in the sub sections below. Event State XinY State XinY Sliding Window State XinY_ deduplication 4 XinY_ new_row 3 3 4 Alerts.status Alerts.XinY Alerts.XinYWindow 9 1 new_row 1 XinY_ CleanUp 2 2 8 XinYWindow_ deduplication 6 5 Expire 6 XinY_Expire 6 Figure 14: XinY - Automation Flow Diagram Impact without Impact Case Study: XinY (Omnibus 7.X) 28 New_row and Deduplication: The modifications are used to update the XinY state and keep the alerts.status table in sync with the alerts.XinY table. The numbers below refer to the numbers in the chart above and specify the actions taken: 1. Initializes Alerts.XinY with the XinY state and records the "before XinY state" values of the Summary, Severity, and AlarmPriority fields. 2. If Alerts.XinY already is tracking a XinY state associate with the event then a. Update the XinY state in alerts.XinY b. If XinY is set, populate alerts.status from alerts.XinY with the current Severity (unless event cleared), AlarmPriority or Summary according to the XinYEffect value. (Since Event deletion or deduplication erases the current XinY state in alerts.status.) c. Update alerts.status XinY to 2 to indicate XinY is in progress. (This field is also used to keep the primary and secondary Object Servers from stepping on each others as they independently update the events.) XinY_new_row and XinY_deduplication These automations update the sliding window, assert and reassert the XinY condition. The numbers below refer to the numbers in the chart above and specify the actions taken: 3. Keep the XinY state current for the effected Summary, Severity, and AlarmPriority fields and update the sliding window state. 4. If the XinY state is triggered or retriggered, a. If the first time XinY is tripped, 1. Update the affected alerts.status fields: Summary, Severity, and AlarmPriority. 2. Update XinYEffect to enable the correct back out procedures. b. Reset LastY to the value of LastOccurrence. This resets the fixed windows for unsetting the XinY condition. XinY_Expire This automation times out the XinY state and sliding window events. In addition, the automation detects and unsets the XinY state when it is time. The numbers below refer to the numbers in the chart above and specify the actions taken: 5. Delete from alerts.XinYWindow the sliding window events that have aged outside the 'Y' parameter. If no window events remain, delete the XinY state from alerts.XinY table. 6. Detect if it is time and then un-assert a XinY event by: Impact without Impact Case Study: XinY (Omnibus 7.X) 29 a. Reverting to the previous Summary, Severity (unless currently cleared), and AlarmPriority alerts.status field values. b. Clear the XinY state information for the event from the alerts.XinY and alerts.XinYWindow tables. XinYWindow_Expire This automation provides a shortcut to counting the sliding window events since this operation is expensive in SQL. Instead, this automation decrements the alerts.XinY X field every time a record is deleted from the sliding window table. The numbers below refer to the numbers in the chart above and specify the actions taken: 8. The alert.XinY X field is decremented every time a sliding window record expires. XinY_CleanUp This automation is purely precautionary. If things are not working well, this automation ensures that the alerts.XinY and alerts.XinYWindow tables do not grow without bounds. The numbers below refer to the numbers in the chart above and specify the actions taken: 9. Any records older than a day are trimmed from the alert.XinYWindow table. Any events older than 4 weeks are trimmed from the XinY table. Impact without Impact Case Study: XinY (Omnibus 7.X) 30 Test Scenarios Introduction The solution was put through a battery of tests to verify expected behavior. Each of the scenarios is described in the subsections below. Scenario 1: XinY within Event Lifespan XinY Lifespan Event Lifespan TIME Figure 15: XinY - Test Scenario 1 This is the simplest scenario. The counting, setting, and unsetting of the XinY event occurs within the lifespan of an event. In reality this will be rare except for blocked events since non blocked events are deleted from collection about every minute. Scenario 2: XinY Lifespan Spanning Two Related Events XinY Lifespan Event Lifespan Event Lifespan TIME Figure 16: XinY - Test Scenario 2 This is the second most common scenario. The counting, setting, and unsetting of the XinY event occurs over the lifespan of multiple inserts and deletes of an event (using the same Identifier.) This mean when the event returns the current state (if XinY is set) then the affected fields must be reset correctly in alerts.status: XinY and one of the following: Summary, Severity, or AlarmPriority. In addition the XinY window expires when no event is in the alerts.status. (Thus the XinY state needs to be cleared independent of the event state.) Scenario 3:XinY Lifespan not quite Spanning Two Related Events XinY Lifespan Event Lifespan Event Lifespan TIME Figure 17: XinY - Test Scenario 3 This is the first most common scenario. The counting, setting, and unsetting of the XinY event occurs over the lifespan of multiple inserts and deletes of an event (using the same Impact without Impact Case Study: XinY (Omnibus 7.X) 31 Identifier.) The XinY state expires while an event is still in the alerts.status table. Thus, both the XinY and event state need to be cleared. Scenario 4: Clean HA/DR Failover (between XinY events) HA/DR Event XinY Lifespan Event Lifespan TIME Figure 18: XinY - Test Scenario 4 This is a special case where a failover occurs outside a XinY event. Scenario 5: Dirty HA/DR Failover (during an XinY event) XinY Lifespan Event Lifespan TIME HA/DR Event Figure 19: XinY - Test Scenario 5 This is a special case where a failover occurs within a XinY event. Scenario 6: An XinY Failure to Thrive XinY Lifespan Event Lifespan TIME Figure 20: XinY - Test Scenario 6 This is a case where the XinY assertion never happens because not enough instances occur within the window. Scenario 7: An interrupted XinY Lifespan (by GenericClear) XinY Lifespan Event Lifespan TIME Impact without Impact Case Study: XinY (Omnibus 7.X) 32 Figure 21: XinY - Test Scenario 7 This is a case where a Generic Clear causes the problem event to clear out. In this case the XinY state should persist and ignore the Generic Clear but allow the Generic Clear to delete the event as it normally would. If a new instance of the event occurs, the XinY state will be used to back fill the correct Severity, AlarmPriority or Severity according to the XinYEffect set for that event. Summary After two relatively short weeks, the XinY solution was created and rolled into production. Impact without Impact Case Study: XinY (Omnibus 7.X) 33 Case Study: Event Audit (Omnibus 7.X) Introduction One of the largest weaknesses of the Netcool suite is the inability to determine what automation, impact policy, probe rule or other entity updated an event and when. The Event Audit Automations combined with the Audit Report Script were designed to solve this problem. This case study describes the steps to implement an event audit solution through automation changes and additions along with additional alerts tables and alerts.status fields. Automation Audit Architecture The Event Audit Solution is a fully extendible and scalable way of tracking what automation(s) touch what events and when. It can be further extended to track probe and other program updates to the events by applying changes in the state_change automation also to the new_row and deduplication automations. If a new automation or program is added to the Netcool solution it too can be audited. The new automation or program simply must update the event's Audit field with its name. The automation will grab this information and create an audit trail of the change. If there is more than one insert and/or update statement in the new automation or program, appending an index to the program name in the Audit field can provide further granularity to the audit. The Event Audit Solution consists of 4 parts: 1. Audit field in alerts.status 2. State_change automation (new_row and deduplication if probe tracking is desired as well.) 3. Event inserting/updating automations and or tools. 4. clean_alarmaudit_table automation The audit process occurs in a 4 step life cycle: 1. The event inserting and/or updating automation populates the alerts.status’ Audit field with its name (and an index if there is more than one insert and/or update statement.) 2. Before the insertion occurs, the state_change automation takes the Audit field and populates the alarmaudit_table through a simple SQL statement: Impact without Impact Case Study: Event Audit (Omnibus 7.X) 34 insert into alerts.alarmaudit VALUES (0,getdate(),new.Audit,new.Identifier); 3. The audit trail hangs around until the event in alerts.status is deleted from the system. 4. Within a minute after the event is deleted from the alerts.status table, the clean_alarmaudit_table automation runs and clears out the audit trail. The result of this solution is a fully extendable audit capability, tracking what automation or program is touch which events and when. Automation Audit Report The Automation Audit Report (AuditTable.cgi) is a PERL program that provides statistics regarding what automations have touch which events when. The program has four command line options: Identifier: Provides statistics on a particular event as identified by its identifier. Node: Provides statistics on a particular node Start: Relative to the present when does the period start that should include the audit records for the report End: Relative to the present when does the period end that should include the audit records for the report Examples run and results of this report program are included in the appendices at the end of this whitepaper. Installation Before beginning installation it is important that the entire section is read and understood. In particular, know exactly what is needed for the tables. The reason being there is a bug in Omnibus where the deletion of a column and other edits to a table will cause the deletion of all automations that touch this table. Needless to say this can take a few minutes to resolve and results in an unacceptable outage in production. Step 1: Create the table alerts.alarmaudit This can be perform either with nco_sql at the command line or through the Admin GUI. The command line version is as follows: -- CREATE THE TABLE CREATE TABLE alerts.alarmaudit PERSISTENT ( Key INCR PRIMARY KEY, Occurred TIME, Entity VARCHAR(64), Identifier VARCHAR(255) ); Impact without Impact Case Study: Event Audit (Omnibus 7.X) 35 Go In the GUI you will see the following if the table is created correctly: Figure 22: Admin GUI - Table Creation Impact without Impact Case Study: Event Audit (Omnibus 7.X) 36 Step 2: Add the Audit field (varchar(64)) to alerts.status The solution also requires adding the Audit field (varchar(64)) to alerts.status. This can be perform either with nco_sql at the command line or through the Admin GUI. Step 3: Insert an initial record into the alerts.alarmaudit table This step can be performed from the command line using the nco_sql utility. Within nco_sql enter the following command: INSERT INTO alerts.alarmaudit VALUES (0,getdate,'TestAutomation','TestEvent'); go Step 4: Create the Clean_Alarmaudit_Table automation Figure 23: Admin GUI - Automation Clean_alarmaudit_table Impact without Impact Case Study: Event Audit (Omnibus 7.X) 37 This automation deletes any audit entry not associated with a event in the alerts.status table. This step can be performed through the Admin GUI. The automation needs specific values set as shown in these figures. Figure 24: Admin GUI - Automation Clean_alarmaudit_table After creating this automation, use the nco_sql command at the command prompt to verify that it deletes the entry added to the audit table in step 2. Impact without Impact Case Study: Event Audit (Omnibus 7.X) 38 Step 5: OPTIONAL: Make a place holder for tracking probe events When probes and other tools insert new events into the alerts.status table they evoke the new_row automation. Through the Admin GUI insert the following lines in GREEN to stage tracking these updates at a later date: Figure 25: Admin GUI - Automation new_row Impact without Impact Case Study: Event Audit (Omnibus 7.X) 39 Step 6: OPTIONAL: Make a place holder for tracking probe events When probes and other tools insert on top of existing events in the alerts.status table they evoke the deduplication automation. Through the Admin GUI insert the following lines in GREEN to stage tracking these updates at a later date: Figure 26: Admin GUI - Automation deduplication Impact without Impact Case Study: Event Audit (Omnibus 7.X) 40 Step 7: Update state_change to track automation updates This automation is the work horse of the audit automation solution. Specifically, when automations or other tools update existing events (rather than insert new events or insert on top of existing events) in the alerts.status table they evoke the state_change automation. Through the Admin GUI insert the following line in the automation: insert into alerts.alarmaudit VALUES (0,getdate(),new.Audit,new.Identifier); This populates the audit table each time an automation touches an event. The Audit field contains the automation that last touched the event. Since this is a pre-insert automation, it guarantees consistency and no trashing of updates: Figure 27: Admin GUI - Automation state_change Impact without Impact Case Study: Event Audit (Omnibus 7.X) 41 Step 8: Update GenericClear, Expire, and other automations to enable Event Auditing By setting the Audit field, this sets the stage for the state_change automation to populate the audit table against every event that this automation touches. Thus, any automation with insert, reinsert, or update requires an update to assign the Audit field to the automation name and sub index (if required.) Below is an example of an updated GenericClear (rudimentary) automation. Figure 28: Admin GUI - Automation GenericClear Impact without Impact Case Study: Event Audit (Omnibus 7.X) 42 Similar updates can be performed on the Expire automation: Figure 29: Admin GUI - Automation expire Step 9-N: Update Other Event Specific automations (Future Expansion) Any automation that updates an event via the insert or update command needs to be modified to set the Audit field with the name of the automation. If there is more than one insert or update command, then numeric tags should be used at the end to indicate which command updated what event (i.e. Automation-1.) By doing this simple step, this sets the stage for the state_change automation to populate the audit table against every event that this automation touches. Summary The previous sections described the step by step installation of the event auditing solution. Though solution design and implementation should be shared by the same responsible Impact without Impact Case Study: Event Audit (Omnibus 7.X) 43 parties, this ideal was not possible. The testing and deployment of the solution was performed by another group due to budget short falls. During the testing by that group, the Generic Clear automation generated over 98% of the audit records. The solution was modified to consolidate the statistics much in the same way deduplication summarizes multiple alerts into a single event. Based on the results of the on going audit, future plans were made to migrate the Generic Clear logic from the Object Server automation into the probe mttrapd and syslog rules files. Although the Generic Clear automation was still needed to resolve polled events, the majority of the traffic was trap and syslog based. In addition the solution proved extremely effective at reducing troubleshooting time by narrowing down the culprit automation, rule, and/or script that erroneously updated various events. Time will tell if additional modifications to the logic is required. Impact without Impact Case Study: Event Audit (Omnibus 7.X) 44 Case Study: Impact Replacement Introduction In some cases for a variety of technical and budgetary reasons a custom script is the best solution. The most readily available methods to do this are via PERL and JAVA. In this case study the client had a 24 page Impact policy that had over the course of 6 years migrated from Impact 2.3 all the way to Impact 4.01. As a result, the policy was very complex and very customer-specific. This case study will discuss the generic part of the project which was to create a PERL script to simulate Impact's generic functionality. During the course of the 4 week project the generic aspects of the project included: 1. Write the PERL based Impact-like shell. 2. Test and improve the Impact-like shell efficiency and robustness 3. Integrate the policy specific PERL 4. Deploy the PERL program into production The result was a PERL based replacement for an Impact implementation of a policy. Appendix I contains the source code of the Impact-like PERL shell. The program was written with five goals in mind. 1. Compatible Functionality 2. Flexibility 3. Speed 4. Robustness 5. Maintainability The following sections describe how the program was architected to satisfy these requirements. Impact without Impact Case Study: Impact Replacement 45 Compatible Functionality The most important goal of the project was to keep the same functionality. To ensure this was the case during the testing and initial production phases the events were fed to both a Netcool stack using Impact as well as a Netcool stack using the Impact-like PERL script. Another script periodically scraped the events from both solutions and highlighted the differences. Manual inspection of the differences determined whether the discrepancies were caused by true script functionality gaps, Impact bugs, or simple timing. After a couple weeks the pseudo code of the Impact-like PERL shell logic coalesced to: 1. Declare packages, environment variables, database security variables, command line parameter defaults, other database assignments. 2. Declare and prepare Oracle SQL statements for tables: equipment, site, card, path, port, and channel. 3. Load procedures: LogMsg, dumpnetcool, usage, and commandline. Each is documented regarding PSUDO code. 4. Run main program: a. Prepare Netcool SQL for alert.status and open logfile and change summary file. b. Loop forever (unless DEBUG & 16 in which case only loop once.) i. Pull events from Netcool alerts.status ii. For each event iii. Clear data structures. iv. Perform Impact Policy work v. Build and execute UPDATE against Netcool alerts.status to update the event. vi. Sleep the remaining time in the cycle and log if the cyle has overran its length. Make sure to sleep a minimum time between cycles ($CYCLEMIN) c. Disconnect from Netcool and Oracle. Flexibility From the onset it was predicted that the manor in which the script was used would evolve as the project progressed. This was certainly the case. Midway through the project the script was used to audit the event stream, which required different functions. Further, benchmarking and debugging became stronger requirements as dictated by the end customer that used the system. Most of this flexibility was managed by three aspects: an open architecture, global configuration variables, and extendible command line options. The Impact without Impact Case Study: Impact Replacement 46 command line options became extensive. As a result a 'help' option was added to document the various ways the program could be run: USAGE: psuedoimpact [-debug 1 2 4 8 16 32 64 128 - <debug number 1-255>] Log any warning or errors. Function Entry and exit logging. Inner function verbose logging Dump before and after Netcool alert.status fields. Query Netcool alerts.status only once. Clear old log. Show initial alert.status field values. Do not perform update, instead document what would be done. [-node] (Node to perform extra logging on.) [-logfile] (full path and file to log file.) *[-sumfile] (full path and file to the change summary file.) [-cycle] (Cycle period in seconds.) [-cyclemin] (minimum delay between cycles, even if the cycle overruns.) [-delay] (Ignore events with Populated field younger than) [-stdout] --Log to standard out [-help] NOTE: Program must run as the user root DEBUGING ************************** There are 8 flags available for debugging. When diagnosing a specific problem in development -255 or -127 is used as it will: Clear the old log (32) Query the events only once and exit the program (16) Show any warning or errors (1) Show entry and exit from functions (2) Provide verbose logging of the inner workings (4) Dump the initial values of the event's fields that can be updated (64) Dump the before and after values (if there is a change) for each event (8) Write a summary of the updates for each event in the sumfile (8) Normal operation of the program can used the default DEBUG value of 1, which will log WARNING and ERRORS to an event's journal as well as to the logfile. SUMFILE *************************** * When debug flag 8 is set, the summary file contains a record of all proposed (debug & 128) or actual changes made in a ~ delimited file that can easily be viewed with Excel. It also includes any journal entries or detected errors. Each row represents one event. The following is a column description: -- Fields used to identify and select the event... Node - Node referenced in the event. DESCR - Description of the Node (From Oracle.) IP_ADDRESS - IP Address of the Node (From Oracle.) MODEL - Model of the event (From Oracle.) AlertGroup - Netcool event field AlertKey - Netcool event field IsJournal - Journal exist? -- Fields show the before and after values of netcool fields for the event. If there is no change, null is shown in the before and after values to make the changes stand out. Each field is repeated twice to show the before and after values respectively. The columns include: Customer Node Site Summary JournalEntry -- Error messages against the event from processing. ERRORS Impact without Impact Case Study: Impact Replacement 47 Speed The initial tests of the program proved the script could only handle 1-2 events per second. After some benchmark analysis it was discovered the majority of the time was taken up in Oracle sql queries and creation and destruction of hashes, which were caused by calls to the DBI:DBD perl module's $sth->fetchrow_hashref. Two changes to the code improved the script to handle 15-20 events per second. 1. Database select return function changes. The fetchrow_hashref() calls were converted to fetchrow_arrayref() calls. In addition the bind_columns() function was used to avoid having to map array indexes to field names and instead call the fields directly by field names via binded variables. 2. Database select prepare function changes. In addition all the various forms of the select statement were prepared ahead of the main loop. As a result, the initial parsing by the database and binding of search variables was done in advance. These changes enabled the script to handle 15-20 events per second rather than the initial 12 events per second. Robustness The robustness of the solution was improved in two ways: Scheduler: The scheduler was written to provide a minimum downtime between cycles. This ensured that other components would not be starved contact to the object server by the scripts persistence. Additionally checks were built into the scheduler to see if it overran its scheduled run time. If this occurred, messages were logged to this effect. High availability/Disaster recovery: The program was written to have a primary and secondary mode. The secondary gathers only older events to update and performs the same functionality as the primary. By changing the settings and infinite number of scripts could run to provide redundancy or the algorithm could be changed to provide load balancing. Maintainability A significant percentage of the script is directed toward maintainability. These features are self evident: 1. Debug option 2. Inline commands 3. Use of bind_columns() in combination with fetchrow_arrayref() Together these features help to self document the code. Impact without Impact Case Study: Impact Replacement 48 Summary In large and small enterprises alike, often the out of the box functionality of IBM Tivoli Netcool Omnibus is not enough. Additional event enrichment and correlation involving or not involving outside data sources may be required. Traditionally, Impact is selected as the only alternative. However, in many cases though judicious use of probe rules, automations, and alert.status database fields, an Impact-like policy can be created through these facilities alone. Further, in some cases replacing Impact with home grown scripts makes sense. This document described generally how to implement these alternatives and provided several case studies that are currently in production at several fortune 500 companies. Impact without Impact Summary 49 Appendix A: XinY New Tables and Table Updates Alerts.XinY The XinY table is used to track the current state of the XinY property of an event. Order Name Datatype Length Primary Key Description 2 Identifier VarChar 255 Yes Unique index for XinY state 3 LastOccurrence UTC 4 No Used for the time window calc 4 X Integer N/A No Used for the instance calc 5 XinYXValue Integer N/A No Instance count required for XinY 6 XinYYValue Integer N/A No Time window limit for XinY 9 XinYEffect Integer N/A No What to do once XinY tripped 7 LastY UTC N/A No Fixed window for XinY set/unset calculation 13 NowXinYSeverity Integer N/A No For backfilling events with current XinY state. 12 NowXinYAlarmPriority Integer N/A No For backfilling events with current XinY state. 14 NowXinYSummary VarChar 255 No For backfilling events with current XinY state. 10 PreXinYSeverity Integer N/A No For restoration if XinY unset 11 PreXinYAlarmPriority Integer N/A No For restoration if XinY unset 8 PreXinYSummary VarChar 255 No For restoration if XinY unset Impact without Impact Appendix A: XinY New Tables and Table Updates 50 Figure 30: Schema: Alerts.XinY NOTE: The order field indicates the order of the columns. This is only important for the INSERT statements that appear in the automation new_row and deduplication. Otherwise the order of the columns can be changed. Alerts.XinYWindow The XinYWindow table is used to track the various instances of an event to calculate the X value (number of instances) in XinY condition. As the events age outside the Y time period (in seconds), they are deleted from this table. NOTE: The order field indicates the order of the columns. This is only important for the INSERT statements that appear in the automations XinY_new_row and XinY_deduplication. Otherwise the order of the columns can be changed. Order Name Datatype Length Primary Key Description 4 Idx INCR N/A Yes Unique index for XinYWindow event 2 Identifier VarChar 255 No Unique index for XinY sate 3 Occurred UTC 4 No When it occurred (for sliding window calculations) Figure 31: Schema: Alerts.XinYWindow Impact without Impact Appendix A: XinY New Tables and Table Updates 51 Alerts.Status These are the additional fields added to the alerts.status. They are the same as what was specified as part of the requirements for the solution. Name Datatype Length Primary Key XinY Integer N/A Yes 1=new, 2=in XinY calc, 3=XinY set XinYXValue Integer 255 No Instance count to hit. XinYYValue UTC 4 No Time period limit XinYEffect Integer N/A No What to do once XinY is tripped. Figure 32: Schema Changes: Alerts.Status Impact without Impact Description Appendix A: XinY New Tables and Table Updates 52 Appendix B: Alerts.Status XinY Centric Automations new_row automation: SETTINGS On alerts.status Pre database action on insert Apply to row Enabled ACTION begin if ( %user.is_gateway = false ) then set new.Tally = 1; set new.ServerName = getservername(); end if; set new.StateChange = getdate(); set new.InternalLast = getdate(); if( new.ServerSerial = 0 ) then set new.ServerSerial = new.Serial; end if; -- Company Customer XinY Logic if (new.XinYEffect > 0) then for each row XinYrow in alerts.XinY where XinYrow.Identifier = new.Identifier begin if (new.XinYEffect = 1) then set new.Summary=XinYrow.NowXinYSummary; else if (new.Severity > 0) then set new.Severity=XinYrow.NowXinYSeverity; end if; set new.AlarmPriority=XinYrow.NowXinYAlarmPriority; end if; end; INSERT INTO alerts.XinY VALUES(new.Identifier,new.LastOccurrence,1,new.XinYXValue,new.XinYYValue,0,ne w.Summary,new.XinYEffect,new.Severity,new.AlarmPriority,new.AlarmPriority,new .Severity,new.Summary); set new.XinY = 2; -- XinY activated. end if; -- ENDCompany Customer XinY Logic Impact without Impact Appendix B: Alerts.Status XinY Centric Automations 53 deduplication automation SETTINGS On alerts.status Pre database action on Reinsert Apply to row Enabled ACTION declare gw_dedup char( 255 ); time_now utc; begin -- Get the date once set time_now = getdate(); if( %user.is_gateway = false ) then -- Deduplication for non-gateway clients set old.Tally = old.Tally + 1; set old.LastOccurrence = new.LastOccurrence; set old.StateChange = time_now; set old.InternalLast = time_now; set old.Summary = new.Summary; set old.AlertKey = new.AlertKey; if ( (new.Severity = 0) and (old.Severity > 0) ) then set old.ClearTime = time_now; set old.InitialSeverity = old.Severity; end if; set old.Severity = new.Severity; else -- Deduplication for gateway clients. -- This section of the trigger emulates -- the gateway deduplication in v3.6 ObjectServer. set gw_dedup = get_prop_value( 'GWDeduplication' ); case -- Do not increment Tally when( gw_dedup = '0' ) then set old.LastOccurrence = new.LastOccurrence; set old.StateChange = time_now; set old.InternalLast = time_now; set old.Summary = new.Summary; set old.AlertKey = new.AlertKey; set old.Severity = new.Severity; -- Replace the 'old' row with the 'new' row when( gw_dedup = '1' ) then set row old = new; -- Drop the reinsert when( gw_dedup = '2' ) then cancel; Impact without Impact Appendix B: Alerts.Status XinY Centric Automations 54 -- Identical to non-gateway deduplication when( gw_dedup = '3' ) then set old.Tally = old.Tally + 1; set old.LastOccurrence = new.LastOccurrence; set old.StateChange = time_now; set old.InternalLast = time_now; set old.Summary = new.Summary; set old.AlertKey = new.AlertKey; set old.Severity = new.Severity; -- Any other value is taken to be a drop else cancel; end case; end if; -- Company Customer XinY Logic if (new.XinYEffect > 0) then for each row XinYrow in alerts.XinY where XinYrow.Identifier = new.Identifier begin if (new.XinYEffect = 1) then set old.Summary=XinYrow.NowXinYSummary; else if (new.Severity > 0) then set old.Severity=XinYrow.NowXinYSeverity; end if; set old.AlarmPriority=XinYrow.NowXinYAlarmPriority; end if; end; INSERT INTO alerts.XinY VALUES(new.Identifier,new.LastOccurrence,1,new.XinYXValue,new.XinYYValue,0,ol d.Summary,new.XinYEffect,old.Severity,old.AlarmPriority,old.AlarmPriority,old .Severity,old.Summary); Set old.XinY = 2; -- XinY reactivated. end if; -- ENDCompany Customer XinY Logic end Impact without Impact Appendix B: Alerts.Status XinY Centric Automations 55 Appendix C: Alerts.XinY Centric Automations XinY_new_row automation SETTINGS On alerts.XinY Pre database action on Insert Apply to row Enabled ACTION Begin -- NOTE: Do not increment X, so that when X=0 => delete XinY state. INSERT INTO alerts.XinYWindow VALUES (new.Identifier ,new.LastOccurrence,0); -- THIS APPEARS UNEEDED --SET old.NowXinYSummary = new.NowXinYSummary; --SET old.NowXinYSeverity = new.NowXinYSeverity; --SET old.NowXinYAlarmPriority = new.NowXinYAlarmPriority; end XinY_deduplication automation SETTINGS On alerts.XinY Pre database action on Reinsert Apply to row Enabled ACTION begin SET old.X=old.X+1; IF (((old.X >= old.XinYXValue) and (old.X < 32000)) or ((old.X >= old.XinYXValue-32000) and (old.X > 32000))) THEN -- XINY FIRST TIME, SO DO ACTION IF (old.LastY = 0) THEN SET old.X = 31999 + old.XinYXValue; -- Offset for decrementing from delete DELETE FROM alerts.XinYWindow WHERE Identifier=new.Identifier; -- New slate. IF (old.XinYEffect = 1) THEN UPDATE alerts.status SET Summary = 'XinY Policy (' + to_char(old.XinYXValue) + ' event in ' + to_char(old.XinYYValue) + ' seconds Impact without Impact Appendix C: Alerts.XinY Centric Automations 56 Met:' + old.PreXinYSummary, XinY=3 where Identifier=new.Identifier and XinY!=3; SET old.NowXinYSummary = 'XinY Policy (' + to_char(old.XinYXValue) + ' event in ' + to_char(old.XinYYValue) + ' seconds Met:' + old.PreXinYSummary; ELSEIF (old.XinYEffect = 2) THEN IF (new.PreXinYSeverity < 5) THEN SET old.NowXinYSeverity = new.PreXinYSeverity+1; UPDATE alerts.status SET Severity = old.NowXinYSeverity, XinY=3 where Identifier=old.Identifier and XinY!=3; ELSEIF (old.PreXinYAlarmPriority < 3) THEN SET old.NowXinYAlarmPriority = new.NowXinYAlarmPriority+1; UPDATE alerts.status SET AlarmPriority = old.NowXinYAlarmPriority, XinY=3 where Identifier=old.Identifier and XinY!=3; SET old.XinYEffect=3; -- Alternate undo behavior when unsetting XinY ELSE -- Exception code if Severity and AlarmPriority maxed out? SET old.XinYEffect=4; -- Alternate undo behavior when unsetting XinY END IF; END IF; END IF; -- UPDATE START OF UNSET FIXED WINDOW (NOT SLIDING WINDOW ON UNSET) SET old.LastY=new.LastOccurrence; -- NEW ALERTS.STATUS ON OLD XINY; REPOPULATE ALERTS.STATUS END IF; INSERT INTO alerts.XinYWindow VALUES (new.Identifier ,new.LastOccurrence,0); end Impact without Impact Appendix C: Alerts.XinY Centric Automations 57 Appendix D: Alerts.XinY Centric Automations XinY_Expire (Temporal Trigger) SETTINGS Every: 15 seconds Priority: 1 Enabled EVALUATE Bind As: XinYbind -- CLEAR XinY STATE REGARDLESS IF EVENT EXISTS CURRENTLY. select Identifier,X,XinYYValue,LastY,XinYEffect, PreXinYSeverity, PreXinYAlarmPriority, PreXinYSummary from alerts.XinY ACTION begin if %rowcount > 0 then for each row XinYrow in XinYbind begin -- UNSET XINY AFTER LONG ENOUGH PERIOD OF NO XINY: NOTE ASSUMES SAME SYSTEM CLOCK MATCHES EVENT CLOCK IF ((XinYrow.LastY > 0) AND ((getdate()-XinYrow.LastY) > XinYrow.XinYYValue*2)) THEN IF (XinYrow.XinYEffect = 1) THEN -- Cut new text from Summary UPDATE alerts.status SET Summary=XinYrow.PreXinYSummary, XinY=1 WHERE Identifier = XinYrow.Identifier and XinY!=1; ELSEIF (XinYrow.XinYEffect = 2) THEN -- Unset Severity UPDATE alerts.status SET Severity=XinYrow.PreXinYSeverity, XinY=1 WHERE Identifier = XinYrow.Identifier and XinY!=1 and Severity>0; ELSEIF (XinYrow.XinYEffect = 3) THEN -- Unset AlarmPriority UPDATE alerts.status SET AlarmPriority=XinYrow.PreXinYAlarmPriority, XinY=1 WHERE Identifier = XinYrow.Identifier and XinY!=1; -ELSEIF (XinYrow.XinYEffect = 4) THEN -- Unset Nothing END IF; DELETE FROM alerts.XinY WHERE (Identifier = XinYrow.Identifier); -Delete XinY lifespan record DELETE FROM alerts.XinYWindow WHERE (Identifier = XinYrow.Identifier); -- Delete XinY window records -- OTHERWISE AGE OUT OLDER XINY WINDOW EVENTS ELSE DELETE FROM alerts.XinYWindow WHERE (Identifier = XinYrow.Identifier) and (Occurred < (getdate()-XinYrow.XinYYValue)); end if; end; end if; Impact without Impact Appendix D: Alerts.XinY Centric Automations 58 DELETE FROM alerts.XinY WHERE (X=0); -- IF XinYWindow_Expire aged all events. end Impact without Impact Appendix D: Alerts.XinY Centric Automations 59 Appendix E: XinYWindow Centric Automations XinYWindow_Expire (Database Trigger) SETTINGS On alerts.XinYWindow Pre database action on Delete Apply to row Enabled ACTION Being -- Short cut for impossible SQL: update set X=(select count() from alerts.XinYWindow)... UPDATE alerts.XinY SET X=X-1 WHERE ((Identifier=old.Identifier) AND (X != 32000)); End XinY_CleanUp (Temporal Trigger) SETTINGS Every: 1 hour Priority: 1 Enabled ACTION Begin -- Stop gap measure too ensure stale events don’t hang around. DELETE FROM alerts.XinYWindow WHERE (Occurred < (getdate()-134400)); -- 1 day -DELETE FROM alerts.XinY WHERE (Occurred < (LastOccurrence-2419200)); -4 weeks End Impact without Impact Appendix E: XinYWindow Centric Automations 60 Appendix F: Eventstream.pl The following script was used to inject events into the alerts.status table to verify the XinYsolution functioned as expected. #!/usr/bin/perl ############################################################################# # # PROGRAM: eventstream.pl Daniel L. Needles 2009-10-06 # PURPOSE: Simulate events by directly inserting into alerts.status ############################################################################# # #use strict; use IPC::Open2; my $user='root'; # USER NAME my $name='DEV_COL01'; # DATABASE NAME (OMNIBUS) ## GENERIC CLEAR TEST my @cmds = ( # Node/IDentifer,WaitTime,Summary,Severity,XinY,XinYXValue,XinYYValue,XinYEffec t "'Node 0',5,'Bad Event: Sev 4',4,1,0,2,20,2", "'Node 1',5,'Bad Event: Sev 4',4,1,0,2,20,2", "'Node 1',5,'Bad Event: Sev 4',4,1,0,2,20,2", "'Node 1',5,'Bad Event: Sev 4',4,1,0,2,20,2", "'Node 1',5,'Bad Event: Sev 4',4,1,0,2,20,2", "'Node 1',5,'Bad Event: Sev 4',4,1,0,2,20,2", "'Node 1',5,'Bad Event: Sev 4',4,1,0,2,20,2", "'Node 1b',5,'Good Event',0,2,0,0,0,0", "'Node 1b',5,'Good Event',0,2,0,0,0,0", "'Node 1b',5,'Good Event',0,2,0,0,0,0", "'Node 1b',5,'Good Event',0,2,0,0,0,0", "'Node 1',5,'Bad Event: Sev 4',4,1,0,2,20,2", "'Node 1',5,'Bad Event: Sev 4',4,1,0,2,20,2", "'Node 1',5,'Bad Event: Sev 4',4,1,0,2,20,2", "'Node 1',5,'Bad Event: Sev 4',4,1,0,2,20,2", "'Node 1',5,'Bad Event: Sev 4',4,1,0,2,20,2", "'Node 1',5,'Bad Event: Sev 4',4,1,0,2,20,2", # "'Node 1','Bad Event: Sev 4',4,1,2,10,1", # "'Node 1','Bad Event: Sev 4',4,1,2,10,1", # "'Node 2','Bad Event: Sev 4',4,1,2,10,1", # "'Node 3','Bad Event: Sev 4',4,1,2,10,1", # "'Node 4','Bad Event: Sev 4',4,1,2,10,1", # "'Node 5','Bad Event: Sev 4',4,1,2,10,1", # "'Node 6','Bad Event: Sev 4',4,1,2,10,1" ); ######################################################################## ########################### MAIN ##################################### ######################################################################## if ( $#cmds > 0) { #my $pid = open2(FIN,FOUT,"nco_sql -server $name -user $user -password $pass"); Impact without Impact Appendix F: Eventstream.pl 61 my $pid = open2(FIN,FOUT,"nco_sql -server $name -user $user"); for ($i=0; $i<=$#cmds; $i++) { my $tm = time(); my ($Node,$waittm,$dmy) = split /,/,$cmds[$i],3; $cmds[$i]="$Node,$dmy"; my $ins="insert into alerts.status (Serial,Identifier,Tally,FirstOccurrence,StateChange,InternalLast,LastOccurre nce,Node,Summary,Severity,Type,XinY,XinYXValue,XinYYValue,XinYEffect) values ($i, $Node,1,$tm,$tm,$tm,$tm,$cmds[$i]);\n"; print FOUT "$ins"; print "$ins"; print FOUT "go\n"; sleep $waittm; } # print FOUT "select rtrim(Node),rtrim(Summary),XinYXValue,XinYYValue,XinY,XinYEffect,Tally,FirstO ccurrence,LastOccurrence from alerts.status;\n"; # sleep 1; # print FOUT "go\n"; # sleep 4; print FOUT "quit\n"; } ## DETERMINE SCHEMA AND BUFFER ALL INPUT INTO ARRAY $d=1; my $allline; while ( my $line=<FIN> ) { # chomp($line); $line=~s:\0:,:g; $line=~s:\s+,:,:g; $line=~s:,\s+:,:g; $line=~s:[ \t]+: :g; if (!($line =~ /^\s*$/)) { $allline.=$line; } } #$allline =~s:,\n::smg; print $allline; Impact without Impact Appendix F: Eventstream.pl 62 APPENDIX G: Event Audit Report Script This program provides statistical information regarding what automations, probes, and/or Impact policies have touched which events and when. #!/usr/bin/perl ######################################################################## PROGRAM: AuditTable.cgi # PURPOSE: To create an HTML table of Netcool Automations # ISSUES: There is a known limit of 255 char retreval via this # version of DBI. ####################################################################### use CGI; use Time::localtime; use Getopt::Long; ## HANDLES COMMAND LINE OPTIONS use DBI; my $Id; my $Node; my $Start; my $End; GetOptions("Identifier=s" => \$Id, "Node=s" => \$Node, "Start=i" => \$Start, "End=i" => \$End); my $q = new CGI; $q->import_names("X"); # Read in site data inputs my $filter = ($q->param(filter)); my $custom_vis = "hidden"; #### OPEN Netcool DATABASE CONNECTION $ENV{"ORACLE_HOME"}="/opt/oracle/product/11.1.0/db_1"; $ENV{"LD_LIBRARY_PATH"}="/usr/local/lib:/usr/lib:/opt/oracle/product/11.1.0/d b_1/lib"; $ENV{"SYBASE"}="/usr/local"; #print "Content-TYPE: text/html","\n\n"; $|++; # Unbuffer output my $dbh = DBI->connect( 'dbi:Sybase:NETCOM', "user", "password", { RaiseError=> 0, PrintError => 0, AutoCommit => 0 } ); ## OUTPUT HTML FILE HEADER print <<HTML; <html> <head> <style> Impact without Impact APPENDIX G: Event Audit Report Script 63 a.menu:link { color: #FFFFFF; size: -3; text-decoration: none; } a.menu:visited { color: #FFFFFF; size: -3; text-decoration: none; } a.menu:hover { color: #99CCFF; size: -3 } .fttext1 { font-size: 12px; font-family: Arial, Helvetica, sans-serif; color: #000000; font-weight: bold; } .Texth { horizontal-align:center; color: black; background-color:yellow font-family: "Arial, Helvetica, sans-serif"; font-size: 12pt; font-weight: bold; text-decoration: underline; } .Textu { color: black; background-color: yellow font-family: "Arial, Helvetica, sans-serif"; font-size: 12pt; font-weight: bold; text-decoration: underline; } .Text2 { color: black; font-family: "Arial, Helvetica, sans-serif"; font-size: 10pt; font-weight: normal; } .Text1 { vertical-align:top; color: black; font-family: "Arial, Helvetica, sans-serif"; font-size: 10pt; font-weight: bold; } </style> <body> <table width=90\% border=1 cellspacing =1 cellpadding =1> HTML my $select_sql = ($Id)?"select Occurred, Entity, Identifier from alerts.alarmaudit where Identifier = '$Id'":($Node)?"select Occurred, Entity, Identifier from alerts.alarmaudit where Identifier in (select Identifier from alerts.status)":'select Occurred, Entity, Identifier from alerts.alarmaudit'; if ($Start) { my $tm=time()-$Start; $select_sql.=" and Occurred > $tm"; } if ($End) { my $tm=time()-$End; $select_sql.=" and Occurred < $tm"; Impact without Impact APPENDIX G: Event Audit Report Script 64 } #print "'$select_sql'\n"; my $sth = $dbh->prepare("$select_sql"); $sth->execute; $min=99999999999999; while(my ($Occurred,$Entity,$Identifier)=$sth->fetchrow_array ) { $Occurred=~s:\0::; $Entity=~s:\0::; $Identifier=~s:\0::; $Event{$Identifier}++; ## Group by Event $Automation{$Entity}++; ## Group by Automation $a=$Occurred; $Occurred=int($a/3600)*3600; $Hourly{$Occurred}++; ## Group by the hour $a%=3600; $a=int($a/300); $Periods{$a}++; ## Group by 5 min on hour $min=($min<$Occurred)?$min:$Occurred; $max=($max>$Occurred)?$max:$Occurred; } ## GET EVENT DISTRIBUTION foreach $event (sort keys %Event) { $Distribution{$Event{$event}}++; } ## HEADER print "<tr><td><table width=100% border=1 cellspacing =2 cellpadding =2>\n"; print " <tr><td width=100% class=Texth>AUTOMATION AUDIT REPORT</td></tr>\n"; ## TRIGGER NAME my $tm = localtime($min); my $buf1=sprintf("%02d/%02d/%04d %02d:%02d:%02d", $tm->mon+1, $tm->mday, $tm->year+1900, $tm->hour, $tm->min, $tm->sec); $tm = localtime($max); my $buf2=sprintf("%02d/%02d/%04d %02d:%02d:%02d ", $tm->mon+1, $tm->mday, $tm->year+1900, $tm->hour, $tm->min, $tm->sec); my $name=($Id)?"All Events Where Identifier = $Identifier":($Node)?"All Events Where Node = $Node":'All Events In The Audit Table'; print " <tr><td width=100% class=Texth>Against $name</td></tr>\n"; ## TRIGGER NAME print " <tr><td width=100% class=Texth>For period from $buf1 to $buf2</td></tr>\n"; ## TRIGGER NAME print "</table>\n"; ## CODE BLOCK ## AUTOMATION HITS print "<tr><td><table width=100% border=1 cellspacing =2 cellpadding =2>\n"; print " <tr><td width=250 class=Textu>Automation</td><td class=Textu>Hit Count</td></tr>\n"; ## TRIGGER NAME foreach $automation (sort keys %Automation) { $automation=($automation)?$automation:'Non Automation Update'; print " <tr><td width=250 class=Text1>$automation:</td><td class=Text2>$Automation{$automation}</td></tr>\n"; ## TRIGGER NAME } print "</table>\n"; ## CODE BLOCK ## PERIODS IN THE HOUR HITS print "<tr><td><table width=100% border=1 cellspacing =2 cellpadding =2>\n"; Impact without Impact APPENDIX G: Event Audit Report Script 65 print " <tr><td width=250 class=Textu>Time Periods</td><td class=Textu>Number of Events during period</td></tr>\n"; ## TRIGGER NAME $i=0; foreach $event (sort { $a <=> $b } keys %Periods) { my $a = 5 * $event; my $b = $a+5; print " <tr><td width=250 class=Text1>Period $a-$b Min:</td><td class=Text2>$Periods{$event}</td></tr>\n"; ## TRIGGER NAME } print "</table>\n"; ## CODE BLOCK ## HOURLY HISTORY print "<tr><td><table width=100% border=1 cellspacing =2 cellpadding =2>\n"; print " <tr><td width=250 class=Textu>Hourly History</td><td class=Textu>Number of Hits</td></tr>\n"; ## TRIGGER NAME $i=0; foreach $event (sort { $a <=> $b } keys %Hourly) { my $tm = localtime($event); my $buf=sprintf("%02d/%02d/%04d %02d hr", $tm->mon+1, $tm->mday, $tm->year+1900, $tm->hour); print " <tr><td width=250 class=Text1>$buf</td><td class=Text2>$Hourly{$event}</td></tr>\n"; ## TRIGGER NAME } print "</table>\n"; ## CODE BLOCK ## HIT BY EVENT BY COUNT print "<tr><td><table width=100% border=1 cellspacing =2 cellpadding =2>\n"; print " <tr><td width=250 class=Textu>Events Hit X Times by Automations</td><td class=Textu>Number of Events with X Hits</td></tr>\n"; ## TRIGGER NAME $i=0; foreach $event (sort { $a <=> $b } keys %Distribution) { print " <tr><td width=250 class=Text1>Events hit $event times:</td><td class=Text2>$Distribution{$event}</td></tr>\n"; ## TRIGGER NAME } print "</table>\n"; ## CODE BLOCK ## END OF OUTPUT print" </body></html>"; $dbh->disconnect; Impact without Impact APPENDIX G: Event Audit Report Script 66 APPENDIX H: Event Audit Report (Example 1) This is the first example using the Automation Audit Report script. In this case we are interested in the audit trail for a particular Identifier. We accomplish this by specifying the identifier through the following command: perl AuditTable.cgi -Identifier 'den0118-admw103-2003OMX DS1VME CLI LOST1E1LOS0303-2003-202' > q.html This produces the following report: AUTOMATION AUDIT REPORT Against All Events Where Identifier = For period from 09/21/2009 20:00:00 to 09/26/2009 15:00:00 Automation Hit Count GenericClear Automation-1: 625 Time Periods Number of Events during period Period 0-5 Min: 52 Period 5-10 Min: 53 Period 10-15 Min: 51 Period 15-20 Min: 52 Period 20-25 Min: 55 Period 25-30 Min: 53 Period 30-35 Min: 53 Period 35-40 Min: 50 Period 40-45 Min: 52 Period 45-50 Min: 52 Impact without Impact APPENDIX H: Event Audit Report (Example 1) 67 Period 50-55 Min: 51 Period 55-60 Min: 51 Hourly History Number of Hits 09/21/2009 20 hr 5 09/21/2009 21 hr 5 09/21/2009 22 hr 6 09/21/2009 23 hr 5 09/22/2009 00 hr 6 09/22/2009 01 hr 5 09/22/2009 02 hr 6 09/22/2009 03 hr 5 09/22/2009 04 hr 5 09/22/2009 05 hr 6 09/22/2009 06 hr 5 09/22/2009 07 hr 6 09/22/2009 08 hr 5 09/22/2009 09 hr 6 09/22/2009 10 hr 5 09/22/2009 11 hr 6 09/22/2009 12 hr 5 09/22/2009 13 hr 5 09/22/2009 14 hr 6 09/22/2009 15 hr 5 09/22/2009 16 hr 5 09/22/2009 17 hr 6 09/22/2009 18 hr 5 Impact without Impact APPENDIX H: Event Audit Report (Example 1) 68 09/22/2009 19 hr 5 09/22/2009 20 hr 6 09/22/2009 21 hr 5 09/22/2009 22 hr 6 09/22/2009 23 hr 5 09/23/2009 00 hr 6 09/23/2009 01 hr 5 09/23/2009 02 hr 6 09/23/2009 03 hr 5 09/23/2009 04 hr 5 09/23/2009 05 hr 6 09/23/2009 06 hr 5 09/23/2009 07 hr 6 09/23/2009 08 hr 5 09/23/2009 09 hr 6 09/23/2009 10 hr 5 09/23/2009 11 hr 5 09/23/2009 12 hr 6 09/23/2009 13 hr 5 09/23/2009 14 hr 6 09/23/2009 15 hr 4 09/23/2009 16 hr 6 09/23/2009 17 hr 5 09/23/2009 18 hr 5 09/23/2009 19 hr 6 09/23/2009 20 hr 5 09/23/2009 21 hr 6 Impact without Impact APPENDIX H: Event Audit Report (Example 1) 69 09/23/2009 22 hr 5 09/23/2009 23 hr 6 09/24/2009 00 hr 5 09/24/2009 01 hr 6 09/24/2009 02 hr 5 09/24/2009 03 hr 5 09/24/2009 04 hr 6 09/24/2009 05 hr 5 09/24/2009 06 hr 6 09/24/2009 07 hr 5 09/24/2009 08 hr 6 09/24/2009 09 hr 5 09/24/2009 10 hr 5 09/24/2009 11 hr 5 09/24/2009 12 hr 5 09/24/2009 13 hr 6 09/24/2009 14 hr 5 09/24/2009 15 hr 6 09/24/2009 16 hr 5 09/24/2009 17 hr 6 09/24/2009 18 hr 5 09/24/2009 19 hr 6 09/24/2009 20 hr 5 09/24/2009 21 hr 5 09/24/2009 22 hr 6 09/24/2009 23 hr 5 09/25/2009 00 hr 6 Impact without Impact APPENDIX H: Event Audit Report (Example 1) 70 09/25/2009 01 hr 5 09/25/2009 02 hr 6 09/25/2009 03 hr 5 09/25/2009 04 hr 5 09/25/2009 05 hr 6 09/25/2009 06 hr 5 09/25/2009 07 hr 6 09/25/2009 08 hr 5 09/25/2009 09 hr 6 09/25/2009 10 hr 5 09/25/2009 11 hr 5 09/25/2009 12 hr 6 09/25/2009 13 hr 5 09/25/2009 14 hr 6 09/25/2009 15 hr 5 09/25/2009 16 hr 6 09/25/2009 17 hr 5 09/25/2009 18 hr 6 09/25/2009 19 hr 5 09/25/2009 20 hr 5 09/25/2009 21 hr 6 09/25/2009 22 hr 5 09/25/2009 23 hr 6 09/26/2009 00 hr 5 09/26/2009 01 hr 6 09/26/2009 02 hr 5 09/26/2009 03 hr 5 Impact without Impact APPENDIX H: Event Audit Report (Example 1) 71 09/26/2009 04 hr 6 09/26/2009 05 hr 5 09/26/2009 06 hr 6 09/26/2009 07 hr 5 09/26/2009 08 hr 6 09/26/2009 09 hr 5 09/26/2009 10 hr 6 09/26/2009 11 hr 5 09/26/2009 12 hr 5 09/26/2009 13 hr 6 09/26/2009 14 hr 5 09/26/2009 15 hr 3 Events Hit X Times by Automations Number of Events with X Hits Events hit 625 times: 1 Impact without Impact APPENDIX H: Event Audit Report (Example 1) 72 APPENDIX I: Event Audit Report (Example 2) This is the second example using the Automation Audit Report script. In this case we are interested in all events or the last 3 days. We accomplish this by specifying: Node Start time as second prior to now End time for the period as seconds prior to now This translates to the following command: perl AuditTable.cgi -Node 'wdc1749-rocm1' -start 6220800 -end 0 > q.html AUTOMATION AUDIT REPORT Against All Events Where Node = wdc1749-rocm1 For period from 09/21/2009 19:00:00 to 09/26/2009 15:00:00 Automation Hit Count Non Automation Update: EscalateAlarms Automation -1: 2 EscalateAlarms Automation-1: 22 GenericClear Automation-1: 69491 GenericClear Automation-2: 58919 Level1Page Automation-1: 181 Level1Page Automation-3: 11 Level1Page Automation-5: 3 Time Periods Number of Events during period Period 0-5 Min: 11375 Period 5-10 Min: 12114 Impact without Impact APPENDIX I: Event Audit Report (Example 2) 73 Period 10-15 Min: 10671 Period 15-20 Min: 11495 Period 20-25 Min: 10392 Period 25-30 Min: 10638 Period 30-35 Min: 11969 Period 35-40 Min: 11043 Period 40-45 Min: 12175 Period 45-50 Min: 10442 Period 50-55 Min: 11780 Period 55-60 Min: 12088 Hourly History Number of Hits 09/21/2009 19 hr 3 09/21/2009 20 hr 855 09/21/2009 21 hr 1060 09/21/2009 22 hr 1123 09/21/2009 23 hr 891 09/22/2009 00 hr 1131 09/22/2009 01 hr 1099 09/22/2009 02 hr 1053 09/22/2009 03 hr 1109 09/22/2009 04 hr 914 09/22/2009 05 hr 1052 09/22/2009 06 hr 950 09/22/2009 07 hr 998 09/22/2009 08 hr 933 09/22/2009 09 hr 819 Impact without Impact APPENDIX I: Event Audit Report (Example 2) 74 09/22/2009 10 hr 944 09/22/2009 11 hr 738 09/22/2009 12 hr 913 09/22/2009 13 hr 814 09/22/2009 14 hr 991 09/22/2009 15 hr 1071 09/22/2009 16 hr 1136 09/22/2009 17 hr 1089 09/22/2009 18 hr 1081 09/22/2009 19 hr 1038 09/22/2009 20 hr 1090 09/22/2009 21 hr 1003 09/22/2009 22 hr 1106 09/22/2009 23 hr 1087 09/23/2009 00 hr 892 09/23/2009 01 hr 1068 09/23/2009 02 hr 1084 09/23/2009 03 hr 949 09/23/2009 04 hr 1182 09/23/2009 05 hr 1070 09/23/2009 06 hr 983 09/23/2009 07 hr 993 09/23/2009 08 hr 891 09/23/2009 09 hr 1135 09/23/2009 10 hr 969 09/23/2009 11 hr 916 09/23/2009 12 hr 1034 Impact without Impact APPENDIX I: Event Audit Report (Example 2) 75 09/23/2009 13 hr 815 09/23/2009 14 hr 2092 09/23/2009 15 hr 1387 09/23/2009 16 hr 1290 09/23/2009 17 hr 1019 09/23/2009 18 hr 991 09/23/2009 19 hr 858 09/23/2009 20 hr 1019 09/23/2009 21 hr 961 09/23/2009 22 hr 1070 09/23/2009 23 hr 1062 09/24/2009 00 hr 1006 09/24/2009 01 hr 1195 09/24/2009 02 hr 1160 09/24/2009 03 hr 1090 09/24/2009 04 hr 1123 09/24/2009 05 hr 1055 09/24/2009 06 hr 1021 09/24/2009 07 hr 1020 09/24/2009 08 hr 2324 09/24/2009 09 hr 1322 09/24/2009 10 hr 1037 09/24/2009 11 hr 1736 09/24/2009 12 hr 2774 09/24/2009 13 hr 1356 09/24/2009 14 hr 2472 09/24/2009 15 hr 1116 Impact without Impact APPENDIX I: Event Audit Report (Example 2) 76 09/24/2009 16 hr 1187 09/24/2009 17 hr 1221 09/24/2009 18 hr 1361 09/24/2009 19 hr 1233 09/24/2009 20 hr 1472 09/24/2009 21 hr 1539 09/24/2009 22 hr 1348 09/24/2009 23 hr 1230 09/25/2009 00 hr 1206 09/25/2009 01 hr 1292 09/25/2009 02 hr 1195 09/25/2009 03 hr 1115 09/25/2009 04 hr 1199 09/25/2009 05 hr 994 09/25/2009 06 hr 1027 09/25/2009 07 hr 1089 09/25/2009 08 hr 965 09/25/2009 09 hr 945 09/25/2009 10 hr 1424 09/25/2009 11 hr 1188 09/25/2009 12 hr 977 09/25/2009 13 hr 1000 09/25/2009 14 hr 1057 09/25/2009 15 hr 965 09/25/2009 16 hr 1196 09/25/2009 17 hr 1089 09/25/2009 18 hr 1250 Impact without Impact APPENDIX I: Event Audit Report (Example 2) 77 09/25/2009 19 hr 1091 09/25/2009 20 hr 1468 09/25/2009 21 hr 1138 09/25/2009 22 hr 1340 09/25/2009 23 hr 2162 09/26/2009 00 hr 1100 09/26/2009 01 hr 1455 09/26/2009 02 hr 1076 09/26/2009 03 hr 1041 09/26/2009 04 hr 1136 09/26/2009 05 hr 984 09/26/2009 06 hr 1129 09/26/2009 07 hr 1688 09/26/2009 08 hr 1467 09/26/2009 09 hr 1671 09/26/2009 10 hr 1980 09/26/2009 11 hr 1606 09/26/2009 12 hr 1421 09/26/2009 13 hr 1152 09/26/2009 14 hr 1187 09/26/2009 15 hr 498 Events Hit X Times by Automations Number of Events with X Hits Events hit 1 times: 2086 Events hit 2 times: 520 Events hit 3 times: 622 Impact without Impact APPENDIX I: Event Audit Report (Example 2) 78 Events hit 4 times: 226 Events hit 5 times: 185 Events hit 6 times: 135 Events hit 7 times: 110 Events hit 8 times: 52 Events hit 9 times: 25 Events hit 10 times: 24 Events hit 11 times: 31 Events hit 12 times: 10 Events hit 13 times: 8 Events hit 14 times: 44 Events hit 15 times: 121 Events hit 16 times: 217 Events hit 17 times: 24 Events hit 18 times: 4 Events hit 19 times: 7 Events hit 20 times: 4 Events hit 21 times: 5 Events hit 22 times: 5 Events hit 23 times: 8 Events hit 24 times: 5 Events hit 25 times: 1 Events hit 26 times: 5 Events hit 27 times: 4 Events hit 28 times: 3 Events hit 29 times: 11 Events hit 30 times: 6 Impact without Impact APPENDIX I: Event Audit Report (Example 2) 79 Events hit 31 times: 6 Events hit 32 times: 6 Events hit 33 times: 1 Events hit 34 times: 2 Events hit 35 times: 2 Events hit 37 times: 1 Events hit 38 times: 2 Events hit 39 times: 1 Events hit 40 times: 1 Events hit 42 times: 3 Events hit 43 times: 1 Events hit 44 times: 2 Events hit 47 times: 2 Events hit 50 times: 3 Events hit 51 times: 3 Events hit 52 times: 1 Events hit 53 times: 4 Events hit 55 times: 2 Events hit 57 times: 1 Events hit 62 times: 2 Events hit 63 times: 2 Events hit 64 times: 1 Events hit 67 times: 1 Events hit 72 times: 1 Events hit 73 times: 1 Events hit 74 times: 2 Events hit 76 times: 1 Impact without Impact APPENDIX I: Event Audit Report (Example 2) 80 Events hit 84 times: 1 Events hit 85 times: 2 Events hit 89 times: 1 Events hit 91 times: 1 Events hit 93 times: 4 Events hit 94 times: 2 Events hit 95 times: 1 Events hit 97 times: 4 Events hit 101 times: 1 Events hit 105 times: 1 Events hit 111 times: 1 Events hit 112 times: 1 Events hit 117 times: 2 Events hit 118 times: 1 Events hit 123 times: 1 Events hit 128 times: 1 Events hit 129 times: 2 Events hit 130 times: 1 Events hit 131 times: 2 Events hit 133 times: 1 Events hit 134 times: 1 Events hit 138 times: 1 Events hit 139 times: 1 Events hit 140 times: 1 Events hit 141 times: 1 Events hit 142 times: 1 Events hit 145 times: 1 Impact without Impact APPENDIX I: Event Audit Report (Example 2) 81 Events hit 146 times: 1 Events hit 150 times: 1 Events hit 153 times: 2 Events hit 154 times: 1 Events hit 155 times: 2 Events hit 158 times: 1 Events hit 159 times: 2 Events hit 162 times: 1 Events hit 165 times: 1 Events hit 170 times: 1 Events hit 171 times: 1 Events hit 172 times: 2 Events hit 174 times: 1 Events hit 175 times: 1 Events hit 177 times: 1 Events hit 178 times: 3 Events hit 179 times: 1 Events hit 182 times: 2 Events hit 184 times: 1 Events hit 188 times: 1 Events hit 194 times: 1 Events hit 205 times: 1 Events hit 207 times: 1 Events hit 208 times: 1 Events hit 219 times: 1 Events hit 220 times: 2 Events hit 228 times: 1 Impact without Impact APPENDIX I: Event Audit Report (Example 2) 82 Events hit 236 times: 1 Events hit 239 times: 1 Events hit 253 times: 1 Events hit 278 times: 2 Events hit 299 times: 2 Events hit 301 times: 1 Events hit 302 times: 1 Events hit 305 times: 1 Events hit 391 times: 1 Events hit 437 times: 1 Events hit 438 times: 1 Events hit 450 times: 1 Events hit 454 times: 1 Events hit 457 times: 1 Events hit 538 times: 1 Events hit 571 times: 2 Events hit 573 times: 2 Events hit 623 times: 1 Events hit 624 times: 4 Events hit 625 times: 2 Events hit 627 times: 1 Events hit 634 times: 1 Events hit 665 times: 1 Events hit 809 times: 1 Events hit 812 times: 2 Events hit 814 times: 2 Events hit 816 times: 1 Impact without Impact APPENDIX I: Event Audit Report (Example 2) 83 Events hit 817 times: 1 Events hit 818 times: 1 Events hit 819 times: 3 Events hit 820 times: 1 Events hit 822 times: 3 Events hit 823 times: 2 Events hit 824 times: 1 Events hit 876 times: 1 Events hit 1440 times: 1 Events hit 1442 times: 1 Events hit 1443 times: 2 Events hit 1806 times: 2 Events hit 1854 times: 1 Events hit 2155 times: 1 Events hit 2160 times: 1 Events hit 2164 times: 1 Events hit 2165 times: 1 Events hit 3139 times: 1 Events hit 3189 times: 1 Events hit 3198 times: 1 Events hit 3257 times: 1 Events hit 3268 times: 1 Events hit 3915 times: 1 Events hit 3924 times: 1 Events hit 4255 times: 1 Events hit 4346 times: 1 Events hit 4362 times: 1 Impact without Impact APPENDIX I: Event Audit Report (Example 2) 84 Events hit 5030 times: 1 Events hit 6549 times: 1 Events hit 6848 times: 1 Impact without Impact 85 APPENDIX I: Impact-like PERL Shell The following is a shell which provides a basic example of an impact-like PERL shell. #!/usr/bin/perl ################################################################################ # 09/24/2009 Daniel L. Needles Version 0.1 # # PROGRAM: psuedoimpact # # USAGE: psuedoimpact # # [-debug <debug number 1-255>] # # 1 - Log any warning or errors. # # 2 - Function Entry and exit logging. # # 4 - Inner function verbose logging # # 8 - Dump before and after Netcool alert.status fields. # # 16 - Query Netcool alerts.status only once. # # 32 - Clear old log. # # 64 - Show initial alert.status field values. # # 128 - Do not perform update, instead document # # what would be done. # # [-node] (Node to perform extra logging on.) # # [-logfile] (full path and file to log file.) # # *[-sumfile] (full path and file to the change summary file.) # # [-cycle] (Cycle period in seconds.) # # [-cyclemin] (minimum delay between cycles, even if the cycle # # overruns.) # # [-delay] (Ignore events with Populated field younger than) # # [-stdout] --Log to standard out # # [-help] # # DESCRIPTION: Program enriches Netcool based on Oracle info: # # @Node : With HOST name, if Node=IP and HOST in Oracle. # # CIRC_PATH_HUM_ID. # # @Customer : With Customer information found ($customer). # # Journal : With warning, errors found by Populate.pl. # # PURPOSE: Program replaces Impact's enrichment functionality and # # updates Netcool with information pulled from Oracle. # # PSUDO CODE: # # 1. Declare packages,environment vars,database security vars, command line # # parmaeter defaults, other database assignments. # # 2. Declare and prepare Oracle SQL statements for tables: equipment, site, # # card, path, port, and channel. # # 3. Load procedures: LogMsg, dumpnetcool, usage, commandline, # # Each is documented regarding PSUDO code. # # 4. Run main program: # # a. Prepare Netcool SQL for alert.status and open logfile and change # # summary file. # # b. Loop forever (unless DEBUG & 16 in which case only loop once.) # # 1. Pull events from Netcool alerts.status # # 2. For each event # # a. Clear data structures. # # b. Perform Impact Policy work # # i. Build and execute UPDATE against Netcool alerts.status to update # # the event. # # 3. Sleep the remaining time in the cycle and log if the cyle has # # overran its length. Make sure to sleep a minimum time between # # cycles ($CYCLEMIN) # # c. Disconnect from Netcool and Oracle. # Impact without Impact APPENDIX I: Impact-like PERL Shell 86 ################################################################################ # ERRORS AND WARNING: # # There are several errors and warnings logged by the system. These are # # stored in the Journal entries for the events and the log file IF DEBUG # # IS SET TO 1. (The default value of DEBUG is 1 if not specified.) # # The errors occur within the Policy code's exceptions and are coded as # # follows: # # # # WARN: <ERROR NUMBER>: <ERROR MESSAGE> # ################################################################################ # DEBUG RULES: # # 0 - No logging. # # 1 - Errors and warning logging. # # 2 - Function Entry and exit logging. # # 4 - Node based logging. # # 8 - Dump before and after Netcool alert.status fields. # # 16 - Query Netcool alerts.status only once. # # 32 - Clear old log # # 64 - Show initial alert.status field values. # #128 - Do not perform update, instead document what would be done. # ################################################################################ use strict; ## TRAINING WHEELS ON ###################################### PACKAGES ################################ use Getopt::Long; ## HANDLES COMMAND LINE OPTIONS use DBI; ## ACCESS TO NETCOOL AND ORACLE VIA DATABASES use Time::localtime; ## Convert raw seconds to date ############################# ENVIRONMENT VARIABLES ############################ $ENV{SYBASE}='/usr/local'; ## Local of SYBASE $ENV{LD_LIBRARY_PATH}='/usr/local/lib:/usr/lib:/opt/oracle/orahome/lib'; $ENV{'ORACLE_HOME'}='/opt/oracle/orahome'; ########################## my $ORA_USER='USER'; my $ORA_PASS='PASSWORD'; my $NETCOOL_USER='root'; my $NETCOOL_PASS=''; my $NETCOOL_UID=500; DATABASE PERMISSIONS AND IDENT ###################### ## Oracle USER ## Oracle PASSWORD ## Netcool USER ## Netcool PASSWORD ## Netcool USERID (Needed to insert journals) ################### COMMAND LINE OPTION DEFAULTS ############################### my $STDOUT=0; ## DEFAULT TO OUTPUT TO LOG FILE RATHER THAN STDOUT my $NODE; ## Used for Debug & 4. my $DEBUG=1; ## DEFAULT TO PRINT ERRORS AND WARNS TO THE LOG my $CYCLESPEED=60; ## CYCLE PERIOD IN SECONDS my $CYCLEMIN=10; ## MIN REST PERIOD BETWEEN CYCLES IN SECONDS my $ISPRIMARY=1; ## IS THIS THE PRIMARY PROGRAM? my $SECONDARYDELAY=$CYCLESPEED*2; ## IF SECONDARY WHAT ADDITIONAL DELAY IS USED my $EVENTDELAY=($ISPRIMARY)?259200:345600; ## REPOPULATE EVENTS 2 DAYS OLD ON ## PRIMARY AND 3 DAYS OLD ON SECONDARY my $POPULATESERVER=($ISPRIMARY)?'Primary':'Secondary'; ## WILL BE CALCULATED AGAIN AFTER COMMANDLINE ARGS my $LOGFILE=''; ## ASSUME NO LOG FILE NAME. my $SUMMARYFILE=''; ## ASSUME NO 'SUMMARY OF CHANGES' FILE ##################### OTHER DATABASE ASSIGNMENTS ############################### my ($NC,$XNG_CARD,$XNG_DS1CHANNEL,$XNG_EQUIPMENT,$XNG_PATH,$XNG_PORT,$XNG_SITE,$XNG_STSPO RT); ## DATABASE TABLE HANDLES. my $NCJournalEntry; ## JOURNAL UPDATES my %NCDMP; ## DETECTS UPDATES TO NETCOOL (BEFORE AND AFTER) my $oracle_dbh; ## ORACLE FILE HANDLE Impact without Impact APPENDIX I: Impact-like PERL Shell 87 my $netcool_dbh; ## NETCOOL FILE HANDLE my $UPDATESET=''; ## USE TO BUILD UPDATE SET COMMAND CLAUSE my @UPDATECOL= qw/Customer Node Site Summary/; ## FULL INVENTORY OF FIELDS THAT CAN BE SET ########################### DATABASE SET UP ################################## ## OPEN CONNECTION TO ORACLE $oracle_dbh = DBI->connect( 'dbi:Oracle:oracle', $ORA_USER, $ORA_PASS, { RaiseError=> 1, PrintError => 1, AutoCommit => 1 } ) or die "Connection failed to Oracle (oracle): " . $DBI::errst; ## OPEN CONNECTION TO NETCOOL $netcool_dbh = DBI->connect( 'dbi:Sybase:NCOMS', $NETCOOL_USER, $NETCOOL_PASS, { RaiseError=> 1, PrintError => 0, AutoCommit => 1} ) or die "Connection failed: " . $DBI::errst; ## BIND VARIABLES FOR THE NETCOOL EVENT FIELDS IN FORM: NC<FIELD NAME> my ($NCSummary,$NCSite,$NCSerial,$NCTally,$NCNode,$NCIdentifier,$NCCustomer,$NCAlertKey,$ NCAlertGroup); my $NCERROR; ## ERROR TRACKING... ## SQL PREP: DEFINE THE COLUMNS SELECTED FROM NETCOOL my $IMPACTSELECT="Summary,Site,Serial,Tally,Node,Identifier,Customer,AlertKey,AlertGroup" ; my $IMPACTFILTER; ## NETCOOL FILTER HAS TO BE SET AFTER COMMANDLINE IS CALLED. ################################################################################ ##################### SQL PREPARE STATEMENTS ################################### ####### THIS, ALONG WITH BIND_COLUMNS IS USED TO SPEED UP THINGS 10X ########### ################################################################################ # HERE ARE SOME EXAMPLES OF ORACLE SQL PREP STATEMENTS # ####################### SQLPREP: ORACLE EQUIPMENT TABLE ####################### my ($EQEQUIP,$EQDESCR,$EQIP_ADDRESS,$EQMODEL); ## VERSION 1: LOOKUP ON IP ADDRESS my $xng_equipment1 = $oracle_dbh->prepare(<<SQL); SELECT EQUIP, DESCR, IP_ADDRESS, MODEL FROM webuser.IMP_OSS_EQUIP_INST WHERE IP_ADDRESS = ? SQL #$xng_equipment1->bind_columns(\$EQEQUIP,\$EQDESCR,\$EQIP_ADDRESS,\$EQMODEL); ## VERSION 2: LOOKUP ON HOSTNAME my $xng_equipment2 = $oracle_dbh->prepare(<<SQL); SELECT EQUIP, DESCR, IP_ADDRESS, MODEL FROM webuser.IMP_OSS_EQUIP_INST WHERE DESCR = ? SQL #$xng_equipment2->bind_columns(\$EQEQUIP,\$EQDESCR,\$EQIP_ADDRESS,\$EQMODEL); ################################################################################ Impact without Impact APPENDIX I: Impact-like PERL Shell 88 ####################### END OF SQL PREPARE SECTION ############################# ################################################################################ ################################################################################ # PROCEDURE: LogMsg # PURPOSE: Write messages to logfiles or whatever else is needed. ################################################################################ sub LogMsg { # NOTE: DO NOT CALL LogMsg to declare Function ENTRY,EXIT FOR LOGMSG - INFINITE LOOP my $msg = shift; my $tm = localtime(time()); my $buf=sprintf("%02d/%02d/%04d %02d:%02d:%02d:", $tm->mon+1, $tm->mday, $tm->year+1900, $tm->hour, $tm->min, $tm->sec); print LOG "$buf $msg\n"; ## LOG ERROR AND WARNINGS ADDED TO JOURNAL if ($msg =~ /^ERROR:/) { $NCJournalEntry .= " ERROR: contact Netcool admin. $msg"; } elsif ($msg =~ /^WARN:/) { $NCJournalEntry.=" $msg"; } return; } ################################################################################ # PROCEDURE: dumpnetcool # PURPOSE: If debug & 8 is set, then this procedure is called. The procedure # documents the changes done to the particular event. ################################################################################ sub dumpprint { ## LOG THE BEFORE AND AFTER IMAGE OF EACH POTENTIALLY UPDATABLE NETCOOL # EVENT FIELD TO THE LOG. (NOTE EACH FIELD WILL BE OF THE FORM: # <BEFORE VALUE>~<AFTER VALUE> # IF THERE HAS BEEN NO CHANGE TO THE FIELD THEN THE UPDATE WILL SHOW: # ~ LogMsg("\n************ EVENT DUMP ******************"); LogMsg("Customer: $NCDMP{Customer}"); LogMsg("Node: $NCDMP{Node}"); LogMsg("Site: $NCDMP{Site}"); LogMsg("Summary: $NCDMP{Summary}"); LogMsg("JOURNAL: $NCDMP{JournalEntry}"); LogMsg("Error: $NCERROR"); LogMsg("******************************************\n"); ## SPECIAL FILTERS TO MAKE IT EASIER TO READ THE CHANGE SUMMARY FILE my $journalit=($NCTally < 30)?'Yes':'No'; ## DUMP A ROW TO THE CHANGE SUMMARY FILE THAT REPRESENTS THE EVENTS print SUM "$NCNode~$EQDESCR~$EQIP_ADDRESS~$EQMODEL~$alertgroup~$alertkey~$journalit~$NCDMP{Custo mer}~$NCDMP{Node}~$NCDMP{Site}~$NCDMP{Summary}~$NCDMP{JournalEntry}~$NCERROR\n"; return; } ################################################################################ # PROCEDURE: usage # PURPOSE: Reports available usage parameters for the program to the # standard out. ################################################################################ sub usage { Impact without Impact APPENDIX I: Impact-like PERL Shell 89 print <<EOF; USAGE: psuedoimpact [-debug 1 2 4 8 16 32 64 128 - <debug number 1-255>] Log any warning or errors. Function Entry and exit logging. Inner function verbose logging Dump before and after Netcool alert.status fields. Query Netcool alerts.status only once. Clear old log. Show initial alert.status field values. Do not perform update, instead document what would be done. [-node] (Node to perform extra logging on.) [-logfile] (full path and file to log file.) *[-sumfile] (full path and file to the change summary file.) [-cycle] (Cycle period in seconds.) [-cyclemin] (minimum delay between cycles, even if the cycle overruns.) [-delay] (Ignore events with Populated field younger than) [-stdout] --Log to standard out [-help] NOTE: Program must run as the user root DEBUGING ************************** There are 8 flags available for debugging. When diagnosing a specific problem in development -255 or -127 is used as it will: Clear the old log (32) Query the events only once and exit the program (16) Show any warning or errors (1) Show entry and exit from functions (2) Provide verbose logging of the inner workings (4) Dump the initial values of the event's fields that can be updated (64) Dump the before and after values (if there is a change) for each event (8) Write a summary of the updates for each event in the sumfile (8) Normal operation of the program can used the default DEBUG value of 1, which will log WARNING and ERRORS to an event's journal as well as to the logfile. SUMFILE *************************** * When debug flag 8 is set, the summary file contains a record of all proposed (debug & 128) or actual changes made in a ~ delimited file that can easily be viewed with Excel. It also includes any journal entries or detected errors. Each row represents one event. The following is a column description: -- Fields used to identify and select the event... Node - Node referenced in the event. DESCR - Description of the Node (From Oracle.) IP_ADDRESS - IP Address of the Node (From Oracle.) MODEL - Model of the event (From Oracle.) AlertGroup - Netcool event field AlertKey - Netcool event field IsJournal - Journal exist? -- Fields show the before and after values of netcool fields for the event. If there is no change, null is shown in the before and after values to make the changes stand out. Each field is repeated twice to show the before and after values respectively. The columns include: Customer Node Site Summary JournalEntry -- Error messages against the event from processing. ERRORS EOF Impact without Impact APPENDIX I: Impact-like PERL Shell 90 exit; } ################################################################################ # PROCEDURE: commandline # PURPOSE: Parses commandline parameters and sets their values ################################################################################ sub commandline { my $HELP; GetOptions("debug=i" "stdout" "node=s" "cycle=i" => "cyclemin=i" => "logfile=s" => "sumfile=s" => "delay=i" 'h|help|?' if ( $HELP ) { die usage(); } return; => \$DEBUG, => \$STDOUT, => \$NODE, \$CYCLESPEED, \$CYCLEMIN, \$LOGFILE, \$SUMMARYFILE, => \$EVENTDELAY, => \$HELP); } ############################################################################### # PROCEDURE: POLICY-SPECIFIC-CODE # PURPOSE: Code specific to the policy shows up as procedures. # NETCOOL: The procedure assigns the following fields # @Site : Oracle <Database>.<Field> # PSUDO CODE: # ############################################################################### sub PolicySpecificCode { ($DEBUG & 2) && LogMsg('Entering PolicySpecificCode'); :: :: :: ($DEBUG & 4) && LogMsg("PolicySpecificCode: AlertGroup: $NCAlertGroup\n Site: $NCSite"); :: :: :: ($DEBUG & 2) && LogMsg('Exiting PolicySpecificCode'); return; } ################################################################################ ################################## MAIN ###################################### ################################################################################ $|++; # Unbuffer output commandline(); ## PARSE PARAMETERS $IMPACTFILTER=($ISPRIMARY)?"(Populated < getdate-$EVENTDELAY) and (Severity > 0) and ((Type = 1) and ( ...IMPACTFILTER ...)":"(Populated < getdate-$EVENTDELAY) and (Severity > 0) and (LastOccurrence<getdate-$SECONDARYDELAY) and (Type = 1) and ( ...IMPACTFILTER... )"; my $nc = $netcool_dbh->prepare("SELECT $IMPACTSELECT FROM status WHERE $IMPACTFILTER"); $POPULATESERVER=($ISPRIMARY)?'Primary':'Secondary'; ($DEBUG & 2) && LogMsg("Starting Main Program (after comandline parsing)"); ## OPEN LOG $LOGFILE=($LOGFILE)?$LOGFILE:'psuedoimpact.log'; ## DEFAULT FILE. if ($DEBUG & 32) { open(LOG, "> $LOGFILE") || die("Can't open '$LOGFILE'"); } else { Impact without Impact APPENDIX I: Impact-like PERL Shell 91 open(LOG, ">> $LOGFILE") || die("Can't open '$LOGFILE'"); } select(LOG); $|=1; select(STDOUT); ## Unbuffer logging ## OPEN SUMMARY OF CHANGES TO EVENTS LOG $SUMMARYFILE=($SUMMARYFILE)?$SUMMARYFILE:'psuedoimpact.sum'; ## DEFAULT FILE. if ($DEBUG & 32) { open(SUM, "> $SUMMARYFILE") || die("Can't open '$SUMMARYFILE'"); } else { open(SUM, ">> $SUMMARYFILE") || die("Can't open '$SUMMARYFILE'"); } select(SUM); $|=1; select(STDOUT); ## Unbuffer logging my $loopstart; ## TIME THE START OF THE LOOP my $istrue=1; ## FLAG TO ENABLE EXITING OF LOOP IF DEBUG 16 SET ## (ONLY LOOP ONCE) ## PRINT TABLE HEADER WITH ~ DELIMITED FIELDS TO CHANGE SUMMARY FILE ($DEBUG & 8) && print SUM "Node~DESCR~IP_ADDRESS~MODEL~AlertGroup~AlertKey~IsJournal~Customer~Customer~Node~Node ~Site~Site~Summary~Summary~JournalEntry~JournalEntry~ERRORS\n"; ############################# MAIN LOOP ######################################## while ( $istrue ) { ########################################################################## ########################## PREPARE NETCOOL QUERY ######################### ########################################################################## $loopstart=time(); ## START TIMER $istrue=($DEBUG & 16)?0:1; ## LOOP FOREVER, OR ONCE? ($DEBUG & 4) && LogMsg("netcool_dbh->prepare(\"SELECT $IMPACTSELECT FROM status WHERE $IMPACTFILTER\")"); $nc->execute(); ## GRAB UPDATED EVENTS FROM NETCOOL $nc->bind_columns(\$NCSummary,\$NCSite,\$NCSerial,\$NCTally,\$NCSeverity,\$NCNode, \$NCIdentifier,\$NCCustomer,\$NCAlertKey,\$NCAlertGroup); my $ncidx=0; ## COUNT NETCOOL EVENTS. RESET AFTER EVERY LOOP ########################## PREPARE NETCOOL QUERY ######################### ################################### END ################################ ########################################################################## ######################## ITERATE OVER ALL NETCOOL EVENTS ################# ########################################################################## while ($NC = $nc->fetchrow_arrayref) { ###################################################################### ####################### INITIALIZE DATA ############################# ###################################################################### ## REMOVE NULLS for (my $i=0; $i<=$#$NC; $i++) { $$NC[$i]=~s:\0::g; } $ncidx++; ## COUNT NETCOOL EVENTS. START AT 1. ## DEBUG AND DETECT CHANGES foreach my $item (%NCDMP) { $NCDMP{$item} = ''; } $NCERROR=''; ## $NCJournalEntry=''; ## PREP ## INITIALIZE HASH TO NULL INITIALIZE ERROR DETECTION CLEAR JOURNAL CACHE ## TRANSCRIBE BIND VARIABLES TO HASH Impact without Impact APPENDIX I: Impact-like PERL Shell 92 ## BEFORE I UPDATE ANY FIELDS WHAT ARE THEIR INITIAL VALUES $NCDMP{Customer} = $NCCustomer; $NCDMP{Node} = $NCNode; $NCDMP{Site} = $NCSite; $NCDMP{Summary} = $NCSummary; $NCDMP{Class} = $NCClass; ## LOG INITIAL VALUES OF THE FIELDS THAT MAY BE UPDATED if ($DEBUG & 64) { LogMsg("BEFORE UPDATE"); foreach my $fld (sort keys %NCDMP) { LogMsg(" $fld = $NCDMP{$fld}"); } } ## LOG IF DEBUG ON THE NODE if ( $DEBUG & 4) { LogMsg("Processing " . uc($NCNode) . " - " . $NCSummary); } ####################### INITIALIZE DATA ############################# ############################### END ################################ ###################################################################### #################### DETERMINE VALUES FROM ORACLE #################### ###################################################################### # The code that represents the work the policy does goes here # #################### DETERMINE VALUES FROM ORACLE #################### ############################### END ################################ ###################################################################### ####################### UPDATE EVENT LOGIC ########################## ###################################################################### ## MARK EVENT AS TOUCHED IF JOURNAL UPDATED, ERROR LOGGED, OR PROPER UPDATE TIME if ($NCJournalEntry || $NCERROR || (($NCSeverity > 3) and ($NCTally < 30))) { my $tm = localtime(time()); my $buf=sprintf("%02d/%02d/%04d %02d:%02d:%02d: ", $tm->mon+1, $tm->mday, $tm->year+1900, $tm->hour, $tm->min, $tm->sec); $NCJournalEntry = "POPULATE.PL: $buf $NCJournalEntry "; } ## DETECT IF ANY CHANGE WAS MADE $NCDMP{Customer}=($NCCustomer ne $NCDMP{Customer})? "$NCDMP{Customer}~$NCCustomer":'~'; $NCDMP{Node}=($NCNode ne $NCDMP{Node})?"$NCDMP{Node}~$NCNode":'~'; $NCDMP{Site}=($NCSite ne $NCDMP{Site})?"$NCDMP{Site}~$NCSite":'~'; $NCDMP{Summary}=($NCSummary ne $NCDMP{Summary})? "$NCDMP{Summary}~$NCSummary":'~'; $NCDMP{JournalEntry}=($NCJournalEntry ne $NCDMP{JournalEntry})? "$NCDMP{JournalEntry}~$NCJournalEntry":'~'; ## BUILD UPDATE STATEMENT (NOTE ALL FIELDS ARE STRINGS) $UPDATESET=''; foreach my $fld (@UPDATECOL) { if ($NCDMP{$fld} ne '~') { my ($dmy,$val) = split /~/,$NCDMP{$fld}; ## DO INTEGERS DIFFERENT FROM STRINGS if ($fld eq 'Class' || $fld eq 'Severity') { $UPDATESET.=",$fld = $val"; } else { $UPDATESET.=",$fld = '$val'"; } Impact without Impact APPENDIX I: Impact-like PERL Shell 93 } } ## UPDATE OR LOG NO UPDATE if ($UPDATESET && !($DEBUG & 128)) { $UPDATESET="PopulatedBy='$POPULATESERVER',Populated=$loopstart" . $UPDATESET; if ($DEBUG & 8) { dumpprint(); } my $sth=$netcool_dbh->prepare( qq{update status set $UPDATESET where Identifier = '$NCIdentifier'}); $sth->execute; ($DEBUG & 8) && LogMsg("netcool_dbh->do(UPDATE status SET $UPDATESET where Identifier = '$NCIdentifier')"); if ($NCJournalEntry ne '~') { ## DO JOURNAL ENTRY my $sth2=$netcool_dbh->prepare(<<SQL); INSERT INTO alerts.journal VALUES ( '$NCSerial:$NETCOOL_UID:$loopstart', $NCSerial, $NETCOOL_UID, $loopstart, '$NCJournalEntry', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '') SQL ($DEBUG & 8) && LogMsg("INSERT INTO alerts.journal VALUES ('$NCSerial:$NETCOOL_UID:$loopstart',$NCSerial,$NETCOOL_UID,$loopstart,'$NCJournalEntr y','', '', '', '', '', '', '', '', '', '', '', '', '', '', '')"); $sth2->execute; } } else { if (($DEBUG & 128) && $UPDATESET) { $UPDATESET="PopulatedBy='$POPULATESERVER',Populated=$loopstart" . $UPDATESET; if ($DEBUG & 8) { dumpprint(); } LogMsg("netcool_dbh->do(UPDATE status SET $UPDATESET where Identifier = '$NCIdentifier')"); } elsif (($DEBUG & 8) || ($DEBUG & 128)) { LogMsg("No update."); ## IF NO DEBUG THEN SAY NOTHING SO DONT NEED AN ELSE CLAUSE HERE... } } ####################### UPDATE EVENT LOGIC ########################## ################################ END ################################# # } ######################## ITERATE OVER ALL NETCOOL EVENTS ################# ################################### END ################################ ## LOOP TIMING CALCULATIONS AND LOGGING my $loopend=time(); my $loopperiod=$loopend-$loopstart; my $SLEEP=(($loopperiod+$CYCLEMIN)<$CYCLESPEED)?$CYCLESPEED-$loopperiod:$CYCLEMIN; if ($SLEEP == $CYCLEMIN) { $loopperiod+=$SLEEP; ## THIS IS THE ONLY ERROR THAT DOESNT GET LOGGED TO AN EVENT BECAUSE IT IS GLOBAL IN NATURE ($DEBUG & 1) && LogMsg("ERROR: Poll cycle overrun. Cycle set at $CYCLESPEED secs but will be set to $loopperiod secs to allow for a minimum $CYCLEMIN secs break between cycles."); } Impact without Impact APPENDIX I: Impact-like PERL Shell 94 ($DEBUG & 4) && LogMsg("$ncidx Netcool events processed in $loopperiod seconds. Sleeping $SLEEP seconds for loop period of $loopperiod (should be $CYCLESPEED. Minimum rest allowed: $CYCLEMIN)"); if ( $istrue ) { ## IF DEBUG & 16 AND GOING TO EXIT ANYWAY, DON'T WAIT. sleep($SLEEP); } } $netcool_dbh->disconnect; $oracle_dbh->disconnect; ($DEBUG & 2) && LogMsg("Exiting Main Program"); Impact without Impact APPENDIX I: Impact-like PERL Shell 95 IMPORTANT NOTICE ALL INFORMATION PROVIDED IN THIS PAPER IS PROVIDED "AS IS" WITH ALL FAULTS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPIED. NMS GURU DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED INCLUDING, WITHOUT LIMITATION, THOSE OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. NMS GURU SHALL NOT BE LIABLE FOR ANY INDIRECT, SPEICAL, CONSEQUENTIAL, EXEMLARY, PUNITIVE OR INCIDENTAL DAMAGES INCLUDING, WITHOUT LIMITATION, LOST PROFITS OR REVENUES, COSTS OF REPLACEMENT GOODS, LOSS OR DAMAGE TO DATA ARISING OUT OF THE USE OR INABILITY TO USE ANY PRODUCT MENTIONED IN THIS PAPER, DAMAGES RESULTING FROM USE OF OR RELIANCE ON THE INFORMATION PRESENT, EVEN IF NMS GURU HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. NMS GURU IS NOT LIABLE FOR THE ACCURACY OR UTILITY OF THE INFORMATION CONTAINED IN THIS WHITE PAPER. NMS GURU'S DISCUSSION OF ANOTHER COMPANY'S PRODUCTS AND/OR SERVICES DOES NOT CONSTITUTE EITHER AN ENORSEMENT OR A RECOMMENDATION. THE CONTENTS OF THIS PAPER ARE FOR INFORMATION PURPOSES ONLY. Impact without Impact APPENDIX I: Impact-like PERL Shell 96 About NMS Guru NMS Guru architects and manages comprehensive enterprise management solutions through principal consultants with decades of experience and deep roots into the industry. Specialties include: monitoring, performance, configuration, provisioning, change, and security solutions for networks, systems, applications, and business processes. These practices are integrated holistically by weaving together the strategic initiatives from above (OSS/BSS, BPM, ITIL, FCAPS, TMN, etc) with the tactical realities from below (tools, people, knowledge and processes.) The result is increased operational awareness and extended useful lifespan of the enterprise management solution. NMS Guru is headquartered in Austin, TX. (Along with NMS tools: IBM Tivoli, CA NetQOS, SolarWinds, and many others.) For more information, visit the website at http://nmsguru.com or call 1.512.617.6694. Author Dan Needles is the founder of NMS Guru. Over the past 20 years as an Enterprise IT Operations Architect he has designed and implemented fault, performance and configuration management solutions and products for over 50 fortune 500 companies and government entities. He can be contacted via: guru@nmsguru.com or call 1.512.627.6694. Impact without Impact About NMS Guru 97