Avid System Monitor Ed Harper November 2010 1 Avid System Monitoring overview • Avid System Monitor delivers Enterprise wide monitoring solution for Avid systems and infrastructure switches Overview • Single GUI visibility to whole infrastructure • Standards based polling and event notification; SNMP, IP, HTTP • Tightly integrated with Avid Health Monitor • Integrate with enterprise management Devices Managed • ISIS 5000 & 7000; System Director • Interplay; Media Indexer, Look Up Server, Interplay Engine, Capture, ASF services, Capture • SNMP Network Switches (Cisco, Foundry, Force10) Capabilities • Real Time Statistics, thresholds • Events, Alarms, Notifications (email) • Historical statistics • Surveillance Dashboard • Flexible reporting tools Benefits • Proactive real time status and statistics - Identify anomalies, prevent outages - System wide diagnostic tools, faster restoration • Trend analysis 2 What it is • • • • • • A tool to increase the system availability by identifying issues in real time A tool to help identify potential problems in a system as they are occurring A single tool for monitoring all necessary components of the “system”, including Avid gear, network infrastructure, 3rd party devices A tool that collects performance data over time so that it can be graphed (and trends identified) A tool that will continually evolve to identify known problems within a system (after the knowledge of those problems have been learned during Code Blues, etc) A window into specific state of the Avid & selected infrastructure system components at a given point in time. It also provides enough flexibility for customers to refine and fine tune the tool’s outputs once the basic functions are mastered. 3 Overview • Avid System Monitor delivers enterprise solution monitoring for Avid systems and infrastructure – – – – Pro-active system health and status monitoring Statistics gathering, graphing and thresholds Event logging, intelligent alarm processing and notification Dashboard views showing outages and availability • – Simple drill down to isolate issues Standards based • SNMP, HTTP & IP port status – Avid Monitoring Gateway service installed on Framework (ASF) enabled devices to provide visibility to Avid System Monitor via HTTP 4 Monitoring components Monitored Node Agents Monitoring Server Recommended platform SR2500 GUI, SNMP & HTTP collection SQL Database Java (JDK) Environment • Interplay Engine • Stream Server • Capture • Media Indexer • Interplay Lookup Service (LUS) • ISIS 7000 System Director •ISIS 5000 3rd Party • Cisco switches • Foundry Switches • Force10 Real Time Audit Agentless • Avid Service Framework • Provides time sync • AirSpeed, AirSpeed Multi Stream • Capture Manager • DNS, DHCP services • Time Sync 5 Monitoring Environment • Monitored Avid Services & Devices – Detailed monitoring including status, statistics etc. • • • • • • Real-time inventory – Device up/down status without detailed monitoring • • Avid Service Framework (ASF) – Media Indexer (MI) – ASF Lookup Service Interplay Engine Stream Server Interplay Capture ISIS 5000 & 7000: System Director Workflow Engine, iNews FTS, Workstation Service , Time Sync service, Multicast repeater, LowRes Encoder 3rd Party Elements – Windows services; DNS, DHCP etc – Network Switches • Cisco, Foundry, Force10 6 Dashboard • • Single screen view with Intelligent grouping of devices & domains High level status – – – – • Alarms Notifications Node Status Resource Graphs Click on any device group to automatically filter information for selected devices 7 Events & Alarms • Extensive Event Logging – – – – – • Severity, source etc Acknowledgement Search Fine grain event details Correlating up/restore Alarms – Flexible rules to allow event aggregation in alarm view to count multiple occurrences of same event • • • • • – Severity Last time of event Count occurrences Link to event details Option to auto-clean events Operator Instructions specific to alarm & device type 8 Notifications • Flexible notification to email – • Individuals or groups Automatic Escalation – Escalation to higher level group if notification is not acknowledged within certain time • • Example; Minor event sent to Ops team, if unacknowledged for 20 minutes raised priority to Major and issues notification to Management team Notification logging, with timestamps including response time 9 Statistics & Charts • • Historical statistics gathering, trending, charts Thresholds set to trigger events and notifications on ‘interesting’ conditions – Specifically tuned to Avid components, based on real world experience 10 Threshold Event Notification • Flexible Threshold engine – – – Configurable on any counter in the system Extensive pre-programmed thresholds provided in Avid monitoring package Simple process to add customer specific threshold Media Indexer Media Files Admin configurable trigger levels 11 Threshold Configuration • Custom configuration of Threshold Event – Any counter value collected by OpenNMS – Type; High, Low, Relative Change, Absolute Change – Datasource; Entity to collect counter data (graph properties) – Datasource Type; Node or interface – Datasource Label; String displayed in event – Value; Threshold value – Re-arm; Reset/ Cleared value – Trigger: Number of times the threshold must be broken to create an event 12 Node View • Single screen dashboard per node – Current Status – Availability; system and individual services – Notifications, Recent Events, Recent Outages 13 Outages & Availability • Current Outages – Node or Service down • Calculated 30-Day Availability – Color Coded • Grouped by Device / Service Type – Click to drill down 14 Surveillance View: Flexible Grouping • Current Outages by; – Device Type – Workgroup or location • Grouping by – Service – Category – Simple customization 15 Node Discovery • Configure OpenNMS to discover devices and services on specific IP address or range – Automated capability query of generic IP, SNMP and Avid specific services & device capabilities – Add device names to nodes for readability if desired • • IP address and DNS names displayed by default Automated capabilities scan every 24 hours 16 Network Switch Monitoring • SNMP monitoring and statistics gathering for Cisco, Foundry & Force10 infrastructure Zone 2 switches SNMP • Link Up • Link Down Network • Spanning Tree Topology Change • Bandwidth Utilization Thermal • Max temp exceeded System • Memory utilization • Processor utilization Cisco • Configuration change Foundry • Startup config change • Running config change • Telnet login / logout 17 Maps • OpenNMS provides mapping tool with device status – Multiple maps to allow views for LAN, editors etc – Link discovery find node connectivity • Not all links shown correctly; ISIS switches not manageable so devices appear connected to adjacent switch 18 Proving it’s Value (a real field example) • Phased Roll-out – • Monitoring SNMP switches (only) Customer Reported AirSpeed “Slow Down” – Avid CS / Systems Engineers queried OpenNMS remotely – Pulled switch bandwidth utilization – • Switches operating correctly • Within a few minutes troubleshooting team moved on to investigate specific devices Without OpenNMS proving switch operation required access labor intensive process of monitoring scripts and driving traffic loads • Time consuming ~ 1 day to prove switches Faster resolution Greater customer satisfaction 19 Example • • • • Memory Utilization on Interplay Media Indexer Charts show steady consumption of server RAM memory during load test Performance impacted as memory maxed out Thresholds provide notification when x% exceeded 20 Server & System Requirements Category Requirement Avid System Monitor Server Recommended; Intel SR2500 Server Operating System Windows 2003 Processor 2 GHz or better Memory 2 GB Java JDK Provided with Avid System Monitor PostgreSQL Database Provided with Avid System Monitor Client Browser Adobe SVG viewer Required for Internet Explorer client browser to view map pages (Firefox etc have SVG viewer built in) 21 Pricing, Availability etc • Delivery – Value-add offered to customers with Avid Uptime support • Software download • Phased roll-out at selected customer Production networks – Typically switch monitoring • Pricing – Avid System Monitor available to Avid Uptime support contract customers – PSG installation • PSG engagement required 22 Summary • Real-Time monitoring of devices, services, networks & infrastructure – Avid Customer Success – Customer IT / Admin • • Statistics, thresholds, events and notifications Broad Enterprise system support – Increasing breadth and depth • • Pro-active warnings and notification of potential problems Improved time to resolution 23 Avid Monitoring Solution ISIS client, Editor OpenNMS GUI ICMP (Ping) Avid TCP Port monitoring DNS ICMP HTTP/TCP SNMP Data collection Trap receiver Avid TCP Port monitoring DNS, time sync ICMP SNMP ICMP SNMP LAN Switches Interplay SNMP AirSpeed Service / IP monitoring ICMP SNMP ASF Monitoring Gateway ICMP only ASF Health Monitor SNMP SNMP Interplay Engine, Stream Server, Archive ISIS ISB, ISIS switch System Director ISIS Engine Lookup Server Media Indexer AirSpeedMS Full Monitoring; events, statistics 25 Failure Modes Monitored • Avid System Monitor is tuned to identify specific failure modes – As found in field experience / Code Blue • Media Indexer • • • • • • MI in the HAG with a weight of "0": Indicates an "election issue" which can cause major system slowdown. Number of quarantined files growing: Indicates a faulty ingest device creating bad files. Different file count between each of the HAG MI's: Indicates issue with ISIS notifications. Some files will appear offline to some clients. Different time on each of the machines in the WG: Can be the cause of lost ISIS notifications (see above). MI Heap usage running dangerously high: Indicates your WG file count or client count is causing too much stress on that MI. Eventually, the MI will thrash. Number of files added/updated on last full resync, when it's greater than 0. This value is displayed in the Health Monitor, under each storage pane of the MI. • Interplay Engine • • • • • Time to perform login - should be below 15 seconds: indicates engine slowness Number of journal files - should be below 50: indicates journal integration stuck/dead Number of deletes - should be below 100 for 5 minute polling intervals during normal production time: indicates deletion during production time Number of loaded objects/number of total objects - should be above 30%: indicates engine cache warm-up causing slowness Backup running flag - should be off during production time • Avid Service Framework Lookup Service (LUS) • • • • • • For LUS, here are things we could check today via SNMP Gateway. However, these monitor points don't really contribute to most of the problems we see related to ASF. They are the only data points that are available today. Monitor Handle Count (either via gateway or MSFT agent) - should be below some threshold (<5000) Monitor Thread Count (either via gateway or MSFT agent) - should be below some threshold (<500) Monitor Events In Queue (via gateway) - should be less than 50 Check that a process is bound to port 4160 on the box (don't know how to do that with OpenNMS) - confirms that the LUS process is running Monitor Memory Usage (either via gateway or MSFT agent) - should be below some threshold (<200MB) • ISIS • ISIS monitors a number of critical areas and sends an event to the Windows event log when values reach a defined value or threshold. You can configure ISIS to send an email when an error or warning event occurs. You can also configure the System Director to generate an SNMP trap when the event occurs. The top areas include the following: Temperature and presence of components such as switches, storage elements, and power supplies. Workspace usage thresholds. For example, an Admin can enable warning and error thresholds. If you can set the workspace threshold to 90%, ISIS will generate an error event when a workspace reaches 90% full Disk health issues such as disk failed or disk performance degraded based on continuous monitoring. Server failover notifications. For example, on a failover system you are notified when the system fails over to the other node. Metadata problems. For example: if there is a problem opening a metadata file or if the metadata in a file seems out of date • • • • • 26 Monitored Device Matrix Device / Service Version(s), Notes Inv Unity ISIS v2.x √ Interplay Engine V2.x √ Media Indexer V2.x √ Interplay Engine V2.x √ Interplay Lookup Service (LUS) V2.x √ AirSpeed √ AirSpeed Multi Stream √ Capture Manager √ Interplay Capture Mon √ V2.x DNS & DHCP services √ Avid Time Sync Service √ √ 3rd Party Network Switches Cisco / Foundry / Force10 Windows Services DNS,DHCP, Time, Anti-virus, auto-update etc √ Inv Real time Inventory; service or server Up/Down Mon Full monitoring; detailed alarms, statistics etc 27