1. 2. 3. 4. Scenario Description/Example Time Horizon Data Size Alerting Detecting and Mitigating Problems Now Small to Large Dashboards Service Insight Now-Recent Modest Reports How is feature X adoption progressing day o? Hourly/Daily Medium Data Science Building prediction models based on past behaviors Unlimited Very Large Complex cloud architecture example... Cloud apps have key differences from traditional on-premises systems • Internet-facing, always up • Service SLAs – uptime requirements • Larger scale – ISVs/SaaS vendors host all customers vs. sell/deploy each customer 1-by-1 Troubleshooting in the Cloud • Too many machines/databases/etc. to troubleshoot manually • Separate “mitigate” vs. “root cause”(RCA) determination • Generate telemetry to determine RCA (later) • Find a way to get things working ASAP (reboot/failover/whatever) 1 4 2 3 • Analyze: At a certain size tools to analyze and monitor the system works • System for the system: Beyond that your need a system to monitor the system 1. 2. 3. 4. Event Tracing for Windows (ETW) • Native to Windows platform • Great performance & OK diagnostic tooling • Historically hard to publish events EventSource class • New in .NET Framework 4.5 • Meant to ease authoring experience • Extensible but supports ETW-only out of the box Semantic Logging Application Block (SLAB) • Provides several destinations for events published with EventSource • Does not require any knowledge in ETW • Additional tooling support for authoring events 1 2 3 4 Data Source Description IIS Logs Information about IIS web sites. Azure Diagnostic infrastructure logs Information about Diagnostics itself. IIS Failed Request logs Information about failed requests to an IIS site or application. Windows Event logs Information sent to the Windows event logging system. Performance counters Operating System and custom performance counters. Crash dumps Information about the state of the process in the event of an application crash. Custom error logs Logs created by your application or service. .NET EventSource Events generated by your code using the .NET EventSource class. Manifest based ETW ETW events generated by any process. Health (master) • sys.event_log • sys.bandwidth_usage • sys.database_connection_stats Data Access & Usage • sys.dm_db_index_usage_stats • sys.dm_db_missing_index_details • sys.dm_db_missing_index_groups • sys.dm_db_missing_index_group_stats • sys.dm_exec_sessions Performance • sys.dm_exec_query_stats • sys.dm_exec_sql_text • sys.dm_exec_query_plan • sys.dm_exec_requests • sys.dm_db_wait_stats Resource Usage • master.sys.resource_usage* • master.sys.resource_stats* • userdb.sys.dm_db_resource_stats Windows Azure SQL Database and SQL Server -- Performance and Scalability Compared and Contrasted http://msdn.microsoft.com/en-us/library/windowsazure/jj879332.aspx DMV Details Use sys.dm_exec_query_stats Cumulative view of query statistics Total and average resource consumption sys.dm_exec_query_sql_text Returns the text of the SQL batch that is identified by the specified sql_handle Provide overall batch text for statement sys.dm_exec_query_plan Returns plan in XML for specified plan handle Provide plan for tuning and analysis sys.dm_exec_requests Current requests executing on your DB Check for blocking, contention related issues, convoys, etc • Look at the Top N’s CPU / IO / Worker Time / Executions / Avgs • • Compare Queries Between Shards • • • • Plan Changes Resources Executes / Hot Shards? What is Slow? • Look at Durations… • • • DML Blocking / Waits / Throttling One Offs Works on prem and in the cloud Free -> ~ $2578.00/mo (10 xlarge instances) Agent based, hooking profiling API Great cross-instance correlation features Availability Performance Usage 1. 2. 3. 1. 2. 3. 4. 5. 6. 1. 2. 3. Application Telemetry DB DB SCOM SCOM Azure Management Pack: http://www.microsoft.com/en-us/download/details.aspx?id=11324 Generating Telemetry • WA Table Storage: General maximum throughput is 1000 entities / partition / table • Performance Counters: • Uses part of timestamp as partition key (limits number of concurrent entity writes) • Each partition key is 60 seconds wide, and are written asynchronously in bulk Consuming Telemetry • WA Table storage Read performance degrades with # entities/partition • Example: Entities/Partition := (# perf counter entries) * (# role instances being monitored) Scaling The Solution – You can extend this approach by • Collecting performance counters at a coarser grain (Example: 1 minute -> 5 minutes) • Filter more records (skip WARN/INFO messages, keep ERROR) Problems • Some PaaS services don’t expose performance counters (Azure SQL DB, Service Bus, etc.) Application Telemetry Reports/Dashboards DB DB Telemetry DB DMVs Worker Role http://code.msdn.microsoft.com/Cloud-Service-Fundamentals-4ca72649 http://social.technet.microsoft.com/wiki/contents/articles/17987.cloud-servicefundamentals.aspx Generating Telemetry Consuming Telemetry • WA Blob Storage supports higher limits (but you need to batch writes better) • Polling DBs requires DMV diffing (which is imperfect but better than nothing) • Multi-threading helps scale the system (to a point), but eventually you have latency • Database allows use of existing tools (Reporting Services, etc.) • Writing Dashboards initially takes some time, but it can really help Scaling The Solution – You can extend this approach by • (Same as approach 1 – collect less often or collect less data) Problems • Eventually you want data “faster” and things slow down as you scale your service All Geo-Regions One Region Alerting/Compute Deployment Job Complete Notification HDI Hive WA Storage Scheduling Pig Cluster WA Storage Cluster Data Exhaust Persist Telemetry Partitioned Queues Map-Reduce Jobs Cluster On-Premises Data Warehouse ETL Data Warehouse WA Storage Persist Curated Data Transform/ Load Data Warehouse Generating Telemetry Consuming Telemetry • On-Node collectors batch telemetry, write to Multiple WA Blob Storage Containers • Per-Geo Region Accounts (collocated with service stamps in each region) • Big Data (Hadoop or similar) system reads data across all stamps • Aggregations/Trace Processing generate output data (to WA Blob Storage) • ETL moves data into the DW • Users Query DW with star schema (facts/dimensions) using normal DB techniques • Reports generated for common activities needed to run the business • Queries using Hive against Hadoop also possible Scaling The Solution – You can extend this approach by • Add more cores to Hadoop • Buy a larger DW box • Change aggregation grain for aggregation jobs Problems • E2E Latency • Layers between Hadoop world and Microsoft world (expertise in two technology stacks) http://msdn.microsoft.com/en-us/library/jj853352.aspx (http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf http://channel9.msdn.com/Events/TechEd www.microsoft.com/learning http://microsoft.com/technet http://developer.microsoft.com http://technet.microsoft.com/library/dn765472.aspx http://technet.microsoft.com/en-us/library/hh546785.aspx http://www.microsoft.com/en-us/server-cloud/products/ windows-azure-pack http://azure.microsoft.com/en-us/