NETWORK PERFORMANCE MONITORING SYSTEM Dimpalben Prajapati B.E., Gujarat University, India, 2007 PROJECT Submitted in partial satisfaction of the requirements for the degree in MASTER OF SCIENCE in COMPUTER SCIENCE at CALIFORNIA STATE UNIVERSITY, SACRAMENTO FALL 2011 NETWORK PERFORMANCE MONITORING SYSTEM A Project by Dimpalben Prajapati Approved by: __________________________________, Committee Chair Jinsong Ouyang, Ph.D. __________________________________, Second Reader Chung-E Wang, Ph.D. ____________________________ Date ii Student: Dimpalben Prajapati I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the Project. __________________________, Graduate Coordinator Nikrouz Faroughi, Ph.D. Department of Computer Science iii ________________ Date Abstract of NETWORK PERFORMANCE MONITORING SYSTEM by Dimpalben Prajapati The basic goal of this project is to develop a system that, in real time, monitors and controls the Linux and Windows servers in the ECS network. The advantage of such system is that it will help the IT people in ECS department to monitor the servers remotely via web interface. IT administrator can identify the server performance issues by just analyzing statistics shown on the web interface. The system has mainly 3 components: One is web interface which enables the tabular as well graph view of performance data for various servers, second is Linux managing server which collects and manipulates the statistics of a server from Linux agents, and third is Linux and Windows agents deployed on ECS servers which collect the raw data from servers. _______________________, Committee Chair Jinsong Ouyang, Ph.D. _______________________Date iv ACKNOWLEDGEMENTS I would like to thank all the people who have helped and guided me in completing my Masters project. My sincere thanks to Dr. Jinsong Ouyang for giving me the opportunity to work on this project and guiding me throughout the project. His insightful knowledge and thinking about the project helped me a lot in growing my conceptual knowledge in system programming aspect. I am extremely thankful to Dr. Chung-E Wang for agreeing to be my second reader and his promptness regarding review of my project. Finally, I would like to thank my professors, department and school for giving me this opportunity to learn and grow towards attainment of this degree. My friends always helped me in learning and gave their valuable feedback in completing the milestones during the journey of Masters. I would like to thank my parents and inlaws. They have always motivated me and blessed me to reach this far in my life. Last but not the least, I am extremely thankful to my fiancé Nishad, who constantly encouraged me to achieve my goals and helped me to make my dreams come true. I dedicate my success to him and he will always be the person behind all my achievements in life. v TABLE OF CONTENTS Page Acknowledgements..............................................................................................................v List of Tables....................................................................................................................viii List of Figures...................................................................................................................ix Chapter 1. INTRODUCTION........................................................................................................1 1.1 Background.......................................................................................................1 1.2 Motivation.........................................................................................................2 1.3 Goal...................................................................................................................2 1.4 Results...............................................................................................................3 2. ARCHITECTURE........................................................................................................4 2.1 Operating Console.............................................................................................5 2.2 Web Server and Web Interface….....................................................................5 2.3 Database............................................................................................................8 2.4 Managing Server................................................................................................8 2.5 Agents...............................................................................................................9 3. IMPLEMENTATION..................................................................................................12 3.1 Web Interface...................................................................................................13 3.1.1 Sending Data Collection Request.....................................................14 3.1.2 Receiving Performance Stats............................................................16 3.1.3 Stop Updates.....................................................................................19 3.1.4 Settings..............................................................................................19 3.1.5 AJAX................................................................................................21 3.1.6 RGraph..............................................................................................21 3.2 Managing Server..............................................................................................23 3.3 Agent...............................................................................................................30 vi 3.3.1 Linux Agent......................................................................................30 3.3.2 Windows Agent................................................................................37 4. CONFIGURATION....................................................................................................47 4.1 Configuration Specs.........................................................................................47 4.2 Installation………............................................................................................49 5. CONCLUSION...........................................................................................................51 6. FUTURE WORK........................................................................................................52 Bibliography......................................................................................................................53 vii LIST OF TABLES Tables 1. Page Database Schema…………………………………………………………………8 viii LIST OF FIGURES Figures Page 1. System Architecture………………………………………………………………4 2. Web Server to Managing Server Communication………………………………...7 3. Process Flow Chart………………………………………………………………11 4. Socket Communication………………………………………………………......12 5. Performance Stats in Tabview…………………………………………………...18 6. Data Loading using AJAX………………………………………………………19 7. Set Interval……………………………………………………………………….20 8. Delete Data from Database………………………………………………………20 9. Graph for Overall CPU Utilization………………………………………………23 10. CPU Structure……………………………………………………………………27 11. Memory Structure……………………………………………………………......27 12. Disk I/O Structure………………………………………………………………..28 13. Network I/O Structure……………………………………………………………28 14. Process Structure…………………………………………………………………29 15. Performance Stats Structure……………………………………………………...29 16. Linux Agent Process Flow Chart………………………………………………...31 ix 1 Chapter 1 INTRODUCTION The Network Performance Monitoring System remotely monitors the servers deployed in ECS, Engineering and Computer Science, department network, in real-time. The advantage it gives to IT administrator is, he/she doesn’t have to go to individual server and type the commands or monitor the task manager. The System gives flexibility to check performance metrics from anywhere, anytime and in real-time. Performance monitoring can be useful in solving so many server issues as well as improving the server response capabilities. It can identify any loopholes server experiences and in that situation, direct the IT administrator to overcome those. 1.1 Background HP and other’s research – HP has one performance monitoring software which is called HP Performance Manager Software. “HP Performance Manager Software is a web-based analysis and analysis and visualization tool that analyzes performance trends of applications, system, and services.”[1] This software monitors performance in real time and tracks historical trends which help in managing and optimizing performance of mission-critical servers. 2 1.2 Motivation The servers deployed in ECS network are one the busiest servers as Students do all their computing activities related to course projects and assignments. So constantly monitoring can be useful in analyzing the busy hours so that administrator can enhance the resources in that period of time. This provides better user experience and prevents server bog down issues. Current systems such as HP’s Performance Manager Software and other monitoring tools, provides the great help in improving organizations network resources, preventing errors in network access and providing easy access to remote resources. By looking through existing systems and software, I tried to cover not all the functions they provide but minimal functionality with basic features. This will establish basic foundation for big system in future. 1.3 Goal Building successful deliverable of the system which monitors Windows as well as Linux server is main challenge and goal of this project. When system is deployed on ECS network, it should be configured very easily. Source code of the system should not need any internal modifications in order to run in ECS environment. System should collect and process five sets of statistics: system wide CPU utilization, system wide memory utilization, disk I/O statistics, network utilization, and process resource utilization. 3 1.4 Results The system is successfully developed and deployed on servers of ECS department. User can see updates coming from selected servers in table and graph as expected. Managing server can handle single server requests and multiple server requests. It also successfully implemented multithreading which handle requests for each server in individual thread. The system can monitor Linux servers as well as Windows servers. 4 Chapter 2 ARCHITECTURE This chapter discusses all the components modules, interfaces and database in detail. This project follows the architecture containing 3 main components. This architecture is very seamless and robust. Each component performs specific functionality. All components are interconnected in such a way that binds whole system together. One of them is Web interface which has all user interactivity. User has no knowledge of any other component. Second component is managing server which is the bridge between Agents (Windows and Linux) and Web server. Third component is Managed Agent which interacts with Linux and Windows servers. Figure 1. System Architecture 5 2.1 Operating Console This component provides access to whole system. User doesn’t have to know the underlying architecture. This user can be IT administrator or a person who can analyze the performance metrics of Linux or Windows server. Requirements: - Any web browser like Internet Explorer, Mozilla Firefox, Safari or Chrome with JavaScript-enabled feature. - User should have knowledge about web environment. - User should understand the system performance metrics like CPU, memory and network utilization, disk I/O, and network I/O. 2.2 Web Server and Web Interface Web server hosts the web interface. Web interface is developed in PHP using HTML, Javascript and AJAX technologies. Web interface gives the user a choice for selecting single/ multiple server/s. When user chooses the server, web server gets the performance statistics and web interface displays those in tabular and graph view. Web interface has user-friendly view for displaying performance statistics. This view resembles Windows task manager view of displaying performance stats. It uses Yahoo UI library to wrap the performance statistics in five different categories: Overall, CPU, Network, Disk, and Process. 6 Web server interacts with two other components of system as shown in Figure 1, one is managing server and other is database system. When user makes request for data updates, web server forwards that request to managing server. This communication happens using sockets where web server is socket client and managing server is socket server. Requirements: - LAMP i.e. Linux, Apache, MySQL, PHP, platform - Text editor like Dreamweaver, emacs, vi, Bluefish - Understanding of AJAX technology, PHP to MySQL operations, HTML, CSS and JavaScript - Sockets in PHP Libraries used: - Prototype.js - Yahoo UI – tabview, calendar - RGraph Figure 2 shows the communication between different components of system. It depicts how these components are interconnected and which information is passed between them. In order to see the performance statistics in real-time, user selects single or multiple server/s from the list provided and starts data collection. On receiving user’s request for 7 Web Interface List of servers Web Server Command for data Collection Performance stats Managing Server Performance stats Database Performance Stats Figure 2. Web Server to Managing Server Communication data updates, web server forms the command for data collection to send to managing server. This command has following format: | (STARTDATACOLLECTION), 1 byte | (NumOfServers), 1 byte | (Interval), 1 byte | (ServerName/s), 16*NumOfServers bytes | After sending the command for data collection to managing server, web server waits for the response from managing server. Managing server sends serialized performance statistics to web server. When web server receives the response from managing server, it performs 4 main tasks. First, it de-serializes those statistics. Second, it inserts them into database. Third, it binds the response into tabular format. Forth, it generates graphs from current statistics and displays on web interface. 8 2.3 Database Database contains historic performance statistics. Whenever web server receives performance statistics from managing server, it inserts statistics into corresponding tables. Following is the outline of database schema: Table 1. Database Schema Table Name SERVER_TOTAL CPU_INFO NETWORK_INFO DISK_INFO PROCESS_INFO SETTINGS SERVER Description Contains total CPU utilization, total network utilization, and memory metrics for each server Contains CPU metrics for each cpu of server Contains Network metrics for each adapter of server Contains Disk metrics for each server Contains Process metrics for each server Settings for Interval and address of managing server All Servers that can be monitored 2.4 Managing Server Managing server is responsible for sending request from web server to appropriate agent because web server doesn’t have knowledge about what agent is. It receives response containing performance statistics after sending request. Managing server performs managerial part of this system. The task done by this component is too time-sensitive. The processing of request and response has to be so quick because web server expects the updates in very short time interval. To achieve this 9 requirement, managing server accomplish the data gathering task using multithreading. Managing server serves the request of each server in different thread, instead of serving the request one-by-one. It implements parallel processing via multithreading to achieve faster communication. After sending the request to agent, managing server waits for response that each respective agent sends. Managing server forwards that response to web server. Some of the performance metrics don’t need any extra calculation or processing. But for some of the metrics, managing server performs calculation and then sends the statistics to web server. Requirements: - C/C++ on unix platform - C sockets - Multithreading (pthreads) - Sorting algorithms - C structures and pointers manipulation 2.5 Agent Agent is very crucial part of this project. An Agent has to be deployed on each server that user wants to monitor otherwise, monitoring would not be possible. When agent receives 10 request for data fetching from managing server, it starts reading the Performance stats from specified server, binds the stats in data structure, and sends back to managing server. Agent sits idle until it receives request from managing server for data fetching. There are two types of agents: one is Linux agent which runs on Linux server and second is Windows agent which runs on Windows server. Requirements: - C/C++ on linux platform - Visual C/C++ on windows platform - VB scripting - /proc Linux pseudo filesystem - PDH Interface to get raw counters Figure 3 shows the process flow chart for overall architecture of this project. It clearly depicts how each component interacts with each other and with what information. 11 Agent Performance stats Command for data colleciton Managing Server Performance stats Command for data collection Web Server Inserts data into database List of servers with interval Database Data comes from database Web Interface Figure 3. Process Flow Chart 12 Chapter 3 IMPLEMENTATION This chapter discusses the implementation in detail. Implementation follows the architecture discussed above. The flow is very smooth where each component is connected very tightly. Three main components, Web server, managing server and agent, are connected with each other using TCP/IP sockets. Project implements client/server model for each component. The sockets use UDP connection because establishing connection is not very important to start data collection. The reason for not using TCP is that everything happens inside ECS network, no external intervention is involved, and each source is trusted. The focus of this communication is Datagram, which passes messages from one point to another, and use of UDP connection makes the communication faster as well as eliminates extra overhead in establishing connection. Request for data collection Windows Agent Request for data collection Web Server Managing Server Calculated Response containing performance stats Request for data collection Figure 4. Socket communication Raw performance stats Linux Agent 13 3.1 Web Interface Web Interface is implemented on LAMP (Linux, Apache, MySQL, and PHP) platform. It uses various technologies and libraries like AJAX, Yahoo UI, RGraph, to enhance interactivity to user. To implement/enhance/extend the web interface, developer should have grasp on PHP-MySQL, HTML, Javascript, AJAX and CSS. The web application has following design structure: - nwMgtSystem – parent folder - Images - js - lib - Styles - All PHP files Tasks of each component in web interface are distributed in different files/folder. All images displayed on web interface are coming from Images folder. There are two main JavaScript files used in this project. One is loadData.js. This file contains code for AJAX request/response to get data from database to display in tables and graphs. The other is prototype.js which contains framework for AJAX and some DOM functions. lib folder contains libraries for yahoo tabview, RGraph and calendar. Yahoo tabview is used for AJAX tabs to display different performance metrics. RGraph is used to generate 14 line chart and pie chart. calendar.js is used for interactive calendar tool which user uses to delete data from database. The web interface follows the style guideline that csus and ecs provide. Styles folder contains site.css which is used across entire web application and tabview.css which is used to provide styling tabs. Index.php is the base of web application hierarchy. It uses three template files, header.php, mainNavigation.php and footer.php. It is mainly divided into four tasks: Send data collection request, receiving performance stats, stop data collection, and settings. 3.1.1 Sending Data Collection Request When user clicks on “Start Updates” button, it sends the data collection request by calling sendRequest() function. This function gets the selected server/s, forms them into array, and sends AJAX request to getData.php. var ddServers = document.getElementById("ddListServers"); var selectedServers = new Array(); var i; var count= 0; for(i=0; i<ddServers.options.length;i++) { if(ddServers.options[i].selected) 15 { selectedServers[count] = ddServers.options[i].value; count++; } } var val = selectedServers; new Ajax.Request("getData.php", { method: 'post', postBody: 'q='+ val, onComplete: callshowResponse }); getData.php makes socket connection to managing server. The address and port where managing server can listen for requests from web server are listed in config.php. Web server forms the command and the structure of that command is listed in Chapter 2. Following snippet of code shows the process of sending commands for data collection. $socket = socket_create(AF_INET, SOCK_DGRAM, SOL_UDP); $serverName = $managingServer; $portnum = $port; socket_connect($socket, $serverName, $portnum); $sendcmd = chr($cmd) . chr($numServers); // 1st byte=cmd and 2nd byte=numServers socket_sendto($socket, $sendcmd, strlen($sendcmd), 0, $serverName, $portnum); socket_sendto($socket, $interval, strlen($interval), 0, $serverName, $portnum) socket_sendto($socket, $server, strlen($server), 0, $serverName, $portnum) When AJAX completes the request, function callshowResponse() starts the timer for receiving data collection response from managing server. 16 function callshowResponse(req) { var serversRequested = req.responseText; var interval = $('txtInterval').value *1000; setInterval("startReceivingData()",interval); } The timer is now activated and is called every specified interval of time. This triggers the call to startReceivingData() function. 3.1.2 Receiving Performance Stats After sending the command for start data collection, startReceivingData() function sends AJAX request to receiveData.php for receiving performance statistics. receiveData.php binds socket with managing server and starts receiving data for all the servers requested. Also it makes connection with database and as soon as it receives data, it inserts every metrics in database. Following snippet of code shows how it binds socket with managing server and received platformID, number of CPUs and CPU structures. $socketRecv = socket_create(AF_INET, SOCK_DGRAM, SOL_UDP); !socket_bind($socketRecv, $address, $port) // Getting platformID socket_recv($socketRecv, $buf, 1, 0); //Getting ServerName socket_recv($socketRecv, $buf, 16, 0); //Getting number of cpus socket_recv($socketRecv, $buf, 8, 0); $cpuStructSize = 28; // 28bytes -> |cpunum=8|sysutil=4|userutil=4|idleutil=4|numinterrupts=8 socket_recv($socketRecv, $buf_cpuinfo, $numCPUs*28, 0); 17 When the request is finished, the data is loaded on active tab. if(myTabs.get('activeIndex') == 1) loadCPUStats(); else if(myTabs.get('activeIndex') == 2) loadNetworkStats(); loadCPUStats() function first gets the data to load into table and then calls updategraph() function to update the graph with latest data. This function is generic function to load graph for every performance metric. According to flag passed to this function, respective AJAX request is performed. AJAX response gives set of data points to plot on graph. Data loading in tabs for Overview, CPU, Network, Memory, Disk and Process is done in getOverallStats.php, getCPUStats.php, getNetworkStats.php, getMemoryStats.php, getDiskStats.php and getProcessStats.php respectively. Graph loading is done in graphOverall.php, graphCPU.php, graphNetwork.php, graphMemory.php, graphDisk.php and graphProcess.php. Figure 5 shows tabs for five different metrics. This view resembles the Windows task manager view. There are mainly five tabs on web interface: 18 Figure 5. Performance Stats in Tabview - Overview tab displays Overall CPU utilization i.e. if gaia has 4 CPUs, it displays overall CPU utilization, Memory utilization and network utilization. - CPU tab displays utilization of individual CPU for each server. The performance stats for every server are displayed individually. This view resembles the Linux ‘top’ command display where each row of output is for one single CPU. - Network tab displays performance stats for each network adapter. It includes bytes received, bytes sent, number of errors, collisions, and network utilization. It displays graph for network utilization. This is overall utilization of all adapters. - Memory tab displays memory performance stats containing total memory, free memory, used memory, memory page in, memory page out, memory swap in, and memory swap out. It also shows graph for memory utilization. - Disk tab displays each disk partition for the server. This view also resembles the Linux ‘top’ command display where each row of output is for one disk partition. The details include number of reads on disk partition, number of writes, major, minor and number of blocks. - Process tab shows the details same as Linux ‘top’ commands. It displays statistics of every process running, suspended and idle. These include PID, priority, Process Status, Process memory utilization and Process CPU utilization. 19 Figure 6 shows how data is loaded in tabs. Figure 6. Data Loading using AJAX 3.1.3 Stop Updates When user clicks on “Stop Updates” button, AJAX request is sent to stopUpdates.php, which in turn sends command ‘CMD_STOP_UPDATES’ to managing server. $portnum)) $cmd = "4"; // command for StopDataCollection = 4 $sendcmd = chr($cmd); //1st byte=cmd socket_sendto($socket, $sendcmd, strlen($sendcmd), 0, $serverName, 3.1.4 Settings Web Interface also provides a way to reconfigure interval at which data updates should be coming from managing server and to delete data from database. Delete functionality 20 supports different date ranges as well as user can specify only from and to dates. If user specifies only from date, data is deleted from the date specified to current date and time. If user specifies only to date, all the data is deleted before the date specified. Figure 7. Set Interval Figure 8. Delete Data from Database 21 3.1.5 AJAX AJAX is a methodology to create asynchronous web applications. “With Ajax, web applications can send data to, and retrieve data from, a server asynchronously (in the background) without interfering with the display and behavior of the existing page.”[1]. Web application uses AJAX to send the request to web server and receive response asynchronously without intervening the user display. Following snippet of code shows how AJAX makes request to server to get data from webserver: new Ajax.Request("getData.php", { method: 'post', postBody: 'queryString='+ serverNames, onComplete: callshowResponse }); Following snippet of code shows how AJAX response is received from web server: function showResponseCPUStats(req) { updategraph("CPU"); contentCPU = req.responseText + "<br>"; $('divCPUStats').innerHTML = contentCPU; } 3.1.6 RGraph RGraph is an external library used for generating graphs. I am using this library to generate line graph for CPU, Network and overall performance statistics. This library 22 uses HTML canvas element to draw graphs on-the-fly. “<canvas> is an HTML element which can be used to draw graphics using scripting (usually JavaScript)”[2]. This element is very flexible tool for drawing graphics. Following code snippet shows how to create <canvas> element: <canvas id="cvsCPU" width="500" height="300">[No canvas support]</canvas> RGraph library takes id of canvas element to generate graph. The library requires data points to plot. These data comes from AJAX response. As soon as web server receives performance statistics from managing server, AJAX engine makes request to get the most recent data from database, forms the response in tabular view and graph view. Here, Rgraph uses these data from AJAX response and draws the graph in canvas. Following code snippet shows snippet of this process: lineSys = new RGraph.Line('cvsCPU', arrSys); ... lineSys.Draw(); RGraph.Register(lineSys); RGraph.Redraw(); lineUser = new RGraph.Line('cvsCPU', arrUser); Every time data is received from managing server, graph also gets updated which shows the moving graph just like windows task manager. To display CPU utilization, graph combines 3 views together which includes system utilization, user utilization and idle utilization. Following figure displays graph view for CPU utilization: 23 Figure 9. Graph for Overall CPU Utilization 3.2 Managing Server (ms) The implementation of managing server is solely based on C/C++ and its enormous features. Successful implementation of this component of the project requires very good understanding of C Sockets, Multithreading (POSIX threads) and data structures. Managing server is developed on UNIX platform. I used gcc compiler, gdb debugger and makefile. 24 Managing server implements bridge functionality between web server and agents. It behaves as socket server for web server and socket client for agents. When managing server is first started it reads server.conf file in the msserver directory. This file contains configuration for socket connection. Following code snippet shows this configuration: [WEBSERVER] hostcomputer ;name 9100 ;port [MANAGING SERVER] 9200 ;port [AGENTS] 9300 ;port The [WEBSERVER] section describes web server name e.g hostcomputer and the port that the managing server uses to send data to web server e.g. 9100. The [MANAGING SERVER] section lists the port number. Managing server uses this port to listen requests from web server. The [AGENTS] section lists the port number. Agent listens for requests on this port. Managing server binds the socket using this port number to make communication with agent. Managing server is mainly divided into two interfaces: ws2msinterface and ms2agentinterface. The ws2msinterface is responsible for establishing communication between web server and managing server. The ms2AgentInterfce is responsible for establishing communication between managing server and agent. 25 Ws2msinterface contains structures for web server command and arguments. Web server sends “START DATA COLLECTION” command to managing server. From that command, managing server gets interval, number of servers and server names that need monitoring. The detailed structure of this command is described in web Server section of this chapter. Depending on the number of servers managing server receives, it starts individual thread for each server. Each thread establishes connection with corresponding agent. To achieve this, managing server uses Multithreading in C/C++. Managing server implements this using POSIX thread (pthread) libraries. “It allows one process to spawn a new concurrent process flow”.[3] One process creates another process to gain speed through parallel processing. Here, managing server creates dataCollectionThread which runs independently. Following code snippet shows thread creation: pthread_create(dataCollectionThread, NULL, collectData, NULL); All threads establish independent connection to agent without interrupting other threads. Received performance stats from agents are manipulated in their respective thread only. Managing server receives metrics for individual CPUs of each server so the calculation is necessary before metrics send back to web server. After that calculation, all the received data is captured and formatted properly in corresponding structures. Detailed description about those structures is explained later in this chapter. The last step in ws2msinterface is to send back the response containing performance metrics. Now metrics are bound in perfstats structure and sent over to network. 26 The following line shows how we send data through socket. sendto(afd, stats->ci, stats->numCPUs * sizeof(cpuInfo), 0, serv_addr, sizeof(sockaddr_in)) The ms2agentinterface first sends “START DATA COLLECTION” command to agent so that agent can start reading the performance metrics. When agent receives this command, it starts the process of getting the performance stats. Agent wraps all statistics data in perfStats structure and sends back to managing server. The process of communication should be known at both sides so both managing serve and agent share common data structures. They include structures to hold CPU, Memory (RAM), Network, Disk I/O, and Process. The interface also shares one important structure perfStats to send over the network. This structure wraps structures for all performance metrics. Following figures described set of data structure shared by ms2agentInterface: 27 // structure to hold CPU information. typedef struct { long long cpuNum; //-1 if sending overall CPU data float userUtil; //User mode utilization per CPU float sysUtil; //sys mode utilization per CPU float idleUtil; //idle utilization per CPU unsigned long long numInterrupts; //# of interrupts per CPU } cpuInfo; Figure 10. CPU Structure // structure to hold memory (RAM) information. typedef struct { unsigned long long free; //Free memory unsigned long long used; //Used memory unsigned long long total; //Total memory float util; //Memory utilization unsigned long long pgin; //# of pages paged in unsigned long long pgout; //# of pages paged out unsigned long long swpin; //# of swapped pages swapped in unsigned long long swpout; //# of swapped pages swapped out } memInfo; Figure 11. Memory Structure 28 // structure to hold disk I/O information. typedef struct { char paritionName[32]; unsigned long long numBlocks; per parition unsigned long long major; unsigned long long minor; unsigned long long numRdOp; unsigned long long numWrOp; unsigned long long numBlocksRd; unsigned long long numBlocksWr; } diskInfo; //Partition name //# of blocks/sectors //Major number //Minor number //# of read operations //# of write operations //# of blocks read //# of blocks written Figure 12. Disk I/O Structure // structure to hold network information. typedef struct { char networkName[256]; //"Overall" if sending overall network data unsigned long long bytesRec; //bytes received per Adapter unsigned long long bytesSent; //bytes sent per Adapter unsigned long long errors; //number of erros per Adapter unsigned long long colls; //number of collisions per Adapter float util; //network utilization per Adapter } networkInfo; Figure 13. Network I/O Structure 29 // structure to hold process information. typedef struct { unsigned long long pid; // PID unsigned long long userID; // UserId unsigned long long priority; // Process Priority unsigned long long sizeTotal; // Total program size unsigned long long sizeRes; // Resident Set size unsigned long long pages; // Shared pages char state; // Process Status unsigned long long cpuTime; // Process CPU utilization float cpuUtil; // Used in calculating the process % cpu utilization on the managing server. float memUtil; // Process memory utilization unsigned long long wallTime; // Running time (wall time) char cmd[256]; // Process Command } procInfo; Figure14. Process Structure //structure to send over the network typedef struct { char platformID; //'L' for linux, 'W' for windows char agentName[16]; //server name, i.e unsigned long long numCPUs; //# of CPUs on the server cpuInfo* ci; //appropriate # of cpuInfo structs unsigned long long ctxt; // total number of context switches. memInfo mi; //memInfo struct unsigned long long numLogDrives; //# of disk drives diskInfo* di; //appropriate # of diskInfo structs unsigned long long numNetAdapters; //# of network adapters networkInfo* ni; //appropriate # of networkInfo structs unsigned long long numProcs; //# of processes on server procInfo* pi; //appropriate # of procInfo Figure 15. Performance Stats Structure structs } perfStats; 30 3.3 Agent 3.3.1 Linux Agent Linux agent handles the collection of performance metrics from Linux server. It is designed to send unprocessed raw system metrics to the managing server. When managing server sends the CMD_START_DATA_COLLECTION, command for data collection i.e. agent starts the process of reading the performance metrics from /proc. The implementation of the agent is in C. Communication layer uses UDP for the network protocol. The code is divided into three parts. First part listens for a command from the managing server, second part executes the command and third part sends performance metrics back to the managing server. Agent waits for managing server to start the data collection process. This reduces unnecessary processing at agent side. Figure 16 demonstrates the flow chart for agent processing. The structure of CMD_START_DATA_COLLECTION is same as the structure that web server sends to managing server. 31 Create the Socket Bind address of managing server to socket Get the command from managing server If CMD_START_DATA _COLELCTION Yes Read the performance metrics from /proc Fill PerfStats data structure Send PerfStat to managing server Figure 16. Linux Agent Process Flow Chart No 32 Once agent receives the data collection command, agent starts the most important and critical part of this whole system which is getting/reading the performance metrics from Linux server. This process requires thorough understanding of proc file system and very efficient File Handling mechanism from C/C++. Proc File system: “The proc file system is a pseudo-file system which is used as an interface to kernel data structures. It is commonly mounted at /proc.”[4] Read Memory Stats: /proc/meminfo meminfo is used to find memory usage on the system. It is used for reading total memory and free memory. From these two metrics, calculation for used memory and total memory utilization are performed. fopen("/proc/meminfo", "r") fscanf(f, "MemTotal: %d kB MemFree: %d kb", &mtotal, &mfree) meminfo->total = mtotal; meminfo->free = mfree; meminfo->used = meminfo->total - meminfo->free; meminfo->util = (float) meminfo->used/(1.0*meminfo->total)*100; /proc/vmstat This file displays virtual memory statistics. It is used for reading information about memory page in, page out, swap in, and swap out. 33 fopen("/proc/vmstat", "r")) fscanf(f, "pgpgin %d pgpgout %d pswpin %d pswpout %d", &pgin, &pgout, &swpin, &swpout) Here, agent opens the vmstat file system, reads the statistics and put them in pgin, pgout, swpin,and swpout. Read CPU Statistics: /proc/stat This file system contains information about kernel activity. It gives total number of processes running in user mode and kernel mode, idle processes, and total number of interrupts. This information is for per CPU. We also get context switches across all CPUs. The code snippet given below shows the process of capturing CPU statistics. fopen("/proc/stat", "r") fscanf(fp, "cpu%ld %ld %ld %ld %ld %ld %ld %ld", &cpuNum, &user, &nice, &sys, &idle, &val3, &val4, &val5) cinfo[i].cpuNum = cpuNum; cinfo[i].user = user + nice; cinfo[i].system = sys; cinfo[i].idle = idle; This process runs for each CPU and we store this statistics into CPU_INFO data structure. 34 Read Disk statistics: /proc/partitions This file system contains list of device numbers, their sizes and names. We read major and minor numbers, number of blocks, and partition name for each partition from this file system. fopen("/proc/partitions", "r") fscanf(f," %d %d %d %s\n", &numMajor[i], &numMinor[i], &numBlocks[i], (char*)&partName[i]) /proc/diskstats This file system contains disk I/O statistics for each disk device. f = fopen("/proc/diskstats", "r") fscanf(f, " %d %d %s %d %d %d %d %d %d %d %d %d %d %d", &numMajor[j], &numMinor[j], (char*)&partName[j],&ioRead[j], &blockRead[j], &ioWrite[j], &blockWrite[j], &temp1[j], &temp2[j], &temp3[j], &temp4[j], &temp5[j], &temp6[j], &temp7[j]) strncpy(di[i].paritionName, partName[i], 32); di[i].numBlocks = numBlocks[i]; di[i].major = numMajor[i]; di[i].minor = numMinor[i]; di[i].numRdOp = ioRead[i]; di[i].numWrOp = ioWrite[i]; di[i].numBlocksRd = blockRead[i]; di[i].numBlocksWr = blockWrite[i]; Here, we opened /proc/partition and /proc/diskstats, to read disk statistics. This process is done for each disk partition. Read Network statistics: /proc/net/dev 35 This pseudo file system contains network device status information. These include total number of received and sent packets, and the number of errors and collisions. We need performance metrics for each network adapter. fopen("/proc/net/dev", "r") fscanf(f, " %[^:]: %d %d %d %d %d %d %d %d %d %d %d %d %d %d %d %d", b[i].networkName, &bytesRec, &errs, &a, &a, &a, &a, &a, &a, &bytesSent, &errs, &a, &a, &colls, &a, &a, &a) b[i].bytesRec = bytesRec; b[i].errors = errs; b[i].bytesSent = bytesSent; b[i].colls = colls; Here, network performance metrics are being read for each network adapter and then total metrics are calculated as shown below: total_bytesrec = total_bytesrec + b[i].bytesRec; total_bytesent = total_bytesent + b[i].bytesSent; total_errs = total_errs + b[i].errors; total_colls = total_colls + b[i].colls; Read process statistics: This is the most complex procedure among all performance metrics. It requires step by step reading and calculation. We read one by one file system from /proc and store the information in proc_info data structure. /proc/[pid] 36 “There is a numerical subdirectory for each running process; the subdirectory is named by the process ID. Each such subdirectory contains the many pseudo-files and directories.”[5] Process Statistics are found under /proc fie system. This file system gives Process ID, process priority, shared pages, resident size, Process memory utilization, CPU utilization, and process name for each process in system. sprintf (fname, "/proc/%s/stat", ent->d_name); fscanf (f, "%d %s %c %d %d %d %d %d %u %lu %lu %lu %lu %lu %lu %lu %lu %ld %ld %ld %ld %ld", &nodePtrPid, (char*)&(nodePtr->cmd), &(nodePtr->state), &ppid, &pgrp, &session, &tty_nr, &tpgid, &flags, &minflt, &cminflt, &majflt, &cmajflt, &utime, &stime, &cutime, &cstime, &priority, &nice, &num_thr, &itrealv, &startTime); // Calculate CPU utilization here // nodePtr->cpuUtil = (utime + stime)/100.0; nodePtr->cpuTime = (utime + stime); // Assumption startTime/100 = wall time in secs // getWallTime(&(nodePtr->wt), startTime); nodePtr->wallTime = ontime - startTime/100; sprintf (fname, "/proc/%s/statm", ent->d_name); fscanf (f, "%d %d %d", &nodePtrSizeTotal, &nodePtrSizeRes, &nodePtrPages); // Calculate memory utilization nodePtr->memUtil = (nodePtr->sizeRes/(1.0*nodePtr->sizeTotal))*100; After filling the data structure perfstats, agent sends the data structure to managing server. The procedure for sending data to managing server is same as how we send performance stats to web server. The flag sendingWebserver is set to false if data is 37 being sent to managing server. The procedure for accomplishing this task is mentioned below: int send_perfstats(int afd, const sockaddr* serv_addr, perfStats* stats, int sendingWebserver); This procedure is defined in perfstatsnetwork.c. After sending the data to managing server, Agent waits for another command from web server. The heart of this whole ‘Performance monitoring system’ is reading /proc to get most recent performance statistics of any Linux server. 3.3.2 Windows Agent Windows agent handles the collection of performance metrics from windows machines. When agent receives “START_DATA_COLLECTION” command, it starts fetching the performance metrics. To achieve this, it uses PDH and WMI, Windows Management Instrumentation, functions. To collect performance data using PDH functions, agent performs following steps [6]: - Create a query: Before using PDH functions we need to create a new query that collects performance data from a real‐time source, by using PdhOpenQuery function. The function returns a handle to the query that is used by all PDH function calls. - Add counters to the query: After creating the query, the PdhAddCounter function is called for each counter and it is added to the query. 38 - Collect the performance data: PdhCollectQueryData Function is used to collect the current raw data value for all counters. PdhGetFormattedCounterValue function computes a displayable value for the specified counter. - Display the performance data: PDH_FMT_COUNTERVALUE Structure is used to compute counter value as a long, double, longlong, lpcstr and lpcwstr by using longValue, doubleValue, largeValue, AnsiStringValue and WideStringValue respectively before printing. - Close the query: The PdhCloseQuery function is used to close the query and release all allocated system resources. PdhCloseQuery closes all counter handles associated with the query. Windows Management Instrumentation (WMI) “Windows Management Instrumentation (WMI) is the infrastructure for management data and operations on Windows‐based operating systems”.[7] We must initialize COM to access and set WMI protocols. To collect performance data using WMI, following steps are performed. - Initialize COM: Because WMI is based on COM technology, the CoInitializeEx is called to access WMI. General COM security levels are set using CoInitializeSecurity. 39 - Create a connection to a WMI namespace: WMI runs in a different process than your application. Therefore, you must create a connection between your application and WMI. The initial locator of WMI is obtained using CoCreateInstance. Connection with WMI is established through a call to the IWbemLocator::ConnectServer method. - Set the security levels on the WMI connection: For using the WMI connection, the impersonation and authentication levels must be set. The security levels on the IWbemServices proxy is set with a call to CoSetProxyBlanket. - Get the data from the Query: WMI exposes a variety of COM interfaces to access and manipulate data. - Cleanup and shut down your application: After the queries to WMI are completed, all COM pointers must be destroyed. Windows agent is programmed using C/C++ and scripting is done using VB language. The communication layer uses UDP for the network protocol. The code consists of three parts. First part listens for a command from the managing server, second part executes the command and third part sends the relevant data back to the managing server. First part consists of a simple while loop in the agent.cpp file which listens for a command from the managing server. The commands are specified in the ws2msinterface.h file. For second part where the agent executes the given command, the outline is provided to 40 explain how performance data collection is implemented since that is the most complicated part. When the command to collect performance data is received the agent executes the copy stats functions. COUNTER PATHS USED FOR READING PERFORMANCE DATA PROCESSOR COUNTER PATHS: \\Processor(_Total)\\% Processor Time % Processor Time is the percentage of elapsed time that the processor spends to execute a non‐Idle thread. It is calculated by measuring the percentage of time that the processor spends executing the idle thread and then subtracting that value from 100%. (Each processor has an idle thread that consumes cycles when no other threads are ready to run). This counter is the primary indicator of processor activity, and displays the average percentage of busy time observed during the sample interval. It should be noted that the accounting calculation of whether the processor is idle is performed at an internal sampling interval of the system clock (10ms). Nowadays processors comes with very fast processing features, % Processor Time can therefore underestimate the processor utilization as the processor may be spending a lot of time servicing threads between the system clock sampling interval. Workload based timer applications are one example of applications which are more likely to be measured inaccurately as timers are signaled just after the sample is taken. \\Processor(_Total)\\% Privileged Time 41 % Privileged Time is the percentage of elapsed time that the process threads spent executing code in privileged mode. When a Windows system service is called, the service will often run in privileged mode to gain access to system‐private data. Threads executing in user mode cannot access this data. Calls to the system can be explicit or implicit, such as page faults or interrupts. Unlike some early operating systems, Windows uses process boundaries for subsystem protection in addition to the traditional protection of user and privileged modes. Some of the work done by Windows on behalf of the application might appear in other subsystem processes in addition to the privileged time in the process. \\Processor(_Total)\\% User Time % User Time is the percentage of elapsed time the processor spends in the user mode. User mode is a restricted processing mode designed for applications, environment subsystems, and integral subsystems. The privileged mode is designed for operating system components and allows direct access to hardware and all memory. The operating system switches application threads to privileged mode to access operating system services. This counter displays the average busy time as a percentage of the sample time. \\Processor(_Total)\\Interrupts/sec Interrupts/sec is the average rate, in incidents per second, at which the processor receives and services hardware interrupts. It does not include deferred procedure calls (DPCs), which are counted separately. This value is an indirect indicator of the activity of devices 42 that generate interrupts, such as the system clock, the mouse, disk drivers, data communication lines, network interface cards, and other peripheral devices. These devices normally interrupt the processor when they have completed a task or require attention. Normal thread execution is suspended. The system clock typically interrupts the processor every 10 milliseconds, creating a background of interrupt activity. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval. MEMORY COUNTER PATHS: \\Memory\\Cache Bytes Cache Bytes is the sum of the Memory\\System Cache Resident Bytes, Memory\\System Driver Resident Bytes, Memory\\System Code Resident Bytes, and Memory\\Pool Paged Resident Bytes counters. This counter displays the last observed value only; it is not an average. \\Memory\\Available Bytes Available Bytes is the amount of physical memory, in bytes, available to processes running on the computer. It is calculated by adding the amount of space on the Zeroed, Free, and Standby memory lists. Free memory is ready for use; Zeroed memory consists of pages of memory filled with zeros to prevent subsequent processes from seeing data used by a previous process; Standby memory is memory that has been removed from a 43 process' working set (its physical memory) on route to disk, but is still available to be recalled. This counter displays the last observed value only; it is not an average. DISK I/O COUNTER PATHS: \\ LogicalDisk(Partition name) \\ Disk Writes/sec Shows the rate at which write operations are performed on the disk. \\ LogicalDisk(Partition name) \\ Disk Reads/sec Shows the rate at which read operations are performed on the disk. \\ LogicalDisk(Partition name) \\ Disk Write Bytes/sec Shows the rate at which bytes are transferred to the disk during write operations. \\ LogicalDisk(Partition name) \\ Disk Read Bytes/sec Shows the rate at which bytes are transferred from the disk during read operations. GetDiskFreeSpace ( ), a Win32 API, was used to determine some of the low level information about a disk. The information such as disk space, disk space used, free disk space, root path name, sectors per cluster and bytes per sector can be determined using this API call. Its Syntax is BOOL WINAPI GetDiskFreeSpace( __in LPCTSTR lpRootPathName, __out LPDWORD lpSectorsPerCluster, __out LPDWORD lpBytesPerSector, __out LPDWORD lpNumberOfFreeClusters, 44 __out LPDWORD lpTotalNumberOfClusters ); NETWORK I/O COUNTER PATHS: \\Network Interface(pAdapter->Description) \\Packets Received/sec Packets Received/sec is the rate at which packets are received on the network interface. \\Network Interface(pAdapter->Description) \\Packets Sent/sec Packets Sent/sec is the rate at which packets are sent on the network interface. \\Network Interface(pAdapter->Description) \\Packets Received Discarded Packets Received Discarded is the number of inbound packets that are chosen to be discarded even though no errors have been detected to prevent their delivery to a higher‐layer protocol. One possible reason for discarding packets could be to free up buffer space. \\Network Interface(pAdapter->Description) \\Packets Received Errors Packets Received Errors is the number of inbound packets that contained errors preventing them from being deliverable to a higher‐layer protocol. \\Network Interface(pAdapter->Description) \\Packets Outbound Discarded Packets Outbound Discarded is the number of outbound packets that were chosen to be discarded even though no errors have been detected to prevent transmission. One possible reason for discarding packets could be to free up buffer space. \\Network Interface(pAdapter->Description) \\Packets Outbound Errors 45 Packets Outbound Errors is the number of outbound packets that could not be transmitted because of errors. \\Network Interface(pAdapter->Description)\ \Current Bandwidth Current Bandwidth is an estimate of the current bandwidth of the network interface in bits per second (BPS). For interfaces that do not vary in bandwidth or for those where no accurate estimation can be made, this value is the nominal bandwidth. \\Network Interface(pAdapter->Description) \\Bytes Total/sec Bytes Total/sec is the rate at which bytes are sent and received over each network adapter, including framing characters. Network Interface\\Bytes Received/sec is a sum of Network Interface\\Bytes Received/sec and Network Interface\\Bytes Sent/sec. GetAdaptersInfo(pAdapterInfo, &ulOutBufLen) This function gives the information of all the network adapters. pAdapter‐>Description gives the name of the adapter which can be used in the counter path to get further information. PROCESS COUNTER PATHS: \\System(_Total)\\Processes This shows the number of processes in the computer at the time of data collection. This is an instantaneous count, not an average over the time interval. Each process represents a program that is running. 46 \\Process(*)\\Priority Base This shows the current base priority of this process. Threads within a process can raise and lower their own base priority relative to the process's base priority. \\Process(*)\\Elapsed Time This shows the time, in seconds, since this process has been running. \\Process(*)\\% Processor Time This shows the percentage of time that the processor spent executing a non‐idle thread. It is calculated by measuring the duration that the idle thread is active during the sample interval, and subtracting that time from 100 %. (Each processor has an idle thread that consumes cycles when no other threads are ready to run.) This counter is the primary indicator of processor activity, and displays the average percentage of busy time observed during the sample interval. Code executed to handle some hardware interrupts and trap conditions are included in this count. 47 Chapter 4 CONFIGURATION 4.1 Configuration Specs Web application configuration: Web application is already deployed under project account named nms. Username: nms Password: ******** To change the password, go to ecs.csus.edu, do login to ECS portal, login with above credentials and select change the password link from menus listed under My Account (link is shown on right-side menu list). Database configuration: Database is located on athena sever under user account nms. Name of database is also nms. To connect to mysql server: mysql -h athena -u nms -p This will prompt for password. The example shown below explains the database access procedure in brief. For example: [nms@athena ~]> mysql -h athena -u nms -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \\g. Your MySQL connection id is 1605 48 Server version: 5.1.40-community MySQL Community Server (GPL) Type \'help;\' or \'\\h\' for help. Type \'\\c\' to clear the current input statement. mysql> use nms Database changed mysql> exit Bye [nms@athena ~]> Socket configuration: There is one configuration file maintained at managing server side. This file contains server address and port for web server and agent. So in future, if web server needs to be moved from athena to some other server then this config file must have to be changed. Attention should be paid if the ports listed are assigned to some other application. [WEBSERVER] 130.86.67.252 ;name 9100 ;port [MANAGING SERVER] 9200 ;port [AGENTS] 9300 ;port Other than this configuration, no other change is needed to make communication happening successfully between components of the system. 49 Agent configuration: Agent only knows the socket endpoint when it gets the data from managing server so no configuration set up is needed as the port on which agent listens for request is already specified by agent by mutual agreement between managing server and agent. 4.2 Installation Once web application is deployed on web server, no additional installation is needed. index.php page is already built so no need for writing index.html file. After deploying the application to web server, make sure the database configuration is correct as per the location of database. If database server, username or password is changed, please change the config.php file to establish successful database connectivity. Copy the managing server code from CD provided. If managing server process is not running as cron job or background process, to manually run the managing server process follow the steps described below: - Open the terminal and go to msserver directory. - Execute “make server” to compile the project - Execute “./server” to run managing server To manually run Linux agent process, follow the steps described below: - Open the terminal and go to msserver directory. 50 - Execute “make server” to compile projecy - Execute “./agent” to run agent. - Follow these steps for each server that need monitoring otherwise web interface will not be able to see any update. To run Windows agent, follow the steps described below: - copy the agent code to Visual Studio under the folder projects on target Windows server. - Open the project in Visual Studio. - Run it using “ctrl+F5”. This starts the windows agent on window server. 51 Chapter 5 CONCLUSIONS I developed this project to provide the way to monitor any server anytime, anywhere and with real-time data. The process of getting the data and displaying the data is very transparent to end-user. User doesn’t have to have background knowledge of how web interface sends the request, how the performance metrics are calculated at managing server side, how the data is sent back from managing server to web server. The only thing user can see is after pressing “Start Updates” button, he/she can see the updates coming at every specified interval. User can see the graph moving just like Windows Task Manager. This system also gives flexibility to add more servers which need monitoring. The time interval at which the updates come from servers is reconfigurable. Overall, this project tries to accommodate requirements for monitoring servers at organizational level. 52 Chapter 6 FUTURE WORK The important aspect of this project is, it stores the performance metrics in database, which provides a way to analyze the historic data and make important decisions like which server needs more bandwidth, more disk space, which process bogs down the server most and what is network and CPU utilization of specific server. But Right now, analyzing historic data can be done only by going through whole database. In order to overcome this limitation, web application can be extended in such a way that it displays the historic data in web interface itself. As the foundation is already built to accommodate this enhancement, it will fasten the process for adding any database related enhancement in future. 53 BIBLIOGRAPHY [1] HP Performance Manager software http://www8.hp.com/us/en/software/software-solution.html?compURI=tcm:245-937022 [2] AJAX Programming [Online] http://en.wikipedia.org/wiki/Ajax_(programming) [3] Canvas Tutorial https://developer.mozilla.org/en/Canvas_tutorial [4] POSIX thread (pthread) libraries http://www.yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html [5] proc - Linux man page http://linux.die.net/man/5/proc [6] /proc/[pid] - Linux man page http://linux.die.net/man/5/proc [7] Using the PDH functions to consume counter data http://msdn.microsoft.com/en-us/library/windows/desktop/aa373214(v=vs.85).aspx [8] Windows Management Interface http://msdn.microsoft.com/en-us/library/windows/desktop/aa394582(v=vs.85).aspx