Troubleshooting Website Availability WARNING The clustered host machines VHOST1 and VHOST2 host the rest of the servers virtually. Restarting them without first shutting down or migrating the virtual machines to another server is like pulling the plug on a physical server. Data loss and OS corruption may occur. Check Web Server Functionality Go to http://nursing.byu.edu/DBOnlineCheck.aspx The page should display 10,000 “a” characters. Since this page doesn’t interact with the database, it is a reflection of the web server performance. If the page times out: Try to recycle the worker process: o Remote into the web server o Open IIS Manager 7 o Navigate to the site that isn’t responding and choose “Restart” from the right pane. Check the Load Balancing Cluster: o Remote into VNURSE1 or VNURSE2. o Open “Network Load Balancing” o Check that both the servers appear green under the IP Address for the website that is being load balanced. o If one of the servers is not green, right-click on it and choose “Host Status”. Resolve any error messages. o If the server is green, you might need to check IIS for the server that isn’t responding. Reset IIS o Open a command prompt o Type iisreset If the page reports an error: Check the database exception log in the Ex_Exceptions table (see instructions below). If there isn’t a log of an error, try to browse for the page from the actual server hosting the site. By default, the localhost has access to debug information that isn’t available to remote clients. Checking Website Exceptions when the Website is Down Open SSMS and connect to NURSE-DBCLUSTER (if this doesn’t work, check the section below about database availability). Open Databases > www_ConMaster > Tables. Right-click on Ex_Exceptions and choose “Select Top 100 Rows”. You can add a new line “ORDER BY ExceptionTime DESC” beneath the FROM clause to see the latest exceptions. You can copy and paste the contents of any of the grid blocks into notepad to see the whole exception. Check Database Availability Ping the database server (NURSE-DBCLUSTER) o If it doesn’t respond, follow instructions below about restarting a clustered resource. Open SSMS and connect to NURSE-DBCLUSTER o If you get a timeout/server not responding error, follow instructions below about querying health of a clustered resource. Check the connection pool o Right-click on the Server (in Object Explorer) and choose Reports > Standard Reports > Activity – All Sessions o Expand the Logins and see how many connections have been used. If the exception log reports “max connection pools” or if there are more than 100, follow the instructions about restarting a clustered resource. Querying a Clustered SQL Resource Remote into VHOST1 or VHOST2 Click Start > Failover Cluster Manager Choose “Manage a Cluster” and click OK (in the management pane) Expand VNURSE-CLUSTER and click on “Services and applications” Look at the status of SQL Server. o If it is offline or failed, right-click on it and choose “Show the critical events for this application”. o Look at any errors that explain the failure and fix them. Checking the SQL Server Agent Logs Open SSMS and connect to NURSE-DBCLUSTER Expand SQL Server Agent > Error Logs Open the error log with the date/time closest to suspected failure time. Check boxes next to the available logs to see error entries Restarting a Clustered SQL Resource Remote into VHOST1 or VHOST2 Click Start > Failover Cluster Manager Choose “Manage a Cluster” and click OK (in the management pane) Expand VNURSE-CLUSTER > SQL Server In the central pane under “Other Resources”, right-click on SQL Server and choose “Take this resource offline”. You will get an error about disconnecting clients, choose to continue. Once offline, right-click on the SQL Server again and choose “Bring this resource online”. Check Server Disk Space Levels If there was a major problem, there could be GBs of error logs and memory dumps that have maxed out the HDD of the web/database server. Logon to the web server and open the Computer to check disk levels. Determine which of the VHOSTs currently owns the database server (check the steps about querying a clustered resource). Remote into the host that owns the database and open Computer to check the disk levels. The SQL Server shows up as the Q drive on the host that owns it. Failover Cluster Manager not responding Follow these steps if loading a particular cluster resource’s details says “taking longer than normal” AND a cancel button appears. Restarting a cluster node should be the last resort to try and fix a problem. If you have to restart a cluster node, follow the appropriate instructions below. Determine if the problem is localized to the node Remote into the other cluster node and try to open the Failover Cluster Manager. If the manager opens and the cluster is responsive: o Follow the instructions below for restarting the cluster service. o Move any applications that seem unresponsive on the other node to this one. If the manager doesn’t respond: o Try accessing the nodes in the left-pane of the cluster manager. Sometimes resource DLLs may freeze the clustered applications, but the nodes are still accessible. o If the nodes are accessible, restart the cluster service on each node – see instructions below. Look at the event logs for the cluster Before deciding how bad the problem is, look through the event logs in Event Viewer for details. If the cluster has failed, there will be LOTS of errors logged. Scroll down until the first error appears and read any log entries (including info or warnings). If the problem is with the whole cluster Try restarting the cluster service on each node. If that doesn’t work, prepare the nodes to do a physical restart – follow instructions below. Preparing Cluster Nodes for a Physical Restart Try to migrate applications to another node (will only work if the other host is online) o Open the Failover Cluster Manager o Expand the Applications and Services o Right-click on any applications hosted by the node and select “Move this service or application…” and select the other host. o For virtual machines, right-click and select “Live migrate the virtual machine to…” or “Quick migrate virtual machine…” and choose the other node. You may need to migrate services and applications from the target node if the one you are restarting has an unresponsive failover cluster manager. If you can’t migrate the virtual machines to another node Open the Hyper-V Manager o Check the status of the virtual machines o Try connecting to the virtual machines and logging on. o Shut the virtual machines down to prevent data loss. o If the virtual machines are not responsive, close the Hyper-V Manager. The host’s OS will attempt to shut the virtual machines down, which may be better than just switching them off with the Hyper-V Manager, so don’t worry if they aren’t responding. Restart the host machine. Any applications or services that fail to migrate will just have to be offline until the server reboots. You can now restart the server. Restarting the Cluster Service on a Node IMPORTANT! Before trying to restart the cluster service, you should try and migrate services, applications and virtual machines to a node that is still functional. See the instructions under “Preparing Cluster Nodes for a Physical Restart”. The same rules apply. Before trying to restart the cluster service you should prepare the host machine as if it were about to be physically rebooted. This ensures the applications don’t become unstable. Open Failover Cluster Manager Expand VNURSE-CLUSTER > Nodes > VHOST# Right-click on the cluster node (in the left pane) and choose “More Actions > “Stop Cluster Service”. Once it is stopped, do the same thing to start it again.