SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang Hu Cluster and Grid Computing Lab Services Computing Technology and System Lab Huazhong University of Science and Technology Bing Bing Zhou Centre for Distributed and High Performance Computing Services School of Information Technologies University of Sydney Introduction Many applications need high availability But there are still numerous security vulnerabilities Fix all bugs in testing is impossible Virtualization technology brings new challenges Server downtime is very costly (1hr = $84,000~$108,000) there are more application instances in a single-machine How to guarantee high availability? Current Approaches & limitations Rx STEM Software components are fail-stop and individually recoverable Limitations Manufacture values for “out of the bounds read” Discard “out of the bounds write” Micro-reboot Emulate function and potentially others within a larger scope to return error values Failure-oblivious computing Change execution environment Deterministic bugs are still there Require program redesign A narrow suitability for only a small number of applications or memory bugs …… ASSURE better address these problems [ASPLOS’09] SHelp can be considered as an extension of ASSURE to a virtualized computing environment ASSURE Overview ASPLOS’09 Bypass the “faulty” functions Rescue points Error virtualization locations in the existing application code used to handle programmeranticipated failures force a heuristic-based error return in a function Quick recovery for future faults Take a checkpoint once the appropriate rescue point is called Walk stack Execution Graph input foo() bar() bad() Create rescue-graph input Rescue Graph foo() bar() other() int bad(char* buf) { char rbuf[10]; int i = 0; if(buf == NULL) return -1; while(i < strlen(buf)) { rbuf[i++] = *buf++; } return 0; } ASSURE Limitations ASPLOS’09 --main.c-052 int main() ... 167 if (!fork()) { /* this is the child process */ 168 while(1) 169 { ... 185 if(serveconnection(new_fd)==-1) break; ... A potential problem when Rescue point Bthe canappropriate rescue point is in the survivemain faultsprocedure1.of an application Candidate rescue point B 1) Define Define Two cases 2. Create High overhead for frequently checkpointing 3. Assignment 2) Assignment No rescue point is 3) Use appropriate Candidate rescue point A --protocol.c-038 int serveconnection(int sockfd) ... 041 char tempdata[8192], *ptr, *ptr2, *host_ptr1, *host_ptr2; 043 char filename[255]; ... 054 while(!strstr(tempdata, "\r\n\r\n") && !strstr(tempdata, "\n\n")) 055 { 056 if((numbytes=recv(sockfd, tempdata+numbytes, 4096-numbytes, 0))==-1) 057 return -1; 058 } 059 for(loop=0; loop<4096 && tempdata[loop]!='\n' && tempdata[loop]!='\r'; loop++) 060 tempstring[loop] = tempdata[loop]; ... 063 ptr = strtok(tempstring, " "); ... 098 Log("Connection from %s, request = \"GET %s\"", inet_ntoa(sa.sin_addr), ptr); ... --util.c-212 void Log(char *format, ...) ... Memory Region: 4) Create 217 char temp[200], temp2[200], logfilename[255]; Name: filename ... Size: 255 Byte 222 va_start(ap, format); // format it all into temp 5) Copy 223 vsprintf(temp, format, ap); Buffer Overflow B ... Memory Region: 144 4. Copy strcat(filename, ptr); ... Buffer Overflow A Name: temp Size: 200 Byte SHelp Main Idea “Weighted” rescue point assign weight values to rescue points When an appropriate rescue point is chosen, its associated weight value is incremented. first select the rescue point with the largest weight value to test once detecting a fault Error handling information sharing in VMs A two-level storage hierarchy for rescue point management a global rescue point database in Dom0 a rescue point cache in each DomU Weight values are updating between Dom0 and DomUs for error handling information sharing The accumulative effect of added weight values in Dom0 provides a useful guideline for diagnosis of serious bugs SHelp Architecture Sensors for detecting software faults Recovery and Test component for choosing the appropriate rescue point DomU Programmers Report Application 1 Dom0 Sensors Rescue Point Database Rescue Point Cache Management ... DomU Application n Application 1 ... Application n Sensors Checkpoint & Rollback Recovery & Test Control Unit VMM (Xen) Hardware ... Rescue Point Cache Checkpoint & Rollback Control Unit Recovery & Test SHelp Procedure Determine candidate rescue points Prioritize candidate rescue points and test one by one first test the largest weight value of rescue point Increment the corresponding weight values Quick recovery for the same stack smashing bug Program execution checkpoint Fault detected Rollback to previous checkpoints ② Dom0 Update ① Candidate Rescue Point Rescue Point Rescue Points Matched Cache Database Log Inputs Select and Instrument Survival Test ③ Bug-Rescue List ④ ⑤ Update Weight Value Appropriate Rescue Point Report Module ⑥ Send Bug Report Programmers Implementation Details Updating the Rescue Point Cache At the application level -> LRU At the trace level of applications -> LFUM Consider globally maximum weight value and local hit rate for trace i TraceFlag(i) k wmax h(i) Updating Weight Values of Rescue Points Real-time updating for RP database Periodical updating for RP cache Bug-Rescue List The stack is corrupted in stack smashing bug Get the trace need to replay program -> high overhead Record the appropriate rescue point related to the fault Choose it to probabilistically survive faults Experimental Setup Implementation Platform Linux 2.6.18.8 kernel with BLCR and TCPCP checkpoint support Xen 3.2.0 and Dyninst 6.0 Intel Xeon E6550, 4MB L2 cache, 1GB memory 100Mbps Ethernet connection Applications Application Version Bug Depth Apache 2.0.49 2.0.50 Off-by-one Heap overflow 2 2 2.0.59 NULL dereference 3 Stack smashing 2 Divide-by-zero 2 Stack smashing 1 Heap overflow 1 Double free 3 Light-HTTPd Light-HTTPd-dbz ATP-HTTPd Null-HTTPd Null-HTTPd-df 0.1 0.4b 0.5.0 Comparison between ASSURE and SHelp Web server application Light-HTTPd Select the function serveconnection as the appropriate rescue point Throughput (MB/s) Throughput is only about 60KB/s in ASSURE 0 4 5 10 15 20 ASSURE 2 0 4 SHelp 2 00 5 10 Elapsed Time (sec) 15 20 SHelp Recovery Performance First-2 First-1 First-2 u TT ll Pd H Nu TT ll Pd -d f N H 2. Web Server Application First-1 First-1 First-2 First-2 First-1 First-2 First-2 First-2 First-1 First-2 First-1 First-1 First-1 pa c 0. he 49 A p 2. ach 0. 50 e A p 2. ach 0. 59 e L H igh TT t Pd H Lig TT h Pd t -d bz A H TP TT Pd 24 20 16 12 8 4 0 Test Instrument Analysis A First-1: new faults occur First-2: same faults occur again in local VM or in other VMs Time (s) Benefits of the Bug-Rescue List Subsequent: with Bug-Rescue List 24 20 16 Time (s) Test Instrument Analysis First-1 First-2 First-1 First-2 12 8 4 Subsequent Subsequent 0 Light-HTTPd ATP-HTTPd Web Server Application Checkpoint/Rollback Overhead Analysis Lightweight checkpoint and roll-back Modified BLCR with TCPCP tool support Time (ms) 60 50 40 30 20 10 0 Checkpoint Rollback Apache 2.0.49 Apache Apache Light ATP 2.0.50 2.0.59 HTTPd HTTPd Web Server Application Null HTTPd Conclusions and Future Work “Weighted” rescue points and two-level storage hierarchy for rescue point management make the system perform more effectively and efficiently. Future Work Integrate the COW mechanism in BLCR Evaluate the effectiveness of our system for more complex server and client applications Thank you! Questions?