CSEP Internal Server Error Report 鍾珍慧 國立中央大學 資訊工程學系 分散式系統實驗室 2012/10/24 Abstract Contents 問題1: CSEP 網頁找不到實驗手冊 原因:templates/ade/detail.html 的路徑敘述錯誤 解決方式 1. 修改 templates/ade/detail.html ① : 修改前 59: <div id="manual"> 60: <h2><a href="/static/{{ manual.man }}">實驗手冊</a></h2> 61: <h3 class="mantitl">原理</h3> ② : 修改後 59: <div id="manual"> 60: <h2><a href="/static/man/{{ manual.man }}">實驗手冊</a></h2> 61: <h3 class="mantitl">原理</h3> 問題2: CSEP 網頁在 create instance 時出現錯誤,表示函式只需要 1 個 arguments,而系統卻給了 11 個 arguments 原因:CSS Instance 的 constructor 用法有誤,需覆寫 django 內建的__init__() 解決方式 1. 修改 css/models.py ① : 修改前 373: def __init__(self): 374: ‘ 375: 376: 377: self.vmc = vmcontrol.vmcontrolFactory() ② : 修改後 373: def __init__(self, *args): 374: super(Instance, self).__init__(*args) 375: self.vmc = vmcontrol.vmcontrolFactory() 問題3: CSEP 無法取得 IP 原因:子二不小心更動到了 vm-start 這個 API,造成原本 vm-start 的回傳值應為: 0 < VM ′ s IP > 改成 0 因此,在 CSEP 在建立實驗後,無法得到子二的 VM 現在的 IP 位置,造成無法在 VM 與 CSEP 間建立連線。 解決方式 1. 修改 ade/vmcontrollers.py,將子二提供的 API cpcs-vm-getip 匯入(新增)CSEP 中;且修改 CPCS_Controllers.bootup()函式,讓原本為回傳開機成功後的 VM IP,更改成回傳開機 成功(0)或開機失敗(1)的值 ① : 修改前 231: def bootup(self, instance): 232: args = 'vm-start %s' % (instance.curImage.uuid) 233: if debug: 234: 235: cseplib.debugmsg(__name__, sys._getframe(), args) ret, output = self._executor(args, sync=True) 236: 237: # Look each line in output, find the match pattern, and get it out and return it. 238: result = self._result_check(output) 239: if result[0] == "0": 240: return result[1] ② : 修改後 241: def bootup(self, instance): 242: args = 'vm-start %s' % (instance.curImage.uuid) 243: if debug: 244: 245: cseplib.debugmsg(__name__, sys._getframe(), args) ret, output = self._executor(args, sync=True) 246: 247: # Look each line in output, find the match pattern, and get it out and return it. 248: result = self._result_check(output) 249: if result[0] == "0": 250: 284: return int(result[0]) def getip(self, uuid): 285: """ 286: Get the ip of instance in CPCS platform 287: Return IP if success, None if failed 288: """ 289: args = 'cpcs-vm-getip %s' % uuid.curImage.uuid 290: if debug: 291: cseplib.debugmsg(__name__, sys._getframe(), args) 292: ret, output = self._executor(args, sync=True) 293: result = self._result_check(output) 294: if result[0] == "0": 295: return result[1] 296: 2. else: 297: cseplib.errmsg(__name__, sys._getframe(), result[1]) 298: return None 修改 css/models.py,在 instance.bootup()這個函式中,依據 bootup 傳回的值 0(回傳成功) 或 1(回傳失敗),再呼叫剛剛添加的 getip()這個函式取得 VM 的 IP ① : 修改前 391: ip = self.vmc.bootup(self) 392: # Synchrnized the Instance objest. It might been saved by vmc. ② : 修改後 299: 300: 301: 302: 303: 304: boot_rlt = self.vmc.bootup(self) if boot_rlt == 0: ip = self.vmc.getip(self) else: ip = None # Synchrnized the Instance objest. It might been saved by vmc. 問題4: 在平台網頁上想刪除實驗時,點選刪除,卻發現無法從網頁上刪除 點選前 點選後 原因: 1. 2. 在 css.model 中,instance 的刪除並沒有檢查傳入的參數 mysql 的 css_instance table 存在的 VM 資料,於實際上存在於 CPCS 平台的 VM 資料不 符(也就是是 CSEP 的 DB 認為某個 UUID 的 VM 存在於 CPCS 平台上,但實際上 CPCS 平台 並不存在此 VM),可在/var/log/apache/csep-error.log 中觀察到如下的錯誤訊息: 26: [Wed Aug 29 14:26:10 2012] [error] [DEBUG] ade.vmcontrollers.delete : vm-delete 36fb1482-fd94-582e-853c-a2297994f683 27: [Wed Aug 29 14:26:10 2012] [error] [DEBUG] ade.vmcontrollers._executor : ./cpcs_api vm-delete 36fb1482-fd94-582e-853c-a2297994f683 28: [Wed Aug 29 14:26:10 2012] [error] [DEBUG] ade.vmcontrollers._executor : 1 The wrong VM uuid. 29: [Wed Aug 29 14:26:10 2012] [error] [ERROR] ade.vmcontrollers.delete : The wrong VM uuid. 30: [Wed Aug 29 14:26:10 2012] [error] Traceback (most recent call last): 31: [Wed Aug 29 14:26:10 2012] [error] File "/var/www/csep/css/models.py", line 590, in delExp 32: [Wed Aug 29 14:26:10 2012] [error] tobe = exp.id 33: [Wed Aug 29 14:26:10 2012] [error] AttributeError: 'unicode' object has no attribute 'id' 65: [Wed Aug 29 23:51:01 2012] [error] [DEBUG] ade.vmcontrollers._executor : ./cpcs_api vm-suspend de9fcab8-3f12-1008-12b0-6b6727d5d610 66: [Wed Aug 29 23:51:01 2012] [error] Exception in thread Thread-39: 67: [Wed Aug 29 23:51:01 2012] [error] Traceback (most recent call last): 68: [Wed Aug 29 23:51:01 2012] [error] File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner 69: [Wed Aug 29 23:51:01 2012] [error] 70: [Wed Aug 29 23:51:01 2012] [error] self.run() File "/usr/lib/python2.6/threading.py", line 484, in run 71: [Wed Aug 29 23:51:01 2012] [error] self.__target(*self.__args, **self.__kwargs) 72: [Wed Aug 29 23:51:01 2012] [error] File "/var/www/csep/css/models.py", line 329, in suspend 73: [Wed Aug 29 23:51:01 2012] [error] cha_repo.allsuperuser('suspend', caseid=str(self.id)) 74: [Wed Aug 29 23:51:01 2012] [error] File "/var/www/csep/css/models.py", line 280, in allsuperuser 75: [Wed Aug 29 23:51:01 2012] [error] profile.channel.broadcast(instruction, **kw_args) 76: [Wed Aug 29 23:51:01 2012] [error] File "/var/www/csep/css/models.py", line 267, in broadcast 77: [Wed Aug 29 23:51:01 2012] [error] 78: [Wed Aug 29 23:51:01 2012] [error] conn.disconnect() File "/usr/local/lib/python2.6/dist-packages/stomp/connect.py", line 403, in disconnect 79: [Wed Aug 29 23:51:01 2012] [error] self.__socket.shutdown(socket.SHUT_RDWR) 80: [Wed Aug 29 23:51:01 2012] [error] File "<string>", line 1, in shutdown 81: [Wed Aug 29 23:51:01 2012] [error] File "/usr/lib/python2.6/socket.py", line 165, in _dummy 82: [Wed Aug 29 23:51:01 2012] [error] raise error(EBADF, 'Bad file descriptor') 83: [Wed Aug 29 23:51:01 2012] [error] error: [Errno 9] Bad file descriptor 84: [Wed Aug 29 23:51:01 2012] [error] 85: [Wed Aug 29 23:51:01 2012] [error] [DEBUG] ade.vmcontrollers._executor : 1 The wrong VM uuid. 86: [Wed Aug 29 23:51:01 2012] [error] [ERROR] ade.vmcontrollers.suspend : The wrong VM uuid. 87: [Wed Aug 29 23:51:01 2012] [error] None 3. mysql 的 css_image table 存在的 VM UUID,資料值為 NULL,可在 /var/log/apache/csep-error.log 中觀察到如下的錯誤訊息: 234: [Thu Aug 30 02:42:28 2012] [error] [DEBUG] ade.vmcontrollers.delete : vm-delete None 235: [Thu Aug 30 02:42:28 2012] [error] [DEBUG] ade.vmcontrollers._executor : ./cpcs_api vm-delete None 236: [Thu Aug 30 02:42:28 2012] [error] [DEBUG] ade.vmcontrollers._executor : 1 The wrong VM uuid. 237: [Thu Aug 30 02:42:28 2012] [error] [ERROR] ade.vmcontrollers.delete : The wrong VM uuid. 238: [Thu Aug 30 02:42:28 2012] [error] Traceback (most recent call last): 239: [Thu Aug 30 02:42:28 2012] [error] File "/var/www/csep/css/models.py", line 590, in delExp 240: [Thu Aug 30 02:42:28 2012] [error] tobe = exp.id 241: [Thu Aug 30 02:42:28 2012] [error] AttributeError: 'unicode' object has no attribute 'id' 解決方式 1. 在利用 CPCS vm-list API 具有不存在的 VM UUID 回傳值為空的特性,在 ade/vmcontrollers.py 中的 class CPCSController 新增一個名為 vmExist()的函式,檢查某 個 UUID 是否存在於 CPCS 平台中 ① : 修改後 231: 232: 233: 234: 235: def vmExsist(self, instance): if (instance.curImage.uuid is None) or (instance.curImage.uuid is ""): cseplib.errmsg(__name__, sys._getframe(), "VM with this uuid doesn't exist.") return False 236: else: 237: args = 'vm-list %s' % (instance.curImage.uuid) 238: if debug: 239: cseplib.debugmsg(__name__, sys._getframe(), args) 240: ret, output = self._executor(args, sync=True) 241: result = self.result_check(output) 242: if ret=="0": 243: return True 244: else: 245: 2. return False 修改 css/models.py 中,class Experiments 的 suspend()函式 ① : 修改前 318: def suspend(self): 319: """ 320: Suspend all the instances belong to the experiment 321: and change the state. 322: """ 323: for ins in self.instances.all(): 324: if ins.state == 'R': 325: ins.suspend() 326: self.state = 'S' 327: self.save() 328: cha_repo = ChannelRepository() 329: cha_repo.allsuperuser('suspend', caseid=str(self.id)) ② : 修改後 318: def suspend(self): 319: """ 320: Suspend all the instances belong to the experiment 321: and change the state. 322: """ 323: for ins in self.instances.all(): 324: ins_info = ins.vmc.list(css.models.Image.objects.filter(id=self.curImage_id).get().uuid) 325: ins_info = ins_info.split() 326: if ins_info[0]=='0' and ins_info[3]=='running' and ins.state=='R': 327: ins.suspend() 328: self.state = 'S' 329: self.save() 330: cha_repo = ChannelRepository() 331: 3. cha_repo.allsuperuser('suspend', caseid=str(self.id)) 修改 css/models.py 中,class Instance 的 delete function,對要刪除的 UUID 確認(1)UUID 是否為空?(2)使用 ade.vmcontrollers.CPCSController().vmExist(),檢查 UUID 是否存在於 CPCS 平台? ① : 修改前 332: def delete(self, *args, **kwargs): 333: """ 334: Overriding the original model delete method 335: """ 336: if int(self.vmc.delete(str(self.curImage))) == 0: 337: # Delete img in platform Success, remove record 338: super(Instance, self).delete(*args, **kwargs) 339: return 0 340: else: 341: # Delete failed, Keep the instance and return 1 342: return 1 ② : 修改後 422: def delete(self, *args, **kwargs): 423: """ 424: Overriding the original model delete method 425: """ 426: if self.curImage is None: 427: cseplib.errmsg( __name__, sys._getframe(), "Instance%s's curImage is NULL." % (self.id)) 428: super(Instance, self).delete(*args, **kwargs) 429: return 0 430: 431: if self.vmc.vmExist(self) == False: cseplib.errmsg(__name__, sys._getframe(), "Instance%s: VM uuid is NULL." % (self.id)) 432: super(Instance, self).delete(*args, **kwargs) 433: return 0 434: if int(self.vmc.delete(self)) == 0: 435: # Delete img in platform Success, remove record 436: super(Instance, self).delete(*args, **kwargs) 437: return 0 438: else: 439: # Delete failed, Keep the instance and return 1 440: return 1 問題5: 在/var/log/apache/csep-error.log 中可見到 tobe = exp.id 的錯誤訊息 242: [Thu Aug 30 02:42:28 2012] [error] [DEBUG] ade.vmcontrollers.delete : vm-delete None 243: [Thu Aug 30 02:42:28 2012] [error] [DEBUG] ade.vmcontrollers._executor : ./cpcs_api vm-delete None 244: [Thu Aug 30 02:42:28 2012] [error] [DEBUG] ade.vmcontrollers._executor : 1 The wrong VM uuid. 245: [Thu Aug 30 02:42:28 2012] [error] [ERROR] ade.vmcontrollers.delete : The wrong VM uuid. 246: [Thu Aug 30 02:42:28 2012] [error] Traceback (most recent call last): 247: [Thu Aug 30 02:42:28 2012] [error] File "/var/www/csep/css/models.py", line 590, in delExp 248: [Thu Aug 30 02:42:28 2012] [error] tobe = exp.id 249: [Thu Aug 30 02:42:28 2012] [error] AttributeError: 'unicode' object has no attribute 'id' 原因:css.models. ExperimentRepository.delExp 寫法有誤,在尋找要刪除的實驗資料時, 應該是以 css.models.ExperimentRepository.delExp 傳入的參數 exp 作為 keyword,在資 料庫尋找符合此實驗 ID 的資料。而不是去尋找參數 exp 內的 id(因為 exp 本身就是一個實 驗的 ID),原本的做法會使 log 出現 exp 沒有名為 id 的屬性。 解決方式 1. 修改 ① : 修改前 584: def delExp(self, exp): 585: """ 586: Delete an experiment, it will leave forever.... 587: Return zero for success, 1 for otherwise 588: """ 589: try: 590: tobe = exp.id 591: tobe = exp 592: except AttributeError: 593: # Get AttributeError, it implys exp is exp_id 594: tobe = Experiment.objects.get(pk=exp) 595: try: 596: instances = tobe.instances.all() 597: for ins in instances: 598: 599: if ins.delete() != 0: return 1 600: channel = tobe.channel 601: tobe_id = tobe.id 602: tobe.delete() 603: channel.delete() 604: cha_repo = ChannelRepository() 605: cha_repo.allsuperuser('delete', caseid=str(tobe_id)) 606: return 0 607: except: 608: import traceback 609: traceback.print_exc() ② : 修改後 590: 591: def delExp(self, exp): try: 592: 593: tobe = Experiment.objects.get(pk=exp) except: 594: cseplib.errmsg(__name, sys._getframe(), "Can't find any experiments with exp_id %s" % (exp)) 595: 596: return 1 try: 597: instances = tobe.instances.all() 598: for ins in instances: 599: image_id = ins.curImage_id 600: if ins.delete() != 0: 601: return 1 602: channel = tobe.channel 603: tobe_id = tobe.id 604: tobe.delete() 605: channel.delete() 606: cha_repo = ChannelRepository() 607: cha_repo.allsuperuser('delete', caseid=str(tobe_id)) 608: return 0 609: except: 610: import traceback 611: traceback.print_exc() 問題6: 在 CSEP 網頁上建立實驗時,開機後,網頁不會顯示要連線到的 IP 與 port。 錯誤情況 正常情況 原因: 1. ade/vagent_server.py 中,搜尋 VM 對應到的 UUID 機制有缺失 78: def initvagt(self, *args, **kwargs): 79: """ 80: Find a uninitialized image, and send the UUID to a vagent_client 81: vagent_client will use that UUID in later usage, such as ctport 82: """ 83: sock = kwargs['clientsock'] 84: client_ip = sock.getpeername()[0] 85: max_try = 10 86: for i in range(1, max_try): 87: # Loop until find the instance by IP 88: try: 89: instance = Instance.objects.filter(ip=client_ip, initialized=False)[0] 90: break 91: except IndexError: 92: if DEBUG: 93: cseplib.debugmsg(__name__, sys._getframe(), 94: 95: "Can't find the instance by ip %s" % (client_ip)) if i == (max_try-1): 96: cseplib.debugmsg(__name__, sys._getframe(), 97: "Failed to lookup instance by IP more") 98: 99: return 1 time.sleep(15) 100: sock.send(str(instance.curImage.uuid)) 101: instance.initialized = True 102: instance.save() 103: return 0 照此機制設計出來的話,若 css_instance table 的資料如下的話,可能會因為 IP 紀錄重複 而產生無法正確判斷 VM UUID 的問題: mysql> select * from css_instance order by ip; +------+--------+-------------+-------------+----------------+-------+--------------+----------+---------------------+-------------+ | id | exp_id | srcImage_id | curImage_id | ip | state | reverse_port | guest_os | last_report | initialized | +------+--------+-------------+-------------+----------------+-------+--------------+----------+---------------------+-------------+ | 2148 | 1274 | 4 | 2119 | 140.115.3.216 | S | NULL | NULL | 2012-07-04 11:41:32 | 0 | | 2135 | 1267 | 4 | 2106 | 140.115.3.216 | S | NULL | NULL | 2012-07-04 05:05:00 | 0 | | 2095 | 1244 | 4 | 2066 | 140.115.3.216 | S | NULL | NULL | 2012-06-15 14:16:56 | 0 | | 1506 | 829 | 4 | 1450 | 140.115.3.216 | S | NULL | NULL | 2012-03-05 21:12:37 | 0 | +------+--------+-------------+-------------+----------------+-------+--------------+----------+---------------------+-------------+ 2. 在製作案例 template 時,將 uuid.pkl 檔也一併置入,造成 VM 以為在 uuid.pkl 中會有 自己的 uuid,但事實上 uuid.pkl 中的 uuid 卻是空的。(ade.vagent_client.init()依 賴檢查在 linux 中的/etc/init.d/uuid.pkl 或在 windows 中的 C:\\uuid.pkl 是否存在, 以決定是否要跟 vagent_server 發出 request 詢問自己的 uuid 為多少) 此情形發生時,可在/var/log/apache/csep-error.log 觀察到如下畫面: 18: [Wed Aug 29 11:16:23 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the instance by client_uuid 19: [Wed Aug 29 11:16:33 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the instance by client_uuid 20: [Wed Aug 29 11:16:43 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the instance by client_uuid 21: [Wed Aug 29 11:16:53 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the instance by client_uuid 22: [Wed Aug 29 11:17:03 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the instance by client_uuid 23: [Wed Aug 29 11:17:13 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the instance by client_uuid 24: [Wed Aug 29 11:17:23 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the instance by client_uuid 25: [Wed Aug 29 11:17:33 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the instance by client_uuid 26: [Wed Aug 29 11:17:43 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the instance by client_uuid 3. css_instance table 資料有問題 ① : curImage_id 欄位為 NULL mysql> select * from css_instance where curImage_id IS NULL order by last_report; +-----+--------+-------------+-------------+------+-------+--------------+----------+---------------------+-------------+ | id | exp_id | srcImage_id | curImage_id | ip | state | reverse_port | guest_os | last_report | initialized | +-----+--------+-------------+-------------+------+-------+--------------+----------+---------------------+-------------+ | 696 | 361 | 3 | NULL | NULL | S | NULL | NULL | 2011-12-15 03:19:34 | 0 | | 697 | 361 | 4 | NULL | NULL | S | NULL | NULL | 2011-12-15 03:19:55 | 0 | +-----+--------+-------------+-------------+------+-------+--------------+----------+---------------------+-------------+ ② : ip 欄位為 NULL 此情形發生時,可在/var/log/apache/csep-error.log 觀察到如下畫面: 57: [Wed Aug 29 05:54:21 2012] [error] [DEBUG] csep.ade.vagent_server.initvagt : Can't find the instance by ip 140.115.14.160 4. 實驗 VM 的 UUID,在 css_image table 中的紀錄為空,因此可能造成 vagent_server 回 傳給 vagent_client 的 UUID 值為空,可在 VM 中的 uuid.pkl 觀察到: 錯誤情況 正常情況 1: S'' 1: S'7e8412af-176c-4bb8-5c5e-5176529c3dc8' 2: p0 2: p0 3: . 3: . 5. -------------+ 解決方式 1. 修改 ① : 修改前 ② : 修改後 問題7: 原因: 解決方式:重新設計 css_instance table 的 IP 機制 1. 修改 css/models.py 中,class Instance 的 resume()函式 r,在 resume 之後,去詢問 CPCS 平台 VM 的新 IP,存入 css_instance 表格中 ① : 修改前 434: def resume(self): 435: """ 436: Recovery from a suspend state 437: Return 0 when success, 1 is failed. 438: """ 439: if self.vmc.resume(self): 440: self = Instance.objects.get(pk=self.id) 441: self.state = 'R' 442: self.last_report = datetime.datetime.now() 443: self.save() 444: return 0 445: return 1 ② : 修改後 447: 448: """ 449: Recovery from a suspend state 450: Return 0 when success, 1 is failed. 451: """ 452: if self.vmc.resume(self): 453: self = Instance.objects.get(pk=self.id) 454: self.ip = self.vmc.getip(self) 455: self.state = 'R' 456: self.last_report = datetime.datetime.now() 457: self.save() 458: return 0 459: 2. def resume(self): return 1 ③ : 在/var/www/csep 下執行 django shell,將 DB: $ ./manage.py shell ① : 修改前 ② : 修改後 ③ : 問題8: 原因: 解決方式 1. 修改 ① : 修改前 ② : 修改後 結論