CSEP Internal Server Error Report

advertisement
CSEP Internal Server Error Report
鍾珍慧
國立中央大學 資訊工程學系 分散式系統實驗室
2012/10/24
Abstract
Contents
問題1: CSEP 網頁找不到實驗手冊


原因:templates/ade/detail.html 的路徑敘述錯誤
解決方式
1.
修改 templates/ade/detail.html
①
: 修改前
59: <div id="manual">
60:
<h2><a href="/static/{{ manual.man }}">實驗手冊</a></h2>
61:
<h3 class="mantitl">原理</h3>
②
: 修改後
59: <div id="manual">
60:
<h2><a href="/static/man/{{ manual.man }}">實驗手冊</a></h2>
61:
<h3 class="mantitl">原理</h3>
問題2: CSEP 網頁在 create instance 時出現錯誤,表示函式只需要 1 個 arguments,而系統卻給了 11
個 arguments


原因:CSS Instance 的 constructor 用法有誤,需覆寫 django 內建的__init__()
解決方式
1. 修改 css/models.py
①
: 修改前
373:
def __init__(self):
374:
‘
375:
376:
377:
self.vmc = vmcontrol.vmcontrolFactory()
②
: 修改後
373:
def __init__(self, *args):
374:
super(Instance, self).__init__(*args)
375:
self.vmc = vmcontrol.vmcontrolFactory()
問題3: CSEP 無法取得 IP

原因:子二不小心更動到了 vm-start 這個 API,造成原本 vm-start 的回傳值應為:
0 < VM ′ s IP >
改成

0
因此,在 CSEP 在建立實驗後,無法得到子二的 VM 現在的 IP 位置,造成無法在 VM 與 CSEP
間建立連線。
解決方式
1. 修改 ade/vmcontrollers.py,將子二提供的 API cpcs-vm-getip 匯入(新增)CSEP 中;且修改
CPCS_Controllers.bootup()函式,讓原本為回傳開機成功後的 VM IP,更改成回傳開機
成功(0)或開機失敗(1)的值
①
: 修改前
231:
def bootup(self, instance):
232:
args = 'vm-start %s' % (instance.curImage.uuid)
233:
if debug:
234:
235:
cseplib.debugmsg(__name__, sys._getframe(), args)
ret, output = self._executor(args, sync=True)
236:
237:
# Look each line in output, find the match pattern, and get it out and return
it.
238:
result = self._result_check(output)
239:
if result[0] == "0":
240:
return result[1]
②
: 修改後
241:
def bootup(self, instance):
242:
args = 'vm-start %s' % (instance.curImage.uuid)
243:
if debug:
244:
245:
cseplib.debugmsg(__name__, sys._getframe(), args)
ret, output = self._executor(args, sync=True)
246:
247:
# Look each line in output, find the match pattern, and get it out and return
it.
248:
result = self._result_check(output)
249:
if result[0] == "0":
250:
284:
return int(result[0])
def getip(self, uuid):
285:
"""
286:
Get the ip of instance in CPCS platform
287:
Return IP if success, None if failed
288:
"""
289:
args = 'cpcs-vm-getip %s' % uuid.curImage.uuid
290:
if debug:
291:
cseplib.debugmsg(__name__, sys._getframe(), args)
292:
ret, output = self._executor(args, sync=True)
293:
result = self._result_check(output)
294:
if result[0] == "0":
295:
return result[1]
296:
2.
else:
297:
cseplib.errmsg(__name__, sys._getframe(), result[1])
298:
return None
修改 css/models.py,在 instance.bootup()這個函式中,依據 bootup 傳回的值 0(回傳成功)
或 1(回傳失敗),再呼叫剛剛添加的 getip()這個函式取得 VM 的 IP
①
: 修改前
391:
ip = self.vmc.bootup(self)
392:
# Synchrnized the Instance objest. It might been saved by vmc.
②
: 修改後
299:
300:
301:
302:
303:
304:
boot_rlt = self.vmc.bootup(self)
if boot_rlt == 0:
ip = self.vmc.getip(self)
else:
ip = None
# Synchrnized the Instance objest. It might been saved by vmc.
問題4: 在平台網頁上想刪除實驗時,點選刪除,卻發現無法從網頁上刪除
點選前

點選後
原因:
1.
2.
在 css.model 中,instance 的刪除並沒有檢查傳入的參數
mysql 的 css_instance table 存在的 VM 資料,於實際上存在於 CPCS 平台的 VM 資料不
符(也就是是 CSEP 的 DB 認為某個 UUID 的 VM 存在於 CPCS 平台上,但實際上 CPCS 平台
並不存在此 VM),可在/var/log/apache/csep-error.log 中觀察到如下的錯誤訊息:
26: [Wed Aug 29 14:26:10 2012] [error] [DEBUG] ade.vmcontrollers.delete : vm-delete
36fb1482-fd94-582e-853c-a2297994f683
27: [Wed Aug 29 14:26:10 2012] [error] [DEBUG] ade.vmcontrollers._executor : ./cpcs_api
vm-delete 36fb1482-fd94-582e-853c-a2297994f683
28: [Wed Aug 29 14:26:10 2012] [error] [DEBUG] ade.vmcontrollers._executor : 1 The wrong VM
uuid.
29: [Wed Aug 29 14:26:10 2012] [error] [ERROR] ade.vmcontrollers.delete : The wrong VM uuid.
30: [Wed Aug 29 14:26:10 2012] [error] Traceback (most recent call last):
31: [Wed Aug 29 14:26:10 2012] [error]
File "/var/www/csep/css/models.py", line 590, in
delExp
32: [Wed Aug 29 14:26:10 2012] [error]
tobe = exp.id
33: [Wed Aug 29 14:26:10 2012] [error] AttributeError: 'unicode' object has no attribute 'id'
65: [Wed Aug 29 23:51:01 2012] [error] [DEBUG] ade.vmcontrollers._executor : ./cpcs_api
vm-suspend de9fcab8-3f12-1008-12b0-6b6727d5d610
66: [Wed Aug 29 23:51:01 2012] [error] Exception in thread Thread-39:
67: [Wed Aug 29 23:51:01 2012] [error] Traceback (most recent call last):
68: [Wed Aug 29 23:51:01 2012] [error]
File "/usr/lib/python2.6/threading.py", line 532,
in __bootstrap_inner
69: [Wed Aug 29 23:51:01 2012] [error]
70: [Wed Aug 29 23:51:01 2012] [error]
self.run()
File "/usr/lib/python2.6/threading.py", line 484,
in run
71: [Wed Aug 29 23:51:01 2012] [error]
self.__target(*self.__args, **self.__kwargs)
72: [Wed Aug 29 23:51:01 2012] [error]
File "/var/www/csep/css/models.py", line 329, in
suspend
73: [Wed Aug 29 23:51:01 2012] [error]
cha_repo.allsuperuser('suspend',
caseid=str(self.id))
74: [Wed Aug 29 23:51:01 2012] [error]
File "/var/www/csep/css/models.py", line 280, in
allsuperuser
75: [Wed Aug 29 23:51:01 2012] [error]
profile.channel.broadcast(instruction, **kw_args)
76: [Wed Aug 29 23:51:01 2012] [error]
File "/var/www/csep/css/models.py", line 267, in
broadcast
77: [Wed Aug 29 23:51:01 2012] [error]
78: [Wed Aug 29 23:51:01 2012] [error]
conn.disconnect()
File
"/usr/local/lib/python2.6/dist-packages/stomp/connect.py", line 403, in disconnect
79: [Wed Aug 29 23:51:01 2012] [error]
self.__socket.shutdown(socket.SHUT_RDWR)
80: [Wed Aug 29 23:51:01 2012] [error]
File "<string>", line 1, in shutdown
81: [Wed Aug 29 23:51:01 2012] [error]
File "/usr/lib/python2.6/socket.py", line 165, in
_dummy
82: [Wed Aug 29 23:51:01 2012] [error]
raise error(EBADF, 'Bad file descriptor')
83: [Wed Aug 29 23:51:01 2012] [error] error: [Errno 9] Bad file descriptor
84: [Wed Aug 29 23:51:01 2012] [error]
85: [Wed Aug 29 23:51:01 2012] [error] [DEBUG] ade.vmcontrollers._executor : 1 The wrong VM
uuid.
86: [Wed Aug 29 23:51:01 2012] [error] [ERROR] ade.vmcontrollers.suspend : The wrong VM uuid.
87: [Wed Aug 29 23:51:01 2012] [error] None
3.
mysql 的 css_image table 存在的 VM UUID,資料值為 NULL,可在
/var/log/apache/csep-error.log 中觀察到如下的錯誤訊息:
234:
[Thu Aug 30 02:42:28 2012] [error] [DEBUG] ade.vmcontrollers.delete : vm-delete None
235:
[Thu Aug 30 02:42:28 2012] [error] [DEBUG] ade.vmcontrollers._executor : ./cpcs_api
vm-delete None
236:
[Thu Aug 30 02:42:28 2012] [error] [DEBUG] ade.vmcontrollers._executor : 1 The wrong
VM uuid.
237:
[Thu Aug 30 02:42:28 2012] [error] [ERROR] ade.vmcontrollers.delete : The wrong VM
uuid.
238:
[Thu Aug 30 02:42:28 2012] [error] Traceback (most recent call last):
239:
[Thu Aug 30 02:42:28 2012] [error]
File "/var/www/csep/css/models.py", line 590,
in delExp
240:
[Thu Aug 30 02:42:28 2012] [error]
tobe = exp.id
241:
[Thu Aug 30 02:42:28 2012] [error] AttributeError: 'unicode' object has no attribute
'id'

解決方式
1.
在利用 CPCS vm-list API 具有不存在的 VM UUID 回傳值為空的特性,在
ade/vmcontrollers.py 中的 class CPCSController 新增一個名為 vmExist()的函式,檢查某
個 UUID 是否存在於 CPCS 平台中
①
: 修改後
231:
232:
233:
234:
235:
def vmExsist(self, instance):
if (instance.curImage.uuid is None) or (instance.curImage.uuid is ""):
cseplib.errmsg(__name__, sys._getframe(),
"VM with this uuid doesn't exist.")
return False
236:
else:
237:
args = 'vm-list %s' % (instance.curImage.uuid)
238:
if debug:
239:
cseplib.debugmsg(__name__, sys._getframe(), args)
240:
ret, output = self._executor(args, sync=True)
241:
result = self.result_check(output)
242:
if ret=="0":
243:
return True
244:
else:
245:
2.
return False
修改 css/models.py 中,class Experiments 的 suspend()函式
①
: 修改前
318:
def suspend(self):
319:
"""
320:
Suspend all the instances belong to the experiment
321:
and change the state.
322:
"""
323:
for ins in self.instances.all():
324:
if ins.state == 'R':
325:
ins.suspend()
326:
self.state = 'S'
327:
self.save()
328:
cha_repo = ChannelRepository()
329:
cha_repo.allsuperuser('suspend', caseid=str(self.id))
②
: 修改後
318:
def suspend(self):
319:
"""
320:
Suspend all the instances belong to the experiment
321:
and change the state.
322:
"""
323:
for ins in self.instances.all():
324:
ins_info =
ins.vmc.list(css.models.Image.objects.filter(id=self.curImage_id).get().uuid)
325:
ins_info = ins_info.split()
326:
if ins_info[0]=='0' and ins_info[3]=='running' and
ins.state=='R':
327:
ins.suspend()
328:
self.state = 'S'
329:
self.save()
330:
cha_repo = ChannelRepository()
331:
3.
cha_repo.allsuperuser('suspend', caseid=str(self.id))
修改 css/models.py 中,class Instance 的 delete function,對要刪除的 UUID 確認(1)UUID
是否為空?(2)使用 ade.vmcontrollers.CPCSController().vmExist(),檢查 UUID 是否存在於
CPCS 平台?
①
: 修改前
332:
def delete(self, *args, **kwargs):
333:
"""
334:
Overriding the original model delete method
335:
"""
336:
if int(self.vmc.delete(str(self.curImage))) == 0:
337:
# Delete img in platform Success, remove record
338:
super(Instance, self).delete(*args, **kwargs)
339:
return 0
340:
else:
341:
# Delete failed, Keep the instance and return 1
342:
return 1
②
: 修改後
422:
def delete(self, *args, **kwargs):
423:
"""
424:
Overriding the original model delete method
425:
"""
426:
if self.curImage is None:
427:
cseplib.errmsg( __name__, sys._getframe(), "Instance%s's curImage is
NULL." % (self.id))
428:
super(Instance, self).delete(*args, **kwargs)
429:
return 0
430:
431:
if self.vmc.vmExist(self) == False:
cseplib.errmsg(__name__, sys._getframe(), "Instance%s: VM uuid is
NULL." % (self.id))
432:
super(Instance, self).delete(*args, **kwargs)
433:
return 0
434:
if int(self.vmc.delete(self)) == 0:
435:
# Delete img in platform Success, remove record
436:
super(Instance, self).delete(*args, **kwargs)
437:
return 0
438:
else:
439:
# Delete failed, Keep the instance and return 1
440:
return 1
問題5: 在/var/log/apache/csep-error.log 中可見到 tobe = exp.id 的錯誤訊息
242:
[Thu Aug 30 02:42:28 2012] [error] [DEBUG] ade.vmcontrollers.delete : vm-delete None
243:
[Thu Aug 30 02:42:28 2012] [error] [DEBUG] ade.vmcontrollers._executor : ./cpcs_api
vm-delete None
244:
[Thu Aug 30 02:42:28 2012] [error] [DEBUG] ade.vmcontrollers._executor : 1 The wrong
VM uuid.
245:
[Thu Aug 30 02:42:28 2012] [error] [ERROR] ade.vmcontrollers.delete : The wrong VM
uuid.
246:
[Thu Aug 30 02:42:28 2012] [error] Traceback (most recent call last):
247:
[Thu Aug 30 02:42:28 2012] [error]
File "/var/www/csep/css/models.py", line 590,
in delExp
248:
[Thu Aug 30 02:42:28 2012] [error]
tobe = exp.id
249:
[Thu Aug 30 02:42:28 2012] [error] AttributeError: 'unicode' object has no attribute
'id'


原因:css.models. ExperimentRepository.delExp 寫法有誤,在尋找要刪除的實驗資料時,
應該是以 css.models.ExperimentRepository.delExp 傳入的參數 exp 作為 keyword,在資
料庫尋找符合此實驗 ID 的資料。而不是去尋找參數 exp 內的 id(因為 exp 本身就是一個實
驗的 ID),原本的做法會使 log 出現 exp 沒有名為 id 的屬性。
解決方式
1. 修改
①
: 修改前
584:
def delExp(self, exp):
585:
"""
586:
Delete an experiment, it will leave forever....
587:
Return zero for success, 1 for otherwise
588:
"""
589:
try:
590:
tobe = exp.id
591:
tobe = exp
592:
except AttributeError:
593:
# Get AttributeError, it implys exp is exp_id
594:
tobe = Experiment.objects.get(pk=exp)
595:
try:
596:
instances = tobe.instances.all()
597:
for ins in instances:
598:
599:
if ins.delete() != 0:
return 1
600:
channel = tobe.channel
601:
tobe_id = tobe.id
602:
tobe.delete()
603:
channel.delete()
604:
cha_repo = ChannelRepository()
605:
cha_repo.allsuperuser('delete', caseid=str(tobe_id))
606:
return 0
607:
except:
608:
import traceback
609:
traceback.print_exc()
②
: 修改後
590:
591:
def delExp(self, exp):
try:
592:
593:
tobe = Experiment.objects.get(pk=exp)
except:
594:
cseplib.errmsg(__name, sys._getframe(), "Can't find any experiments
with exp_id %s" % (exp))
595:
596:
return 1
try:
597:
instances = tobe.instances.all()
598:
for ins in instances:
599:
image_id = ins.curImage_id
600:
if ins.delete() != 0:
601:
return 1
602:
channel = tobe.channel
603:
tobe_id = tobe.id
604:
tobe.delete()
605:
channel.delete()
606:
cha_repo = ChannelRepository()
607:
cha_repo.allsuperuser('delete', caseid=str(tobe_id))
608:
return 0
609:
except:
610:
import traceback
611:
traceback.print_exc()
問題6: 在 CSEP 網頁上建立實驗時,開機後,網頁不會顯示要連線到的 IP 與 port。
錯誤情況

正常情況
原因:
1.
ade/vagent_server.py 中,搜尋 VM 對應到的 UUID 機制有缺失
78: def initvagt(self, *args, **kwargs):
79:
"""
80:
Find a uninitialized image, and send the UUID to a vagent_client
81:
vagent_client will use that UUID in later usage, such as ctport
82:
"""
83:
sock = kwargs['clientsock']
84:
client_ip = sock.getpeername()[0]
85:
max_try = 10
86:
for i in range(1, max_try):
87:
# Loop until find the instance by IP
88:
try:
89:
instance = Instance.objects.filter(ip=client_ip,
initialized=False)[0]
90:
break
91:
except IndexError:
92:
if DEBUG:
93:
cseplib.debugmsg(__name__, sys._getframe(),
94:
95:
"Can't find the instance by ip %s" % (client_ip))
if i == (max_try-1):
96:
cseplib.debugmsg(__name__, sys._getframe(),
97:
"Failed to lookup instance by IP more")
98:
99:
return 1
time.sleep(15)
100:
sock.send(str(instance.curImage.uuid))
101:
instance.initialized = True
102:
instance.save()
103:
return 0
照此機制設計出來的話,若 css_instance table 的資料如下的話,可能會因為 IP 紀錄重複
而產生無法正確判斷 VM UUID 的問題:
mysql> select * from css_instance order by ip;
+------+--------+-------------+-------------+----------------+-------+--------------+----------+---------------------+-------------+
| id
| exp_id | srcImage_id | curImage_id | ip
| state | reverse_port | guest_os | last_report
| initialized |
+------+--------+-------------+-------------+----------------+-------+--------------+----------+---------------------+-------------+
| 2148 |
1274 |
4 |
2119 | 140.115.3.216 | S
|
NULL | NULL
| 2012-07-04 11:41:32 |
0 |
| 2135 |
1267 |
4 |
2106 | 140.115.3.216 | S
|
NULL | NULL
| 2012-07-04 05:05:00 |
0 |
| 2095 |
1244 |
4 |
2066 | 140.115.3.216 | S
|
NULL | NULL
| 2012-06-15 14:16:56 |
0 |
| 1506 |
829 |
4 |
1450 | 140.115.3.216 | S
|
NULL | NULL
| 2012-03-05 21:12:37 |
0 |
+------+--------+-------------+-------------+----------------+-------+--------------+----------+---------------------+-------------+
2.
在製作案例 template 時,將 uuid.pkl 檔也一併置入,造成 VM 以為在 uuid.pkl 中會有
自己的 uuid,但事實上 uuid.pkl 中的 uuid 卻是空的。(ade.vagent_client.init()依
賴檢查在 linux 中的/etc/init.d/uuid.pkl 或在 windows 中的 C:\\uuid.pkl 是否存在,
以決定是否要跟 vagent_server 發出 request 詢問自己的 uuid 為多少)
此情形發生時,可在/var/log/apache/csep-error.log 觀察到如下畫面:
18: [Wed Aug 29 11:16:23 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the
instance by client_uuid
19: [Wed Aug 29 11:16:33 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the
instance by client_uuid
20: [Wed Aug 29 11:16:43 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the
instance by client_uuid
21: [Wed Aug 29 11:16:53 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the
instance by client_uuid
22: [Wed Aug 29 11:17:03 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the
instance by client_uuid
23: [Wed Aug 29 11:17:13 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the
instance by client_uuid
24: [Wed Aug 29 11:17:23 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the
instance by client_uuid
25: [Wed Aug 29 11:17:33 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the
instance by client_uuid
26: [Wed Aug 29 11:17:43 2012] [error] [DEBUG] csep.ade.vagent_server.ctport : Can't find the
instance by client_uuid
3.
css_instance table 資料有問題
①
: curImage_id 欄位為 NULL
mysql> select * from css_instance where curImage_id IS NULL order by last_report;
+-----+--------+-------------+-------------+------+-------+--------------+----------+---------------------+-------------+
| id | exp_id | srcImage_id | curImage_id | ip
| state | reverse_port | guest_os | last_report
| initialized |
+-----+--------+-------------+-------------+------+-------+--------------+----------+---------------------+-------------+
| 696 |
361 |
3 |
NULL | NULL | S
|
NULL | NULL
| 2011-12-15 03:19:34 |
0 |
| 697 |
361 |
4 |
NULL | NULL | S
|
NULL | NULL
| 2011-12-15 03:19:55 |
0 |
+-----+--------+-------------+-------------+------+-------+--------------+----------+---------------------+-------------+
②
: ip 欄位為 NULL
此情形發生時,可在/var/log/apache/csep-error.log 觀察到如下畫面:
57: [Wed Aug 29 05:54:21 2012] [error] [DEBUG] csep.ade.vagent_server.initvagt : Can't
find the instance by ip 140.115.14.160
4.
實驗 VM 的 UUID,在 css_image table 中的紀錄為空,因此可能造成 vagent_server 回
傳給 vagent_client 的 UUID 值為空,可在 VM 中的 uuid.pkl 觀察到:
錯誤情況

正常情況
1: S''
1: S'7e8412af-176c-4bb8-5c5e-5176529c3dc8'
2: p0
2: p0
3: .
3: .
5. -------------+
解決方式
1. 修改
①
: 修改前
②
: 修改後
問題7:
 原因:
 解決方式:重新設計 css_instance table 的 IP 機制
1.
修改 css/models.py 中,class Instance 的 resume()函式 r,在 resume 之後,去詢問 CPCS
平台 VM 的新 IP,存入 css_instance 表格中
①
: 修改前
434:
def resume(self):
435:
"""
436:
Recovery from a suspend state
437:
Return 0 when success, 1 is failed.
438:
"""
439:
if self.vmc.resume(self):
440:
self = Instance.objects.get(pk=self.id)
441:
self.state = 'R'
442:
self.last_report = datetime.datetime.now()
443:
self.save()
444:
return 0
445:
return 1
②
: 修改後
447:
448:
"""
449:
Recovery from a suspend state
450:
Return 0 when success, 1 is failed.
451:
"""
452:
if self.vmc.resume(self):
453:
self = Instance.objects.get(pk=self.id)
454:
self.ip = self.vmc.getip(self)
455:
self.state = 'R'
456:
self.last_report = datetime.datetime.now()
457:
self.save()
458:
return 0
459:
2.
def resume(self):
return 1
③
:
在/var/www/csep 下執行 django shell,將 DB:
$ ./manage.py shell
①
: 修改前
②
: 修改後
③
:
問題8:
 原因:
 解決方式
1. 修改
①
: 修改前
②
: 修改後




結論
Download