Segmentation of Chinese Long Sentences Using Commas

advertisement
Segmentation of Chinese Long
Sentences Using Commas
Mei xun Jin, Mi-Young Kim, Dongil Kim, and Jong-Hyeok Lee
Pohang University of Science and Technology, Advanced Information Technology Research Center
Div. of Computer, Electronics and Telecommunications, Yanbian University of Science and Technology
ACL SIGHAN Workshop 2004
My research topic
• Sentence is a fundamental unit for NLP.
• Resolving the boundaries of Chinese sentences
(or topic chains).
– Commas and full-stops are often confused in Chinese.
– A full-stop sometimes can be replaced with a comma.
– A comma sometimes should be replaced with a fullstop.
– Vice versa.
• Sentence segmentation is inherently ambiguous.
Samples
• 正因為沒有經過仔細全面的設計規畫,我們發展的
步驟錯亂、標準參差,由於向歐、美、日本取法的
模範不一,不但各方面無法配合,甚至會有衝突,
而設定的辦法不能取得大眾的共識與認同,與整個
社會格格不入,導致化橘為枳,貌似神非自然不在
話下。
• 這是有點霸道,但也有道理,因為他們是上市公司,
每一季要向美國證管會報告總公司、附屬公司及子
公司的營運及財務狀況,帳都是照一套會計原則來
做,所以很多時候他們的要求,是出自一種單純的
需要,而並不是故意要來欺負我們。
Outline
• Segmentation of Chinese long sentences using
commas.
• Types of commas
• Features
• Experiments
• Conclusion
Motivation
• Chinese has a rather different set of salient
ambiguities from the perspective of statistical
parsing.
• In Chinese, a subordinate clause or coordinate
clause is sometimes connected without any
conjunctions in a sentence.
• Clause segmentation is also rather different
compared with western languages.
• Segment Chinese long sentences using commas.
Segmentation
• Syntactic analysis of a sentence
1. Segment the sentence at a comma.
2. Do the dependency analysis for each segment.
3. Set the dependency relation between segment
pairs.
• In Chinese dependency parsing, not all
commas are proper as segmentation points.
Segmentation: Case 1
• There is only one dependency line cross over
the comma.
– one_dep_line_cross comma
Segmentation: Case 2
• Some of the words fail to find their heads.
– mul_dep_lines_cross comma
Segmentation: Case 3
• Some words to find the wrong head.
– mul_dep_lines_cross comma
•
•
Segmentation at one_dep_line_cross comma is helpful for
reducing parsing complexity and can contribute to accurate
parsing results.
Segmentation at mul_dep_line_cross comma should be avoid.
Inter-clause comma and Intra-clause
comma
• Intra-clause comma
– Occurring within a clause.
– 北海在數年前,是一個默默無聞的小漁村。
• Inter-clause comma
– At the end of a clause.
– 小明在寫作業,媽媽在打毛衣。
• Segment the long sentence at inter-clause
commas.
– Comma classification
Two segments adjoining a comma
• To identify whether a comma is an interclause comma or an intra-clause comma.
• Assign values to each comma
– (left_seg, right_seg)
– The left_seg/right_seg can be phrase or clause.
– (p, p), (p, c), (c, p), (c, c)
Syntactic relation between two
adjoining segments
• Relation
– If any words of the left segment has a dependency
relation with the word of the right segment.
• Direction
– How many direction(s) of the dependency
relations the two segments have.
• Head
– Which side of segment contains the head of any
words of the other side.
Comma
Values
Syntactic
Relation
(c, p)
Relation = 0
Sample
Comma
Classification
在單位裡,他是個好領導,在家裡,他是好
Type
(c, p)-I
爸爸。
(p, c)
Relation = 1
Head = right = p
科研成果快速轉化為生產力,是這個開發區
的特點。
Relation = 1
Head = left = c
學生們來到了操場,高高興興地。
(c, p)-II
Relation = 0
韓國對大連投資已連續三年增長,在大連,
韓國投資企業受到各種優惠。
(p, c)-I
Relation = 1
Head = left = p
統計資料表明,大連對韓出口達一億多美元。
Relation = 1
Head = right = c
一九九四年,通用在中國購買了四千多萬美
元的東西。
(p, c)-II
(p, p)
中國銀行在去年十月,聘請日某公司做顧問。 (p, p)
(c, c)
一號產品佔據不到二成,二號產品比重達七
成以上。
(c, c)
Estimate the type of comma
• To identify the inter-clause or intra-clause role
of a comma, it needs to estimate the right and
the left segment conjuncts to the comma.
– Classify a comma into one of (c, c), (c, p), (p, c),
and (p, p).
• Classification using SVM
– With a number of kernels.
Features
• Direct relevant feature category:
– Predicate
– Complements
• Indirect relevant feature category:
– Auxiliary words
– Adverbials
– Prepositions
– Clausal conjunctions
Direct relevant features
Indirect relevant features
Experiments
• Dataset
– Chinese Penn Treebank 2.0
– 10-fold cross-validation
Results: Different kernel
Results: Window size & POS
Results: Parsing accuracy
• Parsing procedure
1. POS Tagging.
2. Long sentence segmentation by comma.
3. Parsing based on segmentation.
Conclusion
• Chinese sentence segmentation by
classification of the comma.
• Improving the accuracy of dependency parsing
by 9.6%.
• The accuracy for the segmentation is not yet
satisfactory.
– Inter-F score = 83.72%
– Accuracy = 85.43%
Download