Processing Phase of Summarizer for Multiple News Single Punjabi Documents

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013
Processing Phase of Summarizer for Multiple News
Single Punjabi Documents
Vishal Gupta
Assistant Professor, UIET Panjab University,
Sector-25, Chandigarh, India
Abstract—The proposed technique discusses processing sub step
for summarizer in Punjabi which is single text document
multiple news articles summarizer. It is first in the history that
this Punjabi summarizer is implemented and developed. For this
project, we had developed different language oriented resources
in Punjabi i.e. morph of nouns for Punjabi, stemming procedure
for Punjabi, finding key terms in Punjabi text and finding named
entities in Punjabi etc. We know that one page in any news
paper usually contains many types of multiple news with
different lengths. On the basis of C.R. (compression ratio) given
by any user, the proposed system will retrieve mainlines called
headlines present in every news, sentences just after mainlines
along with other suitable sentences based on their relevance.
Relevance of different lines is calculated on the basis of different
text features which are either statistical or language oriented in
nature. There are mainly two steps of this proposed approach: a)
Pre Processing step b) Processing step. In step of pre-processing
input text is denoted in structured manner. We can select and
calculate different features which can decide important lines in
processing step. Different types of features which are statistical
in nature are key terms detection in Punjabi, feature related to
length of a line & feature related to numeric data. Various
language oriented features for extracting relevant lines are:
extraction of mainlines in Punjabi news paper, extraction of next
sentences to mainlines, feature related to nouns in Punjabi,
feature related to Punjabi names, feature related to English
terms which are used as it is in Punjabi, feature related to cue
terms in Punjabi, existence of terms related to title in lines. Then
we can find out ranks of different lines on the basis of value of
features-lines equation. We assume all the features are of same
importance. High ranked lines are retrieved in suitable order as
part of summary. We also consider coherence of lines using
suitable ordering of lines as per their order in input text at
suitable C.R.
Keywords—Text summarizer, detection of Punjabi mainlines,
detection of names in Punjab, detection of key terms in Punjabi.
I. INTRODUCTION
Summarization of text [1] [2] is technique of shortening
input text but maintaining contents of its its information and
theme. It is having two sub steps [3] i) Sub step for doing pre
processing of input text which is considered as denoting the
input text in structured manner. ii) In sub phase of processing
the input text we can find ranks of every line by applying and
calculating feature-line equation & lines with higher scores
are retrieved in summary in a particular order as per their
order in input. The proposed technique discusses processing
sub step for summarizer in Punjabi which is single text
document multiple news articles summarizer. It is first in the
ISSN: 2231-5381
history that this Punjabi summarizer is implemented and
developed. For this project, we had developed different
language oriented resources in Punjabi i.e. morph of nouns for
Punjabi, stemming procedure for Punjabi, finding key terms in
Punjabi text and finding named entities in Punjabi etc. We
know that one page in any news paper usually contains many
types of multiple news with different lengths. On the basis of
C.R. given by any user, the proposed system will retrieve
mainlines called headlines present in every news, sentences
just after mainlines along with other suitable sentences based
on their relevance. Relevance of different lines is calculated
on the basis of different text features which are either
statistical or language oriented in nature. There are mainly two
steps of this proposed approach: a) Pre Processing step b)
Processing step. In step of pre-processing [8] input text is
denoted in structured manner. We can select and calculate
different features which can decide important lines in
processing step. Different types of features which are
statistical in nature are key terms detection in Punjabi, feature
related to length of a line & feature related to numeric data.
Various language oriented features for extracting relevant
lines are: extraction of mainlines in Punjabi news paper,
extraction of next sentences to mainlines, feature related to
nouns in Punjabi, feature related to Punjabi names, feature
related to English terms which are used as it is in Punjabi,
feature related to cue terms in Punjabi, existence of terms
related to title in lines. Then we can find out ranks of different
lines on the basis of value of features-lines equation. We
assume all the features are of same importance. High ranked
lines are retrieved in suitable order as part of summary. We
also consider coherence of lines using suitable ordering of
lines as per their order in input text at suitable C.R.
II. PROCESSING PHASE
In this step [4] [12], different types of features deciding
important lines are applied and their values are calculated. We
can find out ranks of different lines by applying line-feature
equation. Line scores are calculated using line-feature
equation
give
as
feature_1+feature_2+feature_3+……feature_n Here feature_1,
feature_2, feature_3……feature_n are values of various
features belongs to lines which we can calculate in various
steps for this summarizer Higher scored lines are retrieved in
summary by following a particular order of lines as per their
order in input text. Which means coherence among different
lines is maintained.
http://www.ijettjournal.org
Page 367
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013
A. Detection of Mainlines and Lines Next to Main Lines
This is first in the history that a technique for automatically
detection of mainlines & next sentences to main sentences
from single input Punjabi documents related to Punjabi news
papers is discovered and implemented in this summarizer.
Main lines are always relevant in case of news papers, as these
lines can convey much information related to whole news. So
mainlines are always considered to be relevant and become
part of summary. Next sentences to main sentences in news
articles are also relevant as they may have relevant
information so these next lines to main lines are normally also
become part of summary. We have taken The count of main
sentences in corpus of Punjabi news articles is sixty five
thousand seven hundred twenty two and these lines consume
around seven percent of this corpus. The ending of lines in
Punjabi is usually marked by presence of any of characters
among question mark, exclamation character or vertical line
character. So if endings of any line in Punjabi text is not
marked by these characters but marked by new line character
or enter key character then that sentence belongs to mainline.
Moreover, If just next sentence to main line ends with
question mark, vertical line character or exclamation character
then that sentence belongs to next line to main line.
B. Detection of Cue Terms in Punjabi
For English language there are various type of cue terms
like conclude, at last, ultimately and summary etc. and lines
having cue terms at any position either at start, middle or at
end in that line are relevant because that line can highlight
more information. We have created a list of cue terms in
Punjabi from corpus related to news in Punjabi. Lines having
these Punjabi cue terms are marked as relevant and belongs to
summary. In corpus of news articles in Punjabi, cue terms
count is fifty eight thousand seven hundred eight and consume
corpus of 0.52%.
C. Name Entity Detection in Punjabi
Rule oriented detection of names entity in Punjabi is 1st in
the history developed by Gupta & Lehal (2011) [10]. We are
using this names entity detection approach. In this system
various lists have been created called gazetteer lists for
example list of names prefixes, list of names suffixes, list of
names middle part, list of names last part & list of Punjabi
proper nouns used in detecting if any Punjabi term is name
entity. These lists were developed after consulting corpus
related to news in Punjabi.
List of Punjabi prefixes contain different prefixex of
Punjabi names for example ਸੀਮਤੀ, ਿਪ. and ਡਾ: etc. The count
of these prefixes is fourteen which are found from corpus of
news in Punjabi. The count value of Punjabi prefixes is
seventeen thousand one hundred twenty seven and consumes
the corpus of news articles in Punjabi with 0.15%. List of
suffixes in Punjabi is applied for determining if that term is
Punjabi name for example ਜੀਤ, ਪੁਰ, ਪੁਰਾ and ਪੁਰੀ etc. This
ISSN: 2231-5381
list of suffixes has around fifty suffixes. The count of suffixes
is two lakh twenty five thousand three hundred six in corups
reletd to news artcles in Punjabi and consume around two
percent of this corpus. List of names middle part has different
middle parts of Punjabi names for identifyng if the term is
name entity in Punjabi for example ਕੁਮਾਰ, ਕੌ ਰ and ਕੁਮਾਰੀ etc.
From corpus of news artciles in Punjabi we have found 08
middle parts of nems in Punjabi and count of them is ninty
seven thousand nine hundred sven in this corpus and
consumes around one percent of this corpus related to news
articles n Punjabi. List of last part of names in Punjabi has
different last parts of names in Punjabi used for detecting if tht
term beongs to name in Punjabi. The count of last parts of
names in Punjabi is around three hundred ten. In corpus of
Punjabi, sixty nine thousand two hundred sixty eight terms are
discovered as last part of names in Punjabi, and consumes this
corpus of 0.6135 percent. Names in Punjabi are essential for
retrieving relevant lines in summary. The count of names in
Punjabi is seventeen thousand five hundred ninety eight in
corpus related to news and consume around fourteen percent
of this corpus. Value of this feature is determined using ratio
of frequency of names in Punjabi sentence to length of line. Its
score will vary from 0 to 1. After looking at results of this sub
phase on fifty news articles in Punjabi, we have discovered
that F-measure is equal to 86.25 percent, Recall is equal to
83.4 percent and Precision is equal to 89.32 percent, along
with errors of 13.75 percent.
D. Detection of Nouns & Common Nouns in English and
Punjabi
Lines having nouns [6] are always relevant. Terms in input
text are found in morph of nouns in Punjabi for checking if
they are nouns. Morph of nouns in Punjabi contains around
thirty seven thousand two hundred ninety seven nouns. Terms
in input are searched from morph of nouns in Punjabi or
stemmer is applied for nouns t check existence of Punjabi
nouns. Value of this feature is found by taking ratio of
frequency of nouns in particular line to length of line. The
score will vary from 0 to 1. Count of nouns in Punjabi is
around seventeen percent of terms in corpus related to news in
Punjabi. Accuracy of this phase is tested on corpus related to
news in Punjabi and is around ninety eight percent with
1.57% errors which are because of not presence of various
nouns in morph of Punjabi nouns & stemmer errors. Many
terms in English are also written in Punjabi in same manner as
in English. For example Punjabi terms ਮੋਬਾਈਲ and ਟੈਕਨਾਲੋ ਜੀ
are written in same manner as of English. But these terms are
usually absent in morph of nouns & dictionary in Punjabi
because these are not the terms of Punjabi. But we know that
these terms are very relevant and are called common nouns in
English and Punjabi. These terms can affect the relevance of
lines. Separate list is created having only these common terms
in English & Punjabi.
Input terms are searched from list related to common noun
terms of English & Punjabi. The score of this feature is found
http://www.ijettjournal.org
Page 368
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013
by taking ratio of frequency of common noun terms in Punjabi
& English in any line to line’s length. Its value will vary from
zero to one. We have discovered eighteen thousand two
hundred forty five common noun terms of Punjabi & English.
These common terms are around six percent of corpus related
to news in Punjabi. Accuracy of this sub phase is around
ninety five percent. This accuracy is tested on corpus related
to Punjabi news articles for fifty documents. Five percent of
errors are because of non presence of many common terms in
Punjabi and English in its database.
E. Detection of Key terms and Title Terms in Punjabi Text
Key terms [7] are useful for extracting important lines. This
is first time in history that this system for detection of key
terms and title terms for Punjabi is implemented by Gupta &
Lehal (2011) [11]. Punjabi Key terms are those noun terms
having high value of Punjabi term frequency-inverse lines
frequency. Here Punjabi noun term frequency is count of a
particular noun-term of Punjabi in a given line. Inverse lines
frequency is using frequency of lines having that Punjabi noun
term. i.e. Its value can be calculates using the formula log(|L|/
LF(t)) Where |L| is total frequency of lines in a given
document. LF (t) is frequency of lines having that given noun
term t.
F-measure, recall and precision of this system for extracting
key terms are 85.2 percent, 90.6 percent and 80.4 percent
respectively. These measures are determined by thoroughly
studying outputs of key terms extraction on 50 documents
related to news in Punjabi. Errors of around fifteen percent
are because of non existence of various noun terms of Punjabi
in morph related to Punjabi nouns, mistakes in Punjabi
dictionary, errors while typing input as syntax errors &
various violations in rules for stemming the noun terms.
Title sentences are main sentences of any news article
containing single or multiple news articles. Punjabi lines
having title key terms are relevant [5]. Title key terms are
selected by eliminating Punjabi stop terms in title sentences.
Score value for it is determined by taking ratio of frequency of
unique key title terms in any line to total frequency of title key
terms. Its accuracy is found to be around ninety seven percent
and is determined on 50 documents of corpus related to
Punjabi news articles. Three percent errors are because of
presence of certain stop terms in title sentences because stop
terms list in Punjabi only has six hundred fifteen stop terms of
Punjabi.
G. Font Feature of Punjabi Terms
Those Punjabi lines having terms in bold font, underlined,
italics, quotation marks or having larger font size are
important and should be included in summary. If this feature
is true for any line then its font flag will store value 01
otherwise its value will be 0.
H. Finding Scores of Lines for Final Summary
We can find out ranks of different lines by applying linefeature equation. Line scores are calculated using line-feature
equation as: feature_1+feature_2+feature_3+……feature_n
Here feature_1, feature_2, feature_3……feature_n are values
of various features belongs to lines which we can calculate in
various steps for this summarizer Higher scored lines are
retrieved in summary by following a particular order of lines
as per their order in input text. Which means coherence
among different lines is maintained.
III. RESULTS AND DISCUSSIONS
This system is tested on 50 news articles of corpus related
to news in Punjabi. These documents were of mixed type
containing single or multiple news in same document and its
data set is having six thousand one hundred eighty five lines
and seventy two thousand six hundred eighty nine terms in
corpus related to news articles in Punjabi. This system is
tested by using measures of intrinsic and extrinsic evaluation.
Four intrinsic techniques are applied [9] for evaluating the
Punjabi summary a) F-measure b) Measure of CosineSimilarity c) Measure of Cofficient-Jaccard d) Distance
calculation using Euclidean measure. We have applied 02
techniques of extrinsic summary evaluation i) Performing
task of question answering ii) Performing job of association of
key terms Intrinsic summary results are given in TABLE I.
TABLE I
INTRINSIC SUMMARY EVALUATION
Compression
Ratio
Intrinsic Evaluation of Summary
F-measure CosineSimilarity
Jaccard Euclidean
Coeff.
Distance
10%
98.45
98.89
97.30
0.10
30%
96.53
97.56
95.99
0.29
F. Calculation of Line Relative Length
95.11
96.23
95.12
0.48
Lines of small length are not preferred to become as part of 50%
summary [5] because these short lines usually have very little
information. But large sentences in Punjabi sentences can
Results of summary evaluation by applying extrinsic
have much of information. Value of this feature is determined
by taking ratio of frequency of terms in any line to term count measures are given in Table II at different C.R.
of largest line. Value of this feature will vary from zero to one.
Score (Relative Length) = Frequency of terms in line /
frequency count of terms of largest line.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 369
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013
REFERENCES
TABLE II
RESULTS OF EXTRISIC SUMMARY EVALUATION
Compression
Ratio
[1]
Extrinsic summary evaluation
Efficiency
of Question
Answering
Efficiency of Key
Terms Association
[2]
[3]
10%
80.67
81.88
30%
85.58
94.46
50%
90.45
96.92
[4]
[5]
[6]
[7]
IV. CONCLUSIONS
It is first in history that this summarizer is developed with
such a high accuracy for single & multiple news documents.
Various language components applied in this system like
stemming of terms in Punjabi, standardization of nouns in
Punjabi, detection of names in Punjabi, detection of key terms
in Punjabi, list of Punjabi names, list of common noun terms
of Punjabi & English, list of Punjabi stop terms, list having
suffix parts and prefix parts of names in Punjabi etc. were
created from initial point because these resources were not
present. Moreover it is 1st in history that these Punjabi
language components had developed and these might be
useful in implementation of various other NLP applications
for Punjabi.
ISSN: 2231-5381
[8]
[9]
[10]
[11]
[12]
F. Kyoomarsi, H. Khosravi, E. Eslami, P.K. Dehkordy, “Optimizing
Text Summarization Based on Fuzzy Logic”, IEEE International
Conference on Computer and Information Science, University of
Shahid Kerman, UK, 2008, pp. 347-352.
Vishal Gupta, G.S. Lehal, “A Survey of Text Summarization
Extractive Techniques”, In International Journal of Emerging
Technologies in Web Intelligence, vol. 2, 2010, pp. 258-268.
J. Lin, “Summarization. In Encyclopedia of Database Systems”,
Springer-Verlag Heidelberg, Germany, 2009.
V. Gupta and G.S. Lehal, “Automatic Punjabi Text Extractive
Summarization System” In International Conference on Computational
Linguistics COLING-2012, IIT Bombay, India, 2012, pp. 191-198.
M.A. Fattah F. Ren, “Automatic Text Summarization”, In World
Academy of Science Engineering and Technology, vol. 27, 2008, 192195.
K. Kaikhah “Automatic Text Summarization with Neural Networks” In
IEEE international Conference on intelligent systems, Texas, USA,
2004, pp. 40-44.
J. L. Neto, A.D. Santos, C.A.A. Kaestner, A.A. Freitas, “Document
Clustering and Text Summarization”, Int. Conference on Practical
Application of Knowledge Discovery & Data Mining, London, 2000,
pp. 41-55.
V. Gupta and G.S. Lehal, “Complete Pre processing Phase of Punjabi
Language Text Summarization” In International Conference on
Computational Linguistics COLING-2012, IIT Bombay, India, 2012,
pp. 199-205.
M. Hassel, “Evaluation of Automatic Text Summarization”,
Licentiate Thesis, Stockholm, Sweden, 2004, pp. 1-75.
V. Gupta and G. S. Lehal, “Named Entity Recognition for Punjabi
Language Text Summarization”, International Journal of Computer
Applications, vol. 33, 2011, pp. 28-32.
V. Gupta and G. S. Lehal, “Automatic Keywords Extraction for
Punjabi Language”, International Journal of Computer Science Issues,
vol. 8, 2011, pp. 327-331.
V. Gupta and G.S. Lehal, “ Automatic Text Summarization System for
Punjabi Language,” Journal of Emerging Technologies in Web
Intelligence, vol. 5, pp. 257-271, 2013
http://www.ijettjournal.org
Page 370
Download