Creating and Using a Correlated Corpus to Glean Communicative

advertisement
Creating and Using a
Correlated Corpora to Glean
Communicative Commonalities
Jade Goldstein-Stewart
Kerri A. Goodwin
Roberta E. Sabin
Ransom K. Winder
U.S. Dept. of Defense
Loyola College
Loyola College
MITRE Corporation
Outline
• Motivation
• Corpora collection
• General Corpora Characteristics
– Word count
– Readability
• Future directions
May 30, 2008
LREC
2
Motivation
• How do computer-mediated communication
genres differ from traditional genres?
email
blog
chat
interview
essay
discussion
• How consistent are communicative features
across genres for a single individual?
• If such commonalities exist, how can they be
utilized for document classification?
May 30, 2008
LREC
3
Email sample (2E1S3)
I do not feel that gender discrimination is a
problem in the United States at the moment.
My supervisor at my current job is a woman,
and everyone respects her the same as the
owner of the company, who is a man. I think
this issue was more prevalent earlier last
century. In these modern times, it really is not
an issue in my opinion.
May 30, 2008
LREC
4
Blog sample
(2B1S2)
While gender discrimination is something that
should always be avoided ideally, there
are some problems I have with the issue in
general. As the discussion starter
states, discrimination because of sex is defined
as adverse action against another person, that
would not have occurred had the person been
of another sex.
May 30, 2008
LREC
5
Chat sample
(2C1S1)
– Are there a lot of issues like this in the news, because
to me generder discrimination is a thing of the past
– Aren't men found to be naturally more apt in certain
fields, and women in others?
– Did any of you experienece any personal
discrimination at your jobs, or witness it or
anything?
– I definitely agree with that
– Unless one person decides another person is not right
for a job solely based on gender, I don't believe it is
discrimination
May 30, 2008
LREC
6
Aim:
Collect a correlated corpora of
text samples
• Including both computer-mediated and not c-m
• Including both individual and interactive, spoken
and text
• Across 6 genres:
– email, essay, interview (phone)
– blog, chat, discussion
• From the same individuals
• On 6 distinct topics
May 30, 2008
LREC
7
Corpora Collection
September 2006 through November 2007
Participants
• All college students, aged 18-29
• 12 students in pilot study
• 21 participants completed both Phase 1 (email, essay,
interview) & Phase 2 (blog, chat, discussion)
• 10M/11W
• 18 Caucasian/3 African-American
• all had English as the primary language spoken at home
May 30, 2008
LREC
8
Topics
•
•
Piloted via individual interviews with a separate
group
Selected for
– production of expression
– comfort of participates for the topic
•
Topics:
1.
2.
3.
4.
5.
6.
•
Catholic Church
Gay Marriage
Iraq War
Legalization of Marijuana
Privacy as a U.S. Citizen
Gender Discrimination
Each introduced via a “starter” question
May 30, 2008
LREC
9
Other Design Issues
• Individual instructions standardized
• Environments controlled
–
–
–
–
In-house email system
Single discussion leader and phone interviewer
Relaxed discussion and interview setting
Chat sessions “gently” moderated
• Ordering of genres and topics controlled
• Group membership randomized
– gender balance 2M/2W
May 30, 2008
LREC
10
All .txt files produced
• Interviews and Discussions transcribed
– by trained psychology students
– punctuation inserted
– non-fluencies preserved
• Discussion and Chat dismembered to
individual files
• Multiple blog entries combined to a single
file
May 30, 2008
LREC
11
Resulting Corpora
Totals
Emails
180
Essays Interviews Blogs*
180
186
132
From Same Individuals
21 fully parallel copora
126
126
126
126
Chat Discussion
132
132
All
942
126
756
126
•Blogs entries were combined into single files.
The 21 fully parallel corpora were used in this paper.
Limitations: size, homogeneity of subjects, nonspontaneity of discourse
May 30, 2008
LREC
12
General Corpora Characteristics
• Word Count
– by topic
– by genre
– by gender of communicant
• Readability: Flesch reading ease & FleschKincaid grade level
– by topic
– by genre
– by gender of author
May 30, 2008
LREC
13
Word Count
•
•
•
No main effect for gender
No main effect for topic
Significant topic x gender interaction for Church and Discrimination
Men
Women
700
MEAN WORD COUNT
Combined
650
600
550
500
Church
Gay
Iraq
Marij
Privacy
Discrim
T OPIC
May 30, 2008
LREC
14
Word Count (con’t)
•
•
•
Significant Main Effect for genre
Discussion had highest word counts
Direct communication produced higher word counts
Mean Word C ount
1300
1100
900
700
Men
W omen
500
C ombined
300
100
E ma il E s s a y Inter B log C ha t D is c
G en re
May 30, 2008
LREC
15
Readability
• No significant main effect for gender
• Significant main effect for genre
– Discussion and interview had highest reading ease
– Main effect for topic
MEAN FLESCH READING EASE
81
79
77
75
Men
73
Women
Combined
71
69
67
65
Church
Gay
Iraq
Marij
Privacy
SexDis
TOPIC
May 30, 2008
LREC
16
Readability (con’t)
• reading ease of conversational genres high
• reading ease of non-conversational genres low
12
E mail
G R A DE L E V E L
10
E s s ay
8
Interview
6
B log
4
C hat
2
Dis c
0
C hurc h
G ay
Iraq Marijuana P rivac y S ex Dis c
TO P IC
May 30, 2008
LREC
17
Future Possibilities
• additional features for genderID, authorship
• sentence complexity
• cohesion of text
• feature change across time within a topic
• classification by topic order
• classification by genre
• conversational dynamics in chat vs.
discussion
May 30, 2008
LREC
18
Thank you.
Questions?
www.cs.loyola.edu/~res
May 30, 2008
LREC
19
Download