Presentation1 - Eastern Michigan University

advertisement
CULTUROMICS:
EXPERIMENTING ON GOOGLE
BOOKS WITH N-GRAM VIEWER
Susan Haynes
Eastern Michigan University
MASAL 2013
Definition of culturomics
• Computational analysis of cultural trends or social trends
using quantitative analysis of digitized text.
• First coined in 2010 by researchers that helped create
Google’s Ngram Viewer.
• Example research:
• Detecting censorship and suppression (e.g., frequency trends of
•
•
•
•
“Marc Chagall”)
Trends in fame
Trends in interest in “current events” (see following slide)
N-gram viewer to attempt a physics-motivated statistical model of
fluctuations of English word usage from “birth” and “death”
news archives to identify cultural events (Arab Spring 2011).
“FAME”
The Data
• Google books project:
• As of March 2012, more than 20 million scanned and digitized books
• Dates range back to 1500s. (Before 1700, approx 0.5 million English books published)
• Access to text:
• If in public domain, complete text is viewable
• If out of copyright or copyright owner grants permission, viewer may “preview” a subset of pages
• If copyright owner does not give “preview permission” then can give preview permission for “snippets” (2 or 3
lines)
• Unknown owner – “snippets”
• Books with full view or “preview” (including snippets) permission are searchable, though they may
not be viewable.
• Books with neither full view nor “preview” permission, are not searchable. Google Books gives only
the book title.
• NB, One opinion: “[…] Google Books searches are an unreliable indicator of the prevalence of specific usages or
terms, because many authoritative works fall into the unsearchable category” [
http://en.wikipedia.org/wiki/Google_books ]
• OCR
• OCR is done very fast (1000 pages/hour), and there are many errors, including upside down pages, books
interleaved, wrong characters. Feedback of error information from users has been curtailed due to volume of
data.
• The 2012 data are much improved over the previous set of data.
N-grams
• An n-gram is a sequence of n items from the books.
• Size 1, 2, 3 are ‘unigram’, ‘bigram’, trigram (respectively)
• Larger size n uses the value of n: 4-gram, 5-gram, …
• Using Google Books, an “item” is a sequence of alpha-
numerics.
• In computational linguistics, an n-gram can comprise any
fixed set: words, syllable, letters, phonemes, …
• In other areas where probabilistic methods are used to
predict the next item in a sequence, an n-gram could be,
e.g., amino acid (for proteomics), DNA base pair
(genomics).
• A common model for predicting the next item in an n-gram
is a Markov model ( order n-1 ). (See next slide for “math”)
Markov chain order m:
(m is the “amount of memory” needed to predict
the next state)
A Markov chain of order m is a statistical process satisfying:
P(Xn=xn | Xn-1=xn-1, Xn-2=xn-2,… Xn-m=xn-m)
for n > m
Xi is a random variable, xi is a value
That is, the state at index n depends on the previous m states’ values.
The cat is _____ P(X4 = “pretty” | X3=“is”, X2=“cat”,
X1 X2 X3 X4
X1=“The”)
Google Books Ngram viewer
The data --- k-gram: k=1 … 5
• Each is divided in several files to make data downloads more manageable. For example in the English 20120701
corpus, here is a list of the 3-grams files: 0 1 2 3 4 5 6 7 8 9 _ADJ_ _ADP_ _ADV_ _CONJ_ _DET_ _NOUN_ _NUM_
•
_PRON_ _PRT_ _VERB_ a_ aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax ay az b_
ba bb bc bd be bf bg bh bi bj bk bl bm bn bo bp bq br bs bt bu bv bw bx by bz c_ ca cb cc cd ce cf cg
ch ci cj ck cl cm cn co cp cq cr cs ct cu cv cw cx cy cz d_ da db dc dd de df dg dh di dj dk dl dm dn
do dp dq dr ds dt du dv dw dx dy dz e_ ea eb ec ed ee ef eg eh ei ej ek el em en eo ep eq er es et eu
ev ew ex ey ez f_ fa fb fc fd fe ff fg fh fi fj fk fl fm fn fo fp fq fr fs ft fu fv fw fx fy fz g_ ga
gb gc gd ge gf gg gh gi gj gk gl gm gn go gp gq gr gs gt gu gv gw gx gy gz h_ ha hb hc hd he hf hg hh
hi hj hk hl hm hn ho hp hq hr hs ht hu hv hw hx hy hz i_ ia ib ic id ie if ig ih ii ij ik il im in io
ip iq ir is it iu iv iw ix iy iz j_ ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju jv
jw jx jy jz k_ ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz l_ la lb
lc ld le lf lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz m_ ma mb mc md me mf mg mh mi
mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz n_ na nb nc nd ne nf ng nh ni nj nk nl nm nn no np
nq nr ns nt nu nv nw nx ny nz o_ oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot other ou
ov ow ox oy oz p_ pa pb pc pd pe pf pg ph pi pj pk pl pm pn po pp pq pr ps pt pu punctuation pv pw px
py pz q_ qa qb qc qd qe qf qg qh qi qj qk ql qm qn qo qp qq qr qs qt qu qv qw qx qy qz r_ ra rb rc rd
re rf rg rh ri rj rk rl rm rn ro rp rq rr rs rt ru rv rw rx ry rz s_ sa sb sc sd se sf sg sh si sj sk
sl sm sn so sp sq sr ss st su sv sw sx sy sz t_ ta tb tc td te tf tg th ti tj tk tl tm tn to tp tq tr
ts tt tu tv tw tx ty tz u_ ua ub uc ud ue uf ug uh ui uj uk ul um un uo up uq ur us ut uu uv uw ux uy
uz v_ va vb vc vd ve vf vg vh vi vj vk vl vm vn vo vp vq vr vs vt vu vv vw vx vy vz w_ wa wb wc wd we
wf wg wh wi wj wk wl wm wn wo wp wq wr ws wt wu wv ww wx wy wz x_ xa xb xc xd xe xf xg xh xi xj xk xl
xm xn xo xp xq xr xs xt xu xv xw xx xy xz y_ ya yb yc yd ye yf yg yh yi yj yk yl ym yn yo yp yq yr ys
yt yu yv yw yx yy yz z_ za zb zc zd ze zf zg zh zi zj zk zl zm zn zo zp zq zr zs zt zu zv zw zx zy zz
The ngram file format is
ngram TAB year TAB match_count TAB volume_count
• Each file is tsv (tab separated values) Two sample (contiguous lines) in the y_ file (2-grams)are
• Y2O3 have
1972
1
1
• Y2O3 have
1974
1
1
•
•
The first line means the bigram Y2O3 appeared in 1972 one time in one volume.
Simple example of n-gram viewer
• Computer science, computer technology, information
technology
• Corpus, start date, end date, smoothing
Areas of computer science
Simple example 3
Social impact of computing “digital divide”
Some n-gram viewer operations
• Some arithmetic operations: +, -, /, ( ) , *
• ( and/or) retrieves #“and/or”, while and/or returns #and / #or
• * useful for comparing trends that have very different frequences
• Choose a corpus :
• faux:eng_2012, faux:fre_2012
Will show frequency trends for “faux” in the English corpus and the
French corpus.
• Part of speech tagging
• _NOUN_, _VERB_, …
• E.g., race_NOUN_, race_VERB_
• Dependencies => (see next slide for example)
Example: how often “computer” modifies
“crime” versus “computer crime”
Non-computer technology trends(?)
Example: computer terminology
Cultural trends in impact of computers(?)
Y2K, identity theft
Using the Google books ngram corpora
for data-mining
• Inventors of the Ngram Viewer state that it is just another,
albeit quantitative, tool for research in culture, linguistics,
etc.
• From my own perspective, weaknesses include:
• The books that are used to build the n-grams are not all accessible
– this makes it impossible to do clustering or other data-mining
techniques on those inaccessible data.
• There is no pointer back to the books an n-gram was obtained from
(just number of hits and number of volumes) – this makes it
impossible to use the Ngram Viewer to identify specific books.
• Can’t really do phrase “discovery” in a particular domain (e.g.,
computer) in the n-gram files – have to know the phrase ahead of
time (e.g., researchers used “Marc Chagall” to identify censorship,
not “censorship” to discover that Marc Chagall was censored)
Problems
• Books, in fast moving areas, do not make a good medium
for analysis. For computer-related topics, the web itself
may be better.
• The web data is unmanageable given a relative (to books) lack of
context. The ngram data sets may be useful for crude identification
of trends over time.
• Trends over time only, not over space or other possible
dimensions (….)
• NOT A PROBLEM: There is a Python utility and a
Windows utility that will retrieve the data that Ngram
Viewer uses to construct the graphs.
References
• Quantitative Analysis of Culture Using Millions of Digitized Books. Jean-Baptiste Michel*, Yuan Kui
Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett,
Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman
Aiden*. Science 331 (2011) [http://www.sciencemag.org/content/331/6014/176.full.pdf]
• http://www.culturomics.org/
• Books.google.com
• Books.google.com/ngrams
• J. Bohannon, Digital Data. Google Books, Wikipedia and the future of culturomics, Science, January 14,
2011 [http://www.sciencemag.org/content/331/6014/135.long]
• Brian Haynes, Bit Lit, American Scientist, 99(3), may-June 2011)
• Letcher, David W. (April 6, 2011) Culturomics: A New Way to See Temporal Changes in the Prevalence of
Words and Phrases". American Institute of Higher Education 6th International Conference Proceedings 4
(1): 228 [http://www.amhighed.com/documents/charleston2011/AIHE2011_Proceedings.pdf#page=228]
• Ben Zimmer, “When physicists do linguistics”, Boston Globe, February 10,
2013.[http://bostonglobe.com/ideas/2013/02/10/when-physicistslinguistics/ZoHNxhE6uunmM7976nWsRP/story.html]
• http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3663/3040
Download