1. Overview w 

advertisement
Parallelism for LDA
A uan, Changsi A
An Yang Ru
The probab
bility of the geenerated case can be calculaated by multiplyingg the joint prrobabilities off occurred evvents in the samplin
ng processes. (yang
gruan@indian
na.edu, anch@
@indiana.edu)
1.
Overview
w A
As parallelism is very importtant for large scale of data, we w
want to use different techn
nology to paraallelize the lattent d
dirichlet allocaation algorithm. Latent dirichlet allocattion (LLDA) is a generative m
model that allows sets of o
observations to
o be explained
d by unobservved groups wh
hich eexplain why so
ome parts of th
he data are sim
milar. TThere are two
o different im
mplementation
n of this model, o
one is Expectaation Maximum
m (EM metho
od) and the otther iss Gibbs Sampling. Both of them use iteraative algorithm
m to d
do the calculation, however, Gibbs Sampling is easier to im
mplement in ssequential cod
de but harderr to parallel while th
he EM method is easier to parallel as (Naallapati, Cohen
n et al.) has already use MapRed
duce technolo
ogy to paralleelize th
hat. And we are going to
o parallel thee Gibbs sampling im
mplementatio
on this time using iterative MapReduce and m
message passin
ng interface.
2.
Algorithm
m 2
2.1 Guideline o
of Sequential Algorithm with LDA LDA is a generative probab
bilistic model of a corpus. TThe b
basic idea is th
hat documentts are represeented as rand
dom m
mixtures over latent topics, where each topic is characterized by a distributtion over worrds(Blei, Ng ett al. 2
2003). TTo clarify thiss model, assume we havve the follow
wing convention: A
A corpus D : {d
d1, d2, .. dm}. EEach corpus di : {w1, w2, …, w
wni} iss a sequence of words. In
n contrast to words, we have vvocabulary V : {v1, v2, …, vl} which is aa non‐duplicaated p
projection of all words. Allso we assum
me there exisst K to
opics T = {t1, t2, …, tk}. A
Apart from thee above notation, we also h
have two variaable p
parameters: which is used as a param
meter in Dirich
hlet d
distribution to
o generate topics distribu
ution and is a conditional probability p
m
matrix, which records the p
probability of aa word given aa specific topicc. TThe algorithm with LDA mo
odel has two ggoals. One goaal is to
o generate a ttopics distribu
ution and asso
ociated wordss list fo
or each topiic, and give out the pro
obability of the ggenerated casee. Another goal is to updatee the parametters and in the generation prrocess. TThe algorithm can be desccribed in the following steeps: fiirst choose ∼
. Then for each word wi of d
document di , ssample a topicc Tn which follows multinom
mial d
distribution wiith parameterr , after thatt, sample a w
word wi’ according tto ( | Tn, )). After that, iff wi’ is not wi, we w
will update b
by augmentingg the probabillity of word wi’ to to
opic Tn and dim
minishing the one of word w
wi. 2.2 ACT mo
odel Author Co
onference Top
pic model (ACT) is an exttension model baseed on LDA(Tan
ng, Zhang et al. 2008). Because th
his model also
o adds conferrence and author as two new variables, so
o the size o
of the computation 2‐D matrixes. W
We will increased ffrom 2 2‐D matrixes to 3 2
have topiic‐author, wo
ord‐topic an
nd conferencce‐topic matrixes to do the Gibbs samplingg. And the saampled result (mattrix contains tthe frequencyy) can be used
d in the computatio
on to the final probability result:  is the probabilityy of a topic givven an authorr;  is the probability of a word
d given a to
opic;  is th
he probabilityy of a conferencee given a topicc. 3.
Techn
nology This time we are goingg to use an itterative MapReduce frameworkk (twister) and
d message passsing interfacee (MPI). The reason
n for choosing twister is thee algorithm is d
done in iterative p
pattern and it will be m
much faster to be parallelized
d under twisteer which is speecially designeed for it than normaal MapReducee framework like Hadoop. TThis has already been demonstraated by Jaliyaa et al. And M
MPI has parallelism method m
for yeears. By always be a standard p
ontrol group w
where there w
will be a using it we can have a co
n between thiis two technollogies.
comparison
Twister is an enhanceed MapReducce runtime w
with an extended programmingg model thatt supports itterative ons efficiently((J.Ekanayake, H.Li et MapReduce computatio
al. 2010). To use thaat feature, the iterative paart of the alggorithm needs to be done within the proggram control of the iteration. Here is a fiigure of the parallelism of the program. There are ttwo main partts of this proggram, the firstt part is the Gibbs sampling wh
hich is an iterative part of the nd the second
d none‐iterativve part is to u
use this program an
3 2‐D m
matrixes to calculate the final probability distribution
n. In the first part, for each mapper, theere will be 3 ssplit up 2‐D matrixxes in each mappers’ m
consstant memoryy which are generaated directly b
by each docum
ment chunk. In each itteration, the rreducer will broadcast the merged matrixes o
onto each map
pper. W
We call this method App
proximate Disstributed Autthor C
Conference To
opic model. (AD‐ACT) A
And Here is thee performance chart: Speeed up Test of 8
8k documents running on PolarGrid in Indiana University using 1 to 20 mappeers R
References: B
By using this p
pattern, we caan save the m
memory spacee on eeach mapper by exchange the time of sending the 1‐D m
matrix to each node. M
MPJ Express is a MPI extenssion of Java im
mplementation
n. It iss capable to rrun on distrib
buted machinees and multiccore p
processors. The usage of it is also daemo
on control sim
milar to
o Twister and with comman
nd line launcher. H
Here is a figuree and some AP
PI of it: P
P2P Communiccation C
Comm.Send() , Comm.Recv()) C
Comm.Isend(), Comm.Irecv()) C
Collective Communcation In
ntracomm.Bca
ast() In
ntracomm.Gather() P
Process Manag
gement In
ntra‐communiication(within
n same Group) : Intracomm In
nter‐communiication(server‐‐client, Carteesian topolog
gy): C
Cartcomm B
Blei, D. M., A. YY. Ng, et al. (2
2003). "Latent Dirichlet alloccation." Jo
ournal of Macchine Learning
g Research 3: 9
993‐1022. J..Ekanayake, H.Li, et al. (2010). Twister: A Runtim
me for itterative MapR
Reduce. Proceeedings of th
he First Intern
national W
Workshop on MapReduce a
and its Applica
ations of ACM
M HPDC 2
2010 conference June 20‐25,
5, 2010. Chicag
go, Illinois, AC
CM. N
Nallapati, R., W
W. Cohen, et al. Parallelized Variational EM for LLatent Dirichleet Allocation:: An Experim
mental Evalua
ation of SSpeed and Scalability. Pitttsburgh, USA
A, Carnegie Mellon U
University. TTang, J., J. Zhang, et al. (2
2008). ArnetM
Miner: Extractiion and M
Mining of Academic Social N
Networks. KD
DD’08, August 24–27, 2
2008, Las Vega
as, Nevada, USSA. 4.
on Conclusio
TThe performan
nce of Twister‐‐ACT is not as good as eexpected, sincee while we implement this parallel prograam, w
we found out tthat every tim
me we sample aa single elemeent in
nside the matrixes, there will be some inffluence to thee fo
ollowing samp
pling. So the reesults can nevver be as samee as th
he sequential one if we do parallel computing on the A
ACT ggibbs samplingg. Basically, to minimize thiss problem, we h
have to do iterrative mapreduce. And keep
p the each blo
ock o
of document o
on each mappeer’s memory ccache. Still wee h
have a large qu
uantity of messsage to pass ssince every m
mapper requires the full sam
mpling matrixees. Still, we can solve some of tthe memory p
problem by sp
plitting up the d
documents, so
o each mapperr only gets a partition of thee fu
ull document. 
Download