advertisement

Parallelism for LDA A uan, Changsi A An Yang Ru The probab bility of the geenerated case can be calculaated by multiplyingg the joint prrobabilities off occurred evvents in the samplin ng processes. (yang [email protected] na.edu, [email protected] @indiana.edu) 1. Overview w A As parallelism is very importtant for large scale of data, we w want to use different techn nology to paraallelize the lattent d dirichlet allocaation algorithm. Latent dirichlet allocattion (LLDA) is a generative m model that allows sets of o observations to o be explained d by unobservved groups wh hich eexplain why so ome parts of th he data are sim milar. TThere are two o different im mplementation n of this model, o one is Expectaation Maximum m (EM metho od) and the otther iss Gibbs Sampling. Both of them use iteraative algorithm m to d do the calculation, however, Gibbs Sampling is easier to im mplement in ssequential cod de but harderr to parallel while th he EM method is easier to parallel as (Naallapati, Cohen n et al.) has already use MapRed duce technolo ogy to paralleelize th hat. And we are going to o parallel thee Gibbs sampling im mplementatio on this time using iterative MapReduce and m message passin ng interface. 2. Algorithm m 2 2.1 Guideline o of Sequential Algorithm with LDA LDA is a generative probab bilistic model of a corpus. TThe b basic idea is th hat documentts are represeented as rand dom m mixtures over latent topics, where each topic is characterized by a distributtion over worrds(Blei, Ng ett al. 2 2003). TTo clarify thiss model, assume we havve the follow wing convention: A A corpus D : {d d1, d2, .. dm}. EEach corpus di : {w1, w2, …, w wni} iss a sequence of words. In n contrast to words, we have vvocabulary V : {v1, v2, …, vl} which is aa non‐duplicaated p projection of all words. Allso we assum me there exisst K to opics T = {t1, t2, …, tk}. A Apart from thee above notation, we also h have two variaable p parameters: which is used as a param meter in Dirich hlet d distribution to o generate topics distribu ution and is a conditional probability p m matrix, which records the p probability of aa word given aa specific topicc. TThe algorithm with LDA mo odel has two ggoals. One goaal is to o generate a ttopics distribu ution and asso ociated wordss list fo or each topiic, and give out the pro obability of the ggenerated casee. Another goal is to updatee the parametters and in the generation prrocess. TThe algorithm can be desccribed in the following steeps: fiirst choose ∼ . Then for each word wi of d document di , ssample a topicc Tn which follows multinom mial d distribution wiith parameterr , after thatt, sample a w word wi’ according tto ( | Tn, )). After that, iff wi’ is not wi, we w will update b by augmentingg the probabillity of word wi’ to to opic Tn and dim minishing the one of word w wi. 2.2 ACT mo odel Author Co onference Top pic model (ACT) is an exttension model baseed on LDA(Tan ng, Zhang et al. 2008). Because th his model also o adds conferrence and author as two new variables, so o the size o of the computation 2‐D matrixes. W We will increased ffrom 2 2‐D matrixes to 3 2 have topiic‐author, wo ord‐topic an nd conferencce‐topic matrixes to do the Gibbs samplingg. And the saampled result (mattrix contains tthe frequencyy) can be used d in the computatio on to the final probability result: is the probabilityy of a topic givven an authorr; is the probability of a word d given a to opic; is th he probabilityy of a conferencee given a topicc. 3. Techn nology This time we are goingg to use an itterative MapReduce frameworkk (twister) and d message passsing interfacee (MPI). The reason n for choosing twister is thee algorithm is d done in iterative p pattern and it will be m much faster to be parallelized d under twisteer which is speecially designeed for it than normaal MapReducee framework like Hadoop. TThis has already been demonstraated by Jaliyaa et al. And M MPI has parallelism method m for yeears. By always be a standard p ontrol group w where there w will be a using it we can have a co n between thiis two technollogies. comparison Twister is an enhanceed MapReducce runtime w with an extended programmingg model thatt supports itterative ons efficiently((J.Ekanayake, H.Li et MapReduce computatio al. 2010). To use thaat feature, the iterative paart of the alggorithm needs to be done within the proggram control of the iteration. Here is a fiigure of the parallelism of the program. There are ttwo main partts of this proggram, the firstt part is the Gibbs sampling wh hich is an iterative part of the nd the second d none‐iterativve part is to u use this program an 3 2‐D m matrixes to calculate the final probability distribution n. In the first part, for each mapper, theere will be 3 ssplit up 2‐D matrixxes in each mappers’ m consstant memoryy which are generaated directly b by each docum ment chunk. In each itteration, the rreducer will broadcast the merged matrixes o onto each map pper. W We call this method App proximate Disstributed Autthor C Conference To opic model. (AD‐ACT) A And Here is thee performance chart: Speeed up Test of 8 8k documents running on PolarGrid in Indiana University using 1 to 20 mappeers R References: B By using this p pattern, we caan save the m memory spacee on eeach mapper by exchange the time of sending the 1‐D m matrix to each node. M MPJ Express is a MPI extenssion of Java im mplementation n. It iss capable to rrun on distrib buted machinees and multiccore p processors. The usage of it is also daemo on control sim milar to o Twister and with comman nd line launcher. H Here is a figuree and some AP PI of it: P P2P Communiccation C Comm.Send() , Comm.Recv()) C Comm.Isend(), Comm.Irecv()) C Collective Communcation In ntracomm.Bca ast() In ntracomm.Gather() P Process Manag gement In ntra‐communiication(within n same Group) : Intracomm In nter‐communiication(server‐‐client, Carteesian topolog gy): C Cartcomm B Blei, D. M., A. YY. Ng, et al. (2 2003). "Latent Dirichlet alloccation." Jo ournal of Macchine Learning g Research 3: 9 993‐1022. J..Ekanayake, H.Li, et al. (2010). Twister: A Runtim me for itterative MapR Reduce. Proceeedings of th he First Intern national W Workshop on MapReduce a and its Applica ations of ACM M HPDC 2 2010 conference June 20‐25, 5, 2010. Chicag go, Illinois, AC CM. N Nallapati, R., W W. Cohen, et al. Parallelized Variational EM for LLatent Dirichleet Allocation:: An Experim mental Evalua ation of SSpeed and Scalability. Pitttsburgh, USA A, Carnegie Mellon U University. TTang, J., J. Zhang, et al. (2 2008). ArnetM Miner: Extractiion and M Mining of Academic Social N Networks. KD DD’08, August 24–27, 2 2008, Las Vega as, Nevada, USSA. 4. on Conclusio TThe performan nce of Twister‐‐ACT is not as good as eexpected, sincee while we implement this parallel prograam, w we found out tthat every tim me we sample aa single elemeent in nside the matrixes, there will be some inffluence to thee fo ollowing samp pling. So the reesults can nevver be as samee as th he sequential one if we do parallel computing on the A ACT ggibbs samplingg. Basically, to minimize thiss problem, we h have to do iterrative mapreduce. And keep p the each blo ock o of document o on each mappeer’s memory ccache. Still wee h have a large qu uantity of messsage to pass ssince every m mapper requires the full sam mpling matrixees. Still, we can solve some of tthe memory p problem by sp plitting up the d documents, so o each mapperr only gets a partition of thee fu ull document.