NLP in Scala with Breeze and Epic David Hall UC Berkeley ScalaNLP Ecosystem ≈ ≈ Numpy/Scipy PyStruct/NLTK • Natural Language • Linear Algebra Processing • Scientific Computing • Structured Prediction • Optimization Breeze Epic ≈ • Super-fast GPU parser for English { } Puck Natural Language Processing S VP NP S V VP NP VP PP Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap. It certainly won’t get there on looks. Epic for NLP S VP NP S V VP NP VP PP Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap. Named Entity Recognition Person Organization Location Misc NER with Epic > import epic.models.NerSelector > val nerModel = NerSelector.loadNer("en").get > val tokens = epic.preprocess.tokenize("Almost 20 years ago, Bill Watterson walked away from \"Calvin & Hobbes.\"") > println(nerModel.bestSequence(tokens).render("O")) Almost 20 years ago , [PER:Bill Watterson] walked away from `` [LOC:Calvin & Hobbes] . '' Not a location! Annotate a bunch of data? Building an NER system > val data: IndexedSeq[Segmentation[Label, String]] = ??? > val system = SemiCRF.buildSimple(data, startLabel, outsideLabel) > println(system.bestSequence(tokens).render("O")) Almost 20 years ago , [PER:Bill Watterson] walked away from `` [MISC:Calvin & Hobbes] . '' Gazetteers http://en.wikipedia.org/wiki/List_of_newspaper_comic_strips_A%E2%80%93F Gazetteers Using your own gazetteer > val data: IndexedSeq[Segmentation[Label, String]] = ??? > val myGazetteer = ??? > val system = SemiCRF.buildSimple(data, startLabel, outsideLabel, gaz = myGazetteer) Gazetteer • Careful with gazetteers! • If built from training data, system will use it and only it to make predictions! • So, only known forms will be detected. • Still, can be very useful… Semi-CR-What? • Semi-Markov Conditional Random Field • Don’t worry about the name. Semi-CRFs Semi-CRFs score(Chez Panisse) + score(Berkeley, CA) + score(- A bowl of ) + score(Churchill-Brenneis Orchards) + score(Page mandarins and medjool dates) Features = w(starts-with-Chez) score(Chez + w(starts-with-C…) + w(ends-with-P…) + w(starts-sentence) + w(shape:Xxx Xxx) + w(two-words) + w(in-gazetteer) Panisse) Building your own features val dsl = new WordFeaturizer.DSL[L](counts) with SurfaceFeaturizer.DSL import dsl._ word(begin) // word at the beginning of the span + word(end – 1) // end of the span + word(begin – 1) // before (gets things like Mr.) + word (end) // one past the end + prefixes(begin) // prefixes up to some length + suffixes(begin) + length(begin, end) // names tend to be 1-3 words + gazetteer(begin, end) Using your own featurizer > val data: IndexedSeq[Segmentation[Label, String]] = ??? > val myFeaturizer = ??? > val system = SemiCRF.buildSimple(data, startLabel, outsideLabel, featurizer = myFeaturizer) Features • So far, we’ve been able to do everything with (nearly) no math. • To understand more, need to do some math. Machine Learning Primer • Training example (x, y) – x: sentence of some sort – y: labeled version • Goal: – want score(x, y) > score(x, y’), forall y’. Machine Learning Primer score(x, y) = wTf(x, y) Machine Learning Primer score(x, y) = w.t * f(x, y) Machine Learning Primer score(x, y) = w dot f(x, y) Machine Learning Primer score(x, y) >= score(x, y’) Machine Learning Primer w dot f(x, y) >= w dot f(x, y’) Machine Learning Primer w dot f(x, y) >= w dot f(x, y’) Machine Learning Primer w + f(x,y) f(x,y) w f(x,y’) (w + f(x,y)) dot f(x,y) >= w dot f(x,y) The Perceptron > val featureIndex : Index[Feature] = ??? > val labelIndex: Index[Label] = ??? > val weights = DenseVector.rand[Double](featureIndex.size) > for ( epoch <- 0 until numEpochs; (x, y) <- data ) { val labelScores = DV.tabulate(labelIndex.size) { yy => val features = featuresFor(x, yy) weights.t * new FeatureVector(indexed) // or weights dot new FeatureVector(indexed) } … } The Perceptron (cont’d) > val featureIndex : Index[Feature] = ??? > val labelIndex: Index[Label] = ??? > val weights = DenseVector.rand[Double](featureIndex.size) > for ( epoch <- 0 until numEpochs; (x, y) <- data ) { val labelScores = ... val y_best = argmax(labelScores) if (y != y_best) { weights += new FeatureVector(featuresFor(x, y)) weights -= new FeaureVector(featuresFor(x, y_best)) } } Structured Perceptron • Can’t enumerate all segmentations! (L * 2n) • But dynamic programs exist to max or sum • … if the feature function has a nice form. Structured Perceptron > val featureIndex : Index[Feature] = ??? > val labelIndex: Index[Label] = ??? > val weights = DenseVector.rand[Double](featureIndex.size) > for ( epoch <- 0 until numEpochs; (x, y) <- data ) { val y_best = bestStructure(weights, x) if (y != y_best) { weights += featuresFor(x, y) weights -= featuresFor(x, y_best) } } Constituency Parsing S VP NP S V VP NP VP PP Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap. Multilingual Parser 95 "Berkeley" Epic 90 85 80 75 70 “Berkeley:” [Petrov & Klein, 2007]; Epic [Hall, Durrett, and Klein, 2014] Epic Pre-built Models • Parsing – English, Basque, French, German, Swedish, Polish, Korean – (working on Arabic, Chinese, Spanish) • Part-of-Speech Tagging – English, Basque, French, German, Swedish, Polish • Named Entity Recognition – English • Sentence segmentation – English – (ok support for others) • Tokenization – Above languages What Is Breeze? Dense Vectors, Matrices, Sparse Vectors, Counters, Matrix Decompositions What Is Breeze? Nonlinear Optimization, Probability Distributions Getting started libraryDependencies ++= Seq( "org.scalanlp" %% "breeze" % "0.8.1", // optional: native linear algebra libraries // bulky, but faster "org.scalanlp" %% "breeze-natives" % "0.8.1" ) scalaVersion := "2.11.1" Linear Algebra > import breeze.linalg._ > val x = DenseVector.zeros[Int](5) // DenseVector(0, 0, 0, 0, 0) > val m = DenseMatrix.zeros[Int](5,5) > val r = DenseMatrix.rand(5,5) > > > > > > m.t // transpose x + x // addition m * x // multiplication by vector m * 3 // by scalar m * m // by matrix m :* m // element wise mult, Matlab .* Return Type Selection > import breeze.linalg.{DenseVector => DV} > val dv = DV(1.0, 2.0) > val sv = SparseVector.zeros[Double](2) > dv + sv // res: DenseVector[Double] = DenseVector(1.0, 2.0) > sv + sv // res: SparseVector[Double] = SparseVector(1.0, 2.0) Return Type Selection > import breeze.linalg.{DenseVector => DV} > val dv = DV(1.0, 2.0) > val sv = SparseVector.zeros[Double](2) > (dv: Vector[Double]) + (sv: Vector[Double]) // res: Vector[Double] = DenseVector(1.0, 2.0) > (sv: Vector[Double]) + (sv: Vector[Double]) Static: Vector Dynamic: Dense // res: Vector[Double] = SparseVector() Static: Vector Dynamic: Sparse Linear Algebra: Slices > val m = DenseMatrix.zeros[Int](5, 5) > m(::,1) // slice a column // DV(0, 0, 0, 0, 0) > m(4,::) // slice a row > m(4,::) := DV(1,2,3,4,5).t > m.toString 0 0 0 0 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 Linear Algebra: Slices > m(0 to 1, 3 to 4).toString 0 2 0 3 > m(IndexedSeq(3,1,4,2),IndexedSeq(4,4,3,1)) 0 0 5 0 0 0 5 0 0 0 4 0 0 0 2 0 Universal Functions > import breeze.numerics._ > log(DenseVector(1.0, 2.0, 3.0, 4.0)) // DenseVector(0.0, 0.6931471805599453, // 1.0986122886681098, 1.3862943611198906) > exp(DenseMatrix( (1.0, 2.0), (3.0, 4.0))) > sin(Array(2.0, 3.0, 4.0, 42.0)) > // also sin, cos, sqrt, asin, floor, round, digamma, trigamma, ... UFuncs: In-Place > import breeze.numerics._ > val v = DV(1.0, 2.0, 3.0, 4.0) > log.inPlace(v) // v == DenseVector(0.0, 0.6931471805599453, // 1.0986122886681098, 1.3862943611198906) UFuncs: Reductions > sum(DenseVector(1.0, 2.0, 3.0)) // 6.0 > sum(DenseVector(1, 2, 3)) // 6 > mean(DenseVector(1.0, 2.0, 3.0)) // 2.0 > mean(DenseMatrix( (1.0, 2.0, 3.0) (4.0, 5.0, 6.0))) // 3.5 UFuncs: Reductions > val m = DenseMatrix((1.0, 3.0), (4.0, 4.0)) > sum(m) == 12.0 > mean(m) == 3.0 > sum(m(*, ::)) == DenseVector(4.0, 8.0) > sum(m(::, *)) == DenseMatrix((5.0, 7.0)) > mean(m(*, ::)) == DenseVector(2.0, 4.0) > mean(m(::, *)) == DenseMatrix((2.5, 3.5)) Broadcasting > val dm = DenseMatrix((1.0,2.0,3.0), (4.0,5.0,6.0)) > sum(dm(::, *)) == DenseMatrix((5.0, 7.0, 9.0)) > dm(::, *) + DenseVector(3.0, 4.0) // DenseMatrix((4.0, 5.0, 6.0), (8.0, 9.0, 10.0))) > dm(::, *) := DenseVector(3.0, 4.0) // DenseMatrix((3.0, 3.0, 3.0), (4.0, 4.0, 4.0)) UFuncs: Implementation > object log extends UFunc > implicit object logDouble extends log.Impl[Double, Double] { def apply(x: Double) = scala.math.log(x) } > log(1.0) // == 0.0 // elsewhere: magic to automatically make this work: > log(DenseVector(1.0)) // == DenseVector(0.0) > log(DenseMatrix(1.0)) // == DenseMatrix(0.0) > log(SparseVector(1.0)) // == SparseVector() > UFuncs: Implementation > object add1 extends UFunc > implicit object add1Double extends add1.Impl[Double, Double] { def apply(x: Double) = x + 1.0 } > add1(1.0) // == 2.0 // elsewhere: magic to automatically make this work: > add1(DenseVector(1.0)) // == DenseVector(2.0) > add1(DenseMatrix(1.0)) // == DenseMatrix(2.0) > add1(SparseVector(1.0)) // == SparseVector(2.0) > Breeze Benchmarks Singular Value Decomposition 3000 2626 2500 2151 2000 1500 1000 500 685.5 135.04 0 Breeze jblas Lower is better Mahout Apache Commons Breeze Benchmarks Sparse/Dense Multiply 3000 2562 2500 2000 1500 1000 500 0 26 N/A Breeze jblas Lower is better 56 Mahout Apache Commons Breeze Benchmarks Sparse/Dense Addition 30 25.376 25 20 15 10 5 0 0.037 N/A 0.075 Breeze jblas Mahout Lower is better Apache Commons Nonnegative Matrix Factorization V ≈ W H Vij, Wij, Hij >= 0 Nonnegative Matrix Factorization • Input: matrix V • Output: W * H ≈ V, W and H ≥ 0 [Lee and Seung, 2011] Breeze NMF val n = V.rows val m = V.cols val r = dims val W = DenseMatrix.rand(n, r) val H = DenseMatrix.rand(r, m) for(i <- 0 until iters) { H :*= ((W.t * V) :/= (W.t * W * H)) W :*= ((V * H.t) :/= (W * H * H.t)) } (W, H) From Breeze to Gust + Breeze NMF val n = X.rows val m = X.cols val r = dims val W = DenseMatrix.rand(n, r) val H = DenseMatrix.rand(r, m) for(i <- 0 until iters) { H :*= ((W.t * X) :/= (W.t * W * H)) W :*= ((X * H.t) :/= (W * H * H.t)) } (W, H) Gust NMF val n = X.rows val m = X.cols val r = dims ≥6x speedup on my laptop! val W = CuMatrix.rand(n, r) val H = CuMatrix.rand(r, m) for(i <- 0 until iters) { H :*= ((W.t * X) :/= (W.t * W * H)) W :*= ((X * H.t) :/= (W * H * H.t)) } (W, H) Optimization (x2 + y – 11)2 + (x + y2 – 7)2 Optimization trait DiffFunction[T] extends (T=>Double) { /** Calculates both the value and the gradient at a point */ def calculate(x:T):(Double,T) } Optimization val df = new DiffFunction[DV[Double]] { def calculate(values: DV[Double]) = { val gradient = DV.zeros[Double](2) val (x,y) = (values(0),values(1)) val value = pow(x * x + y - 11, 2) + pow(x + y * y - 7, 2) gradient(0) = 4 * x * (x * x + y - 11) 2 * (x + y * y - 7) gradient(1) = 2 * (x * x + y - 11) + 4 * y * (x + y * y - 7) (value, gradient) } } + Optimization breeze.optimize.minimize(df, DV(0.0, 0.0)) // DenseVector(3.0000000000012905, 1.999999999997128) Optimization • Fully configurable • Support for: – LBFGS (+ L2/L1 Regularization) – Stochastic Gradient (+ L2/L1 Regularization) – Truncated Newton – Linear Programs – Conjugate Gradient Descent Thanks! Breeze Epic Puck www.github.com/dlwh/{breeze, epic, puck}