Cincinnati Apache Spark Meetup Spark Smorgasbord Curt Kohler & Darin McBeath November 11, 2015 | 2 Agenda • • • • • • Spark Europe Recap Spark Survey Spark Packages Spark and Text Mining Closing Thoughts Plans for 2016 | 3 Spark Summit Europe | 4 Spark Summit overview • Latest Developments in the Spark Universe https://spark-summit.org/eu-2015/ • Occur 3 times/year • All slides and presentation videos made available for free on the Summit site Links for this Summit’s videos posted last week. Find them on the Schedule | 5 Summit Europe • Four tracks Developer – nuts and bolts and development topics Applications – Cool things people have used Spark for Data Science – Leveraging Spark in a Data Science context Research – Academic research leveraging Spark • Mid-cycle of releases right now…. Not a lot of new features since Spark Summit East over the summer • Plan of attack Presentations from Databricks are usually pretty good and where I typically start Cherry pick others based on subject matter. | 6 Spark Education Opportunity • Day 1 of a Summit typically training sessions Spark Basics Data Science Advanced Topics • Links to videotaped sessions are typically posted as well You can follow through at your own pace. • Summit Europe’s training links not up yet Previous Summits’ sessions are available on their respective sites. Other topics there as well: DevOps | 7 Interesting Presentations • How Spark Usage is Evolving in 2015 – Matei Zaharia Overview and insights of the Spark survey results Dataframes and Tungsten are still the main story this year Spark 1.6 Futures (@ 21:30) - “Datasets API” - DataFrames for GraphX and Streaming - More Tungsten enhancements - Data source API in Streaming | 8 Interesting Presentations • Spark DataFrames: Simple and Fast Analysis of Structured Data – Michael Armburst Overview of DataFrames: - Write less code - Read less data - Catalyst optimizer gets you better performance Makes it easier to extend Spark to new languages Vision – Only use RDDs if you need strict control over the processing DAG | 9 Spark Dataset API • Datasets API (@ 27:30) API preview in 1.6 Bring RDD benefits to DataFrames - Type-safe - Custom lambda functions - Fast (goes through Catalyst) - DataFrame Dataset • Code Example val df = sc.read.json(“people.json”) case class Person(name:String, age:Int) val ds Dataset[Person] = df.as[Person] ds.map().filter().groupBy()… | 10 Interesting Presentations • Spark In Production: Lessons from 300+ production users – Aaron Davidson Python Performance How people are using R with Spark Investigating Network and CPU bound workloads Common performance pitfalls • Securing your Spark Applications – Kostas Sakellis & Marcelo Vanzin Kerberos, HDFS, and YARN: oh my!!! Areas where there are still security issues with Spark deployments Long term vision of the “Record Service” & Apache Sentry | 11 Interesting Presentations • Spark UI Visualization – Andrew Orr Demos on how the new Spark UI gives insight into how your application is running and how to debug performance problems - DAG view to understand how your code was turned into processing - Timeline view for seeing where time is being spent - Extra support for Dataframe jobs (SQL Tab) - Extra support for Streaming jobs (Streaming Tab) | 12 For the Data Scientists… • Enabling exploratory data science with Spark & R – Hoissen Falaki Good companion presentation to the one Eugene and Corrine did in August. Demo in the video showing how to get going. • Combining the Strengths of Mlib, scikit-learn, and R – Joseph Bradley Good talk about upcoming Spark package pdspark which aims to provide distributed implementations of various popular Scikit-Learn and R algorithms in pyspark (and also hopefully in Scala, as an addition to MLLib). | 13 Other perspectives • Elsevier had two presenters at Summit Europe • Here are some additional presentations they recommended. Lambda Architecture, Analytics and Data Pathways with Spark Streaming, Kafka, Akka and Cassandra Time Series Stream Processing with Spark and Cassandra - More technical presentation of integrating Spark and Cassandra. Very in- depth if you are planning on integrating Spark with Cassandra. | 14 2015 Spark Survey Full report http://go.databricks.com/2015-spark-survey Takeaways * Spark adoption growing rapidly * Spark use is growing beyond hadoop * Spark is increasing access to big data | 15 Databricks Survey • The subset of the Databricks survey slides that were actually presented at the Meetup at this point have been removed from this copy of the presentation. • You can access the Databricks survey in its entirety for free at this URL: http://go.databricks.com/2015-spark-survey | 16 Spark User List Activity | 17 Cincinnati Sparker Survey • • • • Sent to Cincinnati Sparkers Completely anonymous 10 questions 9 respondents | 18 Cincinnati Sparker Survey | 19 Cincinnati Sparker Survey | 20 Cincinnati Sparker Survey | 21 Cincinnati Sparker Survey | 22 Cincinnati Sparker Survey | 23 Cincinnati Sparker Survey | 24 Cincinnati Sparker Survey | 25 Cincinnati Sparker Survey | 26 Cincinnati Sparker Survey | 27 Cincinnati Sparker Survey | 28 Spark Packages Launched in Dec 2014 by Databricks http://spark-packages.org Community (open source) packages for Spark Many different licenses possible Not part of Spark distribution Currently more than 140 packages 10 high-level categories | 29 Deployments spark_azure Spark launch script for Microsoft Azure spark_gce Spark launch script for Google Compute Engine Very similar to ec2 launch scripts that come bundled with Spark Both developed by Sigmoid Analytics | 30 Data Sources spark-redshift Spark and Redshift integration Release available Developed by Databricks spark-avro Integration utilities for using Spark with Apache Avro data Release available Developed by Databricks Used by Elsevier spark-csv Integration utilities for using Spark with CSV data Release available Developed by Databricks Used by Elsevier spark-mongodb Read/write data with Spark SQL from/into MongoDB collections. Release available | 31 Data Sources spark-cassandra-connector Spark and Cassandra integration Release available Developed by Datastax couchbase-spark-connector Spark and Couchbase integration Release available Developed by Couchbase Labs elasticsearch-hadoop Spark and Elassticsearch integration Release available Developed by Elastic Used by Elsevier | 32 Could be Interesting spark-indexedrdd An efficient updatable key-value store for Apache Spark Release available Developed by Amplabs spark-skewjoin Joins for skewed datasets in Spark Release available spark-corenlp Stanford CoreNLP wrapper for Spark ML pipeline API pipeline Developed by Databricks spark-jobserver REST job server for Spark succinct Data store that enables queries directly on compressed data (think search) | 33 Elsevier Developed Spark Packages soda Solr Dictionary Annotator (Microservice for Spark) Annotate entities in text using lexicons (dictionaries, controlled vocabularies) stored on Solr Two major annotation modes are provided, exact and fuzzy Release in the works spark-xml-utils Filter documents based on an xpath expression Return specific nodes for an xpath/xquery expression Transform documents using a xslt stylesheet Release available Guest blog on Databricks | 34 Using a Package cd {spark-install-dir} ./bin/spark-shell --packages elsevierlabs-os:spark-xml-utils:1.2.0 ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.2.0 Assumes a release for the package is available | 35 Why spark-xml-utils? <xocs:doc xmlns="http://www.elsevier.com/xml/ja/dtd" xmlns:cals="http://www.elsevier.com/xml/common/cals/dtd" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:ja="http://www.elsevier.com/xml/ja/dtd" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:sb="http://www.elsevier.com/xml/common/struct-bib/dtd" xmlns:tb=http://www.elsevier.com/xml/common/table/dtd xmlns:xocs=http://www.elsevier.com/xml/xocs/dtd xsi:schemaLocation="http://www.elsevier.com/xml/xocs/dtd http://schema.elsevier.com/dtds/document/fulltext/xcr/xocs-article.xsd"> <xocs:meta> <xocs:content-family>serial</xocs:content-family> <xocs:content-type>JL</xocs:content-type> <xocs:cid>271245</xocs:cid> <xocs:ssids> <xocs:ssid type="alllist">291210</xocs:ssid> <xocs:ssid type="subj">291843</xocs:ssid> <xocs:ssid type="subj">291847</xocs:ssid> <xocs:ssid type="content">31</xocs:ssid> </xocs:ssids> <xocs:srctitle>Insect Biochemistry and Molecular Biology</xocs:srctitle> <xocs:normalized-srctitle>INSECTBIOCHEMISTRYMOLECULARBIOLOGY</xocs:normalized-srctitle> <xocs:orig-load-date yyyymmdd="20000503">2000-05-03</xocs:orig-load-date> <xocs:ew-transaction-id>2010-04-18T16:48:17</xocs:ew-transaction-id> <xocs:eid>1-s2.0-S0965174800000278</xocs:eid> <xocs:pii-formatted>S0965-1748(00)00027-8</xocs:pii-formatted> <xocs:pii-unformatted>S0965174800000278</xocs:pii-unformatted> <xocs:doi>10.1016/S0965-1748(00)00027-8</xocs:doi> <xocs:item-stage>S300</xocs:item-stage> <xocs:item-version-number>S300.1</xocs:item-version-number> <xocs:item-weight>FULL-TEXT</xocs:item-weight> <xocs:hub-eid>1-s2.0-S0965174800X00598</xocs:hub-eid> ... <xocs:cover-date-text>June 2000</xocs:cover-date-text> <xocs:cover-date-start>2000-06-01</xocs:cover-date-start> <xocs:cover-date-end>2000-06-30</xocs:cover-date-end> <xocs:cover-date-year>2000</xocs:cover-date-year> … | 36 spark-xml-utils XPath (filter) import com.elsevier.spark_xml_utils.xpath.XPathProcessor import scala.collection.JavaConverters._ import java.util.HashMap val xmlKeyPair = sc.sequenceFile[String, String]("s3n://spark-xml-utils/xml/part*") val filtered = xmlKeyPair.mapPartitions(recsIter => { val xpath = "exists(/xocs:doc[xocs:meta[xocs:content-type='JL' xocs:coverdate-year > xs:int(2012) and xocs:cover-date-year < xs:int(2015)]])" val namespaces = new HashMap[String,String](Map( "xocs" -> "http://www.elsevier.com/xml/xocs/dtd" ).asJava) val proc = XPathProcessor.getInstance(xpath,namespaces) recsIter.filter(rec => proc.filterString(rec._2)) }) | 37 spark-xml-utils XQuery(evaluate) import com.elsevier.spark_xml_utils.xquery.XQueryProcessor import scala.collection.JavaConverters._ import java.util.HashMap val xmlKeyPair = sc.sequenceFile[String, String]("s3n://spark-xml-utils/xml/part*") val srcyearJson = xmlKeyPair.mapPartitions(recsIter => { val xquery = "for $x in /xocs:doc/xocs:meta return " + "string-join(('{ \"srctitle\" :\"',$x/xocs:srctitle, '\",\"year\":',$x/xocs:cover-date-year,'}'),'')" val namespaces = new HashMap[String,String](Map( "xocs" -> "http://www.elsevier.com/xml/xocs/dtd" ).asJava) val proc = XQueryProcessor.getInstance(xquery,namespaces) recsIter.map(rec => proc.evaluateString(rec._2)) }) Output: { "srctitle" :"Advances in Mathematics","year":2012} { "srctitle" :"Biological Psychiatry","year":2012} { "srctitle" :"Earth and Planetary Science Letters","year":2014} { "srctitle" :"Fuel","year":2014} { "srctitle" :"Icarus","year":2012} | 38 spark-xml-utils XSLT (transform) import com.elsevier.spark_xml_utils.xslt.XSLTProcessor val xmlKeyPair = sc.sequenceFile[String, String]("s3n://spark-xml-utils/xml/part*") val stylesheet = sc.textFile(“s3n://spark-xml-utils/stylesheets/srctitle.xsl").collect.head val srctitles = xmlKeyPair.mapPartitions(recsIter => { val proc = XSLTProcessor.getInstance(stylesheet) recsIter.map(rec => proc.transform(rec._2)) }) Output: { "srctitle" :"Advances in Mathematics","year":2012} { "srctitle" :"Biological Psychiatry","year":2012} { "srctitle" :"Earth and Planetary Science Letters","year":2014} { "srctitle" :"Fuel","year":2014} { "srctitle" :"Icarus","year":2012} | 39 spark-xml-utils Consider the scenario where you might want to filter documents where the record is of type ‘journal’, the stage is ‘S300’, the publication year is > 2010 and < 2014, the abstract contains ‘heart’ or ‘brain’ or ‘body’ or ‘number’ and the first section section contains ‘red’ or ‘black’. val xmlKeyPair = sc.sequenceFile[String, String]("s3n://spark-xml-utils/xml/part*") val filtered = xmlKeyPair.mapPartitions(recsIter => { val xpath ="/xocs:doc[./xocs:meta[xocs:content-type='JL' " + "and xocs:item-stage='S300' " + "and xocs:cover-date-year > xs:int(2010) " + "and xocs:cover-date-year < xs:int(2014)] " + "and .//ja:head[.//ce:abstract[tokenize(lower-case(string-join(.//text(),' ')),'\\W+') = ('heart','brain','body','number')]] " + "and .//ce:section[position()=1 and tokenize(lower-case(string-join(.//text(),' ')),'\\W+') = ('red','black')]]" val namespaces = new HashMap[String,String](Map( "xocs" -> "http://www.elsevier.com/xml/xocs/dtd", "ja" -> "http://www.elsevier.com/xml/ja/dtd", "ce" ->"http://www.elsevier.com/xml/common/dtd" ).asJava) val proc = XPathProcessor.getInstance(xpath,namespaces) recsIter.filter(rec => proc.filterString(rec._2)) }) | 40 spark-xml-utils github https://github.com/elsevierlabs-os/spark-xml-utils spark-packages http://spark-packages.org/package/elsevierlabs-os/spark-xml-utils | 41 Help Support and Grow the Spark Community Develop a package Provide a release Easier with SBT but possible with Maven Happy to help and answer questions you might have | 42 Text Mining Stand-off Annotations Start with XML Create a String Create OM annotations Create new Annotations Soda Stanford Core Very much work in progress | 43 Text Mining (original XML) <xocs:doc xmlns="http://www.elsevier.com/xml/ja/dtd" xmlns:cals="http://www.elsevier.com/xml/common/cals/dtd" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:ja="http://www.elsevier.com/xml/ja/dtd" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:sb="http://www.elsevier.com/xml/common/struct-bib/dtd" xmlns:tb=http://www.elsevier.com/xml/common/table/dtd xmlns:xocs=http://www.elsevier.com/xml/xocs/dtd xsi:schemaLocation="http://www.elsevier.com/xml/xocs/dtd http://schema.elsevier.com/dtds/document/fulltext/xcr/xocs-article.xsd"> <xocs:meta> <xocs:content-family>serial</xocs:content-family> <xocs:content-type>JL</xocs:content-type> <xocs:cid>271245</xocs:cid> <xocs:ssids> <xocs:ssid type="alllist">291210</xocs:ssid> <xocs:ssid type="subj">291843</xocs:ssid> <xocs:ssid type="subj">291847</xocs:ssid> <xocs:ssid type="content">31</xocs:ssid> </xocs:ssids> <xocs:srctitle>Insect Biochemistry and Molecular Biology</xocs:srctitle> <xocs:normalized-srctitle>INSECTBIOCHEMISTRYMOLECULARBIOLOGY</xocs:normalized-srctitle> <xocs:orig-load-date yyyymmdd="20000503">2000-05-03</xocs:orig-load-date> <xocs:ew-transaction-id>2010-04-18T16:48:17</xocs:ew-transaction-id> <xocs:eid>1-s2.0-S0965174800000278</xocs:eid> <xocs:pii-formatted>S0965-1748(00)00027-8</xocs:pii-formatted> <xocs:pii-unformatted>S0965174800000278</xocs:pii-unformatted> <xocs:doi>10.1016/S0965-1748(00)00027-8</xocs:doi> <xocs:item-stage>S300</xocs:item-stage> <xocs:item-version-number>S300.1</xocs:item-version-number> <xocs:item-weight>FULL-TEXT</xocs:item-weight> <xocs:hub-eid>1-s2.0-S0965174800X00598</xocs:hub-eid> ... <xocs:cover-date-text>June 2000</xocs:cover-date-text> <xocs:cover-date-start>2000-06-01</xocs:cover-date-start> <xocs:cover-date-end>2000-06-30</xocs:cover-date-end> <xocs:cover-date-year>2000</xocs:cover-date-year> … | 44 Text Mining (string) serialJL27124529121029184329184731Insect Biochemistry and Molecular BiologyINSECTBIOCHEMISTRYMOLECULARBIOLOGY2000-05-032010-04-18T16:48:171-s2.0S0965174800000278S0965-1748(00)00027-8S096517480000027810.1016/S0965-1748(00)000278S300S300.1FULL-TEXT1-s2.0-S0965174800X005982010-10-15T15:19:33.46732604:0000200006012000063020002000-05-03T00:00:00Zarticleinfo crossmark dco dateupdated tomb dateloaded datesearch indexeddate issuelist volumelist yearnav articletitlenorm authfirstinitialnorm authfirstsurnamenorm cid cids contenttype copyright dateloadedtxt docsubtype doctype doi eid ewtransactionid hubeid issfirst issn issnnorm itemstage itemtransactionid itemweight openaccess openarchive pg pgfirst pglast pii piinorm pubdateend pubdatestart pubdatetxt pubyr sortorder srctitle srctitlenorm srctype subheadings volfirst volissue webpdf webpdfpagecount figure body acknowledge affil articletitle auth authfirstini authfull authkeywords authlast primabst ref alllist content subj ssids0965174809651748303066Volume 30, Issue 69507514507514200006June 20002000-06-012000-06302000converted-articleflaCopyright © 2000 Elsevier Science Ltd. All rights reserved.HIGHLEVELEXPRESSIONMALESPECIFICPHEROMONEBINDINGPROTEINSPBPSINANTENNA EFEMALENOCTUIIDMOTHSCALLAHANF1Introduction2Materials and methods2.1Moth tissue2.2Protein electrophoresis/immunoblotting2.3Immunocytochemistry2.4Library construction/cDNA sequencing2.5Northern Blot analysis3Results4DiscussionAcknowledgementsReferencesBOECKH1979235242JBREER1997115130HIN SECTPHEROMONERESEARCHNEWDIRECTIONSMOLECULARMECHANISMSPHEROMONERECEPTIO NININSECTANTENNAEBREER1990735740HCARLSON1996175180JCHEN1997159172XCHRISTENSEN1 990275283TDENOTTER1996413421CDENOTTER1978370378CDICKENS1995857867JDU199587268732 GFENG1997405412LGYORGYI198898519855THILDEBRAND1996519JKAISSLING1998385395KKLUN19 80165175JKOONTZ19873950MKRIEGER1999720723JKRIEGER1993449456JKRIEGER1991277284JKRI EGER1996297307JLAEMMLI1970680685ULAUE1997217228MLAUE1994178180MLJUNGBERG19932532 60HMAIBECHECOISNE1998815818MMAIBECHECOISNE1997213221MMAMELI1996875882MMCKENNA 19941634016347MMERRITT1998272276TNAGNANLEMEILLOUR19965967POCHIENG1995221232SPEL OSI1996319PPELOSI1995503514PPRESTWICH1995461469GRAMING1989215218KRAMING199050350 9KROBERTSON1999501518HROGERS199916251637MROSS1979807816RSAMBROOK1989JMOLECUL ARCLONINGALABORATORYMANUALSCHNEIDER1998153161DSCHWEITZER1976955960ESEABROOK 198714431453WSTEINBRECHT1996718725RSTEINBRECHT | 45 Text Mining (OM Annotations) 1^om^xocs:doc^0^41420^xmlns:xocs=http%3A%2F%2Fwww.elsevier.com%2Fxml%2Fxocs%2Fdtd&xsi:schemaLocation=http%3A%2F%2Fwww.elsevier.com%2Fxml%2Fxocs %2Fdtd+http%3A%2F%2Fschema.elsevier.com%2Fdtds%2Fdocument%2Ffulltext%2Fxcr%2Fxocsarticle.xsd&xmlns:xs=http%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema&xmlns:xsi=http%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchemainstance&xmlns=http%3A%2F%2Fwww.elsevier.com%2Fxml%2Fja%2Fdtd&xmlns:ja=http%3A%2F%2Fwww.elsevier.com%2Fxml%2Fja%2Fdtd&xmlns:mml=http%3A%2F%2F www.w3.org%2F1998%2FMath%2FMathML&xmlns:tb=http%3A%2F%2Fwww.elsevier.com%2Fxml%2Fcommon%2Ftable%2Fdtd&xmlns:sb=http%3A%2F%2Fwww.elsevier.co m%2Fxml%2Fcommon%2Fstructbib%2Fdtd&xmlns:ce=http%3A%2F%2Fwww.elsevier.com%2Fxml%2Fcommon%2Fdtd&xmlns:xlink=http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink&xmlns:cals=http%3A%2 F%2Fwww.elsevier.com%2Fxml%2Fcommon%2Fcals%2Fdtd 2^om^xocs:meta^0^3813^parentId=1 3^om^xocs:content-family^0^6^parentId=2 4^om^xocs:content-type^6^8^parentId=2 5^om^xocs:cid^8^14^parentId=2 6^om^xocs:ssids^14^34^parentId=2 7^om^xocs:ssid^14^20^type=alllist&parentId=6 8^om^xocs:ssid^20^26^type=subj&parentId=6 9^om^xocs:ssid^26^32^type=subj&parentId=6 10^om^xocs:ssid^32^34^type=content&parentId=6 11^om^xocs:srctitle^34^75^parentId=2 12^om^xocs:normalized-srctitle^75^109^parentId=2 13^om^xocs:orig-load-date^109^119^yyyymmdd=20000503&parentId=2 14^om^xocs:ew-transaction-id^119^138^parentId=2 15^om^xocs:eid^138^162^parentId=2 16^om^xocs:pii-formatted^162^183^parentId=2 17^om^xocs:pii-unformatted^183^200^parentId=2 18^om^xocs:doi^200^229^parentId=2 19^om^xocs:item-stage^229^233^parentId=2 20^om^xocs:item-version-number^233^239^parentId=2 21^om^xocs:item-weight^239^248^parentId=2 22^om^xocs:hub-eid^248^272^parentId=2 23^om^xocs:timestamp^272^304^yyyymmdd=20101015&parentId=2 24^om^xocs:dco^304^305^parentId=2 25^om^xocs:tomb^305^306^parentId=2 26^om^xocs:date-search-begin^306^314^parentId=2 27^om^xocs:date-search-end^314^322^parentId=2 28^om^xocs:year-nav^322^326^parentId=2 29^om^xocs:indexeddate^326^346^epoch=957312000&parentId=2 30^om^xocs:articleinfo^346^986^parentId=2 .. | 46 Text Mining (Spark) // ScienceDirect articles to process val piis = Array("S0965174800000278","S0013468610015215") // Databricks mount point for the ScienceDirect XML val inputBase = "/mnt/sd-fulltext/" // Databricks mount point to place the string and original markup annotations val outputBase = "/mnt/els/darin/cat3/” // Parallelize the piis across the cluster val piiRDD = sc.parallelize(piis) OR // Get the values from a file val piiRDD = sc.textFile("/mnt/els/darin/catPIIs").repartition(4) | 47 Text Mining (Spark) piiRDD.map(pii => { // Local filename for the pii (that will need to be deleted) val piiLocalFile = "/" + pii + ".xml" // Copy the xml file identified by the pii to the local filesystem on the worker dbutils.fs.cp(inputBase + pii, "file:" + piiLocalFile) val xml = FileUtils.openInputStream(new File(piiLocalFile)) // Generate the string and the om annotations val results = BaselineMarkup.process(xml) // Get the generated string and write to the local file system on the worker val strFileName = "/" + pii + ".str" val strFile = new File(strFileName) FileUtils.writeStringToFile(strFile, results(0), "UTF-8"); // Get the original markup annotations and write to the local file system on the worker val omFileName = "/" + pii + ".om" val omFile = new File(omFileName) FileUtils.writeStringToFile(omFile, results(1), "UTF-8"); // Move the string and om annotations to S3 dbutils.fs.mv("file:" + strFileName, outputBase + pii + "/str") dbutils.fs.mv("file:" + omFileName, outputBase + pii + "/om") // Delete the local pii xml file dbutils.fs.rm("file:" + piiLocalFile) }).count // Action to force execution | 48 Text Mining (Spark) piiRDD.mapPartitions(piiIter => { // Hashmap of include annotations val includes = new HashMap[String,Array[String]](Map( "om" -> Array("ce:abstract","ce:caption") ).asJava) // Init Stanford Core (use default parameters of "tokenize, ssplit, pos, lemma, ner, parse") val stanfordCoreMarkup = StanfordCoreMarkup.getInstance() piiIter.map(pii => { // Copy the om annotation file identified by the pii to the local filesystem on the worker dbutils.fs.cp(inputBase + pii + "/om", "file:" + localFile(pii,".om")) val om = FileUtils.openInputStream(new File(localFile(pii,".om"))) // Copy the string file identified by the pii to the local filesystem on the worker dbutils.fs.cp(inputBase + pii + "/str", "file:" + localFile(pii,".str")) val str = FileUtils.openInputStream(new File(localFile(pii,".str"))) //Generate the Stanford Core NLP annotations (includes and excludes) val results = stanfordCoreMarkup.process(om,str,includes,excludes) // Get the generated string and write to the local file system on the worker val scnlpFileName = "/" + pii + ".scnlp" val scnlpFile = new File(scnlpFileName) FileUtils.writeStringToFile(scnlpFile, results, "UTF-8"); // Move the Stanford Core NLP annotations to S3 dbutils.fs.mv("file:" + scnlpFileName, outputBase + pii + "/scnlp") // Delete the local om annotation and string files dbutils.fs.rm("file:" + localFile(pii,".om")) dbutils.fs.rm("file:" + localFile(pii,".str")) }) }).count // Action to force execution | 49 Text Mining (Spark) // Schema definition for annotations val annotationSchema = StructType(List(StructField("DocId", StringType, false), StructField("AnnotSet", StringType, false), StructField("AnnotType", StringType, false), StructField("Start", LongType, false), StructField("End", LongType, false), StructField("AnnotId", LongType, false), StructField("Other", StringType, true))) // FlatpMap to to get one annotation per record val baselineAnnotRDD = sc.wholeTextFiles("/mnt/els/darin/cat3/*/om") val annotations = baselineAnnotRDD.flatMapValues(rec => rec.split("\n")).map(rec => (rec._1.split("/")(5),rec._2)) // Create a row of annotations val annotationsRowRDD = annotations.map{v => val arr = v._2.split("\\^") val docId = v._1 val annotSet = arr(1) val annotType = arr(2) val start = arr(3).toLong val end = arr(4).toLong val annotId = arr(0).toLong var other : String = null if (arr.size == 6) { other = arr(5) } Row(docId,annotSet,annotType,start,end,annotId,other)} // Create the data frame val annotationsDataFrame = sqlContext.createDataFrame(annotationsRowRDD, annotationSchema).cache() | 50 Text Mining (Spark) | 51 Text Mining (Spark) | 52 Text Mining (Spark) What else could you do? Look at Stanford Core NLP output Frequency distribution of nouns, verbs, other POS Combine with OM to do same scoped to an element Develop smarter search engines Also looking at Genia and Spacey Much, much, more … | 53 Text Mining at Elsevier with Spark Very much work in progress AWS and Lambda Changes made to S3 buckets Better for incremental changes Spark Great for processing everything at once Annotation Files to Dataframes and Datasets Spark SQL Further analysis | 54 Some Random Closing Thoughts IBM purchase of Weather.com Contest on IBM Bluemix (with weather data) Machine Learning Class (Stanford) Statistical Learning (Trevor Hastie, Rob Tibshirani) Free online book Github book (free) jaceklaskowski.gitbooks.io/mastering-apache-spark/ | 55 Some Random Closing Thoughts Spark 1.5.2 released yesterday Google groups for Spark meetup organizers What’s good? What could be better? | 56 Plans for 2016 • Next meetup planned for February • Focus for 2016? • External speakers/vendors? • High-level vs. Detailed? Thank You Curt Kohler c.kohler@elsevier.com Darin McBeath d.mcbeath@elsevier.com