Cincinnati Apache
Spark Meetup
Spark Smorgasbord
Curt Kohler & Darin McBeath
November 11, 2015
| 2
Agenda
•
•
•
•
•
•
Spark Europe Recap
Spark Survey
Spark Packages
Spark and Text Mining
Closing Thoughts
Plans for 2016
| 3
Spark Summit Europe
| 4
Spark Summit overview
• Latest Developments in the Spark Universe

https://spark-summit.org/eu-2015/
• Occur 3 times/year
• All slides and presentation videos made available for free on the
Summit site
Links for this Summit’s videos posted last week.
 Find them on the Schedule

| 5
Summit Europe
• Four tracks
Developer – nuts and bolts and development topics
 Applications – Cool things people have used Spark for
 Data Science – Leveraging Spark in a Data Science context
 Research – Academic research leveraging Spark

• Mid-cycle of releases right now….

Not a lot of new features since Spark Summit East over the summer
• Plan of attack

Presentations from Databricks are usually pretty good and where I
typically start
 Cherry pick others based on subject matter.
| 6
Spark Education Opportunity
• Day 1 of a Summit typically training sessions

Spark Basics
 Data Science
 Advanced Topics
• Links to videotaped sessions are typically posted as well

You can follow through at your own pace.
• Summit Europe’s training links not up yet
Previous Summits’ sessions are available on their respective sites.
 Other topics there as well: DevOps

| 7
Interesting Presentations
• How Spark Usage is Evolving in 2015 – Matei Zaharia

Overview and insights of the Spark survey results

Dataframes and Tungsten are still the main story this year

Spark 1.6 Futures (@ 21:30)
- “Datasets API”
- DataFrames for GraphX and Streaming
- More Tungsten enhancements
- Data source API in Streaming
| 8
Interesting Presentations
• Spark DataFrames: Simple and Fast Analysis of Structured
Data – Michael Armburst

Overview of DataFrames:
- Write less code
- Read less data
- Catalyst optimizer gets you better performance

Makes it easier to extend Spark to new languages

Vision – Only use RDDs if you need strict control over the processing
DAG
| 9
Spark Dataset API
• Datasets API (@ 27:30)

API preview in 1.6
 Bring RDD benefits to DataFrames
- Type-safe
- Custom lambda functions
- Fast (goes through Catalyst)
- DataFrame  Dataset
• Code Example
val df = sc.read.json(“people.json”)
case class Person(name:String, age:Int)
val ds Dataset[Person] = df.as[Person]
ds.map().filter().groupBy()…
| 10
Interesting Presentations
• Spark In Production: Lessons from 300+ production users –
Aaron Davidson

Python Performance
 How people are using R with Spark
 Investigating Network and CPU bound workloads
 Common performance pitfalls
• Securing your Spark Applications – Kostas Sakellis & Marcelo
Vanzin

Kerberos, HDFS, and YARN: oh my!!!
 Areas where there are still security issues with Spark deployments
 Long term vision of the “Record Service” & Apache Sentry
| 11
Interesting Presentations
• Spark UI Visualization – Andrew Orr

Demos on how the new Spark UI gives insight into how your application
is running and how to debug performance problems
- DAG view to understand how your code was turned into processing
- Timeline view for seeing where time is being spent
- Extra support for Dataframe jobs (SQL Tab)
- Extra support for Streaming jobs (Streaming Tab)
| 12
For the Data Scientists…
• Enabling exploratory data science with Spark & R – Hoissen
Falaki

Good companion presentation to the one Eugene and Corrine did in
August.
 Demo in the video showing how to get going.
• Combining the Strengths of Mlib, scikit-learn, and R – Joseph
Bradley

Good talk about upcoming Spark package pdspark which aims to
provide distributed implementations of various popular Scikit-Learn and
R algorithms in pyspark (and also hopefully in Scala, as an addition to
MLLib).
| 13
Other perspectives
• Elsevier had two presenters at Summit Europe
• Here are some additional presentations they recommended.

Lambda Architecture, Analytics and Data Pathways with Spark
Streaming, Kafka, Akka and Cassandra

Time Series Stream Processing with Spark and Cassandra
- More technical presentation of integrating Spark and Cassandra. Very in-
depth if you are planning on integrating Spark with Cassandra.
| 14
2015 Spark Survey
Full report

http://go.databricks.com/2015-spark-survey
Takeaways
* Spark adoption growing rapidly
* Spark use is growing beyond hadoop
* Spark is increasing access to big data
| 15
Databricks Survey
• The subset of the Databricks survey slides that were actually
presented at the Meetup at this point have been removed from
this copy of the presentation.
• You can access the Databricks survey in its entirety for free at this
URL: http://go.databricks.com/2015-spark-survey
| 16
Spark User List Activity
| 17
Cincinnati Sparker Survey
•
•
•
•
Sent to Cincinnati Sparkers
Completely anonymous
10 questions
9 respondents
| 18
Cincinnati Sparker Survey
| 19
Cincinnati Sparker Survey
| 20
Cincinnati Sparker Survey
| 21
Cincinnati Sparker Survey
| 22
Cincinnati Sparker Survey
| 23
Cincinnati Sparker Survey
| 24
Cincinnati Sparker Survey
| 25
Cincinnati Sparker Survey
| 26
Cincinnati Sparker Survey
| 27
Cincinnati Sparker Survey
| 28
Spark Packages
Launched in Dec 2014 by Databricks
http://spark-packages.org
Community (open source) packages for Spark
Many different licenses possible
Not part of Spark distribution
Currently more than 140 packages
10 high-level categories
| 29
Deployments
spark_azure
Spark launch script for Microsoft Azure
spark_gce
Spark launch script for Google Compute Engine
Very similar to ec2 launch scripts that come bundled with Spark
Both developed by Sigmoid Analytics
| 30
Data Sources
spark-redshift
Spark and Redshift integration
Release available
Developed by Databricks
spark-avro
Integration utilities for using Spark with Apache Avro data
Release available
Developed by Databricks
Used by Elsevier
spark-csv
Integration utilities for using Spark with CSV data
Release available
Developed by Databricks
Used by Elsevier
spark-mongodb
Read/write data with Spark SQL from/into MongoDB collections.
Release available
| 31
Data Sources
spark-cassandra-connector
Spark and Cassandra integration
Release available
Developed by Datastax
couchbase-spark-connector
Spark and Couchbase integration
Release available
Developed by Couchbase Labs
elasticsearch-hadoop
Spark and Elassticsearch integration
Release available
Developed by Elastic
Used by Elsevier
| 32
Could be Interesting
spark-indexedrdd
An efficient updatable key-value store for Apache Spark
Release available
Developed by Amplabs
spark-skewjoin
Joins for skewed datasets in Spark
Release available
spark-corenlp
Stanford CoreNLP wrapper for Spark ML pipeline API pipeline
Developed by Databricks
spark-jobserver
REST job server for Spark
succinct
Data store that enables queries directly on compressed data (think search)
| 33
Elsevier Developed Spark Packages
soda
Solr Dictionary Annotator (Microservice for Spark)
Annotate entities in text using lexicons (dictionaries, controlled vocabularies) stored on Solr
Two major annotation modes are provided, exact and fuzzy
Release in the works
spark-xml-utils
Filter documents based on an xpath expression
Return specific nodes for an xpath/xquery expression
Transform documents using a xslt stylesheet
Release available
Guest blog on Databricks 
| 34
Using a Package
cd {spark-install-dir}
./bin/spark-shell --packages elsevierlabs-os:spark-xml-utils:1.2.0
./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.2.0
Assumes a release for the package is available
| 35
Why spark-xml-utils?
<xocs:doc xmlns="http://www.elsevier.com/xml/ja/dtd"
xmlns:cals="http://www.elsevier.com/xml/common/cals/dtd"
xmlns:ce="http://www.elsevier.com/xml/common/dtd"
xmlns:ja="http://www.elsevier.com/xml/ja/dtd"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
xmlns:sb="http://www.elsevier.com/xml/common/struct-bib/dtd"
xmlns:tb=http://www.elsevier.com/xml/common/table/dtd
xmlns:xocs=http://www.elsevier.com/xml/xocs/dtd
xsi:schemaLocation="http://www.elsevier.com/xml/xocs/dtd http://schema.elsevier.com/dtds/document/fulltext/xcr/xocs-article.xsd">
<xocs:meta>
<xocs:content-family>serial</xocs:content-family>
<xocs:content-type>JL</xocs:content-type>
<xocs:cid>271245</xocs:cid>
<xocs:ssids>
<xocs:ssid type="alllist">291210</xocs:ssid>
<xocs:ssid type="subj">291843</xocs:ssid>
<xocs:ssid type="subj">291847</xocs:ssid>
<xocs:ssid type="content">31</xocs:ssid>
</xocs:ssids>
<xocs:srctitle>Insect Biochemistry and Molecular Biology</xocs:srctitle>
<xocs:normalized-srctitle>INSECTBIOCHEMISTRYMOLECULARBIOLOGY</xocs:normalized-srctitle>
<xocs:orig-load-date yyyymmdd="20000503">2000-05-03</xocs:orig-load-date>
<xocs:ew-transaction-id>2010-04-18T16:48:17</xocs:ew-transaction-id>
<xocs:eid>1-s2.0-S0965174800000278</xocs:eid>
<xocs:pii-formatted>S0965-1748(00)00027-8</xocs:pii-formatted>
<xocs:pii-unformatted>S0965174800000278</xocs:pii-unformatted>
<xocs:doi>10.1016/S0965-1748(00)00027-8</xocs:doi>
<xocs:item-stage>S300</xocs:item-stage>
<xocs:item-version-number>S300.1</xocs:item-version-number>
<xocs:item-weight>FULL-TEXT</xocs:item-weight>
<xocs:hub-eid>1-s2.0-S0965174800X00598</xocs:hub-eid>
...
<xocs:cover-date-text>June 2000</xocs:cover-date-text>
<xocs:cover-date-start>2000-06-01</xocs:cover-date-start>
<xocs:cover-date-end>2000-06-30</xocs:cover-date-end>
<xocs:cover-date-year>2000</xocs:cover-date-year>
…
| 36
spark-xml-utils
XPath (filter)
import com.elsevier.spark_xml_utils.xpath.XPathProcessor
import scala.collection.JavaConverters._
import java.util.HashMap
val xmlKeyPair = sc.sequenceFile[String, String]("s3n://spark-xml-utils/xml/part*")
val filtered = xmlKeyPair.mapPartitions(recsIter => {
val xpath = "exists(/xocs:doc[xocs:meta[xocs:content-type='JL' xocs:coverdate-year > xs:int(2012) and xocs:cover-date-year < xs:int(2015)]])"
val namespaces = new HashMap[String,String](Map(
"xocs" -> "http://www.elsevier.com/xml/xocs/dtd"
).asJava)
val proc = XPathProcessor.getInstance(xpath,namespaces)
recsIter.filter(rec => proc.filterString(rec._2))
})
| 37
spark-xml-utils
XQuery(evaluate)
import com.elsevier.spark_xml_utils.xquery.XQueryProcessor
import scala.collection.JavaConverters._
import java.util.HashMap
val xmlKeyPair = sc.sequenceFile[String, String]("s3n://spark-xml-utils/xml/part*")
val srcyearJson = xmlKeyPair.mapPartitions(recsIter => {
val xquery = "for $x in /xocs:doc/xocs:meta return " +
"string-join(('{ \"srctitle\" :\"',$x/xocs:srctitle, '\",\"year\":',$x/xocs:cover-date-year,'}'),'')"
val namespaces = new HashMap[String,String](Map(
"xocs" -> "http://www.elsevier.com/xml/xocs/dtd"
).asJava)
val proc = XQueryProcessor.getInstance(xquery,namespaces)
recsIter.map(rec => proc.evaluateString(rec._2))
})
Output:
{ "srctitle" :"Advances in Mathematics","year":2012}
{ "srctitle" :"Biological Psychiatry","year":2012}
{ "srctitle" :"Earth and Planetary Science Letters","year":2014}
{ "srctitle" :"Fuel","year":2014}
{ "srctitle" :"Icarus","year":2012}
| 38
spark-xml-utils
XSLT (transform)
import com.elsevier.spark_xml_utils.xslt.XSLTProcessor
val xmlKeyPair = sc.sequenceFile[String, String]("s3n://spark-xml-utils/xml/part*")
val stylesheet = sc.textFile(“s3n://spark-xml-utils/stylesheets/srctitle.xsl").collect.head
val srctitles = xmlKeyPair.mapPartitions(recsIter => {
val proc = XSLTProcessor.getInstance(stylesheet)
recsIter.map(rec => proc.transform(rec._2))
})
Output:
{ "srctitle" :"Advances in Mathematics","year":2012}
{ "srctitle" :"Biological Psychiatry","year":2012}
{ "srctitle" :"Earth and Planetary Science Letters","year":2014}
{ "srctitle" :"Fuel","year":2014}
{ "srctitle" :"Icarus","year":2012}
| 39
spark-xml-utils
Consider the scenario where you might want to filter documents where the record is of type ‘journal’,
the stage is ‘S300’, the publication year is > 2010 and < 2014, the abstract contains ‘heart’ or ‘brain’
or ‘body’ or ‘number’ and the first section section contains ‘red’ or ‘black’.
val xmlKeyPair = sc.sequenceFile[String, String]("s3n://spark-xml-utils/xml/part*")
val filtered = xmlKeyPair.mapPartitions(recsIter => {
val xpath ="/xocs:doc[./xocs:meta[xocs:content-type='JL' " +
"and xocs:item-stage='S300' " +
"and xocs:cover-date-year > xs:int(2010) " +
"and xocs:cover-date-year < xs:int(2014)] " +
"and .//ja:head[.//ce:abstract[tokenize(lower-case(string-join(.//text(),' ')),'\\W+') =
('heart','brain','body','number')]] " +
"and .//ce:section[position()=1 and tokenize(lower-case(string-join(.//text(),' ')),'\\W+') =
('red','black')]]"
val namespaces = new HashMap[String,String](Map(
"xocs" -> "http://www.elsevier.com/xml/xocs/dtd",
"ja" -> "http://www.elsevier.com/xml/ja/dtd",
"ce" ->"http://www.elsevier.com/xml/common/dtd"
).asJava)
val proc = XPathProcessor.getInstance(xpath,namespaces)
recsIter.filter(rec => proc.filterString(rec._2))
})
| 40
spark-xml-utils
github
https://github.com/elsevierlabs-os/spark-xml-utils
spark-packages
http://spark-packages.org/package/elsevierlabs-os/spark-xml-utils
| 41
Help Support and Grow the Spark Community
Develop a package
Provide a release
Easier with SBT but possible with Maven
Happy to help and answer questions you might have
| 42
Text Mining
Stand-off Annotations
Start with XML


Create a String
Create OM annotations
Create new Annotations
Soda
Stanford Core
Very much work in progress
| 43
Text Mining (original XML)
<xocs:doc xmlns="http://www.elsevier.com/xml/ja/dtd"
xmlns:cals="http://www.elsevier.com/xml/common/cals/dtd"
xmlns:ce="http://www.elsevier.com/xml/common/dtd"
xmlns:ja="http://www.elsevier.com/xml/ja/dtd"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
xmlns:sb="http://www.elsevier.com/xml/common/struct-bib/dtd"
xmlns:tb=http://www.elsevier.com/xml/common/table/dtd
xmlns:xocs=http://www.elsevier.com/xml/xocs/dtd
xsi:schemaLocation="http://www.elsevier.com/xml/xocs/dtd http://schema.elsevier.com/dtds/document/fulltext/xcr/xocs-article.xsd">
<xocs:meta>
<xocs:content-family>serial</xocs:content-family>
<xocs:content-type>JL</xocs:content-type>
<xocs:cid>271245</xocs:cid>
<xocs:ssids>
<xocs:ssid type="alllist">291210</xocs:ssid>
<xocs:ssid type="subj">291843</xocs:ssid>
<xocs:ssid type="subj">291847</xocs:ssid>
<xocs:ssid type="content">31</xocs:ssid>
</xocs:ssids>
<xocs:srctitle>Insect Biochemistry and Molecular Biology</xocs:srctitle>
<xocs:normalized-srctitle>INSECTBIOCHEMISTRYMOLECULARBIOLOGY</xocs:normalized-srctitle>
<xocs:orig-load-date yyyymmdd="20000503">2000-05-03</xocs:orig-load-date>
<xocs:ew-transaction-id>2010-04-18T16:48:17</xocs:ew-transaction-id>
<xocs:eid>1-s2.0-S0965174800000278</xocs:eid>
<xocs:pii-formatted>S0965-1748(00)00027-8</xocs:pii-formatted>
<xocs:pii-unformatted>S0965174800000278</xocs:pii-unformatted>
<xocs:doi>10.1016/S0965-1748(00)00027-8</xocs:doi>
<xocs:item-stage>S300</xocs:item-stage>
<xocs:item-version-number>S300.1</xocs:item-version-number>
<xocs:item-weight>FULL-TEXT</xocs:item-weight>
<xocs:hub-eid>1-s2.0-S0965174800X00598</xocs:hub-eid>
...
<xocs:cover-date-text>June 2000</xocs:cover-date-text>
<xocs:cover-date-start>2000-06-01</xocs:cover-date-start>
<xocs:cover-date-end>2000-06-30</xocs:cover-date-end>
<xocs:cover-date-year>2000</xocs:cover-date-year>
…
| 44
Text Mining (string)
serialJL27124529121029184329184731Insect Biochemistry and Molecular
BiologyINSECTBIOCHEMISTRYMOLECULARBIOLOGY2000-05-032010-04-18T16:48:171-s2.0S0965174800000278S0965-1748(00)00027-8S096517480000027810.1016/S0965-1748(00)000278S300S300.1FULL-TEXT1-s2.0-S0965174800X005982010-10-15T15:19:33.46732604:0000200006012000063020002000-05-03T00:00:00Zarticleinfo crossmark dco dateupdated tomb
dateloaded datesearch indexeddate issuelist volumelist yearnav articletitlenorm authfirstinitialnorm
authfirstsurnamenorm cid cids contenttype copyright dateloadedtxt docsubtype doctype doi eid
ewtransactionid hubeid issfirst issn issnnorm itemstage itemtransactionid itemweight openaccess
openarchive pg pgfirst pglast pii piinorm pubdateend pubdatestart pubdatetxt pubyr sortorder srctitle
srctitlenorm srctype subheadings volfirst volissue webpdf webpdfpagecount figure body acknowledge affil
articletitle auth authfirstini authfull authkeywords authlast primabst ref alllist content subj ssids0965174809651748303066Volume 30, Issue 69507514507514200006June 20002000-06-012000-06302000converted-articleflaCopyright © 2000 Elsevier Science Ltd. All rights
reserved.HIGHLEVELEXPRESSIONMALESPECIFICPHEROMONEBINDINGPROTEINSPBPSINANTENNA
EFEMALENOCTUIIDMOTHSCALLAHANF1Introduction2Materials and methods2.1Moth tissue2.2Protein
electrophoresis/immunoblotting2.3Immunocytochemistry2.4Library construction/cDNA
sequencing2.5Northern Blot
analysis3Results4DiscussionAcknowledgementsReferencesBOECKH1979235242JBREER1997115130HIN
SECTPHEROMONERESEARCHNEWDIRECTIONSMOLECULARMECHANISMSPHEROMONERECEPTIO
NININSECTANTENNAEBREER1990735740HCARLSON1996175180JCHEN1997159172XCHRISTENSEN1
990275283TDENOTTER1996413421CDENOTTER1978370378CDICKENS1995857867JDU199587268732
GFENG1997405412LGYORGYI198898519855THILDEBRAND1996519JKAISSLING1998385395KKLUN19
80165175JKOONTZ19873950MKRIEGER1999720723JKRIEGER1993449456JKRIEGER1991277284JKRI
EGER1996297307JLAEMMLI1970680685ULAUE1997217228MLAUE1994178180MLJUNGBERG19932532
60HMAIBECHECOISNE1998815818MMAIBECHECOISNE1997213221MMAMELI1996875882MMCKENNA
19941634016347MMERRITT1998272276TNAGNANLEMEILLOUR19965967POCHIENG1995221232SPEL
OSI1996319PPELOSI1995503514PPRESTWICH1995461469GRAMING1989215218KRAMING199050350
9KROBERTSON1999501518HROGERS199916251637MROSS1979807816RSAMBROOK1989JMOLECUL
ARCLONINGALABORATORYMANUALSCHNEIDER1998153161DSCHWEITZER1976955960ESEABROOK
198714431453WSTEINBRECHT1996718725RSTEINBRECHT
| 45
Text Mining (OM Annotations)
1^om^xocs:doc^0^41420^xmlns:xocs=http%3A%2F%2Fwww.elsevier.com%2Fxml%2Fxocs%2Fdtd&xsi:schemaLocation=http%3A%2F%2Fwww.elsevier.com%2Fxml%2Fxocs
%2Fdtd+http%3A%2F%2Fschema.elsevier.com%2Fdtds%2Fdocument%2Ffulltext%2Fxcr%2Fxocsarticle.xsd&xmlns:xs=http%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema&xmlns:xsi=http%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchemainstance&xmlns=http%3A%2F%2Fwww.elsevier.com%2Fxml%2Fja%2Fdtd&xmlns:ja=http%3A%2F%2Fwww.elsevier.com%2Fxml%2Fja%2Fdtd&xmlns:mml=http%3A%2F%2F
www.w3.org%2F1998%2FMath%2FMathML&xmlns:tb=http%3A%2F%2Fwww.elsevier.com%2Fxml%2Fcommon%2Ftable%2Fdtd&xmlns:sb=http%3A%2F%2Fwww.elsevier.co
m%2Fxml%2Fcommon%2Fstructbib%2Fdtd&xmlns:ce=http%3A%2F%2Fwww.elsevier.com%2Fxml%2Fcommon%2Fdtd&xmlns:xlink=http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink&xmlns:cals=http%3A%2
F%2Fwww.elsevier.com%2Fxml%2Fcommon%2Fcals%2Fdtd
2^om^xocs:meta^0^3813^parentId=1
3^om^xocs:content-family^0^6^parentId=2
4^om^xocs:content-type^6^8^parentId=2
5^om^xocs:cid^8^14^parentId=2
6^om^xocs:ssids^14^34^parentId=2
7^om^xocs:ssid^14^20^type=alllist&parentId=6
8^om^xocs:ssid^20^26^type=subj&parentId=6
9^om^xocs:ssid^26^32^type=subj&parentId=6
10^om^xocs:ssid^32^34^type=content&parentId=6
11^om^xocs:srctitle^34^75^parentId=2
12^om^xocs:normalized-srctitle^75^109^parentId=2
13^om^xocs:orig-load-date^109^119^yyyymmdd=20000503&parentId=2
14^om^xocs:ew-transaction-id^119^138^parentId=2
15^om^xocs:eid^138^162^parentId=2
16^om^xocs:pii-formatted^162^183^parentId=2
17^om^xocs:pii-unformatted^183^200^parentId=2
18^om^xocs:doi^200^229^parentId=2
19^om^xocs:item-stage^229^233^parentId=2
20^om^xocs:item-version-number^233^239^parentId=2
21^om^xocs:item-weight^239^248^parentId=2
22^om^xocs:hub-eid^248^272^parentId=2
23^om^xocs:timestamp^272^304^yyyymmdd=20101015&parentId=2
24^om^xocs:dco^304^305^parentId=2
25^om^xocs:tomb^305^306^parentId=2
26^om^xocs:date-search-begin^306^314^parentId=2
27^om^xocs:date-search-end^314^322^parentId=2
28^om^xocs:year-nav^322^326^parentId=2
29^om^xocs:indexeddate^326^346^epoch=957312000&parentId=2
30^om^xocs:articleinfo^346^986^parentId=2
..
| 46
Text Mining (Spark)
// ScienceDirect articles to process
val piis = Array("S0965174800000278","S0013468610015215")
// Databricks mount point for the ScienceDirect XML
val inputBase = "/mnt/sd-fulltext/"
// Databricks mount point to place the string and original markup annotations
val outputBase = "/mnt/els/darin/cat3/”
// Parallelize the piis across the cluster
val piiRDD = sc.parallelize(piis)
OR
// Get the values from a file
val piiRDD = sc.textFile("/mnt/els/darin/catPIIs").repartition(4)
| 47
Text Mining (Spark)
piiRDD.map(pii => {
// Local filename for the pii (that will need to be deleted)
val piiLocalFile = "/" + pii + ".xml"
// Copy the xml file identified by the pii to the local filesystem on the worker
dbutils.fs.cp(inputBase + pii, "file:" + piiLocalFile)
val xml = FileUtils.openInputStream(new File(piiLocalFile))
// Generate the string and the om annotations
val results = BaselineMarkup.process(xml)
// Get the generated string and write to the local file system on the worker
val strFileName = "/" + pii + ".str"
val strFile = new File(strFileName)
FileUtils.writeStringToFile(strFile, results(0), "UTF-8");
// Get the original markup annotations and write to the local file system on the worker
val omFileName = "/" + pii + ".om"
val omFile = new File(omFileName)
FileUtils.writeStringToFile(omFile, results(1), "UTF-8");
// Move the string and om annotations to S3
dbutils.fs.mv("file:" + strFileName, outputBase + pii + "/str")
dbutils.fs.mv("file:" + omFileName, outputBase + pii + "/om")
// Delete the local pii xml file
dbutils.fs.rm("file:" + piiLocalFile)
}).count // Action to force execution
| 48
Text Mining (Spark)
piiRDD.mapPartitions(piiIter => {
// Hashmap of include annotations
val includes = new HashMap[String,Array[String]](Map(
"om" -> Array("ce:abstract","ce:caption")
).asJava)
// Init Stanford Core (use default parameters of "tokenize, ssplit, pos, lemma, ner, parse")
val stanfordCoreMarkup = StanfordCoreMarkup.getInstance()
piiIter.map(pii => {
// Copy the om annotation file identified by the pii to the local filesystem on the worker
dbutils.fs.cp(inputBase + pii + "/om", "file:" + localFile(pii,".om"))
val om = FileUtils.openInputStream(new File(localFile(pii,".om")))
// Copy the string file identified by the pii to the local filesystem on the worker
dbutils.fs.cp(inputBase + pii + "/str", "file:" + localFile(pii,".str"))
val str = FileUtils.openInputStream(new File(localFile(pii,".str")))
//Generate the Stanford Core NLP annotations (includes and excludes)
val results = stanfordCoreMarkup.process(om,str,includes,excludes)
// Get the generated string and write to the local file system on the worker
val scnlpFileName = "/" + pii + ".scnlp"
val scnlpFile = new File(scnlpFileName)
FileUtils.writeStringToFile(scnlpFile, results, "UTF-8");
// Move the Stanford Core NLP annotations to S3
dbutils.fs.mv("file:" + scnlpFileName, outputBase + pii + "/scnlp")
// Delete the local om annotation and string files
dbutils.fs.rm("file:" + localFile(pii,".om"))
dbutils.fs.rm("file:" + localFile(pii,".str"))
})
}).count // Action to force execution
| 49
Text Mining (Spark)
// Schema definition for annotations
val annotationSchema = StructType(List(StructField("DocId", StringType, false),
StructField("AnnotSet", StringType, false),
StructField("AnnotType", StringType, false),
StructField("Start", LongType, false),
StructField("End", LongType, false),
StructField("AnnotId", LongType, false),
StructField("Other", StringType, true)))
// FlatpMap to to get one annotation per record
val baselineAnnotRDD = sc.wholeTextFiles("/mnt/els/darin/cat3/*/om")
val annotations = baselineAnnotRDD.flatMapValues(rec => rec.split("\n")).map(rec => (rec._1.split("/")(5),rec._2))
// Create a row of annotations
val annotationsRowRDD = annotations.map{v =>
val arr = v._2.split("\\^")
val docId = v._1
val annotSet = arr(1)
val annotType = arr(2)
val start = arr(3).toLong
val end = arr(4).toLong
val annotId = arr(0).toLong
var other : String = null
if (arr.size == 6) {
other = arr(5)
}
Row(docId,annotSet,annotType,start,end,annotId,other)}
// Create the data frame
val annotationsDataFrame = sqlContext.createDataFrame(annotationsRowRDD, annotationSchema).cache()
| 50
Text Mining (Spark)
| 51
Text Mining (Spark)
| 52
Text Mining (Spark)
What else could you do?
Look at Stanford Core NLP output
Frequency distribution of nouns, verbs, other POS
Combine with OM to do same scoped to an element
Develop smarter search engines
Also looking at Genia and Spacey
Much, much, more …
| 53
Text Mining at Elsevier with Spark
Very much work in progress
AWS and Lambda
Changes made to S3 buckets
Better for incremental changes
Spark
Great for processing everything at once
Annotation Files to Dataframes and Datasets
Spark SQL
Further analysis
| 54
Some Random Closing Thoughts
IBM
purchase of Weather.com
Contest on IBM Bluemix (with weather data)
Machine Learning Class (Stanford)
Statistical Learning (Trevor Hastie, Rob Tibshirani)
Free online book
Github book (free)
jaceklaskowski.gitbooks.io/mastering-apache-spark/
| 55
Some Random Closing Thoughts
Spark 1.5.2 released yesterday
Google groups for Spark meetup organizers
What’s good?
What could be better?
| 56
Plans for 2016
• Next meetup planned for February
• Focus for 2016?
• External speakers/vendors?
• High-level vs. Detailed?
Thank You
Curt Kohler
c.kohler@elsevier.com
Darin McBeath
d.mcbeath@elsevier.com