A View From the Trenches: Open-Source, Data-Intensive Software Chris A. Mattmann

advertisement
A View From the Trenches:
Open-Source, Data-Intensive Software
Chris A. Mattmann
Senior Computer Scientist, NASA Jet Propulsion Laboratory
Adjunct Assistant Professor, Univ. of Southern California
Member, Apache Software Foundation
Roadmap
• 1st part of the talk
– The importance and concerns of open-source
software
• 2nd part of the talk
– NASA Earth Science Data Systems
– Real-world open-source software and its use in
various projects
9-Mar-11
ARR-MATTMANN
2
And you are?
• Architect/Developer at
NASA JPL in Pasadena,
CA
• Software
Architecture/Engineerin
g Prof at USC
• Apache Member involved in
– OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC),
Incubator (PMC), SIS (Mentor), Lucy (Mentor) and
Gora (Champion), MRUnit (Mentor)
9-Mar-11
ARR-MATTMANN
3
Open Source Software Development
• It’s here
• Chances are if you build
software nowadays, are you
are using some type of
open source software
• Let me give you some
context…
9-Mar-11
ARR-MATTMANN
4
The NASA ESDS Context
Where is open
source most
useful?
9-Mar-11
Which area should
produce
open source
ARR-MATTMANN
5
Concerns in the Open Source World
• Licensing
– GPL(v2, v3?), LGPL(v?), BSD, MIT, ASLv2
– Your own custom license approved by OSS
• NASA OSS license?
• Caltech license?
– Copy-left versus Copy-right
• Redistribution
– Can you take open source product X and use it in your commercially interested
software Y?
• If so, do you have to pay for it?
– Should others pay for your open source product if they use it in their
commercial application?
• Open Source “Help Desk” Syndrome versus Community
– Are you trying to simply make your open source software (releases) available
for distribution (aka help desk)?
– Are you trying to get others to “buy in” to your open source software?
9-Mar-11
ARR-MATTMANN
6
Concerns in the Open Source World
• Intellectual Property
– Who owns it?
– How does the Open Source Software affect your IP?
• Open Source Ecosystems
– Where can you find the “killer app” you need?
– Which communities are conducive for longevity?
– How relevant are “generic” open source software communities to NASA Earth
Science Data Systems?
• Contributing
–
–
–
–
Are you even allowed to contribute to a OSS community?
Can you do it on “company” time?
What’s required?
What’s the governance?
• Responsiveness
– How response is the OSS community to your projects’ needs?
9-Mar-11
ARR-MATTMANN
7
Concerns in the Open Source World
• Help/Guidance
– Is the OSS community/project alive?
– How can you tell whether the project is alive?
– How can you “follow the rainbow” to the OSS pot of gold?
• Interaction
– What are the best practices for interacting with OSS communities?
• Implementation strategies
– Insulation: how do you insulate your project from OSS change?
– Configuration Management: what are the important CM issues when
using Open Source Software in NASA ESDS?
• Architectural strategies
– How to design your system to take advantage of OSS?
• Legal strategies
– How to avoid getting sued by Oracle some huge tech company?
9-Mar-11
ARR-MATTMANN
8
The NASA ESDS Context
The aforementioned OSS
concerns are cross cutting
against the whole ESDS
enterprise!
9-Mar-11
ARR-MATTMANN
9
What is industry doing?
• Big Data?
– They got it
– Bigger than NASA even!
– How about processing
1 PB/day?
• How are they
doing it?
– Open Source
– “Big Data”
technologies
• What is JPL doing to be part of this?
– That’s my goal!
– Actively representing the science mission needs
9-Mar-11
ARR-MATTMANN
10
Licensing
• Relates to: redistribution, intellectual property,
contributing, legal strategies
• There are tons of OSS approved licenses
– What’s the difference between them?
• The difference mostly has to do with
– Commercialization
– Redistribution
– Attribution
• Let’s take a few examples
9-Mar-11
ARR-MATTMANN
11
Some OSS licenses
•
Apache
License
V2
– Allows (unrestricted):
•
•
•
•
•
Redistribution (or not)
Commercialization
Must keep ASL
headers and
NOTICE file bearing
ASL attribution
Contributors to ASF
sign Apache CLAs
or CCLAs
BSD License
– Allows
•
•
•
•
9-Mar-11
Redistribution (or not)
Commercialization
Must include header
in code or in NOTICE
0 contrib. restrictions
ARR-MATTMANN
12
Some OSS licenses that aren’t so
friendly
• GPL(v2, v3), LGPL
– Allows (restricted):
• Redistribution
– Based on implied commerciality and copyleft
• Commercialization
– May need to pay license fees
• Standard attribution requirements (NOTICE or in source code
headers) based on copyleft
– “Copy left” Syndrome
• Software must be redistributed under the same original terms of
the any upstream author.
• Author is not free to decide redistribution terms, or even source
code inclusion terms (except under fair use which supersedes)
9-Mar-11
ARR-MATTMANN
13
Why licenses are important
• “War Story”
– Amazon EC2, S3
• Johnson and Johnson Pharm. R&D
– At a recent conference I met the director of R&D for J&J. He presented
a story wherein which J&J needed large bursting processing and
limited data storage for some drug tests they were conducted. They
decided to use Amazon EC2. After reviewing Amazon’s licensing policy
for EC2 J&J’s laywers determined that Amazon claimed IP for any data
or computational results produced on its cloud. Since the need for
Amazon’s processing and cloud was limited to a few trials, and since
the costs were so outrageous to stand up its own cluster for these
experiments, J&J decided to forge ahead with Amazon with the
understanding that its lawyers would “duke it out.” should the need
arise with Amazon’s lawyers based on the FOU restrictions and IP
claims induced by Amazon EC2.
9-Mar-11
ARR-MATTMANN
14
Why licenses are important
• “War Story 2”
– Oracle versus Google
• Is there really free Java with your free lunch?
– Although there is no definitive answer besides papers filed in
court, Oracle’s claims in its lawsuit are based on perceived
patents for the Java Virtual Machine and its associated IP. Sun
originally filed patents on the Java Virtual Machine and its
translation of programming language code into runtime
executables. In order for a JVM to be a “certified” (read:
trademark) JVM, the JVM must pass a “Test Compatibility Kit”
(TCK), which Oracle/Sun license at a cost to JVM vendors. By
purchasing the TCK, a JVM builder is given IP knowledge of the
JVM patents. If a company builds a JVM but does not purchase
the TCK, Oracle/Sun loses licensing dollars.
9-Mar-11
ARR-MATTMANN
15
What does OSS licensing mean to
Software Development?
• Awareness
–
–
–
–
Know your distribution requirements
Know your contributor expectations
Know your field of use expectations
Know your commercialization expectations
• Leverage licenses that give the most leeway in all
important categories above
– Allows the decision on some of the above aspects to be
delayed without fear of penalty or prejudice
• Friendly licenses: ASLv2, BSD, MIT
• Know what you are signing *before* you sign it (or use
it)
9-Mar-11
ARR-MATTMANN
16
Redistribution
• So you’ve built some awesome piece of software
• And you’re wondering
– What are my options for distributing it?
– Question: what license are you going to choose?
– Hopefully one that supports redistribution under your
own terms!
• How to redistribute the software?
– Requires infrastructure
– Who’s going to set it up?
9-Mar-11
ARR-MATTMANN
17
OSS Redistribution Infrastructure
Issue
tracking
Portals
Source code
repository
Repository browsing
Acceptance testing
Mailing lists
9-Mar-11
ARR-MATTMANN
18
OSS Ecosystems
• Where should you go to for your open source project?
• Should NASA have its own?
• Should your project (SIPS, DAAC, proposal) have its own?
9-Mar-11
ARR-MATTMANN
19
OSS Ecosystems
• FTR, there are tons of concerns here
– Should the ecosystem impose license restrictions?
• Only support license X or Y?
–
–
–
–
–
What are the redistribution policies?
What’s the community?
What’s the IP?
What’s the infrastructure support?
Is it a foundation with rules, or just free for all?
• Elephant in the room: What is the exposure? How many people are going
to community C and are going to see your software and perhaps want to
use it, improve it, file bugs against it, file patches, etc.?
• Where/how does this matter to you?
– Standing up our own OSS ecosystem may make sense for large, coarse-grained
federations
– A LOT more difficult to justify OTOH, for fine grained components, and module reuse
9-Mar-11
ARR-MATTMANN
20
Community versus “Help Desk”
• How do you want to have your open source software
project run in the open?
– Do you expect folks to come to you and only file bugs?
•
•
•
•
Then you and your team are the only ones who can fix them?
Then you and your team are the only ones who can release updates?
”Help Desk” open source project
Examples: Sourceforge.net, Google Code, etc.
– Do you expect to grow a community of interest where
volunteers actively engage in software development?
• Are volunteers empowered to pick up a shovel and help dig the hole?
• Can volunteers (including your own paid employees) file issues?
• Do you want to give the community a stake/vote in the overall
process?
•  “Community Building” open source project
• Examples: Apache Software Foundation, Eclipse Foundation, etc.
9-Mar-11
ARR-MATTMANN
21
Communities for your organization
• Working on common software
– Measured not in terms of what center contributor works for, but
in terms of
•
•
•
•
•
Number of patches contributed (high quality)
Mailing list questions answered
# of releases made, or helped with
Tests written
Documentation added
– Take the politics out of it and just work on “core” common code
of mutual interest
• Deciding on the right redistribution mechanism and license
– Apache Software Foundation and ASLv2 provide openness and
ability for center-local redistribution, commerciality and other
decisions (for internal distributions and beyond)
9-Mar-11
ARR-MATTMANN
22
Our experience: Apache OODT
• Front Page
article on
NASA.gov
– Free and available
to download
• First true
NASA
FOSS
technology
– Distributed at
Apache
• Featured on
Slashdot
9-Mar-11
ARR-MATTMANN
23
Free as in beer
•
“Something like the Apache foundation is the best place for released government software.
A previous attempt at release and public distribution via a private company was a truly
dismal failure. OpenPBS (portable batch system) is supposed to be available to anyone that
asks. However when you do ask a sales rep strings you along for more than a month trying
to sell you something that they can't actually assure you will fit your requirements (and is
no longer under development) even when the free one is documented as doing so. It was a
truly stupid waste of the salesperson's time and mine that would have exceeded the price
of providing the file for download or sending by email by several orders of magnitude and
generated a lot of ill will. I'll go as far as saying it was blatant false advertising using a
government funded open source product to do a bait and switch to try to sell me an
unmaintained product they picked up in a corporate take over. My experience appears to
have been identical to that of many that attempted to obtain this government funded open
source software that NASA had declared was available for anyone. Eventually due to this
open source project becoming closed the project just had to fork and the compatible
Torque batch system was developed by people that had actually get hold of the original
OpenPBS.”
– Slashdot user comment
9-Mar-11
ARR-MATTMANN
24
So, again, why do we care?
• Software Engineering Research
Challenges/Questions
– Students are using open source
– They will inevitably see it in their projects, will use it
to build out next generation infrastructure, etc.
• Practitioner challenges
– I still hear the “I just picked one” syndrome
• We need to better train software engineering
students and practitioners to understand the
dimensions of open source software
9-Mar-11
ARR-MATTMANN
25
2nd part of the Talk
• Some context and real-examples
9-Mar-11
ARR-MATTMANN
26
NASA Ground Data Systems
Credit: D. Woollard
9-Mar-11
ARR-MATTMANN
27
Context
• NASA develops science data processing systems for
multiple earth science missions
• These systems convert the instrument telemetry
delivered to earth from space into useful data for
scientific research
• Typical characteristics
– Remote sensing instruments that orbit the Earth multiple
times daily
– Data are acquired constantly
– Complex algorithms convert instrument measurements to
geophysical quantities
9-Mar-11
ARR-MATTMANN
28
Some Key Open Source Technologies in
Use to Enable Science Data
• Apache Tomcat
– Core
App
Server
for many of our Data Web Services and Web Apps
• Apache HTTPD
– Powers front-facing portals, redirects to app servers (like Tomcat),
vhosting, as well as fronting for Plone
• Apache Tika
– Metadata
extraction, content identification
• Apache Hadoop
Also using Apache
OODT to link all of
these specific
technologies together!
– Cloud Storage
9-Mar-11
ARR-MATTMANN
29
Application to planetary science
•
•
•
•
Often unique, one of a kind missions
– Can drive technological changes
Apache OODT, Solr
Instruments are competed and
developed by academic and industry
– Highly distributed acquisition and processing across
partner organizations
– Highly diverse data sets given heterogeneity of the
instruments and the targets (i.e. solar system)
Missions are required to share science
data results:
– Common domain information model used to drive
system implementations
– Expert scientific help to the user community on using
the data
– Peer-review of data results to ensure quality
All planetary science data results from
NASA (and some international)
missions is deposited into the Planetary
Data System, a federation of nodes in
the USA and other systems
9-Mar-11
ARR-MATTMANN
internationally
Planetary Data System
Distributed Planetary Science Archive
Rings Node
Ames Research Center
Moffett Field, CA
Geosciences Node
Washington University
St. Louis, MO
Imaging Node
JPL and USGS
Pasadena, CA and Flagstaff, AZ
THEMIS Data Node
Arizona State University
Tempe, AZ
Central Node
Jet Propulsion Laboratory
Pasadena, CA
Small Bodies Node
University of Maryland
College Park, MD
Atmospheres Node
New Mexico State University
Las Cruces, NM
Planetary Plasma Interactions Node
University of California Los Angeles
Los Angeles, CA
Navigation Ancillary Information Node
Jet Propulsion Laboratory
Pasadena, CA
DJC-30
Application for Earth Missions
•
Leveraged OODT software framework for
constructing ground data systems for earth
science missions
–
–
•
Execution of “processors” based on a set of rules
Explicit separation of workflow management
from management of computational resources
Provided “lights out” operations
User Interface (Process Monitoring & Control, Instrument Command ing, Data Verification)
Multiple Missions
–
–
–
–
–
SeaWinds
QuikSCAT
Orbiting Carbon Observatory (OCO), OCO-2…
NP Sounder PEATE
SMAP
PreProce ssors
(PP)
En gi ne e rin g
An al ysis
(EA)
S cie n ce
Le ve l
Proce ssors
(LP)
S cie n ce
An al ysis
an d
Q u ality
Re portin g
(S A)
Spacecraft
& Ancillary
Files
Science
Products
Released
to
PO.DAAC
Product Delivery (PM)
Instrument
Commands
File Transfer (FX)
•
SeaWinds on
ADEOS II
(Launched
Dec 2002)
Used OODT Catalog and Archive Service
software
Focus is on Workflow Management
Constructed “workflows”
–
–
•
Apache OODT, Lucene, Tika,
Tomcat, HTTPD
Data Management and Automatic Process Control (PM) using OODT
Credit: D. Freeborn, C. Mattmann, D. Woollard
9-Mar-11
ARR-MATTMANN
DJC-31
Application to Airborne Science
Missions
Apache OODT, Solr,
Tika
A full service stack is deployed for each
mission, utilizing any mission-provided
resources as well as a cloud computing
infrastructure.
Credit:
D. Freeborn, D. Woollard, E. Law, D. Crichton,
L. Kay-Im,
9-Mar-11
ARR-MATTMANN
Mission proprietary data is presented in a
mission-specific secure portal while publicly
available data is aggregated in the public portal.
32
32
Earth System Grid Federation
•
•
•
•
•
DOE-funded federation to distribute climate model output to the climate modeling community
Common services for access to repositories and portals/gateways
Highly decoupled
Open source framework (software packaged and distributed) mandated by DOE SciDAC Program
A Recent question….how do you link observations and climate model output?
Apache Solr
9-Mar-11
ARR-MATTMANN
33
Application to Climate Research
(…Climate Data Exchange)
Specific Tools
(H2O, CO2, …)
Apache OODT,
Apache Solr
9-Mar-11
Credit: A. Braverman, C. Mattmann, D. Crichton,
L. Cinquini, M. Cayanan
ARR-MATTMANN
34
Application to Cancer Research: Early
Detection Research Network
Apache OODT,
From: distributed research databases
•EDRN has pioneered the use of
informatics technologies to support
biomarker research
Apache Solr
•EDRN has developed a
comprehensive infrastructure to
support biomarker data
management across EDRN’s
distributed cancer centers
• It supports capture and access to
a diverse collection of distributed
sets of information and results
based on a core ontology for
biomarker research
9-Mar-11
Credit: D. Crichton, C. Mattmann, S. Kelly
ARR-MATTMANN A. Hart, H. Kincaid, S. Hughes
35
Application to Health Informatics:
Virtual Pediatric Intensive Care Unit
• Principal collaboration with Childrens Hospital Los
Angeles to develop a research network to capture and
share data sets from pediatric intensive care units
• CHLA has established a collaboration across 85
hospitals
Apache OODT,
Apache Solr
• Grant funded by the National Library of Medicine to
research an informatics infrastructure and validate
methods to support data analysis
– Ultimate goal is to improve decision support through an
integrated network of information and researchers
9-Mar-11
ARR-MATTMANN
36
The Square Kilometer Array
• 1 sq. km of
antennas
• Never-before
seen
resolution
looking into
the sky
• 700 TB
Apache OODT,
Apache Solr
– Per second!
9-Mar-11
ARR-MATTMANN
37
Alright, I’ll shut up now
• Any questions?
• THANK YOU!
– mattmann@apache.org
– @chrismattmann on Twitter
9-Mar-11
ARR-MATTMANN
38
Acknowledgements
• Some Tika material inspired by Jukka Zitting’s
talks
– http://www.slideshare.net/jukka/text-and-metadataextraction-with-apache-tika
– http://www.slideshare.net/jukka/text-and-metadataextraction-with-apache-tika-4427630
• NASA Jet Propulsion Laboratory
– OODT Team
• Dan Crichton’s keynote from ApacheConNA 2011
9-Mar-11
ARR-MATTMANN
39
2011 Workshop on Software
Engineering for Cloud Computing
(SECLOUD)
• This is the
workshop at
ICSE that Craig
was talking
about
• 10 accepted
research
papers and 5
“demos”
• Come join us in
Hawaii!
https://sites.google.com/site/icsecloud2011/
9-Mar-11
ARR-MATTMANN
40
Book
• Jukka Zitting and I are writing
a book on Tika
– Working on Chapters 14
and 15 of 15
• Early Access available
through MEAP
program
• http://manning.com/mattmann/
9-Mar-11
ARR-MATTMANN
41
Download