Is there an app for that - iPlant Pods

advertisement
Applied CyberInfrastructure Concepts
ISTA 420/520 Fall 2013
Will Computers Crash Genomics? Science Vol 331 Feb 2011
Nirav Merchant (nirav@email.arizona.edu)
Bio Computing & iPlant Collaborative
Eric Lyons (ericlyons@email.arizona.edu)
Plant Sciences & iPlant Collaborative
University of Arizona
http://goo.gl/p4j3m or https://sites.google.com/site/appliedciconcepts/
1
PowerPoint Does Rocket Science--and Better Techniques for Technical Reports
-Essay by Edward Tufte
2
Topic Coverage
 Frontiers in Massive Data Analysis (Text
Book)
 Focus of this course
 What is “Cyberinfrastructure”
 What is “Big Data” ?
 Who is a “Data Scientist” ?
 Why you should care about it ?
 Discussion: Resources, conduct, etc.
3
Simple Formula
+
=
4
The Reality
PERL Python
Java Ruby
Fortran C C# C++
R Matlab
etc.
+
+
Amazon
Azure
Rackspace
Campus
HPC
XSEDE
Etc.
and lots of glue…..
5
Simple Formula
+
=
7
Science Paradigms
1. Thousand years ago: science was empirical
describing natural phenomena,
observations
2. Last few hundred years: theoretical branch
using models, generalizations
3. Last few decades: a computational branch
simulating complex phenomena
4. Today: data exploration (eScience)
unify theory, experiment, and simulation
Based on the transcript of a talk given by the late Jim Gray
to the National Research Council – Computer Science and Telecommunication
Board in Mountain View, CA, on January 11, 2007
8
The Fourth Paradigm:
Data-Intensive Scientific Discovery
 Increasingly, scientific breakthroughs will be
powered by advanced computing capabilities that
help researchers manipulate and explore massive
datasets.
 The speed at which any given scientific discipline
advances will depend on how well its researchers
collaborate with one another, and with
technologists, in areas of eScience such as
databases, workflow management, visualization,
and cloud computing technologies.
http://research.microsoft.com/en-us/collaboration/fourthparadigm/
9
The Discovery Lifecycle
The Fourth Paradigm: Data-Intensive Scientific Discovery
10
Evolution of X-Info
 The evolution of X-Info and Comp-X for
each discipline X e.g. (Bio-Informatics ,
Computational-Biology)
 How to codify and represent our knowledge
 The Generic Problems:
• Data ingest
• Managing a petabyte
• Common schema
• How to organize it
• How to reorganize it
• How to share it with others
• Query and Vis tools
• Building and executing models
• Integrating data and literature
• Documenting experiments
• Curation and long-term
preservation
11
 Classic paradigm: You produce data,
analyze, interpret (end to end)
 Conventional paradigm: Consortium/centers
produce data and you consume it
 New Paradigm: Consortium/centers have
produced data and creating “cyber
infrastructure” to tackle the “grand
challenge”
12
∧
13
14
The “V” of big data
Volume
Velocity
Variety
(Value)
Attributed to Gartner Consulting
15
Big Data
 Extracting meaningful results from vast
amount of data (linked data)
 Big data “information assets” demand costeffective, innovative forms of information
processing for enhanced insight and
decision making.
 “Big Data” Is only the Beginning of Extreme
Information Management
 Big Data Technology, all Is Not New
Attributed to Gartner Consulting
16
The transition (Data->Big data)
http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/
17
The hype cycle (2014)
18
Dealing with the choices (Thought Works)
http://assets.thoughtworks.com/assets/technology-radar-july-2014-en.pdf
http://www.thoughtworks.com/radar
20
Building Data Science Teams
 Technical expertise: the best data scientists
typically have deep expertise in some scientific
discipline.
 Curiosity: a desire to go beneath the surface
and discover and distill a problem down into a
very clear set of hypotheses that can be tested.
 Storytelling: the ability to use data to tell a story
and to be able to communicate it effectively.
 Cleverness: the ability to look at a problem in
different, creative ways.
D.J. Patil: characterizing data scientist qualities
21
EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/
22
EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/
23
EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/
24
Big Data: Venn Diagram
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
25
Rise of the “data janitors”
26
@danariely
@jeremyjarvis
27
So what is the course about ??
 Provide you with key concepts to work with
“Big Data”
 Get familiar with use of Cyberinfrastructure
 Help you build your “tool chest” of simple
and easy to use resources
 Ability to work as a TEAM
 Having fun with cutting edge computing
infrastructure
 Working with REAL data and infrastructure !
 Take these pragmatic skills to your lab/job
28
 Abstraction:
C.T. is operating in terms of multiple layers of
abstraction simultaneously
C.T. is defining the relationships the between layers
 Automation:
C.T. is thinking in terms of mechanizing the
abstraction layers and their relationships
 Mechanization is possible due to precise and
exacting notations and models
 There is some “machine” below (human or computer,
virtual or physical)
 They give us the ability and audacity to scale.
29
Focus for this course
 Project based learning class
 Introduce fundamental concepts, tools and
resources, best practices for effectively
managing common tasks associated with
analyzing large datasets
 Provide familiarity with cyberinfrastrucutre (CI)
resources available at the University of Arizona
campus, iPlant Collaborative, NSF XSEDE
centers, Future Grid
30
Focus for this course
 Learn how to automate computation (tasks)
 Learn how to utilize distributed compute and
storage resources
 Adopt a “large-ish” dataset
(and do fun things with it)
 Build your tool chest for working with “Big Data”
 YOU will develop wiki based documentation of
these best practices
 YOU will learn how to effectively collaborate in
interdisciplinary team settings
COMPUTATIONAL THINKING
AND DOING!
31
Topics we will have emphasis on
 Scalable Data Handling: iRODS
 Distributed Workflow Management:
Makeflow
 Visualization: Paraview
 Computing platforms: XSEDE, UA Campus,
FutureGrid
 iPlant CI
 Software Carpentry: Git, wiki etc.
 Stitching all of this together
32
Class logistics
 Grading based on:
 Assignments (~5)
 Group Projects
 Class participation
 Midterm (focused on key concepts and
problem solving)
 Graduate v. undergraduates
 Demonstrated application towards novel
discovery (hopefully using their data)
 Mentorship to undergraduates
33
Where/What is the XYZ
 Class documentation is on iPlant wiki
(google “iPlant wiki” and go to ACIC)
 If you don’t find it (search again), then write
your own
 Where are the PPT from class ?
 How do I form a group ?
 How do I turn in my homework ?
 What if my group hates me or I hate them ?
 How is this class different then last year ?
34
far







How difficult is this class
Do I need a laptop
Do I need to know LINUX
Do I need to be a sysadmin
Do I need to how to program X language
Will this help me with my X project
Will all my jobs run faster (and I can
graduate sooner)
 Can I take this for audit, sit in, sleep
 Can I bring my X to class
35
Pragmatic Cyberinfastructure
(CI)
 Pragmatic*: Dealing with things sensibly and
realistically in a way that is based on practical
rather than theoretical considerations
Knowledgeable Research
people and
engaged that
 Cyberinfrastructure**:
environments
support
advanced
data
analytics,
data
community
management,
visualization
and other
are essential data
components
for successful
CI
computing and information processing services
distributed
over
the
Internet.
These capabilities are beyond the scope of a
single institution to implement
* Oxford dictionary (from my mac)
** Wikipedia
(wikipedia)
 Research environments that support:
 advanced data acquisition
 data storage
 data management
 data integration
 data mining
 data visualization
 and other computing and information
processing services distributed over the
Internet beyond the scope of a single
institution
37
Man is the best computer we can put aboard a
spacecraft.
And the only one that can be mass produced
with unskilled labor.
-- Wernher von Braun
Greatest rocket scientist in history
http://earthobservatory.nasa.gov/Features/vonBraun/
38
39
Download