Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2013 Will Computers Crash Genomics? Science Vol 331 Feb 2011 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons (ericlyons@email.arizona.edu) Plant Sciences & iPlant Collaborative University of Arizona http://goo.gl/p4j3m or https://sites.google.com/site/appliedciconcepts/ 1 PowerPoint Does Rocket Science--and Better Techniques for Technical Reports -Essay by Edward Tufte 2 Topic Coverage Frontiers in Massive Data Analysis (Text Book) Focus of this course What is “Cyberinfrastructure” What is “Big Data” ? Who is a “Data Scientist” ? Why you should care about it ? Discussion: Resources, conduct, etc. 3 Simple Formula + = 4 The Reality PERL Python Java Ruby Fortran C C# C++ R Matlab etc. + + Amazon Azure Rackspace Campus HPC XSEDE Etc. and lots of glue….. 5 Simple Formula + = 7 Science Paradigms 1. Thousand years ago: science was empirical describing natural phenomena, observations 2. Last few hundred years: theoretical branch using models, generalizations 3. Last few decades: a computational branch simulating complex phenomena 4. Today: data exploration (eScience) unify theory, experiment, and simulation Based on the transcript of a talk given by the late Jim Gray to the National Research Council – Computer Science and Telecommunication Board in Mountain View, CA, on January 11, 2007 8 The Fourth Paradigm: Data-Intensive Scientific Discovery Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud computing technologies. http://research.microsoft.com/en-us/collaboration/fourthparadigm/ 9 The Discovery Lifecycle The Fourth Paradigm: Data-Intensive Scientific Discovery 10 Evolution of X-Info The evolution of X-Info and Comp-X for each discipline X e.g. (Bio-Informatics , Computational-Biology) How to codify and represent our knowledge The Generic Problems: • Data ingest • Managing a petabyte • Common schema • How to organize it • How to reorganize it • How to share it with others • Query and Vis tools • Building and executing models • Integrating data and literature • Documenting experiments • Curation and long-term preservation 11 Classic paradigm: You produce data, analyze, interpret (end to end) Conventional paradigm: Consortium/centers produce data and you consume it New Paradigm: Consortium/centers have produced data and creating “cyber infrastructure” to tackle the “grand challenge” 12 ∧ 13 14 The “V” of big data Volume Velocity Variety (Value) Attributed to Gartner Consulting 15 Big Data Extracting meaningful results from vast amount of data (linked data) Big data “information assets” demand costeffective, innovative forms of information processing for enhanced insight and decision making. “Big Data” Is only the Beginning of Extreme Information Management Big Data Technology, all Is Not New Attributed to Gartner Consulting 16 The transition (Data->Big data) http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/ 17 The hype cycle (2014) 18 Dealing with the choices (Thought Works) http://assets.thoughtworks.com/assets/technology-radar-july-2014-en.pdf http://www.thoughtworks.com/radar 20 Building Data Science Teams Technical expertise: the best data scientists typically have deep expertise in some scientific discipline. Curiosity: a desire to go beneath the surface and discover and distill a problem down into a very clear set of hypotheses that can be tested. Storytelling: the ability to use data to tell a story and to be able to communicate it effectively. Cleverness: the ability to look at a problem in different, creative ways. D.J. Patil: characterizing data scientist qualities 21 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/ 22 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/ 23 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/ 24 Big Data: Venn Diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram 25 Rise of the “data janitors” 26 @danariely @jeremyjarvis 27 So what is the course about ?? Provide you with key concepts to work with “Big Data” Get familiar with use of Cyberinfrastructure Help you build your “tool chest” of simple and easy to use resources Ability to work as a TEAM Having fun with cutting edge computing infrastructure Working with REAL data and infrastructure ! Take these pragmatic skills to your lab/job 28 Abstraction: C.T. is operating in terms of multiple layers of abstraction simultaneously C.T. is defining the relationships the between layers Automation: C.T. is thinking in terms of mechanizing the abstraction layers and their relationships Mechanization is possible due to precise and exacting notations and models There is some “machine” below (human or computer, virtual or physical) They give us the ability and audacity to scale. 29 Focus for this course Project based learning class Introduce fundamental concepts, tools and resources, best practices for effectively managing common tasks associated with analyzing large datasets Provide familiarity with cyberinfrastrucutre (CI) resources available at the University of Arizona campus, iPlant Collaborative, NSF XSEDE centers, Future Grid 30 Focus for this course Learn how to automate computation (tasks) Learn how to utilize distributed compute and storage resources Adopt a “large-ish” dataset (and do fun things with it) Build your tool chest for working with “Big Data” YOU will develop wiki based documentation of these best practices YOU will learn how to effectively collaborate in interdisciplinary team settings COMPUTATIONAL THINKING AND DOING! 31 Topics we will have emphasis on Scalable Data Handling: iRODS Distributed Workflow Management: Makeflow Visualization: Paraview Computing platforms: XSEDE, UA Campus, FutureGrid iPlant CI Software Carpentry: Git, wiki etc. Stitching all of this together 32 Class logistics Grading based on: Assignments (~5) Group Projects Class participation Midterm (focused on key concepts and problem solving) Graduate v. undergraduates Demonstrated application towards novel discovery (hopefully using their data) Mentorship to undergraduates 33 Where/What is the XYZ Class documentation is on iPlant wiki (google “iPlant wiki” and go to ACIC) If you don’t find it (search again), then write your own Where are the PPT from class ? How do I form a group ? How do I turn in my homework ? What if my group hates me or I hate them ? How is this class different then last year ? 34 far How difficult is this class Do I need a laptop Do I need to know LINUX Do I need to be a sysadmin Do I need to how to program X language Will this help me with my X project Will all my jobs run faster (and I can graduate sooner) Can I take this for audit, sit in, sleep Can I bring my X to class 35 Pragmatic Cyberinfastructure (CI) Pragmatic*: Dealing with things sensibly and realistically in a way that is based on practical rather than theoretical considerations Knowledgeable Research people and engaged that Cyberinfrastructure**: environments support advanced data analytics, data community management, visualization and other are essential data components for successful CI computing and information processing services distributed over the Internet. These capabilities are beyond the scope of a single institution to implement * Oxford dictionary (from my mac) ** Wikipedia (wikipedia) Research environments that support: advanced data acquisition data storage data management data integration data mining data visualization and other computing and information processing services distributed over the Internet beyond the scope of a single institution 37 Man is the best computer we can put aboard a spacecraft. And the only one that can be mass produced with unskilled labor. -- Wernher von Braun Greatest rocket scientist in history http://earthobservatory.nasa.gov/Features/vonBraun/ 38 39