Data Science Rearch and Education at Digital Science Center@SOIC

advertisement

Data Science at

Digital Science Center@SOIC

November 5 2014

Geoffrey Fox

Judy Qiu gcf@indiana.edu

, xqiu@Indiana.edu

http://www.infomall.org

School of Informatics and Computing

Digital Science Center

Indiana University Bloomington

Digital Science Center Leadership

• Indiana University Faculty

• Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski

Data Science Center Research Areas

• Digital Science Center Facilities

• RaPyDLI Deep Learning Environment

• HPC-ABDS and Cloud DIKW Big Data Environments

• Java Grande Runtime

• CloudIOT Internet of Things Environment

• SPIDAL Scalable Data Analytics Library

• Big Data Ogres Classification and Benchmarks

• Cloudmesh Cloud and Bare metal Automation

• Data Science Education with MOOC’s

DSC Computing Systems

• Working with SDSC on NSF XSEDE Comet System (Haswell)

• Adding 64-128 node Haswell based system (Juliet)

– 128-256 GB memory per node

– Substantial conventional disk per node (8TB) plus PCI based SSD

– Infiniband with SR-IOV

• Older machines

– India (128 nodes, 1024 cores), Bravo (16 nodes, 128 cores),

Delta(16 nodes, 192 cores), Echo(16 nodes, 192 cores), Tempest

(32 nodes, 768 cores) with large memory, large disk and GPU

– Cray XT5m with 672 cores

• Optimized for Cloud research and Large scale Data analytics exploring storage models, algorithms

• Bare-metal v. Openstack virtual clusters

• Extensively used in Education

NSF Data Science Project I

• 3 yr. XPS: FULL: DSD: Collaborative Research: Rapid Prototyping HPC

Environment for Deep Learning IU, Tennessee (Dongarra), Stanford

(Ng)

• “Rapid Python Deep Learning Infrastructure” (RaPyDLI) Builds optimized Multicore/GPU/Xeon Phi kernels (best exascale dataflow) with Python front end for general deep learning problems with

ImageNet exemplar. Leverage Caffe from UCB.

Large neural networks combined with large datasets (typically imagery, video, audio, or text) are increasingly the top performers in benchmark tasks for vision, speech, and Natural

Language Processing. Training often requires customization of the neural network architecture, learning criteria, and dataset pre-processing.

Classified

OUT

IN

NSF Data Science Project II

• 5 yr. Datanet: CIF21 DIBBs: Middleware and High

Performance Analytics Libraries for Scalable Data Science

IU, Rutgers (Jha), Virginia Tech (Marathe), Kansas (Paden), Stony

Brook (Wang), Arizona State(Beckstein), Utah(Cheatham)

HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data

Stack.

• SPIDAL (Scalable Parallel Interoperable Data Analytics

Library): Scalable Analytics for Biomolecular Simulations,

Network and Computational Social Science, Epidemiology,

Computer Vision, Spatial Geographical Information

Systems, Remote Sensing for Polar Science and Pathology

Informatics.

Big Data Software Model

Harp Plug-in to Hadoop

Make ABDS high performance – do not replace it!

Application

Framework

Resource

Manager

MapReduce

Applications

Map-Collective or Map-

Communication

Applications

MapReduce V2

Harp

YARN

1.20

1.00

0.80

0.60

0.40

0.20

0.00

0 20

100K points

40 60 80

Number of Nodes

200K points

100 120

300K points

140

Work of Judy Qiu and Bingjing Zhang.

Left diagram shows architecture of Harp Hadoop Plug-in that adds high performance communication, Iteration (caching) and support for rich data abstractions including key-value

Right side shows efficiency for 16 to 128 nodes (each 32 cores) on WDA-SMACOF dimension reduction dominated by conjugate gradient

Sequential

Parallel

Tweet

Clustering with Storm

Judy Qiu and Xiaoming Gao

Storm Bolts coordinated by

ActiveMQ to synchronize parallel cluster center updates

Speedup on up to 96 bolts on two clusters Moe and Madrid

Red curve is old algorithm; green and blue new

Java Grande and C# on 40K point DAPWC Clustering

Very sensitive to threads v MPI

C# Hardware 0.7 performance Java Hardware

64 Way parallel

128 Way parallel

C#

Java

256 Way parallel

TXP

Nodes

Total

Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data

Raw

Data

Storm

Pub-Sub

Data

System Orchestration / Dataflow / Workflow

Archival Storage – NOSQL like Hbase

Batch Processing (Iterative MapReduce)

Information Knowledge Wisdom

Streaming Processing (Iterative MapReduce)

Storm Storm Storm Storm

Decisions

Storm

Internet of Things (Smart Grid)

IOTCloud

• Device  Pub-Sub  Storm 

Datastore  Data Analysis

Apache Storm provides scalable distributed system for processing data streams coming from devices in real time.

• For example Storm layer can decide to store the data in cloud storage for further analysis or to send control data back to the devices

• Evaluating Pub-Sub Systems

ActiveMQ, RabbitMQ, Kafka,

Kestrel

Turtlebot and Kinect

Kafka Latency

RabbitMQ Latency

RabbitMQ outperforms

Kafka with

Storm

Big Data Ogres and their Facets

• 51 Big Data use cases: http://bigdatawg.nist.gov/usecases.php

Ogres classify Big Data Applications with facets and benchmarks

Facets I: Features identified from 51 use cases: PP(26), MR(18), MR-

Statistics(7), MR-Iterative(23), Graph(9), Fusion(11), Streaming/DDDAS(41),

Classify(30), Search/Query(12), Collaborative Filtering(4), LML(36), GML(23),

Workflow(51), GIS(16), HPC(5), Agents(2)

– MR MapReduce; L/GML Local/Global Machine Learning

Facets II: Some broad features familiar from past like

– BSP (Bulk Synchronous Processing) or not?

– SPMD (Single Program Multiple Data) or not?

– Iterative or not?

– Regular or Irregular ?

– Static or dynamic ?,

– communication/compute and I-O/compute ratios

– Data abstraction (array, key-value, pixels, graph…)

Facets III: Data Processing Architectures

14

Core Analytics Facet I

• Map-Only

• Pleasingly parallel - Local Machine Learning LML

• MapReduce:

• Search/Query/Index

• Summarizing statistics as in LHC Data analysis (histograms)

Recommender Systems (Collaborative Filtering)

• Linear Classifiers (Bayes, Random Forests)

• Alignment and Streaming Genomic Alignment, Incremental

Classifiers

• Global Analytics: Nonlinear Solvers (structure depends on objective function)

– Stochastic Gradient Descent SGD and approximations to Newton’s

Method

– Levenberg-Marquardt solver

Core Analytics Facet II

• Global Analytics: Map-Collective (See Mahout, MLlib)

Often use matrix-matrix,-vector operations, solvers (conjugate gradient)

Clustering (many methods), Mixture Models, LDA (Latent Dirichlet Allocation),

PLSI (Probabilistic Latent Semantic Indexing)

SVM and Logistic Regression

Outlier Detection (several approaches)

PageRank, (find leading eigenvector of sparse matrix)

SVD (Singular Value Decomposition)

MDS (Multidimensional Scaling)

• Learning Neural Networks (Deep Learning)

• Hidden Markov Models

• Graph Analytics (Global Analytics subset)

• Graph Structure and Graph Simulation

• Communities, subgraphs/motifs, diameter, maximal cliques, connected components, Betweenness centrality, shortest path

• Linear/Quadratic Programming , Combinatorial Optimization,

Branch and Bound

16

Protein Universe Browser for COG Sequences with a few illustrative biologically identified clusters

17

3D Phylogenetric Tree from WDA SMACOF

LC-MS Proteomics Mass Spectrometry

The brownish triangles are peaks outside any cluster.

The colored hexagons are peaks inside clusters with the white hexagons being determined cluster center

Fragment of 30,000 Clusters

241605 Points

19

Cloudmesh Software Defined System Toolkit

• Cloudmesh Open source http://cloudmesh.github.io/ supporting

– The ability to federate a number of resources from academia and industry. This includes existing FutureSystems infrastructure , Amazon Web Services, Azure,

HP Cloud, Karlsruhe using several IaaS frameworks

– IPython-based workflow as an interoperable onramp

Supports reproducible computing environments

Uses internally

Libcloud and

Cobbler

Celery Task/Query manager (AMQP -

RabbitMQ)

MongoDB

Scientific Impact of High End Resources

(XSEDE TAS)

• Is there some way to provide an indication about the impact of providing such facilities?

• Using EXTENSIVE Bibliometric data as criteria as mashup

• Sources: NSF, ISI Web of Science, (Google), XSEDE

• 140K publications, 20K XSEDE users, ~5K externally verified publications, 2M related publication database

• Metrics

• Number of Publications, citations, projects, users, researchers

• H-index, G-Index, I-index, …

• Correlation to externally vetted data, journal impact , …

• Unique data set to conduct extensive analysis.

• Previous effort only analyzed about 1% of the data

• We are not aware of similar comprehensive efforts.

Gregor von Laszewski

Fugang Wang

• Portal

• Users can look up their own data

• Generally useable and can be adapted this for your resources, department, ….

IU TAS

Interface

Layer

IU TAS

Publications

Mashup

IU TAS Architecture

REST API

IU TAS

Service

Layer

Portal

IU TAS

Publication

Mashup

IU TAS REST Services

NSF Award

DB Mining

3 rd Party

Queries

XSEDE

Portal

XD Entities

Mashup

IU NSF Awards

Publication data for XSEDE Users

UB TAS Databases &Reports

XDMoD

Warehous

XDcDB

Mirror

Publications &

Accounts

XSEDE Databases

XSEDE

Quarterly

Reports

POPS

Proposal

Data

NSF Database

NSF Awards

Original

Data Source

3 rd Party Data

Microsoft Academic

Search

Google Scholar

(User profiles)

ISI Web of Science

Citeseer, PUBMed,

ACM, IEEE, …

Mendeley

50

Comparing XSEDE Supported Publications with Peers

Top 1Q (%)

2Q(%)

3Q(%)

4Q(%)

25

0

BIOPHYSICAL

JOURNAL

JOURNAL OF

PHYSICAL

CHEMISTRY B

ASTROPHYSICAL

JOURNAL

JOURNAL OF

CHEMICAL

PHYSICS

JOURNAL OF THE

AMERICAN

CHEMICAL

SOCIETY

JOURNAL OF

CHEMICAL

THEORY AND

COMPUTATION

PHYSICAL REVIEW

D

PHYSICAL REVIEW

LETTERS

PHYSICAL REVIEW

B

MONTHLY

NOTICES OF THE

ROYAL

ASTRONOMICAL

SOCIETY

TOP 10 Overall

• ~ 1500 XSEDE supported publications appeared in these top 10 journals (by # of XSEDE supported publications published)

• Comparing each single publication with all peers appeared in the same issue. Get percentile ranking based on citation data (per ISI Web of Science data).

• Percentage of how many belongs to each quarter (top 25% to bottom 25%).

• In general trends towards higher quarter.

• Differences among fields (Physics, Astrophysics, Astronomical; Chemistry; etc.)

1

Data Science Definition from NIST Public Working Group

• Data Science is the extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and analytical hypothesis analysis.

• A Data Scientist is a practitioner who has sufficient knowledge of the overlapping regimes of expertise in business needs, domain knowledge, analytical skills and programming expertise to manage the end-to-end scientific method process through each stage in the big data lifecycle.

See Big Data Definitions in http://bigdatawg.nist.gov/V1_output_docs.php

25

IU Data Science Program

• Program managed by cross disciplinary Faculty in Data

Science. Currently Statistics and Informatics and

Computing School (31 faculty) but will expand scope to full campus

• A purely online 4-course Certificate in Data Science has been running since January 2014 (with 70 students total in 2 semesters)

– 4 students will get certificate end of this semester

– Most students are professionals taking courses in “free time”

• A campus wide Ph.D. Minor in Data Science has been proposed.

• Courses labelled as “Decision-maker” and “Technical” paths where McKinsey says an order of magnitude more (1.5 million by 2018) unmet job openings in

Decision-maker track

McKinsey Institute on Big Data Jobs

• There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as

1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

• Decision Maker Path aimed at 1.5 million jobs. Technical Path covers the

140,000 to 190,000 http://www.mckinsey.com/mgi/publications/big_data/index.asp.

27

IU Data Science Program: Masters

• Masters Fully approved by University and State

October 14 2014 and starts January 2015

• Blended online and residential (any combination)

– Online offered at in-state rates (~$1100 per course)

Informatics, Computer Science, Information and

Library Science in School of Informatics and

Computing and the Department of Statistics, College

of Arts and Science, IUB

• 30 credits (10 conventional courses)

• Basic (general) Masters degree plus tracks

– Currently only track is “Computational and Analytic Data

Science ”

– Other tracks expected such as Biomedical Data Science

Background on MOOC’s

• MOOC’s are a “disruptive force” in the educational environment

– Coursera, Udacity, Khan Academy and many others

• MOOC’s have courses and technologies

• Google Course Builder and OpenEdX are open source MOOC technologies

• Blackboard, Canvas and others are learning management systems with (some) MOOC support

• The MOOC version of Fox’s Big Data Applications and Analytics course has ~2000 students enrolled.

• Coursera Offerings have much larger enrollment

29

http://x-informatics.appspot.com/course

Example

Google

Course Builder

MOOC

4 levels

Course

Section (12)

Units(29)

Lessons(~150)

Units are ~ traditional lecture

Lessons are ~10 minute segments

31

Example

Google

Course Builder

MOOC

The Physics

Section expands to 4 units and 2

Homeworks

Unit 9 expands to 5 lessons

Lessons played on Youtube

“talking head video +

PowerPoint”

The community group for one of classes and one forum (“No more malls”)

33

Office

Mix Site

Lectures

Made as ~15 minute lessons linked here

Metadata on

Microsoft Site

34

Potpourri of Online Technologies

Canvas (Indiana University Default): Best for interface with IU grading and records

Google Course Builder: Best for management and integration of components

Ad hoc web pages: alternative easy to build integration

Mix: Simplest faculty preparation interface

Adobe Presenter/Camtasia: More powerful video preparation that support subtitles but not clearly needed

Google Community: Good social interaction support

YouTube: Best user interface for videos

Hangout: Best for instructor-students online interactions (one instructor to 9 students with live feed). Hangout on air mixes live and streaming (30 second delay from archived YouTube) and more participants

35

Download