An Architecture-based Framework For Understanding Large-Volume Data Distribution Chris A. Mattmann USC CSSE Annual Research Review March 17, 2009 Agenda • Research Problem and Importance • Our Approach – Classification – Selection – Analysis • Evaluation – Precision, Recall, Accuracy Measurements – Speed • Conclusion & Future Work Research Problem and Importance – In a performant manner? – Fulfilling system requirements? NASA Planetary Data System Archive Volume Growth 90 80 70 60 TB (Accum) • Content repositories are growing rapidly in size • At the same time, we expect more immediate dissemination of this data • How do we distribute it… 50 TBytes 40 30 20 10 0 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 Year Data Distribution Scenarios A Backup Site periodically across the WAN to the A medium-sized volume connects of Movie Repository to data, e.g., on the order ofDigital a backup its entire catalog and gigabyte needs to be delivered archive of over 20 terabytes of across a LAN, using multiple movie delivery intervals consisting of data and metadata. 10 megabytes of data per interval, to a single user. Data Distribution Problem Space Insight: Software Architecture • The definition of a system in the form of its canonical building blocks – Software Components: the computational units in the system – Software Connectors: the communications and interactions between software components – Software Configurations: arrangements of components and connectors and the rules that guide their composition Data Distribution Systems Component Data Producer data data ??? Connector Insight: Use Software Connectors to model data distribution technologies Data Data Data Consumer Data Component Consumer Consumer Consumer Impact of Data Distribution Technologies • Broad variety of data distribution technologies • Some are highly efficient, some more reliable • P2P, Grid, Client/Server, and Event-based • Some are entirely appropriate to use, some are not appropriate Data Movement Technologies • Wide array of available OTS “largescale” connector technologies – GridFTP, Aspera, HTTP/REST, RMI, CORBA, SOAP, XML-RPC, Bittorrent, JXTA, UFTP, FTP, SFTP, SCP, Siena, GLIDE/PRISM-MW, and more • Which one is the best one? • How do we compare them – Given our current architecture? – Given our distribution scenarios & requirements? Research Question • What types of software connectors are best suited for delivering vast amounts of data to users, that satisfy their particular scenarios, in a manner that is performant, scalable, in these hugely distributed data systems? Broad variety of distribution connector families • P2P, Grid, Client/Server, and Eventbased • Though each connector family varies slightly in some form or fashion – They all share 3 common atomic connector constituents • Data Access, Stream, Distributor • Adapted from our group’s ICSE2000 Connector Taxonomy Connector Tradeoff Space • Surveyed properties of 13 representative distribution connectors, across all 4 distribution connector families and classified them – Client/Server • SOAP, RMI, CORBA, HTTP/REST, FTP, UFTP, SCP, Commercial UDP Technology – Peer to Peer • Bittorrent – Grid • GridFTP, bbFTP – Event-based • GLIDE, Sienna Large Heterogeneity in Connector Properties Procedure Call Connector Breakdown (5 connectors, 2 families) 6 Data Access Connector Breakdown (8 Connectors, 4 families) 9 proc_call_params_return_value proc_call_cardinality_senders proc_call_invocation_explicit data_access_locality proc_call_params_invocation_record data_access_persistence proc_call_params_datatransfer Stream Connector Breakdown (8 connectors, 4 data_access_avail_transient families) proc_call_accessibility data_access_cardinality_receivers distributor_routing_membership proc_call_semantics Num Connectors H TT P R RM esp G Num Connectors rid I m ons FTP es e s Pr D ce ag yn SO om e s e am A s ic CO P m sagNum Connectors D R Gl e ad at B ob s e -h a A a sa b E m l ge R o o e G MI un Dc NumxcConnectors rid ohna ssa M de ata n F e g ge Ra T es d b St w S P M saR G aseM see ru nd g O e e ct p lo A t AP e s sa os bu ccheo er ur ny M d i ed es ge Htor s L ss C Se TTy og al sa nd H ge P Ac L l O TT er Sece ay ne P EvCO s rvss er M enR Se R FMi I er nd Pe ess t BA lo er re er ag Se N le R e ss am I/eOg g gi Re P io e i m at str iec n- R str ot tr y- es Ba eg y ib b e W s i ut as eb ed str e- ed Lo xa H b S y ca a e ct ir se l Ca erv ly ar e ch ch d P O t a nc ev r ic ee co le r a a e r- re l nt asch l B a f e ue en F Ma t ite on c r la n s t e ed n -b t y Be cetu ce Re as re st e pu c co d Ef e pivre bl nf tcp fo O ne rt ig orts ic / ur ip ec Re at te ce io bp y Ex tr n on ivpri d Re s a ac c e e v ce k Ac re r ate At tly er iv ne c er es ce l e On s i Re as c kseor ver e ce B e t on y iv w M st ce er or ut M St Ef at d a 8 1 0 5 8 4 7 3 6 2 5 1 4 0 3 2 1 distributor_delivery_type data_access_accesses distributor_naming_type data_access_cardinality_senders distributor_naming_structures stream_formats distributor_routing_type stream_cardinality_send distributor_delivery_semantics stream_localities distributor_routing_path stream_deliveries distributor_delivery_mechanisms stream_throughput stream_cardinality_receiv stream_state stream_identity stream_bounds stream_synchronicity stream_buffering 0 ed Bo un de yn d ch ro n ou yn ch s ro no us Bu ff er ed 0 9 am 2 6 es s 3 7 N 1 4 at el 2 8 or Se nd er O s ne Se nd er 5 9 ny 6 dy for na t m ca ic ch ed st at i U ni c M cas ul t Br tica oa st dc as t 3 ul 7 Distributor Connector Breakdown (8 connectors, 4 families) at ef 4 St 5 How do experts make these decisions? • Performed survey of 33 “experts” • Experts defined to be – Practitioners in industry, building data-intensive systems – Researchers in data distribution – Admitted architects of data distribution technologies • General consensus? – They don’t the how and the why about which connector(s) are appropriate – They rely on anecdotal evidence and “intuition” Expert Survey Demographic 6% 6% 12% 18% 6% Cancer Research Planetary Science Earth Science Industry Grid Computing Professors Web Technologies Open Source Students 45% of respondents claimed to be uncomfortable being addressed as a data distribution expert. 12% 22% 12% 6% Percentage Breakdown of Expert Responses 3% 15% No Response Not Comfortable No Time Full Response 15% 67% Why is it bad to have these types of experts? • Employ a small set of COTS, and/or pervasive distribution technologies, and stick to them – Regardless of the scenario requirements – Regardless of the capabilities at user’s institutions • Lack a comprehensive understanding of benefits/tradeoffs amongst available distribution technologies – They have “pet technologies” that they have used in similar situations – These technologies are not always applicable and frequently only satisfy one or two scenario requirements and ignore the rest Our Approach: DISCO • Develop a software framework for: – Connector Classification • Build metadata profiles of connector technologies, describing their intrinsic properties (DCPs) – Connector Selection • Adaptable, extensible algorithm development framework for selecting the “right” connectors (and identifying wrong ones) – Connector Selection Analysis • Measurement of accuracy of results – Connector Performance Analysis DISCO in a Nutshell Scenario Language • Describes distribution scenarios Total Volume e.g., 10 MB, 100 GB, etc., int + higher order unit Number of Intervalse.g., Delivery Schedule Access Policies 1, 10, int Volume Per Interval Timing of Interval e.g., SSL/HTTP 1.0, Linux File System Perms, string from controlled value range Geographic Distribution WAN LAN Data Distribution 1-10, Scalability computed Dependability scale Consistency Performance Requirements e.g., 1, 10, int Number of Users Number of User Types e.g., 1, 10, int Number of Data Types e.g., 1, 10, int Efficiency Producers Consumers Automatic Initiated Automatic Initiated Types of Data Data Metadata Distribution Connector Model • Developed model for distribution connectors • Identified combination of primitive connectors that a distribution connector is made from Distribution Connector Model • Model defines important properties of each of the important “modules” within a distribution connector • Defines value space for each property • Defines each property • Properties are based on the combination of underlying “primitive” connector constituents • Model forms the basis for a metadata description (or profile) of a distribution connector Selection Algorithms • So far – Let data system architects encode the data distribution scenarios within their system using scenario language – Let connector gurus describe important properties of connectors using architectural metadata (connector model) • Selection Algorithms – Use scenario(s) and connector properties identify the “best” connectors for the given scenario(s) Selection Algorithms • Formal Statement of the problem Selection Algorithms • Selection scenario Connector KB This interface is desirable because it allows a user to rank algorithm interface and compare how “appropriate” 0.157) each connector (bbFTP, is, rather than (FTP,0.157) having a binary (GridFTP,0.157) decision ? (HTTP/REST, 0.157) (SCP, 0.157) (UFTP, 0.157) (Bittorrent, 0.021) (CORBA, 0.005) (Commercial UDP Technology, 0.005) (GLIDE, 0.005) (RMI, 0.005) (Sienna, 0.005) (SOAP, 0.005) Selection Algorithm Approach • White box – Consider the internal properties of a connector (e.g., its internal architecture) when selecting it for a distribution scenario • Black box – Consider the external (observable) properties of the connector (such as performance) when selecting it for a distribution scenario Develop complementary selection algorithms •Software architects fill out Bayesian domain profiles containing conditional probabilities •Likelihood a connector, given attribute A and its value, and given scenario requirement, is appropriate for scenario S •Users familiar with connector technologies develop score functions •Relating observable properties (performance reqs) of connector to scenario dimensions Selection Analysis • How do we make decisions based on a rank list? • Insight: looking at the rank list, it is apparent that many connectors are similarly ranked, while many are not – Appropriate versus Inappropriate? Selection Analysis appropriate inappropriate (bbFTP, 0.15789473684210525) (FTP,0.15789473684210525) (GridFTP,0.15789473684210525) (HTTP/REST, 0.15789473684210525) (SCP, 0.15789473684210525) (UFTP, 0.15789473684210525) (Bittorrent, 0.02105263157894737) (CORBA, 0.005263157894736843) (Commercial UDP Technology, 0.005263157894736843) (GLIDE, 0.005263157894736843) (RMI, 0.005263157894736843) (Sienna, 0.005263157894736843) (SOAP, 0.005263157894736843) Selection Analysis Selection Analysis • Employed k-means data clustering algorithm – k parameter defines how many sets data is partitioned into • Allows for clustering of data points (x, y) around a “centroid” or mean value • We developed an exhaustive connector clustering algorithm based on k-means – clusters connectors into 2 groups, appropriate, and inappropriate – uses connector rank value as y parameter (x is the connector name) – exhaustive in the sense that it iterates over all possible connector clusters (vanilla k-means is heuristic & possibly incomplete) Tool Support • Allows a user to utilize different connector knowledge bases, configure selection algorithms and execute them and visualize their results Decision Process 87% 80.5% •Precision - the fraction of connectors correctly identified as appropriate for a scenario •Accuracy - the fraction of connectors correctly identified as appropriate or inappropriate for a scenario Decision Process: Speed Conclusions & Future Work • Conclusions – Domain experts (gurus) rely on tacit knowledge and often cannot explain design rationale – Disco provides a quantification of & framework for understanding an ad hoc process – Bayesian algorithm has a higher precision rate • Future Work – Explore the tradeoffs between white-box and blackbox approaches – Investigate the role of architectural mismatch in connectors for data system architectures Thank You! Questions? Backup Related Work • Software Connectors – Mehta00 (Taxonomy), Spitznagel01, Spitznagel03, Arbab04, Lau05 • Data Distribution/Grid Computing – Crichton01, Chervenak00, Kesselman01 • COTS Component/Connector selection – Bhuta07, Mancebo05, Finkelstein05 • Data Dissemination – Franklin/Zdonik97