An Architecture-based Framework For Understanding Large-Volume Data Distribution Chris A. Mattmann

advertisement
An Architecture-based Framework
For Understanding Large-Volume
Data Distribution
Chris A. Mattmann
USC CSSE Annual Research Review
March 17, 2009
Agenda
• Research Problem and Importance
• Our Approach
– Classification
– Selection
– Analysis
• Evaluation
– Precision, Recall, Accuracy Measurements
– Speed
• Conclusion & Future Work
Research Problem and
Importance
– In a performant manner?
– Fulfilling system
requirements?
NASA Planetary Data System
Archive Volume Growth
90
80
70
60
TB (Accum)
• Content repositories are
growing rapidly in size
• At the same time, we
expect more immediate
dissemination of this data
• How do we distribute it…
50
TBytes
40
30
20
10
0
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
Year
Data Distribution Scenarios
A Backup Site periodically
across the WAN to the
A medium-sized volume connects
of
Movie Repository to
data, e.g., on the order ofDigital
a
backup its entire catalog and
gigabyte needs to be delivered
archive of over 20 terabytes of
across a LAN, using multiple
movie
delivery intervals consisting
of data and metadata.
10 megabytes of data per
interval, to a single user.
Data Distribution Problem
Space
Insight: Software Architecture
• The definition of a system in the form of its canonical
building blocks
– Software Components: the computational units in the system
– Software Connectors: the communications and interactions
between software components
– Software Configurations: arrangements of components and
connectors and the rules that guide their composition
Data Distribution Systems
Component
Data
Producer
data
data
???
Connector
Insight: Use Software Connectors
to model data distribution
technologies
Data
Data
Data
Consumer
Data
Component
Consumer
Consumer
Consumer
Impact of Data Distribution
Technologies
• Broad variety of data
distribution technologies
• Some are highly
efficient, some more
reliable
• P2P, Grid, Client/Server,
and Event-based
• Some are entirely
appropriate to use,
some are not
appropriate
Data Movement Technologies
• Wide array of available OTS “largescale” connector technologies
– GridFTP, Aspera, HTTP/REST, RMI, CORBA,
SOAP, XML-RPC, Bittorrent, JXTA, UFTP, FTP,
SFTP, SCP, Siena, GLIDE/PRISM-MW, and more
• Which one is the best one?
• How do we compare them
– Given our current architecture?
– Given our distribution scenarios & requirements?
Research Question
• What types of software connectors
are best suited for delivering vast
amounts of data to users, that satisfy
their particular scenarios, in a
manner that is performant, scalable,
in these hugely distributed data
systems?
Broad variety of distribution
connector families
• P2P, Grid, Client/Server, and Eventbased
• Though each connector family varies
slightly in some form or fashion
– They all share 3 common atomic connector
constituents
• Data Access, Stream, Distributor
• Adapted from our group’s ICSE2000 Connector
Taxonomy
Connector Tradeoff Space
• Surveyed properties of 13 representative distribution
connectors, across all 4 distribution connector
families and classified them
– Client/Server
• SOAP, RMI, CORBA, HTTP/REST, FTP, UFTP, SCP,
Commercial UDP Technology
– Peer to Peer
• Bittorrent
– Grid
• GridFTP, bbFTP
– Event-based
• GLIDE, Sienna
Large Heterogeneity in
Connector Properties
Procedure Call Connector Breakdown (5 connectors, 2 families)
6
Data Access Connector Breakdown (8 Connectors, 4 families)
9
proc_call_params_return_value
proc_call_cardinality_senders
proc_call_invocation_explicit
data_access_locality
proc_call_params_invocation_record
data_access_persistence
proc_call_params_datatransfer
Stream Connector Breakdown (8 connectors, 4 data_access_avail_transient
families)
proc_call_accessibility
data_access_cardinality_receivers
distributor_routing_membership
proc_call_semantics
Num Connectors
H
TT
P
R
RM esp
G
Num Connectors
rid I m ons
FTP es e
s
Pr
D
ce ag
yn
SO om
e
s
e
am
A
s
ic CO P m sagNum Connectors
D R Gl e
ad
at B ob s e
-h
a A a sa
b
E m l ge
R
o o
e
G MI un Dc NumxcConnectors
rid
ohna ssa
M de ata
n
F
e g ge
Ra
T es d b
St
w S P M saR G aseM see
ru
nd
g
O
e
e
ct
p lo A t
AP e s
sa os bu ccheo er
ur
ny
M
d
i
ed
es ge Htor s L ss C
Se
TTy og al
sa
nd
H
ge
P Ac L l
O
TT
er
Sece ay
ne
P EvCO
s
rvss er
M enR
Se
R
FMi I er
nd
Pe ess t BA
lo
er re er ag Se N le R
e ss am I/eOg g
gi
Re
P
io e
i
m
at str iec
n- R str
ot
tr y- es
Ba eg y
ib b
e
W s i
ut as
eb ed str
e- ed
Lo
xa
H
b
S y
ca
a
e
ct
ir se
l
Ca erv
ly
ar
e
ch
ch d P
O
t a nc
ev r
ic
ee
co
le r
a
a
e
r- re
l
nt
asch
l
B a f e ue
en F Ma
t ite
on c
r
la n
s
t
e
ed n
-b t y
Be
cetu
ce
Re
as
re
st
e
pu
c
co
d
Ef
e
pivre bl
nf tcp
fo
O
ne
rt
ig
orts ic
/
ur ip
ec
Re
at
te
ce
io
bp
y
Ex tr n
on ivpri d
Re
s
a
ac c
e e v
ce
k
Ac re r ate
At tly er
iv
ne
c
er
es ce
l e On
s
i
Re
as c
kseor ver
e
ce
B e t on
y
iv
w
M
st ce
er
or
ut
M
St
Ef
at
d
a
8
1
0
5
8
4
7
3
6
2
5
1
4
0
3
2
1
distributor_delivery_type
data_access_accesses
distributor_naming_type
data_access_cardinality_senders
distributor_naming_structures
stream_formats
distributor_routing_type
stream_cardinality_send
distributor_delivery_semantics
stream_localities
distributor_routing_path
stream_deliveries
distributor_delivery_mechanisms
stream_throughput
stream_cardinality_receiv
stream_state
stream_identity
stream_bounds
stream_synchronicity
stream_buffering
0
ed
Bo
un
de
yn
d
ch
ro
n
ou
yn
ch
s
ro
no
us
Bu
ff
er
ed
0
9
am
2
6
es
s
3
7
N
1
4
at
el
2
8
or
Se
nd
er
O
s
ne
Se
nd
er
5
9
ny
6
dy for
na t
m
ca ic
ch
ed
st
at
i
U
ni c
M cas
ul
t
Br tica
oa st
dc
as
t
3
ul
7
Distributor Connector Breakdown (8 connectors, 4 families)
at
ef
4
St
5
How do experts make these
decisions?
• Performed survey of 33 “experts”
• Experts defined to be
– Practitioners in industry, building
data-intensive systems
– Researchers in data distribution
– Admitted architects of data
distribution technologies
• General consensus?
– They don’t the how and the why
about which connector(s) are
appropriate
– They rely on anecdotal evidence
and “intuition”
Expert Survey Demographic
6%
6%
12%
18%
6%
Cancer Research
Planetary Science
Earth Science
Industry
Grid Computing
Professors
Web Technologies
Open Source
Students
45% of respondents claimed
to be uncomfortable
being addressed as a data
distribution expert.
12%
22%
12%
6%
Percentage Breakdown of Expert Responses
3%
15%
No Response
Not Comfortable
No Time
Full Response
15%
67%
Why is it bad to have these
types of experts?
• Employ a small set of COTS, and/or pervasive
distribution technologies, and stick to them
– Regardless of the scenario requirements
– Regardless of the capabilities at user’s institutions
• Lack a comprehensive understanding of
benefits/tradeoffs amongst available distribution
technologies
– They have “pet technologies” that they have used in similar
situations
– These technologies are not always applicable and
frequently only satisfy one or two scenario requirements and
ignore the rest
Our Approach: DISCO
• Develop a software framework for:
– Connector Classification
• Build metadata profiles of connector technologies,
describing their intrinsic properties (DCPs)
– Connector Selection
• Adaptable, extensible algorithm development framework
for selecting the “right” connectors (and identifying wrong
ones)
– Connector Selection Analysis
• Measurement of accuracy of results
– Connector Performance Analysis
DISCO in a Nutshell
Scenario Language
• Describes distribution scenarios
Total Volume
e.g., 10 MB, 100 GB, etc., int + higher order unit
Number of Intervalse.g.,
Delivery Schedule
Access Policies
1, 10, int
Volume Per Interval
Timing of Interval
e.g., SSL/HTTP 1.0, Linux File System Perms, string from controlled value range
Geographic Distribution
WAN
LAN
Data Distribution
1-10,
Scalability
computed
Dependability scale
Consistency
Performance Requirements
e.g., 1, 10, int
Number of Users
Number of User
Types
e.g., 1, 10, int
Number of Data Types
e.g., 1, 10, int
Efficiency
Producers
Consumers
Automatic
Initiated
Automatic
Initiated
Types of Data
Data
Metadata
Distribution Connector Model
• Developed model for distribution
connectors
• Identified combination of primitive
connectors that a distribution
connector is made from
Distribution Connector Model
• Model defines important properties of
each of the important “modules” within
a distribution connector
• Defines value space for each property
• Defines each property
• Properties are based on the
combination of underlying “primitive”
connector constituents
• Model forms the basis for a metadata
description (or profile) of a distribution
connector
Selection Algorithms
• So far
– Let data system architects encode the data
distribution scenarios within their system using
scenario language
– Let connector gurus describe important properties
of connectors using architectural metadata
(connector model)
• Selection Algorithms
– Use scenario(s) and connector properties identify
the “best” connectors for the given scenario(s)
Selection Algorithms
• Formal Statement of the problem
Selection Algorithms
• Selection
scenario
Connector
KB
This interface is desirable
because
it allows a user to rank
algorithm
interface
and compare how “appropriate”
0.157)
each connector (bbFTP,
is, rather
than
(FTP,0.157)
having a binary (GridFTP,0.157)
decision
?
(HTTP/REST, 0.157)
(SCP, 0.157)
(UFTP, 0.157)
(Bittorrent, 0.021)
(CORBA, 0.005)
(Commercial UDP Technology,
0.005)
(GLIDE, 0.005)
(RMI, 0.005)
(Sienna, 0.005)
(SOAP, 0.005)
Selection Algorithm Approach
• White box
– Consider the internal properties of a
connector (e.g., its internal architecture)
when selecting it for a distribution scenario
• Black box
– Consider the external (observable)
properties of the connector (such as
performance) when selecting it for a
distribution scenario
Develop complementary
selection algorithms
•Software architects fill out
Bayesian domain profiles
containing conditional
probabilities
•Likelihood a connector,
given attribute A and its
value, and given scenario
requirement, is appropriate
for scenario S
•Users familiar with connector
technologies develop score
functions
•Relating observable
properties (performance
reqs) of connector to
scenario dimensions
Selection Analysis
• How do we make decisions based on a
rank list?
• Insight: looking at the rank list, it is
apparent that many connectors are
similarly ranked, while many are not
– Appropriate versus Inappropriate?
Selection Analysis
appropriate
inappropriate
(bbFTP, 0.15789473684210525)
(FTP,0.15789473684210525)
(GridFTP,0.15789473684210525)
(HTTP/REST, 0.15789473684210525)
(SCP, 0.15789473684210525)
(UFTP, 0.15789473684210525)
(Bittorrent, 0.02105263157894737)
(CORBA, 0.005263157894736843)
(Commercial UDP Technology, 0.005263157894736843)
(GLIDE, 0.005263157894736843)
(RMI, 0.005263157894736843)
(Sienna, 0.005263157894736843)
(SOAP, 0.005263157894736843)
Selection Analysis
Selection Analysis
• Employed k-means data clustering algorithm
– k parameter defines how many sets data is partitioned into
• Allows for clustering of data points (x, y) around a
“centroid” or mean value
• We developed an exhaustive connector clustering
algorithm based on k-means
– clusters connectors into 2 groups, appropriate, and inappropriate
– uses connector rank value as y parameter (x is the connector
name)
– exhaustive in the sense that it iterates over all possible connector
clusters (vanilla k-means is heuristic & possibly incomplete)
Tool Support
• Allows a user to utilize different connector
knowledge bases, configure selection
algorithms and execute them and visualize
their results
Decision Process
87%
80.5%
•Precision - the fraction of
connectors correctly identified as
appropriate for a scenario
•Accuracy - the fraction of
connectors correctly identified as
appropriate or inappropriate for a
scenario
Decision Process: Speed
Conclusions & Future Work
• Conclusions
– Domain experts (gurus) rely on tacit knowledge and
often cannot explain design rationale
– Disco provides a quantification of & framework for
understanding an ad hoc process
– Bayesian algorithm has a higher precision rate
• Future Work
– Explore the tradeoffs between white-box and blackbox approaches
– Investigate the role of architectural mismatch in
connectors for data system architectures
Thank You!
Questions?
Backup
Related Work
• Software Connectors
– Mehta00 (Taxonomy), Spitznagel01,
Spitznagel03, Arbab04, Lau05
• Data Distribution/Grid Computing
– Crichton01, Chervenak00, Kesselman01
• COTS Component/Connector selection
– Bhuta07, Mancebo05, Finkelstein05
• Data Dissemination
– Franklin/Zdonik97
Download