Software and Enterprise Architectures CSE 5095 Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department The University of Connecticut 371 Fairfield Road, Box U-255 Storrs, CT 06269-2155 steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818 Copyright © 2008 by S. Demurjian, Storrs, CT. SWEA1 Software Architectures CSE 5095 Emerging Discipline in Mid-1990s Software as Collection of Interacting Components What are Local Interactions (within Component)? What are Global Interactions (between Components)? Advantages of SW Architectural Design Understand Communication/Synchronization Definition of Database Requirements Identification of Performance/Scaling Issues Detailing of Security Needs and Constraints Towards Large-Scale Software Development For Biomedical Informatics: What are Architectures for Data Sharing? How is Interoperability Facilitated? SWEA2 Concepts of Software Architectures CSE 5095 Exceed Traditional Algorithm/Data Structure Perspective Emphasize Componentwise Organization and System Functionality Focus on Global and Local Interactions Identify Communication/Synchronization Requirements Define Database Needs and Dependencies Consider Performance/Scaling Issues Understand Potential Evolution Dimensions SWEA3 The HTSS Software Architecture CSE 5095 IL IL IL SDO EDO SDO EDO Payment CR CR CR CR IL: CR: IC: DO: Item Locator Cash Register Invent. Control Deli Orderer for Shopper/Employee Item IC Order IC Non-Local Client Int. CreditCardDB Inventory Control ItemDB Global Server ItemDB Local Server ATM-BanKDB OrderDB SupplierDB SWEA4 Multiple Backend Database System (MBDS) CSE 5095 Backend Database Processor Database Controller Backend Database Processor Host/User Backend Database Processor SWEA5 The MBDS Processes CSE 5095 Database Controller Request Preparation Post Processing Put Msg. Get Msg. Get Msg. Put Msg. Directory Management Record Processing Concurrency Control Disk I/O Backend Database Processor SWEA6 Multiple Processes in MBDS CSE 5095 No. 1 2 3 4 6 12 15 16 21 22 23 Type New Request Results of Request Number of Reqs in Transaction Aggregate Operators (Sum, etc.) Parsed Request to Backends Backend Aggregate Operator Results Ids for Accessing Database Indexes Request and Disk Addresses Ids for Accessing Database Records Locks Obtained: Okay to Execute Request ID of Finished Request SRC Host PoPr ReqP ReqP ReqP RecP DM DM DM CC RecP DST ReqP Host PoPr PoPr DM PoPr DMs RecP CC RecP CC SWEA7 Message Passing in MBDS CSE 5095 F15 From Other Backend A1 Request Preparation D6 Put Msg. B3 C4 K12 Post Processing K12 Get Msg. E15 To Backend(s) Get Msg. Put Msg. D6,F15 E15 Directory Management G21 K12 H22 Record Processing I16 Concurrency Control J23 Disk I/O SWEA8 Software Design Levels CSE 5095 Architecturally: Modules Interconnections Among Modules Decomposition into Subsystems Code: Algorithms/Data Structures Tasking/Control Threads Executable: Memory Management Runtime Environment Is this a Realistic/Accurate View? Yes for a Single “Application” What about Application of Applications? System of Systems? SWEA9 Software Engineering - an Oxymoron? CSE 5095 Is there any Engineering? Is there any Science? Collection of Disparate Techniques: Data-Flow Diagrams E-R Diagrams Finite State Machines Petri Nets UML Class, Object, Sequence, Etc. Design Patterns Model Drive Architectures What is being “Engineered”? How do we Know we are Done? E.g. Does Artifact Match Specification? SWEA10 What's Available for Engineering Software? CSE 5095 Specification (Abstract Models, Algebraic Semantics) Software Structure (Bundling Representation with Algorithms) Languages Issues (Models, Scope, User-Defined Types) Information Hiding (Protect Integrity of Information) Integrity Constraints (Invariants of Data Structures) Is this up to date? What else can be Added to List? Design Patters Model Driven Architectures XML –Data Modeling and Dependencies Others? SWEA11 Engineering Success in Computing CSE 5095 Compilers Have Had Great Success Originally by Hand Then Compiler Compilers Parser Generators - Lex/Yacc Solid Science Behind Compilers Regular, Context Free, Context Sensitive Languages FSAs, PDAs, CFGs, etc. Science has Provided Engineering Success re. Ease and Accuracy of Modern Compiler Writing SWEA12 History of Programming CSE 5095 C - Still Remains Industry Stronghorse Separate Compilation Decomposition of System into Subsystems, etc. Shared Declarations ADTs in C, But Compiler won't Enforce Them Modula-II and Ada 83 Had Information Hiding Public/Private Paradigm Module/Package Concepts Import/Export Paradigm Rigor Enforced by Compiler – but Can’t Bind/Group Modules into Subsystems Precisely Specify Interconnections and Interactions Among Subsystems and Components SWEA13 ‘Recent-Past’ Generation? CSE 5095 C++ and Ada95 Considered “Legacy” Languages - Old Java, C# - Are they Headed Toward Legacy? How do they Rate? What Do they Offer that Hasn't been Offered Before? What are Unique Benefits and Potential of Java? What about new Web Technologies? Javascript, Perl, PhP, Phython, Ruby XML and SOAP How do all of these fit into this process? Particularly in Regards to C/S Solutions! SWEA14 What's Next Step? CSE 5095 Architectural Description Languages Provide Tools to Describe Architectures Definition and Communication Codification of Architectural Expertise Frameworks for Specific Domains DB vs. GUI vs. Embedded vs. C/S Formal Underpinning for Engineering Rigor What has Appeared for Each of these? Struts for GUI Open Source Frameworks (mediawiki) Wide-Ranging Standards (XML) Model-Driven Architectures What Else??? SWEA15 Architectural Styles CSE 5095 What are Popular Architectural Styles? How are they Characterized? Example in Practice Explore a Taxonomy of Styles Focus on “Micro-Architectures” Components Flow Among Components Represents “Single” Application Forms Basis for “Macro-Architectures” System of Systems Application of Applications Significantly Scaling Up SWEA16 Taxonomy of Architectural Styles CSE 5095 Data Flow Systems Batch Sequential Pipes and Filters Call & Return Systems Main/Subroutines (C, Pascal) Object Oriented Implicit Invocation Hierarchical Systems Virtual Machines Interpreters Rule Based Systems Data Centered Systems DBS Hypertext Blackboards Independent Components Communicating Processes/Event Systems Client/Server Two-Tier Multi-Tier SWEA17 Taxonomy of Architectural Styles CSE 5095 Establish Framework of … Components Building Blocks for Constructing Systems A Major Unit of Functionality Examples Include: Client, Server, Filter, Layer, DB Connectors Defining the Ways that Components Interact What are the Protocols that Mandate the Allowable Interactions Among Components? How are Protocols Enforced at Run/Design Time? Examples Include: Procedure Call, Event Broadcast, DB Protocol, Pipe SWEA18 Overall Framework CSE 5095 What Is the Design Vocabulary? Connectors and Components What Are Allowable Structural Patterns? Constraints on Combining Components & Connectors What Is the Underlying Conceptual Model? Von Newman, Parallel, Agent, Message-Passing… Are their New Emerging Models? Collaborative Environments/Shareware? What Are Essential Invariants of a Style? Limits on Allowable Components & Connectors Common Examples of Usage Advantages and Disadvantages of a Style Common Specializations of a Style SWEA19 Pipes and Filters CSE 5095 Components are Independent Entities. No Shared State! Components with Input and Output Sort Sort Merge Connectors for Flow Streams of I/O Filters: Invariant: Unaware of up and Down Stream Behavior Streamed Behavior: Output Could Go From One Filter to the Next One Allowing Multiple Filters to Run in Parallel. SWEA20 Pipes and Filters CSE 5095 Possible Specializations: Pipelines - Linear Sequence Bounded - Limits on Data Amounts Typed Pipes - Known Data Format What is a Classic Example? Other Examples: Compilers Sequential Processes Parallel Processes SWEA21 Pipes and Filters - Another Example CSE 5095 Text Information Retrieval Systems Scanning Newspapers for Key Words, Etc. Also, Boolean Search Expressions Where is Such an Architecture Utilized Today? What is Potential Usage in BMI? User Commands Search Disk Controller Controller Programming Result Query Resolver Control Term Search Comparator Data DB SWEA22 ADTs and OO Architectures CSE 5095 Widespread Usage in the 1990’s Advantages Are Well Known Components op obj op Connectors op obj op op op obj obj op op op op op obj obj op op obj obj Disadvantages: Interaction Required Object Identity If Identity Changes, It Is Difficult to Track All Affected Objects. SWEA23 Implicit Invocation CSE 5095 Similar to OO in the Sense that Components Can Call Services on Other Components How Does this Work? Components Have List of Events they can Raise and List of Procedures to Handle Events When Event is Raised, it is Broadcast All Components that Have Procedure to Handle Broadcast Event will Act Upon it The Component That Raised the Event has no Knowledge of Which Component(s) will Handle Event What are Some Examples? SWEA24 Implicit Invocation CSE 5095 Advantages No Need to Know the Targeted Components Single Event can Impact Multiple Components New Event Handlers can Easily be Added New Events Can then be Raised Disadvantages No Control Over the Order of Processing When an Event is Raised No Control Over “Who” and “How Many” Process Events Very Non-Deterministic System Behavior SWEA25 What has OO Evolved Into? CSE 5095 What has Classic OO Solution Evolved into Today? Client (Browser + Struts) Server (Many Variants of OO Languages) Database Server (typically Relational) Different Style (e.g., Design Pattern) Does Pattern Capture All Aspects of Style? Do we Need to Couple Technology with Pattern? Dr. D, Jan 01, 08 Fever, Flu, Bed Rest No Scripts No Tests Item(Phy_Name*, Date*, Visit_Flag, Symptom, Diagnosis, Treatment, Presc_Flag, Pre_No, Pharm_Name, Medication, Test_Flag, Test_Code, Spec_No, Status, Tech) SWEA26 Layered Systems CSE 5095 Useful Systems Base Utility Core level Users Components - Virtual Machine at Each Layer Connectors - Protocols That Specify How Layers Interact Interaction Is Restricted to Adjacent Layers SWEA27 Layered Systems CSE 5095 Advantages: Increasing Levels of Abstraction Support Enhancement - New Layers Support for Reuse Drawbacks: Not Feasible for All Systems Performance Issues With Multiple Layers Defining Abstractions Is Difficult. SWEA28 Layered Systems in BMI CSE 5095 One Approach to Constructing Access to Patient Data for Clinical Research and Clinical Practice Construct Layered Data Repositories as Below Each Layer Targets Different User Group Need to Fine Tune Access Even within Layers Aggregated De-identified Patient Data Provider Cl. Researchers Public Health Researchers SWEA29 ISO as Layered Architecture CSE 5095 ISO Open Systems Interconnect (OSI) Model Now Widely Used as a Reference Architecture 7-layer Model Provides Framework for Specific Protocols (Such as IP, TCP, FTP, RPC, UDP, RSVP, …) Application Presentation Session Transport Network Data Link Physical Application Presentation Session Transport Network Data Link Physical SWEA30 ISO OSI Model Application Presentation Session Transport Network Data Link Physical CSE 5095 Application Presentation Session Transport Network Data Link Physical Physical (Hardware)/Data Link Layer Networks: Ethernet, Token Ring, ATM Network Layer Net: The Internet Transport Layer Net: Tcp-based Network Presentation/Session Layer Net: Http/html, RPC, PVM, MPI Applications, E.g., WWW, Window System, Algorithm SWEA31 Repositories ks8 CSE 5095 ks1 Blackboard (shared data) ks2 ks3 ks6 ks4 ks7 ks5 Knowledge Sources Interact With the Blackboard. Blackboard Contains the Problem Solving State Data. Control Is Driven by the State of the Blackboard. DB Systems Are a Form of Repository With a Layer Between the BB and the KSs - Supports Concurrent Access, Security, Integrity, Recovery SWEA32 Database System as a Repository c8 CSE 5095 c1 Database (shared data) c2 c3 c6 c4 c7 c5 Clients Interact With the DBMS Database Contains the Problem Solving State Data Control is Driven by the State of the Database Concurrent Access, Security, Integrity, Recovery Single Layer System: Clients have Direct Access Control of Access to Information must be Carefully Defined within DB Security/Integrity SWEA33 Team Project as a Repository c8 CSE 5095 c1 Web Portal Shared c2 c3 c6 c4 c7 c5 Clients are Providers, Patients, Clinical Researchers Database Underlies Web Portal Simply a Portion of Architecture Interactions with PHR (Patients) Interactions with EMR (Providers) Interactions with Database/Warehouse (Researchers) SWEA34 Interpreters CSE 5095 Inputs Outputs Program being interpreted Data (program state) Simulated interpretation engine Selected instruction Selected data Internal interpreter state What Are Components and Connectors? Where Have Interpreters Been Used in CS&E? LISP, ML, Java, Other Languages, OS Command Line SWEA35 Java as Interpreter CSE 5095 SWEA36 Process Control Paradigms Input variables CSE 5095 Set point Ds to manipulated variables Controller Input variables Set point Controller Ds to manipulated variables With Feedback Process Controlled variable Without Feedback Process Controlled variable Also: Open vs. Close Loop Systems Well Defined Control and Computational Characters Heavily Used in Engineering Fields. SWEA37 Process Architecture: Statechart Diagram? CSE 5095 SWEA38 Process Architecture: Activity Diagram? CSE 5095 Clear Applicability to Medical Processes that have Underlying BMI – Low Level Processes Waiting for Heart Signal timeout irregular beat Heartbeat Heart Signal Waiting for Resp. Signal Breath Trigger Local Alarm Trigger Remote Alarm Resp Signal Alarm Reset SWEA39 Design Patterns as Software Architectures CSE 5095 Emerged as the Recognition that in Object-Oriented Systems Repetitions in Design Occurred Gained Prominence in 1995 with Publication of “Design Patterns: Elements of Reusable ObjectOriented Software”, Addison-Wesley “… descriptions of communicating objects and classes that are customized to solve a general design problem in a particular context…” Akin to Complicated Generic Usage of Patterns Requires Consistent Format and Abstraction Common Vocabulary and Descriptions Simple to Complex Patterns – Wide Range SWEA40 The Observer Pattern CSE 5095 Utilized to Define a One-to-Many Relationship Between Objects When Object Changes State – all Dependents are Notified and Automatically Updated Loosely Coupled Objects When one Object (Subject – an Active Object) Changes State than Multiple Objects (Observers – Passive Objects) Notified Observer Object Implements Interface to Specify the Way that Changes are to Occur Two Interfaces and Two Concrete Classes SWEA41 The Observer Pattern CSE 5095 SWEA42 Model View Controller http://java.sun.com/blueprints/patterns/MVC-detailed.html CSE 5095 SWEA43 Model View Controller CSE 5095 Three Parts of the Pattern: Model Enterprise Data and Business Rules for Accessing and Updating Data View Renders the Contents (or Portion) of Model Deals with Presentation of Stored Data Pull or Push Model Possible Controller Translates Interactions with View into Actions on Model Actions could be Button Clicks (GUI), Get/Post http (Web), etc. SWEA44 Model View Controller http://java.sun.com/blueprints/patterns/MVC-detailed.html CSE 5095 SWEA45 UML for System Modeling CSE 5095 UML is a Language for Specifying, Visualizing, Constructing, and Documenting Software Artifacts What Does a Modeling Language Provide? Model Elements: Concepts and Semantics Notation: Visual Rendering of Model Elements Guidelines: Hints and Suggestions for Using Elements in Notation References and Resources Web: http://www.uml.org/ Is UML Sufficient for Complexity of BMI? Able to Model Information Needs for BMI? Able to Represent Required Architectures? SWEA46 UML Diagrammatic Representations CSE 5095 Component Diagram: Captures the Physical Structure of the Implementation Deployment Diagram: Captures the Topology of a System’s Hardware Collaboration Diagram: Captures Dynamic Behavior (Message-Oriented) What About Other Diagrams? State Chart Diagram: Captures Dynamic Behavior (Event-Oriented) Activity Diagram: Captures Dynamic Behavior (Activity-Oriented) These and Others Seem too Low Level … What is Role of UML for BMI? Yet Another Design Artifact Can it be More? SWEA47 Component Diagram Captures the Physical Structure of the Implementation CSE 5095 SWEA48 Deployment Diagram Captures the Topology of a System’s Hardware CSE 5095 SWEA49 Collaboration Diagram CSE 5095 SWEA50 Single and Multi-Tier Architectures CSE 5095 Widespread use in Practice for All Types of Distributed Systems and Applications Two Kinds of Components Servers: Provide Services - May be Unaware of Clients Web Servers (unaware?) Database Servers and Functional Servers (aware?) Clients: Request Services from Servers Must Identify Servers May Need to Identify Self A Server Can be Client of Another Server Expanding from Micro-Architectures (Single Computer/One Application) to Macro-Architecture SWEA51 Single and Multi-Tier Architectures CSE 5095 Normally, Clients and Servers are Independent Processes Running in Parallel Connectors Provide Means for Service Requests and Answers to be Passes Among Clients/Servers Connectors May be RPC, RMI, etc. Advantages Parallelism, Independence Separation of Concerns, Abstraction Others? Disadvantages Complex Implementation Mechanisms Scalability, Correctness, Real-Time Limits Others? SWEA52 Example: Software Architectural Structure CSE 5095 Initial Data Entry Operator (Scanning & Posting) Advanced Data Entry Operators Analyst Manager 10-100MB Network Document Server Stored Images/CD Database Server Running Oracle RMI Registry RMI Act. Obj/Server RMI Act. Obj/Server Functional Server SWEA53 Business Process Model CSE 5095 DB DB Historical Completed Records Applications Licensing DB Supervisor Review Scanner DB Licensing Division Scanning Operator Stored Images Licensing Division Printer Data Entry Operator DB Basic Information Entered New Licenses New Appointments FOI Letters (Request Information, etc.) SWEA54 Two-Tier Architecture CSE 5095 Small Manufacturer Previously on C++ New Order Entry, Inventory, and Invoicing Applications in Java Programming Language Existing Customer and Order Database Most of Business Logic in Stored Procedures Tool-generated GUI Forms for Java Objects SWEA55 Three-Tier Architecture CSE 5095 Passenger Check-in for Regional Airline Local Database for Seating on Today's Flights Clients Invoke EJBs at Local Site Through RMI EJBs Update Database and Queue Updates JMS Queues Updates to Legacy System DBC API Used to Access Local Database SWEA56 Four-Tier Architecture CSE 5095 Web Access to Brokerage Accounts Only HTML Browser Required on Front End "Brokerbean" EJB Provides Business Logic Login, Query, Trade Servlets Call Brokerbean Use JNDI to Find EJBs, RMI to Invoke Them SWEA57 Architecture Comparisons CSE 5095 Two-tier Through JDBC API is Simplest Multi-tier: Separate Business Logic, Protect Database Integrity, More Scaleable JMS Queues vs. Synchronous (RMI or IDL): Availability, Response Time, Decoupling JMS Publish & Subscribe: Off-line Notification RMI IIOP vs. JRMP vs. Java IDL: Standard Cross-language Calls or Full Java Functionality JTS: Distributed Integrity, Lockstep Actions SWEA58 Comments on Architectural Styles CSE 5095 Architectural Styles Provide Patterns Suppose Designing a New System During Requirements Discovery, Behavior and Structure of System Will Emerge Attempt to Match to Architectural Style Modify, Extend Style as Needed By Choosing Existing Architectural Style Know Advantages and Disadvantages Ability to Focus in on Problem Areas and Bottlenecks Can Adjust Architecture Accordingly Architectures Range from Large Scale to Small Scale in their Applicability We’ll see Examples for BMI Shortly … SWEA59 Other Issues in Software Architectures CSE 5095 Consider a Set of Applications New Software Legacy, COTS, Databases, etc. A Distributed Application is a Set of Applications Deployed Over a Network that Communicate Relationship Between Applications Different Implementations of “Same” Application on Different Hardware Platforms Configuration of Various Hardware Nodes Different Node Types in the Network Issue: What is the ‘Best’ Way to Deploy Applications Across the Network of Available Resources? SWEA60 Distributed Application & Hardware Nodes CSE 5095 Computers & Connections May have Different Characteristics that Affect their Usage Speed Storage Bandwidth SWEA61 Objective: ‘Best’ Deployment CSE 5095 A Distributed System is Optimally Deployed if it Yields the Best Performance Performance: Efficient Use of Resources via Throughput, Response Time, or Number of Messages What are Implications in BMI? Need to Bring Together Multiple Assets Work Efficiently Across Network Unifying Clinical Research Repositories SWEA62 Distr. Systems: Combo of Requirements CSE 5095 interaction patterns software elements hardware elements Specification interfaces connections protocols SWEA63 Deployment Influenced by Many Factors CSE 5095 algorithms software architecture underlying network replication degree Performance processing nodes usage patterns middleware deployment SWEA64 Framework for Design and Deployment CSE 5095 SOFTWARE HARDWARE Dependencies Deployment PERFORMANCE SWEA65 What is I5? CSE 5095 Five Definition Languages Interface Inheritance Implementation Instantiation Installation Five Formal Integrated Graphical Languages Based on UML’s Implementation Diagrams The Application, Network, Dependencies and the Deployment are Part of an Integrated Framework SWEA66 The Five Levels of I5 Abstraction Interface (I1) - Types of Components, Nodes and Connectors Implementation (I2) - Classes of Components, Nodes and Connectors Integration (I3) - Dependencies Between Component and Node Classes Instantiation (I4) - Instances of Each Class Definition Installation (I5) - Deployment of Each Instance (Requirements and Complete Deployment) Detail CSE 5095 SWEA67 Levels of Specification in I5 Types CSE 5095 - Generic Definition of Components, Nodes, and Connectors According to Their Role Defined in I1 Used in I2 to Define Classes Classes - Different Implementations of the Types Defined in I2 Used in I3 to Associate Software Components and Hardware Artifacts and I4 to Define Instances Instances - Identical Copies of the Different Classes Defined in I4 Used in I5 to Deploy Instances Across Nodes SWEA68 UML CSE 5095 UML is a Set of Graphical Specification Languages (OMG’s Standard Design Language Since November, 1997) Implementation Diagrams Component Diagrams: Show the Physical Structure of the Code in Terms of Code Components and Their Dependencies Deployment Diagrams: Show the Physical Architecture of the Hardware and Software in the System. They Have a Type and an Instance Version. SWEA69 UML CSE 5095 When to Use Deployment Diagrams “… In practice, I haven’t seen this kind of diagram used much. Most people do draw diagrams to show this kind of information but they are informal cartoons. On the whole, I don’t have a problem with that since each system has its own physical characteristics that your want to emphasize. As we wrestle more and more with distributed systems, however, I’m sure we will require more formality as we understand better which issues need to be highlighted in deployment diagrams.” From “UML Distilled. Applying the Standard Object Modeling Language”, by Martin Fowler. Addison-Wesley, Object Technology Series, 7th. Reprint June, 1998. SWEA70 Pros and Cons of Graphical Modeling CSE 5095 Advantages: Clear to Show Structure Excellent Communication Vehicle Addresses Different Aspects of Modeling in an Integrated Fashion Disadvantages: Shows Little (or No) Details There is a Big Gap Between Specification and Implementation Limited by Screen Size & Printable Page Solution: Associate a Complete Textual Specification to Graphical Model that Contains the Necessary Details for Each Element SWEA71 Design Concepts CSE 5095 Interface Interaction With the Outer World Signature + Requested Services Type: Abstract Entity - Interface + Semantics Subtype: Inherits the Supertype Definition Class: Implementation of a Type Realization: Relation Between a Type and a Class That Implements It Subclass: Inherits the Superclass Implementation Instance: Element of a Class SWEA72 The I5 Framework CSE 5095 An Integrated Specification Framework for Distributed Systems Support for the Architectural Specification of OO and Component Based Distributed Systems Heterogeneous Network - Platforms A Five Level Framework for Defining Software and Hardware (Platforms) With a Uniform Notation and With Different Levels of Abstraction Specified Textually in Z or Graphically in UML Emphasis on Implementation Diagrams Please See http://www.engr.uconn.edu/~cecilia SWEA73 Dependencies Between Levels CSE 5095 Component Types Node Types INTERFACE Component Classes Node Classes IMPLEMENTATION Implementation Dependencies Inst. Components INTEGRATION Inst. Nodes System Instantiation Installation Req. (together,separated) INSTANTIATION Installation Req. (fix location) Complete Installation INSTALLATION SWEA74 Interface - Software: I1S CSE 5095 Components Types Type Supertypes Associated Interfaces Calls Properties Types are Unique Supertypes Must Be Part of I1S Calls Must Be Satisfied in I1S SWEA75 Interface - Software: I1S CSE 5095 response Client <<call>> <<call>> request receive FrontEnd <<call>> <<call>> Replica receive gossip <<call>> SWEA76 Interface - Hardware: I1H CSE 5095 Node Types Connector Types Connections Properties All Node Types Must Be Connected Only Node and Connector Types Defined Take Part in the Connections MPI Sockets SUN Intel Pentium SWEA77 Implementation - Software: I2S CSE 5095 Component Classes Component Type Class Superclasses Calls to Classes Interfaces Properties: Only Types in I1S are Allowed Superclasses Are Realizations of the Supertypes Calls & Inheritance are Satisfied Within I2S SWEA78 Implementation - Software: I2S CSE 5095 response response PCCtrCl XCtrCl <<call>> <<call>> request receive XFrontEnd <<call>> Counter receive gossip <<call>> SWEA79 Implementation - Hardware: I2H CSE 5095 Node Classes Node Type Class Connector Classes Type Class Connections Between Node Classes Properties Node and Connector Classes Refine the Types in I1H Connections are With Connector Classes That Refine Connector Types in I1H SWEA80 Implementation - Hardware: I2H CSE 5095 MPI Sockets SUN <<realizes>> Intel Pentium <<realizes>> MPI_Impl CSockets SUN OS 4.1.4 Win95 SWEA81 Software and Hardware Integration: I3 CSE 5095 Relation <<supports>> Instances of the Component Class May Run on Instances of the Node Class Important Step Since it Constrains Deployment Options Properties Only Node and Component Classes Defined in I2 Can Participate of the <<supports>> Relation SWEA82 Software and Hardware Integration: I3 CSE 5095 response response PCCtrCl XCtrCl <<supports>> <<supports>> MPI_Impl request XFrontEnd CSockets <<supports>> Win95 SUN OS 4.1.4 receive <<supports>> Counter receive gossip SWEA83 Instantiation - Software: I4S CSE 5095 Component Instances Class Identification Calls Properties Instance Calls Refine Class Calls Only Classes in I2S May Be Instantiated SWEA84 Instantiation - Software: I4S CSE 5095 request c1:PCCtrCl response fe1:XFrontEnd response receive request c3:PCCtrCl c4:XCtrCl fe2:XFrontEnd response ct1:Counter receive gossip ct2:Counter receive gossip c2:PCCtrCl response receive gossip receive ct3:Counter receive gossip ct4:Counter receive gossip ct5:Counter receive gossip ct6:Counter SWEA85 Instantiation - Hardware: I4H CSE 5095 Node Instances Class Identification Connector Instances Class Identification Set of Connected Nodes Properties There are Only Instances of the Node & Connector Classes Defined in I2H Connectors Refine I2H Connections SWEA86 Instantiation - Hardware: I4H CSE 5095 pc1:Win95 pc2:Win95 pc3:Win95 pc4:Win95 sock1 sock2 sock3 sock4 sun1: SunOS4.1.4 sun2: SunOS4.1.4 sun3: SunOS4.1.4 sun4: SunOS4.1.4 sun5: SunOS4.1.4 sun9: SunOS4.1.4 sun10: SunOS4.1.4 mpi1 sun6: SunOS4.1.4 sun7: SunOS4.1.4 sun8: SunOS4.1.4 SWEA87 Installation Requirements CSE 5095 A Set of Component Instances Must Be Deployed Together or Separated Fix the Location of Some Component Instances All Installation Requirements Must Be Consistent With the Requirements Imposed by All the Previous Specification Levels Requirements Together Separated Fix SWEA88 Installation - Requirements: Ifix, Iseparated CSE 5095 receive receive fe2:XFrontEnd fe1:XFrontEnd request sun2:SunOS4.1.4 request sun3:SunOS4.1.4 separated = {ct1:Counter, ct2:Counter, ct3:Counter, ct4:Counter, ct5:Counter, ct6:Counter} SWEA89 Mapping Applications to Hardware CSE 5095 Applications (Left) and Hardware (Right) Instances Restrictions on Which Applications can be Deployed on Which Hardware? Which Applications Deployed Together? Which Applications Must be Separate? SWEA90 Objective: ‘Best” Optimal Deployment CSE 5095 SWEA91 Using I5 for BMI CSE 5095 Focus at Architectural Level Multiple Assets to Bring Together Hospital EMRs, Provider EMRs, Other Systems Multiple and Disparate Hardware Different Contexts and Needs Clinical Practice – (Near) Real-Time Integration/Access Clinical Research – De-Identified Integrated Repository Performance will be Key Issue Clinical Practice – Time of Access Clinical Research – Volume of Information Some Genomic Data Requires Terabytes of Data! Information overload Possible SWEA92 The Next Big Challenge CSE 5095 Macro-Architectures System of Systems Application of Applications Involves Two Key Issues Interoperability Heterogeneous Distributed Databases Heterogeneous Distributed Systems Autonomous Applications Scalability Rapid and Continuous Growth Amount of Data Variety of Data Types Different Privacy Levels or Ownerships of Data SWEA93 Interoperability: A Classic View CSE 5095 Local Schema Simple Federation Multiple Nested Federation FDB Global Schema FDB Global Schema 4 Federated Integration Federated Integration Local Schema Local Schema FDB 1 Local Schema Federation FDB3 Federation SWEA94 What is CORBA? CSE 5095 Differs from Typical Programming Languages Objects can be … Located Throughout Network Interoperate with Objects on other Platforms Written in Ant PLs for which there is mapping from IDL to that Language Application Interfaces Domain Interfaces Object Request Broker Object Services SWEA95 What is CORBA? CSE 5095 Allow Interactions from Client to Server CORBA Installed on All Participating Machines Client Application Static Stub DII Server Application ORB Interface ORB Interface Skel eton DSI Object Adapter Client ORB Core Network IDL - Independent Same for all applications Server ORB Core There may be multiple object adapters SWEA96 CORBA-Based Development CSE 5095 IDL file Client Application IDL Compiler Stub ORB/IIOP Object Implementation IDL Compiler Skeleton ORB/IIOP SWEA97 Database Interoperability in the Internet CSE 5095 Technology Web/HTTP, JDBC/ODBC, CORBA (ORBs + IIOP), XML Architecture Information Broker •Mediator-Based Systems •Agent-Based Systems SWEA98 ORB Integration:Java Client + Legacy Application CSE 5095 Java Client Legacy Application Java Wrapper Object Request Broker (ORB) CORBA is the Medium of Info. Exchange Requires Java/CORBA Capabilities SWEA99 Java Client with Wrapper to Legacy Application CSE 5095 Java Client Java Application Code WRAPPER Mapping Classes JAVA LAYER Interactions Between Java Client and Legacy Appl. via C and RPC C is the Medium of Info. Exchange Java Client with C++/C Wrapper NATIVE LAYER Native Functions (C++) RPC Client Stubs (C) Legacy Application Network SWEA100 COTS and Legacy Appls. to Java Clients CSE 5095 COTS Application Legacy Application Java Application Code Java Application Code Native Functions that Map to COTS Appl NATIVE LAYER Native Functions that Map to Legacy Appl NATIVE LAYER JAVA LAYER JAVA LAYER Mapping Classes JAVA NETWORK WRAPPER Mapping Classes JAVA NETWORK WRAPPER Network Java Client Java Client Java is Medium of Info. Exchange - C/C++ Appls with Java Wrappers SWEA101 Java Client to Legacy App via RDBS CSE 5095 Transformed Legacy Data Java Client Updated Data Relational Database System(RDS) Extract and Generate Data Transform and Store Data Legacy Application SWEA102 JDBC CSE 5095 JDBC API Provides DB Access Protocols for Open, Query, Close, etc. Different Drivers for Different DB Platforms JDBC API Java Application Driver Manager Driver Oracle Driver Access Driver Driver Sybase SWEA103 Connecting a DB to the Web CSE 5095 DBMS CGI Script Invocation or JDBC Invocation Web Server Web Server are Stateless DB Interactions Tend to be Stateful Invoking a CGI Script on Each DB Interaction is Very Expensive, Mainly Due to the Cost of DB Open Internet Browser SWEA104 Connecting More Efficiently CSE 5095 DBMS Helper Processes CGI Script or JDBC Invocation Web Server Internet To Avoid Cost of Opening Database, One can Use Helper Processes that Always Keep Database Open and Outlive Web Connection Newly Invoked CGI Scripts Connect to a Preexisting Helper Process System is Still Stateless Browser SWEA105 DB-Internet Architecture CSE 5095 WWW Client (Netscape) WWW client (Info. Explore) WWW Client (HotJava) Internet HTTP Server DBWeb Gateway DBWeb Gateway DBWeb Gateway DBWeb Dispatcher DBWeb Gateway SWEA106 Biomedical Architectures CSE 5095 Transcend Normal Two, Three, and Four Tier Solutions – Macro-Architecture An Architecture of Architectures! Need to Integrate Systems that are Themselves Multi-Tier and Distributed Need to Resolve Data Ownership Issues State of Connecticut Agencies Don’t Share Competing Hospitals Seek to Protect Market Share T1, T2, and Clinical Research Requires Interoperating Genomic Databases/Supercomputers Integration of De-identified Patient Data from Multiple Sources to Allow Sufficient Study Samples De-identified Data Repositories or Data Marts Dealing with Ownership Issues (DNA Research) SWEA107 Consider Team Project Architecture Providers Patients CSE 5095 PHR EMR Web-Based Portal(XML + HL7) Open Source DB (XML or MySQL) Feedback Repository Clinical Researchers Education Materials SWEA108 Internet and the Web CSE 5095 A Major Opportunity for Business A Global Marketplace Business Across State and Country Boundaries A Way of Extending Services Online Payment vs. VISA, Mastercard A Medium for Creation of New Services Publishers, Travel Agents, Teller, Virtual Yellow Pages, Online Auctions … A Boon for Academia Research Interactions and Collaborations Free Software for Classroom/Research Usage Opportunities for Exploration of Technologies in Student Projects What are Implications for BMI? Where is the Adv? SWEA109 WWW: Three Market Segments Server CSE 5095 Business to Business Corporate Network Server Intranet Decision support Mfg.. System monitoring corporate repositories Workgroups Information sharing Ordering info./status Targeted electronic commerce Internet Corporate Server Network Internet Sales Marketing Information Services Provider Network Server Provider Network Exposure to Outside SWEA110 Information Delivery Problems on the Net CSE 5095 Everyone can Publish Information on the Web Independently at Any Time Consequently, there is an Information Explosion Identifying Information Content More Difficult There are too Many Search Engines but too Few Capable of Returning High Quality Data Most Search Engines are Useful for Ad-hoc Searches but Awkward for Tracking Changes What are Information Delivery Issues for BMI? Publishing of Patient Education Materials Publishing of Provider Education Materials How Can Patients/Providers find what Need? How do they Know if its Relevant? Reputable? SWEA111 Example Web Applications CSE 5095 Scenario 1: World Wide Wait A Major Event is Underway and the Latest, Up-tothe Minute Results are Being Posted on the Web You Want to Monitor the Results for this Important Event, so you Fire up your Trusty Web Browser, Pointing at the Result Posting Site, and Wait, and Wait, and Wait … What is the Problem? The Scalability Problems are the Result of a Mismatch Between the Data Access Characteristics of the Application and the Technology Used to Implement the Application May not be Relevant to BMI: Hard to Apply Scenario SWEA112 Example Web Applications CSE 5095 Scenario 2: Many Applications Today have the Need for Tracking Changes in Local and Remote Data Sources and Notifying Changes If Some Condition Over the Data Source(s) is Met To Monitor Changes on Web, You Need to Fire Your Trusty Web Browser from Time to Time, Cache the Most Recent Result, and Difference Manually Each Time You Poll the Data Source(s) Issue: Pure Pull is Not the Answer to All Problems BMI: If a Patient Enters Data that Sets off a Chain Reaction, how Can Provider be Notified and in Turn the Provider Notify the Patient (Bad Health Event) SWEA113 What is the Problem? CSE 5095 Applications are Asymmetric but the Web is Not Computation Centric vs. Information Flow Centric Type of Asymmetry Network Asymmetry Satellite, CATV, Mobile Clients, Etc. Client to Server Ratio Too Many Clients can Swamp Servers Data Volume Mouse and Key Click vs. Content Delivery Update and Information Creation Clients Need to be Informed or Must Poll Clearly, for BMI, Simple Web Environment/Browser is Not Sufficient – No Auto-Notification SWEA114 What are Information Delivery Styles? CSE 5095 Pull-Based System Transfer of Data from Server to Client is Initiated by a Client Pull Clients Determine when to Get Information Potential for Information to be Old Unless Client Periodically Pulls Push-Based System Transfer of Data from Server to Client is Initiated by a Server Push Clients may get Overloaded if Push is Too Frequent Hybrid Pull and Push Combined Pull First and then Push Continually SWEA115 Publish/Subscribe CSE 5095 Semantics: Servers Publish/Clients Subscribe Servers Publish Information Online Clients Subscribe to the Information of Interest (Subscription-based Information Delivery) Data Flow is Initiated by the Data Sources (Servers) and is Aperiodic Danger: Subscriptions can Lead to Other Unwanted Subscriptions Applications Unicast: Database Triggers and Active Databases 1-to-n: Online News Groups May work for Clinical Researcher to Provider Push SWEA116 Design Options for Nodes CSE 5095 Three Types of Nodes: Data Sources Provide Base Data which is to be Disseminated Clients Who are the Net Consumers of the Information Information Brokers Acquire Information from Other Data Sources, Add Value to that Information and then Distribute this Information to Other Consumers By Creating a Hierarchy of Brokers, Information Delivery can be Tailored to the Need of Many Users Brokers may be Ideal Intermediaries for BMI! Act on Behalf of Patients, Providers Incorporate Secure Access SWEA117 Research Challenges CSE 5095 Ubiquitous/Pervasive Many computers and information appliances everywhere, networked together Inherent Complexity: Coping with Latency (Sometimes Unpredictable) Failure Detection and Recovery (Partial Failure) Concurrency, Load Balancing, Availability, Scale Service Partitioning Ordering of Distributed Events “Accidental” Complexity: Heterogeneity: Beyond the Local Case: Platform, Protocol, Plus All Local Heterogeneity in Spades. Autonomy: Change and Evolve Autonomously Tool Deficiencies: Language Support (Sockets,rpc), Debugging, Etc. SWEA118 Infosphere Problem: too many sources,too much information CSE 5095 Internet: Information Jungle Infopipes Clean, Reliable, Timely Information, Anywhere Digital Earth Personalized Filtering & Info. Delivery Sensors SWEA119 Current State-of-Art CSE 5095 Web Server Mainframe Database Server Thin Client SWEA120 Infosphere Scenario – for BMI CSE 5095 Infotaps & Fat Clients Sensors Variety of Servers Many sources Database Server SWEA121 Heterogeneity and Autonomy CSE 5095 Heterogeneity: How Much can we Really Integrate? Syntactic Integration Different Formats and Models Web/SQL Query Languages Semantic Interoperability Basic Research on Ontology, Etc Autonomy No Central DBA on the Net Independent Evolution of Schema and Content Interoperation is Voluntary Interface Technology (Support for Isvs) DCOM: Microsoft Standard CORBA, Etc... SWEA122 Security and Data Quality CSE 5095 Security System Security in the Broad Sense Attacks: Penetrations, Denial of Service System (and Information) Survivability Security Fault Tolerance Replication for Performance, Availability, and Survivability Data Quality Web Data Quality Problems Local Updates with Global Effects Unchecked Redundancy (Mutual Copying) Registration of Unchecked Information Spam on the Rise SWEA123 Legacy Data Challenge CSE 5095 Legacy Applications and Data Definition: Important and Difficult to Replace Typically, Mainframe Mission Critical Code Most are OLTP and Database Applications Evolution of Legacy Databases Client-server Architectures Wrappers Expensive and Gradual in Any Case SWEA124 Potential Value Added/Jumping on Bandwagon CSE 5095 Sophisticated Query Capability Combining SQL with Keyword Queries Consistent Updates Atomic Transactions and Beyond But Everything has to be in a Database! Only If we Stick with Classic DB Assumptions Relaxing DB Assumptions Interoperable Query Processing Extended Transaction Updates Commodities DB Software A Little Help is Still Good If it is Cheap Internet Facilitates Software Distribution Databases as Middleware SWEA125 Data Warehousing and Data Mining CSE 5095 Data Warehousing Provide Access to Data for Complex Analysis, Knowledge Discovery, and Decision Making Underlying Infrastructure in Support of Mining Provides Means to Interact with Multiple DBs OLAP (on-Line Analytical Processing) vs. OLTP Data Mining Discovery of Information in a Vast Data Sets Search for Patterns and Common Features based Discover Information not Previously Known Medical Records Accessible Nationwide Research/Discover Cures for Rare Diseases Relies on Knowledge Discovery in DBs (KDD) SWEA126 Data Warehousing and OLAP CSE 5095 A Data Warehouse Database is Maintained Separately from an Operational Database “A Subject-Oriented, Integrated, Time-Variant, and Non-Volatile Collection of Data in Support for Management’s Decision Making Process [W.H.Inmon]” OLAP (on-Line Analytical Processing) Analysis of Complex Data in the Warehouse Attempt to Attain “Value” through Analysis Relies on Trained and Adept Skilled Knowledge Workers who Discover Information Data Mart Organized Data for a Subset of an Organization Establish De-Identified Marts for BMI Research SWEA127 Building a Data Warehouse CSE 5095 Option 1 Leverage Existing Repositories Collate and Collect May Not Capture All Relevant Data Option 2 Start from Scratch Utilize Underlying Corporate Data Corporate data warehouse Option 1: Consolidate Data Marts Option 2: Build from scratch Data Mart ... Data Mart Data Mart Data Mart Corporate data SWEA128 BMI – Partition/Excerpt Data Warehouse CSE 5095 Clinical and Epidemiological Research (and for T2 and T1) Each Study Submitted to Institutional Review Board (IRB) For Human Subjects (Assess Risks, Protect Privacy) See: http://resadm.uchc.edu/hspo/irb/ To Satisfy IRB (and Privacy, Security, etc.), Reverse Process to Create a Data Mart for each Approved Study Export/Excerpt Study Data from Warehouse May be Single or Multiple Sources BMI data warehouse Data Mart ... Data Mart Data Mart Data Mart SWEA129 Data Warehouse Characteristics CSE 5095 Utilizes a “Multi-Dimensional” Data Model Warehouse Comprised of Store of Integrated Data from Multiple Sources Processed into Multi-Dimensional Model Warehouse Supports of Times Series and Trend Analysis “Super-Excel” Integrated with DB Technologies Data is Less Volatile than Regular DB Doesn’t Dramatically Change Over Time Updates at Regular Intervals Specific Refresh Policy Regarding Some Data SWEA130 Three Tier Architecture CSE 5095 monitor External data sources OLAP Server integrator Summarization report Operational databases Extraxt Transform Load Refresh serve Data Warehouse Query report Data mining metadata Data marts SWEA131 Data Warehouse Design CSE 5095 Most of Data Warehouses use a Start Schema to Represent Multi-Dimensional Data Model Each Dimension is Represented by a Dimension Table that Provides its Multidimensional Coordinates and Stores Measures for those Coordinates A Fact Table Connects All Dimension Tables with a Multiple Join Each Tuple in Fact Table Represents the Content of One Dimension Each Tuple in the Fact Table Consists of a Pointer to Each of the Dimensional Tables Links Between the Fact Table and the Dimensional Tables for a Shape Like a Star SWEA132 What is a Multi-Dimensional Data Cube? CSE 5095 Representation of Information in Two or More Dimensions Typical Two-Dimensional - Spreadsheet In Practice, to Track Trends or Conduct Analysis, Three or More Dimensions are Useful For BMI – Axes for Diagnosis, Drug, Subject Age SWEA133 Multi-Dimensional Schemas CSE 5095 Supporting Multi-Dimensional Schemas Requires Two Types of Tables: Dimension Table: Tuples of Attributes for Each Dimension Fact Table: Measured/Observed Variables with Pointers into Dimension Table Star Schema Characterizes Data Cubes by having a Single Fact Table for Each Dimension Snowflake Schema Dimension Tables from Star Schema are Organized into Hierarchy via Normalization Both Represent Storage Structures for Cubes SWEA134 Example of Star Schema CSE 5095 Product Date Date Month Year Sale Fact Table Date ProductNo ProdName ProdDesc Categoryu Product Store Customer Unit_Sales Store StoreID City State Country Region Dollar_Sales Customer CustID CustName CustCity CustCountry SWEA135 Example of Star Schema for BMI CSE 5095 Vitals Date Date Month Year Patient Fact Table Visit Date BP Temp Resp HR (Pulse) Vitals Symptoms Patient Medications Symptoms Pulmonary Heart Mus-Skel Skin Digestive Etc. Patient PatientID PatientName PatientCity PatientCountry Reference another Star Schema for all Meds SWEA136 A Second Example of Star Schema … CSE 5095 SWEA137 and Corresponding Snowflake Schema CSE 5095 SWEA138 Data Warehouse Issues CSE 5095 Data Acquisition Extraction from Heterogeneous Sources Reformatted into Warehouse Context - Names, Meanings, Data Domains Must be Consistent Data Cleaning for Validity and Quality is the Data as Expected w.r.t. Content? Value? Transition of Data into Data Model of Warehouse Loading of Data into the Warehouse Other Issues Include: How Current is the Data? Frequency of Update? Availability of Warehouse? Dependencies of Data? Distribution, Replication, and Partitioning Needs? Loading Time (Clean, Format, Copy, Transmit, Index Creation, etc.)? For CTSA – Data Ownership (Competing Hosps). SWEA139 Knowledge Discovery CSE 5095 Data Warehousing Requires Knowledge Discovery to Organize/Extract Information Meaningfully Knowledge Discovery Technology to Extract Interesting Knowledge (Rules, Patterns, Regularities, Constraints) from a Vast Data Set Process of Non-trivial Extraction of Implicit, Previously Unknown, and Potentially Useful Information from Large Collection of Data Data Mining A Critical Step in the Knowledge Discovery Process Extracts Implicit Information from Large Data Set SWEA140 Steps in a KDD Process CSE 5095 Learning the Application Domain (goals) Gathering and Integrating Data Data Cleaning Data Integration Data Transformation/Consolidation Data Mining Choosing the Mining Method(s) and Algorithm(s) Mining: Search for Patterns or Rules of Interest Analysis and Evaluation of the Mining Results Use of Discovered Knowledge in Decision Making Important Caveats This is Not an Automated Process! Requires Significant Human Interaction! SWEA141 OLAP Strategies CSE 5095 OLAP Strategies Roll-Up: Summarization of Data Drill-Down: from the General to Specific (Details) Pivot: Cross Tabulate the Data Cubes Slide and Dice: Projection Operations Across Dimensions Sorting: Ordering Result Sets Selection: Access by Value or Value Range Implementation Issues Persistent with Infrequent Updates (Loading) Optimization for Performance on Queries is More Complex - Across Multi-Dimensional Cubes Recovery Less Critical - Mostly Read Only Temporal Aspects of Data (Versions) Important SWEA142 On-Line Analytical Processing CSE 5095 Data Cube A Multidimensonal Array Each Attribute is a Dimension In Example Below, the Data Must be Interpreted so that it Can be Aggregated by Region/Product/Date Product Product Store Date Sale acron Rolla,MO 7/3/99 325.24 budwiser LA,CA 5/22/99 833.92 large pants NY,NY 2/12/99 771.24 Pants Diapers Beer Nuts West East 3’ diaper Cuba,MO 7/30/99 81.99 Region Central Mountain South Jan Feb March April Date SWEA143 On-Line Analytical Processing CSE 5095 For BMI – Imagine a Data Table with Patient Data Define Axis Summarize Data Create Perspective to Match Research Goal Essentially De-identified Data Mart Medication Patient Med BirthDat Dosage Steve Lipitor 1/1/45 10mg John Zocor 2/2/55 Harry Crestor 3/3/65 5mg Lois Lipitor 4/4/66 20mg Charles Crestor 7/1/59 Lescol Crestor Zocor Lipitor 80mg 10mg 5 10 Dosage 20 40 80 1940s 1950s 1960s 1970s Decade SWEA144 Examples of Data Mining CSE 5095 The Slicing Action A Vertical or Horizontal Slice Across Entire Cube Months Slice on city Atlanta Products Sales Products Sales Months Multi-Dimensional Data Cube SWEA145 Examples of Data Mining CSE 5095 The Dicing Action A Slide First Identifies on Dimension A Selection of Any Cube within the Slice which Essentially Constrains All Three Dimensions Months Products Sales Products Sales Months March 2000 Electronics Atlanta Dice on Electronics and Atlanta SWEA146 Examples of Data Mining Drill Down - Takes a Facet (e.g., Q1) and Decomposes into Finer Detail Jan Feb March Products Sales CSE 5095 Drill down on Q1 Roll Up on Location (State, USA) Roll Up: Combines Multiple Dimensions From Individual Cities to State Q1 Q2 Q3 Q4 Products Sales Products Sales Q1 Q2 Q3 Q4 SWEA147 Mining Other Types of Data CSE 5095 Analysis and Access Dramatically More Complicated! Time Series Data for Glucose, BP, Peak Flow, etc. Spatial databases Multimedia databases World Wide Web Time series data Geographical and Satellite Data SWEA148 Advantages/Objectives of Data Mining CSE 5095 Descriptive Mining Discover and Describe General Properties 60% People who buy Beer on Friday also have Bought Nuts or Chips in the Past Three Months Predictive Mining Infer Interesting Properties based on Available Data People who Buy Beer on Friday usually also Buy Nuts or Chips Result of Mining Order from Chaos Mining Large Data Sets in Multiple Dimensions Allows Businesses, Individuals, etc. to Learn about Trends, Behavior, etc. Impact on Marketing Strateg SWEA149 Data Mining Methods (1) CSE 5095 Association Discover the Frequency of Items Occurring Together in a Transaction or an Event Example 80% Customers who Buy Milk also Buy Bread Hence - Bread and Milk Adjacent in Supermarket 50% of Customers Forget to Buy Milk/Soda/Drinks Hence - Available at Register Prediction Predicts Some Unknown or Missing Information based on Available Data Example Forecast Sale Value of Electronic Products for Next Quarter via Available Data from Past Three Quarters SWEA150 Association Rules CSE 5095 Motivated by Market Analysis Rules of the Form Item1^Item2^…^ ItemkItemk+1 ^ … ^ Itemn Example “Beer ^ Soft Drink Pop Corn” Problem: Discovering All Interesting Association Rules in a Large Database is Difficult! Issues Interestingness Completeness Efficiency Basic Measurement for Association Rules Support of the Rule Confidence of the Rule SWEA151 Data Mining Methods (2) CSE 5095 Classification Determine the Class or Category of an Object based on its Properties Example Classify Companies based on the Final Sale Results in the Past Quarter Clustering Organize a Set of Multi-dimensional Data Objects in Groups to Minimize Inter-group Similarity is and Maximize Intra-group Similarity Example Group Crime Locations to Find Distribution Patterns SWEA152 Classification CSE 5095 Two Stages Learning Stage: Construction of a Classification Function or Model Classification Stage: Predication of Classes of Objects Using the Function or Model Tools for Classification Decision Tree Bayesian Network Neural Network Regression Problem Given a Set of Objects whose Classes are Known (Training Set), Derive a Classification Model which can Correctly Classify Future Objects SWEA153 An Example CSE 5095 Attributes Attribute Possible Values outlook sunny, overcast, rain temperature continuous humidity continuous windy true, false Class Attribute - Play/Don’t Play the Game Training Set Values that Set the Condition for the Classification What are the Pattern Below? Outlook Temperature Humidity sunny 85 85 overcast 83 78 sunny 80 90 sunny 72 95 sunny 72 70 … … … Windy false false true false false … Play No Yes No No Yes ... SWEA154 Data Mining Methods (3) CSE 5095 Summarization Characterization (Summarization) of General Features of Objects in the Target Class Example Characterize People’s Buying Patterns on the Weekend Potential Impact on “Sale Items” & “When Sales Start” Department Stores with Bonus Coupons Discrimination Comparison of General Features of Objects Between a Target Class and a Contrasting Class Example Comparing Students in Engineering and in Art Attempt to Arrive at Commonalities/Differences SWEA155 Summarization Technique CSE 5095 Attribute-Oriented Induction Generalization using Concert hierarchy (Taxonomy) barcode category 14998 milk brand diaryland content size Skim 2L food 12998 mechanical MotorCraft valve 23a 12in … … … … ... Milk … Skim milk … 2% milk Category milk milk … Content Count skim 2% … 280 98 ... bread White whole bread … wheat Lucern … Dairyland Wonder … Safeway SWEA156 Why is Data Mining Popular? CSE 5095 Technology Push Technology for Collecting Large Quantity of Data Bar Code, Scanners, Satellites, Cameras Technology for Storing Large Collection of Data Databases, Data Warehouses Variety of Data Repositories, such as Virtual Worlds, Digital Media, World Wide Web Corporations want to Improve Direct Marketing and Promotions - Driving Technology Advances Targeted Marketing by Age, Region, Income, etc. Exploiting User Preferences/Customized Shopping What is Potential for BMI? How do you see Data Mining Utilized? What are Key Issues to Worry About? SWEA157 Requirements & Challenges in Data Mining CSE 5095 Security and Social What Information is Available to Mine? Preferences via Store Cards/Web Purchases What is Your Comfort Level with Trends? User Interfaces and Visualization What Tools Must be Provided for End Users of Data Mining Systems? How are Results for Multi-Dimensional Data Displayed? Performance Guarantees Range from Real-Time for Some Queries to LongTerm for Other Queries Data Sources of Complex Data Types or Unstructured Data - Ability to Format, Clean, and Load Data Sets SWEA158 CSE 5095 An Initiative of the University of Connecticut Center for Public Health and Health Policy Robert H. Aseltine, Jr., Ph.D. Cal Collins January 16, 2008 SWEA159 What is CHIN? CSE 5095 State of Connecticut Agencies Collect and Maintain Data in Separate Databases such as: Vital Statistics: Birth, Death (DPH) Surveillance data: Lead Screening and Immunization Registries (DPH) Administrative services: LINK system (DCF), CAMRIS (DMR) Benefit programs: WIC (DPH), Medicaid (DSS) Educational achievement: (PSIS) Such Data is Un-Integrated Impossible to Track Assess Target Populations Difficult to Develop Evidence-Based Practices Limits Meaningful Interactions Among State Agencies SWEA160 What Do We Mean by “Integration?” UCONN Health Center Low Birth Weight Infant Registry Dept. of Mental Retardation Birth to Three System CT Dept. of Education PSIS System CSE 5095 Last Name First Name DOB SSN Birth Wt. (kg) Last Name First Name DOB Street Town Appel April 01/01/1 999 016-000-9876 2.8 Allen Gwen 01/01/19 99 Apple Enfie Berry John 02/02/1 997 216-000-4576 2.9 Buck Jerome 07/01/19 99 Burbank West Carat Colleen 03/03/1 993 119-000-1234 1.9 Cleary Jane 03/03/19 93 Cedar Tolla Ernst Max 04/04/1 994 116-000-3456 2.7 Dory Daniel 03/03/19 93 Dogfish Hartf Gomez Gloria 05/05/1 995 036-000-9999 2.6 Ernst Max 04/04/19 94 Elm Enfie Hurst William 06/06/1 996 016-000-5599 3.1 Friday Joe 11/03/19 99 Fruit Wind Keller Helene 07/07/1 997 017-000-2340 2.5 Glenn Valerie 03/23/19 98 Glen Branf Pedro 08/08/1 998 018-000-9886 Martinez Pedro 08/08/19 98 High Hartf Felix 09/09/1 999 029-000-9111 Riley Lily 03/03/19 96 Ipswich Bridg Sanchez Ramon New Peggy 016-000-8787 03/03/19 93 Juniper 10/10/2 000 Martinez Rodriguez Smith 3.0 2.8 2.5 Last Name First Name CMT Math Polio Vac Date Days in Attendance Appel April 134 01/05/ 1999 179 Carat Colleen 256 05/01/ 1998 122 Cleary Jane 268 01/28/ 2000 178 Ernst Max 152 01/09/ 1999 145 Gomez Gloria 289 01/01/ 1999 168 Friday Joe 265 10/01/ 1999 170 Keller Helene 309 11/01/ 2001 180 Martinez Pedro 248 12/01/ 2003 180 Riley Lily 201 01/01/ 1999 122 Sanchez Ramon 249 01/01/ 1999 159 Last Name First Name DOB SSN Birth Wt. Street Town CMT Math Grade 3 Polio Vaccination Date Days in Attendance Ernst Max 04/04/1994 116-000-3456 2.7 Elm Enfield 152 01/09/1999 145 Martinez Pedro 08/08/1998 018-000-9886 3.0 High Hartford 248 12/01/2003 180 SWEA161 Key Challenges to Integrating Data CSE 5095 Security and Privacy HIPAA FERPA WIC, Social Security (Medicaid/Medicare) regulations State statutes Alteration/disruption of business practices Unique identification of individuals/cases Accuracy and reliability of data Disparate hardware/software platforms SWEA162 Key Challenges to Integrating Data CSE 5095 Security and Privacy HIPAA FERPA WIC, Social Security (Medicaid/Medicare) regulations State statutes Alteration/disruption of business practices Unique identification of individuals/cases Accuracy and reliability of data Disparate hardware/software platforms SWEA163 The Solution: CHIN CSE 5095 Connecticut Health Information Network A Federated Network That: Allows Shared Access to “Health”-related Data From Heterogeneous Databases Allows Agencies to Retain Complete Control Over Access to Data Has Minimal Impact on Business Practices Complies with Security and Privacy Statutes Incorporates Cutting-edge Approaches to Case Matching Partnership of: Early Partners: DPH, DCF, DDS, DoE, DOIT, UConn, Akaza Research SWEA164 CHIN Processes and Components CSE 5095 Define data elements in CHIN Map data elements to source database Publish “metadata” to CHIN with security and privacy rules CHIN Metadata Registry CHIN Contributor CHIN Metadata Registry and CHIN Trusted Broker Query Execution: Identifier Matching and Data Merge CHIN GRID and Trusted Broker Review Committee Approval Build Query CHIN Enterprise Administration CHIN Metadata Registry and CHIN Query Builder De-identify Data CHIN Trusted Broker and De-Identification Engine Integrated, De-identified Data SWEA165 Original CHIN Architecture CSE 5095 http://publichealth.uconn.edu/CHIN.php SWEA166 Second CHIN Architecture: User Side CSE 5095 A & A Contributor Contributor SWEA167 Second CHIN Architecture: Contributor Side CSE 5095 A & A Front End Trusted Broker SWEA168 Current CHIN Architecture CSE 5095 SWEA169 CHIN Architecture: Standards-based CSE 5095 All data is mapped to Health Level Seven’s Clinical Document Architecture (CDA) in XML Health Level Seven (HL7), is an ANSI-approved Standards Developing Organization HL7 has its own XML Special Interest Group, responsible for developing XML implementations of its standards in XML HL7 is also an active participant in W3C, the organization responsible for the development of XML CDA was approved as an ANSI standard in November of 2000. Component Architecture communicates via Web Services and OGSA Grid standards SWEA170 CHIN Arch.: Proven, Open Components CSE 5095 Components are based on open-source libraries The grid-based servers Mako and Virtual Mako are part of the Mobius Project from Ohio State University’s Dept. of BioInformatics The translation tools to get data into XML are provided by the XQuare and XBridge projects, hosted on the ObjectWeb website, an open source middleware community The algorithm and code for identity management is FEBRL, Freely Extensible Biomedical Record Linkage, which was developed at Australian National University NuSOAP Web Services Engine for component integration SWEA171 FEBRL CSE 5095 Identifier matching in FEBRL proceeds in four steps: Data cleansing and standardization Removes, to the degree possible, string discrepancies based on common misspellings, extra white space, or misplaced name or address components. Indexing Reduces the size of the number of record comparisons which must be performed for scalability; blocking, sorting, and bigram indexing methods are all supported. Record comparison Conducted using an arbitrary composition of exact or inexact string comparison methods over any combination of fields Classification. Follows the Felligi-Sunter34 model, with records pairs assigned a weight based on a pallet of probabilities and matches determined based on the record pair weights SWEA172 FEBRL CSE 5095 The current prototype uses FEBRL to implement a simplistic method of linkage whereby record pairs are declared a match if the first and last name are exactly equal. Next Steps Evaluate the accuracy of linking records over a rubric of five data fields - first name, last name, date of birth, social security number, and gender. Exact and inexact matching (ie misspellings and slight discrepancies), including experimental variations of the service based on the blinded bigram matching algorithm. Assess false positives and false negatives produced by each palette of field comparison algorithms. Evaluate the accuracy of linking records using fabricated data sets with characteristics similar to real datasets Experiment with variations of canopy cluster matching algorithm. SWEA173 Other CHIN Issues CSE 5095 Why Choose an Open Architecture? Increased Accountability Plenty of Documentation and Research Greater Transparency Ease of Installation, Maintenance, Dissemination How is Data Ported into CHIN? CHIN is based on a Grid, with each organization supporting its own data through a Contributor server Agency staff has complete control over access to data on CHIN by other users Only one server faces to the outside network SWEA174 Creating a Contributor Server External IP Address Connection to CHIN Trusted Broker CSE 5095 Data Elements Firewall Contributor Server Contains: XML generated files Mako service Java files *.xqy files XML files to generate CDA compliant files Datasource SWEA175 Connecting to rest of Network External IP Address Connection to information •Metadata Registry takes •About elements CHINdata Trusted •About data security Broker information •Datasource CSE 5095 •Contributor profile is registered with CHIN Network Admin Data Elements Firewall Contributor Server Contains: XML generated files Mako service Java files *.xqy files XML files to generate CDA compliant files Datasource SWEA176 How do we get data out? CSE 5095 The Trusted Broker component: Pulls XML from the Virtual Mako which reaches out to all Contributors Compares records from different Contributors using FEBRL De-identifies data sets to generate a final data set for Investigators The Front End component: Provides a central place for users to connect to the system Connects to the Metadata Registry and the Trusted Broker via Web Services calls Allows different users of the system to perform different actions SWEA177 Getting Data from CHIN CSE 5095 SWEA178 Getting Data From CHIN CSE 5095 XML Files •CHIN also contains: •A Front-end server to take queries •A Trusted Broker to compare data, perform record linkage, and de-identify results FEBRL Result Set Deidentify Final Result Set SWEA179 Progress to Date CSE 5095 Needs assessment completed Technical and functional specifications identified MOU’s with state agencies Expanding list of partners Prototype developed Funding for Model Network Development/Deployment /Evaluation 2008 SWEA180 Demo CSE 5095 SWEA181 EMR Architectures CSE 5095 Provider-Based Systems have Two Variants All Data In House Limited In House – Off Site Storage (Larger, Multi-Site Practices Larger Providers (Clinics) Control All Own Data Sizeable IT Staff for 24-7 Operations Control of Own Backups Smaller Providers – Limited IT Staff Desire Out-of-Box Solution Local Data for Ease of Access Remote Storage – Promotes Off-Hours Access Even 1st Variant – Service for “Backups” SWEA182 EMR for Large Providers - AllScript CSE 5095 SWEA183 EMR for Smaller Providers Provider’s Office Vendor’s Location Server/Data Farm CSE 5095 Local EMR Patient Data Remote EMR Remote Access SWEA184 Integrating Clinical Repositories CSE 5095 Provider/Hospital Relationship Provider has Privileges at Hospital Provider Chooses Office-Based EMR More Easily Integrated with Hospital EMR Emerging at Community Hospital Level Example: Milford Hospital, MA All Area Providers with Privileges Linked in Ability to See Patient Records, Tests, at Hospital Unclear on Uploads from Providers to Hospital However, No Link to UMass Medical Center (of which Milford Hospital is Affiliated) SWEA185 Integrating Clinical Repositories CSE 5095 CTSA – Region Wide Clinical/Translational Research Target Area Hospitals St. Francis, Hartford, Hosp. Central CT, CCMC Each Hospital has Own Clinical Repository (EMR) For Wider-Scoped T1, T2, and Clinical Research Need to Integrate these Repositories at Some Level What is Most Practical? Setting up Centralized De-Identified Repository? Creating Data Marts as you go? What are Pros and Cons of Each? Researcher Seeking CHF Patient Data Needs to have De-Identified Data Mart SWEA186 Integrating Clinical Repositories CSE 5095 SWEA187 Integrating Clinical Repositories CSE 5095 SWEA188 Integrating Clinical Repositories CSE 5095 SWEA189 Integrating Clinical Repositories CSE 5095 NHIN Prototype Phase I SWEA190 Integrating Clinical Repositories CSE 5095 NHIN Prototype Phase II SWEA191 CSE 5095 SWEA192 Personal Health Record Integration CSE 5095 SWEA193 Concluding Remarks CSE 5095 Only Scratched Surface on Architectures Micro Architectures Macro Architectures Super-Macro Architectures (We’ll see …) What’s are Key Facets in the Discussion? Role and Impact of Standards Open Solutions Architectural Variants – Reuse “Architecture” Can we Reuse CHIN for Clinical Practice? Are All Contributors Simply Each Hospital and EHR? How do we Connect all of the Pieces? What are Next Steps? Let’s Review Some other Work Source: Wide Range of Presentations on Web SWEA194