ProjectDescription - Electrical and Computer Engineering

Project Description Introduction Next-generation scientific applications will require access to Petabytes of observational imagery, measurements from remote sensing instruments (e.g. radio or infrared telescopes, planetary rovers, undersea drones), data from biological and/or chemical experiments, and other types data archives. These data sets will reside on geographically distributed data sources, and with very heterogeneous schemas, data processing capabilities, and usage policies. To be useful to scientists, these applications must correlate these data products to find relationships and trends that lead to new knowledge. Moreover, scientist must be able incorporate new processing capabilities to custom-tailor the data products generated by sensors, and data already stored in databases. Thus, these new applications must be built on top of systems designed for wide-area networks that support, among other features: 1) Reliable 24/7 access to distributed data sources, 2) access to distributed computational resources (e.g. CPU cycles, disk space), 3) efficient distributed query processing, and 4) customization via distributed software deployment. Typically, database middleware systems have been used as a solution to integrate heterogeneous data sources and support applications that require access to data in these sources. The term Federated System is used to depict a group of data sources integrated via database middleware. Unfortunately, the existing solutions are based on a centralized architecture that cannot scale to the wide-area environments that are becoming common place for scientific applications. This architecture, shown in Figure 1, relies on a central integration server to provide client applications with a uniform view of the data, and single-point of access to the data sources. The integration server relies on the capabilities of translators to extract the data from the sources, and perform schema mapping operations to convert data from local schemas into the schema specified by the client to the integration server. Once the data items have been translated, they are sent back to the integration server for further processing. Most of the query processing occurs at the integration server site and the data sources often act as mere I/O nodes. A catalog associated with the integration server provides the metadata necessary to guide the process to find data sources, schema mapping rules, and query processing strategies. There have been two modalities to realize database middleware systems. The first approach is to use a relational database engine as the integration server, and use database gateways as the translators that allow the integration DBMS to access distributed data. In the second approach, a Mediator System specifically tailored for distributed processing is employed. This solution features an integration server called the mediator, which acts as data integrator, and a group of wrappers which act as the translators. By the nature of their architecture, scaling these database middleware systems to very dynamic wide-area environment is extremely difficult. Since most of the processing is done at the integration server, very large amounts of data must be moved over the network in order to deliver the data from the source to the translators. Support for extending the system with code that implements customized operations such image subsetting, feature extraction, and spatial analysis is difficult since it is usually done manually. As the number of sites in the system increases, it becomes very difficult to keep track of sites where the necessary software has been deployed. During query processing, failures at one or more of the sources being used to solve a query often result in an unrecoverable error that forces the query to be restarted and all the previous work is lost. Access to computational resources, such as clusters and disk farms, is rarely included as a feature that can be accessed via the software toolkits included with the middleware system. Hence, developers need to work with several system toolkits which requires a team with expertise on each of these systems. Also, the catalog used in the system is assumed to be an “oracle” which knows the location all the data sources, schema mapping rules and query operators needed to solve a given query. This proposal represents our effort to develop a new framework to develop database middleware systems that are more aligned with the nature of wide-area environments and with the requirements of scientific applications being deployed on these environments. We propose to realize this framework by building the GaiaNET database middleware system to integrate databases and computational resources on large-scale wide-area networks. GaiaNET will provide applications with the abstractions of three types of services: data services, computational services and software library services that can be seamlessly combined to build applications that support complex scientific queries and data analysis. These services will be combined using an approach that we call dynamic service composition, which is a based on a Peer-to-Peer (P2P) architecture. The most salient novel features of GaiaNET are: 1) self-organization of federated sites, 2) dynamic selection and eviction of sources that participate in solving a query, 3) decentralized query processing with elastically redundant information and computation to meet reliability requirements, 4) decentralized control and coordination of query processing, 5) automatic deployment of application-specific code to remote federated sites, 6) ability to satisfy Quality of Data (QoD) and Quality of Service (QoS) requirements for applications, and 7) seamless integration with Web technology and standards. GaiaNET will enable its users to federated sensors, databases, clusters, and other scientific equipment to create applications with value-added features that simplify the tasks of scientists and engineers working to analysis the vast amounts of data collected on a daily basis. GaiaNET will be distributed as an open source system. GaiaNET differs from Network Middleware such as CORBA, RMI, .NET and RPC since the later are used as an infrastructure layer to provide applications access to the network. GaiaNET (as well as database middleware) is at a higher layer, and can leverage on the Network Middleware for connectivity purposes. But neither CORBA, RMI nor .NET provide services such as distributed query execution, schema mapping, and caching as GaiaNET will do. Client Client Integration Server Internet Translator Oracle 9i Translator XML Data Translator IBM DB2 Translator Text Data Figure 1: Centralized Database Middleware Architecture The GaiaNET project will be carried out by a multi-disciplinary team of Computer Scientists, Electrical Engineers and Earth Scientists from the University of Maryland, College Park (UMCP), and the University of Puerto Rico, Mayagüez (UPRM). The Computer Science team will lead the effort responsible to develop GaiaNET and will consist of researchers from UMCP and UPRM. This team will collaborate with Electrical Engineers and Earth Scientists from two research centers at UPRM: 1) the NASA Tropical Center for Earth and Space Studies (TCESS) , and 2) the NSF Center for Subsurface Sensing and Imaging Systems (CenSSIS). TCESS is a NASA University Research Center devoted to satellite data acquisition, image processing and analysis. TCESS operates a Synthetic Aperture Radar (SAR) and HRPT tracking stations that receive over 70GB of satellite imagery per week. CenSSIS is a multi-institution NSF Engineering Research Center 1 that seeks to revolutionize our ability to detect and image biomedical and environmental-civil objects or conditions that are underground, underwater, or embedded within cells or inside the human body. Researchers from these two centers will contribute their expertise to deploy and use the system, develop domain-specific applications and processing code, and serve as a testbed to provide feedback on the features of the system. The proposed project has great potential to achieve significant broader impacts with particular emphasis on the following areas: 1) promote collaborative teaching, training and learning between different academic institutions, 2) broaden participation of Hispanic groups of students, faculty, and scientists in cutting-edge database and Internet research (approximately 50% of the undergraduate engineering students at UPRM are women), 3) build much needed research and education partnerships between an internationally recognized university (UMCP) and a minority serving university (UPRM), 4) promote graduate and undergraduate research experiences to help increase the number of Ph.D. and M.S. degrees in the U.S. workforce, 5) provide tools for scientific research to help reduce costs and turnaround time, and 6) provide a new framework for developing more affordable and accessible distributed database technology. The project will foster direct interaction among people with widely diverse educational levels and backgrounds. 1 Participating institutions in CenSSIS are Northeastern University, Rensselaer Polytechnic Institute, Boston University and University of Puerto Rico, Mayagüez. The remainder of this proposal is organized as follows. Section 2 provides overview material and defines the problem. Section 3 provides a technical description of the GaiaNET system and the research issues that we must tackle to build such a system. Section 4 discusses related work. Section 5 discusses the broader impacts that this project can have on research, education, and our society. Section 6 presents our expertise, management approach and milestones to be completed over a five year period. Finally, section 7 discusses the results generated by our team members from prior NSF support. Overview of Service Composition Consider an Earth Science application that needs to correlate surface satellite images with the land regions near to the coast of San Juan, Puerto Rico. The goal of the analysis is to study the effect of urban development in coastal erosion. Various types of satellite images (e.g AVHRR, MODIS, ETM+) are kept on databases located in Maryland, New York and Texas. Suppose that Maryland and New York are replicated sites. Schematic maps of costal regions are kept in San Juan. The application must find the appropriate satellite images from these types, and draw the coastal map on top of the image for a given region. A color code must be used to indicate zones of sand, vegetation, or water depth (just like weather forecasters show bands of showers on city maps). We can model the satellite images with a relational view Images(taken:Date,band:Integer,location:Rectangle,data:BLOB) that indicates date taken, radiation energy band measured, location on the Earth, and the actual image data bytes. Likewise, the maps can be modeled with a view Maps(taken:Date,agency:String,landtype:integer,location:Polygon) that indicates date of map creation, agency that made the map, type of land (e.g. coast, mountain, city) and a polygon with the lines that form the map. To support our application, we must have a database middleware system that can expose these views to the application, find the data to populate the views, and support complex queries to analyze the data and extract new information. We propose a novel framework for building database middleware systems based on services that supply data, software and computing power, and which can be composed to form an execution pipeline that extracts the data, processes data items based on a query specification and returns the results to the client application. We define a service as a server application that provides some type of functionality and which is reachable over a network. A data service provides access to a collection of data and metadata for a given application domain. These include database engines, web server, file system or any other customized-server. A software service provides access to code and associated metadata; this code performs a given computational task. Examples include sorting routines, spatial indexing code, and feature extraction functions. Finally, a computing service provides access to computational resources required to process a collection of data with a specific set of software routines. Notice that there might be some services that have dual roles, such as a database engine that provides both data and software to process them. Following the World Wide Web Consortium (W3C) convention, we define a Web service as a service identified by a Uniform Resource Identifier (URI), whose interface is described by XML, and which can be accessed over the Web. In our framework, all data services, software services, and computing services are exposed to applications as Web services. We shall use the word “service” and “Web service interchangeably”, but bear in mind the all the services we mention are actually Web services. We can model the data sources for our Earth Science application as Web services. Let us assume that the sites at Maryland, New York and Texas are running data services, and computing services. Meanwhile, San Juan is running all three types of services. To solve a query such as “Correlate all images and maps for region R, where the images were taken between June 1999 and April 2002”, we can use the following strategy: 1) Invoke the Web data services in New York , Texas and San Juan to extract the images and maps, 2) route the images and maps to computing services willing to correlate them, 3) route the necessary correlation code from a software service to the computing services found in step 2, and 4) send the results back to the client application. Figure 2 depicts this approach, assuming that data is brought from New York, Texas, and San Juan. The images from Texas and New York are first filtered using the date predicate to remove unwanted ones. All images are sent to San Juan for correlation, and sent to the client application which is assumed to be in San Juan. In general, we can solve the problem of executing a query Q by building a service composition graph. The graph formed by the interaction of services shown in Figure 2 is an example of a service composition graph. Given a query Q, a service composition graph G is a directed acyclic graph (DAG) G =(V,E) with the following properties: 1. V represents a set of Web services acquired to solve Q. 2. 3. 4. Each edge (u,v) E represents the fact that service u is being used by service v. This is called a service composition between services u and v. These edges are called composition edges. Each edge (u,v) E has an associated cost C. For all composition edges (u,v), the cost of the edge is defined as the cost incurred using service u, plus the cost incurred to deliver its results to service v. By executing the services in G=(V,E) as indicated by E we can find a solution to query Q. The relationships in E form the heart of our framework since they represent the flow of data and computation that enables data to be extracted, and processed according to the instructions and code issued in the query Q. In the context of query processing, composition graph G represents a plan to solve the query. The set of services V take care of executing one or more of the query operators in this plan, provide the necessary data, or provide the code required for one or more operator. The composition edges indicate how data and code moves between the services. The cost of composition edges is application-specific; some metrics for cost can be resource usage, response time (wall clock time), volume of data transfer, or monetary cost (assuming some services are charged in dollars). Hence, the cost of the computation represented by the composition graphs is also application-specific. Image Service Image Service TX Image Service NY TX Computing Service Sofware Service NY Computing Service Computing Service SJU SJU MD Computing Service MD MAP Service SJU Client SJU Figure 2: Service Composition Graph for Web Services At first glance, it might appear that a service composition graph for a given query Q might be computed once at query time and used throughout the computation of the query. However, this approach will not be suitable for widearea environments where data sources come and go, network speeds change depending on current traffic, and newly available computer cluster usage might be granted for just a limited amount of time. For example, in our sample scenario it might be the case that the New York database fails, so the system must switch to the data service at Maryland to get the remaining images. Thus, rather than building just one service composition graph G, the middleware system should monitor query execution, and transform the current G into a new composition graph G whenever it is needed, as shown in Figure 3. This new graph will have services or composition edges not present in graph G, but which are now necessary to complete the execution of query Q. Likewise, unused services and composition edges are removed from G . Clearly, for G to be useful it ought to be the best candidate replacement, meaning that it minimizes the total cost to solve the remaining of query Q. Notice that this process might occur several times during the execution of query Q. Therefore, to solve query Q we actually need a sequence of service composition graphs S  {G0 , G1 ,..., Gn } , where G0 is the initial service composition graph, and Gn is the final Gi is produced by running an optimization algorithm to modify the services or compositions in Gi 1 , for i 1. To generate Gi , this modification algorithm shall consider composition graph. In this context, service composition graph current system status and the progress achieved to so far to solve the query. To the best of our knowledge, no other work in the research literature has modeled query processing and service composition for database middleware in this fashion. This proposal presents a five-year plan focused on the research necessary to fully develop this database middleware and query processing framework, to be known as the GaiaNET database middleware system. We also plan to collaborate with our partners from TCESS and CenSSIS to develop a realistic scenario that will help us deploy, test and characterize the system with rigorous performance experiments. In order to realize the GaiaNET system our research effort must lead us to overcome a series of barriers that block any attempt at a straight implementation of an efficient database middleware system for wide-area environments. These barriers are: 1. Barrier #1 – Poor understanding of an adaptive query model for wide-area system dynamics: Most of query processing models assume a stable environment where system dynamics do not change during query execution. There are a few notable exceptions such as Query Scrambling (REF) and Eddies (REF) which attempt to adapt to changes in the execution environment. However, these solutions mostly deal with reordering of operators. We need a general framework that takes into consideration opportunities to bring data from alternate sources (perhaps concurrently), send query computation to other sites, or partition the query computation by leveraging on redundancy of data and computing resources on a wide-area network. Quality of Service (QoS) and Quality of Data (QoD) guarantees must also be present in this framework. 2. Barrier #2 – Lack of a decentralized middleware architecture that can adapt to system conditions: Currently, database middleware systems follow a centralized architecture. All configuration information and relationships must be present in the catalog. Federations are built by hand, requiring interaction between system administrators. The integrator server becomes a single point of failure in the system. There is a need for a P2P database middleware architecture that provides redundancy, more opportunities to find sites for processing, and is based on Web services. Image Service Image Service TX Image Service NY TX Computing Service Computing Service NY MD Computing Service MD Unavailable Sofware Service Computing Service SJU SJU MAP Service SJU Client SJU Figure 3: Reconfiguration of Service Composition Graph for Web services 3. 4. 5. Barrier #3 – Lack of a well-established framework for Web service compositions: Web services are still an evolving technology. Vendors are focusing on marketing, not offering many concrete scenarios, and virtually no documentation. At the time of this writing, the W3C Working Group on Web Services is just beginning a standardization process. Composition of services is a problem not well understood, and there are almost no reference implementations. Performance metrics needed to compare different composition schemes are yet to be identified. Algorithms to find a composition graph to solve a given query are not known. Barrier #4 – Inadequate APIs to write applications that use Web service compositions: There is no language to specify the composition of Web services in declarative fashion. Thus, right now compositions must be specified implicitly in the structure of the applications. WSDL (REF) is a language used to define a Web service, but its does not specify compositions. Moreover, most APIs only provide support to register services, and to specify XML messages with SOAP to exchange requests. How all these metadata elements get disseminated and managed efficiently throughout the system is still unclear. Barrier #5 – Inadequate query execution engines for processing based on Web Service Composition: Typically, query optimizers and query execution engines are separated modules in a database middleware engine. While some systems perform query execution in distributed fashion, optimization is centralized. This arrangement precludes a more adaptive operation that enables to system to monitor current execution performance and quickly adapt to changes by modifying the query plan with more efficient query sub-plans. There is little to support to discover new processing strategies once the query is running. As a result, the 6. 7. 8. middleware cannot capitalize on new resources (e.g. computer clusters or replicated collections) that become available after query execution starts. Barrier #6 – Limited understating on how to build systems that feature some form of SelfAdministration: Database middleware rely heavily on system administrators to define schemas, available data sources, available software for query processing, and policies for system usage. This scheme is not scalable for a wide-area environment. In (REF) we developed the concept of self-extensible middleware, which means that during query processing the middleware system is capable of shipping code needed for query processing to remote sites. This allows the system to dynamically extend its functionality and adapt to the needs of the query at hand. This framework must be extended to a P2P setting, and new features such as computing sites discovery, data source discovery, dynamic federation formation, and application-code discovery should be developed. Barrier #7 – Limited tools to automate or semi-automate the process of schema mapping: Schema mapping is a very difficult task, and there are few tools to help administrators build the infrastructure to simplify the generation of schema mapping rules via automation. Very often, developers must either hard-code schema mapping rules into the translators, or write stored procedures that perform the schema mapping during data extraction. Schema mapping ought to be automated as much as possible, and should follow a more declarative process. Barrier # 8 – Limited tools to protect federated systems and Web Services from attacks: Like most networked system, database middleware system can be vulnerable to attacks. Often, these systems delegate the role of system security to the network and operating system infrastructure that hosts each data source. Issues such as intrusion detection, and data security to prevent data misuse, are currently being explored but relatively little technology transfer and adoption has occurred. Our research plan will focus on dealing with barriers 1-6 because we find this is the niche that better fits our expertise and interests. There are several on-going projects (REFS) that are working on barrier #7. Likewise, the volume of research activity geared towards dealing with barrier 8 is impressive, to say the least. We shall keep track of advances in schema mapping and security to incorporate them in GaiaNET as they become available, and whenever possible, we will attempt to bring our own contributions to the system. But the focus on this research will be the development of a framework for Web service composition, and query processing on wide-area environments, with emphasis on their realization in the GaiaNET middleware system. GaiaNET System Architecture We have designed GaiaNET to be an open source system tightly coupled with the Web. The rationale for this choice to leverage on the wealth of experience deploying Web-based applications that has been acquired by many enterprises. By leveraging on Web-technology, which has been proven scalable and very reliable, we can maximize the acceptance of GaiaNET, and reduce the cost, time for deployment and risk incurred by an enterprise willing to use it. GaiaNET: P2P environment Client Workstations Client CSP QSB QSB DAP DAP Databases P2P Cubetree Caching and Storage Services QSB DAP QSB DAP DAP DAP Sensors Databases NICK??? Adaptive Replication Services Bienvenido??? Smart Mirror Services Pedro??? Sensor Management Services Isidoro??? Related Work A Distributed Database System is composed of a collection of Database Management Systems (DBMS), connected together via a computer network, that agree to work together in a federation to manipulate their distributed data collection and share their query processing capabilities. Among the most interesting prototypes built over the past two decades we have R*[50], Distributed Ingres [44], ADMS+/- [40] and Mariposa [42, 45]. These systems proposed to federate groups of autonomous DBMS located on different hosts, and with no central authority imposing restrictions on the local operations at each site. In fact, these were early examples of Peer-to-Peer (P2P) systems, which are now gaining again broader acceptance. In most Distributed Databases it was assumed that all sites were relational systems, and that they either ran the same DBMS or all sites adhere to a common communications protocol. Distributed join processing [9, 24, 32, 48], transaction processing [35] and data caching [12] have been the major focus of research on these systems. The work in [20, 27]studied the structure of the DBMS to be run at the federated sites, and argued that a client-server DBMS model was a more appropriate and efficient arrangement than P2P. The work in [12] explores different strategies to cache frequently used data at the client DBMS and reduce the latency incurred when data must be requested from the server DBMS running on a mainframe. The basic assumption in a Heterogeneous Database System is that the sites being federated have very different characteristic in terms of data models, database server and query execution capabilities. Often, Heterogeneous Database Systems are referred to as Database Middleware Systems since they are a middle layer of software that interconnects the client applications with the data sources. Still, most of these systems are based on the assumption that data sources reside in mainframes or enterprise servers. Typically, the server application used at the middleware layer to service client requests is called the integration server. In its most basic form, the middleware layer can be made out of database gateways, which are software modules that allow a single-site database server to gain access to the data managed by another database server (possibly from a different vendor). The gateway gives the local DBMS an access method mechanism to the data stored in the foreign database server, and this local DBMS becomes an integration server (also called application server) for its clients. Database gateways are provided by major database vendors such as Oracle [16], Informix [15] and Sybase [17]. By far, the most complex and interesting type of middleware system is the Mediator system, in which In a server application called the mediator acts as the integration server for the client applications. The mediator is specially tailored for data translation [14], schema mapping [31], and distributed query processing [26, 38]. The mediator provides services such as query parsing, query optimization, query execution and transaction processing. Moreover, the mediator provides a common data model used to resolve the conflicts that might arise as result of the differences in the schemas of the data sources. The mediator relies on wrappers to gain access to the information contained in the data sources. The wrappers receive requests from the mediator to query data in the data sources, and then they generate queries or procedure calls specific to the data source in order to fetch the data of interest. All the data values retrieved from a data source are translated from the local schema into the schema specified by the mediator, and then are send to the mediator where they are further processed to produce the final result of the query. Some of the better known mediator-type middleware systems are Pegasus [4], TSIMMIS [13], DISCO [46], METU [22], Garlic [39] and MOCHA [38]. Agent Systems are based on the idea that applications can be built in terms of groups of intelligent agents that work as a group to solve a given task. Agents exhibit intelligence, mobility and autonomy to carry our tasks and make decisions on behalf of the users. Thus, agents can be used to monitor stocks, buy goods on-line, participate in auctions, and perform data integration operations. The precise definition of what an agent is (or should be) and to implement it is somewhat of a controversial issue. Some researchers view agents as small programs [8, 28], others think of them in terms of logic semantics [5], while some researchers [41] disregard them as a bad idea that isolates user from the experience of interacting with networked applications. The body of literature in Agent System is extensive, but it is mostly focus on topics more relevant to Artificial Intelligence than Database Systems, so we shall not go any further on this discussion on agents. Sensor Networks [11, 23] are networks of formed when a set of small, unattended sensor devices are deployed over a given area. The sensors form ad-hoc relationships between them to cooperate on sensing physical phenomena. Some of the target applications for sensor networks are surveillance of inhospitable terrain, monitoring and detection of forest fires, study of traffic patterns on metropolitan areas, and monitoring of equipment on manufacturing plans. Data Streams emerge in this context as an abstraction representing a continuous flow of information produced by the sensors which can seldom be stored in raw form. Recent efforts have proposed mechanisms to efficiently optimize [49], process [34], and aggregate [21] data emanating from Data Streams. Broader Impacts Personnel, Management Approach and Milestones The project will be organized into four teams, each one led by a faculty member. Dr. Manuel Rodríguez-Martínez will be the project PI and leader of the Distributed Data Management and Integration Team. Dr. Bienvenido VélezRivera will be the leader of the Programmatic Interfaces and Visualization Team. The Parallel Processing and Storage Team will be led by Dr. Pedro I. Rivera-Vega. Finally, Dr. Miguel Vélez-Reyes will be the leader of the Image Processing Team. Each team will have at least one graduate student and one undergraduate student from our programs in Electrical Engineering, Computer Engineering, and Computer and Information Science and Engineering (CISE). In addition, we will provide two assistantships for Earth Science students to be integrated into the project. We will have monthly group meetings, and one project review day per semester. During the UPRM Industrial Affiliates Week we will have a NASA TerraScope Workshop to present project status to the public and bring NASA personnel for assessment of the project. We will form a committee to manage the interaction with our TCESS and CenSSIS partners, and incorporate their user perspective throughout the development of TerraScope. Administrative operations will be managed by the Center for Computing Research and Development (CECORD). These activities include equipment purchase, travel arrangements, and organization of the TerraScope Workshop. The UPRM Ph.D. program in CISE will also contribute with the organization of seminars and student activities. The following list presents a brief description of the expertise of the personnel associated with this project (for details, see resumes at the end of this proposal):  Dr. Manuel Rodriguez-Martinez, PI, Computer Science Experience: Database Middleware System, Distributed Query Processing, and Computer Networks. Member of ADM Group. Developer of the MOCHA System for NASA ESIP Federation.  Dr. Nick Roussopoulos, PI, Computer Science Experience  Dr. Bienvenido Velez-Rivera, Co-PI, Computer Science Experience: Information Retrieval, Distributed Systems, Cluster-Computing,Human Computer Interaction. Developer of the Info-Radar system at MIT. Coordinator of Computer Science Programs at UPRM.  Dr. Isidoro Couvertier, Co-PI, Electrical Engineering  Dr. Pedro I. Rivera-Vega, Co-PI, Computer Science Experience: Parallel Algorithms, Data Structures Analysis, High Performance Computing, and Computer Science Education. Coordinator of Computer Science B.S. Program at UPR-Rio Piedras. Prior NSF Support

ProjectDescription - Electrical and Computer Engineering

Related documents

Products

Support

ProjectDescription - Electrical and Computer Engineering

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib