Ashish Sharma CS-257 ID:118 Databases are created independently, even if they later need to work together. The use of databases evolves, so we can not design a database to support every possible future use. We will understand Information integration from an example of University Database. Earlier we had different databases for different functions like; Registrar Database for keeping data about courses and student grades for generating transcripts. Bursar Database for keeping data about the tuition payments by students. Human Resource Department Database for recording employees including those students with teaching assistantship jobs. Applications were build using these databases like generation of payroll checks, calculation of taxes and social security payments to government. But these databases independently were of no use as a change in 1 database would not reflect in the other database which had to be performed manually. For e.g. we want to make sure that Registrar does not record grades of the student who did not pay the fees at Bursars office. Building a whole new database for the system again is a very expensive and time consuming process. In addition to paying for a very expensive software the University will have to run both the old and the new databases together for a long time to see that the new system works properly or not. A Solution for this is to build a layer of abstraction, called middleware, on top of all legacy databases, without disturbing the original databases. Now we can query this middleware layer to retrieve or update data. Often this layer is defined by a collection of classes and queried in an Object oriented language. New applications can be written to access this layer for data, while the legacy applications continue to run using the legacy database. When we try to connect information sources that were developed independently, we invariably find that sources differ in many ways. Such sources are called Heterogeneous, and the problem of integrating them is referred to as the Heterogeneity Problem. There are different levels of heterogeneity viz. 1. 2. 3. 4. 5. 6. Communication Heterogeneity. Query-Language Heterogeneity. Schema Heterogeneity. Data type differences. Value Heterogeneity. Semantic Heterogeneity. Today, it is common to allow access to your information using HTTP protocols. However, some dealers may not make their databases available on net, but instead accept remote accesses via anonymous FTP. Suppose there are 1000 dealers of Aardvark Automobile Co. out of which 900 use HTTP while the remaining 100 use FTP, so there might be problems of communication between the dealers databases. The manner in which we query or modify a dealer’s database may vary. For e.g. Some of the dealers may have different versions of database like some might use relational database some might not have relational database, or some of the dealers might be using SQL, some might be using Excel spreadsheets or some other database. Even assuming that the dealers use a relational DBMS supporting SQL as the query language there might be still some heterogeneity at the highest level like schemas can differ. For e.g. one dealer might store cars in a single relation while the other dealer might use a schema in which options are separated out into a second relation. Serial Numbers might be represented by a character strings of varying length at one source and fixed length at another. The fixed lengths could differ, and some sources might use integers rather than character strings. The same concept might be represented by different constants at different sources. The color Black might be represented by an integer code at one source, the string BLACK at another, and the code BL at a third. Terms might be given different interpretations at different sources. One dealer might include trucks in Cars relation, while the another puts only automobile data in Cars relation. One dealer might distinguish station wagons from the minivans, while another doesn’t.