SCC762 Distributed Database Systems Session 1 Basic Concepts of Distributed Database Systems 1.1 Study points 1.2 What is a distributed database system? 1.3 Advantages and disadvantages of DDBSs 1.4 Research and development issues 1.5 Overview of relational DBMSs 1.6 Overview of computer networks Session 1 - Basic Concepts of Distributed Database Systems 1-1 SCC762 Distributed Database Systems 1.1 Study points On the completion of this session you should be able to: 1.1 understand the concepts of distributed database systems (DDBSs) 1.2 understand the issues concerned in research and development of DDBSs 1.3 identify the various levels of transparencies provided by a distributed database management system (DDBMS) 1.4 have a basic knowledge on the relational data model and relational databases 1.5 be able to use the relational algebra to express data manipulation processes 1.6 be able to use SQL to express data definition and data manipulation processes 1.7 have a basic knowledge on computer networks and their impact on distributed database systems References: [VO]: Chapters 1, 2, and 3; [BG]: Chapters 1, 2, and 3. 1.2 What is a distributed database system? Distributed database systems can be defined as follows: Distributed database system (DDBS) = Databases + Computers + Computer Network + Distributed database management system (DDBMS) The database system manages data; the computer network makes the communication possible; and the DDBMS (distributed database management system) is the software that realises the mechanisms and policies for manipulating distributed data. A distributed database system can be simply defined as a collection of multiple logically interrelated databases distributed over a computer network and managed by a distributed database management system. A distributed database management system (DDBMS) is the software system that permits the management of the DDBS and makes the distribution transparent to users. From the architectural point of view, a distributed database system consists of a collection of sites, connected together via a communication network, in which ➢ ➢ 1-2 Each site is a database system site in its own right, but The sites have agreed to work together (if necessary), so that a user at any site can access data anywhere in the network exactly as if the data were all stored at the user’s own site. Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems Figure 1 depicts the architecture of a distributed database. Communication Network CS CS CS CS ... Server Client Server Client Client Site 1 Site 2 Site 3 Site N Figure 1: Architecture of a DDBS ➢ ➢ ➢ Client: the frontend of a DDBS where the access requests are issued. Server: the backend of a DDBS where the database is stored. Communication system (CS): it enables the communication between clients and servers. Figure 2 depicts the evolution of distributed databases. Program 1 Data Specification Program 1 File 1 Data Specification ... ... ... Database ... Data Specification Program N Data Specification File N Program N Traditional file processing: decentralised; unintegrated Database processing: centralised; integrated Site 1 Communication Network Site 2 Site N Site 3 Distributed database processing: distributed; integrated Figure 2. Evolution Session 1 - Basic Concepts of Distributed Database Systems 1-3 SCC762 Distributed Database Systems The first generation of data processing is decentralised and unintegrated where data are stored in individual files and specifications are embedded into the programs that manipulate the data. Files are therefore not shared, and any changes in the file structure will affect the data specifications in the programs. The second generation of data processing is centralised and integrated in which data are stored in a centralised database and data specification are stored in a centralised location, normally the same location as the database. The advantages of this model is that changes in database may only affect data specifications but not the programs. The third generation of data processing is distributed and integrated in which data and their local specifications are distributed in a network and there also exists a global view of all the data stored in the network. 1.3 Advantages and disadvantages of DDBSs Three issues have led us to the distributed database systems: ➢ Distributed system (hardware/software). Nowadays, most of our computer systems are connected by networks and therefore are distribtued. We are already working in such a distribtued environment and therefore a DDBS that is capable of coordinating the work of distributed recources is needed. ➢ Distributed application. Many of our applications are distributed in nature. For example, the banking system, the point of sales system, the order and inventory management system, etc. A DDBS that can support such applications is needed. ➢ Distributed data. Most of our data are distribtued in nature. Many organisations, such as Deakin University, have multiple locations. Therefore data are already distributed. A DDBS that is capable of manaing the distribtued data is needed. Advantages of a DDBS can be briefly summarised as transparent management of distributed and replicated data, reliability through distributed transactions, improved performance, and easier system expansion. They are outlined as follows: ➢ Transparency: One of the most important issue in distributed database is the maintenance of transparencies in a DDBMS. It refers to separation of the higher level semantics of a system from lower-level implementation issues. It mainly includes data independence, network transparancy, replication transparency, and fragmentation transparency. • 1-4 Data independence: it is a fundamental form of transparency within a single DBMS. It includes both logical data independence and physical data independence. Logical data independence: The ability to modify the conceptual database schema without having to change external schemas or application programs. The possible changes in the conceptual schema are adding a new record type or a new field in the current record, or deleting a record type from the database or a field from a record type. In the latter case, external schemas that refer only to the unchanging record types should be able to remain unchanged. Physical data independence: The ability to modify the physical database schema without causing application programs to be rewritten. Modifications in the physical schema level include changes to the length of fields and records, changes to indexes on data files, changes to the organisation of records in files, Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems ➢ and so forth. • Network (Distribution) Transparency: From the user point of view, you don’t know whether you access a single database or distributed database. It is classified into location transparency and naming transparency. Location transparency refers to the fact that data objects to be accessed without the knowledge of their location. Naming transparency means that a unique name is provided for each object in the database. • Replication transparency: it enables multiple instances of data objects to be used to increase reliability and performance without knowledge of the replicas by users or application programs. • Fragmentation transparency: logically related data may be fragmented and stored in multiple locations to increase the reliablity and performance. This information should be hidden from users and application programs. However, who should provide these transparencies? The data access layer of the network? The OS? Or the DBMS? Each method has its pros and cons and the DBMS should take the responsibility of providing transparencies related to data and its access. Improved performance: it is well-known that in general,80% of the information generated locally is consumed locally; only 20% locally generated information will be consumed by (accessed by or sent to) external entities. Therefore, have data stored locally will improve most local activities. ➢ Improved reliability / availability: reliability is the probability that if a system S is working at time t0, then S will continue to work in time interval (t0, t). Availability is the probability that S is operational at a given time t. When data are distributed, then the faulty of one part of the system will not cause the whole system to stop working. ➢ Economics: a parallel super-computer costs millions of dollars while the same amount of money can buy a distributed computer system (servers, PCs and LAN connections) with a much high computing power. ➢ Expandability: It is easier to expand in a distributed environmen by simply adding in more workstations and servers. ➢ Sharability: users in a distributed system can easily share their resources. There are also some disadvantages in using a distributed database system. Some of them are listed below: ➢ Lack of experience and standards: many people are used to centralised control and there are not many standards for distributed databases to follow. ➢ Complexity: in psychology there is a rule saying that as human bings our limit to understand the number of things happening in parallel is 7 ±2 . However, a distributed database system normally has hundreds and thousands of events happening simultaneously. It is very hard for human beings to fully understand such a complicated system. ➢ Cost: there must be some ininitial cost of shifting from a centralised system to a distributed system and of setting up and maintaining the computer network. ➢ Distribution of control: data equal to power and money in the sense of business. Some people are reluctant to lose their control of data. Session 1 - Basic Concepts of Distributed Database Systems 1-5 SCC762 Distributed Database Systems ➢ Security: It is more difficult to maintain the security in a distributed environment than to maintain the security in a centralised environment. ➢ Difficulty of change: it is hard to persude people to change from a familiar environment to a new environment even the new one provides more benefits. 1.4 Research and development issues We can list the following issues on the research and development of distributed database systems: ➢ Distributed database design: The two fundamental design issues are fragmentation, the separation of the database into partitions called fragments, and distribution, the optimum distribution of fragments. 1-6 ➢ Distributed query processing: Query processing deals with designing algorithms that analyze queries and convert them into a series of data manipulation operations. The problem is how to decide on a strategy for executing each query over the network in the most cost-effective way. The factor to be considered are the distribution of data, communication costs, and lack of sufficient locally available information. The objective is to optimise where the inherent parallelism is used to improve the performance of executing the transaction. ➢ Distributed directory management: A directory contains information about data items in the database. It may be global to the entire DDBS or local to each site; it can be centralized at one site or distributed over several sites; there can be a single copy or multiple copies. ➢ Distributed concurrency control: It involves the synchronisation of accesses to the distributed database, such that the integrity of the database is maintained. Two fundamental primitives are locking, which is based on the mutual exclusion of accesses to data items, and timestamping, where the transactions are executed in some order. ➢ Distributed deadlock management: The deadlock problem in DDBSs is similar in nature to that encountered in operating systems. The competition among users for access to a set of resources can result in a deadlock if the synchronisation mechanism is based on locking. The well-known alternatives of prevention, avoidance, and detection/recovery also apply to DDBSs. ➢ Reliability of DDBMSs: It is important that mechanisms be provided to ensure the consistency of the database as well as to detect failures and recover from them. ➢ Operating system support: In distributed environments there is the problem of having to deal with multiple layers of network software. The work is to provide both adequate support for distributed database operations and general operating system support for other applications. ➢ Heterogeneous databases: When there is no homogeneity among the databases at various sites, it becomes necessary to provide a translation machanism between database systems. This translation mechanism usually involves a canonical form to facilitate data translation, as well as program templates for translating data manipulation instructions. Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems Figure 3 shows the relationships among these issues. Directory management Query processing Distributed database design Reliability Concurrency control Deadlock management Figure 3: Relationship among problem areas 1.5 Overview of relational DBMSs A DDBS consists of many DBSs and most of them are relational DBSs. Therefore most DDBSs are relational as well. The important issues to understand in relational databases are: ➢ The relational data model ➢ Normalisation ➢ Relational algebra ➢ the entity/relationship model ➢ The ANSI-SPARC 3-level architecture ➢ SQL 1.5.1 The relational data model We give the following term definitions: A domain D is a set of atomic values and is associated with a type. For example, INTEGER, STRING, DATE, are domains. A domain has a name, a type, and a format. For example, BIRTHDATE could be a domain's name, its type could be DATE, and its format could be dd/mm/yy. A relation schema R, denoted by R(A1, A2,..., An), is made up of a relation name R and a list of attributes A1, A2, ..., An. The domain of attribute Ai is denoted as dom(Ai), i=1, ..., n). A relation schema is used to describe a relation. R is called the name of the relation and n the degree of the relation. A relation r of the relation schema R(A1, A2, ..., An), denoted by r(R), is a set of m-tuples r = {t1, t2, ..., tm}. Each m-tuple t is an ordered list of n values t = < v1, v2, ..., vn>, where each value vi, 1 <= i <= n is an element of dom(Ai), or is a special null value. A relation schema is an abstract kind of object, and a relation (table) is a concrete picture Session 1 - Basic Concepts of Distributed Database Systems 1-7 SCC762 Distributed Database Systems (instance) of such an abstract object. The primary key of a relation is a unique identifier that can uniquely identify a tuple of that relation. Table 1 lists some formal terms and their informal equivalents. Table 1: Example data models Formal Terms Informal Equivalents relation table tuple row or record cardinality number of rows attribute column or field degree number of columns primary key unique identifier domain pool of legal values We can list the properties of a relation as follows: ➢ There are no duplicate tuples - so there is always a primary key. ➢ Tuples are unordered (top to bottom). ➢ Attributes are unordered (left to right). ➢ All (simple) attribute values are atomic. Or equivalently, relations do not contain repeating groups A relation satisfying this condition is said to be normalised. A relational database contains tables (relations), nothing but tables. 1.5.2 Normalisation Normalisation provides a method of representing data and their relationships precisely in a tabular format that makes the database easy to understand and operationally efficient. It is basically a formalisation of avoiding redundancy; of trying to achieve “one fact one place.” There are five normalisation levels (forms). Each form's rules place additional conditions on the database design. By satisfying the first set of rules, the data is said to be in the first normal form or 1NF. Moving on to the second set results in the second normal form (2NF), and so on up to the fifth normal form (5NF). Figure 4 depicts these normal form levels. 1-8 Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems Universe of relations (Normalised and unnormalised) 1NF relations 2NF relations 3NF relations 4NF relations 5NF relations Figure4. Levels of normal forms One additional normal form, called Boyce-Codd Normal Form (BCNF), can be defined between 3NF and 4NF. We summarise the conditions addressed by each normal form as follows: ➢ 1NF: Repeating groups. ➢ 2NF: A column’s dependency on only part of a composite key (partial dependency). ➢ 3NF: A nonkey column’s representing a fact about another nonkey column (transitive dependency). ➢ 4NF: Two or more independent, multivalued facts occurring for an entity. ➢ 5NF: Interdependent columns (symmetric constraint). A designer does not have to work through the 1NF and 2NF to get the 3NF. The rules leading to and including the 3NF can be summed up in a single statement: Each attribute must be a fact about the key, the whole key, and nothing but the key. Designers can use this statement to test whether their designs are in 3NF. For the majority of database designers, the 3NF is the extent of normalisation needed or desired. Two additional normal forms can improve the design in certain situations, however. The 4NF and 5NF apply only when the database includes one-to-many and many-to-many relations, and then only in some special situations. An example of normalisation Unnormalised relation: TRADESMAN Tradesman ID Tradesman Name Company ID Company Name Company Location Dependent Name Dependent Date of Birth 1-9 Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems Skill ID Skill Name Skill Where Attained Skill Level 1NF: remove the repeating groups, shown as in Figure 5. DEPENDENT TRADESMAN Primary Key Tradesman ID Tradesman Name Primary Key Tradesman ID Dependent Name Dependent Date of Birth Company ID Company Name Company Location SKILL Primary Key Tradesman ID Skill ID Skill Name Where Skill Attained Skill Level Figure 5. First normal form 2NF: remove partial dependency, shown as in Figure 6. SKILL is in 1NF but not in 2NF: Partial dependency SKILL Primary Key Tradesman ID Skill ID Skill Name Where Skill Attained Skill Level SKILL and TSKILL are now in 2NF: TSKILL Primary Key SKILL Primary Key Tradesman ID Skill ID Skill ID Skill Name Where Skill Attained Skill Level Figure 6. Second normal form 3NF: remove transitive dependency, shown as in Figure 7. 1-10 Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems TRADESMAN is in 2NF but not in 3NF: Transitive dependency TRADESMAN Primary Key Tradesman ID Tradesman Name Company ID Company Name Company Location TRADESMAN and DEPARTMENT are in 3NF: TRADESMAN Primary Key DEPARTMENT Tradesman ID Tradesman Name Primary Key Company ID Company ID Company Name Company Location Figure 7. Third normal form 1.5.3 Relational algebra Relational algebra consists of a collection of high-level operators that operate on relations. Each such operator takes one or two relations as its input and produces a new relation as its output. The original eight operators defined by Codd (the pioneer of the relational model) are: ➢ The traditional (mathematical) set of operations: union, intersection,difference, and Cartesian product. ➢ The special relational operations: restrict, project, join, and divide. Figure 8 depicts how these operators work on tables. 1-11 Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems Restrict Project a1 a2 b1 b2 a1 a2 x1 x2 x1 x2 Product y1 y2 = a1 a2 y1 y2 b1 b2 x1 x2 c1 c2 b1 b2 y1 y2 c1 c2 x1 x2 c1 c2 y1 y2 Union Intersection a1 a2 b1 b2 c1 c2 a2 x1 (Natural) Join a3 x2 c2 x3 Difference a1 a2 = a1 a2 x1 b1 b2 c1 c2 x3 a1 c2 Divide a2 = a1 c2 Figure 8: The original eight operators The RESTRICT operation is called SELECT in relational languages. We use the Drinker database of Figure 9 to review some of the operations in relational databases. 1-12 Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems FREQUENT SERVES LIKE (The Drinker frequently goes to the Bar) (The Bar serves the Beer) (The Drinker likes the Beer) Drinker Bar Bar Beer Drinker Beer Ullman Manuel’s Manuel’s Miller Lite Ullman Miller Lite Ullman Orchard Night Manuel’s Tiger Ullman Tiger Ullman Faculty Club Orchard Night VB Ullman Anchor Ullman Dynasty Manuel’s Qindao Jones Anchor Sam Manuel’s Faculty Club Tiger Sam Anchor Sam Orchard Night Faculty Club Miller Lite Smith Dynasty Dynasty Anchor Figure 9: Sample data of the Drinker database Restrict and Project: List the name of bars that Ullman frequently goes to. π bar ( σ drinker = ′Ullman′ ( Frequent ) ) Union: List names of bars that Sam and Smith frequently go to. π bar σ drinker = ′Sam′ ( Frequent ) ∪ σdrinker = ′Smith′( Frequent ) Or: π bar ( σ ( drinker = ′Sam′ ) ∨ drinker = ′Smith′ ( Frequent ) ) Intersection: List drinkers who like both “Miller Lite” and “VB” π drinker ( σ beer = ′Miller Lite′ ( Likes ) ) ∩ π (σ ( Likes ) ) drinker beer = ′VB′ But the following is wrong (no beer is named both Miller Lite and VB): π drinker ( σ beer = ′Miller Lite′ ∧ beer = ′VB′ ( Likes ) ) Difference: List bars who do not serve “Miller Lite”. 1-13 Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems ➢ Step 1: all bars that serve “Miller Lite” B 2 = π bar ( σ beer = ′Miller Lite′ ( Serves ) ) ➢ Step 2: all bars B1 = π bar ( Frequent ) ∪ π bar ( Serves ) ➢ Step 3: the difference result: B 1 – B 2. Product: Frequent × Serves Join: List beers that Ullman drinks. ➢ Step 1: all Frequents tuples regarding Ullman σ drinker = ′Ullman′ ( Frequent ) ➢ Step 2: Link with bar σ drinker = ′Ullman′ ( Frequent )∞Serves ➢ Step 3: Project onto beer π beer ( σ drinker = ′Ullman′ ( Frequent )∞Serves ) Other types of joins: semi-join, θ−joins. 1.5.4 The entity/relationship model The Entity/Relationship (E/R) data model is most often used as a tool for communication between database designers and end users during the analysis phase of database development. The E/R model is used to construct a conceptual data model, which is a representation of the structure of a database that is independent of the software (such as the DBMS) that will be used to implement the database. An E/R model is usually expressed as an E/R diagram, which is a graphical representation of the E/R model. ➢ Entity: things which can be distinctly identified. A weak entity is an entity that is existence-dependent on some other entity. A regular entity is an entity that is not weak. 1-14 ➢ Property: entities have properties (or attributes). All entities of a given type have certain properties in common. Each property draws its values from a corresponding value set. Properties can be simple or composite, key, single- or multi-valued, missing, base or derived. ➢ Relationship: an association among entities. The entities involved in a given relationship are said to be the participants in that relationship. The number of participants in a given relationship is called the degree of that relationship. The participation can be total or partial. An E/R relationship can be one-to-one, oneto-many, or many-to-many. ➢ Supertype and Subtype: A supertype is a generic entity that is subdivided into Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems subtypes. For example, a TRADESMAN can be divided into ELECTRICIAN and PLUMBER subtypes. A subtype is then a subset of a supertype that shares common attributes or relationships distinct from other subsets. Entity subtypes behave in exactly the same way as any entity type. ➢ Generalisation and categorisation: Generalisation is the concept that some things (entities) are subtypes of other, more general things (entities). Categorisation is the opposite concept that an entity comes with various subtypes. Entity/Relationship diagrams constitute a technique for representing the logical structure of a database in a pictorial manner. ➢ Entities: each entity type is shown as a rectangle, labeled with the name of the entity type in question. For a weak entity, the rectangle is shown with double lines. ➢ Properties: are shown as ellipses, labeled with the name of the property in question and attached to the relevant entity by means of a straight line. The ellipse is dotted if the property is derived and double lined if the property is multivalued. If the property is composite, its component properties are shown as further ellipses, connected to the ellipse for the composite property in question by means of a straight line. Key properties are underlined. ➢ Relationships: each relationship is shown as a diamond, labeled with the name of the relationship in question. The participants in each relationship are connected by means of straight lines; each such line is associated with one of the symbols.. ➢ Subtypes and supertypes: Let Y be a subtype of X. Then we draw a straight line from Y to X, marked with a hook to represent the mathematical “subset of" operator. An entity/relationship diagram is an abstract database design. Such a design can be easily mapped into a relational database definition. Figure 10 shows an E/R model for a tradesman database design. TID CID Name Location COMPANY BELONGTO SID Name TRADESMAN Address HAS GrantedBy Figure 10. Example E/R model The E/R model is mapped into the following relations: COMPANY(CID, Name, Location) TRADESMAN(TID, Name, Address, CID) 1-15 Session 1 - Basic Concepts of Distributed Databases SKILL Name Level SCC762 Distributed Database Systems SKILL(SID, Name, Level) HAS(TID, SID, GrantedBy) 1.5.5 The ANSI/SPARC 3-level architecture Figure 11 shows the three levels of the DBMS architecture proposed by ANSI/SPARC Study Group on Data Base Management Systems. External Level External View 1 ... External View n External/Conceptual Mapping Conceptual Schema Conceptual Level Conceptual/Internal Mapping Internal Schema Internal Level Stored Database ... Figure 11. The three levels of the ANSI/SPARC architecture Broadly speaking, ➢ The internal level is the lowest level of abstraction at which one describes how the data is physically stored. It has an internal schema. ➢ The conceptual level is the next higher level of abstraction at which one describes what are actually stored in the database and the relationships that exist among the data. It has a conceptual schema. ➢ The external level is the level closest to the users at which one describes the entire database which is of interest to individual users. It has a number of external schemas or external views, one for each group of users. There are two levels of mapping: ➢ Conceptual/internal mapping: defines the correspondence between the conceptual view and the stored database; it specifies how conceptual records and fields are represented at the internal level. ➢ 1-16 External/conceptual mapping: defines the correspondence between a particular external view and the conceptual view. Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems Figure 12 shows a greatly simplified view of a database system. DBMS Database Application Programs End Users Figure 12. Simplified picture of a database system 1.5.6 SQL Data definition A relation (table) can be defined by using the following statement: create table <relation_name> ( <attribute_name> <type> [<attribute_constraint>] {, <attribute_name> <type> [<attribute_constraint>]} {, [relation_constraint>]} ) ; The <relation_name> has to be unique for an individual user only. The <attribute_name> has to be unique within the relation. The <type> of an attribute is one of the following (in Oracle): Table 2: Common Oracle types 1-17 number a numeric field of maximum 40 digits number(m. n) decimal number of m digits, n to the right of the decimal point number(m) a numeric field of maximum m digits char(n) a character string of fixed length n Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems Table 2: Common Oracle types number a numeric field of maximum 40 digits varchar(n) a variable length character string of maximum length n varchar2(n) as varchar(n) date valid dates Example: create table dept (deptid number(4) primary key, dname varchar(20) not null, location varchar2(20)); Data Manipulation SELECT UPDATE DELETE INSERT The general form of the SELECT statement: SELECT [DISTINCT] item(s) FROM table(s) [WHERE condition] [GROUP BY field(s)] [HAVING condition] [ORDER BY field(s)] ; Here are some examples of using SQL. Query: List the name of bars that Ullman frequently goes to. SQL: SELECT Bar FROM Frequent WHERE Drinker = ‘Ullman’; Query: List names of bars that Sam and Smith frequently go to. SQL: SELECT Bar 1-18 Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems FROM Frequent WHERE Drinker = ‘Sam’ or Drinker = ‘Smith’; Query: List drinkers who like both “Miller Lite” and “VB” SQL: SELECT Drinker FROM Like WHERE Beer = ‘Miller Lite’ INTERSECT SELECT Drinker FROM Like WHERE Beer = ‘VB’; But the following is wrong (no beer is named both Miller Lite and VB): SELECT Drinker FROM Like WHERE Beer = ‘Miller Lite’ and Beer = ‘VB’; Query: List bars who do not serve “Miller Lite”. SQL: SELECT Bar FROM Frequent UNION SELECT Bar FROM Serves MINUS SELECT Bar FROM Serves WHERE Beer = ‘Miller Lite’; Query: List beers that Ullman drinks. SQL: SELECT Beer FROM Frequent, Serves WHERE Drinker = ‘Ullman’; 1.6 Overview of computer networks 1.6.1 A brief history of computer networks A computer network is a collection of computers interconnected through a data network, and the data network enables computers to exchange information among them. Computer networks form the basis of a distributed database system. Over the years, the public and 1-19 Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems private data networks have evolved from packet-switching network in 10s and 100s of kilobits per second (kbps), to frame relay network operating at up to 2 mega-bits per second (Mbps), and now Asynchronous Transfer Mode (ATM) operating at 155 Mbps or more. Fast Ethernet has evolved from 10-Mbps to 100-Mbps and now Gigabit Ethernet. The Table 3 shows a brief networking history. Table 3: Milestones of data network development Year Milestones 1966 ARPA packet-switching experimentation 1969 First Arpanet nodes operational 1972 Distributed e-mail invented 1973 For non-U.S. computer linked to Arpanet 1975 Arpanet transitioned to Defence Communications Agency 1980 New host added every 20 days 1981 TCP/IP experimentation began 1983 TCP/IP switchover completed 1986 NSFnet backbone created 1990 Arpanet retired 1991 Gopher introduced 1991 WWW invented 1992 Mosaic introduced 1995 Internet backbone privatized 1.6.2 The ISO/OSI reference architecture The OSI (Open System Interconnection) Reference Model was developed by ISO (International Standards Organisation) as a model for implementing data communication between cooperating systems. It has seven layers. ➢ Application: providing end-user applications. 1-20 ➢ Presentation: translating the information to be exchanged into terms that are understood by the end systems. ➢ Session: for the establishment, maintenance, and termination of connections. ➢ Transport: for reliable transfer of data between end systems. ➢ Network: for routing the data to its destination and for network addressing. ➢ Data link: preparing data in frames for transfer over the link and error detection and correction in frames. Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems ➢ Physical: defining the electrical and mechanical standards and signaling requirements for establishing, maintaining, and terminating connections. Figure 13 depicts the OSI process and peer-to-peer communication, where SDU represent service data unit, Hi (i = 2, ...,7) headers of layer i, and T2 is the trailer for layer 2. User Data H7 Data H6 H5 H3 H2 SDU SDU H4 SDU SDU SDU User Peer-to-peer communication T2 Bits Application Application Presentation Presentation Session Session Transport Transport Network Network Network Data Link Data Link Data Link Physical Physical Physical Source node Intermediate node Destination node Figure 13.OSI process and peer-to-peer communication 1.6.2 Internet architecture Internet is the largest data network in the world. It is an interconnection of several packetswitched networks and has a layered architecture. Figure 14 shows the comparison of Internet and OSI architectures. 1-21 Session 1 - Basic Concepts of Distributed Databases SCC762 Distributed Database Systems Application Presentation Application Session Transport Transport Network Data Link Physical Internet Network Access Physical Figure 14. Comparison of Internet and OSI architectures Internet has the following layers: ➢ Network access layer: It relies on the data link and physical layer protocols of the appropriate network and no specific protocols are defined. ➢ Internet layer: The Internet Protocol (IP) defined for this layer is a simple connectionless datagram protocol. It offers no error recovery and any error packets are simply discarded. ➢ Transport layer: Two protocols are defined: the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP). TCP is a connection-oriented protocol that permits the reliable transfer of data between the source and destination end users. UDP is a connectionless protocol that offers neither error recovery and nor flow control. ➢ User process layer (application): It describes the applications and technologies that are used to provide end-user services. Large computer networks such as the Internet are formed through the help of bridges and routers. Bridges and routers are intermediate systems that provide a communication path and perform the necessary relaying and routing functions so that data can be exchanged between devices attached to different subnetworks in the internet. ➢ Bridge: operates at layer 2 of the OSI architecture and acts as a relay of frames between like networks. ➢ Router: operates at layer 3 of the OSI architecture and route packets between potentially different networks. Both bridge and router assume the same upper-layer protocols are in use. 1-22 Session 1 - Basic Concepts of Distributed Databases