Data Models Avi Silberschatz Henry F. Korth S. Sudarshan 1 Introduction Underlying the structure of a database is a data model. A data model is a collection of conceptual tools for describing the real-world entities to be modeled in the database and the relationships among these entities. Data models dier in the primitives available for describing data and in the amount of semantic detail that can be expressed. The various data models that have been proposed fall into three dierent groups: object-based logical models, record-based logical models, and physical data models. Physical data models are used to describe data at the lowest level. Physical data models capture aspects of database system implementation that are not covered in this article. Thus, our focus here is on the object-based and record-based logical models. Recently, a new model, the object-relational model, has been developed. It merges the objectoriented data model with the dominant record-based model, the relational model. We discuss this model briey at the end of this article. Further details on data models appear in database texts, including Silberschatz, et al. 1996], and Ullman 1988]. 2 Object-Based Logical Models The object-based models use the concepts of entities or objects and relationships among them rather than the implementation-based concepts, such as records, used in the record-based models. Object-based logical models provide exible structuring capabilities and allow data constraints to be specied explicitly. Below, we present descriptions of the two most widely-used representatives of these models: the Entity-Relationship model, and the Object-Oriented model. 2.1 The Entity-Relationship Model The Entity-Relationship (E-R) data model is one of several semantic data models that is, it attempts to represent the meaning of the data. The E-R model employs three basic concepts: entity sets, relationship sets, and attributes. An entity is an \object" in the real world that is distinguishable from all other objects. An entity set is a set of entities of the same type that share the same properties (or attributes). Attributes are descriptive properties possessed by all members of an entity set. Each entity has its own value for each attribute. A set of attributes that suces to distinguish all entities in an entity set is called a primary key. A relationship is an association among several entities. Extended E-R features include specialization, generalization, higher- and lower-level entity sets, attribute inheritance, and aggregation. An explanation of these features is beyond the scope of this article. Further discussion of the E-R model appears in Chen 1976], which introduced the E-R model. 1 2.2 Object-Oriented Model The object-oriented data model is an adaptation of the object-oriented programming language paradigm to database systems. The model is based on the concept of encapsulating data, and code that operates on that data, in an object. Entities, in the sense of the E-R model, are represented as objects with attribute values represented by instance variables within the object. The value stored in an instance variable is itself an object. Thus, a containment relationship, the is-part-of relationship, is established among objects. An advantage of the containment concept is the ability for objects to be shared among several containing objects. An object may send a message to another object, causing that object to execute a method in response. Methods are procedures, written in a general purpose programming language which manipulate the object's local instance variables and may send messages to other objects. This encapsulation of code and data has proven useful in developing modular systems. Objects that contain the same types of values and the same methods are grouped together into classes. A class may be viewed as a type denition for objects. Classes are organized into an inheritance hierarchy each class inherits attributes and methods from classes that are above it in the hierarchy. This combination of data and code into a type denition is similar to the programming language concept of abstract data types. This hierarchical structure facilitates code sharing among classes. Taking full advantage of both the code- and object-sharing features is an important aspect of object-oriented data modeling. Object-oriented data models for databases extend the above-mentioned data modeling features of the object-oriented paradigm. The extensions include data integrity constraints, persistence of data (which allows transient data to be distinguished from persistent data) and support for collections. There are two approaches to creating an object-oriented database language: 1. Extending existing database languages with concepts from the object-oriented paradigm. 2. Extending existing object-oriented programming languages to deal with databases by adding concepts such as persistence and collections. For further discussion of the object-oriented model see Kim 1990]. 3 Record-Based Logical Models Record-based models are so named because the database is structured in xed-format records of several types. Each record type denes a xed number of elds, or attributes, and each eld is usually of a xed length. The use of xed-length records simplies the physical-level implementation of the database. The relational model has established itself as the primary data model for commercial data processing applications. The rst database systems were based on either the network model or the hierarchical model, both of which are tied more closely to the underlying implementation of the database, and are now decreasing in importance and real-world use. 3.1 The Relational Model The power of the relational data model lies in its rigorous mathematical foundations and a simple user-level paradigm. Mathematically speaking, a relation is a subset of the cartesian product of an ordered list of domains. For example, let be the set of all employee identication numbers, the set of all department names, and the set of all salaries. An employment relation is a set of E D S 2 3-tuples ( ) where 2 , 2 , and 2 . A tuple ( ) represents the fact that employee works in department and earns salary . At the user-level, we represent a relation as a table. This table has one column for each domain and one row for each tuple. Each column has a name, which serves as a column header, and is called an attribute of the relation. The set of attributes for a relation is called the relation schema. The process of designing a relational database involves the selection of a set of relation schemas. An initial set of schemas can be generated from an E-R database design by using a relation to represent each entity set and relationship set. There are often many possible choices that the database designer might make. To illustrate these choices, consider a database of employees, departments, and managers. Assume that a department has only one manager. If we use a single schema (employee, department, manager), then we must repeat the manager of a department once for each employee. We can avoid this redundancy by using two schemas (employee, manager) and (manager, department). However, if a particular manager manages two departments, we cannot represent a situation where an employee works in only one of these two departments. If instead, we choose the two schemas (employee, department) and (manager, department), we would avoid this diculty, and, at the same time, avoid redundancy. The theory of normalization helps in the choice of relation schemas. There are several languages for expressing operations on relational databases. In all these languages, the expressions and/or operations are over relations, and their results are also relations, This allows queries to be constructed modularly from sub-queries, and allows for automated query optimization. The relational calculus is a nonprocedural language, based on mathematical logic, that denes the basic power required in a relational query language. The relational algebra is a procedural language that is equivalent in power to the relational calculus, and denes the basic operations used within relational query languages. Commercial database systems use languages with more \syntactic sugar." The three most inuential commercial languages are SQL, QBE, and Quel. Of these three, SQL has clearly established itself as the standard relational database language, represented by the SQL-92 standard. Further versions of the SQL standard are under development. Further discussion of the relational model can be found in the seminal paper by Codd 1970], which introduced the relational model. Formal aspects of the relational model are presented in detail in Maier 1983]. e d s e 3.2 e d E d D s S e d s s The Network and Hierarchical Models The network data model is an abstraction of the design concepts used in the implementation of databases. As a result, the model is tied more closely to physical-level design than is the relational model. In the network model, data items are represented by collections of records and relationships among data are represented by links, which correspond to pointers at the physical level. The hierarchical model is similar to the network model except that links in the hierarchical model must form a tree structure, while the network model allows arbitrary graphs. 4 Object-Relational Data Models Object-relational data models are hybrids of the object-oriented and the relational data models. They extend the relational data model by providing an extended type system and object-oriented 3 concepts such as object identity. The extended type systems allow complex types including nonatomic values such as nested relations, and inheritance at the level of attribute domains as well as at the level of relations. Such extensions attempt to preserve the relational foundations, while extending the modeling power. There is a trend towards the amalgamation of features of the relational and object-oriented models. The SQL-3 standard currently under development includes object-oriented features within the framework of an extended version of the current relational SQL standard. Market-leading relational database products are adding object-oriented features so as to compete with objectoriented and object-relational database products. Future database systems can be expected to oer the high-level of abstraction of object-orientation along with the relative eciency and uniformity of the relational model. Bibliography Chen 1976] P. P. Chen, \The Entity-Relationship Model: Toward a Unied View of Data," ACM Transactions on Database Systems, Volume 1, Number 1 (January 1976), pages 9{36. Codd 1970] E. F. Codd, \A Relational Model for Large Shared Data Banks," Communications of the ACM, Volume 13, Number 6 (June 1970), pages 377{387. Kim 1990] W. Kim, Introduction to Object-Oriented Databases, MIT Press, Cambridge, MA (1990). Maier 1983] D. Maier, The Theory of Relational Databases, Computer Science Press, Rockville, MD (1983). Silberschatz et al. 1996] A. Silberschatz, H. F. Korth, and S. Sudarshan, Database System Concepts, Third Edition, McGraw Hill, New York, NY (1996). Ullman 1988] J. D. Ullman, Principles of Database and Knowledge-base Systems, Volume I, Computer Science Press, Rockville, MD (1988). 4