D@TA NERD November 2008 Normalisation ERD Modelling Data Analysis And more! PLUS: Crosswords, Puzzles, Spot The difference And more! Send us a copy! The funniest one each month will win a prize! Got a funny picture? WINNER! Cartoon of the month CONTENTS COVER 1,2,3,4 By the Astist Eimear Duffy Introduction to Normalisation Kevin Mallon - here we are introduced to the normal – A FIRST FOR MANY OF US AND NORMAL IS SUBJECTIVE Page 22 IS NORMAL BETTER? - SAM Senior - Mr SQL Visits Dr Database to Find Out if He's Normal and looks at the Yin and Yang of life as a database Page 25 Against all odds A Puzzle …Padraic Lavin – Wow now I am confused. Page 33 ERD and Distributed Databases Tanya Polianinova what’s the point in having a good ERD if you don’t spread it around Page 28 Speed Test Patrick Crowe sends the theory around the lap a couple of times Page 36 Contents & Desk Top Publishing By Patrick Crowe 1 DATA-NERD Issue 1 November 2008 . Mrs. Peacock In The Library ... By Gene Kelly . In this StudyDetective Kelly TERD will lay the cludeo plot PAGE 20 PAGE 3...DATA-NERD NERD A Cradle for our Creativity....... Ian Reston In this article Ian provides a practicle solution to help the publishing staff understand what the hell is going with the creative department! HOW DATA ANALYSIS CAN HELP YOU TO IMPROVE YOUR SEXUAL LIFE – Alfredo del Campo our in house Latin Lover solves all your problems ...Read it if you need IT! Page7 Database Design: Only for fellows with Mercs? Gary Gallagher – from the building site to the gas guzzler or Carroll to Chen. Gary stresses that design leads to function. P10 2 Page 15..Let’s get Physical ....Denis Farrell examines the logical and the physical, the Brain or the Brawn! INTRODUCTION TO ERD MODELLING ..Fatih Degirmenci – no better place to start than with the model man Page 18. DATA-NERD Issue 1 November 2008 A Cradle for our Creativity By Ian Retson Article In this first Issue of our Data-Attack Magazine we thought what better way to relate our readers to the subject than to describe in outline our very own in-house bespoke Cradle database. This is the key part of our information system that allows us to focus on bringing you interesting Creative articles like this one and less time worrying about the mechanics required to produce it. “Genius is one percent inspiration and ninety-nine percent perspiration” [1] The Cradle is at the core of our steady state organization driving our business in the creation, collection and communication of information aimed at you the Database NERD and the wannabee NERD community. There are separate specialist Publishing and Distribution systems that were purchased as off the shelf packages. This allowed us to concentrate on our key information system. The “hands that rock the cradle” [2] or stakeholders were identified initially within the Inception Phase; this provides us with a Top Down external view of the system and helps us establish boundaries: • • • • • • Our NERD customer (YOU) demands informative, varied format and fun articles that also communicate the latest trends within the world of databases. A free cut-down on-line version of each Magazine Issue is also made available and is used as a vehicle for registration of extra keen SUPER-NERDS. The ACCOUNTANTS (NON-NERD) require that we are cost effective. The EDITORIAL (UBER-NERD) staff requires that articles are available for review, to meet editorial and final production deadlines. The JOURNALIST (NERD-SYMPATHISER) requires a repository where they can lodge their articles and have access to a library of previous contributions from internal and external sources. The NERD in turn is encouraged to provide feedback including contributions (NERD- SYMPATHISER-NERD). During the Elaboration Phase the following details were established, providing a bottom-up view of the system; note the nouns and verbs: 3 DATA-NERD Issue 1 November 2008 A Magazine is issued on a regular basis made up of Articles approved by the Editor. SubEditors are responsible for individual departments e.g. News, Puzzles, Feedback, etc. An issue maybe categorized as regular or special re-issue or on-line version. An Article is created from one or more Items contributed by our in-house and external Agency Journalists. An Item is designated a media type which currently distinguishes between photograph, illustration and text, but there maybe more in the future. At the moment only one magazine is produced but market conditions permitting we hope to expand into the OO Modeling world and on to infinity. Our Subscribers are both individuals and retail shops. Subscribers are encouraged to contribute articles. Note that we didn’t leave our data experts perspiring in the basement but we embraced them as an integral part of the ongoing analysis & design and so we avoided the mistake where “The database team often works on its own without open doors of communication.”[3] “The foundation of modern database technology is without question the relational model; it is that foundation that makes the field a science”. [4] 4 DATA-NERD Issue 1 November 2008 “Design Engineering should always begin with a consideration of data; the foundation for all other elements of the design”. [5] 5 DATA-NERD Issue 1 November 2008 Some interesting nerdy points from the Cradle ERD: • The description of the stakeholders, provide us with insight into the boundaries and scope of the system. The Publishing, Distribution and Accountancy packages are outside the scope of the Cradle System; however the entities Article, Subscriber and Staff respectively indicate the genesis of data interfaces between the systems. • Note the correlation between the nouns in the business description and the entity names in the ERD. The verbs would normally provide us with the associations or relationships between the entities but they can be spotted as Foreign Key attributes. Can you add the association roles to the ERD? • The main high volume transactional tables are Item followed by Article, which act as the main system repository; both of which have numeric primary key constituents for efficient processing; thereafter the tables are more Master control tables concerned with categorizing & grouping the transactions. • An Article may consist of one or more Items. This promotes parallel activity allowing items to be contributed outside of Issue and Article deadlines; the concept also supports the efficient re-use of items across multiple articles over time. • The cancelled attribute in Article provides us with the capability of stopping an article being added to an Issue after it has been approved by the Editor. This allows us to resurrect the article for future issues and avoids a messy deletion option. How would a deletion option work? What would be its consequences? • Editor and Journalist are shown as separate entities since their roles are quite distinct within the system i.e. An Editor controls Articles and Issues whereas Journalists contribute items but both are subtypes of the Staff Entity. Note that a Journalist maybe external therefore is not a {complete} subtype [This is a discussion for our sister Magazine Object-Attack!!!!]. Further normalization can be achieved as you may have spotted; Address Information is present in the Agency, Staff and Subscriber entities. How would you rationalize this into the Diagram? “He who asks question is a fool for five minutes; he who does not ask a question is a fool forever.” [6] • Answers in next Issue when again more 6 DATA-NERD Issue 1 November 2008 HOW DATA ANALYSIS CAN HELP YOU TO IMPROVE YOUR SEXUAL LIFE Greetings, my dear reader! Now that I’ve got your attention we can move on to the fascinating world of the Data Analysis. Right now, you must be wondering: “And what on earth does Data Analysis have to do with my sexual life?” - fair enough, keep reading this article and you will find out by yourself. First of all, let’s give a definition of Data Analysis: “is is a process of gathering, modelling, and transforming data with the goal of highlighting useful information, information suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.” – Wikipedia. Analysis, Design, Standards and Support Support. Data Analysis is in the first phase, its input will be the results of Data Gathering and its output will be the input for Conceptual Model and Usuability Requirements. Having said that, everybody agrees that Data Analysis is a useful activity to do but in the real world we can find a surprisingly isingly common case where collected data is stored but is never analysed. In this article, we will cover what rol Data Aanalysis plays in the design of a project, project next step will be to talk about how we can collect data and various techniques to do so. Following that point, if you are still with me, we will have an overview of both quantitative and qualitative data and most important of all we will discover the links between your sexual life and Data Analysis. Data Gathering/Data Data Collection Techniques Project phases design Data Analysis is one in multiple steps, but no less important, that belong to the complex process of Engineering Methodology Methodology. These are the different phases we follow that comprise the design of a project/product: 7 This is the very initial phase of the design. design Following are the most common techniques that are adopted to gather data: User Interviews, Contextual Enquiry, Enquiry Personas / Scenarios, Direct Interview, Interview Indirect Interviews. DATA-NERD Issue 1 November 2008 Quantitative data analysis Here I recommend some useful software programs rograms for analysing quantitative data: data • • • Epi-info: Covers most ost of the statistical analyses. Minitab: Covers overs all the basic statistical analyses. SPSS: Statistical tatistical package. A brief definition of quantitative research can be, a measure of how many actors (can be humans, or anything that interacts with the system under study) act in a particular way. The collection of data tends ends to include large amount of information – ie, minimum number of intervies should be 50. Questionnaires are the most common tool used for this purpose, with closed questions normally. Data quantitative analysis strategy: 8 For describing the participants,, we can use the typical descriptive escriptive statistics statistics, such Frequency counts, Proportions,, Measures of central tendency (mean, mean, median, mode), mode and Measures of dispersion (standard standard deviation, inter-quartile range, etc…). Talking of relationship or association association, we can count on Association and Correlation. Correlation If what we are treating is comparative studies, studies we have several techniques to work with, ie: Student's t-test statistic, Mann-Whitney Whitney U test, paired t-test, Analysis of Variance,… Variance Analysing qualitative research Some useful software that can help: • • NVIVO: Accumulates data, data assigns codess to data and analyses this encoded-data numerically. Ethnograph: A similar program to NVIVO. DATA-NERD Issue 1 November 2008 Note for the reader: honestly, did you really believe that the Analysis of Data could improve the sexual life of anybody?.... Got ya! So, what is qualitative research research? In market research, is used to help the observant to understand the motives of the people, how they feel and why. For this purpose, the researcher asks questions such as why do you..? to collect detailed information. Compared to quantitative methods, where accumulated data is much larger, samples amples tend to be smaller. Here we have the most common methods of Data Analysys in Qualitative Research, compiled by Donald Ratcliff: Typology, Taxonomy,, Constant Comparison Grounded Theory, Analytic Induction, Logical Analysis/Matrix Analysis, Quasi-statistics,, Event Analysis Microanalysis, Metaphorical Analysis, Analysis Domain Analysis, Hermeneutical Analysis, Discourse analysis, nalysis, Semiotics, Content Analysis, Henomenology Heuristic Analysis, Analysis Narrative analysis. DATA PRESENTATION Information processed from a sample mple can be presented in many ways. Rather than just giving plain numbers about the central tendency and dispersion, we should look for friendlier ways of presenting data such graphs or chartss (ie, frequency, polygon, histogram, bar/par); people could see better the result of the research. 9 DATA-NERD Issue 1 November 2008 Database Design: Only for fellows with Mercs? GARY GALLAGHER Liam Carroll is currently one of Ireland’s leading property developers. His company, Zoe Developers, have built more apartments in Dublin’s inner city than all other builders combined. Carroll is heavily involved in the high profile Dublin docklands re-development project and is responsible for the Cherrywood Innovation and Technology Business Park site in Loughlinstown, which houses corporate giants such as Dell and Friends First. Carroll’s current standing, however, is a far cry from his initial forays into property development. Examples of early efforts in 1989 include Fisherman’s Wharf - “a humdrum scheme of townhouses and apartment blocks” and Portobello Harbour, described as having “no design or functional integrity” (narrow, lego-like constructions with one room on each floor). All of these early developments share one startling characteristic – Carroll did not employ architects for their design. Architects, Carroll claimed, were “only interested in designing penthouses for fellows with Mercs”. It was only with the introduction of Government apartment design guidelines in 1995, coupled with the prospect of more complex development schemes, that Carroll finally decided to engage architects to plan and design properties correctly. This move paid off, catapulting Carroll from his status as ‘the shoebox apartment king’ to a respected and successful developer responsible for some of the country’s largest residential and commercial developments. As you worryingly re-examine the title of this magazine, possibly thinking that you may have picked up the wrong one in the shop, let me re-assure you that the above anecdote does hold some connection to Database Design. It is based on a widely accepted saying among database workers that building a database without a design is akin to building a house without an architect’s blueprint. Before elaborating, let us first examine what we are talking about when discussing Database Design. Database design is also referred to as database modelling, however it has nothing to do with women, catwalks or lingerie (sorry lads). Fear not though, as some similarities do exist for the more imaginative of us. Data modelling is essentially a method of organising data so that it can be used effectively by databases. It is concerned with structuring data in a way that it presentable and is placed in nice neat packages for processing by the database. It is the first, and some would argue most important step, in creating a database. Webopedia defines data modelling as ‘the analysis of data entities and their relationships to other data entities”. An entity, in this case, is any object about which we wish to store information in the database. 10 DATA-NERD Issue 1 November 2008 They are items in the real world that are capable of existing independently. To illustrate this in simple terms think about a computer vendor’s database. Here, in simple terms, you would need to store information about the vendor’s clients (cust. ID, name, address, tel. #) and about the products that it sells (model number, spec, price, availability). The entities here are therefore ‘client’ and ‘product’. Now that you have some idea what it is, you may ask why it’s important enough for us to waste our time and your money publishing a whole magazine about Database Design. Fair question, lets try to demonstrate why it is so useful (obviously the sight of the Portobello Harbour shanties isn’t enough for you) by looking at another example, this time loosely based around the current scramble to become GWB’s successor at the White House. In the aftermath of such an election, in depth analysis would be carried out on various aspects of the election. For example - knowing the number of people that have voted for the various different political parties would be invaluable. This could be achieved by including a column in the database from the very beginning for which party each person voted for. If, however, this column was omitted at the beginning it would be very time consuming collating the relevant data to get the same result. It is at the database design stage where the decision to include such a column would be made. The importance of the design stage is equally apparent with even the most basic of databases. You could say that if a Formula 1 racing car doesn’t have smooth aerodynamics, it will drag and go slower. Equally, if a database doesn’t adhere to best practices, it won’t perform as efficiently as possible. There are several methodologies used for creating the ‘perfect’ database. In this edition we focus on what are widely regarded as the two most effective techniques – the usage of Entity Relationship Diagrams (ERDs) to assist in matching the business needs of the database to the physical design; and a process of safeguarding the database from structural problems known as Database Normalisation. An ERD is essentially a graphic representation of the entities, and the relationship between the entities, within a database. Although initially introduced in the 1960s by a General Electrics engineer, the development of ERDs is credited to the American scientist Professor Peter Chen. Chen’s original ERD paper was selected as one of the 38 most influential papers in Computer Science, resulting in his ERD approach being ranked as one of the top methodologies in systems development by several surveys of FORTUNE 500 companies. Yes folks, it works. While an ERD is mainly concerned with the relationships between the entities of a database, the goal of database normalisation is to reduce the amount of space a database consumes by eliminating unnecessary duplication of data, thus increasing overall performance. Although often previously overlooked as a complicated process for academic geniuses, it is now accepted that a grasp of the principles of normalisation can drastically improve database performance. These methodologies will be explained in more detail as you read on, where their importance will hopefully become even more apparent. Should their relevance escape you however, you may want to consider again the following. From the shoebox king, to one of the worlds most influential computer scientists, the basic principles used in creating a database remain - effective planning and design are essential parts of any project. Without them, the roof might fall in. 11 DATA-NERD Issue 1 November 2008 Keeping IT Real: by Aine Daly How to use Logical ERD Modelling in Effective Database Design The logical data model is primarily focused on the representation of REALITY…tangible objects, actual characteristics, bona fide relationships…these are the fundamentals of logical modelling. Analytically structured to reflect the core requirements of a business. The model is independent of technology and not created with a physical data store in mind. This will come into play in the next phase of design – the Physical Model. Systems have both Technological Components • • • Program Database Management System Screen Components Technology Independent Components • • Logical Data model Business Rules & The logical model concentrates on the needs of the business, there are no details included about the physical hardware and database technology. It reveals the business processes and data that exist and reflects the relationships between the two. The goals at the Logical ERD model are: 12 DATA-NERD Issue 1 November 2008 a ESTABLISH INFORMATION/BUSINESS REQUIREMENTS... ..DATA-ENTITIES, RELATIONSHIPS, ATTRIBUTES, CARDINALITY... b GRAPHICALLY REPRESENT THESE REQUIREMENTS ….SO THAT THEY MAY BE UNDERSTOOD COMMUNICATION between the Business/organization and the Database designer is critical in order to achieve the above objectives. Both may have different ideas about what the requirements and structure of the database should be and collaboration ensures that the system developed will fit the business needs. The Logical ERD can be used as a tool of communication as it can be easily explained to non-technical clients. Logical Entity Relationship Diagram Models convey a great deal of information using a very apt and succinct notation. The components used in Logical ERD development are: Entities, Relationships and Attributes. Using these components the logical model identifies entities and the correct relationships among them. The term unique identifier is used to describe data element that differenciates between one entity and another. It replaces the term Primary Keys because once again, it is technology independent whereas Primary Key represents a unique identification of a row in a table that can be used as a foreign key in a related table. Normalization is used to remove reduntant data and optimize the overall data structure by grouping the data elements correctly, ensuring that entities are properly formed and each attribute is assigned to the correct entity. This systematic process produces a solid database structure which will allow for data to be stored and retrieved in the most efficient manner. If the correct data is not captured problems are sure to follow. If the relevent entities or relationships are not represented correctly in a data model, then end-user queries about these entities and relationships cannot be answered. 13 DATA-NERD Issue 1 November 2008 Regardless of the application that is used in implementation, if you take the time to carefully build a logical model your result will be solid foundation for your database. It is this framework which will dictate the relevance, speed and efficiency of the final database and an organizations success when using it to conduct business. It should also have a positive impact on the cost of the system development as it resolves problems at an early stage and does not incorporate redundant data. Figuring out these issues at the design and database developments phase is significantly cheaper then trying to fix a problem in an implemented system. The next step is the Physical model summarised below: • Logical • • The implementation of the logical model in the chosen database structure The physical diagram is platformspecific and more detailed mapping of the logical model to the physical hardware and database technology Physical 14 DATA-NERD Issue 1 November 2008 Let’s get Physical By Denis Farrell To understand Physical ERD Modelling fully, we have to look at the complete ERD Modelling Picture. In the design phase of databases, data is represented using a certain data model. These data models are a gathering of concepts or notations for describing data, data relationships and data constraints. Data models are either: 1. Conceptual models • • • Collection of entities. Flexible data structuring capabilities. Examples of this model is object-orientated model, semantic data model and entity-relationship model. 2. Record based logical models • Data is considered as a collection of fixed – size record. • These models are closer to the physical level or file structure so they are easier to implement. • The three most well known models of this kind are relational data model , network data model or hierachical data model. 3. Physical models • Provide concepts that describe the details of how data is stored in the computer’s memory It is important to understand how logical and physical models relate to each other and the differences between them. Logical The first stage is to gather all the business requirements for the planned database and convert these requirements into a model. The logical model does not look at the needs of the database but the business requirements are used to determine the needs of the database. After all the business requirements and information is collected, reports and diagrams are produced together with entity relationship diagrams, business process diagrams, and eventually process flow diagrams. The diagrams created should demonstrate the processes and data that exists. It should also demonstrate the relationship between the data and the business processes. 15 DATA-NERD Issue 1 November 2008 Logical modelling should clearly depict a visual illustration of the activities and data relevant to a particular business. Logical modelling has implications on the direction of the design of the database, however it also indirectly affects the performance and administration of an implemented database. If time is taken to perform logical modelling, more opportunities arise for planning the design of the physical database. Logical modelling produces diagrams and documentation which determines whether or not the business requirements have been completely gathered. This information is the then reviewed by developers, management and end users to decide if more research and work is required before the commencement of the physical modelling. From Logical Modelling we expect to get the following deliverables. • Entity relationship diagrams This give the development team the initial picture what the database needs to deliver. It will show the different categories of data for the business and how they relate to each other. • Business process diagrams The process model illustrates all the parent and child processes that are performed by individuals within a company. This shows the development team how data moves within the business • User feedback documentation Physical Modelling Physical modelling relates to the actual design of a database. It is cost effective and a practical tool for problem solving and design optimisation. The requirements that were recognised in the logical model set out the basis for the design of the database. The physical model deals with the converting the requirements gathered in the logical model into a relational database model. Throughout physical modelling objects such as tables and columns are created. This is based on the entities and attributes defined in the logical model. Also at this stage constraints are defined, including the primary keys, foreign keys and other unique keys. From database tables views can be created to summarise data. All the pieces are brought together in the physical model and this defines the database for the business. One restriction of physical modelling is that it is software specific. This means that the objects defined in the physical model can vary on the relational database software been used. Variations exist in the way the data types are represented and stored. Conceptually, basic types of data are the same with different implementations. Databases systems differ in the objects that are 16 DATA-NERD Issue 1 November 2008 available in one may not be available in another and as a result of this, physical models hardware and software dependent. Oracle is an example of software that will work with many operating system such as Windows NT and UNIX. Java-based products can be used on virtually all operating platforms and hence its popularity. So when choosing database software, hardware and operating system platforms, these need to be looked at in conjunction with one another. From physical modelling we expect to get the following deliverables. • Server model diagrams This diagram demonstrates relationships within a database, shows tables and columns. 17 • User feedback documentation • Database design documentation DATA-NERD Issue 1 November 2008 INTRODUCTION TO ERD MODELLING By Fatih Degirmenci One of the most painful problems of database design is different views of designers, programmers, and users and this causes design of useless databases or databases which do not reflect purposes of actual database. Data Modelling is the first step of Database Design Process and it is laid between real world objects and database model. To keep everyone involved and aware of design, it is necessary to use a method that simplifies design process. Entity Relationship Diagram Modelling is a method that removes potential roadblocks and simplifies database design process. DATABASE DESIGN AND ERD MODELLING Database design is a software engineering activity falls in design activity in generic software engineering process. Database design process consists of a number of steps including identifying the data to be stored, determining relationships between stored data, and structuring data. [1] Modelling part is an intermediary step that falls in between requirements gathering and construction, and ERD Modelling is widely used modelling schema for this purpose. It allows us to abstract notional representation of structured data using conceptual schema to design database and it is a general data modelling type for relational databases, which helps design process to be simplified. [2] Some of the key terms of ERD Modelling are described by Paul Chen as below “An entity is a “thing” which can be distinctly identified. A specific person, company, or event is an example of an entity. A relationship is an association among entities.” [3] There are several types of ERD Modelling and widely used type of ERD Modelling is developed by Peter Chen. In Chen’s ERD Modelling, entities are represented by rectangles and entity name is in these rectangles expressed in singular form. [4] student Entity attributes are not shown on ERD itself in original Chen model but it is extended to include attributes. Attribute preceded by an asterisk is the identifier of entity. [4] *sId name student address 18 telephone DATA-NERD Issue 1 November 2008 Relationships show how two or more entities related with each other in forms of verbs, for example student submits assignment. In this example, student and assignment are entities and submit is the relationship. submits student assignment There are several other notations which can be used to draw ERDs and one of the widely used notations is Crow’s foot notation.[1] If we redraw above example with using this notation, we have below diagram. student assignment submits Relationships can be in several forms, one-to-one, one-to-many, and many-to-many. In one-toone relationship, one entity is related to only one entity. In previous example, a student related with one assignment to show one-to-one relationship. In real world, a student may submit more than one assignment and this is a good opportunity to show one-to-many relationship. In this case, this relationship can be redrawn as below to include one-to-many relationship. student assignment submits Completed ERD shows the overall plan of database, which is named logical ERD. Database designers need to be aware of logical ERD. In DBMS terms, realization is done in physical ERD schema. In database design, communication with end users is an important step to gather requirements of database and have a common view on real world entities. When data modelling starts, differences of end users’ views and developer’s views are become the main problem which is laid upon developer’s hands and could be solved if developer creates a data model that can be understood by end user. ERD Modelling is useful when users need to know more on design and developers need to explain design aspects to users. This type of schema gives chance to its users and developers to share common view of data and knowledge on how database design issues can be handled. REFERENCES [1] “Entity-relationship model - Wikipedia, the free encyclopedia”; http://en.wikipedia.org/wiki/Entity-relationship_model. [2] S. Bagui and R. Earp, Database Design Using Entity-relationship Diagrams, Auerbach Publications, 2003. [3] P.P.S. Chen, “The entity-relationship model—toward a unified view of data,” ACM Transactions on Database Systems (TODS), vol. 1, 1976, pp. 9-36. [4] J.L. Harrington, Relational Database Design Clearly Explained, Morgan Kaufmann Publishers, 2002. 19 DATA-NERD Issue 1 November 2008 Mrs. Peacock in the Library By: Gene Kelly Mrs. Peacock In The Library With The Candle Stick? Dr. Black Murdered! Dr. John Black (48), self made millionaire, hosted a weekend celebration at his country mansion to celebrate the 30th anniversary of his company, DBD inc. Suspicions first arose when Dr. Black was nowhere to be seen in the drawing room for pre dinner drinks on Saturday night. By the time deserts were being served there was still no sign of Dr. Black and Mrs. White, his maid of 25yrs, now feeling a little worried, went to Dr. Black’s room to look for him. Just as she was about to knock on his door, she heard a scream echo from what appeared to be the kitchen, this was abruptly followed by another scream coming from the entrance hall. Mrs. White went to investigate… Black’s Tudor Mansion, built in 1586 When Mrs. White reached the bottom of the stairs she was met by Mrs. Peacock who was being comforted by Reverend Green. They were both standing beside a pool of blood which had been smeared across the carpet. Mrs. White felt a strange feeling in her stomach, she wasn’t sure if it was worry or hope, She continued to the kitchen to find the source of the first scream. In the kitchen she was met by Miss Scarlet who was standing by the cold room with her hand on the door, Mrs. White was closely followed by Professor Plum who had also come to find the source of the scream, they both looked into the open cold room to find the body of Dr. Black. Mrs. White ran to the nearest telephone, which was in the Lounge, she called the local police station and informed them of the news, they would send someone over right away… as Mrs. White made her way back to the others, she passed through the Billiards Room where she met Colonel Mustard sitting in a leather armchair, swirling his snifter of cognac with one hand while holding his wooden pipe with the other. Apparently oblivious to the happenings in the rest of the house. Mrs. White told Colonel Mustard about the body and led him through the conservatory into the ballroom where the rest of the guests had gathered. When Mrs. White arrived in the ballroom, she noticed that one of the bronze candle sticks that stood by the fireplace was missing. Just as she was about to point this out, A Knock! Mrs. White went to answer the front door, where she was met by Mr. Parker, the local police officer and another man whom she did not recognise. Mr Parker introduced the other man as Dr. Peter Chen, who was visiting from Louisiana State University to help update the methods used to collect police data. With that Dr. Chen proclaimed, don't worry Mrs. White, I'm on the CASE! 20 DATA-NERD Issue 1 November 2008 21 DATA-NERD Issue 1 November 2008 Introduction to Normalisation By Kevin Mallon Normalisation is the process of organising data in a database. The goal of data normalisation is to reduce and if possible, eliminate data redundancy. This is an important consideration for application developers because it is incredibly difficult to store objects in a relational database that maintains the same information in several places. Redundant data also wastes disk space and creates maintenance problems. The main reason for normalizing is the possible corruption of databases due to three main factors - insertion anomalies, deletion anomalies and update anomalies. Insertion Anomalies Why Normalise? Deletion Anomalies Update Anomalies Normalisation can also be referred to as canonical synthesis as this is the process of designing a database model without redundant data items. Well normalised data makes the task of programming a lot easier and works very well in multi-platform, enterprise wide environments. Data Normalisation is sometimes known as the cure for Spreadsheet Syndrome, the lumping of every possible piece of information into as few tables as possible, sometimes into a single table. Normalisation Spreadsheet Syndrome Concepts 22 DATA-NERD Issue 1 November 2008 The original concept of database normalisation was introduced by Edgar Frank Codd in 1970 in his paper “A Relational Model of Data for Large Shared Data Banks”. In this paper, Codd states “there is, in fact, a very simple elimination procedure which we shall call normalization. Through decomposition non-simple domains are replaced by "domains whose elements are atomic (non-decomposable) values."” There are a few rules for database normalisation. Each rule is called a "normal form." If the first rule is observed, the database is said to be in "first normal form." 1NF is often referred to as the atomic rule. In a database, this means that each column should only be designed to hold one and only one piece of information. If the first three rules are observed, the database is considered to be in "third normal form." Although other levels of normalization are possible, third normal form is considered the highest level necessary for most applications. The concept of functional dependencies is the basis for the first three normal forms. A functional dependency occurs when one attribute in a relation uniquely determines another attribute. This can be written A -> B which would be the same as stating "B is functionally dependent upon A. The table below shows the three most common forms of normalisation. Level First Normal Form (1NF) Rule An entity type is in 1NF when it contains no repeating groups of data. Second Normal Form An entity type is in 2NF when it is in 1NF and when all of (2NF) its non-key attributes are fully dependent on its primary key. Third Normal Form An entity type is in 3NF when it is in 2NF and when all of (3NF) its attributes are directly dependent on the primary key 23 DATA-NERD Issue 1 November 2008 SOLUTION TO: Puzzle Page 34 1 A 7 E 9 W T 3 O N E T V T B 2 R E L I A T I O L B U 5 D T I E A I A V E R M O N C O M H 10 S I U Y T N D E 11 D A P T T A B E B I O C A S E T L 12 O A 13 S O G 8 6 4 R D E R G M E R C E O R Y 24 DATA-NERD Issue 1 November 2008 Is Normal Better? By:Sam Senior Mr SQL Visits Dr Database to Find Out if He's Normal... Mr SQL: Wow! I followed the plan of decomposing d tables into more tables and can feel the redundant data just slipping away. Dr Database, I am not sure if I am Mr SQL: Normal or not. Can you help me? Dr Database: Well, Mr SQL, do you feel atomic? Mr SQL: Not sure what you mean? Dr Database: Well, a Normalised database has atomic data. Think of an atom. In other words, the data can't be broken down any more. For example, first fi name can't be broken down any more. Mr SQL: Dr Database: As I predicted, you now have no duplicated data due to decreased redundancy. Mr SQL: My CPU is a lot cooler but when people query me it takes me longer to respond espond because of the table JOINs. Dr Database: Mr SQL: Denormalise? But I spent ages trying to Normalise! Why would I want to do that? I'm just a raw, Unnormalised database. Dr Database: me out... Dr Database: Do you feel any anomalies? Mr SQL: Oh, yes, plenty Doc. I have inconsistent data and my CPU's very hot and overloaded. Also, I feel so bloated and large...must ..must be all the redundant data I have. Dr Database: Sounds like you have an acute case of Spreadsheet Syndrome. Well, I guess you need to be Normalised. I will outline the basic plan... Three Normal Forms later... 25 Well, we could Denormalise you a bit. Well, it's not all black and white. Hear What are the advantages vantages of Normalisation? Since there is no duplicity in a Normalised database there will be little or no anomalies. This means little to no administration to ensure that the redundant data is accurate and up-to-date. date. In addition, little or no redundant data means fewer storage requirements. A simplier more efficient structure also means the database is more scalable. Also, write actions such as INSERT, UPDATE and APPEND, ie: writing to the database, will run better. DATA-NERD DATA Issue 1 November 2008 CUSTOMER However, it's not all good... CustomerNum, CustomerName, Phone1, Phone2, Phone3... As the table count increases during the Normalisation process so to does the JOIN count. If the database is large then JOIN jungles can be created which can eventually effect response times. What can be done to improve performance? Improve the Normalisation design so that it reflects the data usage; create indexes for frequently queried attributes; clustering or just accepting poor performance. However, if the users still complain…Denormalise! Denormalisation is part of the physical design phase and can only be done after the data has been Normalised. ANOMOLY WARNING: DO NOT DENORMALISE UNNORMALISED/RAW DATABASES! Question: don't read any further. What do you think Denormalisation means and why would a SQL administrator do it? “Denormalisation is the design process of taking normalised data and producing a physical design in which normalised data is rearranged so that optimal access and manipulation of data can be achieved.” [Inmon] Normalised Database Example CUSTOMER CustomerNum, CustomerName... CUST_PHONE CustomerNum, Phone Denormalised Database Example 26 Here are some reasons why a database administrator would contemplate using Denormalisation. • No calculated values. For example, an online shopping cart may have a field called total_price, price * quantity, which is forbidden by the Third Normalised form. Information Warehouses use large numbers of precalculated summary tables known as Materialised Views. This improves response times for summary data, ie: no complex calculations required because a pre-calculated result on a summary table is queried. • The key reason: performance. To avoid JOIN jungles. A Normalised database must locate the relevant tables and then JOIN the data to either get the information or process the data. Thus a Normalised database uses a higher amount of I/O and CPU. In addition, Relational DBMSs are optimised to perform three-way joins therefore the database loses efficiency when more complex joins are required. The outcome of Denormalisation is better response times, ie: reduced I/O and CPU. For systems that depend on real-time information Denormalisation may be required. • To maintain historical data. For example, a Saleperson's surname may change and if the customer name is stored in a Normalised database any invoice report won't list the old/new surname. However, if the surname is stored in a separate invoices table as redundant data then both surnames will appear in the report. • For specific application requirements. Application coding could be simpler DATA-NERD Issue 1 November 2008 because the data is spread across fewer tables and easier to locate. What tools can be used to Denormalise? To reduce the number of tables/joins it is important to analyse which entities are accessed by applications and how they relate to each other. This can be achieved by using Entity Relationship Diagrams, Data Flow Diagrams and Cross-Reference Matrices to identify database usage. Disadvantages... The key risk of Denormalisation is anomalies caused by redundant data. Tracking the redundant data will require extra administrative effort. Like everything in life, there's a balance, Ying and Yang, et cetera... There’s a happy medium between Normalisation and Denormalisation but both require a complete understanding of the data and the specific business requirements. 27 DATA-NERD Issue 1 November 2008 ERD and Distributed Databases By: Tanya Polianinova Distributed databases are widely used by many companies for data storage and manipulation. The next few paragraphs of the paper will explain the concepts of Distributed databases and will describe the principals behind Entity Relationship Diagram. The advantages and disadvantages of both items will be discussed in detail as well as descriptions for each of the item. History Databases have been used since the time when electronic computing has started. Around 1970s, the Distributed Database concept was introduced and since then a variety of different organisations worldwide uses them for data storage. Around the same time the Entity Relationship Diagram was first introduced by Charles Bachman. ERDs are used for different databases designs and can be served as ‘foundation’ for database development and planning. Distributed Databases Database represents a collection of different data that is stored on the computerised system. Data is stored, created, organised and sorted, manipulated and retrieved by using different software programs or Database Management System (DBMS) and variety of query languages, such as SQL. Distributed Database is a database that stores data in the different locations on the network, which can be located in different geographical locations and is controlled by DBMS and allows multiple users to access and manipulate data without interfering with each other. In another words, although the data is spread across, the user sees database as centralised system with data stored in one place. 28 DATA-NERD Issue 1 November 2008 Data is spread across by using fragments that allow multiple re-creations of the same data. Different forms of data distribution can be used to spread data across. Data can be replicated, where the copies of the same data are kept in many different locations. Data can also be Horizontally or Vertically Fragmented. With Horizontal fragmentation, the data is distributed across different sites, whether with Vertical fragmentation the data is split by the columns across multiple systems. Sometimes data can be reorganised or in another words data is manipulated in some way, for example summarised and then stored. And the last method to data distribution is known as Separate Schema, in which the data is kept in different databases in order to facilitate different systems to access and use data with help of different programs and interfaces. Data in Distributed Database is regularly synchronised in order to ensure that all of the data is up-to-date. Data synchronisation is done by using timestamps. Every time the data in the database is created or updated, a timestamp is recorded with the date and time of that update, the system then uses timestamps to see whether the data was modified from previous time by comparing timestamps, and updates data if required. Distributed Database is designed in such way where the user sees the database as centralized system, rather than a system with data circulated across multiple locations. Although Distributed database has very complex design, it can be costly to create and needs very high security requirements, it has many benefits. Those benefits include reduced network traffic, as server or network is not used for most of the database activities, improved data manipulation time, reliability and availability. 29 DATA-NERD Issue 1 November 2008 3.4 ERDs for Distributed Databases Entity Relationship Diagram or ERD is used to graphically represent entities (tables or objects) of database and the relationships between these entities. ERD shows data flows and interactions between different objects, which are linked together by using unique identifiers or primary keys. Each entity in ERD represents an object of some kind, e.g. student or person, who is accompanied with its attributes, for example ID, Name, Date of Birth, Address, etc. The entities interact with each other by using relationships, e.g. student is assigned to the group. Sometimes the relationship defines the number of entities with which the object interacts, e.g. many students can be assigned to one group. ERDs are easy to use, create and are good as communication tool. ERD can be used as the foundation for the database design and structure. It is important, as it represents the structure and behaviour of the system or user requirements. It can be used as elements for planning and development processes. Although ERD can be weak tool for representing specifications and data descriptions and even can cause a loss of information, it has an advantage over other methods of database structure representations, as it comes in a graphical form. This allows people without any specific technical skills to understand how database works. This is very useful characteristic, as database design can be very complex and difficult to understand. 30 DATA-NERD Issue 1 November 2008 Giammarco Schisani 19th of October 2008 ERD Puzzle Fill in the blanks By: Giammarco Schisani Instructions Given the following description of an Entity Relationship Diagram, fill in the blanks in the Puzzle below. Entity Relationship Diagrams A relational 10 can be modelled using a 7 Relationship Diagrams (or ER Diagrams). Such diagrams are capable of describing the main components of an Entity Relationship 6: entities and 2. An entity describes something that can be uniquely identified, such as: • • • • An 12 in an e-13 website; A customer in an e-commerce 9; A product in an e-commerce website; A 11 of products in an e-commerce website (e.g. “Monitors”, “Printers”, etc.). Entities can often be described by a 4 (e.g. “order”, “customer”, etc.). In an ER 5down, an entity is described with a box: Order A relationship describes how two or more entities relate to each other. Relationships can often be described by a 8. For example: • “Places”: A customer places an order; In an ER diagram, a relationship is described by a 5across: Customer 31 Places Order DATA-NERD Issue 1 November 2008 Both entities and relations can have attributes. An attribute represents information about the entity or relationship. For example: • • • An “order” entity might have an “ID” 1, that uniquely identifies the order; A “Customer” entity might have “Name” and “Surname” attributes; A “Places” relationship between a “Customer” and an “Order” entity might have a “Date” attribute indicating when the order has been placed. In an ER Diagram, an attribute is represented by an Firstname Surname ID Date Customer Order Places : See Page 26 for Solution 1 7 9 3 2 4 5 6 11 10 8 12 13 32 DATA-NERD Issue 1 November 2008 Puzzle 1: Against all odds By Paraic Lavin You work in a small company as a database administrator earning lots of money. These tables below (A, B & C) have been designed by three different colleagues who work in another division. Their boss has asked you to check them in order to prevent future problems, efficiency, etc. Can you spot the odd table out? Table A Did you know? #1 Data should be presented in table format. Figure 1. Table B Did you know? #2 Data should be accessible without ambiguity. Figure 2. Table C Did you know? #3 INSERT, DELETE, UPDATE commands must be supported by use of a single command. Figure 3. 33 DATA-NERD Issue 1 November 2008 Puzzle 2: Deleting for good not for evil Puzzle 2A – “The Adventures of Dataman” You are “Dataman”, a superhero with a penchant for whiskey and who recognises bad design as evil in database tables. Can you remove one column from the following table in Figure 4 so that removing the column converts the table into first normal form (1NF) and save the word from evil yet again? Table D Did you know? #5 Physical changes to the data store should not affect the logical database structure. Figure 5. Puzzle 2B - “Dataman Returns” Al-primary-key-da have attacked western financial markets by introducing bad design into one critical database table. Governments across the world have said they will guarantee all affected tables but the public fears that it is not enough. Can you delete one column and save the world yet again from financial ruin? Table E Did you know? #6 Figure 6. Constraints must exist to preserve data integrity. Table F Did you know ? #7 Codd's 12 rules are really 13 rules because they are numbered 0 to 12. Figure 7. Answers: 34 DATA-NERD Issue 1 November 2008 Puzzle 2: Deleting for good not for evil Puzzle 1 Against all odds: The answer is Table A. Although none of the tables are fully normalised Table A is clearly not normalised at all as it has repeating information i.e. Class_1, Class_2, Class_3. Should two of these columns be deleted in favour of one “Class” column the table would be in 1NF – First Normal Form. Puzzle 2: Deleting for good: Puzzle 2A – Delete column FavColour or FavColour2. Either answer is correct. Puzzle 2B – Delete column CustomerName from Table E as this information is duplicated in Table F. 35 DATA-NERD Issue 1 November 2008 The Need for Speed - War of The fields By Patrick Crowe In this edition of DATA-Nerd we take the chance to get out of the class-room and take a couple of laps under the clock. In this practical I examine if the theory regarding the correct definition of database fields is really required for performance and if it is required does it make a real difference out in the real world. Objective To examine the difference in performance between two databases identical in all respects except the field type for one column was declared as INT in one database and NVCHAR in the second. The column in question was used to contain numbers only. The Test All operations were executed using queries in MS SQL Server Management Express. The results were obtained using the Client Statistics functionality in the same application The DATABASES DATABASE Column Name NUMBER_INT Letter WOTW Speed_Test Data Type Int nchar(10) Text ALLOW Nulls Checked Checked Checked DATABASE Column Name NUMBER_nchar Letter WOTW Speed_Test2 Data Type nchar(100) nchar(10) Text ALLOW Nulls Checked Checked Checked Contents of Database Column Name CONTENT NUMBER_INT/NUMBER_nchar Number from 1 to 535294 Letter A The first Paragraph from War of the Worlds by H.G. Wells 1898 (source: WOTW http://www.bartleby.com/1002/101.html) 230 words, 1331 characters. The databases contained 535294 rows after population TEST 1 – BULK INSERT To test the Bulk Import speed from a The data was imported from a Comma Separated (CSV) Text file using the following : BULK INSERT Test_Table FROM 'c:\test2.csv' WITH (FIELDTERMINATOR = ',') RESULTS 36 DATA-NERD Issue 1 November 2008 Contents of Database TOTAL Execution Time(ms) Speed _Test (INT) Speed _Test2(nchar) 436437 242875 Difference 193562 TEST 2 – Simple select The following select was used to return a rows of the database For Database: Speed_Test Select * from [Test_Table] Where Number_nchar > 0 For Database: Speed_Test2 Select * from [Test_Table] Where Number_nchar > 0 RESULTS The test was run 4 times for each database and the results are in milliseconds DataBase Speed Test Speed Test2 Difference 37 Test 2 Test 1 Test 3 Test 4 Average 15734 15062 14156 14750 14925.5 194406 213265 209062 244390 215280.8 200355.3 DATA-NERD Issue 1 November 2008 4 3 Difference SpeedTest2 2 SpeedTest 1 0 50000 100000 150000 200000 250000 300000 Conclusions It is clear from the test results in this particular environment that the correct declaration of a numeric field has significant performance issues. As part of the overall design of a database care should be taken to numerals and Characters to help optimise performance. The Environment Hard ware Lenovo ThinkPad R61 T8100 @2.10 GHZ CPU Core 2 Duo Memory RAM 4GB Disk Space (at start of Speed Test) 142 GB, 84MB free Software Operating System Windows XP professional 2002 Service Pack 2 Database 9.00.1399.06 Microsoft SQL Server 2005 standard Edition , Version Database Management MS SQL Server Management Express ,Version 9.00.2047.00 Other Software (open but not in use during test) 38 MS EXCEL, Google Chrome DATA-NERD Issue 1 November 2008 HOROSCOPE Psychic Meg is on hand to analyse the cosmos! What the stars have in store for you! ARIES TAURUS The stars have aligned just for you. Now is the time to sell your collection on eBay. The recession hasn’t hit your star sign just yet! Sell sell sell! This will be a deeply depressing week when you realise your database has way more friends than you do. Maybe now is a good time to step into the real world. GEMINI Be careful what you wish for; it just might happen. Think BIG and BIG is what you will get. Hopefully this won’t apply to your waistline but could be very advantageous in your career! CANCER Fail to plan and you could be planning to fail! Make sure your recovery and failover plans do work. This month could be tricky… Be prepared! LEO This is your future self! Don’t give up on your timetravel research. Take the time to include people around you in formulating a plan. Others will appreciate it and recognise you as a team player. VIRGO 'My Precious' - Finishing your Germanic translation of the Lord of the Rings book will finally culminate 6 years worth of Friday and Saturday nights. Time to party! LIBRA You are destined to meet the person of your dreams this week. Keep your distance however. Time to kick on-line dating into cyberspace. Things are not always as they seem! SCORPIO “There is no spoon!” Keep this phrase in mind this month as nothing is clear or set in stone just yet. Clarity will come next month. Swirling your cup will help mix the coffee, milk and sugar. CAPRICORN You will arrive in a strange universe where you still live in your parent’s house, Battle Star Galactica is no longer cool, and your mum still licks her thumb and uses it to wash dirt off your face. Do your best to survive until the next worm hole opens up then jump as if your life depended on it! AQUARIUS Front page news - Your dreams of making “Wonder Woman vs. Cat Woman” into a movie will finally be realised. Keep the spandex-wearing stories to yourself though – your plan of world domination must remain a secret. The world is not ready – just yet! SAGITTARIUS Feeling paranoid that your car might be an Autobot? Don’t fret; you aren’t losing your mind. It will need a service, so book it in soon. PISCES Abandon ship. Your robots have become self aware. All mayhem is about to break loose. You and your kind are the first to be integrated and soldered into the motherboard. Abort while you can! Advertisement Want to Learn more? Check out www.comp.dit.ie for the full range of innovative, exciting and flexible industry focused full-time and part-time undergraduate and post graduate courses.