ADA Question and Answers Table of Contents Q1: Data Warehousing vs. OLTP: 2010, 2011 ......................................................................................... 2 Q2: Dimensional Models, Star, Galaxy etc: 2009, 2011 .......................................................................... 3 Q3: Cube, Roll-Up, Slice and Dice etc: 2011 ........................................................................................... 4 Q4: XML: 2011, 2010............................................................................................................................... 5 Q5: A-Priori: 2008, 2010, 2011 ............................................................................................................... 6 Q6: XPATH: 2008, 2009, 2010 ................................................................................................................. 7 Q7: Oracle: 2010 ..................................................................................................................................... 7 Q8: Schemas and ROLAP: 2010 ............................................................................................................... 7 Q9: Support and Confidence: 2008, 2009, 2010..................................................................................... 8 Q10: Objects: 2008, 2009 ....................................................................................................................... 8 Q11: Decisions and DW: 2009................................................................................................................. 8 Q12: SQL XML: 2008 ............................................................................................................................... 9 Q13: Objects: 2008 ................................................................................................................................. 9 Q14: DW Data: 2008 ............................................................................................................................... 9 Q15: OLAP: 2008 ..................................................................................................................................... 9 Q1: Data Warehousing vs. OLTP: 2010, 2011 As a Data Warehouse consultant, you have been asked to brief a client company whose experience has been in transaction processing based systems and are considering embarking on a data warehouse project. 2011 a) Briefly discuss how Online Transaction Processing (OLTP) systems differ from Data Warehousing systems. Justify your answer with examples. A DW is designed for query and analysis, as opposed to transaction processing. A DW separates analysis workload from transaction workload. A DW’s main purpose in a business is as an analytical tool. A DW environment includes an extraction (and transportation), transformation, and loading (ETL) solution, online analytical processing (OLAP) and data mining capabilities, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. OLTP OLAP Source Original/current data Multiple sources, historical data Purpose Basic tasks Planning, decisions, problem solving Shows Snapshot of business Multidimensional views of biz INSERT/UPDATE Short & fast Long running batch Queries Simple Complex, with aggregations Speed Very fast Slow Space Small Large Design Normalised Denormalised, uses schemas Backup Daily Can just reload OLTP data Holds current data Stores detailed data Data is dynamic Repetitive Processing High level of transaction throughput Predictable pattern of usage Transaction-driven Application-oriented Supports day-to-day decisions Serves large number of clerical/operational users Holds historical data Stores detailed, lightly & highly summarised data Data is largely static Ad hoc, unstructured & heuristic processing Medium to low level of transaction throughput Unpredictable pattern of usage Analysis driven Subject-orientated Supports strategic decisions Serves relatively low number of managerial users b) Discuss two options that one could use to manage highly summarised data with a view to improving query performance in the data warehouse environment. Why would you use one option over the other? Give examples to support your answer. meh C) Briefly discuss the role of the Staging Area in an enterprise data warehouse environment. A staging area simplifies building summaries or materialized views and general warehouse management. Staging areas are often useful for performing intensive calculations or data-cleansing operations that may adversely affect the production query environment. 2010 a) Highlight a common Data Warehouse architecture, briefly discussing its main components. b) Discuss the typical problems that can occur with Data Warehousing projects. Underestimation of resources for data loading required data not captured increased end user demands high demand for resources high maintenance data ownership long duration project data homogenisation complexity of integration Q2: Dimensional Models, Star, Galaxy etc: 2009, 2011 a) Dimensional Modelling is a logical design technique used to design star and derivative schemas. i) Compare and contrast Star and Snowflake schemas, explaining when you would use one over the other. 2009, 2011 Star schema: Central fact table with dimension tables. Snowflake: Normalised dimensions tables. Less space, slower query. ii) Briefly explain the term Star Galaxy and why you would design for this. What design decisions must you make to ensure you achieve this design? A galaxy schema is a grouping of star schemas, each with their own fact table. You might design for this if each fact table or dimension table uses different dimension level, eg sales in one table go by the month, and sales in the other go by the year. b) Data Warehouse environments are considered I/O intensive. Explain what approaches you could take to improve I/O performance. Partition pruning divides data into partition so when a query is made, it doesn’t have to read the whole disk, just that partition that contains the relevant data. Bitmap indexes are a lot smaller than B-Trees. Materialised Views come with pre-computed queries. c) Briefly discuss the definition of the Dimension Object in the Oracle Database. While not mandatory, why is it good practice to define them for your dimension table? Possibly in case you want to map the database to an OO programming language in the future. 2009 Dimensional modelling is a specialised type of ER modelling used in DW design. a) Discuss in detail the main tables (and how they are related) commonly found in dimensional data warehouse schemas. Provide examples to support your answer. FACT, Dimension and relationships. c) Compare and contrast an Enterprise Data Warehouse with a Virtual Data Warehouse. meh Q3: Cube, Roll-Up, Slice and Dice etc: 2011 a) With reference to the multidimensional cube below, explain how the following cube and visualisation operations function: i) Slice and dice: Slice takes one product and shows the full info on it. Dice takes any number of products, and shows limited data for it. ii) Roll up and drill down: Roll up provides a more summarised view, eg all products in the north, instead of N/E and N/W. Drill down gives more detailed views, eg instead of by Quarters, it would be by month. Or you could just get the data by supplier. iii) Cube: Gives summarised data “NULL” rows to show totals. iv) Pivot and Nest: Pivot will twist the cube to give a different dimension, eg instead of sorting by Quarter, sort by product. Nest is representing the cube as a table. b) Briefly discuss by way of example how OLAP functionality is provided by the ROLLUP and CUBE functions of the SQL: 2003 standard. As part of your answer, you should address the output produced by using these functions, noting how they differ. Take a pet store DB with animal type, amount of that animal, and location. ROLLUP will total all the animals and give a summary of amount of each animal by store, and overall total number among all stores. CUBE will also do this, but add further rows which will give total of all types of animals in each store. Both SQL commands are used to add more summarised data, but give different grain. c) Compare and contrast MOLAP and ROLAP environments. Following a hybrid approach, how have these environments been used to complement each other in the past? For hybrid (HOLAP), it has both parts. The RDBMS part stores un-aggregated data while the MOLAP part stores SELECTED pre-computed aggregated data. Q4: XML: 2011, 2010 a) Briefly discuss the use of "Russian Doll" design in defining XML Schema documents. Highlight alternative design approaches giving appropriate examples to support your answer. 2010, 2011 The Russian Doll design contains only one single global element. All the other elements are local. You nest element declarations within a single global declaration, which you can use once only. You must define only the root element within the global namespace. Since it contains only one single global element, Russian Doll is the simplest and easiest pattern to use by instance developers. However, if its elements or types are intended for reuse, Russian Doll is not suitable for schema developers. b) Discuss the term "round-tripping" in terms of document-centric and data-centric XML documents. 2010, 2011 With round-tripping, you can store an XML document in a native XML database and get the "same" document back again. This is important to document-centric applications, for which things like CDATA sections, entity usage, comments, and processing instructions form an integral part of the document. It is also vital to many legal and medical applications, which are required by law to keep exact copies of documents. Round-tripping is less important to data-centric applications, which generally care only about elements, attributes, text, and hierarchical order. All native XML databases can round-trip documents at the level of elements, attributes, PCDATA, and document order. How much more they can round-trip depends on the database. c) The Oracle database provides a native abstract data type called XMLType for XML data with a choice of structured or unstructured storage. What are their key characteristics? When would you choose one storage type over the other? 2010, 2011 Structured allows for SQL functions to be called on the XML file and for the XML file to be treated as a table. Trailing new lines, whitespace within tags and data format for non-string data types is lost, but maintains DOM fidelity. Unstructured consumes considerable space, but is very flexible when schemas change. It also maintains the original XML byte for byte - important in some applications. This can be used when you want to retrieve the whole document, or when you do not want to perform piece-wise updates on the XML document. Q5: A-Priori: 2008, 2010, 2011 Gaelic Sports Inc is a chain of stores that specialises in sports equipment for Gaelic games. The chain has allocated a limited budget to organise product displays in their stores to maximise turnover. a) Briefly explain the tasks in the Knowledge Discovery in Databases (KDD) process to meet this requirement. 2008, 2011 Preparation: Must understand the application domain and the goals of the organisation Target Data: Diverse data held by organisation must exclude irrelevant data. Pre-processed data: Data must be cleaned to remove any incorrect or inconsistent data Transformed data: Perform certain transformations to the data. Knowledge: Transformed data is searched using one or more data mining techniques to find potential patterns. Patterns: Patterns identified must be interpreted into knowledge. How can it help the decision making process. Visualisation tools can help this process 2008 b) How are the tasks in Knowledge Discovery in Databases (KDD) likely to change if the source of the data is from a data warehouse rather than a database? You only have “Knowledge” and “Patterns” steps because the DW takes care of the rest. 2011 b) The A-Priori Algorithm is used to mine Association Rules. State the A-Priori Property in your own words. Explain why this property is important in the A-Priori algorithm. 2010, 2011 meh c) Using the transaction below to support your answer, briefly explain how the A-Priori Algorithm works: 2010, 2011 i) In generating frequent Itemsets? Count all the occurrences of the desired items which are greater than the given cutoff, eg 3. ii) In generating Association Rules? Iterate through all items, and count itemsets. Remove irrelevant sets and prune data. Q6: XPATH: 2008, 2009, 2010 b) Examine the XML documents in Appendix A. Identify and explain the errors that would be encountered when Book.xml document is checked for well formedness and is validated in a parser. 2008, 2009, 2010 meh c) Write a XPATH expression that identifies the name of the character in the book that has a friend called Lassie. As part of your answer, discuss the structure of the location path expression and how it operates on the document. meh 2009 b) XML documents can be categorised as document-centric or data-centric. Outline the key characteristics that differentiate these two types of documents. Briefly discuss a suitable approach to storing each of these document types. Give reasons for your decisions. Document centric: document centric use of xml data is data that you always use in its complete form. Data centric: data centric use of xml is usage of data were the main interest point is focused on only pieces of the total set of xml data within a document. Document centric can be stored as CLOB since you won’t want the document altered, data centric can be stored as object relational so it can be mapped to an OO programming language. c) Contrast the use of DTDs and XML Schema for defining XML document instances. The DTD provides a basic grammar for defining an XML Document in terms of the metadata that comprise the shape of the document. An XML Schema provides this, plus a detailed way to define what the data can and cannot contain. It provides far more control for the developer over what is legal, and it provides an Object Oriented approach, with all the benefits this entails. Q7: Oracle: 2010 c) Discuss the fidelity provided with Oracles structured storage option. Ye, it provides document fidelity. So what? Q8: Schemas and ROLAP: 2010 a) Discuss the different types of schemas used in dimensional modelling and adopted ROLAP architectures. Why would you use one over the other? meh b) Compare and contrast "Create Table As Select" (CTAS) tables with Materialised Views used by Data Warehouse environments for summary management. CTAS is when you store summary query or just query results in a separate table (requires manual updates). Not transparent to user. They must query table for results. Materialized Views also store information in a separate table but this table is called a view and is managed by DBMS (automatically updates). Transparent to user, DBMS rewrites query to table. c) Examine the approaches that can be taken with Slow Changing Dimensions. meh Q9: Support and Confidence: 2008, 2009, 2010 In market basket analysis, the task of association seeks to uncover rules for quantifying the relation between two or more attributes. i) What is meant by the support and the confidence of an association rule? 2009, 2010 Support is the percentage of time that the pairs occur out of total transaction, confidence is the amount of time they appear out of all the times the first item of the pair comes up. ii) Showing your calculations, calculate the support and confidence for the following rule: IF a customer purchases Plasma TV THEN they will also purchase DVD Player. Out of 2000 customers, 400 purchased Plasma TVs and of the 400 who bought Plasma TVs, 120 bought DVD players. 2009 Support = 120/2000 => 3/50, Confidence = 120/400 =>3/10. b) Briefly explain the classification and clustering techniques used in the data mining. What is their main difference? 2008, 2010 Clustering – is the task of discovering groups and structures in the data that are in some way “similar”. Classification – is the task of generalizing known structure to apply to new data. Example: Email program might attempt to classify an email as legitimate or spam. Q10: Objects: 2008, 2009 a) You work for a company that has a Relational Database infrastructure. However, in recent years, existing application are gradually being re-engineered as object-oriented applications. Objects increasingly need to support multi-media data as well s the traditional information held in company database. Provide an executive summary of why an Object Database Management System could be more suitable for this scenario than the current infrastructure. 2008, 2009 meh b) In reference to Object Databases, explain what is meant by transparent persistence. Briefly explain how the client cache supports this concept. Transparent persistence in object database products refers to the ability to directly manipulate data stored in a database using an object programming language. c) From you own web-based research, identify one company using an object database today, and explain why they chose an object database. 2008, 2009 meh Q11: Decisions and DW: 2009 In recent times, organisations have focused on ways to use operational data to support decision making as a means of gaining competitive advantage in the market place. The data warehouse was deemed the solution to providing a single consolidated view of the organisations data to support this decision-making. a) Briefly discuss the main components of a typical Data Warehouse Architecture, identifying their key interactions with each other. meh b) Explain the role of the metadata repository and how it supports: i)The decision makers of an organisation. meh ii) The main components of the data warehouse architecture. meh Q12: SQL XML: 2008 b) The SQL 2003 standard has defined extensions to SQL to enable the publication of XML, commonly referred to as SQL/XML. Briefly discuss these extensions using examples to support your answer. meh c) There are a number of approaches to storing an XML document in a relational database. Describe two of these approaches outlining which approach is better and why. meh Q13: Objects: 2008 b) Discuss the concept of object identity in the Object Database and how it differences from that in the Relational Database. Describe some possible ways in which an identifier can be generated. In what way is the identifier generated within Objectivity? meh Q14: DW Data: 2008 Data warehouse data can be described as a subject-oriented, integrated time-variant, and a nonvolatile collection of data in support of management’s decision making process. a) Explain the main characteristics of data in data warehouse as defined in the above definition. meh b) Discuss the important role metadata plays in data warehouse architectures. meh c) Discuss the main reasons for implementing a data mart. meh Q15: OLAP: 2008 a) Discuss what OLAP represents. meh b) Vendors of OLAP tools argue that a multi-dimensional conceptual view of data can be delivered without multi-dimensional storage. Contrast the architecture of multidimensional OLAP (MOLAP) with relational OLAP (ROLAP). meh c) ROLAP can use a star schema to represent the data. Discuss the key tables of this schema identifying how they differ. meh