ADA Question and Ans..

advertisement
ADA Question and Answers
Table of Contents
Q1: Data Warehousing vs. OLTP: 2010, 2011 ......................................................................................... 2
Q2: Dimensional Models, Star, Galaxy etc: 2009, 2011 .......................................................................... 3
Q3: Cube, Roll-Up, Slice and Dice etc: 2011 ........................................................................................... 4
Q4: XML: 2011, 2010............................................................................................................................... 5
Q5: A-Priori: 2008, 2010, 2011 ............................................................................................................... 6
Q6: XPATH: 2008, 2009, 2010 ................................................................................................................. 7
Q7: Oracle: 2010 ..................................................................................................................................... 7
Q8: Schemas and ROLAP: 2010 ............................................................................................................... 7
Q9: Support and Confidence: 2008, 2009, 2010..................................................................................... 8
Q10: Objects: 2008, 2009 ....................................................................................................................... 8
Q11: Decisions and DW: 2009................................................................................................................. 8
Q12: SQL XML: 2008 ............................................................................................................................... 9
Q13: Objects: 2008 ................................................................................................................................. 9
Q14: DW Data: 2008 ............................................................................................................................... 9
Q15: OLAP: 2008 ..................................................................................................................................... 9
Q1: Data Warehousing vs. OLTP: 2010, 2011
As a Data Warehouse consultant, you have been asked to brief a client company whose experience
has been in transaction processing based systems and are considering embarking on a data
warehouse project.
2011
a) Briefly discuss how Online Transaction Processing (OLTP) systems differ from Data Warehousing
systems. Justify your answer with examples.
A DW is designed for query and analysis, as opposed to transaction processing. A DW separates
analysis workload from transaction workload. A DW’s main purpose in a business is as an analytical
tool. A DW environment includes an extraction (and transportation), transformation, and loading
(ETL) solution, online analytical processing (OLAP) and data mining capabilities, client analysis tools,
and other applications that manage the process of gathering data and delivering it to business users.
OLTP
OLAP
Source
Original/current data Multiple sources, historical data
Purpose
Basic tasks
Planning, decisions, problem solving
Shows
Snapshot of business Multidimensional views of biz
INSERT/UPDATE Short & fast
Long running batch
Queries
Simple
Complex, with aggregations
Speed
Very fast
Slow
Space
Small
Large
Design
Normalised
Denormalised, uses schemas
Backup
Daily
Can just reload OLTP data
Holds current data
Stores detailed data
Data is dynamic
Repetitive Processing
High level of transaction throughput
Predictable pattern of usage
Transaction-driven
Application-oriented
Supports day-to-day decisions
Serves large number of clerical/operational users
Holds historical data
Stores detailed, lightly & highly summarised data
Data is largely static
Ad hoc, unstructured & heuristic processing
Medium to low level of transaction throughput
Unpredictable pattern of usage
Analysis driven
Subject-orientated
Supports strategic decisions
Serves relatively low number of managerial users
b) Discuss two options that one could use to manage highly summarised data with a view to
improving query performance in the data warehouse environment. Why would you use one
option over the other? Give examples to support your answer.
meh
C) Briefly discuss the role of the Staging Area in an enterprise data warehouse environment.
A staging area simplifies building summaries or materialized views and general warehouse
management. Staging areas are often useful for performing intensive calculations or data-cleansing
operations that may adversely affect the production query environment.
2010
a) Highlight a common Data Warehouse architecture, briefly discussing its main components.
b) Discuss the typical problems that can occur with Data Warehousing projects.
 Underestimation of resources for data loading
 required data not captured
 increased end user demands
 high demand for resources
 high maintenance
 data ownership
 long duration project
 data homogenisation
 complexity of integration
Q2: Dimensional Models, Star, Galaxy etc: 2009, 2011
a) Dimensional Modelling is a logical design technique used to design star and derivative schemas.
i) Compare and contrast Star and Snowflake schemas, explaining when you would use one over
the other. 2009, 2011
Star schema: Central fact table with dimension tables.
Snowflake: Normalised dimensions tables. Less space, slower query.
ii) Briefly explain the term Star Galaxy and why you would design for this. What design decisions
must you make to ensure you achieve this design?
A galaxy schema is a grouping of star schemas, each with their own fact table. You might design for
this if each fact table or dimension table uses different dimension level, eg sales in one table go by
the month, and sales in the other go by the year.
b) Data Warehouse environments are considered I/O intensive. Explain what approaches you
could take to improve I/O performance.
Partition pruning divides data into partition so when a query is made, it doesn’t have to read the
whole disk, just that partition that contains the relevant data.
Bitmap indexes are a lot smaller than B-Trees.
Materialised Views come with pre-computed queries.
c) Briefly discuss the definition of the Dimension Object in the Oracle Database. While not
mandatory, why is it good practice to define them for your dimension table?
Possibly in case you want to map the database to an OO programming language in the future.
2009
Dimensional modelling is a specialised type of ER modelling used in DW design.
a) Discuss in detail the main tables (and how they are related) commonly found in dimensional
data warehouse schemas. Provide examples to support your answer.
FACT, Dimension and relationships.
c) Compare and contrast an Enterprise Data Warehouse with a Virtual Data Warehouse.
meh
Q3: Cube, Roll-Up, Slice and Dice etc: 2011
a) With reference to the multidimensional cube below, explain how the following cube and
visualisation operations function:
i) Slice and dice: Slice takes one product and shows the full info on it. Dice takes any number
of products, and shows limited data for it.
ii) Roll up and drill down: Roll up provides a more summarised view, eg all products in the
north, instead of N/E and N/W. Drill down gives more detailed views, eg instead of by
Quarters, it would be by month. Or you could just get the data by supplier.
iii) Cube: Gives summarised data “NULL” rows to show totals.
iv) Pivot and Nest: Pivot will twist the cube to give a different dimension, eg instead of
sorting by Quarter, sort by product. Nest is representing the cube as a table.
b) Briefly discuss by way of example how OLAP functionality is provided by the ROLLUP and CUBE
functions of the SQL: 2003 standard. As part of your answer, you should address the output
produced by using these functions, noting how they differ.
Take a pet store DB with animal type, amount of that animal, and location. ROLLUP will total all the
animals and give a summary of amount of each animal by store, and overall total number among all
stores. CUBE will also do this, but add further rows which will give total of all types of animals in
each store.
Both SQL commands are used to add more summarised data, but give different grain.
c) Compare and contrast MOLAP and ROLAP environments. Following a hybrid approach, how
have these environments been used to complement each other in the past?
For hybrid (HOLAP), it has both parts. The RDBMS part stores un-aggregated data while the MOLAP
part stores SELECTED pre-computed aggregated data.
Q4: XML: 2011, 2010
a) Briefly discuss the use of "Russian Doll" design in defining XML Schema documents. Highlight
alternative design approaches giving appropriate examples to support your answer. 2010, 2011
The Russian Doll design contains only one single global element. All the other elements are local. You
nest element declarations within a single global declaration, which you can use once only. You must
define only the root element within the global namespace.
Since it contains only one single global element, Russian Doll is the simplest and easiest pattern to
use by instance developers. However, if its elements or types are intended for reuse, Russian Doll is
not suitable for schema developers.
b) Discuss the term "round-tripping" in terms of document-centric and data-centric XML
documents. 2010, 2011
With round-tripping, you can store an XML document in a native XML database and get the "same"
document back again. This is important to document-centric applications, for which things like
CDATA sections, entity usage, comments, and processing instructions form an integral part of the
document. It is also vital to many legal and medical applications, which are required by law to keep
exact copies of documents.
Round-tripping is less important to data-centric applications, which generally care only about
elements, attributes, text, and hierarchical order.
All native XML databases can round-trip documents at the level of elements, attributes, PCDATA,
and document order. How much more they can round-trip depends on the database.
c) The Oracle database provides a native abstract data type called XMLType for XML data with a
choice of structured or unstructured storage. What are their key characteristics? When would you
choose one storage type over the other? 2010, 2011
Structured allows for SQL functions to be called on the XML file and for the XML file to be treated as
a table. Trailing new lines, whitespace within tags and data format for non-string data types is lost,
but maintains DOM fidelity.
Unstructured consumes considerable space, but is very flexible when schemas change. It also
maintains the original XML byte for byte - important in some applications. This can be used when
you want to retrieve the whole document, or when you do not want to perform piece-wise updates
on the XML document.
Q5: A-Priori: 2008, 2010, 2011
Gaelic Sports Inc is a chain of stores that specialises in sports equipment for Gaelic games. The chain
has allocated a limited budget to organise product displays in their stores to maximise turnover.
a) Briefly explain the tasks in the Knowledge Discovery in Databases (KDD) process to meet this
requirement. 2008, 2011
Preparation: Must understand the application domain and the goals of the organisation
Target Data: Diverse data held by organisation must exclude irrelevant data.
Pre-processed data: Data must be cleaned to remove any incorrect or inconsistent data
Transformed data: Perform certain transformations to the data.
Knowledge: Transformed data is searched using one or more data mining techniques to find
potential patterns.
Patterns: Patterns identified must be interpreted into knowledge. How can it help the decision
making process. Visualisation tools can help this process
2008
b) How are the tasks in Knowledge Discovery in Databases (KDD) likely to change if the source of
the data is from a data warehouse rather than a database?
You only have “Knowledge” and “Patterns” steps because the DW takes care of the rest.
2011
b) The A-Priori Algorithm is used to mine Association Rules. State the A-Priori Property in your
own words. Explain why this property is important in the A-Priori algorithm. 2010, 2011
meh
c) Using the transaction below to support your answer, briefly explain how the A-Priori Algorithm
works: 2010, 2011
i) In generating frequent Itemsets?
Count all the occurrences of the desired items which are greater than the given cutoff, eg 3.
ii) In generating Association Rules?
Iterate through all items, and count itemsets. Remove irrelevant sets and prune data.
Q6: XPATH: 2008, 2009, 2010
b) Examine the XML documents in Appendix A. Identify and explain the errors that would be
encountered when Book.xml document is checked for well formedness and is validated in a
parser. 2008, 2009, 2010
meh
c) Write a XPATH expression that identifies the name of the character in the book that has a friend
called Lassie. As part of your answer, discuss the structure of the location path expression and
how it operates on the document.
meh
2009
b) XML documents can be categorised as document-centric or data-centric. Outline the key
characteristics that differentiate these two types of documents. Briefly discuss a suitable approach
to storing each of these document types. Give reasons for your decisions.
Document centric: document centric use of xml data is data that you always use in its complete
form.
Data centric: data centric use of xml is usage of data were the main interest point is focused on only
pieces of the total set of xml data within a document.
Document centric can be stored as CLOB since you won’t want the document altered, data centric
can be stored as object relational so it can be mapped to an OO programming language.
c) Contrast the use of DTDs and XML Schema for defining XML document instances.
The DTD provides a basic grammar for defining an XML Document in terms of the metadata that
comprise the shape of the document. An XML Schema provides this, plus a detailed way to define
what the data can and cannot contain. It provides far more control for the developer over what is
legal, and it provides an Object Oriented approach, with all the benefits this entails.
Q7: Oracle: 2010
c) Discuss the fidelity provided with Oracles structured storage option.
Ye, it provides document fidelity. So what?
Q8: Schemas and ROLAP: 2010
a) Discuss the different types of schemas used in dimensional modelling and adopted ROLAP
architectures. Why would you use one over the other?
meh
b) Compare and contrast "Create Table As Select" (CTAS) tables with Materialised Views used by
Data Warehouse environments for summary management.
CTAS is when you store summary query or just query results in a separate table (requires manual
updates). Not transparent to user. They must query table for results.
Materialized Views also store information in a separate table but this table is called a view and is
managed by DBMS (automatically updates). Transparent to user, DBMS rewrites query to table.
c) Examine the approaches that can be taken with Slow Changing Dimensions.
meh
Q9: Support and Confidence: 2008, 2009, 2010
In market basket analysis, the task of association seeks to uncover rules for quantifying the relation
between two or more attributes.
i) What is meant by the support and the confidence of an association rule? 2009, 2010
Support is the percentage of time that the pairs occur out of total transaction, confidence is the
amount of time they appear out of all the times the first item of the pair comes up.
ii) Showing your calculations, calculate the support and confidence for the following rule:
IF a customer purchases Plasma TV THEN they will also purchase DVD Player.
Out of 2000 customers, 400 purchased Plasma TVs and of the 400 who bought Plasma TVs, 120
bought DVD players. 2009
Support = 120/2000 => 3/50, Confidence = 120/400 =>3/10.
b) Briefly explain the classification and clustering techniques used in the data mining. What is their
main difference? 2008, 2010
Clustering – is the task of discovering groups and structures in the data that are in some way
“similar”.
Classification – is the task of generalizing known structure to apply to new data. Example: Email
program might attempt to classify an email as legitimate or spam.
Q10: Objects: 2008, 2009
a) You work for a company that has a Relational Database infrastructure. However, in recent
years, existing application are gradually being re-engineered as object-oriented applications.
Objects increasingly need to support multi-media data as well s the traditional information held in
company database.
Provide an executive summary of why an Object Database Management System could be more
suitable for this scenario than the current infrastructure. 2008, 2009
meh
b) In reference to Object Databases, explain what is meant by transparent persistence. Briefly
explain how the client cache supports this concept.
Transparent persistence in object database products refers to the ability to directly manipulate data
stored in a database using an object programming language.
c) From you own web-based research, identify one company using an object database today, and
explain why they chose an object database. 2008, 2009
meh
Q11: Decisions and DW: 2009
In recent times, organisations have focused on ways to use operational data to support decision
making as a means of gaining competitive advantage in the market place. The data warehouse was
deemed the solution to providing a single consolidated view of the organisations data to support this
decision-making.
a) Briefly discuss the main components of a typical Data Warehouse Architecture, identifying their
key interactions with each other.
meh
b) Explain the role of the metadata repository and how it supports:
i)The decision makers of an organisation.
meh
ii) The main components of the data warehouse architecture.
meh
Q12: SQL XML: 2008
b) The SQL 2003 standard has defined extensions to SQL to enable the publication of XML,
commonly referred to as SQL/XML. Briefly discuss these extensions using examples to support
your answer.
meh
c) There are a number of approaches to storing an XML document in a relational database.
Describe two of these approaches outlining which approach is better and why.
meh
Q13: Objects: 2008
b) Discuss the concept of object identity in the Object Database and how it differences from that
in the Relational Database. Describe some possible ways in which an identifier can be generated.
In what way is the identifier generated within Objectivity?
meh
Q14: DW Data: 2008
Data warehouse data can be described as a subject-oriented, integrated time-variant, and a nonvolatile collection of data in support of management’s decision making process.
a) Explain the main characteristics of data in data warehouse as defined in the above definition.
meh
b) Discuss the important role metadata plays in data warehouse architectures.
meh
c) Discuss the main reasons for implementing a data mart.
meh
Q15: OLAP: 2008
a) Discuss what OLAP represents.
meh
b) Vendors of OLAP tools argue that a multi-dimensional conceptual view of data can be delivered
without multi-dimensional storage. Contrast the architecture of multidimensional OLAP (MOLAP)
with relational OLAP (ROLAP).
meh
c) ROLAP can use a star schema to represent the data. Discuss the key tables of this schema
identifying how they differ.
meh
Download