STAT503X - Assignment 4 - Databases

advertisement
STAT503X - Assignment 4 - Databases
Due: anytime before 5:00pm May 7 2003. This assignment needs to be handed
in by each person instead of one per working group. There cannot be any
extensions beyond the due date.
This assignment shall practice your SQL-skills by investigating a (well) known data set. The results of
all tasks/questions must be supported with corresponding SQL-code. You can choose either to work on the
Tao Buoy data or the Biotin data (either I or II below). For both datasets the first step is to get a database
client if you haven’t done so yet. (The notes have details on clients - DbVisualizer looks to be a good choice).
If you have trouble installing, please do not hesitate to contact us.
Details of database: Address: 129.186.194.32 User:stat503 Password:cookie
I Analyzing the Tao (Tao) data in a database with SQL commands Tasks
(a) Describe the location of the buoys (location and deviation from where they apparently are expected
to be)
(b) Give summaries for the 5 measured variables
(c) Calculate correlations between the variables for each single year and compare the results for
”normal” years (93-94) and El-Nino years (97-98). Describe the differences.
(d) Look at the trend of sea-surface-temperature for the south-east (define yourself) buoys over all
years. How do they differ from the rest of the buoys?
(e) Describe the missing values in the data set in respect to time and space. What kind of imputation
would you suggest? Why?
II Analyzing the Biotin (Biorma, Biotibs) data in a database with SQL commands Tasks: (Note that
this data has not been published yet so please don’t distribute it beyond this class.)
(a) Calculate the differences between the replicates for each treatment, and summarize these.
(b) List the genes with large differences in the replicates from each treatment and compare these lists.
Do any genes have large differences between replicates on all treatments?
(c) Compute the average of the replicate for each gene. Return the list of gene which have large
differences on Bio1 relative to the other treatments.
(d) Repeat 3. for the rma data (median polish). Intersect this list with the list from tibs (log average
ratio). Do they detect the same genes as interesting.
For both datasets, discuss the differences and restrictions you had to face when you answered the
questions with SQL instead of using R, ggobi or Mondrian.
1
Download