CS157B-4 Mid Lecture Notes - Department of Computer Science

CS157B Lecture Notes
From January 25 to April 03
Hoang, Luong
CS157B
Section 04
Dr. Lin
Weeks of January 25 – February 12: no lecture notes were assigned until
the date of February 12.
Week of feb 12 - feb 21, 2006
WEEKLY LECTURE NOTES : the following and the attachment is all the commands i did on SQL following all the
statements from the sql_review.doc as Professor Lin told us to do for this week for the required LECTURE
NOTES.
-------------------------------------------------------------------------------------------------
SQL*Plus: Release 8.1.7.0.0 - Production on Sun Feb 19 13:05:01 2006
(c) Copyright 2000 Oracle Corporation. All rights reserved.
Connected to:
Oracle8i Enterprise Edition Release 8.1.7.0.0 - Production
With the Partitioning option
JServer Release 8.1.7.0.0 - Production
SQL> SELECT *
2 FROM SUPPLIERS
3 ;
S# SNAME
STATUS CITY
-- ---------- ---------- ---------S1 SMITH
20 LONDON
S2 JONES
10 PARIS
S3 BLAKE
30 PARIS
S4 CLARK
20 LONDON
S5 ADAMS
30 ATHENS
SQL> SELECT *
2 FROM PARTs
3 ;
P# PNAME COLOR
WEIGHT CITY
-- -------- ------ ---------- ---------P1 NUT
RED
12 LONDON
P2 BOLT GREEN
17 PARIS
P3 SCREW BLUE
17 ROME
P4 SCREW RED
14 LONDON
P5 CAM
BLUE
12 PARIS
P6 COG
RED
19 LONDON
6 rows selected.
SQL> SELECT P# FROM SHIPMENTS;
P#
-P1
P2
P3
P4
P5
P6
P1
P2
P2
P2
P4
P#
-P5
12 rows selected.
SQL> SELECT DISTINCT P# FROM SHIPMENTS;
P#
-P1
P2
P3
P4
P5
P6
6 rows selected.
SQL> SELECT COUNT(*)
2 FROM SUPPLIERS
3 ;
COUNT(*)
---------5
SQL> SELECT COUNT(Status)
2 From Suppliers
3 ;
COUNT(STATUS)
------------5
SQL> SELECT COUNT(Distinct Status)
2 FROM SUPPLIERS;
COUNT(DISTINCTSTATUS)
--------------------3
SQL> SELECT COUNT(*) FROM SHIPMENTS WHERE P# = 'P2';
COUNT(*)
---------4
SQL> SELECT P#, SUM(QTY)
2 FROM SHIPMENTS
3 GROUP BY P#;
P# SUM(QTY)
-- ---------P1
600
P2
1000
P3
400
P4
500
P5
P6
500
100
6 rows selected.
SQL> Select employee.ssn, employee.dno
2 from employee, department
3 where employee.prikey=department.mgrprikey
4 ;
SSN
DNO
---------- ---------333445555
5
987654321
4
888665555
4
SQL> Select E.dno, E.prikey, E.salary
2 from employee E, department
3 where E.prikey=department.mgrprikey;
DNO PRIKEY
SALARY
---------- ------------ ---------52
40000
44
43000
48
25000
SQL> Select E.dno,count(E.salary)
2 from employee E
3 group by E.dno
4 having count(E.salary) >1;
DNO COUNT(E.SALARY)
---------- --------------4
4
5
4
SQL> select E.dno, E.salary , Count(*)
2 from employee E
3 group by E.dno, E.salary
4 having count(*) >1 ;
DNO SALARY COUNT(*)
---------- ---------- ---------4
25000
3
SQL> select E.salary, count(E.dno)
2 from employee E
3 group by E.salary
4 having count(E.dno) >1;
SALARY COUNT(E.DNO)
---------- -----------25000
4
SQL> select count(E.dno), E.prikey
2 from employee E
3 group by E.prikey
4 having count(E.dno) >1;
no rows selected
SQL> Select employee.dno,
2 avg(employee.salary) as ave
3
4
5
6
from employee, department
where employee.prikey=department.mgrprikey
group by employee.dno
having avg (employee.salary ) >10;
DNO
AVE
---------- ---------4
34000
5
40000
SQL> select E.dno, avg(E.salary) as ave
2 from employee E, department
3 where E.prikey=department.mgrprikey
4 group by E.dno
5 having avg(E.salary) >10;
DNO
AVE
---------- ---------4
34000
5
40000
SQL> Select F.fname, F.minit, F.lname,
2 F.salary as mgr_salary,
3 F.dno, G. dept_avg
4 from Department D,
5 Employee F,
6 (select E.dno as edno,
7 avg(E.salary) as dept_avg
8 from employee E
9 group by E.dno) G
10 where D.dnumber= G.edno
11 and F.prikey=D.mgrprikey
12 and F.salary > dept_avg;
FNAME
M LNAME
MGR_SALARY
DNO DEPT_AVG
---------- - ---------- ---------- ---------- ---------FRANKLIN T WONG
40000
5
33250
JENNIFER S WALLACE
43000
4
29500
SQL> SELECT S.*, P.*
2 FROM
SUPPLIERS S, PARTS P
3 WHERE S.CITY='London' and P.CITY='Paris';
no rows selected
SQL> SELECT SUPPLIERS.S#, SUPPLIERS.SNAME, SUPPLIERS.STATUS, SUPPLIERS.CITY, PARTS.P#,
PARTS.PNAME,
PARTS.COLOR, PARTS.WEIGHT
2 FROM SUPPLIERS, PARTS
3 WHERE SUPPLIERS.CITY = PARTS.CITY
4 ;
S# SNAME
STATUS CITY
P# PNAME COLOR
WEIGHT
-- ---------- ---------- ---------- -- -------- ------ ---------S1 SMITH
20 LONDON P1 NUT
RED
12
S4 CLARK
20 LONDON P1 NUT
RED
12
S1 SMITH
20 LONDON P4 SCREW RED
14
S4 CLARK
20 LONDON P4 SCREW RED
14
S1 SMITH
20 LONDON P6 COG
RED
19
S4 CLARK
20 LONDON P6 COG
RED
19
S2 JONES
10 PARIS
P2 BOLT GREEN
17
S3 BLAKE
30 PARIS
P2 BOLT GREEN
17
S2 JONES
10 PARIS
P5 CAM
BLUE
12
S3 BLAKE
30 PARIS
P5 CAM
BLUE
12
10 rows selected.
SQL> SELECT SUPPLIERS.*,PARTS.*
2 FROM
SUPPLIERS, PARTS
3 WHERE SUPPLIERS.City > PARTS.City
4 ;
S# SNAME
STATUS CITY
P# PNAME COLOR
WEIGHT CITY
-- ---------- ---------- ---------- -- -------- ------ ---------- ---------S2 JONES
10 PARIS
P1 NUT
RED
12 LONDON
S3 BLAKE
30 PARIS
P1 NUT
RED
12 LONDON
S2 JONES
10 PARIS
P4 SCREW RED
14 LONDON
S3 BLAKE
30 PARIS
P4 SCREW RED
14 LONDON
S2 JONES
10 PARIS
P6 COG
RED
19 LONDON
S3 BLAKE
30 PARIS
P6 COG
RED
19 LONDON
6 rows selected.
SQL> SELECT P#, MIN(WEIGHT)
2 FROM PARTS
3 WHERE COLOR = 'RED'
4 ;
SELECT P#, MIN(WEIGHT)
*
ERROR at line 1:
ORA-00937: not a single-group group function
SQL> Select
p.p# , p.pname, sum(sp.qty) as sum
2 from
Shipments sp, Parts p
3 where
p.p# = sp.p#
4 and
p.weight > 0
5 group by sp.p#, p.p#, p.pname
6 having sum( sp.qty) >= 100
7 ;
P# PNAME
SUM
-- -------- ---------P1 NUT
600
P2 BOLT
1000
P3 SCREW
400
P4 SCREW
500
P5 CAM
500
P6 COG
100
6 rows selected.
SQL> Select
p.p# , p.pname, h.sum
2 from
parts p, (Select
sp.p#, sum(sp.qty) as sum
3 from
shipments sp
4 group by sp.p#
5 having sum(sp.qty) >= 100) h
6 where
p.p# = h.p#
7 and
p.weight > 0;
P# PNAME
SUM
-- -------- ---------P1 NUT
600
P2 BOLT
1000
P3 SCREW
400
P4 SCREW
500
P5 CAM
500
P6 COG
100
6 rows selected.
SQL> Select k.p# , k.pname, h.sum
2 from
3 (Select
sp.p#, sum(sp.qty) as sum
4 from
shipments sp
5 group by sp.p#
6 having sum(sp.qty) >= 100) h
7 ,
8 (Select p.p# , p.pname
9 From Parts p
10
where
p.weight > 0)k
11 where
k.p# = h.p#;
P# PNAME
SUM
-- -------- ---------P1 NUT
600
P2 BOLT
1000
P3 SCREW
400
P4 SCREW
500
P5 CAM
500
P6 COG
100
6 rows selected.
SQL> SELECT PARTS.*
2 FROM PARTS
3 WHERE PARTS.PNAME
LIKE 'C%';
P# PNAME COLOR
WEIGHT CITY
-- -------- ------ ---------- ---------P5 CAM
BLUE
12 PARIS
P6 COG
RED
19 LONDON
SQL> SELECT S#
2 FROM SUPPLIERS
3 WHERE
STATUS > 25
4 ;
S#
-S3
S5
SQL> SELECT S#
2 FROM SUPPLIERS
3 WHERE STATUS is NULL;
no rows selected
SQL> SELECT S#, P#
2 FROM SHIPMENTs
3 WHERE QTY IS NOT NULL;
S# P#
-- -S1 P1
S1 P2
S1 P3
S1 P4
S1 P5
S1 P6
S2 P1
S2 P2
S3 P2
S4 P2
S4 P4
S# P#
-- -S4 P5
12 rows selected.
SQL> SELECT Sname
2 FROM SUPPLIERS
3 WHERE S# not in
4 FROM SHIPMENTS
5 WHERE P# = 'P2');
(SELECT S#
SNAME
---------ADAMS
SQL> SELECT S#
2 FROM SUPPLIERs
3 WHERE CITY =
4 (SELECT CITY
5 FROM SUPPLIERs
6 WHERE S#='S1');
S#
-S1
S4
SQL> SELECT SNAME
2 FROM SUPPLIERS
3 WHERE NOT EXISTS
4 (SELECT *
5 FROM SHIPMENTS
6 WHERE S# = Suppliers.S#
7 AND P# = 'P2');
SNAME
---------ADAMS
SQL> SELECT SNAME
2 FROM SUPPLIERS
3 WHERE NOT EXISTS
4 (SELECT *
5 FROM SHIPMENTS
6 WHERE S#=S#
7 AND P#='P2');
no rows selected
SQL> SELECT P#
2 FROM PARTS
3 WHERE WEIGHT > 16
4 UNION
5 SELECT P#
6 FROM SHIPMENTS
7 WHERE S# = 'S2';
P#
-P1
P2
P3
P6
SQL> UPDATE Suppliers
2 SET Status=Null
3 WHERE S# = 'S5';
1 row updated.
SQL> UPDATE SHIPMENTS
2 SET QTY = 0
3 WHERE 'London' =
4 (SELECT CITY
5 FROM SUPPLIERS
6 WHERE SHIPMENTS.S#= SUPPLIERS.S#);
0 rows updated.
SQL> DELETE
2 FROM
SHIPMENTs
3 WHERE P# = 'P3';
1 row deleted.
SQL> DELETE
2 FROM SHIPMENTs
3 WHERE 'London'=
4 (SELECT CITY
5 FROM SUPPLIERs
6 WHERE S.S# =
7 SP.S#);
SP.S#)
*
ERROR at line 7:
ORA-00904: invalid column name
SQL> INSERT
2 INTO SUPPLIERs(S#, SNAME, STATUS)
3 VALUES ('S6', 'James', 35);
1 row created.
SQL> INSERT
2 INTO SHIPMENTs(S#,P#,QTY)
3 VALUES ('S2', 'P2', 1000);
1 row created.
SQL> CREATE TABLE TEMP
2 (SNUM CHAR(5),
3 PNUM CHAR(5),
4 QTY NUMBER(38));
Table created.
SQL> INSERT
2 INTO TEMP (SNUM, PNUM, QTY)
3 SELECT *
4 FROM SHIPMENTS
5 Where S#='S2';
3 rows created.
SQL> exit
Week of feb 20 - feb 27, 2006
February 20, 2006
The following code creates a temporary table which holds the serial number, product number and the quantity of it left:
create table temp
( snum char(5),
pnum char(5),
qty number(38));
The following formate of code which inserts a tuple into a the temporary table:
insert into temp (snum,pnum,qty)
The following code selects all the snum where is s2:
select * from shipments where snum ='s2';
QUERY OPTIMIZATION------------------------------------------------------------------------------------------------The following code gets the product name in which is the natural join of the parts and shipments table in which the weight
is greater than 1000:
SELECT G.PNAME
FROM (SELECT p.P#,p.PNAME,p.COLOR,p.WEIGHT,p.CITY,sp.S#,sp.qty
FROM PARTS P, SHIPMENTS SP) g
WHERE g.WEIGHT > 1000
This is the code we went over to in which we go from step to step to convert a sql statement to a relational algebra statement
Select p.pname , sum(sp.qty) as sum
from
parts p, shipments sp
where p.pnum = sp.pnum
and
p.weight > 1000
group by p.pname, p.pnum
having sum(sp.qty ) >= 1
1.) Get the natural join of 2 tables :
(P X S)
2.) Make a new table which consist of only product number and products which weight is over 1000:
σP.PNUM = SP.PNUM, P.WEIGHT >1000(P X S)
3.) Do a grouping of product name product number, and quantity to a sum:
σSUM >= 1000
γP.PNAME, P. PNUM, SUM( SP.QTY) → SUM
(σP.PNUM = SP.PNUM and P.WEIGHT >0(P X S))
4.) Now finally the projection of product name and sum to match the original sql statement:
P.PNUM, SUM
(σSUM >= 1000
(γP.PNUM, SUM(SP.QTY) → SUM
(σP.PNUM = SP.PNUM and P.WEIGHT >1000(P X S))))
(The above code is found from the queryoptimization.doc on Dr. Lin's website)
According to the Advert-IT database developmental group, they have outlined 17 steps into optimizing tables.
1.
2.
3.
4.
5.
6.
7.
Show the minimum number of fields within the query.
Index all the join fields, fields in expressions, sorted fields, and the restricted based fields.
Use as much as possible the primary keys and unique keys from the table.
Numeric keys are ideal instead of text keys.
Do not use blank unique fields.
Do not use if and only if functions in the query statements.
Do not use domain aggregations.
8. Use between and equal instead of <, > which speeds up the queries.
9. Fixed column headings in queries.
10. For reports based on queries use Portrait view in preference to Landscape and select Fast Laser Printing to
Yes (View,Options,Other Properties).
11. Make sure to use make table queries for running on static data.
12. Do not use count(columnNme), instead use count(*).
13. For joined columns in 1 to x relationships, test the comparative performance by placing the restriction on the 1
or the x sides
14. Use shorter names instead of longer ones.
15. Normalize all the tables
16. Denormalize the tables
17. Avoid the use of Distinct Row.
The end of class consisted of Fahad presenting how to set up oracle on your personal computer which consisted of
creating new users, new database, how to connect to a database via oracle and command line, and concluded with small
overview.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
February 22, 2006
(Notes from optimization_law.doc)
Product and join relations are communitive and associative
(The following are from the optimization_law.doc):
RS=SR
R  S = S  R ( is natural join)
RS=SR
RS=SR
R  (S  T) = (R  S)  T
R  (S  T) = (R  S)  T
R  (S  T) = (R  S)  T
R  (S  T) = (R  S)  T
The most efficient way to lower the number of rows is the simple push down selection since joins take much time. There is an
example on the document with over 10 thousand rows in which the time taken
is obviously noticed.
Basic selection properties:
1.
2.
σA(R) = σAσA(R)
σAσB(R) = σBσA(R)
The following three rules are the selection and set operators, they will push the operations in an expression tree.
1.
2.
3.
We then in class go over each individual database to check whether progress is being made or not.
For the next assignment we need to:
For our project we need to find the term frequency for each term.
We need to find the TFIDF in which we need to find the terms in all the documents.
The LSI is also needed to be found.
Finally, create a database with token, filename, and the TFIDF values
ALL OF THIS IS DUE March 8, 2005
MIDTERM NEXT WEEK March 3, 2005
Week of feb 27 – march 5, 2006
LECTURE NOTES (REQUIRED ALONG WITH THE PROGRESS REPORT):
February 27, 2006
SUBJECT BITMAP INDEXES:
- The definitions are the collection of bit-vector of the total number of records.
- If there is a value at an index then the vector bit is listed as '1', but if not then is '0'.
- Distinct values are the values which are distinct within the column.
- The BitMap Vector is the number of tuples within the database.
F
G
30
foo
30
bar
40
baz
50
foo
40
bar
30
baz
Granular Form:
value
Granular values
Bit-vector
30
e1,e2,e6
110001
40
e3,e5
001010
50
e5
000100
- BitMap Index Forms is a table with two columns which contains the distinct values and the BitVector
which in the example is a 6 so we see 'xxxxxx' on all the tuples.
- But when looking at a Bit-Value of a certain value we see the columns of all the tupes and listed '1'
for where that value exit but '0' otherwise.
- ex) 30 | 110001
- Granular Values is an 'e' + the row which the value is for.
- Then there is the sorted of the value of the table.
- If the table has 'n' tuples then the total number of bits needed is the bitmap index is n^2.
- The advantage of this process is the faster search process.. for example
SELECT title
FROM Movie
WHERE studioName = 'Disney' AND
year = 1995;
We look at the bitmap indexes for studioName and year so we see the 100001(studioName) and 100100(year)
and then we do an intersection and we get 100000 so we know that only tuple 1 has only the value 'Disney'
and 1995.
- This bitmap index can also be applied to other sql statements such as...
SELECT *
FROM R
WHERE 23 <= age <= 25
and 60 <= salary <= 70
ex) Of the Bitmap Indecies
Age
salary
25
60
22
55
30
70
22
55
23
55
25
100
23
45
30
45
AGE:
22
01010000
23
00001010
25
10000100
30
00100001
SALARY:
45
00000011
60
10000000
55
01011000
70
00100000
100
00000100
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
March 1, 2006
-
MIDTERM EXAM on tables, sql statments, term frequency, query algebra, and rules.
Week of march 6– march 12, 2006
LECTURE NOTES (REQUIRED ALONG WITH THE PROGRESS REPORT):
March 6, 2006
Given the following Database from relational_database_example.doc
of DB-2 example do the following:
EMPLOYE PRIKE FNAM MINIT LNAM
SSN
E
Y
E
E
char(
char(10)
char(1 char(1 1) char(1
2)
0)
0)
1234567
1 John
B Smith
89
Frankli
3334455
2n
T Wong
55
9998877
3 Alicia
J Zelaya
77
Jennife
Wallac 9876543
4r
S e
21
Rames
Naraya 6668844
5h
K n
44
4534534
6 Joyce
A English
53
9879879
7 Ahmad V Jabbar
87
8886655
8 James E Borg
55
BDATE
char(10)
ADDRESS
char(30)
1965-0109
1955-1208
1968-0719
1941-0620
1962-0915
1972-0731
1969-0329
1937-1110
731 Fondren, Houston,
TX
SEX SALAR SUPERPRIK DN
Y
EY
O
char(
1) decimal
char(12)
shor
t
M
30000
2
5
638 Voss, Houston, TX
3321 Castle, Spring,
TX
M
40000
8
5
F
25000
4
4
291 Berry, Bellaire, TX
975 Fire Oak, Humble,
TX
5631 Rice, Houston,
TX
980 Dallas, Houston,
TX
450 Stone, Houston,
TX
F
43000
8
4
M
38000
2
5
F
25000
2
5
M
25000
4
4
M
55000
null
1
1.) create 100 tuples of this database repeating each row so that there will be several duplicates. The only differencce of
each tuple will
be the primary key which will be listed from 1 to 100
2.) write java code that will generate all possible association of the columns in sql code.
3.) the sql code generated from the 2000 possible statements will be printed out onto a separate file for viewing.
Send all the 3 following to dr.lin, due by wednesday for 120 points, or due by monday for 100 points
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
March 8, 2006
The LSI table with the should have been due today, the combination of all 20k files will be due by next week
The following code will produce the lsi value for each token:
The LSI table, the following code will figure the lsi for each distinct token in all the documents:
insert into lsi (id,token,tfidf)
select id,token,
occurences*log(10,(total/concerned)) TFIDF
from
(select count(distinct id) total from data),
(select distinct id, token,count(position) occurrences
from data group by id,token) pairs
natural join
(select count(distinct id) concerned, token from data group by token);
This code finds the LSI values for the tokens: this code should be repeated for all tokens within the supplied 20,000 files
given due by next week wednesday (this is dr. lins code)
THE SQL STATEMENTS I USE TO COMPUTE THE VALUES:
1.) tf(ti;dj) = number of occurrences of i ('the') in j ('58142')
select count(id) from database where token = ‘ ’ and id = ' ';
2.) Tr = total number of documents
select count(distinct id) from database;
3.) Tr(ti) = number of documents containing i ('the')
select count(distinct id) from database where token = ;
4.) Then use the java Math class to figure the logarithm.
Here is the piece of code used…
tf = "select count(" + idHold + ")from database where token = \'" + tokHold +
"\' and id = \'" + idHold + "\';";
tr = "select count(distinct id) from database;";
trti = "select count(distinct id) from database where token = \'" +
tokHold + "\';";
tfSet = s1.executeQuery(tf);
trSet = s2.executeQuery(tr);
trtiSet = s3.executeQuery(trti);
tfSet.next();
trSet.next();
trtiSet.next();
value = (tfSet.getDouble(1)) * Math.log10(trSet.getDouble(1) / trtiSet.getDouble(1));
s.execute("insert into tfidf values(\'" + tokHold + "\' , \'" + idHold + "\' , \'" + value + "\');");
// The last line inserts the lsi value into the new table which holds all the tfidf values.
It is not possible to generate the lsi table example dr. lin gave last week due to oracles limitations of 1000 maximun
columns. So it will be new formatted in this:
3 columns: token,id,lsi,value
the table should contain 20,000 files of tokens.
The new tokenizer should reconize numbers with words such as: cs157b instead of treating numbers and words
differnetly
Week of march 13 – march 19, 2006
LECTURE NOTES (REQUIRED ALONG WITH THE PROGRESS REPORT):
march 13,2005
(on appiori_sql.doc)
Table of food with 8 tuples. New table is shown with the columns for each different type of food, and everytim
it exist there is a 1 and 0 otherwise. But this method is not efficient due to the amount of space needed
to have all the tuples of possible items
drop table supermarket
create table supermarket(
transaction varchar(5),
item1 varchar(5),
item2 varchar(5),
item3 varchar(5),
...
item7 varchar(5));
Association Rules: how to find important items?
select from item1 from supermarket group by item1 having count(*)>1;
bad way: select item1 from supermarket group by item1 having count(*)>1;
then repeat all for all columns the previous sql statement: 8 times
select item1,item2 from supermarket group by item1,item2 having count(*)>1; ... and so on for all combinations
To know whether of not an item is important is by looking at the first pair of columns.... if the individual
item1 and item2 are important then figure the association of item1&item2. If the subset if significant then compute
the superset.
function dependencies in the shown table... if the person bought parsley then they bought cucumber & if the person
bought onion then they bought tomato.
How to find functional dependencies (according to dr. lin):
Select item1,count(distinct item2) from supermarket group by item1,item2 having count(distinct item2)>1;
for every x there is some y where x->y
The code I used to figure functional dependency:
Select item1,item2 from employee group by item1,item2 having count(distinct item2)>1;
Alternative code but more complicated:
SELECT A, B, MIN(B),MAX(B)
FROM Table
GROUP BY A, B
HAVING MIN(B)<> MAX(B);
=> no row return, then do the next step
SELECT A, MIN(A),MAX(B)
FROM Table
GROUP BY B
HAVING MIN(B)<> MAX(B)
=> no row return, means B is determined by A
another example: select sname,count(distinct status) from suppliers group by sname;
homework find the functional dependencies for the EMPLOYEE table
decision rules?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
march 15, 2006
project stuff: the new TF of pairs = (# of occurences) / (some sort of normalization, computation inexpensive)
The normalization is whatever we can think of. This is the association of 2 tokens now.
The normalization should be able to be has inexpensive regarding the amount of computation done by the computer and
relective of the length of the articles.
A DataWarehouseis a storage table for easier data analysis. The datawhere house is similar to the regular database, but
the
columns for the database are the rows upon the datawarehouse. This type of storage is ideal for data analysis, but it
cannot be used modification. To analysize new data, the datawarehouse must be repopulated with some sort of crawler.
Group member found a sql statement to find functional dependencies:
SELECT A, B, MIN(B),MAX(B)
FROM Table
GROUP BY A, B
HAVING MIN(B)<> MAX(B);
// no row return, then do the next step
SELECT A, MIN(A),MAX(B)
FROM Table
GROUP BY B
HAVING MIN(B)<> MAX(B)
// no row return, means B is determined by A
Week of march 20 – march 27, 2006
march 20, 2006
association rules.. where the column appears at least >= threshold, if it doesnt repeat more than the threshold than we
we dont care. But this is for length l.
If length is 2 or greater than we find the combo of the columns with length 2, then with the threshold we find
if it appears more or equal to the threshold.
(7 choose 1) + (7 choose 2) + ... + (7 choose 7) = 2^7
Question on quiz: list the assocation og length 2 with threshold=3?
2.) Which triples meet the arpiori conditions?
TABLE FROM THE QUIZ
t1: cucumber,parsley,onion,tomato,salt,bread
t2: tomato,cucumber,parsly
t3: tomato,cucumber,onion,parsley
t4: tomato,cucumber,onion,bread
t5: tomato,salt,onion
t6: bread,cheese
t7: tomato,cheese,cucumber
t8: bread
What is the answer ot this: which items of length 2 have a threshold of 3 or greater?
ANSWER: cucumber/tomato and tomato/onion
wednesday we will cover the descision rules....
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
march 22, 2006
Whenever you hand in something, you place your name in the filename, even the java names so you have to modify the
java class
names
Stanford lecture tomorrow, thursday at 7:30 pm to 8:30 pm. Last lecture of the semester according to him.
He then discusses the grading for the lectures for about 15 minutes.
Midterm the monday after spring break. Class notes should be added on upon(more notes from outside sources and
practice) to
recieve full credit. Minimum notes will get minimum passing grade.
Project due when we come back. The two token dfidf project.
Notes of soft copy due Monday when we come back. Should include project notes. He says the grades will be distributed
as follows:
100,95,90,85,80,75,70,50(if you dont turn in)
So on Monday for CS157B: we have midterm, project, soft copy notes
Last 10 minutes of class...
BITMAPS: table with two columns
- is better than datawarehouse
- vertical representation
-
ex) a translation of a bitmap from his bitmap slides(online) of example 1
1
2
3
4
------------------------------------------------------30
1
1
0
0
40
0
0
1
0
50
0
0
0
1
5
6
0
1
0
1
0
0
foo
bar
baz
0
1
0
0
0
1
1
0
0
0
1
0
0
0
1
1
0
0
This way of a database is better since it takes up less bits than the old way since the o's and 1's are only on bit, while
placing the
token names over and over can take a lot more.
Know how to convert a regular table into a bitmap.
Also we covered the binary conversion of the tables to find the intersection of several columns.