09-25-10 - NDSU Computer Science

advertisement
Weekly Research Meeting Notes:
Date: Saturday 09/25/10
Time: 10:00am to 1:00pm
Place: CS conference room
Members Present:
1. Dr. William Perrizo
2. Dr. George Hamer
3. Mohammad Kabir Hossain
4. Arjun Roy
5. Tingda Lu (Online through skype)
Dr. Perrizo started the discussion of the first research of the meeting of fall 2010 welcoming all
who were present in the meeting especially George who came all the way from Brookings South
Dakota (SDSU) to attend the meeting. The following topics were discussed in the meeting:
1. Dr. Perrizo asked all the members to develop their research portfolio. In the portfolio
their research interest, current research topics etc. will be posted. Then the portfolio may
be uploaded the group’s website (which will be on Dr. Perrizo’s website at
web.cs.ndsu.nodak.edu/~perrizo/classes/saturday.
2. Dr. Perrizo emphasized the importance of paper publication. He mentioned that for
Ph.D. level at least 14 publications and master level 5-6 publications are required in
order to compete in the job marketplace.
3. To achieve this target Dr. Perrizo suggested to help other group members in their
research and become co-author of their papers as well as lead author of your own. The
co-author policy of the group is that the main student developer is the first author and that
Dr. Perrizo is the last author and other contributors are in alphabetical order in the middle
4. Then Dr. Perrizo informed the group about deadlines of two conferences namely CATA2011 and BICoB-2011. Both of them have full paper submission dates on October 22,
2010, but there is a good chance that the dates will be shifted to November 22.
5. Then Dr. Perrizo talked about patents. He also described the previous patents related to
his work on P-tree and others (see below for a description ).
Presentation:
Dr. Perrizo presented a power-point presentation on different techniques of data mining –
classification, clustering and association rule mining.
1. First he discussed the Entity-Relationship model of a database. The data mining
technique, classification, can be applied to using any table and class label column of that
table. Since entities or relationships are both tables, classification can be done on either.
Clustering is also done on a table, usually as a preparatory step to classification.
2. In order to apply different data mining techniques we need to find the similarity between
two data points. This similarity can be found by determining the distance between the
data points in n-dimensional space. Then different distance functions were discussed.
3. Clustering is the preparation of classification
4. Association rule mining is always done on a relationship between two entities (e.g.,
Market Basket Research is done on the buys relationship between Customers and Items).
5. While finding the similarity, if two attributes have the same measures then how to break
the tie? – that may be a paper topic. (May be solved by introducing more attribute?)
Dr. Perrizo’s note: Does anyone recall what this was about? I am drawing a blank. Did
it have to do with Ptree CkNN and what to do if it results in a tie vote?
6. Closed k-NN (CkNN) is a classification techniques that is a blend of k-NN and ϵ-NN and
is a more effective than k-NN where not just first k point but all the points having same
distance as the kth point should participate in voting in classification. Be careful using ϵ,
because in high dimension no points may be near neighbors. A sphere in high dimension
has almost no points in it, since the [hyper]volume of the n-sphere goes to zero as n goes
to infinity and is highest if dimension is 5. (one version of the curse of dimensionality).
7. Collaborating filtering and the Netflix problems were discussed. In the Netflix problem
there is database of movie id, user id, date and rating. Ratings are assigned by users on
movies on scale from 1 to 5. 0 rating means the movie was not rated. Netflix problem is a
classification problem where the rating has to be determine given the user id, movie id
and date.
At the end Dr. Perrizo thanked all the members present in the meeting and announced that the
next meeting will be on 10/09/10.
Note taker:
Mohammad K Hossain.
Ptree PATENT Portfolio
1. United States Patent and Trademark Office Patent Number 6,941,303 B2 (NDSU-RFT-75), issued
September 6, 2005, “System and Method for Organizing, Compressing and Structuring Data for Data Mining
Readiness”, Inventor: William K. Perrizo, Application, No. 957637 filed on 2001-09-20, Abstract: A system and
method to take data, which is in the form of an n-dimensional array of binary data where the binary data is
comprised of bits that are identified by a bit position within the n-dimensional array, and create one file for each bit
position of the binary data while maintaining the bit position identification and to store the bit with the
corresponding bit position identification from the binary data within the created filed. Once this bit-sequential
format of the data is achieved, the formatted data is structured into a tree format that is data-mining-ready. The
formatted data is structured by dividing each of the files containing the binary data into quadrants according to the
bit position identification and recording the count of 1-bits for each quadrant on a first level. Then, recursively
dividing each of the quadrants into further quadrants and recording the count of 1-bits for each quadrant until all
quadrants comprise a pure-1 quadrant or a pure-0 quadrant to form a basic tree structure.
2. United States Patent and Trademark Office Patent Number 7,051,028 B2 (NDSU RFT-79), issued May 23,
2006, “Read-Commit Order Concurrency Control (ROCC)”, Inventors: Victor T. Shi and William K. Perrizo, North
Dakota State University, Abstract: A system and method for concurrency control in high performance database
systems. Generally includes receiving a database access request message from a transaction. Then, generating an
element that corresponds to the access request message. The element type is that of a read element, commit element,
validated element, or restart element. The element is then posted to a read-commit (RC) queue. If the element is a
commit element, an intervening validation of the transaction is performed. Upon the transaction passing validation
the requested database access is performed.
3. United States Patent and Trademark Office Patent Number 7,089,244 B2 (NDSU-RFT-99), issued August
8, 2006, “Multiversion read-commit order concurrency control”, Inventors: Victor T. Shi and William K. Perrizo,
North Dakota State University, Application No. 10440442 filed on 2003-05-16, Abstract: A system and method
for multi-version concurrency control in high performance database systems. Generally includes receiving a
database access request message from a transaction. Then, generating an element that corresponds to the access
request message. The element type is that of a read element, commit element, validated element, or restart element.
The element is then posted to a read-commit (RC) queue. If the element is a commit element, an intervening
validation of the transaction is performed. Upon the transaction passing validation the requested database access is
performed.
4. United States Patent and Trademark Office Patent Number 7,640,219; Issued December 29, 2009.
Parameter Optimized, Vertical, Nearest Neighbor Vote and Boundary Based Classification (NDSU RFT-203).
Inventor: William Perrizo.
This invention involves a Computer Aided Detection (CAD) model that is designed
to diagnose Pulmonary Embolisms (PE) from CT image information data sheets. This high performance
classification system, includes a Local Decision Boundary based classification combined with an evolutionary
algorithm for parameter optimization and a vertical data structure for efficient processing. The invention was
developed as a solution for the ACM KDD Cup competition in 2006, and won task 3 of that competition. A U.S.
Provisional Patent application was filed August 4, 2007.
Two patents pending.
1. A divisional patent application related to PCM (6,941,303 B2) (NDSU-RFT-94), Similar Function Data
Mining with P-trees – similar to System and Method for Organizing, Compressing and Structuring Data for Data
Mining Readiness”.
Vertical Set Inner Product (VSIP) (NDSU-RFT-159) This novel algorithm provides at least a 10-fold increase in
clustering and classifying numeric data by providing a horizontal calculation across a vertical P-tree dataset. It is
related to but distinct from RFT-75 (PCM). A U.S. provisional application was filed on November 17, 2004 and a
PCT patent application was filed November 17, 2005.
Download