Weekly Research Meeting Notes: Date: Saturday 09/25/10 Time: 10:00am to 1:00pm Place: CS conference room Members Present: 1. Dr. William Perrizo 2. Dr. George Hamer 3. Mohammad Kabir Hossain 4. Arjun Roy 5. Tingda Lu (Online through skype) Dr. Perrizo started the discussion of the first research of the meeting of fall 2010 welcoming all who were present in the meeting especially George who came all the way from Brookings South Dakota (SDSU) to attend the meeting. The following topics were discussed in the meeting: 1. Dr. Perrizo asked all the members to develop their research portfolio. In the portfolio their research interest, current research topics etc. will be posted. Then the portfolio may be uploaded the group’s website (which will be on Dr. Perrizo’s website at web.cs.ndsu.nodak.edu/~perrizo/classes/saturday. 2. Dr. Perrizo emphasized the importance of paper publication. He mentioned that for Ph.D. level at least 14 publications and master level 5-6 publications are required in order to compete in the job marketplace. 3. To achieve this target Dr. Perrizo suggested to help other group members in their research and become co-author of their papers as well as lead author of your own. The co-author policy of the group is that the main student developer is the first author and that Dr. Perrizo is the last author and other contributors are in alphabetical order in the middle 4. Then Dr. Perrizo informed the group about deadlines of two conferences namely CATA2011 and BICoB-2011. Both of them have full paper submission dates on October 22, 2010, but there is a good chance that the dates will be shifted to November 22. 5. Then Dr. Perrizo talked about patents. He also described the previous patents related to his work on P-tree and others (see below for a description ). Presentation: Dr. Perrizo presented a power-point presentation on different techniques of data mining – classification, clustering and association rule mining. 1. First he discussed the Entity-Relationship model of a database. The data mining technique, classification, can be applied to using any table and class label column of that table. Since entities or relationships are both tables, classification can be done on either. Clustering is also done on a table, usually as a preparatory step to classification. 2. In order to apply different data mining techniques we need to find the similarity between two data points. This similarity can be found by determining the distance between the data points in n-dimensional space. Then different distance functions were discussed. 3. Clustering is the preparation of classification 4. Association rule mining is always done on a relationship between two entities (e.g., Market Basket Research is done on the buys relationship between Customers and Items). 5. While finding the similarity, if two attributes have the same measures then how to break the tie? – that may be a paper topic. (May be solved by introducing more attribute?) Dr. Perrizo’s note: Does anyone recall what this was about? I am drawing a blank. Did it have to do with Ptree CkNN and what to do if it results in a tie vote? 6. Closed k-NN (CkNN) is a classification techniques that is a blend of k-NN and ϵ-NN and is a more effective than k-NN where not just first k point but all the points having same distance as the kth point should participate in voting in classification. Be careful using ϵ, because in high dimension no points may be near neighbors. A sphere in high dimension has almost no points in it, since the [hyper]volume of the n-sphere goes to zero as n goes to infinity and is highest if dimension is 5. (one version of the curse of dimensionality). 7. Collaborating filtering and the Netflix problems were discussed. In the Netflix problem there is database of movie id, user id, date and rating. Ratings are assigned by users on movies on scale from 1 to 5. 0 rating means the movie was not rated. Netflix problem is a classification problem where the rating has to be determine given the user id, movie id and date. At the end Dr. Perrizo thanked all the members present in the meeting and announced that the next meeting will be on 10/09/10. Note taker: Mohammad K Hossain. Ptree PATENT Portfolio 1. United States Patent and Trademark Office Patent Number 6,941,303 B2 (NDSU-RFT-75), issued September 6, 2005, “System and Method for Organizing, Compressing and Structuring Data for Data Mining Readiness”, Inventor: William K. Perrizo, Application, No. 957637 filed on 2001-09-20, Abstract: A system and method to take data, which is in the form of an n-dimensional array of binary data where the binary data is comprised of bits that are identified by a bit position within the n-dimensional array, and create one file for each bit position of the binary data while maintaining the bit position identification and to store the bit with the corresponding bit position identification from the binary data within the created filed. Once this bit-sequential format of the data is achieved, the formatted data is structured into a tree format that is data-mining-ready. The formatted data is structured by dividing each of the files containing the binary data into quadrants according to the bit position identification and recording the count of 1-bits for each quadrant on a first level. Then, recursively dividing each of the quadrants into further quadrants and recording the count of 1-bits for each quadrant until all quadrants comprise a pure-1 quadrant or a pure-0 quadrant to form a basic tree structure. 2. United States Patent and Trademark Office Patent Number 7,051,028 B2 (NDSU RFT-79), issued May 23, 2006, “Read-Commit Order Concurrency Control (ROCC)”, Inventors: Victor T. Shi and William K. Perrizo, North Dakota State University, Abstract: A system and method for concurrency control in high performance database systems. Generally includes receiving a database access request message from a transaction. Then, generating an element that corresponds to the access request message. The element type is that of a read element, commit element, validated element, or restart element. The element is then posted to a read-commit (RC) queue. If the element is a commit element, an intervening validation of the transaction is performed. Upon the transaction passing validation the requested database access is performed. 3. United States Patent and Trademark Office Patent Number 7,089,244 B2 (NDSU-RFT-99), issued August 8, 2006, “Multiversion read-commit order concurrency control”, Inventors: Victor T. Shi and William K. Perrizo, North Dakota State University, Application No. 10440442 filed on 2003-05-16, Abstract: A system and method for multi-version concurrency control in high performance database systems. Generally includes receiving a database access request message from a transaction. Then, generating an element that corresponds to the access request message. The element type is that of a read element, commit element, validated element, or restart element. The element is then posted to a read-commit (RC) queue. If the element is a commit element, an intervening validation of the transaction is performed. Upon the transaction passing validation the requested database access is performed. 4. United States Patent and Trademark Office Patent Number 7,640,219; Issued December 29, 2009. Parameter Optimized, Vertical, Nearest Neighbor Vote and Boundary Based Classification (NDSU RFT-203). Inventor: William Perrizo. This invention involves a Computer Aided Detection (CAD) model that is designed to diagnose Pulmonary Embolisms (PE) from CT image information data sheets. This high performance classification system, includes a Local Decision Boundary based classification combined with an evolutionary algorithm for parameter optimization and a vertical data structure for efficient processing. The invention was developed as a solution for the ACM KDD Cup competition in 2006, and won task 3 of that competition. A U.S. Provisional Patent application was filed August 4, 2007. Two patents pending. 1. A divisional patent application related to PCM (6,941,303 B2) (NDSU-RFT-94), Similar Function Data Mining with P-trees – similar to System and Method for Organizing, Compressing and Structuring Data for Data Mining Readiness”. Vertical Set Inner Product (VSIP) (NDSU-RFT-159) This novel algorithm provides at least a 10-fold increase in clustering and classifying numeric data by providing a horizontal calculation across a vertical P-tree dataset. It is related to but distinct from RFT-75 (PCM). A U.S. provisional application was filed on November 17, 2004 and a PCT patent application was filed November 17, 2005.