CSC 177-Term Project Report C Sc 177 Term Project Cover Page Due 5-20-08 5pm (submit it to the CSC Department office before 5pm 5/20/08 or to the instructor at 5:15pm in RVR 5029) Student(s) Name : Vrushali Shah, Ektabahen Patel, Hetalben Savaliya__ Grade ______ Title of the project: Data mart and Data-mining on REXUS Hand-in-check list: A hardcopy of final report (without appendix) with cover page for the term project An electronic copy on a CD including all of important writings of your term project Project oral presentation power point file with improvement made based on comments of the class and instructor during oral presentation. Project final report (100%) containing the following parts, font >= 11: 1. 2. 3. 4. 5. 6. 7. objective statement of the term project (1/3 -1/2 page); background information (1 page); design principle of your data mining system/ scope of study (1/3 – 1/2 page); implementation issues and solutions/ survey results/ diagrams/ tables (3-5 pages); summary of learning experience such as experiments and readings (1/2 - 1 page); references (authors, title, publishing source data, date of publication, URL) and you should quote each reference in your report text. appendix (optional) containing a set of supporting material such as examples, sample demo sessions, and any information that reflects your effort regarding the project. 1 CSC 177-Term Project Report 1. Objective Main objective of our project is to implement datamart which can help people as well as government employees to get basic information based on the given data and apply datamining rules using data mining tool for future predition and compare the algorithm’s result. This datamart can help to provide answers for following questions based on government property dataset: 1. How People can find Government Properties based on Location ? 2. How People can find more details on Government Properties ? 3. How People can find details of leasing company ? 4. How one can know which property get expired in given duration? For datamining, we have used classification algorithms :Naive Bayes, J48 and ID3 to predict the future price and the property ownership (leased/owened) based on given class and compare results of them. 2. Background Information The original database of this project comes from http://www.data.gov/. This website “enables the public to participate in government by providing downloadable Federal datasets to build applications, conduct analyses, and perform research” [from http://www.data.gov/about]. We downloaded dataset that contains federal government REXUS (Real Estate Across the United States) inventory building and leasing information. This REXUS is a primary tool used by government to track and manage government’s real property assets and store inventory data. We are using this dataset for our project. This dataset contains building data, customer data and leasing information. There are mainly 2 tables. One contains property information and second data contain leasing information of property. Property dataset has 10310 entries and 17 attributes and leasing dataset contain 8782 entries with 18 attributes. Property dataset contains following attributes: Location Code, Region Code, Bldg Address1, Bldg Address2, Bldg City, Bldg County, Bldg State, Bldg Zip, Congressional 2 CSC 177-Term Project Report District, Bldg Status, Property Type, Bldg ANSI Usable, Total Parking Spaces, Owned/Leased, Construction Date, Historical Type, Historical Status and ABA Accessibility Flag. Leasing property contains following attributes: Lease Number, Current Expiration Date, Lease Initial Effective Date, Location Code, Lease ANSI Rentable Sqft, Lease Usable Sqft, Lse Structured Parking Spaces, Lse Surface Parking Spaces, Lease Annual Rent Amount, Lease Responsibility, Lessor Name, Lessor In-Care-Of, Lessor Address 1,Lessor Address 2,Lessor Country, Lessor State, Lessor Zip Code, Lessor City REXUX contain detail information about all government property like address, parking space, historical and categorical information as well as leasing contract information if property is on leased. This dataset is not cleaned. As it is real data, it has some missing information and some anonymous information. So our first step is to clean this data by removing some unwanted attributes, which are not require to fulfill our project goal and fill missing value with some constant value. 3. Design principle of your data mining system/ scope of study The design principles of this project included data cleaing and preprocessing, data warehousing and classification. The first phase of this project includes cleaning the data and make it compatible to import in database and datamining tool, the next phase includes to implement data warehouse to retrive the search result from the given data set and the final step is to apply datamining algorithms to get classification result and compare this algorithm’s performance. For Data cleaning and preprocesing, we went through manually checking all attribute entries and made changes using Microsoft Office Excell, for implimenting datamart we used PHP 5.* with MySQL database and for website implimentation, we used HTML, CSS, Javascript and JQuery. We used Weka tool to get result of classification algorithms such as Naïve Bayes, J48, and ID3 and compared their threshold curve graphs. 3 CSC 177-Term Project Report 4. Implementation issues and solutions/ survey results/ diagrams/ tables 4.1 Data Preprocessing The first step of our project was to go through data cleaning and data pre processing steps, as our dataset was not cleaned and up to date. We removed following attributes from original dataset during data preprocessing: Preprocessing on leasing property dataset: 1. Removed Lease ANSI Rentable Sqft , Lse Structured Parking Spaces, Lse Surface Parking Spaces, Lessor Address 2. 2. Changed date format to MySQL compatible date for current Expiration Date, Lease Initial Effective Date. 3. Changed Lease Annual Rent Amount, Lse Structured Parking Spaces datatype to integer. 4. Renamed all attribute to database table compatible name. Like replace space with “_”). Preprocessing on Property dataset: 1. Removed Bldg Address2, Congressional District, Bldg Status, Property Type, Bldg ANSI Usable and ABA Accessibility Flag. 2. Changed date format to MySQL compatible date for Construction Date. 3. Changed Parking Spaces data type to integer. 4. Renamed all attribute to database table compatible name. We made data compatible to import in MySQL database after preprocessing step . 4.2 Data Mart The next step was to build a data mart. For this, we understood all attributes information and their importance and based on given data, we drew the star schema for our project to achieve our goal to get answer of following questions: 4 CSC 177-Term Project Report 4.2.1 Star Schema Figure 1: Star Schema In the above star schema there are four dimensional tables connected with the fact table. 1. LessorInfo table : This table contains the information about the person who leases or lets a property to the another person. Therefore, in this table it contains its name, company name, address and LessorInfoId to map with fact table. 2.PLocationInfo table :It contains the brief details of the property location such as County, State and City. This table is also used for overall search by just City, State and County. 3.PDetails : In this table, it has all the detailed information about Property Details such as Location code, Address and so on. This table is used for advanced search when anyone want to search using its construction date, parking area and so on. 4. Leasing Details : This table has information about the property who can be on lease such as its Lease Number, when lease will expire. This table is also very useful in Advanced search when one can search by giving some amount range, parking area or property area and so on. 5 CSC 177-Term Project Report 5. Fact table : It is the fact table which contains all information about the Primary keys of Dimensional tables and a measure -which is the annual amount of the property who can be in lease.In this fact table, measure is the annual amount which can be added to any dimensional table. Amount is the additive measure type. 4.3 Data Mining and Compare performance Our final step is to use WEKA tool for applying Data mining classification algorithm on our dataset and compare their performance. During this phase we go through following steps: Transport data in WEKA tool, Apply algorithms, Compare performance. In this project, we have also implemented data mining on the two tables such as Leasing Data and Property Data. We have applied data mining algorithms such as J48,NaiveBayes and ID3.The data is evaluated on three attributes of property data such as Bldg State, Owned/Leased and Historical Type. This is evaluated by using training data set. The threshold curve for false positive rate vs true positive rate of Owned and Leased property attribute is as follows: ID3 Threshold Curve Result : Owned Leased J48 Threshold Curve Result : Owned Leased 6 CSC 177-Term Project Report Naive Bayes Threshold Curve Result : Owned Leased From The above we can conclude that Naive Bayes has more informative curve compared to ID3 and J48.Moreover,the confusion matrix we got in calculation seems more promising in Naive Bayes algorithm. Therefore, Naive Bayes is better compared to ID3 and J48. 4.3.1 Data mining tool Comparision : For this term projct,we have decided to use ID3 algorithm result across WEKA and Rapidminer.To compare WEKA and Rapidminer ,we made decision tree and compared the outcome. The decision tree of WEKA was as follows : [Figure : WEKA Decision tree] 7 CSC 177-Term Project Report The Decision tree of Rapidminer was as follows : [Figure : Rapid miner Decision tree] From the above trees, we can conclude that the outcome tree of Rapid miner is more meaningful. Using this tree, one can predict whether to buy or to lease the property in the particular state. Whereas in WEKA the decision tree is not clear. Therefore, we can also conclude that WEKA is not useful when the dataset is larger. 5. Summary of learning experience such as experiments and readings Learned Got Data Mining tool such as WEKA better understanding of classification algorithms such as J48,Naive Bayes and ID3 algorithm. Learned Got PHP and MySQL. awareness of Data Mart applications and how they are being useful in day to day lives. Team work advantages Read many articles to get clear idea of how to do data mining and warehousing. 6. References http://www.data.gov/ http://www.slideshare.net/dataminingtools/rapidminer-introduction-to-datamining 8 CSC 177-Term Project Report Textbook: Data Mining for Business Intelligence by Galit Shumeli, Nitin Patel, Peter Bruce. Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. Witten, I.H. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Witten, Ian and Frank, Eibe, Data Mining: Practical machine learning tools with Java implementations, Morgan Kaufmann Publishing, San Francisco 2000. 9 CSC 177-Term Project Report APPENDIX : 10 CSC 177-Term Project Report 11 CSC 177-Term Project Report 12