TermProjectReport

advertisement
CSC 177-Term Project Report
C Sc 177 Term Project Cover Page
Due 5-20-08 5pm
(submit it to the CSC Department office before 5pm 5/20/08
or to the instructor at 5:15pm in RVR 5029)
Student(s) Name : Vrushali Shah, Ektabahen Patel, Hetalben Savaliya__ Grade ______
Title of the project: Data mart and Data-mining on REXUS
Hand-in-check list:
A hardcopy of final report (without appendix) with cover page for the term project
An electronic copy on a CD including all of important writings of your term project
 Project oral presentation power point file with improvement made based
on comments of the class and instructor during oral presentation.
 Project final report (100%) containing the following parts, font >= 11:
1.
2.
3.
4.
5.
6.
7.
objective statement of the term project (1/3 -1/2 page);
background information (1 page);
design principle of your data mining system/ scope of study (1/3 – 1/2 page);
implementation issues and solutions/ survey results/ diagrams/ tables (3-5 pages);
summary of learning experience such as experiments and readings (1/2 - 1 page);
references (authors, title, publishing source data, date of publication, URL) and you should quote each
reference in your report text.
appendix (optional) containing a set of supporting material such as examples, sample demo
sessions, and any information that reflects your effort regarding the project.
1
CSC 177-Term Project Report
1. Objective
Main objective of our project is to implement datamart which can help people as
well as government employees to get basic information based on the given data and apply
datamining rules using data mining tool for future predition and compare the algorithm’s
result.
This datamart can help to provide answers for following questions based on government
property dataset:
1. How People can find Government Properties based on Location ?
2. How People can find more details on Government Properties ?
3. How People can find details of leasing company ?
4. How one can know which property get expired in given duration?
For datamining, we have used classification algorithms :Naive Bayes, J48 and
ID3 to predict the future price and the property ownership (leased/owened) based on
given class and compare results of them.
2. Background Information
The original database of this project comes from http://www.data.gov/. This
website “enables the public to participate in government by providing downloadable
Federal datasets to build applications, conduct analyses, and perform research” [from
http://www.data.gov/about]. We downloaded dataset that contains federal government
REXUS (Real Estate Across the United States) inventory building and leasing
information. This REXUS is a primary tool used by government to track and manage
government’s real property assets and store inventory data. We are using this dataset for
our project.
This dataset contains building data, customer data and leasing information. There
are mainly 2 tables. One contains property information and second data contain leasing
information of property. Property dataset has 10310 entries and 17 attributes and leasing
dataset contain 8782 entries with 18 attributes.
Property dataset contains following attributes: Location Code, Region Code, Bldg
Address1, Bldg Address2, Bldg City, Bldg County, Bldg State, Bldg Zip, Congressional
2
CSC 177-Term Project Report
District, Bldg Status, Property Type, Bldg ANSI Usable, Total Parking Spaces,
Owned/Leased, Construction Date, Historical Type, Historical Status and ABA
Accessibility Flag.
Leasing property contains following attributes: Lease Number, Current Expiration Date,
Lease Initial Effective Date, Location Code, Lease ANSI Rentable Sqft, Lease Usable
Sqft, Lse Structured Parking Spaces, Lse Surface Parking Spaces, Lease Annual Rent
Amount, Lease Responsibility, Lessor Name, Lessor In-Care-Of, Lessor Address
1,Lessor Address 2,Lessor Country, Lessor State, Lessor Zip Code, Lessor City
REXUX contain detail information about all government property like address,
parking space, historical and categorical information as well as leasing contract
information if property is on leased. This dataset is not cleaned. As it is real data, it has
some missing information and some anonymous information. So our first step is to clean
this data by removing some unwanted attributes, which are not require to fulfill our
project goal and fill missing value with some constant value.
3. Design principle of your data mining system/ scope of study
The design principles of this project included data cleaing and preprocessing, data
warehousing and classification. The first phase of this project includes cleaning the data
and make it compatible to import in database and datamining tool, the next phase
includes to implement data warehouse to retrive the search result from the given data set
and the final step is to apply datamining algorithms to get classification result and
compare this algorithm’s performance.
For Data cleaning and preprocesing, we went through manually checking all
attribute entries and made changes using Microsoft Office Excell, for implimenting
datamart we used PHP 5.* with MySQL database and for website implimentation, we
used HTML, CSS, Javascript and JQuery. We used Weka tool to get result of
classification algorithms such as Naïve Bayes, J48, and ID3 and compared their threshold
curve graphs.
3
CSC 177-Term Project Report
4. Implementation issues and solutions/ survey results/ diagrams/ tables
4.1 Data Preprocessing
The first step of our project was to go through data cleaning and data pre processing
steps, as our dataset was not cleaned and up to date. We removed following attributes
from original dataset during data preprocessing:
Preprocessing on leasing property dataset:
1. Removed Lease ANSI Rentable Sqft , Lse Structured Parking Spaces, Lse Surface
Parking Spaces, Lessor Address 2.
2. Changed date format to MySQL compatible date for current Expiration Date,
Lease Initial Effective Date.
3. Changed Lease Annual Rent Amount, Lse Structured Parking Spaces datatype to
integer.
4. Renamed all attribute to database table compatible name. Like replace space with
“_”).
Preprocessing on Property dataset:
1. Removed Bldg Address2, Congressional District, Bldg Status, Property Type,
Bldg ANSI Usable and ABA Accessibility Flag.
2. Changed date format to MySQL compatible date for Construction Date.
3. Changed Parking Spaces data type to integer.
4. Renamed all attribute to database table compatible name.
We made data compatible to import in MySQL database after preprocessing step .
4.2 Data Mart
The next step was to build a data mart. For this, we understood all attributes
information and their importance and based on given data, we drew the star schema for
our project to achieve our goal to get answer of following questions:
4
CSC 177-Term Project Report
4.2.1 Star Schema
Figure 1: Star Schema
In the above star schema there are four dimensional tables connected with the fact table.
1. LessorInfo table : This table contains the information about the person who leases or
lets a property to the another person. Therefore, in this table it contains its name,
company name, address and LessorInfoId to map with fact table.
2.PLocationInfo table :It contains the brief details of the property location such as
County, State and City. This table is also used for overall search by just City, State and
County.
3.PDetails : In this table, it has all the detailed information about Property Details such as
Location code, Address and so on. This table is used for advanced search when anyone
want to search using its construction date, parking area and so on.
4. Leasing Details : This table has information about the property who can be on lease
such as its Lease Number, when lease will expire. This table is also very useful in
Advanced search when one can search by giving some amount range, parking area or
property area and so on.
5
CSC 177-Term Project Report
5. Fact table : It is the fact table which contains all information about the Primary keys of
Dimensional tables and a measure -which is the annual amount of the property who can
be in lease.In this fact table, measure is the annual amount which can be added to any
dimensional table. Amount is the additive measure type.
4.3 Data Mining and Compare performance
Our final step is to use WEKA tool for applying Data mining classification
algorithm on our dataset and compare their performance. During this phase we go
through following steps: Transport data in WEKA tool, Apply algorithms, Compare
performance.
In this project, we have also implemented data mining on the two tables such as
Leasing Data and Property Data. We have applied data mining algorithms such as
J48,NaiveBayes and ID3.The data is evaluated on three attributes of property data such as
Bldg State, Owned/Leased and Historical Type. This is evaluated by using training data
set. The threshold curve for false positive rate vs true positive rate of Owned and Leased
property attribute is as follows:
ID3 Threshold Curve Result :
Owned
Leased
J48 Threshold Curve Result :
Owned
Leased
6
CSC 177-Term Project Report
Naive Bayes Threshold Curve Result :
Owned
Leased
From The above we can conclude that Naive Bayes has more informative curve
compared to ID3 and J48.Moreover,the confusion matrix we got in calculation seems
more promising in Naive Bayes algorithm. Therefore, Naive Bayes is better compared to
ID3 and J48.
4.3.1 Data mining tool Comparision :
For this term projct,we have decided to use ID3 algorithm result across WEKA
and Rapidminer.To compare WEKA and Rapidminer ,we
made decision tree and
compared the outcome. The decision tree of WEKA was as follows :
[Figure : WEKA Decision tree]
7
CSC 177-Term Project Report
The Decision tree of Rapidminer was as follows :
[Figure : Rapid miner Decision tree]
From the above trees, we can conclude that the outcome tree of Rapid miner is more
meaningful. Using this tree, one can predict whether to buy or to lease the property in the
particular state. Whereas in WEKA the decision tree is not clear. Therefore, we can also
conclude that WEKA is not useful when the dataset is larger.
5. Summary of learning experience such as experiments and readings
 Learned
 Got
Data Mining tool such as WEKA
better understanding of classification algorithms such as J48,Naive Bayes and
ID3 algorithm.
 Learned
 Got
PHP and MySQL.
awareness of Data Mart applications and how they are being useful in day to
day lives.
 Team
work advantages
 Read
many articles to get clear idea of how to do data mining and warehousing.
6. References

http://www.data.gov/

http://www.slideshare.net/dataminingtools/rapidminer-introduction-to-datamining
8
CSC 177-Term Project Report

Textbook: Data Mining for Business Intelligence by Galit Shumeli, Nitin Patel,
Peter Bruce.

Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan
Kaufmann, 2001.

Witten, I.H. and Frank, E., Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations, Morgan Kaufmann, 2000.

Witten, Ian and Frank, Eibe, Data Mining: Practical machine learning tools with
Java implementations, Morgan Kaufmann Publishing, San Francisco 2000.
9
CSC 177-Term Project Report
APPENDIX :
10
CSC 177-Term Project Report
11
CSC 177-Term Project Report
12
Download