file - ODU Computer Science

advertisement
CS825: Information Visualization
Homework #2
By
Ahmed H. Salem
UIN: 00940407
Part 1: Written Assignment
Question 1:
a. Dataset with ordering relationship:
Dataset Name: US overseas loans and Grants
Explanation: The dataset represents the value of financial aid given by the US
to other countries. The data can be ordered either by the amount of aid or its
date.
Link :
http://www.google.com/publicdata/explore?ds=h0f5ln0phd8c_#!ctype=l&strai
l=false&bcs=d&nselm=h&met_y=value&scale_y=lin&ind_y=false&rdim=wo
rld&idim=world:Earth&ifdim=world&hl=en_US&dl=en_US&ind=false
b. Data with a distance metric:
Dataset name: Bluetooth traces
Explanation: The dataset represents the Bluetooth signal collected over
several days period. The Bluetooth signals contains a value called received
signal strength (RSSI), which determines the strength of the Bluetooth signal
at the receiver side. The distance metric can be measured for this data using
the RSSI parameter. This distance can be very helpful in various Bluetooth
applications for example the ones concerned with localization.
Link: http://crawdad.cs.dartmouth.edu/meta.php?name=cambridge/haggle
c. Data with an absolute zero:
Dataset Name: Demographic statistics by the U.S. census bureau
Explanation: The dataset provides the information regarding the population
growth, fertility measures, and population density among countries of the
world. The population parameter can be considered as an absolute zero
parameter since it is likely to apply all four basic mathematical operations on.
Link:
http://www.google.com/publicdata/explore?ds=h650d9ipptcp4_#!strail=false
&bcs=d&nselm=h&rdim=region&idim=country:EG:GZ&ifdim=region&hl=e
n_US&dl=en_US&ind=false
Question 2: Describe the difference between a data attribute and a
value. Use examples to clarify your response.
Datasets should have both attributes and values. Every dataset consists of rows
holding the multiple instances or readings, and columns holding the features or
attributes. Every feature should have a value that differs among rows. For example
consider the cars dataset discussed in class it has multiple attributes (e.g. Engine Size,
Dealer Cost, Cyl. , etc… ). These attributes are the same among all instances (rows).
Every attribute should hold a different value corresponding to it row. For example the
cylinder attribute for Toyota Camry LE 4dr is 4 while for Toyota Camry LE V6 4dr is
6. The attribute is the same for both instances but the values are different.
Question 3: Strategies dealing with missing data in datasets.
Dealing with missing data is an annoying issue to all researchers. In the following
sections will brief some of them and explain when each one should be used.
Deleting rows:
This approach is as simple as ignoring the whole row when it contains missing
information. The strength of this approach comes in the easiness (computationally and
coding) in applying it. Its drawback comes in the amount of data that is totally
discarded.
Replace the missing:
In this approach the missing data is replaced by a specific noticeable value for
example -1 if the data is expected all to be positive. In this case it will be very easy to
spot the faulty records when plotted. The problem in this method is that data must be
handled with great care not use the replaced data fields in other operations. For
example calculating the appearance times of that field, if the negative values were to
be counted the will have a misleading effect on the final result.
Averaging the missing:
This approach depends on assigning the average value to the missing field. The main
advantage of this approach is having minimal impact on the data statistical features.
The drawback with the average is that it might hide the outliers in the data; also the
average might be a misleading approach if the value is compared with other fields in
the same record.
Nearest Neighbor:
This approach replaces the missing value with the value of the nearest neighbor. This
approach appears to be the best fit for the missing data problem. However, it can be
noticed that the nearest neighbor depends on the all the features however it might not
be the closet choice when talking about only the feature to be replaced.
It is clear from the previous methods that no one has a superior effect over the others.
It is up to the researcher to choose the replacement method that fits his data or in other
words gives him the best results. For example if the research study is concerned with
statistical analysis, row elimination will appear as the best fit as others can cause a
shit in the statistical analysis. If the goal is to visually plot the results then replacing
the missing with specific values will come in the picture, as it will easily depict the
faulty records. If the data trend is to be measured then average will be a good
candidate, as it will merge the missing fields inside the dataset trend. If a
classification approach is to run on the data, then the nearest node can be considered
before the training phase to result in the most coherent dataset possible.
Download