CS825: Information Visualization Homework #2 By Ahmed H. Salem UIN: 00940407 Part 1: Written Assignment Question 1: a. Dataset with ordering relationship: Dataset Name: US overseas loans and Grants Explanation: The dataset represents the value of financial aid given by the US to other countries. The data can be ordered either by the amount of aid or its date. Link : http://www.google.com/publicdata/explore?ds=h0f5ln0phd8c_#!ctype=l&strai l=false&bcs=d&nselm=h&met_y=value&scale_y=lin&ind_y=false&rdim=wo rld&idim=world:Earth&ifdim=world&hl=en_US&dl=en_US&ind=false b. Data with a distance metric: Dataset name: Bluetooth traces Explanation: The dataset represents the Bluetooth signal collected over several days period. The Bluetooth signals contains a value called received signal strength (RSSI), which determines the strength of the Bluetooth signal at the receiver side. The distance metric can be measured for this data using the RSSI parameter. This distance can be very helpful in various Bluetooth applications for example the ones concerned with localization. Link: http://crawdad.cs.dartmouth.edu/meta.php?name=cambridge/haggle c. Data with an absolute zero: Dataset Name: Demographic statistics by the U.S. census bureau Explanation: The dataset provides the information regarding the population growth, fertility measures, and population density among countries of the world. The population parameter can be considered as an absolute zero parameter since it is likely to apply all four basic mathematical operations on. Link: http://www.google.com/publicdata/explore?ds=h650d9ipptcp4_#!strail=false &bcs=d&nselm=h&rdim=region&idim=country:EG:GZ&ifdim=region&hl=e n_US&dl=en_US&ind=false Question 2: Describe the difference between a data attribute and a value. Use examples to clarify your response. Datasets should have both attributes and values. Every dataset consists of rows holding the multiple instances or readings, and columns holding the features or attributes. Every feature should have a value that differs among rows. For example consider the cars dataset discussed in class it has multiple attributes (e.g. Engine Size, Dealer Cost, Cyl. , etc… ). These attributes are the same among all instances (rows). Every attribute should hold a different value corresponding to it row. For example the cylinder attribute for Toyota Camry LE 4dr is 4 while for Toyota Camry LE V6 4dr is 6. The attribute is the same for both instances but the values are different. Question 3: Strategies dealing with missing data in datasets. Dealing with missing data is an annoying issue to all researchers. In the following sections will brief some of them and explain when each one should be used. Deleting rows: This approach is as simple as ignoring the whole row when it contains missing information. The strength of this approach comes in the easiness (computationally and coding) in applying it. Its drawback comes in the amount of data that is totally discarded. Replace the missing: In this approach the missing data is replaced by a specific noticeable value for example -1 if the data is expected all to be positive. In this case it will be very easy to spot the faulty records when plotted. The problem in this method is that data must be handled with great care not use the replaced data fields in other operations. For example calculating the appearance times of that field, if the negative values were to be counted the will have a misleading effect on the final result. Averaging the missing: This approach depends on assigning the average value to the missing field. The main advantage of this approach is having minimal impact on the data statistical features. The drawback with the average is that it might hide the outliers in the data; also the average might be a misleading approach if the value is compared with other fields in the same record. Nearest Neighbor: This approach replaces the missing value with the value of the nearest neighbor. This approach appears to be the best fit for the missing data problem. However, it can be noticed that the nearest neighbor depends on the all the features however it might not be the closet choice when talking about only the feature to be replaced. It is clear from the previous methods that no one has a superior effect over the others. It is up to the researcher to choose the replacement method that fits his data or in other words gives him the best results. For example if the research study is concerned with statistical analysis, row elimination will appear as the best fit as others can cause a shit in the statistical analysis. If the goal is to visually plot the results then replacing the missing with specific values will come in the picture, as it will easily depict the faulty records. If the data trend is to be measured then average will be a good candidate, as it will merge the missing fields inside the dataset trend. If a classification approach is to run on the data, then the nearest node can be considered before the training phase to result in the most coherent dataset possible.