Chapter Overview: Data representations mathematically abstracts

advertisement

Chapter Overview:

1) Data representations mathematically abstracts measurements of real world objects so that we can formally define the problem we are interested in and incorporate the data with algorithms and models to conduct a data analysis.

2) The key of appropriate data representation is balancing abstract and concrete.

3) Presented 5 goals of data representation and 6 existing challenges and future directions.

Goals of Data Representation

1) Reducing Computation

 “Smart” data representation can dramatically boost computational efficiency. E.g.

Google search engine use inverted-index to fast query documents by key words.

2) Reducing Storage and/or Communication

 Appropriate data structure also saves spaces requirement for data storage and transfer. E.g. Huffman code used for lossless data compression.

3) Reducing Statistical Complexity and Discovering the Structure in the Data

 Reducing statistical complexity involves two basic ideas: a) Reduce the number of features (dimensionality reduction). Only keep the features that have more explanation power to reduce computational complexity and remove noises. The procedure of identifying important feature or dimensions usually implemented by statistical models like PCA b) Reduce the number of samples (clustering). Through clustering, original data are grouped so that the number samples to be analyzed is reduced.

Also, it is much easier to differentiate clusters instead of individual samples, making the analytical models more powerful in terms of generalizing and predicting.

4) Exploratory Data Analysis and Data Interpretation

 Exploratory data analysis refers to the process of simple preliminary examinations of the data in order to gain insight about its properties in order to help formulate hypotheses about the data.

 Interpreting data after dimensionality reduction or clustering, makes it possible to further process the data in more sophisticated and promising models.

5) Sampling and Large-Scale Data Representation

 Sampling is an effective solution in terms of dimensionality reduction, so that it is possible to have a prototype running expensive computation on a small sample.

 Sampling also serves as an approach for bias controlling. E.g. selecting samples that are relevant to the outcome of the analysis.

Challenges and Future Directions

1) How to Extend Existing Methods to Massive Data Systems

 Existing models or approaches are mostly not compatible for large-scale computing systems, such as multi-core processors or distributed clusters of commodity machines.

Design new data representation for these systems to benefit the analysis by powerful hardware. E.g. MapReduce includes an innovative data representation for distributing tasks among nodes.

2) Heavy-Tailed and High-Variance Data

 Reducing statistical complexity assumes data is sparse and only a few of data relative to the output of analysis. This assumption may not be true in large-scale systems where the data have high variance. E.g. traditional clustering algorithms work poorly on large scale social networks.

 “Heavy tail” phenomenon requires predicting or generalizing models be able to capture mall signal from a background of noise.

3) Develop a Middleware for Large-Scale Graph Analytics

 Facing the trade-off between algorithms that perform better at one particular prediction task and algorithms that are more understandable or interpretable.

Creating an analogous middleware between theory and computation. Like the connection between graph theory and linear algebra computation.

4) Manipulation and Integration of Heterogeneous Data

 Combining diverse data is into a same mode is going to damage the structure in each representation.

 It is needed to develop principled ways to incorporate metadata on top of a base representation.

 Challenge one in data integration is most appropriate representation of different data sources may not be the same.

 Challenge two in data integration is how to take advantage of different information sources

5) Understanding and Exploiting the Relative Strengths of Data-Oblivious Versus Data-Aware

Methods

 Data-oblivious dimensionality reduction are the methods that compute the dimensionality reducing mapping without using (or the knowledge of) the data.

 Data-aware dimensionality reduction are the methods tailor the mapping to a given data set

 The challenge is merge the benefits of data-oblivious and data-aware dimensionality reduction approaches.

6) Combining Algorithmic and Statistical Perspectives

 Much algorithmic work has put efforts to approximate solution under different models of data access.

 Statistical approaches enable model to learn from the noisy or uncertain data that is observed

 Combine and exploit the complementary strengths of algorithmic and statistical approaches.

Download