Generating Semantic Annotations For Research Datasets

advertisement
GENERATING AUTOMATIC SEMANTIC
ANNOTATIONS FOR RESEARCH DATASETS
AYUSH SINGHAL AND JAIDEEP SRIVASTAVA
CS DEPT. ,
UNIVERSITY OF MINNESOTA, MN, USA
CONTENTS
 Motivation
 Problem statement
 Proposed approach

Data type labelling


Application concept


Experiments and results
Experiments and results
Similar dataset identification

Experiments and results
 Conclusions and future work
MOTIVATION
 Annotation is act of adding a note by way of comment or explanation.
 Apart from documents, images, videos are searchable only when they have tags or
annotations (i.e. content)
 Recently, genomic databases, archeological databases are annotated for indexing.
ANNOTATING RESEARCH DATASETS
 No context- hard to be searchable by popular search engines.
 Make the dataset visible and informative.
EXAMPLE OF STRUCTURED ANNOTATION
PROBLEM STATEMENT
 Given a data name “D” as a string of English characters, the research task is to
generate semantic annotations for the dataset denoted by “D” in the following
categories:
 Characteristic data type
 Application domain
 List of similar datasets
PROPOSED APPROACH
Research challenges
 No universal schema for describing content
of a dataset.
 Common attribute, dataset name.
 No well known structure for semantic
annotation of research datasets.
 Proposed structure should positively impact
user’s search for datasets.
CONTEXT GENERATION
Used the top-50 results to build context for the dataset “Global context”
Critical step: how to generate
useful context for a dataset.
• Usage of the dataset in research.
• Research articles and journals .
• Get a proxy using web knowledge:
Google scholar search engine.
IDENTIFYING DATA TYPE LABELS
 For a dataset ‘D’:
Given: global context of ‘D’, a list of data types
Required: data type of ‘D’
 Approach: Supervised Multi-label classification
Feature construction:
0. Preprocessing of global context-stop word removal etc.
1. BOW and TFIDF representation of Global context of ‘D’.
2. Dimensionality reduction by PCA- 98% of variance coverage
EXPERIMENTS AND RESULTS
Dataset
Instances
Label count
Label density
Label cardinality
SNAP
42
5
0.34
1.69
UCI
110
4
0.275
1.1
Ground truth: author provided data type labels.
Baseline: ZeroR classifier.
Evaluation metrics: typical multi-label classification metrics ( Tsoumakas et al 2010)
Measure
ZeroR
AdaBoostMH
(tfidf)
Measure
ZeroR
AdaBoostMH
(BOW)
Fmeasure ↑
0.025
0.172
Fmeasure ↑
0.854
0.873
Average Precision ↑
0.657
0.663
Average Precision ↑
0.908
0.924
0.555
Macro AUC↑
0.5
0.54
0.5
Macro AUC↑
SNAP dataset
UCI dataset
CONCEPT GENERATION
 Given a dataset ‘D’, find k-descriptors (n-gram words) for the application of dataset.
 Approach: Concept extraction from world knowledge (wikipedia, dbpedia)
 Input feature: Global context of ‘D’.
 Preprocessing of global context
 Used text analytic tools (AlchemyAPI) for concept generation.
 Pruning of input query terms
EXPERIMENTS AND RESULTS
 Baseline: Context generated from the short description provided by the owner. Text
pre-processing was done.
 Evaluation metrics: user rating.
Comparison of average user rating on UCI and SNAP dataset.
UCI dataset
SNAP dataset
IDENTIFYING SIMILAR DATASETS
 Given a dataset ‘D’, find k-most similar datasets from a list of datasets.
 Approach: cosine similarity between TFIDF vectors of global-context of ‘D’ and
global-context of d_i in list of datasets.
 Top-k selection from list ranked in descending order.
EXPERIMENTS AND RESULTS
 Ground truth: dataset categorization provided by the dataset repository owners.
Different categorization for SNAP and UCI.
 Baseline: Context generated from owner’s description.
 Evaluation metrics: precision@k
SNAP dataset
UCI dataset
USE CASE: SYNTHETIC QUERYING
 Synthetic querying on the annotated database of research datasets.
 50 queries on SNAP database and 50 queries on UCI database.
 Query structure: find a <data type> dataset used for <concept> like <similar to>
 <fields> are random generated from their respective lists.
 Evaluation metric: overlap between context of retrieved results and the input query.
 Baseline: querying on Google database and extracting dataset names from the
retrieved results.
QUANTITATIVE AND QUALITATIVE EVALUATION
Comparison of Google results with annotated DB for a few samples
CONCLUSIONS AND FUTURE WORK
 Real world datasets play an important role- testing and validation purposes.
 General purpose search engines cannot find datasets due to lack of annotation.
 A novel concept of structured semantic annotation of dataset- data type labels,
application concepts, similar datasets.
 Annotation generated using global context from the web corpus.
 Data type labels identification using multi-label classifier- using web context helps to
improve accuracy both for SNAP and UCI test datasets.
CONCLUSIONS AND FUTURE WORK
 Concept generation using web context performs better than baseline based on user
ratings.
 Web context is not significantly helpful in identifying similar datasets for UCI and
SNAP datasets.
 18% improvement in accuracy over normal datasets search using Google ( for
synthetic queries).
 Future work: finding an overall encompassing structure of annotation ; extending
analysis across different domains.
THANK YOU
Download