Data Generation for Application-Specific Benchmarking Y.C. Tay National University of Singapore Background benchmarks help research and development --- the dominant database benchmark is TPC SIGMOD Conference 2011 research track: 87 papers, 17 use TPC (20%) industry track: 14 papers, 6 use TPC (43%) Problem : a few TPC benchmarks but many, many applications TPC becoming irrelevant? Vision a paradigm shift in database benchmark development from to top-down committee consensus domain-specific package (data generator + queries) bottom-up community collaboration application-specific tools (dataset scaling) synthetically scale up/down application data application already has queries Challenge Dataset Scaling Problem : Given a set of relational tables D and a scale factor s, generate a database state D’ that is similar to D but s times its size. E.g. What would DBLP look like in 2020? s>1 why: scalability testing difficulty: copying doesn’t work (e.g. social network data) s<1 why: application testing difficulty: sampling not straightforward (similar to web crawling) s=1 why: privacy/proprietary reasons difficulty: encryption is risky Challenge Dataset Scaling Problem : Given a set of relational tables D and a scale factor s, generate a database state D’ that is similar to D but s times its size. by query results difficulty: data correlation E.g. database = {photos, owners, comments, tags} inter-column correlation inter-row correlation inter-column + inter-row • foreign keys • photo dimensions (same camera) • 2 users comment on each other’s photos (social network) • age and gender • user likely to comment on own photos • gardener likely to tag photos of flowers • tags used by gardener (“rose”, “bee”, “beetle”) Challenge scaling a social network: extract D G empirical dataset empirical social graph use join query ~ G scale by s inject synthetic social graph use graph theory #edges? #triangles? path lengths? D synthetic dataset any database theory? ~ E.g. how to inject into D ~ * correlation from G indicating X and Y comment on each other’s photos * correlation between Alice’s birthday and wall posts by her classmates * correlation among tags used by bird watchers ~ Challenge * online social networks are here to stay * their datasets can be huge * their datasets have commercial value where is the database theory? Attribute Value Correlation Problem for Social Networks : Suppose a dataset D records data from a social network. How do the social interactions affect the correlation among attribute values in D ? Vision (for the next 25 years): a paradigm shift from a top-down design of domain-specific benchmarks by committee consensus to a bottom-up collaborative development of tools for application-specific dataset scaling Challenges: • Dataset Scaling Problem • Attribute Value Correlation Problem for Social Networks Payoff: • commercial value in dataset scaling tools • new database research areas (social network data, schema design, vertical/horizontal partition, query optimization, business intelligence, …) Start: UpSizeR (http:www.comp.nus.edu.sg/~upsizer ) • single-server version • Hadoop version