A Characterization of Big Data Benchmarks Wen.Xiong Zhibin Yu, Zhendong Bei, Juanjuan Zhao, Fan Zhang, Yubin Zou, Xue Bai, Ye Li, Chengzhong Xu Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences 1 Agenda • • • • • • Background Motivation Methodology Evaluation Conclusion Future work 2 Background • Requirements of a benchmark suite • Characteristics of different workload-input pairs • Spatio-temporal data in a real world system ETI Confidential 13/04/2015 3 Background (1/3) • Requirements of a benchmark suite – a benchmark suite should contain workloads that represent a wide range of application domains. – workloads in a benchmark suite should be as diverse as possible. – a benchmark suite should not have redundant workloads in itself, keeping simulation or measure time as short as possible. ETI Confidential 13/04/2015 4 Background (1/3) • simulation time between different numbers of workload-input pairs After removing redundancy, it can decrease 30% number of workload-input pairs and %40 simulation time. ETI Confidential 13/04/2015 5 Background (2/3) • Characteristics of different workload-input pairs – Characteristics of workloads as the size of input data set changing • Stable • Unstable ETI Confidential 13/04/2015 6 Background (3/3) • Spatio-temporal data in Shenzhen Transportation System – GPS trajectory data of taxicabs, 30000+ taxicabs, 90 millions GPS points per day. – Smart card data in metro transportation system, 15+ millions smart cards, 12+ millions transaction records per day. ETI Confidential 13/04/2015 7 Background (3/3) (1) 2000 square kilometers, 18 millions of people. (2) road network in Shenzhen contains 73515 vertices and 101794 road segments. ETI Confidential 13/04/2015 8 Motivation • Remove redundancy of a typical benchmark suite • Provide a benchmark suite for spatio-temporal data ETI Confidential 13/04/2015 9 Motivation (1/2) • Remove redundancy of a typical benchmark suite – To decrease experiment time of benchmarking the objective system by minimizing the number of typical workload-input pairs. ETI Confidential 13/04/2015 10 Motivation (2/2) • Provide a benchmark suite for spatio-temporal data – Representative workloads in our benchmark suite are as follows: • transaction count (hotregion) • spatiotemporal origin destination (sztod) • map matching • hotspot monitoring • spatiotemporal secondary sort ETI Confidential 13/04/2015 11 Methodology • • • • Typical MapReduce-based workloads Micro architecture level metrics Principal component analysis (PCA) Hierarchical clustering and K-means clustering ETI Confidential 13/04/2015 12 Methodology • Typical MapReduce-based workloads (1/2): index workload source 1 sort HiBench 2 wordcount HiBench 3 terasort HiBench 4 bayes HiBench 5 K-means HiBench 6 Nutch indexing HiBench 7 pagerank HiBench 8 hive-jion HiBench 9 Hive-aggregate HiBench 10 grep DCBench 11 svm DCBench ETI Confidential 13/04/2015 13 Methodology • Typical MapReduce-based workloads (2/2): index workload source 12 ibcf DCBench 13 fpg DCBench 14 hmm DCBench 15 sztod our internal program for trajectory data 16 hotregion our internal program for trajectory data ETI Confidential 13/04/2015 14 Methodology • Micro architecture level metrics are as follows: – – – – – – – Instruction per cycle (IPC) L1 instruction cache miss ratio L2 instruction cache miss ratio Last level cache miss ratio Branch prediction per instruction Branch miss prediction per instruction Off-chip bandwidth utilization ETI Confidential 13/04/2015 15 Methodology • Principal Component Analysis: – It can reduce program characteristics while controlling the amount of information that is thrown away. ETI Confidential 13/04/2015 16 Methodology • Hierarchical clustering – Hierarchical clustering is a "bottom up" approach: each observation starts in its own cluster, and workload-input pairs of clusters are merged as one moves up the hierarchy. It is useful in simultaneously looking at multiple clustering possibilities, and we can use a dendrogram for selecting desired number of clusters. • K-means clustering – K-means clustering aims to partition n workloads-input pairs into k clusters in which each workload-input pair belongs to the cluster with the nearest mean, where K is a value specified by user. ETI Confidential 13/04/2015 17 Evaluation (instruction per cycle) The IPC of these sixteen workloads are range from 0.72 to 0.96, with an average value of 0.85. Wordcount has the lowest IPC value and hotregion has highest value among these workloads. ETI Confidential 13/04/2015 18 Evaluation (L1 ICache miss ratio) The cache miss ratios of these typical workloads are range from 3.9% to 19.8%, with an average value of 8.9%. svm has the lowest L1 instruction cache miss ratio and hive-aggre has the highest L1 instruction cache miss ratio. ETI Confidential 13/04/2015 19 Evaluation (L2 ICache miss ratio) The cache misses value of these workloads are range from 23.7% to 64.9%. On average, workloads from DCBench in right side have larger L2 instruction miss rate then workloads from HiBench in the left side. Overall, the L2 cache is ineffective in our experiment platform. ETI Confidential 13/04/2015 20 Evaluation (branch prediction per instruction ) These values are range from 0.18 to 0.23, with an average value of 0.21. Hotregion has the lowest value of branch prediction per instruction while nutchindexing has the highest value of branch prediction per instruction. ETI Confidential 13/04/2015 21 Evaluation (branch missprediction ratio ) These ratios are range from 1.5% to 5.6%, with an average value of 2.7%. Pagerank has the lowest branch miss prediction ratio while nutch indexing has the highest branch miss prediction ratio. The results show that the branch predictor of our processor matches these typical MapReduce 13/04/2015 22 ETI Confidential based applications. Evaluation (off-chip bandwidth utilization) Among these workloads we evaluated, terasort is the only one that has the highest utilization ratio with a value of 14%. Overall, in our experiment platform, processors significantly over-provision off-chip bandwidth for these typical workloads. 13/04/2015 23 ETI Confidential Evaluation (Hierarchical clustering ) sort-30G sort-60G sort-15G terasort-100G terasort-50G bayes terasort-25G sztod-98G pagerank hotregion-17G hotregion-35G grep-80G hive-join sztod-49G svm-40G grep-20G hotregion-70G hive-aggre wordcount-15G wordcount-30G wordcount-60G k-means sztod-24G ibcf-8G hmm-16G ibcf-4G hmm-32G hmm-8G svm-20G grep-40G ibcf-2G svm-10G fpg nutchindexing a b c 2 3 4 5 Linkage Distance 6 ETI Confidential 7 8 13/04/2015 24 Evaluation (Hierarchical clustering ) index cluster type workloads 1 strong cluster wordcount, sort, terasort 2 weak cluster sztod, hotregion 3 non cluster svm, ibcf (1) strong cluster, three workload-input pairs of same workload clustered together. (2) weak cluster, two workload-input pairs of same workload clustered together. (3) non cluster, no workload-input pairs of same workload clustered together. ETI Confidential 13/04/2015 25 Evaluation(K-means clustering) • Seclecting 8 workload-input pairs via K-means clustering cluster workloads representative 1 sztod-98G,hotregion-17G, hmm-16G hmm-16G 2 fpg, ibcf-2G fpg 3 sztod-24G,sztod-49G sztod-49G 4 wordcount-15G,wordcount-30G, wordcount-60G, svm-20G wordcount-30G 5 nutchindexing nutchindexing 6 hotregion-35G, hotregion-70G, bayes, hive- hotregion-35G aggre 7 sort-15G, sort-30G, sort-60G, terasort-25G, terasort-50G, terasort-100G, hive-join, pagerank Sort-60G 8 kmeans kmeans ETI Confidential 13/04/2015 26 Evaluation(K-means clustering) sort-60G can be taken as the representative workload-input pair of its group including eight members. ETI Confidential 13/04/2015 27 Conclusion • Redundancy exists in these pioneering benchmark suites – Such as sort and terasort. • The workload behavior of trajectory data analysis applications is dramatically affected by their input data sets. ETI Confidential 13/04/2015 28 Future work • Conduct similarity analysis in workload-input pairs at a larger scale. – More metrics and larger input size • Fully implement a big data benchmark suite for spatio-temporal data – Data model, data generator and typical workload-input pairs. ETI Confidential 13/04/2015 29 Thank You !!! 30