Performance Analysis of Lucene Index on HBase Environment Group #13 Anand Hegde Prerna Shraff • HBase vs BigTable • The Problem • Implementation • Performance Analysis • Survey • Conclusion Overview BigTable • • Compressed, high performance database system It is built GFS using Chubby Lock Service, SSTable etc. HBase • • Hadoop Database Open source, distributed versioned, column oriented • Modeled after BigTable HBase vs BigTable • Data intensive computing requires storage solutions for huge amount of data. • The requirement is to host very large tables on clusters of commodity hardware. • HBase provides BigTable like capabilities on top of Hadoop. • Current implementation in this field includes an experiment using Lucene Index on HBase in an HPC Environment. (Xiaoming Gao, Vaibhav Nachankar, Judy Qiu) The Problem Architecture • Configured Hadoop and HBase on Alamo cluster. • Added scripts to run the program sequentially on multiple nodes. • Modified scripts to record size of the table. • Modified scripts to record time of execution for both sequential and parallel execution. Implementation • Sequential execution across same number of nodes for different data sizes. • Sequential execution across different number of data nodes for same data size. • Parallel execution across same number of nodes for different data sizes. Performance Analysis • Performed analysis on Alamo cluster on FutureGrid • System type: Dell PowerEdge • No. of CPUs: 192 • No. of cores: 768 • 3 ZooKeeper nodes + 1 HDFS-Master + 1 HBasemaster Analysis details 00000004###md###Title###Geoffrey C. Fox Papers Collection 1990 00000004###md###Category###paper, proceedings collection 00000004###md###Authors###Geoffrey C. Fox, others 00000004###md###CreatedYear###1990 00000004###md###Publishers###California Institute of Technology CA 00000004###md###Location###California Institute of Technology CA 00000004###md###StartPage###1 00000004###md###CurrentPage###105 00000004###md###Additional###This is a paper collection of Geoffrey C. Fox 00000004###md###DirPath###Proceedings in a collection of papers from one conference/Fox 00000005###md###Title###C3P Related Papers - T.Barnes 00000005###md###Category###paper, proceedings collection 00000005###md###Authors###T.Barnes, others Analysis details Sequential execution 70 60 Time in seconds 50 40 30 20 10 0 100 MB 300 MB 500 MB Size of data 800 MB Number of nodes: 11 1 GB Sequential execution 7 6 Time in seconds 5 4 3 2 1 0 11 nodes 13 nodes 15 nodes Number of nodes 17 nodes Size of data: 50 MB 19 nodes Parallel Execution 16 14 Time in minutes 12 10 8 6 4 2 0 1 GB 5 GB 10 GB Size of data 20 GB Number of nodes: 13 30 GB • There are a lot of load testing frameworks available to run distributed tests using many machines. • Popular ones are Grinder, Apache JMeter, Load Runner etc. • Compared the above testing frameworks to choose the best framework. Survey • Gives the absolute measure of the system response time. • Targets the regressions on the sever and the application code. • Examines the response. • Helps evaluate and compare middleware solutions from different vendors. Why Survey? • Automated performance testing product on a commercial ground • Supports JavaScript and C-script • Windows platform • Commercial • Aimed for Automated Test Engineers • Has a UI Framework: • Virtual User Scripts • Controller Load Runner • Pure Java desktop application • designed to load test functional behavior and measure performance • designed for testing Web Applications • Java based • Highly extensible Test Plan • Thread Groups • Controllers • Samplers • Listeners Apache JMeter • Open source • Uses Jython • Scripts can be run by defining the tests in the grinder.properties file Framework: • Console • Agent • Workers Grinder Grinder Parameter LoadRunner Server monitoring Strong for MS Windows Grinder JMeter Needs wrapper No built in based monitoring approach Amount of load Number of users restricted Number of agents restricted Number of agents depend on H/W support available Able to run in batch? No No Yes Ease of installation Difficult Moderate Easy Setting up tests Icon based Uses Jython Java based Comparison Parameter LoadRunner Grinder JMeter Running tests Complex Moderate Simple Result generation Integrated analysis tool No integrated tool available Can generate client side graphs Agent management Easy/Automatic Manual Real time/Dynamic Cross Platform No. MS Windows only Yes Yes Intended audience Aimed at nondevelopers Aimed at developers Aimed at nonbuilders Stability Poor Moderate Poor Cost Expensive Free (open source) Free (open source) Comparison Study HBase Study Lucene Indexing Modify Scripts Add Scripts Study Testing Frameworks Implement Grinder Roadmap • Sequential execution takes more time compared to parallel execution on HBase. • Research indicates that HBase is not as robust as the BigTable yet. • Regarding the testing framework, we recommend Grinder as it is an open source tool and has lot of documentation. • Grinder also provides good real time feedbacks. Conclusion • http://grinder.sourceforge.net/ • http://jmeter.apache.org/ • http://www8.hp.com/us/en/software/softwareproduct.html?compURI=tcm:245-935779 • http://hpcdb.org/sites/hpcdb.org/files/gao_lucene .pdf • http://hadoop.apache.org/common/docs/stable/fil e_system_shell.html#du References Thank you