Final Presentation

advertisement
Performance Analysis of Lucene
Index on HBase Environment
Group #13
Anand Hegde
Prerna Shraff
• HBase vs BigTable
• The Problem
• Implementation
• Performance Analysis
• Survey
• Conclusion
Overview
BigTable
•
•
Compressed, high performance database system
It is built GFS using Chubby Lock Service, SSTable
etc.
HBase
•
•
Hadoop Database
Open source, distributed versioned, column
oriented
• Modeled after BigTable
HBase vs BigTable
• Data intensive computing requires storage
solutions for huge amount of data.
• The requirement is to host very large tables on
clusters of commodity hardware.
• HBase provides BigTable like capabilities on top of
Hadoop.
• Current implementation in this field includes an
experiment using Lucene Index on HBase in an
HPC Environment. (Xiaoming Gao, Vaibhav
Nachankar, Judy Qiu)
The Problem
Architecture
• Configured Hadoop and HBase on Alamo cluster.
• Added scripts to run the program sequentially on
multiple nodes.
• Modified scripts to record size of the table.
• Modified scripts to record time of execution for
both sequential and parallel execution.
Implementation
• Sequential execution across same number of
nodes for different data sizes.
• Sequential execution across different number
of data nodes for same data size.
• Parallel execution across same number of nodes
for different data sizes.
Performance Analysis
• Performed analysis on Alamo cluster on FutureGrid
• System type: Dell PowerEdge
• No. of CPUs: 192
• No. of cores: 768
• 3 ZooKeeper nodes + 1 HDFS-Master + 1 HBasemaster
Analysis details
00000004###md###Title###Geoffrey C. Fox Papers Collection 1990
00000004###md###Category###paper, proceedings collection
00000004###md###Authors###Geoffrey C. Fox, others
00000004###md###CreatedYear###1990
00000004###md###Publishers###California Institute of Technology CA
00000004###md###Location###California Institute of Technology CA
00000004###md###StartPage###1
00000004###md###CurrentPage###105
00000004###md###Additional###This is a paper collection of
Geoffrey C. Fox
00000004###md###DirPath###Proceedings in a collection of papers
from one conference/Fox
00000005###md###Title###C3P Related Papers - T.Barnes
00000005###md###Category###paper, proceedings collection
00000005###md###Authors###T.Barnes, others
Analysis details
Sequential execution
70
60
Time in seconds
50
40
30
20
10
0
100 MB
300 MB
500 MB
Size of data
800 MB
Number of nodes: 11
1 GB
Sequential execution
7
6
Time in seconds
5
4
3
2
1
0
11 nodes
13 nodes
15 nodes
Number of nodes
17 nodes
Size of data: 50 MB
19 nodes
Parallel Execution
16
14
Time in minutes
12
10
8
6
4
2
0
1 GB
5 GB
10 GB
Size of data
20 GB
Number of nodes: 13
30 GB
• There are a lot of load testing frameworks available
to run distributed tests using many machines.
• Popular ones are Grinder, Apache JMeter, Load
Runner etc.
• Compared the above testing frameworks to choose
the best framework.
Survey
• Gives the absolute measure of the system response
time.
• Targets the regressions on the sever and the
application code.
• Examines the response.
• Helps evaluate and compare middleware solutions
from different vendors.
Why Survey?
• Automated performance testing product on a
commercial ground
• Supports JavaScript and C-script
• Windows platform
• Commercial
• Aimed for Automated Test Engineers
• Has a UI
Framework:
• Virtual User Scripts
• Controller
Load Runner
• Pure Java desktop application
• designed to load test functional behavior and
measure performance
• designed for testing Web Applications
• Java based
• Highly extensible
Test Plan
• Thread Groups
• Controllers
• Samplers
• Listeners
Apache JMeter
• Open source
• Uses Jython
• Scripts can be run by defining the tests in the
grinder.properties file
Framework:
• Console
• Agent
• Workers
Grinder
Grinder
Parameter
LoadRunner
Server
monitoring
Strong for MS
Windows
Grinder
JMeter
Needs wrapper No built in
based
monitoring
approach
Amount of load Number of
users
restricted
Number of
agents
restricted
Number of
agents depend
on H/W
support
available
Able to run in
batch?
No
No
Yes
Ease of
installation
Difficult
Moderate
Easy
Setting up
tests
Icon based
Uses Jython
Java based
Comparison
Parameter
LoadRunner
Grinder
JMeter
Running tests
Complex
Moderate
Simple
Result
generation
Integrated
analysis tool
No
integrated
tool
available
Can generate
client side
graphs
Agent
management
Easy/Automatic
Manual
Real
time/Dynamic
Cross Platform No. MS Windows
only
Yes
Yes
Intended
audience
Aimed at nondevelopers
Aimed at
developers
Aimed at nonbuilders
Stability
Poor
Moderate
Poor
Cost
Expensive
Free (open
source)
Free (open
source)
Comparison
Study
HBase
Study
Lucene
Indexing
Modify
Scripts
Add
Scripts
Study
Testing
Frameworks
Implement
Grinder
Roadmap
• Sequential execution takes more time compared to
parallel execution on HBase.
• Research indicates that HBase is not as robust as
the BigTable yet.
• Regarding the testing framework, we recommend
Grinder as it is an open source tool and has lot of
documentation.
• Grinder also provides good real time feedbacks.
Conclusion
• http://grinder.sourceforge.net/
• http://jmeter.apache.org/
• http://www8.hp.com/us/en/software/softwareproduct.html?compURI=tcm:245-935779
• http://hpcdb.org/sites/hpcdb.org/files/gao_lucene
.pdf
• http://hadoop.apache.org/common/docs/stable/fil
e_system_shell.html#du
References
Thank you
Download