Questions and Answers from Improving Hive and

advertisement
Questions and Answers from Improving Hive
and HBase Integration Webinar
We recently did a webinar in which I talked about using Hive and HBase together.
During and after the webinar, we received a lot of interesting questions, and we
thought that addressing them in this blog post will have some value. If you are not
familiar with HBase and Hive integration, we suggest watching the webinar first
before reading on.
So without further ado, here are some of the selected questions and answers.
Would you advise to reconfigure Hive to move away from the embedded
Apache Derby database, if so why?
Choosing the best relational database product to be used by Hive to store your
metadata is subject to your specific setup, and most of the time, your team’s
capabilities with managing different database servers. Although perfect for running
tests, and small scale deployments, the embedded database makes it hard to share
the metadata across clients. Most of the use cases we experience benefit from a
centralized location to store the hive schema information, which will be the
authoritative source for the meta layer. Otherwise, every client has to redefine and
manage the data sources it uses. Hive deployment options part of the talk covers 3
typical medium to large scale deployment setups.
Would HBase region balancing impact Hive-HBase queries?
Hive’s HBase integration benefits from using the HBase MapReduce infrastructure
with little change. From the HBase perspective, Hive would behave as the MR jobs
run directly on top of the Table(Input|Output)Format. This also indicates that, Hive
MR jobs which read from HBase or write to HBase will have similar characteristics
for data locality.
Regarding load balancing for regions, Hive integration does impose the same
semantics as bare HBase. You still need to design your data model to find a good
keyspace, distribute the load across regions, avoid region hotspotting, configure
region sizes, configure compactions, monitor the load on servers, etc.
What is the best version combination of Hadoop, Hive, and HBase?
We believe that the best combinations from Hadoop, HBase, Hive and other
ecosystem projects are put together in the upcoming HDP (TODO:link here) release.
In HDP, we are extensively testing stable, production ready versions of 100% open
source software released from Apache, and will be providing support for that
platform. All the information about HDP can be found here
When you say Real Time updates are possible in HBase, could you throw
some more light on this?
HBase provides real time read and write access to the data. As covered in the use
cases section, you can use HBase to update data in real time (not using Hive), and
at the same time query the data using Hive. Having said that, Hive does not yet
support INSERT INTO statements to update data row-by-row. So real time updates
refer to updates coming from pure HBase clients not using Hive. Adding real time
update semantics to Hive, has been discussed previously, and we believe it is the
logical next step to go.
What Reporting tools could be connected to HBase? Are some connectors
available?
From a reporting perspective, we have two options, you can opt to connect the
higher level tool to hbase directly as a part of the ETL process, or you can connect
through Hive for accessing HBase.
From the partners we work with or know about, Talend, Pentaho and Jaspersoft
provides direct HBase connectors as well as Hive support. As long as the table
definitions for hive-hbase are provided, or HiveQL is supported, any reporting
solution supporting Hive, like Microstrategy, etc, should be able to leverage hbase
tables as well, since the storage handler abstracts away the actual access and it is
transparent for the other parts of the hive pipeline (query parsing, plan generation,
optimizer, mapreduce execution, etc).
Is there a way to avoid loading duplicate data in HBase and Hive, something
like Unique Constraint?
There is two aspects for this question. Let’s start with the first aspect, Hive. You can
keep using the same query constraints in Hive using Hive HBaseStorageHandler.
For example, we can ensure that we filter the input data using SELECT DISTINCT to
filter out the duplicate data, or use GROUP BY’s to group similar data together, and
save the results in HBase.
In the second aspect, HBase indexes cells by the row and column. The cells can
have multiple versions indexed by the timestamp. If you are generating data from
Hive, and saving in HBase, after the schema and type mapping, if there are
duplicates in the resulting cells (row + column), the last one to be saved will
overwrite the others. The last version, in this case, is non deterministic, so care
should be taken for these workloads. Also keep in mind that this behavior is different
than saving the table data in hdfs, since hdfs allows duplicate rows, while hbase
does not.
How stable/mature is Hive integration for real product usage?
The integration effort between Hive and HBase was started in 2009/2010 by
Facebook, and has seen some recent love from Hortonworks, Facebook and the
community. It has been mostly tested by unit test coverage and our functional
internal testing as a part of HDP. Although it has some rough edges, and not fully
optimized for novice users, we believe, it can be used in production systems. We are
also providing support for this feature.
How do you scale out queries that you are mapping into HBase operations
instead of map reduce jobs? Or are you combining them?
Hive HBase integration continues to compile SQL queries into MapReduce jobs. If
the table to be queried or written to resides in HBase, then the HBase Mapreduce
counterparts (InputSplit, InputFormat, etc) are used to read / persist data.
In the last 2 work engagements, we implemented the Hadoop/HBase/Hive
stack. I see this as becoming a very popular solution. Are you experiencing
the same? Can you speak to any instance of a company implementation you
know of?
All 3 projects are becoming increasingly more popular, and we are seeing an
accelerated trend towards enterprise adoption for all of them. We believe that, tight
Hive and HBase integration is a very important next step as this opens new
opportunities for both Hive and HBase. As you can understand, I cannot talk about a
specific use case here, but given the popularity of the two projects, co-deployments
are getting popular as well. We also noticed some community interest in this
webinar, and in the previous talks I gave at Hadoop and HBase user group meetings
on the same topic.
Do you have a comparison of say Cassandra vs. Hive+HBase? From a data
path perspective, I don't like how Cassandra does it. From your slides it
seems HBase is pretty clean about it.
There is recently released CQL, and Hive+Cassandra integration for cassandra
users. My understanding of the cassandra internals and the ecosystem is limited, so
doing a comparison would do injustice. I would suggest checking out the options and
comparing for yourself.
We hope this blog post shed some light on the topics not covered fully on the
webinar. Please feel free to leave a feedback below, or forward more questions to
[email protected] We are also interested in hearing about your use cases,
and what are the areas we should focus on next.
Enis Soztutar
Download