Questions and Answers from Improving Hive and HBase Integration Webinar We recently did a webinar in which I talked about using Hive and HBase together. During and after the webinar, we received a lot of interesting questions, and we thought that addressing them in this blog post will have some value. If you are not familiar with HBase and Hive integration, we suggest watching the webinar first before reading on. So without further ado, here are some of the selected questions and answers. Would you advise to reconfigure Hive to move away from the embedded Apache Derby database, if so why? Choosing the best relational database product to be used by Hive to store your metadata is subject to your specific setup, and most of the time, your team’s capabilities with managing different database servers. Although perfect for running tests, and small scale deployments, the embedded database makes it hard to share the metadata across clients. Most of the use cases we experience benefit from a centralized location to store the hive schema information, which will be the authoritative source for the meta layer. Otherwise, every client has to redefine and manage the data sources it uses. Hive deployment options part of the talk covers 3 typical medium to large scale deployment setups. Would HBase region balancing impact Hive-HBase queries? Hive’s HBase integration benefits from using the HBase MapReduce infrastructure with little change. From the HBase perspective, Hive would behave as the MR jobs run directly on top of the Table(Input|Output)Format. This also indicates that, Hive MR jobs which read from HBase or write to HBase will have similar characteristics for data locality. Regarding load balancing for regions, Hive integration does impose the same semantics as bare HBase. You still need to design your data model to find a good keyspace, distribute the load across regions, avoid region hotspotting, configure region sizes, configure compactions, monitor the load on servers, etc. What is the best version combination of Hadoop, Hive, and HBase? We believe that the best combinations from Hadoop, HBase, Hive and other ecosystem projects are put together in the upcoming HDP (TODO:link here) release. In HDP, we are extensively testing stable, production ready versions of 100% open source software released from Apache, and will be providing support for that platform. All the information about HDP can be found here When you say Real Time updates are possible in HBase, could you throw some more light on this? HBase provides real time read and write access to the data. As covered in the use cases section, you can use HBase to update data in real time (not using Hive), and at the same time query the data using Hive. Having said that, Hive does not yet support INSERT INTO statements to update data row-by-row. So real time updates refer to updates coming from pure HBase clients not using Hive. Adding real time update semantics to Hive, has been discussed previously, and we believe it is the logical next step to go. What Reporting tools could be connected to HBase? Are some connectors available? From a reporting perspective, we have two options, you can opt to connect the higher level tool to hbase directly as a part of the ETL process, or you can connect through Hive for accessing HBase. From the partners we work with or know about, Talend, Pentaho and Jaspersoft provides direct HBase connectors as well as Hive support. As long as the table definitions for hive-hbase are provided, or HiveQL is supported, any reporting solution supporting Hive, like Microstrategy, etc, should be able to leverage hbase tables as well, since the storage handler abstracts away the actual access and it is transparent for the other parts of the hive pipeline (query parsing, plan generation, optimizer, mapreduce execution, etc). Is there a way to avoid loading duplicate data in HBase and Hive, something like Unique Constraint? There is two aspects for this question. Let’s start with the first aspect, Hive. You can keep using the same query constraints in Hive using Hive HBaseStorageHandler. For example, we can ensure that we filter the input data using SELECT DISTINCT to filter out the duplicate data, or use GROUP BY’s to group similar data together, and save the results in HBase. In the second aspect, HBase indexes cells by the row and column. The cells can have multiple versions indexed by the timestamp. If you are generating data from Hive, and saving in HBase, after the schema and type mapping, if there are duplicates in the resulting cells (row + column), the last one to be saved will overwrite the others. The last version, in this case, is non deterministic, so care should be taken for these workloads. Also keep in mind that this behavior is different than saving the table data in hdfs, since hdfs allows duplicate rows, while hbase does not. How stable/mature is Hive integration for real product usage? The integration effort between Hive and HBase was started in 2009/2010 by Facebook, and has seen some recent love from Hortonworks, Facebook and the community. It has been mostly tested by unit test coverage and our functional internal testing as a part of HDP. Although it has some rough edges, and not fully optimized for novice users, we believe, it can be used in production systems. We are also providing support for this feature. How do you scale out queries that you are mapping into HBase operations instead of map reduce jobs? Or are you combining them? Hive HBase integration continues to compile SQL queries into MapReduce jobs. If the table to be queried or written to resides in HBase, then the HBase Mapreduce counterparts (InputSplit, InputFormat, etc) are used to read / persist data. In the last 2 work engagements, we implemented the Hadoop/HBase/Hive stack. I see this as becoming a very popular solution. Are you experiencing the same? Can you speak to any instance of a company implementation you know of? All 3 projects are becoming increasingly more popular, and we are seeing an accelerated trend towards enterprise adoption for all of them. We believe that, tight Hive and HBase integration is a very important next step as this opens new opportunities for both Hive and HBase. As you can understand, I cannot talk about a specific use case here, but given the popularity of the two projects, co-deployments are getting popular as well. We also noticed some community interest in this webinar, and in the previous talks I gave at Hadoop and HBase user group meetings on the same topic. Do you have a comparison of say Cassandra vs. Hive+HBase? From a data path perspective, I don't like how Cassandra does it. From your slides it seems HBase is pretty clean about it. There is recently released CQL, and Hive+Cassandra integration for cassandra users. My understanding of the cassandra internals and the ecosystem is limited, so doing a comparison would do injustice. I would suggest checking out the options and comparing for yourself. We hope this blog post shed some light on the topics not covered fully on the webinar. Please feel free to leave a feedback below, or forward more questions to support@hortonworks.com. We are also interested in hearing about your use cases, and what are the areas we should focus on next. Enis Soztutar