3. Where does the data of a Hive table gets stored?
By default, the Hive table is stored in an HDFS directory – /user/hive/warehouse. One can
change it by specifying the desired directory in hive.metastore.warehouse.dir configuration
parameter present in the hive-site.xml.
Metastore in Hive stores the meta data information using RDBMS and an open source ORM
(Object Relational Model) layer called Data Nucleus which converts the object
representation into relational schema and vice versa.
1. Discuss Hadoop, HDFS, blockchain.
2. How does MapReduce differ from Spark?
3. Discuss Shuffle vs Sort on MapReduce.
4. Discuss Hive.
5. How are tables related in a relational database?
6. What is the difference between a logical model and a physical model?
7. Do you have experience with NoSQL databases, e.g., MongoDB and what makes
them different from a traditional SQL database?
. Why Hive does not store metadata information in HDFS?
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The
reason for choosing RDBMS is to achieve low latency as HDFS read/write operations are time
consuming processes.
6. What is the difference between local and remote metastore?
Local Metastore:
In local metastore configuration, the metastore service runs in the same JVM in which the
Hive service is running and connects to a database running in a separate JVM, either on the
same machine or on a remote machine.
Remote Metastore:
In the remote metastore configuration, the metastore service runs on its own separate JVM
and not in the Hive service JVM. Other processes communicate with the metastore server
using Thrift Network APIs. You can have one or more metastore servers in this case to
provide more availability.
7. What is the default database provided by Apache Hive for metastore?
By default, Hive provides an embedded Derby database instance backed by the local disk for
the metastore. This is called the embedded metastore configuration.
Suppose I have installed Apache Hive on top of my Hadoop cluster using default
metastore configuration. Then, what will happen if we have multiple clients trying to
access Hive at the same time?
The default metastore configuration allows only one Hive session to be opened at a time for
accessing the metastore. Therefore, if multiple clients try to access the metastore at the same
time, they will get an error. One has to use a standalone metastore, i.e. Local or remote
metastore configuration in Apache Hive for allowing access to multiple clients concurrently.
Following are the steps to configure MySQL database as the local metastore in Apache Hive:
One should make the following changes in hive-site.xml:
javax.jdo.option.ConnectionURL property
javax.jdo.option.ConnectionDriverName property
to com.mysql.jdbc.Driver.
One should also set the username and password as:
javax.jdo.option.ConnectionUserName is set to desired username.
javax.jdo.option.ConnectionPassword is set to the desired password.
The JDBC driver JAR file for MySQL must be on the Hive’s classpath, i.e. The jar file should be
copied into the Hive’s lib directory.
Now, after restarting the Hive shell, it will automatically connect to the MySQL database
which is running as a standalone metastore.
9. What is the difference between external table and managed table?
Here is the key difference between an external table and managed table:
In case of managed table, If one drops a managed table, the metadata information along with
the table data is deleted from the Hive warehouse directory.
On the contrary, in case of an external table, Hive just deletes the metadata information
regarding the table and leaves the table data present in HDFS untouched.
Note: I would suggest you to go through the blog on Hive Tutorial to learn more about
Managed Table and External Table in Hive.
10. Is it possible to change the default location of a managed table?
Yes, it is possible to change the default location of a managed table. It can be achieved by
using the clause – LOCATION ‘<hdfs_path>’.
Trending Courses in this category
Big Data Hadoop Certification Training
5 (56300)
141k Learners Enrolled Live Class
Best Price 404 449
Similar Courses Apache Spark and Scala Certification TrainingPython Spark Certification Training
using PySparkApache Kafka Certification Training
11. When should we use SORT BY instead of ORDER BY?
We should use SORT BY instead of ORDER BY when we have to sort huge datasets because
SORT BY clause sorts the data using multiple reducers whereas ORDER BY sorts all of the
data together using a single reducer. Therefore, using ORDER BY against a large number of
inputs will take a lot of time to execute.
12. What is a partition in Hive?
Hive organizes tables into partitions for grouping similar type of data together based on a
column or partition key. Each Table can have one or more partition keys to identify a
particular partition. Physically, a partition is nothing but a sub-directory in the table
13. Why do we perform partitioning in Hive?
Partitioning provides granularity in a Hive table and therefore, reduces the query latency by
scanning only relevant partitioned data instead of the whole data set.
For example, we can partition a transaction log of an e – commerce website based on month
like Jan, February, etc. So, any analytics regarding a particular month, say Jan, will have
to scan the Jan partition (sub – directory) only instead of the whole table data.
14. What is dynamic partitioning and when is it used?
In dynamic partitioning values for partition columns are known in the runtime, i.e. It is
known during loading of the data into a Hive table.
One may use dynamic partition in following two cases:
Loading data from an existing non-partitioned table to improve the sampling and therefore,
decrease the query latency.
When one does not know all the values of the partitions before hand and therefore, finding
these partition values manually from a huge data sets is a tedious task.
15. Scenario:
Suppose, I create a table that contains details of all the transactions done by the
customers of year 2016: CREATE TABLE transaction_details (cust_id INT, amount
Now, after inserting 50,000 tuples in this table, I want to know the total revenue
generated for each month. But, Hive is taking too much time in processing this
query. How will you solve this problem and list the steps that I will be taking in order to
do so?
We can solve this problem of query latency by partitioning the table according to each month.
So, for each month we will be scanning only the partitioned data instead of whole data sets.
As we know, we can’t partition an existing non-partitioned table directly. So, we will be
taking following steps to solve the very problem:
1. Create a partitioned table, say partitioned_transaction:
CREATE TABLE partitioned_transaction (cust_id INT, amount FLOAT, country STRING)
2. Enable dynamic partitioning in Hive:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
3. Transfer the data from the non – partitioned table into the newly created partitioned table:
INSERT OVERWRITE TABLE partitioned_transaction PARTITION (month) SELECT cust_id,
amount, country, month FROM transaction_details;
Now, we can perform the query using each partition and therefore, decrease the query time.
16. How can you add a new partition for the month December in the above partitioned
For adding a new partition in the above table partitioned_transaction, we will issue the
command give below:
LOCATION ‘/partitioned_transaction’;
Note: I suggest you to go through the dedicated blog on Hive Commands where all the
commands present in Apache Hive have been explained with an example.
17. What is the default maximum dynamic partition that can be created by a
mapper/reducer? How can you change it?
By default the number of maximum partition that can be created by a mapper or reducer is
set to 100. One can change it by issuing the following command:
SET hive.exec.max.dynamic.partitions.pernode = <value>
Note: You can set the total number of dynamic partitions that can be created by one
statement by using: SET hive.exec.max.dynamic.partitions = <value>
18. Scenario:
I am inserting data into a table based on partitions dynamically. But, I received an error
– FAILED ERROR IN SEMANTIC ANALYSIS: Dynamic partition strict mode requires at least
one static partition column. How will you remove this error?
To remove this error one has to execute following commands:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Things to Remember:
By default, hive.exec.dynamic.partition configuration property is set to False in case you are
using Hive whose version is prior to 0.9.0.
hive.exec.dynamic.partition.mode is set to strict by default. Only in non – strict mode Hive
allows all partitions to be dynamic.
19. Why do we need buckets?
There are two main reasons for performing bucketing to a partition:
A map side join requires the data belonging to a unique join key to be present in the same
partition. But what about those cases where your partition key differs from that of join
key? Therefore, in these cases you can perform a map side join by bucketing the table using
the join key.
Bucketing makes the sampling process more efficient and therefore, allows us to decrease the
query time.
20. How Hive distributes the rows into buckets?
formula: hash_function (bucketing_column) modulo (num_of_buckets). Here, hash_function
depends on the column data type. For integer data type, the hash_function will be:
hash_function (int_type_column)= value of int_type_column
21. What will happen in case you have not issued the command: ‘SET
hive.enforce.bucketing=true;’before bucketing a table in Hive in Apache Hive 0.x or 1.x?
The command: ‘SET hive.enforce.bucketing=true;’ allows one to have the correct number of
reducer while using ‘CLUSTER BY’ clause for bucketing a column. In case it’s not done, one
may find the number of files that will be generated in the table directory to be not equal to
the number of buckets. As an alternative, one may also set the number of reducer equal to
the number of buckets by using set mapred.reduce.task = num_bucket.
22. What is indexing and why do we need it?
One of the Hive query optimization methods is Hive index. Hive index is used to speed up the
access of a column or set of columns in a Hive database because with the use of index the
database system does not need to read all rows in the table to find the data that one
has selected.
23. Scenario:
Suppose, I have a CSV file – ‘sample.csv’ present in ‘/temp’ directory with the following
id first_name last_name email gender ip_address
1 Hugh Jackman [email protected] Male
2 David Lawrence [email protected] Male
3 Andy Hall [email protected] Female
4 Samuel Jackson [email protected] Male
5 Emily Rose [email protected] Female
How will you consume this CSV file into the Hive warehouse using built SerDe?
SerDe stands for serializer/deserializer. A SerDe allows us to convert the unstructured bytes
into a record that we can process using Hive. SerDes are implemented using Java. Hive comes
with several built-in SerDes and many other third-party SerDes are also available.
Hive provides a specific SerDe for working with CSV files. We can use this SerDe for the
sample.csv by issuing following commands:
(id int, first_name string,
last_name string, email string,
gender string, ip_address string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
Now, we can perform any query on the table ‘sample’:
SELECT first_name FROM sample WHERE gender = ‘male’;
24. Scenario:
Suppose, I have a lot of small CSV files present in /input directory in HDFS and I want to
create a single Hive table corresponding to these files. The data in these files are in the
format: {id, name, e-mail, country}. Now, as we know, Hadoop performance degrades
when we use lots of small files.
So, how will you solve this problem where we want to create a single Hive table for lots
of small files without degrading the performance of the system?
One can use the SequenceFile format which will group these small files together to form a
single sequence file. The steps that will be followed in doing so are as follows:
Create a temporary table:
CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING)
Load the data into temp_table:
LOAD DATA INPATH ‘/input’ INTO TABLE temp_table;
Create a table that will store data in SequenceFile format:
CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING)
Transfer the data from the temporary table into the sample_seqfile table:
Hence, a single SequenceFile is generated which contains the data present in all of the input
files and therefore, the problem of having lots of small files is finally eliminated.
What is the goal of A/B Testing?
It is a statistical hypothesis testing for randomized experiment with two variables A and B.
The goal of A/B Testing is to identify any changes to the web page to maximize or increase
the outcome of an interest.
An example for this could be identifying the click-through rate for a banner ad.
What do you understand by statistical power of sensitivity and how do you calculate
Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, Random
Forest etc.).
Sensitivity is nothing but “Predicted True events/ Total events”. True events here are the
events which were true and model also predicted them as true.
Calculation of seasonality is pretty straight forward.
Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable )
where true positives are positive events which are correctly classified as positives.
What are the differences between overfitting and underfitting?
In statistics and machine learning, one of the most common tasks is to fit a model to a set of
training data, so as to be able to make reliable predictions on general untrained data.
In overfitting, a statistical model describes random error or noise instead of the underlying
relationship. Overfitting occurs when a model is excessively complex, such as having too
many parameters relative to the number of observations. A model that has been overfit has
poor predictive performance, as it overreacts to minor fluctuations in the training data.
Underfitting occurs when a statistical model or machine learning algorithm cannot capture
the underlying trend of the data. Underfitting would occur, for example, when fitting a linear
model to non-linear data. Such a model too would have poor predictive performance.
Differentiate between univariate, bivariate and multivariate analysis.
Univariate analyses are descriptive statistical analysis techniques which can be
differentiated based on the number of variables involved at a given point of time. For
example, the pie charts of sales based on territory involve only one variable and can the
analysis can be referred to as univariate analysis.
Bivariate analysis attempts to understand the difference between two variables at a time as
in a scatterplot. For example, analyzing the volume of sale and a spending can be considered
as an example of bivariate analysis.
Multivariate analysis deals with the study of more than two variables to understand the effect
of variables on the responses.
What are Eigenvectors and Eigenvalues?
Eigenvectors are used for understanding linear transformations. In data analysis, we usually
calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the
directions along which a particular linear transformation acts by flipping, compressing or
Eigenvalue can be referred to as the strength of the transformation in the direction of
eigenvector or the factor by which the compression occurs.
Can you cite some examples where a false positive is important than a false negative?
Let us first understand what false positives and false negatives are. False positives are the
cases where you wrongly classified a non-event as an event a.k.a Type I error. False negatives
are the cases where you wrongly classify events as non-events, a.k.a Type II error.
Example 1: In the medical field, assume you have to give chemotherapy to patients. Assume
a patient comes to that hospital and he is tested positive for cancer, based on the lab
prediction but he actually doesn’t have cancer. This is a case of false positive. Here it is of
utmost danger to start chemotherapy on this patient when he actually does not have cancer.
In the absence of cancerous cell, chemotherapy will do certain damage to his normal healthy
cells and might lead to severe diseases, even cancer.
Example 2: Let’s say an e-commerce company decided to give $1000 Gift voucher to the
customers whom they assume to purchase at least $10,000 worth of items. They send free
voucher mail directly to 100 customers without any minimum purchase condition because
they assume to make at least 20% profit on sold items above $10,000. Now the issue is if we
send the $1000 gift vouchers to customers who have not actually purchased anything but
are marked as having made $10,000 worth of purchase.
Can you cite some examples where a false negative important than a false positive?
Example 1: Assume there is an airport ‘A’ which has received high-security threats and
based on certain characteristics they identify whether a particular passenger can be a
threat or not. Due to a shortage of staff, they decide to scan passengers being predicted as
risk positives by their predictive model. What will happen if a true threat customer is being
flagged as non-threat by airport model?
Example 2: What if Jury or judge decide to make a criminal go free?
Example 3: What if you rejected to marry a very good person based on your predictive
model and you happen to meet him/her after few years and realize that you had a false
Can you cite some examples where both false positive and false negatives are equally
In the banking industry giving loans is the primary source of making money but at the same
time if your repayment rate is not good you will not make any profit, rather you will risk
huge losses.
Banks don’t want to lose good customers and at the same point in time, they don’t want to
acquire bad customers. In this scenario, both the false positives and false negatives become
very important to measure.
Explain cross-validation.
Cross validation is a model validation technique for evaluating how the outcomes of a
statistical analysis will generalize to an independent data set. Mainly used in backgrounds
where the objective is forecast and one wants to estimate how accurately a model will
accomplish in practice.
The goal of cross-validation is to term a data set to test the model in the training phase (i.e.
validation data set) in order to limit problems like overfitting and get an insight on how the
model will generalize to an independent data set.
What are Recommender Systems?
Recommender Systems are a subclass of information filtering systems that are meant to
predict the preferences or ratings that a user would give to a product. Recommender
systems are widely used in movies, news, research articles, products, social tags, music, etc.
Examples include movie recommenders in IMDB, Netflix & BookMyShow, product
recommenders in e-commerce sites like Amazon, eBay & Flipkart, YouTube video
recommendations and game recommendations in Xbox.
What is Collaborative filtering?
The process of filtering used by most of the recommender systems to find patterns or
information by collaborating viewpoints, various data sources and multiple agents.
An example of collaborative filtering can be to predict the rating of a particular user based
on his/her ratings for other movies and others’ ratings for all movies. This concept is widely
used in recommending movies in IMDB, Netflix & BookMyShow, product recommenders in
e-commerce sites like Amazon, eBay & Flipkart, YouTube video recommendations and game
recommendations in Xbox.
Outlier values can be identified by using univariate or any other graphical analysis method. If
the number of outlier values is few then they can be assessed individually but for large
number of outliers the values can be substituted with either the 99th or the 1st percentile
All extreme values are not outlier values. The most common ways to treat outlier values
1. To change the value and bring in within a range
2. To just remove the value.
During analysis, how do you treat missing values?
The extent of the missing values is identified after identifying the variables with missing
values. If any patterns are identified the analyst has to concentrate on them as it could lead
to interesting and meaningful business insights.
If there are no patterns identified, then the missing values can be substituted with mean or
median values (imputation) or they can simply be ignored. Assigning a default value which
can be mean, minimum or maximum value. Getting into the data is important.
If it is a categorical variable, the default value is assigned. The missing value is assigned a
default value. If you have a distribution of data coming, for normal distribution give the mean
If 80% of the values for a variable are missing then you can answer that you would be
dropping the variable instead of treating the missing values.
How will you define the number of clusters in a clustering algorithm?
Though the Clustering Algorithm is not specified, this question is mostly in reference to KMeans clustering where “K” defines the number of clusters. The objective of clustering is to
group similar entities in a way that the entities within a group are similar to each other but
the groups are different from each other.
For example, the following image shows three different groups.
Within Sum of squares is generally used to explain the homogeneity within a cluster. If you
plot WSS for a range of number of clusters, you will get the plot shown below.
The Graph is generally known as Elbow Curve.
Red circled point in above graph i.e. Number of Cluster =6 is the point after which you don’t
see any decrement in WSS.
This point is known as bending point and taken as K in K – Means.
This is the widely used approach but few data scientists also use Hierarchical clustering first
to create dendograms and identify the distinct groups from there.
What is the difference between deep and shallow copy?
Ans: Shallow copy is used when a new instance type gets created and it keeps the values that
are copied in the new instance. Shallow copy is used to copy the reference pointers just like
it copies the values. These references point to the original objects and the changes made in
any member of the class will also affect the original copy of it. Shallow copy allows faster
execution of the program and it depends on the size of the data that is used.
Deep copy is used to store the values that are already copied. Deep copy doesn’t copy the
reference pointers to the objects. It makes the reference to an object and the new object that
is pointed by some other object gets stored. The changes made in the original copy won’t
affect any other copy that uses the object. Deep copy makes execution of the program slower
due to making certain copies for each object that is been called.
How is Multithreading achieved in Python?
1. Python has a multi-threading package but if you want to multi-thread to speed your code up,
then it’s usually not a good idea to use it.
2. Python has a construct called the Global Interpreter Lock (GIL). The GIL makes sure that only
one of your ‘threads’ can execute at any one time. A thread acquires the GIL, does a little work,
then passes the GIL onto the next thread.
3. This happens very quickly so to the human eye it may seem like your threads are executing
in parallel, but they are really just taking turns using the same CPU core.
4. All this GIL passing adds overhead to execution. This means that if you want to make your
code run faster then using the threading package often isn’t a good idea.
How is memory managed in Python?
1. Memory management in python is managed by Python private heap space. All Python
objects and data structures are located in a private heap. The programmer does not have
access to this private heap. The python interpreter takes care of this instead.
2. The allocation of heap space for Python objects is done by Python’s memory manager. The
core API gives access to some tools for the programmer to code.
3. Python also has an inbuilt garbage collector, which recycles all the unused memory and so
that it can be made available to the heap space.
Explain what Flask is and its benefits?
Ans: Flask is a web microframework for Python based on “Werkzeug, Jinja2 and good
intentions” BSD license. Werkzeug and Jinja2 are two of its dependencies. This means it will
have little to no dependencies on external libraries. It makes the framework light while there
is a little dependency to update and fewer security bugs.
A session basically allows you to remember information from one request to another. In a
flask, a session uses a signed cookie so the user can look at the session contents and modify.
The user can modify the session if only it has the secret key Flask.secret_key.
What is monkey patching in Python?
Ans: In Python, the term monkey patch only refers to dynamic modifications of a class or
module at run-time.
Consider the below example:
class MyClass:
def f(self):
print "f()"
We can then run the monkey-patch testing like this:
import m
def monkey_f(self):
print "monkey_f()"
m.MyClass.f = monkey_f
obj = m.MyClass()
The output will be as below:
As we can see, we did make some changes in the behavior of f() in MyClass using the function
we defined, monkey_f(), outside of the module m.
What does this mean: *args, **kwargs? And why would we use it?
Ans: We use *args when we aren’t sure how many arguments are going to be passed to a
function, or if we want to pass a stored list or tuple of arguments to a function. **kwargsis
used when we don’t know how many keyword arguments will be passed to a function, or it
can be used to pass the values of a dictionary as keyword arguments. The
identifiers args and kwargs are a convention, you could also use *bob and**billy but that
would not be wise.
What are negative indexes and why are they used?
Ans: The sequences in Python are indexed and it consists of the positive as well as negative
numbers. The numbers that are positive uses ‘0’ that is uses as first index and ‘1’ as the
second index and the process goes on like that.
The index for the negative number starts from ‘-1’ that represents the last index in the
sequence and ‘-2’ as the penultimate index and the sequence carries forward like the positive
The negative index is used to remove any new-line spaces from the string and allow the
string to except the last character that is given as S[:-1]. The negative index is also used to
show the index to represent the string in correct order.
What is the process of compilation and linking in python?
Ans: The compiling and linking allows the new extensions to be compiled properly without
any error and the linking can be done only when it passes the compiled procedure. If the
dynamic loading is used then it depends on the style that is being provided with the system.
The python interpreter can be used to provide the dynamic loading of the configuration
setup files and will rebuild the interpreter.
The steps that is required in this as:
1. Create a file with any name and in any language that is supported by the compiler of your
system. For example file.c or file.cpp
2. Place this file in the Modules/ directory of the distribution which is getting used.
3. Add a line in the file Setup.local that is present in the Modules/ directory.
4. Run the file using spam file.o
5. After successful run of this rebuild the interpreter by using the make command on the toplevel directory.
6. If the file is changed then run rebuildMakefile by using the command as ‘make Makefile’.
What is pickling and unpickling?
Ans: Pickle module accepts any Python object and converts it into a string representation
and dumps it into a file by using dump function, this process is called pickling. While the
process of retrieving original Python objects from the stored string representation is called
Mention the differences between Django, Pyramid and Flask.
Flask is a “microframework” primarily build for a small application with simpler
requirements. In flask, you have to use external libraries. Flask is ready to use.
Pyramid is built for larger applications. It provides flexibility and lets the developer use the
right tools for their project. The developer can choose the database, URL structure,
templating style and more. Pyramid is heavy configurable.
Django can also used for larger applications just like Pyramid. It includes an ORM.
What is map function in Python?
Ans: map function executes the function given as the first argument on all the elements of
the iterable given as the second argument. If the function given takes in more than 1
arguments, then many iterables are given.
Related flashcards
Create Flashcards