Uploaded by 陈星仪

Data Science Interview Prep: SQL & Probability Questions

advertisement
1
Quesetion
1
employee-salaries(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Cortland】
【Con】
,
【MasterClass】
,
【Uber】
,
【Amazon】
,
,
【Fractal】,【PepsiCo】,【Think】,【Microsoft】
Given a employees and departments table, select the top 3
departments with at least ten employees and rank them according to
the percentage of their employees making over 100K in salary.
Example:
Input:
Output:
employees table
departments table
Columns
Type
Columns
Type
id
INTEGER
id
INTEGER
first_name
VARCHAR
name
VARCHAR
last_name
VARCHAR
salary
INTEGER
department_id
INTEGER
Columns
Type
percentage_over_100K
FLOAT
department_name
VARCHAR
number of employees
INTEGER
>>> 关注公众号
获取更多精彩内容
1
Solution
1
Solution: employee-salaries(sql)
SELECT
d.name,
SUM(CASE WHEN e.salary > 100000 THEN 1 ELSE 0 END)/COUNT(DISTINCT e.id)AS pct_above_100k,
COUNT(DISTINCT e.id) AS c
FROM employees e JOIN departments d ON e.department_id = d.id
GROUP BY 1
HAVING COUNT(*) > 10
ORDER BY 2 DESC
LIMIT 3
>>> 关注公众号
获取更多精彩内容
2
Quesetion
2
first-to-six(probability)
难度标题
【Medium】
题目标签
【probability】
公司标签
【Microsoft】,【Zenefits】
Amy and Brad take turns in rolling a fair six-sided die.
Whoever rolls a “6” first wins the game. Amy starts by rolling
first.
What’s the probability that Amy wins?
>>> 关注公众号
获取更多精彩内容
3
Solution
2
Solution: first-to-six(probability)
>>> 关注公众号
获取更多精彩内容
4
Quesetion
3
found-item(probability)
难度标题
【Easy】
公司标签
/
题目标签
【probability】
Amazon has a warehouse system where items on the website
are located at different distribution centers across a city. Let’s
say in one example city, the probability that a specific item X
is available at warehouse A or warehouse B are 0.6 and 0.8
respectively.
Given that you’re a customer in this example city and the
items are only found on the website if they exist in the
distribution centers, what is the probability that the item X
would be found on Amazon’s website?
>>> 关注公众号
获取更多精彩内容
5
Solution
3
Solution: found-item(probability)
>>> 关注公众号
获取更多精彩内容
6
Quesetion
4
first-touch-attribution(sql)
难度标题
【Hard】
【NerdWallet】,【Google】
公司标签
题目标签
【sql】
The schema below is for a retail online shopping company consisting
of two tables, attribution and user_sessions.
• The attribution table logs a session visit for each row.
• If conversion is true, then the user converted to buying on that
session.
• The channel column represents which advertising platform the user
was attributed to for that specific session.
• Lastly the user_sessions table maps many to one session visits back
to one user.
First touch attribution is defined as the channel with which the
converted user was associated when they first discovered the website.
Calculate the first touch attribution for each user_id that converted.
Example:
Input:
Output:
attribution table
user_sessions table
Column
Type
Column
Type
session_id
INTEGER
session_id
INTEGER
channel
VARCHAR
created_at
DATETIME
conversion
BOOLEAN
user_id
INTEGER
User_id
Channel
123
facebook
145
google
153
facebook
172
organic
173
email
>>> 关注公众号
获取更多精彩内容
7
Solution
4
Solution: found-item(probability)
WITH sessions AS (
SELECT
u.user_id,
a.channel,
ROW_NUMBER() OVER(
PARTITION BY u.user_id
ORDER BY u.created_at ASC
) AS session_num,
SUM(a.conversion) OVER(
PARTITION BY u.user_id
) > 0 AS converted
FROM user_sessions AS u
INNER JOIN attribution AS a
ON u.session_id = a.session_id
)
SELECT
user_id,
>>> 关注公众号
获取更多精彩内容
8
5
Quesetion
post-success(sql)
难度标题
公司标签
【Medium】
题目标签
/
【sql】
Consider the events table which contains information about the
phases of writing a new social media post.
The action column can have values post_enter, post_submit, or
post_canceled for when a user starts to write (post_enter), ends up
canceling their post (post_cancel), or posts it (post_submit).
Write a query to get the post success rate for each day in the month
of January 2020.
You can assume that a single user may only make one post per day.
Example:
Input: events table
Sample:
Column
Type
user_id
created_at
event_name
id
INTEGER
123
2019-01-01
post_enter
user_id
INTEGER
123
2019-01-01
post_submit
created_at
DATETIME
456
2019-01-02
post_enter
action
VARCHAR
456
2019-01-02
post_cancel
url
VARCHAR
platform
VARCHAR
Output:
Column
Type
dt
DATETIME
post_success_rate
FLOAT
>>> 关注公众号
获取更多精彩内容
9
Solution
5
Solution: post-success(sql)
Let’s see if we can clearly define the metrics we want
to calculate before just jumping into the problem. We
want post success rate for each day over the past week.
To get that metric let’s assume post success rate can be defined as:
(total posts created) / (total posts entered)
Additionally, since the success rate must be broken
down by day, we must make sure that a post that is
entered must be completed on the same day.
Cool, now that we have these requirements, it’s time to
calculate our metrics. We know we have to GROUP BY the
date to get each day’s posting success rate. We also have
to break down how we can compute our two metrics
of total posts entered and total posts actually created.
>>> 关注公众号
获取更多精彩内容
10
Solution
5
Solution: post-success(sql)
Let’s look at the first one. Total posts entered can be
calculated by a simple query such as filtering for where the
event is equal to ‘enter’.
SELECT COUNT(user_id)
FROM events
WHERE action = 'post_enter'
Now we have to get all of the users that also successfully
created the post in the same day. We can do this with a
join and set the correct conditions. The conditions are:Same user- Successfully posted- Same day
We can get those by doing a LEFT JOIN to the same table,
and adding in those conditions. Remember we have to do
a LEFT JOIN in this case because we want to use the join as
a filter to where the conditions have been successfully met.
>>> 关注公众号
获取更多精彩内容
11
Solution
5
Solution: post-success(sql)
SELECT *
FROM events AS c1
LEFT JOIN events AS c2
ON c1.user_id = c2.user_id
AND c2.action = 'post_submit'
AND DATE(c1.created_at) = DATE(c2.created_at)
WHERE c1.action = 'post_enter'
AND MONTH(c1.created_at) = 1
AND YEAR(c1.created_at) = 2020
However, this query runs into an issue in which if we join on
all of our
conditions, we’ll find that if a user posted multiple times in
the same day,
we’ll be dealing with a multiplying join that will square the
actual
number of posts that we did.
>>> 关注公众号
获取更多精彩内容
12
Solution
5
Solution: post-success(sql)
To simplify it, all we need to do instead is ignore the JOIN,
and take the
count of all of the events that are posts divided by all of
the events that
are enter.
SELECT
DATE(c1.created_at) AS dt
, COUNT(c2.user_id) / COUNT(c1.user_id) AS post_success_
rate
FROM events AS c1
LEFT JOIN events AS c2
ON c1.user_id = c2.user_id
AND c2.action = 'post_submit'
AND DATE(c1.created_at) = DATE(c2.created_at)
WHERE c1.action = 'post_enter'
AND MONTH(c1.created_at) = 1
AND YEAR(c1.created_at) = 2020
GROUP BY 1
>>> 关注公众号
获取更多精彩内容
13
Quesetion
6
distribution-of-2x---y(statistics)
难度标题
【Medium】
公司标签
【Google】
题目标签
【statistics】
Given that XX and YY are independent random variables with
normal distributions, what is the mean and variance of the
distribution of 2X-Y2X-Y when the corresponding distributions
are X~N(3,4)X~N(3,4) and Y~N(1,4)Y~N(1,4)?
>>> 关注公众号
获取更多精彩内容
14
Solution
6
Solution: distribution-of-2x---y(statistics)
>>> 关注公众号
获取更多精彩内容
15
Solution
6
Solution: distribution-of-2x---y(statistics)
>>> 关注公众号
获取更多精彩内容
16
Quesetion
7
upsell-transactions(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Instacart】,【Apple】,【Coinbase】
We’re given a table of product purchases. Each row in the table
represents an individual user product purchase.
Write a query to get the number of customers that were upsold by
purchasing additional products.
Note: If the customer purchased two things on the same day that
does not count as an upsell as they were purchased within a similar
timeframe.
Example:
Input: transactions table
Output:
Column
Type
Column
Type
id
INTEGER
num_of_upsold_customers
INTEGER
user_id
INTEGER
created_at
DATETIME
product_id
INTEGER
quantity
INTEGER
>>> 关注公众号
获取更多精彩内容
17
Solution
7
Solution: upsell-transactions(sql)
Assuming:
“upsell” is purchasing an additional product after
purchasing a first product
the additional product(s) must be purchased on a later
date (ie. not the same day as the first product)
the additional “upsell” product(s) can be the same type
of product (product_id) as the first product
select
count(distinct t1.user_id) as num_of_upsold_customers
from
transactions t1
inner join transactions t2
on t1.user_id = t2.user_id
and date(t1.created_at) < date(t2.created_at)
>>> 关注公众号
获取更多精彩内容
18
Quesetion
8
seven-day-streak(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Twilio】,【Amazon】
Given a table with event logs, find the percentage of users
that had at least one seven-day streak of visiting the same
URL.
Note: Round the results to 2 decimal places. For example, if
the result is 35% return 0.35.
Example:
Input: events table
Output:
Column
Type
Column
Type
user_id
INTEGER
output
FLOAT
created_at
DATETIME
url
VARCHAR
>>> 关注公众号
获取更多精彩内容
19
Solution
8
Solution: seven-day-streak(sql)
WITH cte_1 AS (
SELECT user_id, DATE(created_at) AS login_date, url FROM
events
),
cte_2 AS (
SELECT user_id, login_date, url FROM cte_1 GROUP BY user_
id, login_date, url
),
cte_3 AS (
SELECT *, DATE_ADD(login_date, INTERVAL -ROW_NUMBER()
OVER (PARTITION BY user_id,url ORDER BY login_date) DAY)
AS interval_group, DENSE_RANK() OVER (ORDER BY user_id)
dr FROM cte_2
),
cte_4 AS (
SELECT user_id,login_date, url, interval_group, MAX(dr) OVER
() total_users FROM cte_3
),
cte_5 AS (
>>> 关注公众号
获取更多精彩内容
20
Solution
8
Solution: seven-day-streak(sql)
SELECT COUNT(*) streak, MIN(login_date) AS cnt, user_id,
total_users FROM cte_4
GROUP BY interval_group, user_id, url, total_users
HAVING COUNT(*) >= 7
),
cte_6 AS (
SELECT COUNT(DISTINCT user_id) AS stre , total_users
FROM cte_5 GROUP BY user_id, total_users
)
SELECT IF((SELECT count(*) FROM cte_6) > 0 , (SELECT
CAST(stre/total_users AS DECIMAL (3,2)) FROM cte_6),
CAST(0.00 AS DECIMAL(3,2))) AS precent_of_users
>>> 关注公众号
获取更多精彩内容
21
Quesetion
9
cumulative-reset(sql)
难度标题
公司标签
【Hard】
【Amazon】
题目标签
【sql】
Given a users table, write a query to get the cumulative
number of new users added by the day, with the total reset
every month.
Example:
Input: users table
Output:
Columns
Type
DATE
INTEGER
id
INTEGER
2020-01-01
5
name
VARCHAR
2020-01-02
12
created_at
DATETIME
…
…
2020-02-01
8
2020-02-02 17
2020-02-03
23
>>> 关注公众号
获取更多精彩内容
22
Solution
9
Solution: cumulative-reset(sql)
This question first seems like it could be solved by just
running a COUNT(*) and grouping by date. Or maybe it’s
just a regular cumulative distribution function?
But we have to notice that we are actually grouping by
a specific interval of month and date. And that when
the next month comes around, we want to the reset the
count of the number of users.
Tangentially aside - the practical benefit for a query like
this is that we can get a retention graph that compares
the cumulative number of users from one month to
another. If we have a goal to acquire 10% more users
each month, how do we know if we’re on track for this
goal on February 15th without having the same number
to compare it to for January 15th?
Therefore how can we make sure that the total amount
of users on January 31st rolls over back to 0 on February
1st?
>>> 关注公众号
获取更多精彩内容
23
Solution
9
Solution: cumulative-reset(sql)
Let’s first just solve the issue of getting the total count of
users. We know that we’ll need to know the number of
users that sign up each day. This can be written pretty
easily.
WITH daily_total AS (
SELECT
DATE(created_at) AS dt
, COUNT(*) AS cnt
FROM users
GROUP BY 1
)
If we can model out that computation, we’ll find that
the cumulative istaken from the sum of all of the frequency
counts lower than the specified frequency index. In which
we can then run our self join on a condition where we set
the left f1 table frequency index as greater than the right
table frequency index.
>>> 关注公众号
获取更多精彩内容
24
Solution
9
Solution: cumulative-reset(sql)
Okay, so we know that we have to specify a self join in the
same way
where we want to get the cumulative value by comparing
each date
against each other. But now the only difference here is
that we add an
additional condition in the join where the month and year
have to be the same. That way we apply a filter to the
same month and year AND limit the cumulative total.
FROM daily_total AS t
LEFT JOIN daily_total AS u
ON t.dt >= u.dt
AND MONTH(t.dt) = MONTH(u.dt)
AND YEAR(t.dt) = YEAR(u.dt)
>>> 关注公众号
获取更多精彩内容
25
Solution
9
Solution: cumulative-reset(sql)
Therefore if we bring it all together:
WITH daily_total AS (
SELECT
DATE(created_at) AS dt
, COUNT(*) AS cnt
FROM users
GROUP BY 1
)
SELECT
t.dt AS date
>>> 关注公众号
获取更多精彩内容
26
Quesetion
10
last-transaction(sql)
难度标题
【Easy】
公司标签
题目标签
/
【sql】
Given a table of bank transactions with columns id, transaction_value,
and created_at representing the date and time for each transaction,
write a query to get the last transaction for each day.
The output should include the id of the transaction, datetime of the
transaction, and the transaction amount. Order the transactions by
datetime.
Example:
Input: bank_transactions table
Output:
Column
Type
Column
Type
id
INTEGER
created_at
DATETIME
created_at
DATETIME
transaction_value
FLOAT
transaction_value
FLOAT
id
INTEGER
>>> 关注公众号
获取更多精彩内容
27
Solution
10
Solution: last-transaction(sql)
with last_moment AS (
select date(created_at) day, max(created_at) as created_
at from bank_transactions group by 1
)
s
ated_at,
id,
transaction_value
from last_moment left join bank_transactions
using(created_at)
>>> 关注公众号
获取更多精彩内容
28
Quesetion
11
ad-raters(probability)
难度标题
【Easy】
公司标签
/
题目标签
【probability】
Let’s say we use people to rate ads.
There are two types of raters. Random and independent from our
point of view:
• 80% of raters are careful and they rate an ad as good (60%
chance) or bad (40% chance).
• 20% of raters are lazy and they rate every ad as good (100%
chance).
1. Suppose we have 100 raters each rating one ad independently.
What’s the expected number of good ads?
2. Now suppose we have 1 rater rating 100 ads. What’s the expected
number of good ads?
3. Suppose we have 1 ad, rated as bad. What’s the probability the
rater was lazy?
>>> 关注公众号
获取更多精彩内容
29
Solution
11
Solution: ad-raters(probability)
>>> 关注公众号
获取更多精彩内容
30
Quesetion
12
compute-deviation(python)
难度标题
【Medium】
题目标签
【python】
公司标签
【Tinder】,【Optiver】,【Amazon】
Write a function compute_deviation that takes in a list of
dictionaries with a key and list of integers and returns a
dictionary with the standard deviation of each list.
Note: This should be done without using the NumPy built-in
functions.
Example:
Input:
Output:
input = [
output = {'list1': 1.12, 'list2': 14.19}
{
'key': 'list1',
'values': [4,5,2,3,4,5,2,3],
},
{
'key': 'list2',
'values': [1,1,34,12,40,3,9,7],
}
]
>>> 关注公众号
获取更多精彩内容
31
Solution
12
Solution:compute-deviation(python)
With NumPy:
import numpy as np
{i['key']: round(np.std(i['values']),2) for i in input}
Without NumPy:
res = {}
for i in input:
avg = sum(i['values']) / len(i['values'])
squares = [(j-avg)**2 for j in i['values']]
res[i['key']] = round((sum(squares)/
len(i['values']))**(1/2),2)
print(res)
>>> 关注公众号
获取更多精彩内容
32
Quesetion
13
is-it-raining-in-seattle(probability)
难度标题
【Medium】
题目标签
【probability】
公司标签
【Microsoft】,【Accenture】,【Facebook】
You are about to get on a plane to Seattle. You want to know
if you should bring an umbrella. You call 3 random friends
of yours who live there and ask each independently if it’s
raining. Each of your friends has a 2 ⁄ 3 chance of telling you
the truth and a 1 ⁄ 3 chance of messing with you by lying. All 3
friends tell you that “Yes” it is raining.
What is the probability that it’s actually raining in Seattle?
>>> 关注公众号
获取更多精彩内容
33
Solution
13
Solution:is-it-raining-in-seattle
(probability)
>>> 关注公众号
获取更多精彩内容
34
Quesetion
14
find-bigrams(python)
难度标题
【Easy】
题目标签
【python】
公司标签
【Indeed】,【Microsoft】
Write a function called find_bigrams that takes a sentence or
paragraph of strings and returns a list of all bigrams.
Example:
Input:
Output:
sentence = """
def find_bigrams(sentence) ->
Have free hours and love children?
Drive kids to school, soccer practice
[('have', 'free'),
and other activities.
('free', 'hours'),
"""
('hours', 'and'),
('and', 'love'),
('love', 'children?'),
('children?', 'drive'),
('drive', 'kids'),
('kids', 'to'),
('to', 'school,'),
('school,', 'soccer'),
('soccer', 'practice'),
('practice', 'and'),
('and', 'other'),
('other', 'activities.')]
关注公众号 <<<
获取更多精彩内容
35
Solution
14
Solution:find-bigrams(python)
def bigrams(sentence):
sentence = sentence.split(' ')
result = []
for i, item in enumerate(sentence):
if i < len(sentence)-1:
result.append((sentence[i],sentence[i+1]))
#print((sentence[i],sentence[i+1]))
return result
>>> 关注公众号
获取更多精彩内容
36
Quesetion
15
experiment-validity(a/b testing)
难度标题
【Medium】
题目标签
【a/b testing】
公司标签
【Facebook】,【Google】,【Metromile】,
【Uber】,【Grammarly】,【Airbnb】
Let’s say that your company is running a standard control
and variant AB test on a feature to increase conversion rates
on the landing page. The PM checks the results and finds a .04
p-value.
How would you assess the validity of the result?
>>> 关注公众号
获取更多精彩内容
37
Solution
15
Solution:experiment-validity(a/b
testing)
This looks to be statistically significant, but I’d also double
check a few more things before making the conclusion:
1) How long have we been running this experiment for?
How many times have we been running the analysis? If
we’ve run the experiment for 4 weeks and we sig tests
4 times, the likelihood of a false positive significantly
increases. Or if we’ve run the test for one day, we should
wait to see if the results hold.
2) Is the experiment properly randomized? Are the
distributions across treatment/control matching up? Are
the standard deviations roughly on the same trajectory?
3) What is the point estimate? Is this a negative
conversion rate hit or positive? Is this in line with what
we’re expecting?
>>> 关注公众号
获取更多精彩内容
38
Quesetion
16
reducing-error-margin(statistics)
难度标题
【Medium】
题目标签
【statistics】
公司标签
【Apple】,【Walmart】
Let’s say we have a sample size of nn. The margin of error for
our sample size is 3.
How many more samples would we need to decrease the
margin of error to 0.3?
>>> 关注公众号
获取更多精彩内容
39
Solution
16
Solution:reducing-error-margin
(statistics)
>>> 关注公众号
获取更多精彩内容
40
Solution
16
Solution:reducing-error-margin
(statistics)
>>> 关注公众号
获取更多精彩内容
41
Quesetion
17
liked-pages(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Snapchat】,【Facebook】,【Snap】
Let’s say we want to build a naive recommender. We’re given two
tables, one table called friends with a user_id and friend_id columns
representing each user’s friends, and another table called page_likes
with a user_id and a page_id representing the page each user liked.
Write an SQL query to create a metric to recommend pages for each
user based on recommendations from their friend’s liked pages.
Note: It shouldn’t recommend pages that the user already likes.
Example:
Input: friends table
page_likes table
Column
Type
Column
Type
user_id
INTEGER
user_id
INTEGER
friend_id
INTEGER
page_id
INTEGER
Output:
Column
Type
user_id
INTEGER
page_id
INTEGER
num_friend_likes
INTEGER
>>> 关注公众号
获取更多精彩内容
42
Solution
17
Solution:liked-pages(sql)
Let’s solve this problem by visualizing what kind of output
we want from the query. Given that we have to create a
metric for each user to recommend pages, we know we
want something with a user_id and a page_id along with
some sort of recommendation score.
Let’s try to think of an easy way to represent the scores
of each user_id and page_id combo. One naive method
would be to create a score by summing up the total likes
by friends on each page that the user hasn’t currently
liked. Then the max value on our metric will be the most
recommendable page.
The first thing we have to do is then to write a query to
associate users to their friend’s liked pages. We can do
that easily with an initial join between the two tables.
>>> 关注公众号
获取更多精彩内容
43
Solution
17
Solution:liked-pages(sql)
WITH t1 AS (
SELECT
f.user_id
, f.friend_id
, pl.page_id
FROM friends AS f
INNER JOIN page_likes AS pl
ON f.friend_id = pl.user_id
)
Now we have every single user_id associated with the
friends liked pages. Can’t we just do a GROUP BY on user_
id and page_id fields and get the DISTINCT COUNT of
the friend_id field? Not exactly. We still have to filter out all
of the pages that the original users also liked.
>>> 关注公众号
获取更多精彩内容
44
Solution
17
Solution:liked-pages(sql)
We can do that by joining the original page_likes table back
to the CTE. We can filter out all the pages that the original
users liked by doing a LEFT JOIN on page_likes and then
selecting all the rows where the JOIN on user_id and page_
id are NULL.
SELECT t1.user_id, t1.page_id, COUNT(DISTINCT t1.friend_id)
AS num_friend_likes
FROM t1
LEFT JOIN page_likes AS pl
ON t1.page_id = pl.page_id
AND t1.user_id = pl.user_id
WHERE pl.user_id IS NULL # filter out existing user likes
GROUP BY 1, 2
In this case, we only need to check one column, where pl.
user_id IS NULL. Once we GROUP BY the user_id and page_
id, we now can count the distinct number of friends, which
will display the distinct number of likes on each page by
friends creating our metric
>>> 关注公众号
获取更多精彩内容
45
Quesetion
18
customer-orders(sql)
难度标题
公司标签
【Medium】
题目标签
/
【sql】
Write a query to identify customers who placed more than
three transactions each in both 2019 and 2020.
Example:
Input: transactions table
users table
Column
Type
Columns
Type
id
INTEGER
id
INTEGER
user_id
INTEGER
name
VARCHAR
created_at
DATETIME
product_id
INTEGER
quantity
INTEGER
Output:
Column
Type
customer_name
VARCHAR
>>> 关注公众号
获取更多精彩内容
46
Solution
18
Solution:customer-orders(sql)
This question gives us two tables and asks us to
find customers’ names who placed more than three
transactions in both 2019 and 2020.
Note the phrasing of the question institutes this logical
expression: Customer transaction > 3 in 2019 AND
Customer transactions > 3 in 2020.
Our first query will join the transactions table to the user’s
table so that we can easily reference both the user’s
name and the orders together. We can join our tables on
the id field of the user’s table and the user_id field of the
transactions table:
FROM transactions t
JOIN users u
ON u.id = user_id
>>> 关注公众号
获取更多精彩内容
47
Solution
18
Solution:customer-orders(sql)
Next, we can work on the shape of our SELECT statement
for our CTE. The first two fields we want to include are
pretty simple: users_id and name. You might think that we
could pull from only the name field here, but your query
might fall apart if there are two users that have the same
name. Instead, we’re going to select both and organize our
query according to the users.id field (which we know has
no duplicates).
Next, we’re going to make some CASE WHEN statements,
then combine them with SQL’s SUM function to count the
number of transactions that each of our users made in
2019 and 2020.
SUM(CASE WHEN YEAR(t.created_at)= '2019' THEN 1 ELSE 0
END) AS t_2019,
SUM(CASE WHEN YEAR(t.created_at)= '2020' THEN 1 ELSE
0 END) AS t_2020
>>> 关注公众号
获取更多精彩内容
48
Solution
18
Solution:customer-orders(sql)
Notice that we have to make sure both years are
accounted for with a minimum of 3 transactions each.
In the above code, each instance where the YEAR of a
user’s transaction is 2019 (in the first line) or 2020 (in
the second) is assigned a value of 1. All other YEARs
are assigned a value of 0. If we SUM these CASE
WHEN statements, we’ll get the count of transactions
made by a given user in both 2019 and 2020.
>>> 关注公众号
获取更多精彩内容
49
Quesetion
19
encoding-categorical-features
(machine learning)
难度标题
【Medium】
题目标签
【machine learning】
公司标签
【Sentio】,【Uber】,【Amazon】,【AES】,
【Accenture】,【Visa】
Let’s say you have a categorical variable with thousands of
distinct values, how would you encode it?
>>> 关注公众号
获取更多精彩内容
50
Solution
19
Solution:encoding-categorical-features
(machine learning)
This depends on whether the problem is a regression or a
classification model.
If it’s a regression model, one way would be to cluster
them based on the response by working backwards. You
could sort them by the response variable, and then split
the categorical variables into buckets based on the
grouping of the response variable. This could be done by
using a shallow decision tree to reduce the number of
categories.
Another way given a regression model would be to target
encode them. Replace each category in a variable with the
mean response given that category. Now you have one
continuous feature instead of a bunch of categories.
For a binary classification, you can target encode the
column by finding the conditional probability of the
response variable being a one, given that the categorical
column takes a particular value.
>>> 关注公众号
获取更多精彩内容
51
Solution
19
Solution:encoding-categorical-features
(machine learning)
Then replace the categorical column with this numerical
value. For example if you have a categorical column of city
in predicting loan defaults, and the probability of a person
who lives in San Francisco defaults is 0.4, you would then
replace “San Francisco” with 0.4.
Additionally if working with classification model, you could
try grouping them by the category’s frequency. The most
frequent categories may dominate in the total make-up
and the least frequent may make up a long tail with a few
samples each. By looking at the frequency distribution of
the categories, you could find the drop-off point where you
could leave the top X categories alone and then categorize
the rest into an “other bucket” giving you X+1 categories.
If you want to be more precise, total the categories that
give you the 90 percentile in the cumulative and dump
the rest into the “other bucket”.Lastly we could also try
using a Louvain community detection algorithm. Louvain
is a method to extract communities from large networks
without setting a predetermined number of clusters like
K-means.
>>> 关注公众号
获取更多精彩内容
52
Quesetion
20
fair-coin(probability)
难度标题
【Easy】
公司标签
/
题目标签
【probability】
Say you flip a coin 10 times.
It comes up tails 8 times and heads twice.
Is this a fair coin?
>>> 关注公众号
获取更多精彩内容
53
Solution
20
Solution:fair-coin(probability)
>>> 关注公众号
获取更多精彩内容
54
Solution
20
Solution:fair-coin(probability)
>>> 关注公众号
获取更多精彩内容
55
Solution
20
Solution:fair-coin(probability)
>>> 关注公众号
获取更多精彩内容
56
Solution
20
Solution:fair-coin(probability)
>>> 关注公众号
获取更多精彩内容
57
Solution
20
Solution:fair-coin(probability)
>>> 关注公众号
获取更多精彩内容
58
Quesetion
21
n-die(probability)
难度标题
【Easy】
题目标签
【probability】
公司标签
【Oneida】,【Facebook】
Let’s say you’re playing a dice game. You have 2 dice.
1. What’s the probability of rolling at least one 3?
2. What’s the probability of rolling at least one 3 given N dice?
>>> 关注公众号
获取更多精彩内容
59
Solution
21
Solution:n-die(probability)
>>> 关注公众号
获取更多精彩内容
60
Solution
21
Solution:n-die(probability)
>>> 关注公众号
获取更多精彩内容
61
Quesetion
22
repeat-job-postings(sql)
难度标题
公司标签
【Medium】
/
题目标签
【sql】
Given a table of job postings, write a query to retrieve the number of
users that have posted each job only once and the number of users
that have posted at least one job multiple times.
Each user has at least one job posting. Thus the sum of single_post and
multiple_posts should equal the total number of distinct user_id’s.
Example:
Input: job_postings table
Column
Type
id
INTEGER
job_id
INTEGER
user_id
INTEGER
date_posted
DATETIME
Output:
Columns
Type
single_post
INTEGER
multiple_posts
INTEGER
>>> 关注公众号
获取更多精彩内容
62
Solution
22
Solution:repeat-job-postings(sql)
We want the value of two different metrics, the number
of users that have posted their jobs once and the number
of users that have posted at least one job multiple
times. What does that mean exactly?
If a user has 5 jobs but only posted each once, then they
are part of the single_post. But if they have 5 jobs and
posted a total of 7 times, then at least one job must have
multiple postings.
In general if a user’s total number of postings exceeds
that user’s total number of distinct jobs, the pigeon hole
principle tells us at least one must have been posted
multiple times.
We first write a subquery to get an organized version of
the job_postings and name it user_job.
We want a count of total job postings per user and job.
Since each job posting has a unique id, we write our
subquery to count posting ids and distinct job ids per
user.
>>> 关注公众号
获取更多精彩内容
63
Solution
22
Solution:repeat-job-postings(sql)
We use COUNT DISTINCT on job_id to get a unique row for
each job and COUNT on id as all id are already unique.
We then GROUP BY user_id so we can compare the
number of distinct jobs per user denoted num_jobs with
the number of total posts per user denoted n_posts.
WITH user_job AS (
SELECT user_id
, COUNT(DISTINCT job_id) AS n_jobs
, COUNT(DISTINCT id) AS n_posts
FROM job_postings
GROUP BY 1
)
>>> 关注公众号
获取更多精彩内容
64
Solution
22
Solution:repeat-job-postings(sql)
Finally, we can simply write our main query to identify
when n_posts exceeds n_jobs for each user. We then
count these users toward multiple_posts.
Note that n_posts is always greater than or equal to n_
jobs since each job
gets posted at least once. Thus checking if the two are
not equal is the
same as checking if n_posts exceeds n_jobs.
We use CASE WHEN to count towards our total multiple_
posts whenever
n_jobs is not equal to n_posts.
If n_jobs = n_posts, we count that user towards single_
post.
>>> 关注公众号
获取更多精彩内容
65
Solution
22
Solution:repeat-job-postings(sql)
Our final query is as follows:
WITH user_job AS (
SELECT user_id
, COUNT(DISTINCT job_id) AS n_jobs
, COUNT(DISTINCT id) AS n_posts
FROM job_postings
GROUP BY 1
)
SELECT
SUM(CASE WHEN n_jobs = n_posts THEN 1 ELSE 0 END) AS
single_post
,SUM(CASE WHEN n_jobs != n_posts THEN 1 ELSE 0 END) AS
multiple_posts
FROM user_job
>>> 关注公众号
获取更多精彩内容
66
Quesetion
23
recurring-character(python)
难度标题
【Easy】
题目标签
【python】
公司标签
【HealthTap】,【HEB】,【Facebook】
Given a string, write a function recurring_char to find its first recurring
character. Return None if there is no recurring character.
Treat upper and lower case letters as distinct characters.
You may assume the input string includes no spaces.
Example 1 :
Input: input = "interviewquery"
Output: output = "i"
Example 2 :
Input: input = "interv"
Output: output =None
>>> 关注公众号
获取更多精彩内容
67
Solution
23
Solution:recurring-character(python)
We know we have to store a unique set of characters of
the input string and loop through the string to check which
ones occur twice.
Given that we have to return the first index of the second
repeating character, we should be able to go through the
string in one loop, save each unique character, and then
just check if the character exists in that saved set. If it
does, return the character.
def recurring_char(input):
seen = set()
for char in input:
if char in seen:
return char
seen.add(char)
return(None)
>>> 关注公众号
获取更多精彩内容
68
Quesetion
24
average-order-value(sql)
难度标题
【Easy】
题目标签
【sql】
公司标签
【Klaviyo】,【Facebook】,【Target】
Given three tables, representing customer transactions and customer
attributes:
Write a query to get the average order value by gender.
Note: We’re looking at the average order value by users that have
ever placed an order. Additionally, please round your answer to two
decimal places.
users table
Example:
Input: transactions table
Column
Type
Column
Type
id
INTEGER
id
INTEGER
name
VARCHAR
user_id
INTEGER
sex
VARCHAR
created_at
DATETIME
product_id
INTEGER
quantity
INTEGER
Output:
Column
Type
sex
VARCHAR
aov
FLOAT
products table
Column
Type
id
INTEGER
name
VARCHAR
price
FLOAT
>>> 关注公众号
获取更多精彩内容
69
Solution
24
Solution:average-order-value(sql)
Quick solution: For this problem, note that we are going to
assume that the question states average order value for
all users that have ordered at least once.
Therefore, we can apply an INNER JOIN between users an
d transactions.
SELECT
u.sex
, ROUND(AVG(quantity *price), 2) AS aov
FROM users AS u
INNER JOIN transactions AS t
ON u.id = t.user_id
INNER JOIN products AS p
ON t.product_id = p.id
GROUP BY 1
>>> 关注公众号
获取更多精彩内容
70
Quesetion
25
longest-streak-users(sql)
难度标题
【Medium】
公司标签
【Facebook】
题目标签
【sql】
Given a table with event logs, find the top five users with the longest
continuous streak of visiting the platform in 2020.
Note: A continuous streak counts if the user visits the platform at least
once per day on consecutive days.
Example:
Input: events table
Column
Type
user_id
INTEGER
created_at
DATETIME
url
VARCHAR
Output:
Column
Type
user_id
INTEGER
streak_length
INTEGER
>>> 关注公众号
获取更多精彩内容
71
Solution
25
Solution:longest-streak-users(sql)
WITH grouped AS (
SELECT
DATE(DATE_ADD(created_at, INTERVAL -ROW_NUMBER()
OVER (PARTITION BY user_id ORDER BY created_at)
DAY)) AS grp,
user_id,
created_at
FROM (
SELECT *
FROM events
GROUP BY created_at, user_id) dates
)
SELECT
user_id, streak_length
FROM (
SELECT user_id, COUNT(*) as streak_length
FROM grouped
GROUP BY user_id, grp
ORDER BY COUNT(*) desc) c
GROUP BY user_id
LIMIT 5
>>> 关注公众号
获取更多精彩内容
72
Solution
25
Solution:longest-streak-users(sql)
Explanation:
We need to find the top five users with the longest
continuous streak of visiting the platform. Before anything
else, let’s make sure we are selecting only distinct dates
from the created_at column for each user so that the
streaks aren’t incorrectly interrupted by duplicate dates.
SELECT *
FROM events
GROUP BY created_at, user_id) dates
After that, the first step is to find a method of calculating
the “streaks” of each user from the created_at column.
This is a “gaps and islands” problem, in which the data is
split into “islands” of consecutive values, separated by
“gaps” (i.e. 1-2-3, 5-6, 9-10). A clever trick which will help us
group consecutive values is taking advantage of the fact
that
subtracting two equally incrementing sequences will
produce the
same difference for each pair of values.
>>> 关注公众号
获取更多精彩内容
73
Solution
25
Solution:longest-streak-users(sql)
For example, [1, 2, 3, 5, 6] - [0, 1, 2, 3, 4] = [1, 1, 1, 2, 2].
By creating a new column containing the result of such a
subtraction, we can then group and count the streaks for
each user. For our incremental sequence, we can use the
row number of each event, obtainable with either window
functions: ROW_NUMBER() or DENSE_RANK().
The difference between these two functions lies in how they
deal with duplicate values, but since we need to remove
duplicate values either way to accurately count the streaks,
it doesn’t make a difference.
SELECT
DATE(DATE_ADD(created_at, INTERVAL -ROW_NUMBER()
OVER (PARTITION BY user_id ORDER BY created_at)
DAY)) AS grp,
user_id,
created_at
FROM (
SELECT *
FROM events
GROUP BY created_at, user_id) dates
>>> 关注公众号
获取更多精彩内容
74
Solution
25
Solution:longest-streak-users(sql)
With the events categorized into consecutive streaks, it
is simply a matter of grouping by the streaks, counting
each group, selecting the highest streak for each user, and
ranking the top 5 users.
WITH grouped AS (
SELECT
DATE(DATE_ADD(created_at, INTERVAL -ROW_
NUMBER()
OVER (PARTITION BY user_id ORDER BY created_at)
DAY)) AS grp,
user_id,
created_at
FROM (
SELECT *
FROM events
GROUP BY created_at, user_id) dates
)
>>> 关注公众号
获取更多精彩内容
75
Solution
25
Solution:longest-streak-users(sql)
SELECT
user_id, streak_length
FROM (
SELECT user_id, COUNT(*) as streak_length
FROM grouped
GROUP BY user_id, grp
ORDER BY COUNT(*) desc) c
GROUP BY user_id
LIMIT 5
Note that the second subquery was necessary in order
to get the streak_length (count) as a column in our final
selection, as it involves multiple groupings.
>>> 关注公众号
获取更多精彩内容
76
Quesetion
26
p-value-to-a-layman(statistics)
难度标题
【Easy】
题目标签
【statistics】
公司标签
【Uber】,【Facebook】,【Klaviyo】,【Pocket】,
【Netflix】,【Sage】,【Centene】,【Thermo】,
【Lumen】,【Surescripts】,【Apptio】,【Bolt】,
【Nextdoor】
How would you explain what a p-value is to someone who is
not technical?
>>> 关注公众号
获取更多精彩内容
77
Solution
26
Solution:p-value-to-a-layman(statistics)
The p-value is a fundamental concept in statistical testing.
First, why does this kind of question matter?
What an interviewer is looking for here is can you
answer this question in a way that both conveys your
understanding of statistics but can also answer a question
from a non-technical worker that doesn’t understand why
a p-value might matter.
For example, if you were a data scientist and explained to
a PM that the ad campaign test has a 0.08 p-value, why
should the PM care about this number?
Here’s how we could explain that.
To understand the p-value, we must first learn about
statistical tests. In statistical tests, you have two
hypotheses. The null hypothesis states that our ad
campaign will not have a measurable increase in daily
active users. The test hypothesis states that our ad
campaign will have a measurable increase in daily active
users.
>>> 关注公众号
获取更多精彩内容
78
Solution
26
Solution:p-value-to-a-layman(statistics)
We then use data to run a statistical test to find out which
hypothesis is true. The p-value can help us determine
this by giving us a probability that we would observe the
current data if the null hypothesis were true. Note, this is
just a statement about probability given an assumption,
the p-value is not a measure of “how likely” the null
hypothesis is to be right, nor does it measure “how likely”
the observations in our data are due to random chance,
which are the most common misinterpretations of what
the p-value is. The only thing the p-value can say is
contribute to cult-like worship of pp-values in non-technical
circles.
how likely we are to have gotten the data we got if the
null hypothesis were true. The difference may seem very
abstract and not practical, but using incorrect explanations
helps contribute to cult-like worship of p-value in nontechnical circles.
Thus, a low p-value indicates that it would be extremely
unlikely that our data would result in this way if the null
hypothesis were true.
>>> 关注公众号
获取更多精彩内容
79
Solution
26
Solution:p-value-to-a-layman(statistics)
Because such data would be extremely unlikely to occur,
we then make the conclusion that the null hypothesis is in
fact false. Typically, p<0.05 is standard for rejecting the null
hypothesis in many practices, but this is just convention,
it may be that in your specific application we need more
confidence (0.01) or less confidence (0.1) to reject a null
hypothesis. For example, in life-or-death situations like
healthcare, we may want a p-value lower than 0.05, while in
studies with many factors like sociological studies, we may
choose to increase the p-value standard to 0.1.
Another important thing to recognize is that the p-value
does not say anything about the “strength” of the statistical
relationship, only if it exists or not. We could find a very
small change in ad revenue from our test (say 1%), but that
change could have a low p-value because it would unlikely
that the change would have resulted if the null hypothesis
were true. Likewise, we could find a huge change in ad
revenue with a high p-value, which tells us although the
change would be great if the null hypothesis were false, we
do not have enough evidence to say that it is in fact false.
>>> 关注公众号
获取更多精彩内容
80
Quesetion
27
manager-team-sizes(sql)
难度标题
【Easy】
公司标签
题目标签
/
【sql】
Write a query to identify the manager with the biggest team
size.
You may assume there is only one manager with the largest
team size.
Example:
Input: employees table
managers table
Column
Type
Column
Type
id
INTEGER
id
INTEGER
first_name
VARCHAR
name
VARCHAR
last_name
VARCHAR
team
VARCHAR
salary
INTEGER
department_id
INTEGER
manager_id
INTEGER
Output:
Column
Type
manager
VARCHAR
team_size
INTEGER
>>> 关注公众号
获取更多精彩内容
81
Solution
27
Solution:manager-team-sizes(sql)
This question is relatively straightforward. We’re given two
tables and asked to find the manager team with the largest
number of employees.
There are actually a couple of ways we could do this.
Method one involves using the MAX function and method
two (which is the path we chose to follow) involves
creating a sorted list grouped by the manager name. We
chose method two because it takes advantage of the most
basic aspects of SQL to produce an elegant solution to the
problem at hand.
First, we’re going to use a LEFT JOIN to merge our “left”
table, employees, with our “right” table, managers. We’ll
join the two tables where employees’ manager_id field
matches managers’ id field.
Then, because we’re going to need to get a COUNT of
employees under each manager and aggregates don’t
mix well with discrete values, we’re going to GROUP our
query BY the id field of our managers table.
>>> 关注公众号
获取更多精彩内容
82
Solution
27
Solution:manager-team-sizes(sql)
We don’t want to GROUP BY the name field because we
don’t know for certain that we don’t have two managers at
the company with the same name, which would mess up
our query.
Now we can structure the SELECT clause of our query.
We’re going to pull the name field from our managers
table and a COUNT of the id field from our employees
table. Since we already have a GROUP BY clause in place,
our COUNT results will be grouped by manager ID, giving
us the size of each team. Remember that we want to use
aliasing at this stage to make sure our results match the
output table.
Next, we’re going to add an ORDER BY clause that sorts
the results of our query by team size. We’re going to sort
in DESCending order so that the highest team size is first
on our list.
Finally, we can LIMIT the results of our query to 1 and we
will have found the manager with the largest team.
>>> 关注公众号
获取更多精彩内容
83
Solution
27
Solution:manager-team-sizes(sql)
SELECT
m.name AS manager,
COUNT(e.id) AS team_size
FROM managers m
LEFT JOIN employees e
ON e.manager_id = m.id
GROUP BY m.id
ORDER BY COUNT(e.id) DESC
LIMIT 1
>>> 关注公众号
获取更多精彩内容
84
Quesetion
28
flight-records(sql)
难度标题
【Hard】
公司标签
/
题目标签
【sql】
Write a query to create a new table, named flight routes, that
displays unique pairs of two locations.
Example:
Input: Note: Duplicate pairs from the flights table, such as Dallas to
Seattle and Seattle to Dallas, should have one entry in the flight
routes table.
Column
Type
id
INTEGER
source_location
VARCHAR
destination_location
VARCHAR
Output:
Column
Type
destination_one
VARCHAR
destination_two
VARCHAR
>>> 关注公众号
获取更多精彩内容
85
Solution
28
Solution:flight-records(sql)
WITH locations AS (
SELECT id,
LEAST(source_location, destination_location) AS point_A,
GREATEST(destination_location, source_location) AS
point_B
FROM flights
ORDER BY 2,3
)
SELECT point_A AS destination_one,
point_B AS destination_two
FROM locations
GROUP BY point_A, point_B
ORDER BY point_A, point_B
>>> 关注公众号
获取更多精彩内容
86
Quesetion
29
booking-regression(machine
learning)
难度标题
【Medium】
题目标签
【machine learning】
公司标签
【TripAdvisor】,【Chewy】,【UBS】,
【Amazon】,【Facebook】,【Airbnb】
Let’s say we want to build a model to predict booking prices
on Airbnb.
Between linear regression and random forest regression,
which model would perform better and why?
>>> 关注公众号
获取更多精彩内容
87
Solution
29
Solution:booking-regression
(machine learning)
Let’s first quickly explain the differences between linear
and random forest regression before diving into which one
is a better use case for bookings.
Random forest regression is based on the ensemble
machine learning technique of bagging. The two key
concepts of random forests are:
1、Random sampling of training observations when building
trees.
2、Random subsets of features for splitting nodes.
Random forest regressions also discretize continuous
variables since they are based on decision trees, which
function through recursive binary partitioning at the
nodes. This effectively means that we can split not only
categorical variables, but also split continuous variables.
Additionally, with enough data and sufficient splits, a step
function with many small steps can approximate a smooth
function for predicting an output.
>>> 关注公众号
获取更多精彩内容
88
Solution
29
Solution:booking-regression
(machine learning)
Linear regression on the other hand is the standard
regression technique in which relationships are modeled
using a linear predictor function, the most common example
of y = Ax + B. Linear regression models are often fitted using
the least-squares approach.
There are also four main assumptions in linear regression:
• A normal distribution of error terms
• Independence in the predictors
• The mean residuals must equal zero with constant
variance
• No correlation between the features
So how do we differentiate between random forest
regression and linear regression independent of the problem
statement?
>>> 关注公众号
获取更多精彩内容
89
Solution
29
Solution:booking-regression
(machine learning)
The difference between random forest regression versus
standard regression
techniques for many applications are:
• Random forest regression can approximate complex
nonlinear shapes without a prior specification. Linear
regression performs better when the underlying function
is linear and has many continuous predictors.
• Random forest regression allows the use of arbitrarily
many predictors (more predictors than data points is
possible).
• Random forest regression can also capture complex
interactions between predictions without a prior
specification.
• Both will give some semblance of a “feature importance.”
However, linear regression feature importance is much
more interpretable than random forest given the linear
regression coefficient values attached to each predictor.
>>> 关注公众号
获取更多精彩内容
90
Solution
29
Solution:booking-regression
(machine learning)
Now let’s see how each model is applicable to Airbnb’s
bookings. One thing we need to do in the interview is to
understand more context around the problem of predicting
bookings.
To do so we need to understand what features exist in our
dataset.
We can assume the dataset will have features like:
• location features
• Seasonality
• number of bedrooms and bathrooms
• private room, shared, entire home, etc..
• External demand (conferences, festivals, etc…)
Can we extrapolate those features into a linear model that
makes sense?
Probably. If we were to measure the price of bookings in just
one city, we could probably fit a decent linear regression.
>>> 关注公众号
获取更多精彩内容
91
Solution
29
Solution:booking-regression
(machine learning)
Take Seattle for an example, the coefficient for each
bedroom, bathroom, time of month, etc… could be
standardized across the city if we had a good variable that
could take into account location in the city.
Given the nuances of different events that influence
pricing, we could create custom interaction effects
between the features if, for example, a huge festival
suddenly increases the demand of three or four-bedroom
houses.
However, let’s say we have thousands of features in
our dataset to try and predict prices for different types
of homes across the world. If we run a random forest
regression model, the advantages are now forming
complex non-linear combinations into a model from a
dataset that could hold one-bedrooms in Seattle and
mansions in Croatia.
>>> 关注公众号
获取更多精彩内容
92
Solution
29
Solution:booking-regression
(machine learning)
But if our problem set is back to a simple example of one
zipcode of Seattle, then our feature set is dramatically
reduced by variation in geography and type of rental, and
a regular linear regression has benefits in being able to
understand the interpretability of the model to quantify
the pricing factors.
A one-bedroom plus two bathroom could probably double
in price compared to a one-bedroom one-bathroom given
the number of guests it could fit, yet this interaction may
not be true in other parts of the world with different
demand pricing.
>>> 关注公众号
获取更多精彩内容
93
Quesetion
30
three-zebras(probability)
难度标题
【Medium】
公司标签
【Facebook】
题目标签
【probability】
Three zebras are chilling in the desert. Suddenly a lion
attacks.
Each zebra is sitting on a corner of an equally length triangle.
Each zebra randomly picks a direction and only runs along
the outline of the triangle to either edge of the triangle.
What is the probability that none of the zebras collide?
>>> 关注公众号
获取更多精彩内容
94
Solution
30
Solution:three-zebras(probability)
Let’s imagine all of the zebras on an equilateral triangle.
They each have two options of directions to go in if they
are running along the outline to either edge. Given the
case is random, let’s compute the possibilities in which
they fail to collide.
There’s only really two possibilities. The zebras will either
all choose to run in a clockwise direction or a counterclockwise direction.
Let’s calculate the probabilities of each. The probability
that every zebra will choose to go clockwise will be the
multiple of each zebra choosing the clockwise direction.
Given there are two choices that would be 1/2*1/2*1/2=1/8.
The probability of every zebra going counter-clockwise is
the same at 1/8. Therefore if we sum up the probabilities,
we get the correct probability of 1/4 or 25% .
>>> 关注公众号
获取更多精彩内容
95
Quesetion
31
month-over-month(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Salesforce】,【LinkedIn】,【Amazon】,
【Sezzle】
Given a table of transactions and products, write a function to get the
month_over_month change in revenue for the year 2019. Make sure to
round month_over_month to 2 decimal places.
Example:
Input: transactions table
products table
Column
Type
Column
Type
id
INTEGER
id
INTEGER
user_id
INTEGER
name
VARCHAR
created_at
DATETIME
price
FLOAT
product_id
INTEGER
quantity
INTEGER
Output:
Column
Type
month
INTEGER
month_over_month
FLOAT
>>> 关注公众号
获取更多精彩内容
96
Solution
31
Solution:month-over-month(sql)
Whenever there is a question on month_over_month or week_
over_week or year_over_year etc.. change, note that it can
generally be done in two different ways.
One is using the LAG function that is available in certain SQL
services. Another is to do a sneaky join.
For both, we’re going to first have to sum the transactions
and group by the month and the year. Grouping by the year
is generally redundant in this case because we are only
looking for the year of 2019.
WITH monthly_transactions AS (
SELECT
MONTH(created_at) AS month,
YEAR(created_at) AS year,
SUM(price * quantity) AS revenue
FROM transactions AS t
INNER JOIN products AS p
ON t.product_id = p.id
WHERE YEAR(created_at) = 2019
GROUP BY 1,2
ORDER BY 1
)
SELECT * FROM monthly_transactions
>>> 关注公众号
获取更多精彩内容
97
Solution
31
Solution:month-over-month(sql)
Now using the LAG function, we can apply it to our column
of revenue. Notice that the LAG function takes a column and
then a number by which to lag the value by. Then we can
just compute the month over month values by the general
formula.
WITH monthly_transactions AS (
SELECT
MONTH(created_at) AS month,
YEAR(created_at) AS year,
SUM(price * quantity) AS revenue
FROM transactions AS t
INNER JOIN products AS p
ON t.product_id = p.id
WHERE YEAR(created_at) = 2019
GROUP BY 1,2
ORDER BY 1
)
SELECT
month
, ROUND((revenue - previous_revenue)/previous_revenue,
2) AS month_over_month
FROM (
>>> 关注公众号
获取更多精彩内容
98
Solution
31
Solution:month-over-month(sql)
SELECT
month,
revenue,
LAG(revenue,1) OVER (
ORDER BY month
) previous_revenue
FROM monthly_transactions
) AS t
The second way we can do this if we aren’t given the LAG
function to use is to do a self-join on the month - 1.
WITH monthly_transactions AS (
SELECT
MONTH(created_at) AS month,
YEAR(created_at) AS year,
SUM(price * quantity) AS revenue
FROM transactions AS t
INNER JOIN products AS p
ON t.product_id = p.id
WHERE YEAR(created_at) = 2019
GROUP BY 1,2
ORDER BY 1
)
>>> 关注公众号
获取更多精彩内容
99
Solution
31
Solution:month-over-month(sql)
SELECT
mt1.month
, ROUND((mt2.revenue - mt1.revenue)/mt1.revenue, 2) AS
month_over_month
FROM monthly_transactions AS mt1
LEFT JOIN monthly_transactions AS mt2
ON mt1.month = mt2.month - 1
Notes: The second solution’s query results are slightly
different (month 12 is null instead of month 1) and thus will
not pass the test case.
>>> 关注公众号
获取更多精彩内容
100
Quesetion
32
ride-coupon(probability)
难度标题
【Easy】
公司标签
/
题目标签
【probability】
1. A ride-sharing app has probability pp of dispensing a $5$5
coupon to a rider. The app services NN riders. How much
should we budget for the coupon initiative in total?
2. A driver using the app picks up two passengers.
• What is the probability of both riders getting the coupon?
• What is the probability that only one of them will get the
coupon?
>>> 关注公众号
获取更多精彩内容
101
Solution
32
Solution:ride-coupon(probability)
>>> 关注公众号
获取更多精彩内容
102
Quesetion
33
employee-salaries-etl-error(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Microsoft】,【Noom】,【MasterClass】,
【Magical】,【Think】
Let’s say we have a table representing a company payroll schema.
Due to an ETL error, the employees table instead of updating the
salaries every year when doing compensation adjustments, did an
insert instead. The head of HR still needs the current salary of each
employee.
Bonus: Write a query to get the current salary for each employee.
Note: Assume no duplicate combination of first and last names. (I.E. No
two John Smiths)
Example:
Input: employees table
Output:
Column
Type
Column
Type
id
VARCHAR
first_name
VARCHAR
first_name
VARCHAR
last_name
VARCHAR
last_name
VARCHAR
salary
INTEGER
salary
INTEGER
department_id
INTEGER
>>> 关注公众号
获取更多精彩内容
103
Solution
33
Solution:employee-salaries-etl-error(sql)
The first step we need to do would be to remove duplicates
and retain the current salary for each user.
Given we know there aren’t any duplicate first and last
name combinations, we can remove duplicates from the
employees table by running a GROUP BY on two fields,
the first and last name. This allows us to then get a unique
combinational value between the two fields.
This is great, but at the same time, we’re now stuck with
trying to find the most recent salary from the user. How
would we be able to tell which was the most recent salary
without a datetime column?
Notice that in the question it states that instead of
updating the salaries every year when doing compensation
adjustments, did an insert instead. This means that the
current salary could then be evaluated by looking at the
most recent row inserted into table. We can assume that
an insert will autoincrement the id field in the table, which
means that the row we want would be the maximum id for
the row for each given user.
>>> 关注公众号
获取更多精彩内容
104
Solution
33
Solution:employee-salaries-etl-error(sql)
SELECT first_name, last_name, MAX(id) AS max_id
FROM employees
GROUP BY 1,2
Now that we have the corresponding maximum id, we can
re-join it to the original table in a subquery to then get the
correct salary associated with the id in the sub-query.
SELECT e.first_name, e.last_name, e.salary
FROM employees AS e
INNER JOIN (
SELECT first_name, last_name, MAX(id) AS max_id
FROM employees
GROUP BY 1,2
) AS m
ON e.id = m.max_id
>>> 关注公众号
获取更多精彩内容
105
Quesetion
34
paired-products(sql)
难度标题
公司标签
【Hard】
【Amazon】
题目标签
【sql】
Let’s say we have two tables, transactions and products.
Hypothetically the transactions table consists of over a billion rows of
purchases bought by users.
We are trying to find paired products that are often purchased
together by the same user, such as wine and bottle openers, chips and
beer, etc..
Write a query to find the top five paired products and their names.
Notes: For the purposes of satisfying the test case, P1 should be the
item that comes first in the alphabet.
Example:
Input: transactions table
products table
Column
Type
Column
Type
id
INTEGER
id
INTEGER
user_id
INTEGER
name
VARCHAR
created_at
DATETIME
price
FLOAT
product_id
INTEGER
quantity
INTEGER
Output:
Column
Type
P1
VARCHAR
P2
VARCHAR
count
INTEGER
>>> 关注公众号
获取更多精彩内容
106
Solution
34
Solution:paired-products(sql)
We are tasked with finding pairs of products that are
purchased together by the same user. Before we can
do anything, however, we need to join the two tables:
transactions and products on id = product_id, so that we can
associate each transaction with a product name:
SELECT user_id, created_at, products.name
FROM transactions
JOIN products
ON transactions.product_id = products.id
Afterwards, we are faced with the first challenge of selecting
all instances where the user purchased a pair of products
together. One intuitive way to accomplish this is to select
all created_at dates in which more than one transaction
occurred by the same user_id, which would look like this:
SELECT
user_id
, created_at
, products.name
FROM transactions
JOIN products
ON transactions.product_id = products.id
WHERE transactions.id NOT IN (
>>> 关注公众号
获取更多精彩内容
107
Solution
34
Solution:paired-products(sql)
SELECT id
FROM transactions
GROUP BY created_at, user_id
HAVING COUNT(*) = 1
)
This is an acceptable way to accomplish the task but it runs
into trouble in the next step, where we will need to count all
unique instances of each pairing of products.
Fortunately, there is a clever solution which handles both
parts of the problem efficiently. By self joining the combined
table with itself, we can specify the join to connect rows
sharing created_at and user_id:
WITH purchases AS (
SELECT
user_id
, created_at
, products.name
FROM transactions
JOIN products
ON transactions.product_id = products.id
)
>>> 关注公众号
获取更多精彩内容
108
Solution
34
Solution:paired-products(sql)
SELECT
t1.name AS P1
, t2.name AS P2
, count(*)
FROM purchases AS t1
JOIN purchases AS t2
ON t1.user_id = t2.user_id
AND t1.created_at = t2.created_at
The self join produces every combination of pairs of
products purchased. However, looking at the resulting
selection, it becomes clear that there is an issue:
Product 1
Product 2
federal discuss hard
federal discuss hard
night sound feeling
night sound feeling
go window serious
go window serious
outside learn nice
outside learn nice
We are including pairs of the same products in our
selection.
>>> 关注公众号
获取更多精彩内容
109
Solution
34
Solution:paired-products(sql)
To fix this, we add AND t1.name < t2.name. One additional
problem that this solves for us is that it enforces a
consistent order to the pairing of names throughout the
table, namely that the first name will be alphabetically“less”
than the second one (i.e. A < Z). This is important because
it avoids the potential problem of undercounting pairs of
names that are in different orders (i.e. A & B vs B & A).
Finally, we can then finish the problem by grouping and
ordering in order to count the total occurrences of each
pair.
WITH purchases AS (
SELECT
user_id
, created_at
, products.name
FROM transactions
JOIN products
ON transactions.product_id = products.id
)
>>> 关注公众号
获取更多精彩内容
110
Solution
34
Solution:paired-products(sql)
SELECT
t1.name AS P1
, t2.name AS P2
, count(*)
FROM purchases AS t1
JOIN purchases AS t2
ON t1.user_id = t2.user_id
AND t1.name < t2.name
AND t1.created_at = t2.created_at
GROUP BY 1,2 ORDER BY 3 DESC
LIMIT 5
>>> 关注公众号
获取更多精彩内容
111
Quesetion
35
dice-worth-rolling(probability)
难度标题
【Easy】
公司标签
【Amazon】
题目标签
【probability】
Let’s play a game. You are given two fair six-sided dice and
asked to roll them.
If the sum of the values on the dice equals seven, then you
win 21 dollars. However, you must pay $10$10 for each roll.
Is this game worth playing?
>>> 关注公众号
获取更多精彩内容
112
Solution
35
Solution:dice-worth-rolling(probability)
>>> 关注公众号
获取更多精彩内容
113
Solution
35
Solution:dice-worth-rolling(probability)
>>> 关注公众号
获取更多精彩内容
114
Solution
35
Solution:dice-worth-rolling(probability)
>>> 关注公众号
获取更多精彩内容
115
Solution
35
Solution:dice-worth-rolling(probability)
>>> 关注公众号
获取更多精彩内容
116
Quesetion
36
random-sql-sample(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Microsoft】,【Apple】,【Two】
Let’s say we have a table with an id and name fields. The table holds
over 100 million rows and we want to sample a random row in the
table without throttling the database.
Write a query to randomly sample a row from this table.
Example:
Input: big_table table
Columns
Type
id
INTEGER
name
VARCHAR
>>> 关注公众号
获取更多精彩内容
117
Solution
36
Solution:random-sql-sample(sql)
In most SQL databases there exists a RAND() function in
which normally we can call:
SELECT * FROM big_table
ORDER BY RAND()
and the function will randomly sort the rows in the table.
This function works fine and is fast if you only have let’s
say around 1000 rows. It might take a few seconds to run
at 10K. And then at 100K maybe you have to go to the
bathroom or cook a meal before it finishes.
What happens at 100 million rows? Someone in DevOps is
probably screaming at you.
Random sampling is important in SQL with scale. We don’t
want to use the pre-built function because it wasn’t meant
for performance. But maybe we can re-purpose it for our
own use case.
We know that the RAND() function actually returns a
floating-point between 0 and 1. So if we were to instead
call:
SELECT RAND()
>>> 关注公众号
获取更多精彩内容
118
Solution
36
Solution:random-sql-sample(sql)
we would get a random decimal point to some Nth
degree of precision. RAND() essentially allows us to seed
a random value. How can we use this to select a random
row quickly?
Let’s try to grab a random number using RAND() from our
table that can be mapped to an id. Given we have 100
million rows, we probably want a random number from 1
to 100 million. We can do this by multiplying our random
seed from RAND() by the MAX number of rows in our table.
SELECT CEIL(RAND() * (
SELECT MAX(id) FROM big_table)
)
We use the CEIL function to round the random value to an
integer. Now we have to join back to our existing table to
get the value.
What happens if we have missing or skipped id values
though? We can
solve this by running the join on all the ids which are
greater or equal
than our random value and selects only the direct neighbor
if a direct
match is not possible.
>>> 关注公众号
获取更多精彩内容
119
Solution
36
Solution:random-sql-sample(sql)
As soon as one row is found, we stop (LIMIT 1). And we
read the rows
according to the index (ORDER BY id ASC). Now our
performance is optimal.
SELECT r1.id, r1.name
FROM big_table AS r1
INNER JOIN (
SELECT CEIL(RAND() * (
SELECT MAX(id)
FROM big_table)
) AS id
) AS r2
ON r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1
>>> 关注公众号
获取更多精彩内容
120
Quesetion
37
liked-and-commented(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Facebook】,【Glassdoor】
You’re given two tables, users and events. The events table holds
values of all of the user events in the action column (‘like’, ‘comment’,
or ‘post’).
Write a query to get the percentage of users that have never liked or
commented. Round to two decimal places.
Example:
Input: users table
events table
Column
Type
Column
Type
id
INTEGER
user_id
INTEGER
name
VARCHAR
action
VARCHAR
created_at
DATETIME
created_at
DATETIME
Output:
Column
Type
percent_never
FLOAT
>>> 关注公众号
获取更多精彩内容
121
Solution
37
Solution:liked-and-commented(sql)
The question gives us two tables (users and events) and
asks us to find the percentage of users who have never
liked or commented.
We know two things at once:
We will have to join our two tables to get what we need.
Our final SELECT clause will one sum divided by another
sum.
From there, we can begin to strategize about how to
formulate our query. The trick, here, lies in how we go
about defining data points for users who have never done
something. In this case, the first step in the strategy we’re
going to employ is to create a Common Table Expression (or
CTE) to isolate users who have liked or commented. This
will create a temporary table that can be referenced by
the query that follows.
WITH liked_or_commented AS (
SELECT e.user_id
FROM events e
WHERE action IN ('like', 'comment')
GROUP BY 1
)
>>> 关注公众号
获取更多精彩内容
122
Solution
37
Solution:liked-and-commented(sql)
Note that we’re using the IN function in our WHERE clause
to determine whether the value of the action field can
be found in a comma-separated list of strings. We could
achieve the same effect by using:
WHERE action = 'like'
OR action = 'comment'
Both forms are equally valid and won’t affect the
outcome of the query. We simply chose the form that
would fit on a single line.
The next step in our process to isolate users who
have never liked or commented is to perform a LEFT
JOIN. Remember that in a LEFT JOIN, all of the values of
the first table are preserved and only matching records
from the second table are preserved. That means that
if we perform the following join of the users table to our
temporary table liked_or_commented:
FROM users u
LEFT JOIN liked_or_commented loc
ON u.id = loc.user_id
>>> 关注公众号
获取更多精彩内容
123
Solution
37
Solution:liked-and-commented(sql)
We’re going to be left with NULL values in the user_id field
of the liked_or_commented table for every user that has
never liked or
commented. We’ve also effectively joined
our users and events table since liked_or_commented is
just a version of events that has been
narrowed to specific parameters.
Now the only thing missing from our query is
the SELECT clause. Here,
since we want our final value to be a percentage, we’re
going to divide
one quantity by another quantity. Specifically, we’re going
to divide the
SUM of users who have never liked or commented by
the COUNT of the
total number of users.
To get the first sum, we’re going to use a combination of
the SUM and CASE WHEN functions in SQL:
SUM(CASE WHEN loc.user_id IS NULL THEN 1 ELSE 0 END)
>>> 关注公众号
获取更多精彩内容
124
Solution
37
Solution:liked-and-commented(sql)
The CASE WHEN function will assign a value of 1 to
every record where the user_id field of the liked_or_
commented CTE IS a NULL value. In every other case,
the CASE WHEN function will assign a value of 0 to the
record. By summing these values, we effectively get a
count of the number of users who have never liked or
commented.
That gives us our numerator. We’ll use a very
simple COUNT of the id field of the users table to get our
denominator. The last step to get our result is to wrap
our small calculation in the ROUND function (which has
the form ROUND (quantity to be rounded, # of decimal
places), which gives us:
ROUND(SUM(CASE WHEN loc.user_id IS NULL THEN 1
ELSE 0 END)
/ COUNT(u.id), 2) AS percent_never
>>> 关注公众号
获取更多精彩内容
125
Solution
37
Solution:liked-and-commented(sql)
That means that our final query looks like:
WITH liked_or_commented AS (
SELECT e.user_id
FROM events e
WHERE e.action in ('like','comment')
GROUP BY 1
)
SELECT
ROUND(SUM(CASE WHEN loc.user_id IS NULL THEN 1 ELSE
0 END)
/ COUNT(u.id), 2) as percent_never
FROM users u
LEFT JOIN liked_or_commented loc
ON u.id = loc.user_id
>>> 关注公众号
获取更多精彩内容
126
Quesetion
38
daily-active-users(sql)
难度标题
【Easy】
题目标签
【sql】
公司标签
【Apple】,【Lattice】
Given a table of user logs with platform information, count
the number of daily active users on each platform for the
year of 2020.
Example:
Input: events table
Column
Type
id
INTEGER
user_id
INTEGER
created_at
DATETIME
action
VARCHAR
url
VARCHAR
platform
VARCHAR
Output:
Columns
Type
platform
VARCHAR
created_at
DATETIME
daily_users
INTEGER
>>> 关注公众号
获取更多精彩内容
127
Quesetion
39
download-facts(sql)
难度标题
【Easy】
题目标签
【sql】
公司标签
【Microsoft】,【Amazon】
Given two tables: accounts, and downloads, find the average number
of downloads for free vs paying accounts, broken down by day.
Note: You only need to consider accounts that have had at least one
download before when calculating the average.
Note: round average_downloads to 2 decimal places.
Example:
Input: accounts table
downloads table
Column
Type
Column
Type
account_id
INTEGER
account_id
INTEGER
paying_customer
BOOLEAN
download_date
DATETIME
downloads
INTEGER
Output:
Column
Type
download_date
DATETIME
paying_customer
BOOLEAN
average_downloads
FLOAT
>>> 关注公众号
获取更多精彩内容
128
Solution
39
Solution:download-facts(sql)
We need to use data from both tables, so the first thing
we should do is to join them somehow.
Since we should consider only accounts that had
downloads during the day, we may use an INNER
JOIN(or JOIN).This type of join will discard accounts
with no records in the downloads table. If we used a
different type of join, for example, a LEFT JOIN, we would
need to how to handle accounts with no records in
the downloads table.
For example: If there are three records
within the accounts table and two records in
the downloads table:
account_id
paying_customer
1
0
2
0
3
0
account_id
download_date
downloads
1
2020-01-01 00:00:00
100
2
2020-01-01 00:00:00
200
>>> 关注公众号
获取更多精彩内容
129
Solution
39
Solution:download-facts(sql)
By using an INNER JOIN (or simply JOIN) query like:
SELECT *
FROM accounts a
JOIN downloads b ON a.account_id = b.account_id
Will output only two rows, omitting account 3.
account_id
paying_customer
1
0
2
0
account_id
download_date
downloads
1
2020-01-01 00:00:00
100
2
2020-01-01 00:00:00
200
But if we needed to take into consideration account
number 3, then our calculation would have been
(100+200+0)/3=100
Our second step is to figure out what columns we need
to display in our output. Those columns are download_
date, paying_customer and a calculated column
called average_downloads. We should use the AVG()
function to calculate an average.
>>> 关注公众号
获取更多精彩内容
130
Solution
39
Solution:download-facts(sql)
Since the AVG function is an aggregate
function, so we need to apply a GROUP BY clause.
Grouping results should
be done by columns download_date and paying_
customer since those are the columns we what to
differentiate entries by.
SELECT download_date, paying_customer,
AVG(downloads) AS average_downloads
FROM accounts a
JOIN downloads b ON a.account_id = b.account_id
GROUP BY download_date, paying_customer
Lastly, we need to apply the ROUND() function to the
average in order to obtain the final result:
SELECT download_date, paying_customer,
ROUND(AVG(downloads),2) AS average_downloads
FROM accounts a
JOIN downloads b ON a.account_id = b.account_id
GROUP BY download_date, paying_customer
>>> 关注公众号
获取更多精彩内容
131
40
Quesetion
project-budget-error(sql)
难度标题
【Easy】
题目标签
【sql】
公司标签
【Microsoft】,【Facebook】
We’re given two tables. One is named projects and the other maps
employees to the projects they’re working on.
We want to select the five most expensive projects by budget to
employee count ratio. But let’s say that we’ve found a bug where
there exist duplicate rows in the employee_projects table.
Write a query to account for the error and select the top five most
expensive projects by budget to employee count ratio.
Example:
Input: projects table
employee_projects table
Column
Type
Column
Type
id
INTEGER
project_id
INTEGER
title
VARCHAR
employee_id
INTEGER
state_date
DATETIME
end_date
DATETIME
budget
INTEGER
Output:
Column
Type
title
VARCHAR
budget_per_employee
FLOAT
>>> 关注公众号
获取更多精彩内容
132
Solution
40
Solution:project-budget-error(sql)
Given that the bug only exists in the employee_
projects table, we can reuse most of the code from this
question as long as we rebuild the employees_
projects table by removing duplicates.
One way to do so is to simply group by the columns project_
id and employee_id. By grouping by both columns, we’re
creating a table that sets distinct value on project_
id and employee_id, thereby getting rid of any duplicates.
Then all we have to do is then query from that table and
nest it into another subquery.
SELECT
p.title,
budget/num_employees AS budget_per_employee
FROM projects AS p
INNER JOIN (
>>> 关注公众号
获取更多精彩内容
133
Solution
40
Solution:project-budget-error(sql)
SELECT project_id, COUNT(*) AS num_employees
FROM (
SELECT project_id, employee_id
FROM employee_projects
GROUP BY 1,2
) AS gb
GROUP BY project_id
) AS ep
ON p.id = ep.project_id
ORDER BY budget/num_employees DESC
LIMIT 5;
>>> 关注公众号
获取更多精彩内容
134
Quesetion
41
biased-five-out-of-six(probability)
难度标题
【Medium】
题目标签
【probability】
公司标签
【Facebook】,【Google】
Let’s say we’re given a biased coin that comes up heads 30%
of the time when tossed.
What is the probability of the coin landing as heads exactly 5
times out of 6 tosses?
>>> 关注公众号
获取更多精彩内容
135
Solution
41
Solution:biased-five-out-of-six
(probability)
>>> 关注公众号
获取更多精彩内容
136
Solution
41
Solution:biased-five-out-of-six
(probability)
>>> 关注公众号
获取更多精彩内容
137
Quesetion
42
closed-accounts(sql)
难度标题
公司标签
【Medium】
题目标签
/
【sql】
Given a table of account statuses, write a query to get the percentage
of accounts that were active on December 31st, 2019, and closed on
January 1st, 2020, over the total number of accounts that were active
on December 31st. Each account has only one daily record indicating
its status at the end of the day.
Note: Round the result to 2 decimal places.
Example:
Input: account_status table
Column
Type
account_id
INTEGER
date
DATETIME
status
VARCHAR
account_id
date
status
1
2020-01-01
closed
1
2019-12-31
open
2
2020-01-01
closed
Output:
Column
Type
percentage_closed
FLOAT
>>> 关注公众号
获取更多精彩内容
138
Solution
42
Solution:closed-accounts(sql)
At first, this question seems pretty straightforward.
We could just compute a SUM(CASE WHEN...) function
that allows us to compute the total number of closed
accounts divided by the total number of accounts.
SELECT SUM(CASE
WHEN status = "closed"
THEN 1 ELSE 0 END)/COUNT(DISTINCT account_id) as
percentage_closed
FROM account_status
WHERE date = '2020-01-01'
But there’s a problem here! This query would count every
closed account, which is not what we want. We are
looking for accounts that were closed only on January
1st, 2020, and opened the day before. The account_
statuses table will have the status of each account for
each day.
Firstly, we want to find a number of accounts active on
December 31st, 2019, and closed on January 1st, 2020.
This part is done
>>> 关注公众号
获取更多精彩内容
139
Solution
42
Solution:closed-accounts(sql)
within correct_closed_accounts CTE. Secondly, we count
the number of accounts from the table within num_
accounts CTE. Finally, we divide both numbers to get the
final solution.
WITH correct_closed_accounts_cte AS
(
SELECT COUNT(*) AS numerator FROM account_status a
JOIN account_status b ON a.account_id = b.account_id
WHERE a.date = '2020-01-01' AND b.date = '2019-12-31'
AND a.status = 'closed' AND b.status ='open'
),
num_accounts AS
(
SELECT numerator , COUNT(DISTINCT account_id) AS
denominator
FROM correct_closed_accounts_cte , account_status
WHERE date =
'2019-12-31' AND status ='open'
)
SELECT CAST((numerator/denominator) AS DECIMAL(3,2))
AS percentage_closed FROM num_accounts;
>>> 关注公众号
获取更多精彩内容
140
Quesetion
43
fewer-orders(sql)
难度标题
公司标签
【Easy】
【Amazon】
题目标签
【sql】
Write a query to identify the names of users who placed
less than 3 orders or ordered less than $500$500 worth of
product.
Example:
Input: transactions table
users table
Column
Type
Column
Type
id
INTEGER
id
INTEGER
user_id
INTEGER
name
VARCHAR
created_at
DATETIME
sex
VARCHAR
product_id
INTEGER
quantity
INTEGER
Output:
Column
Type
users_less_than
VARCHAR
products table
Column
Type
id
INTEGER
name
VARCHAR
price
FLOAT
>>> 关注公众号
获取更多精彩内容
141
Solution
43
Solution:fewer-orders(sql)
Code:
SELECT DISTINCT(user_name) users_less_than FROM
(
SELECT u.name user_name, COUNT(t.id) tx_count,
SUM(quantity*price) total_prod_worth FROM users u
LEFT JOIN transactions t ON u.id = t.user_id
LEFT JOIN products p ON t.product_id = p.id
GROUP BY 1
) sub
WHERE tx_count<3 OR total_prod_worth < 500;
>>> 关注公众号
获取更多精彩内容
142
Quesetion
44
replace-words-with-stems(python)
难度标题
【Medium】
题目标签
【python】
公司标签
【Adobe】,【Facebook】,【ABC】
In data science, there exists the concept of stemming, which is the
heuristic of chopping off the end of a word to clean and bucket it into
an easier feature set.
Given a dictionary consisting of many roots and a sentence, write a
function replace_words to stem all the words in the sentence with the
root forming it. If a word has many roots that can form it, replace it
with the root with the shortest length.
Example:
Input: roots = ["cat", "bat", "rat"]
sentence = "the cattle was rattled by the battery"
Output: "the cat was rat by the bat"
>>> 关注公众号
获取更多精彩内容
143
Solution
44
Solution:replace-words-with-stems
(python)
At first, it simply looks like we can just loop through each
word and check if the root exists in the word and if so,
replace the word with the root. But since we are technically
stemming the words, we have to make sure that the roots
are equivalent to the word at its prefix rather than existing
anywhere within the word.
We’re given a list of roots and a sentence string. Given we
have to check each word let’s first split sentence into a list
of words.
words = sentence.split()
Next, we loop through each word in words, and for each
check if it has a prefix equal to one of the roots. To do this,
we loop through each possible substring starting at the
first letter. If we find a prefix matching a root, we replace
that word in the words list with the root in contains.
j=0
while j < len(words):
i=0
while i < len(words[j]):
if words[j][:i] in roots:
words[j] = words[j][:i]
i = len(words[j])
i=i+1
j=j+1
>>> 关注公众号
获取更多精彩内容
144
Solution
44
Solution:replace-words-with-stems
(python)
Notice the line inside the if statement of the inner while
loop.
i = len(words[j])
We need this statement to ensure that if a word contains
two roots, we replace it with the shorter one. For example,
if the roots list from above also contained the string “catt”
we would still return the same output.
Finally, we need to join our updated list of words back into
a sentence.
new_sentence = " ".join(words)
And our final code is as follows:
def replace_words(roots,sentence):
words = sentence.split()
j=0
while j < len(words):
i=0
while i < len(words[j]):
if words[j][:i] in roots:
words[j] = words[j][:i]
i=len(words[j])
i=i+1
j=j+1
new_sentence = " ".join(words)
return(new_sentence)
>>> 关注公众号
获取更多精彩内容
145
Quesetion
45
acceptance-rate(sql)
难度标题
【Easy】
公司标签
题目标签
/
【sql】
We’re given two tables. friend_requests holds all the friend requests
made and friend_accepts is all of the acceptances.
Write a query to find the overall acceptance rate of friend requests.
Note: Round results to 4 decimal places.
Example:
Input: friend_requests table
friend_accepts table
Column
Type
Column
Type
requester_id
INTEGER
acceptor_id
INTEGER
requested_id
INTEGER
requester_id
INTEGER
created_at
DATETIME
created_at
DATETIME
Output:
Column
Type
acceptance_rate
FLOAT
>>> 关注公众号
获取更多精彩内容
146
Solution
45
Solution:acceptance-rate(sql)
The overall acceptance_rate is going to be computed by the
total number of acceptances of friend requests divided by
the total friend requests given:
Count of acceptances / Count of friend requests
We can pretty easily get both values. Our denominator will
be the total number of friend requests which will be the base
table. We can compute the total number of acceptances by
then LEFT JOINING to the friend_accepts table.
In the JOIN, we have to make sure we’re joining on
the correct columns. In this case, we have to match
our requester_id to the requestor_id and be sure to also
match on the second column of requested_id to acceptor_id.
Note: We cannot compute the DISTINCT count given that
users can send and accept friend requests to multiple other
users.
SELECT CAST(COUNT(b.acceptor_id)/ COUNT(a.requester_id)
AS DECIMAL(5,4)) AS acceptance_rate
FROM friend_requests a
LEFT JOIN friend_accepts b
ON a.requester_id = b.requester_id AND a.requested_id =
b.acceptor_id;
>>> 关注公众号
获取更多精彩内容
147
Quesetion
46
attribution-rules(sql)
难度标题
【Medium】
公司标签
题目标签
/
【sql】
Write a query that creates an attribution rule for each user. If the user
visited Facebook or Google at least once then the attribution is labeled
as “paid.” Otherwise, the attribution is labeled as “organic.”
Example:
Input: user_sessions table
attribution table
Column
Type
Column
Type
created_at
DATETIME
session_id
INTEGER
session_id
INTEGER
channel
VARCHAR
user_id
INTEGER
Output:
Column
Type
user_id
INTEGER
attribute
VARCHAR
>>> 关注公众号
获取更多精彩内容
148
Solution
46
Solution:attribution-rules(sql)
WITH cte AS (
SELECT
user_id,
SUM(CASE WHEN (channel = 'Facebook' OR channel =
'Google') THEN 1 ELSE 0 END) AS paid_count
FROM user_sessions
JOIN attribution ON
user_sessions.session_id = attribution.session_id
GROUP BY user_id)
SELECT
user_id,
CASE WHEN paid_count >=1 THEN 'paid' ELSE 'organic'
END AS attribute
FROM cte
>>> 关注公众号
获取更多精彩内容
149
Quesetion
47
notification-deliveries(sql)
难度标题
【Hard】
题目标签
【sql】
公司标签
【Twitter】,【Facebook】,【Think】,
【LinkedIn】
We’re given two tables, a table of notification_deliveries and a table of
users with created and purchase conversion dates. If the user hasn’t
purchased then the conversion_date column is NULL.
Write a query to get the distribution of total push notifications before
a user converts.
Example:
Input: notification_deliveries table
users table
Column
Type
Column
Type
notification
VARCHAR
id
INTEGER
user_id
INTEGER
created_at
DATETIME
created_at
DATETIME
conversion_date
DATETIME
Output:
Column
Type
total_pushes
INTEGER
frequency
INTEGER
>>> 关注公众号
获取更多精彩内容
150
Solution
47
Solution:notification-deliveries(sql)
If we’re looking for the distribution of total push notifications
before a user converts, we can evaluate that we want our
end result to look something like this:
total_pushes | frequency
-------------+---------0
| 100
1
| 250
2
| 300
...
| ...
In order to get there, we have to follow a couple of logical
conditions for the JOIN between users and notification_
deliveries
We have to join on the user_id field in both tables.
We have to exclude all users that have not converted.
We have to set the conversion_date value as greater than
the created_at value in the delivery table in order to get all
notifications sent to the user.
Cool, we know this has to be a LEFT JOIN additionally in
order to get the users that converted off of zero push
notifications as well.
We can get the count per user, and then group by that
count to get the overall distribution.
>>> 关注公众号
获取更多精彩内容
151
Solution
47
Solution:notification-deliveries(sql)
SELECT total_pushes, COUNT(*) AS frequency
FROM (
SELECT u.id, COUNT(nd.notification) as total_pushes
FROM users AS u
LEFT JOIN notification_deliveries AS nd
ON u.id = nd.user_id
AND u.conversion_date >= nd.created_at
WHERE u.conversion_date IS NOT NULL
GROUP BY 1
) AS pushes
GROUP BY 1
>>> 关注公众号
获取更多精彩内容
152
Quesetion
48
time-on-fb-distribution(statistics)
难度标题
【Medium】
公司标签
/
题目标签
【statistics】
What do you think the distribution of time spent per day on
Facebook looks like? What metrics would you use to describe
that distribution?
>>> 关注公众号
获取更多精彩内容
153
Solution
48
Solution:time-on-fb-distribution
(statistics)
Having the vocabulary to describe a distribution is an
important skill as a data scientist when it comes to
communicating ideas to your peers. There are 4 important
concepts, with supporting vocabulary, that you can use to
structure your answer to a question like this. These are:
1. Center (mean, median, mode)
2. Spread (standard deviation, inter quartile range, range)
3. Shape (skewness, kurtosis, uni or bimodal)
4. Outliers (Do they exist?)
In terms of the distribution of time spent per day on Facebook
(FB), one can imagine there may be two groups of people on
Facebook:
1. People who scroll quickly through their feed and don’t
spend too much time on FB.
2. People who spend a large amount of their social media
time on FB.
From this point of view, we can make the following claims
about the distribution of time spent on FB, with the caveat
that this needs to be validated with real world data.
>>> 关注公众号
获取更多精彩内容
154
Solution
48
Solution:time-on-fb-distribution
(statistics)
1. Center: Since we expect the distribution to be bimodal (see
Shape), we could describe the distribution using mode and
median instead of mean. These summary statistics are good for
investigating distributions that deviate from the classical normal
distribution.
2. Spread: Since we expect the distribution to be bimodal (see
Shape), the spread and range will be fairly large. This means
there will be a large inter quartile range that will be needed to
accurately describe this distribution. Further, refrain from using
standard deviation to describe the spread of this distribution.
3. Shape: From our description, the distribution would be bimodal.
One large group of people would be clustered around the lower
end of the distribution, and another large group would be centered
around the higher end. There could also be some skewness to the
right for those people who may spend a bit too much time on FB.
4. Outliers: You can run outlier detection tests like Grubb’s test,
z-score, or the IQR methods to quantitatively tell which users are
not like the rest.
If we were to ask further questions about the demographics of the
users we were interested in, we could come up with
another story using these same vocabulary to
structure our answer!
>>> 关注公众号
获取更多精彩内容
155
Quesetion
49
minimum-change(python)
难度标题
【Easy】
公司标签
【Google】
题目标签
【python】
Write a function find_change to find the minimum number
of coins that make up the given amount of change cents.
Assume we only have coins of value 1, 5, 10, and 25 cents.
Example:
Input: cents = 73
Output: def find_change(cents) -> 7
#(25 + 25 + 10 + 10 + 1 + 1 + 1)
>>> 关注公众号
获取更多精彩内容
156
Solution
49
Solution:minimum-change(python)
def minimum_change(cents):
count =0
while cents != 0:
if cents >= 25:
count +=1
cents -=25
elif cents >= 10:
count +=1
cents -=10
elif cents >= 5:
count +=1
cents -=5
elif cents >= 1:
count +=1
cents -=1
return count
>>> 关注公众号
获取更多精彩内容
157
Quesetion
50
swipe-precision(sql)
难度标题
【Hard】
题目标签
【sql】
公司标签
【Amazon】,【Tinder】
There are two tables. One table is called swipes that holds a row for
every Tinder swipe and contains a boolean column that determines if
the swipe was a right or left swipe called is_right_swipe. The second is
a table named variants that determines which user has which variant
of an AB test.
Write a SQL query to output the average number of right swipes for
two different variants of a feed ranking algorithm by comparing users
that have swiped the first 10, 50, and 100 swipes on their feed.
Note: Users have to have swiped at least 10 times to be included in the
subset of users to analyze the mean number of right swipes.
Example:
Input: variants table
swipes table
Column
Type
Column
Type
id
INTEGER
id
INTEGER
experiment
VARCHAR
user_id
INTEGER
variant
VARCHAR
swiped_user_id
INTEGER
user_id
INTEGER
created_at
DATETIME
is_right_swipe
BOOLEAN
Output:
Columns
Type
varient
VARCHAR
mean_right_swipes
FLOAT
swipe_threshold
INTEGER
num_users
INTEGER
>>> 关注公众号
获取更多精彩内容
158
Solution
50
Solution:swipe-precision(sql)
WITH sample AS
(SELECT *,
row_number() over (partition BY user_id ORDER BY created_
at) AS swipe_num
FROM swipes
ORDER BY created_at
),
sample2 AS (
SELECT *,
SUM(is_right_swipe) over(partition BY user_id ORDER BY
swipe_num) AS swipe_count
FROM sample
)
SELECT
v.variant,
s.swipe_num AS swipe_threshold,
AVG(s.swipe_count) AS mean_right_swipes,
COUNT(s.user_id) AS num_users
FROM sample2 AS s
LEFT JOIN variants AS v
ON s.user_id = v.user_id
WHERE swipe_num IN (10, 50, 100)
GROUP BY v.variant, s.swipe_num
>>> 关注公众号
获取更多精彩内容
159
Quesetion
51
random-seed-function(probability)
难度标题
【Medium】
公司标签
【Google】
题目标签
【probability】
Let’s say you have a function that outputs a random integer
between a minimum value, NN, and maximum value, MM.
Now let’s say we take the output from the random integer
function and place it into another random function as the
max value with the same min value NN.
1. What would the distribution of the samples look like?
2. What would be the expected value?
>>> 关注公众号
获取更多精彩内容
160
Solution
51
Solution:random-seed-function
(probability)
>>> 关注公众号
获取更多精彩内容
161
Solution
51
Solution:random-seed-function
(probability)
>>> 关注公众号
获取更多精彩内容
162
Quesetion
52
find-the-missing-number(algorithms)
难度标题
【Easy】
题目标签
【algorithms】
公司标签
【Microsoft】,【PayPal】
You have an array of integers of length n spanning 00 to
nn with one missing. Write a function missing_number that
returns the missing number in the array.
Note: Complexity of O(N)O(N) required.
Example:
Input: nums = [0,1,2,4,5]
missing_number(nums) -> 3
>>> 关注公众号
获取更多精彩内容
163
Solution
52
Solution:find-the-missing-number
(algorithms)
There are two ways we can solve this problem. One
way through logical iteration and another through
mathematical formulation. We can look at both as they
both hold O(N) complexity.
The first would be through general iteration through the
array. We can pass in the array and create a set which
will hold each value in the input array. Then we create a
for loop that will span the range from 0 to n, and look to
see if each number is in the set we just created. If it isn’t,
we return the missing number.
def missing_number(nums):
num_set = set(nums)
n = len(nums) + 1
for number in range(n):
if number not in num_set:
return number
The second solution requires formulating an equation.
If we know that one number is supposed to be missing
from 0 to n, then we can solve for the missing number by
taking the sum of numbers from 0 to n and
subtracting it from the sum of the input array
with the missing value.
>>> 关注公众号
获取更多精彩内容
164
Solution
52
Solution:find-the-missing-number
(algorithms)
An equation for the sum of numbers from 0 to n
is n(n+1)/2.
Now all we have to do is apply the internal sum function
to the input array, and then subtract the values from
each other.
def missing_number(nums):
n = len(nums)
total = n*(n+1)/2
sum_of_nums = sum(nums)
return total - sum_of_nums
>>> 关注公众号
获取更多精彩内容
165
Quesetion
53
lazy-raters(probability)
难度标题
【Medium】
题目标签
【probability】
公司标签
【Facebook】,【Netflix】
Netflix has hired people to rate movies.
Out of all of the raters, 80% of the raters carefully rate movies
and rate 60% of the movies as good and 40% as bad. The
other 20% are lazy raters and rate 100% of the movies as
good.
Assuming all raters rate the same amount of movies, what is
the probability that a movie is rated good?
>>> 关注公众号
获取更多精彩内容
166
Solution
53
Solution:lazy-raters(probability)
>>> 关注公众号
获取更多精彩内容
167
Solution
53
Solution:lazy-raters(probability)
>>> 关注公众号
获取更多精彩内容
168
Quesetion
54
impression-reach(probability)
难度标题
【Medium】
公司标签
/
题目标签
【probability】
Let’s say we have a very naive advertising platform. Given an
audience of size A and an impression size of B, each user in
the audience is given the same random chance of seeing an
impression.
1. Compute the probability that a user sees exactly 0
impressions.
2. What’s the probability of each person receiving at least 1
impression?
>>> 关注公众号
获取更多精彩内容
169
Solution
54
Solution:impression-reach
(probability)
>>> 关注公众号
获取更多精彩内容
170
Solution
54
Solution:impression-reach
(probability)
>>> 关注公众号
获取更多精彩内容
171
Solution
54
Solution:impression-reach
(probability)
>>> 关注公众号
获取更多精彩内容
172
Quesetion
55
conversations-distribution(analytics)
难度标题
【Medium】
题目标签
【analytics】
公司标签
【Amazon】,【Think】
We have a table that represents the total number of messages sent
between two users by date on messenger.
1. What are some insights that could be derived from this table?
2. What do you think the distribution of the number of conversations
created by each user per day looks like?
3. Write a query to get the distribution of the number of conversations
created by each user by day in the year 2020.
Example:
Input: messages table
Column
Type
id
INTEGER
date
DATETIME
user1
INTEGER
user2
INTEGER
msg_count
INTEGER
Output:
Column
Type
num_conversations
INTEGER
frequency
INTEGER
>>> 关注公众号
获取更多精彩内容
173
Solution
55
Solution:conversations-distribution
(analytics)
1. Top-level insights that can be derived from this table are
the total number of messages being sent per day, number
of conversations being started, and the average number
of messages per conversation.
If we think about business-facing metrics, we can start
analyzing them by including time series.
How many more conversations are being started over
the past year compared to now? Do more conversations
between two users indicate a closer friendship versus the
depth of the conversation in total messages?
2. The distribution would be likely skewed to the right
or bimodal. If we think about the probability for a user
to have a conversation with more than one additional
person per day, would that likely be going up or down?
The peak is probably around one to five new conversations
a day. After that, we would see a large decrease with a
potential bump of very active users that may be using
messenger tools for work.
>>> 关注公众号
获取更多精彩内容
174
Solution
55
Solution:conversations-distribution
(analytics)
3. Given we just want to count the number of
conversations, we can ignore the message count and
focus on getting our key metric of a number of new
conversations created by day in a single query.
To get this metric, we have to group by the date field and
then group by the distinct number of users messaged.
Afterward, we can then group by the frequency value
and get the total count of that as our distribution.
SELECT num_conversations, COUNT( * ) AS frequency
FROM (
SELECT user1, DATE(date), COUNT(DISTINCT user2) AS
num_conversations
FROM messages
WHERE YEAR(date) = '2020'
GROUP BY 1,2
) AS t
GROUP BY 1
>>> 关注公众号
获取更多精彩内容
175
Quesetion
56
move-zeros-back(algorithms)
难度标题
【Medium】
公司标签
/
题目标签
【algorithms】
Given an array of integers, write a function move_zeros_back
that moves all zeros in the array to the end of the array. If
there are no zeros, return the input array
Example:
Input: array = [0,5,4,2,0,3]
def move_zeros_back(array) -> [5,4,2,3,0,0]
>>> 关注公众号
获取更多精彩内容
176
Solution
56
Solution:move-zeros-back(algorithms)
O(n) time complexity and O(1) space complexity …
Using a variable non_zeros to hold the index of the first
non-zero item in the list.
Loop through the array and swap every zero you find
with the non_zero item.
def move_zeros_back(array):
non_zeros = 0
for i in range(len(array)):
if array[i] == 0:
while array[non_zeros] == 0:
non_zeros += 1
if non_zeros >= len(array):
return array
array[non_zeros], array[i] = array[i], array[non_
zeros]
return array
>>> 关注公众号
获取更多精彩内容
177
Quesetion
57
bucket-test-scores(pandas)
难度标题
【Medium】
公司标签
【Google】
题目标签
【pandas】
Let’s say you’re given a dataframe of standardized test scores from
high schoolers from grades 9 to 12 called df_grades.
Given the dataset, write code function in Pandas called bucket_test_
scores to return the cumulative percentage of students that received
scores within the buckets of <50, <75, <90, <100.
Example:
Input:
Output:
print(df_grades)
def bucket_test_scores(df_grades) ->
user_id
grade
test score
grade
test score
percentage
1
10
85
10
<50
33%
2
10
60
10
<75
66%
3
11
90
10
<90
100%
4
10
30
10
<100
100%
5
11
99
11
<50
0%
11
<75
0%
11
<90
50%
11
<100
100%
>>> 关注公众号
获取更多精彩内容
178
Solution
57
Solution:bucket-test-scores(pandas)
import pandas as pd
def bucket_test_scores(df):
bins = [0, 50, 75, 90, 100]
labels=['<50','<75','<90' , '<100']
df['test score'] = pd.cut(df['test score'],
bins,labels=labels)
df = (df
.groupby(['grade','test score']).agg({'user_id':'count'})
.fillna(0)
.groupby(['grade']).apply(lambda x:100 * x / float(x.
sum()))
.groupby(['grade']).cumsum()
.reset_index())
df['percentage'] = df.user_id.astype(int).astype(str) + '%'
df.drop(columns='user_id',inplace=True)
return df
>>> 关注公众号
获取更多精彩内容
179
Quesetion
58
friendship-timeline(python)
难度标题
【Hard】
公司标签
题目标签
/
【python】
There are two lists of dictionaries representing friendship beginnings
and endings: friends_added and friends_removed. Each dictionary
contains the user_ids and created_at time of the friendship beginning /
ending.
Write a function friendship_timeline to generate an output that lists the
pairs of friends with their corresponding timestamps of the friendship
beginning and then the timestamp of the friendship ending.
Note: There can be multiple instances over time when two people
became friends and unfriended; only output lists when a corresponding
friendship was removed.
Example:
Output:
Input:
friendships = [{
'user_ids': [1, 2],
friends_added = [
'start_date': '2020-01-01',
{'user_ids': [1, 2], 'created_at': '2020-01-01'},
'end_date': '2020-01-03'
{'user_ids': [3, 2], 'created_at': '2020-01-02'},
},
{'user_ids': [2, 1], 'created_at': '2020-02-02'},
{
{'user_ids': [4, 1], 'created_at': '2020-02-02'}]
'user_ids': [1, 2],
friends_removed = [
'start_date': '2020-02-02',
{'user_ids': [2, 1], 'created_at': '2020-01-03'},
'end_date': '2020-02-05'
{'user_ids': [2, 3], 'created_at': '2020-01-05'},
},
{'user_ids': [1, 2], 'created_at': '2020-02-05'}]
{
'user_ids': [2, 3],
'start_date': '2020-01-02',
关注公众号 <<<
获取更多精彩内容
'end_date': '2020-01-05'
},
]
180
Solution
58
Solution:friendship-timeline(python)
def friendship_timeline(friends_added, friends_removed):
friendships = []
for removed in friends_removed:
for added in friends_added:
if sorted(removed['user_ids']) ==
sorted(added['user_ids']):
friends_added.remove(added)
friendships.append({
'user_ids': sorted(removed['user_ids']),
'start_date': added['created_at'],
'end_date': removed['created_at']
})
break
return sorted(friendships, key=lambda x: x['user_ids'])
>>> 关注公众号
获取更多精彩内容
181
Quesetion
59
search-ratings(analytics)
难度标题
公司标签
【Easy】
【Facebook】
题目标签
【analytics】
You’re given a table that represents search results from searches on Facebook.
The query column is the search term, the position column represents each position
the search result came in, and the rating column represents the human rating of
the search result from 1 to 5 where 5 is high relevance and 1 is low relevance.
Example:
Input:
search_results table
1. Write a query to compute a metric to measure the quality of
Column
Type
query
VARCHAR
result_id
INTEGER
position
INTEGER
rating
INTEGER
the search results for each query.
2. You want to be able to compute a metric that measures
the precision of the ranking system based on position. For
example, if the results for dog and cat are…
…we would rank ‘cat’ as
query
result_id
position
rating
notes
dog
1000
1
2
picture of hotdog
dog
998
2
4
dog walking
dog
342
3
1
zebra
cat
123
1
4
picture of cat
Write a query to create a
cat
435
2
2
cat memes
metric that can validate and
cat
545
3
1
pizza shops
having a better search result
ranking precision than ‘dog’
based on the correct sorting
by rating.
rank the queries by their
search result precision. Round
the metric (avg_rating column)
Output:
to 2 decimal places.
Column
Type
query
VARCHAR
avg_rating
FLOAT
>>> 关注公众号
获取更多精彩内容
182
Solution
59
Solution:search-ratings(analytics)
1. This is an unusual SQL problem given it asks to define
a metric and then write a query to compute it. Generally,
this should be pretty simple. Can we rank by the metric
and figure out which query has the best overall results?
For example, if the search query for ‘tiger’ has 5s for each
result, then that would be a perfect result.
The way to compute that metric would be to simply take
the average of the rating for all of the results. In which
the query can very easily be:
SELECT query, ROUND(AVG(rating), 2) AS avg_rating
FROM search_results
GROUP BY 1
2. The precision metric is a little more difficult now that
we have to account for a second factor which is position.
We now have to find a way to weight the position in
accordance to the rating to normalize the metric score.
This type of problem set can get very complicated if we
wanted to dive deeper into it.
>>> 关注公众号
获取更多精彩内容
183
Solution
59
Solution:search-ratings(analytics)
However, the question is clearly more marked towards
being practical in figuring out the metric and developing
an easy SQL query than developing a search ranking
precision scale that optimizes for something like CTR.
In solving the problem, it’s helpful to look at the example
to construct an approach towards a metric. For example,
if the first result is rated at 5 and the last result is rated at
a 1, that’s good. Even better however is if the first result is
rated 5 and the last result is also rated 5. Bad is if the first
result is 1 and the last result is 5.
However, if we use the approach from question number
1, we’ll get the same metric score no matter which ways
the values are ranked by position. So how do we factor
position into the ranking?
What if we took the inverse of the position as our
weighted factor?
In which case it would be 1/position as a weighted score.
Now no matter what the overall rating, we have a way to
weight the position into the formula.
SELECT query, ROUND(AVG((1/position) * rating), 2) AS
avg_rating
FROM search_results
GROUP BY 1
>>> 关注公众号
获取更多精彩内容
184
Quesetion
60
employee-project-budgets(sql)
难度标题
【Medium】
公司标签
题目标签
/
【sql】
We’re given two tables. One is named projects and the other maps
employees to the projects they’re working on.
Write a query to get the top five most expensive projects by budget to
employee count ratio.
Note: Exclude projects with 0 employees. Assume each employee
works on only one project.
projects table
employee_projects table
Column
Type
Column
Type
id
INTEGER
project_id
INTEGER
title
VARCHAR
employee_id
INTEGER
state_date
DATETIME
end_date
DATETIME
budget
INTEGER
Output:
Column
Type
title
VARCHAR
budget_per_employee
INTEGER
>>> 关注公众号
获取更多精彩内容
185
Solution
60
Solution:employee-project-budgets(sql)
We’re given two tables, one which has the budget of each
project and the other with all employees associated with
each project.
Since the question specifies one employee per project
and excludes projects with 0 employees, we know we
can apply an INNER JOIN between the two tables and not
have to worry about duplicates or leaving out non-staffed
projects.
SELECT project_id, COUNT(*) AS num_employees
FROM employee_projects
GROUP BY 1
The query above grabs the total number of employees
per project. Now, all we have to do is join it to the projects
table to get the budget for each project and divide it by
the number of employees.
SELECT
p.title,
budget/num_employees AS budget_per_employee
FROM projects AS p
INNER JOIN (
>>> 关注公众号
获取更多精彩内容
186
Solution
60
Solution:employee-project-budgets(sql)
SELECT project_id, COUNT(*) AS num_employees
FROM employee_projects
GROUP BY 1
) AS ep
ON p.id = ep.project_id
ORDER BY 2 DESC
LIMIT 5;
>>> 关注公众号
获取更多精彩内容
187
Quesetion
61
expected-tests(statistics)
难度标题
【Easy】
公司标签
【Facebook】
题目标签
【statistics】
Suppose there are one million users and we want to expose
1000 users per day to a test. The same user can be selected
twice for the test.
1. What’s the expected value of how long someone will have
to wait before they receive the test?
2. What is the likelihood they get selected after the first day?
Is that closer to 0 or 1?
>>> 关注公众号
获取更多精彩内容
188
Solution
61
Solution:expected-tests(statistics)
>>> 关注公众号
获取更多精彩内容
189
Solution
61
Solution:expected-tests(statistics)
>>> 关注公众号
获取更多精彩内容
190
Quesetion
62
max-quantity(sql)
难度标题
【Easy】
公司标签
【Amazon】
题目标签
【sql】
Given the transactions table, write a query to get the max
quantity purchased for each distinct product_id, every year.
The output should include the year, product_id, and max_
quantity for that product sorted by year and product_id
ascending.
Example:
Input: transactions table
Column
Type
id
INTEGER
user_id
INTEGER
created_at
DATETIME
product_id
INTEGER
quantity
INTEGER
Output:
Column
Type
year
INTEGER
product_id
INTEGER
max_quantity
INTEGER
>>> 关注公众号
获取更多精彩内容
191
Solution
62
Solution:max-quantity(sql)
WITH cte AS (
SELECT
id,
created_at,
quantity,
product_id,
dense_rank() OVER
(PARTITION BY product_id,
year(created_at)
ORDER BY
quantity DESC) AS max_rank
FROM
transactions
)
SELECT
year(created_at) AS year,
product_id,
quantity AS max_quantity
FROM
cte
WHERE
max_rank = 1
GROUP BY
1,2,3
>>> 关注公众号
获取更多精彩内容
192
Quesetion
63
good-grades-and-favorite-colors
(pandas)
难度标题
公司标签
【Easy】
【Facebook】
题目标签
【pandas】
You’re given a dataframe of students named students_df:
students_df table
name
age
favorite_color
grade
Tim Voss
19
red
91
Nicole Johnson
20
yellow
95
Elsa Williams
21
green
82
John James
20
blue
75
Catherine Jones
23
green
93
Write a function named grades_colors to select only the rows where
the student’s favorite color is green or red and their grade is above 90.
Example:
Input: import pandas as pd
students = {"name" : ["Tim Voss", "Nicole Johnson", "Elsa Williams",
"John James", "Catherine Jones"], "age" : [19, 20, 21, 20, 23],
"favorite_color" : ["red", "yellow", "green", "blue", "green"], "grade" :
[91, 95, 82, 75, 93]}
students_df = pd.DataFrame(students)
>>>
关注公众号
获取更多精彩内容
Output: def grades_colors(students_df) ->
name
age
favorite_color
grade
Tim Voss
19
red
91
Catherine Jones
23
green
93
193
Solution
63
Solution:good-grades-and-favoritecolors(pandas)
This question requires us to filter a data frame by two
conditions: first, the grade of the student, and second,
their favorite color.
Let’s start by filtering by grade since it’s a bit simpler
than filtering by strings. We can filter columns in pandas
by setting our data frame equal to itself with the filter in
place.
In this case:
df_students = df_students[df_students["grade"] > 90]
f we were to look at our data frame after passing that
line of code, we’d see that every student with a lower
grade than 90 no longer appears in our data frame.
Now, we need to filter by favorite color. but we want to
choose between two colors red and green. We will use
isin( ) method that will compare the color cell with a list of
colors passed to it, in this case, will be ['red','green']
students_df = students_df['favorite_color'].
isin(['red','green'])
>>> 关注公众号
获取更多精彩内容
194
Solution
63
Solution:good-grades-and-favoritecolors(pandas)
finally, to add the two conditions of grade and color
together to filter to rows we can use the & operator.
Our syntax should look like this:
import pandas as pd
def grades_colors(students_df):
students_df = students_df[(students_df['grade'] > 90) &
students_df['favorite_color'].isin(['red','green'])]
return students_df
>>> 关注公众号
获取更多精彩内容
195
Quesetion
64
median-probability(statistics)
难度标题
【Hard】
公司标签
【Google】
题目标签
【statistics】
Given three random variables independently and identically
distributed from a uniform distribution of 0 to 4, what is the
probability that the median is greater than 3?
>>> 关注公众号
获取更多精彩内容
196
Solution
64
Solution:median-probability(statistics)
If we break down this question, we’ll find that another
way to phrase it is to ask what the probability is that at
least two of the variables are larger than 3.
For example, if look at the combination of events that
satisfy the condition, the events can actually be divided
into two exclusive events.
Event A: All three random variables are larger than 3.
Event B: One random variable is smaller than 3 and two
are larger than 3.
Given these two events satisfy the condition of the
median > 3, we can now calculate the probability of
both of the events occurring. The question can now be
rephrased as
P(Median >3)=P(A)+P(B)
Let’s calculate the probability of the event A. The
probability that a random variable > 3 but less than 4 is
equal to 1 ⁄ 4. So the probability of event A is:
P(A)=(1/4)·(1/4)·(1/4)=1/64
>>> 关注公众号
获取更多精彩内容
197
Solution
64
Solution:median-probability(statistics)
The probability of event B is that two values must be
greater than 3, but one random variable is smaller than
3. We can calculate this the same way as the calculating
the probability of A. The probability of a value being
greater than 3 is 1 ⁄ 4 and the probability of a value being
less than 3 is 3 ⁄ 4. Given this has to occur three times we
multiply the condition three times.
P(B)=3((3/4)·(1/4)·(1/4))=9/64
Therefore, the total probability is
P(A)+P(B)=1/64+9/64=10/64
>>> 关注公众号
获取更多精彩内容
198
65
Quesetion
sms-confirmations(sql)
难度标题
公司标签
【Easy】
题目标签
/
【sql】
The sms_sends table contains all the messages sent to the users and may
contain various message types.
Confirmation messages (type = "confirmation") are sent when a user
registers an account and requires a response. The system may send multiple
confirmation messages to a single phone number, but the user may confirm
only the latest one sent to them.
The ds column represents the date when SMSs are sent.
The confirmers table contains phone numbers that responded to
confirmation messages and dates when users responded.
Write a query to calculate the number of responses grouped by carrier and
country to the SMSs sent by the system on February 28th, 2020.
Example:
Input: sms_sends table
confirmers table
Column
Type
Column
Type
ds
DATETIME
date
DATE
country
VARCHAR
phone_number
VARCHAR
carrier
VARCHAR
phone_number
VARCHAR
type
VARCHAR
Column
Type
carrier
VARCHAR
country
VARCHAR
unique_numbers
INTEGER
>>>
Output:
关注公众号
获取更多精彩内容
199
Quesetion
66
self-upvotes(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Reddit】,【Amazon】
We’re given three tables representing a forum of users and their
comments on posts.
Write a query to get the percentage of comments by each user where
that user also upvoted their own comment.
Note: A user that doesn’t make a comment should have a 0 percent
self-upvoted.
Example:
Input:
users table
comments table
comment_votes table
Columns
Type
Columns
Type
Columns
Type
id
INTEGER
id
INTEGER
id
INTEGER
created_at
DATETIME
created_at
DATETIME
created_at
DATETIME
username
VARCHAR
post_id
INTEGER
user_id
INTEGER
user_id
INTEGER
comment_id
INTEGER
is_upvote
BOOLEAN
Output:
Columns
Type
username
VARCHAR
total_comments
INTEGER
percentage_self_voted
FLOAT
>>> 关注公众号
获取更多精彩内容
200
Quesetion
67
eta-experiment(a/b testing)
难度标题
【Medium】
公司标签
/
题目标签
【a/b testing】
Let’s say you work at Uber. A PM comes to you considering
a new feature where instead of a direct ETA estimate like 5
minutes, would instead display a range of something like 3-7
minutes.
How would you conduct this experiment and how would you
know if your results were significant?
>>> 关注公众号
获取更多精彩内容
201
Solution
67
Solution:eta-experiment(a/b testing)
Clarify
What ETA we are looking for? Is this from the driver’s app
or rider’s app? Is this the ETA of estimated waiting time
after requesting the ride, or is this the ETA of estimated
arrival time at destination after driver picks up rider?
Let’s say it’s the ETA on the rider’s app, which is the time
between request submission and driver arrived at the
pickup location.
Prerequisites
1. Key metrics: revenue increase? or cancellation rate
decrease?
2. Variant: fixed ETA vs Range ETA, is the change easy to
make?
3. Randomized Unit: Riders who is going to requested a
ride, do we have enough randomization units?
Experiment Design
1.Sample size is determined based on statistical power,
statistical significance level, practical significance
boundary, population standard deviation
>>> 关注公众号
获取更多精彩内容
202
Solution
67
Solution:eta-experiment(a/b testing)
2. Length of experiment is determined by sample size and
actual number of riders requested daily. For example,
the sample size for each group is 1000, and number of
riders requested daily is 100, then we need at least 20
days to run this experiment. Also we need to consider
ramp-up when launching the experiment, so that the
system can handle the change and make sure the change
rolls out correctly. Another thing we need to consider
when deciding the length of experiment is seasonality.
Generally, we need to run at least 1 week to eliminate
weekday difference, and if the experiment period covers
holiday seasons or any other special time, we might also
need to discard those days or extend the experiment
length.
Run the Experiment and collect data
Results to Decision
1. Sanity checks on randomization, any other factors that
might break the identical situation between control and
treatment group (e.g. app down time) tradeoffs
between different metrics
2. Cost to implement and other opportunity
costs,
>>> 关注公众号
获取更多精彩内容
203
Solution
67
Solution:eta-experiment(a/b testing)
so we often set up a practical significance boundary
3. Compare p-value with significance level to check if the
change is statistically significant
4. Compare the change with practical significance
boundary to check if the change is practically
5. If the change is both statistically significant and
practically significant, we will make the decision to launch
the change to all riders
>>> 关注公众号
获取更多精彩内容
204
Quesetion
68
ctr-by-age(sql)
难度标题
【Hard】
公司标签
【Facebook】
题目标签
【sql】
Given two tables, search_events and users, write a query to find the
three age groups (bucketed by decade: age 0-9 falls into group 0, age
10-19 to group 1, …, 90-99 to group 9, with the endpoint included) with
the highest clickthrough rate in 2021. If two or more groups have the
same clickthrough rate, the older group should have priority.
Hint: If a user that clicked the link on 1/1/2021 is 29 years old on that
day and has a birthday tomorrow on 2/1/2021, they fall into the [2029] category. If the same user clicked on another link on 2/1/2021, he
turned 30 and will fall into the [30-39] category.
Example:
Input: search_events table
users table
Column
Type
Column
Type
search_id
INTEGER
id
INTEGER
query
VARCHAR
name
VARCHAR
has_clicked
BOOLEAN
birthdate
DATETIME
user_id
INTEGER
search_time
DATETIME
Output:
Column
Type
age_group
VARCHAR
ctr
FLOAT
>>> 关注公众号
获取更多精彩内容
205
Solution
68
Solution:eta-experiment(a/b testing)
WITH cte_1 AS (
SELECT
has_clicked,
TIMESTAMPDIFF(YEAR,birthdate,search_time) DIV 10 AS
age_group FROM users a JOIN search_events b
ON a.id = b.user_id
WHERE YEAR(search_time) = '2021'
),
cte_2 AS (
SELECT age_group, sum(has_clicked)/count(1) AS clck_rate
FROM
cte_1 group by age_group
)
SELECT age_group, clck_rate AS crt FROM cte_2
ORDER BY clck_rate DESC, age_group DESC
limit 3
>>> 关注公众号
获取更多精彩内容
206
Quesetion
69
bank-fraud-model(machine learning)
难度标题
【Medium】
题目标签
【machine learning】
公司标签
【DigitalOcean】,【ETRADE】,【World】,
【Amazon】,【BMO】,【ByteDance】,
【Robinhood】,【Accenture】,【Skillz】,
【Urbint】,【Facebook】,【Chartboost】,【s】,
【Solar】,【Adobe】,【Square】
Let’s say that you work at a bank that wants to build a model
to detect fraud on the platform.
The bank wants to implement a text messaging service in
addition that will text customers when the model detects a
fraudulent transaction in order for the customer to approve or
deny the transaction with a text response.
How would we build this model?
>>> 关注公众号
获取更多精彩内容
207
Solution
69
Solution:bank-fraud-model(machine
learning)
We should summarize our findings by building out a
binary classifier on an imbalanced dataset.
A few considerations we have to make are:
• How accurate is our data? Is all of the data labeled
carefully? How much fraud are we not detecting if
customers don’t even know they’re being defrauded?
• What model works well on an imbalance
dataset? Generally, tree models come to mind.
• How much do we care about interpretability? Building
a highly accurate model for our dataset may not be the
best method if we don’t learn anything from it. In the
case that our customers are being comprised without
us even knowing, then we run into the issue of building
a model that we can’t learn from and feature engineer
for in the future.
• What are the costs of misclassification? If we look
at precision versus recall, we can understand which
metrics we care given the business problem at hand.
>>> 关注公众号
获取更多精彩内容
208
Solution
69
Solution:bank-fraud-model(machine
learning)
We can assume that low recall in a fraudulent case
scenario would be a disaster. With low predictive power
on false negatives, fraudulent purchases would go under
the radar with consumers not even knowing they were
being defrauded. This could cost the bank thousands of
dollars in lost revenue given they would have to refund
the cost to the consumer.
Meanwhile if there was low precision, customers would
think their accounts were being defrauded all the time.
They would continue to get text messages until they
switched over to another bank, because the transactions
would always be flagged as fraudulent.
Since the question prompts for a text messaging service,
it might make sense then to optimize for recall to
minimize risk and avoid costly fraudulent charges.
We could also graph the precision recall curves at
different price buckets to understand how the precision
and recall thresholds were set.
>>> 关注公众号
获取更多精彩内容
209
Solution
69
Solution:bank-fraud-model(machine
learning)
For example, if recall was lower for purchases
under \$10$10 dollars but very high for purchases
over \$1000$1000, then effectively we’ve mitigated risk
by making it 100x harder to defraud the bank out of lots
of money.
Additional considerations
Reweighting: Algorithms such as LightGBM or SVM will
allow us to reweight data.
Custom Loss Function: We can apply different costs to
different false positives and false negatives depending on
the magnitude of the fraud.
SMOTE/ADASYN: Helps us generate synthetic examples of
the smaller class.
>>> 关注公众号
获取更多精彩内容
210
Quesetion
70
jars-and-coins(probability)
难度标题
【Hard】
题目标签
【probability】
公司标签
【Komodo】,【Google】
A jar holds 1000 coins. Out of all of the coins, 999 are fair and
one is double-sided with two heads. Picking a coin at random,
you toss the coin ten times.
Given that you see 10 heads, what is the probability that the
coin is double headed and the probability that the next toss
of the coin is also a head?
>>> 关注公众号
获取更多精彩内容
211
Solution
70
Solution:jars-and-coins(probability)
>>> 关注公众号
获取更多精彩内容
212
Solution
70
Solution:jars-and-coins(probability)
>>> 关注公众号
获取更多精彩内容
213
Solution
70
Solution:jars-and-coins(probability)
>>> 关注公众号
获取更多精彩内容
214
Quesetion
71
lifetime-plays(database design)
难度标题
【Medium】
题目标签
【database design】
公司标签
【Google】
We have a table called song_plays that tracks each time a
user plays a song.
Let’s say we want to create an aggregation table called
lifetime_plays that records the song count by date for each
user.
Write a SQL query that could perform this ETL each day.
song_plays table
Column
Type
id
INTEGER
created_at
DATETIME
user_id
INTEGER
`song_id
INTEGER
>>> 关注公众号
获取更多精彩内容
215
Solution
71
Solution:lifetime-plays(database
design)
CREATE TABLE song_plays (
id int,
date_listen date,
user_id int,
song_id int
);
CREATE TABLE lifetime_plays (
date_listen date,
user_id int,
song_id int,
count_plays int
);
INSERT INTO song_plays (id, date_listen, user_id, song_id)
VALUES (1, STR_TO_DATE('2021-03-01', '%Y-%m-%d'), 1, 1),
(2, STR_TO_DATE('2021-03-01', '%Y-%m-%d'), 1, 1),
(3, STR_TO_DATE('2021-03-01', '%Y-%m-%d'), 1, 2),
(4, STR_TO_DATE('2021-02-28', '%Y-%m-%d'), 1, 2),
(5, STR_TO_DATE('2021-03-01', '%Y-%m-%d'), 2, 1);
INSERT INTO lifetime_plays (date_listen, user_id, song_id,
count_plays)
SELECT date_listen
, user_id
, song_id
>>> 关注公众号
获取更多精彩内容
216
Solution
71
Solution:lifetime-plays(database
design)
, COUNT(*) AS count
FROM song_plays
WHERE date_listen = STR_TO_DATE('2021-03-01', '%Y-%m%d')
GROUP BY 1, 2, 3
SELECT * FROM lifetime_plays;
>>> 关注公众号
获取更多精彩内容
217
Quesetion
72
changing-composer(product metrics)
难度标题
【Easy】
题目标签
【product metrics】
公司标签
【Facebook】
Let’s say that Facebook would like to change the user
interface of the composer feature (the posting box) to be
more like Instagram. Instead of a box, Facebook would add a“+”
button at the bottom of the page.
How would you test if this is a good idea?
>>> 关注公众号
获取更多精彩内容
218
Solution
72
Solution:changing-composer(product
metrics)
Let’s make some initial assumptions. We can guess that
we want to try a new user interface to improve certain
key metrics that Instagram does better than Facebook in.
Noticeably, given that Instagram is a photo-sharing app,
we can assume that Facebook wants to improve:
• Posts per active user
• Photo posts per active user
Additionally, we have to measure the trade-offs between
the existing UI of the Facebook composer versus the
Instagram UI. While the current composer feature on
Facebook may make it easier to share status updates and
geo-location or sell items, the Instagram composer may
make the user more inclined to share photo posts.
Therefore, given this hypothesis, one way to initially
understand if this test is a good idea is to measure the
effects of an increase in the proportion of photo posts
to non-photo posts on Facebook and how that affects
general engagement metrics.
>>> 关注公众号
获取更多精彩内容
219
Solution
72
Solution:changing-composer(product
metrics)
For example, if we compare the population of users that
have a percentage of photo posts from 10% of the total
versus 20% of the total posts, does this increase our
active user percentage at all? Would it increase monthly
retention rates?
Another thing we have to be aware of is the drop-off
rate for the Facebook composer versus the Instagram
composer. The drop-off rate would directly affect the
general amount of posts that each user makes. We can
look at the drop-off rate between the two composers by
different segments as well such as geographic location,
device type, and demographic markets.
If we want to run an AB test to actually test the
differences instead of just analyzing our existing
segments, we would have to evaluate these same metrics
but make sure not to compare by specific segments
unless they are a large sample size of the population.
Doing it by market/segment may leave it so that you get
a Simpson’s paradox scenario where for most markets
you get a certain result but in aggregate the result is
different.
>>> 关注公众号
获取更多精彩内容
220
Solution
72
Solution:changing-composer(product
metrics)
In running the A/B test in addition, it’s important to add
in the specific rigidity that the test must be run. For
example, sample size and distribution are important to
need to make sure we have a sufficiently large enough
sample size in both control and test to get a statistically
significant result. We should also randomly assign folks
to either test/control as well as remember to reach
significance and single variable change on the composer
element.
>>> 关注公众号
获取更多精彩内容
221
Quesetion
73
equivalent-index(algorithms)
难度标题
【Medium】
题目标签
【algorithms】
公司标签
【Apple】
Given a list of integers, find the index at which the sum of the left half
of the list is equal to the right half.
If there is no index where this condition is satisfied return -1.
Example 1 :
Input: nums = [1, 7, 3, 5, 6]
Output: equivalent_index(nums) -> 2
Example 2 :
Input: nums = [1,3,5]
Output: equivalent_index(nums) -> -1
>>> 关注公众号
获取更多精彩内容
222
Solution
73
Solution:equivalent-index(algorithms)
Our goal is to iterate through the list and quickly compute
the sum of both sides of the index in the iteration.
We can do this by first getting the sum of the entire list.
This allows us to then subtract values from one side to
get the value for the other side. If the values are equal,
then we can return the index.
Given this approach, we can then loop through our list
and apply this formula to each value until we find the
index. If it doesn’t exist then we’ll return -1 at the end.
def equivalent_index(nums):
total = sum(nums)
leftsum = 0
for index, x in enumerate(nums):
# the formula for computing the right side
rightsum = total - leftsum - x
leftsum += x
if leftsum == rightsum:
return index
return -1
>>> 关注公众号
获取更多精彩内容
223
Quesetion
74
compute-variance(python)
难度标题
【Easy】
题目标签
【python】
公司标签
【Amazon】
Write a function that outputs the (sample) variance given a
list of integers.
Note: round the result to 2 decimal places.
Example :
Input: test_list = [6, 7, 3, 9, 10, 15]
Output: get_variance(test_list) -> 13.89
>>> 关注公众号
获取更多精彩内容
224
Solution
74
Solution:compute-variance(python)
>>> 关注公众号
获取更多精彩内容
225
Solution
74
Solution:compute-variance(python)
>>> 关注公众号
获取更多精彩内容
226
Quesetion
75
stranded-miner(probability)
难度标题
【Hard】
题目标签
【probability】
公司标签
【Facebook】
A miner is stranded and there are two paths he can take.
Path AA loops back to itself and takes him 5 days to walk it.
Path BB brings him to a junction immediately (0 days). The
junction at the end of path BB has two paths say Path BABA
and Path BBBB.
Path BABA brings him back to his original starting point and
takes him 2 days to walk. Path BBBB brings him to safety and
takes him 1 day to walk.
Each path has an equal probability of being chosen and
once a wrong path is chosen, he gets disoriented and cannot
remember which path he went through, and the probabilities
remain the same.
What is the expected value of the amount of days he will
spend before he exits the mine?
>>> 关注公众号
获取更多精彩内容
227
Solution
75
Solution:stranded-miner(probability)
First, some terminology. We will call a particular sequence
through the mine a circuit and a decision to go down one
path a walk. This terminology is borrowed from graph
theory.
We will denote the number of days that the miner spends
stranded as D. Note that D is path-dependent; the
sequence of paths the miner takes matters—for example,
the circuit.
A→A→B→AB
takes 11 days to complete, while the circuit
B→BA→B→AB
takes three days to complete, even though the minor
got to BB in the same amount of “walks.” Due to this,
Calculating E[D] directly would require you to come
up with a formula to generate the probability of every
possible circuit. Not impossible, but not something you’re
going to be able to do on the spot in an interview.
Because of this difficulty, we won’t focus on D. Instead,
we will focus on the number of “walks” the miner makes,
that is, the number of times they go down a path.
We will denote this as W.
>>> 关注公众号
获取更多精彩内容
228
Solution
75
Solution:stranded-miner(probability)
Note that since W measures trials until a success, it lends
itself a geometric distribution. At the start of any circuit,
the probability of ending up at path BB is
P(BB)=P(BB|B)P(B)=0.25
Thus, E[W]=0.25-1 =4
Now, let’s think about the number of days per walk when
he’s trapped. Since in the long-run, half, a quarter, and a
quarter of all circuits end up in A, BA, and BB respectively,
we have
E[D/W]=2/5+2/4+1/4=3.25
Now note that we have D=W(D/W) (since W is never
zero). Thus,
E[D]=E[W]E[D/W]=4·3.25=13
>>> 关注公众号
获取更多精彩内容
229
Quesetion
76
first-names-only(pandas)
难度标题
【Medium】
题目标签
【pandas】
公司标签
【Facebook】,【ICF】
You’re given a dataframe containing a list of user IDs and their full
names (e.g. ‘James Emerson’).
Transform this dataframe into a dataframe that contains the user ids
and only the first name of each user.
Example:
Input:
Output:
user_id
name
user_id
name
1034
James Emerson
1034
James
9430
Fiona Woodward
9430
Fiona
7281
Alvin Gross
7281
Alvin
5264
Deborah Handler
5264
Deborah
8995
Leah Xue
8995
Leah
>>> 关注公众号
获取更多精彩内容
230
Solution
76
Solution:first-names-only(pandas)
Simply split the name and take the first one.
def first_name_only (users_df):
users_df [ 'name'] = users_df ['name'].str.split(' ').str[0]
return users_d
>>> 关注公众号
获取更多精彩内容
231
Quesetion
77
netflix-retention(product metrics)
难度标题
【Hard】
题目标签
【product metrics】
公司标签
【Netflix】
Let’s say at Netflix we offer a subscription where customers
can enroll for a 30-day free trial. After 30 days, customers will
be automatically charged based on the package selected.
Let’s say we want to measure the success of acquiring new
users through the free trial.
How can we measure acquisition success and what metrics
can we use to measure the success of the free trial?
>>> 关注公众号
获取更多精彩内容
232
Solution
77
Solution:netflix-retention(product
metrics)
First a protip:
Let’s go back to thinking about an idea for a strategy
on Product Metrics type questions. One way we can
frame the concept specifically to this problem is to think
about controllable inputs, external drivers, and then the
observable output. It is critical to spend most of the time in
the interview creating good/bad benchmarks, having numeric
goals, explaining actual performance vs expectation, evaluating
inputs, and not be bogged down over other KPIs that we can’t
really influence.
With that in mind, let’s start out by stating the main goals of
the question. What is Netflix’s business model?
Main Goal:
1. Acquiring new users.
2. Decreasing churn and increasing retention.
Let’s think about acquisition before we dive into measuring the
success of the free trial. Starting out, what questions do we
have on acquisition at a larger scale before we jump into the
problem?
1. What’s the size of the market?
>>> 关注公众号
获取更多精彩内容
233
Solution
77
Solution:netflix-retention(product
metrics)
This would be the top of the funnel in terms of acquisition.
Let’s say there are seven billion people on the planet that
comprises of two billion households. If we assume a 5%
penetration of high-speed broadband internet, the potential
market size is 100 million households.
2. Size and quality of the potential leads.
In this case, our leads are the free-trial users. In each segment,
we can break down the number and quality of leads by
different factors such as geography, demographics, devicetype (TV or Mobile), acquisition funnel, etc…
Now, let’s focus on acquisition output metrics. What metrics
can we measure that will define success on a top-level
viewpoint for acquisition?
• Conversion rate percentage: # of trial sign-ups / # of leads
in the pool, by sub-segments. This is the number of leads
that we convert into free trial signups. Leads are defined by
customers that click on ads, sign up their email, or any other
top of the funnel activity before the free trial sign-up.
• Cost per free trial acquisition: This is the cost for signing up
each person to a free trial. This can be calculated by the
total marketing spend on advertising the free
trial divided by the total number of free trial
users.
>>> 关注公众号
获取更多精彩内容
234
Solution
77
Solution:netflix-retention(product
metrics)
• Daily Conversion Rate: This is the number of daily users
who convert and start to pay divided by the number of
users who enrolled in the 30-day free trial, thirty days ago.
One problem with this metric is it’s hard to get information
about users who enrolled for the free trial given the 30-day
lag from sign-up to conversion. For example, it would take
30 days to get the conversion data from all the users that
signed up for the free trial today.
Going deeper into the daily conversion rate metric, one way
we can get around the 30-day lag is by looking at cohort
performance over time. Everyone who joined in the month of
January (let’s say between day 1 to day 30) would become
cohort-1. Everyone who joined in the month of February would
be cohort-2, etc…
Then you see at the end of the trial:
• What % of free users paid.
• What % of free users till pay for month two, month three,
month four, etc…until you have metrics for month six to
one year. Then we can look at a second cohort for February
sign-ups. Once you have this, then you compare the 30-day
retention for cohort-2 vs cohort-1, then 60-day
retention for cohort-2 vs cohort-1, and so on.
>>> 关注公众号
获取更多精彩内容
235
Solution
77
Solution:netflix-retention(product
metrics)
This tells you if the quality of acquisition is effective enough,
and actually encourages long-term engagement.
Now if we jump into a few engagement metrics:
• Percentage of daily users who consume at least an hour of
Netflix content.
• We can break this down by the percentage of users who are
also consuming content at least 1min, 15mins, 1hour, 3hours,
6+hours in a week.
• Average weekly session duration by user
• We can cut this metric by the behavioral segment of the
users. There are different member profiles such as college
students, movie fanatics, suburban families, romantic
comedy enthusiasts, etc.
• Within each role, there’s the job of providing
recommendation to the acquisition team on which parts
of the business is having the highest growth. More
segmentations then exist past demographics of looking
at usage preferences, time of day, content verticals, to
determine which combination will increase the output of
average weekly session duration.
>>> 关注公众号
获取更多精彩内容
236
Quesetion
78
estimated-rounds(probability)
难度标题
【Hard】
题目标签
【probability】
公司标签
【Google】
Let’s say that there are six people trying to divide up into two
equally separate teams. Because they want to pick random
teams, on each round, each person shows their hand in either
a face-up or face-down position. If there are three of each
position, then they’ll split into teams.
What’s the expected number of rounds that everyone will
have to pick a hand side before they split into teams?
>>> 关注公众号
获取更多精彩内容
237
Solution
78
Solution:estimated-rounds(probability)
Since “they want to pick random teams” and there is no
additional information given, we can assume there is
a 50 ⁄ 50 chance that each person puts a face down or face
up. Thus the face of every individual person follows a Bernoulli
distribution with probability of success p=0.5. We are looking
for the total number of face ups to be exactly 3, as that would
imply the rest of the group has their faces down. Let F denote
the number of faces that are up. F is the sum of Bernoulli
random variables and as such follows a binomal distribution,
meaning the probability of F=3 is:
Let R the number of rounds before having 3 faces up.
Clearly, R follows a geometric distribution because it denotes
the number of trials before one success. For geometric random
variable G with success probability p, E[G]=1/p. Thus the
expected number of rounds until teams form is:
E[R]=1/0.3125=3.2
>>> 关注公众号
获取更多精彩内容
238
Quesetion
79
power-size(statistics)
难度标题
【Easy】
题目标签
【statistics】
公司标签
【Trivago】,【Apple】,【Qualcomm】,
【Google】
Let’s say you’re analyzing an AB test with a test and control
group.
1. How do you calculate the sample size necessary for an
accurate measurement?
2. Let’s say that the sample size is similar and sufficient
between the two groups. In order to measure very small
differences between the two, should the power get bigger
or smaller?
>>> 关注公众号
获取更多精彩内容
239
Solution
79
Solution:power-size(statistics)
1) The necessary sample size (n) depends on the following
factors:
a. Alpha (default is 0.05)
b. Test’s power (default is 80% corresponding to beta = 0.20)
c. The expected effect size (d) between the test and the
control populations, i.e. d = mu1 - mu2.
d. The population variance of the control population, assuming
that the test population has the same variance.
2) a. In order to measure very small differences between the
two groups, we want to reduce false negative (FN) rate.
b. By default, tests are more sensitive to FP than they are to
FN, alpha = 0.05 while beta = 0.2. This convention implies a
four-to-one trade off between β-risk and α-risk.
Given 2.a. & 2.b., beta should get smaller, which implies that
power (1-beta) should get bigger.
>>> 关注公众号
获取更多精彩内容
240
Quesetion
80
ad-raters-part-2(probability)
难度标题
【Hard】
题目标签
【probability】
公司标签
【Facebook】
Let’s say we use people to rate ads.
There are two types of raters. Random and independent from
our point of view:
• 80% of raters are careful and they rate an ad as good (60%
chance) or bad (40% chance).
• 20% of raters are lazy and they rate every ad as good
(100% chance).
1. Suppose a rater rates just three ads, and rates them all as
good. What’s probability the rater was lazy?
2. Suppose a rater sees NN ads and rates all of them as good.
What happens to the probability the rater was lazy as NN
tends to infinity?
3. Suppose we want to exclude lazy raters. Can you come up
with a rule for classifying raters as careful or lazy?
>>> 关注公众号
获取更多精彩内容
241
Solution
80
Solution:ad-raters-part-2(probability)
This should be intuitive, the more times we observe
a rater rating every ad they see as good,
we expect that it’s more likely that
the rater is lazy.
>>> 关注公众号
获取更多精彩内容
242
Solution
80
Solution:ad-raters-part-2(probability)
>>> 关注公众号
获取更多精彩内容
243
Quesetion
81
multi-modal-sample(python)
难度标题
【Easy】
题目标签
【python】
公司标签
【Stitch】,【Google】
Write a function for sampling from a multimodal distribution.
Inputs are keys (i.e. green, red, blue), weights (i.e. 2, 3, 5.5),
and the number of samples drawn from the distribution. The
output should return the keys of the samples.
Example :
Input: keys = ['green', 'red', 'blue']
weights = [1, 10, 2]
n=5
sample_multimodal(keys, weights, n)
Output: ['blue', 'red', 'red', 'green', 'red']
>>> 关注公众号
获取更多精彩内容
244
Quesetion
82
converted-sessions(probability)
难度标题
【Medium】
公司标签
/
题目标签
【probability】
Let’s say there are two user sessions that both convert with
probability 0.5.
1. What is the probability that they both converted?
2. Given that there are NN sessions and they convert with
probability qq, what is the expected number of converted
sessions?
>>> 关注公众号
获取更多精彩内容
245
Solution
82
Solution:converted-sessions(probability)
>>> 关注公众号
获取更多精彩内容
246
Solution
82
Solution:converted-sessions(probability)
>>> 关注公众号
获取更多精彩内容
247
Quesetion
83
complete-addresses(pandas)
难度标题
【Medium】
题目标签
【pandas】
公司标签
【Nextdoor】,【Google】
You’re given two dataframes. One contains information about addresses
and the other contains relationships between various cities and states:
Example :
df_addresses
df_cities
address
city
state
4860 Sunset Boulevard, San Francisco, 94105
Salt Lake City
Utah
3055 Paradise Lane, Salt Lake City, 84103
Kansas City
Missouri
682 Main Street, Detroit, 48204
Detroit
Michigan
9001 Cascade Road, Kansas City, 64102
Tampa
Florida
5853 Leon Street, Tampa, 33605
San Francisco
California
Write a function complete_address to create a single dataframe with
complete addresses in the format of street, city, state, zip code.
Input: import pandas as pd
addresses = {"address": ["4860 Sunset Boulevard, San Francisco, 94105", "3055 Paradise
Lane, Salt Lake City, 84103", "682 Main Street, Detroit, 48204", "9001 Cascade Road,
Kansas City, 64102", "5853 Leon Street, Tampa, 33605"]}
cities = {"city": ["Salt Lake City", "Kansas City", "Detroit", "Tampa", "San Francisco"],
"state": ["Utah", "Missouri", "Michigan", "Florida", "California"]}
df_addresses = pd.DataFrame(addresses)
df_cities = pd.DataFrame(cities)
address
关注公众号
获取更多精彩内容
>>>
Output: def complete_address(df_addresses,df_cities) ->
4860 Sunset Boulevard, San Francisco, California, 94105
3055 Paradise Lane, Salt Lake City, Utah, 84103
682 Main Street, Detroit, Michigan, 48204
9001 Cascade Road, Kansas City, Missouri, 64102
5853 Leon Street, Tampa, Florida, 33605
248
Solution
83
Solution: complete-addresses
(pandas)
def complete_address(df_addresses, df_cities):
df_addresses[['street', 'city', 'zipcode']] =
df_addresses['address'].str.split(', ', expand=True)
df_addresses = df_addresses.drop(['address'], axis=1)
df_addresses = df_addresses.merge(df_cities, on="city")
df_addresses['address'] = df_addresses[['street', 'city',
'state', 'zipcode']].agg(', '.join, axis=1)
df_addresses = df_addresses.drop(['street', 'city', 'state',
'zipcode'], axis=1)
return df_addresses
>>> 关注公众号
获取更多精彩内容
249
Quesetion
84
same-side-probability(probability)
难度标题
【Medium】
题目标签
【probability】
公司标签
【Microsoft】,【LinkedIn】
Suppose we have two coins. One is fair and the other biased
where the probability of it coming up heads is 3 ⁄ 4.
Let’s say we select a coin at random and flip it two times.
What is the probability that both flips result in the same side?
>>> 关注公众号
获取更多精彩内容
250
Solution
84
Solution: Same Side Probability
(probability)
Let’s tackle this by solving first splitting up the probabilities
of getting the same side twice for the biased coin and then
computing the same thing for the fair coin.
First the biased coin. We know that if we flip the biased coin
we have a 3 ⁄ 4 chance of getting heads. And so the probability
of heads twice will be 3 ⁄ 4 * 3 ⁄ 4 and the probability of tails
twice is 1 ⁄ 4 * 1 ⁄ 4.
Easy, but now what’s the probability of it being either twice
heads or twice tails? In this case, because the computation
is an OR function, the probability is additive. In which the
probabilities of heads twice OR tails twice is computed by
adding the probabilities together.
(3 ⁄ 4) * (3 ⁄ 4) + (1 ⁄ 4) * (1 ⁄ 4) = 10 ⁄ 16 = 0.625
Now the fair coin. We can apply the same formula from the
biased coin to the fair coin. Since heads and tails are both
equivalently probable, we can compute the formula quite
easily with:
(1 ⁄ 2) * (1 ⁄ 2) + (1 ⁄ 2) * (1 ⁄ 2) = 1 ⁄ 2
>>> 关注公众号
获取更多精彩内容
251
Solution
84
Solution: Same Side Probability
(probability)
Now let’s compute the total probability given a random
selection of either coin. Since there are only two coins and
we are equally likely to pick either of them, the probability of
getting each is 1 ⁄ 2. We can then compute the total probability
by again adding the individual probabilities while multiplying
by the probability of choosing either.
=
=
=
0.5625
>>> 关注公众号
获取更多精彩内容
252
Quesetion
85
string-shift(algorithms)
难度标题
【Easy】
题目标签
【algorithms】
公司标签
【Google】,【PayPal】
Given two strings AA and BB, write a function can_shift to
return whether or not AA can be shifted some number of
places to get BB.
Example :
Input: A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True
A = 'abc'
B = 'acb'
can_shift(A, B) == False
>>> 关注公众号
获取更多精彩内容
253
Solution
85
Solution: string-shift(algorithms)
This problem is relatively simple if we figure out the underlying
algorithm that allows us to easily check for string shifts
between strings A and B.
First off, we have to set baseline conditions for string shifting.
Strings A and B must both be the same length and consist
of the same letters. We can check for the former by setting a
conditional statement for if the length of A is equivalent to the
length of B.
Now, let’s think about the string shift. If B is reordered
from A then the condition has failed. But we can check order if
we continue to repeat B and then compare to see if A exists
in B.
For example:
A = 'abcde'
B * 2 = 'cdeabcdeab'
Now all we have to do is check to see if AA exists in BB, which
for the condition above is true.
def can_shift(A, B):
return ( A and B and
len(A) == len(B) and
A in B * 2 )
>>> 关注公众号
获取更多精彩内容
254
Quesetion
86
choosing-k(machine learning)
难度标题
【Easy】
题目标签
【machine learning】
公司标签
【Facebook】【
, Predictive】【
, Qualcomm】,
【Intel】
How would you choose the k value when doing k-means
clustering?
>>> 关注公众号
获取更多精彩内容
255
Solution
86
Solution: choosing-k(machine learning)
Elbow method - build kmeans models using increasing values
of k, recording models’ inertia (aka WCSS, Within Cluster Sum
of Squares or SSD sum of samples’ squared distances from
cluster centers) or distortion (average of samples’ squared
distances from cluster centers). Plot k against distortion/
inertia and choose k by taking the value where graph creates
an “elbow”: the point after which the distortion/inertia start
decreasing in a linear fashion or nearly parallel to X axis.
Silhouette method - calculate mean silhouette coefficient
of samples, a measure of similarity among points within
clusters–for a sample i, distances from points in neighboring
clusters, b, relative to points within the same cluster, a:
(b(i)-a(i))/max(a(i),b(i))–looking for a value closest to 1. To
verify choice, for each k in consideration, plot each clusters’
samples’ silhouette score (marking average across clusters),
and identify the k with few or no clusters with below average
silhouette score, and without wide fluctuations in the size of
the silhouette plots.
>>> 关注公众号
获取更多精彩内容
256
Quesetion
87
wau-vs-open-rates(product metrics)
难度标题
【Medium】
题目标签
【product metrics】
公司标签
【Pinterest】,【Facebook】
Let’s say that you’re a data scientist on the engagement
team. A product manager comes up to you and says that the
weekly active users metric is up 5% but email notification
open rates are down 2%.
>>> 关注公众号
获取更多精彩内容
257
Solution
87
Solution: wau-vs-open-rates(product
metrics)
Initially reading this question, we should assume it’s first a
debugging question, and then possibly a dive into trade-offs.
WAU (weekly active users) and email open rates are most of the
time, directly correlated, especially if emails are used as
a call to action to bring users back onto the website. An
email opened and then clicked would lead to an active user, but
it doesn’t necessarily mean that they have correlations or be the
only reason causing changes.
Let’s bring in some context first or state assumptions.
Specifically, around the two variables at play here.
Weekly active users can be defined as the number of users
active at least once in the past 7 days. Active user can be
defined as a user opening the app or website while logged in on
mobile, web, desktop, etc..
>>> 关注公众号
获取更多精彩内容
258
Solution
87
Solution: wau-vs-open-rates(product
metrics)
Email open rate is defined as the number of email opens
divided by the number of emails sent. We can assume
that both the email open rate and WAU are being measured
compared to their historical past. Such as if email open rates
were always measured within 24 hours of sending the email, then
the email open rate is not down now because it’s being measured
within 12 hours instead.
One is that we take a closer look at the metric of email open
rates. Given it is a fraction, we can understand that a 2%
decrease in open rate is much smaller in scale when we imagine
it as going from a 30% open rate to a 29.4% open rate. In which
case we can then look at segmentations for factors such as bugs
or seasonal trends that could be causing the problem:
• Bugs in the tracking. One time or progressive. Possibly seasonal.
• Platform: Look into if it was an abnormal event in one of the
platforms (mobile, desktop, ios, android)
• Countries or demographics. If certain countries or demographics
are using it more from new trends.
>>> 关注公众号
获取更多精彩内容
259
Solution
87
Solution: wau-vs-open-rates(product
metrics)
Now after looking at segmentations, let’s try to dive into
hypothesis of possible trade-offs.
We also have to remember that WAU is many times directly
influenced by the number of new users coming onto a site. For
example, if after two weeks, the user retention is only 20% of
the original number that is active on Pinterest, and after one
month it is 10%, then we might find that at any given time, WAU
could be primarily made up of new users that had just joined
Pinterest that week.
Given this assumption, we can then say that if there was a huge
influx of new users this week, that could be pushing the WAU
number up while also pushing the email open rate down as
we see more users coming onto the website organically or
through ads, without going through the usual email notifications
that long-term users would be attributed to.
>>> 关注公众号
获取更多精彩内容
260
Solution
87
Solution: wau-vs-open-rates(product
metrics)
Another hypothesis could be that the increase in WAUs triggers
many user-related email notifications and as a result pushes
down the email open rate by increasing the denominator. We
can also then verify this hypothesis by breaking down the email
open rate by different types of email notifications.
Lastly, we can assume that to generate an increase in WAU,
marketing could have sent a very large amount of emails that
pushed up the overall WAU number and created email fatigue
which in turn lowered the email open rates. To verify this, we
could look at different kinds of email metrics such as unsubscribe
rate, and see if there are different email open and unsubscribe
rates by cohorts of the number of emails received total.
>>> 关注公众号
获取更多精彩内容
261
Quesetion
88
group-success(product metrics)
难度标题
【Medium】
公司标签
/
题目标签
【product metrics】
How would you measure the success of Facebook Groups?
>>> 关注公众号
获取更多精彩内容
262
Solution
88
Solution: group-success(product metrics)
Success Metrics
The goal here is to evaluate and track metrics that relate to our
three main areas of focus; activation, engagement, and
retention.
Activation is how users discover Facebook Groups. Engagement
is tracking the health of user activity on Facebook Groups. Lastly,
retention helps us measure the long-term effect that Facebook
Groups have on the user to see if the user will come back over
time.
1. % o f u s e r s t h a t j o i n a g r o u p a f t e r v i e w i n g t h e
group (public group) [activation]. This indicates how effective
a page is at showing value to the user (through a large number
of recent posts, or # of new members) would reflect an active
community.
2. % of users that engage (post, comment, react) in the group
within one day of joining [engagement].
>>> 关注公众号
获取更多精彩内容
263
Solution
88
Solution: group-success(product
metrics)
3. Average engagement score calculated by some combination
of comments + likes per post by a new or returning user in the
group [engagement]. Indicates how supportive and welcoming
existing group members are to new and old users, but this may
depend on the type of content that is posted.
4. % of users that friend or follow another user of the
group within one week of joining [engagement]. This
metric demonstrates how close users are with each other,
and how friendly they are, but this may come across as weird
behavior and not performed by many users.
5. % of users that are returning members compared to
new users [engagement].
6. % of 30 daily active users [retention]. General retention
metrics to see how community brings repeat value to users.
7. % of users that invite a friend to the group [referral].
Indicates if users will promote a group, but there are other
reasons a user may invite friends and this may not be used by a
lot of users.
>>> 关注公众号
获取更多精彩内容
264
Quesetion
89
prime-to-n(python)
难度标题
【Medium】
题目标签
【python】
公司标签
【Tiger】,【Zenefits】,【Amazon】
Given an integer N, write a Python function that returns all of
the prime numbers up to N.
>>> 关注公众号
获取更多精彩内容
265
Solution
89
Solution: prime-to-n(python)
from math import ceil
def prime_numbers(N):
primes = []
if N > 1:
primes.append(2)
if N > 2:
primes.append(3)
>>> 关注公众号
获取更多精彩内容
266
Solution
89
Solution: prime-to-n(python)
from math import ceil
if N > 4:
for i in range(2,N+1):
is_prime = True
# all primes except 2 and 3 are of the form 6n +/- 1
if i % 6 == 1 or i % 6 == 5:
# this number is odd so we can start at 3 and
check only odd numbers
for j in range(3,ceil(pow(i,1/2))+1,2):
if i % j == 0:
is_prime = False
break
else:
is_prime = False
if is_prime:
primes.append(i)
return primes
>>> 关注公众号
获取更多精彩内容
267
Quesetion
90
subscription-retention(sql)
难度标题
【Hard】
题目标签
【sql】
公司标签
【Stripe】,【Houzz】,【Natera】,【Amazon】,
【Niantic】,【Intuit】
Given a table of subscriptions, write a query to get the retention rate
of each monthly cohort for each plan_id for the three months after
sign-up.
Order your output by start_month, plan_id, then num_month.
If an end_date is in the same month as start_date we say the
subscription was not retained in the first month.
If the end_date occurs in the month after the month of start_date, the
subscription was not retained in the second month. And so on for the
third.
The end_date field is NULL if the user has not canceled.
Example:
Input: subscriptions table
Output:
Column
Type
Column
Type
start_month
DATETIME
user_id
INTEGER
num_month
INTEGER
start_date
DATETIME
plan_id
VARCHAR
end_date
DATETIME
retained
FLOAT
plan_id
VARCHAR
>>> 关注公众号
获取更多精彩内容
268
Solution
90
Solution: subscription-retention(sql)
WITH cte_1 AS (
SELECT
*,
DATE_SUB(start_date,
I N T E R VA L
DAYOFMONTH(start_date) - 1 DAY) AS 'start_month'
FROM
subscriptions
ORDER BY
plan_id,
start_date
),
>>> 关注公众号
获取更多精彩内容
269
Solution
90
Solution: subscription-retention(sql)
cte_2 AS (
SELECT
x.column_0,
cte_1.*
FROM
cte_1
CROSS JOIN (
VALUES ROW (1),
ROW (2),
ROW (3)) AS x
),
>>> 关注公众号
获取更多精彩内容
270
Solution
90
Solution: subscription-retention(sql)
cte_3 AS (
SELECT
column_0,
user_id,
start_date,
end_date,
plan_id,
date(start_month) AS start_month,
IF(IFNULL(PERIOD_DIFF(DATE_
FORMAT(end_date,
'%Y%m'),
DATE_FORMAT(DATE_ADD(start_date,
INTERVAL column_0 - 1 MONTH),
'%Y%m')),
>>> 关注公众号
获取更多精彩内容
271
Solution
90
Solution: subscription-retention(sql)
1) > 0,
1,
0) AS x
FROM
cte_2
)
>>> 关注公众号
获取更多精彩内容
272
Solution
90
Solution: subscription-retention(sql)
SELECT
start_month,
column_0 AS num_month,
plan_id,
cast((sum(x) / count(x)) AS DECIMAL (3, 2)) AS
retained
FROM
cte_3
GROUP BY
start_month,
column_0,
plan_id
ORDER BY
start_month,
plan_id,
num_month
>>> 关注公众号
获取更多精彩内容
273
Quesetion
91
like-tracker(sql)
难度标题
【Easy】
题目标签
【sql】
公司标签
【Facebook】
The events table tracks every time a user performs a certain
action (like, post_enter, etc.) on a platform.
Write a query to determine how many different users gave a
like on June 6, 2020.
Example:
Input: events table
Column
Type
user_id
INTEGER
created_at
DATETIME
action
VARCHAR
platform
VARCHAR
Output:
Column
Type
num_users_gave_like
INTEGER
>>> 关注公众号
获取更多精彩内容
274
Solution
91
Solution: like-tracker(sql)
SELECT COUNT(DISTINCT user_id) AS num_users_gave_like
FROM events
WHERE DATE(created_at) = DATE("2020-06-06")
AND action = "like"
>>> 关注公众号
获取更多精彩内容
275
Quesetion
92
duplicate-rows(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Amazon】
Given a users table, write a query to return only its duplicate
rows.
Example:
Input: users table
Column
Type
id
INTEGER
name
VARCHAR
created_at
DATETIME
>>> 关注公众号
获取更多精彩内容
276
92
Solution
Solution: duplicate-rows(sql)
SELECT
id,
name,
created_at
FROM (
SELECT
*,
row_number() OVER
(PARTITION BY id ORDER BY created_
at ASC)
AS ranking
FROM
users) AS u
WHERE
ranking > 1
>>> 关注公众号
获取更多精彩内容
277
Quesetion
93
notification-type-conversion(sql)
难度标题
【Hard】
题目标签
【sql】
公司标签
【Facebook】
We’re given two tables, a table of notification deliveries and a
table of users with created and purchase conversion dates. If
the user hasn’t purchased then the conversion_date column is
NULL.
Write a query to get the conversion rate for each notification.
A user may convert only once.
Example:
notification_deliveries table
users table
Column
Type
Column
Type
notification
VARCHAR
id
INTEGER
user_id
INTEGER
created_at
DATETIME
created_at
DATETIME
conversion_date
DATETIME
Output:
Column
Type
notification
VARCHAR
conversion_rate
FLOAT
>>> 关注公众号
获取更多精彩内容
278
Solution
93
Solution: duplicate-rows(sql)
WITH time_differences AS
(
SELECT a.*,b.conversion_date, TIMESTAMPDIFF(second,a.
created_at,conversion_date) delta_t FROM
notification_deliveries a
JOIN users b ON a.user_id = b.id
)
find_notification_that_converted AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id,
delta_t>=0 ORDER BY delta_t) AS row_no FROM
time_differences
),
count_notifications AS
(
>>> 关注公众号
获取更多精彩内容
279
Solution
93
Solution: duplicate-rows(sql)
SELECT x.notification,total_notifications,
IFNULL(notifications_that_converted,0) AS notifications_
that_converted
FROM
(
SELECT notification, COUNT(*) total_notifications FROM
find_notification_that_converted
GROUP BY notification
)x
LEFT JOIN
(
>>> 关注公众号
获取更多精彩内容
280
Solution
93
Solution: duplicate-rows(sql)
SELECT notification, COUNT(*) AS notifications_that_
converted FROM
find_notification_that_converted
WHERE conversion_date IS NOT NULL AND row_no = 1 AND
delta_t>=0
GROUP BY notification
) y ON x.notification = y.notification
)
SELECT notification, IF( notifications_that_converted = 0,
0.0000, notifications_that_converted/total_notifications) AS
conversion_rate
FROM count_notifications
>>> 关注公众号
获取更多精彩内容
281
Quesetion
94
adding-c-to-sample(statistics)
难度标题
【Easy】
题目标签
【statistics】
公司标签
【Amazon】
Let’s say you are working as an analyst. There was an error in
collecting data and all entries are off by some number cc.
If you were to add cc to all the entries, what would happen to
the sample statistics (mean, median, mode, range, variance)
of the field?
>>> 关注公众号
获取更多精彩内容
282
Solution
94
Solution: adding-c-to-sample(statistics)
Adding a constant c to each (of number N) data points
consequently changes the descriptive statistics as such:
Mean: increases by c because if m is the current mean, the
current sum of squares is mN and if we add c to each of those
N points, then the new sum of squares is mN + cN = (m+c)N
dividing by N to calculate the new mean = (m+c)*N/N = (m+c)
the current mean has increased to the new mean by c.
Median: increases by c because the ordered list of data points
is maintained even when every number is increased by c,
therefore the new median is still relatively the 50th% of the
data set, it will just be c higher than the current median.
Mode: increases by c because each data point is increased by
the same amount which means each unique value’s relative
frequency is maintained and the new mode is only chigher
than the current mode.
>>> 关注公众号
获取更多精彩内容
283
Solution
94
Solution: adding-c-to-sample(statistics)
Range: remains the same because the minimum point becomes
x_min + c and the maximum point becomes x_max + c; the
range is calculated as max-min : {x_max+c - (x_min + c) }, where
the c values cancel out and the range is still {x_max - x_min}.
Variance: remains the same because intuitively: the relative
spread of the data set is the same as each point has shifted
by the same amount c mathematically, if the original variance
= V(X) then V(X+c) =V(x) as there is 0-variance in the constant
vector representing the size of the increase.
>>> 关注公众号
获取更多精彩内容
284
Quesetion
95
unique-work-days(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Amazon】
You have a table containing information about the projects
employees have worked on and the time period in which they
worked on the project. Each project can be assigned to more
than one employee, and an employee can be working on
more than one project at a time.
Write a query to find how many unique days each employee
worked. Order your query by the employee_id.
Example:
Input: projects table
Output:
Columns
Type
Columns
Type
employee_id
INTEGER
id
INTEGER
days_worked
DATETIME
title
VARCHAR
start_date
DATETIME
end_date
DATETIME
budget
INTEGER
employee_id
INTEGER
>>> 关注公众号
获取更多精彩内容
285
Solution
95
Solution: unique-work-days(sql)
WITH cte AS (
SELECT employee_id,
MIN(start_date) AS min_start_date,
MAX(end_date) AS max_end_date
FROM projects
GROUP BY employee_id)
SELECT
employee_id,
TIMESTAMPDIFF(DAY, min_start_date, max_end_date) AS
days_
worked
FROM cte
ORDER BY employee_id
>>> 关注公众号
获取更多精彩内容
286
Quesetion
96
search-algorithm-recall(machine
learning)
难度标题
【Easy】
题目标签
【machine learning】
公司标签
【Amazon】
Let’s say you work as a data scientist at Amazon. You want
to improve the search results for product search but cannot
change the underlying logic in the search algorithm.
What methods could you use to increase recall?
>>> 关注公众号
获取更多精彩内容
287
Solution
96
Solution: search-algorithm-recall
(machine learning)
Given we are not allowed to change the algorithm, we have to
logically look at this search algorithm like a black box, in which
the underlying model will not change but rather tweak the
inputs into the algorithm to increase the general recall output.
Hence if we modify the search query by adding additional input
keywords or chaining the results of different search terms, we
can get different results for the same original search term.
Remember that recall is the fraction of the relevant documents
that are successfully retrieved over the total amount of
relevant documents. In this case that means we want to
generally increase the number of results returned in terms of
relevance.
Let’s take an example to demonstrate.
>>> 关注公众号
获取更多精彩内容
288
Solution
96
Solution: search-algorithm-recall
(machine learning)
Let’s assume the algorithm using lexical search for relevancy
like Lucene search. If the search query is “black shirts”, the
results would still be generally relevant if the products returned
were dark colored shirts such as dark grey, dark blue, etc…
Instead however, given how the general search algorithm
might work, “black shirts” would be more likely to bring up “blue
shirts” or “black shoes” first instead of other dark colored shirts
given that the algorithm doesn’t know anything besides lexical
association.
Given that we cannot change the underlying algorithm, we
could surface these different dark colored shirts by appending
a synonyms query. A synonyms query would replace or
add to the existing words in the query with words that are
synonymous. Results for synonyms could be chained to the
first search query results. So we would first return the results
for “black shirts”, and then start returning the results for “dark
grey shirts”, “dark blue shirts”, etc…
>>> 关注公众号
获取更多精彩内容
289
Solution
96
Solution: search-algorithm-recall
(machine learning)
Another method would be to try search terms of products that
are adjacent to the values being searched. We would use an
algorithm of collaborative filtering to see what products users
bought together. Such as if people were likely to buy black
pants with a black shirt, we could chain the search terms like
“black pants” and “black shoes” into the query as well.
We can also try modifying the search query by adding in
keywords and tags from relevant products that users click on.
If users that search for “black shirts” click on products that
feature black collared shirts at a higher rate, we can append
the keywords of “collared shirt” to our search query to increase
recall towards general user preference.
>>> 关注公众号
获取更多精彩内容
290
Quesetion
97
losing-users(product metrics)
难度标题
【Medium】
题目标签
【product metrics】
公司标签
【Facebook】,【Google】
Let’s say you are working at Facebook and are asked to
investigate the claim that Facebook is losing young users.
1. How would you verify this claim?
2. What test metrics would you look at?
>>> 关注公众号
获取更多精彩内容
291
Solution
97
Solution: adding-c-to-sample(statistics)
Clarifying Questions:
1. Who is young user?
2. What are the source of this claim? ( Number of. posts, likes,
log in times or duration etc. )
3. Which time periods are this claim is observed?
Assessing Requirements:
1. Needs to user profile and activity data
Solution:
1. Assume that young users are identified as under 25 and
losing users is determined as decrease of activity hours per
month.
2. Divide data as young users and other users. Detect if there
is a usage difference between two of them. After that detect
if there is a usage difference for two groups individually
according to time. With this way we will understand that is
there anygeneral or group based loose.
>>> 关注公众号
获取更多精彩内容
292
Solution
97
Solution: adding-c-to-sample(statistics)
3. Assume that we discovered a huge difference for young
users between this year and last year, but there is not a
big change for other users. We can accept the claim at this
point.
Validation:
1. We can calculate the decrease rate by month by month and
try to understand reasons of this decrease. We can take
account any updates on the app or other external factors.
Additional Concerns:
1. We should be curious about competitor companies’ trend
>>> 关注公众号
获取更多精彩内容
293
Quesetion
98
third-unique-song(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Spotify】,【Apple】
Given a table of song_plays and a table of users, write a
query to extract the earliest date each user played their third
unique song.
Example:
Input: song_plays table
users table
Columns
Type
Columns
Type
user_id
INTEGER
id
INTEGER
song_name
TEXT
name
VARCHAR
date_played
DATETIME
Output:
Columns
Type
name
VARCHAR
date_played
DATETIME
song_name
TEXT
>>> 关注公众号
获取更多精彩内容
294
Solution
98
Solution: third-unique-song(sql)
WITH
CTE_ds AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY user_id,song_
name
ORDER BY date_played) br, song_plays.*
from song_plays
),
CTE_ds2 AS
(
SELECT ROW_NUMBER()
OVER (PARTITION BY user_id
ORDER BY
date_played) br2,CTE_ds.* FROM CTE_ds WHERE br =1
)
>>> 关注公众号
获取更多精彩内容
295
Solution
98
Solution: third-unique-song(sql)
SELECT x.name,y.date_played,song_name FROM
users x
LEFT JOIN
(SELECT user_id, date_played,song_name FROM CTE_ds2
WHERE
br2 = 3) y ON id = user_id
>>> 关注公众号
获取更多精彩内容
296
Quesetion
99
post-composer-drop(product
metrics)
难度标题
【Medium】
公司标签
/
题目标签
【product metrics】
Let’s say that on Facebook composer, the posting tool, drops
from 3% posts per user last month to 2.5% post per user
today.
How would you investigate what happened?
Let’s say the drop is in photo posts. What would you
investigate now?
>>> 关注公众号
获取更多精彩内容
297
Solution
99
Solution: post-composer-drop
(product metrics)
Clarification
1. Is it a true drop? Is it a sudden drop or has it been declining
for a while?
2. Is there any changes that happen recently that might have
affect this?
3. is the per user looking at active users only or everyone?
Investigation on potential root cause 1. Look at overall trend,
YoY, MoM, to confirm it is indeed a true decline. If there is
indeed a decline, is it a sudden decline? This could indicate he
feature itself might have some problem/not working properly.
1. Are we seeing similar decline for other feature?
2. Has the overall active user base using the Facebook
composer drop?
>>> 关注公众号
获取更多精彩内容
298
Solution
99
Solution: post-composer-drop
(product metrics)
3. Look at it by different cohort to see if it’s coming from a
particular group:
• Surface (iOS, android, desktop, mobile, etc.)
• Geography
• New/tenure users
• Business type?
• Paid ads or no?
Depending on the findings, we can suggest few
recommendations. If it’s due to specific cohort group, we can
create specific marketing campaigns to promote the posting
tool to try to get users to get back to using the feature, if it’s
due to the feature itself being hard to use/find, we can try
working on making the tool easier and provide more education
for the user on how to use it and etc.
>>> 关注公众号
获取更多精彩内容
299
Quesetion
100
permutation-palindrome(algorithms)
难度标题
【Medium】
题目标签
【algorithms】
公司标签
【Snapchat】,【ByteDance】,【Squarepoint】,
【Amazon】,【Snap】
Given a string str, write a function perm_palindrome to
determine whether there exists a permutation of str that is a
palindrome.
Example:
Input: str = 'carerac'
def perm_palindrome(str) -> True
“carerac” returns True since it can be rearranged to form
“racecar” which is a palindrome.
>>> 关注公众号
获取更多精彩内容
300
Solution
100
Solution:permutation-palindrome
(algorithms)
def perm_palindrome(str):
arr = [0] * 1000
num_odds = 0
for char in str:
i = ord(char)
arr[i] += 1
if arr[i] % 2 != 0:
num_odds += 1
else:
num_odds -= 1
return num_odds <= 1
>>> 关注公众号
获取更多精彩内容
301
Quesetion
101
estimating-d(statistics)
难度标题
【Medium】
题目标签
【statistics】
公司标签
【Spotify】
Given NN samples from a uniform distribution [0,d][0,d], how
would you estimate dd?
>>> 关注公众号
获取更多精彩内容
302
Solution
101
Solution:estimating-d(statistics)
What does a uniform distribution look like? Just a straight line
over the range of values from 0 to d, where any value between
0 to d is equally likely to be randomly sampled.
So, let’s make this easy to understand practically. If we’re
given N samples and we have to estimate what d is with zero
context of statistics and based on intuition, what value would
we choose?
For example, if our N sample is 5 and our values are: (1,4,6,2,3),
what value would we guess as d? Probably the max value of 6
right?
But, let’s look at another example. Let’s say our N sample is 5
again and our values are instead: (20,30,28,26,16). Would our
estimate still be the max value of 30?
Intuitively, it doesn’t seem correct right? And that’s because
if we assume d as 30, then that means these values are
spanned from 0 to 30 but somehow all of the values sampled
are above our projected median of 15.
In the first example, all of our values were equally distributed
from 0 to 6, while in this example, all of our values are skewed
above the 50% percentile. Now, we can come up
with a new estimator for d.
>>> 关注公众号
获取更多精彩内容
303
Solution
101
Solution:estimating-d(statistics)
One way to compute it would be that the average of a uniform
distribution is in its middle. The two parameters of interest in
a uniform distribution are its minimum and maximum values,
as the entirety of its values are uniformly distributed between
them. If d is the maximum and 0 is the minimum, half of d is its
average.
E(X) is the average, so
How do we know how to choose between the two estimators?
We have to ask the interviewer about the distribution of our N
samples. For example, if we were to continue to sample from
the uniform distribution and calculate the mean of the samples
each time, seeing huge variations of the mean would tell us
that the samples from our distribution are biased.
>>> 关注公众号
获取更多精彩内容
304
Quesetion
102
pca-and-k-means(machine learning)
难度标题
【Medium】
题目标签
【machine learning】
公司标签
【Google】,【Palo】,【Uber】,【Booz】,
【Rincon】,【Ocrolus】,【AstraZeneca】,
【QuantumBlack】,【BNP】,【General】
What’s the relationship between PCA and K-means clustering?
>>> 关注公众号
获取更多精彩内容
305
Solution
102
Solution:pca-and-k-means
(machine learning)
Both K means and PCA are unsupervised machine learning
techniques.
While PCA is used for dimensionality reduction, K-Means
can be used for clustering. K-Means fails in high dimensional
scenarios ( It is computationally expensive in High Dimension
scenarios and may incorrectly clustering things) Hence before
Performing a K-Means one always performs a PCA to reduce
dimensionality.
>>> 关注公众号
获取更多精彩内容
306
Quesetion
103
activity-conversion(analytics)
难度标题
【Hard】
题目标签
【analytics】
公司标签
【Apple】,【Facebook】
You’re given three tables, users, transactions and events. We’re
interested in how user activity affects user purchasing behavior.
The events table holds data for user events on the website where the
action field would equal values such as like and comment.
Write a query to prove if users that interact on the website (likes,
comments) convert towards purchasing at a higher volume than users
that do not interact.
users table
transactions table
column
type
column
type
id
INTEGER
user_id
INTEGER
name
VARCHAR
name
VARCHAR
created_at
DATETIME
created_at
DATETIME
events table
column
type
user_id
INTEGER
action
VARCHAR
created_at
DATETIME
>>> 关注公众号
获取更多精彩内容
307
Solution
103
Solution:activity-conversion
(analytics)
/* count number of transactions per user */
with tcnt as
( select users.id, count(transactions.created_at) as no_of_t
from users left join transactions
on users.id = transactions.id
),
/* count number of events per user */
ecnt as
( select users.id, count(events.created_at) as no_of_e
from users left join events
on users.id = events.user_id
where action = 'like' or action = 'comment'
)
/* now combine the two tables and determine avg number of
events
needed for each number of transactions */
select no_of_t, avg(no_of_e)
from tcnt left join ecnt
on tcnt.id = ecnt.id
group by no_of_t
>>> 关注公众号
获取更多精彩内容
308
Quesetion
104
scalped-ticket(probability)
难度标题
【Easy】
公司标签
/
题目标签
【probability】
One of your favorite sports teams is playing at a local
stadium, but you waited until the last minute to buy a ticket.
You can buy a scalped (second-hand) ticket for $50$50, which
has a 20% chance of not working. If the scalped ticket doesn’t
work, you’ll have to buy a box office ticket for $70$70 at the
stadium.
1. How much do you expect to pay to go to the sports game?
2. How much money should you set aside for the game?
>>> 关注公众号
获取更多精彩内容
309
Solution
104
Solution:scalped-ticket(probability)
One of your favorite sports teams is playing at a local stadium,
but you waited until the last minute to buy a ticket.
You can buy a scalped (second-hand) ticket for $50, which
has a 20% chance of not working. If the scalped ticket doesn’t
work, you’ll have to buy a box office ticket for $70 at the
stadium.
1. How much do you expect to pay to go to the sports game?
2. How much money should you set aside for the game?
>>> 关注公众号
获取更多精彩内容
310
Quesetion
105
button-ab-test(a/b testing)
难度标题
【Easy】
题目标签
【a/b testing】
公司标签
【Nextdoor】,【Amazon】,【Livongo】,
【Agoda】,【Known】,【Impossible】,
【Ibotta】,【Dropbox】,【Gusto】
A team wants to A/B test multiple different changes through
a sign-up funnel.
For example, on a page, a button is currently red and at the
top of the page. They want to see if changing a button from
red to blue and/or from the top of the page to the bottom of
the page will increase click-through.
How would you set up this test?
>>> 关注公众号
获取更多精彩内容
311
Solution
105
Solution:button-ab-test(a/b testing)
Two Options: - Run a multiple variant test - Run a simultaneous
test
1. Calculate the desired effect size of our change
2. Calculate the required sample size & duration of the
experiment to hit the desired effect size
3. Ensure proper tracking of CTR within our homepage
4. Ensure proper experiment framework to randomize between
treatment/control
5. If we want to run a simultaneous test, we’ll need to have a
framework for measuring the interaction effects. We can:
• Measure each variant individually against control
• For each variant, calculate the values of the interaction term
to determine influence of the other experiment
• Benefit is we get more power with the simultaneous test.
And we can understand what would happen if we rolled
both variants out.
1. If we wanted to either run tests separately, this would give
us the benefit of interpretability of our results. At the cost of
potential power improvements and delay in results.
2. We can apply variance reduction techniques like
stratification or adding covariates to reduce the
effect of external factors.
>>> 关注公众号
获取更多精彩内容
312
Quesetion
106
merge-sorted-lists(algorithms)
难度标题
【Easy】
题目标签
【algorithms】
公司标签
【Workday】,【Two】,【PayPal】,【Facebook】,
【Indeed】
Given two sorted lists, write a function to merge them into
one sorted list.
Bonus: What’s the time complexity?
Example:
Input: list1 = [1,2,5]
list2 = [2,4,6]
Output: def merge_list(list1,list2) -> [1,2,2,4,5,6]
>>> 关注公众号
获取更多精彩内容
313
Solution
106
Solution:merge-sorted-lists(algorithms)
def merge_list(list1, list2):
list3 = []
i=0
j=0
# Traverse both lists
# If the current element of first list
# is smaller than the current element
# of the second list, then store the
# first list's value and increment the index
while i < len(list1) and j < len(list2):
if list1[i] < list2[j]:
list3.append(list1[i])
i=i+1
else:
list3.append(list2[j])
j=j+1
# Store remaining elements of the first list
while i < len(list1):
list3.append(list1[i])
i=i+1
>>> 关注公众号
获取更多精彩内容
314
Solution
106
Solution:merge-sorted-lists(algorithms)
# Store remaining elements of the first list
while i < len(list1):
list3.append(list1[i])
i=i+1
# Store remaining elements of the second list
while j < len(list2):
list3.append(list2[j])
j=j+1
return list3
>>> 关注公众号
获取更多精彩内容
315
Quesetion
107
promoting-instagram(product
metrics)
难度标题
【Medium】
公司标签
/
题目标签
【product metrics】
Let’s say you work on the growth team at Facebook and are
tasked with promoting Instagram from within the Facebook
app.
Where and how could you promote Instagram through
Facebook?
>>> 关注公众号
获取更多精彩内容
316
Solution
107
Solution:promoting-instagram
(product metrics)
Goal: Increase awareness of Instragram through Facebook
Hypothesis: Showing Instragram ads to users in their News
Feed will increase the likelihood that they will login to
Instragram by X%.
Run A/B test:
• Control group: no changes
• Variant group: will be shown Instagram ads as the first ad
they see when scrolling through their News Feed.
• Randomly assign users to each group, making sure they’re
not bias and are representative of the population.
• Set a significance level like 95%
• Set experiment time, how long the long experiement run
• Set the power,usually 80%
• Esimate intended effect size - 20%
>>> 关注公众号
获取更多精彩内容
317
Solution
107
Solution:promoting-instagram
(product metrics)
Metrics:
• # of Instagram logins after being exposed to the Instagram
ad, 24 hours
• Instagram logins / # of users - Percent logging into Instagram
after using Facebook
• Stop-gap metric: Ad revenue, CTR, revenue per session.
Since we’re taking up ad space, we want to see how much
these ads cost us
The other idea is: Notifying ppl on Facebook when their friends
join Instagram. We can do a regression of # of friends on
Instagram vs % of those users who use Instagram.
>>> 关注公众号
获取更多精彩内容
318
Quesetion
108
significance-time-series(statistics)
难度标题
【Medium】
题目标签
【statistics】
公司标签
【Amazon】,【MasterClass】,【Apple】
Let’s say you have a time series dataset grouped monthly for
the past five years.
How would you find out if the difference between this month
and the previous month was significant or not?
>>> 关注公众号
获取更多精彩内容
319
Solution
108
Solution:significance-time-series
(statistics)
As stated, the dataset is grouped monthly, and for the
purposes of this answer, let’s say that the data is the number
of unique visitors to said website. This means that at the end
of each month, the number of unique visitors from every day
that month is summed up and reported.
We are interested in whether the difference between this
month and the previous month is significant. To test this, we
can take all the differences in unique visitors between every
month and the month after it (e.g. January and February of
Year 1, February and March of Year 2, etc.).
This will result in a population of differences in unique
visitors. We can then take the month we are interested in
and run a t-test against the sample that we have. This sample
size is large enough to extract useful information from it.
Once you get the output t-statistic, you can then calculate
your p-value. If the p-value is less than your desired threshold,
then the difference you are interested in is in fact statistically
significant.
However, one aspect we should watch out for are confounders
that may affect the overall trendline in the full dataset.
>>> 关注公众号
获取更多精彩内容
320
Solution
108
Solution:significance-time-series
(statistics)
One major variable that affects most time series datasets is
seasonality, and you can adjust each month by normalizing
(or dividing) the month’s value by a factor proportional to
the effect of seasonality. We can also quantify seasonality by
looking at year over year change to see if the seasonal effect
is strong or non-existent.
For example, if more users tend to go on the website in the
summertime, you need to adjust May to August months
accordingly. Furthermore, what if there are campaigns that
have increased traffic to the website over the past few years?
Or what if our business has generally done better year over
year?
This would be the effect of trend on time series. One way we
can account for trend is to normalize it like seasonality. But
this wouldn’t work out perfectly if growth had an interaction
effect with seasonality.
One method that we can run to adjust for seasonality and
trend is to run forecasts each month on what our next
month’s expected numbers are. This way, we can compare
our forecasts against our actual numbers.
>>> 关注公众号
获取更多精彩内容
321
Solution
108
Solution:significance-time-series
(statistics)
Our forecasts will have to be tuned to see if there is a linear
relationship in the historical data. If there isn’t, we would use
something more like a three month moving average method vs
a traditional linear ARIMA.
Lastly, we should set up a marginal variance on our
expectation between our forecast and our actual. This
threshold of variance should be based on the business. As
if it were revenue, we wouldn’t want more than, let’s say, a
5% change in our forecasts vs expected, given that it could
influence bigger problems with cash-flow and expenses.
However, if it’s less directly related to the business, such as an
engagement metric or a smaller product offering revenue, then
we can be fine with a larger change in the variance and set
the threshold higher.
>>> 关注公众号
获取更多精彩内容
322
Quesetion
109
netflix-price(business case)
难度标题
【Hard】
题目标签
【business case】
公司标签
【Netflix】
How would you determine if the price of a Netflix subscription
is truly the deciding factor for a consumer?
>>> 关注公众号
获取更多精彩内容
323
Solution
109
Solution:netflix-price(business case)
Based on the initial question Jay raised, lets focus on
conversion, say from free-trial to subscription. The same
idea applies to investigating retention, which is repeated
purchase.One way to approach this question is through quasiexperiment. I have done some research. Basically, Netflix
rolls out price changes on a country-by-country basis and the
change “in the US does not influence or indicate a global price
change,” a Netflix spokesperson told The Verge (Source:
https://www.theverge.com/2020/10/29/21540346/netflix-priceincrease-united-statesstandard-premium-content-productfeatures).
This creates the best scenario for us to conduct difference in
difference analysis, where for a chose period of time, say 2
months (this is reasonable because Netflix’s business model
is monthly subscription), compare conversion rate between
two groups of countries in which only one group experienced
a price spike, and also before and after the price spike.
Theoretically, this will give you the average treatment effect of
price on conversion rate.
>>> 关注公众号
获取更多精彩内容
324
Solution
109
Solution:netflix-price(business case)
But of course this method suffers from strong assumptions
limitations, and we need to include a bunch of covariates to
control for confounders, such as social-economic factors of
each country, such as GDP, average income, # of movie titles,
# of tv titles, cost per movie title, cost per tv title, total library
size, etc.
Another way is geo experiment, where a group of markets in
the US are given control price, and the other group of markets
are given treatment price. The problem with this approach is
obviously two groups are most likely not comparable. We can
either apply difference in difference method to take care of it
or apply matching method.
>>> 关注公众号
获取更多精彩内容
325
Quesetion
110
hundreds-of-hypotheses(a/b testing)
难度标题
【Medium】
题目标签
【a/b testing】
公司标签
【Amazon】
You are testing hundreds of hypotheses with many tt-tests.
What considerations should be made?
>>> 关注公众号
获取更多精彩内容
326
Solution
110
Solution:hundreds-of-hypotheses
(a/b testing)
Type I error will scale with the number of t-tests you run. If
your significance level alpha for a single t-test is 0.05, i.e we
allow 5% Type I error rate on a single test, then across many
tests p(Type I) error will increase.
i.e with 2 tests:
P(type I error) = p(type I error on A OR type I error on B) =
2p(type I error on single test) -p(type I error on A AND type I
error on B) = 2*.05 - .05^2 (assuming independence of tests) =
0.5 - .025 = .075
If you want your p(type I error) across n-tests to remain at 5%,
you will need to decrease the alpha in each individual test.
Otherwise, you can try and run an F-test to start in order to
identify if a least 1 test sees some significant effect. Then run
a t-test on the specific experiment with the highest effect
size. Granted, the p-value of the test will also depend on
the variance of the sample in the given test, if we assume
constant variance across tests, then the test with the highest
effect size is in expectation the best performing test. Only
running a single t-test will keep your p(type I error) low.
>>> 关注公众号
获取更多精彩内容
327
Quesetion
111
disease-testing-probability
(probability)
难度标题
【Hard】
公司标签
【Asana】
题目标签
【probability】
Bob takes a test and tests positive for the disease. Bob was
close to six other friends they all take the same test and end
up testing negative.
The test has a 1% false positive rate and a 15% false negative
rate.
What’s the percent chance that Bob was actually negative for
the disease?
>>> 关注公众号
获取更多精彩内容
328
Solution
111
Solution:disease-testing-probability
(probability)
While this immediately looks like a Bayes’ Theorem problem,
the lack of disease prevalence leads me to believe that this
is a comparison of binomial distributions or using Bayes’
Theorem with the comparison. Because 7 people have been
close (6 people and Bob), we can conclude that they all share
the same condition, either (1) all positive for the disease or
(2) none positive for the disease. Therefore, we can use the
FPR and FNR values as p for our binomial distributions to
calculate the probability of each of the situations and decide
the most probable.
(1) In the case where they are all positive, we consider X = #
of False Negatives so p=.15 and n=7. Therefore, P(X=6) = 6.8
* 10^(-5)
(2) In the case where they are all negative, we consider X = #
of False Positives so p=.01 and n=7. Therefore, P(X=1) = 0.066
So case 2 is more likely. Taking a page from Bayes’ Theorem,
P(Bob is actually negative) = .066 / (.066 + 6.8 * 10^(-5)) =
0.999
(Note the similarity in values to @pratapkd’s result
but without the assumption of prevalence)
>>> 关注公众号
获取更多精彩内容
329
Quesetion
112
parents-joining-teens(product
metrics)
难度标题
【Medium】
公司标签
/
题目标签
【product metrics】
Let’s say you’re a data scientist at Facebook.
How you would evaluate the effect on engagement of
teenage users when their parents join Facebook?
>>> 关注公众号
获取更多精彩内容
330
Solution
112
Solution:parents-joining-teens
(product metrics)
Since you cannot run a randomized test (unless you figure
out a way to make parents of teens to join by force), this
will be need to be an observational study, with a quasi
experiment design to answer the question - ‘How do parents
cause teens to behave in different ways’.
Look at 2 groups of teen users at two time periods. Time
period t0, parents of teens in group 1 (test) join facebook
while parents in group2 (control) do not. At time period t2,
compare the pre to post change in user behavior of users in
test to that of control.
Since random assignment is not possible, you’ll need to
control for selection bias through matching or regression.
Variables to match on would depend on the outcome
measure of interest (time spent, engagement on tagged
posts, sharing, posting). Few selection controls could include
age, affluence, education level of teens and parents, ethnic/
cultural background, size and density of connections etc.
Ultimately, compare the pre to post change in metrics for
the 2 groups at time t1 (relative to time t0) and see if the
differences are significant.
>>> 关注公众号
获取更多精彩内容
331
Quesetion
113
emails-opened(sql)
难度标题
【Easy】
题目标签
【sql】
公司标签
【Facebook】,【Wayfair】
The events table tracks every time a user performs a certain
action (like, post_enter, etc.) on a platform.
How many users have ever opened an email?
Example:
Input: events table
Column
Type
user_id
INTEGER
created_at
DATETIME
action
VARCHAR
platform
VARCHAR
Output:
Column
Type
num_users_open_email
INTEGER
>>> 关注公众号
获取更多精彩内容
332
Solution
113
Solution:emails-opened(sql)
SELECT
count(DISTINCT user_id) AS num_users_open_email
FROM
events
WHERE
action = 'email_opened'
>>> 关注公众号
获取更多精彩内容
333
Quesetion
114
low-precision(machine learning)
难度标题
【Easy】
公司标签
/
题目标签
【machine learning】
Let’s say you’re tasked with building a classification model
to determine whether a customer will buy on an e-commerce
platform after making a search on the homepage.
You find that your model is suffering from low precision.
How would you improve it?
>>> 关注公众号
获取更多精彩内容
334
Solution
114
Solution:low-precision
(machine learning)
Plot ROC Curve and increase classification threshold without
sacrificing recall too much. Increasing classification threshold.
Give more weight to features related to a user’s actions such
as previous activity on site, log in time. Basically features
that look for intentions to buy.
>>> 关注公众号
获取更多精彩内容
335
Quesetion
115
how-many-friends(algorithms)
难度标题
【Medium】
公司标签
/
题目标签
【algorithms】
You are given a list of lists where each group represents a
friendship.
For example, given the list:
list = [[2,3],[3,4],[5]]
Person 2 is friends with person 3, person 3 is friends with
person 4, etc.
Write a function to find how many friends each person has.
Example 1 :
Example 2 :
Input:
Input:
friends = [[1,3],[2,3],[3,5],[4]]
friends = [[1],[2],[3],[4]]
Output:
Output:
[(1,1), (2,1), (3,3), (4,0), (5,1)]
[(1,0), (2,0), (3,0), (4,0)]
Explanation: every person has no friends on the friends list
>>> 关注公众号
获取更多精彩内容
336
Solution
115
Solution:how-many-friends(algorithms)
def how_many_friends(friendships):
counts = {}
for friendship in friendships:
for f in friendship:
counts[f] = counts.get(f, set())
counts[f] = counts[f].union(friendship)
return [
(f, len(r) - 1)
for f, r in sorted(counts.items())
]
>>> 关注公众号
获取更多精彩内容
337
Quesetion
116
lowest-paid(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Microsoft】
Given tables employees, employee_projects, and projects, find the 3
lowest-paid employees that have completed at least 2 projects.
Note: incomplete projects will have an end date of NULL in the projects
table.
Example:
employee_projects table
Input: employees table
Column
Type
Column
Type
employee_id
INTEGER
id
INTEGER
project_id
INTEGER
first_name
VARCHAR
last_name
VARCHAR
salary
INTEGER
Column
Type
department_id
INTEGER
id
INTEGER
title
VARCHAR
start_date
DATE
end_date
DATE
budget
INTEGER
Output:
Column
Type
employee_id
INTEGER
salary
INTEGER
completed_projects
INTEGER
projects table
>>> 关注公众号
获取更多精彩内容
338
Solution
116
Solution:lowest-paid(sql)
SELECT ep.employee_id
, e.salary
, COUNT(p.id) AS completed_projects
FROM employee_projects AS ep
JOIN employees AS e
ON e.id = ep.employee_id
JOIN projects AS p
ON ep.project_id = p.id
WHERE p.end_date IS NOT NULL
GROUP BY 1
HAVING completed_projects > 1
ORDER BY 2
LIMIT 3
>>> 关注公众号
获取更多精彩内容
339
Quesetion
117
overfit-avoidance(machine learning)
难度标题
【Medium】
题目标签
【machine learning】
公司标签
【Microsoft】,【Adobe】
Let’s say that you’re training a classification model.
How would you combat overfitting when building tree-based
models?
>>> 关注公众号
获取更多精彩内容
340
Solution
117
Solution:overfit-avoidance
(machine learning)
Overfitting refers to the condition when the model completely
fits the training data but fails to generalize the testing
unseen data. A perfectly fit decision tree performs well for
training data but performs poorly for unseen test data. A
good model must not only fit the training data well but also
accurately classify records it has never seen.
There are different techniques to avoid overfitting in Decision
Trees:
• Pruning
• Pre-Pruning
• Post-Pruning
• Ensemble Methods
Pruning is a technique that reduces the size of decision trees
by removing noncritical and redundant sections to classify
instances. Pruning reduces the complexity of the final
classifier.
>>> 关注公众号
获取更多精彩内容
341
Solution
117
Solution:overfit-avoidance
(machine learning)
There are two types of pruning:
1. Pre-pruning: stop growing the tree earlier before it
perfectly classifies the training set. The hyperparameters
of the decision tree, including “maximum depth”, “minimum
sample leaf”, and “minimum samples split,” can be tuned
to early stop the growth of the tree and prevent the model
from overfitting.
2. Post-pruning: that allows the tree to grow up fully and
perfectly classify the training set, and then post prune
the tree by removing branches Practically, the second
approach of post-pruning overfitting trees is more popular
and successful because it is not easy to estimate when to
stop growing the tree precisely.Random ForestsRandom
Forest is an ensemble technique for classification and
regression by bootstrapping multiple decision trees.
Random Forest
follows bootstrap sampling and aggregation techniques to
prevent overfitting.
>>> 关注公众号
获取更多精彩内容
342
Quesetion
118
approximate-ad-views(probability)
难度标题
【Easy】
题目标签
【probability】
公司标签
【Facebook】
Let’s say you work for a social media website.
Users view 100 posts a day, and each post has a 10% chance
of being an ad.
What is the probability that a user views more than 10 ads
a day? How could you approximate this value using the
standard normal distribution’s cdf?
>>> 关注公众号
获取更多精彩内容
343
Solution
118
Solution:approximate-ad-views
(probability)
The probability that a user sees k ads in 100 daily posts can
be modeled by a binomial distribution ~B(p=0.1, n=100).
This implies the probability that a user sees at least 10 ads
out of 100 viewed posts = 1-CDF(p=0.1, k=10, n=100), where
CDF is the Binomial CDF function.
Since n*p >= 10, we can approximate the binomial
distribution with a normal distribution ~N(np, np(1-p) = ~N(10,
90)**
As 10 is the mean of this Normal Distribution, we conclude
that
1-CDF(p=0.1, k=10, n=100) ~ 1 - 0.5 = 0.5
>>> 关注公众号
获取更多精彩内容
344
Quesetion
119
secret-wins(probability)
难度标题
【Hard】
题目标签
【probability】
公司标签
【Google】
There are 100 students that are playing a coin-tossing game.
The students are given a coin to toss. If a student tosses the
coin and it turns up heads, they win. If it comes up tails, then
they must flip again.
If the coin comes up heads the second time, the students will
lie and say they have won when they didn’t. If it comes up
tails then they will say they have lost.
If 30 students at the end say they won, how many students
do we expect actually won the game?
>>> 关注公众号
获取更多精彩内容
345
Solution
119
Solution:secret-wins(probability)
Let W denote the number of students that actually won.
Since we know that 30 students said they won, we only need
to consider the students that said they one. By the rules of
the question, these students all flipped heads on their first
toss, so we need to consider their second toss further only.
By reducing our area of consideration like this, this becomes
a simple question of how many heads we expect those
students to have gotten. This situation can be modeled as
a binomial distribution, W ~ B(30,0.5). Using the expected
value of the binomial, we see E[W] = 15.
>>> 关注公众号
获取更多精彩内容
346
Quesetion
120
85-vs-82(machine learning)
难度标题
【Medium】
题目标签
【machine learning】
公司标签
【Amazon】
We have two models: one with 85% accuracy, one 82%.
Which one do you pick?
>>> 关注公众号
获取更多精彩内容
347
Solution
120
Solution:85-vs-82(machine learning)
At first glance it seems like a trick question. 85% accuracy is
obviously higher than 82%, so there must be a reason why
we should dive into this question and understand what the
broader context behind it is.
What is the model being applied to? What is more important
to the business; a higher accuracy model or a higher
interpretable model? For example, it’s likely that a higher
accuracy model could be a black box and more difficult
for the business to interpret.
A first determination needed would be figuring out the
correct metric for the model. Accuracy is a misleading metric
in that it is the fraction of predictions the model got
correct. However, if we were to narrow it down to binary
classification terms for example, this could be misleading in
that if we cared more about True Positives rather than
True Negatives or vice versa, the less accurate model could
have a better proportion for which metric we care about. It
makes sense in this case to balance precision and recall for
the business use case.
>>> 关注公众号
获取更多精彩内容
348
Solution
120
Solution:85-vs-82(machine learning)
For example, if we’re a doctor trying to estimate the number
of sick patients in a town. We have two models confusion
matrices:
10 | 10
---+--10 | 70
15 | 20
---+--5 | 60
We have an accuracy of 80% (10 + 70) in the first model and
75% (15 + 60) in the second model. It seems like in the first
model it has a better accuracy, yet the second model does a
better job of predicting when patients are sick while underpredicting patients that are healthy. Which model do we care
more about if we’re a doctor? It depends on the severity of
the disease and other factors on how much we care about
precision or recall.
>>> 关注公众号
获取更多精彩内容
349
Solution
120
Solution:85-vs-82(machine learning)
Lastly we should be looking at model scalability. We need
a model that can perform well in production because that’s
how it will be used in real life. This means that predictions
must be generated and scaled to the number of datapoints
the model will have to classify in real time. An example is
the model that won the Netflix prize. The model was not
the one that Netflix actually used, even though it was best
performing, because it wasn’t scalable to Netflix’s audience
size.
>>> 关注公众号
获取更多精彩内容
350
Quesetion
121
fast-food-database(database design)
难度标题
【Medium】
题目标签
【database design】
公司标签
【Facebook】
1. Design a database for a stand-alone fast food restaurant.
2. Based on the above database schema, write a SQL query
to find the top three highest revenue items sold yesterday.
3. Write a SQL query using the database schema to find the
percentage of customers that order drinks with their meal.
>>> 关注公众号
获取更多精彩内容
351
Solution
121
Solution:fast-food-database
(database design)
users
user_id (pk) | created_date | user_type
1 | 2020-05-25 | "walkin"
orders
id | user_id | item_id | qty | created_date
1 | 1 | 234 | 1 | 2020-09-09
1 | 1 | 432 | 2 | 2020-09-09
items
id (pk) | description | price
234 | "chicken burger" | 10.28
432 | "bread sticks" | 3.5
-- top three highest revenue items sold yesterday
-- revenue = item_price * item_qty
select
o.item_id,sum(o.qty * i.price) as item_rev
from orders o
where o.created_date = current_date() - interval 1 day
inner join items i on i.id = o.item_id
group by 1
order by 2 desc
limit 3 ;
>>> 关注公众号
获取更多精彩内容
352
Solution
121
Solution:fast-food-database
(database design)
-- num customers who ordered drinks with their meals / total
customers who ordered
-- isolated customer orders who ordered drinks AND meal(s)
-- pick all the items which are drinks
with drinks as
(
select id
from items
where description ilike '%drink%'
),
non_drinks as
(
select id from
items where
id not in (select id from drinks)
), -- use case when to count the drinks and non drinks
user_agg as
>>> 关注公众号
获取更多精彩内容
353
Solution
121
Solution:fast-food-database
(database design)
(
select
user_id,
max(case when t2.id is not null then 1 else 0 end) as drinks_
flag,
max(case when t3.id is not null then 1 else 0 end) as non_
drinks_flag
from orders t1
left join drinks t2 on t1.item_id = t2.id
left join non_drinks t3 on t1.item_id = t3.id
group by 1
)
select (100.0 * count(user_agg.user_id) / (select count(distinct
user_id) from orders)) as
percentage_users_with_drink
from user_agg
where drinks_flag = 1
and non_drinks_flag = 1
>>> 关注公众号
获取更多精彩内容
354
Quesetion
122
fake-algorithm-reviews(probability)
难度标题
【Medium】
公司标签
/
题目标签
【probability】
Let’s say we’re trying to determine fake reviews on our
products.
Based on past data, 98% reviews are legitimate and 2%
are fake. If a review is fake, there is 95% chance that the
machine learning algorithm identifies it as fake. If a review is
legitimate, there is a 90% chance that the machine learning
algorithm identifies it as legitimate.
What is the percentage chance the review is actually fake
when the algorithm detects it as fake?
>>> 关注公众号
获取更多精彩内容
355
Quesetion
123
all-tails-consecutive(probability)
难度标题
【Medium】
题目标签
【probability】
公司标签
【Google】
Let’s say you flip a fair coin 10 times.
What is the probability that you only get three tails, but, all
the tails happen consecutively?
An example of this happening would be if the flips were
HHHHTTTHHH.
Bonus: What would the probability of getting only t tails
in n coin flips (t ≤ n), requiring that the tails all happen
consecutively?
>>> 关注公众号
获取更多精彩内容
356
Quesetion
124
top-3-users(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Google】
Let’s say you work at a file-hosting website. You have
information on user’s daily downloads in the download_facts
table
Use the window function RANK to display the top three users
by downloads each day. Order your data by date, and then
by daily_rank
Example:
Input: download_facts table
Column
Type
user_id
INTEGER
date
DATE
downloads
INTEGER
Output:
Column
Type
daily_rank
INTEGER
user_id
INTEGER
date
DATE
downloads
INTEGER
>>> 关注公众号
获取更多精彩内容
357
Quesetion
125
bernoulli-sample(algorithms)
难度标题
【Hard】
题目标签
【algorithms】
公司标签
【Uber】,【Google】
Given a random Bernoulli trial generator, write a function to
return a value sampled from a normal distribution.
Example:
Input: def bernoulli_sample(p):
"""
generate 100 outputs of bernoulli sample , given prob of 1 as p
and 0 as 1-p
Output: 55
>>> 关注公众号
获取更多精彩内容
358
Quesetion
126
whatsapp-metrics(business case)
难度标题
【Easy】
题目标签
【business case】
公司标签
【Amazon】
What do you think are the most important metrics for
WhatsApp?
>>> 关注公众号
获取更多精彩内容
359
Quesetion
127
amateur-performance(product
metrics)
难度标题
【Hard】
题目标签
【product metrics】
公司标签
【Pinterest】,【Google】
You are a data scientist at YouTube focused on creators.
A PM comes to you worried that amateur video creators could
do well before but now it seems like only “superstars” do well.
What data points and metrics would you look at to decide if
this is true or not?
>>> 关注公众号
获取更多精彩内容
360
Quesetion
128
matching-siblings(machine learning)
难度标题
【Medium】
公司标签
/
题目标签
【machine learning】
Let’s say that you’re a data scientist working for Facebook.
A product manager has asked you to develop a method to
match users to their siblings on Facebook.
1. How would you evaluate a method or algorithm to match
users with their siblings?
2. What metrics might you use?
>>> 关注公众号
获取更多精彩内容
361
Quesetion
129
employees-before-managers(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Amazon】
You’re given two tables: employees and managers. Find the
names of all employees who joined before their manager.
Example:
Input: employees table
managers table
Column
Type
Column
Type
id
INTEGER
id
INTEGER
first_name
VARCHAR
name
VARCHAR
last_name
VARCHAR
join_date
DATETIME
manager_id
INTEGER
join_date
DATETIME
Output:
Column
Type
employee_name
VARCHAR
>>> 关注公众号
获取更多精彩内容
362
Quesetion
130
monotonic-function(statistics)
难度标题
【Medium】
题目标签
【statistics】
公司标签
【Google】
What does it mean for a function to be monotonic?
Why is it important that a transformation applied to a metric
is monotonic?
>>> 关注公众号
获取更多精彩内容
363
Quesetion
131
transactions-in-the-last-5-days(sql)
难度标题
【Medium】
题目标签
【sql】
公司标签
【Amazon】
Let’s say you work at a bank.
Using the bank_transactions table, find how many users made
at least one transaction each day in the first five days of
January 2020.
bank_transactions table
Column
Type
user_id
INTEGER
created_at
DATETIME
transaction_value
FLOAT
id
INTEGER
Output:
Column
Type
number_of_users
INTEGER
>>> 关注公众号
获取更多精彩内容
364
Quesetion
132
job-recommendation(machine
learning)
难度标题
【Hard】
题目标签
【machine learning】
公司标签
【Google】,【Twitter】
Let’s say that you’re working on a job recommendation
engine. You have access to all user Linkedin profiles, a list of
jobs each user applied to, and answers to questions that the
user filled in about their job search.
Using this information, how would you build a job
recommendation feed?
>>> 关注公众号
获取更多精彩内容
365
Quesetion
133
target-indices(algorithms)
难度标题
【Medium】
题目标签
【algorithms】
公司标签
【Amazon】
Given an array and a target integer, write a function sum_pair_
indices that returns the indices of two integers in the array
that add up to the target integer. If not found, just return an
empty list.
Note: Can you do it on O(n)O(n) time?
Note: Even though there could be many solutions, only one
needs to be returned.
Example 1 :
Input: array = [1 2 3 4]
target = 5
Output: def sum_pair_indices(array, target) -> [0 3] or [1 2]
Example 2 :
Input: array = [3]
target = 6
Output: Do NOT return [0 0] as you can't use an
index twice.
>>> 关注公众号
获取更多精彩内容
366
Quesetion
134
extra-delivery-pay(business case)
难度标题
【Medium】
公司标签
/
题目标签
【business case】
Let’s say you work at a food delivery company.
How would you measure the effectiveness of giving extra
pay to delivery drivers during peak hours to meet the demand
from consumers?
>>> 关注公众号
获取更多精彩内容
367
Quesetion
135
inactive-users(business case)
难度标题
【Medium】
题目标签
【business case】
公司标签
【Google】
Let’s say one million Netflix users have not logged in to Netflix
in the past 6 months.
How would you determine the cause? And what would you do
with these users?
>>> 关注公众号
获取更多精彩内容
368
Quesetion
136
matrix-analysis(python)
难度标题
【Medium】
题目标签
【python】
公司标签
【Google】
Let’s say we have a five-by-five matrix num_employees where each
row is a company and each column represents a department. Each
cell of the matrix displays the number of employees working in that
particular department at each company.
Write a function find_percentages to return a five by five matrix that
contains the portion of employees employed in each department
compared to the total number of employees at each company.
Example:
Input: import numpy as np
#Input:
num_employees = np.array( [[10, 20, 30, 30, 10], [15, 15, 5, 10, 5],
[150, 50, 100, 150, 50], [300, 200, 300, 100, 100], [1, 5, 1, 1, 2]] )
Output: def find_percentages(num_employees) ->
#Output:
percentage_by_department = [[0.1, 0.2, 0.3, 0.3, 0.1], [0.3, 0.3,
0.1, 0.2, 0.1], [0.3, 0.1, 0.2, 0.3, 0.1], [0.3, 0.2, 0.3, 0.1, 0.1], [0.1,
0.5, 0.1, 0.1, 0.2]]
>>> 关注公众号
获取更多精彩内容
369
Quesetion
137
minimum-absolute-distance
(algorithms)
难度标题
【Easy】
题目标签
【algorithms】
公司标签
【McKinsey】,【Apple】
Given an array of integers, write a function min_distance
to calculate the minimum absolute distance between two
elements then return all pairs having that absolute difference.
Note: Make sure to print the pairs in ascending order.
Example:
Input: v = [3, 12, 126, 44, 52, 57, 144, 61, 68, 72, 122]
Output: def min_distance(V) ->
min = 4
[(57, 61), (68, 72), (22, 126)]
>>> 关注公众号
获取更多精彩内容
370
Quesetion
138
count-transactions(sql)
难度标题
【Easy】
题目标签
【sql】
公司标签
【Amazon】
Let’s say you work at Amazon. With the annual_payments table below,
answer the following three questions via SQL queries and output them
as a table with the answers to each question.
1. How many total transactions are in this table?
2. How many different users made transactions?
3. How many transactions listed as "paid" have an amount greater or
equal to 100?
4. Which product made the highest revenue? (use only transactions
with a "paid" status)
Example:
Input: annual_payments table
Output: Column
Type
Column
Type
question_id
INTEGER
id
INTEGER
answer
FLOAT
amount
FLOAT
created_at
DATETIME
status
VARCHAR
user_id
INTEGER
amount_refunded
FLOAT
product
VARCHAR
last_updated
DATETIME
>>> 关注公众号
获取更多精彩内容
371
Quesetion
139
one-element-removed(algorithms)
难度标题
【Medium】
题目标签
【algorithms】
公司标签
【Facebook】
There are two lists, list X and list Y. Both lists contain integers
from -1000 to 1000 and are identical to each other except that
one integer is removed in list Y that exists in list X.
Write a function one_element_removed that takes in both lists
and returns the integer that was removed in O(1)O(1) space
and O(n)O(n) time without using the python set function.
Example:
Input: list_x = [1,2,3,4,5]
list_y = [1,2,4,5]
one_element_removed(list_x, list_y) -> 3
>>> 关注公众号
获取更多精彩内容
372
Quesetion
140
fake-news-stories(business case)
难度标题
【Medium】
题目标签
【business case】
公司标签
【Facebook】
Mark Zuckerburg calls you at 7pm and says he needs to know
exactly what percentage of Facebook stories are fake news
by tomorrow at 7pm.
How would you measure this given the time constraint?
>>> 关注公众号
获取更多精彩内容
373
Quesetion
141
instagram-tv-success(product
metrics)
难度标题
【Hard】
题目标签
【product metrics】
公司标签
【Google】,【Facebook】
Let’s say you’re a Product Data Scientist at Instagram. How
would you measure the success of the Instagram TV product?
>>> 关注公众号
获取更多精彩内容
374
Quesetion
142
approval-drop(statistics)
难度标题
【Medium】
题目标签
【statistics】
公司标签
【Intuit】,【edX】,【Microsoft】
Capital approval rates have gone down for our overall
approval rate. Let’s say last week it was 85% and the
approval rate went down to 82% this week which is a
statistically significant reduction.
The first analysis shows that all approval rates stayed flat or
increased over time when looking at the individual products.
• Product 1: 84% to 85% week over week
• Product 2: 77% to 77% week over week
• Product 3: 81% to 82% week over week
• Product 4: 88% to 88% week over week
What could be the cause of the decrease?
>>> 关注公众号
获取更多精彩内容
375
Quesetion
143
variate-anomalies(statistics)
难度标题
【Easy】
公司标签
/
题目标签
【statistics】
If given a univariate dataset, how would you design a
function to detect anomalies?
What if the data is bivariate?
>>> 关注公众号
获取更多精彩内容
376
Quesetion
144
optimal-host(algorithms)
难度标题
【Hard】
题目标签
【algorithms】
公司标签
【LinkedIn】,【Facebook】,
【Pluralsight】,【Zillow】
Let’s say we have a group of NN friends represented by a
list of dictionaries where each value is a friend name and
their location on a three dimensional scale of (x,y,zx,y,z). The
friends want to host a party but want the friend with the
optimal location (least distance to travel as a group) to host
it.
Write a function pick_host to return the friend that should host
the party.
Example:
Input: friends = [
{'name': 'Bob', location: (5,2,10)},
{'name': 'David', location: (2,3,5)},
{'name': 'Mary', location: (19,3,4)},
{'name': 'Skyler', location: (3,5,1)},
]
def optimal_host(friends) -> 'David'
>>> 关注公众号
获取更多精彩内容
377
Download