Data Science Interview Prep: SQL & Probability Questions

1 Quesetion 1 employee-salaries（sql）难度标题【Medium】题目标签【sql】公司标签【Cortland】【Con】 , 【MasterClass】 , 【Uber】 , 【Amazon】 , , 【Fractal】,【PepsiCo】,【Think】,【Microsoft】 Given a employees and departments table, select the top 3 departments with at least ten employees and rank them according to the percentage of their employees making over 100K in salary. Example： Input： Output： employees table departments table Columns Type Columns Type id INTEGER id INTEGER first_name VARCHAR name VARCHAR last_name VARCHAR salary INTEGER department_id INTEGER Columns Type percentage_over_100K FLOAT department_name VARCHAR number of employees INTEGER >>> 关注公众号获取更多精彩内容 1 Solution 1 Solution: employee-salaries（sql） SELECT d.name, SUM(CASE WHEN e.salary > 100000 THEN 1 ELSE 0 END)/COUNT(DISTINCT e.id)AS pct_above_100k, COUNT(DISTINCT e.id) AS c FROM employees e JOIN departments d ON e.department_id = d.id GROUP BY 1 HAVING COUNT(*) > 10 ORDER BY 2 DESC LIMIT 3 >>> 关注公众号获取更多精彩内容 2 Quesetion 2 first-to-six（probability）难度标题【Medium】题目标签【probability】公司标签【Microsoft】,【Zenefits】 Amy and Brad take turns in rolling a fair six-sided die. Whoever rolls a “6” first wins the game. Amy starts by rolling first. What’s the probability that Amy wins? >>> 关注公众号获取更多精彩内容 3 Solution 2 Solution: first-to-six（probability） >>> 关注公众号获取更多精彩内容 4 Quesetion 3 found-item（probability）难度标题【Easy】公司标签 / 题目标签【probability】 Amazon has a warehouse system where items on the website are located at different distribution centers across a city. Let’s say in one example city, the probability that a specific item X is available at warehouse A or warehouse B are 0.6 and 0.8 respectively. Given that you’re a customer in this example city and the items are only found on the website if they exist in the distribution centers, what is the probability that the item X would be found on Amazon’s website? >>> 关注公众号获取更多精彩内容 5 Solution 3 Solution: found-item（probability） >>> 关注公众号获取更多精彩内容 6 Quesetion 4 first-touch-attribution（sql）难度标题【Hard】【NerdWallet】,【Google】公司标签题目标签【sql】 The schema below is for a retail online shopping company consisting of two tables, attribution and user_sessions. • The attribution table logs a session visit for each row. • If conversion is true, then the user converted to buying on that session. • The channel column represents which advertising platform the user was attributed to for that specific session. • Lastly the user_sessions table maps many to one session visits back to one user. First touch attribution is defined as the channel with which the converted user was associated when they first discovered the website. Calculate the first touch attribution for each user_id that converted. Example： Input： Output： attribution table user_sessions table Column Type Column Type session_id INTEGER session_id INTEGER channel VARCHAR created_at DATETIME conversion BOOLEAN user_id INTEGER User_id Channel 123 facebook 145 google 153 facebook 172 organic 173 email >>> 关注公众号获取更多精彩内容 7 Solution 4 Solution: found-item（probability） WITH sessions AS ( SELECT u.user_id, a.channel, ROW_NUMBER() OVER( PARTITION BY u.user_id ORDER BY u.created_at ASC ) AS session_num, SUM(a.conversion) OVER( PARTITION BY u.user_id ) > 0 AS converted FROM user_sessions AS u INNER JOIN attribution AS a ON u.session_id = a.session_id ) SELECT user_id, >>> 关注公众号获取更多精彩内容 8 5 Quesetion post-success（sql）难度标题公司标签【Medium】题目标签 / 【sql】 Consider the events table which contains information about the phases of writing a new social media post. The action column can have values post_enter, post_submit, or post_canceled for when a user starts to write (post_enter), ends up canceling their post (post_cancel), or posts it (post_submit). Write a query to get the post success rate for each day in the month of January 2020. You can assume that a single user may only make one post per day. Example： Input： events table Sample： Column Type user_id created_at event_name id INTEGER 123 2019-01-01 post_enter user_id INTEGER 123 2019-01-01 post_submit created_at DATETIME 456 2019-01-02 post_enter action VARCHAR 456 2019-01-02 post_cancel url VARCHAR platform VARCHAR Output： Column Type dt DATETIME post_success_rate FLOAT >>> 关注公众号获取更多精彩内容 9 Solution 5 Solution: post-success（sql） Let’s see if we can clearly define the metrics we want to calculate before just jumping into the problem. We want post success rate for each day over the past week. To get that metric let’s assume post success rate can be defined as: (total posts created) / (total posts entered) Additionally, since the success rate must be broken down by day, we must make sure that a post that is entered must be completed on the same day. Cool, now that we have these requirements, it’s time to calculate our metrics. We know we have to GROUP BY the date to get each day’s posting success rate. We also have to break down how we can compute our two metrics of total posts entered and total posts actually created. >>> 关注公众号获取更多精彩内容 10 Solution 5 Solution: post-success（sql） Let’s look at the first one. Total posts entered can be calculated by a simple query such as filtering for where the event is equal to ‘enter’. SELECT COUNT(user_id) FROM events WHERE action = 'post_enter' Now we have to get all of the users that also successfully created the post in the same day. We can do this with a join and set the correct conditions. The conditions are:Same user- Successfully posted- Same day We can get those by doing a LEFT JOIN to the same table, and adding in those conditions. Remember we have to do a LEFT JOIN in this case because we want to use the join as a filter to where the conditions have been successfully met. >>> 关注公众号获取更多精彩内容 11 Solution 5 Solution: post-success（sql） SELECT * FROM events AS c1 LEFT JOIN events AS c2 ON c1.user_id = c2.user_id AND c2.action = 'post_submit' AND DATE(c1.created_at) = DATE(c2.created_at) WHERE c1.action = 'post_enter' AND MONTH(c1.created_at) = 1 AND YEAR(c1.created_at) = 2020 However, this query runs into an issue in which if we join on all of our conditions, we’ll find that if a user posted multiple times in the same day, we’ll be dealing with a multiplying join that will square the actual number of posts that we did. >>> 关注公众号获取更多精彩内容 12 Solution 5 Solution: post-success（sql） To simplify it, all we need to do instead is ignore the JOIN, and take the count of all of the events that are posts divided by all of the events that are enter. SELECT DATE(c1.created_at) AS dt , COUNT(c2.user_id) / COUNT(c1.user_id) AS post_success_ rate FROM events AS c1 LEFT JOIN events AS c2 ON c1.user_id = c2.user_id AND c2.action = 'post_submit' AND DATE(c1.created_at) = DATE(c2.created_at) WHERE c1.action = 'post_enter' AND MONTH(c1.created_at) = 1 AND YEAR(c1.created_at) = 2020 GROUP BY 1 >>> 关注公众号获取更多精彩内容 13 Quesetion 6 distribution-of-2x---y（statistics）难度标题【Medium】公司标签【Google】题目标签【statistics】 Given that XX and YY are independent random variables with normal distributions, what is the mean and variance of the distribution of 2X-Y2X-Y when the corresponding distributions are X~N(3,4)X~N(3,4) and Y~N(1,4)Y~N(1,4)? >>> 关注公众号获取更多精彩内容 14 Solution 6 Solution: distribution-of-2x---y（statistics） >>> 关注公众号获取更多精彩内容 15 Solution 6 Solution: distribution-of-2x---y（statistics） >>> 关注公众号获取更多精彩内容 16 Quesetion 7 upsell-transactions（sql）难度标题【Medium】题目标签【sql】公司标签【Instacart】,【Apple】,【Coinbase】 We’re given a table of product purchases. Each row in the table represents an individual user product purchase. Write a query to get the number of customers that were upsold by purchasing additional products. Note: If the customer purchased two things on the same day that does not count as an upsell as they were purchased within a similar timeframe. Example： Input： transactions table Output： Column Type Column Type id INTEGER num_of_upsold_customers INTEGER user_id INTEGER created_at DATETIME product_id INTEGER quantity INTEGER >>> 关注公众号获取更多精彩内容 17 Solution 7 Solution: upsell-transactions（sql） Assuming: “upsell” is purchasing an additional product after purchasing a first product the additional product(s) must be purchased on a later date (ie. not the same day as the first product) the additional “upsell” product(s) can be the same type of product (product_id) as the first product select count(distinct t1.user_id) as num_of_upsold_customers from transactions t1 inner join transactions t2 on t1.user_id = t2.user_id and date(t1.created_at) < date(t2.created_at) >>> 关注公众号获取更多精彩内容 18 Quesetion 8 seven-day-streak（sql）难度标题【Medium】题目标签【sql】公司标签【Twilio】,【Amazon】 Given a table with event logs, find the percentage of users that had at least one seven-day streak of visiting the same URL. Note: Round the results to 2 decimal places. For example, if the result is 35% return 0.35. Example： Input： events table Output： Column Type Column Type user_id INTEGER output FLOAT created_at DATETIME url VARCHAR >>> 关注公众号获取更多精彩内容 19 Solution 8 Solution: seven-day-streak（sql） WITH cte_1 AS ( SELECT user_id, DATE(created_at) AS login_date, url FROM events ), cte_2 AS ( SELECT user_id, login_date, url FROM cte_1 GROUP BY user_ id, login_date, url ), cte_3 AS ( SELECT *, DATE_ADD(login_date, INTERVAL -ROW_NUMBER() OVER (PARTITION BY user_id,url ORDER BY login_date) DAY) AS interval_group, DENSE_RANK() OVER (ORDER BY user_id) dr FROM cte_2 ), cte_4 AS ( SELECT user_id,login_date, url, interval_group, MAX(dr) OVER () total_users FROM cte_3 ), cte_5 AS ( >>> 关注公众号获取更多精彩内容 20 Solution 8 Solution: seven-day-streak（sql） SELECT COUNT(*) streak, MIN(login_date) AS cnt, user_id, total_users FROM cte_4 GROUP BY interval_group, user_id, url, total_users HAVING COUNT(*) >= 7 ), cte_6 AS ( SELECT COUNT(DISTINCT user_id) AS stre , total_users FROM cte_5 GROUP BY user_id, total_users ) SELECT IF((SELECT count(*) FROM cte_6) > 0 , (SELECT CAST(stre/total_users AS DECIMAL (3,2)) FROM cte_6), CAST(0.00 AS DECIMAL(3,2))) AS precent_of_users >>> 关注公众号获取更多精彩内容 21 Quesetion 9 cumulative-reset（sql）难度标题公司标签【Hard】【Amazon】题目标签【sql】 Given a users table, write a query to get the cumulative number of new users added by the day, with the total reset every month. Example： Input： users table Output： Columns Type DATE INTEGER id INTEGER 2020-01-01 5 name VARCHAR 2020-01-02 12 created_at DATETIME … … 2020-02-01 8 2020-02-02 17 2020-02-03 23 >>> 关注公众号获取更多精彩内容 22 Solution 9 Solution: cumulative-reset（sql） This question first seems like it could be solved by just running a COUNT(*) and grouping by date. Or maybe it’s just a regular cumulative distribution function? But we have to notice that we are actually grouping by a specific interval of month and date. And that when the next month comes around, we want to the reset the count of the number of users. Tangentially aside - the practical benefit for a query like this is that we can get a retention graph that compares the cumulative number of users from one month to another. If we have a goal to acquire 10% more users each month, how do we know if we’re on track for this goal on February 15th without having the same number to compare it to for January 15th? Therefore how can we make sure that the total amount of users on January 31st rolls over back to 0 on February 1st? >>> 关注公众号获取更多精彩内容 23 Solution 9 Solution: cumulative-reset（sql） Let’s first just solve the issue of getting the total count of users. We know that we’ll need to know the number of users that sign up each day. This can be written pretty easily. WITH daily_total AS ( SELECT DATE(created_at) AS dt , COUNT(*) AS cnt FROM users GROUP BY 1 ) If we can model out that computation, we’ll find that the cumulative istaken from the sum of all of the frequency counts lower than the specified frequency index. In which we can then run our self join on a condition where we set the left f1 table frequency index as greater than the right table frequency index. >>> 关注公众号获取更多精彩内容 24 Solution 9 Solution: cumulative-reset（sql） Okay, so we know that we have to specify a self join in the same way where we want to get the cumulative value by comparing each date against each other. But now the only difference here is that we add an additional condition in the join where the month and year have to be the same. That way we apply a filter to the same month and year AND limit the cumulative total. FROM daily_total AS t LEFT JOIN daily_total AS u ON t.dt >= u.dt AND MONTH(t.dt) = MONTH(u.dt) AND YEAR(t.dt) = YEAR(u.dt) >>> 关注公众号获取更多精彩内容 25 Solution 9 Solution: cumulative-reset（sql） Therefore if we bring it all together: WITH daily_total AS ( SELECT DATE(created_at) AS dt , COUNT(*) AS cnt FROM users GROUP BY 1 ) SELECT t.dt AS date >>> 关注公众号获取更多精彩内容 26 Quesetion 10 last-transaction（sql）难度标题【Easy】公司标签题目标签 / 【sql】 Given a table of bank transactions with columns id, transaction_value, and created_at representing the date and time for each transaction, write a query to get the last transaction for each day. The output should include the id of the transaction, datetime of the transaction, and the transaction amount. Order the transactions by datetime. Example： Input： bank_transactions table Output： Column Type Column Type id INTEGER created_at DATETIME created_at DATETIME transaction_value FLOAT transaction_value FLOAT id INTEGER >>> 关注公众号获取更多精彩内容 27 Solution 10 Solution: last-transaction（sql） with last_moment AS ( select date(created_at) day, max(created_at) as created_ at from bank_transactions group by 1 ) s ated_at, id, transaction_value from last_moment left join bank_transactions using(created_at) >>> 关注公众号获取更多精彩内容 28 Quesetion 11 ad-raters（probability）难度标题【Easy】公司标签 / 题目标签【probability】 Let’s say we use people to rate ads. There are two types of raters. Random and independent from our point of view: • 80% of raters are careful and they rate an ad as good (60% chance) or bad (40% chance). • 20% of raters are lazy and they rate every ad as good (100% chance). 1. Suppose we have 100 raters each rating one ad independently. What’s the expected number of good ads? 2. Now suppose we have 1 rater rating 100 ads. What’s the expected number of good ads? 3. Suppose we have 1 ad, rated as bad. What’s the probability the rater was lazy? >>> 关注公众号获取更多精彩内容 29 Solution 11 Solution: ad-raters（probability） >>> 关注公众号获取更多精彩内容 30 Quesetion 12 compute-deviation（python）难度标题【Medium】题目标签【python】公司标签【Tinder】,【Optiver】,【Amazon】 Write a function compute_deviation that takes in a list of dictionaries with a key and list of integers and returns a dictionary with the standard deviation of each list. Note: This should be done without using the NumPy built-in functions. Example： Input： Output： input = [ output = {'list1': 1.12, 'list2': 14.19} { 'key': 'list1', 'values': [4,5,2,3,4,5,2,3], }, { 'key': 'list2', 'values': [1,1,34,12,40,3,9,7], } ] >>> 关注公众号获取更多精彩内容 31 Solution 12 Solution:compute-deviation（python） With NumPy: import numpy as np {i['key']: round(np.std(i['values']),2) for i in input} Without NumPy: res = {} for i in input: avg = sum(i['values']) / len(i['values']) squares = [(j-avg)**2 for j in i['values']] res[i['key']] = round((sum(squares)/ len(i['values']))**(1/2),2) print(res) >>> 关注公众号获取更多精彩内容 32 Quesetion 13 is-it-raining-in-seattle（probability）难度标题【Medium】题目标签【probability】公司标签【Microsoft】,【Accenture】,【Facebook】 You are about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it’s raining. Each of your friends has a 2 ⁄ 3 chance of telling you the truth and a 1 ⁄ 3 chance of messing with you by lying. All 3 friends tell you that “Yes” it is raining. What is the probability that it’s actually raining in Seattle? >>> 关注公众号获取更多精彩内容 33 Solution 13 Solution:is-it-raining-in-seattle （probability） >>> 关注公众号获取更多精彩内容 34 Quesetion 14 find-bigrams（python）难度标题【Easy】题目标签【python】公司标签【Indeed】,【Microsoft】 Write a function called find_bigrams that takes a sentence or paragraph of strings and returns a list of all bigrams. Example： Input： Output： sentence = """ def find_bigrams(sentence) -> Have free hours and love children? Drive kids to school, soccer practice [('have', 'free'), and other activities. ('free', 'hours'), """ ('hours', 'and'), ('and', 'love'), ('love', 'children?'), ('children?', 'drive'), ('drive', 'kids'), ('kids', 'to'), ('to', 'school,'), ('school,', 'soccer'), ('soccer', 'practice'), ('practice', 'and'), ('and', 'other'), ('other', 'activities.')] 关注公众号 <<< 获取更多精彩内容 35 Solution 14 Solution:find-bigrams（python） def bigrams(sentence): sentence = sentence.split(' ') result = [] for i, item in enumerate(sentence): if i < len(sentence)-1: result.append((sentence[i],sentence[i+1])) #print((sentence[i],sentence[i+1])) return result >>> 关注公众号获取更多精彩内容 36 Quesetion 15 experiment-validity（a/b testing）难度标题【Medium】题目标签【a/b testing】公司标签【Facebook】,【Google】,【Metromile】, 【Uber】,【Grammarly】,【Airbnb】 Let’s say that your company is running a standard control and variant AB test on a feature to increase conversion rates on the landing page. The PM checks the results and finds a .04 p-value. How would you assess the validity of the result? >>> 关注公众号获取更多精彩内容 37 Solution 15 Solution:experiment-validity（a/b testing） This looks to be statistically significant, but I’d also double check a few more things before making the conclusion: 1) How long have we been running this experiment for? How many times have we been running the analysis? If we’ve run the experiment for 4 weeks and we sig tests 4 times, the likelihood of a false positive significantly increases. Or if we’ve run the test for one day, we should wait to see if the results hold. 2) Is the experiment properly randomized? Are the distributions across treatment/control matching up? Are the standard deviations roughly on the same trajectory? 3) What is the point estimate? Is this a negative conversion rate hit or positive? Is this in line with what we’re expecting? >>> 关注公众号获取更多精彩内容 38 Quesetion 16 reducing-error-margin（statistics）难度标题【Medium】题目标签【statistics】公司标签【Apple】,【Walmart】 Let’s say we have a sample size of nn. The margin of error for our sample size is 3. How many more samples would we need to decrease the margin of error to 0.3? >>> 关注公众号获取更多精彩内容 39 Solution 16 Solution:reducing-error-margin （statistics） >>> 关注公众号获取更多精彩内容 40 Solution 16 Solution:reducing-error-margin （statistics） >>> 关注公众号获取更多精彩内容 41 Quesetion 17 liked-pages（sql）难度标题【Medium】题目标签【sql】公司标签【Snapchat】,【Facebook】,【Snap】 Let’s say we want to build a naive recommender. We’re given two tables, one table called friends with a user_id and friend_id columns representing each user’s friends, and another table called page_likes with a user_id and a page_id representing the page each user liked. Write an SQL query to create a metric to recommend pages for each user based on recommendations from their friend’s liked pages. Note: It shouldn’t recommend pages that the user already likes. Example： Input： friends table page_likes table Column Type Column Type user_id INTEGER user_id INTEGER friend_id INTEGER page_id INTEGER Output： Column Type user_id INTEGER page_id INTEGER num_friend_likes INTEGER >>> 关注公众号获取更多精彩内容 42 Solution 17 Solution:liked-pages（sql） Let’s solve this problem by visualizing what kind of output we want from the query. Given that we have to create a metric for each user to recommend pages, we know we want something with a user_id and a page_id along with some sort of recommendation score. Let’s try to think of an easy way to represent the scores of each user_id and page_id combo. One naive method would be to create a score by summing up the total likes by friends on each page that the user hasn’t currently liked. Then the max value on our metric will be the most recommendable page. The first thing we have to do is then to write a query to associate users to their friend’s liked pages. We can do that easily with an initial join between the two tables. >>> 关注公众号获取更多精彩内容 43 Solution 17 Solution:liked-pages（sql） WITH t1 AS ( SELECT f.user_id , f.friend_id , pl.page_id FROM friends AS f INNER JOIN page_likes AS pl ON f.friend_id = pl.user_id ) Now we have every single user_id associated with the friends liked pages. Can’t we just do a GROUP BY on user_ id and page_id fields and get the DISTINCT COUNT of the friend_id field? Not exactly. We still have to filter out all of the pages that the original users also liked. >>> 关注公众号获取更多精彩内容 44 Solution 17 Solution:liked-pages（sql） We can do that by joining the original page_likes table back to the CTE. We can filter out all the pages that the original users liked by doing a LEFT JOIN on page_likes and then selecting all the rows where the JOIN on user_id and page_ id are NULL. SELECT t1.user_id, t1.page_id, COUNT(DISTINCT t1.friend_id) AS num_friend_likes FROM t1 LEFT JOIN page_likes AS pl ON t1.page_id = pl.page_id AND t1.user_id = pl.user_id WHERE pl.user_id IS NULL # filter out existing user likes GROUP BY 1, 2 In this case, we only need to check one column, where pl. user_id IS NULL. Once we GROUP BY the user_id and page_ id, we now can count the distinct number of friends, which will display the distinct number of likes on each page by friends creating our metric >>> 关注公众号获取更多精彩内容 45 Quesetion 18 customer-orders（sql）难度标题公司标签【Medium】题目标签 / 【sql】 Write a query to identify customers who placed more than three transactions each in both 2019 and 2020. Example： Input： transactions table users table Column Type Columns Type id INTEGER id INTEGER user_id INTEGER name VARCHAR created_at DATETIME product_id INTEGER quantity INTEGER Output： Column Type customer_name VARCHAR >>> 关注公众号获取更多精彩内容 46 Solution 18 Solution:customer-orders（sql） This question gives us two tables and asks us to find customers’ names who placed more than three transactions in both 2019 and 2020. Note the phrasing of the question institutes this logical expression: Customer transaction > 3 in 2019 AND Customer transactions > 3 in 2020. Our first query will join the transactions table to the user’s table so that we can easily reference both the user’s name and the orders together. We can join our tables on the id field of the user’s table and the user_id field of the transactions table: FROM transactions t JOIN users u ON u.id = user_id >>> 关注公众号获取更多精彩内容 47 Solution 18 Solution:customer-orders（sql） Next, we can work on the shape of our SELECT statement for our CTE. The first two fields we want to include are pretty simple: users_id and name. You might think that we could pull from only the name field here, but your query might fall apart if there are two users that have the same name. Instead, we’re going to select both and organize our query according to the users.id field (which we know has no duplicates). Next, we’re going to make some CASE WHEN statements, then combine them with SQL’s SUM function to count the number of transactions that each of our users made in 2019 and 2020. SUM(CASE WHEN YEAR(t.created_at)= '2019' THEN 1 ELSE 0 END) AS t_2019, SUM(CASE WHEN YEAR(t.created_at)= '2020' THEN 1 ELSE 0 END) AS t_2020 >>> 关注公众号获取更多精彩内容 48 Solution 18 Solution:customer-orders（sql） Notice that we have to make sure both years are accounted for with a minimum of 3 transactions each. In the above code, each instance where the YEAR of a user’s transaction is 2019 (in the first line) or 2020 (in the second) is assigned a value of 1. All other YEARs are assigned a value of 0. If we SUM these CASE WHEN statements, we’ll get the count of transactions made by a given user in both 2019 and 2020. >>> 关注公众号获取更多精彩内容 49 Quesetion 19 encoding-categorical-features （machine learning）难度标题【Medium】题目标签【machine learning】公司标签【Sentio】,【Uber】,【Amazon】,【AES】, 【Accenture】,【Visa】 Let’s say you have a categorical variable with thousands of distinct values, how would you encode it? >>> 关注公众号获取更多精彩内容 50 Solution 19 Solution:encoding-categorical-features （machine learning） This depends on whether the problem is a regression or a classification model. If it’s a regression model, one way would be to cluster them based on the response by working backwards. You could sort them by the response variable, and then split the categorical variables into buckets based on the grouping of the response variable. This could be done by using a shallow decision tree to reduce the number of categories. Another way given a regression model would be to target encode them. Replace each category in a variable with the mean response given that category. Now you have one continuous feature instead of a bunch of categories. For a binary classification, you can target encode the column by finding the conditional probability of the response variable being a one, given that the categorical column takes a particular value. >>> 关注公众号获取更多精彩内容 51 Solution 19 Solution:encoding-categorical-features （machine learning） Then replace the categorical column with this numerical value. For example if you have a categorical column of city in predicting loan defaults, and the probability of a person who lives in San Francisco defaults is 0.4, you would then replace “San Francisco” with 0.4. Additionally if working with classification model, you could try grouping them by the category’s frequency. The most frequent categories may dominate in the total make-up and the least frequent may make up a long tail with a few samples each. By looking at the frequency distribution of the categories, you could find the drop-off point where you could leave the top X categories alone and then categorize the rest into an “other bucket” giving you X+1 categories. If you want to be more precise, total the categories that give you the 90 percentile in the cumulative and dump the rest into the “other bucket”.Lastly we could also try using a Louvain community detection algorithm. Louvain is a method to extract communities from large networks without setting a predetermined number of clusters like K-means. >>> 关注公众号获取更多精彩内容 52 Quesetion 20 fair-coin（probability）难度标题【Easy】公司标签 / 题目标签【probability】 Say you flip a coin 10 times. It comes up tails 8 times and heads twice. Is this a fair coin? >>> 关注公众号获取更多精彩内容 53 Solution 20 Solution:fair-coin（probability） >>> 关注公众号获取更多精彩内容 54 Solution 20 Solution:fair-coin（probability） >>> 关注公众号获取更多精彩内容 55 Solution 20 Solution:fair-coin（probability） >>> 关注公众号获取更多精彩内容 56 Solution 20 Solution:fair-coin（probability） >>> 关注公众号获取更多精彩内容 57 Solution 20 Solution:fair-coin（probability） >>> 关注公众号获取更多精彩内容 58 Quesetion 21 n-die（probability）难度标题【Easy】题目标签【probability】公司标签【Oneida】,【Facebook】 Let’s say you’re playing a dice game. You have 2 dice. 1. What’s the probability of rolling at least one 3? 2. What’s the probability of rolling at least one 3 given N dice? >>> 关注公众号获取更多精彩内容 59 Solution 21 Solution:n-die（probability） >>> 关注公众号获取更多精彩内容 60 Solution 21 Solution:n-die（probability） >>> 关注公众号获取更多精彩内容 61 Quesetion 22 repeat-job-postings（sql）难度标题公司标签【Medium】 / 题目标签【sql】 Given a table of job postings, write a query to retrieve the number of users that have posted each job only once and the number of users that have posted at least one job multiple times. Each user has at least one job posting. Thus the sum of single_post and multiple_posts should equal the total number of distinct user_id’s. Example： Input： job_postings table Column Type id INTEGER job_id INTEGER user_id INTEGER date_posted DATETIME Output： Columns Type single_post INTEGER multiple_posts INTEGER >>> 关注公众号获取更多精彩内容 62 Solution 22 Solution:repeat-job-postings（sql） We want the value of two different metrics, the number of users that have posted their jobs once and the number of users that have posted at least one job multiple times. What does that mean exactly? If a user has 5 jobs but only posted each once, then they are part of the single_post. But if they have 5 jobs and posted a total of 7 times, then at least one job must have multiple postings. In general if a user’s total number of postings exceeds that user’s total number of distinct jobs, the pigeon hole principle tells us at least one must have been posted multiple times. We first write a subquery to get an organized version of the job_postings and name it user_job. We want a count of total job postings per user and job. Since each job posting has a unique id, we write our subquery to count posting ids and distinct job ids per user. >>> 关注公众号获取更多精彩内容 63 Solution 22 Solution:repeat-job-postings（sql） We use COUNT DISTINCT on job_id to get a unique row for each job and COUNT on id as all id are already unique. We then GROUP BY user_id so we can compare the number of distinct jobs per user denoted num_jobs with the number of total posts per user denoted n_posts. WITH user_job AS ( SELECT user_id , COUNT(DISTINCT job_id) AS n_jobs , COUNT(DISTINCT id) AS n_posts FROM job_postings GROUP BY 1 ) >>> 关注公众号获取更多精彩内容 64 Solution 22 Solution:repeat-job-postings（sql） Finally, we can simply write our main query to identify when n_posts exceeds n_jobs for each user. We then count these users toward multiple_posts. Note that n_posts is always greater than or equal to n_ jobs since each job gets posted at least once. Thus checking if the two are not equal is the same as checking if n_posts exceeds n_jobs. We use CASE WHEN to count towards our total multiple_ posts whenever n_jobs is not equal to n_posts. If n_jobs = n_posts, we count that user towards single_ post. >>> 关注公众号获取更多精彩内容 65 Solution 22 Solution:repeat-job-postings（sql） Our final query is as follows: WITH user_job AS ( SELECT user_id , COUNT(DISTINCT job_id) AS n_jobs , COUNT(DISTINCT id) AS n_posts FROM job_postings GROUP BY 1 ) SELECT SUM(CASE WHEN n_jobs = n_posts THEN 1 ELSE 0 END) AS single_post ,SUM(CASE WHEN n_jobs != n_posts THEN 1 ELSE 0 END) AS multiple_posts FROM user_job >>> 关注公众号获取更多精彩内容 66 Quesetion 23 recurring-character（python）难度标题【Easy】题目标签【python】公司标签【HealthTap】,【HEB】,【Facebook】 Given a string, write a function recurring_char to find its first recurring character. Return None if there is no recurring character. Treat upper and lower case letters as distinct characters. You may assume the input string includes no spaces. Example 1 ： Input： input = "interviewquery" Output： output = "i" Example 2 ： Input： input = "interv" Output： output =None >>> 关注公众号获取更多精彩内容 67 Solution 23 Solution:recurring-character（python） We know we have to store a unique set of characters of the input string and loop through the string to check which ones occur twice. Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character, and then just check if the character exists in that saved set. If it does, return the character. def recurring_char(input): seen = set() for char in input: if char in seen: return char seen.add(char) return(None) >>> 关注公众号获取更多精彩内容 68 Quesetion 24 average-order-value（sql）难度标题【Easy】题目标签【sql】公司标签【Klaviyo】,【Facebook】,【Target】 Given three tables, representing customer transactions and customer attributes: Write a query to get the average order value by gender. Note: We’re looking at the average order value by users that have ever placed an order. Additionally, please round your answer to two decimal places. users table Example： Input： transactions table Column Type Column Type id INTEGER id INTEGER name VARCHAR user_id INTEGER sex VARCHAR created_at DATETIME product_id INTEGER quantity INTEGER Output： Column Type sex VARCHAR aov FLOAT products table Column Type id INTEGER name VARCHAR price FLOAT >>> 关注公众号获取更多精彩内容 69 Solution 24 Solution:average-order-value（sql） Quick solution: For this problem, note that we are going to assume that the question states average order value for all users that have ordered at least once. Therefore, we can apply an INNER JOIN between users an d transactions. SELECT u.sex , ROUND(AVG(quantity *price), 2) AS aov FROM users AS u INNER JOIN transactions AS t ON u.id = t.user_id INNER JOIN products AS p ON t.product_id = p.id GROUP BY 1 >>> 关注公众号获取更多精彩内容 70 Quesetion 25 longest-streak-users（sql）难度标题【Medium】公司标签【Facebook】题目标签【sql】 Given a table with event logs, find the top five users with the longest continuous streak of visiting the platform in 2020. Note: A continuous streak counts if the user visits the platform at least once per day on consecutive days. Example： Input： events table Column Type user_id INTEGER created_at DATETIME url VARCHAR Output： Column Type user_id INTEGER streak_length INTEGER >>> 关注公众号获取更多精彩内容 71 Solution 25 Solution:longest-streak-users（sql） WITH grouped AS ( SELECT DATE(DATE_ADD(created_at, INTERVAL -ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) DAY)) AS grp, user_id, created_at FROM ( SELECT * FROM events GROUP BY created_at, user_id) dates ) SELECT user_id, streak_length FROM ( SELECT user_id, COUNT(*) as streak_length FROM grouped GROUP BY user_id, grp ORDER BY COUNT(*) desc) c GROUP BY user_id LIMIT 5 >>> 关注公众号获取更多精彩内容 72 Solution 25 Solution:longest-streak-users（sql） Explanation: We need to find the top five users with the longest continuous streak of visiting the platform. Before anything else, let’s make sure we are selecting only distinct dates from the created_at column for each user so that the streaks aren’t incorrectly interrupted by duplicate dates. SELECT * FROM events GROUP BY created_at, user_id) dates After that, the first step is to find a method of calculating the “streaks” of each user from the created_at column. This is a “gaps and islands” problem, in which the data is split into “islands” of consecutive values, separated by “gaps” (i.e. 1-2-3, 5-6, 9-10). A clever trick which will help us group consecutive values is taking advantage of the fact that subtracting two equally incrementing sequences will produce the same difference for each pair of values. >>> 关注公众号获取更多精彩内容 73 Solution 25 Solution:longest-streak-users（sql） For example, [1, 2, 3, 5, 6] - [0, 1, 2, 3, 4] = [1, 1, 1, 2, 2]. By creating a new column containing the result of such a subtraction, we can then group and count the streaks for each user. For our incremental sequence, we can use the row number of each event, obtainable with either window functions: ROW_NUMBER() or DENSE_RANK(). The difference between these two functions lies in how they deal with duplicate values, but since we need to remove duplicate values either way to accurately count the streaks, it doesn’t make a difference. SELECT DATE(DATE_ADD(created_at, INTERVAL -ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) DAY)) AS grp, user_id, created_at FROM ( SELECT * FROM events GROUP BY created_at, user_id) dates >>> 关注公众号获取更多精彩内容 74 Solution 25 Solution:longest-streak-users（sql） With the events categorized into consecutive streaks, it is simply a matter of grouping by the streaks, counting each group, selecting the highest streak for each user, and ranking the top 5 users. WITH grouped AS ( SELECT DATE(DATE_ADD(created_at, INTERVAL -ROW_ NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) DAY)) AS grp, user_id, created_at FROM ( SELECT * FROM events GROUP BY created_at, user_id) dates ) >>> 关注公众号获取更多精彩内容 75 Solution 25 Solution:longest-streak-users（sql） SELECT user_id, streak_length FROM ( SELECT user_id, COUNT(*) as streak_length FROM grouped GROUP BY user_id, grp ORDER BY COUNT(*) desc) c GROUP BY user_id LIMIT 5 Note that the second subquery was necessary in order to get the streak_length (count) as a column in our final selection, as it involves multiple groupings. >>> 关注公众号获取更多精彩内容 76 Quesetion 26 p-value-to-a-layman（statistics）难度标题【Easy】题目标签【statistics】公司标签【Uber】,【Facebook】,【Klaviyo】,【Pocket】, 【Netflix】,【Sage】,【Centene】,【Thermo】, 【Lumen】,【Surescripts】,【Apptio】,【Bolt】, 【Nextdoor】 How would you explain what a p-value is to someone who is not technical? >>> 关注公众号获取更多精彩内容 77 Solution 26 Solution:p-value-to-a-layman（statistics） The p-value is a fundamental concept in statistical testing. First, why does this kind of question matter? What an interviewer is looking for here is can you answer this question in a way that both conveys your understanding of statistics but can also answer a question from a non-technical worker that doesn’t understand why a p-value might matter. For example, if you were a data scientist and explained to a PM that the ad campaign test has a 0.08 p-value, why should the PM care about this number? Here’s how we could explain that. To understand the p-value, we must first learn about statistical tests. In statistical tests, you have two hypotheses. The null hypothesis states that our ad campaign will not have a measurable increase in daily active users. The test hypothesis states that our ad campaign will have a measurable increase in daily active users. >>> 关注公众号获取更多精彩内容 78 Solution 26 Solution:p-value-to-a-layman（statistics） We then use data to run a statistical test to find out which hypothesis is true. The p-value can help us determine this by giving us a probability that we would observe the current data if the null hypothesis were true. Note, this is just a statement about probability given an assumption, the p-value is not a measure of “how likely” the null hypothesis is to be right, nor does it measure “how likely” the observations in our data are due to random chance, which are the most common misinterpretations of what the p-value is. The only thing the p-value can say is contribute to cult-like worship of pp-values in non-technical circles. how likely we are to have gotten the data we got if the null hypothesis were true. The difference may seem very abstract and not practical, but using incorrect explanations helps contribute to cult-like worship of p-value in nontechnical circles. Thus, a low p-value indicates that it would be extremely unlikely that our data would result in this way if the null hypothesis were true. >>> 关注公众号获取更多精彩内容 79 Solution 26 Solution:p-value-to-a-layman（statistics） Because such data would be extremely unlikely to occur, we then make the conclusion that the null hypothesis is in fact false. Typically, p<0.05 is standard for rejecting the null hypothesis in many practices, but this is just convention, it may be that in your specific application we need more confidence (0.01) or less confidence (0.1) to reject a null hypothesis. For example, in life-or-death situations like healthcare, we may want a p-value lower than 0.05, while in studies with many factors like sociological studies, we may choose to increase the p-value standard to 0.1. Another important thing to recognize is that the p-value does not say anything about the “strength” of the statistical relationship, only if it exists or not. We could find a very small change in ad revenue from our test (say 1%), but that change could have a low p-value because it would unlikely that the change would have resulted if the null hypothesis were true. Likewise, we could find a huge change in ad revenue with a high p-value, which tells us although the change would be great if the null hypothesis were false, we do not have enough evidence to say that it is in fact false. >>> 关注公众号获取更多精彩内容 80 Quesetion 27 manager-team-sizes（sql）难度标题【Easy】公司标签题目标签 / 【sql】 Write a query to identify the manager with the biggest team size. You may assume there is only one manager with the largest team size. Example： Input： employees table managers table Column Type Column Type id INTEGER id INTEGER first_name VARCHAR name VARCHAR last_name VARCHAR team VARCHAR salary INTEGER department_id INTEGER manager_id INTEGER Output： Column Type manager VARCHAR team_size INTEGER >>> 关注公众号获取更多精彩内容 81 Solution 27 Solution:manager-team-sizes（sql） This question is relatively straightforward. We’re given two tables and asked to find the manager team with the largest number of employees. There are actually a couple of ways we could do this. Method one involves using the MAX function and method two (which is the path we chose to follow) involves creating a sorted list grouped by the manager name. We chose method two because it takes advantage of the most basic aspects of SQL to produce an elegant solution to the problem at hand. First, we’re going to use a LEFT JOIN to merge our “left” table, employees, with our “right” table, managers. We’ll join the two tables where employees’ manager_id field matches managers’ id field. Then, because we’re going to need to get a COUNT of employees under each manager and aggregates don’t mix well with discrete values, we’re going to GROUP our query BY the id field of our managers table. >>> 关注公众号获取更多精彩内容 82 Solution 27 Solution:manager-team-sizes（sql） We don’t want to GROUP BY the name field because we don’t know for certain that we don’t have two managers at the company with the same name, which would mess up our query. Now we can structure the SELECT clause of our query. We’re going to pull the name field from our managers table and a COUNT of the id field from our employees table. Since we already have a GROUP BY clause in place, our COUNT results will be grouped by manager ID, giving us the size of each team. Remember that we want to use aliasing at this stage to make sure our results match the output table. Next, we’re going to add an ORDER BY clause that sorts the results of our query by team size. We’re going to sort in DESCending order so that the highest team size is first on our list. Finally, we can LIMIT the results of our query to 1 and we will have found the manager with the largest team. >>> 关注公众号获取更多精彩内容 83 Solution 27 Solution:manager-team-sizes（sql） SELECT m.name AS manager, COUNT(e.id) AS team_size FROM managers m LEFT JOIN employees e ON e.manager_id = m.id GROUP BY m.id ORDER BY COUNT(e.id) DESC LIMIT 1 >>> 关注公众号获取更多精彩内容 84 Quesetion 28 flight-records（sql）难度标题【Hard】公司标签 / 题目标签【sql】 Write a query to create a new table, named flight routes, that displays unique pairs of two locations. Example： Input： Note: Duplicate pairs from the flights table, such as Dallas to Seattle and Seattle to Dallas, should have one entry in the flight routes table. Column Type id INTEGER source_location VARCHAR destination_location VARCHAR Output： Column Type destination_one VARCHAR destination_two VARCHAR >>> 关注公众号获取更多精彩内容 85 Solution 28 Solution:flight-records（sql） WITH locations AS ( SELECT id, LEAST(source_location, destination_location) AS point_A, GREATEST(destination_location, source_location) AS point_B FROM flights ORDER BY 2,3 ) SELECT point_A AS destination_one, point_B AS destination_two FROM locations GROUP BY point_A, point_B ORDER BY point_A, point_B >>> 关注公众号获取更多精彩内容 86 Quesetion 29 booking-regression（machine learning）难度标题【Medium】题目标签【machine learning】公司标签【TripAdvisor】,【Chewy】,【UBS】, 【Amazon】,【Facebook】,【Airbnb】 Let’s say we want to build a model to predict booking prices on Airbnb. Between linear regression and random forest regression, which model would perform better and why? >>> 关注公众号获取更多精彩内容 87 Solution 29 Solution:booking-regression （machine learning） Let’s first quickly explain the differences between linear and random forest regression before diving into which one is a better use case for bookings. Random forest regression is based on the ensemble machine learning technique of bagging. The two key concepts of random forests are: 1、Random sampling of training observations when building trees. 2、Random subsets of features for splitting nodes. Random forest regressions also discretize continuous variables since they are based on decision trees, which function through recursive binary partitioning at the nodes. This effectively means that we can split not only categorical variables, but also split continuous variables. Additionally, with enough data and sufficient splits, a step function with many small steps can approximate a smooth function for predicting an output. >>> 关注公众号获取更多精彩内容 88 Solution 29 Solution:booking-regression （machine learning） Linear regression on the other hand is the standard regression technique in which relationships are modeled using a linear predictor function, the most common example of y = Ax + B. Linear regression models are often fitted using the least-squares approach. There are also four main assumptions in linear regression: • A normal distribution of error terms • Independence in the predictors • The mean residuals must equal zero with constant variance • No correlation between the features So how do we differentiate between random forest regression and linear regression independent of the problem statement? >>> 关注公众号获取更多精彩内容 89 Solution 29 Solution:booking-regression （machine learning） The difference between random forest regression versus standard regression techniques for many applications are: • Random forest regression can approximate complex nonlinear shapes without a prior specification. Linear regression performs better when the underlying function is linear and has many continuous predictors. • Random forest regression allows the use of arbitrarily many predictors (more predictors than data points is possible). • Random forest regression can also capture complex interactions between predictions without a prior specification. • Both will give some semblance of a “feature importance.” However, linear regression feature importance is much more interpretable than random forest given the linear regression coefficient values attached to each predictor. >>> 关注公众号获取更多精彩内容 90 Solution 29 Solution:booking-regression （machine learning） Now let’s see how each model is applicable to Airbnb’s bookings. One thing we need to do in the interview is to understand more context around the problem of predicting bookings. To do so we need to understand what features exist in our dataset. We can assume the dataset will have features like: • location features • Seasonality • number of bedrooms and bathrooms • private room, shared, entire home, etc.. • External demand (conferences, festivals, etc…) Can we extrapolate those features into a linear model that makes sense? Probably. If we were to measure the price of bookings in just one city, we could probably fit a decent linear regression. >>> 关注公众号获取更多精彩内容 91 Solution 29 Solution:booking-regression （machine learning） Take Seattle for an example, the coefficient for each bedroom, bathroom, time of month, etc… could be standardized across the city if we had a good variable that could take into account location in the city. Given the nuances of different events that influence pricing, we could create custom interaction effects between the features if, for example, a huge festival suddenly increases the demand of three or four-bedroom houses. However, let’s say we have thousands of features in our dataset to try and predict prices for different types of homes across the world. If we run a random forest regression model, the advantages are now forming complex non-linear combinations into a model from a dataset that could hold one-bedrooms in Seattle and mansions in Croatia. >>> 关注公众号获取更多精彩内容 92 Solution 29 Solution:booking-regression （machine learning） But if our problem set is back to a simple example of one zipcode of Seattle, then our feature set is dramatically reduced by variation in geography and type of rental, and a regular linear regression has benefits in being able to understand the interpretability of the model to quantify the pricing factors. A one-bedroom plus two bathroom could probably double in price compared to a one-bedroom one-bathroom given the number of guests it could fit, yet this interaction may not be true in other parts of the world with different demand pricing. >>> 关注公众号获取更多精彩内容 93 Quesetion 30 three-zebras（probability）难度标题【Medium】公司标签【Facebook】题目标签【probability】 Three zebras are chilling in the desert. Suddenly a lion attacks. Each zebra is sitting on a corner of an equally length triangle. Each zebra randomly picks a direction and only runs along the outline of the triangle to either edge of the triangle. What is the probability that none of the zebras collide? >>> 关注公众号获取更多精彩内容 94 Solution 30 Solution:three-zebras（probability） Let’s imagine all of the zebras on an equilateral triangle. They each have two options of directions to go in if they are running along the outline to either edge. Given the case is random, let’s compute the possibilities in which they fail to collide. There’s only really two possibilities. The zebras will either all choose to run in a clockwise direction or a counterclockwise direction. Let’s calculate the probabilities of each. The probability that every zebra will choose to go clockwise will be the multiple of each zebra choosing the clockwise direction. Given there are two choices that would be 1/2*1/2*1/2=1/8. The probability of every zebra going counter-clockwise is the same at 1/8. Therefore if we sum up the probabilities, we get the correct probability of 1/4 or 25% . >>> 关注公众号获取更多精彩内容 95 Quesetion 31 month-over-month（sql）难度标题【Medium】题目标签【sql】公司标签【Salesforce】,【LinkedIn】,【Amazon】, 【Sezzle】 Given a table of transactions and products, write a function to get the month_over_month change in revenue for the year 2019. Make sure to round month_over_month to 2 decimal places. Example： Input： transactions table products table Column Type Column Type id INTEGER id INTEGER user_id INTEGER name VARCHAR created_at DATETIME price FLOAT product_id INTEGER quantity INTEGER Output： Column Type month INTEGER month_over_month FLOAT >>> 关注公众号获取更多精彩内容 96 Solution 31 Solution:month-over-month（sql） Whenever there is a question on month_over_month or week_ over_week or year_over_year etc.. change, note that it can generally be done in two different ways. One is using the LAG function that is available in certain SQL services. Another is to do a sneaky join. For both, we’re going to first have to sum the transactions and group by the month and the year. Grouping by the year is generally redundant in this case because we are only looking for the year of 2019. WITH monthly_transactions AS ( SELECT MONTH(created_at) AS month, YEAR(created_at) AS year, SUM(price * quantity) AS revenue FROM transactions AS t INNER JOIN products AS p ON t.product_id = p.id WHERE YEAR(created_at) = 2019 GROUP BY 1,2 ORDER BY 1 ) SELECT * FROM monthly_transactions >>> 关注公众号获取更多精彩内容 97 Solution 31 Solution:month-over-month（sql） Now using the LAG function, we can apply it to our column of revenue. Notice that the LAG function takes a column and then a number by which to lag the value by. Then we can just compute the month over month values by the general formula. WITH monthly_transactions AS ( SELECT MONTH(created_at) AS month, YEAR(created_at) AS year, SUM(price * quantity) AS revenue FROM transactions AS t INNER JOIN products AS p ON t.product_id = p.id WHERE YEAR(created_at) = 2019 GROUP BY 1,2 ORDER BY 1 ) SELECT month , ROUND((revenue - previous_revenue)/previous_revenue, 2) AS month_over_month FROM ( >>> 关注公众号获取更多精彩内容 98 Solution 31 Solution:month-over-month（sql） SELECT month, revenue, LAG(revenue,1) OVER ( ORDER BY month ) previous_revenue FROM monthly_transactions ) AS t The second way we can do this if we aren’t given the LAG function to use is to do a self-join on the month - 1. WITH monthly_transactions AS ( SELECT MONTH(created_at) AS month, YEAR(created_at) AS year, SUM(price * quantity) AS revenue FROM transactions AS t INNER JOIN products AS p ON t.product_id = p.id WHERE YEAR(created_at) = 2019 GROUP BY 1,2 ORDER BY 1 ) >>> 关注公众号获取更多精彩内容 99 Solution 31 Solution:month-over-month（sql） SELECT mt1.month , ROUND((mt2.revenue - mt1.revenue)/mt1.revenue, 2) AS month_over_month FROM monthly_transactions AS mt1 LEFT JOIN monthly_transactions AS mt2 ON mt1.month = mt2.month - 1 Notes: The second solution’s query results are slightly different (month 12 is null instead of month 1) and thus will not pass the test case. >>> 关注公众号获取更多精彩内容 100 Quesetion 32 ride-coupon（probability）难度标题【Easy】公司标签 / 题目标签【probability】 1. A ride-sharing app has probability pp of dispensing a $5$5 coupon to a rider. The app services NN riders. How much should we budget for the coupon initiative in total? 2. A driver using the app picks up two passengers. • What is the probability of both riders getting the coupon? • What is the probability that only one of them will get the coupon? >>> 关注公众号获取更多精彩内容 101 Solution 32 Solution:ride-coupon（probability） >>> 关注公众号获取更多精彩内容 102 Quesetion 33 employee-salaries-etl-error（sql）难度标题【Medium】题目标签【sql】公司标签【Microsoft】,【Noom】,【MasterClass】, 【Magical】,【Think】 Let’s say we have a table representing a company payroll schema. Due to an ETL error, the employees table instead of updating the salaries every year when doing compensation adjustments, did an insert instead. The head of HR still needs the current salary of each employee. Bonus: Write a query to get the current salary for each employee. Note: Assume no duplicate combination of first and last names. (I.E. No two John Smiths) Example： Input： employees table Output： Column Type Column Type id VARCHAR first_name VARCHAR first_name VARCHAR last_name VARCHAR last_name VARCHAR salary INTEGER salary INTEGER department_id INTEGER >>> 关注公众号获取更多精彩内容 103 Solution 33 Solution:employee-salaries-etl-error（sql） The first step we need to do would be to remove duplicates and retain the current salary for each user. Given we know there aren’t any duplicate first and last name combinations, we can remove duplicates from the employees table by running a GROUP BY on two fields, the first and last name. This allows us to then get a unique combinational value between the two fields. This is great, but at the same time, we’re now stuck with trying to find the most recent salary from the user. How would we be able to tell which was the most recent salary without a datetime column? Notice that in the question it states that instead of updating the salaries every year when doing compensation adjustments, did an insert instead. This means that the current salary could then be evaluated by looking at the most recent row inserted into table. We can assume that an insert will autoincrement the id field in the table, which means that the row we want would be the maximum id for the row for each given user. >>> 关注公众号获取更多精彩内容 104 Solution 33 Solution:employee-salaries-etl-error（sql） SELECT first_name, last_name, MAX(id) AS max_id FROM employees GROUP BY 1,2 Now that we have the corresponding maximum id, we can re-join it to the original table in a subquery to then get the correct salary associated with the id in the sub-query. SELECT e.first_name, e.last_name, e.salary FROM employees AS e INNER JOIN ( SELECT first_name, last_name, MAX(id) AS max_id FROM employees GROUP BY 1,2 ) AS m ON e.id = m.max_id >>> 关注公众号获取更多精彩内容 105 Quesetion 34 paired-products（sql）难度标题公司标签【Hard】【Amazon】题目标签【sql】 Let’s say we have two tables, transactions and products. Hypothetically the transactions table consists of over a billion rows of purchases bought by users. We are trying to find paired products that are often purchased together by the same user, such as wine and bottle openers, chips and beer, etc.. Write a query to find the top five paired products and their names. Notes: For the purposes of satisfying the test case, P1 should be the item that comes first in the alphabet. Example： Input： transactions table products table Column Type Column Type id INTEGER id INTEGER user_id INTEGER name VARCHAR created_at DATETIME price FLOAT product_id INTEGER quantity INTEGER Output： Column Type P1 VARCHAR P2 VARCHAR count INTEGER >>> 关注公众号获取更多精彩内容 106 Solution 34 Solution:paired-products（sql） We are tasked with finding pairs of products that are purchased together by the same user. Before we can do anything, however, we need to join the two tables: transactions and products on id = product_id, so that we can associate each transaction with a product name: SELECT user_id, created_at, products.name FROM transactions JOIN products ON transactions.product_id = products.id Afterwards, we are faced with the first challenge of selecting all instances where the user purchased a pair of products together. One intuitive way to accomplish this is to select all created_at dates in which more than one transaction occurred by the same user_id, which would look like this: SELECT user_id , created_at , products.name FROM transactions JOIN products ON transactions.product_id = products.id WHERE transactions.id NOT IN ( >>> 关注公众号获取更多精彩内容 107 Solution 34 Solution:paired-products（sql） SELECT id FROM transactions GROUP BY created_at, user_id HAVING COUNT(*) = 1 ) This is an acceptable way to accomplish the task but it runs into trouble in the next step, where we will need to count all unique instances of each pairing of products. Fortunately, there is a clever solution which handles both parts of the problem efficiently. By self joining the combined table with itself, we can specify the join to connect rows sharing created_at and user_id: WITH purchases AS ( SELECT user_id , created_at , products.name FROM transactions JOIN products ON transactions.product_id = products.id ) >>> 关注公众号获取更多精彩内容 108 Solution 34 Solution:paired-products（sql） SELECT t1.name AS P1 , t2.name AS P2 , count(*) FROM purchases AS t1 JOIN purchases AS t2 ON t1.user_id = t2.user_id AND t1.created_at = t2.created_at The self join produces every combination of pairs of products purchased. However, looking at the resulting selection, it becomes clear that there is an issue: Product 1 Product 2 federal discuss hard federal discuss hard night sound feeling night sound feeling go window serious go window serious outside learn nice outside learn nice We are including pairs of the same products in our selection. >>> 关注公众号获取更多精彩内容 109 Solution 34 Solution:paired-products（sql） To fix this, we add AND t1.name < t2.name. One additional problem that this solves for us is that it enforces a consistent order to the pairing of names throughout the table, namely that the first name will be alphabetically“less” than the second one (i.e. A < Z). This is important because it avoids the potential problem of undercounting pairs of names that are in different orders (i.e. A & B vs B & A). Finally, we can then finish the problem by grouping and ordering in order to count the total occurrences of each pair. WITH purchases AS ( SELECT user_id , created_at , products.name FROM transactions JOIN products ON transactions.product_id = products.id ) >>> 关注公众号获取更多精彩内容 110 Solution 34 Solution:paired-products（sql） SELECT t1.name AS P1 , t2.name AS P2 , count(*) FROM purchases AS t1 JOIN purchases AS t2 ON t1.user_id = t2.user_id AND t1.name < t2.name AND t1.created_at = t2.created_at GROUP BY 1,2 ORDER BY 3 DESC LIMIT 5 >>> 关注公众号获取更多精彩内容 111 Quesetion 35 dice-worth-rolling（probability）难度标题【Easy】公司标签【Amazon】题目标签【probability】 Let’s play a game. You are given two fair six-sided dice and asked to roll them. If the sum of the values on the dice equals seven, then you win 21 dollars. However, you must pay $10$10 for each roll. Is this game worth playing? >>> 关注公众号获取更多精彩内容 112 Solution 35 Solution:dice-worth-rolling（probability） >>> 关注公众号获取更多精彩内容 113 Solution 35 Solution:dice-worth-rolling（probability） >>> 关注公众号获取更多精彩内容 114 Solution 35 Solution:dice-worth-rolling（probability） >>> 关注公众号获取更多精彩内容 115 Solution 35 Solution:dice-worth-rolling（probability） >>> 关注公众号获取更多精彩内容 116 Quesetion 36 random-sql-sample（sql）难度标题【Medium】题目标签【sql】公司标签【Microsoft】,【Apple】,【Two】 Let’s say we have a table with an id and name fields. The table holds over 100 million rows and we want to sample a random row in the table without throttling the database. Write a query to randomly sample a row from this table. Example： Input： big_table table Columns Type id INTEGER name VARCHAR >>> 关注公众号获取更多精彩内容 117 Solution 36 Solution:random-sql-sample（sql） In most SQL databases there exists a RAND() function in which normally we can call: SELECT * FROM big_table ORDER BY RAND() and the function will randomly sort the rows in the table. This function works fine and is fast if you only have let’s say around 1000 rows. It might take a few seconds to run at 10K. And then at 100K maybe you have to go to the bathroom or cook a meal before it finishes. What happens at 100 million rows? Someone in DevOps is probably screaming at you. Random sampling is important in SQL with scale. We don’t want to use the pre-built function because it wasn’t meant for performance. But maybe we can re-purpose it for our own use case. We know that the RAND() function actually returns a floating-point between 0 and 1. So if we were to instead call: SELECT RAND() >>> 关注公众号获取更多精彩内容 118 Solution 36 Solution:random-sql-sample（sql） we would get a random decimal point to some Nth degree of precision. RAND() essentially allows us to seed a random value. How can we use this to select a random row quickly? Let’s try to grab a random number using RAND() from our table that can be mapped to an id. Given we have 100 million rows, we probably want a random number from 1 to 100 million. We can do this by multiplying our random seed from RAND() by the MAX number of rows in our table. SELECT CEIL(RAND() * ( SELECT MAX(id) FROM big_table) ) We use the CEIL function to round the random value to an integer. Now we have to join back to our existing table to get the value. What happens if we have missing or skipped id values though? We can solve this by running the join on all the ids which are greater or equal than our random value and selects only the direct neighbor if a direct match is not possible. >>> 关注公众号获取更多精彩内容 119 Solution 36 Solution:random-sql-sample（sql） As soon as one row is found, we stop (LIMIT 1). And we read the rows according to the index (ORDER BY id ASC). Now our performance is optimal. SELECT r1.id, r1.name FROM big_table AS r1 INNER JOIN ( SELECT CEIL(RAND() * ( SELECT MAX(id) FROM big_table) ) AS id ) AS r2 ON r1.id >= r2.id ORDER BY r1.id ASC LIMIT 1 >>> 关注公众号获取更多精彩内容 120 Quesetion 37 liked-and-commented（sql）难度标题【Medium】题目标签【sql】公司标签【Facebook】,【Glassdoor】 You’re given two tables, users and events. The events table holds values of all of the user events in the action column (‘like’, ‘comment’, or ‘post’). Write a query to get the percentage of users that have never liked or commented. Round to two decimal places. Example： Input： users table events table Column Type Column Type id INTEGER user_id INTEGER name VARCHAR action VARCHAR created_at DATETIME created_at DATETIME Output： Column Type percent_never FLOAT >>> 关注公众号获取更多精彩内容 121 Solution 37 Solution:liked-and-commented（sql） The question gives us two tables (users and events) and asks us to find the percentage of users who have never liked or commented. We know two things at once: We will have to join our two tables to get what we need. Our final SELECT clause will one sum divided by another sum. From there, we can begin to strategize about how to formulate our query. The trick, here, lies in how we go about defining data points for users who have never done something. In this case, the first step in the strategy we’re going to employ is to create a Common Table Expression (or CTE) to isolate users who have liked or commented. This will create a temporary table that can be referenced by the query that follows. WITH liked_or_commented AS ( SELECT e.user_id FROM events e WHERE action IN ('like', 'comment') GROUP BY 1 ) >>> 关注公众号获取更多精彩内容 122 Solution 37 Solution:liked-and-commented（sql） Note that we’re using the IN function in our WHERE clause to determine whether the value of the action field can be found in a comma-separated list of strings. We could achieve the same effect by using: WHERE action = 'like' OR action = 'comment' Both forms are equally valid and won’t affect the outcome of the query. We simply chose the form that would fit on a single line. The next step in our process to isolate users who have never liked or commented is to perform a LEFT JOIN. Remember that in a LEFT JOIN, all of the values of the first table are preserved and only matching records from the second table are preserved. That means that if we perform the following join of the users table to our temporary table liked_or_commented: FROM users u LEFT JOIN liked_or_commented loc ON u.id = loc.user_id >>> 关注公众号获取更多精彩内容 123 Solution 37 Solution:liked-and-commented（sql） We’re going to be left with NULL values in the user_id field of the liked_or_commented table for every user that has never liked or commented. We’ve also effectively joined our users and events table since liked_or_commented is just a version of events that has been narrowed to specific parameters. Now the only thing missing from our query is the SELECT clause. Here, since we want our final value to be a percentage, we’re going to divide one quantity by another quantity. Specifically, we’re going to divide the SUM of users who have never liked or commented by the COUNT of the total number of users. To get the first sum, we’re going to use a combination of the SUM and CASE WHEN functions in SQL: SUM(CASE WHEN loc.user_id IS NULL THEN 1 ELSE 0 END) >>> 关注公众号获取更多精彩内容 124 Solution 37 Solution:liked-and-commented（sql） The CASE WHEN function will assign a value of 1 to every record where the user_id field of the liked_or_ commented CTE IS a NULL value. In every other case, the CASE WHEN function will assign a value of 0 to the record. By summing these values, we effectively get a count of the number of users who have never liked or commented. That gives us our numerator. We’ll use a very simple COUNT of the id field of the users table to get our denominator. The last step to get our result is to wrap our small calculation in the ROUND function (which has the form ROUND (quantity to be rounded, # of decimal places), which gives us: ROUND(SUM(CASE WHEN loc.user_id IS NULL THEN 1 ELSE 0 END) / COUNT(u.id), 2) AS percent_never >>> 关注公众号获取更多精彩内容 125 Solution 37 Solution:liked-and-commented（sql） That means that our final query looks like: WITH liked_or_commented AS ( SELECT e.user_id FROM events e WHERE e.action in ('like','comment') GROUP BY 1 ) SELECT ROUND(SUM(CASE WHEN loc.user_id IS NULL THEN 1 ELSE 0 END) / COUNT(u.id), 2) as percent_never FROM users u LEFT JOIN liked_or_commented loc ON u.id = loc.user_id >>> 关注公众号获取更多精彩内容 126 Quesetion 38 daily-active-users（sql）难度标题【Easy】题目标签【sql】公司标签【Apple】,【Lattice】 Given a table of user logs with platform information, count the number of daily active users on each platform for the year of 2020. Example： Input： events table Column Type id INTEGER user_id INTEGER created_at DATETIME action VARCHAR url VARCHAR platform VARCHAR Output： Columns Type platform VARCHAR created_at DATETIME daily_users INTEGER >>> 关注公众号获取更多精彩内容 127 Quesetion 39 download-facts（sql）难度标题【Easy】题目标签【sql】公司标签【Microsoft】,【Amazon】 Given two tables: accounts, and downloads, find the average number of downloads for free vs paying accounts, broken down by day. Note: You only need to consider accounts that have had at least one download before when calculating the average. Note: round average_downloads to 2 decimal places. Example： Input： accounts table downloads table Column Type Column Type account_id INTEGER account_id INTEGER paying_customer BOOLEAN download_date DATETIME downloads INTEGER Output： Column Type download_date DATETIME paying_customer BOOLEAN average_downloads FLOAT >>> 关注公众号获取更多精彩内容 128 Solution 39 Solution:download-facts（sql） We need to use data from both tables, so the first thing we should do is to join them somehow. Since we should consider only accounts that had downloads during the day, we may use an INNER JOIN(or JOIN).This type of join will discard accounts with no records in the downloads table. If we used a different type of join, for example, a LEFT JOIN, we would need to how to handle accounts with no records in the downloads table. For example: If there are three records within the accounts table and two records in the downloads table: account_id paying_customer 1 0 2 0 3 0 account_id download_date downloads 1 2020-01-01 00:00:00 100 2 2020-01-01 00:00:00 200 >>> 关注公众号获取更多精彩内容 129 Solution 39 Solution:download-facts（sql） By using an INNER JOIN (or simply JOIN) query like: SELECT * FROM accounts a JOIN downloads b ON a.account_id = b.account_id Will output only two rows, omitting account 3. account_id paying_customer 1 0 2 0 account_id download_date downloads 1 2020-01-01 00:00:00 100 2 2020-01-01 00:00:00 200 But if we needed to take into consideration account number 3, then our calculation would have been (100+200+0)/3=100 Our second step is to figure out what columns we need to display in our output. Those columns are download_ date, paying_customer and a calculated column called average_downloads. We should use the AVG() function to calculate an average. >>> 关注公众号获取更多精彩内容 130 Solution 39 Solution:download-facts（sql） Since the AVG function is an aggregate function, so we need to apply a GROUP BY clause. Grouping results should be done by columns download_date and paying_ customer since those are the columns we what to differentiate entries by. SELECT download_date, paying_customer, AVG(downloads) AS average_downloads FROM accounts a JOIN downloads b ON a.account_id = b.account_id GROUP BY download_date, paying_customer Lastly, we need to apply the ROUND() function to the average in order to obtain the final result: SELECT download_date, paying_customer, ROUND(AVG(downloads),2) AS average_downloads FROM accounts a JOIN downloads b ON a.account_id = b.account_id GROUP BY download_date, paying_customer >>> 关注公众号获取更多精彩内容 131 40 Quesetion project-budget-error（sql）难度标题【Easy】题目标签【sql】公司标签【Microsoft】,【Facebook】 We’re given two tables. One is named projects and the other maps employees to the projects they’re working on. We want to select the five most expensive projects by budget to employee count ratio. But let’s say that we’ve found a bug where there exist duplicate rows in the employee_projects table. Write a query to account for the error and select the top five most expensive projects by budget to employee count ratio. Example： Input： projects table employee_projects table Column Type Column Type id INTEGER project_id INTEGER title VARCHAR employee_id INTEGER state_date DATETIME end_date DATETIME budget INTEGER Output： Column Type title VARCHAR budget_per_employee FLOAT >>> 关注公众号获取更多精彩内容 132 Solution 40 Solution:project-budget-error（sql） Given that the bug only exists in the employee_ projects table, we can reuse most of the code from this question as long as we rebuild the employees_ projects table by removing duplicates. One way to do so is to simply group by the columns project_ id and employee_id. By grouping by both columns, we’re creating a table that sets distinct value on project_ id and employee_id, thereby getting rid of any duplicates. Then all we have to do is then query from that table and nest it into another subquery. SELECT p.title, budget/num_employees AS budget_per_employee FROM projects AS p INNER JOIN ( >>> 关注公众号获取更多精彩内容 133 Solution 40 Solution:project-budget-error（sql） SELECT project_id, COUNT(*) AS num_employees FROM ( SELECT project_id, employee_id FROM employee_projects GROUP BY 1,2 ) AS gb GROUP BY project_id ) AS ep ON p.id = ep.project_id ORDER BY budget/num_employees DESC LIMIT 5; >>> 关注公众号获取更多精彩内容 134 Quesetion 41 biased-five-out-of-six（probability）难度标题【Medium】题目标签【probability】公司标签【Facebook】,【Google】 Let’s say we’re given a biased coin that comes up heads 30% of the time when tossed. What is the probability of the coin landing as heads exactly 5 times out of 6 tosses? >>> 关注公众号获取更多精彩内容 135 Solution 41 Solution:biased-five-out-of-six （probability） >>> 关注公众号获取更多精彩内容 136 Solution 41 Solution:biased-five-out-of-six （probability） >>> 关注公众号获取更多精彩内容 137 Quesetion 42 closed-accounts（sql）难度标题公司标签【Medium】题目标签 / 【sql】 Given a table of account statuses, write a query to get the percentage of accounts that were active on December 31st, 2019, and closed on January 1st, 2020, over the total number of accounts that were active on December 31st. Each account has only one daily record indicating its status at the end of the day. Note: Round the result to 2 decimal places. Example： Input： account_status table Column Type account_id INTEGER date DATETIME status VARCHAR account_id date status 1 2020-01-01 closed 1 2019-12-31 open 2 2020-01-01 closed Output： Column Type percentage_closed FLOAT >>> 关注公众号获取更多精彩内容 138 Solution 42 Solution:closed-accounts（sql） At first, this question seems pretty straightforward. We could just compute a SUM(CASE WHEN...) function that allows us to compute the total number of closed accounts divided by the total number of accounts. SELECT SUM(CASE WHEN status = "closed" THEN 1 ELSE 0 END)/COUNT(DISTINCT account_id) as percentage_closed FROM account_status WHERE date = '2020-01-01' But there’s a problem here! This query would count every closed account, which is not what we want. We are looking for accounts that were closed only on January 1st, 2020, and opened the day before. The account_ statuses table will have the status of each account for each day. Firstly, we want to find a number of accounts active on December 31st, 2019, and closed on January 1st, 2020. This part is done >>> 关注公众号获取更多精彩内容 139 Solution 42 Solution:closed-accounts（sql） within correct_closed_accounts CTE. Secondly, we count the number of accounts from the table within num_ accounts CTE. Finally, we divide both numbers to get the final solution. WITH correct_closed_accounts_cte AS ( SELECT COUNT(*) AS numerator FROM account_status a JOIN account_status b ON a.account_id = b.account_id WHERE a.date = '2020-01-01' AND b.date = '2019-12-31' AND a.status = 'closed' AND b.status ='open' ), num_accounts AS ( SELECT numerator , COUNT(DISTINCT account_id) AS denominator FROM correct_closed_accounts_cte , account_status WHERE date = '2019-12-31' AND status ='open' ) SELECT CAST((numerator/denominator) AS DECIMAL(3,2)) AS percentage_closed FROM num_accounts; >>> 关注公众号获取更多精彩内容 140 Quesetion 43 fewer-orders（sql）难度标题公司标签【Easy】【Amazon】题目标签【sql】 Write a query to identify the names of users who placed less than 3 orders or ordered less than $500$500 worth of product. Example： Input： transactions table users table Column Type Column Type id INTEGER id INTEGER user_id INTEGER name VARCHAR created_at DATETIME sex VARCHAR product_id INTEGER quantity INTEGER Output： Column Type users_less_than VARCHAR products table Column Type id INTEGER name VARCHAR price FLOAT >>> 关注公众号获取更多精彩内容 141 Solution 43 Solution:fewer-orders（sql） Code: SELECT DISTINCT(user_name) users_less_than FROM ( SELECT u.name user_name, COUNT(t.id) tx_count, SUM(quantity*price) total_prod_worth FROM users u LEFT JOIN transactions t ON u.id = t.user_id LEFT JOIN products p ON t.product_id = p.id GROUP BY 1 ) sub WHERE tx_count<3 OR total_prod_worth < 500; >>> 关注公众号获取更多精彩内容 142 Quesetion 44 replace-words-with-stems（python）难度标题【Medium】题目标签【python】公司标签【Adobe】,【Facebook】,【ABC】 In data science, there exists the concept of stemming, which is the heuristic of chopping off the end of a word to clean and bucket it into an easier feature set. Given a dictionary consisting of many roots and a sentence, write a function replace_words to stem all the words in the sentence with the root forming it. If a word has many roots that can form it, replace it with the root with the shortest length. Example： Input： roots = ["cat", "bat", "rat"] sentence = "the cattle was rattled by the battery" Output： "the cat was rat by the bat" >>> 关注公众号获取更多精彩内容 143 Solution 44 Solution:replace-words-with-stems （python） At first, it simply looks like we can just loop through each word and check if the root exists in the word and if so, replace the word with the root. But since we are technically stemming the words, we have to make sure that the roots are equivalent to the word at its prefix rather than existing anywhere within the word. We’re given a list of roots and a sentence string. Given we have to check each word let’s first split sentence into a list of words. words = sentence.split() Next, we loop through each word in words, and for each check if it has a prefix equal to one of the roots. To do this, we loop through each possible substring starting at the first letter. If we find a prefix matching a root, we replace that word in the words list with the root in contains. j=0 while j < len(words): i=0 while i < len(words[j]): if words[j][:i] in roots: words[j] = words[j][:i] i = len(words[j]) i=i+1 j=j+1 >>> 关注公众号获取更多精彩内容 144 Solution 44 Solution:replace-words-with-stems （python） Notice the line inside the if statement of the inner while loop. i = len(words[j]) We need this statement to ensure that if a word contains two roots, we replace it with the shorter one. For example, if the roots list from above also contained the string “catt” we would still return the same output. Finally, we need to join our updated list of words back into a sentence. new_sentence = " ".join(words) And our final code is as follows: def replace_words(roots,sentence): words = sentence.split() j=0 while j < len(words): i=0 while i < len(words[j]): if words[j][:i] in roots: words[j] = words[j][:i] i=len(words[j]) i=i+1 j=j+1 new_sentence = " ".join(words) return(new_sentence) >>> 关注公众号获取更多精彩内容 145 Quesetion 45 acceptance-rate（sql）难度标题【Easy】公司标签题目标签 / 【sql】 We’re given two tables. friend_requests holds all the friend requests made and friend_accepts is all of the acceptances. Write a query to find the overall acceptance rate of friend requests. Note: Round results to 4 decimal places. Example： Input： friend_requests table friend_accepts table Column Type Column Type requester_id INTEGER acceptor_id INTEGER requested_id INTEGER requester_id INTEGER created_at DATETIME created_at DATETIME Output： Column Type acceptance_rate FLOAT >>> 关注公众号获取更多精彩内容 146 Solution 45 Solution:acceptance-rate（sql） The overall acceptance_rate is going to be computed by the total number of acceptances of friend requests divided by the total friend requests given: Count of acceptances / Count of friend requests We can pretty easily get both values. Our denominator will be the total number of friend requests which will be the base table. We can compute the total number of acceptances by then LEFT JOINING to the friend_accepts table. In the JOIN, we have to make sure we’re joining on the correct columns. In this case, we have to match our requester_id to the requestor_id and be sure to also match on the second column of requested_id to acceptor_id. Note: We cannot compute the DISTINCT count given that users can send and accept friend requests to multiple other users. SELECT CAST(COUNT(b.acceptor_id)/ COUNT(a.requester_id) AS DECIMAL(5,4)) AS acceptance_rate FROM friend_requests a LEFT JOIN friend_accepts b ON a.requester_id = b.requester_id AND a.requested_id = b.acceptor_id; >>> 关注公众号获取更多精彩内容 147 Quesetion 46 attribution-rules（sql）难度标题【Medium】公司标签题目标签 / 【sql】 Write a query that creates an attribution rule for each user. If the user visited Facebook or Google at least once then the attribution is labeled as “paid.” Otherwise, the attribution is labeled as “organic.” Example： Input： user_sessions table attribution table Column Type Column Type created_at DATETIME session_id INTEGER session_id INTEGER channel VARCHAR user_id INTEGER Output： Column Type user_id INTEGER attribute VARCHAR >>> 关注公众号获取更多精彩内容 148 Solution 46 Solution:attribution-rules（sql） WITH cte AS ( SELECT user_id, SUM(CASE WHEN (channel = 'Facebook' OR channel = 'Google') THEN 1 ELSE 0 END) AS paid_count FROM user_sessions JOIN attribution ON user_sessions.session_id = attribution.session_id GROUP BY user_id) SELECT user_id, CASE WHEN paid_count >=1 THEN 'paid' ELSE 'organic' END AS attribute FROM cte >>> 关注公众号获取更多精彩内容 149 Quesetion 47 notification-deliveries（sql）难度标题【Hard】题目标签【sql】公司标签【Twitter】,【Facebook】,【Think】, 【LinkedIn】 We’re given two tables, a table of notification_deliveries and a table of users with created and purchase conversion dates. If the user hasn’t purchased then the conversion_date column is NULL. Write a query to get the distribution of total push notifications before a user converts. Example： Input： notification_deliveries table users table Column Type Column Type notification VARCHAR id INTEGER user_id INTEGER created_at DATETIME created_at DATETIME conversion_date DATETIME Output： Column Type total_pushes INTEGER frequency INTEGER >>> 关注公众号获取更多精彩内容 150 Solution 47 Solution:notification-deliveries（sql） If we’re looking for the distribution of total push notifications before a user converts, we can evaluate that we want our end result to look something like this: total_pushes | frequency -------------+---------0 | 100 1 | 250 2 | 300 ... | ... In order to get there, we have to follow a couple of logical conditions for the JOIN between users and notification_ deliveries We have to join on the user_id field in both tables. We have to exclude all users that have not converted. We have to set the conversion_date value as greater than the created_at value in the delivery table in order to get all notifications sent to the user. Cool, we know this has to be a LEFT JOIN additionally in order to get the users that converted off of zero push notifications as well. We can get the count per user, and then group by that count to get the overall distribution. >>> 关注公众号获取更多精彩内容 151 Solution 47 Solution:notification-deliveries（sql） SELECT total_pushes, COUNT(*) AS frequency FROM ( SELECT u.id, COUNT(nd.notification) as total_pushes FROM users AS u LEFT JOIN notification_deliveries AS nd ON u.id = nd.user_id AND u.conversion_date >= nd.created_at WHERE u.conversion_date IS NOT NULL GROUP BY 1 ) AS pushes GROUP BY 1 >>> 关注公众号获取更多精彩内容 152 Quesetion 48 time-on-fb-distribution（statistics）难度标题【Medium】公司标签 / 题目标签【statistics】 What do you think the distribution of time spent per day on Facebook looks like? What metrics would you use to describe that distribution? >>> 关注公众号获取更多精彩内容 153 Solution 48 Solution:time-on-fb-distribution （statistics） Having the vocabulary to describe a distribution is an important skill as a data scientist when it comes to communicating ideas to your peers. There are 4 important concepts, with supporting vocabulary, that you can use to structure your answer to a question like this. These are: 1. Center (mean, median, mode) 2. Spread (standard deviation, inter quartile range, range) 3. Shape (skewness, kurtosis, uni or bimodal) 4. Outliers (Do they exist?) In terms of the distribution of time spent per day on Facebook (FB), one can imagine there may be two groups of people on Facebook: 1. People who scroll quickly through their feed and don’t spend too much time on FB. 2. People who spend a large amount of their social media time on FB. From this point of view, we can make the following claims about the distribution of time spent on FB, with the caveat that this needs to be validated with real world data. >>> 关注公众号获取更多精彩内容 154 Solution 48 Solution:time-on-fb-distribution （statistics） 1. Center: Since we expect the distribution to be bimodal (see Shape), we could describe the distribution using mode and median instead of mean. These summary statistics are good for investigating distributions that deviate from the classical normal distribution. 2. Spread: Since we expect the distribution to be bimodal (see Shape), the spread and range will be fairly large. This means there will be a large inter quartile range that will be needed to accurately describe this distribution. Further, refrain from using standard deviation to describe the spread of this distribution. 3. Shape: From our description, the distribution would be bimodal. One large group of people would be clustered around the lower end of the distribution, and another large group would be centered around the higher end. There could also be some skewness to the right for those people who may spend a bit too much time on FB. 4. Outliers: You can run outlier detection tests like Grubb’s test, z-score, or the IQR methods to quantitatively tell which users are not like the rest. If we were to ask further questions about the demographics of the users we were interested in, we could come up with another story using these same vocabulary to structure our answer! >>> 关注公众号获取更多精彩内容 155 Quesetion 49 minimum-change（python）难度标题【Easy】公司标签【Google】题目标签【python】 Write a function find_change to find the minimum number of coins that make up the given amount of change cents. Assume we only have coins of value 1, 5, 10, and 25 cents. Example： Input： cents = 73 Output： def find_change(cents) -> 7 #(25 + 25 + 10 + 10 + 1 + 1 + 1) >>> 关注公众号获取更多精彩内容 156 Solution 49 Solution:minimum-change（python） def minimum_change(cents): count =0 while cents != 0: if cents >= 25: count +=1 cents -=25 elif cents >= 10: count +=1 cents -=10 elif cents >= 5: count +=1 cents -=5 elif cents >= 1: count +=1 cents -=1 return count >>> 关注公众号获取更多精彩内容 157 Quesetion 50 swipe-precision（sql）难度标题【Hard】题目标签【sql】公司标签【Amazon】,【Tinder】 There are two tables. One table is called swipes that holds a row for every Tinder swipe and contains a boolean column that determines if the swipe was a right or left swipe called is_right_swipe. The second is a table named variants that determines which user has which variant of an AB test. Write a SQL query to output the average number of right swipes for two different variants of a feed ranking algorithm by comparing users that have swiped the first 10, 50, and 100 swipes on their feed. Note: Users have to have swiped at least 10 times to be included in the subset of users to analyze the mean number of right swipes. Example： Input： variants table swipes table Column Type Column Type id INTEGER id INTEGER experiment VARCHAR user_id INTEGER variant VARCHAR swiped_user_id INTEGER user_id INTEGER created_at DATETIME is_right_swipe BOOLEAN Output： Columns Type varient VARCHAR mean_right_swipes FLOAT swipe_threshold INTEGER num_users INTEGER >>> 关注公众号获取更多精彩内容 158 Solution 50 Solution:swipe-precision（sql） WITH sample AS (SELECT *, row_number() over (partition BY user_id ORDER BY created_ at) AS swipe_num FROM swipes ORDER BY created_at ), sample2 AS ( SELECT *, SUM(is_right_swipe) over(partition BY user_id ORDER BY swipe_num) AS swipe_count FROM sample ) SELECT v.variant, s.swipe_num AS swipe_threshold, AVG(s.swipe_count) AS mean_right_swipes, COUNT(s.user_id) AS num_users FROM sample2 AS s LEFT JOIN variants AS v ON s.user_id = v.user_id WHERE swipe_num IN (10, 50, 100) GROUP BY v.variant, s.swipe_num >>> 关注公众号获取更多精彩内容 159 Quesetion 51 random-seed-function（probability）难度标题【Medium】公司标签【Google】题目标签【probability】 Let’s say you have a function that outputs a random integer between a minimum value, NN, and maximum value, MM. Now let’s say we take the output from the random integer function and place it into another random function as the max value with the same min value NN. 1. What would the distribution of the samples look like? 2. What would be the expected value? >>> 关注公众号获取更多精彩内容 160 Solution 51 Solution:random-seed-function （probability） >>> 关注公众号获取更多精彩内容 161 Solution 51 Solution:random-seed-function （probability） >>> 关注公众号获取更多精彩内容 162 Quesetion 52 find-the-missing-number（algorithms）难度标题【Easy】题目标签【algorithms】公司标签【Microsoft】,【PayPal】 You have an array of integers of length n spanning 00 to nn with one missing. Write a function missing_number that returns the missing number in the array. Note: Complexity of O(N)O(N) required. Example： Input： nums = [0,1,2,4,5] missing_number(nums) -> 3 >>> 关注公众号获取更多精彩内容 163 Solution 52 Solution:find-the-missing-number （algorithms） There are two ways we can solve this problem. One way through logical iteration and another through mathematical formulation. We can look at both as they both hold O(N) complexity. The first would be through general iteration through the array. We can pass in the array and create a set which will hold each value in the input array. Then we create a for loop that will span the range from 0 to n, and look to see if each number is in the set we just created. If it isn’t, we return the missing number. def missing_number(nums): num_set = set(nums) n = len(nums) + 1 for number in range(n): if number not in num_set: return number The second solution requires formulating an equation. If we know that one number is supposed to be missing from 0 to n, then we can solve for the missing number by taking the sum of numbers from 0 to n and subtracting it from the sum of the input array with the missing value. >>> 关注公众号获取更多精彩内容 164 Solution 52 Solution:find-the-missing-number （algorithms） An equation for the sum of numbers from 0 to n is n(n+1)/2. Now all we have to do is apply the internal sum function to the input array, and then subtract the values from each other. def missing_number(nums): n = len(nums) total = n*(n+1)/2 sum_of_nums = sum(nums) return total - sum_of_nums >>> 关注公众号获取更多精彩内容 165 Quesetion 53 lazy-raters（probability）难度标题【Medium】题目标签【probability】公司标签【Facebook】,【Netflix】 Netflix has hired people to rate movies. Out of all of the raters, 80% of the raters carefully rate movies and rate 60% of the movies as good and 40% as bad. The other 20% are lazy raters and rate 100% of the movies as good. Assuming all raters rate the same amount of movies, what is the probability that a movie is rated good? >>> 关注公众号获取更多精彩内容 166 Solution 53 Solution:lazy-raters（probability） >>> 关注公众号获取更多精彩内容 167 Solution 53 Solution:lazy-raters（probability） >>> 关注公众号获取更多精彩内容 168 Quesetion 54 impression-reach（probability）难度标题【Medium】公司标签 / 题目标签【probability】 Let’s say we have a very naive advertising platform. Given an audience of size A and an impression size of B, each user in the audience is given the same random chance of seeing an impression. 1. Compute the probability that a user sees exactly 0 impressions. 2. What’s the probability of each person receiving at least 1 impression? >>> 关注公众号获取更多精彩内容 169 Solution 54 Solution:impression-reach （probability） >>> 关注公众号获取更多精彩内容 170 Solution 54 Solution:impression-reach （probability） >>> 关注公众号获取更多精彩内容 171 Solution 54 Solution:impression-reach （probability） >>> 关注公众号获取更多精彩内容 172 Quesetion 55 conversations-distribution（analytics）难度标题【Medium】题目标签【analytics】公司标签【Amazon】,【Think】 We have a table that represents the total number of messages sent between two users by date on messenger. 1. What are some insights that could be derived from this table? 2. What do you think the distribution of the number of conversations created by each user per day looks like? 3. Write a query to get the distribution of the number of conversations created by each user by day in the year 2020. Example： Input： messages table Column Type id INTEGER date DATETIME user1 INTEGER user2 INTEGER msg_count INTEGER Output： Column Type num_conversations INTEGER frequency INTEGER >>> 关注公众号获取更多精彩内容 173 Solution 55 Solution:conversations-distribution （analytics） 1. Top-level insights that can be derived from this table are the total number of messages being sent per day, number of conversations being started, and the average number of messages per conversation. If we think about business-facing metrics, we can start analyzing them by including time series. How many more conversations are being started over the past year compared to now? Do more conversations between two users indicate a closer friendship versus the depth of the conversation in total messages? 2. The distribution would be likely skewed to the right or bimodal. If we think about the probability for a user to have a conversation with more than one additional person per day, would that likely be going up or down? The peak is probably around one to five new conversations a day. After that, we would see a large decrease with a potential bump of very active users that may be using messenger tools for work. >>> 关注公众号获取更多精彩内容 174 Solution 55 Solution:conversations-distribution （analytics） 3. Given we just want to count the number of conversations, we can ignore the message count and focus on getting our key metric of a number of new conversations created by day in a single query. To get this metric, we have to group by the date field and then group by the distinct number of users messaged. Afterward, we can then group by the frequency value and get the total count of that as our distribution. SELECT num_conversations, COUNT( * ) AS frequency FROM ( SELECT user1, DATE(date), COUNT(DISTINCT user2) AS num_conversations FROM messages WHERE YEAR(date) = '2020' GROUP BY 1,2 ) AS t GROUP BY 1 >>> 关注公众号获取更多精彩内容 175 Quesetion 56 move-zeros-back（algorithms）难度标题【Medium】公司标签 / 题目标签【algorithms】 Given an array of integers, write a function move_zeros_back that moves all zeros in the array to the end of the array. If there are no zeros, return the input array Example： Input： array = [0,5,4,2,0,3] def move_zeros_back(array) -> [5,4,2,3,0,0] >>> 关注公众号获取更多精彩内容 176 Solution 56 Solution:move-zeros-back（algorithms） O(n) time complexity and O(1) space complexity … Using a variable non_zeros to hold the index of the first non-zero item in the list. Loop through the array and swap every zero you find with the non_zero item. def move_zeros_back(array): non_zeros = 0 for i in range(len(array)): if array[i] == 0: while array[non_zeros] == 0: non_zeros += 1 if non_zeros >= len(array): return array array[non_zeros], array[i] = array[i], array[non_ zeros] return array >>> 关注公众号获取更多精彩内容 177 Quesetion 57 bucket-test-scores（pandas）难度标题【Medium】公司标签【Google】题目标签【pandas】 Let’s say you’re given a dataframe of standardized test scores from high schoolers from grades 9 to 12 called df_grades. Given the dataset, write code function in Pandas called bucket_test_ scores to return the cumulative percentage of students that received scores within the buckets of <50, <75, <90, <100. Example： Input： Output： print(df_grades) def bucket_test_scores(df_grades) -> user_id grade test score grade test score percentage 1 10 85 10 <50 33% 2 10 60 10 <75 66% 3 11 90 10 <90 100% 4 10 30 10 <100 100% 5 11 99 11 <50 0% 11 <75 0% 11 <90 50% 11 <100 100% >>> 关注公众号获取更多精彩内容 178 Solution 57 Solution:bucket-test-scores（pandas） import pandas as pd def bucket_test_scores(df): bins = [0, 50, 75, 90, 100] labels=['<50','<75','<90' , '<100'] df['test score'] = pd.cut(df['test score'], bins,labels=labels) df = (df .groupby(['grade','test score']).agg({'user_id':'count'}) .fillna(0) .groupby(['grade']).apply(lambda x:100 * x / float(x. sum())) .groupby(['grade']).cumsum() .reset_index()) df['percentage'] = df.user_id.astype(int).astype(str) + '%' df.drop(columns='user_id',inplace=True) return df >>> 关注公众号获取更多精彩内容 179 Quesetion 58 friendship-timeline（python）难度标题【Hard】公司标签题目标签 / 【python】 There are two lists of dictionaries representing friendship beginnings and endings: friends_added and friends_removed. Each dictionary contains the user_ids and created_at time of the friendship beginning / ending. Write a function friendship_timeline to generate an output that lists the pairs of friends with their corresponding timestamps of the friendship beginning and then the timestamp of the friendship ending. Note: There can be multiple instances over time when two people became friends and unfriended; only output lists when a corresponding friendship was removed. Example： Output: Input： friendships = [{ 'user_ids': [1, 2], friends_added = [ 'start_date': '2020-01-01', {'user_ids': [1, 2], 'created_at': '2020-01-01'}, 'end_date': '2020-01-03' {'user_ids': [3, 2], 'created_at': '2020-01-02'}, }, {'user_ids': [2, 1], 'created_at': '2020-02-02'}, { {'user_ids': [4, 1], 'created_at': '2020-02-02'}] 'user_ids': [1, 2], friends_removed = [ 'start_date': '2020-02-02', {'user_ids': [2, 1], 'created_at': '2020-01-03'}, 'end_date': '2020-02-05' {'user_ids': [2, 3], 'created_at': '2020-01-05'}, }, {'user_ids': [1, 2], 'created_at': '2020-02-05'}] { 'user_ids': [2, 3], 'start_date': '2020-01-02', 关注公众号 <<< 获取更多精彩内容 'end_date': '2020-01-05' }, ] 180 Solution 58 Solution:friendship-timeline（python） def friendship_timeline(friends_added, friends_removed): friendships = [] for removed in friends_removed: for added in friends_added: if sorted(removed['user_ids']) == sorted(added['user_ids']): friends_added.remove(added) friendships.append({ 'user_ids': sorted(removed['user_ids']), 'start_date': added['created_at'], 'end_date': removed['created_at'] }) break return sorted(friendships, key=lambda x: x['user_ids']) >>> 关注公众号获取更多精彩内容 181 Quesetion 59 search-ratings（analytics）难度标题公司标签【Easy】【Facebook】题目标签【analytics】 You’re given a table that represents search results from searches on Facebook. The query column is the search term, the position column represents each position the search result came in, and the rating column represents the human rating of the search result from 1 to 5 where 5 is high relevance and 1 is low relevance. Example： Input： search_results table 1. Write a query to compute a metric to measure the quality of Column Type query VARCHAR result_id INTEGER position INTEGER rating INTEGER the search results for each query. 2. You want to be able to compute a metric that measures the precision of the ranking system based on position. For example, if the results for dog and cat are… …we would rank ‘cat’ as query result_id position rating notes dog 1000 1 2 picture of hotdog dog 998 2 4 dog walking dog 342 3 1 zebra cat 123 1 4 picture of cat Write a query to create a cat 435 2 2 cat memes metric that can validate and cat 545 3 1 pizza shops having a better search result ranking precision than ‘dog’ based on the correct sorting by rating. rank the queries by their search result precision. Round the metric (avg_rating column) Output： to 2 decimal places. Column Type query VARCHAR avg_rating FLOAT >>> 关注公众号获取更多精彩内容 182 Solution 59 Solution:search-ratings（analytics） 1. This is an unusual SQL problem given it asks to define a metric and then write a query to compute it. Generally, this should be pretty simple. Can we rank by the metric and figure out which query has the best overall results? For example, if the search query for ‘tiger’ has 5s for each result, then that would be a perfect result. The way to compute that metric would be to simply take the average of the rating for all of the results. In which the query can very easily be: SELECT query, ROUND(AVG(rating), 2) AS avg_rating FROM search_results GROUP BY 1 2. The precision metric is a little more difficult now that we have to account for a second factor which is position. We now have to find a way to weight the position in accordance to the rating to normalize the metric score. This type of problem set can get very complicated if we wanted to dive deeper into it. >>> 关注公众号获取更多精彩内容 183 Solution 59 Solution:search-ratings（analytics） However, the question is clearly more marked towards being practical in figuring out the metric and developing an easy SQL query than developing a search ranking precision scale that optimizes for something like CTR. In solving the problem, it’s helpful to look at the example to construct an approach towards a metric. For example, if the first result is rated at 5 and the last result is rated at a 1, that’s good. Even better however is if the first result is rated 5 and the last result is also rated 5. Bad is if the first result is 1 and the last result is 5. However, if we use the approach from question number 1, we’ll get the same metric score no matter which ways the values are ranked by position. So how do we factor position into the ranking? What if we took the inverse of the position as our weighted factor? In which case it would be 1/position as a weighted score. Now no matter what the overall rating, we have a way to weight the position into the formula. SELECT query, ROUND(AVG((1/position) * rating), 2) AS avg_rating FROM search_results GROUP BY 1 >>> 关注公众号获取更多精彩内容 184 Quesetion 60 employee-project-budgets（sql）难度标题【Medium】公司标签题目标签 / 【sql】 We’re given two tables. One is named projects and the other maps employees to the projects they’re working on. Write a query to get the top five most expensive projects by budget to employee count ratio. Note: Exclude projects with 0 employees. Assume each employee works on only one project. projects table employee_projects table Column Type Column Type id INTEGER project_id INTEGER title VARCHAR employee_id INTEGER state_date DATETIME end_date DATETIME budget INTEGER Output： Column Type title VARCHAR budget_per_employee INTEGER >>> 关注公众号获取更多精彩内容 185 Solution 60 Solution:employee-project-budgets（sql） We’re given two tables, one which has the budget of each project and the other with all employees associated with each project. Since the question specifies one employee per project and excludes projects with 0 employees, we know we can apply an INNER JOIN between the two tables and not have to worry about duplicates or leaving out non-staffed projects. SELECT project_id, COUNT(*) AS num_employees FROM employee_projects GROUP BY 1 The query above grabs the total number of employees per project. Now, all we have to do is join it to the projects table to get the budget for each project and divide it by the number of employees. SELECT p.title, budget/num_employees AS budget_per_employee FROM projects AS p INNER JOIN ( >>> 关注公众号获取更多精彩内容 186 Solution 60 Solution:employee-project-budgets（sql） SELECT project_id, COUNT(*) AS num_employees FROM employee_projects GROUP BY 1 ) AS ep ON p.id = ep.project_id ORDER BY 2 DESC LIMIT 5; >>> 关注公众号获取更多精彩内容 187 Quesetion 61 expected-tests（statistics）难度标题【Easy】公司标签【Facebook】题目标签【statistics】 Suppose there are one million users and we want to expose 1000 users per day to a test. The same user can be selected twice for the test. 1. What’s the expected value of how long someone will have to wait before they receive the test? 2. What is the likelihood they get selected after the first day? Is that closer to 0 or 1? >>> 关注公众号获取更多精彩内容 188 Solution 61 Solution:expected-tests（statistics） >>> 关注公众号获取更多精彩内容 189 Solution 61 Solution:expected-tests（statistics） >>> 关注公众号获取更多精彩内容 190 Quesetion 62 max-quantity（sql）难度标题【Easy】公司标签【Amazon】题目标签【sql】 Given the transactions table, write a query to get the max quantity purchased for each distinct product_id, every year. The output should include the year, product_id, and max_ quantity for that product sorted by year and product_id ascending. Example： Input： transactions table Column Type id INTEGER user_id INTEGER created_at DATETIME product_id INTEGER quantity INTEGER Output： Column Type year INTEGER product_id INTEGER max_quantity INTEGER >>> 关注公众号获取更多精彩内容 191 Solution 62 Solution:max-quantity（sql） WITH cte AS ( SELECT id, created_at, quantity, product_id, dense_rank() OVER (PARTITION BY product_id, year(created_at) ORDER BY quantity DESC) AS max_rank FROM transactions ) SELECT year(created_at) AS year, product_id, quantity AS max_quantity FROM cte WHERE max_rank = 1 GROUP BY 1,2,3 >>> 关注公众号获取更多精彩内容 192 Quesetion 63 good-grades-and-favorite-colors （pandas）难度标题公司标签【Easy】【Facebook】题目标签【pandas】 You’re given a dataframe of students named students_df： students_df table name age favorite_color grade Tim Voss 19 red 91 Nicole Johnson 20 yellow 95 Elsa Williams 21 green 82 John James 20 blue 75 Catherine Jones 23 green 93 Write a function named grades_colors to select only the rows where the student’s favorite color is green or red and their grade is above 90. Example： Input： import pandas as pd students = {"name" : ["Tim Voss", "Nicole Johnson", "Elsa Williams", "John James", "Catherine Jones"], "age" : [19, 20, 21, 20, 23], "favorite_color" : ["red", "yellow", "green", "blue", "green"], "grade" : [91, 95, 82, 75, 93]} students_df = pd.DataFrame(students) >>> 关注公众号获取更多精彩内容 Output： def grades_colors(students_df) -> name age favorite_color grade Tim Voss 19 red 91 Catherine Jones 23 green 93 193 Solution 63 Solution:good-grades-and-favoritecolors（pandas） This question requires us to filter a data frame by two conditions: first, the grade of the student, and second, their favorite color. Let’s start by filtering by grade since it’s a bit simpler than filtering by strings. We can filter columns in pandas by setting our data frame equal to itself with the filter in place. In this case: df_students = df_students[df_students["grade"] > 90] f we were to look at our data frame after passing that line of code, we’d see that every student with a lower grade than 90 no longer appears in our data frame. Now, we need to filter by favorite color. but we want to choose between two colors red and green. We will use isin( ) method that will compare the color cell with a list of colors passed to it, in this case, will be ['red','green'] students_df = students_df['favorite_color']. isin(['red','green']) >>> 关注公众号获取更多精彩内容 194 Solution 63 Solution:good-grades-and-favoritecolors（pandas） finally, to add the two conditions of grade and color together to filter to rows we can use the & operator. Our syntax should look like this: import pandas as pd def grades_colors(students_df): students_df = students_df[(students_df['grade'] > 90) & students_df['favorite_color'].isin(['red','green'])] return students_df >>> 关注公众号获取更多精彩内容 195 Quesetion 64 median-probability（statistics）难度标题【Hard】公司标签【Google】题目标签【statistics】 Given three random variables independently and identically distributed from a uniform distribution of 0 to 4, what is the probability that the median is greater than 3? >>> 关注公众号获取更多精彩内容 196 Solution 64 Solution:median-probability（statistics） If we break down this question, we’ll find that another way to phrase it is to ask what the probability is that at least two of the variables are larger than 3. For example, if look at the combination of events that satisfy the condition, the events can actually be divided into two exclusive events. Event A: All three random variables are larger than 3. Event B: One random variable is smaller than 3 and two are larger than 3. Given these two events satisfy the condition of the median > 3, we can now calculate the probability of both of the events occurring. The question can now be rephrased as P（Median >3）=P(A)+P(B) Let’s calculate the probability of the event A. The probability that a random variable > 3 but less than 4 is equal to 1 ⁄ 4. So the probability of event A is: P(A)=(1/4)·(1/4)·(1/4)=1/64 >>> 关注公众号获取更多精彩内容 197 Solution 64 Solution:median-probability（statistics） The probability of event B is that two values must be greater than 3, but one random variable is smaller than 3. We can calculate this the same way as the calculating the probability of A. The probability of a value being greater than 3 is 1 ⁄ 4 and the probability of a value being less than 3 is 3 ⁄ 4. Given this has to occur three times we multiply the condition three times. P(B)=3((3/4)·(1/4)·(1/4))=9/64 Therefore, the total probability is P(A)+P(B)=1/64+9/64=10/64 >>> 关注公众号获取更多精彩内容 198 65 Quesetion sms-confirmations（sql）难度标题公司标签【Easy】题目标签 / 【sql】 The sms_sends table contains all the messages sent to the users and may contain various message types. Confirmation messages (type = "confirmation") are sent when a user registers an account and requires a response. The system may send multiple confirmation messages to a single phone number, but the user may confirm only the latest one sent to them. The ds column represents the date when SMSs are sent. The confirmers table contains phone numbers that responded to confirmation messages and dates when users responded. Write a query to calculate the number of responses grouped by carrier and country to the SMSs sent by the system on February 28th, 2020. Example： Input： sms_sends table confirmers table Column Type Column Type ds DATETIME date DATE country VARCHAR phone_number VARCHAR carrier VARCHAR phone_number VARCHAR type VARCHAR Column Type carrier VARCHAR country VARCHAR unique_numbers INTEGER >>> Output：关注公众号获取更多精彩内容 199 Quesetion 66 self-upvotes（sql）难度标题【Medium】题目标签【sql】公司标签【Reddit】,【Amazon】 We’re given three tables representing a forum of users and their comments on posts. Write a query to get the percentage of comments by each user where that user also upvoted their own comment. Note: A user that doesn’t make a comment should have a 0 percent self-upvoted. Example： Input： users table comments table comment_votes table Columns Type Columns Type Columns Type id INTEGER id INTEGER id INTEGER created_at DATETIME created_at DATETIME created_at DATETIME username VARCHAR post_id INTEGER user_id INTEGER user_id INTEGER comment_id INTEGER is_upvote BOOLEAN Output： Columns Type username VARCHAR total_comments INTEGER percentage_self_voted FLOAT >>> 关注公众号获取更多精彩内容 200 Quesetion 67 eta-experiment（a/b testing）难度标题【Medium】公司标签 / 题目标签【a/b testing】 Let’s say you work at Uber. A PM comes to you considering a new feature where instead of a direct ETA estimate like 5 minutes, would instead display a range of something like 3-7 minutes. How would you conduct this experiment and how would you know if your results were significant? >>> 关注公众号获取更多精彩内容 201 Solution 67 Solution:eta-experiment（a/b testing） Clarify What ETA we are looking for? Is this from the driver’s app or rider’s app? Is this the ETA of estimated waiting time after requesting the ride, or is this the ETA of estimated arrival time at destination after driver picks up rider? Let’s say it’s the ETA on the rider’s app, which is the time between request submission and driver arrived at the pickup location. Prerequisites 1. Key metrics: revenue increase? or cancellation rate decrease? 2. Variant: fixed ETA vs Range ETA, is the change easy to make? 3. Randomized Unit: Riders who is going to requested a ride, do we have enough randomization units? Experiment Design 1.Sample size is determined based on statistical power, statistical significance level, practical significance boundary, population standard deviation >>> 关注公众号获取更多精彩内容 202 Solution 67 Solution:eta-experiment（a/b testing） 2. Length of experiment is determined by sample size and actual number of riders requested daily. For example, the sample size for each group is 1000, and number of riders requested daily is 100, then we need at least 20 days to run this experiment. Also we need to consider ramp-up when launching the experiment, so that the system can handle the change and make sure the change rolls out correctly. Another thing we need to consider when deciding the length of experiment is seasonality. Generally, we need to run at least 1 week to eliminate weekday difference, and if the experiment period covers holiday seasons or any other special time, we might also need to discard those days or extend the experiment length. Run the Experiment and collect data Results to Decision 1. Sanity checks on randomization, any other factors that might break the identical situation between control and treatment group (e.g. app down time) tradeoffs between different metrics 2. Cost to implement and other opportunity costs, >>> 关注公众号获取更多精彩内容 203 Solution 67 Solution:eta-experiment（a/b testing） so we often set up a practical significance boundary 3. Compare p-value with significance level to check if the change is statistically significant 4. Compare the change with practical significance boundary to check if the change is practically 5. If the change is both statistically significant and practically significant, we will make the decision to launch the change to all riders >>> 关注公众号获取更多精彩内容 204 Quesetion 68 ctr-by-age（sql）难度标题【Hard】公司标签【Facebook】题目标签【sql】 Given two tables, search_events and users, write a query to find the three age groups (bucketed by decade: age 0-9 falls into group 0, age 10-19 to group 1, …, 90-99 to group 9, with the endpoint included) with the highest clickthrough rate in 2021. If two or more groups have the same clickthrough rate, the older group should have priority. Hint: If a user that clicked the link on 1/1/2021 is 29 years old on that day and has a birthday tomorrow on 2/1/2021, they fall into the [2029] category. If the same user clicked on another link on 2/1/2021, he turned 30 and will fall into the [30-39] category. Example： Input： search_events table users table Column Type Column Type search_id INTEGER id INTEGER query VARCHAR name VARCHAR has_clicked BOOLEAN birthdate DATETIME user_id INTEGER search_time DATETIME Output： Column Type age_group VARCHAR ctr FLOAT >>> 关注公众号获取更多精彩内容 205 Solution 68 Solution:eta-experiment（a/b testing） WITH cte_1 AS ( SELECT has_clicked, TIMESTAMPDIFF(YEAR,birthdate,search_time) DIV 10 AS age_group FROM users a JOIN search_events b ON a.id = b.user_id WHERE YEAR(search_time) = '2021' ), cte_2 AS ( SELECT age_group, sum(has_clicked)/count(1) AS clck_rate FROM cte_1 group by age_group ) SELECT age_group, clck_rate AS crt FROM cte_2 ORDER BY clck_rate DESC, age_group DESC limit 3 >>> 关注公众号获取更多精彩内容 206 Quesetion 69 bank-fraud-model（machine learning）难度标题【Medium】题目标签【machine learning】公司标签【DigitalOcean】,【ETRADE】,【World】, 【Amazon】,【BMO】,【ByteDance】, 【Robinhood】,【Accenture】,【Skillz】, 【Urbint】,【Facebook】,【Chartboost】,【s】, 【Solar】,【Adobe】,【Square】 Let’s say that you work at a bank that wants to build a model to detect fraud on the platform. The bank wants to implement a text messaging service in addition that will text customers when the model detects a fraudulent transaction in order for the customer to approve or deny the transaction with a text response. How would we build this model? >>> 关注公众号获取更多精彩内容 207 Solution 69 Solution:bank-fraud-model（machine learning） We should summarize our findings by building out a binary classifier on an imbalanced dataset. A few considerations we have to make are: • How accurate is our data? Is all of the data labeled carefully? How much fraud are we not detecting if customers don’t even know they’re being defrauded? • What model works well on an imbalance dataset? Generally, tree models come to mind. • How much do we care about interpretability? Building a highly accurate model for our dataset may not be the best method if we don’t learn anything from it. In the case that our customers are being comprised without us even knowing, then we run into the issue of building a model that we can’t learn from and feature engineer for in the future. • What are the costs of misclassification? If we look at precision versus recall, we can understand which metrics we care given the business problem at hand. >>> 关注公众号获取更多精彩内容 208 Solution 69 Solution:bank-fraud-model（machine learning） We can assume that low recall in a fraudulent case scenario would be a disaster. With low predictive power on false negatives, fraudulent purchases would go under the radar with consumers not even knowing they were being defrauded. This could cost the bank thousands of dollars in lost revenue given they would have to refund the cost to the consumer. Meanwhile if there was low precision, customers would think their accounts were being defrauded all the time. They would continue to get text messages until they switched over to another bank, because the transactions would always be flagged as fraudulent. Since the question prompts for a text messaging service, it might make sense then to optimize for recall to minimize risk and avoid costly fraudulent charges. We could also graph the precision recall curves at different price buckets to understand how the precision and recall thresholds were set. >>> 关注公众号获取更多精彩内容 209 Solution 69 Solution:bank-fraud-model（machine learning） For example, if recall was lower for purchases under \$10$10 dollars but very high for purchases over \$1000$1000, then effectively we’ve mitigated risk by making it 100x harder to defraud the bank out of lots of money. Additional considerations Reweighting: Algorithms such as LightGBM or SVM will allow us to reweight data. Custom Loss Function: We can apply different costs to different false positives and false negatives depending on the magnitude of the fraud. SMOTE/ADASYN: Helps us generate synthetic examples of the smaller class. >>> 关注公众号获取更多精彩内容 210 Quesetion 70 jars-and-coins（probability）难度标题【Hard】题目标签【probability】公司标签【Komodo】,【Google】 A jar holds 1000 coins. Out of all of the coins, 999 are fair and one is double-sided with two heads. Picking a coin at random, you toss the coin ten times. Given that you see 10 heads, what is the probability that the coin is double headed and the probability that the next toss of the coin is also a head? >>> 关注公众号获取更多精彩内容 211 Solution 70 Solution:jars-and-coins（probability） >>> 关注公众号获取更多精彩内容 212 Solution 70 Solution:jars-and-coins（probability） >>> 关注公众号获取更多精彩内容 213 Solution 70 Solution:jars-and-coins（probability） >>> 关注公众号获取更多精彩内容 214 Quesetion 71 lifetime-plays（database design）难度标题【Medium】题目标签【database design】公司标签【Google】 We have a table called song_plays that tracks each time a user plays a song. Let’s say we want to create an aggregation table called lifetime_plays that records the song count by date for each user. Write a SQL query that could perform this ETL each day. song_plays table Column Type id INTEGER created_at DATETIME user_id INTEGER `song_id INTEGER >>> 关注公众号获取更多精彩内容 215 Solution 71 Solution:lifetime-plays（database design） CREATE TABLE song_plays ( id int, date_listen date, user_id int, song_id int ); CREATE TABLE lifetime_plays ( date_listen date, user_id int, song_id int, count_plays int ); INSERT INTO song_plays (id, date_listen, user_id, song_id) VALUES (1, STR_TO_DATE('2021-03-01', '%Y-%m-%d'), 1, 1), (2, STR_TO_DATE('2021-03-01', '%Y-%m-%d'), 1, 1), (3, STR_TO_DATE('2021-03-01', '%Y-%m-%d'), 1, 2), (4, STR_TO_DATE('2021-02-28', '%Y-%m-%d'), 1, 2), (5, STR_TO_DATE('2021-03-01', '%Y-%m-%d'), 2, 1); INSERT INTO lifetime_plays (date_listen, user_id, song_id, count_plays) SELECT date_listen , user_id , song_id >>> 关注公众号获取更多精彩内容 216 Solution 71 Solution:lifetime-plays（database design） , COUNT(*) AS count FROM song_plays WHERE date_listen = STR_TO_DATE('2021-03-01', '%Y-%m%d') GROUP BY 1, 2, 3 SELECT * FROM lifetime_plays; >>> 关注公众号获取更多精彩内容 217 Quesetion 72 changing-composer（product metrics）难度标题【Easy】题目标签【product metrics】公司标签【Facebook】 Let’s say that Facebook would like to change the user interface of the composer feature (the posting box) to be more like Instagram. Instead of a box, Facebook would add a“+” button at the bottom of the page. How would you test if this is a good idea? >>> 关注公众号获取更多精彩内容 218 Solution 72 Solution:changing-composer（product metrics） Let’s make some initial assumptions. We can guess that we want to try a new user interface to improve certain key metrics that Instagram does better than Facebook in. Noticeably, given that Instagram is a photo-sharing app, we can assume that Facebook wants to improve: • Posts per active user • Photo posts per active user Additionally, we have to measure the trade-offs between the existing UI of the Facebook composer versus the Instagram UI. While the current composer feature on Facebook may make it easier to share status updates and geo-location or sell items, the Instagram composer may make the user more inclined to share photo posts. Therefore, given this hypothesis, one way to initially understand if this test is a good idea is to measure the effects of an increase in the proportion of photo posts to non-photo posts on Facebook and how that affects general engagement metrics. >>> 关注公众号获取更多精彩内容 219 Solution 72 Solution:changing-composer（product metrics） For example, if we compare the population of users that have a percentage of photo posts from 10% of the total versus 20% of the total posts, does this increase our active user percentage at all? Would it increase monthly retention rates? Another thing we have to be aware of is the drop-off rate for the Facebook composer versus the Instagram composer. The drop-off rate would directly affect the general amount of posts that each user makes. We can look at the drop-off rate between the two composers by different segments as well such as geographic location, device type, and demographic markets. If we want to run an AB test to actually test the differences instead of just analyzing our existing segments, we would have to evaluate these same metrics but make sure not to compare by specific segments unless they are a large sample size of the population. Doing it by market/segment may leave it so that you get a Simpson’s paradox scenario where for most markets you get a certain result but in aggregate the result is different. >>> 关注公众号获取更多精彩内容 220 Solution 72 Solution:changing-composer（product metrics） In running the A/B test in addition, it’s important to add in the specific rigidity that the test must be run. For example, sample size and distribution are important to need to make sure we have a sufficiently large enough sample size in both control and test to get a statistically significant result. We should also randomly assign folks to either test/control as well as remember to reach significance and single variable change on the composer element. >>> 关注公众号获取更多精彩内容 221 Quesetion 73 equivalent-index（algorithms）难度标题【Medium】题目标签【algorithms】公司标签【Apple】 Given a list of integers, find the index at which the sum of the left half of the list is equal to the right half. If there is no index where this condition is satisfied return -1. Example 1 ： Input： nums = [1, 7, 3, 5, 6] Output： equivalent_index(nums) -> 2 Example 2 ： Input： nums = [1,3,5] Output： equivalent_index(nums) -> -1 >>> 关注公众号获取更多精彩内容 222 Solution 73 Solution:equivalent-index（algorithms） Our goal is to iterate through the list and quickly compute the sum of both sides of the index in the iteration. We can do this by first getting the sum of the entire list. This allows us to then subtract values from one side to get the value for the other side. If the values are equal, then we can return the index. Given this approach, we can then loop through our list and apply this formula to each value until we find the index. If it doesn’t exist then we’ll return -1 at the end. def equivalent_index(nums): total = sum(nums) leftsum = 0 for index, x in enumerate(nums): # the formula for computing the right side rightsum = total - leftsum - x leftsum += x if leftsum == rightsum: return index return -1 >>> 关注公众号获取更多精彩内容 223 Quesetion 74 compute-variance（python）难度标题【Easy】题目标签【python】公司标签【Amazon】 Write a function that outputs the (sample) variance given a list of integers. Note: round the result to 2 decimal places. Example ： Input： test_list = [6, 7, 3, 9, 10, 15] Output： get_variance(test_list) -> 13.89 >>> 关注公众号获取更多精彩内容 224 Solution 74 Solution:compute-variance（python） >>> 关注公众号获取更多精彩内容 225 Solution 74 Solution:compute-variance（python） >>> 关注公众号获取更多精彩内容 226 Quesetion 75 stranded-miner（probability）难度标题【Hard】题目标签【probability】公司标签【Facebook】 A miner is stranded and there are two paths he can take. Path AA loops back to itself and takes him 5 days to walk it. Path BB brings him to a junction immediately (0 days). The junction at the end of path BB has two paths say Path BABA and Path BBBB. Path BABA brings him back to his original starting point and takes him 2 days to walk. Path BBBB brings him to safety and takes him 1 day to walk. Each path has an equal probability of being chosen and once a wrong path is chosen, he gets disoriented and cannot remember which path he went through, and the probabilities remain the same. What is the expected value of the amount of days he will spend before he exits the mine? >>> 关注公众号获取更多精彩内容 227 Solution 75 Solution:stranded-miner（probability） First, some terminology. We will call a particular sequence through the mine a circuit and a decision to go down one path a walk. This terminology is borrowed from graph theory. We will denote the number of days that the miner spends stranded as D. Note that D is path-dependent; the sequence of paths the miner takes matters—for example, the circuit. A→A→B→AB takes 11 days to complete, while the circuit B→BA→B→AB takes three days to complete, even though the minor got to BB in the same amount of “walks.” Due to this, Calculating E[D] directly would require you to come up with a formula to generate the probability of every possible circuit. Not impossible, but not something you’re going to be able to do on the spot in an interview. Because of this difficulty, we won’t focus on D. Instead, we will focus on the number of “walks” the miner makes, that is, the number of times they go down a path. We will denote this as W. >>> 关注公众号获取更多精彩内容 228 Solution 75 Solution:stranded-miner（probability） Note that since W measures trials until a success, it lends itself a geometric distribution. At the start of any circuit, the probability of ending up at path BB is P(BB)=P(BB|B)P(B)=0.25 Thus, E[W]=0.25-1 =4 Now, let’s think about the number of days per walk when he’s trapped. Since in the long-run, half, a quarter, and a quarter of all circuits end up in A, BA, and BB respectively, we have E[D/W]=2/5+2/4+1/4=3.25 Now note that we have D=W(D/W) (since W is never zero). Thus, E[D]=E[W]E[D/W]=4·3.25=13 >>> 关注公众号获取更多精彩内容 229 Quesetion 76 first-names-only（pandas）难度标题【Medium】题目标签【pandas】公司标签【Facebook】,【ICF】 You’re given a dataframe containing a list of user IDs and their full names (e.g. ‘James Emerson’). Transform this dataframe into a dataframe that contains the user ids and only the first name of each user. Example： Input： Output： user_id name user_id name 1034 James Emerson 1034 James 9430 Fiona Woodward 9430 Fiona 7281 Alvin Gross 7281 Alvin 5264 Deborah Handler 5264 Deborah 8995 Leah Xue 8995 Leah >>> 关注公众号获取更多精彩内容 230 Solution 76 Solution:first-names-only（pandas） Simply split the name and take the first one. def first_name_only (users_df): users_df [ 'name'] = users_df ['name'].str.split(' ').str[0] return users_d >>> 关注公众号获取更多精彩内容 231 Quesetion 77 netflix-retention（product metrics）难度标题【Hard】题目标签【product metrics】公司标签【Netflix】 Let’s say at Netflix we offer a subscription where customers can enroll for a 30-day free trial. After 30 days, customers will be automatically charged based on the package selected. Let’s say we want to measure the success of acquiring new users through the free trial. How can we measure acquisition success and what metrics can we use to measure the success of the free trial? >>> 关注公众号获取更多精彩内容 232 Solution 77 Solution:netflix-retention（product metrics） First a protip: Let’s go back to thinking about an idea for a strategy on Product Metrics type questions. One way we can frame the concept specifically to this problem is to think about controllable inputs, external drivers, and then the observable output. It is critical to spend most of the time in the interview creating good/bad benchmarks, having numeric goals, explaining actual performance vs expectation, evaluating inputs, and not be bogged down over other KPIs that we can’t really influence. With that in mind, let’s start out by stating the main goals of the question. What is Netflix’s business model? Main Goal: 1. Acquiring new users. 2. Decreasing churn and increasing retention. Let’s think about acquisition before we dive into measuring the success of the free trial. Starting out, what questions do we have on acquisition at a larger scale before we jump into the problem? 1. What’s the size of the market? >>> 关注公众号获取更多精彩内容 233 Solution 77 Solution:netflix-retention（product metrics） This would be the top of the funnel in terms of acquisition. Let’s say there are seven billion people on the planet that comprises of two billion households. If we assume a 5% penetration of high-speed broadband internet, the potential market size is 100 million households. 2. Size and quality of the potential leads. In this case, our leads are the free-trial users. In each segment, we can break down the number and quality of leads by different factors such as geography, demographics, devicetype (TV or Mobile), acquisition funnel, etc… Now, let’s focus on acquisition output metrics. What metrics can we measure that will define success on a top-level viewpoint for acquisition? • Conversion rate percentage: # of trial sign-ups / # of leads in the pool, by sub-segments. This is the number of leads that we convert into free trial signups. Leads are defined by customers that click on ads, sign up their email, or any other top of the funnel activity before the free trial sign-up. • Cost per free trial acquisition: This is the cost for signing up each person to a free trial. This can be calculated by the total marketing spend on advertising the free trial divided by the total number of free trial users. >>> 关注公众号获取更多精彩内容 234 Solution 77 Solution:netflix-retention（product metrics） • Daily Conversion Rate: This is the number of daily users who convert and start to pay divided by the number of users who enrolled in the 30-day free trial, thirty days ago. One problem with this metric is it’s hard to get information about users who enrolled for the free trial given the 30-day lag from sign-up to conversion. For example, it would take 30 days to get the conversion data from all the users that signed up for the free trial today. Going deeper into the daily conversion rate metric, one way we can get around the 30-day lag is by looking at cohort performance over time. Everyone who joined in the month of January (let’s say between day 1 to day 30) would become cohort-1. Everyone who joined in the month of February would be cohort-2, etc… Then you see at the end of the trial: • What % of free users paid. • What % of free users till pay for month two, month three, month four, etc…until you have metrics for month six to one year. Then we can look at a second cohort for February sign-ups. Once you have this, then you compare the 30-day retention for cohort-2 vs cohort-1, then 60-day retention for cohort-2 vs cohort-1, and so on. >>> 关注公众号获取更多精彩内容 235 Solution 77 Solution:netflix-retention（product metrics） This tells you if the quality of acquisition is effective enough, and actually encourages long-term engagement. Now if we jump into a few engagement metrics: • Percentage of daily users who consume at least an hour of Netflix content. • We can break this down by the percentage of users who are also consuming content at least 1min, 15mins, 1hour, 3hours, 6+hours in a week. • Average weekly session duration by user • We can cut this metric by the behavioral segment of the users. There are different member profiles such as college students, movie fanatics, suburban families, romantic comedy enthusiasts, etc. • Within each role, there’s the job of providing recommendation to the acquisition team on which parts of the business is having the highest growth. More segmentations then exist past demographics of looking at usage preferences, time of day, content verticals, to determine which combination will increase the output of average weekly session duration. >>> 关注公众号获取更多精彩内容 236 Quesetion 78 estimated-rounds（probability）难度标题【Hard】题目标签【probability】公司标签【Google】 Let’s say that there are six people trying to divide up into two equally separate teams. Because they want to pick random teams, on each round, each person shows their hand in either a face-up or face-down position. If there are three of each position, then they’ll split into teams. What’s the expected number of rounds that everyone will have to pick a hand side before they split into teams? >>> 关注公众号获取更多精彩内容 237 Solution 78 Solution:estimated-rounds（probability） Since “they want to pick random teams” and there is no additional information given, we can assume there is a 50 ⁄ 50 chance that each person puts a face down or face up. Thus the face of every individual person follows a Bernoulli distribution with probability of success p=0.5. We are looking for the total number of face ups to be exactly 3, as that would imply the rest of the group has their faces down. Let F denote the number of faces that are up. F is the sum of Bernoulli random variables and as such follows a binomal distribution, meaning the probability of F=3 is: Let R the number of rounds before having 3 faces up. Clearly, R follows a geometric distribution because it denotes the number of trials before one success. For geometric random variable G with success probability p, E[G]=1/p. Thus the expected number of rounds until teams form is: E[R]=1/0.3125=3.2 >>> 关注公众号获取更多精彩内容 238 Quesetion 79 power-size（statistics）难度标题【Easy】题目标签【statistics】公司标签【Trivago】,【Apple】,【Qualcomm】, 【Google】 Let’s say you’re analyzing an AB test with a test and control group. 1. How do you calculate the sample size necessary for an accurate measurement? 2. Let’s say that the sample size is similar and sufficient between the two groups. In order to measure very small differences between the two, should the power get bigger or smaller? >>> 关注公众号获取更多精彩内容 239 Solution 79 Solution:power-size（statistics） 1) The necessary sample size (n) depends on the following factors: a. Alpha (default is 0.05) b. Test’s power (default is 80% corresponding to beta = 0.20) c. The expected effect size (d) between the test and the control populations, i.e. d = mu1 - mu2. d. The population variance of the control population, assuming that the test population has the same variance. 2) a. In order to measure very small differences between the two groups, we want to reduce false negative (FN) rate. b. By default, tests are more sensitive to FP than they are to FN, alpha = 0.05 while beta = 0.2. This convention implies a four-to-one trade off between β-risk and α-risk. Given 2.a. & 2.b., beta should get smaller, which implies that power (1-beta) should get bigger. >>> 关注公众号获取更多精彩内容 240 Quesetion 80 ad-raters-part-2（probability）难度标题【Hard】题目标签【probability】公司标签【Facebook】 Let’s say we use people to rate ads. There are two types of raters. Random and independent from our point of view: • 80% of raters are careful and they rate an ad as good (60% chance) or bad (40% chance). • 20% of raters are lazy and they rate every ad as good (100% chance). 1. Suppose a rater rates just three ads, and rates them all as good. What’s probability the rater was lazy? 2. Suppose a rater sees NN ads and rates all of them as good. What happens to the probability the rater was lazy as NN tends to infinity? 3. Suppose we want to exclude lazy raters. Can you come up with a rule for classifying raters as careful or lazy? >>> 关注公众号获取更多精彩内容 241 Solution 80 Solution:ad-raters-part-2（probability） This should be intuitive, the more times we observe a rater rating every ad they see as good, we expect that it’s more likely that the rater is lazy. >>> 关注公众号获取更多精彩内容 242 Solution 80 Solution:ad-raters-part-2（probability） >>> 关注公众号获取更多精彩内容 243 Quesetion 81 multi-modal-sample（python）难度标题【Easy】题目标签【python】公司标签【Stitch】,【Google】 Write a function for sampling from a multimodal distribution. Inputs are keys (i.e. green, red, blue), weights (i.e. 2, 3, 5.5), and the number of samples drawn from the distribution. The output should return the keys of the samples. Example ： Input： keys = ['green', 'red', 'blue'] weights = [1, 10, 2] n=5 sample_multimodal(keys, weights, n) Output： ['blue', 'red', 'red', 'green', 'red'] >>> 关注公众号获取更多精彩内容 244 Quesetion 82 converted-sessions（probability）难度标题【Medium】公司标签 / 题目标签【probability】 Let’s say there are two user sessions that both convert with probability 0.5. 1. What is the probability that they both converted? 2. Given that there are NN sessions and they convert with probability qq, what is the expected number of converted sessions? >>> 关注公众号获取更多精彩内容 245 Solution 82 Solution:converted-sessions（probability） >>> 关注公众号获取更多精彩内容 246 Solution 82 Solution:converted-sessions（probability） >>> 关注公众号获取更多精彩内容 247 Quesetion 83 complete-addresses（pandas）难度标题【Medium】题目标签【pandas】公司标签【Nextdoor】,【Google】 You’re given two dataframes. One contains information about addresses and the other contains relationships between various cities and states： Example ： df_addresses df_cities address city state 4860 Sunset Boulevard, San Francisco, 94105 Salt Lake City Utah 3055 Paradise Lane, Salt Lake City, 84103 Kansas City Missouri 682 Main Street, Detroit, 48204 Detroit Michigan 9001 Cascade Road, Kansas City, 64102 Tampa Florida 5853 Leon Street, Tampa, 33605 San Francisco California Write a function complete_address to create a single dataframe with complete addresses in the format of street, city, state, zip code. Input： import pandas as pd addresses = {"address": ["4860 Sunset Boulevard, San Francisco, 94105", "3055 Paradise Lane, Salt Lake City, 84103", "682 Main Street, Detroit, 48204", "9001 Cascade Road, Kansas City, 64102", "5853 Leon Street, Tampa, 33605"]} cities = {"city": ["Salt Lake City", "Kansas City", "Detroit", "Tampa", "San Francisco"], "state": ["Utah", "Missouri", "Michigan", "Florida", "California"]} df_addresses = pd.DataFrame(addresses) df_cities = pd.DataFrame(cities) address 关注公众号获取更多精彩内容 >>> Output： def complete_address(df_addresses,df_cities) -> 4860 Sunset Boulevard, San Francisco, California, 94105 3055 Paradise Lane, Salt Lake City, Utah, 84103 682 Main Street, Detroit, Michigan, 48204 9001 Cascade Road, Kansas City, Missouri, 64102 5853 Leon Street, Tampa, Florida, 33605 248 Solution 83 Solution: complete-addresses （pandas） def complete_address(df_addresses, df_cities): df_addresses[['street', 'city', 'zipcode']] = df_addresses['address'].str.split(', ', expand=True) df_addresses = df_addresses.drop(['address'], axis=1) df_addresses = df_addresses.merge(df_cities, on="city") df_addresses['address'] = df_addresses[['street', 'city', 'state', 'zipcode']].agg(', '.join, axis=1) df_addresses = df_addresses.drop(['street', 'city', 'state', 'zipcode'], axis=1) return df_addresses >>> 关注公众号获取更多精彩内容 249 Quesetion 84 same-side-probability（probability）难度标题【Medium】题目标签【probability】公司标签【Microsoft】,【LinkedIn】 Suppose we have two coins. One is fair and the other biased where the probability of it coming up heads is 3 ⁄ 4. Let’s say we select a coin at random and flip it two times. What is the probability that both flips result in the same side? >>> 关注公众号获取更多精彩内容 250 Solution 84 Solution: Same Side Probability （probability） Let’s tackle this by solving first splitting up the probabilities of getting the same side twice for the biased coin and then computing the same thing for the fair coin. First the biased coin. We know that if we flip the biased coin we have a 3 ⁄ 4 chance of getting heads. And so the probability of heads twice will be 3 ⁄ 4 * 3 ⁄ 4 and the probability of tails twice is 1 ⁄ 4 * 1 ⁄ 4. Easy, but now what’s the probability of it being either twice heads or twice tails? In this case, because the computation is an OR function, the probability is additive. In which the probabilities of heads twice OR tails twice is computed by adding the probabilities together. (3 ⁄ 4) * (3 ⁄ 4) + (1 ⁄ 4) * (1 ⁄ 4) = 10 ⁄ 16 = 0.625 Now the fair coin. We can apply the same formula from the biased coin to the fair coin. Since heads and tails are both equivalently probable, we can compute the formula quite easily with: (1 ⁄ 2) * (1 ⁄ 2) + (1 ⁄ 2) * (1 ⁄ 2) = 1 ⁄ 2 >>> 关注公众号获取更多精彩内容 251 Solution 84 Solution: Same Side Probability （probability） Now let’s compute the total probability given a random selection of either coin. Since there are only two coins and we are equally likely to pick either of them, the probability of getting each is 1 ⁄ 2. We can then compute the total probability by again adding the individual probabilities while multiplying by the probability of choosing either. = = = 0.5625 >>> 关注公众号获取更多精彩内容 252 Quesetion 85 string-shift（algorithms）难度标题【Easy】题目标签【algorithms】公司标签【Google】,【PayPal】 Given two strings AA and BB, write a function can_shift to return whether or not AA can be shifted some number of places to get BB. Example ： Input： A = 'abcde' B = 'cdeab' can_shift(A, B) == True A = 'abc' B = 'acb' can_shift(A, B) == False >>> 关注公众号获取更多精彩内容 253 Solution 85 Solution: string-shift（algorithms） This problem is relatively simple if we figure out the underlying algorithm that allows us to easily check for string shifts between strings A and B. First off, we have to set baseline conditions for string shifting. Strings A and B must both be the same length and consist of the same letters. We can check for the former by setting a conditional statement for if the length of A is equivalent to the length of B. Now, let’s think about the string shift. If B is reordered from A then the condition has failed. But we can check order if we continue to repeat B and then compare to see if A exists in B. For example: A = 'abcde' B * 2 = 'cdeabcdeab' Now all we have to do is check to see if AA exists in BB, which for the condition above is true. def can_shift(A, B): return ( A and B and len(A) == len(B) and A in B * 2 ) >>> 关注公众号获取更多精彩内容 254 Quesetion 86 choosing-k（machine learning）难度标题【Easy】题目标签【machine learning】公司标签【Facebook】【 , Predictive】【 , Qualcomm】, 【Intel】 How would you choose the k value when doing k-means clustering? >>> 关注公众号获取更多精彩内容 255 Solution 86 Solution: choosing-k（machine learning） Elbow method - build kmeans models using increasing values of k, recording models’ inertia (aka WCSS, Within Cluster Sum of Squares or SSD sum of samples’ squared distances from cluster centers) or distortion (average of samples’ squared distances from cluster centers). Plot k against distortion/ inertia and choose k by taking the value where graph creates an “elbow”: the point after which the distortion/inertia start decreasing in a linear fashion or nearly parallel to X axis. Silhouette method - calculate mean silhouette coefficient of samples, a measure of similarity among points within clusters–for a sample i, distances from points in neighboring clusters, b, relative to points within the same cluster, a: (b(i)-a(i))/max(a(i),b(i))–looking for a value closest to 1. To verify choice, for each k in consideration, plot each clusters’ samples’ silhouette score (marking average across clusters), and identify the k with few or no clusters with below average silhouette score, and without wide fluctuations in the size of the silhouette plots. >>> 关注公众号获取更多精彩内容 256 Quesetion 87 wau-vs-open-rates（product metrics）难度标题【Medium】题目标签【product metrics】公司标签【Pinterest】,【Facebook】 Let’s say that you’re a data scientist on the engagement team. A product manager comes up to you and says that the weekly active users metric is up 5% but email notification open rates are down 2%. >>> 关注公众号获取更多精彩内容 257 Solution 87 Solution: wau-vs-open-rates（product metrics） Initially reading this question, we should assume it’s first a debugging question, and then possibly a dive into trade-offs. WAU (weekly active users) and email open rates are most of the time, directly correlated, especially if emails are used as a call to action to bring users back onto the website. An email opened and then clicked would lead to an active user, but it doesn’t necessarily mean that they have correlations or be the only reason causing changes. Let’s bring in some context first or state assumptions. Specifically, around the two variables at play here. Weekly active users can be defined as the number of users active at least once in the past 7 days. Active user can be defined as a user opening the app or website while logged in on mobile, web, desktop, etc.. >>> 关注公众号获取更多精彩内容 258 Solution 87 Solution: wau-vs-open-rates（product metrics） Email open rate is defined as the number of email opens divided by the number of emails sent. We can assume that both the email open rate and WAU are being measured compared to their historical past. Such as if email open rates were always measured within 24 hours of sending the email, then the email open rate is not down now because it’s being measured within 12 hours instead. One is that we take a closer look at the metric of email open rates. Given it is a fraction, we can understand that a 2% decrease in open rate is much smaller in scale when we imagine it as going from a 30% open rate to a 29.4% open rate. In which case we can then look at segmentations for factors such as bugs or seasonal trends that could be causing the problem: • Bugs in the tracking. One time or progressive. Possibly seasonal. • Platform: Look into if it was an abnormal event in one of the platforms (mobile, desktop, ios, android) • Countries or demographics. If certain countries or demographics are using it more from new trends. >>> 关注公众号获取更多精彩内容 259 Solution 87 Solution: wau-vs-open-rates（product metrics） Now after looking at segmentations, let’s try to dive into hypothesis of possible trade-offs. We also have to remember that WAU is many times directly influenced by the number of new users coming onto a site. For example, if after two weeks, the user retention is only 20% of the original number that is active on Pinterest, and after one month it is 10%, then we might find that at any given time, WAU could be primarily made up of new users that had just joined Pinterest that week. Given this assumption, we can then say that if there was a huge influx of new users this week, that could be pushing the WAU number up while also pushing the email open rate down as we see more users coming onto the website organically or through ads, without going through the usual email notifications that long-term users would be attributed to. >>> 关注公众号获取更多精彩内容 260 Solution 87 Solution: wau-vs-open-rates（product metrics） Another hypothesis could be that the increase in WAUs triggers many user-related email notifications and as a result pushes down the email open rate by increasing the denominator. We can also then verify this hypothesis by breaking down the email open rate by different types of email notifications. Lastly, we can assume that to generate an increase in WAU, marketing could have sent a very large amount of emails that pushed up the overall WAU number and created email fatigue which in turn lowered the email open rates. To verify this, we could look at different kinds of email metrics such as unsubscribe rate, and see if there are different email open and unsubscribe rates by cohorts of the number of emails received total. >>> 关注公众号获取更多精彩内容 261 Quesetion 88 group-success（product metrics）难度标题【Medium】公司标签 / 题目标签【product metrics】 How would you measure the success of Facebook Groups? >>> 关注公众号获取更多精彩内容 262 Solution 88 Solution: group-success（product metrics） Success Metrics The goal here is to evaluate and track metrics that relate to our three main areas of focus; activation, engagement, and retention. Activation is how users discover Facebook Groups. Engagement is tracking the health of user activity on Facebook Groups. Lastly, retention helps us measure the long-term effect that Facebook Groups have on the user to see if the user will come back over time. 1. % o f u s e r s t h a t j o i n a g r o u p a f t e r v i e w i n g t h e group (public group) [activation]. This indicates how effective a page is at showing value to the user (through a large number of recent posts, or # of new members) would reflect an active community. 2. % of users that engage (post, comment, react) in the group within one day of joining [engagement]. >>> 关注公众号获取更多精彩内容 263 Solution 88 Solution: group-success（product metrics） 3. Average engagement score calculated by some combination of comments + likes per post by a new or returning user in the group [engagement]. Indicates how supportive and welcoming existing group members are to new and old users, but this may depend on the type of content that is posted. 4. % of users that friend or follow another user of the group within one week of joining [engagement]. This metric demonstrates how close users are with each other, and how friendly they are, but this may come across as weird behavior and not performed by many users. 5. % of users that are returning members compared to new users [engagement]. 6. % of 30 daily active users [retention]. General retention metrics to see how community brings repeat value to users. 7. % of users that invite a friend to the group [referral]. Indicates if users will promote a group, but there are other reasons a user may invite friends and this may not be used by a lot of users. >>> 关注公众号获取更多精彩内容 264 Quesetion 89 prime-to-n（python）难度标题【Medium】题目标签【python】公司标签【Tiger】,【Zenefits】,【Amazon】 Given an integer N, write a Python function that returns all of the prime numbers up to N. >>> 关注公众号获取更多精彩内容 265 Solution 89 Solution: prime-to-n（python） from math import ceil def prime_numbers(N): primes = [] if N > 1: primes.append(2) if N > 2: primes.append(3) >>> 关注公众号获取更多精彩内容 266 Solution 89 Solution: prime-to-n（python） from math import ceil if N > 4: for i in range(2,N+1): is_prime = True # all primes except 2 and 3 are of the form 6n +/- 1 if i % 6 == 1 or i % 6 == 5: # this number is odd so we can start at 3 and check only odd numbers for j in range(3,ceil(pow(i,1/2))+1,2): if i % j == 0: is_prime = False break else: is_prime = False if is_prime: primes.append(i) return primes >>> 关注公众号获取更多精彩内容 267 Quesetion 90 subscription-retention（sql）难度标题【Hard】题目标签【sql】公司标签【Stripe】,【Houzz】,【Natera】,【Amazon】, 【Niantic】,【Intuit】 Given a table of subscriptions, write a query to get the retention rate of each monthly cohort for each plan_id for the three months after sign-up. Order your output by start_month, plan_id, then num_month. If an end_date is in the same month as start_date we say the subscription was not retained in the first month. If the end_date occurs in the month after the month of start_date, the subscription was not retained in the second month. And so on for the third. The end_date field is NULL if the user has not canceled. Example： Input： subscriptions table Output： Column Type Column Type start_month DATETIME user_id INTEGER num_month INTEGER start_date DATETIME plan_id VARCHAR end_date DATETIME retained FLOAT plan_id VARCHAR >>> 关注公众号获取更多精彩内容 268 Solution 90 Solution: subscription-retention（sql） WITH cte_1 AS ( SELECT *, DATE_SUB(start_date, I N T E R VA L DAYOFMONTH(start_date) - 1 DAY) AS 'start_month' FROM subscriptions ORDER BY plan_id, start_date ), >>> 关注公众号获取更多精彩内容 269 Solution 90 Solution: subscription-retention（sql） cte_2 AS ( SELECT x.column_0, cte_1.* FROM cte_1 CROSS JOIN ( VALUES ROW (1), ROW (2), ROW (3)) AS x ), >>> 关注公众号获取更多精彩内容 270 Solution 90 Solution: subscription-retention（sql） cte_3 AS ( SELECT column_0, user_id, start_date, end_date, plan_id, date(start_month) AS start_month, IF(IFNULL(PERIOD_DIFF(DATE_ FORMAT(end_date, '%Y%m'), DATE_FORMAT(DATE_ADD(start_date, INTERVAL column_0 - 1 MONTH), '%Y%m')), >>> 关注公众号获取更多精彩内容 271 Solution 90 Solution: subscription-retention（sql） 1) > 0, 1, 0) AS x FROM cte_2 ) >>> 关注公众号获取更多精彩内容 272 Solution 90 Solution: subscription-retention（sql） SELECT start_month, column_0 AS num_month, plan_id, cast((sum(x) / count(x)) AS DECIMAL (3, 2)) AS retained FROM cte_3 GROUP BY start_month, column_0, plan_id ORDER BY start_month, plan_id, num_month >>> 关注公众号获取更多精彩内容 273 Quesetion 91 like-tracker（sql）难度标题【Easy】题目标签【sql】公司标签【Facebook】 The events table tracks every time a user performs a certain action (like, post_enter, etc.) on a platform. Write a query to determine how many different users gave a like on June 6, 2020. Example： Input： events table Column Type user_id INTEGER created_at DATETIME action VARCHAR platform VARCHAR Output： Column Type num_users_gave_like INTEGER >>> 关注公众号获取更多精彩内容 274 Solution 91 Solution: like-tracker（sql） SELECT COUNT(DISTINCT user_id) AS num_users_gave_like FROM events WHERE DATE(created_at) = DATE("2020-06-06") AND action = "like" >>> 关注公众号获取更多精彩内容 275 Quesetion 92 duplicate-rows（sql）难度标题【Medium】题目标签【sql】公司标签【Amazon】 Given a users table, write a query to return only its duplicate rows. Example： Input： users table Column Type id INTEGER name VARCHAR created_at DATETIME >>> 关注公众号获取更多精彩内容 276 92 Solution Solution: duplicate-rows（sql） SELECT id, name, created_at FROM ( SELECT *, row_number() OVER (PARTITION BY id ORDER BY created_ at ASC) AS ranking FROM users) AS u WHERE ranking > 1 >>> 关注公众号获取更多精彩内容 277 Quesetion 93 notification-type-conversion（sql）难度标题【Hard】题目标签【sql】公司标签【Facebook】 We’re given two tables, a table of notification deliveries and a table of users with created and purchase conversion dates. If the user hasn’t purchased then the conversion_date column is NULL. Write a query to get the conversion rate for each notification. A user may convert only once. Example： notification_deliveries table users table Column Type Column Type notification VARCHAR id INTEGER user_id INTEGER created_at DATETIME created_at DATETIME conversion_date DATETIME Output： Column Type notification VARCHAR conversion_rate FLOAT >>> 关注公众号获取更多精彩内容 278 Solution 93 Solution: duplicate-rows（sql） WITH time_differences AS ( SELECT a.*,b.conversion_date, TIMESTAMPDIFF(second,a. created_at,conversion_date) delta_t FROM notification_deliveries a JOIN users b ON a.user_id = b.id ) find_notification_that_converted AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id, delta_t>=0 ORDER BY delta_t) AS row_no FROM time_differences ), count_notifications AS ( >>> 关注公众号获取更多精彩内容 279 Solution 93 Solution: duplicate-rows（sql） SELECT x.notification,total_notifications, IFNULL(notifications_that_converted,0) AS notifications_ that_converted FROM ( SELECT notification, COUNT(*) total_notifications FROM find_notification_that_converted GROUP BY notification )x LEFT JOIN ( >>> 关注公众号获取更多精彩内容 280 Solution 93 Solution: duplicate-rows（sql） SELECT notification, COUNT(*) AS notifications_that_ converted FROM find_notification_that_converted WHERE conversion_date IS NOT NULL AND row_no = 1 AND delta_t>=0 GROUP BY notification ) y ON x.notification = y.notification ) SELECT notification, IF( notifications_that_converted = 0, 0.0000, notifications_that_converted/total_notifications) AS conversion_rate FROM count_notifications >>> 关注公众号获取更多精彩内容 281 Quesetion 94 adding-c-to-sample（statistics）难度标题【Easy】题目标签【statistics】公司标签【Amazon】 Let’s say you are working as an analyst. There was an error in collecting data and all entries are off by some number cc. If you were to add cc to all the entries, what would happen to the sample statistics (mean, median, mode, range, variance) of the field? >>> 关注公众号获取更多精彩内容 282 Solution 94 Solution: adding-c-to-sample（statistics） Adding a constant c to each (of number N) data points consequently changes the descriptive statistics as such: Mean: increases by c because if m is the current mean, the current sum of squares is mN and if we add c to each of those N points, then the new sum of squares is mN + cN = (m+c)N dividing by N to calculate the new mean = (m+c)*N/N = (m+c) the current mean has increased to the new mean by c. Median: increases by c because the ordered list of data points is maintained even when every number is increased by c, therefore the new median is still relatively the 50th% of the data set, it will just be c higher than the current median. Mode: increases by c because each data point is increased by the same amount which means each unique value’s relative frequency is maintained and the new mode is only chigher than the current mode. >>> 关注公众号获取更多精彩内容 283 Solution 94 Solution: adding-c-to-sample（statistics） Range: remains the same because the minimum point becomes x_min + c and the maximum point becomes x_max + c; the range is calculated as max-min : {x_max+c - (x_min + c) }, where the c values cancel out and the range is still {x_max - x_min}. Variance: remains the same because intuitively: the relative spread of the data set is the same as each point has shifted by the same amount c mathematically, if the original variance = V(X) then V(X+c) =V(x) as there is 0-variance in the constant vector representing the size of the increase. >>> 关注公众号获取更多精彩内容 284 Quesetion 95 unique-work-days（sql）难度标题【Medium】题目标签【sql】公司标签【Amazon】 You have a table containing information about the projects employees have worked on and the time period in which they worked on the project. Each project can be assigned to more than one employee, and an employee can be working on more than one project at a time. Write a query to find how many unique days each employee worked. Order your query by the employee_id. Example： Input： projects table Output： Columns Type Columns Type employee_id INTEGER id INTEGER days_worked DATETIME title VARCHAR start_date DATETIME end_date DATETIME budget INTEGER employee_id INTEGER >>> 关注公众号获取更多精彩内容 285 Solution 95 Solution: unique-work-days（sql） WITH cte AS ( SELECT employee_id, MIN(start_date) AS min_start_date, MAX(end_date) AS max_end_date FROM projects GROUP BY employee_id) SELECT employee_id, TIMESTAMPDIFF(DAY, min_start_date, max_end_date) AS days_ worked FROM cte ORDER BY employee_id >>> 关注公众号获取更多精彩内容 286 Quesetion 96 search-algorithm-recall（machine learning）难度标题【Easy】题目标签【machine learning】公司标签【Amazon】 Let’s say you work as a data scientist at Amazon. You want to improve the search results for product search but cannot change the underlying logic in the search algorithm. What methods could you use to increase recall? >>> 关注公众号获取更多精彩内容 287 Solution 96 Solution: search-algorithm-recall （machine learning） Given we are not allowed to change the algorithm, we have to logically look at this search algorithm like a black box, in which the underlying model will not change but rather tweak the inputs into the algorithm to increase the general recall output. Hence if we modify the search query by adding additional input keywords or chaining the results of different search terms, we can get different results for the same original search term. Remember that recall is the fraction of the relevant documents that are successfully retrieved over the total amount of relevant documents. In this case that means we want to generally increase the number of results returned in terms of relevance. Let’s take an example to demonstrate. >>> 关注公众号获取更多精彩内容 288 Solution 96 Solution: search-algorithm-recall （machine learning） Let’s assume the algorithm using lexical search for relevancy like Lucene search. If the search query is “black shirts”, the results would still be generally relevant if the products returned were dark colored shirts such as dark grey, dark blue, etc… Instead however, given how the general search algorithm might work, “black shirts” would be more likely to bring up “blue shirts” or “black shoes” first instead of other dark colored shirts given that the algorithm doesn’t know anything besides lexical association. Given that we cannot change the underlying algorithm, we could surface these different dark colored shirts by appending a synonyms query. A synonyms query would replace or add to the existing words in the query with words that are synonymous. Results for synonyms could be chained to the first search query results. So we would first return the results for “black shirts”, and then start returning the results for “dark grey shirts”, “dark blue shirts”, etc… >>> 关注公众号获取更多精彩内容 289 Solution 96 Solution: search-algorithm-recall （machine learning） Another method would be to try search terms of products that are adjacent to the values being searched. We would use an algorithm of collaborative filtering to see what products users bought together. Such as if people were likely to buy black pants with a black shirt, we could chain the search terms like “black pants” and “black shoes” into the query as well. We can also try modifying the search query by adding in keywords and tags from relevant products that users click on. If users that search for “black shirts” click on products that feature black collared shirts at a higher rate, we can append the keywords of “collared shirt” to our search query to increase recall towards general user preference. >>> 关注公众号获取更多精彩内容 290 Quesetion 97 losing-users（product metrics）难度标题【Medium】题目标签【product metrics】公司标签【Facebook】,【Google】 Let’s say you are working at Facebook and are asked to investigate the claim that Facebook is losing young users. 1. How would you verify this claim? 2. What test metrics would you look at? >>> 关注公众号获取更多精彩内容 291 Solution 97 Solution: adding-c-to-sample（statistics） Clarifying Questions: 1. Who is young user? 2. What are the source of this claim? ( Number of. posts, likes, log in times or duration etc. ) 3. Which time periods are this claim is observed? Assessing Requirements: 1. Needs to user profile and activity data Solution: 1. Assume that young users are identified as under 25 and losing users is determined as decrease of activity hours per month. 2. Divide data as young users and other users. Detect if there is a usage difference between two of them. After that detect if there is a usage difference for two groups individually according to time. With this way we will understand that is there anygeneral or group based loose. >>> 关注公众号获取更多精彩内容 292 Solution 97 Solution: adding-c-to-sample（statistics） 3. Assume that we discovered a huge difference for young users between this year and last year, but there is not a big change for other users. We can accept the claim at this point. Validation: 1. We can calculate the decrease rate by month by month and try to understand reasons of this decrease. We can take account any updates on the app or other external factors. Additional Concerns: 1. We should be curious about competitor companies’ trend >>> 关注公众号获取更多精彩内容 293 Quesetion 98 third-unique-song（sql）难度标题【Medium】题目标签【sql】公司标签【Spotify】,【Apple】 Given a table of song_plays and a table of users, write a query to extract the earliest date each user played their third unique song. Example： Input： song_plays table users table Columns Type Columns Type user_id INTEGER id INTEGER song_name TEXT name VARCHAR date_played DATETIME Output： Columns Type name VARCHAR date_played DATETIME song_name TEXT >>> 关注公众号获取更多精彩内容 294 Solution 98 Solution: third-unique-song（sql） WITH CTE_ds AS ( SELECT ROW_NUMBER() OVER (PARTITION BY user_id,song_ name ORDER BY date_played) br, song_plays.* from song_plays ), CTE_ds2 AS ( SELECT ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date_played) br2,CTE_ds.* FROM CTE_ds WHERE br =1 ) >>> 关注公众号获取更多精彩内容 295 Solution 98 Solution: third-unique-song（sql） SELECT x.name,y.date_played,song_name FROM users x LEFT JOIN (SELECT user_id, date_played,song_name FROM CTE_ds2 WHERE br2 = 3) y ON id = user_id >>> 关注公众号获取更多精彩内容 296 Quesetion 99 post-composer-drop（product metrics）难度标题【Medium】公司标签 / 题目标签【product metrics】 Let’s say that on Facebook composer, the posting tool, drops from 3% posts per user last month to 2.5% post per user today. How would you investigate what happened? Let’s say the drop is in photo posts. What would you investigate now? >>> 关注公众号获取更多精彩内容 297 Solution 99 Solution: post-composer-drop （product metrics） Clarification 1. Is it a true drop? Is it a sudden drop or has it been declining for a while? 2. Is there any changes that happen recently that might have affect this? 3. is the per user looking at active users only or everyone? Investigation on potential root cause 1. Look at overall trend, YoY, MoM, to confirm it is indeed a true decline. If there is indeed a decline, is it a sudden decline? This could indicate he feature itself might have some problem/not working properly. 1. Are we seeing similar decline for other feature? 2. Has the overall active user base using the Facebook composer drop? >>> 关注公众号获取更多精彩内容 298 Solution 99 Solution: post-composer-drop （product metrics） 3. Look at it by different cohort to see if it’s coming from a particular group: • Surface (iOS, android, desktop, mobile, etc.) • Geography • New/tenure users • Business type? • Paid ads or no? Depending on the findings, we can suggest few recommendations. If it’s due to specific cohort group, we can create specific marketing campaigns to promote the posting tool to try to get users to get back to using the feature, if it’s due to the feature itself being hard to use/find, we can try working on making the tool easier and provide more education for the user on how to use it and etc. >>> 关注公众号获取更多精彩内容 299 Quesetion 100 permutation-palindrome（algorithms）难度标题【Medium】题目标签【algorithms】公司标签【Snapchat】,【ByteDance】,【Squarepoint】, 【Amazon】,【Snap】 Given a string str, write a function perm_palindrome to determine whether there exists a permutation of str that is a palindrome. Example： Input： str = 'carerac' def perm_palindrome(str) -> True “carerac” returns True since it can be rearranged to form “racecar” which is a palindrome. >>> 关注公众号获取更多精彩内容 300 Solution 100 Solution:permutation-palindrome （algorithms） def perm_palindrome(str): arr = [0] * 1000 num_odds = 0 for char in str: i = ord(char) arr[i] += 1 if arr[i] % 2 != 0: num_odds += 1 else: num_odds -= 1 return num_odds <= 1 >>> 关注公众号获取更多精彩内容 301 Quesetion 101 estimating-d（statistics）难度标题【Medium】题目标签【statistics】公司标签【Spotify】 Given NN samples from a uniform distribution [0,d][0,d], how would you estimate dd? >>> 关注公众号获取更多精彩内容 302 Solution 101 Solution:estimating-d（statistics） What does a uniform distribution look like? Just a straight line over the range of values from 0 to d, where any value between 0 to d is equally likely to be randomly sampled. So, let’s make this easy to understand practically. If we’re given N samples and we have to estimate what d is with zero context of statistics and based on intuition, what value would we choose? For example, if our N sample is 5 and our values are: (1,4,6,2,3), what value would we guess as d? Probably the max value of 6 right? But, let’s look at another example. Let’s say our N sample is 5 again and our values are instead: (20,30,28,26,16). Would our estimate still be the max value of 30? Intuitively, it doesn’t seem correct right? And that’s because if we assume d as 30, then that means these values are spanned from 0 to 30 but somehow all of the values sampled are above our projected median of 15. In the first example, all of our values were equally distributed from 0 to 6, while in this example, all of our values are skewed above the 50% percentile. Now, we can come up with a new estimator for d. >>> 关注公众号获取更多精彩内容 303 Solution 101 Solution:estimating-d（statistics） One way to compute it would be that the average of a uniform distribution is in its middle. The two parameters of interest in a uniform distribution are its minimum and maximum values, as the entirety of its values are uniformly distributed between them. If d is the maximum and 0 is the minimum, half of d is its average. E(X) is the average, so How do we know how to choose between the two estimators? We have to ask the interviewer about the distribution of our N samples. For example, if we were to continue to sample from the uniform distribution and calculate the mean of the samples each time, seeing huge variations of the mean would tell us that the samples from our distribution are biased. >>> 关注公众号获取更多精彩内容 304 Quesetion 102 pca-and-k-means（machine learning）难度标题【Medium】题目标签【machine learning】公司标签【Google】,【Palo】,【Uber】,【Booz】, 【Rincon】,【Ocrolus】,【AstraZeneca】, 【QuantumBlack】,【BNP】,【General】 What’s the relationship between PCA and K-means clustering? >>> 关注公众号获取更多精彩内容 305 Solution 102 Solution:pca-and-k-means （machine learning） Both K means and PCA are unsupervised machine learning techniques. While PCA is used for dimensionality reduction, K-Means can be used for clustering. K-Means fails in high dimensional scenarios ( It is computationally expensive in High Dimension scenarios and may incorrectly clustering things) Hence before Performing a K-Means one always performs a PCA to reduce dimensionality. >>> 关注公众号获取更多精彩内容 306 Quesetion 103 activity-conversion（analytics）难度标题【Hard】题目标签【analytics】公司标签【Apple】,【Facebook】 You’re given three tables, users, transactions and events. We’re interested in how user activity affects user purchasing behavior. The events table holds data for user events on the website where the action field would equal values such as like and comment. Write a query to prove if users that interact on the website (likes, comments) convert towards purchasing at a higher volume than users that do not interact. users table transactions table column type column type id INTEGER user_id INTEGER name VARCHAR name VARCHAR created_at DATETIME created_at DATETIME events table column type user_id INTEGER action VARCHAR created_at DATETIME >>> 关注公众号获取更多精彩内容 307 Solution 103 Solution:activity-conversion （analytics） /* count number of transactions per user */ with tcnt as ( select users.id, count(transactions.created_at) as no_of_t from users left join transactions on users.id = transactions.id ), /* count number of events per user */ ecnt as ( select users.id, count(events.created_at) as no_of_e from users left join events on users.id = events.user_id where action = 'like' or action = 'comment' ) /* now combine the two tables and determine avg number of events needed for each number of transactions */ select no_of_t, avg(no_of_e) from tcnt left join ecnt on tcnt.id = ecnt.id group by no_of_t >>> 关注公众号获取更多精彩内容 308 Quesetion 104 scalped-ticket（probability）难度标题【Easy】公司标签 / 题目标签【probability】 One of your favorite sports teams is playing at a local stadium, but you waited until the last minute to buy a ticket. You can buy a scalped (second-hand) ticket for $50$50, which has a 20% chance of not working. If the scalped ticket doesn’t work, you’ll have to buy a box office ticket for $70$70 at the stadium. 1. How much do you expect to pay to go to the sports game? 2. How much money should you set aside for the game? >>> 关注公众号获取更多精彩内容 309 Solution 104 Solution:scalped-ticket（probability） One of your favorite sports teams is playing at a local stadium, but you waited until the last minute to buy a ticket. You can buy a scalped (second-hand) ticket for $50, which has a 20% chance of not working. If the scalped ticket doesn’t work, you’ll have to buy a box office ticket for $70 at the stadium. 1. How much do you expect to pay to go to the sports game? 2. How much money should you set aside for the game? >>> 关注公众号获取更多精彩内容 310 Quesetion 105 button-ab-test（a/b testing）难度标题【Easy】题目标签【a/b testing】公司标签【Nextdoor】,【Amazon】,【Livongo】, 【Agoda】,【Known】,【Impossible】, 【Ibotta】,【Dropbox】,【Gusto】 A team wants to A/B test multiple different changes through a sign-up funnel. For example, on a page, a button is currently red and at the top of the page. They want to see if changing a button from red to blue and/or from the top of the page to the bottom of the page will increase click-through. How would you set up this test? >>> 关注公众号获取更多精彩内容 311 Solution 105 Solution:button-ab-test（a/b testing） Two Options: - Run a multiple variant test - Run a simultaneous test 1. Calculate the desired effect size of our change 2. Calculate the required sample size & duration of the experiment to hit the desired effect size 3. Ensure proper tracking of CTR within our homepage 4. Ensure proper experiment framework to randomize between treatment/control 5. If we want to run a simultaneous test, we’ll need to have a framework for measuring the interaction effects. We can: • Measure each variant individually against control • For each variant, calculate the values of the interaction term to determine influence of the other experiment • Benefit is we get more power with the simultaneous test. And we can understand what would happen if we rolled both variants out. 1. If we wanted to either run tests separately, this would give us the benefit of interpretability of our results. At the cost of potential power improvements and delay in results. 2. We can apply variance reduction techniques like stratification or adding covariates to reduce the effect of external factors. >>> 关注公众号获取更多精彩内容 312 Quesetion 106 merge-sorted-lists（algorithms）难度标题【Easy】题目标签【algorithms】公司标签【Workday】,【Two】,【PayPal】,【Facebook】, 【Indeed】 Given two sorted lists, write a function to merge them into one sorted list. Bonus: What’s the time complexity? Example： Input： list1 = [1,2,5] list2 = [2,4,6] Output： def merge_list(list1,list2) -> [1,2,2,4,5,6] >>> 关注公众号获取更多精彩内容 313 Solution 106 Solution:merge-sorted-lists（algorithms） def merge_list(list1, list2): list3 = [] i=0 j=0 # Traverse both lists # If the current element of first list # is smaller than the current element # of the second list, then store the # first list's value and increment the index while i < len(list1) and j < len(list2): if list1[i] < list2[j]: list3.append(list1[i]) i=i+1 else: list3.append(list2[j]) j=j+1 # Store remaining elements of the first list while i < len(list1): list3.append(list1[i]) i=i+1 >>> 关注公众号获取更多精彩内容 314 Solution 106 Solution:merge-sorted-lists（algorithms） # Store remaining elements of the first list while i < len(list1): list3.append(list1[i]) i=i+1 # Store remaining elements of the second list while j < len(list2): list3.append(list2[j]) j=j+1 return list3 >>> 关注公众号获取更多精彩内容 315 Quesetion 107 promoting-instagram（product metrics）难度标题【Medium】公司标签 / 题目标签【product metrics】 Let’s say you work on the growth team at Facebook and are tasked with promoting Instagram from within the Facebook app. Where and how could you promote Instagram through Facebook? >>> 关注公众号获取更多精彩内容 316 Solution 107 Solution:promoting-instagram （product metrics） Goal: Increase awareness of Instragram through Facebook Hypothesis: Showing Instragram ads to users in their News Feed will increase the likelihood that they will login to Instragram by X%. Run A/B test: • Control group: no changes • Variant group: will be shown Instagram ads as the first ad they see when scrolling through their News Feed. • Randomly assign users to each group, making sure they’re not bias and are representative of the population. • Set a significance level like 95% • Set experiment time, how long the long experiement run • Set the power,usually 80% • Esimate intended effect size - 20% >>> 关注公众号获取更多精彩内容 317 Solution 107 Solution:promoting-instagram （product metrics） Metrics: • # of Instagram logins after being exposed to the Instagram ad, 24 hours • Instagram logins / # of users - Percent logging into Instagram after using Facebook • Stop-gap metric: Ad revenue, CTR, revenue per session. Since we’re taking up ad space, we want to see how much these ads cost us The other idea is: Notifying ppl on Facebook when their friends join Instagram. We can do a regression of # of friends on Instagram vs % of those users who use Instagram. >>> 关注公众号获取更多精彩内容 318 Quesetion 108 significance-time-series（statistics）难度标题【Medium】题目标签【statistics】公司标签【Amazon】,【MasterClass】,【Apple】 Let’s say you have a time series dataset grouped monthly for the past five years. How would you find out if the difference between this month and the previous month was significant or not? >>> 关注公众号获取更多精彩内容 319 Solution 108 Solution:significance-time-series （statistics） As stated, the dataset is grouped monthly, and for the purposes of this answer, let’s say that the data is the number of unique visitors to said website. This means that at the end of each month, the number of unique visitors from every day that month is summed up and reported. We are interested in whether the difference between this month and the previous month is significant. To test this, we can take all the differences in unique visitors between every month and the month after it (e.g. January and February of Year 1, February and March of Year 2, etc.). This will result in a population of differences in unique visitors. We can then take the month we are interested in and run a t-test against the sample that we have. This sample size is large enough to extract useful information from it. Once you get the output t-statistic, you can then calculate your p-value. If the p-value is less than your desired threshold, then the difference you are interested in is in fact statistically significant. However, one aspect we should watch out for are confounders that may affect the overall trendline in the full dataset. >>> 关注公众号获取更多精彩内容 320 Solution 108 Solution:significance-time-series （statistics） One major variable that affects most time series datasets is seasonality, and you can adjust each month by normalizing (or dividing) the month’s value by a factor proportional to the effect of seasonality. We can also quantify seasonality by looking at year over year change to see if the seasonal effect is strong or non-existent. For example, if more users tend to go on the website in the summertime, you need to adjust May to August months accordingly. Furthermore, what if there are campaigns that have increased traffic to the website over the past few years? Or what if our business has generally done better year over year? This would be the effect of trend on time series. One way we can account for trend is to normalize it like seasonality. But this wouldn’t work out perfectly if growth had an interaction effect with seasonality. One method that we can run to adjust for seasonality and trend is to run forecasts each month on what our next month’s expected numbers are. This way, we can compare our forecasts against our actual numbers. >>> 关注公众号获取更多精彩内容 321 Solution 108 Solution:significance-time-series （statistics） Our forecasts will have to be tuned to see if there is a linear relationship in the historical data. If there isn’t, we would use something more like a three month moving average method vs a traditional linear ARIMA. Lastly, we should set up a marginal variance on our expectation between our forecast and our actual. This threshold of variance should be based on the business. As if it were revenue, we wouldn’t want more than, let’s say, a 5% change in our forecasts vs expected, given that it could influence bigger problems with cash-flow and expenses. However, if it’s less directly related to the business, such as an engagement metric or a smaller product offering revenue, then we can be fine with a larger change in the variance and set the threshold higher. >>> 关注公众号获取更多精彩内容 322 Quesetion 109 netflix-price（business case）难度标题【Hard】题目标签【business case】公司标签【Netflix】 How would you determine if the price of a Netflix subscription is truly the deciding factor for a consumer? >>> 关注公众号获取更多精彩内容 323 Solution 109 Solution:netflix-price（business case） Based on the initial question Jay raised, lets focus on conversion, say from free-trial to subscription. The same idea applies to investigating retention, which is repeated purchase.One way to approach this question is through quasiexperiment. I have done some research. Basically, Netflix rolls out price changes on a country-by-country basis and the change “in the US does not influence or indicate a global price change,” a Netflix spokesperson told The Verge (Source: https://www.theverge.com/2020/10/29/21540346/netflix-priceincrease-united-statesstandard-premium-content-productfeatures). This creates the best scenario for us to conduct difference in difference analysis, where for a chose period of time, say 2 months (this is reasonable because Netflix’s business model is monthly subscription), compare conversion rate between two groups of countries in which only one group experienced a price spike, and also before and after the price spike. Theoretically, this will give you the average treatment effect of price on conversion rate. >>> 关注公众号获取更多精彩内容 324 Solution 109 Solution:netflix-price（business case） But of course this method suffers from strong assumptions limitations, and we need to include a bunch of covariates to control for confounders, such as social-economic factors of each country, such as GDP, average income, # of movie titles, # of tv titles, cost per movie title, cost per tv title, total library size, etc. Another way is geo experiment, where a group of markets in the US are given control price, and the other group of markets are given treatment price. The problem with this approach is obviously two groups are most likely not comparable. We can either apply difference in difference method to take care of it or apply matching method. >>> 关注公众号获取更多精彩内容 325 Quesetion 110 hundreds-of-hypotheses（a/b testing）难度标题【Medium】题目标签【a/b testing】公司标签【Amazon】 You are testing hundreds of hypotheses with many tt-tests. What considerations should be made? >>> 关注公众号获取更多精彩内容 326 Solution 110 Solution:hundreds-of-hypotheses （a/b testing） Type I error will scale with the number of t-tests you run. If your significance level alpha for a single t-test is 0.05, i.e we allow 5% Type I error rate on a single test, then across many tests p(Type I) error will increase. i.e with 2 tests: P(type I error) = p(type I error on A OR type I error on B) = 2p(type I error on single test) -p(type I error on A AND type I error on B) = 2*.05 - .05^2 (assuming independence of tests) = 0.5 - .025 = .075 If you want your p(type I error) across n-tests to remain at 5%, you will need to decrease the alpha in each individual test. Otherwise, you can try and run an F-test to start in order to identify if a least 1 test sees some significant effect. Then run a t-test on the specific experiment with the highest effect size. Granted, the p-value of the test will also depend on the variance of the sample in the given test, if we assume constant variance across tests, then the test with the highest effect size is in expectation the best performing test. Only running a single t-test will keep your p(type I error) low. >>> 关注公众号获取更多精彩内容 327 Quesetion 111 disease-testing-probability （probability）难度标题【Hard】公司标签【Asana】题目标签【probability】 Bob takes a test and tests positive for the disease. Bob was close to six other friends they all take the same test and end up testing negative. The test has a 1% false positive rate and a 15% false negative rate. What’s the percent chance that Bob was actually negative for the disease? >>> 关注公众号获取更多精彩内容 328 Solution 111 Solution:disease-testing-probability （probability） While this immediately looks like a Bayes’ Theorem problem, the lack of disease prevalence leads me to believe that this is a comparison of binomial distributions or using Bayes’ Theorem with the comparison. Because 7 people have been close (6 people and Bob), we can conclude that they all share the same condition, either (1) all positive for the disease or (2) none positive for the disease. Therefore, we can use the FPR and FNR values as p for our binomial distributions to calculate the probability of each of the situations and decide the most probable. (1) In the case where they are all positive, we consider X = # of False Negatives so p=.15 and n=7. Therefore, P(X=6) = 6.8 * 10^(-5) (2) In the case where they are all negative, we consider X = # of False Positives so p=.01 and n=7. Therefore, P(X=1) = 0.066 So case 2 is more likely. Taking a page from Bayes’ Theorem, P(Bob is actually negative) = .066 / (.066 + 6.8 * 10^(-5)) = 0.999 (Note the similarity in values to @pratapkd’s result but without the assumption of prevalence) >>> 关注公众号获取更多精彩内容 329 Quesetion 112 parents-joining-teens（product metrics）难度标题【Medium】公司标签 / 题目标签【product metrics】 Let’s say you’re a data scientist at Facebook. How you would evaluate the effect on engagement of teenage users when their parents join Facebook? >>> 关注公众号获取更多精彩内容 330 Solution 112 Solution:parents-joining-teens （product metrics） Since you cannot run a randomized test (unless you figure out a way to make parents of teens to join by force), this will be need to be an observational study, with a quasi experiment design to answer the question - ‘How do parents cause teens to behave in different ways’. Look at 2 groups of teen users at two time periods. Time period t0, parents of teens in group 1 (test) join facebook while parents in group2 (control) do not. At time period t2, compare the pre to post change in user behavior of users in test to that of control. Since random assignment is not possible, you’ll need to control for selection bias through matching or regression. Variables to match on would depend on the outcome measure of interest (time spent, engagement on tagged posts, sharing, posting). Few selection controls could include age, affluence, education level of teens and parents, ethnic/ cultural background, size and density of connections etc. Ultimately, compare the pre to post change in metrics for the 2 groups at time t1 (relative to time t0) and see if the differences are significant. >>> 关注公众号获取更多精彩内容 331 Quesetion 113 emails-opened（sql）难度标题【Easy】题目标签【sql】公司标签【Facebook】,【Wayfair】 The events table tracks every time a user performs a certain action (like, post_enter, etc.) on a platform. How many users have ever opened an email? Example： Input： events table Column Type user_id INTEGER created_at DATETIME action VARCHAR platform VARCHAR Output： Column Type num_users_open_email INTEGER >>> 关注公众号获取更多精彩内容 332 Solution 113 Solution:emails-opened（sql） SELECT count(DISTINCT user_id) AS num_users_open_email FROM events WHERE action = 'email_opened' >>> 关注公众号获取更多精彩内容 333 Quesetion 114 low-precision（machine learning）难度标题【Easy】公司标签 / 题目标签【machine learning】 Let’s say you’re tasked with building a classification model to determine whether a customer will buy on an e-commerce platform after making a search on the homepage. You find that your model is suffering from low precision. How would you improve it? >>> 关注公众号获取更多精彩内容 334 Solution 114 Solution:low-precision （machine learning） Plot ROC Curve and increase classification threshold without sacrificing recall too much. Increasing classification threshold. Give more weight to features related to a user’s actions such as previous activity on site, log in time. Basically features that look for intentions to buy. >>> 关注公众号获取更多精彩内容 335 Quesetion 115 how-many-friends（algorithms）难度标题【Medium】公司标签 / 题目标签【algorithms】 You are given a list of lists where each group represents a friendship. For example, given the list: list = [[2,3],[3,4],[5]] Person 2 is friends with person 3, person 3 is friends with person 4, etc. Write a function to find how many friends each person has. Example 1 ： Example 2 ： Input： Input： friends = [[1,3],[2,3],[3,5],[4]] friends = [[1],[2],[3],[4]] Output： Output： [(1,1), (2,1), (3,3), (4,0), (5,1)] [(1,0), (2,0), (3,0), (4,0)] Explanation: every person has no friends on the friends list >>> 关注公众号获取更多精彩内容 336 Solution 115 Solution:how-many-friends（algorithms） def how_many_friends(friendships): counts = {} for friendship in friendships: for f in friendship: counts[f] = counts.get(f, set()) counts[f] = counts[f].union(friendship) return [ (f, len(r) - 1) for f, r in sorted(counts.items()) ] >>> 关注公众号获取更多精彩内容 337 Quesetion 116 lowest-paid（sql）难度标题【Medium】题目标签【sql】公司标签【Microsoft】 Given tables employees, employee_projects, and projects, find the 3 lowest-paid employees that have completed at least 2 projects. Note: incomplete projects will have an end date of NULL in the projects table. Example： employee_projects table Input： employees table Column Type Column Type employee_id INTEGER id INTEGER project_id INTEGER first_name VARCHAR last_name VARCHAR salary INTEGER Column Type department_id INTEGER id INTEGER title VARCHAR start_date DATE end_date DATE budget INTEGER Output： Column Type employee_id INTEGER salary INTEGER completed_projects INTEGER projects table >>> 关注公众号获取更多精彩内容 338 Solution 116 Solution:lowest-paid（sql） SELECT ep.employee_id , e.salary , COUNT(p.id) AS completed_projects FROM employee_projects AS ep JOIN employees AS e ON e.id = ep.employee_id JOIN projects AS p ON ep.project_id = p.id WHERE p.end_date IS NOT NULL GROUP BY 1 HAVING completed_projects > 1 ORDER BY 2 LIMIT 3 >>> 关注公众号获取更多精彩内容 339 Quesetion 117 overfit-avoidance（machine learning）难度标题【Medium】题目标签【machine learning】公司标签【Microsoft】,【Adobe】 Let’s say that you’re training a classification model. How would you combat overfitting when building tree-based models? >>> 关注公众号获取更多精彩内容 340 Solution 117 Solution:overfit-avoidance （machine learning） Overfitting refers to the condition when the model completely fits the training data but fails to generalize the testing unseen data. A perfectly fit decision tree performs well for training data but performs poorly for unseen test data. A good model must not only fit the training data well but also accurately classify records it has never seen. There are different techniques to avoid overfitting in Decision Trees: • Pruning • Pre-Pruning • Post-Pruning • Ensemble Methods Pruning is a technique that reduces the size of decision trees by removing noncritical and redundant sections to classify instances. Pruning reduces the complexity of the final classifier. >>> 关注公众号获取更多精彩内容 341 Solution 117 Solution:overfit-avoidance （machine learning） There are two types of pruning: 1. Pre-pruning: stop growing the tree earlier before it perfectly classifies the training set. The hyperparameters of the decision tree, including “maximum depth”, “minimum sample leaf”, and “minimum samples split,” can be tuned to early stop the growth of the tree and prevent the model from overfitting. 2. Post-pruning: that allows the tree to grow up fully and perfectly classify the training set, and then post prune the tree by removing branches Practically, the second approach of post-pruning overfitting trees is more popular and successful because it is not easy to estimate when to stop growing the tree precisely.Random ForestsRandom Forest is an ensemble technique for classification and regression by bootstrapping multiple decision trees. Random Forest follows bootstrap sampling and aggregation techniques to prevent overfitting. >>> 关注公众号获取更多精彩内容 342 Quesetion 118 approximate-ad-views（probability）难度标题【Easy】题目标签【probability】公司标签【Facebook】 Let’s say you work for a social media website. Users view 100 posts a day, and each post has a 10% chance of being an ad. What is the probability that a user views more than 10 ads a day? How could you approximate this value using the standard normal distribution’s cdf? >>> 关注公众号获取更多精彩内容 343 Solution 118 Solution:approximate-ad-views （probability） The probability that a user sees k ads in 100 daily posts can be modeled by a binomial distribution ~B(p=0.1, n=100). This implies the probability that a user sees at least 10 ads out of 100 viewed posts = 1-CDF(p=0.1, k=10, n=100), where CDF is the Binomial CDF function. Since n*p >= 10, we can approximate the binomial distribution with a normal distribution ~N(np, np(1-p) = ~N(10, 90)** As 10 is the mean of this Normal Distribution, we conclude that 1-CDF(p=0.1, k=10, n=100) ~ 1 - 0.5 = 0.5 >>> 关注公众号获取更多精彩内容 344 Quesetion 119 secret-wins（probability）难度标题【Hard】题目标签【probability】公司标签【Google】 There are 100 students that are playing a coin-tossing game. The students are given a coin to toss. If a student tosses the coin and it turns up heads, they win. If it comes up tails, then they must flip again. If the coin comes up heads the second time, the students will lie and say they have won when they didn’t. If it comes up tails then they will say they have lost. If 30 students at the end say they won, how many students do we expect actually won the game? >>> 关注公众号获取更多精彩内容 345 Solution 119 Solution:secret-wins（probability） Let W denote the number of students that actually won. Since we know that 30 students said they won, we only need to consider the students that said they one. By the rules of the question, these students all flipped heads on their first toss, so we need to consider their second toss further only. By reducing our area of consideration like this, this becomes a simple question of how many heads we expect those students to have gotten. This situation can be modeled as a binomial distribution, W ~ B(30,0.5). Using the expected value of the binomial, we see E[W] = 15. >>> 关注公众号获取更多精彩内容 346 Quesetion 120 85-vs-82（machine learning）难度标题【Medium】题目标签【machine learning】公司标签【Amazon】 We have two models: one with 85% accuracy, one 82%. Which one do you pick? >>> 关注公众号获取更多精彩内容 347 Solution 120 Solution:85-vs-82（machine learning） At first glance it seems like a trick question. 85% accuracy is obviously higher than 82%, so there must be a reason why we should dive into this question and understand what the broader context behind it is. What is the model being applied to? What is more important to the business; a higher accuracy model or a higher interpretable model? For example, it’s likely that a higher accuracy model could be a black box and more difficult for the business to interpret. A first determination needed would be figuring out the correct metric for the model. Accuracy is a misleading metric in that it is the fraction of predictions the model got correct. However, if we were to narrow it down to binary classification terms for example, this could be misleading in that if we cared more about True Positives rather than True Negatives or vice versa, the less accurate model could have a better proportion for which metric we care about. It makes sense in this case to balance precision and recall for the business use case. >>> 关注公众号获取更多精彩内容 348 Solution 120 Solution:85-vs-82（machine learning） For example, if we’re a doctor trying to estimate the number of sick patients in a town. We have two models confusion matrices: 10 | 10 ---+--10 | 70 15 | 20 ---+--5 | 60 We have an accuracy of 80% (10 + 70) in the first model and 75% (15 + 60) in the second model. It seems like in the first model it has a better accuracy, yet the second model does a better job of predicting when patients are sick while underpredicting patients that are healthy. Which model do we care more about if we’re a doctor? It depends on the severity of the disease and other factors on how much we care about precision or recall. >>> 关注公众号获取更多精彩内容 349 Solution 120 Solution:85-vs-82（machine learning） Lastly we should be looking at model scalability. We need a model that can perform well in production because that’s how it will be used in real life. This means that predictions must be generated and scaled to the number of datapoints the model will have to classify in real time. An example is the model that won the Netflix prize. The model was not the one that Netflix actually used, even though it was best performing, because it wasn’t scalable to Netflix’s audience size. >>> 关注公众号获取更多精彩内容 350 Quesetion 121 fast-food-database（database design）难度标题【Medium】题目标签【database design】公司标签【Facebook】 1. Design a database for a stand-alone fast food restaurant. 2. Based on the above database schema, write a SQL query to find the top three highest revenue items sold yesterday. 3. Write a SQL query using the database schema to find the percentage of customers that order drinks with their meal. >>> 关注公众号获取更多精彩内容 351 Solution 121 Solution:fast-food-database （database design） users user_id (pk) | created_date | user_type 1 | 2020-05-25 | "walkin" orders id | user_id | item_id | qty | created_date 1 | 1 | 234 | 1 | 2020-09-09 1 | 1 | 432 | 2 | 2020-09-09 items id (pk) | description | price 234 | "chicken burger" | 10.28 432 | "bread sticks" | 3.5 -- top three highest revenue items sold yesterday -- revenue = item_price * item_qty select o.item_id,sum(o.qty * i.price) as item_rev from orders o where o.created_date = current_date() - interval 1 day inner join items i on i.id = o.item_id group by 1 order by 2 desc limit 3 ; >>> 关注公众号获取更多精彩内容 352 Solution 121 Solution:fast-food-database （database design） -- num customers who ordered drinks with their meals / total customers who ordered -- isolated customer orders who ordered drinks AND meal(s) -- pick all the items which are drinks with drinks as ( select id from items where description ilike '%drink%' ), non_drinks as ( select id from items where id not in (select id from drinks) ), -- use case when to count the drinks and non drinks user_agg as >>> 关注公众号获取更多精彩内容 353 Solution 121 Solution:fast-food-database （database design） ( select user_id, max(case when t2.id is not null then 1 else 0 end) as drinks_ flag, max(case when t3.id is not null then 1 else 0 end) as non_ drinks_flag from orders t1 left join drinks t2 on t1.item_id = t2.id left join non_drinks t3 on t1.item_id = t3.id group by 1 ) select (100.0 * count(user_agg.user_id) / (select count(distinct user_id) from orders)) as percentage_users_with_drink from user_agg where drinks_flag = 1 and non_drinks_flag = 1 >>> 关注公众号获取更多精彩内容 354 Quesetion 122 fake-algorithm-reviews（probability）难度标题【Medium】公司标签 / 题目标签【probability】 Let’s say we’re trying to determine fake reviews on our products. Based on past data, 98% reviews are legitimate and 2% are fake. If a review is fake, there is 95% chance that the machine learning algorithm identifies it as fake. If a review is legitimate, there is a 90% chance that the machine learning algorithm identifies it as legitimate. What is the percentage chance the review is actually fake when the algorithm detects it as fake? >>> 关注公众号获取更多精彩内容 355 Quesetion 123 all-tails-consecutive（probability）难度标题【Medium】题目标签【probability】公司标签【Google】 Let’s say you flip a fair coin 10 times. What is the probability that you only get three tails, but, all the tails happen consecutively? An example of this happening would be if the flips were HHHHTTTHHH. Bonus: What would the probability of getting only t tails in n coin flips (t ≤ n), requiring that the tails all happen consecutively? >>> 关注公众号获取更多精彩内容 356 Quesetion 124 top-3-users（sql）难度标题【Medium】题目标签【sql】公司标签【Google】 Let’s say you work at a file-hosting website. You have information on user’s daily downloads in the download_facts table Use the window function RANK to display the top three users by downloads each day. Order your data by date, and then by daily_rank Example： Input： download_facts table Column Type user_id INTEGER date DATE downloads INTEGER Output： Column Type daily_rank INTEGER user_id INTEGER date DATE downloads INTEGER >>> 关注公众号获取更多精彩内容 357 Quesetion 125 bernoulli-sample（algorithms）难度标题【Hard】题目标签【algorithms】公司标签【Uber】,【Google】 Given a random Bernoulli trial generator, write a function to return a value sampled from a normal distribution. Example： Input： def bernoulli_sample(p): """ generate 100 outputs of bernoulli sample , given prob of 1 as p and 0 as 1-p Output： 55 >>> 关注公众号获取更多精彩内容 358 Quesetion 126 whatsapp-metrics（business case）难度标题【Easy】题目标签【business case】公司标签【Amazon】 What do you think are the most important metrics for WhatsApp? >>> 关注公众号获取更多精彩内容 359 Quesetion 127 amateur-performance（product metrics）难度标题【Hard】题目标签【product metrics】公司标签【Pinterest】,【Google】 You are a data scientist at YouTube focused on creators. A PM comes to you worried that amateur video creators could do well before but now it seems like only “superstars” do well. What data points and metrics would you look at to decide if this is true or not? >>> 关注公众号获取更多精彩内容 360 Quesetion 128 matching-siblings（machine learning）难度标题【Medium】公司标签 / 题目标签【machine learning】 Let’s say that you’re a data scientist working for Facebook. A product manager has asked you to develop a method to match users to their siblings on Facebook. 1. How would you evaluate a method or algorithm to match users with their siblings? 2. What metrics might you use? >>> 关注公众号获取更多精彩内容 361 Quesetion 129 employees-before-managers（sql）难度标题【Medium】题目标签【sql】公司标签【Amazon】 You’re given two tables: employees and managers. Find the names of all employees who joined before their manager. Example： Input： employees table managers table Column Type Column Type id INTEGER id INTEGER first_name VARCHAR name VARCHAR last_name VARCHAR join_date DATETIME manager_id INTEGER join_date DATETIME Output： Column Type employee_name VARCHAR >>> 关注公众号获取更多精彩内容 362 Quesetion 130 monotonic-function（statistics）难度标题【Medium】题目标签【statistics】公司标签【Google】 What does it mean for a function to be monotonic? Why is it important that a transformation applied to a metric is monotonic? >>> 关注公众号获取更多精彩内容 363 Quesetion 131 transactions-in-the-last-5-days（sql）难度标题【Medium】题目标签【sql】公司标签【Amazon】 Let’s say you work at a bank. Using the bank_transactions table, find how many users made at least one transaction each day in the first five days of January 2020. bank_transactions table Column Type user_id INTEGER created_at DATETIME transaction_value FLOAT id INTEGER Output： Column Type number_of_users INTEGER >>> 关注公众号获取更多精彩内容 364 Quesetion 132 job-recommendation（machine learning）难度标题【Hard】题目标签【machine learning】公司标签【Google】,【Twitter】 Let’s say that you’re working on a job recommendation engine. You have access to all user Linkedin profiles, a list of jobs each user applied to, and answers to questions that the user filled in about their job search. Using this information, how would you build a job recommendation feed? >>> 关注公众号获取更多精彩内容 365 Quesetion 133 target-indices（algorithms）难度标题【Medium】题目标签【algorithms】公司标签【Amazon】 Given an array and a target integer, write a function sum_pair_ indices that returns the indices of two integers in the array that add up to the target integer. If not found, just return an empty list. Note: Can you do it on O(n)O(n) time? Note: Even though there could be many solutions, only one needs to be returned. Example 1 ： Input： array = [1 2 3 4] target = 5 Output： def sum_pair_indices(array, target) -> [0 3] or [1 2] Example 2 ： Input： array = [3] target = 6 Output： Do NOT return [0 0] as you can't use an index twice. >>> 关注公众号获取更多精彩内容 366 Quesetion 134 extra-delivery-pay（business case）难度标题【Medium】公司标签 / 题目标签【business case】 Let’s say you work at a food delivery company. How would you measure the effectiveness of giving extra pay to delivery drivers during peak hours to meet the demand from consumers? >>> 关注公众号获取更多精彩内容 367 Quesetion 135 inactive-users（business case）难度标题【Medium】题目标签【business case】公司标签【Google】 Let’s say one million Netflix users have not logged in to Netflix in the past 6 months. How would you determine the cause? And what would you do with these users? >>> 关注公众号获取更多精彩内容 368 Quesetion 136 matrix-analysis（python）难度标题【Medium】题目标签【python】公司标签【Google】 Let’s say we have a five-by-five matrix num_employees where each row is a company and each column represents a department. Each cell of the matrix displays the number of employees working in that particular department at each company. Write a function find_percentages to return a five by five matrix that contains the portion of employees employed in each department compared to the total number of employees at each company. Example： Input： import numpy as np #Input: num_employees = np.array( [[10, 20, 30, 30, 10], [15, 15, 5, 10, 5], [150, 50, 100, 150, 50], [300, 200, 300, 100, 100], [1, 5, 1, 1, 2]] ) Output： def find_percentages(num_employees) -> #Output: percentage_by_department = [[0.1, 0.2, 0.3, 0.3, 0.1], [0.3, 0.3, 0.1, 0.2, 0.1], [0.3, 0.1, 0.2, 0.3, 0.1], [0.3, 0.2, 0.3, 0.1, 0.1], [0.1, 0.5, 0.1, 0.1, 0.2]] >>> 关注公众号获取更多精彩内容 369 Quesetion 137 minimum-absolute-distance （algorithms）难度标题【Easy】题目标签【algorithms】公司标签【McKinsey】,【Apple】 Given an array of integers, write a function min_distance to calculate the minimum absolute distance between two elements then return all pairs having that absolute difference. Note: Make sure to print the pairs in ascending order. Example： Input： v = [3, 12, 126, 44, 52, 57, 144, 61, 68, 72, 122] Output： def min_distance(V) -> min = 4 [(57, 61), (68, 72), (22, 126)] >>> 关注公众号获取更多精彩内容 370 Quesetion 138 count-transactions（sql）难度标题【Easy】题目标签【sql】公司标签【Amazon】 Let’s say you work at Amazon. With the annual_payments table below, answer the following three questions via SQL queries and output them as a table with the answers to each question. 1. How many total transactions are in this table? 2. How many different users made transactions? 3. How many transactions listed as "paid" have an amount greater or equal to 100? 4. Which product made the highest revenue? (use only transactions with a "paid" status) Example： Input： annual_payments table Output： Column Type Column Type question_id INTEGER id INTEGER answer FLOAT amount FLOAT created_at DATETIME status VARCHAR user_id INTEGER amount_refunded FLOAT product VARCHAR last_updated DATETIME >>> 关注公众号获取更多精彩内容 371 Quesetion 139 one-element-removed（algorithms）难度标题【Medium】题目标签【algorithms】公司标签【Facebook】 There are two lists, list X and list Y. Both lists contain integers from -1000 to 1000 and are identical to each other except that one integer is removed in list Y that exists in list X. Write a function one_element_removed that takes in both lists and returns the integer that was removed in O(1)O(1) space and O(n)O(n) time without using the python set function. Example： Input： list_x = [1,2,3,4,5] list_y = [1,2,4,5] one_element_removed(list_x, list_y) -> 3 >>> 关注公众号获取更多精彩内容 372 Quesetion 140 fake-news-stories（business case）难度标题【Medium】题目标签【business case】公司标签【Facebook】 Mark Zuckerburg calls you at 7pm and says he needs to know exactly what percentage of Facebook stories are fake news by tomorrow at 7pm. How would you measure this given the time constraint? >>> 关注公众号获取更多精彩内容 373 Quesetion 141 instagram-tv-success（product metrics）难度标题【Hard】题目标签【product metrics】公司标签【Google】,【Facebook】 Let’s say you’re a Product Data Scientist at Instagram. How would you measure the success of the Instagram TV product? >>> 关注公众号获取更多精彩内容 374 Quesetion 142 approval-drop（statistics）难度标题【Medium】题目标签【statistics】公司标签【Intuit】,【edX】,【Microsoft】 Capital approval rates have gone down for our overall approval rate. Let’s say last week it was 85% and the approval rate went down to 82% this week which is a statistically significant reduction. The first analysis shows that all approval rates stayed flat or increased over time when looking at the individual products. • Product 1: 84% to 85% week over week • Product 2: 77% to 77% week over week • Product 3: 81% to 82% week over week • Product 4: 88% to 88% week over week What could be the cause of the decrease? >>> 关注公众号获取更多精彩内容 375 Quesetion 143 variate-anomalies（statistics）难度标题【Easy】公司标签 / 题目标签【statistics】 If given a univariate dataset, how would you design a function to detect anomalies? What if the data is bivariate? >>> 关注公众号获取更多精彩内容 376 Quesetion 144 optimal-host（algorithms）难度标题【Hard】题目标签【algorithms】公司标签【LinkedIn】,【Facebook】, 【Pluralsight】,【Zillow】 Let’s say we have a group of NN friends represented by a list of dictionaries where each value is a friend name and their location on a three dimensional scale of (x,y,zx,y,z). The friends want to host a party but want the friend with the optimal location (least distance to travel as a group) to host it. Write a function pick_host to return the friend that should host the party. Example： Input： friends = [ {'name': 'Bob', location: (5,2,10)}, {'name': 'David', location: (2,3,5)}, {'name': 'Mary', location: (19,3,4)}, {'name': 'Skyler', location: (3,5,1)}, ] def optimal_host(friends) -> 'David' >>> 关注公众号获取更多精彩内容 377

Data Science Interview Prep: SQL & Probability Questions

Related documents

Products

Support

Data Science Interview Prep: SQL & Probability Questions

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib