Uploaded by un.promise

summer-2021-cse-6040-midterm-1-problem

advertisement
lOMoARcPSD|10628501
summer 2021 cse 6040 midterm 1 problem
Computing for Data Analysis (Georgia Institute of Technology)
StuDocu is not sponsored or endorsed by any college or university
Downloaded by Un Un (un.promise@yahoo.com)
lOMoARcPSD|10628501
Midterm 1, Spring 2021: Music recommender
Version 1.0
This problem builds on your knowledge of basic Python data structures and string processing. It has seven (7) exercises, numbered 0 to 6. There are eleven
(11) available points. However, to earn 100%, the threshold is just 10 points. (Therefore, once you hit 10 points, you can stop. There is no extra credit for
exceeding this threshold.)
Each exercise builds logically on the previous one, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the
next one. However, if you see a code cell introduced by the phrase, "Sample result(s) for ...", please run it. Some demo cells in the notebook may depend
on these precomputed results.
The point values of individual exercises are as follows:
Exercise 0: 1 point
Exercise 1: 1 point
Exercise 2: 2 points
Exercise 3: 2 points
Exercise 4: 2 points
Exercise 5: 1 point
Exercise 6: 2 points
Pro-tips.
Many or all test cells use randomly generated inputs. Therefore, try your best to write solutions that do not assume too much. To help you debug,
when a test cell does fail, it will often tell you exactly what inputs it was using and what output it expected, compared to yours.
If you need a complex SQL query, remember that you can define one using a triple-quoted (multiline) string.
If your program behavior seem strange, try resetting the kernel and rerunning everything.
If you mess up this notebook or just want to start from scratch, save copies of all your partial responses and use Actions → Reset Assignment to
get a fresh, original copy of this notebook. (Resetting will wipe out any answers you've written so far, so be sure to stash those somewhere safe if you
intend to keep or reuse them!)
If you generate excessive output that causes the notebook to load slowly or not at all (e.g., from an ill-placed print statement), use Actions →
Clear Notebook Output to get a clean copy. The clean copy will retain your code but remove any generated output. However, it will also rename
the notebook to clean.xxx.ipynb. Since the autograder expects a notebook file with the original name, you'll need to rename the clean notebook
accordingly. Be forewarned: we won't manually grade "cleaned" notebooks if you forget!
Good luck!
Background and overview: Spotify playlist data
Suppose you are running a musical service and would like to help your users discover artists based on artists they already like. In this problem, you'll prototype
a simple recommender by mining a dataset of user-generated playlists from Spotify, circa 2015.
Your overall workflow will be as follows:
1.
2.
3.
4.
Manually inspect the data and how it is stored
Gather some preliminary statistics to get a "feel" for the data
Clean the data a bit, namely by "normalizing" artist names
Use ideas from Notebook 2 to analyze artist co-occurrences in playlists
With that in mind, let's start!
Modules and data. Run the following two code cells, which load some modules this notebook needs as well as the data itself.
The data for this problem are several hundred megabytes in size and so may take a minute to load.
In [1]: ### BEGIN HIDDEN TESTS
%load_ext autoreload
%autoreload 2
### END HIDDEN TESTS
from pprint import pprint
from testing_tools import load_pickle
print("Ready!")
Opening
Opening
Opening
Opening
Opening
Ready!
pickle
pickle
pickle
pickle
pickle
from
from
from
from
from
'./resource/asnlib/publicdata/user_ids.pickle' ...
'./resource/asnlib/publicdata/artist_names.pickle' ...
'./resource/asnlib/publicdata/playlist_names.pickle' ...
'./resource/asnlib/publicdata/track_titles.pickle' ...
'./resource/asnlib/publicdata/artist_translation_table.pickle' ...
In [2]: !date
spotify_users = load_pickle('user_playlists.pickle')
print("==> Finished loading the data.")
!date
Fri 26 Feb 2021 12:27:54 AM PST
Opening pickle from './resource/asnlib/publicdata/user_playlists.pickle' ...
==> Finished loading the data.
Fri 26 Feb 2021 12:28:03 AM PST
Familiarize yourself with these data
The variable spotify_users holds the data you'll need. It consists of a list of about 15,000 or so users:
In [3]: print(f"`spotify_users`: type == {type(spotify_users)}, number of elements == {len(spotify_users):,}.")
`spotify_users`: type == <class 'list'>, number of elements == 15,918.
Each element of this list corresponds to a distinct user. Have a look at the user at position 2526 of this list:
In [4]: pprint(spotify_users[2526])
{'playlists': [{'name': 'Favoritas de la radio',
'tracks': [{'artist': 'Vico C', 'title': 'Desahogo'},
{'artist': 'Vico C',
'title': 'El Bueno, El Malo Y El Feo (The Good, '
'The Bad & The Ugly) - Feat. Tego '
'Calderón And Eddie Dee'},
{'artist': 'Vico C', 'title': 'Quieren'},
{'artist': 'Vico C',
'title': "Vamonos Po' Encima"}]},
{'name': 'Starred',
'tracks': [{'artist': 'Vico C', 'title': 'El'},
{'artist': 'Strike 3', 'title': 'Enamorado De Ti'},
{'artist': 'Strike 3', 'title': 'Es Por Ti'}]},
{'name': 'Two',
'tracks': [{'artist': 'Walk the Moon', 'title': 'Quesadilla'},
{'artist': 'Two Door Cinema Club',
'title': 'Sleep Alone'},
{'artist': 'Two Door Cinema Club',
'title': 'Something Good Can Work'},
{'artist': 'Two Door Cinema Club',
'title': 'Sun'}]}],
'user_id': '22c5af0c50b557327894d0c9ea6aa5fa'}
Every user has a unique user ID (a hex string) as well as a list of playlists that they have created. Each playlist is named and consists of a list of songs or tracks.
Each track has a title and is performed by an artist (musician or group).
Take a minute to understand how this data is stored: note what data structures are being used (e.g., dictionaries versus lists), for what purpose, and how they
are nested.
If you understand the storage scheme, you should be able to verify the following facts about the above user:
1.
2.
3.
4.
5.
The user's ID is '22c5af0c50b557327894d0c9ea6aa5fa'.
The user has three playlists, one named 'Favoritas de la radio', another named 'Starred', and the last named 'Two'.
The 'Favoritas de la radio' playlist has four songs, all of which were performed by the same artist, 'Vico C'.
The 'Starred' playlist has one song also by 'Vico C', but includes two songs by a different artist, 'Strike 3'.
The 'Two' playlist has four songs: one by 'Walk the Moon' and three by 'Two Door Cinema Club'.
Other users may have only one playlist with just one song, or many playlists with many songs by many artists.
Part A: Preliminary analysis
To make sure you know how to navigate these data, let's start with two basic exercises.
Exercise 0: count_playlists (1 point)
Given a user playlist dataset, users, complete the function, count_playlists(users) so that it returns the total number of playlists.
For instance, suppose the user dataset consists of the following two users:
In [5]: ex0_demo_users = [{'user_id': '0c8435917bd098dce8df8f62b736c0ed',
'playlists': [{'name': 'Starred',
'tracks': [{'artist': 'André Rieu',
'title': 'Once Upon A Time In The West - Main Title Theme'},
{'artist': 'André Rieu',
'title': 'The Second Waltz - From Eyes Wide Shut'}]}]},
{'user_id': 'fc799d71e8d2004377d6d8e861479559',
'playlists': [{'name': 'Liked from Radio',
'tracks': [{'artist': 'The Police', 'title': 'Every Breath You Take'},
{'artist': 'Lucio Battisti', 'title': 'Per Una Lira'},
{'artist': 'Alicia Keys ft. Jay-Z', 'title': 'Empire State of Mind'}]},
{'name': 'Starred', 'tracks': [{'artist': 'U2', 'title': 'With Or Without You'}]}]}]
Then count_playlists(ex0_demo_users) would return 1+2=3, because the first user has one playlist (named 'Starred') and the second has two
playlists (one named 'Liked from Radio' and the other also named 'Starred').
In [6]: def count_playlists(users):
### BEGIN SOLUTION
from random import randint
return count_playlists__soln1(users)
def count_playlists__soln0(users):
num_playlists = 0
for user in users:
num_playlists += len(user['playlists'])
return num_playlists
def count_playlists__soln1(users):
return sum(len(user['playlists']) for user in users)
### END SOLUTION
In [7]: # Demo cell
count_playlists(ex0_demo_users) # should return 3
Out[7]: 3
In [8]: # Test cell 0: `mt1_ex0_count_playlists` (1 point)
### BEGIN HIDDEN TESTS
global_overwrite = False
def tracks_iterator(users, offsets=False):
for i, u in enumerate(users):
for j, p in enumerate(u['playlists']):
for k, t in enumerate(p['tracks']):
if offsets:
yield u, p, t, i, j, k
else:
yield u, p, t
def randomly_error(threshold=0.05):
from random import random
return random() < threshold
def mt1_ex0__gen_soln():
print(f"The Spotify dataset consists of {count_playlists(spotify_users):,} playlists in total.")
#!date
#mt1_ex0__gen_soln()
#!date
### END HIDDEN TESTS
#from testing_tools import mt1_ex0__check
#print("Testing...")
assert count_playlists(spotify_users) == 231_844
from testing_tools import mt1_ex0__check
for trial in range(250):
mt1_ex0__check(count_playlists)
print("\n(Passed!)")
(Passed!)
Exercise 1: count_artist_strings (1 point)
For your next task, suppose we wish to count how many distinct case-insensitive artist strings are in the dataset (across all users and playlists). By "distinct
case-insensitive," we mean two strings a and b would be "equal" if, after conversion to lowercase, they are equal in the Python sense of a == b. For example,
we would treat 'Jay-Z' and 'JAY-Z' as equal, but we would regard 'Jay-Z' (with a hyphen) and 'Jay Z' (without a hyphen) as unequal.
In a subsequent exercise, we will try to normalize names in a different way.
Your task. Given a user playlist dataset, users, complete the function count_artist_strings(users) below so that it counts the number of distinct
case-insensitive artist strings contained in users.
For example, recall the demo dataset from Exercise 0:
In [9]: pprint(ex0_demo_users)
[{'playlists': [{'name': 'Starred',
'tracks': [{'artist': 'André Rieu',
'title': 'Once Upon A Time In The West - Main '
'Title Theme'},
{'artist': 'André Rieu',
'title': 'The Second Waltz - From Eyes Wide '
'Shut'}]}],
'user_id': '0c8435917bd098dce8df8f62b736c0ed'},
{'playlists': [{'name': 'Liked from Radio',
'tracks': [{'artist': 'The Police',
'title': 'Every Breath You Take'},
{'artist': 'Lucio Battisti',
'title': 'Per Una Lira'},
{'artist': 'Alicia Keys ft. Jay-Z',
'title': 'Empire State of Mind'}]},
{'name': 'Starred',
'tracks': [{'artist': 'U2', 'title': 'With Or Without You'}]}],
'user_id': 'fc799d71e8d2004377d6d8e861479559'}]
Looking across all users and playlists, this dataset has five (5) distinct artist strings: 'André Rieu', 'The Police', 'Lucio Battisti', 'Alicia Keys
ft. Jay-Z', and 'U2'. Observe that 'André Rieu' appears twice, but for our tally, we would count it just once. And if 'the POLICE' had been in the
data, then it would be consider the same as 'The Police'.
Note: Your function must not modify the input dataset. Even if your code returns a correct result, if it changes the input data, the autograder will
mark it as incorrect.
In [10]: def count_artist_strings(users):
### BEGIN SOLUTION
artist_strings = set()
for user in users:
for playlist in user['playlists']:
for track in playlist['tracks']:
artist_strings |= {track['artist'].lower()}
return len(artist_strings)
### END SOLUTION
In [11]: # Demo: Should return '5'
count_artist_strings(ex0_demo_users)
Out[11]: 5
In [12]: # Test cell 0: `mt1_ex1_count_artist_strings` (1 point)
### BEGIN HIDDEN TESTS
def mt1_ex1__gen_soln():
print(f"The Spotify dataset contains a total of {count_artist_strings(spotify_users):,} "
"distinct case-insensitive artist strings.")
#!date
#mt1_ex1__gen_soln()
#!date
### END HIDDEN TESTS
#assert count_artist_strings(spotify_users) == 282_555
from testing_tools import mt1_ex1__check
print("Testing...")
for trial in range(250):
mt1_ex1__check(count_artist_strings)
print("\n(Passed!)")
Testing...
(Passed!)
Answer for this dataset. If your function works correctly, running it on the full Spotify dataset would result in 282,555 distinct case-insensitive artist strings.
That's a lot of artists! (We have omitted this check to reduce the running time of the notebook.)
Part B: Data cleaning
Unfortunately, artist names are encoded in a messy fashion. Here are some examples:
The artist "Jay-Z" is written as "Jay-Z" and "JAY Z", with several other variations having different capitalization.
Worse, there is no consistent standard for encoding multiple artists who worked together on a song. For example, here is how several of Jay-Z's
collaborations appear:
'Alicia Keys ft. Jay-Z'
⟹ ('ft.' used as an artist-separator)
'A-Trak x Kanye x Jay-Z'
⟹ (' x ' used as an artist-separator)
'JAY Z Featuring Beyoncé'
⟹ (variation on "Jay-Z" and yet another variation on "featuring" to separate artists)
'Jay-Z Featuring Beyoncé Knowles'
⟹ (Beyoncé's last name included in this variation)
'Jay-Z/Kanye West/Lil Wayne/T.I.'
⟹ (... you get the idea ...)
'Jay Z (Dr. Dre, Rakim, & Truth Hurts)'
'Young Jeezy Ft. Jay-Z & Fat Joe'
'Lil Wayne Drake Jay-Z And Gif Majorz'
⟹ (spaces used ambiguously: there are four artists in this example!)
'Timbaland & Magoo feat Jay-Z'
'OutKast/Jay-Z/Killer Mike'
'Jay-Z Ft.Rihanna And Kanye West'
'Pat Benetar vs. Beyonce vs. 3OH!3 Feat. Britney Spears, Christina Aguilera, & M.I.A.'
⟹ ("Benatar" is
misspelled as 'Benetar')
'jay z with the roots.
s'
⟹ (yes, excess spaces between . and trailing s are real)
It is difficult to design a robust algorithm to extract individual artist names. Instead, let's use the following approximate algorithm, given an artist-string as input.
1. Lowercase: First, convert all characters to lowercase.
2. Space-equivalents: Next, convert any hyphen ('-'), period ('.'), question mark ('?'), exclamation point ('!'), and underscores ('_') into a space
character.
3. Separators: Then split the string, treating the following patterns as artist-name separators:
A. All of the following words, but only when there are spaces both before and after: 'and', 'with', 'ft', 'feat', 'featuring', 'vs', and
'x'.
B. All of the following symbols: '/' (forward slash), '&' (ampersand), comma (','), semicolon (';'), and each enclosing parenthesis or
bracket ('(', ')', '[', ']', '{', '}')
4. Whitespace compression: Lastly, for any artist name-string following the above separation steps, strip out any preceding and trailing whitespace and
collapse multiple consecutive whitespace characters into a single space.
When applying this algorithm, we'll perform steps 1-4 in the exact same sequence as shown above.
Exercise 2: extract_artists (2 points)
Complete the function extract_artists(artist) so that it applies the artist name-separation algorithm described above, returning a Python set consisting
of the separate artist names. For example:
'Alicia Keys ft. Jay-Z' ==> {'jay z', 'alicia keys'}
'A-Trak x Kanye x Jay-Z' ==> {'a trak', 'jay z', 'kanye'}
'JAY Z Featuring Beyoncé' ==> {'jay z', 'beyoncé'}
'Jay-Z Featuring Beyoncé Knowles' ==> {'beyoncé knowles', 'jay z'}
'Jay-Z/Kanye West/Lil Wayne/T.I.' ==> {'lil wayne', 'jay z', 't i', 'kanye west'}
'Young Jeezy Ft. Jay-Z & Fat Joe' ==> {'fat joe', 'jay z', 'young jeezy'}
'Lil Wayne Drake Jay-Z And Gif Majorz' ==> {'gif majorz', 'lil wayne drake jay z'}
'Timbaland & Magoo feat Jay-Z' ==> {'jay z', 'timbaland', 'magoo'}
'OutKast/Jay-Z/Killer Mike' ==> {'outkast', 'jay z', 'killer mike'}
'Jay-Z Ft.Rihanna And Kanye West' ==> {'rihanna', 'jay z', 'kanye west'}
'Pat Benetar vs. Beyonce vs. 3OH!3 Feat. Britney Spears, Christina Aguilera, & M.I.A.' ==> {'m i a', 'beyonce',
'christina aguilera', 'pat benetar', 'britney spears', '3oh 3'}
'jay z with the roots.
s' ==> {'jay z', 'the roots s'}
Note 0: Pay close attention to the target output.
Note 1: This procedure is imperfect. For example, observe that 'Lil Wayne Drake Jay-Z And Gif Majorz' is, in reality, four artists (Lil'
Wayne, Drake, Jay-Z, and Gif Majorz), but the algorithm cannot disambiguate the intention of spaces. Also, in the last example, even though in
reality 'the roots.
s' should resolve to 'the roots', it instead becomes 'the roots s'. And a band like 'Tom Petty and
the Heartbreakers' will be erroneously split into two artists ('Tom Petty' and 'the Heartbreakers'). But it is what it is.
In [13]: def extract_artists(artist):
### BEGIN SOLUTION
import re
artist = artist.lower() # convert to lowercase
for space_equivalent in '-.?!_':
artist = artist.replace(space_equivalent, ' ')
for separator_word in ['featuring', 'feat', 'ft', 'and', 'with', 'vs', 'x']:
artist = artist.replace(f' {separator_word} ', ' & ')
for and_equivalent in '/&,;()[]{}':
artist = artist.replace(and_equivalent, ' & ')
artists = artist.split('&')
artists = set(re.sub('\s+', ' ', a).strip() for a in artists)
return {a for a in artists if a} # prune empty strings
### END SOLUTION
In [14]: # Demo
ex0_inputs = ['Alicia Keys ft. Jay-Z',
'A-Trak x Kanye x Jay-Z',
'JAY Z Featuring Beyoncé',
'Jay-Z Featuring Beyoncé Knowles',
'Jay-Z/Kanye West/Lil Wayne/T.I.',
'Young Jeezy Ft. Jay-Z & Fat Joe',
'Lil Wayne Drake Jay-Z And Gif Majorz',
'Timbaland & Magoo feat Jay-Z',
'OutKast/Jay-Z/Killer Mike',
'Jay-Z Ft.Rihanna And Kanye West',
'Pat Benetar vs. Beyonce vs. 3OH!3 Feat. Britney Spears, Christina Aguilera, & M.I.A.',
'jay z with the roots.
s']
for a in ex0_inputs:
print(f"'{a}' ==> {extract_artists(a)}")
'Alicia Keys ft. Jay-Z' ==> {'alicia keys', 'jay z'}
'A-Trak x Kanye x Jay-Z' ==> {'kanye', 'a trak', 'jay z'}
'JAY Z Featuring Beyoncé' ==> {'jay z', 'beyoncé'}
'Jay-Z Featuring Beyoncé Knowles' ==> {'beyoncé knowles', 'jay z'}
'Jay-Z/Kanye West/Lil Wayne/T.I.' ==> {'kanye west', 't i', 'lil wayne', 'jay z'}
'Young Jeezy Ft. Jay-Z & Fat Joe' ==> {'fat joe', 'young jeezy', 'jay z'}
'Lil Wayne Drake Jay-Z And Gif Majorz' ==> {'gif majorz', 'lil wayne drake jay z'}
'Timbaland & Magoo feat Jay-Z' ==> {'timbaland', 'magoo', 'jay z'}
'OutKast/Jay-Z/Killer Mike' ==> {'killer mike', 'outkast', 'jay z'}
'Jay-Z Ft.Rihanna And Kanye West' ==> {'rihanna', 'jay z', 'kanye west'}
'Pat Benetar vs. Beyonce vs. 3OH!3 Feat. Britney Spears, Christina Aguilera, & M.I.A.' ==> {'beyonce', '3oh 3', 'pat
benetar', 'christina aguilera', 'm i a', 'britney spears'}
'jay z with the roots.
s' ==> {'the roots s', 'jay z'}
In [15]: # Test cell: `mt1_ex2_extract_artists` (2 points)
### BEGIN HIDDEN TESTS
def mt1_ex2__gen_soln(fn_base="artist_translation_table", fn_ext="pickle", overwrite=False):
from testing_tools import file_exists, load_pickle, save_pickle
fn = f"{fn_base}.{fn_ext}"
if file_exists(fn) and not overwrite:
print(f"'{fn}' exists; skipping...")
else: # not file_exists(fn) or overwrite
print(f"'{fn}' does not exist or needs to be overwritten; generating...")
artist_translation_table = {}
for _, _, t in tracks_iterator(spotify_users):
artist_translation_table[t['artist']] = extract_artists(t['artist'])
save_pickle(artist_translation_table, fn)
!date
mt1_ex2__gen_soln(overwrite=False or global_overwrite)
!date
### END HIDDEN TESTS
from testing_tools import mt1_ex2__check
print("Testing...")
for trial in range(250):
mt1_ex2__check(extract_artists)
extract_artists__passed = True
print("\n(Passed!)")
Fri 26 Feb 2021 12:28:04 AM PST
'artist_translation_table.pickle' exists; skipping...
Fri 26 Feb 2021 12:28:04 AM PST
Testing...
(Passed!)
Sample results for Exercise 2: artist_translation_table
If you had a working solution to Exercise 2, then in principle you could use it to normalize and separate the artist names. We have precomputed these
translations for you, for every artist name that appears in the data; run the cell below to load a name-translation table, stored in the variable,
artist_translation_table.
Read and run this cell even if you skipped or otherwise did not complete Exercise 2.
In [16]: from testing_tools import mt1_artist_translation_table as artist_translation_table
print("\n=== Examples ===")
for q in ex0_inputs[:5]:
print(f"artist_translation_table['{q}'] \\\n
== {artist_translation_table[q]}")
=== Examples ===
artist_translation_table['Alicia Keys ft. Jay-Z'] \
== {'alicia keys', 'jay z'}
artist_translation_table['A-Trak x Kanye x Jay-Z'] \
== {'kanye', 'a trak', 'jay z'}
artist_translation_table['JAY Z Featuring Beyoncé'] \
== {'beyoncé', 'jay z'}
artist_translation_table['Jay-Z Featuring Beyoncé Knowles'] \
== {'beyoncé knowles', 'jay z'}
artist_translation_table['Jay-Z/Kanye West/Lil Wayne/T.I.'] \
== {'jay z', 't i', 'lil wayne', 'kanye west'}
Part C: Gathering playlists
The data structure has a complicated nesting. Let's "flatten" it by collecting just the playlists. And for each playlist, let's keep only the artist names.
For example, recall the demo dataset from before:
In [17]: pprint(ex0_demo_users)
[{'playlists': [{'name': 'Starred',
'tracks': [{'artist': 'André Rieu',
'title': 'Once Upon A Time In The West - Main '
'Title Theme'},
{'artist': 'André Rieu',
'title': 'The Second Waltz - From Eyes Wide '
'Shut'}]}],
'user_id': '0c8435917bd098dce8df8f62b736c0ed'},
{'playlists': [{'name': 'Liked from Radio',
'tracks': [{'artist': 'The Police',
'title': 'Every Breath You Take'},
{'artist': 'Lucio Battisti',
'title': 'Per Una Lira'},
{'artist': 'Alicia Keys ft. Jay-Z',
'title': 'Empire State of Mind'}]},
{'name': 'Starred',
'tracks': [{'artist': 'U2', 'title': 'With Or Without You'}]}],
'user_id': 'fc799d71e8d2004377d6d8e861479559'}]
The first user has one playlist with two tracks by the same artist. The second user has two playlists, one playlist with three tracks and four artists (since one
track has a compound artist name), and the other playlist with one track.
For our next task, we'd like to construct a copy of this data with the following simpler structure:
In [18]: ex3_demo_output = [{'André Rieu'},
{'The Police', 'Lucio Battisti', 'Alicia Keys ft. Jay-Z'},
{'U2'}]
This object is simply a Python list of Python sets, with the "outer" list containing playlists and each playlist consisting only of distinct artist strings (without
postprocessing per Exercise 2—we'll handle that later).
Exercise 3: extract_playlists (2 points)
Complete the function, extract_playlists(users), so that it returns the simplified list of artist names as shown above. For instance, calling
extract_playlists(ex0_demo_users) should return an object that matches ex3_demo_output.
Note 0: You should not process the artist names per Exercise 2; that step comes later.
Note 1: You should preserve the exact order of playlists from the input. That is, you should loop over users and playlists in the order that they
appear in the input and produce the corresponding output in that same order.
Note 2: Do not forget that the final output should be a Python list (holding playlists) of Python sets (unprocessed artist names).
Note 3: Your function should not modify the input dataset.
In [19]: def extract_playlists(users):
### BEGIN SOLUTION
playlists = []
for user in users:
for playlist in user['playlists']:
artists = set()
for tracks in playlist['tracks']:
artists |= {tracks['artist']}
playlists.append(artists)
return playlists
### END SOLUTION
In [20]: # Demo cell
ex3_your_output = extract_playlists(ex0_demo_users)
print("=== Your output ===")
pprint(ex3_your_output)
assert all(a == b for a, b in zip(ex3_your_output, ex3_demo_output)), "Your output does not match the demo output!"
print("\n(Your output matches the demo output — so far, so good!)")
=== Your output ===
[{'André Rieu'},
{'Lucio Battisti', 'Alicia Keys ft. Jay-Z', 'The Police'},
{'U2'}]
(Your output matches the demo output — so far, so good!)
In [21]: # Test cell: `mt1_ex3_extract_playlists` (2 points)
### BEGIN HIDDEN TESTS
def mt1_ex3__gen_soln(fn_base="simple_playlists", fn_ext="pickle", overwrite=False):
from testing_tools import file_exists, load_pickle, save_pickle
fn = f"{fn_base}.{fn_ext}"
if file_exists(fn) and not overwrite:
print(f"'{fn}' exists; skipping...")
else: # not file_exists(fn) or overwrite
print(f"'{fn}' does not exist or needs to be overwritten; generating...")
simple_playlists = extract_playlists(spotify_users)
save_pickle(simple_playlists, fn)
!date
mt1_ex3__gen_soln(overwrite=False or global_overwrite)
!date
### END HIDDEN TESTS
from testing_tools import mt1_ex3__check
print("Testing...")
for trial in range(250):
mt1_ex3__check(extract_playlists)
extract_playlists__passed = True
print("\n(Passed!)")
Fri 26 Feb 2021 12:28:04 AM PST
'simple_playlists.pickle' exists; skipping...
Fri 26 Feb 2021 12:28:05 AM PST
Testing...
(Passed!)
Sample results for Exercise 3: simple_playlists
If you had a working solution to Exercise 3, then in principle you could use it to construct simplified playlists for the full Spotify dataset. Instead, we have
precomputed these for you, for playlist in that dataset; run the cell below to load it into a variable named simple_playlists.
Read and run this cell even if you skipped or otherwise did not complete Exercise 3.
In [22]: simple_playlists = load_pickle('simple_playlists.pickle')
print("\n=== Examples (first three playlists) ===")
pprint(simple_playlists[:3])
Opening pickle from './resource/asnlib/publicdata/simple_playlists.pickle' ...
=== Examples (first three playlists) ===
[{'Cocktail Slippers',
'Crosby, Stills & Nash',
'Crowded House',
'Elvis Costello',
'Elvis Costello & The Attractions',
'Joe Echo',
'Joshua Radin',
'Lissie',
'Paul McCartney',
'Paul McCartney & Eric Clapton',
'The Breakers',
'The Coronas',
'The Len Price 3',
'Tiffany Page'},
{'Biffy Clyro',
'Bruce Springsteen',
'Elbow',
'Madness',
'Miles Kane',
'Noah And The Whale',
"Noel Gallagher's High Flying Birds",
'Oasis',
'Pearl Jam',
'Spector',
'Thunderclap Newman',
'Tom Petty',
'Tom Petty And The Heartbreakers'},
{'2080'}]
Part D: An itemset representation
Our artist-recommender system will reuse ideas from Notebook 2 (pairwise association rule mining). The next two exercises do so.
But first, we'll need to identify analogues of baskets (or receipts) and items for our artist-recommender problem. Here is how we'll do that.
Receipts (baskets): Let's consider each playlist to be a receipt.
Items: Let's consider each distinct artist (after name normalization per Exercise 2!) to be an item.
Example. Recall the simplified playlists example, ex3_demo_output, from Exercise 3:
In [23]: print(ex3_demo_output)
[{'André Rieu'}, {'Lucio Battisti', 'Alicia Keys ft. Jay-Z', 'The Police'}, {'U2'}]
Since there are three playlists, there are three "receipts." We want to treat each one as an itemset consisting of normalized artist names, per Exercise 2.
In [24]: ex4_demo_output = [{'andré rieu'}, {'the police', 'lucio battisti', 'alicia keys', 'jay z'}, {'u2'}]
Observe that the second playlist includes one track having a compound artist name, 'Alicia Keys ft. Jay-Z'. In these instances, each collaborating
artist should become an element of the itemset. Here, both 'alicia keys' and 'jay z' appear in the output.
Whether your Exercise 2 works or not, recall that we precomputed translations from raw artist name strings to itemsets. These are stored in
artist_translation_table, e.g.:
In [25]: artist_translation_table['Alicia Keys ft. Jay-Z']
Out[25]: {'alicia keys', 'jay z'}
Code reuse from Notebook 2. In addition to its concepts, Notebook 2 also has a lot of code we want you to reuse.
For example, recall the make_itemsets(receipts) function. Given a bunch of receipts, it converts each receipt into an itemset, a Python set of its items.
Here is a generalized version of that code, which allows the the user to supply a function, make_set, for converting one receipt into an itemset.
In [26]: def make_itemsets(receipts, make_set=set):
return [make_set(r) for r in receipts]
For example, recall how this function worked in the case where "words" are receipts and the individual letters are itemsets. Furthermore, simply calling the
default set on one receipt creates an itemset:
In [27]: make_itemsets(['hello', 'world'])
Out[27]: [{'e', 'h', 'l', 'o'}, {'d', 'l', 'o', 'r', 'w'}]
To use make_itemsets for our problem, we need to create a function that is compatible with the requirements of the make_set argument. That is your next
task.
Exercise 4: normalize_artist_set (2 points)
Complete the function, normalize_artist_set(artist_set), where artist_set is a Python set of unprocessed artist names. It should return a Python
set of normalized artist names, per Exercise 2.
For instance,
normalize_artist_set({'Alicia Keys ft. Jay-Z', 'Lucio Battisti', 'The Police'})
should return
{'the police', 'lucio battisti', 'alicia keys', 'jay z'}
Note: You may reuse your function from Exercise 2, if you are confident it is bug-free; otherwise, we recommend using the precomputed values
in artist_translation_table.
In [28]: def normalize_artist_set(artist_set):
### BEGIN SOLUTION
global artist_translation_table # not strictly necessary, but self-documenting
output_set = set()
for a in artist_set:
output_set |= artist_translation_table[a]
return output_set
### END SOLUTION
In [29]: # Demo cell:
normalize_artist_set({'Alicia Keys ft. Jay-Z', 'Lucio Battisti', 'The Police'})
# expected output: `{'alicia keys', 'jay z', 'lucio battisti', 'the police'}`
Out[29]: {'alicia keys', 'jay z', 'lucio battisti', 'the police'}
In [30]: # Test cell: `mt1_ex4_normalize_artist_set` (2 points)
### BEGIN HIDDEN TESTS
def mt1_ex4__gen_soln(fn_base="normalized_artist_sets", fn_ext="pickle", overwrite=False):
from testing_tools import file_exists, load_pickle, save_pickle
def make_itemsets(receipts, make_set=set):
return [make_set(r) for r in receipts]
fn = f"{fn_base}.{fn_ext}"
if file_exists(fn) and not overwrite:
print(f"'{fn}' exists; skipping...")
else: # not file_exists(fn) or overwrite
print(f"'{fn}' does not exist or needs to be overwritten; generating...")
simple_playlists = load_pickle('simple_playlists.pickle')
normalized_artist_sets = make_itemsets(simple_playlists, make_set=normalize_artist_set)
save_pickle(normalized_artist_sets, fn)
!date
mt1_ex4__gen_soln(overwrite=False or global_overwrite)
!date
### END HIDDEN TESTS
from testing_tools import mt1_ex4__check
print("Testing...")
for trial in range(250):
mt1_ex4__check(normalize_artist_set)
normalize_artist_set__passed = True
print("\n(Passed!)")
Fri 26 Feb 2021 12:28:08 AM PST
'normalized_artist_sets.pickle' exists; skipping...
Fri 26 Feb 2021 12:28:08 AM PST
Testing...
(Passed!)
Sample results for Exercise 4: artist_itemsets
If you had a working solution to Exercise 4, then in principle you could use it to construct artist itemsets for all of the playlists. Instead, we have precomputed
these for you; run the cell below to load it into a variable named artist_itemsets.
Read and run this cell even if you skipped or otherwise did not complete Exercise 4.
In [31]: artist_itemsets = load_pickle('normalized_artist_sets.pickle')
print("\n=== Examples (first three playlists) ===")
pprint(artist_itemsets[:3])
Opening pickle from './resource/asnlib/publicdata/normalized_artist_sets.pickle' ...
=== Examples (first three playlists) ===
[{'cocktail slippers',
'crosby',
'crowded house',
'elvis costello',
'eric clapton',
'joe echo',
'joshua radin',
'lissie',
'nash',
Downloaded by Un Un (un.promise@yahoo.com)
'paul mccartney',
'stills',
'the attractions',
'the breakers',
'the coronas',
'the len price 3',
'tiffany page'},
{'biffy clyro',
'bruce springsteen',
'elbow',
'madness',
'miles kane',
'noah',
"noel gallagher's high flying birds",
'oasis',
'pearl jam',
'spector',
'the heartbreakers',
'the whale',
'thunderclap newman',
'tom petty'},
{'2080'}]
lOMoARcPSD|10628501
Exercise 5: get_artist_counts (1 point)
For the Notebook 2 analysis, we also needed a way to count in how many receipts each item occurred. That's your next task.
Given a collection of artist itemsets, complete the function get_artist_counts(itemsets) so that it returns a dictionary-like object with artist names as
keys and the number of occurrences as values.
For example, suppose you start with these three itemsets:
itemsets = [{'alicia keys', 'jay z', 'lucio battisti', 'the police'}, {'u2', 'the police'}, {'jay z'}]
Then get_artist_counts(itemsets) should return:
{'alicia keys': 1, 'jay z': 2, 'lucio battisti': 1, 'the police': 2, 'u2': 1}
Note: By "dictionary-like," we mean either a conventional Python dictionary or a collections.defaultdict, as you prefer.
Hint: Recall update_item_counts from Notebook 2, which we've provided again in the code cell below.
In [32]: def update_item_counts(item_counts, itemset):
for a in itemset:
item_counts[a] += 1
def get_artist_counts(itemsets):
### BEGIN SOLUTION
from collections import defaultdict
counts = defaultdict(int)
for s in itemsets:
update_item_counts(counts, s)
return counts
### END SOLUTION
In [33]: # Demo cell:
itemsets = [{'alicia keys', 'jay z', 'lucio battisti', 'the police'}, {'u2', 'the police'}, {'jay z'}]
get_artist_counts(itemsets)
Out[33]: defaultdict(int,
{'alicia keys': 1,
'lucio battisti': 1,
'jay z': 2,
'the police': 2,
'u2': 1})
In [34]: # Test cell: `mt1_ex5_get_artist_counts` (1 point)
### BEGIN HIDDEN TESTS
def mt1_ex5__gen_soln(fn_base="artist_counts", fn_ext="pickle", overwrite=False):
from testing_tools import file_exists, load_pickle, save_pickle
fn = f"{fn_base}.{fn_ext}"
if file_exists(fn) and not overwrite:
print(f"'{fn}' exists; skipping...")
else: # not file_exists(fn) or overwrite
print(f"'{fn}' does not exist or needs to be overwritten; generating...")
artist_sets = load_pickle('normalized_artist_sets.pickle')
artist_counts = get_artist_counts(artist_sets)
save_pickle(artist_counts, fn)
!date
mt1_ex5__gen_soln(overwrite=False or global_overwrite)
!date
### END HIDDEN TESTS
from testing_tools import mt1_ex5__check
print("Testing...")
for trial in range(250):
mt1_ex5__check(get_artist_counts)
get_artist_counts__passed = True
print("\n(Passed!)")
Fri 26 Feb 2021 12:28:10 AM PST
'artist_counts.pickle' exists; skipping...
Fri 26 Feb 2021 12:28:11 AM PST
Testing...
(Passed!)
Sample results for Exercise 5: artist_counts
If you had a working solution to Exercise 5, then in principle you could run get_artist_counts(artist_itemsets) to count the number of occurrences of
all artists. Instead, we have precomputed these for you; run the cell below to load it into the object, artist_counts.
Read and run this cell even if you skipped or otherwise did not complete Exercise 5.
In [35]: artist_counts = load_pickle('artist_counts.pickle')
print("Examples:")
for a in ['lady gaga', 'fats domino', 'kishi bashi']:
print(f"* Artist '{a}' appears in {artist_counts[a]:,} playlists.")
Opening pickle from './resource/asnlib/publicdata/artist_counts.pickle' ...
Examples:
* Artist 'lady gaga' appears in 5,121 playlists.
* Artist 'fats domino' appears in 327 playlists.
* Artist 'kishi bashi' appears in 427 playlists.
Part E: A simple artist-recommender system
We now have all the pieces we need to build a recommender system to help users find artists they might like, building on Notebook 2's pairwise associationrule miner. However, we'll need a modified procedure.
Why? Recall how many artists there are (run the cell below):
In [36]: print(f'The dataset has {len(artist_counts):,} artists! (After name normalization per Exercise 2.)')
The dataset has 258,036 artists! (After name normalization per Exercise 2.)
That's a lot! So rather than finding all association rules, let's use the following procedure instead.
1. First, suppose a user has given us the name of one artist they already like. Call that the root artist.
2. Filter all playlists to only those containing the root artist. Call these the root playlists (or root itemsets).
3. For each root playlist, remove any artists that are "uncommon," based on a given threshold. However, do not remove the root artist; those should
always be kept, whether common or not. Call these resulting playlists the pruned playlists.
4. Run the pairwise association rule miner on these pruned playlists, which should be smaller and thus faster to process, and report the top result(s).
For your last exercise, we'll give you code for Step 2 and need you to combine it with Step 3. We will provide the rest, and if your procedure works, you'll be
able to try it out!
Filtering step. Here is code we are providing for Step 2 of this proposed recommender algorithm (filter playlists).
In [37]: def filter_itemsets(root_item, itemsets):
return [s for s in itemsets if root_item in s]
Here is a demo of filter_itemsets, which generates "root playlists" for the artist, "Kishi Bashi."
Pop-up Video / Behind The Lyrics trivia: At the time of this exam (Spring 2021), Kishi Bashi lives in Athens, Georgia, USA, about 90-minutes or
so outside Atlanta!
In [38]: root_playlists_for_kishi_bashi = filter_itemsets('kishi bashi', artist_itemsets)
print(f"Found {len(root_playlists_for_kishi_bashi)} playlists containing 'kishi bashi.'")
print("Example:", root_playlists_for_kishi_bashi[2])
Found 427 playlists containing 'kishi bashi.'
Example: {'hozier', 'the new pornographers', 'plushgun', 'the smashing pumpkins', 'rockabye baby', 'discovery', "chri
s o'brien", 'matt nathanson', 'ed sheeran', 'the xx', 'lisa hannigan', 'first aid kit', 'clap your hands say yeah', '
kishi bashi', 'stars'}
Exercise 6: prune_itemsets (2 points)
Complete the function,
def prune_itemsets(root_item, itemsets, item_counts, min_count):
...
so that it implements Step 2 and Step 3 of the recommender. That is, the inputs are:
root_item: The root item (i.e., the root artist name)
itemsets: A collection of itemsets
item_counts: A pre-tabulated count of how many times each item appears in an itemset
min_count: The minimum number of itemsets in which an item should appear to be considered a recommendation
Your function should return the playlists pruned as follows:
1. Filter the itemsets to only those containing root_item. The resulting itemsets are the filtered itemsets.
2. For each filtered itemset, remove any item where item_counts[a] < min_count. However, do not remove root_item, regardless of its count.
3. The resulting itemsets are the pruned itemsets. Discard any pruned itemsets that contain only the root item. Return the remaining pruned itemsets as a
Python list of sets.
Note 0: Although the procedure above is written as though your function will modify its input arguments, it must not do so. Use copies as
needed instead. The test cell will not pass if you modify the input arguments.
Note 1: You can return pruned itemsets in any order. (So if the test cell does not pass, it is not because it assumes results in a particular order.)
Example. Suppose the itemsets and item counts are given as follows:
In [39]: ex6_demo_itemsets = [{'alicia keys', 'jay z', 'lucio battisti', 'the police'}, {'u2', 'the police'}, {'jay z'}]
ex6_demo_item_counts = {'alicia keys': 1, 'jay z': 2, 'lucio battisti': 1, 'the police': 2, 'u2': 1}
Then
prune_itemsets('the police', ex6_demo_itemsets, ex6_demo_item_counts, 2)
will end up returning a list with just one itemset, [{'the police', 'jay z'}]. That's because only two itemsets have 'the police' in them, and of
those, only one has at least one item whose count exceeds min_count=2.
In [40]: def prune_itemsets(root_item, itemsets, item_counts, min_count):
### BEGIN SOLUTION
filtered_itemsets = filter_itemsets(root_item, itemsets)
pruned_itemsets = []
for s in filtered_itemsets:
s_pruned = set()
for x in s:
if item_counts[x] >= min_count or x == root_item:
s_pruned |= {x}
if len(s_pruned) >= 2:
pruned_itemsets.append(s_pruned)
return pruned_itemsets
### END SOLUTION
In [41]: # Demo cell:
prune_itemsets('the police', ex6_demo_itemsets, ex6_demo_item_counts, 2)
Out[41]: [{'jay z', 'the police'}]
In [42]: # Test cell: `mt1_ex6_prune_itemsets` (2 points)
### BEGIN HIDDEN TESTS
def mt1_ex6__gen_soln(fn_base="pruned_playlists", root='kishi bashi', threshold=1000, fn_ext="pickle", overwrite=False
):
from testing_tools import file_exists, load_pickle, save_pickle
fn = f"{fn_base}--{root.replace(' ', '-')}--{threshold}.{fn_ext}"
if file_exists(fn) and not overwrite:
print(f"'{fn}' exists; skipping...")
else: # not file_exists(fn) or overwrite
print(f"'{fn}' does not exist or needs to be overwritten; generating...")
artist_itemsets = load_pickle('normalized_artist_sets.pickle')
artist_counts = load_pickle('artist_counts.pickle')
pruned_playlists = prune_itemsets(root, artist_itemsets, artist_counts, threshold)
save_pickle(pruned_playlists, fn)
!date
mt1_ex6__gen_soln(overwrite=False or global_overwrite)
!date
### END HIDDEN TESTS
from testing_tools import mt1_ex6__check
print("Testing...")
for trial in range(250):
mt1_ex6__check(prune_itemsets)
prune_itemsets__passed = True
print("\n(Passed!)")
Fri 26 Feb 2021 12:28:14 AM PST
'pruned_playlists--kishi-bashi--1000.pickle' exists; skipping...
Fri 26 Feb 2021 12:28:14 AM PST
Testing...
(Passed!)
Fin!
If you passed the preceding exercise, then you have all the pieces necessary to try your recommendation algorithm! It is optional to do so, but if you have any
time left, pick your favorite artist (assuming they are in the dataset) and see if you get reasonable results.
Otherwise, you’ve reached the end of this problem. Don’t forget to restart and run all cells again to make sure your code works when running all code cells in
sequence; and make sure your work passes the submission process. Good luck!
In [43]: assert prune_itemsets__passed == True, "Are you sure you passed Exercise 6?"
# `recommend` implements the complete recommender algorithm
def recommend(root_artist, conf=0.2, min_count=1000, verbose=True):
from cse6040nb2 import find_assoc_rules, print_rules
global artist_itemsets, artist_counts
print("Pruning...")
pruned_playlists = prune_itemsets(root_artist, artist_itemsets, artist_counts, min_count)
num_artists = sum(len(p) for p in pruned_playlists)
print("\t", len(pruned_playlists), "itemsets remain with", num_artists, "artists.")
print("Finding association rules...")
rules = find_assoc_rules(pruned_playlists, conf)
rules = {(a, b): c for (a, b), c in rules.items() if a == root_artist}
print("\t", len(rules), f"rules of the form `conf('{root_artist}' => x) >= {conf}")
print(f"\n=== Our top recommendations for '{root_artist}' ===")
print_rules(rules, limit=20)
# DEMO: 'kishi bashi' produces some spurious results because
# both "Of Monsters and Men" and "Mumford and Sons" are
# erroneously split into two.
recommend('kishi bashi')
Pruning...
411 itemsets remain with 19530 artists.
Finding association rules...
20 rules of the form `conf('kishi bashi' => x) >= 0.2
=== Our top recommendations for 'kishi bashi' ===
conf(kishi bashi => alt j) = 0.265
conf(kishi bashi => passion pit) = 0.258
conf(kishi bashi => men) = 0.255
conf(kishi bashi => of monsters) = 0.255
conf(kishi bashi => vampire weekend) = 0.231
conf(kishi bashi => bon iver) = 0.229
conf(kishi bashi => grizzly bear) = 0.224
conf(kishi bashi => the lumineers) = 0.219
conf(kishi bashi => two door cinema club) = 0.219
conf(kishi bashi => the xx) = 0.219
conf(kishi bashi => m83) = 0.219
conf(kishi bashi => the shins) = 0.219
conf(kishi bashi => first aid kit) = 0.217
conf(kishi bashi => sons) = 0.214
conf(kishi bashi => mumford) = 0.212
conf(kishi bashi => the black keys) = 0.212
conf(kishi bashi => grouplove) = 0.212
conf(kishi bashi => lana del rey) = 0.204
conf(kishi bashi => imagine dragons) = 0.204
conf(kishi bashi => phantogram) = 0.202
Downloaded by Un Un (un.promise@yahoo.com)
Download