Based on the project goals set out by the AC209a instructors, we are to create a Spotify recommender system that can achieve the following two competencies: (1) automatic playlist generation; (2) deal with the Cold Start problem. To this effect, our team has developed a product that utilizes data science models and approaches to create a system with the following functionalities: (1) the system is capable of taking in single-song inputs and generating in-app playlists on Spotify; (2) the system can take in ‘intent’ and/or ‘mood’ inputs and generate playlists corresponding to said intents without any song input.
Our team believes that such a system would adequately address both project goals and create the necessary foundation on which more sophisticated recommender systems can be built. Our project website charts our approach and model design process and will chiefly discuss our data collection and analysis procedure as well as our how we arrived at our final models. Subsequently, we evaluate the success and weaknesses of our model before concluding with some discussion of future possible extensions to our system. All scripts and code referenced in our website are available in our project notebook and our project repository.
In early 2018, Reuters reported that based on findings by the IFPI’s Global Music Report, music streaming has, for the first time, overtaken traditional music sales to become the music industry’s largest single revenue source. A vital component of such music streaming services is the use of music recommender systems (MRS) that encourage exploration, discovery, and even more music consumption. These MRS face unique constraints and challenges that set it apart from traditional recommender systems; two objectives that were posed to our team were (1) creating a system of automatic playlist generation through intent/mood; (2) dealing with the ‘Cold Start’ problem faced by MRS.
As such, our team’s primary aim is to address the twin objectives and, in doing so, develop our own MRS that should integrate seamlessly with the Spotify platform. Our primary challenges involve using the limited data we have to generate sensible recommendations that are relevant and sufficiently random (in order to maintain an explorative element). Another challenge is to find a way to relate our data to user intent/moods and develop metrics to assess the accuracy of our model given the existing data constraints. An additional challenge is to create an interface and method of communication that is intuitive, user-friendly, and flexible enough to accommodate a wide range of use cases. These challenges are more clearly articulated in the subsequent sections where we detail our Exploratory Data Analysis (EDA) and our baseline model.
The primary dataset that we obtained through the AC209a instructors is the Million Playlist dataset released by Spotify as part of the 2018 Recsys challenge. Based on our preliminary examination of the data, we notice that the data does not contain any information on the structural features of tracks nor any user input beyond the inclusion of songs in certain playlists. Our knowledge of MRS informed us that there was very little we could do with the base data alone; we needed a means to tease out user input or feedback (to train our model on) and obtain song features that could be mapped into certain classifications for our recommender system.
As such, this motivated our data collection efforts that we detail below; we ended up obtaining two primary datasets for our system:
The initial dataset contained a set of playlists with the features listed below but little in the way of information about the tracks themselves beyond their membership in a particular playlist and the artist and albums they belonged to.
Base data features:
We wrote a simple Python script that enabled us to extract additional features from Spotify’s API by feeding in the unique track identification codes and requesting song features. For this, we relied on Spotipy (a Python package that enabled easy interaction with Spotify’s API) along with other common Python packages. Subsequently, we created an augmented dataset that extends our initial set of songs from the Million Playlist Dataset by including additional features we’ve extracted from the Spotify API. This was then stored in a SQLite database for easy querying.
Augmented data features:
Detailed explanation of these features are available on the Spotify developer page.
For our second set of data, we relied on a separate Python script that scraped existing user-generated playlists on Spotify and returned TSV files containing Track Titles, Artist Names, Album Titles, and the relevant id codes. We used this script to scrape playlists that corresponded to each of the four moods/intents that we sought to model and we ran the collected data through the API once again to populate our data with Spotify’s audio features. We curated these playlists by qualitatively considering how well the playlists align to the specific moods, assessing the number of playlist followers, and ensuring that the tracks were largely released prior to 2018 (such that the songs from these playlists also feature on our Million Playlist dataset). The playlists are stored in our data folder within our main repository.
Moods/Intents curated:
Both datasets are similar in the features; the key difference in the second dataset that we scraped is that the playlists reference specific moods and intents that we curated. We will now discuss our EDA that we performed on our primary dataset since the process is similar. The following plots are generated from the 'songs620.csv' file and the code used to generate the plots are in our notebook.
We first plot a scatter matrix to assess for correlation across all of the features that we gathered from the Spotify API for the ‘songs620.csv’ dataset that we assumed to be representative of the other CSVs. We viewed this assumption to be fair because of how each dataset contains a large number of observations.
From our visual inspection, it seems like features such as energy and loudness seem to exhibit multicollinearity. We can choose to deal with this by either dropping one of the attributes or by downscaling one; alternatively, we could perform more rigorous assessments of the features in order to understand which combination of weights for the features will generate the best most intuitively sensible recommendations.
Subsequently, we standardized our data and computed the Principal Component Analysis (PCA) vectors so as to visualize the component pairs through another scatter matrix. This would give some intuition for the types of methods we could potentially use for recommendations using these features.
We see the existence of some clustering in the projected feature space. Intuitively, we might think of these groupings of songs to correspond to 'genres'. We see roughly 3 clusters, and can now fit these using K-means to get some intuition. We note that the learned cluster centers correspond correctly to the 3 clusters we visually saw in the PCA projections.
This concludes our initial EDA. This EDA seems to suggest that some methods that exploit the clustering of these songs in the feature space could be a viable recommender option. We note that some of the Spotify features seem to be highly correlated; we also identify potential clusters that will give us some insight into how we should design our baseline model. We now consider the relevant literature surrounding recommender systems in order to develop our baseline model.
All references are cited below in our References section.
Given our lack of experience with recommender models, we cast a particularly wide net in the early stages of our project and implemented a number of distinct models. For instance, after reading A Novel Hybrid Music Recommendation System using K-Means Clustering and PLSA by Gurpreet Singh and Rajdavinder Singh Boparai, we implemented K-Means clustering as part of our EDA. The model implemented on Singh & Boparai is based strictly on what songs users play, and as we explored the Spotify API and found that we could extract a number of metrics beyond just playlist content (e.g. tempo, danceability) for every song, and thus decided to look elsewhere. Moreover, when the K-Means clustering model was compared to the gaussian mixture model we also implemented, we found K-Means clustering provided a less rigorous probabilistic framework than the GMM. This GMM was implemented, in part, based on the findings of Rui Chen, Qingyi Hua, Quanli Gao, and Ying Xing, in "A Hybrid Recommender System for Gaussian Mixture Model and Enhanced Social Matrix Factorization Technology Based on Multiple Interests", though, as with the previous article, this system was based strictly on preferences and was not as low-level as our ideal model. Our use of track features in our model is informed by Berenzweig et al.’s 2003 discussion of acoustic features and their limitations. These insights influenced how we fed these values into our nearest-neighbours model.
Our metric used to assess our model’s performance is guided by the concepts discussed in Maeleb’s Medium article Recall and Precision at k for Recommender Systems. We considered a similar method of evaluating our recommendation accuracy.
Our code that interacts with Spotify’s API references the documentation and sample code provided by the Spotify Developer documentation page. Additionally, we referenced Casey Chu’s public domain code that used the Spotipy package.
The Cold Start problem refers to the issue where a system cannot draw any inferences for users or items about which it has not yet gathered sufficient information.
In the context of our project, the issue concerns how we can create a model that is able to make sound song recommendations for new playlists with relatively few prior user inputs. This would involve the user essentially feeding in perhaps 3 or so songs that they like.
In our case, we have designed our baseline model to take in a single song input in order to generate a playlist of recommended songs solely based on the inputted song. Note that when a user indicates that they like a song, we only focus on the set of playlists that the song is a part of. We can then generate recommendations using techniques like:
Based on the clustering we observed in the EDA, we could think of each song's feature coming from the following generative process:
Each song has a latent state , corresponding to the cluster it belongs to (think of song genres for intuition). Since we saw that the projected clusters had roughly elliptical shapes, we might expect each to be a multi-dimensional Gaussian. Hence, assume that the features are drawn as
We first fit the data to the chosen number of clusters we expect. Then, when the user likes a song , we can infer the posterior latent state to select the cluster that the song belongs to. We can then sample to generate a new song from that cluster. Since this sample from a multivariate Gaussian may not exactly correspond to a song, we can use our pre-fit Nearest Neighbour algorithm to find the closest match.
This method requires us to select the number of clusters we expect in the data. We can select this using a combination of the Silhouette score and Bayesian Information Criterion, which are scoring metrics commonly used to evaluate models. We demonstrate these on a subset of the data below.
Silhouette Score (SS)
This checks how compact and well separated the clusters are. The closer the SS is to , the better the clustering is.
Bayesian Information Criterion (BIC)
This checks how well the GMM fits to the data, with a penalty for model complexity. The lower the BIC, the better the fit.
We see that SS selects clusters=2 whereas BIC selects clusters=6 . We make our final decision based on the integer mean of the two results, e.g. 4 .
We can simply run unsupervised NN on the data and recommend the closest songs corresponding to the song that the user likes, with the hopes that similarity in feature space would result in good song recommendations.
Note that we would expect NN to generate recommendations closer to the song input, since it is literally picking songs with features that are closest to the input, whereas GMM would recommend songs in the same cluster but potentially far away from the input. Our system could use both techniques together, with NN suggesting relevant similar songs and GMM suggesting relevant 'exploratory' songs for the user to better explore that cluster and discover relevant albeit different music.
Consider the results on a sample song (“Plain Jane by A$AP Ferg -- id 11997 in our DB)
# A$AP Ferg's Plain Jane
df.loc[[11997]]
# Fit the model and generate recommendations
nn_recc_songs, gmm_recc_songs = recommend_coldstart(11997, X_train, df, 10, 10, 0)
Inspecting the recommendations from both methods below, we see that this is in general, a good baseline model.
# GMM Recommendations
print("GMM Recomendations")
display(gmm_recc_songs[['track_name', 'artist_name']])
# NN Recomendations
print("\nNN Recomendations")
display(nn_recc_songs[['track_name', 'artist_name']])
This corresponds to the task of creating a model for song discovery on the basis of the base playlists and user-specified context information. We set out to develop a system that is able to take in one of our four specified moods/intents:
and generate song recommendations based purely off the user's choice. This approach does not require user song input and will utilize our curated playlists along with a system of mood/intent mapping.
We implemented two methods to tackle this issue:
This works by first accessing additional curated playlists that we have scraped corresponding to these moods. Random songs are then sampled from these playlists (which correspond to the relevant moods) and fed through our models from the previous section. This draws on the strength of the good recommendations that our model was making, but now using input songs that we know correspond to the user mood. This would generate relevant recommendations that also cater to the required mood.
This approach relies on the intuition that a different combination of the features should map to the relevant moods that we require, and learning these mapping would allow us to classify our recommendations into moods. We could opt to perform some feature engineering ourselves and generate these maps, however this requires strong domain expertise. We hence use a neural network to learn these mappings. This works by first accessing additional curated playlists that we have scraped corresponding to these moods. A neural network is then trained to use the previously obtained song features to classify all songs into one of the four moods. Since there are only 9 numerical features, and we expect the mapping to not be too complex, a relatively simple architecture with two hidden layers and dropouts is used. This trained network will then be applied to the recommended songs from the previous section, and only songs catering to the required mood will be recommended.
Note that a potential issue is that the classes are not equally represented in our dataset. This is because the provided Million Playlist Dataset does not contain newer songs, and we had difficulties in finding relevant existing playlists corresponding to the 'angry' and 'sleep' moods that had older songs. There might essentially have been some change in the 'data distribution' over years, and we did not want this to affect our learning. This issue could be resolved with an updated Million Playlist Dataset.
We now try this approach out on a random mood.
sample_recc_songs, nn_recc_songs = recommend_mood('party', model, X_train, df, 10, 10)
Inspecting the recommendations from both methods below, we see that this is in general, a good model.
# Sample method
print("Sample-based method")
display(sample_recc_songs[['track_name', 'artist_name']])
# NN Recomendations
print("\nNeural network-based Recomendations")
display(nn_recc_songs[['track_name', 'artist_name']])
Inspecting the recommendations from both methods below, we see that this is in general, a good model.
The following videos contain demonstrations of our existing recommender systems; they illustrate both our models taking in some form of user input and generating playlists accessible within Spotify.
Cold Start Demo illustrates our system’s response to the Cold Start problem; it takes in a track title (in this case, Versace by the Migos) and generates a query in our database for a song that best matches the inputted title. This query returns a track ID that is fed into both our Gaussian Mixture Model and our Nearest Neighbors Model in order to create two separate playlists in the user’s Spotify account. As mentioned above, the Gaussian Mixture Model tends to generate more diverse and explorative recommendations given how it makes recommendations based on song clusters. Meanwhile, the Nearest Neighbors model tends to make recommendations that are more similar to the inputted song. Our model generates two separate playlists corresponding to each of the models.
The Emotion-Based Playlist Generation demo illustrates our system’s response to automatic playlist generation using user intent/mood without any specific song input. The command-line interface lists several options for the user to select and these options then inform the recommender system to generate random song predictions that correspond to the selected option. As before, we implement two separate models within this system. The first model involves the fitting of a neural net that is trained on curated playlists corresponding to each mood/intent; the neural net assesses the audio features for each mood/intent and uses this to generate random recommendations. The second model is the sample-based approach where the model samples a random subset of tracks from the curated playlists and searches for the nearest neighbours (in terms of audio features) in our dataset and returns them as recommendations. Once again, we segregate the system’s recommendations into two separate playlists.
Initially, our focus was mainly on the Cold Start problem and how this might be resolved. Our recommendations were entirely based on song inputs and relating these inputs to other songs - in these cases, the mapping of a song input to a recommendation was relatively clear given our model.
Following the third project milestone, we shifted our focus in order to deal with the other goal concerning automatic playlist generation through user context and input beyond song inputs. Rather than attempt to refine our model or find some means of gathering data to evaluate the success of our recommender system, we developed an alternative model that could take in user intent and/or mood and map this input into song recommendations.
Additionally, we made an effort to avoid scope creep and decided against attempting to generate a genre map or a refined graphical interface; instead, we chose to focus on refining our recommender systems. Although these implementations would have been great, given the constraints we faced, pursuing those routes would have probably spread us too thin.
Our neural network for our model attempting to automatically generate playlists based on mood/intent has the following test and train performance.
The training accuracy is approximately 0.9837 while the test accuracy is approximately 0.8528. Based on these scores, the neural network learns the mapping from the feature space to mood/intent sufficiently well.
Additionally, based on trials with multiple users, our recommender system is capable of generating accurate song recommendations across a variety of genres fairly consistently.
Our team considers the project a broad success - we have met our goals of developing a functioning and accurate recommender system with an impressive array of use cases. Our team believes that our final deliverable definitely meets the project criteria and that we have gone beyond what was expected of us in order to create a product that we are all proud of.
Although our priorities have definitely shifted in order to meld better with our resources, we made the most of what we had and managed to produce a considerable amount of output. We learnt a lot throughout the course of the project and had the opportunity to apply a lot of what we were taught in AC209a; we also managed to extend our project and incorporate models and approaches beyond our course syllabus.
We generated fairly sensible recommendations for the Cold Start problem using Gaussian Mixture Model (GMM) and Nearest Neighbors (NN) methods. The performance of the two methods is not comparable since they provide different approaches. As mentioned above, NN would recommend more similar songs while GMM would recommend more exploratory songs, and the strength of the overall system really lies in using both methods together. This would allow the users to both exploit their preferences and explore the genre.
We also obtained sensible recommendations for the mood/intent problem using sample track-based and neural network-based approaches. The sample track-based approach generates more accurate recommendations for the provided mood, as it leverages on information provided by existing playlists. However, we believe that the neural network-based approach has the potential to improve significantly with a larger dataset. With more songs from each mood/intent, we could learn much more accurate mappings from feature space to moods/intents and significantly improve recommendations. This was difficult to implement for us because of the fact that the provided dataset did not contain newer songs, as mentioned above.
A key extension we hope to implement for our recommender system is to incorporate a graphic interface such that our system is more aesthetically appealing and more intuitive for casual users. We want users to be confident in our recommendations and we believe that a strong aesthetic can help to build that confidence. To that extent, we have developed a sample user interface for our existing 4 moods.
Additionally, we hope to incorporate more intents and moods in our model. We could take a similar approach as we did in our above model to generate these moods and intents; alternatively, we could consider more user-driven approaches and adaptively incorporate moods and intents based on the success of our recommendations or based on user feedback. A more quantitative approach to achieve this could involve a classification model that maps songs to genres, moods, and intents and we can subsequently generate our mood/intent extensions based on the most distinct, well-defined clusters.
We also hope to create a feedback system where users can either manually rate the quality of the recommendations or we can somehow collect metadata concerning usage and evaluate recommendations there. One possible formulation of this could involve a ‘mulligan’ system where, given an initial user input, we could provide a small set of recommendations (say, five recommendations). The user would then pick which of the recommendations they want to keep and the system would reparameterize based on the user’s choices.