We recall that the data set provided by CS109a staff contained the following: artist_name, track_uri, artist_uri, track_name, album_uri, duration_ms, album_name. To augment our data set, we used track_uri, we used Spotify's API to obtain continuous features associated with each track. As an extension, we could consider using preview_url, which is a link to a 30 second preview (MP3 format) of the track. This feature could further assist in our analysis of comparing the similarity between two features. In addition to our current methods, we could also use the album_uri and artist_uri to extract other features such as album genre to enrich our data.
It would also be interesting to see how popular each playlist in our data is in reality. Spotify shows the number of followers for a given playlist. If our data contained playlist_uri, we could have extracted the information to see how popular one playlist is from another.
Improving the MinMax Algorithm
There is a number of ways we can improve our MinMax Algorithm.
In our simple setting, we define the mood of a playlist as having one minimizing and one maximizing feature. In reality, this is hardly the case. Extracting the genre information using the extension explained above, we can evaluate the relationship between genre and track features. We can then use this information to better model the mood of each playlist.
Our algorithm depends heavily on the pool of the songs. If we had a pool of songs with an artist who always produces the highest danceability and lowest acousticness, our playlists would only consist of that artists' songs. We can include a measure of diversity in our performance metric to mitigate this issue in the future.
We can consider using other clustering algorithms besides KMeans to compare the performance results. Using the audio file features, we can consider deep learning methods as well.
Improving the Seed Algorithm
There is a couple of ways we can improve our Seed Algorithm.
We should look for ways to speed up our current algorithm. There are probably a number of more efficient data structures and algorithms that perform the same operations as the our current code, however more expertise in this domain is needed.
Some of the numbers we selected in our algorithm are arbitrary, such as the 100 clusters and the minimum cardinality of 5 when we find similar playlists. However, using 4 clusters the algorithm was taking orders of magnitude longer to display any predictions whatsoever. A cross-validation technique would likely improve the selection of such numbers.
Other measures of similarity between playlists would also likely help us in our prediction process and we look forward to explore these on our own in the future.