Tutorial – How to use Spotipy API to scrape Spotify Data

Introduction

In this article, we will learn how to scrape data from Spotify which is a popular music streaming and podcast platform. This scraping will be done by using a Web API of Spotify, known as Spotipy. Our aim through this hands-on experience of web scraping is to fetch the information of all the tracks in Spotify playlists. We can obtain the information of tracks of any playlist, we only need to have URI (Uniform Resource Identifier) of a playlist.

For using Spotipy API, the user requires two credential keys, which are credential_id and credential_secret. These two keys are unique for each user and help Spotify in identifying the users of their Web API. In the following part of this article, I will walk through the steps on how to create these two unique keys through Spotify developers account with the help of screenshots.

Generating Authorizing Keys for Spotipy

Step 1: Creating Spotify Developers Account

For creating an account on Spotify developers website, you can either use your existing Spotify account which you use for listening to music, otherwise, you can sign up using your Facebook account or through your mail-id.

Spotipy Developer Account Creation
Spotipy Developer Account Creation
Spotipy Developer Account Creation
Spotipy Developer Account Creation

 

Step 2: Creating a New App

After creating the account and logging in, you will find a screen as shown below.

Spotify Developer Account Dashboard
Spotify Developer Account Dashboard

The next thing which we have to do is to create a new app by either clicking on create client id button or by clicking on create new app button.

Creating New App in Spotify Developer Dashboard
Creating New App in Spotify Developer Dashboard

After this, Spotify asks some basic questions regarding our new app. After this, you have to tell Spotify whether you will use the app for any monetary advantages i.e. whether the app is commercial or not. It is advised to choose the non-commercial option. Lastly, we need to agree to some permissions and agreements

Creating New App in Spotify Developer Dashboard
Creating New App in Spotify Developer Dashboard
Creating New App in Spotify Developer Dashboard
Creating New App in Spotify Developer Dashboard

Step 3: Obtaining Client Id and Client Secret Keys

Once the app is created, we will be able to see a dashboard along with the name of our app and description, below the description we will find 32 characters long alphanumeric Client Id and below this will get the 32 characters long alphanumeric Client Secret. So this was the aim of the walkthrough, we have obtained the two keys required for authorizing the usage of Spotify Web API.

Obtaining Client Id and Client Secret Keys
Obtaining Client Id and Client Secret Keys

NOTE – Remember not to use the keys visible in the above image, as mentioned earlier these are unique keys and thus you’ll be required to create them for your app, otherwise you may encounter an error.

Now it’s time to start our hands-on practical example where we will fetch the playlists data and track information using spotipy.

Importing Spotipy library and authorization credentials

Initially, we need to load the necessary libraries and credential files.

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import json

For better access to credential keys, We will load these authorization keys in a .json file, simplifying the process of fetching.

The below code shows how you can add the two keys in the authorization.json file.

Contents of Authorization.json

Remember the below keys are for representing purpose, you are required to create your own client id and client secret on the Spotify developers account.

{"client_id": "5e9a80618b284145b54bb1f7df94bb6c",
"client_secret": "0cdef7160e4143118e48abdd939668e8"}

In another playlists_like_dislike.json file, We have mentioned the URI (Uniform Resource Identifier) of the playlists. This file will help in managing the URI of multiple playlists. As informed earlier, URI helps in identifying different playlists on Spotify.

One of the features of playlists_like_dislike.json file is the like attribute which takes a boolean value. Using this boolean value, we can tell which playlists songs a user likes or dislikes. If true, the user likes all the songs of the playlist and if false, the user dislikes all the songs of the playlist.

We can see below how the playlists URI should be mentioned along with the like parameter.

NOTE – You can fetch data of up to 99 songs in a single connection session. Now, these 99 songs can be divided into different playlists or can be present in a single playlist. If you will try to fetch information of more than 99 songs in a single connection session, then you will encounter an error and you will not be able to fetch the data, as per Spotify’s policy.

Contents of Playlists_like_dislike.json

[
{“uri”: “spotify:user:Test_1:playlist:27rVIOLKlIKAg63whXrAzz”,
“like”: true},
{“uri”: “spotify:user:Test_2:playlist:6WjtPvXBC2iSO24VsfBpnc”,
“like”: false}
]

How to get the URI from Spotify Playlist

Fetch Playlist URI from Spotify App
Fetching Playlist URI from Spotify App

In the above image, you can see the location from where you can obtain the URI of a particular playlist, in the case of the Spotify desktop app. You have to click at these three dots and then you can copy the URI of the desired playlist. Here playlist link and Spotify URI contain the same link. So you can copy any of these. The URI is 20 characters long alphanumeric code which is present at the end of the link.

The below image is an example of the web app of Spotify, so if you are using the web app, then you will see something as shown below.

Fetching Playlist URI from Spotify Web App
Fetching Playlist URI from Spotify Web App

This URI will help while communicating with Spotify API and also in fetching the correct information of the songs present in the playlists.

Along with this, there can be multiple playlists in a single .json file, so to access each playlist, indexing is used and here, 0th index playlist is accessed using below code.

In [2]:
credentials = json.load(open('authorization.json'))
client_id = credentials['client_id']
client_secret = credentials['client_secret']

playlist_index = 0

playlists = json.load(open('playlists_like_dislike.json'))
playlist_uri = playlists[playlist_index]['uri']
like = playlists[playlist_index]['like']
In [3]:
client_credentials_manager = SpotifyClientCredentials(client_id=client_id,client_secret=client_secret)

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Getting Information of tracks in Spotify playlist

For obtaining the username and playlist_id, we are using the : as an identifier. These unique id’s help in getting the tracks of each playlist.

In [4]:
uri = playlist_uri    # the URI is split by ':' to get the username and playlist ID
username = uri.split(':')[2]
playlist_id = uri.split(':')[4]
In [5]:
results = sp.user_playlist(username, playlist_id, 'tracks')

Fetching details of the track like ID’s, Titles and Artists

Here data for each playlist is collected in the form of a dictionary. The keys of this dictionary contain track information. We will use list datatype for adding data like IDs, Titles and Artists.

Artists mentioned are the primary artists in each track.

In the below mentioned for loop, we are going over each track of a playlist and adding the track Id, name and artist information to the dataframe of tracks.

In [6]:
playlist_tracks_data = results['tracks']
playlist_tracks_id = []
playlist_tracks_titles = []
playlist_tracks_artists = []
playlist_tracks_first_artists = []

for track in playlist_tracks_data['items']:
    playlist_tracks_id.append(track['track']['id'])
    playlist_tracks_titles.append(track['track']['name'])
    # adds a list of all artists involved in the song to the list of artists for the playlist
    artist_list = []
    for artist in track['track']['artists']:
        artist_list.append(artist['name'])
    playlist_tracks_artists.append(artist_list)
    playlist_tracks_first_artists.append(artist_list[0])

Extracting Audio Features of each track

Spotify has a unique attribute of providing information about the features of song available on their platform. There are various features like danceability, energy, tempo, and many more. You can have a look at them here. After fetching these audio features, they will be stored in a dataframe.

In [7]:
features = sp.audio_features(playlist_tracks_id)
In [8]:
import numpy as np
import pandas as pd
In [9]:
features_df = pd.DataFrame(data=features, columns=features[0].keys())

Merging Dataframes for getting audio features and track information

Now the dataframe of audio features must be merged with title and artist information. We will also reorder the dataframe for better accessibility.

In [10]:
features_df['title'] = playlist_tracks_titles
features_df['first_artist'] = playlist_tracks_first_artists
features_df['all_artists'] = playlist_tracks_artists
#features_df = features_df.set_index('id')
features_df = features_df[['id', 'title', 'first_artist', 'all_artists',
                           'danceability', 'energy', 'key', 'loudness',
                           'mode', 'acousticness', 'instrumentalness',
                           'liveness', 'valence', 'tempo',
                           'duration_ms', 'time_signature']]
features_df.tail()
Out[10]:
id title first_artist all_artists danceability energy key loudness mode acousticness instrumentalness liveness valence tempo duration_ms time_signature
94 5IF6IBPqbclVR7SQKmCIyA Just Like You Louis Tomlinson [Louis Tomlinson] 0.703 0.628 0 -5.914 1 0.36400 0.000000 0.384 0.471 138.032 205217 4
95 2UCI5rt3PkM9pXtARihaaQ Bedroom Floor Liam Payne [Liam Payne] 0.625 0.684 1 -7.147 1 0.34200 0.000064 0.107 0.217 119.932 188234 4
96 2kqAtjOtQPAR0OiYUJR43k Can We Dance The Vamps [The Vamps] 0.640 0.820 1 -4.729 0 0.00312 0.000000 0.189 0.583 130.108 192711 4
97 3wGCtxNjZe3GrZfkojZ1FB I’m a Mess Jasmine [Jasmine] 0.636 0.487 11 -7.123 0 0.09460 0.000000 0.111 0.647 97.022 190500 4
98 3jnQF0OxLiAEFFcMdBgJ9s Oh Cecilia (Breaking My Heart) The Vamps [The Vamps] 0.746 0.844 11 -5.506 1 0.03150 0.000000 0.318 0.662 100.027 196684 4

Data Exploration

To understand the scraped data in a better way, we will perform data exploration with the help of visualization.

In [11]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The following count plot will tell which artists have most number of songs in the playlist.

In [12]:
plt.figure(figsize=(20,30))
sns.countplot(features_df['first_artist'])
plt.xticks(rotation=90)
Out[12]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
        51, 52, 53, 54, 55, 56, 57, 58]),
 <a list of 59 Text xticklabel objects>)
In [13]:
#features_df = features_df.drop(['first_artist', 'all_artists'], axis=1)

Spotify Audio Analysis

Spotify provides the feature where a song can be analyzed on the basis of bars, beats, sections, tatum and segments in the song. You can learn more about these attributes from here.Since beats and tatums represent the same information in scaled form depicted by bars, we will only consider bars, sections and segments.

NOTE – These attributes of each audio track provide in-depth technical information and thus it takes time in processing and fetching. So some time will be spent here in fetching all the information. The output retrying … is the default message displayed by Spotipy for informing the user.

In [14]:
num_bars = []
num_sections = []
num_segments = []

for i in range(0,len(features_df['id'])):
    analysis = sp.audio_analysis(features_df.iloc[i]['id'])
    num_bars.append(len(analysis['bars'])) # beats/time_signature
    num_sections.append(len(analysis['sections']))
    num_segments.append(len(analysis['segments']))
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...2secs
retrying ...3secs
retrying ...4secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...2secs
retrying ...1secs
retrying ...1secs
retrying ...2secs
retrying ...1secs
retrying ...2secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...2secs
retrying ...3secs
retrying ...4secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...2secs

Visualizing the audio options to learn more about them.

In [15]:
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
plt.hist(num_bars, bins=20)
plt.xlabel('num_bars')
plt.subplot(1,3,2)
plt.hist(num_sections, bins=20)
plt.xlabel('num_sections')
plt.subplot(1,3,3)
plt.hist(num_segments, bins=20)
plt.xlabel('num_segments')
Out[15]:
Text(0.5, 0, 'num_segments')

Now we will add these audio features to our existing dataframe which will enhance our dataset.

In [16]:
features_df['num_bars'] = num_bars
features_df['num_sections'] = num_sections
features_df['num_segments'] = num_segments
features_df.head()
Out[16]:
id title first_artist all_artists danceability energy key loudness mode acousticness instrumentalness liveness valence tempo duration_ms time_signature num_bars num_sections num_segments
0 3h4T9Bg8OVSUYa6danHeH5 Animals Maroon 5 [Maroon 5] 0.279 0.742 4 -6.460 0 0.000185 0.000000 0.5930 0.328 189.868 231013 4 162 11 921
1 4pbJqGIASGPr0ZpGpnWkDn We Will Rock You – Remastered Queen [Queen] 0.692 0.497 2 -7.316 1 0.676000 0.000000 0.2590 0.475 81.308 122067 4 42 6 353
2 6b3b7lILUJqXcp6w9wNQSm Cheap Thrills Sia [Sia, Sean Paul] 0.592 0.800 6 -4.931 0 0.056100 0.000002 0.0775 0.728 89.972 224813 4 83 8 914
3 2tpWsVSb9UEmDRxAl1zhX1 Counting Stars OneRepublic [OneRepublic] 0.664 0.705 1 -4.972 0 0.065400 0.000000 0.1180 0.477 122.016 257267 4 129 8 1001
4 1zB4vmk8tFRmM9UULNzbLB Thunder Imagine Dragons [Imagine Dragons] 0.605 0.822 0 -4.833 1 0.006710 0.134000 0.1470 0.288 167.997 187147 4 128 10 614

The following code helps in generating .csv file for a playlist. This .csv file which contains the above dataframe information will be stored in the same folder where this jupyter notebook is stored. Remember to add the .csv extension to each file.

Creating Large Dataset

The below csv file will consist of the information of only a single playlist whose index was provided initially, to execute it for multiple playlists, you can use a for loop and can run over each playlist and thus creating a large dataset.

So for all the URI’s provided in the playlists.like_dislike.json file, you will be having a new .csv file with the information of the particular playlist in it. In this way you can build a large dataset of .csv files.

In [17]:
features_df.to_csv("playlist_" + str(playlist_index) + ".csv", encoding='utf-8',index="false")

Conclusion

So we have reached the endpoint of this interactive article on Web Scraping. In this article, we learned how to scrape playlist information of different users with the help of Spotify Web API, known as Spotipy. We fetched each and every audio feature available for the tracks. This article also covered how we can create a dataset of playlists and its tracks information. For exploring more and understanding the deeper details of Spotipy, you can refer to the below-mentioned links.

Web API Spotipy

Spotipy Official Documentation

  • Palash Sharma

    I am Palash Sharma, an undergraduate student who loves to explore and garner in-depth knowledge in the fields like Artificial Intelligence and Machine Learning. I am captivated by the wonders these fields have produced with their novel implementations. With this, I have a desire to share my knowledge with others in all my capacity.

    View all posts

Follow Us

Leave a Reply

Your email address will not be published. Required fields are marked *