Learning AWS Sagemaker with a simple example

The examples for AWS Sagemaker are too complicated to get started. If you want to learn machine learning and the AWS infrastructure, I found a simple dataset and example to be best.

The UC Irvine Machine Learning Repository has a bunch of well documented datasets for machine learning. I plan on using k-means clustering on a classification dataset. The Geographical Origins of Music dataset will work well. It is a dataset of music origins (latitude and longitude) and measured characteristics of that music. I can pump the dataset through k-means and try to cluster music based on its characteristics. I can then plot the clusters to see if there is any visible relation between the geography and clustered characteristics of the music.

First you need to create a Sagemaker instance. Amazon makes you use at least a medium machine which is probably overkill (and a little expensive) for what we are doing.

Next download the dataset and put it up in a public S3 bucket so it can be accessed from the notebook.

Now load the Jupyter notebook from my Github in the Sagemaker instance (or create a new one and insert the code). I’ll go through some of the code.


!wget 'https://s3.amazonaws.com/sagemakerchris/default_features_1059_tracks.txt'
import datetime
import pandas as pd
import numpy as np

music = pd.read_csv('default_features_1059_tracks.txt')
music_orig = music

The code is downloaded from the public s3 location and then loaded into a pandas dataframe. The dataframe can be printed to check how the data looks so far and make sure it was imported correctly.

music = music.drop([music.columns[music.shape[1]-1],music.columns[music.shape[1]-2]], axis=1)

The last two columns of the data are the latitude and longitude of the music. We can drop those two columns because we don’t want to use them as a feature in our k-means algorithm. You can also use .shape to check the size of the dataframe and make sure the column was dropped correctly.

music = music.as_matrix().astype(np.float32)

K-means expects that the values are a matrix of floats.

from sagemaker import KMeans
from sagemaker import get_execution_role
role = get_execution_role()
bucket = "sagemakerchris"
data_location = "sagemakerchris"
data_location = 's3://{}/kmeans_music/data'.format(bucket)
output_location = 's3://{}/kmeans_music/output'.format(bucket)
print('training data will be uploaded to: {}'.format(data_location))
print('training artifacts will be uploaded to: {}'.format(output_location))
kmeans = KMeans(role=role,

The Sagemaker training instance has to be configured along with where the results will be placed. You can also modify the type of machine the training will run on.


The k-means training instance can now fit our data and determine the clusters.

kmeans_predictor = kmeans.deploy(initial_instance_count=1,

To be able to use the model we created, the model has to be deployed as an endpoint.

result = kmeans_predictor.predict(music)
result = kmeans_predictor.predict(music)
clusters = [r.label['closest_cluster'].float32_tensor.values[0] for r in result] 
i = 0
locations = {}
for r in result:
    out = {
    "lat" : music_orig.iloc[i, 68],
    "lon" :  music_orig.iloc[i, 69],
    "closest_cluster" : r.label['closest_cluster'].float32_tensor.values[0]
    except KeyError:
    i = i + 1

We can now feed our music through the model to get the clustering. The out put has to be reworked so we can get the cluster values and print them out along with the latitude and longitude.

{‘lat’: 14.91, ‘lon’: -23.51, ‘closest_cluster’: 8.0}
{‘lat’: 12.65, ‘lon’: -8.0, ‘closest_cluster’: 9.0}
{‘lat’: 9.03, ‘lon’: 38.74, ‘closest_cluster’: 4.0}
{‘lat’: 34.03, ‘lon’: -6.85, ‘closest_cluster’: 5.0}
{‘lat’: 12.65, ‘lon’: -8.0, ‘closest_cluster’: 4.0}
{‘lat’: 12.65, ‘lon’: -8.0, ‘closest_cluster’: 8.0}
{‘lat’: 14.66, ‘lon’: -17.41, ‘closest_cluster’: 8.0}

It would be interesting to see if the clusters match up with the geographical areas of the music. To do that we can use gmaps plugin for Jupter notebooks. The plugin has to be installed using a Jupter notebook terminal and the directions on the gmaps github.

fig = gmaps.figure()

colors = [(255,0,0), (0,128,0), (128,0,0), (0,255,0), (255,255,0), (0,255,255), (128,128,0), (128,0,128), (0,128,128), (192,192,192)]
i = 0
for key, value in locations.items():
    heatmap_layer = gmaps.heatmap_layer(locations[key])
    heatmap_layer.gradient = [(*colors[i],0), colors[i], colors[i]]
    i = i + 1

Each cluster is given a different color and added as a heatmap layer.

There is more work to be done to make the clustering more visible. Someone could use this to try and draw conclusions about musical influence related to music characteristics, geography, and culture.