Kaggle MNIST digit recognition

One of the active competitions on Kaggle is for hand written digit recognition using the popular MNIST dataset.

Using a Jupyter Notebook that I put up on Github, I created a model using Keras and was able to get rank #1241 with an accuracy of 0.98857.

Using the model above, I’ve been focusing on using the ‘patience’ variable for an early stopping callback while fitting the model. The accuracy seems to max out at 10 epochs. After that you can get an little bit of improvement in accuracy over the next 10 epochs but at the risk of over fitting.

I plan on adding some data visualizations to plot the training and results.

There are also two additional notebooks in the repo for viewing the dataset by plotting the pixels

You can also create kernels on the Kaggle site, but if the site goes down (like it did last night and this morning) you can be out of luck trying to commit changes.

MNIST digit recognition on Github

Learning AWS Sagemaker with a simple example

The examples for AWS Sagemaker are too complicated to get started. If you want to learn machine learning and the AWS infrastructure, I found a simple dataset and example to be best.

The UC Irvine Machine Learning Repository has a bunch of well documented datasets for machine learning. I plan on using k-means clustering on a classification dataset. The Geographical Origins of Music dataset will work well. It is a dataset of music origins (latitude and longitude) and measured characteristics of that music. I can pump the dataset through k-means and try to cluster music based on its characteristics. I can then plot the clusters to see if there is any visible relation between the geography and clustered characteristics of the music.

First you need to create a Sagemaker instance. Amazon makes you use at least a medium machine which is probably overkill (and a little expensive) for what we are doing.

Next download the dataset and put it up in a public S3 bucket so it can be accessed from the notebook.

Now load the Jupyter notebook from my Github in the Sagemaker instance (or create a new one and insert the code). I’ll go through some of the code.

%sc

!wget 'https://s3.amazonaws.com/sagemakerchris/default_features_1059_tracks.txt'
import datetime
import pandas as pd
import numpy as np

music = pd.read_csv('default_features_1059_tracks.txt')
music_orig = music
print(music.head())

The code is downloaded from the public s3 location and then loaded into a pandas dataframe. The dataframe can be printed to check how the data looks so far and make sure it was imported correctly.

music = music.drop([music.columns[music.shape[1]-1],music.columns[music.shape[1]-2]], axis=1)

The last two columns of the data are the latitude and longitude of the music. We can drop those two columns because we don’t want to use them as a feature in our k-means algorithm. You can also use .shape to check the size of the dataframe and make sure the column was dropped correctly.

music = music.as_matrix().astype(np.float32)

K-means expects that the values are a matrix of floats.

from sagemaker import KMeans
from sagemaker import get_execution_role
role = get_execution_role()
print(role)
bucket = "sagemakerchris"
data_location = "sagemakerchris"
data_location = 's3://{}/kmeans_music/data'.format(bucket)
output_location = 's3://{}/kmeans_music/output'.format(bucket)
print('training data will be uploaded to: {}'.format(data_location))
print('training artifacts will be uploaded to: {}'.format(output_location))
kmeans = KMeans(role=role,
train_instance_count=1,
train_instance_type='ml.c4.8xlarge',
output_path=output_location,
k=10,
data_location=data_location)

The Sagemaker training instance has to be configured along with where the results will be placed. You can also modify the type of machine the training will run on.

%%time
kmeans.fit(kmeans.record_set(music))

The k-means training instance can now fit our data and determine the clusters.

%%time
kmeans_predictor = kmeans.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge')

To be able to use the model we created, the model has to be deployed as an endpoint.

result = kmeans_predictor.predict(music)
%%time
result = kmeans_predictor.predict(music)
clusters = [r.label['closest_cluster'].float32_tensor.values[0] for r in result] 
i = 0
locations = {}
print(len(result))
for r in result:
    out = {
    "lat" : music_orig.iloc[i, 68],
    "lon" :  music_orig.iloc[i, 69],
    "closest_cluster" : r.label['closest_cluster'].float32_tensor.values[0]
    }
    try:
        locations[out['closest_cluster']].append([out['lat'],out['lon']])
    except KeyError:
        locations[out['closest_cluster']]=[[out['lat'],out['lon']]]
    print(out)
    i = i + 1

We can now feed our music through the model to get the clustering. The out put has to be reworked so we can get the cluster values and print them out along with the latitude and longitude.

{‘lat’: 14.91, ‘lon’: -23.51, ‘closest_cluster’: 8.0}
{‘lat’: 12.65, ‘lon’: -8.0, ‘closest_cluster’: 9.0}
{‘lat’: 9.03, ‘lon’: 38.74, ‘closest_cluster’: 4.0}
{‘lat’: 34.03, ‘lon’: -6.85, ‘closest_cluster’: 5.0}
{‘lat’: 12.65, ‘lon’: -8.0, ‘closest_cluster’: 4.0}
{‘lat’: 12.65, ‘lon’: -8.0, ‘closest_cluster’: 8.0}
{‘lat’: 14.66, ‘lon’: -17.41, ‘closest_cluster’: 8.0}

It would be interesting to see if the clusters match up with the geographical areas of the music. To do that we can use gmaps plugin for Jupter notebooks. The plugin has to be installed using a Jupter notebook terminal and the directions on the gmaps github.

fig = gmaps.figure()

colors = [(255,0,0), (0,128,0), (128,0,0), (0,255,0), (255,255,0), (0,255,255), (128,128,0), (128,0,128), (0,128,128), (192,192,192)]
i = 0
for key, value in locations.items():
    heatmap_layer = gmaps.heatmap_layer(locations[key])
    heatmap_layer.gradient = [(*colors[i],0), colors[i], colors[i]]
    fig.add_layer(heatmap_layer)
    i = i + 1
fig

Each cluster is given a different color and added as a heatmap layer.

There is more work to be done to make the clustering more visible. Someone could use this to try and draw conclusions about musical influence related to music characteristics, geography, and culture.