not_notMNIST Dataset Generation
If you have ever worked with classification algorithms, you are definitely familiar with the MNIST dataset. If you are a little more involved with machine learning and especially classification, you have heard of the notMNIST as well.
However, if you always wanted to have your own dataset, but didn’t know how to use it – this post is for you!
not_notMNIST is a dataset generator! Everything it needs is an alphabet and whole bunch of fonts. The main advantage is that you can use it to generate a really big dataset with some unicode characters, and train your classifier on that. If you believe that the dataset you generate is worth spreading, share it :)
How to use the generated data?
Once you generate the data, you can just load the pickle
file either per character, or for all of the generated data.
The pickle
file is stored under $DIR/$WIDTHx$WIDTH.pickle
, where $DIR
is the output directory (by default ./28x28/
), and $WIDTH
is the width of image (defaults to 28
). Data has the following format:
data['labels']
- Targetsdata['images']
- Predictors (square images)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# -*- coding: utf-8 -*-
import pickle
import numpy as np
import matplotlib.pyplot as plt
with open('Demo/Japanese/100x100/100x100.pickle', 'rb') as f:
data = pickle.load(f)
labels = data['labels']
images = data['images']
num_points = len(labels)
f, ax = plt.subplots(2,2)
for i in range(2):
for j in range(2):
idx = np.random.randint(num_points)
ax[i,j].imshow(images[idx], cmap='Greys_r')
plt.show()
How to generate the data
You need several things installed first:
- ImageMagick
- Python 2.7+
- numpy
- scipy
- pickle
The installation for the prerequisites would depend on your OS, and is outside of the scope of the current post :)
List of arguments
To learn how to use it, let’s go through some examples. You can see the Demo files on GitHub
Example 1
This is a default run. It will take all possible fonts that it can find, and it will try to generate 28x28 images for every alpha-numeric character. The results will be stored in the 28x28
directory.
Example 2
This one uses 3 new arguments:
-a AbzZ
tells the scripts to generate data for the lettersA
,b
,z
, andZ
-f Arial
We want only a single font, and it isArial
-w 100
Images should be of size100x100
Example 3
This one is more complex
-d
specifies the output directory to beDemo/Japanese/28x28
-af
tells the script that there is a file with an alphabetDemo/Japanese/japanese.alphabet
-ff
specifies a file with a list of fontsDemo/Japanese/japanese.fonts
.
If we new where the fonts are located, we could have used -fd
parameter to use all the fonts in there.
Example 4
Suppose that you want to use all the fonts in the list except 1 (or whatever number). Use -e
argument (or -ef
):
would use all installed fonts EXCEPT Arial
Call for Contributions
I have written this tool for my own work in ~a night. I would appreciate if you could submit any Issues or PRs.
This work is still in progress…