not_notMNIST Dataset Generation
If you have ever worked with classification algorithms, you are definitely familiar with the MNIST dataset. If you are a little more involved with machine learning and especially classification, you have heard of the notMNIST as well.
However, if you always wanted to have your own dataset, but didn’t know how to use it – this post is for you!
not_notMNIST is a dataset generator! Everything it needs is an alphabet and whole bunch of fonts. The main advantage is that you can use it to generate a really big dataset with some unicode characters, and train your classifier on that. If you believe that the dataset you generate is worth spreading, share it :)
How to use the generated data?
Once you generate the data, you can just load the
pickle file either per character, or for all of the generated data.
pickle file is stored under
$DIR is the output directory (by default
$WIDTH is the width of image (defaults to
28). Data has the following format:
data['images']- Predictors (square images)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # -*- coding: utf-8 -*- import pickle import numpy as np import matplotlib.pyplot as plt with open('Demo/Japanese/100x100/100x100.pickle', 'rb') as f: data = pickle.load(f) labels = data['labels'] images = data['images'] num_points = len(labels) f, ax = plt.subplots(2,2) for i in range(2): for j in range(2): idx = np.random.randint(num_points) ax[i,j].imshow(images[idx], cmap='Greys_r') plt.show()
How to generate the data
You need several things installed first:
- Python 2.7+
The installation for the prerequisites would depend on your OS, and is outside of the scope of the current post :)
List of arguments
To learn how to use it, let’s go through some examples. You can see the Demo files on GitHub
This is a default run. It will take all possible fonts that it can find, and it will try to generate 28x28 images for every alpha-numeric character. The results will be stored in the
This one uses 3 new arguments:
-a AbzZtells the scripts to generate data for the letters
-f ArialWe want only a single font, and it is
-w 100Images should be of size
This one is more complex
-dspecifies the output directory to be
-aftells the script that there is a file with an alphabet
-ffspecifies a file with a list of fonts
If we new where the fonts are located, we could have used
-fd parameter to use all the fonts in there.
Suppose that you want to use all the fonts in the list except 1 (or whatever number). Use
-e argument (or
would use all installed fonts EXCEPT
Call for Contributions
This work is still in progress…