Image source: not_notMNIST GitHub

# not_notMNIST Dataset Generation

If you have ever worked with classification algorithms, you are definitely familiar with the MNIST dataset. If you are a little more involved with machine learning and especially classification, you have heard of the notMNIST as well.

However, if you always wanted to have your own dataset, but didn’t know how to use it – this post is for you!

not_notMNIST is a dataset generator! Everything it needs is an alphabet and whole bunch of fonts. The main advantage is that you can use it to generate a really big dataset with some unicode characters, and train your classifier on that. If you believe that the dataset you generate is worth spreading, share it :)

## How to use the generated data?

Once you generate the data, you can just load the pickle file either per character, or for all of the generated data.

The pickle file is stored under $DIR/$WIDTHx$WIDTH.pickle, where $DIR is the output directory (by default ./28x28/), and \$WIDTH is the width of image (defaults to 28). Data has the following format:

• data['labels'] - Targets
• data['images'] - Predictors (square images)

## How to generate the data

You need several things installed first:

• ImageMagick
• Python 2.7+
• numpy
• scipy
• pickle

The installation for the prerequisites would depend on your OS, and is outside of the scope of the current post :)

### List of arguments

To learn how to use it, let’s go through some examples. You can see the Demo files on GitHub

### Example 1

This is a default run. It will take all possible fonts that it can find, and it will try to generate 28x28 images for every alpha-numeric character. The results will be stored in the 28x28 directory.

### Example 2

This one uses 3 new arguments:

• -a AbzZ tells the scripts to generate data for the letters A, b, z, and Z
• -f Arial We want only a single font, and it is Arial
• -w 100 Images should be of size 100x100

### Example 3

This one is more complex

• -d specifies the output directory to be Demo/Japanese/28x28
• -af tells the script that there is a file with an alphabet Demo/Japanese/japanese.alphabet
• -ff specifies a file with a list of fonts Demo/Japanese/japanese.fonts.

If we new where the fonts are located, we could have used -fd parameter to use all the fonts in there.

### Example 4

Suppose that you want to use all the fonts in the list except 1 (or whatever number). Use -e argument (or -ef):

would use all installed fonts EXCEPT Arial

## Call for Contributions

I have written this tool for my own work in ~a night. I would appreciate if you could submit any Issues or PRs.

This work is still in progress…

Updated on

### Zafar Takhirov

I am a recent PhD graduate from Boston University. While my work focuses on digital design,error mitigation, and machine learning, my non-work interests range widely from information theory (go Shannon!), quantum computing, grandfather paradox, Star Trek, Little Mermaid, 'why is the grass green?', 1Q84, etc., etc., etc. If you want to talk about, well, anything - just ping me.

### Passing cv::Mat as argument

Often times when we pass cv::Mat, we forget one important thing: OpenCV matrix does not respect the const modifier.In this post I w...… Continue reading

#### Hungarian Algorithm

Published on July 19, 2017