We use machine learning almost everywhere now, and a lot of research is focused on improving the accuracy, improving the time it takes us to train the networks, improving the complexity, etc… I would like to contribute a little to a slightly different problem: how would we improve test time figures of merit?

Suppose you are controlling a robot using a machine learning algorithm, and would like to increase its battery life. If you don’t care about the training time, but test-time accuracy is important, but not as important as your battery life – how would you go about it?

We have recently released a work on hardware of the adaptive classifiers, but I wanted to give a more “people-friendly” explanation here

You can download the iPython Notebook that I used to generate the images here.

The idea is based on selecting from multiple classifiers that differ in complexity and accuracy, such that we maximize all figures of merit at test time.

## “Hardness” of a problem

Let us define “hardness” of a problem as “how easy is given problem” meaning if I have several classifiers, how complex of a classifier do I need to use to identify the label correctly.

### Quick example on “Hard” problems

To visualize “hard” data points, let us look at the UCI Penbase dataset. I will use a linear classifier as a classifier for “easy” examples, because it is cheap and fast.

In the images below I show three numbers T:a/P:b/L:c. T is the true label for the image, P is the prediction of the second degree polynomial, and L is the result of a linear classifier. That means every time we see problems like these, the linear classifier will fail.

As a first order, let’s say that the “hardness” of the problem is defined as a distance from the decision boundary1. So all of the above examples would be considered “hard”.

### Simple example

To visualize how the idea of “hardness” would create decision boundaries, let us consider a simpler (2D) example:

The above synthetic dataset was generated using the following True Label generator

Now let’s assume that we are running on a tight energy budget, and decide to use a cheap Logistic Regression, and find out that the new equation and the plots for the Predicted Labels should be

Notice that some “yellowish” labels bleed into the blue region and vice versa. The accuracy on the training data for the classifier is 94%.

To separate the hard data points from the easy ones, we will compute the hard_bias as follows2:

where $\mathbb{1}_{z}$ is an indicator function. Now we can specify the boundaries for the Hardness Pseudo-Labels:

If we analyze the test and train synthetic data, we find out that only a fraction of all the data points require complex classification.

### Chooser Function

in order to choose if we want to use a more expensive classifier or not, we use a “chooser” function – another logistic regression function which would predict the “hardness” of the problem. As a first order approximation we will define the chooser as follows:

Step 1. Create hardness pseudo-labels

Step 2. Train the “chooser”

# Improvement on decision boundaries

We have described above that we create a linear “bias” to separate the hardness, but in reality we can just use incorrectly classified samples as a class, and train the classifier and let it create the boundaries automatically.

## Theory

Given Description
$(x_1, y_1), ..., (x_n, y_n) \in \mathcal{X} \times \{1, ..., C\}$ Input data / labels
$f_1, ..., f_k : \mathcal{X} \rightarrow \{1, ..., C\}$ Collection of $k$ classifiers
$c_1, ..., c_k \in \mathbb{R}$ Cost associated with classifiers $f$
$g : \mathcal{X} \rightarrow \{1, ..., k\}$ “Chooser” function
Where Description
$\mathcal{X}$ Input space
$\{1, ..., C\}$ Collection of output Labels
$B$ Upper bound on the budget

Need to train the “chooser” such that

The equation basically says

Find a setting for a “chooser” function $g$ such that from all given (pre-trained) classifiers $f$ chooses the one that correctly classifies the current input, while maintaining the budget $B$

In the code we will not take the budget $B$ into account, and will control it only by the class weight as shown below. This is done for simplicity, and because we don’t have actual numbers on how much more complex the “hard” classifier is.

Step 1. Train the easy and a hard classifier

Step 2. Choose the proper Hyperparameters and run the tests

Decide on the “Weight” of the Hard classes. This is the same as the class weights in the SVMs. This is important because the “hardness” of the classes is a highly skewed classification.

For example, the plots below show how the decision on hardness change with different weight on the “hard” class.

We can also see the utilization at every different setting for the hardness weight hyperparameter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Test Accuracy @ hard weight = 1.0 : 95.25%
Hard Utilization: 0.00%
Test Accuracy @ hard weight = 3.0 : 95.25%
Hard Utilization: 0.00%
Test Accuracy @ hard weight = 9.0 : 96.00%
Hard Utilization: 1.00%
Test Accuracy @ hard weight = 27.0 : 98.00%
Hard Utilization: 14.25%
Test Accuracy @ hard weight = 81.0 : 97.75%
Hard Utilization: 23.25%
Test Accuracy @ hard weight = 243.0 : 98.25%
Hard Utilization: 28.50%
Test Accuracy @ hard weight = 729.0 : 98.25%
Hard Utilization: 31.25%
Test Accuracy @ hard weight = 2187.0 : 98.25%
Hard Utilization: 31.75%
Test Accuracy @ hard weight = 6561.0 : 98.50%
Hard Utilization: 32.75%
Test Accuracy @ hard weight = 19683.0 : 98.50%
Hard Utilization: 33.00%


1. Note that this is not really true. There are cases where the problem is very far from decision boundary, but is still misclassified, and vice-versa

2. This bias is only for illustration purposes, in reality the “hardness” is subject to a little more complex decision making process. In particular $\min_{g \in \mathbb{G}} \sum_i\sum_j{L_j(x_i, y_i)\mathbb{1}_{g(x_i)=j}}$ is an appropriate function to train a “chooser” function with respect to loss $L$

Updated on

### Zafar Takhirov

I am a recent PhD graduate from Boston University. While my work focuses on digital design,error mitigation, and machine learning, my non-work interests range widely from information theory (go Shannon!), quantum computing, grandfather paradox, Star Trek, Little Mermaid, 'why is the grass green?', 1Q84, etc., etc., etc. If you want to talk about, well, anything - just ping me.

### Passing cv::Mat as argument

Often times when we pass cv::Mat, we forget one important thing: OpenCV matrix does not respect the const modifier.In this post I w...… Continue reading

#### Hungarian Algorithm

Published on July 19, 2017

#### not_notMNIST Dataset Generation

Published on January 19, 2017