Adaptive Classification and More

We use machine learning almost everywhere now, and a lot of research is focused on improving the accuracy, improving the time it takes us to train the networks, improving the complexity, etc… I would like to contribute a little to a slightly different problem: how would we improve test time figures of merit?

Suppose you are controlling a robot using a machine learning algorithm, and would like to increase its battery life. If you don’t care about the training time, but test-time accuracy is important, but not as important as your battery life – how would you go about it?

We have recently released a work on hardware of the adaptive classifiers, but I wanted to give a more “people-friendly” explanation here

You can download the iPython Notebook that I used to generate the images here.

The idea is based on selecting from multiple classifiers that differ in complexity and accuracy, such that we maximize all figures of merit at test time.

“Hardness” of a problem

Let us define “hardness” of a problem as “how easy is given problem” meaning if I have several classifiers, how complex of a classifier do I need to use to identify the label correctly.

Quick example on “Hard” problems

To visualize “hard” data points, let us look at the UCI Penbase dataset. I will use a linear classifier as a classifier for “easy” examples, because it is cheap and fast.

In the images below I show three numbers T:a/P:b/L:c. T is the true label for the image, P is the prediction of the second degree polynomial, and L is the result of a linear classifier. That means every time we see problems like these, the linear classifier will fail.

Examples hard for Linear

As a first order, let’s say that the “hardness” of the problem is defined as a distance from the decision boundary1. So all of the above examples would be considered “hard”.

Simple example

To visualize how the idea of “hardness” would create decision boundaries, let us consider a simpler (2D) example:

True Dataset

The above synthetic dataset was generated using the following True Label generator

Now let’s assume that we are running on a tight energy budget, and decide to use a cheap Logistic Regression, and find out that the new equation and the plots for the Predicted Labels should be

Logistic Dataset

Notice that some “yellowish” labels bleed into the blue region and vice versa. The accuracy on the training data for the classifier is 94%.

To separate the hard data points from the easy ones, we will compute the hard_bias as follows2:

where is an indicator function. Now we can specify the boundaries for the Hardness Pseudo-Labels:

Hardness

If we analyze the test and train synthetic data, we find out that only a fraction of all the data points require complex classification.

Ratio

Chooser Function

in order to choose if we want to use a more expensive classifier or not, we use a “chooser” function – another logistic regression function which would predict the “hardness” of the problem. As a first order approximation we will define the chooser as follows:

Step 1. Create hardness pseudo-labels

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
## Preprocessing stage

# Predict the easy $$y'$$ and find out which ones of them are wrong
y_prime_easy = easy.predict(X_train)
wrong = y_prime_easy != y_train

# Compute the regression lines
a = -easy.coef_[0][0] / easy.coef_[0][1]
b = -easy.intercept_[0] / easy.coef_[0][1]
y_prime_easy_regr = X_train[:, 0]*a + b

# Get the bias and compute the pseudo labels
y_decision_hard_up = y_prime_easy_regr + H_bias
y_decision_hard_down = y_prime_easy_regr - H_bias

hard = np.logical_and(X_train[:,1] < y_decision_hard_up, X_train[:,1]  > y_decision_hard_down)

# Create a "hardness" envelope
y_pseudo_1 = -np.ones(y_train.shape)
y_pseudo_1[X_train[:,1] < y_decision_hard_up] = 1
y_pseudo_2 = -np.ones(y_train.shape)
y_pseudo_2[X_train[:,1] > y_decision_hard_down] = 1

y_pseudo = -np.ones(y_train.shape)
y_pseudo[hard] = 1

Step 2. Train the “chooser”

1
2
3
4
5
6
7
8
9
10
chooser = [LinearSVC(), LinearSVC()]
chooser[0].fit(X_train, y_pseudo_1)
chooser[1].fit(X_train, y_pseudo_2)

def chooser_predict(X):
  hardness = chooser[0].predict(X)*chooser[1].predict(X) # Has to be -1/+1
  combine = np.zeros(hardness.shape)
  combine[hardness == -1] = easy.predict(X[hardness == -1])
  combine[hardness == 1] = hard.predict(X[hardness == 1])
  return combine

Decisions of Adaptive Classifier

Improvement on decision boundaries

We have described above that we create a linear “bias” to separate the hardness, but in reality we can just use incorrectly classified samples as a class, and train the classifier and let it create the boundaries automatically.

Theory

Given Description
Input data / labels
Collection of classifiers
Cost associated with classifiers
“Chooser” function
Where Description
Input space
Collection of output Labels
Upper bound on the budget

Need to train the “chooser” such that

The equation basically says

Find a setting for a “chooser” function such that from all given (pre-trained) classifiers chooses the one that correctly classifies the current input, while maintaining the budget

In the code we will not take the budget into account, and will control it only by the class weight as shown below. This is done for simplicity, and because we don’t have actual numbers on how much more complex the “hard” classifier is.

Step 1. Train the easy and a hard classifier

1
2
3
4
5
# X_train, y_train are the training samples
easy_train = easy.predict(X_train)
hard_train = hard.predict(X_train)
y_pseudo_auto = np.zeros(y_train.shape)
y_pseudo_auto[easy_train == y_train] = easy_train[easy_train == y_train]

Step 2. Choose the proper Hyperparameters and run the tests

Decide on the “Weight” of the Hard classes. This is the same as the class weights in the SVMs. This is important because the “hardness” of the classes is a highly skewed classification.

For example, the plots below show how the decision on hardness change with different weight on the “hard” class.

Decisiont of Adaptive Classifier

We can also see the utilization at every different setting for the hardness weight hyperparameter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class_weights = {-1: 1., +1: 1, 0: 1}

for _ in xrange(10):
  chooser_auto = LinearSVC(class_weight = class_weights)
  chooser_auto.fit(X_train, y_pseudo_auto)
  
  cls_to_use = chooser_auto.predict(X_test)
  # cls_to_use = 2*np.abs(cls_to_use) - 1

  y_adaptive = np.zeros(cls_to_use.shape)
  y_adaptive[cls_to_use == 1] = easy.predict(X_test[cls_to_use == 1])
  y_adaptive[cls_to_use == -1] = easy.predict(X_test[cls_to_use == -1])
  if np.sum(cls_to_use == 0) > 0:
    # print "Found zeros"
    y_adaptive[cls_to_use == 0] = hard.predict(X_test[cls_to_use == 0])

  print "Test Accuracy @ hard weight = {:.1f} : {:.2%}".format(class_weights[0], np.mean(y_adaptive == y_test))
  print "\tHard Utilization: {:.2%}".format(np.mean(cls_to_use == 0))
  class_weights[0] *= 3.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Test Accuracy @ hard weight = 1.0 : 95.25%
  Hard Utilization: 0.00%
Test Accuracy @ hard weight = 3.0 : 95.25%
  Hard Utilization: 0.00%
Test Accuracy @ hard weight = 9.0 : 96.00%
  Hard Utilization: 1.00%
Test Accuracy @ hard weight = 27.0 : 98.00%
  Hard Utilization: 14.25%
Test Accuracy @ hard weight = 81.0 : 97.75%
  Hard Utilization: 23.25%
Test Accuracy @ hard weight = 243.0 : 98.25%
  Hard Utilization: 28.50%
Test Accuracy @ hard weight = 729.0 : 98.25%
  Hard Utilization: 31.25%
Test Accuracy @ hard weight = 2187.0 : 98.25%
  Hard Utilization: 31.75%
Test Accuracy @ hard weight = 6561.0 : 98.50%
  Hard Utilization: 32.75%
Test Accuracy @ hard weight = 19683.0 : 98.50%
  Hard Utilization: 33.00%

  1. Note that this is not really true. There are cases where the problem is very far from decision boundary, but is still misclassified, and vice-versa 

  2. This bias is only for illustration purposes, in reality the “hardness” is subject to a little more complex decision making process. In particular is an appropriate function to train a “chooser” function with respect to loss  

Zafar Takhirov

Zafar Takhirov
I am a recent PhD graduate from Boston University. While my work focuses on digital design,error mitigation, and machine learning, my non-work interests range widely from information theory (go Shannon!), quantum computing, grandfather paradox, Star Trek, Little Mermaid, 'why is the grass green?', 1Q84, etc., etc., etc. If you want to talk about, well, anything - just ping me.

Passing cv::Mat as argument

Often times when we pass `cv::Mat`, we forget one important thing: `OpenCV` matrix does not respect the `const` modifier.In this post I w...… Continue reading

Hungarian Algorithm

Published on July 19, 2017

not_notMNIST Dataset Generation

Published on January 19, 2017