Risk and uncertainty for deep learning

Ok, you've got your results. But how dispersive will your predictions be?

Risk vs. Uncertainty

There are diverse definitions for the distinction between risk and uncertainty. The most famous comes from Frank Knight, that states:

Uncertainty must be taken in a sense radically distinct from the familiar notion of Risk, from which it has never been properly separated. The term “risk,” as loosely used in everyday speech and in economic discussion, really covers two things which, functionally at least, in their causal relations to the phenomena of economic organization, are categorically different. … The essential fact is that “risk” means in some cases a quantity susceptible of measurement, while at other times it is something distinctly not of this character; and there are far-reaching and crucial differences in the bearings of the phenomenon depending on which of the two is really present and operating. … It will appear that a measurable uncertainty, or “risk” proper, as we shall use the term, is so far different from an unmeasurable one that it is not in effect an uncertainty at all. We … accordingly restrict the term “uncertainty” to cases of the non-quantitive type.

Here we’ll use a variant from Ian Osband:

[…] We identify risk as inherent stochasticity in a model and uncertainty as the confusion over which model parameters apply. For example, a coin may have a fixed $p = 0.5$ of heads and so the outcome of any single flip holds some risk; a learning agent may also be uncertain of $p$.

The data

We’ll use a simulated example for a heteroskedastic random variable $Y$:

\[y = 2x + 6\sin\left(2\pi x+\frac{\epsilon}{48}\right) \cdot \mathcal{H}(x+2),\\ \epsilon \sim \mathcal{N}\left(0, \ \frac{\pi}{6} + \arctan\left(1.2x + 1\right)\right), \quad x \sim \mathcal{N}(0, 1)\]

in which $\mathcal{H}$ stands for Heaviside function

def data_generator(size):
    _x = np.sort(np.random.normal(0, 1, int(size)))
    eps = np.random.normal(0, [-0.5 + 1 / (1 + np.exp(-np.abs(_x))) for x_i in _x], _x.shape)
    _y = _x + np.exp((-_x - 4) * np.pi) + np.abs(_x) * np.sin(np.pi * 2 * (_x + eps / 16)) + eps
    return _x, eps, _y

x, eps, y = data_generator(2000)

plt.figure(figsize=(20, 5))
plt.scatter(x, y, label='Data')
plt.title('Generated Data')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show();

png

plt.figure(figsize=(20, 5))
sns.kdeplot(eps, alpha=0.3, shade=True, label='$\epsilon$', c='b')
sns.kdeplot(x, alpha=0.3, shade=True, label='$x$', c='g')
sns.kdeplot(y, alpha=0.3, shade=True, label='$y$', c='r')
plt.title('Distribution of $x$, $\epsilon$, $y$')
plt.xlim(-2, 2)
plt.legend()
plt.show();

png

plt.subplots(1, 3, figsize=(20, 3))
plt.subplot(1, 3, 1)
sns.distplot(x)
plt.title('Distribution of $x$')
plt.subplot(1, 3, 2)
sns.distplot(eps)
plt.title('Distribution of $\epsilon$')
plt.subplot(1, 3, 3)
sns.distplot(y)
plt.title('Distribution of $y$')
plt.show();

png

Simple Regression

def generate_simple_model():
    i = Input((1,))
    x = Dense(1000, activation='relu')(i)
    x = Dense(100, activation='relu')(x)
    x = Dense(10, activation='relu')(x)
    o = Dense(1, activation='linear')(x)
    m = Model(i, o)
    return m

x, _, y = data_generator(10000)
simple_model = generate_simple_model()
simple_model.compile(loss='mse', optimizer='adam')
hist = simple_model.fit(x, y, epochs=40, verbose=0)

plt.figure(figsize=(20, 3))
plt.plot(hist.history['loss'])
plt.title('Model trainning progress')
plt.xlabel('# Epochs')
plt.ylabel('loss (MSE)')
plt.ylim(0)
plt.show()

_y = simple_model.predict(x)

plt.figure(figsize=(20, 5))
plt.scatter(x, y, label='Data', s=1)
plt.plot(x, _y, label='Regression', c='orange')
plt.title('Regression')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show();

Strategies for output distribution

In all the following cases, there are some parameters $\theta$ (model parameters, such as the neural network weights), and some data $D = {x, y}$. For output distribution, we need to know $P(\theta\mid D)$, i.e., the probability of having such weights given the data. By Bayes Theorem, we have

png

Generally, the last integral is either difficult or expensive to calculate. So the following methods will be able to approximate the posterior distribution.

Variational Inference
Monte Carlo Dropout
BOOTSTRAP

Markov chain Monte Carlo – Metropolis-Hastings

Markov chain Monte Carlo (MCMC) is a group of methods to generate a sample from a unknown distribution. They combine Monte Carlo techniques for sampling from a generated distribution and Markov chains to calculate the probability of each value in the distribution. Between MCMC methods, the most famous is Metropolis-Hastings, represented by the following algorithm:

Get a base distribution $f$ and a candidate distribution $g$. Generally $f$ is the product $P(D\mid \theta)P(\theta)$, which is well known; as $g$ is a simple distribution (such as gaussian). As we iterate, $g$ will accumulate more samples, and become more similar to $P(\theta\mid D)$
For each iteration t:
1. we have $x_t$ as the previous sample from $g$
2. get a sample $x$ from $g(x\mid x_t)$
3. calculate the acceptance ratio $\alpha = f(x)/f(x_t)$. It gives an idea of how probable $x$ is among the previously gathered samples.
4. with probability $\alpha$, insert $x$ to the gathered samples. Else reinsert the previous sample $x_t$.

This algorithm is proven to asymptotically approximate $P(\theta\mid D)$. However, it is still extremely expensive computationally, being sometimes infeasible. For that reason, I wasn’t able to compute this algorithm for a DNN (because it has many parameters). So, I’ve changed the model to a 3-degree polinomial. I’ve also used a framework called pymc3, for applying Monte Carlo methods. Obs.: as we use Metropolis-Hastings, the sample function doesn’t get just samples, it updates the parameter values (analogous to a fit method).

polynoms_n = range(10)
with pm.Model() as model:
    # Priors
    theta = [pm.Normal(f'theta_{i}', 0, sigma=20) for i in polynoms_n]

    # Likelihood
    likelihood = pm.Normal('y', sum([theta[i] * x ** i for i in polynoms_n]),
                        sigma=1, observed=y)
    # Inference!
    trace = pm.sample(10000, cores=3, step=pm.Metropolis())

png

Variational Inference

Introduction

In a ML problem, we want to aproximate $\hat{f}(X) = Y$. Given that a neural network has weights $w$, we want to maximize the probability $p(Y \mid w, X)$. During trainning, we adjust $w$ so that $p$ increases. Now for uncertainty we need the posterior probability of weights, i.e., $p(w \mid Y, X)$. Using Bayes’s Theorem: $p(w \mid Y, X) = \frac{p(Y \mid w, X) \cdot p(w \mid Y)}{p(X \mid Y)} =$ $= \frac{p(Y \mid w, X) \cdot p(w \mid Y)}{p(X \mid Y)}$

The Kullback-Leibler divergence

Given two distributions, $p$ and $q$, we can establish the following similarity quasimeasure:

png

It’s important to tell that it’s not completely a distance measure, since $KL(q \Vert p) \neq KL(p \Vert q)$.

def variational_network(X_train, y_train):
    # Initialize random weights between each layer
    hidden_shape = (X_train.shape[1], 1000, 100, 10, 1)
    layers_init = [
        np.random.randn(hidden_shape[i], hidden_shape[i+1]).astype(float)
        for i in range(len(hidden_shape)-1)
    ]
    with pm.Model() as neural_network:
        model_input = theano.shared(X_train)
        model_output = theano.shared(y_train)
        layers = [model_input]
        for i, layer in enumerate(layers_init):
            weights = pm.Normal(f'w_{i+1}', 0, sigma=1,
                                shape=layer.shape,
                                testval=layer)
            layers.append(weights)
        act = layers[0]
        for layer in layers[1:]:
            act = pm.math.sigmoid(pm.math.dot(act, layer))

        output = pm.Normal('out',
                           act, sigma=1,
                           observed=model_output,
                           total_size=y_train.shape[0]
                          )
    return model_input, model_output, neural_network

model_input, model_output, neural_network = variational_network(X_train, y_train)

png

Pretty bad, huh? That’s because

the DNN isn’t well optimized, it would take more several training steps to get accurate;
variational inference is susceptible to get stuck in local optima.

Monte-Carlo Dropout

Methods #1 #2, although not working well for DL, are well consolidated for little data (standard statistics). In 2016, this paper showed that using dropout in every layer at prediction time is a guaranteed way to determine uncertainty, i.e., is equivalent to a Monte Carlo process.

def generate_dropout_model():
    i = Input((1,))
    x = Dense(1000, activation='relu')(i)
    x = Dropout(0.1)(x, training=True)
    x = Dense(100, activation='relu')(x)
    x = Dropout(0.1)(x, training=True)
    x = Dense(10, activation='relu')(x)
    x = Dropout(0.1)(x, training=True)
    o = Dense(1, activation='linear')(x)
    m = Model(i, o)
    return m

x, _, y = data_generator(10000)
dropout_model = generate_dropout_model()
dropout_model.compile(loss='mse', optimizer='adam')
hist = dropout_model.fit(x, y, epochs=100, verbose=1)

plt.figure(figsize=(20, 3))
plt.plot(hist.history['loss'])
plt.title('Model trainning progress')
plt.xlabel('# Epochs')
plt.ylabel('loss (MSE)')
plt.ylim(0)
plt.show()

png

preds = []
for i in range(10):
    _y = dropout_model.predict(x)
    preds.append(_y)
predictions = np.array(preds)

plt.figure(figsize=(20, 5))
plt.scatter(x, y, label='Data', s=5)
for i, _y in enumerate(predictions):
    plt.scatter(x, _y, label=f'Regression #{i+1}', c='orange', s=1, alpha=0.1)
plt.title('Regression w/ Dropout')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show();

png

mean = predictions.mean(axis=0).flatten()
stdv = predictions.std(axis=0).flatten()

plt.figure(figsize=(20, 5))
plt.plot(x, mean, label='Mean', ms=1, color='orange')
plt.fill_between(x, mean-stdv, mean+stdv, label='Stdev', alpha=0.4)
plt.title('Error Estimation')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show();

png

Application — Predicting Car price Distribution

I’ve used a Kaggle dataset that contains a list of used cars, its prices, and features as power, how many times the car was sold, mileage, etc. Here is some preprocessing:

# Fuel
df = pd.get_dummies(df, columns=['Fuel_Type'], prefix='fuel_')

# Transmission
df['Manual'] = (df['Transmission'] == 'Manual').astype(int)

# Owner type (first, second, ...)
df['Owner_Type'] = df['Owner_Type'].map({'First': 1, 'Second': 2, 'Third': 3}).fillna(4)

# Mileage
df['Mileage'] = df['Mileage'].apply(process_mileage)
df['Mileage'] = df['Mileage'].fillna(df['Mileage'].mean())

# Engine
df['Engine'] = df['Engine'].astype('str').str.extract('(\d+)')[0].astype('float')
df['Engine'] = df['Engine'].fillna(df['Engine'].mean())

# Power
df['Power'] = df['Power'].astype('str').str.extract('(\d+)')[0].astype('float')
df['Power'] = df['Power'].fillna(df['Power'].mean())

# Seats
df['Seats'] = df['Seats'].fillna(df['Seats'].median())

# Drop others
df.drop(['Name', 'Location', 'New_Price', 'Transmission'], axis=1, inplace=True)

And then we train using the following model:

def generate_dropout_model(input_shape):
    i = Input((input_shape,))
    x = Dense(1024, kernel_initializer='normal', activation='relu')(i)
    x = Dropout(0.1)(x, training=True)
    x = Dense(512, kernel_initializer='normal', activation='relu')(x)
    x = Dropout(0.1)(x, training=True)
    x = Dense(256, kernel_initializer='normal', activation='relu')(x)
    x = Dropout(0.1)(x, training=True)
    x = Dense(128, kernel_initializer='normal', activation='relu')(x)
    x = Dropout(0.1)(x, training=True)
    x = Dense(64, kernel_initializer='normal', activation='relu')(x)
    x = Dropout(0.1)(x, training=True)
    x = Dense(32, kernel_initializer='normal', activation='relu')(x)
    x = Dropout(0.1)(x, training=True)
    o = Dense(1, kernel_initializer='normal', activation='linear')(x)
    m = Model(i, o)
    return m

png

Results

Here are some predicted values distributions. We get a sample from the model with price predictions outputs. Then we check it vs. the real price predicted.

png

Conclusion

There are several (some not so good) more approaches for estimating uncertainty not discussed here, some of them useful for deep learning too. You can check the complete code in the link below!

References

https://arxiv.org/pdf/1902.10189.pdf
https://gdmarmerola.github.io/risk-and-uncertainty-deep-learning/
https://arxiv.org/pdf/1905.09638.pdf
https://arxiv.org/pdf/1505.05424.pdf
https://arxiv.org/pdf/1506.02142.pdf
https://arxiv.org/pdf/1602.04621.pdf
https://arxiv.org/pdf/1806.03335.pdf
https://arxiv.org/pdf/1505.05424.pdf
https://arxiv.org/pdf/1601.00670.pdf
https://ermongroup.github.io/cs228-notes/inference/variational/
https://github.com/ericmjl/website/blob/master/content/blog/variational-inference-with-pymc3-a-lightweight-demo/linreg.ipynb
https://medium.com/tensorflow/regression-with-probabilistic-layers-in-tensorflow-probability-e46ff5d37baf
https://papers.nips.cc/paper/8080-randomized-prior-functions-for-deep-reinforcement-learning.pdf
http://proceedings.mlr.press/v37/salimans15.pdf
https://www.cs.ubc.ca/~schmidtm/Courses/540-W18/L34.pdf
http://bayesiandeeplearning.org/2016/papers/BDL_4.pdf
https://towardsdatascience.com/uncertainty-estimation-for-neural-network-dropout-as-bayesian-approximation-7d30fc7bc1f2

Beyond Similarity