How to Take the Derivative of the Softmax Function

Hongtao Hao / 2022-11-03


This blog post is inspired by MLDawn’s video on the same topic.

Suppose we have a vector: $[z_1, z_2, z_3]$. Softmax function is in the form of:

$$S(z_1) = \frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}}$$

$$S(z_2) = \frac{e^{z_2}}{e^{z_1} + e^{z_2} + e^{z_3}}$$

$$S(z_3) = \frac{e^{z_3}}{e^{z_1} + e^{z_2} + e^{z_3}}$$

Now, we want to take the derivative of $S(z_1)$. Let’s do it now.

It should be noted that in the calculation of $S(z_1)$, there are three variables: $z_1$, $z_2$ and $z_3$. Therefore, when we talk about the derivative of $S(z_1)$, we need to specify the derivative with respect to which variable.

In the following, we will derive the derivative of $S(z_1)$. To understand the steps, we need to first understand the following.

Pre-requisite knowledge #

Quotient Rule #

The quotient rule goes like this. If we have

$$f(x) = \frac{g(x)}{h(x)}$$

Then the derivative of $f(x)$ is:

$$f^\prime(x) = \frac{df}{dx} = \frac{g^\prime(x)\cdot h(x) - g(x)\cdot h^\prime(x)}{(h(x))^2}$$

I know you may wonder why. I can derive it from the product rule. Let’s say

$$t(x) = \frac{1}{h(x)}$$

We have

$$f(x) = \frac{g(x)}{h(x)} = g(x)\cdot t(x)$$

According to product rule, we have:

$$f^\prime(x) = g(x)\cdot t^\prime(x) + g^\prime(x)\cdot t(x)$$

How to get $t^\prime(x)$? This has to do with the chain rule.

Let’s say we have

$$t(m) = m^{-1}$$

and

$$m = h(x)$$

So

$$\begin{aligned} t^\prime(x) &= \frac{dt}{dx} \\ & = \frac{dt}{dm}\cdot \frac{dm}{dx} \\ &= -m^{-2}\cdot h^\prime(x) \\ &= -(h(x))^{-2}\cdot h^\prime(x)\end{aligned}$$

So we have:

$$\begin{aligned} f^\prime(x) &= g(x)\cdot t^\prime(x) + g^\prime(x)\cdot t(x) \\&= g(x)(-(h(x))^{-2}h(x)) + g^\prime(x)(h(x))^{-1} \\ &= \frac{-g(x)h^\prime(x)}{(h(x))^2} + \frac{g^\prime(x)}{h(x)} \\ &= \frac{g^\prime(x)h(x) - g(x)h^\prime(x)}{(h(x))^2} \end{aligned}$$

Sum rule #

If we have

$$f(x) = g(x) + h(x)$$

Then,

$$f^\prime(x) = g^\prime(x) + h^\prime(x)$$

E #

You also need to know that the derivative of $e^x$ is $e^x$ itself.

Sum #

You also need to know that

$$\sum^{i = 1}_{N} S(z_i) = 1$$

Derivation #

Armed with the above knowledge, we can calcuate the derivative of softmax functions. Let us say we want to take the derivative of $S(z_1)$ with respect to $z_1$:

$$\begin{aligned} \frac{\partial S(z_1)}{\partial z_1} & = \frac{\partial}{\partial z_1} (\frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}}) \\ &= \frac{(e^{z_1})^\prime \cdot (e^{z_1} + e^{z_2} + e^{z_3}) - e^{z_1} \cdot (e^{z_1} + e^{z_2} + e^{z_3})^\prime}{(e^{z_1} + e^{z_2} + e^{z_3})^2} \\ & = \frac{e^{z_1}\cdot (e^{z_1} + e^{z_2} + e^{z_3}) - e^{z_1} \cdot (e^{z_1} + 0 + 0)}{(e^{z_1} + e^{z_2} + e^{z_3})^2} \\ & = \frac{e^{z_1} (\sum - e^{z_1})}{\sum^2} \\ & = \frac{e^{z_1}}{\sum} \cdot \frac{\sum - e^{z_1}}{\sum} \\ & = S(z_1) (1 - S(z_1)) \end{aligned}$$

Then, what about the derivative of $S(z_1)$ with respect to $z_2$?

$$\begin{aligned} \frac{\partial S(z_1)}{\partial z_2} & = \frac{\partial}{\partial z_1} (\frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}}) \\ &= \frac{(e^{z_1})^\prime \cdot (e^{z_1} + e^{z_2} + e^{z_3}) - e^{z_1} \cdot (e^{z_1} + e^{z_2} + e^{z_3})^\prime}{(e^{z_1} + e^{z_2} + e^{z_3})^2} \\ & = \frac{0 \cdot (e^{z_1} + e^{z_2} + e^{z_3}) - e^{z_1} \cdot (0 + e^{z_2} + 0 )}{(e^{z_1} + e^{z_2} + e^{z_3})^2} \\ & = \frac{- e^{z_1} \cdot e^{z_2}}{\sum^2} \\ & = - \frac{e^{z_1}}{\sum} \cdot \frac{e^{z_2}}{\sum} \\ & = - S(z_1) S(z_2) \end{aligned}$$

Conclusion #

With the above, we can say that for

$$S(z_i) = \frac{e^{z_i}}{\sum^{N}_{j = 1} e^{z_j}}$$

It derivative

$$\frac{\partial S(z_i)}{\partial z_j}$$

When $i = j$:

$$\frac{\partial S(z_i)}{\partial z_j} = S(z_i)(1 - S(z_i))$$

If $i \neq j$:

$$\frac{\partial S(z_i)}{\partial z_j} = - S(z_i)S(z_j)$$

#ML

Last modified on 2022-11-04