Suppose we have a vector: $[z_1, z_2, z_3]$
. Softmax function is in the form of:
$$S(z_1) = \frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}}$$
$$S(z_2) = \frac{e^{z_2}}{e^{z_1} + e^{z_2} + e^{z_3}}$$
$$S(z_3) = \frac{e^{z_3}}{e^{z_1} + e^{z_2} + e^{z_3}}$$
Now, we want to take the derivative of $S(z_1)$
. Let’s do it now.
It should be noted that in the calculation of $S(z_1)$
, there are three variables: $z_1$
, $z_2$
and $z_3$
. Therefore, when we talk about the derivative of $S(z_1)$
, we need to specify the derivative with respect to which variable.
In the following, we will derive the derivative of $S(z_1)$
. To understand the steps, we need to first understand the following.
Pre-requisite knowledge #
Quotient Rule #
The quotient rule goes like this. If we have
$$f(x) = \frac{g(x)}{h(x)}$$
Then the derivative of $f(x)$
is:
$$f^\prime(x) = \frac{df}{dx} = \frac{g^\prime(x)\cdot h(x) - g(x)\cdot h^\prime(x)}{(h(x))^2}$$
I know you may wonder why. I can derive it from the product rule. Let’s say
$$t(x) = \frac{1}{h(x)}$$
We have
$$f(x) = \frac{g(x)}{h(x)} = g(x)\cdot t(x)$$
According to product rule, we have:
$$f^\prime(x) = g(x)\cdot t^\prime(x) + g^\prime(x)\cdot t(x)$$
How to get $t^\prime(x)$
? This has to do with the chain rule.
Let’s say we have
$$t(m) = m^{-1}$$
and
$$m = h(x)$$
So
$$\begin{aligned} t^\prime(x) &= \frac{dt}{dx} \\ & = \frac{dt}{dm}\cdot \frac{dm}{dx} \\ &= -m^{-2}\cdot h^\prime(x) \\ &= -(h(x))^{-2}\cdot h^\prime(x)\end{aligned}$$
So we have:
$$\begin{aligned} f^\prime(x) &= g(x)\cdot t^\prime(x) + g^\prime(x)\cdot t(x) \\&= g(x)(-(h(x))^{-2}h(x)) + g^\prime(x)(h(x))^{-1} \\ &= \frac{-g(x)h^\prime(x)}{(h(x))^2} + \frac{g^\prime(x)}{h(x)} \\ &= \frac{g^\prime(x)h(x) - g(x)h^\prime(x)}{(h(x))^2} \end{aligned}$$
Sum rule #
If we have
$$f(x) = g(x) + h(x)$$
Then,
$$f^\prime(x) = g^\prime(x) + h^\prime(x)$$
E #
You also need to know that the derivative of $e^x$
is $e^x$
itself.
Sum #
You also need to know that
$$\sum^{i = 1}_{N} S(z_i) = 1$$
Derivation #
Armed with the above knowledge, we can calcuate the derivative of softmax functions. Let us say we want to take the derivative of $S(z_1)$
with respect to $z_1$
:
$$\begin{aligned} \frac{\partial S(z_1)}{\partial z_1} & = \frac{\partial}{\partial z_1} (\frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}}) \\ &= \frac{(e^{z_1})^\prime \cdot (e^{z_1} + e^{z_2} + e^{z_3}) - e^{z_1} \cdot (e^{z_1} + e^{z_2} + e^{z_3})^\prime}{(e^{z_1} + e^{z_2} + e^{z_3})^2} \\ & = \frac{e^{z_1}\cdot (e^{z_1} + e^{z_2} + e^{z_3}) - e^{z_1} \cdot (e^{z_1} + 0 + 0)}{(e^{z_1} + e^{z_2} + e^{z_3})^2} \\ & = \frac{e^{z_1} (\sum - e^{z_1})}{\sum^2} \\ & = \frac{e^{z_1}}{\sum} \cdot \frac{\sum - e^{z_1}}{\sum} \\ & = S(z_1) (1 - S(z_1)) \end{aligned}$$
Then, what about the derivative of $S(z_1)$
with respect to $z_2$
?
$$\begin{aligned} \frac{\partial S(z_1)}{\partial z_2} & = \frac{\partial}{\partial z_1} (\frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}}) \\ &= \frac{(e^{z_1})^\prime \cdot (e^{z_1} + e^{z_2} + e^{z_3}) - e^{z_1} \cdot (e^{z_1} + e^{z_2} + e^{z_3})^\prime}{(e^{z_1} + e^{z_2} + e^{z_3})^2} \\ & = \frac{0 \cdot (e^{z_1} + e^{z_2} + e^{z_3}) - e^{z_1} \cdot (0 + e^{z_2} + 0 )}{(e^{z_1} + e^{z_2} + e^{z_3})^2} \\ & = \frac{- e^{z_1} \cdot e^{z_2}}{\sum^2} \\ & = - \frac{e^{z_1}}{\sum} \cdot \frac{e^{z_2}}{\sum} \\ & = - S(z_1) S(z_2) \end{aligned}$$
Conclusion #
With the above, we can say that for
$$S(z_i) = \frac{e^{z_i}}{\sum^{N}_{j = 1} e^{z_j}}$$
It derivative
$$\frac{\partial S(z_i)}{\partial z_j}$$
When $i = j$
:
$$\frac{\partial S(z_i)}{\partial z_j} = S(z_i)(1 - S(z_i))$$
If $i \neq j$
:
$$\frac{\partial S(z_i)}{\partial z_j} = - S(z_i)S(z_j)$$
Last modified on 2022-11-04