Sigmoid Function

Sigmoid is the first activation function ever used. It was created to simulate the firing rate of a brain neuron.

σ (x) = \frac{1}{1 + e ^{- x}}

Screenshot 2023-12-05 at 1.11.30 PM.png

The function squashes the input $x$ into a range of $[0, 1]$ .

The sigmoid has $3$ problems:

If we analyze the function, we can see that the function can kill the gradients if the input $x$ is not close to $0$ (positively or negatively). Because the gradient of the function w.r.t to $x$ si defined as:
$\frac{\partial σ ( x )}{\partial x} = σ (x) (1 - σ (x))$
This will produce the problem of vanishing gradients.
Another problem is the fact that the elements in the gradient vector have all the same sign, which is bad because we only can update the weights with all positive or all negative deltas. In practice this means that we can only “zig-zag” into the optimal path, which will result in a slower convergence. This isn’t really a big problem because of mini batches, and because this happens only for each sample, so we update with different signs for each sample.
The third problem is the fact that the exponential is a bit expensive in terms of computation complexity.

👨🏽‍💻 Domiziano's Notes