- A considerably faster version based on CUDA can be found here - Mish CUDA (All credits to Thomas Brandon for the same)
- Memory Efficient Experimental version of Mish can be found here
- Faster variants for Mish and H-Mish by Yashas Samaga can be found here - ConvolutionBuildingBlocks
- Alternative (experimental improved) variant of H-Mish developed by Páll Haraldsson can be found here - H-Mish (Available in Julia)
- Variance based initialization method for Mish (experimental) by Federico Andres Lois can be found here - Mish_init
- [07/17] Mish added to OpenVino - Open-1187, Merged-1125
- [07/17] Mish added to BetaML.jl
- [07/17] Loss Landscape exploration progress in collaboration with Javier Ideami and Ajay Uppili Arasanipalai
- [07/17] Poster accepted for presentation at DLRLSS hosted by MILA, CIFAR, Vector Institute and AMII
- [07/20] Mish added to Google's AutoML - 502
- Mila: Controlling Minima Concavity in Activation Function
- SharkFin: A Modified Version of ReLU
- β-Mish: An uni-parametric adaptive activation function derived from Mish
- Hard Mish- Memory Efficient and faster equivalent of Mish
Device Optimized Mish for PyTorch is an experimental feature under construction - Torch Dev
- (02/2020): Podcast episode on Mish at Machine Learning Café is out now. Listen on:
- (02/2020): Talk on Mish and Non-Linear Dynamics at Sicara is out now. Watch on:
- (07/2020): CROWN: A comparison of morphology for Mish, Swish and ReLU produced in collaboration with Javier Ideami. Watch on:
- Introduction
- Mathematics under the hood
a. Loss landscape - ImageNet Scores
- MS-COCO
- Variation of Parameter Comparison
a. MNIST
b. CIFAR10 - Edge of Chaos and Rate of Convergence (EOC & ROC)/ Hessian Energy Computation Analysis:
- Significance Level
a. Sample Size = 3
b. Sample Size = 23
c. Confidence Interval Profiles - Properties Summary
- Results
a. Summary of Results (Vision Tasks)
b. Summary of Results (Language Tasks)
c. Sample Result - Try It!
a. Demo Jupyter Notebooks
b. Torch
c. TensorFlow
d. MXNet - Future Work (Coming Soon)
- Acknowledgements
- Cite this work
Inspired by Swish Activation Function (Paper), Mish is a Self Regularized Non-Monotonic Neural Activation Function. Activation Function serves a core functionality in the training process of a Neural Network Architecture and is represented by the basic mathematical representation:
Image Credits: https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Activation_FunctionsAn Activation Function is generally used to introduce non-linearity and over the years of theoretical machine learning research, many activation functions have been constructed with the 2 most popular amongst them being:
- ReLU (Rectified Linear Unit; f(x)=max(0,x))
- TanH
Other notable ones being:
- Softmax (Used for Multi-class Classification in the output layer)
- Sigmoid (f(x)=(1+e-x)-1;Used for Binary Classification and Logistic Regression)
- Leaky ReLU (f(x)=0.001x (x<0) or x (x>0))
Mish Activation Function can be mathematically represented by the following formula:
And it's 1st and 2nd derivatives are given below:
Where:
The Taylor Series Expansion of f(x) at x=0 is given by:
The Taylor Series Expansion of f(x) at x=∞ is given by:
Minimum of f(x) is observed to be ≈-0.30884 at x≈-1.1924
When visualized, Mish Activation Function closely resembles the function path of Swish having a small decay (preserve) in the negative side while being near linear on the positive side. It is a Non-Monotonic Function and as observed from it's derivatives functions shown above and graph shown below, it can be noted that it has a Non-Monotonic 1st derivative and 2nd derivative.
Mish ranges between ≈-0.31 to ∞.
Following image shows the effect of Mish being applied on random noise. This is a replication of the effect of the activation function on the image tensor inputs in CNN models.
Based on mathematical analysis, it is also confirmed that the function has a parametric order of continuity of: C∞
Mish has a very sharp global minima similar to Swish, which might account to gradients updates of the model being stuck in the region of sharp decay thus may lead to bad performance levels as compared to ReLU. Mish, also being mathematically heavy, is more computationally expensive as compared to the time complexity of Swish Activation Function.
The output landscape of 5 layer randomly initialized neural network was compared for ReLU, Swish, and Mish. The observation clearly shows the sharp transition between the scalar magnitudes for the co-ordinates of ReLU as compared to Swish and Mish. Smoother transition results in smoother loss functions which are easier to optimize and hence the network generalizes better. Additional comparison of output landscapes is done for GELU, SELU, ELU, Leaky ReLU, PReLU and RReLU. Most of them similar to ReLU have sharp transitions in the output landscape and thus prove to be a roadblock to effective optimization of gradients.
Loss landscape visualizations for a ResNet-20 for CIFAR 10 using Mish, Swish and ReLU for 200 epochs training:
Mish provides much better accuracy, overall lower loss, smoother and well conditioned easy-to-optimize loss landscape as compared to both Swish and ReLU. For all loss landscape visualizations please visit this readme. For the non-labelled visualization it's ReLu followed by Mish followed by Swish from left to right.
The Pre-Activations (ωx + b) distribution was observed for the final convolution layer in a ResNet v1-20 with Mish activation function before and after training for 20 epochs on CIFAR-10. As shown below, units are being preserved in the negative side which improves the network capacity to generalize well due to less loss of information.
Complex Analysis of Mish Activation Function:
For Installing DarkNet framework, please refer to darknet(Alexey AB)
Network | Activation | Top-1 Accuracy | Top-5 Accuracy | cfg | Weights | Hardware |
---|---|---|---|---|---|---|
ResNet-50 | Mish | 74.244% | 92.406% | cfg | weights | AWS p3.16x large, 8 Tesla V100 |
DarkNet-53 | Mish | 77.01% | 93.75% | cfg | weights | AWS p3.16x large, 8 Tesla V100 |
DenseNet-201 | Mish | 76.584% | 93.47% | cfg | weights | AWS p3.16x large, 8 Tesla V100 |
ResNext-50 | Mish | 77.182% | 93.318% | cfg | weights | AWS p3.16x large, 8 Tesla V100 |
Network | Activation | Top-1 Accuracy | Top-5 Accuracy |
---|---|---|---|
CSPResNet-50 | Leaky ReLU | 77.1% | 94.1% |
CSPResNet-50 | Mish | 78.1% | 94.2% |
Pelee Net | Leaky ReLU | 70.7% | 90% |
Pelee Net | Mish | 71.4% | 90.4% |
Pelee Net | Swish | 71.5% | 90.7% |
CSPPelee Net | Leaky ReLU | 70.9% | 90.2% |
CSPPelee Net | Mish | 71.2% | 90.3% |
Results on CSPResNext-50:
MixUp | CutMix | Mosaic | Blur | Label Smoothing | Leaky ReLU | Swish | Mish | Top -1 Accuracy | Top-5 Accuracy | cfg | weights |
---|---|---|---|---|---|---|---|---|---|---|---|
✔️ | 77.9%(=) | 94%(=) | |||||||||
✔️ | ✔️ | 77.2%(-) | 94%(=) | ||||||||
✔️ | ✔️ | 78%(+) | 94.3%(+) | ||||||||
✔️ | ✔️ | 78.1%(+) | 94.5%(+) | ||||||||
✔️ | ✔️ | 77.5%(-) | 93.8%(-) | ||||||||
✔️ | ✔️ | 78.1%(+) | 94.4%(+) | ||||||||
✔️ | 64.5%(-) | 86%(-) | |||||||||
✔️ | 78.9%(+) | 94.5%(+) | |||||||||
✔️ | ✔️ | ✔️ | ✔️ | 78.5%(+) | 94.8%(+) | ||||||
✔️ | ✔️ | ✔️ | ✔️ | 79.8%(+) | 95.2%(+) | cfg | weights |
Results on CSPResNet-50:
CutMix | Mosaic | Label Smoothing | Leaky ReLU | Mish | Top -1 Accuracy | Top-5 Accuracy | cfg | weights |
---|---|---|---|---|---|---|---|---|
✔️ | 76.6%(=) | 93.3%(=) | ||||||
✔️ | ✔️ | ✔️ | ✔️ | 77.1%(+) | 94.1%(+) | |||
✔️ | ✔️ | ✔️ | ✔️ | 78.1%(+) | 94.2%(+) | cfg | weights |
Results on CSPDarkNet-53:
CutMix | Mosaic | Label Smoothing | Leaky ReLU | Mish | Top -1 Accuracy | Top-5 Accuracy | cfg | weights |
---|---|---|---|---|---|---|---|---|
✔️ | 77.2%(=) | 93.6%(=) | ||||||
✔️ | ✔️ | ✔️ | ✔️ | 77.8%(+) | 94.4%(+) | |||
✔️ | ✔️ | ✔️ | ✔️ | 78.7%(+) | 94.8%(+) | cfg | weights |
Results on SpineNet-49:
CutMix | Mosaic | Label Smoothing | ReLU | Swish | Mish | Top -1 Accuracy | Top-5 Accuracy | cfg | weights |
---|---|---|---|---|---|---|---|---|---|
✔️ | 77%(=) | 93.3%(=) | - | - | |||||
✔️ | ✔️ | 78.1%(+) | 94%(+) | - | - | ||||
✔️ | ✔️ | ✔️ | ✔️ | 78.3%(+) | 94.6%(+) | - | - |
Model | Mish | AP50...95 | mAP50 | CPU - 90 Watt - FP32 (Intel Core i7-6700K, 4GHz, 8 logical cores) OpenCV-DLIE, FPS | VPU-2 Watt- FP16 (Intel MyriadX) OpenCV-DLIE, FPS | GPU-175 Watt- FP32/16 (Nvidia GeForce RTX 2070) DarkNet-cuDNN, FPS |
---|---|---|---|---|---|---|
CSPDarkNet-53 (512 x 512) | 42.4% | 64.5% | 3.5 | 1.23 | 43 | |
CSPDarkNet-53 (512 x 512) | ✔️ | 43% | 64.9% | - | - | 41 |
CSPDarkNet-53 (608 x 608) | ✔️ | 43.5% | 65.7% | - | - | 26 |
Architecture | Mish | CutMix | Mosaic | Label Smoothing | Size | AP | AP50 | AP75 |
---|---|---|---|---|---|---|---|---|
CSPResNext50-PANet-SPP | 512 x 512 | 42.4% | 64.4% | 45.9% | ||||
CSPResNext50-PANet-SPP | ✔️ | ✔️ | ✔️ | 512 x 512 | 42.3% | 64.3% | 45.7% | |
CSPResNext50-PANet-SPP | ✔️ | ✔️ | ✔️ | ✔️ | 512 x 512 | 42.3% | 64.2% | 45.8% |
CSPDarkNet53-PANet-SPP | ✔️ | ✔️ | ✔️ | 512 x 512 | 42.4% | 64.5% | 46% | |
CSPDarkNet53-PANet-SPP | ✔️ | ✔️ | ✔️ | ✔️ | 512 x 512 | 43% | 64.9% | 46.5% |
Credits to AlexeyAB, Wong Kin-Yiu and Glenn Jocher for all the help with benchmarking MS-COCO and ImageNet.
To observe how increasing the number of layers in a network while maintaining other parameters constant affect the test accuracy, fully connected networks of varying depths on MNIST, with each layer having 500 neurons were trained. Residual Connections were not used because they enable the training of arbitrarily deep networks. BatchNorm was used to lessen the dependence on initialization along with a dropout of 25%. The network is optimized using SGD on a batch size of 128, and for fair comparison, the same learning rates for each activation function was maintained. In the experiments, all 3 activations maintained nearly the same test accuracy for 15 layered Network. Increasing number of layers from 15 gradually resulted in a sharp decrease in test accuracy for Swish and ReLU, however, Mish outperformed them both in large networks where optimization becomes difficult.
The consistency of Mish providing better test top-1 accuracy as compared to Swish and ReLU was also observed by increasing Batch Size for a ResNet v2-20 on CIFAR-10 for 50 epochs while keeping all other network parameters to be constant for fair comparison.
Gaussian Noise with varying standard deviation was added to the input in case of MNIST classification using a simple conv net to observe the trend in decreasing test top-1 accuracy for Mish and compare it to that of ReLU and Swish. Mish mostly maintained a consistent lead over that of Swish and ReLU (Less than ReLU in just 1 instance and less than Swish in 3 instance) as shown below. The trend for test loss was also observed following the same procedure. (Mish has better loss than both Swish and ReLU except in 1 instance)
The effect of various Optimizers on the Test Top-1 Accuracy of a simple 4 layered Conv Net with Mish on MNIST was visualized and compared against Swish. Mish had a better accuracy in 7 out of the 9 optimizers as shown below. Mish was also tested for different Learning Rates for SGD optimizer on MNIST and compared to Swish. The comparison confirms that Mish performs best on lower learning rates as compared to Swish.
The effect of various Weight initializers and Regularizers on the Test Top-1 Accuracy in the fully connected Dense Layer of a simple 4 layered Conv Net with Mish on MNIST was compared to that with Swish and the plots beneath shows that Mish has a significant improvement over Swish.
The effect of increasing dropout rates and increasing dense units on Test Top-1 Accuracy for a 4 layered network using Mish on MNIST was compared to Swish. The graphs below show the consistency of Mish over Swish.
All default parameters were used for Optimizers.
For Cosine Annealing, Max η was set at 0.01 (1e-2) and Min η was set at 0.0001 (1e-4)
For One Cycle Policy, Min Learning Rate was set at 0.00000291545 (7e-3), Max Learning Rate was set at 0.00020408163 (7e-2), Min Momentum was set at 0.85, Max Momentum was set at 0.95, Annealing Stage was set at 0.1 and Annealing Rate was set at 0.01.
Coming Soon
The P-values were computed for different activation functions in comparison to that of Mish on terms of Top-1 Testing Accuracy of a Squeeze Net Model on CIFAR-10 for 50 epochs for 3 runs and 23 runs using Adam Optimizer at a Learning Rate of 0.001 and Batch Size of 128. It was observed that Mish beats most of the activation functions at a high significance level in the 3 runs while for 23 runs, it beats ReLU at a high significance of P < 0.0001. Mish also had a comparatively lower standard deviation across both 3 and 23 runs which proves the consistency of performance for Mish.
Activation Function | Peak Accuracy | Mean Accuracy | Standard Deviation of Accuracy | P-value | Mean Loss |
---|---|---|---|---|---|
Mish | 88.15% | 87.93% | 0.04358898943540784 | - | 4.018666666666666% |
ReLU | 87.47% | 87.06% | 0.5311308689955831 | P < 5e-2 (0.0475) | 4.2956666666666665% |
Swish-1 | 87.88% | 87.36333333333333% | 0.135030860670192 | P < 5e-3 (0.0023) | 4.191% |
ELU(α=1.0) | 86.82% | 86.46333333333334% | 0.07571877794400171 | P < 0.0001 | 4.2090000000000005% |
E-Swish (β=1.75) | 87.92% | 87.53999999999999% | 0.33421549934136363 | P < 5e-1 (0.1156) | 4.091333333333333% |
GELU | 87.89% | 87.28% | 0.15620499351812658 | P < 5e-3 (0.0023) | 4.405666666666667% |
HardShrink(λ = 0.5) | 75.44% | 74.89333333333333% | 0.6035174672976259 | P < 0.0001 | 7.278333333333333% |
Hardtanh | 83.39% | 82.79666666666667% | 0.36963946398258196 | P < 0.0001 | 5.132333333333333% |
Leaky ReLU(α=0.3) | 87.27% | 87.06666666666666% | 0.06429100507328683 | P < 0.0001 | 4.067333333333333% |
LogSigmoid | 84.54% | 82.41666666666667% | 0.7203702751594688 | P < 5e-4 (0.0002) | 5.436% |
PReLU | 85.82% | 84.61666666666666% | 0.4534681172181107 | P < 5e-4 (0.0002) | 5.366666666666666% |
RReLU | 87.88% | 86.82333333333334% | 1.1430806329097392 | P < 5e-1 (0.1691) | 4.103666666666666% |
ReLU6 | 87.5% | 87.02333333333333% | 0.092915732431772 | P < 5e-4 (0.0001) | 4.202333333333334% |
SELU | 84.54% | 84.53666666666666% | 0.26388128644020004 | P < 0.0001 | 4.612666666666667% |
CELU(α=1.0) | 87.2% | 86.52% | 0.32969683043669107 | P < 5e-3 (0.0018) | 4.145666666666667% |
Sigmoid | 81.75% | 78.96% | 1.8929606440705533 | P < 5e-3 (0.0012) | 6.463666666666667% |
SoftPlus(β = 1) | 84.93% | 81.92333333333333% | 1.6565727672919628 | P < 5e-3 (0.0033) | 6.008666666666667% |
Tanhshrink | 84.71% | 83.63% | 0.9457272334029457 | P < 5e-3 (0.0014) | 5.002666666666666% |
Tanh | 84.2% | 83.41% | 0.7397972695272689 | P <= 5e-4 (0.0005) | 5.053% |
Softshrink(λ = 0.5) | 83.34% | 82.51666666666667% | 0.22722969289539155 | P < 0.0001 | 5.494666666666666% |
Softsign | 83.64% | 83.23333333333333% | 0.4398105652816147 | P = 0.0001 | 5.056666666666667% |
Aria-2(β = 1, α=1.5) | 83.89% | 82.67666666666666% | 1.3052330570949109 | P < 5e-3 (0.0022) | 6.205666666666667% |
Bent's Identity | 85.66% | 85.19666666666666% | 0.3500476158086701 | P < 5e-4 (0.0002) | 4.434333333333333% |
SQNL | 83.72% | 83.52% | 0.20000000000000284 | P < 0.0001 | 5.045% |
ELisH | 87.89% | 87.86% | 0.04358898943540458 | P < 5e-1 (0.1206) | 4.138% |
Hard ELisH | 86.85% | 86.29% | 0.11789826122551722 | P < 5e-4 (0.0001) | 4.2967% |
SReLU | 85.91% | 85.347% | 0.5600297611139322 | P < 5e-3 (0.0013) | 4.479% |
ISRU (α=1.0) | 84.14% | 82.86% | 0.7396170180122467 | P < 5e-4 (0.0003) | 5.335% |
Flatten T-Swish | 87.35% | 86.85% | 0.11060440015357959 | P < 5e-4 (0.0001) | 4.669% |
Soft Clipping (α=0.5) | 71.62% | 54.087% | 9.498727985016378 | P < 5e-3 (0.0035) | 18.6857% |
SineReLU (ε = 0.001) | 87.3% | 87.13% | 0.187705443004009 | P < 5e-3 (0.0020) | 4.2963% |
Weighted TanH (Weight = 1.7145) | 83.52% | 83.09% | 0.356791255498227 | P < 0.0001 | 5.22% |
Le Cun's Tanh | 84.06% | 82.79% | 0.4751140214025823 | P < 0.0001 | 5.2026666666666666% |
ISRLU (α=1.0) | 87.1% | 86.02% | 0.8259136355172628 | P < 5e-2 (0.0160) | 4.373% |
Activation Function | Mean Accuracy | Mean Loss | Standard Deviation of Accuracy | P-value | Cohen's d Score | 95% CI |
---|---|---|---|---|---|---|
Mish | 87.48% | 4.13% | 0.3967 | - | - | - |
Swish-1 | 87.32% | 4.22% | 0.414 | P = 0.1973 | 0.386 | -0.3975 to 0.0844 |
E-Swish (β=1.75) | 87.49% | 4.156% | 0.411 | P = 0.9075 | 0.034444 | -0.2261 to 0.2539 |
GELU | 87.37% | 4.339% | 0.472 | P = 0.4003 | 0.250468 | -0.3682 to 0.1499 |
ReLU | 86.66% | 4.398% | 0.584 | P < 0.0001 | 1.645536 | -1.1179 to -0.5247 |
ELU(α=1.0) | 86.41% | 4.211% | 0.3371 | P < 0.0001 | 2.918232 | -1.2931 to -0.8556 |
Leaky ReLU(α=0.3) | 86.85% | 4.112% | 0.4569 | P < 0.0001 | 1.47632 | -0.8860 to -0.3774 |
RReLU | 86.87% | 4.138% | 0.4478 | P < 0.0001 | 1.444091 | -0.8623 to -0.3595 |
SELU | 83.91% | 4.831% | 0.5995 | P < 0.0001 | 7.020812 | -3.8713 to -3.2670 |
SoftPlus(β = 1) | 83.004% | 5.546% | 1.4015 | P < 0.0001 | 4.345453 | -4.7778 to -4.1735 |
HardShrink(λ = 0.5) | 75.03% | 7.231% | 0.98345 | P < 0.0001 | 16.601747 | -12.8948 to -12.0035 |
Hardtanh | 82.78% | 5.209% | 0.4491 | P < 0.0001 | 11.093842 | -4.9522 to -4.4486 |
LogSigmoid | 81.98% | 5.705% | 1.6751 | P < 0.0001 | 4.517156 | -6.2221 to -4.7753 |
PReLU | 85.66% | 5.101% | 2.2406 | P = 0.0004 | 1.128135 | -2.7715 to -0.8590 |
ReLU6 | 86.75% | 4.355% | 0.4501 | P < 0.0001 | 1.711482 | -0.9782 to -0.4740 |
CELU(α=1.0) | 86.23% | 4.243% | 0.50941 | P < 0.0001 | 2.741669 | -1.5231 to -0.9804 |
Sigmoid | 74.82% | 8.127% | 5.7662 | P < 0.0001 | 3.098289 | -15.0915 to -10.2337 |
Softshrink(λ = 0.5) | 82.35% | 5.4915% | 0.71959 | P < 0.0001 | 8.830541 | -5.4762 to -4.7856 |
Tanhshrink | 82.35% | 5.446% | 0.94508 | P < 0.0001 | 7.083564 | -5.5646 to -4.7032 |
Tanh | 83.15% | 5.161% | 0.6887 | P < 0.0001 | 7.700198 | -4.6618 to -3.9938 |
Softsign | 82.66% | 5.258% | 0.6697 | P < 0.0001 | 8.761157 | -5.1493 to -4.4951 |
Aria-2(β = 1, α=1.5) | 81.31% | 6.0021% | 2.35475 | P < 0.0001 | 3.655362 | -7.1757 to -5.1687 |
Bent's Identity | 85.03% | 4.531% | 0.60404 | P < 0.0001 | 4.80211 | -2.7576 to -2.1502 |
SQNL | 83.44% | 5.015% | 0.46819 | P < 0.0001 | 9.317237 | -4.3009 to -3.7852 |
ELisH | 87.38% | 4.288% | 0.47731 | P = 0.4283 | 0.235784 | -0.3643 to 0.1573 |
Hard ELisH | 85.89% | 4.431% | 0.62245 | P < 0.0001 | 3.048849 | -1.9015 to -1.2811 |
SReLU | 85.05% | 4.541% | 0.5826 | P < 0.0001 | 4.883831 | -2.7306 to -2.1381 |
ISRU (α=1.0) | 86.85% | 4.669% | 0.1106 | P < 0.0001 | 5.302987 | -4.4855 to -3.5815 |
Flatten T-Swish | 86.93% | 4.459% | 0.40047 | P < 0.0001 | 1.378742 | -0.7865 to -0.3127 |
SineReLU (ε = 0.001) | 86.48% | 4.396% | 0.88062 | P < 0.0001 | 1.461675 | -1.4041 to -0.5924 |
Weighted Tanh (Weight = 1.7145) | 80.66% | 5.985% | 1.19868 | P < 0.0001 | 7.638298 | -7.3502 to -6.2890 |
LeCun's Tanh | 82.72% | 5.322% | 0.58256 | P < 0.0001 | 9.551812 | -5.0566 to -4.4642 |
Soft Clipping (α=0.5) | 55.21% | 18.518% | 10.831994 | P < 0.0001 | 4.210373 | -36.8255 to -27.7154 |
ISRLU (α=1.0) | 86.69% | 4.231% | 0.5788 | P < 0.0001 | 1.572874 | -1.0753 to -0.4856 |
Values rounded up which might cause slight deviation in the statistical values reproduced from these tests
Activation Function | CI |
---|---|
Mish | 87.48 ± 0.1716 |
Swish-1 | 87.32347 ± 0.179027 |
E-Swish (β=1.75) | 87.49391 ± 0.1776597 |
GELU | 87.37083 ± 0.2040073 |
ReLU | 86.65869 ± 0.2524601 |
ELU(α=1.0) | 86.40565 ± 0.1458006 |
Leaky ReLU(α=0.3) | 86.84826 ± 0.1976138 |
RReLU | 86.86913 ± 0.1936264 |
SELU | 83.91086 ± 0.2592722 |
SoftPlus(β = 1) | 83.00434 ± 0.6060631 |
HardShrink(λ = 0.5) | 75.03086 ± 0.4252852 |
Hardtanh | 82.77956 ± 0.1941855 |
LogSigmoid | 81.9813 ± 0.7244 |
PReLU | 85.66478 ± 0.968944 |
ReLU6 | 86.75391 ± 0.1946326 |
CELU(α=1.0) | 86.22826 ± 0.2202884 |
Sigmoid | 74.81739 ± 2.4934984 |
Softshrink(λ = 0.5) | 82.34913 ± 0.3111762 |
Tanhshrink | 82.34608 ± 0.4086837 |
Tanh | 83.15217 ± 0.2978422 |
Softsign | 82.65782 ± 0.2896004 |
Aria-2(β = 1, α=1.5) | 81.30782 ± 1.0182716 |
Bent's Identity | 85.02608 ± 0.2612082 |
SQNL | 83.43695 ± 0.2024614 |
ELisH | 87.37652 ± 0.2064078 |
Hard ELisH | 85.88869 ± 0.2691689 |
SReLU | 85.04565 ± 0.2519697 |
ISRU (α=1.0) | 83.44652 ± 0.4323568 |
Flatten T-Swish | 86.93043 ± 0.1731766 |
SineReLU (ε = 0.001) | 86.48173 ± 0.3808073 |
Weighted Tanh (Weight = 1.7145) | 80.66043 ± 0.518349 |
LeCun's Tanh | 82.71956 ± 0.2519178 |
Soft Clipping (α=0.5) | 55.20956 ± 4.6841037 |
ISRLU (α=1.0) | 86.69956 ± 0.2502932 |
Activation Function Name | Function Graph | Equation | Range | Order of Continuity | Monotonic | Monotonic Derivative | Approximates Identity Near Origin | Dead Neurons | Saturated |
---|---|---|---|---|---|---|---|---|---|
Mish | ≈-0.31 to ∞ | C∞ | No ❎ | No ❎ | Yes ✔️ (Approximates half of identity at origin) | No ❎ | No ❎ |
News: Ajay Arasanipalai recently submitted benchmark for CIFAR-10 training for the Stanford DAWN Benchmark using a Custom ResNet-9 + Mish which achieved 94.05% accuracy in just 10.7 seconds in 14 epochs on the HAL Computing Cluster. This is the current fastest training of CIFAR-10 in 4 GPUs and 2nd fastest training of CIFAR-10 overall in the world.
All results and comparative analysis are present in the Readme file present in the Examples and Benchmarks Folder.
Comparison is done based on the high priority metric, for image classification the Top-1 Accuracy while for Generative Networks and Image Segmentation the Loss Metric. Therefore, for the latter, Mish > Baseline is indicative of better loss and vice versa. For Embeddings, the AUC metric is considered.
Activation Function | Mish > Baseline Model | Mish < Baseline Model |
---|---|---|
ReLU | 55 | 20 |
Swish-1 | 53 | 22 |
SELU | 26 | 1 |
Sigmoid | 24 | 0 |
TanH | 24 | 0 |
HardShrink(λ = 0.5) | 23 | 0 |
Tanhshrink | 23 | 0 |
PReLU(Default Parameters) | 23 | 2 |
Softsign | 22 | 1 |
Softshrink (λ = 0.5) | 22 | 1 |
Hardtanh | 21 | 2 |
ELU(α=1.0) | 21 | 7 |
LogSigmoid | 20 | 4 |
GELU | 19 | 3 |
E-Swish (β=1.75) | 19 | 7 |
CELU(α=1.0) | 18 | 5 |
SoftPlus(β = 1) | 17 | 7 |
Leaky ReLU(α=0.3) | 17 | 8 |
Aria-2(β = 1, α=1.5) | 16 | 2 |
ReLU6 | 16 | 8 |
SQNL | 13 | 1 |
Weighted TanH (Weight = 1.7145) | 12 | 1 |
RReLU | 12 | 11 |
ISRU (α=1.0) | 11 | 1 |
Le Cun's TanH | 10 | 2 |
Bent's Identity | 10 | 5 |
Hard ELisH | 9 | 1 |
Flatten T-Swish | 9 | 3 |
Soft Clipping (α=0.5) | 9 | 3 |
SineReLU (ε = 0.001) | 9 | 4 |
ISRLU (α=1.0) | 9 | 4 |
ELisH | 7 | 3 |
SReLU | 7 | 6 |
Hard Sigmoid | 1 | 0 |
Thresholded ReLU(θ=1.0) | 1 | 0 |
Comparison is done based on the best metric score (Test accuracy) across 3 runs.
Activation Function | Mish > Baseline Model | Mish < Baseline Model |
---|---|---|
Penalized TanH | 5 | 0 |
ELU | 5 | 0 |
Sigmoid | 5 | 0 |
SReLU | 4 | 0 |
TanH | 4 | 1 |
Swish | 3 | 2 |
ReLU | 2 | 3 |
Leaky ReLU | 2 | 3 |
GELU | 1 | 2 |
Configurations | Parameters |
---|---|
Model | Squeeze Excite ResNet-50 (SENet-50) |
Dataset | CIFAR-10 |
Batch Size | 128 |
Epoch | 100 |
Optimizer | Adam |
Learning Rate | 0.001 |
Activation Function | Testing Top-1 Accuracy | Loss | Testing Top-3 Accuracy |
---|---|---|---|
Mish | 90.7931% | 4.75271% | 98.5562% |
Swish-1 | 90.558% | 4.76047% | 98.6748% |
E-Swish (β = 1.75) | 90.5063% | 5.22954% | 98.6946% |
ReLU | 90.447% | 4.93086% | 98.6155% |
GELU | 90.5063% | 5.0612% | 98.754% |
SELU | 86.432% | 6.89385% | 97.8936% |
It was observed that the stability of descent of Loss for SENet-50 with Mish is much better as compared to other activation functions. It was also observed that Mish was the only activation function which crossed the 91% mark for the Test Top-1 accuracy across both the runs while others reached a maximum of 90.7% with Mish recording the highest at 91.248%.
Note - The graph represents the Test Top-1 accuracy and loss. Training Top-1 Accuracy and Loss are represented using dashed lines.
All demo jupyter notebooks are present in the Examples and Benchmarks Folder.
Torch Implementation of Mish Activation Function can be found here
TensorFlow - Keras Implementation of Mish Activation function can be found here
TensorFlow native implementation can be found on TensorFlow Addons
MXNet Implementation of Mish Activation function can be found here
- Comparison of Convergence Rates.
- Normalizing constant for Mish to eliminate the use of Batch Norm.
- Regularizing effect of the first derivative of Mish with repect to Swish.
- Memory Efficient Fast implementation of Mish for PyTorch/ CUDA
- Investigating the loss landscape for Mish.
Thanks to all the people who have helped and supported me massively through this project who include:
- Sparsha Mishra
- Alexandra Deis
- Alexey Bochkovskiy
- Chien-Yao Wang
- Thomas Brandon
- Less Wright
- Manjunath Bhat
- Ajay Uppili Arasanipalai
- Federico Lois
- Javier Ideami
- Ioannis Anifantakis
- George Christopoulos
- Miklos Toth
And many more including the Fast AI community, Weights and Biases Community, TensorFlow Addons team, SpaCy/Thinc team, Sicara team, Udacity scholarships team to name a few. Apologies if I missed out anyone.
@misc{misra2019mish,
title={Mish: A Self Regularized Non-Monotonic Neural Activation Function},
author={Diganta Misra},
year={2019},
eprint={1908.08681},
archivePrefix={arXiv},
primaryClass={cs.LG}
}