YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🧠 Activation Functions: Deep Neural Network Analysis

License: MIT Python 3.8+ PyTorch

Empirical evidence for the vanishing gradient problem and why modern activations (ReLU, GELU) dominate deep learning.

This repository provides a comprehensive comparison of 5 activation functions in deep neural networks, demonstrating the vanishing gradient problem with Sigmoid and why modern activations enable training of deep networks.


🎯 Key Findings

Activation Final MSE Gradient Ratio (L10/L1) Status
ReLU 0.008 1.93 (stable) βœ… Excellent
Leaky ReLU 0.008 0.72 (stable) βœ… Excellent
GELU 0.008 0.83 (stable) βœ… Excellent
Linear 0.213 0.84 (stable) ⚠️ Cannot learn non-linearity
Sigmoid 0.518 2.59Γ—10⁷ (vanishing!) ❌ Failed

πŸ”¬ The Vanishing Gradient Problem - Visualized

Sigmoid Network (10 layers):
Layer 1  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  Gradient: 5.04Γ—10⁻¹
Layer 5  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                              Gradient: 1.02Γ—10⁻⁴  
Layer 10 ▏                                         Gradient: 1.94Γ—10⁻⁸  ← 26 MILLION times smaller!

ReLU Network (10 layers):
Layer 1  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  Gradient: 2.70Γ—10⁻³
Layer 5  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    Gradient: 2.10Γ—10⁻³
Layer 10 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  Gradient: 1.36Γ—10⁻³  ← Healthy flow!

πŸ“Š Visual Results

Learned Functions

Learned Functions

ReLU, Leaky ReLU, and GELU perfectly approximate the sine wave. Linear learns only a straight line. Sigmoid completely fails to learn.

Training Dynamics

Loss Curves

Gradient Flow Analysis

Gradient Flow

Comprehensive Summary

Summary


πŸ§ͺ Experimental Setup

Architecture

  • Network: 10 hidden layers Γ— 64 neurons each
  • Task: 1D non-linear regression (sine wave approximation)
  • Dataset: y = sin(x) + Ξ΅, where x ∈ [-Ο€, Ο€] and Ξ΅ ~ N(0, 0.1)

Training Configuration

optimizer = Adam(lr=0.001)
loss_fn = MSELoss()
epochs = 500
batch_size = full_batch (200 samples)
seed = 42

Activation Functions Tested

Function Formula Gradient Range
Linear f(x) = x Always 1
Sigmoid f(x) = 1/(1+e⁻ˣ) (0, 0.25]
ReLU f(x) = max(0, x) {0, 1}
Leaky ReLU f(x) = max(0.01x, x) {0.01, 1}
GELU f(x) = xΒ·Ξ¦(x) Smooth, ~(0, 1)

πŸš€ Quick Start

Installation

git clone https://huggingface.co/AmberLJC/activation_functions
cd activation_functions
pip install torch numpy matplotlib

Run the Experiment

# Basic 5-activation comparison
python train.py

# Extended tutorial with 8 activations and 4 experiments
python tutorial_experiments.py

# Training dynamics analysis
python train_dynamics.py

πŸ“ Repository Structure

activation_functions/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ report.md                          # Detailed analysis report
β”œβ”€β”€ activation_tutorial.md             # Educational tutorial
β”‚
β”œβ”€β”€ train.py                           # Main experiment (5 activations)
β”œβ”€β”€ tutorial_experiments.py            # Extended experiments (8 activations)
β”œβ”€β”€ train_dynamics.py                  # Training dynamics analysis
β”‚
β”œβ”€β”€ learned_functions.png              # Predictions vs ground truth
β”œβ”€β”€ loss_curves.png                    # Training loss over epochs
β”œβ”€β”€ gradient_flow.png                  # Gradient magnitude per layer
β”œβ”€β”€ hidden_activations.png             # Activation patterns
β”œβ”€β”€ summary_figure.png                 # 9-panel comprehensive summary
β”‚
β”œβ”€β”€ exp1_gradient_flow.png             # Extended gradient analysis
β”œβ”€β”€ exp2_activation_distributions.png  # Activation distribution analysis
β”œβ”€β”€ exp2_sparsity_dead_neurons.png     # Sparsity and dead neuron analysis
β”œβ”€β”€ exp3_stability.png                 # Training stability analysis
β”œβ”€β”€ exp4_predictions.png               # Function approximation comparison
β”œβ”€β”€ exp4_representational_heatmap.png  # Representational capacity heatmap
β”‚
β”œβ”€β”€ activation_evolution.png           # Activation evolution during training
β”œβ”€β”€ gradient_evolution.png             # Gradient evolution during training
β”œβ”€β”€ training_dynamics_functions.png    # Training dynamics visualization
β”œβ”€β”€ training_dynamics_summary.png      # Training dynamics summary
β”‚
β”œβ”€β”€ loss_histories.json                # Raw loss data
β”œβ”€β”€ gradient_magnitudes.json           # Gradient measurements
β”œβ”€β”€ gradient_magnitudes_epochs.json    # Gradient evolution data
β”œβ”€β”€ exp1_gradient_flow.json            # Extended gradient data
└── final_losses.json                  # Final MSE per activation

πŸ“– Key Insights

Why Sigmoid Fails in Deep Networks

The vanishing gradient problem occurs because:

  1. Sigmoid derivative is bounded: max(Οƒ'(x)) = 0.25 at x=0
  2. Chain rule multiplies gradients: For 10 layers, gradient β‰ˆ (0.25)¹⁰ β‰ˆ 10⁻⁢
  3. Early layers don't learn: Gradient signal vanishes before reaching input layers
# Theoretical gradient decay for Sigmoid
gradient_layer_10 = gradient_output * (0.25)^10
                  β‰ˆ gradient_output * 0.000001
                  β‰ˆ 0  # Effectively zero!

Why ReLU Works

ReLU maintains unit gradient for positive inputs:

# ReLU gradient
f'(x) = 1 if x > 0 else 0

# No multiplicative decay!
gradient_layer_10 β‰ˆ gradient_output * 1^10 = gradient_output

Practical Recommendations

Use Case Recommended
Default choice ReLU or Leaky ReLU
Transformers/LLMs GELU
Very deep networks Leaky ReLU + skip connections
Output (classification) Sigmoid/Softmax
Output (regression) Linear

πŸ“š Extended Experiments

The tutorial_experiments.py script includes 4 additional experiments:

  1. Gradient Flow Analysis - Depths 5, 10, 20, 50 layers
  2. Activation Distributions - Sparsity and dead neuron analysis
  3. Training Stability - Learning rate and depth sensitivity
  4. Representational Capacity - Multiple target function approximation

πŸ”— References


πŸ“„ Citation

@misc{activation_functions_analysis,
  title={Activation Functions: Deep Neural Network Analysis},
  author={Orchestra Research},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/AmberLJC/activation_functions}
}

πŸ“œ License

MIT License - feel free to use for education and research!


Generated by Orchestra Research Assistant

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for AmberLJC/activation_functions