Rebuilding Micrograd
Overview
Micrograd is a minimalistic automatic differentiation (autograd) engine implemented from scratch in Python. It demonstrates the core principles of backpropagation and neural network training using only scalar values and basic mathematical operations.
Technical Architecture
Automatic Differentiation Engine (engine.py
)
The heart of the system is the Value
class, which wraps scalar numbers and tracks their computational history:
Core Data Structure
class Value:
data: float # The actual numerical value
grad: float # Gradient (partial derivative)
_prev: set # Parent nodes in computation graph
_op: str # Operation that created this value
_backward: function # Local gradient computation
Supported Operations
Arithmetic Operations:
- Addition (
+
): Implements sum rule of derivatives - Multiplication (
*
): Implements product rule of derivatives - Power (
**
): Supports integer/float exponents with power rule - Division (
/
): Implemented as multiplication by negative power - Subtraction (
-
): Implemented as addition of negation
Activation Functions:
tanh()
: Hyperbolic tangent with derivative (1 - tanh²(x))relu()
: Rectified Linear Unit with step function derivativeexp()
: Exponential function with derivative exp(x)
Backpropagation Algorithm
The backward()
method implements reverse-mode automatic differentiation:
- Local Gradient Computation: Each operation defines its local gradient via
_backward
lambda - Chain Rule Application: Gradients propagate backward through the computation graph
- Accumulation: Gradients are accumulated for nodes with multiple parents
Neural Network Library (nn.py
)
Built on top of the autograd engine, providing modular neural network components:
Module Base Class
- Purpose: Abstract base for all network components
- Features: Parameter collection and gradient zeroing
- Interface:
parameters()
andzero_grad()
methods
Neuron Implementation
class Neuron(Module):
w: List[Value] # Weights (randomly initialized [-1, 1])
b: Value # Bias (randomly initialized [-1, 1])
Forward Pass: Computes weighted sum + bias, applies tanh activation
Parameters: Returns all weights and bias as trainable parameters
Layer Architecture
- Composition: Contains multiple neurons with shared input
- Output: Returns single value for 1 neuron, list for multiple
- Scalability: Supports arbitrary layer widths
Multi-Layer Perceptron (MLP)
class MLP(Module):
layers: List[Layer] # Sequential layer composition
Architecture: Takes input size and list of output sizes per layer
Forward Pass: Sequential application of layers
Flexibility: Supports arbitrary depth and width configurations
Computational Complexity
Memory: O(n) where n is the number of operations in computation graph
Time:
- Forward pass: O(n)
- Backward pass: O(n)
- Training step: O(p) where p is number of parameters
Current Limitations
- Scalar Only: No vector/matrix operations
- Simple Optimizers: No built-in optimization algorithms
- Limited Activations: Only tanh, ReLU, and exponential
- No Regularization: No dropout, batch norm, or weight decay
Potential Extensions
- Tensor Support: Extend to multi-dimensional arrays
- More Activations: Sigmoid, softmax, GELU, etc.
- Optimizers: SGD, Adam, RMSprop implementations
- Loss Functions: Cross-entropy, MSE, etc.
- Regularization: L1/L2 penalties, dropout layers
References:
The spelled-out intro to neural networks and backpropagation: building micrograd