Rebuilding Micrograd

Overview

Micrograd is a minimalistic automatic differentiation (autograd) engine implemented from scratch in Python. It demonstrates the core principles of backpropagation and neural network training using only scalar values and basic mathematical operations.

Technical Architecture

Automatic Differentiation Engine (`engine.py`)

The heart of the system is the Value class, which wraps scalar numbers and tracks their computational history:

Core Data Structure


class Value:
data: float # The actual numerical value
grad: float # Gradient (partial derivative)
_prev: set # Parent nodes in computation graph
_op: str # Operation that created this value
_backward: function # Local gradient computation

Supported Operations

Arithmetic Operations:

Addition (+): Implements sum rule of derivatives
Multiplication (*): Implements product rule of derivatives
Power (**): Supports integer/float exponents with power rule
Division (/): Implemented as multiplication by negative power
Subtraction (-): Implemented as addition of negation

Activation Functions:

tanh(): Hyperbolic tangent with derivative (1 - tanh²(x))
relu(): Rectified Linear Unit with step function derivative
exp(): Exponential function with derivative exp(x)

Backpropagation Algorithm

The backward() method implements reverse-mode automatic differentiation:

Local Gradient Computation: Each operation defines its local gradient via _backward lambda
Chain Rule Application: Gradients propagate backward through the computation graph
Accumulation: Gradients are accumulated for nodes with multiple parents

Neural Network Library (`nn.py`)

Built on top of the autograd engine, providing modular neural network components:

Module Base Class

Purpose: Abstract base for all network components
Features: Parameter collection and gradient zeroing
Interface: parameters() and zero_grad() methods

Neuron Implementation


class Neuron(Module):
w: List[Value] # Weights (randomly initialized [-1, 1])
b: Value # Bias (randomly initialized [-1, 1])

Forward Pass: Computes weighted sum + bias, applies tanh activation
Parameters: Returns all weights and bias as trainable parameters

Layer Architecture

Composition: Contains multiple neurons with shared input
Output: Returns single value for 1 neuron, list for multiple
Scalability: Supports arbitrary layer widths

Multi-Layer Perceptron (MLP)

class MLP(Module):
layers: List[Layer] # Sequential layer composition

Architecture: Takes input size and list of output sizes per layer
Forward Pass: Sequential application of layers
Flexibility: Supports arbitrary depth and width configurations

Computational Complexity

Memory: O(n) where n is the number of operations in computation graph
Time:

Forward pass: O(n)
Backward pass: O(n)
Training step: O(p) where p is number of parameters

Current Limitations

Scalar Only: No vector/matrix operations
Simple Optimizers: No built-in optimization algorithms
Limited Activations: Only tanh, ReLU, and exponential
No Regularization: No dropout, batch norm, or weight decay

Potential Extensions

Tensor Support: Extend to multi-dimensional arrays
More Activations: Sigmoid, softmax, GELU, etc.
Optimizers: SGD, Adam, RMSprop implementations
Loss Functions: Cross-entropy, MSE, etc.
Regularization: L1/L2 penalties, dropout layers

References:

The spelled-out intro to neural networks and backpropagation: building micrograd