PyTorch Reference
Free reference guide: PyTorch Reference
About PyTorch Reference
The PyTorch Reference is a structured, searchable cheat sheet for the PyTorch deep learning framework. It covers eight core categories — Tensors, Autograd, Neural Networks, Optimizers, Data, Training, GPU, and Save/Load — giving machine learning engineers and researchers a single place to look up any PyTorch API. Each entry includes a description and a real code example, making it practical for building and experimenting with neural network models.
PyTorch is the leading deep learning framework for research and increasingly for production, used by teams working on computer vision, natural language processing, reinforcement learning, and scientific computing. Its dynamic computation graph (autograd) allows flexible model architectures and easy debugging. This reference covers the full training workflow: creating tensors and loading data with DataLoader, defining models with nn.Module or nn.Sequential, choosing loss functions (CrossEntropyLoss, MSELoss) and optimizers (Adam, SGD), running training and validation loops, and saving checkpoints. It also covers the key details of gradient management — requires_grad, backward(), zero_grad(), detach(), and no_grad().
The reference also addresses GPU acceleration, including how to check CUDA availability, move models and tensors to a device, use nn.DataParallel for multi-GPU training, monitor GPU memory usage, and apply mixed precision training with `autocast` and `GradScaler`. The Data section covers custom Dataset classes, torchvision transforms and built-in datasets, and random_split for train/validation splitting. The Save/Load section shows both full model saving and the recommended state_dict approach, plus complete checkpoint saving and loading for resuming interrupted training.
Key Features
- Covers 8 categories: Tensors, Autograd, Neural Networks, Optimizers, Data, Training, GPU, Save/Load
- Tensor creation: torch.tensor, zeros/ones, randn, arange — plus shape, dtype, device operations
- Autograd: requires_grad, backward(), zero_grad(), detach(), torch.no_grad() context manager
- Network layers: nn.Linear, nn.Conv2d, nn.LSTM, nn.BatchNorm2d, nn.Dropout, nn.Sequential
- Optimizers: Adam and SGD with momentum, plus StepLR learning rate scheduler
- Data pipeline: custom Dataset, DataLoader with batch_size/shuffle, torchvision transforms
- GPU: CUDA availability check, tensor.to(device), DataParallel, mixed precision (autocast + GradScaler)
- Save/Load: state_dict (recommended), full model, and complete training checkpoint patterns
Frequently Asked Questions
What is a PyTorch tensor and how is it different from a NumPy array?
A PyTorch tensor is a multi-dimensional array similar to a NumPy ndarray, but with two key differences: tensors can run on GPU for hardware-accelerated computation, and tensors support automatic differentiation (autograd). You create tensors with `torch.tensor()`, `torch.zeros()`, `torch.randn()`, and similar functions. Tensors have a `.shape`, `.dtype`, and `.device` attribute. Use `.to("cuda")` to move a tensor to GPU. NumPy arrays can be converted to tensors with `torch.from_numpy()` and back with `.numpy()`.
How does autograd work in PyTorch?
Autograd is PyTorch's automatic differentiation engine. When you create a tensor with `requires_grad=True`, PyTorch records all operations on it in a computation graph. Calling `.backward()` on a scalar output traverses this graph and computes gradients with respect to all tensors that have `requires_grad=True`, storing them in their `.grad` attribute. In a training loop, call `optimizer.zero_grad()` before each backward pass to reset accumulated gradients, then `loss.backward()`, then `optimizer.step()` to update weights.
How do I define a neural network in PyTorch?
Define a class that inherits from `nn.Module`. In `__init__`, declare your layers as attributes (e.g., `self.fc1 = nn.Linear(784, 256)`). In the `forward` method, define how data flows through the layers. PyTorch automatically tracks parameters defined as `nn.Module` attributes. For simple sequential architectures, `nn.Sequential` is a shortcut: `model = nn.Sequential(nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10))`. Custom modules are needed when your forward pass has branching, skip connections, or other non-sequential logic.
What is the difference between Adam and SGD optimizers?
SGD (Stochastic Gradient Descent) updates weights by subtracting a fraction (learning rate) of the gradient. Adding momentum (`torch.optim.SGD(params, lr=0.01, momentum=0.9)`) makes it accumulate a velocity vector to smooth out oscillations. Adam (Adaptive Moment Estimation) maintains per-parameter adaptive learning rates based on first and second moment estimates of gradients. Adam typically converges faster and is less sensitive to learning rate choice, making it a good default. SGD with momentum can sometimes generalize better for image classification tasks.
How do I load data efficiently with PyTorch?
Use the `Dataset` and `DataLoader` classes. Create a custom `Dataset` by subclassing `torch.utils.data.Dataset` and implementing `__len__` and `__getitem__`. Wrap it in a `DataLoader` with `batch_size` and `shuffle=True` for training. For image tasks, use `torchvision.datasets` (MNIST, CIFAR10, ImageFolder, etc.) and `torchvision.transforms.Compose` to chain preprocessing steps like resize, crop, tensor conversion, and normalization. Use `num_workers` in DataLoader to load data in parallel with background workers.
How should I save and load a PyTorch model?
The recommended approach is to save only the state_dict (model weights): `torch.save(model.state_dict(), "weights.pth")`. To load, recreate the model architecture and call `model.load_state_dict(torch.load("weights.pth"))`, then `model.eval()` for inference. For checkpointing during training (to resume later), save a dict containing the epoch, model state_dict, optimizer state_dict, and loss value. Avoid saving the entire model object with `torch.save(model, ...)` as it is tied to the exact file structure at save time.
How do I use GPU acceleration in PyTorch?
First check availability: `device = torch.device("cuda" if torch.cuda.is_available() else "cpu")`. Then move your model and data to the device: `model.to(device)` and `data.to(device)`. For multi-GPU training, wrap your model with `nn.DataParallel(model)`. For mixed precision training (faster and uses less GPU memory), use `torch.cuda.amp.autocast()` context manager for the forward pass and `GradScaler` for the backward pass. Monitor memory with `torch.cuda.memory_allocated()` and `torch.cuda.memory_summary()`.
What is gradient clipping and when should I use it?
Gradient clipping limits the magnitude of gradients before the optimizer step to prevent the exploding gradients problem. Use `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)` after `loss.backward()` and before `optimizer.step()`. Exploding gradients are common in recurrent networks (RNNs, LSTMs) when training on long sequences, as gradients are multiplied through many time steps. A `max_norm` of 1.0 is a common starting point, but you may need to tune it based on your loss curves.