Using the Genomic Bottleneck to Compress Models

This guide explains the internals of torchGB.

The torch.distributed package is used in torchGB to parallelize the training process across multiple GPUs. This allows for efficient use of hardware resources and can significantly speed up training times.

To enable distributed training, you need to initialize the multiprocessing environment using dist.init_process_group(). You can choose from various backends such as nccl, gloo, or mpi.

After initialization, you can create a DistributedDataParallel (DDP) wrapper around your model using DistributedDataParallel(model, device_ids=[rank], output_device=rank).

Here’s an example code snippet that demonstrates how to initialize the distributed environment and wrap your model with DDP:

Initializing distributed environment and wrapping model with DDP

dist.init_process_group(backend="nccl")
rank = dist.get_rank()
world_size = dist.get_world_size()

model = GPT(**experiment_config["model"]).to(rank)
model = DDP(model, device_ids=[rank], output_device=rank)

The genomic bottleneck in torchGB is responsible for parallelizing the training of g-nets across multiple GPUs. To achieve this, we use the predict_weights() method of the GenomicBottleneck class to compute the weights of each g-net. The genomic bottleneck in torchGB is responsible for parallelizing the training of g-nets across multiple GPUs. To achieve this, we use the predict_weights() method of the GenomicBottleneck class to compute the weights of each g-net. This method implicitly updates the model weights with the g-net predictions.

Here’s an example code snippet that demonstrates how to call predict_weights() and propagate gradients through the g-nets:

Calling predict_weights() and propagating gradients

gnets.zero_grad() # Zero g-net gradients
gnets.predict_weights() # Compute p-net weights using g-nets

loss.backward() # Backpropagate through p-net
gnets.backward() # Backpropagate through g-nets

optimizer.step() # Update p-net parameters
gnets.step() # Update g-net parameters

How g-nets are distributed across MPI ranks:

The distribution of g-nets is handled internally by the GenomicBottleneck class. During initialization, the GenomicBottleneck class (see src/torchGB/core.py) analyzes the provided model and creates a set of g-nets for specific layers within the model. This mapping of g-nets to layers is stored in the gnetdict attribute of the GenomicBottleneck class. Each entry in the gnetdict corresponds to a layer’s parameters and is associated with a GNetLayer object (also in src/torchGB/core.py). The GNetLayer object stores important information, including the MPI rank (rank attribute) where the g-net for that layer resides.

The GenomicBottleneck class uses torch.distributed to manage the distribution of g-nets across different ranks. When methods like zero_grad(), backward(), and step() are called on the GenomicBottleneck instance, they internally check the rank attribute of each GNetLayer in the gnetdict. Operations are performed only on the g-nets residing on the current MPI rank. You can see examples of this logic in the implementations of zero_grad(), get_num_params_gnet(), step(), and load() within src/torchGB/core.py.

The register_gnet_type function in src/torchGB/core.py is used to associate specific layer types (e.g., nn.Linear, nn.Conv2d) with initialization and build functions for their corresponding g-nets. This mechanism allows the GenomicBottleneck class to create appropriate g-nets for different types of layers in the model.

For more details on the implementation, refer to the source code, specifically src/torchGB/core.py, src/torchGB/gnet.py, and layer-specific files under src/torchGB/layers. The docstrings and comments within these files provide further insights into the internal workings of g-net distribution and management. The __repr__ method in src/torchGB/core.py also offers a way to print information about the created g-nets and their associated parameters.

To use torch.distributed with torchrun, you need to launch your training script using the --nproc_per_node argument. This will enable distributed training across multiple GPUs.

Here’s an example code snippet that demonstrates how to launch a training script with torchrun:

Launching training script with torchrun

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 train_llm_gnet_small.py \
--gpus 1,2,3,4 --seed 42 --language en --batchsize 36 \
--name test --no_commit --log_level DEBUG

By following these steps and using the provided code snippets, you can efficiently parallelize your training process with torchGB.

Tiling of large weight matrices

The tiling/slicing of large weight matrices is implemented for different PyTorch layer types in the src/torchGB/layers directory. Specifically, the conv_gnet.py, attn_gnet.py, and linear_gnet.py files contain the implementation.

In all files, the build_<layer-type>_gnet_output functions are used to compute the output of the g-net for each layer type. These functions take the following inputs:

name: The type of layer (e.g., “conv2d” or “linear”)
param: The weights and bias of the original layer
weights: The weights of the corresponding g-net
tile_shape: A tuple specifying the tile size for each dimension

Inside these functions, the following steps are performed:

1. Compute tile dimensions: The number of tiles in each dimension is computed using the ceiling function (math.ceil) to ensure that the entire weight matrix is covered.

2. Rebuild the weight matrix: For convolutional layers, the build_4d_kernel function is used to reshape the weights into a 4D tensor with the specified tile shape. The resulting tensor is then cut to match the original layer’s output shape using the cut_matrix function. For attention layers, we use the tile_matrix function from the src/torchGB/utils.py file to tile the weight matrix along its rows. Specifically, given a 3x1 tiling (i.e., row_size=3, col_size=1), the input weight matrix is reshaped into tiles of size 3x1, and then swapped to have shape (n, 3, 1). The resulting tensor has shape (n, 3, 1) where n is the number of columns in the original weight matrix. For example, if we have a 12x8 weight matrix, the tile_matrix function would split it into 4 tiles of size 3x1 along its rows:

3. Return the sliced g-net weights: The sliced g-net weights are returned as the final result of the computation.

Here’s an excerpt from the conv_gnet.py file showing this implementation:

Building the convolutional g-net output.

def build_conv2d_gnet_output(name: str, param: Tensor, weights: Tensor, tile_shape) -> Tensor:
    num_row_tiles = math.ceil(param.shape[0]/tile_shape[0])
    num_col_tiles = math.ceil(param.shape[1]/tile_shape[1])

    shape = (num_row_tiles*tile_shape[0],
             num_col_tiles*tile_shape[1],
             param.shape[2], param.shape[3])

    new_weights = build_4d_kernel(weights, shape)
    new_weights = cut_matrix(new_weights, param.shape)
    return new_weights