XLA (Accelerated Linear Algebra) is a domain-specific compiler for optimizing deep learning computations, particularly those related to linear algebra, such as matrix multiplications.

XLA was originally designed with TensorFlow’s computational graph in mind. Without XLA, TensorFlow’s computational graph would be executed by its graph executor like an interpreter. Each operation would be executed individually, by evaluating a pre-compiled GPU kernel implementation.

With XLA, the TensorFlow graph is compiled into a sequence of GPU kernels generated specifically for your model. This allows for the computation to be executed all at once, rather than for each component operation. The fact that the kernels are unique to your model allows for model-specific exploitation for new performance optimizations.

XLA achieves this through the notion of fusion. The component operations are fused into a single GPU kernel launch. Besides reducing the number of necessary GPU kernels, this fusion operation also eliminates the need for storing intermediate values in memory, instead streaming these values directly to their users while keeping them in GPU registers.

This memory is optimization is of utmost important since it is typically a scarce resource in hardware accelerators for machine learning.

While XLA was originally designed for TensorFlow, it nowadays can be used also in JAX (also by Google) and in PyTorch.

Notes mentioning this note

Here are all the zettels in this zettelkasten, along with their links, visualized as a graph. You may need to zoom and pan around to see something.