Profiling device memory#
Note
June 2025 update: we recommend using XProf
profiling for device memory analysis. After taking a
profile, open the memory_viewer
tab of the Tensorboard profiler for more
detailed and understandable device memory usage.
The JAX device memory profiler allows us to explore how and why JAX programs are using GPU or TPU memory. For example, it can be used to:
Figure out which arrays and executables are in GPU memory at a given time, or
Track down memory leaks.
Installation#
The JAX device memory profiler emits output that can be interpreted using
pprof (google/pprof). Start by installing pprof
,
by following its
installation instructions.
At the time of writing, installing pprof
requires first installing
Go of version 1.16+,
Graphviz, and then running
go install github.com/google/pprof@latest
which installs pprof
as $GOPATH/bin/pprof
, where GOPATH
defaults to
~/go
.
Note
The version of pprof
from google/pprof is not the same as
the older tool of the same name distributed as part of the gperftools
package.
The gperftools
version of pprof
will not work with JAX.
Understanding how a JAX program is using GPU or TPU memory#
A common use of the device memory profiler is to figure out why a JAX program is using a large amount of GPU or TPU memory, for example if trying to debug an out-of-memory problem.
To capture a device memory profile to disk, use
jax.profiler.save_device_memory_profile()
. For example, consider the
following Python program:
import jax
import jax.numpy as jnp
import jax.profiler
def func1(x):
return jnp.tile(x, 10) * 0.5
def func2(x):
y = func1(x)
return y, jnp.tile(x, 10) + 1
x = jax.random.normal(jax.random.key(42), (1000, 1000))
y, z = func2(x)
z.block_until_ready()
jax.profiler.save_device_memory_profile("memory.prof")
If we first run the program above and then execute
pprof --web memory.prof
pprof
opens a web browser containing the following visualization of the device
memory profile in callgraph format:
The callgraph is a visualization of
the Python stack at the point the allocation of each live buffer was made.
For example, in this specific case, the visualization shows that
func2
and its callees were responsible for allocating 76.30MB, of which
38.15MB was allocated inside the call from func1
to func2
.
For more information about how to interpret callgraph visualizations, see the
pprof documentation.
Functions compiled with jax.jit()
are opaque to the device memory profiler.
That is, any memory allocated inside a jit
-compiled function will be
attributed to the function as a whole.
In the example, the call to block_until_ready()
is to ensure that func2
completes before the device memory profile is collected. See
Asynchronous dispatch for more details.
Debugging memory leaks#
We can also use the JAX device memory profiler to track down memory leaks by using
pprof
to visualize the change in memory usage between two device memory profiles
taken at different times. For example, consider the following program which
accumulates JAX arrays into a constantly-growing Python list.
import jax
import jax.numpy as jnp
import jax.profiler
def afunction():
return jax.random.normal(jax.random.key(77), (1000000,))
z = afunction()
def anotherfunc():
arrays = []
for i in range(1, 10):
x = jax.random.normal(jax.random.key(42), (i, 10000))
arrays.append(x)
x.block_until_ready()
jax.profiler.save_device_memory_profile(f"memory{i}.prof")
anotherfunc()
If we simply visualize the device memory profile at the end of execution
(memory9.prof
), it may not be obvious that each iteration of the loop in
anotherfunc
accumulates more device memory allocations:
pprof --web memory9.prof
The large but fixed allocation inside afunction
dominates the profile but does
not grow over time.
By using pprof
’s
--diff_base
feature to visualize the change in memory usage
across loop iterations, we can identify why the memory usage of the
program increases over time:
pprof --web --diff_base memory1.prof memory9.prof
The visualization shows that the memory growth can be attributed to the call to
normal
inside anotherfunc
.