GPU & Deep Learning
Bertha has NVIDIA GPUs available for deep learning and GPU-accelerated computing. This page covers how to use them from Python (PyTorch, TensorFlow) and R (torch).
You do not need to install CUDA yourself. PyTorch, TensorFlow, and R’s torch package all bundle their own CUDA runtime libraries. Just install the framework and it works.
Checking GPU availability
Before getting started, verify the GPUs are accessible:
# Quick check — shows GPU model, driver version, and current usage
nvidia-smi
# Or use the bertha dashboard
berthaYou can also monitor GPU usage interactively with nvtop (pre-installed).
PyTorch
Installation
Use uv to install PyTorch with GPU support:
# In a project
uv add torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Or ad-hoc
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121The cu121 suffix means PyTorch comes bundled with CUDA 12.1 runtime libraries. This is the recommended variant — it works regardless of which system CUDA version (if any) is installed.
Verify GPU access
import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())
if torch.cuda.is_available():
print("GPU name:", torch.cuda.get_device_name(0))If torch.cuda.is_available() returns False, see Troubleshooting below.
TensorFlow
Installation
# Install TensorFlow with bundled CUDA runtime
uv pip install "tensorflow[and-cuda]"The [and-cuda] extra includes CUDA runtime libraries, so no system CUDA is needed.
Verify GPU access
import tensorflow as tf
print("GPU available:", tf.config.list_physical_devices('GPU'))R torch
The R torch package also bundles its own CUDA libraries:
install.packages("torch")
torch::torch_is_installed()
torch::cuda_is_available()If CUDA is not detected, torch may need a specific CUDA toolkit version available on the system. Contact the admin if you run into issues.
GPU memory management
Bertha’s GPUs are shared between users. Be mindful of memory usage:
# PyTorch — check current memory usage
import torch
print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
# Free unused cached memory
torch.cuda.empty_cache()# TensorFlow — allow memory growth instead of grabbing all GPU memory
import tensorflow as tf
for gpu in tf.config.list_physical_devices('GPU'):
tf.config.experimental.set_memory_growth(gpu, True)Use nvtop or the Bertha dashboard to check GPU utilization and memory usage before starting large jobs. If someone else is using most of the GPU memory, coordinate or wait.
Troubleshooting
torch.cuda.is_available() returns False
Check that the NVIDIA driver is loaded: run
nvidia-smi. If this fails, the system may need a reboot after a driver update.Make sure you installed the CUDA-enabled variant of PyTorch (with
cu121or similar in the index URL). The defaultpip install torchinstalls CPU-only.Verify inside Python:
import torch print(torch.version.cuda) # Should show e.g. "12.1", not None
Out of memory errors
- Check who’s using the GPU:
nvidia-smiorbertha -d - Free cached memory:
torch.cuda.empty_cache() - Reduce batch size in your training loop
- Use mixed precision training (
torch.amp) to halve memory usage
TensorFlow not finding GPU
# Check what TensorFlow sees
import tensorflow as tf
print(tf.config.list_physical_devices())If no GPU is listed, ensure you installed tensorflow[and-cuda] (not just tensorflow).
System Administration
The rest of this page covers driver installation, CUDA toolkit management, and system maintenance. Regular users don’t need this section.
NVIDIA driver installation
# Check GPU and recommended driver
lspci | grep -i nvidia
ubuntu-drivers devices
# Install recommended driver
sudo apt install nvidia-driver-545
# Reboot required after installation
sudo rebootDriver updates and reboots
NVIDIA drivers are backwards-compatible with CUDA versions, so updating drivers does not break existing PyTorch/TensorFlow installations. However, after a driver update via apt, GPUs will not be detected until the next reboot (the old kernel module is still loaded).
Best practice: time system updates with planned reboots. There is no need to pin driver versions as long as updates include a reboot.
Unattended upgrades
To prevent NVIDIA driver updates from happening automatically (which would leave GPUs unusable until reboot), NVIDIA and CUDA packages are blacklisted in /etc/apt/apt.conf.d/50unattended-upgrades. This means driver updates must be applied manually.
CUDA toolkit management
Most users don’t need a system CUDA installation — frameworks bundle their own. System CUDA is only needed for:
- Custom CUDA C/C++ code that requires
nvcc - R torch when it can’t find a compatible bundled CUDA version
If needed, multiple CUDA toolkit versions can be installed side-by-side:
# Add NVIDIA CUDA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
# Install specific toolkit versions (no driver)
sudo apt install cuda-toolkit-11-8
sudo apt install cuda-toolkit-12-1
# These install to /usr/local/cuda-11.8/, /usr/local/cuda-12.1/, etc.Environment modules can be used to switch between versions:
module load cuda/12.1
nvcc --versionThe effort to provide multiple CUDA versions via environment modules has been largely superseded by frameworks bundling their own CUDA runtime. Module-based CUDA is maintained for edge cases but is not the primary approach.
Driver → CUDA compatibility reference
| NVIDIA Driver | Supports CUDA Runtimes | PyTorch Options |
|---|---|---|
| 520+ | CUDA 11.8+ | cu118 |
| 530+ | CUDA 12.1+ | cu118, cu121 |
| 545+ | CUDA 12.3+ | cu118, cu121, cu126 |