Install CUDA and TensorRT on WSL2 Debian

Recently I got a GPU and started setting up for deep learning. Most guides I found focus on Ubuntu, and for a while I almost thought Debian wouldn't work. Fortunately, I successfully installed CUDA and TensorRT on my WSL2 Debian. Debian is better!

In this article, I managed to set up the environment to run YOLOv5 running via ONNXRuntime (ORT) using CUDA and TensorRT. Here's a quick overview of the setup:

Host OS: Windows 11
WSL2 OS: Debian 12
GPU: RTX 5060 Ti
CUDA Version: 12.9
ORT version: 1.22

1. Install CUDA

It's recommended to install the GPU driver on the Windows host (via this page). Once installed, running nvidia-smi inside WSL to confirm.

2. Install Dependencies via Conda

ORT 1.22's CUDA provider depends on several shared libraries. Run ldd libonnxruntime_providers_cuda.so to check them:

Dependency	Version
cublas	12
cudnn	9
curand	10
cufft	11
cudart	12

Then install them.

# create venv
conda create -n cudaenv python=3.10
conda activate cudaenv

# install dependencies
conda install libcublas=12 cudnn=9 libcurand=10 libcufft=11 cuda-cudart=12

See conda-forge packages for more info or to search different versions.

All libraries end up in /home/[Username]/miniforge3/envs/cudaenv/lib. Add this folder to LD_LIBRARY_PATH as follows.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/[Username]/miniforge3/envs/cudaenv/lib

At this point, ORT should be able to run inference using CUDA.

3. Install TensorRT

We need additional libraries to enable the TensorRT provider. The output of ldd libonnxruntime_providers_tensorrt.so indicated that libnvinfer.so.10 was missing.

TensorRT can be downloaded from here. In my case, I needed to download TensorRT 10. Make sure the version is compatible with the CUDA version.

After downloading and extracting the archive, add its library path to LD_LIBRARY_PATH.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/[Username]/workspace/libs/TensorRT-10.7.0.23/targets/x86_64-linux-gnu/lib

4. Simple Benchmark

The benchmarking code is adapted from this repository, an ORT implementation of YOLOv5. Use following code to choose the inference backend:

// select CPU or GPU by setting `provider`
std::vector<std::string> available_providers = Ort::GetAvailableProviders();
auto cuda_available = std::find(available_providers.begin(), available_providers.end(), "CUDAExecutionProvider");
auto trt_available = std::find(available_providers.begin(), available_providers.end(), "TensorrtExecutionProvider");
OrtCUDAProviderOptions cuda_options{};
OrtTensorRTProviderOptions trt_options{};
if (provider != 0 && (cuda_available == available_providers.end()))
{
    std::cout << "Inference device: CPU\n";
}
else if (provider == 1 && (cuda_available != available_providers.end()))
{
    std::cout << "Inference device: GPU CUDA\n";
    session_options.AppendExecutionProvider_CUDA(cuda_options);
}
else if (provider == 2 && (trt_available != available_providers.end()))
{
    std::cout << "Inference device: GPU TRT\n";
    session_options.AppendExecutionProvider_TensorRT(trt_options);
}
else
{
    session_options.SetIntraOpNumThreads(threads);
    std::cout << "Inference device: CPU\n";
}

The test measures inference time by running YOLOv5 on a 1216x1216 image 15 times.

TensorRT is the fastest as expected.

Detection Results

Tips

If you find the full TensorRT download too large, an alternative is to install it via pip (about 4 GB):

pip install --upgrade tensorrt

Outline

Install CUDA and TensorRT on WSL2 Debian

1. Install CUDA

2. Install Dependencies via Conda

3. Install TensorRT

4. Simple Benchmark

Detection Results

Tips

Leave a comment ...

Random articles

Latest comments

Tag

Archive

Other