Eigen GPU Module (unsupported/Eigen/GPU)

GPU-accelerated dense linear algebra for Eigen users, dispatching to NVIDIA CUDA Math Libraries (cuBLAS, cuSOLVER). Requires CUDA 11.4+. Header-only (link against CUDA runtime, cuBLAS, and cuSOLVER).

Why this module

Eigen is the linear algebra foundation for a large ecosystem of C++ projects in robotics (ROS, Drake, MoveIt, Pinocchio), computer vision (OpenCV, COLMAP, Open3D), scientific computing (Ceres, Stan), and beyond. Many of these projects run on GPU-equipped hardware but cannot use GPUs for Eigen operations without dropping down to raw CUDA library APIs. Third-party projects like EigenCuda and cholespy exist specifically to fill this gap, and downstream projects like Ceres and COLMAP have open requests for GPU-accelerated solvers through Eigen.

The unsupported/Eigen/GPU module aims to close this gap: Existing Eigen users should be able to move performance-critical dense linear algebra to the GPU with minimal code changes and without learning CUDA library APIs directly.

Design philosophy

CPU and GPU coexist. There is no global compile-time switch that replaces CPU implementations (unlike EIGEN_USE_LAPACKE). Users choose GPU solvers explicitly -- gpu::LLT<double> vs LLT<MatrixXd> -- and both coexist in the same binary. This also lets users keep the factored matrix on device across multiple solves, something impossible with compile-time replacement.

Familiar syntax. GPU operations use the same expression patterns as CPU Eigen. Here is a side-by-side comparison:

// ---- CPU (Eigen) ----               // ---- GPU (unsupported/Eigen/GPU) ----
#include <Eigen/Dense>                  #define EIGEN_USE_GPU
                                        #include <unsupported/Eigen/GPU>

MatrixXd A = ...;                       auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
MatrixXd B = ...;                       auto d_B = gpu::DeviceMatrix<double>::fromHost(B);

MatrixXd C = A * B;                     gpu::DeviceMatrix<double> d_C = d_A * d_B;
MatrixXd X = A.llt().solve(B);          gpu::DeviceMatrix<double> d_X = d_A.llt().solve(d_B);

                                        MatrixXd X = d_X.toHost();

The GPU version reads like CPU Eigen with explicit upload/download. operator* dispatches to cuBLAS GEMM, .llt().solve() dispatches to cuSOLVER potrf + potrs. Unsupported expressions are compile errors.

Standalone module. unsupported/Eigen/GPU does not modify or depend on Eigen's Core expression template system (MatrixBase, CwiseBinaryOp, etc.). DeviceMatrix is not an Eigen expression type and does not inherit from MatrixBase. The expression layer is a thin compile-time dispatch where every supported expression maps to a single NVIDIA library call. There is no coefficient-level evaluation, lazy fusion, or packet operations.

Explicit over implicit. Host-device transfers, stream management, and library handle lifetimes are visible in the API. There are no hidden allocations or synchronizations except where documented (e.g., toHost() must synchronize to deliver data to the host).

Key concepts

DeviceMatrix<Scalar>

A typed RAII wrapper for a dense column-major matrix in GPU device memory. This is the GPU counterpart of Eigen's MatrixX<Scalar>. A vector is simply a DeviceMatrix with one column.

// Upload from host
auto d_A = gpu::DeviceMatrix<double>::fromHost(A);

// Allocate uninitialized
gpu::DeviceMatrix<double> d_C(m, n);

// Download to host
MatrixXd C = d_C.toHost();

// Async download (returns a future)
auto transfer = d_C.toHostAsync();
// ... do other work ...
MatrixXd C = transfer.get();

DeviceMatrix supports expression methods that mirror Eigen's API: adjoint(), transpose(), triangularView<UpLo>(), selfadjointView<UpLo>(), llt(), lu(). These return lightweight expression objects that are evaluated when assigned.

gpu::Context

Every GPU operation needs a CUDA stream and library handles (cuBLAS, cuSOLVER). gpu::Context bundles these together.

For simple usage, you don't need to create one -- a per-thread default context is created lazily on first use:

// These use the thread-local default context automatically
d_C = d_A * d_B;
d_X = d_A.llt().solve(d_B);

For concurrent multi-stream execution, create explicit contexts:

gpu::Context ctx1, ctx2;
d_C1.device(ctx1) = d_A1 * d_B1;   // runs on stream 1
d_C2.device(ctx2) = d_A2 * d_B2;   // runs on stream 2 (concurrently)

Usage

Matrix operations (cuBLAS)

auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
auto d_B = gpu::DeviceMatrix<double>::fromHost(B);

// GEMM: C = A * B, C = A^H * B, C = A * B^T, ...
gpu::DeviceMatrix<double> d_C = d_A * d_B;
d_C = d_A.adjoint() * d_B;
d_C = d_A * d_B.transpose();

// Scaled and accumulated
d_C += 2.0 * d_A * d_B;             // alpha=2, beta=1
d_C.device(ctx) -= d_A * d_B;       // alpha=-1, beta=1 (requires explicit context)

// Triangular solve (TRSM)
d_X = d_A.triangularView<Lower>().solve(d_B);

// Symmetric/Hermitian multiply (SYMM/HEMM)
d_C = d_A.selfadjointView<Lower>() * d_B;

// Rank-k update (SYRK/HERK)
d_C.selfadjointView<Lower>().rankUpdate(d_A);  // C += A * A^H

Dense solvers (cuSOLVER)

One-shot expression syntax -- Convenient, re-factorizes each time:

// Cholesky solve (potrf + potrs)
d_X = d_A.llt().solve(d_B);

// LU solve (getrf + getrs)
d_Y = d_A.lu().solve(d_B);

Cached factorization -- Factor once, solve many times:

gpu::LLT<double> llt;
llt.compute(d_A);                    // factorize (async)
if (llt.info() != Success) { ... }   // lazy sync on first info() call
auto d_X1 = llt.solve(d_B1);        // reuses factor (async)
auto d_X2 = llt.solve(d_B2);        // reuses factor (async)
MatrixXd X2 = d_X2.toHost();

// LU with transpose solve
gpu::LU<double> lu;
lu.compute(d_A);
auto d_Y = lu.solve(d_B, gpu::GpuOp::Trans);           // A^T Y = B

// QR solve (overdetermined least squares)
gpu::QR<double> qr(A);                // host matrix input
MatrixXd X = qr.solve(B);           // Q^H * B via ormqr, then trsm on R

// SVD
gpu::SVD<double> svd(A, ComputeThinU | ComputeThinV);
VectorXd S = svd.singularValues();
MatrixXd U = svd.matrixU();
MatrixXd VT = svd.matrixVT();
MatrixXd X = svd.solve(B);          // pseudoinverse solve

// SVD: device-side views (no D2H transfer; svd must outlive the views)
auto d_S = svd.d_singularValues();   // DeviceMatrix view of singular values
auto d_U = svd.d_matrixU();          // DeviceMatrix view of U
auto d_VT = svd.d_matrixVT();        // DeviceMatrix view of V^T

// Self-adjoint eigenvalue decomposition
gpu::SelfAdjointEigenSolver<double> es(A);
VectorXd eigenvals = es.eigenvalues();
MatrixXd eigenvecs = es.eigenvectors();
auto d_W = es.d_eigenvalues();        // DeviceMatrix view of eigenvalues
auto d_V = es.d_eigenvectors();       // DeviceMatrix view of eigenvectors

The cached API keeps the factored matrix on device, avoiding redundant host-device transfers and re-factorizations. The d_* accessors return non-owning DeviceMatrix views so downstream cuBLAS/cuSOLVER work can chain without round-tripping through host memory.

Precision control

GEMM dispatch uses cublasXgemm (type-specific Sgemm/Dgemm/Cgemm/Zgemm). cuBLAS may internally use tensor cores depending on the GPU architecture, matrix dimensions, and CUDA math mode settings. No Eigen-specific macros control this; use the standard CUDA_MATH_MODE environment variable or cublasSetMathMode() to configure tensor core behavior if needed.

Stream control and async execution

Operations are asynchronous by default. The compute-solve chain runs without host synchronization until you need a result on the host:

fromHost(A) --sync-->  compute() --async-->  solve() --async-->  toHost()
   H2D                  potrf                 potrs                D2H
                                                                   sync

Mandatory sync points:

  • fromHost() -- Synchronizes to complete the upload before returning
  • toHost() / HostTransfer::get() -- Must deliver data to host
  • info() -- Must read the factorization status

Cross-stream safety is automatic. DeviceMatrix tracks write completion via CUDA events. When a matrix written on stream A is read on stream B, the module automatically inserts cudaStreamWaitEvent. Same-stream operations skip the wait (CUDA guarantees in-order execution within a stream).

Lifetime of cached factorizations. A gpu::LLT / gpu::LU object owns its CUDA stream, library handle, and the cached factor on device. Destroying the factorization object while a solve() it issued is still in flight is correct but not actually async: cudaStreamDestroy returns immediately, but the destructor of the cached factor calls cudaFree, which is fully device-synchronous and stalls until the in-flight potrs/getrs retires. For genuine async pipelining keep the factorization object alive until you have drained its results (e.g. via toHost() or by binding consumption to an explicit gpu::Context that outlives both producer and consumer):

gpu::LLT<double> llt(d_A);             // factor stays on device
auto d_X = llt.solve(d_B);
auto h_x = d_X.toHostAsync(llt.stream());
h_x.get();                             // sync: factor + result complete
// llt may now be destroyed without stalling the device

Reference

Supported scalar types

float, double, std::complex<float>, std::complex<double>.

Expression -> library call mapping

DeviceMatrix expressionLibrary callParameters
C = A * BcublasXgemmtransA=N, transB=N, alpha=1, beta=0
C = A.adjoint() * BcublasXgemmtransA=C, transB=N
C = A.transpose() * BcublasXgemmtransA=T, transB=N
C = A * B.adjoint()cublasXgemmtransA=N, transB=C
C = A * B.transpose()cublasXgemmtransA=N, transB=T
C = alpha * A * BcublasXgemmalpha from LHS
C = A * (alpha * B)cublasXgemmalpha from RHS
C += A * BcublasXgemmalpha=1, beta=1
C.device(ctx) -= A * BcublasXgemmalpha=-1, beta=1
X = A.llt().solve(B)cusolverDnXpotrf + Xpotrsuplo, n, nrhs
X = A.llt<Upper>().solve(B)sameuplo=Upper
X = A.lu().solve(B)cusolverDnXgetrf + Xgetrsn, nrhs
X = A.triangularView<L>().solve(B)cublasXtrsmside=L, uplo, diag=NonUnit
C = A.selfadjointView<L>() * BcublasXsymm / cublasXhemmside=L, uplo
C.selfadjointView<L>().rankUpdate(A)cublasXsyrk / cublasXherkuplo, trans=N

DeviceMatrix<Scalar> API

MethodSync?Description
DeviceMatrix()--Empty (0x0)
DeviceMatrix(rows, cols)--Allocate uninitialized
fromHost(matrix, stream)yesUpload from Eigen matrix
fromHostAsync(ptr, rows, cols, stream)noAsync upload (caller manages lifetime)
adopt(ptr, rows, cols)--Owning wrapper over a device pointer (caller relinquishes ownership)
view(ptr, rows, cols)--Non-owning view over a device pointer (does not free on destruction)
toHost(stream)yesSynchronous download
toHostAsync(stream)noReturns HostTransfer future
clone(stream)noDevice-to-device deep copy
resize(rows, cols)--Discard contents, reallocate
data()--Raw device pointer
rows(), cols()--Dimensions
sizeInBytes()--Total device allocation size in bytes
empty()--True if 0x0
adjoint()--Adjoint view (GEMM ConjTrans)
transpose()--Transpose view (GEMM Trans)
llt() / llt<UpLo>()--Cholesky expression builder
lu()--LU expression builder
triangularView<UpLo>()--Triangular view (TRSM)
selfadjointView<UpLo>()--Self-adjoint view (SYMM, rankUpdate)
device(ctx)--Assignment proxy bound to context

gpu::Context

Unified GPU execution context owning a CUDA stream and library handles.

gpu::Context()                                             // Creates dedicated stream + handles
static gpu::Context& threadLocal()                         // Per-thread default (lazy-created)

cudaStream_t       stream()
cublasHandle_t     cublasHandle()
cusolverDnHandle_t cusolverHandle()

Non-copyable, non-movable (owns library handles).

gpu::LLT<Scalar, UpLo> API

GPU dense Cholesky (LL^T) via cuSOLVER. Caches factor on device.

MethodSync?Description
gpu::LLT(A)deferredConstruct and factorize from host matrix
compute(host_matrix)deferredUpload and factorize
compute(DeviceMatrix)deferredD2D copy and factorize
compute(DeviceMatrix&&)deferredMove-adopt and factorize (no copy)
solve(host_matrix)yesSolve, return host matrix
solve(DeviceMatrix)noSolve, return DeviceMatrix (async)
info()lazySyncs stream on first call, returns Success or NumericalIssue

gpu::LU<Scalar> API

GPU dense partial-pivoting LU via cuSOLVER. Same pattern as gpu::LLT, plus a gpu::GpuOp parameter on solve() (NoTrans, Trans, ConjTrans).

gpu::QR<Scalar> API

GPU dense QR decomposition via cuSOLVER (geqrf). Solve uses ormqr (apply Q^H) + trsm (back-substitute on R) -- Q is never formed explicitly.

MethodDescription
gpu::QR()Default construct, then call compute()
gpu::QR(A)Construct and factorize from host matrix
compute(A)Upload + factorize
compute(DeviceMatrix)D2D copy + factorize
solve(host_matrix)Solve, return host matrix (syncs)
solve(DeviceMatrix)Solve, return DeviceMatrix (async)
matrixR()Download upper-triangular factor R to host (m >= n only)
info()Lazy sync
rows(), cols(), stream()Dimensions and CUDA stream

gpu::SVD<Scalar> API

GPU dense SVD via cuSOLVER (gesvd). Supports thin, full, and values-only modes via Eigen's ComputeThinU | ComputeThinV, ComputeFullU | ComputeFullV, or 0 (values only).

MethodDescription
gpu::SVD()Default construct, then call compute()
gpu::SVD(A, options)Construct and compute (options default: ComputeThinU | ComputeThinV)
compute(A, options)Compute from host matrix
compute(DeviceMatrix, options)Compute from device matrix
singularValues()Download singular values to host
matrixU()Download U to host (requires ComputeThinU or ComputeFullU)
matrixV()Download V to host (matches JacobiSVD::matrixV())
matrixVT()Download V^T to host (requires ComputeThinV or ComputeFullV)
d_singularValues()DeviceMatrix view of singular values (zero-copy)
d_matrixU()DeviceMatrix view of U (zero-copy when m >= n)
d_matrixVT()DeviceMatrix view of V^T (zero-copy when m >= n)
solve(B)Pseudoinverse solve (returns host matrix)
solve(B, k)Truncated solve (top k singular triplets)
solve(B, lambda)Tikhonov regularized solve
rank(threshold)Count singular values above threshold
info()Lazy sync
rows(), cols(), stream()Dimensions and CUDA stream

Wide matrices (m < n) are handled by internally transposing via cuBLAS geam; in that case d_matrixU() and d_matrixVT() allocate an owning result via a single cublasXgeam adjoint pass. d_singularValues() is always zero-copy.

gpu::SelfAdjointEigenSolver<Scalar> API

GPU symmetric/Hermitian eigenvalue decomposition via cuSOLVER (syevd).

MethodDescription
gpu::SelfAdjointEigenSolver()Default construct, then call compute()
gpu::SelfAdjointEigenSolver(A, mode)Construct and compute (mode default: ComputeEigenvectors)
compute(A, mode)Compute from host matrix
compute(DeviceMatrix, mode)Compute from device matrix
eigenvalues()Download eigenvalues to host (ascending order)
eigenvectors()Download eigenvectors to host (columns)
d_eigenvalues()DeviceMatrix view of eigenvalues (zero-copy)
d_eigenvectors()DeviceMatrix view of eigenvectors (zero-copy, requires ComputeEigenvectors)
info()Lazy sync
rows(), cols(), stream()Dimensions and CUDA stream

ComputeMode: gpu::SelfAdjointEigenSolver::EigenvaluesOnly or gpu::SelfAdjointEigenSolver::ComputeEigenvectors.

The d_* accessors return non-owning views into the solver's internal device buffers; the solver must outlive any view derived from it.

HostTransfer<Scalar> API

Future for async device-to-host transfer.

MethodDescription
get()Block until transfer completes, return host matrix reference. Idempotent.
ready()Non-blocking poll

Aliasing

Unlike Eigen‘s Matrix, where omitting .noalias() triggers a copy to a temporary, DeviceMatrix dispatches directly to NVIDIA library calls which have no built-in aliasing protection. All operations are implicitly noalias. The caller must ensure operands don’t alias the destination for GEMM, TRSM, SYMM/HEMM, and SYRK/HERK. Debug builds assert on these violations before dispatching to cuBLAS.

Future work

  • Reassess host-input vs. device-input API surface. Each solver currently exposes both host-input (compute(MatrixXd), solve(MatrixXd)) and device-input (compute(DeviceMatrix), solve(DeviceMatrix)) overloads, plus host- and device-side accessors (matrixU() vs d_matrixU()). This eases migration from CPU Eigen but may invite accidental host ↔ device round-trips when users mix the two without realising the cost. Revisit after the initial GPU module roll-out (MRs !2408, !2412, !2413, !2414, !2415) is in users' hands; if the convenience overloads cause more confusion than they save, narrow toward a single explicit fromHost / toHost boundary.

File layout

FileDepends onContents
GpuSupport.h<cuda_runtime.h>Error macro, DeviceBuffer, PinnedHostBuffer, cuda_data_type<>
DeviceMatrix.hGpuSupport.hDeviceMatrix<>, HostTransfer<>
DeviceExpr.hDeviceMatrix.hGEMM expression wrappers
DeviceBlasExpr.hDeviceMatrix.hTRSM, SYMM, SYRK expression wrappers
DeviceSolverExpr.hDeviceMatrix.hSolver expression wrappers (LLT, LU)
DeviceDispatch.hall aboveAll dispatch functions + Assignment
GpuContext.hCuBlasSupport.h, CuSolverSupport.hgpu::Context
CuBlasSupport.hGpuSupport.h, <cublas_v2.h>cuBLAS error macro, op/compute type maps
CuSolverSupport.hGpuSupport.h, <cusolverDn.h>cuSOLVER params, fill-mode mapping
GpuSolverContext.hCuSolverSupport.h, CuBlasSupport.hShared solver context (stream, handles, scratch)
GpuLLT.hGpuSolverContext.hCached dense Cholesky factorization
GpuLU.hGpuSolverContext.hCached dense LU factorization
GpuQR.hGpuSolverContext.hDense QR decomposition
GpuSVD.hGpuSolverContext.hDense SVD decomposition
GpuEigenSolver.hGpuSolverContext.hSelf-adjoint eigenvalue decomposition

Building and testing

cmake -G Ninja -B build -S . \
  -DEIGEN_TEST_CUDA=ON \
  -DEIGEN_CUDA_COMPUTE_ARCH="70"

cmake --build build --target cublas cusolver_llt cusolver_lu \
  cusolver_qr cusolver_svd cusolver_eigen device_matrix
ctest --test-dir build -L gpu --output-on-failure

EIGEN_TEST_CUBLAS and EIGEN_TEST_CUSOLVER default to ON when CUDA is enabled (cuBLAS and cuSOLVER are part of the CUDA toolkit).