Eigen GPU Module (`unsupported/Eigen/GPU`)

GPU-accelerated dense linear algebra for Eigen users, dispatching to NVIDIA CUDA Math Libraries (cuBLAS, cuSOLVER). Requires CUDA 11.4+. Header-only (link against CUDA runtime, cuBLAS, and cuSOLVER).

Why this module

Eigen is the linear algebra foundation for a large ecosystem of C++ projects in robotics (ROS, Drake, MoveIt, Pinocchio), computer vision (OpenCV, COLMAP, Open3D), scientific computing (Ceres, Stan), and beyond. Many of these projects run on GPU-equipped hardware but cannot use GPUs for Eigen operations without dropping down to raw CUDA library APIs. Third-party projects like EigenCuda and cholespy exist specifically to fill this gap, and downstream projects like Ceres and COLMAP have open requests for GPU-accelerated solvers through Eigen.

The unsupported/Eigen/GPU module aims to close this gap: Existing Eigen users should be able to move performance-critical dense linear algebra to the GPU with minimal code changes and without learning CUDA library APIs directly.

Design philosophy

CPU and GPU coexist. There is no global compile-time switch that replaces CPU implementations (unlike EIGEN_USE_LAPACKE). Users choose GPU solvers explicitly -- gpu::LLT<double> vs LLT<MatrixXd> -- and both coexist in the same binary. This also lets users keep the factored matrix on device across multiple solves, something impossible with compile-time replacement.

Familiar syntax. GPU operations use the same expression patterns as CPU Eigen. Here is a side-by-side comparison:

// ---- CPU (Eigen) ----               // ---- GPU (unsupported/Eigen/GPU) ----
#include <Eigen/Dense>                  #define EIGEN_USE_GPU
                                        #include <unsupported/Eigen/GPU>

MatrixXd A = ...;                       auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
MatrixXd B = ...;                       auto d_B = gpu::DeviceMatrix<double>::fromHost(B);

MatrixXd C = A * B;                     gpu::DeviceMatrix<double> d_C = d_A * d_B;
MatrixXd X = A.llt().solve(B);          gpu::DeviceMatrix<double> d_X = d_A.llt().solve(d_B);

                                        MatrixXd X = d_X.toHost();

The GPU version reads like CPU Eigen with explicit upload/download. operator* dispatches to cuBLAS GEMM, .llt().solve() dispatches to cuSOLVER potrf + potrs. Unsupported expressions are compile errors.

Standalone module. unsupported/Eigen/GPU does not modify or depend on Eigen's Core expression template system (MatrixBase, CwiseBinaryOp, etc.). DeviceMatrix is not an Eigen expression type and does not inherit from MatrixBase. The expression layer is a thin compile-time dispatch where every supported expression maps to a single NVIDIA library call. There is no coefficient-level evaluation, lazy fusion, or packet operations.

Explicit over implicit. Host-device transfers, stream management, and library handle lifetimes are visible in the API. There are no hidden allocations or synchronizations except where documented (e.g., toHost() must synchronize to deliver data to the host).

Key concepts

`DeviceMatrix<Scalar>`

A typed RAII wrapper for a dense column-major matrix in GPU device memory. This is the GPU counterpart of Eigen's MatrixX<Scalar>. A vector is simply a DeviceMatrix with one column.

// Upload from host
auto d_A = gpu::DeviceMatrix<double>::fromHost(A);

// Allocate uninitialized
gpu::DeviceMatrix<double> d_C(m, n);

// Download to host
MatrixXd C = d_C.toHost();

// Async download (returns a future)
auto transfer = d_C.toHostAsync();
// ... do other work ...
MatrixXd C = transfer.get();

DeviceMatrix supports expression methods that mirror Eigen's API: adjoint(), transpose(), triangularView<UpLo>(), selfadjointView<UpLo>(), llt(), lu(). These return lightweight expression objects that are evaluated when assigned.

`gpu::Context`

Every GPU operation needs a CUDA stream and library handles (cuBLAS, cuSOLVER). gpu::Context bundles these together.

For simple usage, you don't need to create one -- a per-thread default context is created lazily on first use:

// These use the thread-local default context automatically
d_C = d_A * d_B;
d_X = d_A.llt().solve(d_B);

For concurrent multi-stream execution, create explicit contexts:

gpu::Context ctx1, ctx2;
d_C1.device(ctx1) = d_A1 * d_B1;   // runs on stream 1
d_C2.device(ctx2) = d_A2 * d_B2;   // runs on stream 2 (concurrently)

Usage

Matrix operations (cuBLAS)

auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
auto d_B = gpu::DeviceMatrix<double>::fromHost(B);

// GEMM: C = A * B, C = A^H * B, C = A * B^T, ...
gpu::DeviceMatrix<double> d_C = d_A * d_B;
d_C = d_A.adjoint() * d_B;
d_C = d_A * d_B.transpose();

// Scaled and accumulated
d_C += 2.0 * d_A * d_B;             // alpha=2, beta=1
d_C.device(ctx) -= d_A * d_B;       // alpha=-1, beta=1 (requires explicit context)

// Triangular solve (TRSM)
d_X = d_A.triangularView<Lower>().solve(d_B);

// Symmetric/Hermitian multiply (SYMM/HEMM)
d_C = d_A.selfadjointView<Lower>() * d_B;

// Rank-k update (SYRK/HERK)
d_C.selfadjointView<Lower>().rankUpdate(d_A);  // C += A * A^H

Dense solvers (cuSOLVER)

One-shot expression syntax -- Convenient, re-factorizes each time:

// Cholesky solve (potrf + potrs)
d_X = d_A.llt().solve(d_B);

// LU solve (getrf + getrs)
d_Y = d_A.lu().solve(d_B);

Cached factorization -- Factor once, solve many times:

gpu::LLT<double> llt;
llt.compute(d_A);                    // factorize (async)
if (llt.info() != Success) { ... }   // lazy sync on first info() call
auto d_X1 = llt.solve(d_B1);        // reuses factor (async)
auto d_X2 = llt.solve(d_B2);        // reuses factor (async)
MatrixXd X2 = d_X2.toHost();

// LU with transpose solve
gpu::LU<double> lu;
lu.compute(d_A);
auto d_Y = lu.solve(d_B, gpu::GpuOp::Trans);           // A^T Y = B

// QR solve (overdetermined least squares)
gpu::QR<double> qr(A);                // host matrix input
MatrixXd X = qr.solve(B);           // Q^H * B via ormqr, then trsm on R

// SVD
gpu::SVD<double> svd(A, ComputeThinU | ComputeThinV);
VectorXd S = svd.singularValues();
MatrixXd U = svd.matrixU();
MatrixXd VT = svd.matrixVT();
MatrixXd X = svd.solve(B);          // pseudoinverse solve

// SVD: device-side views (no D2H transfer; svd must outlive the views)
auto d_S = svd.d_singularValues();   // DeviceMatrix view of singular values
auto d_U = svd.d_matrixU();          // DeviceMatrix view of U
auto d_VT = svd.d_matrixVT();        // DeviceMatrix view of V^T

// Self-adjoint eigenvalue decomposition
gpu::SelfAdjointEigenSolver<double> es(A);
VectorXd eigenvals = es.eigenvalues();
MatrixXd eigenvecs = es.eigenvectors();
auto d_W = es.d_eigenvalues();        // DeviceMatrix view of eigenvalues
auto d_V = es.d_eigenvectors();       // DeviceMatrix view of eigenvectors

The cached API keeps the factored matrix on device, avoiding redundant host-device transfers and re-factorizations. The d_* accessors return non-owning DeviceMatrix views so downstream cuBLAS/cuSOLVER work can chain without round-tripping through host memory.

Precision control

GEMM dispatch uses cublasXgemm (type-specific Sgemm/Dgemm/Cgemm/Zgemm). cuBLAS may internally use tensor cores depending on the GPU architecture, matrix dimensions, and CUDA math mode settings. No Eigen-specific macros control this; use the standard CUDA_MATH_MODE environment variable or cublasSetMathMode() to configure tensor core behavior if needed.

Stream control and async execution

Operations are asynchronous by default. The compute-solve chain runs without host synchronization until you need a result on the host:

fromHost(A) --sync-->  compute() --async-->  solve() --async-->  toHost()
   H2D                  potrf                 potrs                D2H
                                                                   sync

Mandatory sync points:

fromHost() -- Synchronizes to complete the upload before returning
toHost() / HostTransfer::get() -- Must deliver data to host
info() -- Must read the factorization status

Cross-stream safety is automatic. DeviceMatrix tracks write completion via CUDA events. When a matrix written on stream A is read on stream B, the module automatically inserts cudaStreamWaitEvent. Same-stream operations skip the wait (CUDA guarantees in-order execution within a stream).

Lifetime of cached factorizations. A gpu::LLT / gpu::LU object owns its CUDA stream, library handle, and the cached factor on device. Destroying the factorization object while a solve() it issued is still in flight is correct but not actually async: cudaStreamDestroy returns immediately, but the destructor of the cached factor calls cudaFree, which is fully device-synchronous and stalls until the in-flight potrs/getrs retires. For genuine async pipelining keep the factorization object alive until you have drained its results (e.g. via toHost() or by binding consumption to an explicit gpu::Context that outlives both producer and consumer):

gpu::LLT<double> llt(d_A);             // factor stays on device
auto d_X = llt.solve(d_B);
auto h_x = d_X.toHostAsync(llt.stream());
h_x.get();                             // sync: factor + result complete
// llt may now be destroyed without stalling the device

Reference

Supported scalar types

float, double, std::complex<float>, std::complex<double>.

Expression -> library call mapping

DeviceMatrix expression	Library call	Parameters
`C = A * B`	`cublasXgemm`	transA=N, transB=N, alpha=1, beta=0
`C = A.adjoint() * B`	`cublasXgemm`	transA=C, transB=N
`C = A.transpose() * B`	`cublasXgemm`	transA=T, transB=N
`C = A * B.adjoint()`	`cublasXgemm`	transA=N, transB=C
`C = A * B.transpose()`	`cublasXgemm`	transA=N, transB=T
`C = alpha * A * B`	`cublasXgemm`	alpha from LHS
`C = A * (alpha * B)`	`cublasXgemm`	alpha from RHS
`C += A * B`	`cublasXgemm`	alpha=1, beta=1
`C.device(ctx) -= A * B`	`cublasXgemm`	alpha=-1, beta=1
`X = A.llt().solve(B)`	`cusolverDnXpotrf` + `Xpotrs`	uplo, n, nrhs
`X = A.llt<Upper>().solve(B)`	same	uplo=Upper
`X = A.lu().solve(B)`	`cusolverDnXgetrf` + `Xgetrs`	n, nrhs
`X = A.triangularView<L>().solve(B)`	`cublasXtrsm`	side=L, uplo, diag=NonUnit
`C = A.selfadjointView<L>() * B`	`cublasXsymm` / `cublasXhemm`	side=L, uplo
`C.selfadjointView<L>().rankUpdate(A)`	`cublasXsyrk` / `cublasXherk`	uplo, trans=N

`DeviceMatrix<Scalar>` API

Method	Sync?	Description
`DeviceMatrix()`	--	Empty (0x0)
`DeviceMatrix(rows, cols)`	--	Allocate uninitialized
`fromHost(matrix, stream)`	yes	Upload from Eigen matrix
`fromHostAsync(ptr, rows, cols, stream)`	no	Async upload (caller manages lifetime)
`adopt(ptr, rows, cols)`	--	Owning wrapper over a device pointer (caller relinquishes ownership)
`view(ptr, rows, cols)`	--	Non-owning view over a device pointer (does not free on destruction)
`toHost(stream)`	yes	Synchronous download
`toHostAsync(stream)`	no	Returns `HostTransfer` future
`clone(stream)`	no	Device-to-device deep copy
`resize(rows, cols)`	--	Discard contents, reallocate
`data()`	--	Raw device pointer
`rows()`, `cols()`	--	Dimensions
`sizeInBytes()`	--	Total device allocation size in bytes
`empty()`	--	True if 0x0
`adjoint()`	--	Adjoint view (GEMM ConjTrans)
`transpose()`	--	Transpose view (GEMM Trans)
`llt()` / `llt<UpLo>()`	--	Cholesky expression builder
`lu()`	--	LU expression builder
`triangularView<UpLo>()`	--	Triangular view (TRSM)
`selfadjointView<UpLo>()`	--	Self-adjoint view (SYMM, rankUpdate)
`device(ctx)`	--	Assignment proxy bound to context

`gpu::Context`

Unified GPU execution context owning a CUDA stream and library handles.

gpu::Context()                                             // Creates dedicated stream + handles
static gpu::Context& threadLocal()                         // Per-thread default (lazy-created)

cudaStream_t       stream()
cublasHandle_t     cublasHandle()
cusolverDnHandle_t cusolverHandle()

Non-copyable, non-movable (owns library handles).

`gpu::LLT<Scalar, UpLo>` API

GPU dense Cholesky (LL^T) via cuSOLVER. Caches factor on device.

Method	Sync?	Description
`gpu::LLT(A)`	deferred	Construct and factorize from host matrix
`compute(host_matrix)`	deferred	Upload and factorize
`compute(DeviceMatrix)`	deferred	D2D copy and factorize
`compute(DeviceMatrix&&)`	deferred	Move-adopt and factorize (no copy)
`solve(host_matrix)`	yes	Solve, return host matrix
`solve(DeviceMatrix)`	no	Solve, return `DeviceMatrix` (async)
`info()`	lazy	Syncs stream on first call, returns `Success` or `NumericalIssue`

`gpu::LU<Scalar>` API

GPU dense partial-pivoting LU via cuSOLVER. Same pattern as gpu::LLT, plus a gpu::GpuOp parameter on solve() (NoTrans, Trans, ConjTrans).

`gpu::QR<Scalar>` API

GPU dense QR decomposition via cuSOLVER (geqrf). Solve uses ormqr (apply Q^H) + trsm (back-substitute on R) -- Q is never formed explicitly.

Method	Description
`gpu::QR()`	Default construct, then call `compute()`
`gpu::QR(A)`	Construct and factorize from host matrix
`compute(A)`	Upload + factorize
`compute(DeviceMatrix)`	D2D copy + factorize
`solve(host_matrix)`	Solve, return host matrix (syncs)
`solve(DeviceMatrix)`	Solve, return `DeviceMatrix` (async)
`matrixR()`	Download upper-triangular factor R to host (m >= n only)
`info()`	Lazy sync
`rows()`, `cols()`, `stream()`	Dimensions and CUDA stream

`gpu::SVD<Scalar>` API

GPU dense SVD via cuSOLVER (gesvd). Supports thin, full, and values-only modes via Eigen's ComputeThinU | ComputeThinV, ComputeFullU | ComputeFullV, or 0 (values only).

Method	Description
`gpu::SVD()`	Default construct, then call `compute()`
`gpu::SVD(A, options)`	Construct and compute (options default: `ComputeThinU \| ComputeThinV`)
`compute(A, options)`	Compute from host matrix
`compute(DeviceMatrix, options)`	Compute from device matrix
`singularValues()`	Download singular values to host
`matrixU()`	Download U to host (requires `ComputeThinU` or `ComputeFullU`)
`matrixV()`	Download V to host (matches `JacobiSVD::matrixV()`)
`matrixVT()`	Download V^T to host (requires `ComputeThinV` or `ComputeFullV`)
`d_singularValues()`	`DeviceMatrix` view of singular values (zero-copy)
`d_matrixU()`	`DeviceMatrix` view of U (zero-copy when m >= n)
`d_matrixVT()`	`DeviceMatrix` view of V^T (zero-copy when m >= n)
`solve(B)`	Pseudoinverse solve (returns host matrix)
`solve(B, k)`	Truncated solve (top k singular triplets)
`solve(B, lambda)`	Tikhonov regularized solve
`rank(threshold)`	Count singular values above threshold
`info()`	Lazy sync
`rows()`, `cols()`, `stream()`	Dimensions and CUDA stream

Wide matrices (m < n) are handled by internally transposing via cuBLAS geam; in that case d_matrixU() and d_matrixVT() allocate an owning result via a single cublasXgeam adjoint pass. d_singularValues() is always zero-copy.

`gpu::SelfAdjointEigenSolver<Scalar>` API

GPU symmetric/Hermitian eigenvalue decomposition via cuSOLVER (syevd).

Method	Description
`gpu::SelfAdjointEigenSolver()`	Default construct, then call `compute()`
`gpu::SelfAdjointEigenSolver(A, mode)`	Construct and compute (mode default: `ComputeEigenvectors`)
`compute(A, mode)`	Compute from host matrix
`compute(DeviceMatrix, mode)`	Compute from device matrix
`eigenvalues()`	Download eigenvalues to host (ascending order)
`eigenvectors()`	Download eigenvectors to host (columns)
`d_eigenvalues()`	`DeviceMatrix` view of eigenvalues (zero-copy)
`d_eigenvectors()`	`DeviceMatrix` view of eigenvectors (zero-copy, requires `ComputeEigenvectors`)
`info()`	Lazy sync
`rows()`, `cols()`, `stream()`	Dimensions and CUDA stream

ComputeMode: gpu::SelfAdjointEigenSolver::EigenvaluesOnly or gpu::SelfAdjointEigenSolver::ComputeEigenvectors.

The d_* accessors return non-owning views into the solver's internal device buffers; the solver must outlive any view derived from it.

`HostTransfer<Scalar>` API

Future for async device-to-host transfer.

Method	Description
`get()`	Block until transfer completes, return host matrix reference. Idempotent.
`ready()`	Non-blocking poll

Aliasing

Unlike Eigen‘s Matrix, where omitting .noalias() triggers a copy to a temporary, DeviceMatrix dispatches directly to NVIDIA library calls which have no built-in aliasing protection. All operations are implicitly noalias. The caller must ensure operands don’t alias the destination for GEMM, TRSM, SYMM/HEMM, and SYRK/HERK. Debug builds assert on these violations before dispatching to cuBLAS.

Future work

Reassess host-input vs. device-input API surface. Each solver currently exposes both host-input (compute(MatrixXd), solve(MatrixXd)) and device-input (compute(DeviceMatrix), solve(DeviceMatrix)) overloads, plus host- and device-side accessors (matrixU() vs d_matrixU()). This eases migration from CPU Eigen but may invite accidental host ↔ device round-trips when users mix the two without realising the cost. Revisit after the initial GPU module roll-out (MRs !2408, !2412, !2413, !2414, !2415) is in users' hands; if the convenience overloads cause more confusion than they save, narrow toward a single explicit fromHost / toHost boundary.

File layout

File	Depends on	Contents
`GpuSupport.h`	`<cuda_runtime.h>`	Error macro, `DeviceBuffer`, `PinnedHostBuffer`, `cuda_data_type<>`
`DeviceMatrix.h`	`GpuSupport.h`	`DeviceMatrix<>`, `HostTransfer<>`
`DeviceExpr.h`	`DeviceMatrix.h`	GEMM expression wrappers
`DeviceBlasExpr.h`	`DeviceMatrix.h`	TRSM, SYMM, SYRK expression wrappers
`DeviceSolverExpr.h`	`DeviceMatrix.h`	Solver expression wrappers (LLT, LU)
`DeviceDispatch.h`	all above	All dispatch functions + `Assignment`
`GpuContext.h`	`CuBlasSupport.h`, `CuSolverSupport.h`	`gpu::Context`
`CuBlasSupport.h`	`GpuSupport.h`, `<cublas_v2.h>`	cuBLAS error macro, op/compute type maps
`CuSolverSupport.h`	`GpuSupport.h`, `<cusolverDn.h>`	cuSOLVER params, fill-mode mapping
`GpuSolverContext.h`	`CuSolverSupport.h`, `CuBlasSupport.h`	Shared solver context (stream, handles, scratch)
`GpuLLT.h`	`GpuSolverContext.h`	Cached dense Cholesky factorization
`GpuLU.h`	`GpuSolverContext.h`	Cached dense LU factorization
`GpuQR.h`	`GpuSolverContext.h`	Dense QR decomposition
`GpuSVD.h`	`GpuSolverContext.h`	Dense SVD decomposition
`GpuEigenSolver.h`	`GpuSolverContext.h`	Self-adjoint eigenvalue decomposition

Building and testing

cmake -G Ninja -B build -S . \
  -DEIGEN_TEST_CUDA=ON \
  -DEIGEN_CUDA_COMPUTE_ARCH="70"

cmake --build build --target cublas cusolver_llt cusolver_lu \
  cusolver_qr cusolver_svd cusolver_eigen device_matrix
ctest --test-dir build -L gpu --output-on-failure

EIGEN_TEST_CUBLAS and EIGEN_TEST_CUSOLVER default to ON when CUDA is enabled (cuBLAS and cuSOLVER are part of the CUDA toolkit).

Eigen GPU Module (unsupported/Eigen/GPU)

Why this module

Design philosophy

Key concepts

DeviceMatrix<Scalar>

gpu::Context

Usage

Matrix operations (cuBLAS)

Dense solvers (cuSOLVER)

Precision control

Stream control and async execution

Reference

Supported scalar types

Expression -> library call mapping

DeviceMatrix<Scalar> API

gpu::Context

gpu::LLT<Scalar, UpLo> API

gpu::LU<Scalar> API

gpu::QR<Scalar> API

gpu::SVD<Scalar> API

gpu::SelfAdjointEigenSolver<Scalar> API

HostTransfer<Scalar> API

Aliasing

Future work

File layout

Building and testing

Eigen GPU Module (`unsupported/Eigen/GPU`)

`DeviceMatrix<Scalar>`

`gpu::Context`

`DeviceMatrix<Scalar>` API

`gpu::Context`

`gpu::LLT<Scalar, UpLo>` API

`gpu::LU<Scalar>` API

`gpu::QR<Scalar>` API

`gpu::SVD<Scalar>` API

`gpu::SelfAdjointEigenSolver<Scalar>` API

`HostTransfer<Scalar>` API