unsupported/Eigen/GPU)GPU-accelerated dense linear algebra for Eigen users, dispatching to NVIDIA CUDA Math Libraries (cuBLAS, cuSOLVER). Requires CUDA 11.4+. Header-only (link against CUDA runtime, cuBLAS, and cuSOLVER).
Eigen is the linear algebra foundation for a large ecosystem of C++ projects in robotics (ROS, Drake, MoveIt, Pinocchio), computer vision (OpenCV, COLMAP, Open3D), scientific computing (Ceres, Stan), and beyond. Many of these projects run on GPU-equipped hardware but cannot use GPUs for Eigen operations without dropping down to raw CUDA library APIs. Third-party projects like EigenCuda and cholespy exist specifically to fill this gap, and downstream projects like Ceres and COLMAP have open requests for GPU-accelerated solvers through Eigen.
The unsupported/Eigen/GPU module aims to close this gap: Existing Eigen users should be able to move performance-critical dense linear algebra to the GPU with minimal code changes and without learning CUDA library APIs directly.
CPU and GPU coexist. There is no global compile-time switch that replaces CPU implementations (unlike EIGEN_USE_LAPACKE). Users choose GPU solvers explicitly -- gpu::LLT<double> vs Eigen::LLT<MatrixXd> -- and both coexist in the same binary. This also lets users keep the factored matrix on device across multiple solves, something impossible with compile-time replacement.
Familiar syntax. GPU operations use the same expression patterns as CPU Eigen. Here is a side-by-side comparison:
// ---- CPU (Eigen) ---- // ---- GPU (unsupported/Eigen/GPU) ---- #include <Eigen/Dense> #define EIGEN_USE_GPU #include <unsupported/Eigen/GPU> MatrixXd A = ...; auto d_A = gpu::DeviceMatrix<double>::fromHost(A); MatrixXd B = ...; auto d_B = gpu::DeviceMatrix<double>::fromHost(B); MatrixXd C = A * B; gpu::DeviceMatrix<double> d_C = d_A * d_B; MatrixXd X = A.llt().solve(B); gpu::DeviceMatrix<double> d_X = d_A.llt().solve(d_B); MatrixXd X = d_X.toHost();
The GPU version reads like CPU Eigen with explicit upload/download. operator* dispatches to cuBLAS GEMM, .llt().solve() dispatches to cuSOLVER potrf + potrs. Unsupported expressions are compile errors.
Standalone module. unsupported/Eigen/GPU does not modify or depend on Eigen's Core expression template system (MatrixBase, CwiseBinaryOp, etc.). DeviceMatrix is not an Eigen expression type and does not inherit from MatrixBase. The expression layer is a thin compile-time dispatch where every supported expression maps to a single NVIDIA library call. There is no coefficient-level evaluation, lazy fusion, or packet operations.
Explicit over implicit. Host-device transfers, stream management, and library handle lifetimes are visible in the API. There are no hidden allocations or synchronizations except where documented (e.g., toHost() must synchronize to deliver data to the host).
DeviceMatrix<Scalar>A typed RAII wrapper for a dense column-major matrix in GPU device memory. This is the GPU counterpart of Eigen's MatrixX<Scalar>. A vector is simply a DeviceMatrix with one column.
// Upload from host auto d_A = gpu::DeviceMatrix<double>::fromHost(A); // Allocate uninitialized gpu::DeviceMatrix<double> d_C(m, n); // Download to host MatrixXd C = d_C.toHost(); // Async download (returns a future) auto transfer = d_C.toHostAsync(); // ... do other work ... MatrixXd C = transfer.get();
DeviceMatrix supports expression methods that mirror Eigen's API: adjoint(), transpose(), triangularView<UpLo>(), selfadjointView<UpLo>(), llt(), lu(). These return lightweight expression objects that are evaluated when assigned.
gpu::ContextEvery GPU operation needs a CUDA stream and library handles (cuBLAS, cuSOLVER). gpu::Context bundles these together.
For simple usage, you don't need to create one -- a per-thread default context is created lazily on first use:
// These use the thread-local default context automatically d_C = d_A * d_B; d_X = d_A.llt().solve(d_B);
For concurrent multi-stream execution, create explicit contexts:
gpu::Context ctx1, ctx2; d_C1.device(ctx1) = d_A1 * d_B1; // runs on stream 1 d_C2.device(ctx2) = d_A2 * d_B2; // runs on stream 2 (concurrently)
auto d_A = gpu::DeviceMatrix<double>::fromHost(A); auto d_B = gpu::DeviceMatrix<double>::fromHost(B); // GEMM: C = A * B, C = A^H * B, C = A * B^T, ... gpu::DeviceMatrix<double> d_C = d_A * d_B; d_C = d_A.adjoint() * d_B; d_C = d_A * d_B.transpose(); // Scaled and accumulated d_C += 2.0 * d_A * d_B; // alpha=2, beta=1 d_C.device(ctx) -= d_A * d_B; // alpha=-1, beta=1 (requires explicit context) // Triangular solve (TRSM) d_X = d_A.triangularView<Lower>().solve(d_B); // Symmetric/Hermitian multiply (SYMM/HEMM) d_C = d_A.selfadjointView<Lower>() * d_B; // Rank-k update (SYRK/HERK) d_C.selfadjointView<Lower>().rankUpdate(d_A); // C += A * A^H
One-shot expression syntax -- Convenient, re-factorizes each time:
// Cholesky solve (potrf + potrs) d_X = d_A.llt().solve(d_B); // LU solve (getrf + getrs) d_Y = d_A.lu().solve(d_B);
Cached factorization -- Factor once, solve many times:
gpu::LLT<double> llt; llt.compute(d_A); // factorize (async) if (llt.info() != Success) { ... } // lazy sync on first info() call auto d_X1 = llt.solve(d_B1); // reuses factor (async) auto d_X2 = llt.solve(d_B2); // reuses factor (async) MatrixXd X2 = d_X2.toHost(); // LU with transpose solve gpu::LU<double> lu; lu.compute(d_A); auto d_Y = lu.solve(d_B, gpu::GpuOp::Trans); // A^T Y = B
The cached API keeps the factored matrix on device, avoiding redundant host-device transfers and re-factorizations.
Operations are asynchronous by default. The compute-solve chain runs without host synchronization until you need a result on the host:
fromHost(A) --sync--> compute() --async--> solve() --async--> toHost() H2D potrf potrs D2H sync
Mandatory sync points:
fromHost() -- Synchronizes to complete the upload before returningtoHost() / HostTransfer::get() -- Must deliver data to hostinfo() -- Must read the factorization statusCross-stream safety is automatic. DeviceMatrix tracks write completion via CUDA events. When a matrix written on stream A is read on stream B, the module automatically inserts cudaStreamWaitEvent. Same-stream operations skip the wait (CUDA guarantees in-order execution within a stream).
Lifetime of cached factorizations. A gpu::LLT / gpu::LU object owns its CUDA stream, library handle, and the cached factor on device. Destroying the factorization object while a solve() it issued is still in flight is correct but not actually async: cudaStreamDestroy returns immediately, but the destructor of the cached factor calls cudaFree, which is fully device-synchronous and stalls until the in-flight potrs/getrs retires. For genuine async pipelining keep the factorization object alive until you have drained its results (e.g. via toHost() or by binding consumption to an explicit gpu::Context that outlives both producer and consumer):
gpu::LLT<double> llt(d_A); // factor stays on device auto d_X = llt.solve(d_B); auto h_x = d_X.toHostAsync(llt.stream()); h_x.get(); // sync: factor + result complete // llt may now be destroyed without stalling the device
float, double, std::complex<float>, std::complex<double>.
| DeviceMatrix expression | Library call | Parameters |
|---|---|---|
C = A * B | cublasGemmEx | transA=N, transB=N, alpha=1, beta=0 |
C = A.adjoint() * B | cublasGemmEx | transA=C, transB=N |
C = A.transpose() * B | cublasGemmEx | transA=T, transB=N |
C = A * B.adjoint() | cublasGemmEx | transA=N, transB=C |
C = A * B.transpose() | cublasGemmEx | transA=N, transB=T |
C = alpha * A * B | cublasGemmEx | alpha from LHS |
C = A * (alpha * B) | cublasGemmEx | alpha from RHS |
C += A * B | cublasGemmEx | alpha=1, beta=1 |
C.device(ctx) -= A * B | cublasGemmEx | alpha=-1, beta=1 |
X = A.llt().solve(B) | cusolverDnXpotrf + Xpotrs | uplo, n, nrhs |
X = A.llt<Upper>().solve(B) | same | uplo=Upper |
X = A.lu().solve(B) | cusolverDnXgetrf + Xgetrs | n, nrhs |
X = A.triangularView<L>().solve(B) | cublasXtrsm | side=L, uplo, diag=NonUnit |
C = A.selfadjointView<L>() * B | cublasXsymm / cublasXhemm | side=L, uplo |
C.selfadjointView<L>().rankUpdate(A) | cublasXsyrk / cublasXherk | uplo, trans=N |
DeviceMatrix<Scalar> API| Method | Sync? | Description |
|---|---|---|
DeviceMatrix() | -- | Empty (0x0) |
DeviceMatrix(rows, cols) | -- | Allocate uninitialized |
fromHost(matrix, stream) | yes | Upload from Eigen matrix |
fromHostAsync(ptr, rows, cols, outerStride, stream) | no | Async upload (caller manages lifetime) |
toHost(stream) | yes | Synchronous download |
toHostAsync(stream) | no | Returns HostTransfer future |
clone(stream) | no | Device-to-device deep copy |
resize(rows, cols) | -- | Discard contents, reallocate |
data() | -- | Raw device pointer |
rows(), cols() | -- | Dimensions |
sizeInBytes() | -- | Total device allocation size in bytes |
empty() | -- | True if 0x0 |
adjoint() | -- | Adjoint view (GEMM ConjTrans) |
transpose() | -- | Transpose view (GEMM Trans) |
llt() / llt<UpLo>() | -- | Cholesky expression builder |
lu() | -- | LU expression builder |
triangularView<UpLo>() | -- | Triangular view (TRSM) |
selfadjointView<UpLo>() | -- | Self-adjoint view (SYMM, rankUpdate) |
device(ctx) | -- | Assignment proxy bound to context |
gpu::ContextUnified GPU execution context owning a CUDA stream and library handles.
gpu::Context() // Creates dedicated stream + handles static gpu::Context& threadLocal() // Per-thread default (lazy-created) cudaStream_t stream() cublasHandle_t cublasHandle() cusolverDnHandle_t cusolverHandle()
Non-copyable, non-movable (owns library handles).
gpu::LLT<Scalar, UpLo> APIGPU dense Cholesky (LL^T) via cuSOLVER. Caches factor on device.
| Method | Sync? | Description |
|---|---|---|
gpu::LLT(A) | deferred | Construct and factorize from host matrix |
compute(host_matrix) | deferred | Upload and factorize |
compute(DeviceMatrix) | deferred | D2D copy and factorize |
compute(DeviceMatrix&&) | deferred | Move-adopt and factorize (no copy) |
solve(host_matrix) | yes | Solve, return host matrix |
solve(DeviceMatrix) | no | Solve, return DeviceMatrix (async) |
info() | lazy | Syncs stream on first call, returns Success or NumericalIssue |
gpu::LU<Scalar> APIGPU dense partial-pivoting LU via cuSOLVER. Same pattern as gpu::LLT, plus a gpu::GpuOp parameter on solve() (NoTrans, Trans, ConjTrans).
HostTransfer<Scalar> APIFuture for async device-to-host transfer.
| Method | Description |
|---|---|
get() | Block until transfer completes, return host matrix reference. Idempotent. |
ready() | Non-blocking poll |
Unlike Eigen‘s Matrix, where omitting .noalias() triggers a copy to a temporary, DeviceMatrix dispatches directly to NVIDIA library calls which have no built-in aliasing protection. All operations are implicitly noalias. The caller must ensure operands don’t alias the destination for GEMM, TRSM, SYMM/HEMM, and SYRK/HERK. Debug builds assert on these violations before dispatching to cuBLAS.
| File | Depends on | Contents |
|---|---|---|
GpuSupport.h | <cuda_runtime.h> | Error macro, DeviceBuffer, PinnedHostBuffer, cuda_data_type<> |
DeviceMatrix.h | GpuSupport.h | DeviceMatrix<>, HostTransfer<> |
DeviceExpr.h | DeviceMatrix.h | GEMM expression wrappers |
DeviceBlasExpr.h | DeviceMatrix.h | TRSM, SYMM, SYRK expression wrappers |
DeviceSolverExpr.h | DeviceMatrix.h | Solver expression wrappers (LLT, LU) |
DeviceDispatch.h | all above | All dispatch functions + Assignment |
GpuContext.h | CuBlasSupport.h, CuSolverSupport.h | gpu::Context |
CuBlasSupport.h | GpuSupport.h, <cublas_v2.h> | cuBLAS error macro, op/compute type maps |
CuSolverSupport.h | GpuSupport.h, <cusolverDn.h> | cuSOLVER params, fill-mode mapping |
GpuLLT.h | CuSolverSupport.h | Cached dense Cholesky factorization |
GpuLU.h | CuSolverSupport.h | Cached dense LU factorization |
cmake -G Ninja -B build -S . \ -DEIGEN_TEST_CUDA=ON \ -DEIGEN_CUDA_COMPUTE_ARCH="70" cmake --build build --target cublas cusolver_llt cusolver_lu device_matrix ctest --test-dir build -L gpu --output-on-failure
EIGEN_TEST_CUBLAS and EIGEN_TEST_CUSOLVER default to ON when CUDA is enabled (cuBLAS and cuSOLVER are part of the CUDA toolkit).