unsupported/Eigen/GPU)GPU-accelerated dense linear algebra for Eigen users, dispatching to NVIDIA CUDA Math Libraries (cuBLAS, cuSOLVER). Requires CUDA 11.4+. Header-only (link against CUDA runtime, cuBLAS, and cuSOLVER).
Eigen is the linear algebra foundation for a large ecosystem of C++ projects in robotics (ROS, Drake, MoveIt, Pinocchio), computer vision (OpenCV, COLMAP, Open3D), scientific computing (Ceres, Stan), and beyond. Many of these projects run on GPU-equipped hardware but cannot use GPUs for Eigen operations without dropping down to raw CUDA library APIs. Third-party projects like EigenCuda and cholespy exist specifically to fill this gap, and downstream projects like Ceres and COLMAP have open requests for GPU-accelerated solvers through Eigen.
The unsupported/Eigen/GPU module aims to close this gap: Existing Eigen users should be able to move performance-critical dense linear algebra to the GPU with minimal code changes and without learning CUDA library APIs directly.
CPU and GPU coexist. There is no global compile-time switch that replaces CPU implementations (unlike EIGEN_USE_LAPACKE). Users choose GPU solvers explicitly -- gpu::LLT<double> vs LLT<MatrixXd> -- and both coexist in the same binary. This also lets users keep the factored matrix on device across multiple solves, something impossible with compile-time replacement.
Familiar syntax. GPU operations use the same expression patterns as CPU Eigen. Here is a side-by-side comparison:
// ---- CPU (Eigen) ---- // ---- GPU (unsupported/Eigen/GPU) ---- #include <Eigen/Dense> #define EIGEN_USE_GPU #include <unsupported/Eigen/GPU> MatrixXd A = ...; auto d_A = gpu::DeviceMatrix<double>::fromHost(A); MatrixXd B = ...; auto d_B = gpu::DeviceMatrix<double>::fromHost(B); MatrixXd C = A * B; gpu::DeviceMatrix<double> d_C = d_A * d_B; MatrixXd X = A.llt().solve(B); gpu::DeviceMatrix<double> d_X = d_A.llt().solve(d_B); MatrixXd X = d_X.toHost();
The GPU version reads like CPU Eigen with explicit upload/download. operator* dispatches to cuBLAS GEMM, .llt().solve() dispatches to cuSOLVER potrf + potrs. Unsupported expressions are compile errors.
Standalone module. unsupported/Eigen/GPU does not modify or depend on Eigen's Core expression template system (MatrixBase, CwiseBinaryOp, etc.). DeviceMatrix is not an Eigen expression type and does not inherit from MatrixBase. The expression layer is a thin compile-time dispatch where every supported expression maps to a single NVIDIA library call. There is no coefficient-level evaluation, lazy fusion, or packet operations.
Explicit over implicit. Host-device transfers, stream management, and library handle lifetimes are visible in the API. There are no hidden allocations or synchronizations except where documented (e.g., toHost() must synchronize to deliver data to the host).
DeviceMatrix<Scalar>A typed RAII wrapper for a dense column-major matrix in GPU device memory. This is the GPU counterpart of Eigen's MatrixX<Scalar>. A vector is simply a DeviceMatrix with one column.
// Upload from host auto d_A = gpu::DeviceMatrix<double>::fromHost(A); // Allocate uninitialized gpu::DeviceMatrix<double> d_C(m, n); // Download to host MatrixXd C = d_C.toHost(); // Async download (returns a future) auto transfer = d_C.toHostAsync(); // ... do other work ... MatrixXd C = transfer.get();
DeviceMatrix supports expression methods that mirror Eigen's API: adjoint(), transpose(), triangularView<UpLo>(), selfadjointView<UpLo>(), llt(), lu(). These return lightweight expression objects that are evaluated when assigned.
gpu::ContextEvery GPU operation needs a CUDA stream and library handles (cuBLAS, cuSOLVER). gpu::Context bundles these together.
For simple usage, you don't need to create one -- a per-thread default context is created lazily on first use:
// These use the thread-local default context automatically d_C = d_A * d_B; d_X = d_A.llt().solve(d_B);
For concurrent multi-stream execution, create explicit contexts:
gpu::Context ctx1, ctx2; d_C1.device(ctx1) = d_A1 * d_B1; // runs on stream 1 d_C2.device(ctx2) = d_A2 * d_B2; // runs on stream 2 (concurrently)
auto d_A = gpu::DeviceMatrix<double>::fromHost(A); auto d_B = gpu::DeviceMatrix<double>::fromHost(B); // GEMM: C = A * B, C = A^H * B, C = A * B^T, ... gpu::DeviceMatrix<double> d_C = d_A * d_B; d_C = d_A.adjoint() * d_B; d_C = d_A * d_B.transpose(); // Scaled and accumulated d_C += 2.0 * d_A * d_B; // alpha=2, beta=1 d_C.device(ctx) -= d_A * d_B; // alpha=-1, beta=1 (requires explicit context) // Triangular solve (TRSM) d_X = d_A.triangularView<Lower>().solve(d_B); // Symmetric/Hermitian multiply (SYMM/HEMM) d_C = d_A.selfadjointView<Lower>() * d_B; // Rank-k update (SYRK/HERK) d_C.selfadjointView<Lower>().rankUpdate(d_A); // C += A * A^H
One-shot expression syntax -- Convenient, re-factorizes each time:
// Cholesky solve (potrf + potrs) d_X = d_A.llt().solve(d_B); // LU solve (getrf + getrs) d_Y = d_A.lu().solve(d_B);
Cached factorization -- Factor once, solve many times:
gpu::LLT<double> llt; llt.compute(d_A); // factorize (async) if (llt.info() != Success) { ... } // lazy sync on first info() call auto d_X1 = llt.solve(d_B1); // reuses factor (async) auto d_X2 = llt.solve(d_B2); // reuses factor (async) MatrixXd X2 = d_X2.toHost(); // LU with transpose solve gpu::LU<double> lu; lu.compute(d_A); auto d_Y = lu.solve(d_B, gpu::GpuOp::Trans); // A^T Y = B // QR solve (overdetermined least squares) gpu::QR<double> qr(A); // host matrix input MatrixXd X = qr.solve(B); // Q^H * B via ormqr, then trsm on R // SVD gpu::SVD<double> svd(A, ComputeThinU | ComputeThinV); VectorXd S = svd.singularValues(); MatrixXd U = svd.matrixU(); MatrixXd VT = svd.matrixVT(); MatrixXd X = svd.solve(B); // pseudoinverse solve // SVD: device-side views (no D2H transfer; svd must outlive the views) auto d_S = svd.d_singularValues(); // DeviceMatrix view of singular values auto d_U = svd.d_matrixU(); // DeviceMatrix view of U auto d_VT = svd.d_matrixVT(); // DeviceMatrix view of V^T // Self-adjoint eigenvalue decomposition gpu::SelfAdjointEigenSolver<double> es(A); VectorXd eigenvals = es.eigenvalues(); MatrixXd eigenvecs = es.eigenvectors(); auto d_W = es.d_eigenvalues(); // DeviceMatrix view of eigenvalues auto d_V = es.d_eigenvectors(); // DeviceMatrix view of eigenvectors
The cached API keeps the factored matrix on device, avoiding redundant host-device transfers and re-factorizations. The d_* accessors return non-owning DeviceMatrix views so downstream cuBLAS/cuSOLVER work can chain without round-tripping through host memory.
GEMM dispatch uses cublasXgemm (type-specific Sgemm/Dgemm/Cgemm/Zgemm). cuBLAS may internally use tensor cores depending on the GPU architecture, matrix dimensions, and CUDA math mode settings. No Eigen-specific macros control this; use the standard CUDA_MATH_MODE environment variable or cublasSetMathMode() to configure tensor core behavior if needed.
Operations are asynchronous by default. The compute-solve chain runs without host synchronization until you need a result on the host:
fromHost(A) --sync--> compute() --async--> solve() --async--> toHost() H2D potrf potrs D2H sync
Mandatory sync points:
fromHost() -- Synchronizes to complete the upload before returningtoHost() / HostTransfer::get() -- Must deliver data to hostinfo() -- Must read the factorization statusCross-stream safety is automatic. DeviceMatrix tracks write completion via CUDA events. When a matrix written on stream A is read on stream B, the module automatically inserts cudaStreamWaitEvent. Same-stream operations skip the wait (CUDA guarantees in-order execution within a stream).
Lifetime of cached factorizations. A gpu::LLT / gpu::LU object owns its CUDA stream, library handle, and the cached factor on device. Destroying the factorization object while a solve() it issued is still in flight is correct but not actually async: cudaStreamDestroy returns immediately, but the destructor of the cached factor calls cudaFree, which is fully device-synchronous and stalls until the in-flight potrs/getrs retires. For genuine async pipelining keep the factorization object alive until you have drained its results (e.g. via toHost() or by binding consumption to an explicit gpu::Context that outlives both producer and consumer):
gpu::LLT<double> llt(d_A); // factor stays on device auto d_X = llt.solve(d_B); auto h_x = d_X.toHostAsync(llt.stream()); h_x.get(); // sync: factor + result complete // llt may now be destroyed without stalling the device
float, double, std::complex<float>, std::complex<double>.
| DeviceMatrix expression | Library call | Parameters |
|---|---|---|
C = A * B | cublasXgemm | transA=N, transB=N, alpha=1, beta=0 |
C = A.adjoint() * B | cublasXgemm | transA=C, transB=N |
C = A.transpose() * B | cublasXgemm | transA=T, transB=N |
C = A * B.adjoint() | cublasXgemm | transA=N, transB=C |
C = A * B.transpose() | cublasXgemm | transA=N, transB=T |
C = alpha * A * B | cublasXgemm | alpha from LHS |
C = A * (alpha * B) | cublasXgemm | alpha from RHS |
C += A * B | cublasXgemm | alpha=1, beta=1 |
C.device(ctx) -= A * B | cublasXgemm | alpha=-1, beta=1 |
X = A.llt().solve(B) | cusolverDnXpotrf + Xpotrs | uplo, n, nrhs |
X = A.llt<Upper>().solve(B) | same | uplo=Upper |
X = A.lu().solve(B) | cusolverDnXgetrf + Xgetrs | n, nrhs |
X = A.triangularView<L>().solve(B) | cublasXtrsm | side=L, uplo, diag=NonUnit |
C = A.selfadjointView<L>() * B | cublasXsymm / cublasXhemm | side=L, uplo |
C.selfadjointView<L>().rankUpdate(A) | cublasXsyrk / cublasXherk | uplo, trans=N |
DeviceMatrix<Scalar> API| Method | Sync? | Description |
|---|---|---|
DeviceMatrix() | -- | Empty (0x0) |
DeviceMatrix(rows, cols) | -- | Allocate uninitialized |
fromHost(matrix, stream) | yes | Upload from Eigen matrix |
fromHostAsync(ptr, rows, cols, stream) | no | Async upload (caller manages lifetime) |
adopt(ptr, rows, cols) | -- | Owning wrapper over a device pointer (caller relinquishes ownership) |
view(ptr, rows, cols) | -- | Non-owning view over a device pointer (does not free on destruction) |
toHost(stream) | yes | Synchronous download |
toHostAsync(stream) | no | Returns HostTransfer future |
clone(stream) | no | Device-to-device deep copy |
resize(rows, cols) | -- | Discard contents, reallocate |
data() | -- | Raw device pointer |
rows(), cols() | -- | Dimensions |
sizeInBytes() | -- | Total device allocation size in bytes |
empty() | -- | True if 0x0 |
adjoint() | -- | Adjoint view (GEMM ConjTrans) |
transpose() | -- | Transpose view (GEMM Trans) |
llt() / llt<UpLo>() | -- | Cholesky expression builder |
lu() | -- | LU expression builder |
triangularView<UpLo>() | -- | Triangular view (TRSM) |
selfadjointView<UpLo>() | -- | Self-adjoint view (SYMM, rankUpdate) |
device(ctx) | -- | Assignment proxy bound to context |
gpu::ContextUnified GPU execution context owning a CUDA stream and library handles.
gpu::Context() // Creates dedicated stream + handles static gpu::Context& threadLocal() // Per-thread default (lazy-created) cudaStream_t stream() cublasHandle_t cublasHandle() cusolverDnHandle_t cusolverHandle()
Non-copyable, non-movable (owns library handles).
gpu::LLT<Scalar, UpLo> APIGPU dense Cholesky (LL^T) via cuSOLVER. Caches factor on device.
| Method | Sync? | Description |
|---|---|---|
gpu::LLT(A) | deferred | Construct and factorize from host matrix |
compute(host_matrix) | deferred | Upload and factorize |
compute(DeviceMatrix) | deferred | D2D copy and factorize |
compute(DeviceMatrix&&) | deferred | Move-adopt and factorize (no copy) |
solve(host_matrix) | yes | Solve, return host matrix |
solve(DeviceMatrix) | no | Solve, return DeviceMatrix (async) |
info() | lazy | Syncs stream on first call, returns Success or NumericalIssue |
gpu::LU<Scalar> APIGPU dense partial-pivoting LU via cuSOLVER. Same pattern as gpu::LLT, plus a gpu::GpuOp parameter on solve() (NoTrans, Trans, ConjTrans).
gpu::QR<Scalar> APIGPU dense QR decomposition via cuSOLVER (geqrf). Solve uses ormqr (apply Q^H) + trsm (back-substitute on R) -- Q is never formed explicitly.
| Method | Description |
|---|---|
gpu::QR() | Default construct, then call compute() |
gpu::QR(A) | Construct and factorize from host matrix |
compute(A) | Upload + factorize |
compute(DeviceMatrix) | D2D copy + factorize |
solve(host_matrix) | Solve, return host matrix (syncs) |
solve(DeviceMatrix) | Solve, return DeviceMatrix (async) |
matrixR() | Download upper-triangular factor R to host (m >= n only) |
info() | Lazy sync |
rows(), cols(), stream() | Dimensions and CUDA stream |
gpu::SVD<Scalar> APIGPU dense SVD via cuSOLVER (gesvd). Supports thin, full, and values-only modes via Eigen's ComputeThinU | ComputeThinV, ComputeFullU | ComputeFullV, or 0 (values only).
| Method | Description |
|---|---|
gpu::SVD() | Default construct, then call compute() |
gpu::SVD(A, options) | Construct and compute (options default: ComputeThinU | ComputeThinV) |
compute(A, options) | Compute from host matrix |
compute(DeviceMatrix, options) | Compute from device matrix |
singularValues() | Download singular values to host |
matrixU() | Download U to host (requires ComputeThinU or ComputeFullU) |
matrixV() | Download V to host (matches JacobiSVD::matrixV()) |
matrixVT() | Download V^T to host (requires ComputeThinV or ComputeFullV) |
d_singularValues() | DeviceMatrix view of singular values (zero-copy) |
d_matrixU() | DeviceMatrix view of U (zero-copy when m >= n) |
d_matrixVT() | DeviceMatrix view of V^T (zero-copy when m >= n) |
solve(B) | Pseudoinverse solve (returns host matrix) |
solve(B, k) | Truncated solve (top k singular triplets) |
solve(B, lambda) | Tikhonov regularized solve |
rank(threshold) | Count singular values above threshold |
info() | Lazy sync |
rows(), cols(), stream() | Dimensions and CUDA stream |
Wide matrices (m < n) are handled by internally transposing via cuBLAS geam; in that case d_matrixU() and d_matrixVT() allocate an owning result via a single cublasXgeam adjoint pass. d_singularValues() is always zero-copy.
gpu::SelfAdjointEigenSolver<Scalar> APIGPU symmetric/Hermitian eigenvalue decomposition via cuSOLVER (syevd).
| Method | Description |
|---|---|
gpu::SelfAdjointEigenSolver() | Default construct, then call compute() |
gpu::SelfAdjointEigenSolver(A, mode) | Construct and compute (mode default: ComputeEigenvectors) |
compute(A, mode) | Compute from host matrix |
compute(DeviceMatrix, mode) | Compute from device matrix |
eigenvalues() | Download eigenvalues to host (ascending order) |
eigenvectors() | Download eigenvectors to host (columns) |
d_eigenvalues() | DeviceMatrix view of eigenvalues (zero-copy) |
d_eigenvectors() | DeviceMatrix view of eigenvectors (zero-copy, requires ComputeEigenvectors) |
info() | Lazy sync |
rows(), cols(), stream() | Dimensions and CUDA stream |
ComputeMode: gpu::SelfAdjointEigenSolver::EigenvaluesOnly or gpu::SelfAdjointEigenSolver::ComputeEigenvectors.
The d_* accessors return non-owning views into the solver's internal device buffers; the solver must outlive any view derived from it.
HostTransfer<Scalar> APIFuture for async device-to-host transfer.
| Method | Description |
|---|---|
get() | Block until transfer completes, return host matrix reference. Idempotent. |
ready() | Non-blocking poll |
Unlike Eigen‘s Matrix, where omitting .noalias() triggers a copy to a temporary, DeviceMatrix dispatches directly to NVIDIA library calls which have no built-in aliasing protection. All operations are implicitly noalias. The caller must ensure operands don’t alias the destination for GEMM, TRSM, SYMM/HEMM, and SYRK/HERK. Debug builds assert on these violations before dispatching to cuBLAS.
compute(MatrixXd), solve(MatrixXd)) and device-input (compute(DeviceMatrix), solve(DeviceMatrix)) overloads, plus host- and device-side accessors (matrixU() vs d_matrixU()). This eases migration from CPU Eigen but may invite accidental host ↔ device round-trips when users mix the two without realising the cost. Revisit after the initial GPU module roll-out (MRs !2408, !2412, !2413, !2414, !2415) is in users' hands; if the convenience overloads cause more confusion than they save, narrow toward a single explicit fromHost / toHost boundary.| File | Depends on | Contents |
|---|---|---|
GpuSupport.h | <cuda_runtime.h> | Error macro, DeviceBuffer, PinnedHostBuffer, cuda_data_type<> |
DeviceMatrix.h | GpuSupport.h | DeviceMatrix<>, HostTransfer<> |
DeviceExpr.h | DeviceMatrix.h | GEMM expression wrappers |
DeviceBlasExpr.h | DeviceMatrix.h | TRSM, SYMM, SYRK expression wrappers |
DeviceSolverExpr.h | DeviceMatrix.h | Solver expression wrappers (LLT, LU) |
DeviceDispatch.h | all above | All dispatch functions + Assignment |
GpuContext.h | CuBlasSupport.h, CuSolverSupport.h | gpu::Context |
CuBlasSupport.h | GpuSupport.h, <cublas_v2.h> | cuBLAS error macro, op/compute type maps |
CuSolverSupport.h | GpuSupport.h, <cusolverDn.h> | cuSOLVER params, fill-mode mapping |
GpuSolverContext.h | CuSolverSupport.h, CuBlasSupport.h | Shared solver context (stream, handles, scratch) |
GpuLLT.h | GpuSolverContext.h | Cached dense Cholesky factorization |
GpuLU.h | GpuSolverContext.h | Cached dense LU factorization |
GpuQR.h | GpuSolverContext.h | Dense QR decomposition |
GpuSVD.h | GpuSolverContext.h | Dense SVD decomposition |
GpuEigenSolver.h | GpuSolverContext.h | Self-adjoint eigenvalue decomposition |
cmake -G Ninja -B build -S . \ -DEIGEN_TEST_CUDA=ON \ -DEIGEN_CUDA_COMPUTE_ARCH="70" cmake --build build --target cublas cusolver_llt cusolver_lu \ cusolver_qr cusolver_svd cusolver_eigen device_matrix ctest --test-dir build -L gpu --output-on-failure
EIGEN_TEST_CUBLAS and EIGEN_TEST_CUSOLVER default to ON when CUDA is enabled (cuBLAS and cuSOLVER are part of the CUDA toolkit).