unsupported/Eigen/GPU)GPU-accelerated linear algebra for Eigen users, dispatching to NVIDIA CUDA Math Libraries (cuBLAS, cuSOLVER, cuFFT, cuSPARSE, cuDSS). Requires CUDA 11.4+; cuDSS features require CUDA 12.0+ and a separate cuDSS install. Header-only.
Eigen is the linear algebra foundation for a large ecosystem of C++ projects in robotics (ROS, Drake, MoveIt, Pinocchio), computer vision (OpenCV, COLMAP, Open3D), scientific computing (Ceres, Stan), and beyond. Many of these projects run on GPU-equipped hardware but cannot use GPUs for Eigen operations without dropping down to raw CUDA library APIs.
GPU sparse solvers are a particularly acute gap. Sparse factorization is the bottleneck in SLAM, bundle adjustment, FEM, and nonlinear optimization -- exactly the workloads where GPU acceleration matters most. Downstream projects like Ceres and COLMAP have open requests for GPU-accelerated sparse solvers, and third-party projects like cholespy exist specifically because Eigen lacks them. The unsupported/Eigen/GPU module provides GPU sparse Cholesky, LDL^T, and LU factorization via cuDSS, alongside dense solvers (cuSOLVER), matrix products (cuBLAS), FFT (cuFFT), and sparse matrix-vector products (cuSPARSE).
Existing Eigen users should be able to move performance-critical dense or sparse linear algebra to the GPU with minimal code changes and without learning CUDA library APIs directly.
CPU and GPU coexist. There is no global compile-time switch that replaces CPU implementations (unlike EIGEN_USE_LAPACKE). Users choose GPU solvers explicitly -- gpu::LLT<double> vs LLT<MatrixXd>, gpu::SparseLLT<double> vs SimplicialLLT<SparseMatrix<double>> -- and both coexist in the same binary. This also lets users keep the factored matrix on device across multiple solves, something impossible with compile-time replacement.
Familiar syntax. GPU operations use the same expression patterns as CPU Eigen. Here is a side-by-side comparison:
// ---- CPU (Eigen) ---- // ---- GPU (unsupported/Eigen/GPU) ---- #include <Eigen/Dense> #define EIGEN_USE_GPU #include <unsupported/Eigen/GPU> // Dense MatrixXd A = ...; auto d_A = gpu::DeviceMatrix<double>::fromHost(A); MatrixXd B = ...; auto d_B = gpu::DeviceMatrix<double>::fromHost(B); MatrixXd C = A * B; gpu::DeviceMatrix<double> d_C = d_A * d_B; MatrixXd X = A.llt().solve(B); gpu::DeviceMatrix<double> d_X = d_A.llt().solve(d_B); MatrixXd X = d_X.toHost(); // Sparse (using SpMat = SparseMatrix<double>) SimplicialLLT<SpMat> llt(A); gpu::SparseLLT<double> llt(A); VectorXd x = llt.solve(b); VectorXd x = llt.solve(b);
The GPU version reads like CPU Eigen with explicit upload/download for dense operations, and an almost identical API for sparse solvers. Unsupported expressions are compile errors.
Standalone module. unsupported/Eigen/GPU does not modify or depend on Eigen's Core expression template system (MatrixBase, CwiseBinaryOp, etc.). DeviceMatrix is not an Eigen expression type and does not inherit from MatrixBase. The expression layer is a thin compile-time dispatch where every supported expression maps to a single NVIDIA library call. There is no coefficient-level evaluation, lazy fusion, or packet operations.
Explicit over implicit. Host-device transfers, stream management, and library handle lifetimes are visible in the API. There are no hidden allocations or synchronizations except where documented (e.g., toHost() must synchronize to deliver data to the host).
DeviceMatrix<Scalar>A typed RAII wrapper for a dense column-major matrix in GPU device memory. This is the GPU counterpart of Eigen's MatrixX<Scalar>. A vector is simply a DeviceMatrix with one column.
// Upload from host auto d_A = gpu::DeviceMatrix<double>::fromHost(A); // Allocate uninitialized gpu::DeviceMatrix<double> d_C(m, n); // Download to host MatrixXd C = d_C.toHost(); // Async download (returns a future) auto transfer = d_C.toHostAsync(); // ... do other work ... MatrixXd C = transfer.get();
DeviceMatrix supports expression methods that mirror Eigen's API: adjoint(), transpose(), triangularView<UpLo>(), selfadjointView<UpLo>(), llt(), lu(). These return lightweight expression objects that are evaluated when assigned.
gpu::ContextEvery GPU operation needs a CUDA stream and library handles (cuBLAS, cuSOLVER). gpu::Context bundles these together.
For simple usage, you don't need to create one -- a per-thread default context is created lazily on first use:
// These use the thread-local default context automatically d_C = d_A * d_B; d_X = d_A.llt().solve(d_B);
For concurrent multi-stream execution, create explicit contexts:
gpu::Context ctx1, ctx2; d_C1.device(ctx1) = d_A1 * d_B1; // runs on stream 1 d_C2.device(ctx2) = d_A2 * d_B2; // runs on stream 2 (concurrently)
auto d_A = gpu::DeviceMatrix<double>::fromHost(A); auto d_B = gpu::DeviceMatrix<double>::fromHost(B); // GEMM: C = A * B, C = A^H * B, C = A * B^T, ... gpu::DeviceMatrix<double> d_C = d_A * d_B; d_C = d_A.adjoint() * d_B; d_C = d_A * d_B.transpose(); // Scaled and accumulated d_C += 2.0 * d_A * d_B; // alpha=2, beta=1 d_C.device(ctx) -= d_A * d_B; // alpha=-1, beta=1 (requires explicit context) // Triangular solve (TRSM) d_X = d_A.triangularView<Lower>().solve(d_B); // Symmetric/Hermitian multiply (SYMM/HEMM) d_C = d_A.selfadjointView<Lower>() * d_B; // Rank-k update (SYRK/HERK) d_C.selfadjointView<Lower>().rankUpdate(d_A); // C += A * A^H
One-shot expression syntax -- Convenient, re-factorizes each time:
// Cholesky solve (potrf + potrs) d_X = d_A.llt().solve(d_B); // LU solve (getrf + getrs) d_Y = d_A.lu().solve(d_B);
Cached factorization -- Factor once, solve many times:
gpu::LLT<double> llt; llt.compute(d_A); // factorize (async) if (llt.info() != Success) { ... } // lazy sync on first info() call auto d_X1 = llt.solve(d_B1); // reuses factor (async) auto d_X2 = llt.solve(d_B2); // reuses factor (async) MatrixXd X2 = d_X2.toHost(); // LU with transpose solve gpu::LU<double> lu; lu.compute(d_A); auto d_Y = lu.solve(d_B, gpu::GpuOp::Trans); // A^T Y = B // QR solve (overdetermined least squares) gpu::QR<double> qr; qr.compute(d_A); // factorize on device (async) auto d_X = qr.solve(d_B); // Q^H * B via ormqr, then trsm on R MatrixXd X = d_X.toHost(); // SVD (results downloaded on access) gpu::SVD<double> svd; svd.compute(d_A, ComputeThinU | ComputeThinV); VectorXd S = svd.singularValues(); // downloads to host MatrixXd U = svd.matrixU(); // downloads to host MatrixXd VT = svd.matrixVT(); // V^T (matches cuSOLVER) // SVD: device-side views (no D2H transfer; svd must outlive the views) auto d_S = svd.d_singularValues(); // DeviceMatrix view of singular values auto d_U = svd.d_matrixU(); // DeviceMatrix view of U auto d_VT = svd.d_matrixVT(); // DeviceMatrix view of V^T // Self-adjoint eigenvalue decomposition gpu::SelfAdjointEigenSolver<double> es; es.compute(d_A); VectorXd eigenvals = es.eigenvalues(); // downloads to host MatrixXd eigenvecs = es.eigenvectors(); // downloads to host auto d_W = es.d_eigenvalues(); // DeviceMatrix view of eigenvalues auto d_V = es.d_eigenvectors(); // DeviceMatrix view of eigenvectors
The cached API keeps the factored matrix on device, avoiding redundant host-device transfers and re-factorizations. All solvers also accept host matrices directly as a convenience (e.g., gpu::LLT<double> llt(A) or qr.solve(B)), which handles upload/download internally. The d_* accessors on gpu::SVD and gpu::SelfAdjointEigenSolver return non-owning DeviceMatrix views so downstream cuBLAS/cuSOLVER work can chain without round-tripping through host memory.
Requires cuDSS (separate install, CUDA 12.0+). Define EIGEN_CUDSS before including unsupported/Eigen/GPU and link with -lcudss.
SparseMatrix<double> A = ...; // symmetric positive definite VectorXd b = ...; // Sparse Cholesky -- one-liner gpu::SparseLLT<double> llt(A); VectorXd x = llt.solve(b); // Three-phase workflow for repeated solves with the same sparsity pattern gpu::SparseLLT<double> llt; llt.analyzePattern(A); // symbolic analysis (once) llt.factorize(A); // numeric factorization VectorXd x = llt.solve(b); llt.factorize(A_new_values); // refactorize (reuses symbolic analysis) VectorXd x2 = llt.solve(b); // Sparse LDL^T (symmetric indefinite) gpu::SparseLDLT<double> ldlt(A); VectorXd x = ldlt.solve(b); // Sparse LU (general non-symmetric) gpu::SparseLU<double> lu(A); VectorXd x = lu.solve(b);
gpu::FFT<float> fft; // 1D complex-to-complex VectorXcf X = fft.fwd(x); // forward VectorXcf y = fft.inv(X); // inverse (scaled by 1/n) // 1D real-to-complex / complex-to-real VectorXcf R = fft.fwd(r); // returns n/2+1 complex (half-spectrum) VectorXf s = fft.invReal(R, n); // C2R inverse, caller specifies n // 2D complex-to-complex MatrixXcf B = fft.fwd2(A); // 2D forward MatrixXcf C = fft.inv2(B); // 2D inverse (scaled by 1/(rows*cols)) // Plans are cached and reused across calls with the same size/type.
SparseMatrix<double> A = ...; VectorXd x = ...; gpu::SparseContext<double> ctx; VectorXd y = ctx.multiply(A, x); // y = A * x VectorXd z = ctx.multiplyT(A, x); // z = A^T * x ctx.multiply(A, x, y, 2.0, 1.0); // y = 2*A*x + y ctx.multiply(A, x, y, 1.0, 0.0, // y = A^H * x (Hermitian SpMV) gpu::GpuOp::ConjTrans); // Multiple RHS (SpMM) MatrixXd Y = ctx.multiplyMat(A, X); // Y = A * X MatrixXd Z = ctx.multiplyMat(A, X, gpu::GpuOp::Trans); // Z = A^T * X
GEMM dispatch uses cublasXgemm (type-specific Sgemm/Dgemm/Cgemm/Zgemm). cuBLAS may internally use tensor cores depending on the GPU architecture, matrix dimensions, and CUDA math mode settings. No Eigen-specific macros control this; use the standard CUDA_MATH_MODE environment variable or cublasSetMathMode() to configure tensor core behavior if needed.
Operations are asynchronous by default. The compute-solve chain runs without host synchronization until you need a result on the host:
fromHost(A) --sync--> compute() --async--> solve() --async--> toHost() H2D potrf potrs D2H sync
Mandatory sync points:
fromHost() -- Synchronizes to complete the upload before returningtoHost() / HostTransfer::get() -- Must deliver data to hostinfo() -- Must read the factorization statusCross-stream safety is automatic. DeviceMatrix tracks write completion via CUDA events. When a matrix written on stream A is read on stream B, the module automatically inserts cudaStreamWaitEvent. Same-stream operations skip the wait (CUDA guarantees in-order execution within a stream).
Lifetime of cached factorizations. A gpu::LLT / gpu::LU object owns its CUDA stream, library handle, and the cached factor on device. Destroying the factorization object while a solve() it issued is still in flight is correct but not actually async: cudaStreamDestroy returns immediately, but the destructor of the cached factor calls cudaFree, which is fully device-synchronous and stalls until the in-flight potrs/getrs retires. For genuine async pipelining keep the factorization object alive until you have drained its results (e.g. via toHost() or by binding consumption to an explicit gpu::Context that outlives both producer and consumer):
gpu::LLT<double> llt(d_A); // factor stays on device auto h_x = d_X = llt.solve(d_B).toHostAsync(stream); h_x.get(); // sync: factor + result complete // llt may now be destroyed without stalling the device
float, double, std::complex<float>, std::complex<double> (unless noted otherwise).
| DeviceMatrix expression | Library call | Parameters |
|---|---|---|
C = A * B | cublasXgemm | transA=N, transB=N, alpha=1, beta=0 |
C = A.adjoint() * B | cublasXgemm | transA=C, transB=N |
C = A.transpose() * B | cublasXgemm | transA=T, transB=N |
C = A * B.adjoint() | cublasXgemm | transA=N, transB=C |
C = A * B.transpose() | cublasXgemm | transA=N, transB=T |
C = alpha * A * B | cublasXgemm | alpha from LHS |
C = A * (alpha * B) | cublasXgemm | alpha from RHS |
C += A * B | cublasXgemm | alpha=1, beta=1 |
C.device(ctx) -= A * B | cublasXgemm | alpha=-1, beta=1 |
X = A.llt().solve(B) | cusolverDnXpotrf + Xpotrs | uplo, n, nrhs |
X = A.llt<Upper>().solve(B) | same | uplo=Upper |
X = A.lu().solve(B) | cusolverDnXgetrf + Xgetrs | n, nrhs |
X = A.triangularView<L>().solve(B) | cublasXtrsm | side=L, uplo, diag=NonUnit |
C = A.selfadjointView<L>() * B | cublasXsymm / cublasXhemm | side=L, uplo |
C.selfadjointView<L>().rankUpdate(A) | cublasXsyrk / cublasXherk | uplo, trans=N |
DeviceMatrix<Scalar>Typed RAII wrapper for a dense column-major matrix in GPU device memory. Always dense (leading dimension = rows). A vector is a DeviceMatrix with one column.
// Construction DeviceMatrix<Scalar>() // Empty (0x0) DeviceMatrix<Scalar>(rows, cols) // Allocate uninitialized // Upload / download / pointer adoption static DeviceMatrix fromHost(matrix, stream=nullptr) // -> DeviceMatrix (syncs) static DeviceMatrix fromHostAsync(ptr, rows, cols, outerStride, s) // -> DeviceMatrix (no sync, caller manages ptr lifetime) static DeviceMatrix adopt(Scalar* device_ptr, rows, cols) // Owning wrapper over a raw device pointer static DeviceMatrix view(Scalar* device_ptr, rows, cols) // Non-owning view (does not free on destruction) PlainMatrix toHost(stream=nullptr) // -> host Matrix (syncs) HostTransfer toHostAsync(stream=nullptr) // -> HostTransfer future (no sync) DeviceMatrix clone(stream=nullptr) // -> DeviceMatrix (D2D copy, async) // Dimensions and access Index rows() Index cols() size_t sizeInBytes() bool empty() Scalar* data() // Raw device pointer void resize(Index rows, Index cols) // Discard contents, reallocate // Expression builders (return lightweight views, evaluated on assignment) AdjointView adjoint() // GEMM with ConjTrans TransposeView transpose() // GEMM with Trans LltExpr llt() / llt<UpLo>() // -> .solve(d_B) -> DeviceMatrix LuExpr lu() // -> .solve(d_B) -> DeviceMatrix TriangularView triangularView<UpLo>() // -> .solve(d_B) -> DeviceMatrix (TRSM) SelfAdjointView selfadjointView<UpLo>() // -> * d_B (SYMM), .rankUpdate(d_A) (SYRK) Assignment device(gpu::Context& ctx) // Bind assignment to explicit stream
gpu::ContextUnified GPU execution context owning a CUDA stream and library handles.
gpu::Context() // Creates dedicated stream + handles static gpu::Context& threadLocal() // Per-thread default (lazy-created) cudaStream_t stream() cublasHandle_t cublasHandle() cusolverDnHandle_t cusolverHandle()
Non-copyable, non-movable (owns library handles).
gpu::LLT<Scalar, UpLo> -- Dense Cholesky (cuSOLVER)Caches the Cholesky factor on device for repeated solves.
gpu::LLT() // Default construct, then call compute() gpu::LLT(const EigenBase<D>& A) // Convenience: upload + factorize gpu::LLT& compute(const EigenBase<D>& A) // Upload + factorize gpu::LLT& compute(const DeviceMatrix& d_A) // D2D copy + factorize gpu::LLT& compute(DeviceMatrix&& d_A) // Adopt + factorize (no copy) PlainMatrix solve(const MatrixBase<D>& B) // -> host Matrix (syncs) DeviceMatrix solve(const DeviceMatrix& d_B) // -> DeviceMatrix (async, stays on device) ComputationInfo info() // Lazy sync on first call: Success or NumericalIssue Index rows() / cols() cudaStream_t stream()
gpu::LU<Scalar> -- Dense LU (cuSOLVER)Same pattern as gpu::LLT. Adds a gpu::GpuOp parameter on solve().
PlainMatrix solve(const MatrixBase<D>& B, GpuOp op = GpuOp::NoTrans) // -> host Matrix DeviceMatrix solve(const DeviceMatrix& d_B, GpuOp op = GpuOp::NoTrans) // -> DeviceMatrix
gpu::GpuOp: NoTrans, Trans, ConjTrans.
gpu::QR<Scalar> -- Dense QR (cuSOLVER)QR factorization via cusolverDnXgeqrf. Solve uses ORMQR (apply Q^H) + TRSM (back-substitute on R) -- Q is never formed explicitly.
gpu::QR() // Default construct gpu::QR(const EigenBase<D>& A) // Convenience: upload + factorize gpu::QR& compute(const EigenBase<D>& A) // Upload + factorize gpu::QR& compute(const DeviceMatrix& d_A) // D2D copy + factorize PlainMatrix solve(const MatrixBase<D>& B) // -> host Matrix (syncs) DeviceMatrix solve(const DeviceMatrix& d_B) // -> DeviceMatrix (async) PlainMatrix matrixR() // -> host Matrix (m >= n only) ComputationInfo info() // Lazy sync Index rows() / cols() cudaStream_t stream()
gpu::SVD<Scalar> -- Dense SVD (cuSOLVER)SVD via cusolverDnXgesvd. Supports ComputeThinU | ComputeThinV, ComputeFullU | ComputeFullV, or 0 (values only). Wide matrices (m < n) handled by internal transpose.
gpu::SVD() // Default construct, then call compute() gpu::SVD(const EigenBase<D>& A, unsigned options = ComputeThinU | ComputeThinV) // Convenience gpu::SVD& compute(const EigenBase<D>& A, unsigned options = ComputeThinU | ComputeThinV) gpu::SVD& compute(const DeviceMatrix& d_A, unsigned options = ComputeThinU | ComputeThinV) RealVector singularValues() // -> host vector (syncs, downloads) PlainMatrix matrixU() // -> host Matrix (syncs, downloads) PlainMatrix matrixVT() // -> host Matrix (syncs, downloads V^T) DeviceMatrix d_singularValues() // -> DeviceMatrix view (zero-copy) DeviceMatrix d_matrixU() // -> DeviceMatrix view (zero-copy when m >= n) DeviceMatrix d_matrixVT() // -> DeviceMatrix view (zero-copy when m >= n) PlainMatrix solve(const MatrixBase<D>& B) // -> host Matrix (pseudoinverse) PlainMatrix solve(const MatrixBase<D>& B, Index k) // Truncated (top k triplets) PlainMatrix solve(const MatrixBase<D>& B, RealScalar l) // Tikhonov regularized Index rank(RealScalar threshold = -1) ComputationInfo info() // Lazy sync Index rows() / cols() cudaStream_t stream()
Note: singularValues(), matrixU(), and matrixVT() download to host on each call. The d_* accessors return non-owning DeviceMatrix views into the solver's internal buffers; the gpu::SVD object must outlive any view derived from it. For wide matrices (m < n) the U/V^T views are owning (one cublasXgeam adjoint pass).
gpu::SelfAdjointEigenSolver<Scalar> -- Eigendecomposition (cuSOLVER)Symmetric/Hermitian eigenvalue decomposition via cusolverDnXsyevd. ComputeMode enum: EigenvaluesOnly, ComputeEigenvectors.
gpu::SelfAdjointEigenSolver() // Default construct, then call compute() gpu::SelfAdjointEigenSolver(const EigenBase<D>& A, ComputeMode mode = ComputeEigenvectors) // Convenience gpu::SelfAdjointEigenSolver& compute(const EigenBase<D>& A, ComputeMode mode = ComputeEigenvectors) gpu::SelfAdjointEigenSolver& compute(const DeviceMatrix& d_A, ComputeMode mode = ComputeEigenvectors) RealVector eigenvalues() // -> host vector (syncs, downloads, ascending order) PlainMatrix eigenvectors() // -> host Matrix (syncs, downloads, columns) DeviceMatrix d_eigenvalues() // -> DeviceMatrix view (zero-copy) DeviceMatrix d_eigenvectors() // -> DeviceMatrix view (zero-copy, requires ComputeEigenvectors) ComputationInfo info() // Lazy sync Index rows() / cols() cudaStream_t stream()
Note: eigenvalues() and eigenvectors() download to host on each call. The d_* accessors return non-owning DeviceMatrix views into the solver's internal buffers; the gpu::SelfAdjointEigenSolver object must outlive any view derived from it.
HostTransfer<Scalar>Future for async device-to-host transfer. Returned by DeviceMatrix::toHostAsync().
PlainMatrix& get() // Block until complete, return host Matrix ref. Idempotent. bool ready() // Non-blocking poll
gpu::SparseLLT<Scalar, UpLo> -- Sparse Cholesky (cuDSS)Requires cuDSS (CUDA 12.0+, #define EIGEN_CUDSS). Three-phase workflow with symbolic reuse. Accepts SparseMatrix<Scalar, ColMajor, int> (CSC). Matrix dimensions and nonzero count must fit in int (cuDSS limitation; debug builds assert).
gpu::SparseLLT() // Default construct gpu::SparseLLT(const SparseMatrixBase<D>& A) // Analyze + factorize gpu::SparseLLT& analyzePattern(const SparseMatrixBase<D>& A) // Symbolic analysis (reusable) gpu::SparseLLT& factorize(const SparseMatrixBase<D>& A) // Numeric factorization gpu::SparseLLT& compute(const SparseMatrixBase<D>& A) // analyzePattern + factorize void setOrdering(GpuSparseOrdering ord) // AMD (default), METIS, or RCM DenseMatrix solve(const MatrixBase<D>& B) // -> host Matrix (syncs) ComputationInfo info() // Lazy sync Index rows() / cols() cudaStream_t stream()
gpu::SparseLDLT<Scalar, UpLo> -- Sparse LDL^T (cuDSS)Symmetric indefinite. Same API as gpu::SparseLLT.
gpu::SparseLU<Scalar> -- Sparse LU (cuDSS)General non-symmetric. Same API as gpu::SparseLLT (without UpLo).
gpu::FFT<Scalar> -- FFT (cuFFT)Plans cached by (size, type) and reused. Inverse transforms scaled so inv(fwd(x)) == x. Supported scalars: float, double.
// 1D transforms (host vectors in and out) ComplexVector fwd(const MatrixBase<D>& x) // C2C forward (complex input) ComplexVector fwd(const MatrixBase<D>& x) // R2C forward (real input, returns n/2+1) ComplexVector inv(const MatrixBase<D>& X) // C2C inverse, scaled by 1/n RealVector invReal(const MatrixBase<D>& X, Index n) // C2R inverse, scaled by 1/n // 2D transforms (host matrices in and out) ComplexMatrix fwd2(const MatrixBase<D>& A) // 2D C2C forward ComplexMatrix inv2(const MatrixBase<D>& A) // 2D C2C inverse, scaled by 1/(rows*cols) cudaStream_t stream()
All FFT methods accept host data and return host data. Upload/download is handled internally. The C2C and R2C overloads of fwd() are distinguished by the input scalar type (complex vs real).
gpu::SparseContext<Scalar> -- SpMV/SpMM (cuSPARSE)Accepts SparseMatrix<Scalar, ColMajor>. All methods accept host data and return host data. Matrix dimensions and nonzero count must fit in int (cuSPARSE limitation; debug builds assert).
gpu::SparseContext() // Creates own stream + cuSPARSE handle DenseVector multiply(A, x) // y = A * x void multiply(A, x, y, alpha=1, beta=0, // y = alpha*op(A)*x + beta*y op=GpuOp::NoTrans) DenseVector multiplyT(A, x) // y = A^T * x DenseVector multiplyAdjoint(A, x) // y = A^H * x DenseMatrix multiplyMat(A, X, op=GpuOp::NoTrans) // Y = op(A) * X (SpMM) cudaStream_t stream()
Unlike Eigen‘s Matrix, where omitting .noalias() triggers a copy to a temporary, DeviceMatrix dispatches directly to NVIDIA library calls which have no built-in aliasing protection. All operations are implicitly noalias. The caller must ensure operands don’t alias the destination for GEMM, TRSM, SYMM/HEMM, and SYRK/HERK. Debug builds assert on these violations before dispatching to cuBLAS.
compute(MatrixXd), solve(MatrixXd)) and device-input (compute(DeviceMatrix), solve(DeviceMatrix)) overloads, plus host- and device-side accessors (matrixU() vs d_matrixU()). This eases migration from CPU Eigen but may invite accidental host ↔ device round-trips when users mix the two without realising the cost. Revisit after the initial GPU module roll-out (MRs !2408, !2412, !2413, !2414, !2415) is in users' hands; if the convenience overloads cause more confusion than they save, narrow toward a single explicit fromHost / toHost boundary.| File | Depends on | Contents |
|---|---|---|
GpuSupport.h | <cuda_runtime.h> | Error macro, DeviceBuffer, cuda_data_type<> |
DeviceMatrix.h | GpuSupport.h | DeviceMatrix<>, HostTransfer<> |
DeviceExpr.h | DeviceMatrix.h | GEMM expression wrappers |
DeviceBlasExpr.h | DeviceMatrix.h | TRSM, SYMM, SYRK expression wrappers |
DeviceSolverExpr.h | DeviceMatrix.h | Solver expression wrappers (LLT, LU) |
DeviceDispatch.h | all above | All dispatch functions + Assignment |
GpuContext.h | CuBlasSupport.h, CuSolverSupport.h | gpu::Context |
CuBlasSupport.h | GpuSupport.h, <cublas_v2.h> | cuBLAS error macro, op/compute type maps |
CuSolverSupport.h | GpuSupport.h, <cusolverDn.h> | cuSOLVER params, fill-mode mapping |
GpuSolverContext.h | CuSolverSupport.h, CuBlasSupport.h | Shared solver context (stream, handles, scratch) |
GpuLLT.h | GpuSolverContext.h | Cached dense Cholesky factorization |
GpuLU.h | GpuSolverContext.h | Cached dense LU factorization |
GpuQR.h | GpuSolverContext.h | Dense QR decomposition |
GpuSVD.h | GpuSolverContext.h | Dense SVD decomposition |
GpuEigenSolver.h | GpuSolverContext.h | Self-adjoint eigenvalue decomposition |
CuFftSupport.h | GpuSupport.h, <cufft.h> | cuFFT error macro, type-dispatch wrappers |
GpuFFT.h | CuFftSupport.h, CuBlasSupport.h | 1D/2D FFT with plan caching |
CuSparseSupport.h | GpuSupport.h, <cusparse.h> | cuSPARSE error macro |
GpuSparseContext.h | CuSparseSupport.h | SpMV/SpMM via cuSPARSE |
CuDssSupport.h | GpuSupport.h, <cudss.h> | cuDSS error macro, type traits (optional) |
GpuSparseSolverBase.h | CuDssSupport.h | CRTP base for sparse solvers (optional) |
GpuSparseLLT.h | GpuSparseSolverBase.h | Sparse Cholesky via cuDSS (optional) |
GpuSparseLDLT.h | GpuSparseSolverBase.h | Sparse LDL^T via cuDSS (optional) |
GpuSparseLU.h | GpuSparseSolverBase.h | Sparse LU via cuDSS (optional) |
cmake -G Ninja -B build -S . \ -DEIGEN_TEST_CUDA=ON \ -DEIGEN_CUDA_COMPUTE_ARCH="70" \ -DEIGEN_TEST_CUBLAS=ON \ -DEIGEN_TEST_CUSOLVER=ON cmake --build build --target cublas cusolver_llt cusolver_lu \ cusolver_qr cusolver_svd cusolver_eigen \ device_matrix cufft cusparse_spmv ctest --test-dir build -L gpu --output-on-failure # Sparse solvers (cuDSS -- separate install required) cmake -G Ninja -B build -S . \ -DEIGEN_TEST_CUDA=ON \ -DEIGEN_CUDA_COMPUTE_ARCH="70" \ -DEIGEN_TEST_CUDSS=ON cmake --build build --target cudss_llt cudss_ldlt cudss_lu ctest --test-dir build -R '^cudss_' --output-on-failure
gpu::SparseContext currently operates on host vectors and matrices, uploading and downloading on each call. The key missing piece is a DeviceSparseView that holds a sparse matrix on device and supports operator syntax (d_y = d_A * d_x) with DeviceMatrix operands -- keeping the entire SpMV/SpMM pipeline on device. This is essential for iterative solvers and any workflow that chains sparse and dense operations without returning to the host.