| # Eigen GPU Module (`unsupported/Eigen/GPU`) |
| |
| GPU-accelerated linear algebra for Eigen users, dispatching to NVIDIA CUDA |
| Math Libraries (cuBLAS, cuSOLVER, cuFFT, cuSPARSE, cuDSS). Requires CUDA |
| 11.4+; cuDSS features require CUDA 12.0+ and a separate cuDSS install. |
| Header-only. |
| |
| ## Why this module |
| |
| Eigen is the linear algebra foundation for a large ecosystem of C++ projects |
| in robotics (ROS, Drake, MoveIt, Pinocchio), computer vision (OpenCV, COLMAP, |
| Open3D), scientific computing (Ceres, Stan), and beyond. Many of these |
| projects run on GPU-equipped hardware but cannot use GPUs for Eigen operations |
| without dropping down to raw CUDA library APIs. |
| |
| GPU sparse solvers are a particularly acute gap. Sparse factorization is the |
| bottleneck in SLAM, bundle adjustment, FEM, and nonlinear optimization -- |
| exactly the workloads where GPU acceleration matters most. Downstream projects |
| like [Ceres](https://github.com/ceres-solver/ceres-solver/issues/1151) and |
| [COLMAP](https://github.com/colmap/colmap/issues/4018) have open requests for |
| GPU-accelerated sparse solvers, and third-party projects like |
| [cholespy](https://github.com/rgl-epfl/cholespy) exist specifically because |
| Eigen lacks them. The `unsupported/Eigen/GPU` module provides GPU sparse Cholesky, LDL^T, |
| and LU factorization via cuDSS, alongside dense solvers (cuSOLVER), matrix |
| products (cuBLAS), FFT (cuFFT), and sparse matrix-vector products (cuSPARSE). |
| |
| Existing Eigen users should be able to move performance-critical dense or |
| sparse linear algebra to the GPU with minimal code changes and without |
| learning CUDA library APIs directly. |
| |
| ## Design philosophy |
| |
| **CPU and GPU coexist.** There is no global compile-time switch that replaces |
| CPU implementations (unlike `EIGEN_USE_LAPACKE`). Users choose GPU solvers |
| explicitly -- `gpu::LLT<double>` vs `LLT<MatrixXd>`, `gpu::SparseLLT<double>` vs |
| `SimplicialLLT<SparseMatrix<double>>` -- and both coexist in the same binary. |
| This also lets users keep the factored matrix on device across multiple solves, |
| something impossible with compile-time replacement. |
| |
| **Familiar syntax.** GPU operations use the same expression patterns as CPU |
| Eigen. Here is a side-by-side comparison: |
| |
| ```cpp |
| // ---- CPU (Eigen) ---- // ---- GPU (unsupported/Eigen/GPU) ---- |
| #include <Eigen/Dense> #define EIGEN_USE_GPU |
| #include <unsupported/Eigen/GPU> |
| |
| // Dense |
| MatrixXd A = ...; auto d_A = gpu::DeviceMatrix<double>::fromHost(A); |
| MatrixXd B = ...; auto d_B = gpu::DeviceMatrix<double>::fromHost(B); |
| |
| MatrixXd C = A * B; gpu::DeviceMatrix<double> d_C = d_A * d_B; |
| MatrixXd X = A.llt().solve(B); gpu::DeviceMatrix<double> d_X = d_A.llt().solve(d_B); |
| |
| MatrixXd X = d_X.toHost(); |
| |
| // Sparse (using SpMat = SparseMatrix<double>) |
| SimplicialLLT<SpMat> llt(A); gpu::SparseLLT<double> llt(A); |
| VectorXd x = llt.solve(b); VectorXd x = llt.solve(b); |
| ``` |
| |
| The GPU version reads like CPU Eigen with explicit upload/download for dense |
| operations, and an almost identical API for sparse solvers. Unsupported |
| expressions are compile errors. |
| |
| **Standalone module.** `unsupported/Eigen/GPU` does not modify or depend on Eigen's Core |
| expression template system (`MatrixBase`, `CwiseBinaryOp`, etc.). |
| `DeviceMatrix` is not an Eigen expression type and does not inherit from |
| `MatrixBase`. The expression layer is a thin compile-time dispatch where every |
| supported expression maps to a single NVIDIA library call. There is no |
| coefficient-level evaluation, lazy fusion, or packet operations. |
| |
| **Explicit over implicit.** Host-device transfers, stream management, and |
| library handle lifetimes are visible in the API. There are no hidden |
| allocations or synchronizations except where documented (e.g., `toHost()` must |
| synchronize to deliver data to the host). |
| |
| ## Key concepts |
| |
| ### `DeviceMatrix<Scalar>` |
| |
| A typed RAII wrapper for a dense column-major matrix in GPU device memory. |
| This is the GPU counterpart of Eigen's `MatrixX<Scalar>`. A vector is simply |
| a `DeviceMatrix` with one column. |
| |
| ```cpp |
| // Upload from host |
| auto d_A = gpu::DeviceMatrix<double>::fromHost(A); |
| |
| // Allocate uninitialized |
| gpu::DeviceMatrix<double> d_C(m, n); |
| |
| // Download to host |
| MatrixXd C = d_C.toHost(); |
| |
| // Async download (returns a future) |
| auto transfer = d_C.toHostAsync(); |
| // ... do other work ... |
| MatrixXd C = transfer.get(); |
| ``` |
| |
| `DeviceMatrix` supports expression methods that mirror Eigen's API: |
| `adjoint()`, `transpose()`, `triangularView<UpLo>()`, |
| `selfadjointView<UpLo>()`, `llt()`, `lu()`. These return lightweight |
| expression objects that are evaluated when assigned. |
| |
| ### `gpu::Context` |
| |
| Every GPU operation needs a CUDA stream and library handles (cuBLAS, |
| cuSOLVER). `gpu::Context` bundles these together. |
| |
| For simple usage, you don't need to create one -- a per-thread default context |
| is created lazily on first use: |
| |
| ```cpp |
| // These use the thread-local default context automatically |
| d_C = d_A * d_B; |
| d_X = d_A.llt().solve(d_B); |
| ``` |
| |
| For concurrent multi-stream execution, create explicit contexts: |
| |
| ```cpp |
| gpu::Context ctx1, ctx2; |
| d_C1.device(ctx1) = d_A1 * d_B1; // runs on stream 1 |
| d_C2.device(ctx2) = d_A2 * d_B2; // runs on stream 2 (concurrently) |
| ``` |
| |
| ## Usage |
| |
| ### Matrix operations (cuBLAS) |
| |
| ```cpp |
| auto d_A = gpu::DeviceMatrix<double>::fromHost(A); |
| auto d_B = gpu::DeviceMatrix<double>::fromHost(B); |
| |
| // GEMM: C = A * B, C = A^H * B, C = A * B^T, ... |
| gpu::DeviceMatrix<double> d_C = d_A * d_B; |
| d_C = d_A.adjoint() * d_B; |
| d_C = d_A * d_B.transpose(); |
| |
| // Scaled and accumulated |
| d_C += 2.0 * d_A * d_B; // alpha=2, beta=1 |
| d_C.device(ctx) -= d_A * d_B; // alpha=-1, beta=1 (requires explicit context) |
| |
| // Triangular solve (TRSM) |
| d_X = d_A.triangularView<Lower>().solve(d_B); |
| |
| // Symmetric/Hermitian multiply (SYMM/HEMM) |
| d_C = d_A.selfadjointView<Lower>() * d_B; |
| |
| // Rank-k update (SYRK/HERK) |
| d_C.selfadjointView<Lower>().rankUpdate(d_A); // C += A * A^H |
| ``` |
| |
| ### Dense solvers (cuSOLVER) |
| |
| **One-shot expression syntax** -- Convenient, re-factorizes each time: |
| |
| ```cpp |
| // Cholesky solve (potrf + potrs) |
| d_X = d_A.llt().solve(d_B); |
| |
| // LU solve (getrf + getrs) |
| d_Y = d_A.lu().solve(d_B); |
| ``` |
| |
| **Cached factorization** -- Factor once, solve many times: |
| |
| ```cpp |
| gpu::LLT<double> llt; |
| llt.compute(d_A); // factorize (async) |
| if (llt.info() != Success) { ... } // lazy sync on first info() call |
| auto d_X1 = llt.solve(d_B1); // reuses factor (async) |
| auto d_X2 = llt.solve(d_B2); // reuses factor (async) |
| MatrixXd X2 = d_X2.toHost(); |
| |
| // LU with transpose solve |
| gpu::LU<double> lu; |
| lu.compute(d_A); |
| auto d_Y = lu.solve(d_B, gpu::GpuOp::Trans); // A^T Y = B |
| |
| // QR solve (overdetermined least squares) |
| gpu::QR<double> qr; |
| qr.compute(d_A); // factorize on device (async) |
| auto d_X = qr.solve(d_B); // Q^H * B via ormqr, then trsm on R |
| MatrixXd X = d_X.toHost(); |
| |
| // SVD (results downloaded on access) |
| gpu::SVD<double> svd; |
| svd.compute(d_A, ComputeThinU | ComputeThinV); |
| VectorXd S = svd.singularValues(); // downloads to host |
| MatrixXd U = svd.matrixU(); // downloads to host |
| MatrixXd VT = svd.matrixVT(); // V^T (matches cuSOLVER) |
| |
| // SVD: device-side views (no D2H transfer; svd must outlive the views) |
| auto d_S = svd.d_singularValues(); // DeviceMatrix view of singular values |
| auto d_U = svd.d_matrixU(); // DeviceMatrix view of U |
| auto d_VT = svd.d_matrixVT(); // DeviceMatrix view of V^T |
| |
| // Self-adjoint eigenvalue decomposition |
| gpu::SelfAdjointEigenSolver<double> es; |
| es.compute(d_A); |
| VectorXd eigenvals = es.eigenvalues(); // downloads to host |
| MatrixXd eigenvecs = es.eigenvectors(); // downloads to host |
| auto d_W = es.d_eigenvalues(); // DeviceMatrix view of eigenvalues |
| auto d_V = es.d_eigenvectors(); // DeviceMatrix view of eigenvectors |
| ``` |
| |
| The cached API keeps the factored matrix on device, avoiding redundant |
| host-device transfers and re-factorizations. All solvers also accept host |
| matrices directly as a convenience (e.g., `gpu::LLT<double> llt(A)` or |
| `qr.solve(B)`), which handles upload/download internally. The `d_*` accessors |
| on `gpu::SVD` and `gpu::SelfAdjointEigenSolver` return non-owning |
| `DeviceMatrix` views so downstream cuBLAS/cuSOLVER work can chain without |
| round-tripping through host memory. |
| |
| ### Sparse direct solvers (cuDSS) |
| |
| Requires cuDSS (separate install, CUDA 12.0+). Define `EIGEN_CUDSS` before |
| including `unsupported/Eigen/GPU` and link with `-lcudss`. |
| |
| ```cpp |
| SparseMatrix<double> A = ...; // symmetric positive definite |
| VectorXd b = ...; |
| |
| // Sparse Cholesky -- one-liner |
| gpu::SparseLLT<double> llt(A); |
| VectorXd x = llt.solve(b); |
| |
| // Three-phase workflow for repeated solves with the same sparsity pattern |
| gpu::SparseLLT<double> llt; |
| llt.analyzePattern(A); // symbolic analysis (once) |
| llt.factorize(A); // numeric factorization |
| VectorXd x = llt.solve(b); |
| llt.factorize(A_new_values); // refactorize (reuses symbolic analysis) |
| VectorXd x2 = llt.solve(b); |
| |
| // Sparse LDL^T (symmetric indefinite) |
| gpu::SparseLDLT<double> ldlt(A); |
| VectorXd x = ldlt.solve(b); |
| |
| // Sparse LU (general non-symmetric) |
| gpu::SparseLU<double> lu(A); |
| VectorXd x = lu.solve(b); |
| ``` |
| |
| ### FFT (cuFFT) |
| |
| ```cpp |
| gpu::FFT<float> fft; |
| |
| // 1D complex-to-complex |
| VectorXcf X = fft.fwd(x); // forward |
| VectorXcf y = fft.inv(X); // inverse (scaled by 1/n) |
| |
| // 1D real-to-complex / complex-to-real |
| VectorXcf R = fft.fwd(r); // returns n/2+1 complex (half-spectrum) |
| VectorXf s = fft.invReal(R, n); // C2R inverse, caller specifies n |
| |
| // 2D complex-to-complex |
| MatrixXcf B = fft.fwd2(A); // 2D forward |
| MatrixXcf C = fft.inv2(B); // 2D inverse (scaled by 1/(rows*cols)) |
| |
| // Plans are cached and reused across calls with the same size/type. |
| ``` |
| |
| ### Sparse matrix-vector multiply (cuSPARSE) |
| |
| ```cpp |
| SparseMatrix<double> A = ...; |
| VectorXd x = ...; |
| |
| gpu::SparseContext<double> ctx; |
| VectorXd y = ctx.multiply(A, x); // y = A * x |
| VectorXd z = ctx.multiplyT(A, x); // z = A^T * x |
| ctx.multiply(A, x, y, 2.0, 1.0); // y = 2*A*x + y |
| ctx.multiply(A, x, y, 1.0, 0.0, // y = A^H * x (Hermitian SpMV) |
| gpu::GpuOp::ConjTrans); |
| |
| // Multiple RHS (SpMM) |
| MatrixXd Y = ctx.multiplyMat(A, X); // Y = A * X |
| MatrixXd Z = ctx.multiplyMat(A, X, gpu::GpuOp::Trans); // Z = A^T * X |
| ``` |
| |
| ### Precision control |
| |
| GEMM dispatch uses `cublasXgemm` (type-specific Sgemm/Dgemm/Cgemm/Zgemm). |
| cuBLAS may internally use tensor cores depending on the GPU architecture, |
| matrix dimensions, and CUDA math mode settings. No Eigen-specific macros |
| control this; use the standard `CUDA_MATH_MODE` environment variable or |
| `cublasSetMathMode()` to configure tensor core behavior if needed. |
| |
| ### Stream control and async execution |
| |
| Operations are asynchronous by default. The compute-solve chain runs without |
| host synchronization until you need a result on the host: |
| |
| ```text |
| fromHost(A) --sync--> compute() --async--> solve() --async--> toHost() |
| H2D potrf potrs D2H |
| sync |
| ``` |
| |
| Mandatory sync points: |
| - `fromHost()` -- Synchronizes to complete the upload before returning |
| - `toHost()` / `HostTransfer::get()` -- Must deliver data to host |
| - `info()` -- Must read the factorization status |
| |
| **Cross-stream safety** is automatic. `DeviceMatrix` tracks write completion |
| via CUDA events. When a matrix written on stream A is read on stream B, the |
| module automatically inserts `cudaStreamWaitEvent`. Same-stream operations |
| skip the wait (CUDA guarantees in-order execution within a stream). |
| |
| **Lifetime of cached factorizations.** A `gpu::LLT` / `gpu::LU` object owns |
| its CUDA stream, library handle, and the cached factor on device. Destroying |
| the factorization object while a `solve()` it issued is still in flight is |
| *correct* but not actually async: `cudaStreamDestroy` returns immediately, |
| but the destructor of the cached factor calls `cudaFree`, which is fully |
| device-synchronous and stalls until the in-flight `potrs`/`getrs` retires. |
| For genuine async pipelining keep the factorization object alive until you |
| have drained its results (e.g. via `toHost()` or by binding consumption to |
| an explicit `gpu::Context` that outlives both producer and consumer): |
| |
| ```cpp |
| gpu::LLT<double> llt(d_A); // factor stays on device |
| auto h_x = d_X = llt.solve(d_B).toHostAsync(stream); |
| h_x.get(); // sync: factor + result complete |
| // llt may now be destroyed without stalling the device |
| ``` |
| |
| ## Reference |
| |
| ### Supported scalar types |
| |
| `float`, `double`, `std::complex<float>`, `std::complex<double>` (unless |
| noted otherwise). |
| |
| ### Expression -> library call mapping |
| |
| | DeviceMatrix expression | Library call | Parameters | |
| |---|---|---| |
| | `C = A * B` | `cublasXgemm` | transA=N, transB=N, alpha=1, beta=0 | |
| | `C = A.adjoint() * B` | `cublasXgemm` | transA=C, transB=N | |
| | `C = A.transpose() * B` | `cublasXgemm` | transA=T, transB=N | |
| | `C = A * B.adjoint()` | `cublasXgemm` | transA=N, transB=C | |
| | `C = A * B.transpose()` | `cublasXgemm` | transA=N, transB=T | |
| | `C = alpha * A * B` | `cublasXgemm` | alpha from LHS | |
| | `C = A * (alpha * B)` | `cublasXgemm` | alpha from RHS | |
| | `C += A * B` | `cublasXgemm` | alpha=1, beta=1 | |
| | `C.device(ctx) -= A * B` | `cublasXgemm` | alpha=-1, beta=1 | |
| | `X = A.llt().solve(B)` | `cusolverDnXpotrf` + `Xpotrs` | uplo, n, nrhs | |
| | `X = A.llt<Upper>().solve(B)` | same | uplo=Upper | |
| | `X = A.lu().solve(B)` | `cusolverDnXgetrf` + `Xgetrs` | n, nrhs | |
| | `X = A.triangularView<L>().solve(B)` | `cublasXtrsm` | side=L, uplo, diag=NonUnit | |
| | `C = A.selfadjointView<L>() * B` | `cublasXsymm` / `cublasXhemm` | side=L, uplo | |
| | `C.selfadjointView<L>().rankUpdate(A)` | `cublasXsyrk` / `cublasXherk` | uplo, trans=N | |
| |
| ### `DeviceMatrix<Scalar>` |
| |
| Typed RAII wrapper for a dense column-major matrix in GPU device memory. |
| Always dense (leading dimension = rows). A vector is a `DeviceMatrix` with |
| one column. |
| |
| ```cpp |
| // Construction |
| DeviceMatrix<Scalar>() // Empty (0x0) |
| DeviceMatrix<Scalar>(rows, cols) // Allocate uninitialized |
| |
| // Upload / download / pointer adoption |
| static DeviceMatrix fromHost(matrix, stream=nullptr) // -> DeviceMatrix (syncs) |
| static DeviceMatrix fromHostAsync(ptr, rows, cols, outerStride, s) // -> DeviceMatrix (no sync, caller manages ptr lifetime) |
| static DeviceMatrix adopt(Scalar* device_ptr, rows, cols) // Owning wrapper over a raw device pointer |
| static DeviceMatrix view(Scalar* device_ptr, rows, cols) // Non-owning view (does not free on destruction) |
| PlainMatrix toHost(stream=nullptr) // -> host Matrix (syncs) |
| HostTransfer toHostAsync(stream=nullptr) // -> HostTransfer future (no sync) |
| DeviceMatrix clone(stream=nullptr) // -> DeviceMatrix (D2D copy, async) |
| |
| // Dimensions and access |
| Index rows() |
| Index cols() |
| size_t sizeInBytes() |
| bool empty() |
| Scalar* data() // Raw device pointer |
| void resize(Index rows, Index cols) // Discard contents, reallocate |
| |
| // Expression builders (return lightweight views, evaluated on assignment) |
| AdjointView adjoint() // GEMM with ConjTrans |
| TransposeView transpose() // GEMM with Trans |
| LltExpr llt() / llt<UpLo>() // -> .solve(d_B) -> DeviceMatrix |
| LuExpr lu() // -> .solve(d_B) -> DeviceMatrix |
| TriangularView triangularView<UpLo>() // -> .solve(d_B) -> DeviceMatrix (TRSM) |
| SelfAdjointView selfadjointView<UpLo>() // -> * d_B (SYMM), .rankUpdate(d_A) (SYRK) |
| Assignment device(gpu::Context& ctx) // Bind assignment to explicit stream |
| ``` |
| |
| ### `gpu::Context` |
| |
| Unified GPU execution context owning a CUDA stream and library handles. |
| |
| ```cpp |
| gpu::Context() // Creates dedicated stream + handles |
| static gpu::Context& threadLocal() // Per-thread default (lazy-created) |
| |
| cudaStream_t stream() |
| cublasHandle_t cublasHandle() |
| cusolverDnHandle_t cusolverHandle() |
| ``` |
| |
| Non-copyable, non-movable (owns library handles). |
| |
| ### `gpu::LLT<Scalar, UpLo>` -- Dense Cholesky (cuSOLVER) |
| |
| Caches the Cholesky factor on device for repeated solves. |
| |
| ```cpp |
| gpu::LLT() // Default construct, then call compute() |
| gpu::LLT(const EigenBase<D>& A) // Convenience: upload + factorize |
| |
| gpu::LLT& compute(const EigenBase<D>& A) // Upload + factorize |
| gpu::LLT& compute(const DeviceMatrix& d_A) // D2D copy + factorize |
| gpu::LLT& compute(DeviceMatrix&& d_A) // Adopt + factorize (no copy) |
| |
| PlainMatrix solve(const MatrixBase<D>& B) // -> host Matrix (syncs) |
| DeviceMatrix solve(const DeviceMatrix& d_B) // -> DeviceMatrix (async, stays on device) |
| |
| ComputationInfo info() // Lazy sync on first call: Success or NumericalIssue |
| Index rows() / cols() |
| cudaStream_t stream() |
| ``` |
| |
| ### `gpu::LU<Scalar>` -- Dense LU (cuSOLVER) |
| |
| Same pattern as `gpu::LLT`. Adds a `gpu::GpuOp` parameter on `solve()`. |
| |
| ```cpp |
| PlainMatrix solve(const MatrixBase<D>& B, GpuOp op = GpuOp::NoTrans) // -> host Matrix |
| DeviceMatrix solve(const DeviceMatrix& d_B, GpuOp op = GpuOp::NoTrans) // -> DeviceMatrix |
| ``` |
| |
| `gpu::GpuOp`: `NoTrans`, `Trans`, `ConjTrans`. |
| |
| ### `gpu::QR<Scalar>` -- Dense QR (cuSOLVER) |
| |
| QR factorization via `cusolverDnXgeqrf`. Solve uses ORMQR (apply Q^H) + TRSM |
| (back-substitute on R) -- Q is never formed explicitly. |
| |
| ```cpp |
| gpu::QR() // Default construct |
| gpu::QR(const EigenBase<D>& A) // Convenience: upload + factorize |
| |
| gpu::QR& compute(const EigenBase<D>& A) // Upload + factorize |
| gpu::QR& compute(const DeviceMatrix& d_A) // D2D copy + factorize |
| |
| PlainMatrix solve(const MatrixBase<D>& B) // -> host Matrix (syncs) |
| DeviceMatrix solve(const DeviceMatrix& d_B) // -> DeviceMatrix (async) |
| PlainMatrix matrixR() // -> host Matrix (m >= n only) |
| |
| ComputationInfo info() // Lazy sync |
| Index rows() / cols() |
| cudaStream_t stream() |
| ``` |
| |
| ### `gpu::SVD<Scalar>` -- Dense SVD (cuSOLVER) |
| |
| SVD via `cusolverDnXgesvd`. Supports `ComputeThinU | ComputeThinV`, |
| `ComputeFullU | ComputeFullV`, or `0` (values only). Wide matrices (m < n) |
| handled by internal transpose. |
| |
| ```cpp |
| gpu::SVD() // Default construct, then call compute() |
| gpu::SVD(const EigenBase<D>& A, unsigned options = ComputeThinU | ComputeThinV) // Convenience |
| |
| gpu::SVD& compute(const EigenBase<D>& A, unsigned options = ComputeThinU | ComputeThinV) |
| gpu::SVD& compute(const DeviceMatrix& d_A, unsigned options = ComputeThinU | ComputeThinV) |
| |
| RealVector singularValues() // -> host vector (syncs, downloads) |
| PlainMatrix matrixU() // -> host Matrix (syncs, downloads) |
| PlainMatrix matrixVT() // -> host Matrix (syncs, downloads V^T) |
| |
| DeviceMatrix d_singularValues() // -> DeviceMatrix view (zero-copy) |
| DeviceMatrix d_matrixU() // -> DeviceMatrix view (zero-copy when m >= n) |
| DeviceMatrix d_matrixVT() // -> DeviceMatrix view (zero-copy when m >= n) |
| |
| PlainMatrix solve(const MatrixBase<D>& B) // -> host Matrix (pseudoinverse) |
| PlainMatrix solve(const MatrixBase<D>& B, Index k) // Truncated (top k triplets) |
| PlainMatrix solve(const MatrixBase<D>& B, RealScalar l) // Tikhonov regularized |
| |
| Index rank(RealScalar threshold = -1) |
| ComputationInfo info() // Lazy sync |
| Index rows() / cols() |
| cudaStream_t stream() |
| ``` |
| |
| **Note:** `singularValues()`, `matrixU()`, and `matrixVT()` download to host |
| on each call. The `d_*` accessors return non-owning `DeviceMatrix` views into |
| the solver's internal buffers; the `gpu::SVD` object must outlive any view |
| derived from it. For wide matrices (m < n) the U/V^T views are owning (one |
| `cublasXgeam` adjoint pass). |
| |
| ### `gpu::SelfAdjointEigenSolver<Scalar>` -- Eigendecomposition (cuSOLVER) |
| |
| Symmetric/Hermitian eigenvalue decomposition via `cusolverDnXsyevd`. |
| `ComputeMode` enum: `EigenvaluesOnly`, `ComputeEigenvectors`. |
| |
| ```cpp |
| gpu::SelfAdjointEigenSolver() // Default construct, then call compute() |
| gpu::SelfAdjointEigenSolver(const EigenBase<D>& A, ComputeMode mode = ComputeEigenvectors) // Convenience |
| |
| gpu::SelfAdjointEigenSolver& compute(const EigenBase<D>& A, ComputeMode mode = ComputeEigenvectors) |
| gpu::SelfAdjointEigenSolver& compute(const DeviceMatrix& d_A, ComputeMode mode = ComputeEigenvectors) |
| |
| RealVector eigenvalues() // -> host vector (syncs, downloads, ascending order) |
| PlainMatrix eigenvectors() // -> host Matrix (syncs, downloads, columns) |
| |
| DeviceMatrix d_eigenvalues() // -> DeviceMatrix view (zero-copy) |
| DeviceMatrix d_eigenvectors() // -> DeviceMatrix view (zero-copy, requires ComputeEigenvectors) |
| |
| ComputationInfo info() // Lazy sync |
| Index rows() / cols() |
| cudaStream_t stream() |
| ``` |
| |
| **Note:** `eigenvalues()` and `eigenvectors()` download to host on each call. |
| The `d_*` accessors return non-owning `DeviceMatrix` views into the solver's |
| internal buffers; the `gpu::SelfAdjointEigenSolver` object must outlive any |
| view derived from it. |
| |
| ### `HostTransfer<Scalar>` |
| |
| Future for async device-to-host transfer. Returned by |
| `DeviceMatrix::toHostAsync()`. |
| |
| ```cpp |
| PlainMatrix& get() // Block until complete, return host Matrix ref. Idempotent. |
| bool ready() // Non-blocking poll |
| ``` |
| |
| ### `gpu::SparseLLT<Scalar, UpLo>` -- Sparse Cholesky (cuDSS) |
| |
| Requires cuDSS (CUDA 12.0+, `#define EIGEN_CUDSS`). Three-phase workflow |
| with symbolic reuse. Accepts `SparseMatrix<Scalar, ColMajor, int>` (CSC). |
| Matrix dimensions and nonzero count must fit in `int` (cuDSS limitation; |
| debug builds assert). |
| |
| ```cpp |
| gpu::SparseLLT() // Default construct |
| gpu::SparseLLT(const SparseMatrixBase<D>& A) // Analyze + factorize |
| |
| gpu::SparseLLT& analyzePattern(const SparseMatrixBase<D>& A) // Symbolic analysis (reusable) |
| gpu::SparseLLT& factorize(const SparseMatrixBase<D>& A) // Numeric factorization |
| gpu::SparseLLT& compute(const SparseMatrixBase<D>& A) // analyzePattern + factorize |
| void setOrdering(GpuSparseOrdering ord) // AMD (default), METIS, or RCM |
| |
| DenseMatrix solve(const MatrixBase<D>& B) // -> host Matrix (syncs) |
| |
| ComputationInfo info() // Lazy sync |
| Index rows() / cols() |
| cudaStream_t stream() |
| ``` |
| |
| ### `gpu::SparseLDLT<Scalar, UpLo>` -- Sparse LDL^T (cuDSS) |
| |
| Symmetric indefinite. Same API as `gpu::SparseLLT`. |
| |
| ### `gpu::SparseLU<Scalar>` -- Sparse LU (cuDSS) |
| |
| General non-symmetric. Same API as `gpu::SparseLLT` (without `UpLo`). |
| |
| ### `gpu::FFT<Scalar>` -- FFT (cuFFT) |
| |
| Plans cached by (size, type) and reused. Inverse transforms scaled so |
| `inv(fwd(x)) == x`. Supported scalars: `float`, `double`. |
| |
| ```cpp |
| // 1D transforms (host vectors in and out) |
| ComplexVector fwd(const MatrixBase<D>& x) // C2C forward (complex input) |
| ComplexVector fwd(const MatrixBase<D>& x) // R2C forward (real input, returns n/2+1) |
| ComplexVector inv(const MatrixBase<D>& X) // C2C inverse, scaled by 1/n |
| RealVector invReal(const MatrixBase<D>& X, Index n) // C2R inverse, scaled by 1/n |
| |
| // 2D transforms (host matrices in and out) |
| ComplexMatrix fwd2(const MatrixBase<D>& A) // 2D C2C forward |
| ComplexMatrix inv2(const MatrixBase<D>& A) // 2D C2C inverse, scaled by 1/(rows*cols) |
| |
| cudaStream_t stream() |
| ``` |
| |
| All FFT methods accept host data and return host data. Upload/download is |
| handled internally. The C2C and R2C overloads of `fwd()` are distinguished by |
| the input scalar type (complex vs real). |
| |
| ### `gpu::SparseContext<Scalar>` -- SpMV/SpMM (cuSPARSE) |
| |
| Accepts `SparseMatrix<Scalar, ColMajor>`. All methods accept host data and |
| return host data. Matrix dimensions and nonzero count must fit in `int` |
| (cuSPARSE limitation; debug builds assert). |
| |
| ```cpp |
| gpu::SparseContext() // Creates own stream + cuSPARSE handle |
| |
| DenseVector multiply(A, x) // y = A * x |
| void multiply(A, x, y, alpha=1, beta=0, // y = alpha*op(A)*x + beta*y |
| op=GpuOp::NoTrans) |
| DenseVector multiplyT(A, x) // y = A^T * x |
| DenseVector multiplyAdjoint(A, x) // y = A^H * x |
| DenseMatrix multiplyMat(A, X, op=GpuOp::NoTrans) // Y = op(A) * X (SpMM) |
| |
| cudaStream_t stream() |
| ``` |
| |
| ### Aliasing |
| |
| Unlike Eigen's `Matrix`, where omitting `.noalias()` triggers a copy to a |
| temporary, DeviceMatrix dispatches directly to NVIDIA library calls which have |
| no built-in aliasing protection. All operations are implicitly noalias. |
| The caller must ensure operands don't alias the destination for GEMM, TRSM, |
| SYMM/HEMM, and SYRK/HERK. Debug builds assert on these violations before |
| dispatching to cuBLAS. |
| |
| ## Future work |
| |
| - **Reassess host-input vs. device-input API surface.** Each solver currently |
| exposes both host-input (`compute(MatrixXd)`, `solve(MatrixXd)`) and |
| device-input (`compute(DeviceMatrix)`, `solve(DeviceMatrix)`) overloads, plus |
| host- and device-side accessors (`matrixU()` vs `d_matrixU()`). This eases |
| migration from CPU Eigen but may invite accidental host ↔ device round-trips |
| when users mix the two without realising the cost. Revisit after the initial |
| GPU module roll-out (MRs !2408, !2412, !2413, !2414, !2415) is in users' |
| hands; if the convenience overloads cause more confusion than they save, |
| narrow toward a single explicit `fromHost` / `toHost` boundary. |
| |
| ## File layout |
| |
| | File | Depends on | Contents | |
| |------|-----------|----------| |
| | `GpuSupport.h` | `<cuda_runtime.h>` | Error macro, `DeviceBuffer`, `cuda_data_type<>` | |
| | `DeviceMatrix.h` | `GpuSupport.h` | `DeviceMatrix<>`, `HostTransfer<>` | |
| | `DeviceExpr.h` | `DeviceMatrix.h` | GEMM expression wrappers | |
| | `DeviceBlasExpr.h` | `DeviceMatrix.h` | TRSM, SYMM, SYRK expression wrappers | |
| | `DeviceSolverExpr.h` | `DeviceMatrix.h` | Solver expression wrappers (LLT, LU) | |
| | `DeviceDispatch.h` | all above | All dispatch functions + `Assignment` | |
| | `GpuContext.h` | `CuBlasSupport.h`, `CuSolverSupport.h` | `gpu::Context` | |
| | `CuBlasSupport.h` | `GpuSupport.h`, `<cublas_v2.h>` | cuBLAS error macro, op/compute type maps | |
| | `CuSolverSupport.h` | `GpuSupport.h`, `<cusolverDn.h>` | cuSOLVER params, fill-mode mapping | |
| | `GpuSolverContext.h` | `CuSolverSupport.h`, `CuBlasSupport.h` | Shared solver context (stream, handles, scratch) | |
| | `GpuLLT.h` | `GpuSolverContext.h` | Cached dense Cholesky factorization | |
| | `GpuLU.h` | `GpuSolverContext.h` | Cached dense LU factorization | |
| | `GpuQR.h` | `GpuSolverContext.h` | Dense QR decomposition | |
| | `GpuSVD.h` | `GpuSolverContext.h` | Dense SVD decomposition | |
| | `GpuEigenSolver.h` | `GpuSolverContext.h` | Self-adjoint eigenvalue decomposition | |
| | `CuFftSupport.h` | `GpuSupport.h`, `<cufft.h>` | cuFFT error macro, type-dispatch wrappers | |
| | `GpuFFT.h` | `CuFftSupport.h`, `CuBlasSupport.h` | 1D/2D FFT with plan caching | |
| | `CuSparseSupport.h` | `GpuSupport.h`, `<cusparse.h>` | cuSPARSE error macro | |
| | `GpuSparseContext.h` | `CuSparseSupport.h` | SpMV/SpMM via cuSPARSE | |
| | `CuDssSupport.h` | `GpuSupport.h`, `<cudss.h>` | cuDSS error macro, type traits (optional) | |
| | `GpuSparseSolverBase.h` | `CuDssSupport.h` | CRTP base for sparse solvers (optional) | |
| | `GpuSparseLLT.h` | `GpuSparseSolverBase.h` | Sparse Cholesky via cuDSS (optional) | |
| | `GpuSparseLDLT.h` | `GpuSparseSolverBase.h` | Sparse LDL^T via cuDSS (optional) | |
| | `GpuSparseLU.h` | `GpuSparseSolverBase.h` | Sparse LU via cuDSS (optional) | |
| |
| ## Building and testing |
| |
| ```bash |
| cmake -G Ninja -B build -S . \ |
| -DEIGEN_TEST_CUDA=ON \ |
| -DEIGEN_CUDA_COMPUTE_ARCH="70" \ |
| -DEIGEN_TEST_CUBLAS=ON \ |
| -DEIGEN_TEST_CUSOLVER=ON |
| |
| cmake --build build --target cublas cusolver_llt cusolver_lu \ |
| cusolver_qr cusolver_svd cusolver_eigen \ |
| device_matrix cufft cusparse_spmv |
| ctest --test-dir build -L gpu --output-on-failure |
| |
| # Sparse solvers (cuDSS -- separate install required) |
| cmake -G Ninja -B build -S . \ |
| -DEIGEN_TEST_CUDA=ON \ |
| -DEIGEN_CUDA_COMPUTE_ARCH="70" \ |
| -DEIGEN_TEST_CUDSS=ON |
| |
| cmake --build build --target cudss_llt cudss_ldlt cudss_lu |
| ctest --test-dir build -R '^cudss_' --output-on-failure |
| ``` |
| |
| ## Future enhancements |
| |
| - **Device-resident sparse matrix-vector products.** `gpu::SparseContext` |
| currently operates on host vectors and matrices, uploading and downloading |
| on each call. The key missing piece is a `DeviceSparseView` that holds a |
| sparse matrix on device and supports operator syntax (`d_y = d_A * d_x`) |
| with `DeviceMatrix` operands -- keeping the entire SpMV/SpMM pipeline on |
| device. This is essential for iterative solvers and any workflow that chains |
| sparse and dense operations without returning to the host. |