blob: d7411bb358fa27ae8702eb4f379db5cd6cfc36d8 [file] [view]
# Eigen GPU Module (`unsupported/Eigen/GPU`)
GPU-accelerated linear algebra for Eigen users, dispatching to NVIDIA CUDA
Math Libraries (cuBLAS, cuSOLVER, cuFFT, cuSPARSE, cuDSS). Requires CUDA
11.4+; cuDSS features require CUDA 12.0+ and a separate cuDSS install.
Header-only.
## Why this module
Eigen is the linear algebra foundation for a large ecosystem of C++ projects
in robotics (ROS, Drake, MoveIt, Pinocchio), computer vision (OpenCV, COLMAP,
Open3D), scientific computing (Ceres, Stan), and beyond. Many of these
projects run on GPU-equipped hardware but cannot use GPUs for Eigen operations
without dropping down to raw CUDA library APIs.
GPU sparse solvers are a particularly acute gap. Sparse factorization is the
bottleneck in SLAM, bundle adjustment, FEM, and nonlinear optimization --
exactly the workloads where GPU acceleration matters most. Downstream projects
like [Ceres](https://github.com/ceres-solver/ceres-solver/issues/1151) and
[COLMAP](https://github.com/colmap/colmap/issues/4018) have open requests for
GPU-accelerated sparse solvers, and third-party projects like
[cholespy](https://github.com/rgl-epfl/cholespy) exist specifically because
Eigen lacks them. The `unsupported/Eigen/GPU` module provides GPU sparse Cholesky, LDL^T,
and LU factorization via cuDSS, alongside dense solvers (cuSOLVER), matrix
products (cuBLAS), FFT (cuFFT), and sparse matrix-vector products (cuSPARSE).
Existing Eigen users should be able to move performance-critical dense or
sparse linear algebra to the GPU with minimal code changes and without
learning CUDA library APIs directly.
## Design philosophy
**CPU and GPU coexist.** There is no global compile-time switch that replaces
CPU implementations (unlike `EIGEN_USE_LAPACKE`). Users choose GPU solvers
explicitly -- `gpu::LLT<double>` vs `LLT<MatrixXd>`, `gpu::SparseLLT<double>` vs
`SimplicialLLT<SparseMatrix<double>>` -- and both coexist in the same binary.
This also lets users keep the factored matrix on device across multiple solves,
something impossible with compile-time replacement.
**Familiar syntax.** GPU operations use the same expression patterns as CPU
Eigen. Here is a side-by-side comparison:
```cpp
// ---- CPU (Eigen) ---- // ---- GPU (unsupported/Eigen/GPU) ----
#include <Eigen/Dense> #define EIGEN_USE_GPU
#include <unsupported/Eigen/GPU>
// Dense
MatrixXd A = ...; auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
MatrixXd B = ...; auto d_B = gpu::DeviceMatrix<double>::fromHost(B);
MatrixXd C = A * B; gpu::DeviceMatrix<double> d_C = d_A * d_B;
MatrixXd X = A.llt().solve(B); gpu::DeviceMatrix<double> d_X = d_A.llt().solve(d_B);
MatrixXd X = d_X.toHost();
// Sparse (using SpMat = SparseMatrix<double>)
SimplicialLLT<SpMat> llt(A); gpu::SparseLLT<double> llt(A);
VectorXd x = llt.solve(b); VectorXd x = llt.solve(b);
```
The GPU version reads like CPU Eigen with explicit upload/download for dense
operations, and an almost identical API for sparse solvers. Unsupported
expressions are compile errors.
**Standalone module.** `unsupported/Eigen/GPU` does not modify or depend on Eigen's Core
expression template system (`MatrixBase`, `CwiseBinaryOp`, etc.).
`DeviceMatrix` is not an Eigen expression type and does not inherit from
`MatrixBase`. The expression layer is a thin compile-time dispatch where every
supported expression maps to a single NVIDIA library call. There is no
coefficient-level evaluation, lazy fusion, or packet operations.
**Explicit over implicit.** Host-device transfers, stream management, and
library handle lifetimes are visible in the API. There are no hidden
allocations or synchronizations except where documented (e.g., `toHost()` must
synchronize to deliver data to the host).
## Key concepts
### `DeviceMatrix<Scalar>`
A typed RAII wrapper for a dense column-major matrix in GPU device memory.
This is the GPU counterpart of Eigen's `MatrixX<Scalar>`. A vector is simply
a `DeviceMatrix` with one column.
```cpp
// Upload from host
auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
// Allocate uninitialized
gpu::DeviceMatrix<double> d_C(m, n);
// Download to host
MatrixXd C = d_C.toHost();
// Async download (returns a future)
auto transfer = d_C.toHostAsync();
// ... do other work ...
MatrixXd C = transfer.get();
```
`DeviceMatrix` supports expression methods that mirror Eigen's API:
`adjoint()`, `transpose()`, `triangularView<UpLo>()`,
`selfadjointView<UpLo>()`, `llt()`, `lu()`. These return lightweight
expression objects that are evaluated when assigned.
### `gpu::Context`
Every GPU operation needs a CUDA stream and library handles (cuBLAS,
cuSOLVER). `gpu::Context` bundles these together.
For simple usage, you don't need to create one -- a per-thread default context
is created lazily on first use:
```cpp
// These use the thread-local default context automatically
d_C = d_A * d_B;
d_X = d_A.llt().solve(d_B);
```
For concurrent multi-stream execution, create explicit contexts:
```cpp
gpu::Context ctx1, ctx2;
d_C1.device(ctx1) = d_A1 * d_B1; // runs on stream 1
d_C2.device(ctx2) = d_A2 * d_B2; // runs on stream 2 (concurrently)
```
## Usage
### Matrix operations (cuBLAS)
```cpp
auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
auto d_B = gpu::DeviceMatrix<double>::fromHost(B);
// GEMM: C = A * B, C = A^H * B, C = A * B^T, ...
gpu::DeviceMatrix<double> d_C = d_A * d_B;
d_C = d_A.adjoint() * d_B;
d_C = d_A * d_B.transpose();
// Scaled and accumulated
d_C += 2.0 * d_A * d_B; // alpha=2, beta=1
d_C.device(ctx) -= d_A * d_B; // alpha=-1, beta=1 (requires explicit context)
// Triangular solve (TRSM)
d_X = d_A.triangularView<Lower>().solve(d_B);
// Symmetric/Hermitian multiply (SYMM/HEMM)
d_C = d_A.selfadjointView<Lower>() * d_B;
// Rank-k update (SYRK/HERK)
d_C.selfadjointView<Lower>().rankUpdate(d_A); // C += A * A^H
```
### Dense solvers (cuSOLVER)
**One-shot expression syntax** -- Convenient, re-factorizes each time:
```cpp
// Cholesky solve (potrf + potrs)
d_X = d_A.llt().solve(d_B);
// LU solve (getrf + getrs)
d_Y = d_A.lu().solve(d_B);
```
**Cached factorization** -- Factor once, solve many times:
```cpp
gpu::LLT<double> llt;
llt.compute(d_A); // factorize (async)
if (llt.info() != Success) { ... } // lazy sync on first info() call
auto d_X1 = llt.solve(d_B1); // reuses factor (async)
auto d_X2 = llt.solve(d_B2); // reuses factor (async)
MatrixXd X2 = d_X2.toHost();
// LU with transpose solve
gpu::LU<double> lu;
lu.compute(d_A);
auto d_Y = lu.solve(d_B, gpu::GpuOp::Trans); // A^T Y = B
// QR solve (overdetermined least squares)
gpu::QR<double> qr;
qr.compute(d_A); // factorize on device (async)
auto d_X = qr.solve(d_B); // Q^H * B via ormqr, then trsm on R
MatrixXd X = d_X.toHost();
// SVD (results downloaded on access)
gpu::SVD<double> svd;
svd.compute(d_A, ComputeThinU | ComputeThinV);
VectorXd S = svd.singularValues(); // downloads to host
MatrixXd U = svd.matrixU(); // downloads to host
MatrixXd VT = svd.matrixVT(); // V^T (matches cuSOLVER)
// SVD: device-side views (no D2H transfer; svd must outlive the views)
auto d_S = svd.d_singularValues(); // DeviceMatrix view of singular values
auto d_U = svd.d_matrixU(); // DeviceMatrix view of U
auto d_VT = svd.d_matrixVT(); // DeviceMatrix view of V^T
// Self-adjoint eigenvalue decomposition
gpu::SelfAdjointEigenSolver<double> es;
es.compute(d_A);
VectorXd eigenvals = es.eigenvalues(); // downloads to host
MatrixXd eigenvecs = es.eigenvectors(); // downloads to host
auto d_W = es.d_eigenvalues(); // DeviceMatrix view of eigenvalues
auto d_V = es.d_eigenvectors(); // DeviceMatrix view of eigenvectors
```
The cached API keeps the factored matrix on device, avoiding redundant
host-device transfers and re-factorizations. All solvers also accept host
matrices directly as a convenience (e.g., `gpu::LLT<double> llt(A)` or
`qr.solve(B)`), which handles upload/download internally. The `d_*` accessors
on `gpu::SVD` and `gpu::SelfAdjointEigenSolver` return non-owning
`DeviceMatrix` views so downstream cuBLAS/cuSOLVER work can chain without
round-tripping through host memory.
### Sparse direct solvers (cuDSS)
Requires cuDSS (separate install, CUDA 12.0+). Define `EIGEN_CUDSS` before
including `unsupported/Eigen/GPU` and link with `-lcudss`.
```cpp
SparseMatrix<double> A = ...; // symmetric positive definite
VectorXd b = ...;
// Sparse Cholesky -- one-liner
gpu::SparseLLT<double> llt(A);
VectorXd x = llt.solve(b);
// Three-phase workflow for repeated solves with the same sparsity pattern
gpu::SparseLLT<double> llt;
llt.analyzePattern(A); // symbolic analysis (once)
llt.factorize(A); // numeric factorization
VectorXd x = llt.solve(b);
llt.factorize(A_new_values); // refactorize (reuses symbolic analysis)
VectorXd x2 = llt.solve(b);
// Sparse LDL^T (symmetric indefinite)
gpu::SparseLDLT<double> ldlt(A);
VectorXd x = ldlt.solve(b);
// Sparse LU (general non-symmetric)
gpu::SparseLU<double> lu(A);
VectorXd x = lu.solve(b);
```
### FFT (cuFFT)
```cpp
gpu::FFT<float> fft;
// 1D complex-to-complex
VectorXcf X = fft.fwd(x); // forward
VectorXcf y = fft.inv(X); // inverse (scaled by 1/n)
// 1D real-to-complex / complex-to-real
VectorXcf R = fft.fwd(r); // returns n/2+1 complex (half-spectrum)
VectorXf s = fft.invReal(R, n); // C2R inverse, caller specifies n
// 2D complex-to-complex
MatrixXcf B = fft.fwd2(A); // 2D forward
MatrixXcf C = fft.inv2(B); // 2D inverse (scaled by 1/(rows*cols))
// Plans are cached and reused across calls with the same size/type.
```
### Sparse matrix-vector multiply (cuSPARSE)
```cpp
SparseMatrix<double> A = ...;
VectorXd x = ...;
gpu::SparseContext<double> ctx;
VectorXd y = ctx.multiply(A, x); // y = A * x
VectorXd z = ctx.multiplyT(A, x); // z = A^T * x
ctx.multiply(A, x, y, 2.0, 1.0); // y = 2*A*x + y
ctx.multiply(A, x, y, 1.0, 0.0, // y = A^H * x (Hermitian SpMV)
gpu::GpuOp::ConjTrans);
// Multiple RHS (SpMM)
MatrixXd Y = ctx.multiplyMat(A, X); // Y = A * X
MatrixXd Z = ctx.multiplyMat(A, X, gpu::GpuOp::Trans); // Z = A^T * X
```
### Precision control
GEMM dispatch uses `cublasXgemm` (type-specific Sgemm/Dgemm/Cgemm/Zgemm).
cuBLAS may internally use tensor cores depending on the GPU architecture,
matrix dimensions, and CUDA math mode settings. No Eigen-specific macros
control this; use the standard `CUDA_MATH_MODE` environment variable or
`cublasSetMathMode()` to configure tensor core behavior if needed.
### Stream control and async execution
Operations are asynchronous by default. The compute-solve chain runs without
host synchronization until you need a result on the host:
```text
fromHost(A) --sync--> compute() --async--> solve() --async--> toHost()
H2D potrf potrs D2H
sync
```
Mandatory sync points:
- `fromHost()` -- Synchronizes to complete the upload before returning
- `toHost()` / `HostTransfer::get()` -- Must deliver data to host
- `info()` -- Must read the factorization status
**Cross-stream safety** is automatic. `DeviceMatrix` tracks write completion
via CUDA events. When a matrix written on stream A is read on stream B, the
module automatically inserts `cudaStreamWaitEvent`. Same-stream operations
skip the wait (CUDA guarantees in-order execution within a stream).
**Lifetime of cached factorizations.** A `gpu::LLT` / `gpu::LU` object owns
its CUDA stream, library handle, and the cached factor on device. Destroying
the factorization object while a `solve()` it issued is still in flight is
*correct* but not actually async: `cudaStreamDestroy` returns immediately,
but the destructor of the cached factor calls `cudaFree`, which is fully
device-synchronous and stalls until the in-flight `potrs`/`getrs` retires.
For genuine async pipelining keep the factorization object alive until you
have drained its results (e.g. via `toHost()` or by binding consumption to
an explicit `gpu::Context` that outlives both producer and consumer):
```cpp
gpu::LLT<double> llt(d_A); // factor stays on device
auto h_x = d_X = llt.solve(d_B).toHostAsync(stream);
h_x.get(); // sync: factor + result complete
// llt may now be destroyed without stalling the device
```
## Reference
### Supported scalar types
`float`, `double`, `std::complex<float>`, `std::complex<double>` (unless
noted otherwise).
### Expression -> library call mapping
| DeviceMatrix expression | Library call | Parameters |
|---|---|---|
| `C = A * B` | `cublasXgemm` | transA=N, transB=N, alpha=1, beta=0 |
| `C = A.adjoint() * B` | `cublasXgemm` | transA=C, transB=N |
| `C = A.transpose() * B` | `cublasXgemm` | transA=T, transB=N |
| `C = A * B.adjoint()` | `cublasXgemm` | transA=N, transB=C |
| `C = A * B.transpose()` | `cublasXgemm` | transA=N, transB=T |
| `C = alpha * A * B` | `cublasXgemm` | alpha from LHS |
| `C = A * (alpha * B)` | `cublasXgemm` | alpha from RHS |
| `C += A * B` | `cublasXgemm` | alpha=1, beta=1 |
| `C.device(ctx) -= A * B` | `cublasXgemm` | alpha=-1, beta=1 |
| `X = A.llt().solve(B)` | `cusolverDnXpotrf` + `Xpotrs` | uplo, n, nrhs |
| `X = A.llt<Upper>().solve(B)` | same | uplo=Upper |
| `X = A.lu().solve(B)` | `cusolverDnXgetrf` + `Xgetrs` | n, nrhs |
| `X = A.triangularView<L>().solve(B)` | `cublasXtrsm` | side=L, uplo, diag=NonUnit |
| `C = A.selfadjointView<L>() * B` | `cublasXsymm` / `cublasXhemm` | side=L, uplo |
| `C.selfadjointView<L>().rankUpdate(A)` | `cublasXsyrk` / `cublasXherk` | uplo, trans=N |
### `DeviceMatrix<Scalar>`
Typed RAII wrapper for a dense column-major matrix in GPU device memory.
Always dense (leading dimension = rows). A vector is a `DeviceMatrix` with
one column.
```cpp
// Construction
DeviceMatrix<Scalar>() // Empty (0x0)
DeviceMatrix<Scalar>(rows, cols) // Allocate uninitialized
// Upload / download / pointer adoption
static DeviceMatrix fromHost(matrix, stream=nullptr) // -> DeviceMatrix (syncs)
static DeviceMatrix fromHostAsync(ptr, rows, cols, outerStride, s) // -> DeviceMatrix (no sync, caller manages ptr lifetime)
static DeviceMatrix adopt(Scalar* device_ptr, rows, cols) // Owning wrapper over a raw device pointer
static DeviceMatrix view(Scalar* device_ptr, rows, cols) // Non-owning view (does not free on destruction)
PlainMatrix toHost(stream=nullptr) // -> host Matrix (syncs)
HostTransfer toHostAsync(stream=nullptr) // -> HostTransfer future (no sync)
DeviceMatrix clone(stream=nullptr) // -> DeviceMatrix (D2D copy, async)
// Dimensions and access
Index rows()
Index cols()
size_t sizeInBytes()
bool empty()
Scalar* data() // Raw device pointer
void resize(Index rows, Index cols) // Discard contents, reallocate
// Expression builders (return lightweight views, evaluated on assignment)
AdjointView adjoint() // GEMM with ConjTrans
TransposeView transpose() // GEMM with Trans
LltExpr llt() / llt<UpLo>() // -> .solve(d_B) -> DeviceMatrix
LuExpr lu() // -> .solve(d_B) -> DeviceMatrix
TriangularView triangularView<UpLo>() // -> .solve(d_B) -> DeviceMatrix (TRSM)
SelfAdjointView selfadjointView<UpLo>() // -> * d_B (SYMM), .rankUpdate(d_A) (SYRK)
Assignment device(gpu::Context& ctx) // Bind assignment to explicit stream
```
### `gpu::Context`
Unified GPU execution context owning a CUDA stream and library handles.
```cpp
gpu::Context() // Creates dedicated stream + handles
static gpu::Context& threadLocal() // Per-thread default (lazy-created)
cudaStream_t stream()
cublasHandle_t cublasHandle()
cusolverDnHandle_t cusolverHandle()
```
Non-copyable, non-movable (owns library handles).
### `gpu::LLT<Scalar, UpLo>` -- Dense Cholesky (cuSOLVER)
Caches the Cholesky factor on device for repeated solves.
```cpp
gpu::LLT() // Default construct, then call compute()
gpu::LLT(const EigenBase<D>& A) // Convenience: upload + factorize
gpu::LLT& compute(const EigenBase<D>& A) // Upload + factorize
gpu::LLT& compute(const DeviceMatrix& d_A) // D2D copy + factorize
gpu::LLT& compute(DeviceMatrix&& d_A) // Adopt + factorize (no copy)
PlainMatrix solve(const MatrixBase<D>& B) // -> host Matrix (syncs)
DeviceMatrix solve(const DeviceMatrix& d_B) // -> DeviceMatrix (async, stays on device)
ComputationInfo info() // Lazy sync on first call: Success or NumericalIssue
Index rows() / cols()
cudaStream_t stream()
```
### `gpu::LU<Scalar>` -- Dense LU (cuSOLVER)
Same pattern as `gpu::LLT`. Adds a `gpu::GpuOp` parameter on `solve()`.
```cpp
PlainMatrix solve(const MatrixBase<D>& B, GpuOp op = GpuOp::NoTrans) // -> host Matrix
DeviceMatrix solve(const DeviceMatrix& d_B, GpuOp op = GpuOp::NoTrans) // -> DeviceMatrix
```
`gpu::GpuOp`: `NoTrans`, `Trans`, `ConjTrans`.
### `gpu::QR<Scalar>` -- Dense QR (cuSOLVER)
QR factorization via `cusolverDnXgeqrf`. Solve uses ORMQR (apply Q^H) + TRSM
(back-substitute on R) -- Q is never formed explicitly.
```cpp
gpu::QR() // Default construct
gpu::QR(const EigenBase<D>& A) // Convenience: upload + factorize
gpu::QR& compute(const EigenBase<D>& A) // Upload + factorize
gpu::QR& compute(const DeviceMatrix& d_A) // D2D copy + factorize
PlainMatrix solve(const MatrixBase<D>& B) // -> host Matrix (syncs)
DeviceMatrix solve(const DeviceMatrix& d_B) // -> DeviceMatrix (async)
PlainMatrix matrixR() // -> host Matrix (m >= n only)
ComputationInfo info() // Lazy sync
Index rows() / cols()
cudaStream_t stream()
```
### `gpu::SVD<Scalar>` -- Dense SVD (cuSOLVER)
SVD via `cusolverDnXgesvd`. Supports `ComputeThinU | ComputeThinV`,
`ComputeFullU | ComputeFullV`, or `0` (values only). Wide matrices (m < n)
handled by internal transpose.
```cpp
gpu::SVD() // Default construct, then call compute()
gpu::SVD(const EigenBase<D>& A, unsigned options = ComputeThinU | ComputeThinV) // Convenience
gpu::SVD& compute(const EigenBase<D>& A, unsigned options = ComputeThinU | ComputeThinV)
gpu::SVD& compute(const DeviceMatrix& d_A, unsigned options = ComputeThinU | ComputeThinV)
RealVector singularValues() // -> host vector (syncs, downloads)
PlainMatrix matrixU() // -> host Matrix (syncs, downloads)
PlainMatrix matrixVT() // -> host Matrix (syncs, downloads V^T)
DeviceMatrix d_singularValues() // -> DeviceMatrix view (zero-copy)
DeviceMatrix d_matrixU() // -> DeviceMatrix view (zero-copy when m >= n)
DeviceMatrix d_matrixVT() // -> DeviceMatrix view (zero-copy when m >= n)
PlainMatrix solve(const MatrixBase<D>& B) // -> host Matrix (pseudoinverse)
PlainMatrix solve(const MatrixBase<D>& B, Index k) // Truncated (top k triplets)
PlainMatrix solve(const MatrixBase<D>& B, RealScalar l) // Tikhonov regularized
Index rank(RealScalar threshold = -1)
ComputationInfo info() // Lazy sync
Index rows() / cols()
cudaStream_t stream()
```
**Note:** `singularValues()`, `matrixU()`, and `matrixVT()` download to host
on each call. The `d_*` accessors return non-owning `DeviceMatrix` views into
the solver's internal buffers; the `gpu::SVD` object must outlive any view
derived from it. For wide matrices (m < n) the U/V^T views are owning (one
`cublasXgeam` adjoint pass).
### `gpu::SelfAdjointEigenSolver<Scalar>` -- Eigendecomposition (cuSOLVER)
Symmetric/Hermitian eigenvalue decomposition via `cusolverDnXsyevd`.
`ComputeMode` enum: `EigenvaluesOnly`, `ComputeEigenvectors`.
```cpp
gpu::SelfAdjointEigenSolver() // Default construct, then call compute()
gpu::SelfAdjointEigenSolver(const EigenBase<D>& A, ComputeMode mode = ComputeEigenvectors) // Convenience
gpu::SelfAdjointEigenSolver& compute(const EigenBase<D>& A, ComputeMode mode = ComputeEigenvectors)
gpu::SelfAdjointEigenSolver& compute(const DeviceMatrix& d_A, ComputeMode mode = ComputeEigenvectors)
RealVector eigenvalues() // -> host vector (syncs, downloads, ascending order)
PlainMatrix eigenvectors() // -> host Matrix (syncs, downloads, columns)
DeviceMatrix d_eigenvalues() // -> DeviceMatrix view (zero-copy)
DeviceMatrix d_eigenvectors() // -> DeviceMatrix view (zero-copy, requires ComputeEigenvectors)
ComputationInfo info() // Lazy sync
Index rows() / cols()
cudaStream_t stream()
```
**Note:** `eigenvalues()` and `eigenvectors()` download to host on each call.
The `d_*` accessors return non-owning `DeviceMatrix` views into the solver's
internal buffers; the `gpu::SelfAdjointEigenSolver` object must outlive any
view derived from it.
### `HostTransfer<Scalar>`
Future for async device-to-host transfer. Returned by
`DeviceMatrix::toHostAsync()`.
```cpp
PlainMatrix& get() // Block until complete, return host Matrix ref. Idempotent.
bool ready() // Non-blocking poll
```
### `gpu::SparseLLT<Scalar, UpLo>` -- Sparse Cholesky (cuDSS)
Requires cuDSS (CUDA 12.0+, `#define EIGEN_CUDSS`). Three-phase workflow
with symbolic reuse. Accepts `SparseMatrix<Scalar, ColMajor, int>` (CSC).
Matrix dimensions and nonzero count must fit in `int` (cuDSS limitation;
debug builds assert).
```cpp
gpu::SparseLLT() // Default construct
gpu::SparseLLT(const SparseMatrixBase<D>& A) // Analyze + factorize
gpu::SparseLLT& analyzePattern(const SparseMatrixBase<D>& A) // Symbolic analysis (reusable)
gpu::SparseLLT& factorize(const SparseMatrixBase<D>& A) // Numeric factorization
gpu::SparseLLT& compute(const SparseMatrixBase<D>& A) // analyzePattern + factorize
void setOrdering(GpuSparseOrdering ord) // AMD (default), METIS, or RCM
DenseMatrix solve(const MatrixBase<D>& B) // -> host Matrix (syncs)
ComputationInfo info() // Lazy sync
Index rows() / cols()
cudaStream_t stream()
```
### `gpu::SparseLDLT<Scalar, UpLo>` -- Sparse LDL^T (cuDSS)
Symmetric indefinite. Same API as `gpu::SparseLLT`.
### `gpu::SparseLU<Scalar>` -- Sparse LU (cuDSS)
General non-symmetric. Same API as `gpu::SparseLLT` (without `UpLo`).
### `gpu::FFT<Scalar>` -- FFT (cuFFT)
Plans cached by (size, type) and reused. Inverse transforms scaled so
`inv(fwd(x)) == x`. Supported scalars: `float`, `double`.
```cpp
// 1D transforms (host vectors in and out)
ComplexVector fwd(const MatrixBase<D>& x) // C2C forward (complex input)
ComplexVector fwd(const MatrixBase<D>& x) // R2C forward (real input, returns n/2+1)
ComplexVector inv(const MatrixBase<D>& X) // C2C inverse, scaled by 1/n
RealVector invReal(const MatrixBase<D>& X, Index n) // C2R inverse, scaled by 1/n
// 2D transforms (host matrices in and out)
ComplexMatrix fwd2(const MatrixBase<D>& A) // 2D C2C forward
ComplexMatrix inv2(const MatrixBase<D>& A) // 2D C2C inverse, scaled by 1/(rows*cols)
cudaStream_t stream()
```
All FFT methods accept host data and return host data. Upload/download is
handled internally. The C2C and R2C overloads of `fwd()` are distinguished by
the input scalar type (complex vs real).
### `gpu::SparseContext<Scalar>` -- SpMV/SpMM (cuSPARSE)
Accepts `SparseMatrix<Scalar, ColMajor>`. All methods accept host data and
return host data. Matrix dimensions and nonzero count must fit in `int`
(cuSPARSE limitation; debug builds assert).
```cpp
gpu::SparseContext() // Creates own stream + cuSPARSE handle
DenseVector multiply(A, x) // y = A * x
void multiply(A, x, y, alpha=1, beta=0, // y = alpha*op(A)*x + beta*y
op=GpuOp::NoTrans)
DenseVector multiplyT(A, x) // y = A^T * x
DenseVector multiplyAdjoint(A, x) // y = A^H * x
DenseMatrix multiplyMat(A, X, op=GpuOp::NoTrans) // Y = op(A) * X (SpMM)
cudaStream_t stream()
```
### Aliasing
Unlike Eigen's `Matrix`, where omitting `.noalias()` triggers a copy to a
temporary, DeviceMatrix dispatches directly to NVIDIA library calls which have
no built-in aliasing protection. All operations are implicitly noalias.
The caller must ensure operands don't alias the destination for GEMM, TRSM,
SYMM/HEMM, and SYRK/HERK. Debug builds assert on these violations before
dispatching to cuBLAS.
## Future work
- **Reassess host-input vs. device-input API surface.** Each solver currently
exposes both host-input (`compute(MatrixXd)`, `solve(MatrixXd)`) and
device-input (`compute(DeviceMatrix)`, `solve(DeviceMatrix)`) overloads, plus
host- and device-side accessors (`matrixU()` vs `d_matrixU()`). This eases
migration from CPU Eigen but may invite accidental host ↔ device round-trips
when users mix the two without realising the cost. Revisit after the initial
GPU module roll-out (MRs !2408, !2412, !2413, !2414, !2415) is in users'
hands; if the convenience overloads cause more confusion than they save,
narrow toward a single explicit `fromHost` / `toHost` boundary.
## File layout
| File | Depends on | Contents |
|------|-----------|----------|
| `GpuSupport.h` | `<cuda_runtime.h>` | Error macro, `DeviceBuffer`, `cuda_data_type<>` |
| `DeviceMatrix.h` | `GpuSupport.h` | `DeviceMatrix<>`, `HostTransfer<>` |
| `DeviceExpr.h` | `DeviceMatrix.h` | GEMM expression wrappers |
| `DeviceBlasExpr.h` | `DeviceMatrix.h` | TRSM, SYMM, SYRK expression wrappers |
| `DeviceSolverExpr.h` | `DeviceMatrix.h` | Solver expression wrappers (LLT, LU) |
| `DeviceDispatch.h` | all above | All dispatch functions + `Assignment` |
| `GpuContext.h` | `CuBlasSupport.h`, `CuSolverSupport.h` | `gpu::Context` |
| `CuBlasSupport.h` | `GpuSupport.h`, `<cublas_v2.h>` | cuBLAS error macro, op/compute type maps |
| `CuSolverSupport.h` | `GpuSupport.h`, `<cusolverDn.h>` | cuSOLVER params, fill-mode mapping |
| `GpuSolverContext.h` | `CuSolverSupport.h`, `CuBlasSupport.h` | Shared solver context (stream, handles, scratch) |
| `GpuLLT.h` | `GpuSolverContext.h` | Cached dense Cholesky factorization |
| `GpuLU.h` | `GpuSolverContext.h` | Cached dense LU factorization |
| `GpuQR.h` | `GpuSolverContext.h` | Dense QR decomposition |
| `GpuSVD.h` | `GpuSolverContext.h` | Dense SVD decomposition |
| `GpuEigenSolver.h` | `GpuSolverContext.h` | Self-adjoint eigenvalue decomposition |
| `CuFftSupport.h` | `GpuSupport.h`, `<cufft.h>` | cuFFT error macro, type-dispatch wrappers |
| `GpuFFT.h` | `CuFftSupport.h`, `CuBlasSupport.h` | 1D/2D FFT with plan caching |
| `CuSparseSupport.h` | `GpuSupport.h`, `<cusparse.h>` | cuSPARSE error macro |
| `GpuSparseContext.h` | `CuSparseSupport.h` | SpMV/SpMM via cuSPARSE |
| `CuDssSupport.h` | `GpuSupport.h`, `<cudss.h>` | cuDSS error macro, type traits (optional) |
| `GpuSparseSolverBase.h` | `CuDssSupport.h` | CRTP base for sparse solvers (optional) |
| `GpuSparseLLT.h` | `GpuSparseSolverBase.h` | Sparse Cholesky via cuDSS (optional) |
| `GpuSparseLDLT.h` | `GpuSparseSolverBase.h` | Sparse LDL^T via cuDSS (optional) |
| `GpuSparseLU.h` | `GpuSparseSolverBase.h` | Sparse LU via cuDSS (optional) |
## Building and testing
```bash
cmake -G Ninja -B build -S . \
-DEIGEN_TEST_CUDA=ON \
-DEIGEN_CUDA_COMPUTE_ARCH="70" \
-DEIGEN_TEST_CUBLAS=ON \
-DEIGEN_TEST_CUSOLVER=ON
cmake --build build --target cublas cusolver_llt cusolver_lu \
cusolver_qr cusolver_svd cusolver_eigen \
device_matrix cufft cusparse_spmv
ctest --test-dir build -L gpu --output-on-failure
# Sparse solvers (cuDSS -- separate install required)
cmake -G Ninja -B build -S . \
-DEIGEN_TEST_CUDA=ON \
-DEIGEN_CUDA_COMPUTE_ARCH="70" \
-DEIGEN_TEST_CUDSS=ON
cmake --build build --target cudss_llt cudss_ldlt cudss_lu
ctest --test-dir build -R '^cudss_' --output-on-failure
```
## Future enhancements
- **Device-resident sparse matrix-vector products.** `gpu::SparseContext`
currently operates on host vectors and matrices, uploading and downloading
on each call. The key missing piece is a `DeviceSparseView` that holds a
sparse matrix on device and supports operator syntax (`d_y = d_A * d_x`)
with `DeviceMatrix` operands -- keeping the entire SpMV/SpMM pipeline on
device. This is essential for iterative solvers and any workflow that chains
sparse and dense operations without returning to the host.