blob: 2928a7ccb1708a871de8ded5c4ffc6dc03be7644 [file] [view]
# Eigen GPU Module (`unsupported/Eigen/GPU`)
GPU-accelerated dense linear algebra for Eigen users, dispatching to NVIDIA
CUDA Math Libraries (cuBLAS, cuSOLVER). Requires CUDA 11.4+. Header-only (link
against CUDA runtime, cuBLAS, and cuSOLVER).
## Why this module
Eigen is the linear algebra foundation for a large ecosystem of C++ projects
in robotics (ROS, Drake, MoveIt, Pinocchio), computer vision (OpenCV, COLMAP,
Open3D), scientific computing (Ceres, Stan), and beyond. Many of these
projects run on GPU-equipped hardware but cannot use GPUs for Eigen operations
without dropping down to raw CUDA library APIs. Third-party projects like
[EigenCuda](https://github.com/NLESC-JCER/EigenCuda) and
[cholespy](https://github.com/rgl-epfl/cholespy) exist specifically to fill
this gap, and downstream projects like
[Ceres](https://github.com/ceres-solver/ceres-solver/issues/1151) and
[COLMAP](https://github.com/colmap/colmap/issues/4018) have open requests for
GPU-accelerated solvers through Eigen.
The `unsupported/Eigen/GPU` module aims to close this gap: Existing Eigen users should be
able to move performance-critical dense linear algebra to the GPU with minimal
code changes and without learning CUDA library APIs directly.
## Design philosophy
**CPU and GPU coexist.** There is no global compile-time switch that replaces
CPU implementations (unlike `EIGEN_USE_LAPACKE`). Users choose GPU solvers
explicitly -- `gpu::LLT<double>` vs `Eigen::LLT<MatrixXd>` -- and both coexist in
the same binary. This also lets users keep the factored matrix on device across
multiple solves, something impossible with compile-time replacement.
**Familiar syntax.** GPU operations use the same expression patterns as CPU
Eigen. Here is a side-by-side comparison:
```cpp
// ---- CPU (Eigen) ---- // ---- GPU (unsupported/Eigen/GPU) ----
#include <Eigen/Dense> #define EIGEN_USE_GPU
#include <unsupported/Eigen/GPU>
MatrixXd A = ...; auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
MatrixXd B = ...; auto d_B = gpu::DeviceMatrix<double>::fromHost(B);
MatrixXd C = A * B; gpu::DeviceMatrix<double> d_C = d_A * d_B;
MatrixXd X = A.llt().solve(B); gpu::DeviceMatrix<double> d_X = d_A.llt().solve(d_B);
MatrixXd X = d_X.toHost();
```
The GPU version reads like CPU Eigen with explicit upload/download.
`operator*` dispatches to cuBLAS GEMM, `.llt().solve()` dispatches to
cuSOLVER potrf + potrs. Unsupported expressions are compile errors.
**Standalone module.** `unsupported/Eigen/GPU` does not modify or depend on Eigen's Core
expression template system (`MatrixBase`, `CwiseBinaryOp`, etc.).
`DeviceMatrix` is not an Eigen expression type and does not inherit from
`MatrixBase`. The expression layer is a thin compile-time dispatch where every
supported expression maps to a single NVIDIA library call. There is no
coefficient-level evaluation, lazy fusion, or packet operations.
**Explicit over implicit.** Host-device transfers, stream management, and
library handle lifetimes are visible in the API. There are no hidden
allocations or synchronizations except where documented (e.g., `toHost()` must
synchronize to deliver data to the host).
## Key concepts
### `DeviceMatrix<Scalar>`
A typed RAII wrapper for a dense column-major matrix in GPU device memory.
This is the GPU counterpart of Eigen's `MatrixX<Scalar>`. A vector is simply
a `DeviceMatrix` with one column.
```cpp
// Upload from host
auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
// Allocate uninitialized
gpu::DeviceMatrix<double> d_C(m, n);
// Download to host
MatrixXd C = d_C.toHost();
// Async download (returns a future)
auto transfer = d_C.toHostAsync();
// ... do other work ...
MatrixXd C = transfer.get();
```
`DeviceMatrix` supports expression methods that mirror Eigen's API:
`adjoint()`, `transpose()`, `triangularView<UpLo>()`,
`selfadjointView<UpLo>()`, `llt()`, `lu()`. These return lightweight
expression objects that are evaluated when assigned.
### `gpu::Context`
Every GPU operation needs a CUDA stream and library handles (cuBLAS,
cuSOLVER). `gpu::Context` bundles these together.
For simple usage, you don't need to create one -- a per-thread default context
is created lazily on first use:
```cpp
// These use the thread-local default context automatically
d_C = d_A * d_B;
d_X = d_A.llt().solve(d_B);
```
For concurrent multi-stream execution, create explicit contexts:
```cpp
gpu::Context ctx1, ctx2;
d_C1.device(ctx1) = d_A1 * d_B1; // runs on stream 1
d_C2.device(ctx2) = d_A2 * d_B2; // runs on stream 2 (concurrently)
```
## Usage
### Matrix operations (cuBLAS)
```cpp
auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
auto d_B = gpu::DeviceMatrix<double>::fromHost(B);
// GEMM: C = A * B, C = A^H * B, C = A * B^T, ...
gpu::DeviceMatrix<double> d_C = d_A * d_B;
d_C = d_A.adjoint() * d_B;
d_C = d_A * d_B.transpose();
// Scaled and accumulated
d_C += 2.0 * d_A * d_B; // alpha=2, beta=1
d_C.device(ctx) -= d_A * d_B; // alpha=-1, beta=1 (requires explicit context)
// Triangular solve (TRSM)
d_X = d_A.triangularView<Lower>().solve(d_B);
// Symmetric/Hermitian multiply (SYMM/HEMM)
d_C = d_A.selfadjointView<Lower>() * d_B;
// Rank-k update (SYRK/HERK)
d_C.selfadjointView<Lower>().rankUpdate(d_A); // C += A * A^H
```
### Dense solvers (cuSOLVER)
**One-shot expression syntax** -- Convenient, re-factorizes each time:
```cpp
// Cholesky solve (potrf + potrs)
d_X = d_A.llt().solve(d_B);
// LU solve (getrf + getrs)
d_Y = d_A.lu().solve(d_B);
```
**Cached factorization** -- Factor once, solve many times:
```cpp
gpu::LLT<double> llt;
llt.compute(d_A); // factorize (async)
if (llt.info() != Success) { ... } // lazy sync on first info() call
auto d_X1 = llt.solve(d_B1); // reuses factor (async)
auto d_X2 = llt.solve(d_B2); // reuses factor (async)
MatrixXd X2 = d_X2.toHost();
// LU with transpose solve
gpu::LU<double> lu;
lu.compute(d_A);
auto d_Y = lu.solve(d_B, gpu::GpuOp::Trans); // A^T Y = B
```
The cached API keeps the factored matrix on device, avoiding redundant
host-device transfers and re-factorizations.
### Stream control and async execution
Operations are asynchronous by default. The compute-solve chain runs without
host synchronization until you need a result on the host:
```text
fromHost(A) --sync--> compute() --async--> solve() --async--> toHost()
H2D potrf potrs D2H
sync
```
Mandatory sync points:
- `fromHost()` -- Synchronizes to complete the upload before returning
- `toHost()` / `HostTransfer::get()` -- Must deliver data to host
- `info()` -- Must read the factorization status
**Cross-stream safety** is automatic. `DeviceMatrix` tracks write completion
via CUDA events. When a matrix written on stream A is read on stream B, the
module automatically inserts `cudaStreamWaitEvent`. Same-stream operations
skip the wait (CUDA guarantees in-order execution within a stream).
**Lifetime of cached factorizations.** A `gpu::LLT` / `gpu::LU` object owns
its CUDA stream, library handle, and the cached factor on device. Destroying
the factorization object while a `solve()` it issued is still in flight is
*correct* but not actually async: `cudaStreamDestroy` returns immediately,
but the destructor of the cached factor calls `cudaFree`, which is fully
device-synchronous and stalls until the in-flight `potrs`/`getrs` retires.
For genuine async pipelining keep the factorization object alive until you
have drained its results (e.g. via `toHost()` or by binding consumption to
an explicit `gpu::Context` that outlives both producer and consumer):
```cpp
gpu::LLT<double> llt(d_A); // factor stays on device
auto d_X = llt.solve(d_B);
auto h_x = d_X.toHostAsync(llt.stream());
h_x.get(); // sync: factor + result complete
// llt may now be destroyed without stalling the device
```
## Reference
### Supported scalar types
`float`, `double`, `std::complex<float>`, `std::complex<double>`.
### Expression -> library call mapping
| DeviceMatrix expression | Library call | Parameters |
|---|---|---|
| `C = A * B` | `cublasGemmEx` | transA=N, transB=N, alpha=1, beta=0 |
| `C = A.adjoint() * B` | `cublasGemmEx` | transA=C, transB=N |
| `C = A.transpose() * B` | `cublasGemmEx` | transA=T, transB=N |
| `C = A * B.adjoint()` | `cublasGemmEx` | transA=N, transB=C |
| `C = A * B.transpose()` | `cublasGemmEx` | transA=N, transB=T |
| `C = alpha * A * B` | `cublasGemmEx` | alpha from LHS |
| `C = A * (alpha * B)` | `cublasGemmEx` | alpha from RHS |
| `C += A * B` | `cublasGemmEx` | alpha=1, beta=1 |
| `C.device(ctx) -= A * B` | `cublasGemmEx` | alpha=-1, beta=1 |
| `X = A.llt().solve(B)` | `cusolverDnXpotrf` + `Xpotrs` | uplo, n, nrhs |
| `X = A.llt<Upper>().solve(B)` | same | uplo=Upper |
| `X = A.lu().solve(B)` | `cusolverDnXgetrf` + `Xgetrs` | n, nrhs |
| `X = A.triangularView<L>().solve(B)` | `cublasXtrsm` | side=L, uplo, diag=NonUnit |
| `C = A.selfadjointView<L>() * B` | `cublasXsymm` / `cublasXhemm` | side=L, uplo |
| `C.selfadjointView<L>().rankUpdate(A)` | `cublasXsyrk` / `cublasXherk` | uplo, trans=N |
### `DeviceMatrix<Scalar>` API
| Method | Sync? | Description |
|--------|-------|-------------|
| `DeviceMatrix()` | -- | Empty (0x0) |
| `DeviceMatrix(rows, cols)` | -- | Allocate uninitialized |
| `fromHost(matrix, stream)` | yes | Upload from Eigen matrix |
| `fromHostAsync(ptr, rows, cols, outerStride, stream)` | no | Async upload (caller manages lifetime) |
| `toHost(stream)` | yes | Synchronous download |
| `toHostAsync(stream)` | no | Returns `HostTransfer` future |
| `clone(stream)` | no | Device-to-device deep copy |
| `resize(rows, cols)` | -- | Discard contents, reallocate |
| `data()` | -- | Raw device pointer |
| `rows()`, `cols()` | -- | Dimensions |
| `sizeInBytes()` | -- | Total device allocation size in bytes |
| `empty()` | -- | True if 0x0 |
| `adjoint()` | -- | Adjoint view (GEMM ConjTrans) |
| `transpose()` | -- | Transpose view (GEMM Trans) |
| `llt()` / `llt<UpLo>()` | -- | Cholesky expression builder |
| `lu()` | -- | LU expression builder |
| `triangularView<UpLo>()` | -- | Triangular view (TRSM) |
| `selfadjointView<UpLo>()` | -- | Self-adjoint view (SYMM, rankUpdate) |
| `device(ctx)` | -- | Assignment proxy bound to context |
### `gpu::Context`
Unified GPU execution context owning a CUDA stream and library handles.
```cpp
gpu::Context() // Creates dedicated stream + handles
static gpu::Context& threadLocal() // Per-thread default (lazy-created)
cudaStream_t stream()
cublasHandle_t cublasHandle()
cusolverDnHandle_t cusolverHandle()
```
Non-copyable, non-movable (owns library handles).
### `gpu::LLT<Scalar, UpLo>` API
GPU dense Cholesky (LL^T) via cuSOLVER. Caches factor on device.
| Method | Sync? | Description |
|--------|-------|-------------|
| `gpu::LLT(A)` | deferred | Construct and factorize from host matrix |
| `compute(host_matrix)` | deferred | Upload and factorize |
| `compute(DeviceMatrix)` | deferred | D2D copy and factorize |
| `compute(DeviceMatrix&&)` | deferred | Move-adopt and factorize (no copy) |
| `solve(host_matrix)` | yes | Solve, return host matrix |
| `solve(DeviceMatrix)` | no | Solve, return `DeviceMatrix` (async) |
| `info()` | lazy | Syncs stream on first call, returns `Success` or `NumericalIssue` |
### `gpu::LU<Scalar>` API
GPU dense partial-pivoting LU via cuSOLVER. Same pattern as `gpu::LLT`, plus a
`gpu::GpuOp` parameter on `solve()` (`NoTrans`, `Trans`, `ConjTrans`).
### `HostTransfer<Scalar>` API
Future for async device-to-host transfer.
| Method | Description |
|--------|-------------|
| `get()` | Block until transfer completes, return host matrix reference. Idempotent. |
| `ready()` | Non-blocking poll |
### Aliasing
Unlike Eigen's `Matrix`, where omitting `.noalias()` triggers a copy to a
temporary, DeviceMatrix dispatches directly to NVIDIA library calls which have
no built-in aliasing protection. All operations are implicitly noalias.
The caller must ensure operands don't alias the destination for GEMM, TRSM,
SYMM/HEMM, and SYRK/HERK. Debug builds assert on these violations before
dispatching to cuBLAS.
## File layout
| File | Depends on | Contents |
|------|-----------|----------|
| `GpuSupport.h` | `<cuda_runtime.h>` | Error macro, `DeviceBuffer`, `PinnedHostBuffer`, `cuda_data_type<>` |
| `DeviceMatrix.h` | `GpuSupport.h` | `DeviceMatrix<>`, `HostTransfer<>` |
| `DeviceExpr.h` | `DeviceMatrix.h` | GEMM expression wrappers |
| `DeviceBlasExpr.h` | `DeviceMatrix.h` | TRSM, SYMM, SYRK expression wrappers |
| `DeviceSolverExpr.h` | `DeviceMatrix.h` | Solver expression wrappers (LLT, LU) |
| `DeviceDispatch.h` | all above | All dispatch functions + `Assignment` |
| `GpuContext.h` | `CuBlasSupport.h`, `CuSolverSupport.h` | `gpu::Context` |
| `CuBlasSupport.h` | `GpuSupport.h`, `<cublas_v2.h>` | cuBLAS error macro, op/compute type maps |
| `CuSolverSupport.h` | `GpuSupport.h`, `<cusolverDn.h>` | cuSOLVER params, fill-mode mapping |
| `GpuLLT.h` | `CuSolverSupport.h` | Cached dense Cholesky factorization |
| `GpuLU.h` | `CuSolverSupport.h` | Cached dense LU factorization |
## Building and testing
```bash
cmake -G Ninja -B build -S . \
-DEIGEN_TEST_CUDA=ON \
-DEIGEN_CUDA_COMPUTE_ARCH="70"
cmake --build build --target cublas cusolver_llt cusolver_lu device_matrix
ctest --test-dir build -L gpu --output-on-failure
```
`EIGEN_TEST_CUBLAS` and `EIGEN_TEST_CUSOLVER` default to ON when CUDA is enabled
(cuBLAS and cuSOLVER are part of the CUDA toolkit).