unsupported/Eigen/src/GPU/README.md - mirror - Git at Google

 # Eigen GPU Module (`unsupported/Eigen/GPU`)

 GPU-accelerated dense linear algebra for Eigen users, dispatching to NVIDIA
 CUDA Math Libraries (cuBLAS, cuSOLVER). Requires CUDA 11.4+. Header-only (link
 against CUDA runtime, cuBLAS, and cuSOLVER).

 ## Why this module

 Eigen is the linear algebra foundation for a large ecosystem of C++ projects
 in robotics (ROS, Drake, MoveIt, Pinocchio), computer vision (OpenCV, COLMAP,
 Open3D), scientific computing (Ceres, Stan), and beyond. Many of these
 projects run on GPU-equipped hardware but cannot use GPUs for Eigen operations
 without dropping down to raw CUDA library APIs. Third-party projects like
 [EigenCuda](https://github.com/NLESC-JCER/EigenCuda) and
 [cholespy](https://github.com/rgl-epfl/cholespy) exist specifically to fill
 this gap, and downstream projects like
 [Ceres](https://github.com/ceres-solver/ceres-solver/issues/1151) and
 [COLMAP](https://github.com/colmap/colmap/issues/4018) have open requests for
 GPU-accelerated solvers through Eigen.

 The `unsupported/Eigen/GPU` module aims to close this gap: Existing Eigen users should be
 able to move performance-critical dense linear algebra to the GPU with minimal
 code changes and without learning CUDA library APIs directly.

 ## Design philosophy

 **CPU and GPU coexist.** There is no global compile-time switch that replaces
 CPU implementations (unlike `EIGEN_USE_LAPACKE`). Users choose GPU solvers
 explicitly -- `gpu::LLT<double>` vs `LLT<MatrixXd>` -- and both coexist in
 the same binary. This also lets users keep the factored matrix on device across
 multiple solves, something impossible with compile-time replacement.

 **Familiar syntax.** GPU operations use the same expression patterns as CPU
 Eigen. Here is a side-by-side comparison:

 ```cpp
 // ---- CPU (Eigen) ----               // ---- GPU (unsupported/Eigen/GPU) ----
 #include <Eigen/Dense>                  #define EIGEN_USE_GPU
                                         #include <unsupported/Eigen/GPU>

 MatrixXd A = ...;                       auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
 MatrixXd B = ...;                       auto d_B = gpu::DeviceMatrix<double>::fromHost(B);

 MatrixXd C = A * B;                     gpu::DeviceMatrix<double> d_C = d_A * d_B;
 MatrixXd X = A.llt().solve(B);          gpu::DeviceMatrix<double> d_X = d_A.llt().solve(d_B);

                                         MatrixXd X = d_X.toHost();
 ```

 The GPU version reads like CPU Eigen with explicit upload/download.
 `operator*` dispatches to cuBLAS GEMM, `.llt().solve()` dispatches to
 cuSOLVER potrf + potrs. Unsupported expressions are compile errors.

 **Standalone module.** `unsupported/Eigen/GPU` does not modify or depend on Eigen's Core
 expression template system (`MatrixBase`, `CwiseBinaryOp`, etc.).
 `DeviceMatrix` is not an Eigen expression type and does not inherit from
 `MatrixBase`. The expression layer is a thin compile-time dispatch where every
 supported expression maps to a single NVIDIA library call. There is no
 coefficient-level evaluation, lazy fusion, or packet operations.

 **Explicit over implicit.** Host-device transfers, stream management, and
 library handle lifetimes are visible in the API. There are no hidden
 allocations or synchronizations except where documented (e.g., `toHost()` must
 synchronize to deliver data to the host).

 ## Key concepts

 ### `DeviceMatrix<Scalar>`

 A typed RAII wrapper for a dense column-major matrix in GPU device memory.
 This is the GPU counterpart of Eigen's `MatrixX<Scalar>`. A vector is simply
 a `DeviceMatrix` with one column.

 ```cpp
 // Upload from host
 auto d_A = gpu::DeviceMatrix<double>::fromHost(A);

 // Allocate uninitialized
 gpu::DeviceMatrix<double> d_C(m, n);

 // Download to host
 MatrixXd C = d_C.toHost();

 // Async download (returns a future)
 auto transfer = d_C.toHostAsync();
 // ... do other work ...
 MatrixXd C = transfer.get();
 ```

 `DeviceMatrix` supports expression methods that mirror Eigen's API:
 `adjoint()`, `transpose()`, `triangularView<UpLo>()`,
 `selfadjointView<UpLo>()`, `llt()`, `lu()`. These return lightweight
 expression objects that are evaluated when assigned.

 ### `gpu::Context`

 Every GPU operation needs a CUDA stream and library handles (cuBLAS,
 cuSOLVER). `gpu::Context` bundles these together.

 For simple usage, you don't need to create one -- a per-thread default context
 is created lazily on first use:

 ```cpp
 // These use the thread-local default context automatically
 d_C = d_A * d_B;
 d_X = d_A.llt().solve(d_B);
 ```

 For concurrent multi-stream execution, create explicit contexts:

 ```cpp
 gpu::Context ctx1, ctx2;
 d_C1.device(ctx1) = d_A1 * d_B1;   // runs on stream 1
 d_C2.device(ctx2) = d_A2 * d_B2;   // runs on stream 2 (concurrently)
 ```

 ## Usage

 ### Matrix operations (cuBLAS)

 ```cpp
 auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
 auto d_B = gpu::DeviceMatrix<double>::fromHost(B);

 // GEMM: C = A * B, C = A^H * B, C = A * B^T, ...
 gpu::DeviceMatrix<double> d_C = d_A * d_B;
 d_C = d_A.adjoint() * d_B;
 d_C = d_A * d_B.transpose();

 // Scaled and accumulated
 d_C += 2.0 * d_A * d_B;             // alpha=2, beta=1
 d_C.device(ctx) -= d_A * d_B;       // alpha=-1, beta=1 (requires explicit context)

 // Triangular solve (TRSM)
 d_X = d_A.triangularView<Lower>().solve(d_B);

 // Symmetric/Hermitian multiply (SYMM/HEMM)
 d_C = d_A.selfadjointView<Lower>() * d_B;

 // Rank-k update (SYRK/HERK)
 d_C.selfadjointView<Lower>().rankUpdate(d_A);  // C += A * A^H
 ```

 ### Dense solvers (cuSOLVER)

 **One-shot expression syntax** -- Convenient, re-factorizes each time:

 ```cpp
 // Cholesky solve (potrf + potrs)
 d_X = d_A.llt().solve(d_B);

 // LU solve (getrf + getrs)
 d_Y = d_A.lu().solve(d_B);
 ```

 **Cached factorization** -- Factor once, solve many times:

 ```cpp
 gpu::LLT<double> llt;
 llt.compute(d_A);                    // factorize (async)
 if (llt.info() != Success) { ... }   // lazy sync on first info() call
 auto d_X1 = llt.solve(d_B1);        // reuses factor (async)
 auto d_X2 = llt.solve(d_B2);        // reuses factor (async)
 MatrixXd X2 = d_X2.toHost();

 // LU with transpose solve
 gpu::LU<double> lu;
 lu.compute(d_A);
 auto d_Y = lu.solve(d_B, gpu::GpuOp::Trans);           // A^T Y = B

 // QR solve (overdetermined least squares)
 gpu::QR<double> qr(A);                // host matrix input
 MatrixXd X = qr.solve(B);           // Q^H * B via ormqr, then trsm on R

 // SVD
 gpu::SVD<double> svd(A, ComputeThinU | ComputeThinV);
 VectorXd S = svd.singularValues();
 MatrixXd U = svd.matrixU();
 MatrixXd VT = svd.matrixVT();
 MatrixXd X = svd.solve(B);          // pseudoinverse solve

 // SVD: device-side views (no D2H transfer; svd must outlive the views)
 auto d_S = svd.d_singularValues();   // DeviceMatrix view of singular values
 auto d_U = svd.d_matrixU();          // DeviceMatrix view of U
 auto d_VT = svd.d_matrixVT();        // DeviceMatrix view of V^T

 // Self-adjoint eigenvalue decomposition
 gpu::SelfAdjointEigenSolver<double> es(A);
 VectorXd eigenvals = es.eigenvalues();
 MatrixXd eigenvecs = es.eigenvectors();
 auto d_W = es.d_eigenvalues();        // DeviceMatrix view of eigenvalues
 auto d_V = es.d_eigenvectors();       // DeviceMatrix view of eigenvectors
 ```

 The cached API keeps the factored matrix on device, avoiding redundant
 host-device transfers and re-factorizations. The `d_*` accessors return
 non-owning `DeviceMatrix` views so downstream cuBLAS/cuSOLVER work can chain
 without round-tripping through host memory.

 ### Precision control

 GEMM dispatch uses `cublasXgemm` (type-specific Sgemm/Dgemm/Cgemm/Zgemm).
 cuBLAS may internally use tensor cores depending on the GPU architecture,
 matrix dimensions, and CUDA math mode settings. No Eigen-specific macros
 control this; use the standard `CUDA_MATH_MODE` environment variable or
 `cublasSetMathMode()` to configure tensor core behavior if needed.

 ### Stream control and async execution

 Operations are asynchronous by default. The compute-solve chain runs without
 host synchronization until you need a result on the host:

 ```text
 fromHost(A) --sync-->  compute() --async-->  solve() --async-->  toHost()
    H2D                  potrf                 potrs                D2H
                                                                    sync
 ```

 Mandatory sync points:
 - `fromHost()` -- Synchronizes to complete the upload before returning
 - `toHost()` / `HostTransfer::get()` -- Must deliver data to host
 - `info()` -- Must read the factorization status

 **Cross-stream safety** is automatic. `DeviceMatrix` tracks write completion
 via CUDA events. When a matrix written on stream A is read on stream B, the
 module automatically inserts `cudaStreamWaitEvent`. Same-stream operations
 skip the wait (CUDA guarantees in-order execution within a stream).

 **Lifetime of cached factorizations.** A `gpu::LLT` / `gpu::LU` object owns
 its CUDA stream, library handle, and the cached factor on device. Destroying
 the factorization object while a `solve()` it issued is still in flight is
 *correct* but not actually async: `cudaStreamDestroy` returns immediately,
 but the destructor of the cached factor calls `cudaFree`, which is fully
 device-synchronous and stalls until the in-flight `potrs`/`getrs` retires.
 For genuine async pipelining keep the factorization object alive until you
 have drained its results (e.g. via `toHost()` or by binding consumption to
 an explicit `gpu::Context` that outlives both producer and consumer):

 ```cpp
 gpu::LLT<double> llt(d_A);             // factor stays on device
 auto d_X = llt.solve(d_B);
 auto h_x = d_X.toHostAsync(llt.stream());
 h_x.get();                             // sync: factor + result complete
 // llt may now be destroyed without stalling the device
 ```

 ## Reference

 ### Supported scalar types

 `float`, `double`, `std::complex<float>`, `std::complex<double>`.

 ### Expression -> library call mapping

 | DeviceMatrix expression | Library call | Parameters |
 |---|---|---|
 | `C = A * B` | `cublasXgemm` | transA=N, transB=N, alpha=1, beta=0 |
 | `C = A.adjoint() * B` | `cublasXgemm` | transA=C, transB=N |
 | `C = A.transpose() * B` | `cublasXgemm` | transA=T, transB=N |
 | `C = A * B.adjoint()` | `cublasXgemm` | transA=N, transB=C |
 | `C = A * B.transpose()` | `cublasXgemm` | transA=N, transB=T |
 | `C = alpha * A * B` | `cublasXgemm` | alpha from LHS |
 | `C = A * (alpha * B)` | `cublasXgemm` | alpha from RHS |
 | `C += A * B` | `cublasXgemm` | alpha=1, beta=1 |
 | `C.device(ctx) -= A * B` | `cublasXgemm` | alpha=-1, beta=1 |
 | `X = A.llt().solve(B)` | `cusolverDnXpotrf` + `Xpotrs` | uplo, n, nrhs |
 | `X = A.llt<Upper>().solve(B)` | same | uplo=Upper |
 | `X = A.lu().solve(B)` | `cusolverDnXgetrf` + `Xgetrs` | n, nrhs |
 | `X = A.triangularView<L>().solve(B)` | `cublasXtrsm` | side=L, uplo, diag=NonUnit |
 | `C = A.selfadjointView<L>() * B` | `cublasXsymm` / `cublasXhemm` | side=L, uplo |
 | `C.selfadjointView<L>().rankUpdate(A)` | `cublasXsyrk` / `cublasXherk` | uplo, trans=N |

 ### `DeviceMatrix<Scalar>` API

 | Method | Sync? | Description |
 |--------|-------|-------------|
 | `DeviceMatrix()` | -- | Empty (0x0) |
 | `DeviceMatrix(rows, cols)` | -- | Allocate uninitialized |
 | `fromHost(matrix, stream)` | yes | Upload from Eigen matrix |
 | `fromHostAsync(ptr, rows, cols, stream)` | no | Async upload (caller manages lifetime) |
 | `adopt(ptr, rows, cols)` | -- | Owning wrapper over a device pointer (caller relinquishes ownership) |
 | `view(ptr, rows, cols)` | -- | Non-owning view over a device pointer (does not free on destruction) |
 | `toHost(stream)` | yes | Synchronous download |
 | `toHostAsync(stream)` | no | Returns `HostTransfer` future |
 | `clone(stream)` | no | Device-to-device deep copy |
 | `resize(rows, cols)` | -- | Discard contents, reallocate |
 | `data()` | -- | Raw device pointer |
 | `rows()`, `cols()` | -- | Dimensions |
 | `sizeInBytes()` | -- | Total device allocation size in bytes |
 | `empty()` | -- | True if 0x0 |
 | `adjoint()` | -- | Adjoint view (GEMM ConjTrans) |
 | `transpose()` | -- | Transpose view (GEMM Trans) |
 | `llt()` / `llt<UpLo>()` | -- | Cholesky expression builder |
 | `lu()` | -- | LU expression builder |
 | `triangularView<UpLo>()` | -- | Triangular view (TRSM) |
 | `selfadjointView<UpLo>()` | -- | Self-adjoint view (SYMM, rankUpdate) |
 | `device(ctx)` | -- | Assignment proxy bound to context |

 ### `gpu::Context`

 Unified GPU execution context owning a CUDA stream and library handles.

 ```cpp
 gpu::Context()                                             // Creates dedicated stream + handles
 static gpu::Context& threadLocal()                         // Per-thread default (lazy-created)

 cudaStream_t       stream()
 cublasHandle_t     cublasHandle()
 cusolverDnHandle_t cusolverHandle()
 ```

 Non-copyable, non-movable (owns library handles).

 ### `gpu::LLT<Scalar, UpLo>` API

 GPU dense Cholesky (LL^T) via cuSOLVER. Caches factor on device.

 | Method | Sync? | Description |
 |--------|-------|-------------|
 | `gpu::LLT(A)` | deferred | Construct and factorize from host matrix |
 | `compute(host_matrix)` | deferred | Upload and factorize |
 | `compute(DeviceMatrix)` | deferred | D2D copy and factorize |
 | `compute(DeviceMatrix&&)` | deferred | Move-adopt and factorize (no copy) |
 | `solve(host_matrix)` | yes | Solve, return host matrix |
 | `solve(DeviceMatrix)` | no | Solve, return `DeviceMatrix` (async) |
 | `info()` | lazy | Syncs stream on first call, returns `Success` or `NumericalIssue` |

 ### `gpu::LU<Scalar>` API

 GPU dense partial-pivoting LU via cuSOLVER. Same pattern as `gpu::LLT`, plus a
 `gpu::GpuOp` parameter on `solve()` (`NoTrans`, `Trans`, `ConjTrans`).

 ### `gpu::QR<Scalar>` API

 GPU dense QR decomposition via cuSOLVER (`geqrf`). Solve uses `ormqr` (apply
 Q^H) + `trsm` (back-substitute on R) -- Q is never formed explicitly.

 | Method | Description |
 |--------|-------------|
 | `gpu::QR()` | Default construct, then call `compute()` |
 | `gpu::QR(A)` | Construct and factorize from host matrix |
 | `compute(A)` | Upload + factorize |
 | `compute(DeviceMatrix)` | D2D copy + factorize |
 | `solve(host_matrix)` | Solve, return host matrix (syncs) |
 | `solve(DeviceMatrix)` | Solve, return `DeviceMatrix` (async) |
 | `matrixR()` | Download upper-triangular factor R to host (m >= n only) |
 | `info()` | Lazy sync |
 | `rows()`, `cols()`, `stream()` | Dimensions and CUDA stream |

 ### `gpu::SVD<Scalar>` API

 GPU dense SVD via cuSOLVER (`gesvd`). Supports thin, full, and values-only
 modes via Eigen's `ComputeThinU | ComputeThinV`, `ComputeFullU | ComputeFullV`,
 or `0` (values only).

 | Method | Description |
 |--------|-------------|
 | `gpu::SVD()` | Default construct, then call `compute()` |
 | `gpu::SVD(A, options)` | Construct and compute (options default: `ComputeThinU \| ComputeThinV`) |
 | `compute(A, options)` | Compute from host matrix |
 | `compute(DeviceMatrix, options)` | Compute from device matrix |
 | `singularValues()` | Download singular values to host |
 | `matrixU()` | Download U to host (requires `ComputeThinU` or `ComputeFullU`) |
 | `matrixV()` | Download V to host (matches `JacobiSVD::matrixV()`) |
 | `matrixVT()` | Download V^T to host (requires `ComputeThinV` or `ComputeFullV`) |
 | `d_singularValues()` | `DeviceMatrix` view of singular values (zero-copy) |
 | `d_matrixU()` | `DeviceMatrix` view of U (zero-copy when m >= n) |
 | `d_matrixVT()` | `DeviceMatrix` view of V^T (zero-copy when m >= n) |
 | `solve(B)` | Pseudoinverse solve (returns host matrix) |
 | `solve(B, k)` | Truncated solve (top k singular triplets) |
 | `solve(B, lambda)` | Tikhonov regularized solve |
 | `rank(threshold)` | Count singular values above threshold |
 | `info()` | Lazy sync |
 | `rows()`, `cols()`, `stream()` | Dimensions and CUDA stream |

 Wide matrices (m < n) are handled by internally transposing via cuBLAS `geam`;
 in that case `d_matrixU()` and `d_matrixVT()` allocate an owning result via a
 single `cublasXgeam` adjoint pass. `d_singularValues()` is always zero-copy.

 ### `gpu::SelfAdjointEigenSolver<Scalar>` API

 GPU symmetric/Hermitian eigenvalue decomposition via cuSOLVER (`syevd`).

 | Method | Description |
 |--------|-------------|
 | `gpu::SelfAdjointEigenSolver()` | Default construct, then call `compute()` |
 | `gpu::SelfAdjointEigenSolver(A, mode)` | Construct and compute (mode default: `ComputeEigenvectors`) |
 | `compute(A, mode)` | Compute from host matrix |
 | `compute(DeviceMatrix, mode)` | Compute from device matrix |
 | `eigenvalues()` | Download eigenvalues to host (ascending order) |
 | `eigenvectors()` | Download eigenvectors to host (columns) |
 | `d_eigenvalues()` | `DeviceMatrix` view of eigenvalues (zero-copy) |
 | `d_eigenvectors()` | `DeviceMatrix` view of eigenvectors (zero-copy, requires `ComputeEigenvectors`) |
 | `info()` | Lazy sync |
 | `rows()`, `cols()`, `stream()` | Dimensions and CUDA stream |

 `ComputeMode`: `gpu::SelfAdjointEigenSolver::EigenvaluesOnly` or
 `gpu::SelfAdjointEigenSolver::ComputeEigenvectors`.

 The `d_*` accessors return non-owning views into the solver's internal device
 buffers; the solver must outlive any view derived from it.

 ### `HostTransfer<Scalar>` API

 Future for async device-to-host transfer.

 | Method | Description |
 |--------|-------------|
 | `get()` | Block until transfer completes, return host matrix reference. Idempotent. |
 | `ready()` | Non-blocking poll |

 ### Aliasing

 Unlike Eigen's `Matrix`, where omitting `.noalias()` triggers a copy to a
 temporary, DeviceMatrix dispatches directly to NVIDIA library calls which have
 no built-in aliasing protection. All operations are implicitly noalias.
 The caller must ensure operands don't alias the destination for GEMM, TRSM,
 SYMM/HEMM, and SYRK/HERK. Debug builds assert on these violations before
 dispatching to cuBLAS.

 ## Future work

 - **Reassess host-input vs. device-input API surface.** Each solver currently
   exposes both host-input (`compute(MatrixXd)`, `solve(MatrixXd)`) and
   device-input (`compute(DeviceMatrix)`, `solve(DeviceMatrix)`) overloads, plus
   host- and device-side accessors (`matrixU()` vs `d_matrixU()`). This eases
   migration from CPU Eigen but may invite accidental host ↔ device round-trips
   when users mix the two without realising the cost. Revisit after the initial
   GPU module roll-out (MRs !2408, !2412, !2413, !2414, !2415) is in users'
   hands; if the convenience overloads cause more confusion than they save,
   narrow toward a single explicit `fromHost` / `toHost` boundary.

 ## File layout

 | File | Depends on | Contents |
 |------|-----------|----------|
 | `GpuSupport.h` | `<cuda_runtime.h>` | Error macro, `DeviceBuffer`, `PinnedHostBuffer`, `cuda_data_type<>` |
 | `DeviceMatrix.h` | `GpuSupport.h` | `DeviceMatrix<>`, `HostTransfer<>` |
 | `DeviceExpr.h` | `DeviceMatrix.h` | GEMM expression wrappers |
 | `DeviceBlasExpr.h` | `DeviceMatrix.h` | TRSM, SYMM, SYRK expression wrappers |
 | `DeviceSolverExpr.h` | `DeviceMatrix.h` | Solver expression wrappers (LLT, LU) |
 | `DeviceDispatch.h` | all above | All dispatch functions + `Assignment` |
 | `GpuContext.h` | `CuBlasSupport.h`, `CuSolverSupport.h` | `gpu::Context` |
 | `CuBlasSupport.h` | `GpuSupport.h`, `<cublas_v2.h>` | cuBLAS error macro, op/compute type maps |
 | `CuSolverSupport.h` | `GpuSupport.h`, `<cusolverDn.h>` | cuSOLVER params, fill-mode mapping |
 | `GpuSolverContext.h` | `CuSolverSupport.h`, `CuBlasSupport.h` | Shared solver context (stream, handles, scratch) |
 | `GpuLLT.h` | `GpuSolverContext.h` | Cached dense Cholesky factorization |
 | `GpuLU.h` | `GpuSolverContext.h` | Cached dense LU factorization |
 | `GpuQR.h` | `GpuSolverContext.h` | Dense QR decomposition |
 | `GpuSVD.h` | `GpuSolverContext.h` | Dense SVD decomposition |
 | `GpuEigenSolver.h` | `GpuSolverContext.h` | Self-adjoint eigenvalue decomposition |

 ## Building and testing

 ```bash
 cmake -G Ninja -B build -S . \
   -DEIGEN_TEST_CUDA=ON \
   -DEIGEN_CUDA_COMPUTE_ARCH="70"

 cmake --build build --target cublas cusolver_llt cusolver_lu \
   cusolver_qr cusolver_svd cusolver_eigen device_matrix
 ctest --test-dir build -L gpu --output-on-failure
 ```

 `EIGEN_TEST_CUBLAS` and `EIGEN_TEST_CUSOLVER` default to ON when CUDA is enabled
 (cuBLAS and cuSOLVER are part of the CUDA toolkit).
	# Eigen GPU Module (`unsupported/Eigen/GPU`)

	GPU-accelerated dense linear algebra for Eigen users, dispatching to NVIDIA
	CUDA Math Libraries (cuBLAS, cuSOLVER). Requires CUDA 11.4+. Header-only (link
	against CUDA runtime, cuBLAS, and cuSOLVER).

	## Why this module

	Eigen is the linear algebra foundation for a large ecosystem of C++ projects
	in robotics (ROS, Drake, MoveIt, Pinocchio), computer vision (OpenCV, COLMAP,
	Open3D), scientific computing (Ceres, Stan), and beyond. Many of these
	projects run on GPU-equipped hardware but cannot use GPUs for Eigen operations
	without dropping down to raw CUDA library APIs. Third-party projects like
	[EigenCuda](https://github.com/NLESC-JCER/EigenCuda) and
	[cholespy](https://github.com/rgl-epfl/cholespy) exist specifically to fill
	this gap, and downstream projects like
	[Ceres](https://github.com/ceres-solver/ceres-solver/issues/1151) and
	[COLMAP](https://github.com/colmap/colmap/issues/4018) have open requests for
	GPU-accelerated solvers through Eigen.

	The `unsupported/Eigen/GPU` module aims to close this gap: Existing Eigen users should be
	able to move performance-critical dense linear algebra to the GPU with minimal
	code changes and without learning CUDA library APIs directly.

	## Design philosophy

	CPU and GPU coexist. There is no global compile-time switch that replaces
	CPU implementations (unlike `EIGEN_USE_LAPACKE`). Users choose GPU solvers
	explicitly -- `gpu::LLT<double>` vs `LLT<MatrixXd>` -- and both coexist in
	the same binary. This also lets users keep the factored matrix on device across
	multiple solves, something impossible with compile-time replacement.

	Familiar syntax. GPU operations use the same expression patterns as CPU
	Eigen. Here is a side-by-side comparison:

	```cpp
	// ---- CPU (Eigen) ---- // ---- GPU (unsupported/Eigen/GPU) ----
	#include <Eigen/Dense> #define EIGEN_USE_GPU
	#include <unsupported/Eigen/GPU>

	MatrixXd A = ...; auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
	MatrixXd B = ...; auto d_B = gpu::DeviceMatrix<double>::fromHost(B);

	MatrixXd C = A * B; gpu::DeviceMatrix<double> d_C = d_A * d_B;
	MatrixXd X = A.llt().solve(B); gpu::DeviceMatrix<double> d_X = d_A.llt().solve(d_B);

	MatrixXd X = d_X.toHost();
	```

	The GPU version reads like CPU Eigen with explicit upload/download.
	`operator*` dispatches to cuBLAS GEMM, `.llt().solve()` dispatches to
	cuSOLVER potrf + potrs. Unsupported expressions are compile errors.

	Standalone module. `unsupported/Eigen/GPU` does not modify or depend on Eigen's Core
	expression template system (`MatrixBase`, `CwiseBinaryOp`, etc.).
	`DeviceMatrix` is not an Eigen expression type and does not inherit from
	`MatrixBase`. The expression layer is a thin compile-time dispatch where every
	supported expression maps to a single NVIDIA library call. There is no
	coefficient-level evaluation, lazy fusion, or packet operations.

	Explicit over implicit. Host-device transfers, stream management, and
	library handle lifetimes are visible in the API. There are no hidden
	allocations or synchronizations except where documented (e.g., `toHost()` must
	synchronize to deliver data to the host).

	## Key concepts

	### `DeviceMatrix<Scalar>`

	A typed RAII wrapper for a dense column-major matrix in GPU device memory.
	This is the GPU counterpart of Eigen's `MatrixX<Scalar>`. A vector is simply
	a `DeviceMatrix` with one column.

	```cpp
	// Upload from host
	auto d_A = gpu::DeviceMatrix<double>::fromHost(A);

	// Allocate uninitialized
	gpu::DeviceMatrix<double> d_C(m, n);

	// Download to host
	MatrixXd C = d_C.toHost();

	// Async download (returns a future)
	auto transfer = d_C.toHostAsync();
	// ... do other work ...
	MatrixXd C = transfer.get();
	```

	`DeviceMatrix` supports expression methods that mirror Eigen's API:
	`adjoint()`, `transpose()`, `triangularView<UpLo>()`,
	`selfadjointView<UpLo>()`, `llt()`, `lu()`. These return lightweight
	expression objects that are evaluated when assigned.

	### `gpu::Context`

	Every GPU operation needs a CUDA stream and library handles (cuBLAS,
	cuSOLVER). `gpu::Context` bundles these together.

	For simple usage, you don't need to create one -- a per-thread default context
	is created lazily on first use:

	```cpp
	// These use the thread-local default context automatically
	d_C = d_A * d_B;
	d_X = d_A.llt().solve(d_B);
	```

	For concurrent multi-stream execution, create explicit contexts:

	```cpp
	gpu::Context ctx1, ctx2;
	d_C1.device(ctx1) = d_A1 * d_B1; // runs on stream 1
	d_C2.device(ctx2) = d_A2 * d_B2; // runs on stream 2 (concurrently)
	```

	## Usage

	### Matrix operations (cuBLAS)

	```cpp
	auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
	auto d_B = gpu::DeviceMatrix<double>::fromHost(B);

	// GEMM: C = A * B, C = A^H * B, C = A * B^T, ...
	gpu::DeviceMatrix<double> d_C = d_A * d_B;
	d_C = d_A.adjoint() * d_B;
	d_C = d_A * d_B.transpose();

	// Scaled and accumulated
	d_C += 2.0 * d_A * d_B; // alpha=2, beta=1
	d_C.device(ctx) -= d_A * d_B; // alpha=-1, beta=1 (requires explicit context)

	// Triangular solve (TRSM)
	d_X = d_A.triangularView<Lower>().solve(d_B);

	// Symmetric/Hermitian multiply (SYMM/HEMM)
	d_C = d_A.selfadjointView<Lower>() * d_B;

	// Rank-k update (SYRK/HERK)
	d_C.selfadjointView<Lower>().rankUpdate(d_A); // C += A * A^H
	```

	### Dense solvers (cuSOLVER)

	One-shot expression syntax -- Convenient, re-factorizes each time:

	```cpp
	// Cholesky solve (potrf + potrs)
	d_X = d_A.llt().solve(d_B);

	// LU solve (getrf + getrs)
	d_Y = d_A.lu().solve(d_B);
	```

	Cached factorization -- Factor once, solve many times:

	```cpp
	gpu::LLT<double> llt;
	llt.compute(d_A); // factorize (async)
	if (llt.info() != Success) { ... } // lazy sync on first info() call
	auto d_X1 = llt.solve(d_B1); // reuses factor (async)
	auto d_X2 = llt.solve(d_B2); // reuses factor (async)
	MatrixXd X2 = d_X2.toHost();

	// LU with transpose solve
	gpu::LU<double> lu;
	lu.compute(d_A);
	auto d_Y = lu.solve(d_B, gpu::GpuOp::Trans); // A^T Y = B

	// QR solve (overdetermined least squares)
	gpu::QR<double> qr(A); // host matrix input
	MatrixXd X = qr.solve(B); // Q^H * B via ormqr, then trsm on R

	// SVD
	gpu::SVD<double> svd(A, ComputeThinU \| ComputeThinV);
	VectorXd S = svd.singularValues();
	MatrixXd U = svd.matrixU();
	MatrixXd VT = svd.matrixVT();
	MatrixXd X = svd.solve(B); // pseudoinverse solve

	// SVD: device-side views (no D2H transfer; svd must outlive the views)
	auto d_S = svd.d_singularValues(); // DeviceMatrix view of singular values
	auto d_U = svd.d_matrixU(); // DeviceMatrix view of U
	auto d_VT = svd.d_matrixVT(); // DeviceMatrix view of V^T

	// Self-adjoint eigenvalue decomposition
	gpu::SelfAdjointEigenSolver<double> es(A);
	VectorXd eigenvals = es.eigenvalues();
	MatrixXd eigenvecs = es.eigenvectors();
	auto d_W = es.d_eigenvalues(); // DeviceMatrix view of eigenvalues
	auto d_V = es.d_eigenvectors(); // DeviceMatrix view of eigenvectors
	```

	The cached API keeps the factored matrix on device, avoiding redundant
	host-device transfers and re-factorizations. The `d_*` accessors return
	non-owning `DeviceMatrix` views so downstream cuBLAS/cuSOLVER work can chain
	without round-tripping through host memory.

	### Precision control

	GEMM dispatch uses `cublasXgemm` (type-specific Sgemm/Dgemm/Cgemm/Zgemm).
	cuBLAS may internally use tensor cores depending on the GPU architecture,
	matrix dimensions, and CUDA math mode settings. No Eigen-specific macros
	control this; use the standard `CUDA_MATH_MODE` environment variable or
	`cublasSetMathMode()` to configure tensor core behavior if needed.

	### Stream control and async execution

	Operations are asynchronous by default. The compute-solve chain runs without
	host synchronization until you need a result on the host:

	```text
	fromHost(A) --sync--> compute() --async--> solve() --async--> toHost()
	H2D potrf potrs D2H
	sync
	```

	Mandatory sync points:
	- `fromHost()` -- Synchronizes to complete the upload before returning
	- `toHost()` / `HostTransfer::get()` -- Must deliver data to host
	- `info()` -- Must read the factorization status

	Cross-stream safety is automatic. `DeviceMatrix` tracks write completion
	via CUDA events. When a matrix written on stream A is read on stream B, the
	module automatically inserts `cudaStreamWaitEvent`. Same-stream operations
	skip the wait (CUDA guarantees in-order execution within a stream).

	Lifetime of cached factorizations. A `gpu::LLT` / `gpu::LU` object owns
	its CUDA stream, library handle, and the cached factor on device. Destroying
	the factorization object while a `solve()` it issued is still in flight is
	correct but not actually async: `cudaStreamDestroy` returns immediately,
	but the destructor of the cached factor calls `cudaFree`, which is fully
	device-synchronous and stalls until the in-flight `potrs`/`getrs` retires.
	For genuine async pipelining keep the factorization object alive until you
	have drained its results (e.g. via `toHost()` or by binding consumption to
	an explicit `gpu::Context` that outlives both producer and consumer):

	```cpp
	gpu::LLT<double> llt(d_A); // factor stays on device
	auto d_X = llt.solve(d_B);
	auto h_x = d_X.toHostAsync(llt.stream());
	h_x.get(); // sync: factor + result complete
	// llt may now be destroyed without stalling the device
	```

	## Reference

	### Supported scalar types

	`float`, `double`, `std::complex<float>`, `std::complex<double>`.

	### Expression -> library call mapping

	\| DeviceMatrix expression \| Library call \| Parameters \|
	\|---\|---\|---\|
	\| `C = A * B` \| `cublasXgemm` \| transA=N, transB=N, alpha=1, beta=0 \|
	\| `C = A.adjoint() * B` \| `cublasXgemm` \| transA=C, transB=N \|
	\| `C = A.transpose() * B` \| `cublasXgemm` \| transA=T, transB=N \|
	\| `C = A * B.adjoint()` \| `cublasXgemm` \| transA=N, transB=C \|
	\| `C = A * B.transpose()` \| `cublasXgemm` \| transA=N, transB=T \|
	\| `C = alpha * A * B` \| `cublasXgemm` \| alpha from LHS \|
	\| `C = A * (alpha * B)` \| `cublasXgemm` \| alpha from RHS \|
	\| `C += A * B` \| `cublasXgemm` \| alpha=1, beta=1 \|
	\| `C.device(ctx) -= A * B` \| `cublasXgemm` \| alpha=-1, beta=1 \|
	\| `X = A.llt().solve(B)` \| `cusolverDnXpotrf` + `Xpotrs` \| uplo, n, nrhs \|
	\| `X = A.llt<Upper>().solve(B)` \| same \| uplo=Upper \|
	\| `X = A.lu().solve(B)` \| `cusolverDnXgetrf` + `Xgetrs` \| n, nrhs \|
	\| `X = A.triangularView<L>().solve(B)` \| `cublasXtrsm` \| side=L, uplo, diag=NonUnit \|
	\| `C = A.selfadjointView<L>() * B` \| `cublasXsymm` / `cublasXhemm` \| side=L, uplo \|
	\| `C.selfadjointView<L>().rankUpdate(A)` \| `cublasXsyrk` / `cublasXherk` \| uplo, trans=N \|

	### `DeviceMatrix<Scalar>` API

	\| Method \| Sync? \| Description \|
	\|--------\|-------\|-------------\|
	\| `DeviceMatrix()` \| -- \| Empty (0x0) \|
	\| `DeviceMatrix(rows, cols)` \| -- \| Allocate uninitialized \|
	\| `fromHost(matrix, stream)` \| yes \| Upload from Eigen matrix \|
	\| `fromHostAsync(ptr, rows, cols, stream)` \| no \| Async upload (caller manages lifetime) \|
	\| `adopt(ptr, rows, cols)` \| -- \| Owning wrapper over a device pointer (caller relinquishes ownership) \|
	\| `view(ptr, rows, cols)` \| -- \| Non-owning view over a device pointer (does not free on destruction) \|
	\| `toHost(stream)` \| yes \| Synchronous download \|
	\| `toHostAsync(stream)` \| no \| Returns `HostTransfer` future \|
	\| `clone(stream)` \| no \| Device-to-device deep copy \|
	\| `resize(rows, cols)` \| -- \| Discard contents, reallocate \|
	\| `data()` \| -- \| Raw device pointer \|
	\| `rows()`, `cols()` \| -- \| Dimensions \|
	\| `sizeInBytes()` \| -- \| Total device allocation size in bytes \|
	\| `empty()` \| -- \| True if 0x0 \|
	\| `adjoint()` \| -- \| Adjoint view (GEMM ConjTrans) \|
	\| `transpose()` \| -- \| Transpose view (GEMM Trans) \|
	\| `llt()` / `llt<UpLo>()` \| -- \| Cholesky expression builder \|
	\| `lu()` \| -- \| LU expression builder \|
	\| `triangularView<UpLo>()` \| -- \| Triangular view (TRSM) \|
	\| `selfadjointView<UpLo>()` \| -- \| Self-adjoint view (SYMM, rankUpdate) \|
	\| `device(ctx)` \| -- \| Assignment proxy bound to context \|

	### `gpu::Context`

	Unified GPU execution context owning a CUDA stream and library handles.

	```cpp
	gpu::Context() // Creates dedicated stream + handles
	static gpu::Context& threadLocal() // Per-thread default (lazy-created)

	cudaStream_t stream()
	cublasHandle_t cublasHandle()
	cusolverDnHandle_t cusolverHandle()
	```

	Non-copyable, non-movable (owns library handles).

	### `gpu::LLT<Scalar, UpLo>` API

	GPU dense Cholesky (LL^T) via cuSOLVER. Caches factor on device.

	\| Method \| Sync? \| Description \|
	\|--------\|-------\|-------------\|
	\| `gpu::LLT(A)` \| deferred \| Construct and factorize from host matrix \|
	\| `compute(host_matrix)` \| deferred \| Upload and factorize \|
	\| `compute(DeviceMatrix)` \| deferred \| D2D copy and factorize \|
	\| `compute(DeviceMatrix&&)` \| deferred \| Move-adopt and factorize (no copy) \|
	\| `solve(host_matrix)` \| yes \| Solve, return host matrix \|
	\| `solve(DeviceMatrix)` \| no \| Solve, return `DeviceMatrix` (async) \|
	\| `info()` \| lazy \| Syncs stream on first call, returns `Success` or `NumericalIssue` \|

	### `gpu::LU<Scalar>` API

	GPU dense partial-pivoting LU via cuSOLVER. Same pattern as `gpu::LLT`, plus a
	`gpu::GpuOp` parameter on `solve()` (`NoTrans`, `Trans`, `ConjTrans`).

	### `gpu::QR<Scalar>` API

	GPU dense QR decomposition via cuSOLVER (`geqrf`). Solve uses `ormqr` (apply
	Q^H) + `trsm` (back-substitute on R) -- Q is never formed explicitly.

	\| Method \| Description \|
	\|--------\|-------------\|
	\| `gpu::QR()` \| Default construct, then call `compute()` \|
	\| `gpu::QR(A)` \| Construct and factorize from host matrix \|
	\| `compute(A)` \| Upload + factorize \|
	\| `compute(DeviceMatrix)` \| D2D copy + factorize \|
	\| `solve(host_matrix)` \| Solve, return host matrix (syncs) \|
	\| `solve(DeviceMatrix)` \| Solve, return `DeviceMatrix` (async) \|
	\| `matrixR()` \| Download upper-triangular factor R to host (m >= n only) \|
	\| `info()` \| Lazy sync \|
	\| `rows()`, `cols()`, `stream()` \| Dimensions and CUDA stream \|

	### `gpu::SVD<Scalar>` API

	GPU dense SVD via cuSOLVER (`gesvd`). Supports thin, full, and values-only
	modes via Eigen's `ComputeThinU \| ComputeThinV`, `ComputeFullU \| ComputeFullV`,
	or `0` (values only).

	\| Method \| Description \|
	\|--------\|-------------\|
	\| `gpu::SVD()` \| Default construct, then call `compute()` \|
	\| `gpu::SVD(A, options)` \| Construct and compute (options default: `ComputeThinU \\| ComputeThinV`) \|
	\| `compute(A, options)` \| Compute from host matrix \|
	\| `compute(DeviceMatrix, options)` \| Compute from device matrix \|
	\| `singularValues()` \| Download singular values to host \|
	\| `matrixU()` \| Download U to host (requires `ComputeThinU` or `ComputeFullU`) \|
	\| `matrixV()` \| Download V to host (matches `JacobiSVD::matrixV()`) \|
	\| `matrixVT()` \| Download V^T to host (requires `ComputeThinV` or `ComputeFullV`) \|
	\| `d_singularValues()` \| `DeviceMatrix` view of singular values (zero-copy) \|
	\| `d_matrixU()` \| `DeviceMatrix` view of U (zero-copy when m >= n) \|
	\| `d_matrixVT()` \| `DeviceMatrix` view of V^T (zero-copy when m >= n) \|
	\| `solve(B)` \| Pseudoinverse solve (returns host matrix) \|
	\| `solve(B, k)` \| Truncated solve (top k singular triplets) \|
	\| `solve(B, lambda)` \| Tikhonov regularized solve \|
	\| `rank(threshold)` \| Count singular values above threshold \|
	\| `info()` \| Lazy sync \|
	\| `rows()`, `cols()`, `stream()` \| Dimensions and CUDA stream \|

	Wide matrices (m < n) are handled by internally transposing via cuBLAS `geam`;
	in that case `d_matrixU()` and `d_matrixVT()` allocate an owning result via a
	single `cublasXgeam` adjoint pass. `d_singularValues()` is always zero-copy.

	### `gpu::SelfAdjointEigenSolver<Scalar>` API

	GPU symmetric/Hermitian eigenvalue decomposition via cuSOLVER (`syevd`).

	\| Method \| Description \|
	\|--------\|-------------\|
	\| `gpu::SelfAdjointEigenSolver()` \| Default construct, then call `compute()` \|
	\| `gpu::SelfAdjointEigenSolver(A, mode)` \| Construct and compute (mode default: `ComputeEigenvectors`) \|
	\| `compute(A, mode)` \| Compute from host matrix \|
	\| `compute(DeviceMatrix, mode)` \| Compute from device matrix \|
	\| `eigenvalues()` \| Download eigenvalues to host (ascending order) \|
	\| `eigenvectors()` \| Download eigenvectors to host (columns) \|
	\| `d_eigenvalues()` \| `DeviceMatrix` view of eigenvalues (zero-copy) \|
	\| `d_eigenvectors()` \| `DeviceMatrix` view of eigenvectors (zero-copy, requires `ComputeEigenvectors`) \|
	\| `info()` \| Lazy sync \|
	\| `rows()`, `cols()`, `stream()` \| Dimensions and CUDA stream \|

	`ComputeMode`: `gpu::SelfAdjointEigenSolver::EigenvaluesOnly` or
	`gpu::SelfAdjointEigenSolver::ComputeEigenvectors`.

	The `d_*` accessors return non-owning views into the solver's internal device
	buffers; the solver must outlive any view derived from it.

	### `HostTransfer<Scalar>` API

	Future for async device-to-host transfer.

	\| Method \| Description \|
	\|--------\|-------------\|
	\| `get()` \| Block until transfer completes, return host matrix reference. Idempotent. \|
	\| `ready()` \| Non-blocking poll \|

	### Aliasing

	Unlike Eigen's `Matrix`, where omitting `.noalias()` triggers a copy to a
	temporary, DeviceMatrix dispatches directly to NVIDIA library calls which have
	no built-in aliasing protection. All operations are implicitly noalias.
	The caller must ensure operands don't alias the destination for GEMM, TRSM,
	SYMM/HEMM, and SYRK/HERK. Debug builds assert on these violations before
	dispatching to cuBLAS.

	## Future work

	- Reassess host-input vs. device-input API surface. Each solver currently
	exposes both host-input (`compute(MatrixXd)`, `solve(MatrixXd)`) and
	device-input (`compute(DeviceMatrix)`, `solve(DeviceMatrix)`) overloads, plus
	host- and device-side accessors (`matrixU()` vs `d_matrixU()`). This eases
	migration from CPU Eigen but may invite accidental host ↔ device round-trips
	when users mix the two without realising the cost. Revisit after the initial
	GPU module roll-out (MRs !2408, !2412, !2413, !2414, !2415) is in users'
	hands; if the convenience overloads cause more confusion than they save,
	narrow toward a single explicit `fromHost` / `toHost` boundary.

	## File layout

	\| File \| Depends on \| Contents \|
	\|------\|-----------\|----------\|
	\| `GpuSupport.h` \| `<cuda_runtime.h>` \| Error macro, `DeviceBuffer`, `PinnedHostBuffer`, `cuda_data_type<>` \|
	\| `DeviceMatrix.h` \| `GpuSupport.h` \| `DeviceMatrix<>`, `HostTransfer<>` \|
	\| `DeviceExpr.h` \| `DeviceMatrix.h` \| GEMM expression wrappers \|
	\| `DeviceBlasExpr.h` \| `DeviceMatrix.h` \| TRSM, SYMM, SYRK expression wrappers \|
	\| `DeviceSolverExpr.h` \| `DeviceMatrix.h` \| Solver expression wrappers (LLT, LU) \|
	\| `DeviceDispatch.h` \| all above \| All dispatch functions + `Assignment` \|
	\| `GpuContext.h` \| `CuBlasSupport.h`, `CuSolverSupport.h` \| `gpu::Context` \|
	\| `CuBlasSupport.h` \| `GpuSupport.h`, `<cublas_v2.h>` \| cuBLAS error macro, op/compute type maps \|
	\| `CuSolverSupport.h` \| `GpuSupport.h`, `<cusolverDn.h>` \| cuSOLVER params, fill-mode mapping \|
	\| `GpuSolverContext.h` \| `CuSolverSupport.h`, `CuBlasSupport.h` \| Shared solver context (stream, handles, scratch) \|
	\| `GpuLLT.h` \| `GpuSolverContext.h` \| Cached dense Cholesky factorization \|
	\| `GpuLU.h` \| `GpuSolverContext.h` \| Cached dense LU factorization \|
	\| `GpuQR.h` \| `GpuSolverContext.h` \| Dense QR decomposition \|
	\| `GpuSVD.h` \| `GpuSolverContext.h` \| Dense SVD decomposition \|
	\| `GpuEigenSolver.h` \| `GpuSolverContext.h` \| Self-adjoint eigenvalue decomposition \|

	## Building and testing

	```bash
	cmake -G Ninja -B build -S . \
	-DEIGEN_TEST_CUDA=ON \
	-DEIGEN_CUDA_COMPUTE_ARCH="70"

	cmake --build build --target cublas cusolver_llt cusolver_lu \
	cusolver_qr cusolver_svd cusolver_eigen device_matrix
	ctest --test-dir build -L gpu --output-on-failure
	```

	`EIGEN_TEST_CUBLAS` and `EIGEN_TEST_CUSOLVER` default to ON when CUDA is enabled
	(cuBLAS and cuSOLVER are part of the CUDA toolkit).