unsupported/Eigen/src/GPU/README.md - mirror - Git at Google

 # Eigen GPU Module (`unsupported/Eigen/GPU`)

 GPU-accelerated dense linear algebra for Eigen users, dispatching to NVIDIA
 CUDA Math Libraries (cuBLAS, cuSOLVER). Requires CUDA 11.4+. Header-only (link
 against CUDA runtime, cuBLAS, and cuSOLVER).

 ## Why this module

 Eigen is the linear algebra foundation for a large ecosystem of C++ projects
 in robotics (ROS, Drake, MoveIt, Pinocchio), computer vision (OpenCV, COLMAP,
 Open3D), scientific computing (Ceres, Stan), and beyond. Many of these
 projects run on GPU-equipped hardware but cannot use GPUs for Eigen operations
 without dropping down to raw CUDA library APIs. Third-party projects like
 [EigenCuda](https://github.com/NLESC-JCER/EigenCuda) and
 [cholespy](https://github.com/rgl-epfl/cholespy) exist specifically to fill
 this gap, and downstream projects like
 [Ceres](https://github.com/ceres-solver/ceres-solver/issues/1151) and
 [COLMAP](https://github.com/colmap/colmap/issues/4018) have open requests for
 GPU-accelerated solvers through Eigen.

 The `unsupported/Eigen/GPU` module aims to close this gap: Existing Eigen users should be
 able to move performance-critical dense linear algebra to the GPU with minimal
 code changes and without learning CUDA library APIs directly.

 ## Design philosophy

 **CPU and GPU coexist.** There is no global compile-time switch that replaces
 CPU implementations (unlike `EIGEN_USE_LAPACKE`). Users choose GPU solvers
 explicitly -- `gpu::LLT<double>` vs `Eigen::LLT<MatrixXd>` -- and both coexist in
 the same binary. This also lets users keep the factored matrix on device across
 multiple solves, something impossible with compile-time replacement.

 **Familiar syntax.** GPU operations use the same expression patterns as CPU
 Eigen. Here is a side-by-side comparison:

 ```cpp
 // ---- CPU (Eigen) ----               // ---- GPU (unsupported/Eigen/GPU) ----
 #include <Eigen/Dense>                  #define EIGEN_USE_GPU
                                         #include <unsupported/Eigen/GPU>

 MatrixXd A = ...;                       auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
 MatrixXd B = ...;                       auto d_B = gpu::DeviceMatrix<double>::fromHost(B);

 MatrixXd C = A * B;                     gpu::DeviceMatrix<double> d_C = d_A * d_B;
 MatrixXd X = A.llt().solve(B);          gpu::DeviceMatrix<double> d_X = d_A.llt().solve(d_B);

                                         MatrixXd X = d_X.toHost();
 ```

 The GPU version reads like CPU Eigen with explicit upload/download.
 `operator*` dispatches to cuBLAS GEMM, `.llt().solve()` dispatches to
 cuSOLVER potrf + potrs. Unsupported expressions are compile errors.

 **Standalone module.** `unsupported/Eigen/GPU` does not modify or depend on Eigen's Core
 expression template system (`MatrixBase`, `CwiseBinaryOp`, etc.).
 `DeviceMatrix` is not an Eigen expression type and does not inherit from
 `MatrixBase`. The expression layer is a thin compile-time dispatch where every
 supported expression maps to a single NVIDIA library call. There is no
 coefficient-level evaluation, lazy fusion, or packet operations.

 **Explicit over implicit.** Host-device transfers, stream management, and
 library handle lifetimes are visible in the API. There are no hidden
 allocations or synchronizations except where documented (e.g., `toHost()` must
 synchronize to deliver data to the host).

 ## Key concepts

 ### `DeviceMatrix<Scalar>`

 A typed RAII wrapper for a dense column-major matrix in GPU device memory.
 This is the GPU counterpart of Eigen's `MatrixX<Scalar>`. A vector is simply
 a `DeviceMatrix` with one column.

 ```cpp
 // Upload from host
 auto d_A = gpu::DeviceMatrix<double>::fromHost(A);

 // Allocate uninitialized
 gpu::DeviceMatrix<double> d_C(m, n);

 // Download to host
 MatrixXd C = d_C.toHost();

 // Async download (returns a future)
 auto transfer = d_C.toHostAsync();
 // ... do other work ...
 MatrixXd C = transfer.get();
 ```

 `DeviceMatrix` supports expression methods that mirror Eigen's API:
 `adjoint()`, `transpose()`, `triangularView<UpLo>()`,
 `selfadjointView<UpLo>()`, `llt()`, `lu()`. These return lightweight
 expression objects that are evaluated when assigned.

 ### `gpu::Context`

 Every GPU operation needs a CUDA stream and library handles (cuBLAS,
 cuSOLVER). `gpu::Context` bundles these together.

 For simple usage, you don't need to create one -- a per-thread default context
 is created lazily on first use:

 ```cpp
 // These use the thread-local default context automatically
 d_C = d_A * d_B;
 d_X = d_A.llt().solve(d_B);
 ```

 For concurrent multi-stream execution, create explicit contexts:

 ```cpp
 gpu::Context ctx1, ctx2;
 d_C1.device(ctx1) = d_A1 * d_B1;   // runs on stream 1
 d_C2.device(ctx2) = d_A2 * d_B2;   // runs on stream 2 (concurrently)
 ```

 ## Usage

 ### Matrix operations (cuBLAS)

 ```cpp
 auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
 auto d_B = gpu::DeviceMatrix<double>::fromHost(B);

 // GEMM: C = A * B, C = A^H * B, C = A * B^T, ...
 gpu::DeviceMatrix<double> d_C = d_A * d_B;
 d_C = d_A.adjoint() * d_B;
 d_C = d_A * d_B.transpose();

 // Scaled and accumulated
 d_C += 2.0 * d_A * d_B;             // alpha=2, beta=1
 d_C.device(ctx) -= d_A * d_B;       // alpha=-1, beta=1 (requires explicit context)

 // Triangular solve (TRSM)
 d_X = d_A.triangularView<Lower>().solve(d_B);

 // Symmetric/Hermitian multiply (SYMM/HEMM)
 d_C = d_A.selfadjointView<Lower>() * d_B;

 // Rank-k update (SYRK/HERK)
 d_C.selfadjointView<Lower>().rankUpdate(d_A);  // C += A * A^H
 ```

 ### Dense solvers (cuSOLVER)

 **One-shot expression syntax** -- Convenient, re-factorizes each time:

 ```cpp
 // Cholesky solve (potrf + potrs)
 d_X = d_A.llt().solve(d_B);

 // LU solve (getrf + getrs)
 d_Y = d_A.lu().solve(d_B);
 ```

 **Cached factorization** -- Factor once, solve many times:

 ```cpp
 gpu::LLT<double> llt;
 llt.compute(d_A);                    // factorize (async)
 if (llt.info() != Success) { ... }   // lazy sync on first info() call
 auto d_X1 = llt.solve(d_B1);        // reuses factor (async)
 auto d_X2 = llt.solve(d_B2);        // reuses factor (async)
 MatrixXd X2 = d_X2.toHost();

 // LU with transpose solve
 gpu::LU<double> lu;
 lu.compute(d_A);
 auto d_Y = lu.solve(d_B, gpu::GpuOp::Trans);           // A^T Y = B
 ```

 The cached API keeps the factored matrix on device, avoiding redundant
 host-device transfers and re-factorizations.

 ### Stream control and async execution

 Operations are asynchronous by default. The compute-solve chain runs without
 host synchronization until you need a result on the host:

 ```text
 fromHost(A) --sync-->  compute() --async-->  solve() --async-->  toHost()
    H2D                  potrf                 potrs                D2H
                                                                    sync
 ```

 Mandatory sync points:
 - `fromHost()` -- Synchronizes to complete the upload before returning
 - `toHost()` / `HostTransfer::get()` -- Must deliver data to host
 - `info()` -- Must read the factorization status

 **Cross-stream safety** is automatic. `DeviceMatrix` tracks write completion
 via CUDA events. When a matrix written on stream A is read on stream B, the
 module automatically inserts `cudaStreamWaitEvent`. Same-stream operations
 skip the wait (CUDA guarantees in-order execution within a stream).

 **Lifetime of cached factorizations.** A `gpu::LLT` / `gpu::LU` object owns
 its CUDA stream, library handle, and the cached factor on device. Destroying
 the factorization object while a `solve()` it issued is still in flight is
 *correct* but not actually async: `cudaStreamDestroy` returns immediately,
 but the destructor of the cached factor calls `cudaFree`, which is fully
 device-synchronous and stalls until the in-flight `potrs`/`getrs` retires.
 For genuine async pipelining keep the factorization object alive until you
 have drained its results (e.g. via `toHost()` or by binding consumption to
 an explicit `gpu::Context` that outlives both producer and consumer):

 ```cpp
 gpu::LLT<double> llt(d_A);             // factor stays on device
 auto d_X = llt.solve(d_B);
 auto h_x = d_X.toHostAsync(llt.stream());
 h_x.get();                             // sync: factor + result complete
 // llt may now be destroyed without stalling the device
 ```

 ## Reference

 ### Supported scalar types

 `float`, `double`, `std::complex<float>`, `std::complex<double>`.

 ### Expression -> library call mapping

 | DeviceMatrix expression | Library call | Parameters |
 |---|---|---|
 | `C = A * B` | `cublasGemmEx` | transA=N, transB=N, alpha=1, beta=0 |
 | `C = A.adjoint() * B` | `cublasGemmEx` | transA=C, transB=N |
 | `C = A.transpose() * B` | `cublasGemmEx` | transA=T, transB=N |
 | `C = A * B.adjoint()` | `cublasGemmEx` | transA=N, transB=C |
 | `C = A * B.transpose()` | `cublasGemmEx` | transA=N, transB=T |
 | `C = alpha * A * B` | `cublasGemmEx` | alpha from LHS |
 | `C = A * (alpha * B)` | `cublasGemmEx` | alpha from RHS |
 | `C += A * B` | `cublasGemmEx` | alpha=1, beta=1 |
 | `C.device(ctx) -= A * B` | `cublasGemmEx` | alpha=-1, beta=1 |
 | `X = A.llt().solve(B)` | `cusolverDnXpotrf` + `Xpotrs` | uplo, n, nrhs |
 | `X = A.llt<Upper>().solve(B)` | same | uplo=Upper |
 | `X = A.lu().solve(B)` | `cusolverDnXgetrf` + `Xgetrs` | n, nrhs |
 | `X = A.triangularView<L>().solve(B)` | `cublasXtrsm` | side=L, uplo, diag=NonUnit |
 | `C = A.selfadjointView<L>() * B` | `cublasXsymm` / `cublasXhemm` | side=L, uplo |
 | `C.selfadjointView<L>().rankUpdate(A)` | `cublasXsyrk` / `cublasXherk` | uplo, trans=N |

 ### `DeviceMatrix<Scalar>` API

 | Method | Sync? | Description |
 |--------|-------|-------------|
 | `DeviceMatrix()` | -- | Empty (0x0) |
 | `DeviceMatrix(rows, cols)` | -- | Allocate uninitialized |
 | `fromHost(matrix, stream)` | yes | Upload from Eigen matrix |
 | `fromHostAsync(ptr, rows, cols, outerStride, stream)` | no | Async upload (caller manages lifetime) |
 | `toHost(stream)` | yes | Synchronous download |
 | `toHostAsync(stream)` | no | Returns `HostTransfer` future |
 | `clone(stream)` | no | Device-to-device deep copy |
 | `resize(rows, cols)` | -- | Discard contents, reallocate |
 | `data()` | -- | Raw device pointer |
 | `rows()`, `cols()` | -- | Dimensions |
 | `sizeInBytes()` | -- | Total device allocation size in bytes |
 | `empty()` | -- | True if 0x0 |
 | `adjoint()` | -- | Adjoint view (GEMM ConjTrans) |
 | `transpose()` | -- | Transpose view (GEMM Trans) |
 | `llt()` / `llt<UpLo>()` | -- | Cholesky expression builder |
 | `lu()` | -- | LU expression builder |
 | `triangularView<UpLo>()` | -- | Triangular view (TRSM) |
 | `selfadjointView<UpLo>()` | -- | Self-adjoint view (SYMM, rankUpdate) |
 | `device(ctx)` | -- | Assignment proxy bound to context |

 ### `gpu::Context`

 Unified GPU execution context owning a CUDA stream and library handles.

 ```cpp
 gpu::Context()                                             // Creates dedicated stream + handles
 static gpu::Context& threadLocal()                         // Per-thread default (lazy-created)

 cudaStream_t       stream()
 cublasHandle_t     cublasHandle()
 cusolverDnHandle_t cusolverHandle()
 ```

 Non-copyable, non-movable (owns library handles).

 ### `gpu::LLT<Scalar, UpLo>` API

 GPU dense Cholesky (LL^T) via cuSOLVER. Caches factor on device.

 | Method | Sync? | Description |
 |--------|-------|-------------|
 | `gpu::LLT(A)` | deferred | Construct and factorize from host matrix |
 | `compute(host_matrix)` | deferred | Upload and factorize |
 | `compute(DeviceMatrix)` | deferred | D2D copy and factorize |
 | `compute(DeviceMatrix&&)` | deferred | Move-adopt and factorize (no copy) |
 | `solve(host_matrix)` | yes | Solve, return host matrix |
 | `solve(DeviceMatrix)` | no | Solve, return `DeviceMatrix` (async) |
 | `info()` | lazy | Syncs stream on first call, returns `Success` or `NumericalIssue` |

 ### `gpu::LU<Scalar>` API

 GPU dense partial-pivoting LU via cuSOLVER. Same pattern as `gpu::LLT`, plus a
 `gpu::GpuOp` parameter on `solve()` (`NoTrans`, `Trans`, `ConjTrans`).

 ### `HostTransfer<Scalar>` API

 Future for async device-to-host transfer.

 | Method | Description |
 |--------|-------------|
 | `get()` | Block until transfer completes, return host matrix reference. Idempotent. |
 | `ready()` | Non-blocking poll |

 ### Aliasing

 Unlike Eigen's `Matrix`, where omitting `.noalias()` triggers a copy to a
 temporary, DeviceMatrix dispatches directly to NVIDIA library calls which have
 no built-in aliasing protection. All operations are implicitly noalias.
 The caller must ensure operands don't alias the destination for GEMM, TRSM,
 SYMM/HEMM, and SYRK/HERK. Debug builds assert on these violations before
 dispatching to cuBLAS.

 ## File layout

 | File | Depends on | Contents |
 |------|-----------|----------|
 | `GpuSupport.h` | `<cuda_runtime.h>` | Error macro, `DeviceBuffer`, `PinnedHostBuffer`, `cuda_data_type<>` |
 | `DeviceMatrix.h` | `GpuSupport.h` | `DeviceMatrix<>`, `HostTransfer<>` |
 | `DeviceExpr.h` | `DeviceMatrix.h` | GEMM expression wrappers |
 | `DeviceBlasExpr.h` | `DeviceMatrix.h` | TRSM, SYMM, SYRK expression wrappers |
 | `DeviceSolverExpr.h` | `DeviceMatrix.h` | Solver expression wrappers (LLT, LU) |
 | `DeviceDispatch.h` | all above | All dispatch functions + `Assignment` |
 | `GpuContext.h` | `CuBlasSupport.h`, `CuSolverSupport.h` | `gpu::Context` |
 | `CuBlasSupport.h` | `GpuSupport.h`, `<cublas_v2.h>` | cuBLAS error macro, op/compute type maps |
 | `CuSolverSupport.h` | `GpuSupport.h`, `<cusolverDn.h>` | cuSOLVER params, fill-mode mapping |
 | `GpuLLT.h` | `CuSolverSupport.h` | Cached dense Cholesky factorization |
 | `GpuLU.h` | `CuSolverSupport.h` | Cached dense LU factorization |

 ## Building and testing

 ```bash
 cmake -G Ninja -B build -S . \
   -DEIGEN_TEST_CUDA=ON \
   -DEIGEN_CUDA_COMPUTE_ARCH="70"

 cmake --build build --target cublas cusolver_llt cusolver_lu device_matrix
 ctest --test-dir build -L gpu --output-on-failure
 ```

 `EIGEN_TEST_CUBLAS` and `EIGEN_TEST_CUSOLVER` default to ON when CUDA is enabled
 (cuBLAS and cuSOLVER are part of the CUDA toolkit).
	# Eigen GPU Module (`unsupported/Eigen/GPU`)

	GPU-accelerated dense linear algebra for Eigen users, dispatching to NVIDIA
	CUDA Math Libraries (cuBLAS, cuSOLVER). Requires CUDA 11.4+. Header-only (link
	against CUDA runtime, cuBLAS, and cuSOLVER).

	## Why this module

	Eigen is the linear algebra foundation for a large ecosystem of C++ projects
	in robotics (ROS, Drake, MoveIt, Pinocchio), computer vision (OpenCV, COLMAP,
	Open3D), scientific computing (Ceres, Stan), and beyond. Many of these
	projects run on GPU-equipped hardware but cannot use GPUs for Eigen operations
	without dropping down to raw CUDA library APIs. Third-party projects like
	[EigenCuda](https://github.com/NLESC-JCER/EigenCuda) and
	[cholespy](https://github.com/rgl-epfl/cholespy) exist specifically to fill
	this gap, and downstream projects like
	[Ceres](https://github.com/ceres-solver/ceres-solver/issues/1151) and
	[COLMAP](https://github.com/colmap/colmap/issues/4018) have open requests for
	GPU-accelerated solvers through Eigen.

	The `unsupported/Eigen/GPU` module aims to close this gap: Existing Eigen users should be
	able to move performance-critical dense linear algebra to the GPU with minimal
	code changes and without learning CUDA library APIs directly.

	## Design philosophy

	CPU and GPU coexist. There is no global compile-time switch that replaces
	CPU implementations (unlike `EIGEN_USE_LAPACKE`). Users choose GPU solvers
	explicitly -- `gpu::LLT<double>` vs `Eigen::LLT<MatrixXd>` -- and both coexist in
	the same binary. This also lets users keep the factored matrix on device across
	multiple solves, something impossible with compile-time replacement.

	Familiar syntax. GPU operations use the same expression patterns as CPU
	Eigen. Here is a side-by-side comparison:

	```cpp
	// ---- CPU (Eigen) ---- // ---- GPU (unsupported/Eigen/GPU) ----
	#include <Eigen/Dense> #define EIGEN_USE_GPU
	#include <unsupported/Eigen/GPU>

	MatrixXd A = ...; auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
	MatrixXd B = ...; auto d_B = gpu::DeviceMatrix<double>::fromHost(B);

	MatrixXd C = A * B; gpu::DeviceMatrix<double> d_C = d_A * d_B;
	MatrixXd X = A.llt().solve(B); gpu::DeviceMatrix<double> d_X = d_A.llt().solve(d_B);

	MatrixXd X = d_X.toHost();
	```

	The GPU version reads like CPU Eigen with explicit upload/download.
	`operator*` dispatches to cuBLAS GEMM, `.llt().solve()` dispatches to
	cuSOLVER potrf + potrs. Unsupported expressions are compile errors.

	Standalone module. `unsupported/Eigen/GPU` does not modify or depend on Eigen's Core
	expression template system (`MatrixBase`, `CwiseBinaryOp`, etc.).
	`DeviceMatrix` is not an Eigen expression type and does not inherit from
	`MatrixBase`. The expression layer is a thin compile-time dispatch where every
	supported expression maps to a single NVIDIA library call. There is no
	coefficient-level evaluation, lazy fusion, or packet operations.

	Explicit over implicit. Host-device transfers, stream management, and
	library handle lifetimes are visible in the API. There are no hidden
	allocations or synchronizations except where documented (e.g., `toHost()` must
	synchronize to deliver data to the host).

	## Key concepts

	### `DeviceMatrix<Scalar>`

	A typed RAII wrapper for a dense column-major matrix in GPU device memory.
	This is the GPU counterpart of Eigen's `MatrixX<Scalar>`. A vector is simply
	a `DeviceMatrix` with one column.

	```cpp
	// Upload from host
	auto d_A = gpu::DeviceMatrix<double>::fromHost(A);

	// Allocate uninitialized
	gpu::DeviceMatrix<double> d_C(m, n);

	// Download to host
	MatrixXd C = d_C.toHost();

	// Async download (returns a future)
	auto transfer = d_C.toHostAsync();
	// ... do other work ...
	MatrixXd C = transfer.get();
	```

	`DeviceMatrix` supports expression methods that mirror Eigen's API:
	`adjoint()`, `transpose()`, `triangularView<UpLo>()`,
	`selfadjointView<UpLo>()`, `llt()`, `lu()`. These return lightweight
	expression objects that are evaluated when assigned.

	### `gpu::Context`

	Every GPU operation needs a CUDA stream and library handles (cuBLAS,
	cuSOLVER). `gpu::Context` bundles these together.

	For simple usage, you don't need to create one -- a per-thread default context
	is created lazily on first use:

	```cpp
	// These use the thread-local default context automatically
	d_C = d_A * d_B;
	d_X = d_A.llt().solve(d_B);
	```

	For concurrent multi-stream execution, create explicit contexts:

	```cpp
	gpu::Context ctx1, ctx2;
	d_C1.device(ctx1) = d_A1 * d_B1; // runs on stream 1
	d_C2.device(ctx2) = d_A2 * d_B2; // runs on stream 2 (concurrently)
	```

	## Usage

	### Matrix operations (cuBLAS)

	```cpp
	auto d_A = gpu::DeviceMatrix<double>::fromHost(A);
	auto d_B = gpu::DeviceMatrix<double>::fromHost(B);

	// GEMM: C = A * B, C = A^H * B, C = A * B^T, ...
	gpu::DeviceMatrix<double> d_C = d_A * d_B;
	d_C = d_A.adjoint() * d_B;
	d_C = d_A * d_B.transpose();

	// Scaled and accumulated
	d_C += 2.0 * d_A * d_B; // alpha=2, beta=1
	d_C.device(ctx) -= d_A * d_B; // alpha=-1, beta=1 (requires explicit context)

	// Triangular solve (TRSM)
	d_X = d_A.triangularView<Lower>().solve(d_B);

	// Symmetric/Hermitian multiply (SYMM/HEMM)
	d_C = d_A.selfadjointView<Lower>() * d_B;

	// Rank-k update (SYRK/HERK)
	d_C.selfadjointView<Lower>().rankUpdate(d_A); // C += A * A^H
	```

	### Dense solvers (cuSOLVER)

	One-shot expression syntax -- Convenient, re-factorizes each time:

	```cpp
	// Cholesky solve (potrf + potrs)
	d_X = d_A.llt().solve(d_B);

	// LU solve (getrf + getrs)
	d_Y = d_A.lu().solve(d_B);
	```

	Cached factorization -- Factor once, solve many times:

	```cpp
	gpu::LLT<double> llt;
	llt.compute(d_A); // factorize (async)
	if (llt.info() != Success) { ... } // lazy sync on first info() call
	auto d_X1 = llt.solve(d_B1); // reuses factor (async)
	auto d_X2 = llt.solve(d_B2); // reuses factor (async)
	MatrixXd X2 = d_X2.toHost();

	// LU with transpose solve
	gpu::LU<double> lu;
	lu.compute(d_A);
	auto d_Y = lu.solve(d_B, gpu::GpuOp::Trans); // A^T Y = B
	```

	The cached API keeps the factored matrix on device, avoiding redundant
	host-device transfers and re-factorizations.

	### Stream control and async execution

	Operations are asynchronous by default. The compute-solve chain runs without
	host synchronization until you need a result on the host:

	```text
	fromHost(A) --sync--> compute() --async--> solve() --async--> toHost()
	H2D potrf potrs D2H
	sync
	```

	Mandatory sync points:
	- `fromHost()` -- Synchronizes to complete the upload before returning
	- `toHost()` / `HostTransfer::get()` -- Must deliver data to host
	- `info()` -- Must read the factorization status

	Cross-stream safety is automatic. `DeviceMatrix` tracks write completion
	via CUDA events. When a matrix written on stream A is read on stream B, the
	module automatically inserts `cudaStreamWaitEvent`. Same-stream operations
	skip the wait (CUDA guarantees in-order execution within a stream).

	Lifetime of cached factorizations. A `gpu::LLT` / `gpu::LU` object owns
	its CUDA stream, library handle, and the cached factor on device. Destroying
	the factorization object while a `solve()` it issued is still in flight is
	correct but not actually async: `cudaStreamDestroy` returns immediately,
	but the destructor of the cached factor calls `cudaFree`, which is fully
	device-synchronous and stalls until the in-flight `potrs`/`getrs` retires.
	For genuine async pipelining keep the factorization object alive until you
	have drained its results (e.g. via `toHost()` or by binding consumption to
	an explicit `gpu::Context` that outlives both producer and consumer):

	```cpp
	gpu::LLT<double> llt(d_A); // factor stays on device
	auto d_X = llt.solve(d_B);
	auto h_x = d_X.toHostAsync(llt.stream());
	h_x.get(); // sync: factor + result complete
	// llt may now be destroyed without stalling the device
	```

	## Reference

	### Supported scalar types

	`float`, `double`, `std::complex<float>`, `std::complex<double>`.

	### Expression -> library call mapping

	\| DeviceMatrix expression \| Library call \| Parameters \|
	\|---\|---\|---\|
	\| `C = A * B` \| `cublasGemmEx` \| transA=N, transB=N, alpha=1, beta=0 \|
	\| `C = A.adjoint() * B` \| `cublasGemmEx` \| transA=C, transB=N \|
	\| `C = A.transpose() * B` \| `cublasGemmEx` \| transA=T, transB=N \|
	\| `C = A * B.adjoint()` \| `cublasGemmEx` \| transA=N, transB=C \|
	\| `C = A * B.transpose()` \| `cublasGemmEx` \| transA=N, transB=T \|
	\| `C = alpha * A * B` \| `cublasGemmEx` \| alpha from LHS \|
	\| `C = A * (alpha * B)` \| `cublasGemmEx` \| alpha from RHS \|
	\| `C += A * B` \| `cublasGemmEx` \| alpha=1, beta=1 \|
	\| `C.device(ctx) -= A * B` \| `cublasGemmEx` \| alpha=-1, beta=1 \|
	\| `X = A.llt().solve(B)` \| `cusolverDnXpotrf` + `Xpotrs` \| uplo, n, nrhs \|
	\| `X = A.llt<Upper>().solve(B)` \| same \| uplo=Upper \|
	\| `X = A.lu().solve(B)` \| `cusolverDnXgetrf` + `Xgetrs` \| n, nrhs \|
	\| `X = A.triangularView<L>().solve(B)` \| `cublasXtrsm` \| side=L, uplo, diag=NonUnit \|
	\| `C = A.selfadjointView<L>() * B` \| `cublasXsymm` / `cublasXhemm` \| side=L, uplo \|
	\| `C.selfadjointView<L>().rankUpdate(A)` \| `cublasXsyrk` / `cublasXherk` \| uplo, trans=N \|

	### `DeviceMatrix<Scalar>` API

	\| Method \| Sync? \| Description \|
	\|--------\|-------\|-------------\|
	\| `DeviceMatrix()` \| -- \| Empty (0x0) \|
	\| `DeviceMatrix(rows, cols)` \| -- \| Allocate uninitialized \|
	\| `fromHost(matrix, stream)` \| yes \| Upload from Eigen matrix \|
	\| `fromHostAsync(ptr, rows, cols, outerStride, stream)` \| no \| Async upload (caller manages lifetime) \|
	\| `toHost(stream)` \| yes \| Synchronous download \|
	\| `toHostAsync(stream)` \| no \| Returns `HostTransfer` future \|
	\| `clone(stream)` \| no \| Device-to-device deep copy \|
	\| `resize(rows, cols)` \| -- \| Discard contents, reallocate \|
	\| `data()` \| -- \| Raw device pointer \|
	\| `rows()`, `cols()` \| -- \| Dimensions \|
	\| `sizeInBytes()` \| -- \| Total device allocation size in bytes \|
	\| `empty()` \| -- \| True if 0x0 \|
	\| `adjoint()` \| -- \| Adjoint view (GEMM ConjTrans) \|
	\| `transpose()` \| -- \| Transpose view (GEMM Trans) \|
	\| `llt()` / `llt<UpLo>()` \| -- \| Cholesky expression builder \|
	\| `lu()` \| -- \| LU expression builder \|
	\| `triangularView<UpLo>()` \| -- \| Triangular view (TRSM) \|
	\| `selfadjointView<UpLo>()` \| -- \| Self-adjoint view (SYMM, rankUpdate) \|
	\| `device(ctx)` \| -- \| Assignment proxy bound to context \|

	### `gpu::Context`

	Unified GPU execution context owning a CUDA stream and library handles.

	```cpp
	gpu::Context() // Creates dedicated stream + handles
	static gpu::Context& threadLocal() // Per-thread default (lazy-created)

	cudaStream_t stream()
	cublasHandle_t cublasHandle()
	cusolverDnHandle_t cusolverHandle()
	```

	Non-copyable, non-movable (owns library handles).

	### `gpu::LLT<Scalar, UpLo>` API

	GPU dense Cholesky (LL^T) via cuSOLVER. Caches factor on device.

	\| Method \| Sync? \| Description \|
	\|--------\|-------\|-------------\|
	\| `gpu::LLT(A)` \| deferred \| Construct and factorize from host matrix \|
	\| `compute(host_matrix)` \| deferred \| Upload and factorize \|
	\| `compute(DeviceMatrix)` \| deferred \| D2D copy and factorize \|
	\| `compute(DeviceMatrix&&)` \| deferred \| Move-adopt and factorize (no copy) \|
	\| `solve(host_matrix)` \| yes \| Solve, return host matrix \|
	\| `solve(DeviceMatrix)` \| no \| Solve, return `DeviceMatrix` (async) \|
	\| `info()` \| lazy \| Syncs stream on first call, returns `Success` or `NumericalIssue` \|

	### `gpu::LU<Scalar>` API

	GPU dense partial-pivoting LU via cuSOLVER. Same pattern as `gpu::LLT`, plus a
	`gpu::GpuOp` parameter on `solve()` (`NoTrans`, `Trans`, `ConjTrans`).

	### `HostTransfer<Scalar>` API

	Future for async device-to-host transfer.

	\| Method \| Description \|
	\|--------\|-------------\|
	\| `get()` \| Block until transfer completes, return host matrix reference. Idempotent. \|
	\| `ready()` \| Non-blocking poll \|

	### Aliasing

	Unlike Eigen's `Matrix`, where omitting `.noalias()` triggers a copy to a
	temporary, DeviceMatrix dispatches directly to NVIDIA library calls which have
	no built-in aliasing protection. All operations are implicitly noalias.
	The caller must ensure operands don't alias the destination for GEMM, TRSM,
	SYMM/HEMM, and SYRK/HERK. Debug builds assert on these violations before
	dispatching to cuBLAS.

	## File layout

	\| File \| Depends on \| Contents \|
	\|------\|-----------\|----------\|
	\| `GpuSupport.h` \| `<cuda_runtime.h>` \| Error macro, `DeviceBuffer`, `PinnedHostBuffer`, `cuda_data_type<>` \|
	\| `DeviceMatrix.h` \| `GpuSupport.h` \| `DeviceMatrix<>`, `HostTransfer<>` \|
	\| `DeviceExpr.h` \| `DeviceMatrix.h` \| GEMM expression wrappers \|
	\| `DeviceBlasExpr.h` \| `DeviceMatrix.h` \| TRSM, SYMM, SYRK expression wrappers \|
	\| `DeviceSolverExpr.h` \| `DeviceMatrix.h` \| Solver expression wrappers (LLT, LU) \|
	\| `DeviceDispatch.h` \| all above \| All dispatch functions + `Assignment` \|
	\| `GpuContext.h` \| `CuBlasSupport.h`, `CuSolverSupport.h` \| `gpu::Context` \|
	\| `CuBlasSupport.h` \| `GpuSupport.h`, `<cublas_v2.h>` \| cuBLAS error macro, op/compute type maps \|
	\| `CuSolverSupport.h` \| `GpuSupport.h`, `<cusolverDn.h>` \| cuSOLVER params, fill-mode mapping \|
	\| `GpuLLT.h` \| `CuSolverSupport.h` \| Cached dense Cholesky factorization \|
	\| `GpuLU.h` \| `CuSolverSupport.h` \| Cached dense LU factorization \|

	## Building and testing

	```bash
	cmake -G Ninja -B build -S . \
	-DEIGEN_TEST_CUDA=ON \
	-DEIGEN_CUDA_COMPUTE_ARCH="70"

	cmake --build build --target cublas cusolver_llt cusolver_lu device_matrix
	ctest --test-dir build -L gpu --output-on-failure
	```

	`EIGEN_TEST_CUBLAS` and `EIGEN_TEST_CUSOLVER` default to ON when CUDA is enabled
	(cuBLAS and cuSOLVER are part of the CUDA toolkit).