| # Eigen GPU Module (`unsupported/Eigen/GPU`) |
| |
| GPU-accelerated dense linear algebra for Eigen users, dispatching to NVIDIA |
| CUDA Math Libraries (cuBLAS, cuSOLVER). Requires CUDA 11.4+. Header-only (link |
| against CUDA runtime, cuBLAS, and cuSOLVER). |
| |
| ## Why this module |
| |
| Eigen is the linear algebra foundation for a large ecosystem of C++ projects |
| in robotics (ROS, Drake, MoveIt, Pinocchio), computer vision (OpenCV, COLMAP, |
| Open3D), scientific computing (Ceres, Stan), and beyond. Many of these |
| projects run on GPU-equipped hardware but cannot use GPUs for Eigen operations |
| without dropping down to raw CUDA library APIs. Third-party projects like |
| [EigenCuda](https://github.com/NLESC-JCER/EigenCuda) and |
| [cholespy](https://github.com/rgl-epfl/cholespy) exist specifically to fill |
| this gap, and downstream projects like |
| [Ceres](https://github.com/ceres-solver/ceres-solver/issues/1151) and |
| [COLMAP](https://github.com/colmap/colmap/issues/4018) have open requests for |
| GPU-accelerated solvers through Eigen. |
| |
| The `unsupported/Eigen/GPU` module aims to close this gap: Existing Eigen users should be |
| able to move performance-critical dense linear algebra to the GPU with minimal |
| code changes and without learning CUDA library APIs directly. |
| |
| ## Design philosophy |
| |
| **CPU and GPU coexist.** There is no global compile-time switch that replaces |
| CPU implementations (unlike `EIGEN_USE_LAPACKE`). Users choose GPU solvers |
| explicitly -- `gpu::LLT<double>` vs `LLT<MatrixXd>` -- and both coexist in |
| the same binary. This also lets users keep the factored matrix on device across |
| multiple solves, something impossible with compile-time replacement. |
| |
| **Familiar syntax.** GPU operations use the same expression patterns as CPU |
| Eigen. Here is a side-by-side comparison: |
| |
| ```cpp |
| // ---- CPU (Eigen) ---- // ---- GPU (unsupported/Eigen/GPU) ---- |
| #include <Eigen/Dense> #define EIGEN_USE_GPU |
| #include <unsupported/Eigen/GPU> |
| |
| MatrixXd A = ...; auto d_A = gpu::DeviceMatrix<double>::fromHost(A); |
| MatrixXd B = ...; auto d_B = gpu::DeviceMatrix<double>::fromHost(B); |
| |
| MatrixXd C = A * B; gpu::DeviceMatrix<double> d_C = d_A * d_B; |
| MatrixXd X = A.llt().solve(B); gpu::DeviceMatrix<double> d_X = d_A.llt().solve(d_B); |
| |
| MatrixXd X = d_X.toHost(); |
| ``` |
| |
| The GPU version reads like CPU Eigen with explicit upload/download. |
| `operator*` dispatches to cuBLAS GEMM, `.llt().solve()` dispatches to |
| cuSOLVER potrf + potrs. Unsupported expressions are compile errors. |
| |
| **Standalone module.** `unsupported/Eigen/GPU` does not modify or depend on Eigen's Core |
| expression template system (`MatrixBase`, `CwiseBinaryOp`, etc.). |
| `DeviceMatrix` is not an Eigen expression type and does not inherit from |
| `MatrixBase`. The expression layer is a thin compile-time dispatch where every |
| supported expression maps to a single NVIDIA library call. There is no |
| coefficient-level evaluation, lazy fusion, or packet operations. |
| |
| **Explicit over implicit.** Host-device transfers, stream management, and |
| library handle lifetimes are visible in the API. There are no hidden |
| allocations or synchronizations except where documented (e.g., `toHost()` must |
| synchronize to deliver data to the host). |
| |
| ## Key concepts |
| |
| ### `DeviceMatrix<Scalar>` |
| |
| A typed RAII wrapper for a dense column-major matrix in GPU device memory. |
| This is the GPU counterpart of Eigen's `MatrixX<Scalar>`. A vector is simply |
| a `DeviceMatrix` with one column. |
| |
| ```cpp |
| // Upload from host |
| auto d_A = gpu::DeviceMatrix<double>::fromHost(A); |
| |
| // Allocate uninitialized |
| gpu::DeviceMatrix<double> d_C(m, n); |
| |
| // Download to host |
| MatrixXd C = d_C.toHost(); |
| |
| // Async download (returns a future) |
| auto transfer = d_C.toHostAsync(); |
| // ... do other work ... |
| MatrixXd C = transfer.get(); |
| ``` |
| |
| `DeviceMatrix` supports expression methods that mirror Eigen's API: |
| `adjoint()`, `transpose()`, `triangularView<UpLo>()`, |
| `selfadjointView<UpLo>()`, `llt()`, `lu()`. These return lightweight |
| expression objects that are evaluated when assigned. |
| |
| ### `gpu::Context` |
| |
| Every GPU operation needs a CUDA stream and library handles (cuBLAS, |
| cuSOLVER). `gpu::Context` bundles these together. |
| |
| For simple usage, you don't need to create one -- a per-thread default context |
| is created lazily on first use: |
| |
| ```cpp |
| // These use the thread-local default context automatically |
| d_C = d_A * d_B; |
| d_X = d_A.llt().solve(d_B); |
| ``` |
| |
| For concurrent multi-stream execution, create explicit contexts: |
| |
| ```cpp |
| gpu::Context ctx1, ctx2; |
| d_C1.device(ctx1) = d_A1 * d_B1; // runs on stream 1 |
| d_C2.device(ctx2) = d_A2 * d_B2; // runs on stream 2 (concurrently) |
| ``` |
| |
| ## Usage |
| |
| ### Matrix operations (cuBLAS) |
| |
| ```cpp |
| auto d_A = gpu::DeviceMatrix<double>::fromHost(A); |
| auto d_B = gpu::DeviceMatrix<double>::fromHost(B); |
| |
| // GEMM: C = A * B, C = A^H * B, C = A * B^T, ... |
| gpu::DeviceMatrix<double> d_C = d_A * d_B; |
| d_C = d_A.adjoint() * d_B; |
| d_C = d_A * d_B.transpose(); |
| |
| // Scaled and accumulated |
| d_C += 2.0 * d_A * d_B; // alpha=2, beta=1 |
| d_C.device(ctx) -= d_A * d_B; // alpha=-1, beta=1 (requires explicit context) |
| |
| // Triangular solve (TRSM) |
| d_X = d_A.triangularView<Lower>().solve(d_B); |
| |
| // Symmetric/Hermitian multiply (SYMM/HEMM) |
| d_C = d_A.selfadjointView<Lower>() * d_B; |
| |
| // Rank-k update (SYRK/HERK) |
| d_C.selfadjointView<Lower>().rankUpdate(d_A); // C += A * A^H |
| ``` |
| |
| ### Dense solvers (cuSOLVER) |
| |
| **One-shot expression syntax** -- Convenient, re-factorizes each time: |
| |
| ```cpp |
| // Cholesky solve (potrf + potrs) |
| d_X = d_A.llt().solve(d_B); |
| |
| // LU solve (getrf + getrs) |
| d_Y = d_A.lu().solve(d_B); |
| ``` |
| |
| **Cached factorization** -- Factor once, solve many times: |
| |
| ```cpp |
| gpu::LLT<double> llt; |
| llt.compute(d_A); // factorize (async) |
| if (llt.info() != Success) { ... } // lazy sync on first info() call |
| auto d_X1 = llt.solve(d_B1); // reuses factor (async) |
| auto d_X2 = llt.solve(d_B2); // reuses factor (async) |
| MatrixXd X2 = d_X2.toHost(); |
| |
| // LU with transpose solve |
| gpu::LU<double> lu; |
| lu.compute(d_A); |
| auto d_Y = lu.solve(d_B, gpu::GpuOp::Trans); // A^T Y = B |
| |
| // QR solve (overdetermined least squares) |
| gpu::QR<double> qr(A); // host matrix input |
| MatrixXd X = qr.solve(B); // Q^H * B via ormqr, then trsm on R |
| |
| // SVD |
| gpu::SVD<double> svd(A, ComputeThinU | ComputeThinV); |
| VectorXd S = svd.singularValues(); |
| MatrixXd U = svd.matrixU(); |
| MatrixXd VT = svd.matrixVT(); |
| MatrixXd X = svd.solve(B); // pseudoinverse solve |
| |
| // SVD: device-side views (no D2H transfer; svd must outlive the views) |
| auto d_S = svd.d_singularValues(); // DeviceMatrix view of singular values |
| auto d_U = svd.d_matrixU(); // DeviceMatrix view of U |
| auto d_VT = svd.d_matrixVT(); // DeviceMatrix view of V^T |
| |
| // Self-adjoint eigenvalue decomposition |
| gpu::SelfAdjointEigenSolver<double> es(A); |
| VectorXd eigenvals = es.eigenvalues(); |
| MatrixXd eigenvecs = es.eigenvectors(); |
| auto d_W = es.d_eigenvalues(); // DeviceMatrix view of eigenvalues |
| auto d_V = es.d_eigenvectors(); // DeviceMatrix view of eigenvectors |
| ``` |
| |
| The cached API keeps the factored matrix on device, avoiding redundant |
| host-device transfers and re-factorizations. The `d_*` accessors return |
| non-owning `DeviceMatrix` views so downstream cuBLAS/cuSOLVER work can chain |
| without round-tripping through host memory. |
| |
| ### Precision control |
| |
| GEMM dispatch uses `cublasXgemm` (type-specific Sgemm/Dgemm/Cgemm/Zgemm). |
| cuBLAS may internally use tensor cores depending on the GPU architecture, |
| matrix dimensions, and CUDA math mode settings. No Eigen-specific macros |
| control this; use the standard `CUDA_MATH_MODE` environment variable or |
| `cublasSetMathMode()` to configure tensor core behavior if needed. |
| |
| ### Stream control and async execution |
| |
| Operations are asynchronous by default. The compute-solve chain runs without |
| host synchronization until you need a result on the host: |
| |
| ```text |
| fromHost(A) --sync--> compute() --async--> solve() --async--> toHost() |
| H2D potrf potrs D2H |
| sync |
| ``` |
| |
| Mandatory sync points: |
| - `fromHost()` -- Synchronizes to complete the upload before returning |
| - `toHost()` / `HostTransfer::get()` -- Must deliver data to host |
| - `info()` -- Must read the factorization status |
| |
| **Cross-stream safety** is automatic. `DeviceMatrix` tracks write completion |
| via CUDA events. When a matrix written on stream A is read on stream B, the |
| module automatically inserts `cudaStreamWaitEvent`. Same-stream operations |
| skip the wait (CUDA guarantees in-order execution within a stream). |
| |
| **Lifetime of cached factorizations.** A `gpu::LLT` / `gpu::LU` object owns |
| its CUDA stream, library handle, and the cached factor on device. Destroying |
| the factorization object while a `solve()` it issued is still in flight is |
| *correct* but not actually async: `cudaStreamDestroy` returns immediately, |
| but the destructor of the cached factor calls `cudaFree`, which is fully |
| device-synchronous and stalls until the in-flight `potrs`/`getrs` retires. |
| For genuine async pipelining keep the factorization object alive until you |
| have drained its results (e.g. via `toHost()` or by binding consumption to |
| an explicit `gpu::Context` that outlives both producer and consumer): |
| |
| ```cpp |
| gpu::LLT<double> llt(d_A); // factor stays on device |
| auto d_X = llt.solve(d_B); |
| auto h_x = d_X.toHostAsync(llt.stream()); |
| h_x.get(); // sync: factor + result complete |
| // llt may now be destroyed without stalling the device |
| ``` |
| |
| ## Reference |
| |
| ### Supported scalar types |
| |
| `float`, `double`, `std::complex<float>`, `std::complex<double>`. |
| |
| ### Expression -> library call mapping |
| |
| | DeviceMatrix expression | Library call | Parameters | |
| |---|---|---| |
| | `C = A * B` | `cublasXgemm` | transA=N, transB=N, alpha=1, beta=0 | |
| | `C = A.adjoint() * B` | `cublasXgemm` | transA=C, transB=N | |
| | `C = A.transpose() * B` | `cublasXgemm` | transA=T, transB=N | |
| | `C = A * B.adjoint()` | `cublasXgemm` | transA=N, transB=C | |
| | `C = A * B.transpose()` | `cublasXgemm` | transA=N, transB=T | |
| | `C = alpha * A * B` | `cublasXgemm` | alpha from LHS | |
| | `C = A * (alpha * B)` | `cublasXgemm` | alpha from RHS | |
| | `C += A * B` | `cublasXgemm` | alpha=1, beta=1 | |
| | `C.device(ctx) -= A * B` | `cublasXgemm` | alpha=-1, beta=1 | |
| | `X = A.llt().solve(B)` | `cusolverDnXpotrf` + `Xpotrs` | uplo, n, nrhs | |
| | `X = A.llt<Upper>().solve(B)` | same | uplo=Upper | |
| | `X = A.lu().solve(B)` | `cusolverDnXgetrf` + `Xgetrs` | n, nrhs | |
| | `X = A.triangularView<L>().solve(B)` | `cublasXtrsm` | side=L, uplo, diag=NonUnit | |
| | `C = A.selfadjointView<L>() * B` | `cublasXsymm` / `cublasXhemm` | side=L, uplo | |
| | `C.selfadjointView<L>().rankUpdate(A)` | `cublasXsyrk` / `cublasXherk` | uplo, trans=N | |
| |
| ### `DeviceMatrix<Scalar>` API |
| |
| | Method | Sync? | Description | |
| |--------|-------|-------------| |
| | `DeviceMatrix()` | -- | Empty (0x0) | |
| | `DeviceMatrix(rows, cols)` | -- | Allocate uninitialized | |
| | `fromHost(matrix, stream)` | yes | Upload from Eigen matrix | |
| | `fromHostAsync(ptr, rows, cols, stream)` | no | Async upload (caller manages lifetime) | |
| | `adopt(ptr, rows, cols)` | -- | Owning wrapper over a device pointer (caller relinquishes ownership) | |
| | `view(ptr, rows, cols)` | -- | Non-owning view over a device pointer (does not free on destruction) | |
| | `toHost(stream)` | yes | Synchronous download | |
| | `toHostAsync(stream)` | no | Returns `HostTransfer` future | |
| | `clone(stream)` | no | Device-to-device deep copy | |
| | `resize(rows, cols)` | -- | Discard contents, reallocate | |
| | `data()` | -- | Raw device pointer | |
| | `rows()`, `cols()` | -- | Dimensions | |
| | `sizeInBytes()` | -- | Total device allocation size in bytes | |
| | `empty()` | -- | True if 0x0 | |
| | `adjoint()` | -- | Adjoint view (GEMM ConjTrans) | |
| | `transpose()` | -- | Transpose view (GEMM Trans) | |
| | `llt()` / `llt<UpLo>()` | -- | Cholesky expression builder | |
| | `lu()` | -- | LU expression builder | |
| | `triangularView<UpLo>()` | -- | Triangular view (TRSM) | |
| | `selfadjointView<UpLo>()` | -- | Self-adjoint view (SYMM, rankUpdate) | |
| | `device(ctx)` | -- | Assignment proxy bound to context | |
| |
| ### `gpu::Context` |
| |
| Unified GPU execution context owning a CUDA stream and library handles. |
| |
| ```cpp |
| gpu::Context() // Creates dedicated stream + handles |
| static gpu::Context& threadLocal() // Per-thread default (lazy-created) |
| |
| cudaStream_t stream() |
| cublasHandle_t cublasHandle() |
| cusolverDnHandle_t cusolverHandle() |
| ``` |
| |
| Non-copyable, non-movable (owns library handles). |
| |
| ### `gpu::LLT<Scalar, UpLo>` API |
| |
| GPU dense Cholesky (LL^T) via cuSOLVER. Caches factor on device. |
| |
| | Method | Sync? | Description | |
| |--------|-------|-------------| |
| | `gpu::LLT(A)` | deferred | Construct and factorize from host matrix | |
| | `compute(host_matrix)` | deferred | Upload and factorize | |
| | `compute(DeviceMatrix)` | deferred | D2D copy and factorize | |
| | `compute(DeviceMatrix&&)` | deferred | Move-adopt and factorize (no copy) | |
| | `solve(host_matrix)` | yes | Solve, return host matrix | |
| | `solve(DeviceMatrix)` | no | Solve, return `DeviceMatrix` (async) | |
| | `info()` | lazy | Syncs stream on first call, returns `Success` or `NumericalIssue` | |
| |
| ### `gpu::LU<Scalar>` API |
| |
| GPU dense partial-pivoting LU via cuSOLVER. Same pattern as `gpu::LLT`, plus a |
| `gpu::GpuOp` parameter on `solve()` (`NoTrans`, `Trans`, `ConjTrans`). |
| |
| ### `gpu::QR<Scalar>` API |
| |
| GPU dense QR decomposition via cuSOLVER (`geqrf`). Solve uses `ormqr` (apply |
| Q^H) + `trsm` (back-substitute on R) -- Q is never formed explicitly. |
| |
| | Method | Description | |
| |--------|-------------| |
| | `gpu::QR()` | Default construct, then call `compute()` | |
| | `gpu::QR(A)` | Construct and factorize from host matrix | |
| | `compute(A)` | Upload + factorize | |
| | `compute(DeviceMatrix)` | D2D copy + factorize | |
| | `solve(host_matrix)` | Solve, return host matrix (syncs) | |
| | `solve(DeviceMatrix)` | Solve, return `DeviceMatrix` (async) | |
| | `matrixR()` | Download upper-triangular factor R to host (m >= n only) | |
| | `info()` | Lazy sync | |
| | `rows()`, `cols()`, `stream()` | Dimensions and CUDA stream | |
| |
| ### `gpu::SVD<Scalar>` API |
| |
| GPU dense SVD via cuSOLVER (`gesvd`). Supports thin, full, and values-only |
| modes via Eigen's `ComputeThinU | ComputeThinV`, `ComputeFullU | ComputeFullV`, |
| or `0` (values only). |
| |
| | Method | Description | |
| |--------|-------------| |
| | `gpu::SVD()` | Default construct, then call `compute()` | |
| | `gpu::SVD(A, options)` | Construct and compute (options default: `ComputeThinU \| ComputeThinV`) | |
| | `compute(A, options)` | Compute from host matrix | |
| | `compute(DeviceMatrix, options)` | Compute from device matrix | |
| | `singularValues()` | Download singular values to host | |
| | `matrixU()` | Download U to host (requires `ComputeThinU` or `ComputeFullU`) | |
| | `matrixV()` | Download V to host (matches `JacobiSVD::matrixV()`) | |
| | `matrixVT()` | Download V^T to host (requires `ComputeThinV` or `ComputeFullV`) | |
| | `d_singularValues()` | `DeviceMatrix` view of singular values (zero-copy) | |
| | `d_matrixU()` | `DeviceMatrix` view of U (zero-copy when m >= n) | |
| | `d_matrixVT()` | `DeviceMatrix` view of V^T (zero-copy when m >= n) | |
| | `solve(B)` | Pseudoinverse solve (returns host matrix) | |
| | `solve(B, k)` | Truncated solve (top k singular triplets) | |
| | `solve(B, lambda)` | Tikhonov regularized solve | |
| | `rank(threshold)` | Count singular values above threshold | |
| | `info()` | Lazy sync | |
| | `rows()`, `cols()`, `stream()` | Dimensions and CUDA stream | |
| |
| Wide matrices (m < n) are handled by internally transposing via cuBLAS `geam`; |
| in that case `d_matrixU()` and `d_matrixVT()` allocate an owning result via a |
| single `cublasXgeam` adjoint pass. `d_singularValues()` is always zero-copy. |
| |
| ### `gpu::SelfAdjointEigenSolver<Scalar>` API |
| |
| GPU symmetric/Hermitian eigenvalue decomposition via cuSOLVER (`syevd`). |
| |
| | Method | Description | |
| |--------|-------------| |
| | `gpu::SelfAdjointEigenSolver()` | Default construct, then call `compute()` | |
| | `gpu::SelfAdjointEigenSolver(A, mode)` | Construct and compute (mode default: `ComputeEigenvectors`) | |
| | `compute(A, mode)` | Compute from host matrix | |
| | `compute(DeviceMatrix, mode)` | Compute from device matrix | |
| | `eigenvalues()` | Download eigenvalues to host (ascending order) | |
| | `eigenvectors()` | Download eigenvectors to host (columns) | |
| | `d_eigenvalues()` | `DeviceMatrix` view of eigenvalues (zero-copy) | |
| | `d_eigenvectors()` | `DeviceMatrix` view of eigenvectors (zero-copy, requires `ComputeEigenvectors`) | |
| | `info()` | Lazy sync | |
| | `rows()`, `cols()`, `stream()` | Dimensions and CUDA stream | |
| |
| `ComputeMode`: `gpu::SelfAdjointEigenSolver::EigenvaluesOnly` or |
| `gpu::SelfAdjointEigenSolver::ComputeEigenvectors`. |
| |
| The `d_*` accessors return non-owning views into the solver's internal device |
| buffers; the solver must outlive any view derived from it. |
| |
| ### `HostTransfer<Scalar>` API |
| |
| Future for async device-to-host transfer. |
| |
| | Method | Description | |
| |--------|-------------| |
| | `get()` | Block until transfer completes, return host matrix reference. Idempotent. | |
| | `ready()` | Non-blocking poll | |
| |
| ### Aliasing |
| |
| Unlike Eigen's `Matrix`, where omitting `.noalias()` triggers a copy to a |
| temporary, DeviceMatrix dispatches directly to NVIDIA library calls which have |
| no built-in aliasing protection. All operations are implicitly noalias. |
| The caller must ensure operands don't alias the destination for GEMM, TRSM, |
| SYMM/HEMM, and SYRK/HERK. Debug builds assert on these violations before |
| dispatching to cuBLAS. |
| |
| ## Future work |
| |
| - **Reassess host-input vs. device-input API surface.** Each solver currently |
| exposes both host-input (`compute(MatrixXd)`, `solve(MatrixXd)`) and |
| device-input (`compute(DeviceMatrix)`, `solve(DeviceMatrix)`) overloads, plus |
| host- and device-side accessors (`matrixU()` vs `d_matrixU()`). This eases |
| migration from CPU Eigen but may invite accidental host ↔ device round-trips |
| when users mix the two without realising the cost. Revisit after the initial |
| GPU module roll-out (MRs !2408, !2412, !2413, !2414, !2415) is in users' |
| hands; if the convenience overloads cause more confusion than they save, |
| narrow toward a single explicit `fromHost` / `toHost` boundary. |
| |
| ## File layout |
| |
| | File | Depends on | Contents | |
| |------|-----------|----------| |
| | `GpuSupport.h` | `<cuda_runtime.h>` | Error macro, `DeviceBuffer`, `PinnedHostBuffer`, `cuda_data_type<>` | |
| | `DeviceMatrix.h` | `GpuSupport.h` | `DeviceMatrix<>`, `HostTransfer<>` | |
| | `DeviceExpr.h` | `DeviceMatrix.h` | GEMM expression wrappers | |
| | `DeviceBlasExpr.h` | `DeviceMatrix.h` | TRSM, SYMM, SYRK expression wrappers | |
| | `DeviceSolverExpr.h` | `DeviceMatrix.h` | Solver expression wrappers (LLT, LU) | |
| | `DeviceDispatch.h` | all above | All dispatch functions + `Assignment` | |
| | `GpuContext.h` | `CuBlasSupport.h`, `CuSolverSupport.h` | `gpu::Context` | |
| | `CuBlasSupport.h` | `GpuSupport.h`, `<cublas_v2.h>` | cuBLAS error macro, op/compute type maps | |
| | `CuSolverSupport.h` | `GpuSupport.h`, `<cusolverDn.h>` | cuSOLVER params, fill-mode mapping | |
| | `GpuSolverContext.h` | `CuSolverSupport.h`, `CuBlasSupport.h` | Shared solver context (stream, handles, scratch) | |
| | `GpuLLT.h` | `GpuSolverContext.h` | Cached dense Cholesky factorization | |
| | `GpuLU.h` | `GpuSolverContext.h` | Cached dense LU factorization | |
| | `GpuQR.h` | `GpuSolverContext.h` | Dense QR decomposition | |
| | `GpuSVD.h` | `GpuSolverContext.h` | Dense SVD decomposition | |
| | `GpuEigenSolver.h` | `GpuSolverContext.h` | Self-adjoint eigenvalue decomposition | |
| |
| ## Building and testing |
| |
| ```bash |
| cmake -G Ninja -B build -S . \ |
| -DEIGEN_TEST_CUDA=ON \ |
| -DEIGEN_CUDA_COMPUTE_ARCH="70" |
| |
| cmake --build build --target cublas cusolver_llt cusolver_lu \ |
| cusolver_qr cusolver_svd cusolver_eigen device_matrix |
| ctest --test-dir build -L gpu --output-on-failure |
| ``` |
| |
| `EIGEN_TEST_CUBLAS` and `EIGEN_TEST_CUSOLVER` default to ON when CUDA is enabled |
| (cuBLAS and cuSOLVER are part of the CUDA toolkit). |