Benchmark of Eigen's GPU CG pipeline on normal equations arising from bundle adjustment (BAL datasets). Compares CPU ConjugateGradient (Jacobi preconditioner) against GPU CG using DeviceMatrix + GpuSparseContext + DeviceScalar.
eigen-gpu-cg branchFor each BAL problem file:
GPU CG uses: GpuSparseContext for SpMV, DeviceMatrix for vectors, DeviceScalar with CUBLAS_POINTER_MODE_DEVICE for dot/norm reductions, in-place cwiseProduct via NPP for Jacobi preconditioner application, device-pointer-mode scal to avoid host sync on the beta update.
| Dataset | Cameras | Points | Obs | H size | H nnz | CG iters | CPU CG (ms) | GPU CG (ms) | Speedup |
|---|---|---|---|---|---|---|---|---|---|
| Ladybug-49 | 49 | 7,776 | 31,843 | 23,769 | 1.8M | 4,421 | 4,006 | 1,152 | 3.5x |
| Ladybug-138 | 138 | 19,878 | 85,217 | 60,876 | 4.8M | 7,008 | 21,498 | 3,553 | 6.1x |
| Ladybug-646 | 646 | 73,584 | 327,297 | 226,566 | 18.4M | 10,000* | 123,727 | 14,268 | 8.7x |
| Dubrovnik-356 | 356 | 226,730 | 1,255,268 | 683,394 | 69.8M | 4,308 | 216,149 | 24,493 | 8.8x |
* Hit 10,000 iteration cap (poorly conditioned problem). Both CPU and GPU hit the same cap, so timing comparison remains valid.
GPU kernel time is dominated by SpMV (91%). The remaining 9% is BLAS-1 operations (dot, axpy, scal) and NPP element-wise ops (cwiseProduct).
| Kernel | Time (ms) | % | Calls |
|---|---|---|---|
| cuSPARSE csrmv (SpMV) | 2507 | 91.3% | 7,006 |
| cuBLAS dot | 92 | 3.4% | 21,020 |
| cuBLAS axpy (device ptr) | 27 | 1.0% | 14,012 |
| cuSPARSE partition | 19 | 0.7% | 7,006 |
| NPP cwiseProduct | 16 + 13 | 1.1% | 14,011 + 7,006 |
| cuBLAS axpy (host ptr) | 12 | 0.5% | 7,005 |
| cuBLAS scal (device ptr) | 11 | 0.4% | 7,005 |
| NPP scalar ops | 7 | 0.2% | 7,006 |
Three profiling-driven optimizations reduced GPU CG time by 1.8x (6.5s → 3.6s on Ladybug-138):
In-place cwiseProduct: The Jacobi preconditioner apply (z = invdiag .* residual) was allocating a new DeviceMatrix every iteration. Added z.cwiseProduct(ctx, a, b) that reuses z's buffer. Reduced cudaMalloc calls from 7,053 to 23 (saving 2.3s).
squaredNorm via dot(x,x): cuBLAS nrm2 uses a numerically careful scaled-sum-of-squares algorithm (29µs/call). Replaced with dot(x,x) (6.4µs/call) — 4.5x faster per call, saving ~320ms.
Device-pointer scal: p *= beta was converting DeviceScalar beta to host (triggering a stream sync), then calling host-pointer-mode scal. Added operator*=(DeviceScalar) that uses device-pointer-mode scal, eliminating one sync per iteration. Halved cudaStreamSynchronize calls from 14K to 7K.
GPU speedup scales with problem size: from 3.5x on small problems (24K variables) to 8.8x on large problems (683K variables). This is expected — larger problems have more parallelism for the GPU to exploit.
Iteration counts match: CPU and GPU CG converge in the same number of iterations (within 1%), confirming numerical equivalence.
Bottleneck is SpMV: CG iteration time is dominated (91%) by the sparse matrix-vector product on H. Further speedup requires either faster SpMV (e.g., block-sparse formats) or algorithmic improvements (Schur complement, better preconditioners).
Remaining overhead: CUDA API calls (cudaMemcpyAsync for 8-byte DeviceScalar transfers) account for ~50% of non-kernel time. Batching multiple scalar reductions into a single transfer would help.
Jacobi preconditioner is weak for BA: The Ladybug-646 problem does not converge in 10K iterations. Ceres uses block Jacobi or Schur complement preconditioners that would also benefit from GPU acceleration.
# n nnz_H cpu_ms gpu_ms speedup 23769 1793475 4006 1152 3.48 60876 4791762 21498 3553 6.05 226566 18387948 123727 14268 8.67 683394 69827066 216149 24493 8.82
Downloaded from http://grail.cs.washington.edu/projects/bal/
| File | Source |
|---|---|
| problem-49-7776-pre.txt | Ladybug sequence |
| problem-138-19878-pre.txt | Ladybug sequence |
| problem-646-73584-pre.txt | Ladybug sequence |
| problem-356-226730-pre.txt | Dubrovnik reconstruction |
# Build cmake -G Ninja -B build-bench-gpu -S unsupported/benchmarks/GPU -DCMAKE_CUDA_ARCHITECTURES=89 cmake --build build-bench-gpu --target bench_ba # Download BAL datasets wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-49-7776-pre.txt.bz2 wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-138-19878-pre.txt.bz2 wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-646-73584-pre.txt.bz2 wget http://grail.cs.washington.edu/projects/bal/data/dubrovnik/problem-356-226730-pre.txt.bz2 bunzip2 *.bz2 # Run (one at a time) BAL_FILE=problem-49-7776-pre.txt ./build-bench-gpu/bench_ba --benchmark_repetitions=3 BAL_FILE=problem-138-19878-pre.txt ./build-bench-gpu/bench_ba --benchmark_repetitions=3 BAL_FILE=problem-646-73584-pre.txt ./build-bench-gpu/bench_ba --benchmark_repetitions=3 BAL_FILE=problem-356-226730-pre.txt ./build-bench-gpu/bench_ba --benchmark_repetitions=3