Bundle Adjustment: GPU CG vs CPU CG Results

Benchmark of Eigen's GPU CG pipeline on normal equations arising from bundle adjustment (BAL datasets). Compares CPU ConjugateGradient (Jacobi preconditioner) against GPU CG using DeviceMatrix + GpuSparseContext + DeviceScalar.

Hardware

  • CPU: Intel Core i7-13700HX (Raptor Lake, 12 cores / 24 threads, single thread for Eigen CG)
  • GPU: NVIDIA GeForce RTX 4070 Laptop GPU (Ada Lovelace, 4608 CUDA cores, 8 GB GDDR6)
  • CUDA: 13.2 / Driver 595.79
  • OS: Ubuntu 24.04 (WSL2, kernel 6.6.87)

Software

  • Eigen: eigen-gpu-cg branch
  • Google Benchmark 1.9.1
  • Compiler: nvcc 13.2 + g++ 13.3
  • Normal equations: H = J^T*J + I (Levenberg-Marquardt damping lambda=1.0)
  • CG tolerance: 1e-8, max iterations: 10000

Method

For each BAL problem file:

  1. Parse the BAL file (cameras, 3D points, 2D observations)
  2. Compute the full Jacobian J using the BAL camera model (Rodrigues rotation + perspective projection + radial distortion) with central finite differences
  3. Form the normal equations H = J^TJ + lambdaI (sparse, symmetric positive definite)
  4. Solve Hdx = -J^Tr using CG with Jacobi preconditioner on CPU and GPU
  5. Report wall-clock time (mean of 3 repetitions)

GPU CG uses: GpuSparseContext for SpMV, DeviceMatrix for vectors, DeviceScalar with CUBLAS_POINTER_MODE_DEVICE for dot/norm reductions, in-place cwiseProduct via NPP for Jacobi preconditioner application, device-pointer-mode scal to avoid host sync on the beta update.

Results

Summary table

DatasetCamerasPointsObsH sizeH nnzCG itersCPU CG (ms)GPU CG (ms)Speedup
Ladybug-49497,77631,84323,7691.8M4,4214,0061,1523.5x
Ladybug-13813819,87885,21760,8764.8M7,00821,4983,5536.1x
Ladybug-64664673,584327,297226,56618.4M10,000*123,72714,2688.7x
Dubrovnik-356356226,7301,255,268683,39469.8M4,308216,14924,4938.8x

* Hit 10,000 iteration cap (poorly conditioned problem). Both CPU and GPU hit the same cap, so timing comparison remains valid.

Profile breakdown (Ladybug-138, nsys)

GPU kernel time is dominated by SpMV (91%). The remaining 9% is BLAS-1 operations (dot, axpy, scal) and NPP element-wise ops (cwiseProduct).

KernelTime (ms)%Calls
cuSPARSE csrmv (SpMV)250791.3%7,006
cuBLAS dot923.4%21,020
cuBLAS axpy (device ptr)271.0%14,012
cuSPARSE partition190.7%7,006
NPP cwiseProduct16 + 131.1%14,011 + 7,006
cuBLAS axpy (host ptr)120.5%7,005
cuBLAS scal (device ptr)110.4%7,005
NPP scalar ops70.2%7,006

Optimizations applied

Three profiling-driven optimizations reduced GPU CG time by 1.8x (6.5s → 3.6s on Ladybug-138):

  1. In-place cwiseProduct: The Jacobi preconditioner apply (z = invdiag .* residual) was allocating a new DeviceMatrix every iteration. Added z.cwiseProduct(ctx, a, b) that reuses z's buffer. Reduced cudaMalloc calls from 7,053 to 23 (saving 2.3s).

  2. squaredNorm via dot(x,x): cuBLAS nrm2 uses a numerically careful scaled-sum-of-squares algorithm (29µs/call). Replaced with dot(x,x) (6.4µs/call) — 4.5x faster per call, saving ~320ms.

  3. Device-pointer scal: p *= beta was converting DeviceScalar beta to host (triggering a stream sync), then calling host-pointer-mode scal. Added operator*=(DeviceScalar) that uses device-pointer-mode scal, eliminating one sync per iteration. Halved cudaStreamSynchronize calls from 14K to 7K.

Observations

  1. GPU speedup scales with problem size: from 3.5x on small problems (24K variables) to 8.8x on large problems (683K variables). This is expected — larger problems have more parallelism for the GPU to exploit.

  2. Iteration counts match: CPU and GPU CG converge in the same number of iterations (within 1%), confirming numerical equivalence.

  3. Bottleneck is SpMV: CG iteration time is dominated (91%) by the sparse matrix-vector product on H. Further speedup requires either faster SpMV (e.g., block-sparse formats) or algorithmic improvements (Schur complement, better preconditioners).

  4. Remaining overhead: CUDA API calls (cudaMemcpyAsync for 8-byte DeviceScalar transfers) account for ~50% of non-kernel time. Batching multiple scalar reductions into a single transfer would help.

  5. Jacobi preconditioner is weak for BA: The Ladybug-646 problem does not converge in 10K iterations. Ceres uses block Jacobi or Schur complement preconditioners that would also benefit from GPU acceleration.

Scaling plot data

# n        nnz_H       cpu_ms    gpu_ms    speedup
23769      1793475     4006      1152      3.48
60876      4791762     21498     3553      6.05
226566     18387948    123727    14268     8.67
683394     69827066    216149    24493     8.82

BAL datasets

Downloaded from http://grail.cs.washington.edu/projects/bal/

FileSource
problem-49-7776-pre.txtLadybug sequence
problem-138-19878-pre.txtLadybug sequence
problem-646-73584-pre.txtLadybug sequence
problem-356-226730-pre.txtDubrovnik reconstruction

Reproducing

# Build
cmake -G Ninja -B build-bench-gpu -S unsupported/benchmarks/GPU -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build-bench-gpu --target bench_ba

# Download BAL datasets
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-49-7776-pre.txt.bz2
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-138-19878-pre.txt.bz2
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-646-73584-pre.txt.bz2
wget http://grail.cs.washington.edu/projects/bal/data/dubrovnik/problem-356-226730-pre.txt.bz2
bunzip2 *.bz2

# Run (one at a time)
BAL_FILE=problem-49-7776-pre.txt ./build-bench-gpu/bench_ba --benchmark_repetitions=3
BAL_FILE=problem-138-19878-pre.txt ./build-bench-gpu/bench_ba --benchmark_repetitions=3
BAL_FILE=problem-646-73584-pre.txt ./build-bench-gpu/bench_ba --benchmark_repetitions=3
BAL_FILE=problem-356-226730-pre.txt ./build-bench-gpu/bench_ba --benchmark_repetitions=3