blob: 24ce69859a8c59fc91edb2fa15e795df99451a19 [file] [log] [blame]
/*
Copyright (c) 2025, AMD Inc. All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of AMD nor the names of its contributors may
be used to endorse or promote products derived from this software without
specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
********************************************************************************
* Content : Documentation on the use of AMD AOCL through Eigen
********************************************************************************
*/
namespace Eigen {
/** \page TopicUsingAOCL Using AMD® AOCL from %Eigen
Since %Eigen version 3.4 and later, users can benefit from built-in AMD® Optimizing CPU Libraries (AOCL) optimizations with an installed copy of AOCL 5.0 (or later).
<a href="https://www.amd.com/en/developer/aocl.html"> AMD AOCL </a> provides highly optimized, multi-threaded mathematical routines for x86-64 processors with a focus on AMD "Zen"-based architectures. AOCL is available on Linux and Windows for x86-64 architectures.
\note
AMD® AOCL is freely available software, but it is the responsibility of users to download, install, and ensure their product's license allows linking to the AOCL libraries. AOCL is distributed under a permissive license that allows commercial use.
Using AMD AOCL through %Eigen is straightforward:
-# export \c AOCL_ROOT into your environment
-# define one of the AOCL macros before including any %Eigen headers (see table below)
-# link your program to AOCL libraries (BLIS, FLAME, LibM)
-# ensure your system supports the target architecture optimizations
When doing so, a number of %Eigen's algorithms are silently substituted with calls to AMD AOCL routines.
These substitutions apply only for \b Dynamic \b or \b large \b enough objects with one of the following standard scalar types: \c float, \c double, \c complex<float>, and \c complex<double>.
Operations on other scalar types or mixing reals and complexes will continue to use the built-in algorithms.
The AOCL integration targets three core components:
- **BLIS**: High-performance BLAS implementation optimized for modern cache hierarchies
- **FLAME**: Dense linear algebra algorithms providing LAPACK functionality
- **LibM**: Optimized standard math routines with vectorized implementations
\section TopicUsingAOCL_Macros Configuration Macros
You can choose which parts will be substituted by defining one or multiple of the following macros:
<table class="manual">
<tr><td>\c EIGEN_USE_BLAS </td><td>Enables the use of external BLAS level 2 and 3 routines (AOCL-BLIS)</td></tr>
<tr class="alt"><td>\c EIGEN_USE_LAPACKE </td><td>Enables the use of external LAPACK routines via the LAPACKE C interface (AOCL-FLAME)</td></tr>
<tr><td>\c EIGEN_USE_LAPACKE_STRICT </td><td>Same as \c EIGEN_USE_LAPACKE but algorithms of lower robustness are disabled. \n This currently concerns only JacobiSVD which would be replaced by \c gesvd.</td></tr>
<tr class="alt"><td>\c EIGEN_USE_AOCL_VML </td><td>Enables the use of AOCL LibM vector math operations for coefficient-wise functions</td></tr>
<tr><td>\c EIGEN_USE_AOCL_ALL </td><td>Defines \c EIGEN_USE_BLAS, \c EIGEN_USE_LAPACKE, and \c EIGEN_USE_AOCL_VML</td></tr>
<tr class="alt"><td>\c EIGEN_USE_AOCL_MT </td><td>Equivalent to \c EIGEN_USE_AOCL_ALL, but ensures multi-threaded BLIS (\c libblis-mt) is used. \n \b Recommended for most applications.</td></tr>
</table>
\note The AOCL integration automatically enables optimizations when the matrix/vector size exceeds \c EIGEN_AOCL_VML_THRESHOLD (default: 128 elements). For smaller operations, Eigen's built-in vectorization may be faster due to function call overhead.
\section TopicUsingAOCL_Performance Performance Considerations
The \c EIGEN_USE_BLAS and \c EIGEN_USE_LAPACKE macros can be combined with AOCL-specific optimizations:
- **Multi-threading**: Use \c EIGEN_USE_AOCL_MT to automatically select the multi-threaded BLIS library
- **Architecture targeting**: AOCL libraries are optimized for AMD Zen architectures (Zen, Zen2, Zen3, Zen4, Zen5)
- **Vector Math Library**: AOCL LibM provides vectorized implementations that can operate on entire arrays simultaneously
- **Memory layout**: Eigen's column-major storage directly matches AOCL's expected data layout for zero-copy operation
\section TopicUsingAOCL_Types Supported Data Types and Sizes
AOCL acceleration is applied to:
- **Scalar types**: \c float, \c double, \c complex<float>, \c complex<double>
- **Matrix/Vector sizes**: Dynamic size or compile-time size ≥ \c EIGEN_AOCL_VML_THRESHOLD
- **Storage order**: Both column-major (default) and row-major layouts
- **Memory alignment**: Eigen's data pointers are directly compatible with AOCL function signatures
The current AOCL Vector Math Library integration is specialized for \c double precision, with automatic fallback to scalar implementations for \c float.
\section TopicUsingAOCL_Functions Vector Math Functions
The following table summarizes coefficient-wise operations accelerated by \c EIGEN_USE_AOCL_VML:
<table class="manual">
<tr><th>Code example</th><th>AOCL routines</th></tr>
<tr><td>\code
v2 = v1.array().exp();
v2 = v1.array().sin();
v2 = v1.array().cos();
v2 = v1.array().tan();
v2 = v1.array().log();
v2 = v1.array().log10();
v2 = v1.array().log2();
v2 = v1.array().sqrt();
v2 = v1.array().pow(1.5);
v2 = v1.array() + v2.array();
\endcode</td><td>\code
amd_vrda_exp
amd_vrda_sin
amd_vrda_cos
amd_vrda_tan
amd_vrda_log
amd_vrda_log10
amd_vrda_log2
amd_vrda_sqrt
amd_vrda_pow
amd_vrda_add
\endcode</td></tr>
</table>
In the examples, v1 and v2 are dense vectors of type \c VectorXd with size ≥ \c EIGEN_AOCL_VML_THRESHOLD.
\section TopicUsingAOCL_Example Complete Example
\code
#define EIGEN_USE_AOCL_MT
#include <iostream>
#include <Eigen/Dense>
int main() {
const int n = 2048;
// Large matrices automatically use AOCL-BLIS for multiplication
Eigen::MatrixXd A = Eigen::MatrixXd::Random(n, n);
Eigen::MatrixXd B = Eigen::MatrixXd::Random(n, n);
Eigen::MatrixXd C = A * B; // Dispatched to dgemm
// Large vectors automatically use AOCL LibM for math functions
Eigen::VectorXd v = Eigen::VectorXd::LinSpaced(10000, 0, 10);
Eigen::VectorXd result = v.array().sin(); // Dispatched to amd_vrda_sin
// LAPACK decompositions use AOCL-FLAME
Eigen::LLT<Eigen::MatrixXd> llt(A); // Dispatched to dpotrf
std::cout << "Matrix norm: " << C.norm() << std::endl;
std::cout << "Vector result norm: " << result.norm() << std::endl;
return 0;
}
\endcode
\section TopicUsingAOCL_Building Building and Linking
To compile with AOCL support, set the \c AOCL_ROOT environment variable and link against the required libraries:
\code
export AOCL_ROOT=/path/to/aocl
clang++ -O3 -g -DEIGEN_USE_AOCL_ALL \
-I./install/include -I${AOCL_ROOT}/include \
-Wno-parentheses my_app.cpp \
-L${AOCL_ROOT} -lamdlibm -lflame -lblis \
-lpthread -lrt -lm -lomp \
-o eigen_aocl_example
\endcode
For multi-threaded performance, use the multi-threaded BLIS library:
\code
clang++ -O3 -g -DEIGEN_USE_AOCL_MT \
-I./install/include -I${AOCL_ROOT}/include \
-Wno-parentheses my_app.cpp \
-L${AOCL_ROOT} -lamdlibm -lflame -lblis-mt \
-lpthread -lrt -lm -lomp \
-o eigen_aocl_example
\endcode
Key compiler and linker flags:
- \c -DEIGEN_USE_AOCL_ALL: Enable all AOCL accelerations (BLAS, LAPACK, VML)
- \c -DEIGEN_USE_AOCL_MT: Enable multi-threaded version (uses \c -lblis-mt)
- \c -lblis: Single-threaded BLIS library
- \c -lblis-mt: Multi-threaded BLIS library (recommended for performance)
- \c -lflame: FLAME LAPACK implementation
- \c -lamdlibm: AMD LibM vector math library
- \c -lomp: OpenMP runtime for multi-threading support
- \c -lpthread -lrt: System threading and real-time libraries
- \c -Wno-parentheses: Suppress common warnings when using AOCL headers
\subsection TopicUsingAOCL_EigenBuild Building Eigen with AOCL Support
To build Eigen with AOCL Support, use the following CMake configuration:
\code
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_INSTALL_PREFIX=$PWD/install \
-DINCLUDE_INSTALL_DIR=$PWD/install/include \
&& make install -j$(nproc)
\endcode
To build Eigen with AOCL integration and benchmarking capabilities, use the following CMake configuration:
\code
cmake .. -DEIGEN_BUILD_AOCL_BENCH=ON \
-DEIGEN_AOCL_BENCH_FLAGS="-O3 -mavx512f -fveclib=AMDLIBM" \
-DEIGEN_AOCL_BENCH_USE_MT=OFF \
-DEIGEN_AOCL_BENCH_ARCH=znver5 \
-DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_INSTALL_PREFIX=$PWD/install \
-DINCLUDE_INSTALL_DIR=$PWD/install/include \
&& make install -j$(nproc)
\endcode
**CMake Configuration Parameters:**
<table class="manual">
<tr><th>Parameter</th><th>Expected Values</th><th>Description</th></tr>
<tr><td>\c EIGEN_BUILD_AOCL_BENCH</td><td>\c ON, \c OFF</td><td>Enable/disable AOCL benchmark compilation</td></tr>
<tr class="alt"><td>\c EIGEN_AOCL_BENCH_FLAGS</td><td>Compiler flags string</td><td>Additional compiler optimizations: \c "-O3 -mavx512f -fveclib=AMDLIBM"</td></tr>
<tr><td>\c EIGEN_AOCL_BENCH_USE_MT</td><td>\c ON, \c OFF</td><td>Use multi-threaded AOCL libraries (\c ON recommended for performance)</td></tr>
<tr class="alt"><td>\c EIGEN_AOCL_BENCH_ARCH</td><td>\c znver3, \c znver4, \c znver5, \c native, \c generic</td><td>Target AMD architecture (match your CPU generation)</td></tr>
<tr><td>\c CMAKE_BUILD_TYPE</td><td>\c Release, \c Debug, \c RelWithDebInfo</td><td>Build configuration (\c Release recommended for benchmarks)</td></tr>
<tr class="alt"><td>\c CMAKE_C_COMPILER</td><td>\c clang, \c gcc</td><td>C compiler (clang recommended for AOCL)</td></tr>
<tr><td>\c CMAKE_CXX_COMPILER</td><td>\c clang++, \c g++</td><td>C++ compiler (clang++ recommended for AOCL)</td></tr>
<tr class="alt"><td>\c CMAKE_INSTALL_PREFIX</td><td>Installation path</td><td>Where to install Eigen headers</td></tr>
<tr><td>\c INCLUDE_INSTALL_DIR</td><td>Header path</td><td>Specific path for Eigen headers</td></tr>
</table>
**Architecture Selection Guide:**
- \c znver3: AMD Zen 3 (EPYC 7003, Ryzen 5000 series)
- \c znver4: AMD Zen 4 (EPYC 9004, Ryzen 7000 series)
- \c znver5: AMD Zen 5 (EPYC 9005, Ryzen 9000 series)
- \c native: Auto-detect current CPU architecture
- \c generic: Generic x86-64 without specific optimizations
**Custom Compiler Flags Explanation:**
- \c -O3: Maximum optimization level
- \c -mavx512f: Enable AVX-512 instruction set (if supported)
- \c -fveclib=AMDLIBM: Use AMD LibM for vectorized math functions
\subsection TopicUsingAOCL_Benchmark Building the AOCL Benchmark
After configuring Eigen, build the AOCL benchmark executable:
\code
cmake --build . --target benchmark_aocl -j$(nproc)
\endcode
This creates the \c benchmark_aocl executable that demonstrates AOCL acceleration with various matrix sizes and operations.
**Running the Benchmark:**
\code
./benchmark_aocl
\endcode
The benchmark will automatically compare:
- Eigen's native performance vs AOCL-accelerated operations
- Matrix multiplication performance (BLIS vs Eigen)
- Vector math functions performance (LibM vs Eigen)
- Memory bandwidth utilization and cache efficiency
\section TopicUsingAOCL_CMake CMake Integration
When using CMake, you can use a FindAOCL module:
\code
find_package(AOCL REQUIRED)
target_compile_definitions(my_target PRIVATE EIGEN_USE_AOCL_MT)
target_link_libraries(my_target PRIVATE AOCL::BLIS_MT AOCL::FLAME AOCL::LIBM)
\endcode
\section TopicUsingAOCL_Troubleshooting Troubleshooting
Common issues and solutions:
- **Link errors**: Ensure \c AOCL_ROOT is set and libraries are in \c LD_LIBRARY_PATH
- **Performance not improved**: Verify you're using matrices/vectors larger than the threshold
- **Thread contention**: Set \c OMP_NUM_THREADS to match your CPU core count
- **Architecture mismatch**: Use appropriate \c -march flag for your AMD processor
\section TopicUsingAOCL_Links Links
- AMD AOCL can be downloaded for free <a href="https://www.amd.com/en/developer/aocl.html">here</a>
- AOCL User Guide and documentation available on the AMD Developer Portal
- AOCL is also available through package managers and containerized environments
*/
}