ROCm Libraries¶
rocFFT¶
rocFFT is a software library for computing Fast Fourier Transforms (FFT) written in HIP. It is part of AMD’s software ecosystem based on ROCm. In addition to AMD GPU devices, the library can also be compiled with the CUDA compiler using HIP tools for running on Nvidia GPU devices.
API design¶
Please refer to the rocFFT API design for current documentation. Work in progress.
Installing pre-built packages¶
Download pre-built packages either from ROCm’s package servers or by clicking the github releases tab and manually downloading, which could be newer. Release notes are available for each release on the releases tab.
sudo apt update && sudo apt install rocfft
Quickstart rocFFT build¶
Bash helper build script (Ubuntu only)
The root of this repository has a helper bash script install.sh to build and install rocFFT on Ubuntu with a single command. It does not take a lot of options and hard-codes configuration that can be specified through invoking cmake directly, but it’s a great way to get started quickly and can serve as an example of how to build/install. A few commands in the script need sudo access, so it may prompt you for a password.
* ./install -h – shows help
* ./install -id – build library, build dependencies and install globally (-d flag only needs to be specified once on a system)
* ./install -c --cuda – build library and clients for cuda backend into a local directory
Manual build (all supported platforms)
If you use a distro other than Ubuntu, or would like more control over the build process, the rocfft build wiki has helpful information on how to configure cmake and manually build.
Library and API Documentation Please refer to the Library documentation for current documentation.
Example¶
The following is a simple example code that shows how to use rocFFT to compute a 1D single precision 16-point complex forward transform.
#include <iostream>
#include <vector>
#include "hip/hip_runtime_api.h"
#include "hip/hip_vector_types.h"
#include "rocfft.h"
int main()
{
// rocFFT gpu compute
// ========================================
size_t N = 16;
size_t Nbytes = N * sizeof(float2);
// Create HIP device buffer
float2 *x;
hipMalloc(&x, Nbytes);
// Initialize data
std::vector<float2> cx(N);
for (size_t i = 0; i < N; i++)
{
cx[i].x = 1;
cx[i].y = -1;
}
// Copy data to device
hipMemcpy(x, cx.data(), Nbytes, hipMemcpyHostToDevice);
// Create rocFFT plan
rocfft_plan plan = NULL;
size_t length = N;
rocfft_plan_create(&plan, rocfft_placement_inplace, rocfft_transform_type_complex_forward, rocfft_precision_single, 1, &length, 1, NULL);
// Execute plan
rocfft_execute(plan, (void**) &x, NULL, NULL);
// Wait for execution to finish
hipDeviceSynchronize();
// Destroy plan
rocfft_plan_destroy(plan);
// Copy result back to host
std::vector<float2> y(N);
hipMemcpy(y.data(), x, Nbytes, hipMemcpyDeviceToHost);
// Print results
for (size_t i = 0; i < N; i++)
{
std::cout << y[i].x << ", " << y[i].y << std::endl;
}
// Free device buffer
hipFree(x);
return 0;
}
API¶
This section provides details of the library API
Types¶
There are few data structures that are internal to the library. The pointer types to these structures are given below. The user would need to use these types to create handles and pass them between different library functions.
-
typedef struct rocfft_plan_t *
rocfft_plan¶ Pointer type to plan structure.
This type is used to declare a plan handle that can be initialized with rocfft_plan_create
-
typedef struct rocfft_plan_description_t *
rocfft_plan_description¶ Pointer type to plan description structure.
This type is used to declare a plan description handle that can be initialized with rocfft_plan_description_create
-
typedef struct rocfft_execution_info_t *
rocfft_execution_info¶ Pointer type to execution info structure.
This type is used to declare an execution info handle that can be initialized with rocfft_execution_info_create
Library Setup and Cleanup¶
The following functions deals with initialization and cleanup of the library.
-
rocfft_status
rocfft_setup()¶ Library setup function, called once in program before start of library use.
-
rocfft_status
rocfft_cleanup()¶ Library cleanup function, called once in program after end of library use.
Plan¶
The following functions are used to create and destroy plan objects.
-
rocfft_status
rocfft_plan_create(rocfft_plan *plan, rocfft_result_placement placement, rocfft_transform_type transform_type, rocfft_precision precision, size_t dimensions, const size_t *lengths, size_t number_of_transforms, const rocfft_plan_description description)¶ Create an FFT plan.
This API creates a plan, which the user can execute subsequently. This function takes many of the fundamental parameters needed to specify a transform. The parameters are self explanatory. The dimensions parameter can take a value of 1,2 or 3. The ‘lengths’ array specifies size of data in each dimension. Note that lengths[0] is the size of the innermost dimension, lengths[1] is the next higher dimension and so on. The ‘number_of_transforms’ parameter specifies how many transforms (of the same kind) needs to be computed. By specifying a value greater than 1, a batch of transforms can be computed with a single api call. Additionally, a handle to a plan description can be passed for more detailed transforms. For simple transforms, this parameter can be set to null ptr.
- Parameters
[out] plan: plan handle[in] placement: placement of result[in] transform_type: type of transform[in] precision: precision[in] dimensions: dimensions[in] lengths: dimensions sized array of transform lengths[in] number_of_transforms: number of transforms[in] description: description handle created by rocfft_plan_description_create; can be null ptr for simple transforms
-
rocfft_status
rocfft_plan_destroy(rocfft_plan plan)¶ Destroy an FFT plan.
This API frees the plan. This function destructs a plan after it is no longer needed.
- Parameters
[in] plan: plan handle
The following functions are used to query for information after a plan is created.
-
rocfft_status
rocfft_plan_get_work_buffer_size(const rocfft_plan plan, size_t *size_in_bytes)¶ Get work buffer size.
This is one of plan query functions to obtain information regarding a plan. This API gets the work buffer size.
- Parameters
[in] plan: plan handle[out] size_in_bytes: size of needed work buffer in bytes
-
rocfft_status
rocfft_plan_get_print(const rocfft_plan plan)¶ Print all plan information.
This is one of plan query functions to obtain information regarding a plan. This API prints all plan info to stdout to help user verify plan specification.
- Parameters
[in] plan: plan handle
Plan description¶
Most of the times, rocfft_plan_create() is all is needed to fully specify a transform.
And the description object can be skipped. But when a transform specification has more details
a description object need to be created and set up and the handle passed to the rocfft_plan_create().
Functions referred below can be used to manage plan description in order to specify more transform details.
The plan description object can be safely deleted after call to the plan api rocfft_plan_create().
-
rocfft_status
rocfft_plan_description_create(rocfft_plan_description *description)¶ Create plan description.
This API creates a plan description with which the user can set more plan properties
- Parameters
[out] description: plan description handle
-
rocfft_status
rocfft_plan_description_destroy(rocfft_plan_description description)¶ Destroy a plan description.
This API frees the plan description
- Parameters
[in] description: plan description handle
-
rocfft_status
rocfft_plan_description_set_data_layout(rocfft_plan_description description, rocfft_array_type in_array_type, rocfft_array_type out_array_type, const size_t *in_offsets, const size_t *out_offsets, size_t in_strides_size, const size_t *in_strides, size_t in_distance, size_t out_strides_size, const size_t *out_strides, size_t out_distance)¶ Set data layout.
This is one of plan description functions to specify optional additional plan properties using the description handle. This API specifies the layout of buffers. This function can be used to specify input and output array types. Not all combinations of array types are supported and error code will be returned for unsupported cases. Additionally, input and output buffer offsets can be specified. The function can be used to specify custom layout of data, with the ability to specify stride between consecutive elements in all dimensions. Also, distance between transform data members can be specified. The library will choose appropriate defaults if offsets/strides are set to null ptr and/or distances set to 0.
- Parameters
[in] description: description handle[in] in_array_type: array type of input buffer[in] out_array_type: array type of output buffer[in] in_offsets: offsets, in element units, to start of data in input buffer[in] out_offsets: offsets, in element units, to start of data in output buffer[in] in_strides_size: size of in_strides array (must be equal to transform dimensions)[in] in_strides: array of strides, in each dimension, of input buffer; if set to null ptr library chooses defaults[in] in_distance: distance between start of each data instance in input buffer[in] out_strides_size: size of out_strides array (must be equal to transform dimensions)[in] out_strides: array of strides, in each dimension, of output buffer; if set to null ptr library chooses defaults[in] out_distance: distance between start of each data instance in output buffer
Execution¶
The following details the execution function. After a plan has been created, it can be used to compute a transform on specified data. Aspects of the execution can be controlled and any useful information returned to the user.
-
rocfft_status
rocfft_execute(const rocfft_plan plan, void *in_buffer[], void *out_buffer[], rocfft_execution_info info)¶ Execute an FFT plan.
This API executes an FFT plan on buffers given by the user. If the transform is in-place, only the input buffer is needed and the output buffer parameter can be set to NULL. For not in-place transforms, output buffers have to be specified. Note that both input and output buffer are arrays of pointers, this is to facilitate passing planar buffers where real and imaginary parts are in 2 separate buffers. For the default interleaved format, just a unit sized array holding the pointer to input/output buffer need to be passed. The final parameter in this function is an execution_info handle. This parameter serves as a way for the user to control execution, as well as for the library to pass any execution related information back to the user.
- Parameters
[in] plan: plan handle[inout] in_buffer: array (of size 1 for interleaved data, of size 2 for planar data) of input buffers[inout] out_buffer: array (of size 1 for interleaved data, of size 2 for planar data) of output buffers, can be nullptr for inplace result placement[in] info: execution info handle created by rocfft_execution_info_create
Execution info¶
The execution api rocfft_execute() takes a rocfft_execution_info parameter. This parameter needs
to be created and setup by the user and passed to the execution api. The execution info handle encapsulates
information such as execution mode, pointer to any work buffer etc. It can also hold information that are
side effect of execution such as event objects. The following functions deal with managing execution info
object. Note that the set functions below need to be called before execution and get functions after
execution.
-
rocfft_status
rocfft_execution_info_create(rocfft_execution_info *info)¶ Create execution info.
This API creates an execution info with which the user can control plan execution & retrieve execution information
- Parameters
[out] info: execution info handle
-
rocfft_status
rocfft_execution_info_destroy(rocfft_execution_info info)¶ Destroy an execution info.
This API frees the execution info
- Parameters
[in] info: execution info handle
-
rocfft_status
rocfft_execution_info_set_work_buffer(rocfft_execution_info info, void *work_buffer, size_t size_in_bytes)¶ Set work buffer in execution info.
This is one of the execution info functions to specify optional additional information to control execution. This API specifies work buffer needed. It has to be called before the call to rocfft_execute. When a non-zero value is obtained from rocfft_plan_get_work_buffer_size, that means the library needs a work buffer to compute the transform. In this case, the user has to allocate the work buffer and pass it to the library via this api.
- Parameters
[in] info: execution info handle[in] work_buffer: work buffer[in] size_in_bytes: size of work buffer in bytes
-
rocfft_status
rocfft_execution_info_set_stream(rocfft_execution_info info, void *stream)¶ Set stream in execution info.
This is one of the execution info functions to specify optional additional information to control execution. This API specifies compute stream. It has to be called before the call to rocfft_execute. It is the underlying device queue/stream where the library computations would be inserted. The library assumes user has created such a stream in the program and merely assigns work to the stream.
- Parameters
[in] info: execution info handle[in] stream: underlying compute stream
Enumerations¶
This section provides all the enumerations used.
-
enum
rocfft_status¶ rocfft status/error codes
Values:
-
rocfft_status_success¶
-
rocfft_status_failure¶
-
rocfft_status_invalid_arg_value¶
-
rocfft_status_invalid_dimensions¶
-
rocfft_status_invalid_array_type¶
-
rocfft_status_invalid_strides¶
-
rocfft_status_invalid_distance¶
-
rocfft_status_invalid_offset¶
-
-
enum
rocfft_transform_type¶ Type of transform.
Values:
-
rocfft_transform_type_complex_forward¶
-
rocfft_transform_type_complex_inverse¶
-
rocfft_transform_type_real_forward¶
-
rocfft_transform_type_real_inverse¶
-
-
enum
rocfft_result_placement¶ Result placement.
Values:
-
rocfft_placement_inplace¶
-
rocfft_placement_notinplace¶
-
rocBLAS¶
A BLAS implementation on top of AMD’s Radeon Open Compute ROCm runtime and toolchains. rocBLAS is implemented in the HIP programming language and optimized for AMD’s latest discrete GPUs.
Installing pre-built packages¶
Download pre-built packages either from ROCm’s package servers or by clicking the github releases tab and manually downloading, which could be newer. Release notes are available for each release on the releases tab.
sudo apt update && sudo apt install rocblas
Quickstart rocBLAS build¶
Bash helper build script (Ubuntu only)
The root of this repository has a helper bash script install.sh to build and install rocBLAS on Ubuntu with a single command. It does not take a lot of options and hard-codes configuration that can be specified through invoking cmake directly, but it’s a great way to get started quickly and can serve as an example of how to build/install. A few commands in the script need sudo access, so it may prompt you for a password.
./install -h -- shows help
./install -id -- build library, build dependencies and install (-d flag only needs to be passed once on a system)
Manual build (all supported platforms)¶
If you use a distro other than Ubuntu, or would like more control over the build process, the rocblas build wiki has helpful information on how to configure cmake and manually build.
Functions supported
A list of exported functions from rocblas can be found on the wiki
rocBLAS interface examples¶
In general, the rocBLAS interface is compatible with CPU oriented Netlib BLAS and the cuBLAS-v2 API, with the explicit exception that traditional BLAS interfaces do not accept handles. The cuBLAS’ cublasHandle_t is replaced with rocblas_handle everywhere. Thus, porting a CUDA application which originally calls the cuBLAS API to a HIP application calling rocBLAS API should be relatively straightforward. For example, the rocBLAS SGEMV interface is
GEMV API¶
rocblas_status
rocblas_sgemv(rocblas_handle handle,
rocblas_operation trans,
rocblas_int m, rocblas_int n,
const float* alpha,
const float* A, rocblas_int lda,
const float* x, rocblas_int incx,
const float* beta,
float* y, rocblas_int incy);
Batched and strided GEMM API¶
rocBLAS GEMM can process matrices in batches with regular strides. There are several permutations of these API’s, the following is an example that takes everything
rocblas_status
rocblas_sgemm_strided_batched(
rocblas_handle handle,
rocblas_operation transa, rocblas_operation transb,
rocblas_int m, rocblas_int n, rocblas_int k,
const float* alpha,
const float* A, rocblas_int ls_a, rocblas_int ld_a, rocblas_int bs_a,
const float* B, rocblas_int ls_b, rocblas_int ld_b, rocblas_int bs_b,
const float* beta,
float* C, rocblas_int ls_c, rocblas_int ld_c, rocblas_int bs_c,
rocblas_int batch_count )
rocBLAS assumes matrices A and vectors x, y are allocated in GPU memory space filled with data. Users are responsible for copying data from/to the host and device memory. HIP provides memcpy style API’s to facilitate data management.
Asynchronous API¶
Except a few routines (like TRSM) having memory allocation inside preventing asynchronicity, most of the library routines (like BLAS-1 SCAL, BLAS-2 GEMV, BLAS-3 GEMM) are configured to operate in asynchronous fashion with respect to CPU, meaning these library functions return immediately.
For more information regarding rocBLAS library and corresponding API documentation, refer rocBLAS
API¶
This section provides details of the library API
Types¶
Definitions¶
Enums¶
Enumeration constants have numbering that is consistent with CBLAS, ACML and most standard C BLAS libraries.
rocblas_operation¶
-
enum
rocblas_operation¶ Used to specify whether the matrix is to be transposed or not.
parameter constants. numbering is consistent with CBLAS, ACML and most standard C BLAS libraries
Values:
-
rocblas_operation_none= 111¶ Operate with the matrix.
-
rocblas_operation_transpose= 112¶ Operate with the transpose of the matrix.
-
rocblas_operation_conjugate_transpose= 113¶ Operate with the conjugate transpose of the matrix.
-
rocblas_fill¶
rocblas_diagonal¶
rocblas_side¶
-
enum
rocblas_side¶ Indicates the side matrix A is located relative to matrix B during multiplication.
Values:
-
rocblas_side_left= 141¶ Multiply general matrix by symmetric, Hermitian or triangular matrix on the left.
-
rocblas_side_right= 142¶ Multiply general matrix by symmetric, Hermitian or triangular matrix on the right.
-
rocblas_side_both= 143¶
-
rocblas_status¶
-
enum
rocblas_status¶ rocblas status codes definition
Values:
-
rocblas_status_success= 0¶ success
-
rocblas_status_invalid_handle= 1¶ handle not initialized, invalid or null
-
rocblas_status_not_implemented= 2¶ function is not implemented
-
rocblas_status_invalid_pointer= 3¶ invalid pointer parameter
-
rocblas_status_invalid_size= 4¶ invalid size parameter
-
rocblas_status_memory_error= 5¶ failed internal memory allocation, copy or dealloc
-
rocblas_status_internal_error= 6¶ other internal library failure
-
rocblas_datatype¶
-
enum
rocblas_datatype¶ Indicates the precision width of data stored in a blas type.
Values:
-
rocblas_datatype_f16_r= 150¶
-
rocblas_datatype_f32_r= 151¶
-
rocblas_datatype_f64_r= 152¶
-
rocblas_datatype_f16_c= 153¶
-
rocblas_datatype_f32_c= 154¶
-
rocblas_datatype_f64_c= 155¶
-
rocblas_datatype_i8_r= 160¶
-
rocblas_datatype_u8_r= 161¶
-
rocblas_datatype_i32_r= 162¶
-
rocblas_datatype_u32_r= 163¶
-
rocblas_datatype_i8_c= 164¶
-
rocblas_datatype_u8_c= 165¶
-
rocblas_datatype_i32_c= 166¶
-
rocblas_datatype_u32_c= 167¶
-
rocblas_pointer_mode¶
rocblas_layer_mode¶
Functions¶
Level 1 BLAS¶
rocblas_<type>scal()¶
-
rocblas_status
rocblas_dscal(rocblas_handle handle, rocblas_int n, const double *alpha, double *x, rocblas_int incx)¶
-
rocblas_status
rocblas_sscal(rocblas_handle handle, rocblas_int n, const float *alpha, float *x, rocblas_int incx)¶ BLAS Level 1 API.
scal scal the vector x[i] with scalar alpha, for i = 1 , … , n
x := alpha * x ,
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] alpha: specifies the scalar alpha.[inout] x: pointer storing vector x on the GPU.[in] incx: specifies the increment for the elements of x.
rocblas_<type>copy()¶
-
rocblas_status
rocblas_dcopy(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *y, rocblas_int incy)¶
-
rocblas_status
rocblas_scopy(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *y, rocblas_int incy)¶ BLAS Level 1 API.
copy copies the vector x into the vector y, for i = 1 , … , n
y := x,
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] x: pointer storing vector x on the GPU.[in] incx: specifies the increment for the elements of x.[out] y: pointer storing vector y on the GPU.[in] incy: rocblas_int specifies the increment for the elements of y.
rocblas_<type>dot()¶
-
rocblas_status
rocblas_ddot(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, const double *y, rocblas_int incy, double *result)¶
-
rocblas_status
rocblas_sdot(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, const float *y, rocblas_int incy, float *result)¶ BLAS Level 1 API.
dot(u) perform dot product of vector x and y
result = x * y;
dotc perform dot product of complex vector x and complex y
result = conjugate (x) * y;
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of y.[inout] result: store the dot product. either on the host CPU or device GPU. return is 0.0 if n <= 0.
rocblas_<type>swap()¶
-
rocblas_status
rocblas_sswap(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, float *y, rocblas_int incy)¶ BLAS Level 1 API.
swap interchange vector x[i] and y[i], for i = 1 , … , n
y := x; x := y
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[inout] x: pointer storing vector x on the GPU.[in] incx: specifies the increment for the elements of x.[inout] y: pointer storing vector y on the GPU.[in] incy: rocblas_int specifies the increment for the elements of y.
-
rocblas_status
rocblas_dswap(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, double *y, rocblas_int incy)¶
rocblas_<type>axpy()¶
-
rocblas_status
rocblas_daxpy(rocblas_handle handle, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, double *y, rocblas_int incy)¶
-
rocblas_status
rocblas_saxpy(rocblas_handle handle, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, float *y, rocblas_int incy)¶
-
rocblas_status
rocblas_haxpy(rocblas_handle handle, rocblas_int n, const rocblas_half *alpha, const rocblas_half *x, rocblas_int incx, rocblas_half *y, rocblas_int incy)¶ BLAS Level 1 API.
axpy compute y := alpha * x + y
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] alpha: specifies the scalar alpha.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of x.[out] y: pointer storing vector y on the GPU.[inout] incy: rocblas_int specifies the increment for the elements of y.
rocblas_<type>asum()¶
-
rocblas_status
rocblas_dasum(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *result)¶
-
rocblas_status
rocblas_sasum(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *result)¶ BLAS Level 1 API.
asum computes the sum of the magnitudes of elements of a real vector x, or the sum of magnitudes of the real and imaginary parts of elements if x is a complex vector
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of y.[inout] result: store the asum product. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.
rocblas_<type>nrm2()¶
-
rocblas_status
rocblas_dnrm2(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *result)¶
-
rocblas_status
rocblas_snrm2(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *result)¶ BLAS Level 1 API.
nrm2 computes the euclidean norm of a real or complex vector := sqrt( x’*x ) for real vector := sqrt( x**H*x ) for complex vector
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of y.[inout] result: store the nrm2 product. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.
rocblas_i<type>amax()¶
-
rocblas_status
rocblas_idamax(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_int *result)¶
-
rocblas_status
rocblas_isamax(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_int *result)¶ BLAS Level 1 API.
amax finds the first index of the element of maximum magnitude of real vector x or the sum of magnitude of the real and imaginary parts of elements if x is a complex vector
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of y.[inout] result: store the amax index. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.
rocblas_i<type>amin()¶
-
rocblas_status
rocblas_idamin(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_int *result)¶
-
rocblas_status
rocblas_isamin(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_int *result)¶ BLAS Level 1 API.
amin finds the first index of the element of minimum magnitude of real vector x or the sum of magnitude of the real and imaginary parts of elements if x is a complex vector
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of y.[inout] result: store the amin index. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.
Level 2 BLAS¶
rocblas_<type>gemv()¶
-
rocblas_status
rocblas_dgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *x, rocblas_int incx, const double *beta, double *y, rocblas_int incy)¶
-
rocblas_status
rocblas_sgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *x, rocblas_int incx, const float *beta, float *y, rocblas_int incy)¶ BLAS Level 2 API.
xGEMV performs one of the matrix-vector operations
y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, or y := alpha*A**H*x + beta*y,
where alpha and beta are scalars, x and y are vectors and A is an m by n matrix.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] trans: rocblas_operation[in] m: rocblas_int[in] n: rocblas_int[in] alpha: specifies the scalar alpha.[in] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of A.[in] x: pointer storing vector x on the GPU.[in] incx: specifies the increment for the elements of x.[in] beta: specifies the scalar beta.[out] y: pointer storing vector y on the GPU.[in] incy: rocblas_int specifies the increment for the elements of y.
rocblas_<type>trsv()¶
-
rocblas_status
rocblas_dtrsv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const double *A, rocblas_int lda, double *x, rocblas_int incx)¶
-
rocblas_status
rocblas_strsv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const float *A, rocblas_int lda, float *x, rocblas_int incx)¶ BLAS Level 2 API.
trsv solves
A*x = alpha*b or A**T*x = alpha*b,
where x and b are vectors and A is a triangular matrix.
The vector x is overwritten on b.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] uplo: rocblas_fill. rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.[in] transA: rocblas_operation[in] diag: rocblas_diagonal. rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.[in] m: rocblas_int m specifies the number of rows of b. m >= 0.[in] alpha: specifies the scalar alpha.[in] A: pointer storing matrix A on the GPU, of dimension ( lda, m )[in] lda: rocblas_int specifies the leading dimension of A. lda = max( 1, m ).[in] x: pointer storing vector x on the GPU.[in] incx: specifies the increment for the elements of x.
rocblas_<type>ger()¶
-
rocblas_status
rocblas_dger(rocblas_handle handle, rocblas_int m, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, const double *y, rocblas_int incy, double *A, rocblas_int lda)¶
-
rocblas_status
rocblas_sger(rocblas_handle handle, rocblas_int m, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, const float *y, rocblas_int incy, float *A, rocblas_int lda)¶ BLAS Level 2 API.
xHE(SY)MV performs the matrix-vector operation:
y := alpha*A*x + beta*y,
where alpha and beta are scalars, x and y are n element vectors and A is an n by n Hermitian(Symmetric) matrix.
BLAS Level 2 API
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] uplo: rocblas_fill. specifies whether the upper or lower[in] n: rocblas_int.[in] alpha: specifies the scalar alpha.[in] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of A.[in] x: pointer storing vector x on the GPU.[in] incx: specifies the increment for the elements of x.[in] beta: specifies the scalar beta.[out] y: pointer storing vector y on the GPU.[in] incy: rocblas_int specifies the increment for the elements of y.
xGER performs the matrix-vector operations
A := A + alpha*x*y**T
where alpha is a scalars, x and y are vectors, and A is an m by n matrix.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] m: rocblas_int[in] n: rocblas_int[in] alpha: specifies the scalar alpha.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of x.[in] y: pointer storing vector y on the GPU.[in] incy: rocblas_int specifies the increment for the elements of y.[inout] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of A.
rocblas_<type>syr()¶
-
rocblas_status
rocblas_dsyr(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, double *A, rocblas_int lda)¶
-
rocblas_status
rocblas_ssyr(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, float *A, rocblas_int lda)¶ BLAS Level 2 API.
xSYR performs the matrix-vector operations
A := A + alpha*x*x**T
where alpha is a scalars, x is a vector, and A is an n by n symmetric matrix.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int[in] alpha: specifies the scalar alpha.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of x.[inout] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of A.
Level 3 BLAS¶
rocblas_<type>trtri_batched()¶
-
rocblas_status
rocblas_dtrtri_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const double *A, rocblas_int lda, rocblas_int stride_a, double *invA, rocblas_int ldinvA, rocblas_int bsinvA, rocblas_int batch_count)¶
-
rocblas_status
rocblas_strtri_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const float *A, rocblas_int lda, rocblas_int stride_a, float *invA, rocblas_int ldinvA, rocblas_int bsinvA, rocblas_int batch_count)¶ BLAS Level 3 API.
trtri compute the inverse of a matrix A
inv(A);
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] uplo: rocblas_fill. specifies whether the upper ‘rocblas_fill_upper’ or lower ‘rocblas_fill_lower’[in] diag: rocblas_diagonal. = ‘rocblas_diagonal_non_unit’, A is non-unit triangular; = ‘rocblas_diagonal_unit’, A is unit triangular;[in] n: rocblas_int.[in] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of A.[in] stride_a: rocblas_int “batch stride a”: stride from the start of one “A” matrix to the next
rocblas_<type>trsm()¶
-
rocblas_status
rocblas_dtrsm(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, double *B, rocblas_int ldb)¶
-
rocblas_status
rocblas_strsm(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, float *B, rocblas_int ldb)¶ BLAS Level 3 API.
trsm solves
op(A)*X = alpha*B or X*op(A) = alpha*B,
where alpha is a scalar, X and B are m by n matrices, A is triangular matrix and op(A) is one of
op( A ) = A or op( A ) = A^T or op( A ) = A^H.
The matrix X is overwritten on B.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] side: rocblas_side. rocblas_side_left: op(A)*X = alpha*B. rocblas_side_right: X*op(A) = alpha*B.[in] uplo: rocblas_fill. rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.[in] transA: rocblas_operation. transB: op(A) = A. rocblas_operation_transpose: op(A) = A^T. rocblas_operation_conjugate_transpose: op(A) = A^H.[in] diag: rocblas_diagonal. rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.[in] m: rocblas_int. m specifies the number of rows of B. m >= 0.[in] n: rocblas_int. n specifies the number of columns of B. n >= 0.[in] alpha: alpha specifies the scalar alpha. When alpha is &zero then A is not referenced and B need not be set before entry.[in] A: pointer storing matrix A on the GPU. of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.[in] lda: rocblas_int. lda specifies the first dimension of A. if side = rocblas_side_left, lda >= max( 1, m ), if side = rocblas_side_right, lda >= max( 1, n ).
rocblas_<type>gemm()¶
-
rocblas_status
rocblas_dgemm(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, const double *B, rocblas_int ldb, const double *beta, double *C, rocblas_int ldc)¶
-
rocblas_status
rocblas_sgemm(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, const float *B, rocblas_int ldb, const float *beta, float *C, rocblas_int ldc)¶
-
rocblas_status
rocblas_hgemm(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, const rocblas_half *B, rocblas_int ldb, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc)¶ BLAS Level 3 API.
xGEMM performs one of the matrix-matrix operations
C = alpha*op( A )*op( B ) + beta*C,
where op( X ) is one of
op( X ) = X or op( X ) = X**T or op( X ) = X**H,
alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.
- Parameters
[in] handle: rocblas_handle, handle to the rocblas library context queue.[in] transA: rocblas_operation, specifies the form of op( A )[in] transB: rocblas_operation, specifies the form of op( B )[in] m: rocblas_int, number or rows of matrices op( A ) and C[in] n: rocblas_int, number of columns of matrices op( B ) and C[in] k: rocblas_int, number of columns of matrix op( A ) and number of rows of matrix op( B )[in] alpha: specifies the scalar alpha.[in] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int, specifies the leading dimension of A.[in] B: pointer storing matrix B on the GPU.[in] ldb: rocblas_int, specifies the leading dimension of B.[in] beta: specifies the scalar beta.[inout] C: pointer storing matrix C on the GPU.[in] ldc: rocblas_int, specifies the leading dimension of C.
rocblas_<type>gemm_strided_batched()¶
-
rocblas_status
rocblas_dgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, rocblas_int stride_a, const double *B, rocblas_int ldb, rocblas_int stride_b, const double *beta, double *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)¶
-
rocblas_status
rocblas_sgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, rocblas_int stride_a, const float *B, rocblas_int ldb, rocblas_int stride_b, const float *beta, float *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)¶
-
rocblas_status
rocblas_hgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, rocblas_int stride_a, const rocblas_half *B, rocblas_int ldb, rocblas_int stride_b, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)¶ BLAS Level 3 API.
xGEMM_STRIDED_BATCHED performs one of the strided batched matrix-matrix operations
[0,batch_count-1]C[i*stride_c] = alpha*op( A[i*stride_a] )*op( B[i*stride_b] ) + beta*C[i*stride_c], for i in
where op( X ) is one of
op( X ) = X or op( X ) = X**T or op( X ) = X**H,
alpha and beta are scalars, and A, B and C are strided batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) an k by n by batch_count strided_batched matrix and C an m by n by batch_count strided_batched matrix.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] transA: rocblas_operation specifies the form of op( A )[in] transB: rocblas_operation specifies the form of op( B )[in] m: rocblas_int. matrix dimention m.[in] n: rocblas_int. matrix dimention n.[in] k: rocblas_int. matrix dimention k.[in] alpha: specifies the scalar alpha.[in] A: pointer storing strided batched matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of “A”.[in] stride_a: rocblas_int stride from the start of one “A” matrix to the next[in] B: pointer storing strided batched matrix B on the GPU.[in] ldb: rocblas_int specifies the leading dimension of “B”.[in] stride_b: rocblas_int stride from the start of one “B” matrix to the next[in] beta: specifies the scalar beta.[inout] C: pointer storing strided batched matrix C on the GPU.[in] ldc: rocblas_int specifies the leading dimension of “C”.[in] stride_c: rocblas_int stride from the start of one “C” matrix to the next[in] batch_count: rocblas_int number of gemm operatons in the batch
rocblas_<type>gemm_kernel_name()¶
-
rocblas_status
rocblas_dgemm_kernel_name(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, rocblas_int stride_a, const double *B, rocblas_int ldb, rocblas_int stride_b, const double *beta, double *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)¶
-
rocblas_status
rocblas_sgemm_kernel_name(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, rocblas_int stride_a, const float *B, rocblas_int ldb, rocblas_int stride_b, const float *beta, float *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)¶
-
rocblas_status
rocblas_hgemm_kernel_name(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, rocblas_int stride_a, const rocblas_half *B, rocblas_int ldb, rocblas_int stride_b, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)¶
rocblas_<type>geam()¶
-
rocblas_status
rocblas_dgeam(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *beta, const double *B, rocblas_int ldb, double *C, rocblas_int ldc)¶
-
rocblas_status
rocblas_sgeam(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *beta, const float *B, rocblas_int ldb, float *C, rocblas_int ldc)¶ BLAS Level 3 API.
xGEAM performs one of the matrix-matrix operations
C = alpha*op( A ) + beta*op( B ),
where op( X ) is one of
op( X ) = X or op( X ) = X**T or op( X ) = X**H,
alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by n matrix, op( B ) an m by n matrix, and C an m by n matrix.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] transA: rocblas_operation specifies the form of op( A )[in] transB: rocblas_operation specifies the form of op( B )[in] m: rocblas_int.[in] n: rocblas_int.[in] alpha: specifies the scalar alpha.[in] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of A.[in] beta: specifies the scalar beta.[in] B: pointer storing matrix B on the GPU.[in] ldb: rocblas_int specifies the leading dimension of B.[inout] C: pointer storing matrix C on the GPU.[in] ldc: rocblas_int specifies the leading dimension of C.
BLAS Extensions¶
rocblas_gemm_ex()¶
-
rocblas_status
rocblas_gemm_ex(rocblas_handle handle, rocblas_operation trans_a, rocblas_operation trans_b, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags, size_t *workspace_size, void *workspace)¶
rocblas_gemm_strided_batched_ex()¶
-
rocblas_status
rocblas_gemm_strided_batched_ex(rocblas_handle handle, rocblas_operation trans_a, rocblas_operation trans_b, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, rocblas_long stride_a, const void *b, rocblas_datatype b_type, rocblas_int ldb, rocblas_long stride_b, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, rocblas_long stride_c, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_long stride_d, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags, size_t *workspace_size, void *workspace)¶ BLAS EX API.
GEMM_EX performs one of the matrix-matrix operations
D = alpha*op( A )*op( B ) + beta*C,
where op( X ) is one of
op( X ) = X or op( X ) = X**T or op( X ) = X**H,
alpha and beta are scalars, and A, B, C, and D are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C and D are m by n matrices.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] transA: rocblas_operation specifies the form of op( A )[in] transB: rocblas_operation specifies the form of op( B )[in] m: rocblas_int. matrix dimension m[in] n: rocblas_int. matrix dimension n[in] k: rocblas_int. matrix dimension k[in] alpha: const void * specifies the scalar alpha. Same datatype as compute_type.[in] a: void * pointer storing matrix A on the GPU.[in] a_type: rocblas_datatype specifies the datatype of matrix A[in] lda: rocblas_int specifies the leading dimension of A.[in] b: void * pointer storing matrix B on the GPU.[in] b_type: rocblas_datatype specifies the datatype of matrix B[in] ldb: rocblas_int specifies the leading dimension of B.[in] beta: const void * specifies the scalar beta. Same datatype as compute_type.[in] c: void * pointer storing matrix C on the GPU.[in] c_type: rocblas_datatype specifies the datatype of matrix C[in] ldc: rocblas_int specifies the leading dimension of C.[out] d: void * pointer storing matrix D on the GPU.[in] d_type: rocblas_datatype specifies the datatype of matrix D[in] ldd: rocblas_int specifies the leading dimension of D.[in] compute_type: rocblas_datatype specifies the datatype of computation[in] algo: rocblas_gemm_algo enumerant specifying the algorithm type.[in] solution_index: int32_t reserved for future use[in] flags: uint32_t reserved for future use
Build Information¶
rocblas_get_version_string()¶
-
rocblas_status
rocblas_get_version_string(char *buf, size_t len)¶ BLAS EX API.
GEMM_STRIDED_BATCHED_EX performs one of the strided_batched matrix-matrix operations
[0,batch_count-1]D[i*stride_d] = alpha*op(A[i*stride_a])*op(B[i*stride_b]) + beta*C[i*stride_c], for i in
where op( X ) is one of
op( X ) = X or op( X ) = X**T or op( X ) = X**H,
alpha and beta are scalars, and A, B, C, and D are strided_batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) a k by n by batch_count strided_batched matrix and C and D are m by n by batch_count strided_batched matrices.
The strided_batched matrices are multiple matrices separated by a constant stride. The number of matrices is batch_count.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] transA: rocblas_operation specifies the form of op( A )[in] transB: rocblas_operation specifies the form of op( B )[in] m: rocblas_int. matrix dimension m[in] n: rocblas_int. matrix dimension n[in] k: rocblas_int. matrix dimension k[in] alpha: const void * specifies the scalar alpha. Same datatype as compute_type.[in] a: void * pointer storing matrix A on the GPU.[in] a_type: rocblas_datatype specifies the datatype of matrix A[in] lda: rocblas_int specifies the leading dimension of A.[in] stride_a: rocblas_long specifies stride from start of one “A” matrix to the next[in] b: void * pointer storing matrix B on the GPU.[in] b_type: rocblas_datatype specifies the datatype of matrix B[in] ldb: rocblas_int specifies the leading dimension of B.[in] stride_b: rocblas_long specifies stride from start of one “B” matrix to the next[in] beta: const void * specifies the scalar beta. Same datatype as compute_type.[in] c: void * pointer storing matrix C on the GPU.[in] c_type: rocblas_datatype specifies the datatype of matrix C[in] ldc: rocblas_int specifies the leading dimension of C.[in] stride_c: rocblas_long specifies stride from start of one “C” matrix to the next[out] d: void * pointer storing matrix D on the GPU.[in] d_type: rocblas_datatype specifies the datatype of matrix D[in] ldd: rocblas_int specifies the leading dimension of D.[in] stride_d: rocblas_long specifies stride from start of one “D” matrix to the next[in] batch_count: rocblas_int number of gemm operations in the batch[in] compute_type: rocblas_datatype specifies the datatype of computation[in] algo: rocblas_gemm_algo enumerant specifying the algorithm type.[in] solution_index: int32_t reserved for future use[in] flags: uint32_t reserved for future use
Auxiliary¶
rocblas_pointer_to_mode()¶
-
rocblas_pointer_mode
rocblas_pointer_to_mode(void *ptr)¶ indicates whether the pointer is on the host or device. currently HIP API can only recoginize the input ptr on deive or not can not recoginize it is on host or not
rocblas_create_handle()¶
-
rocblas_status
rocblas_create_handle(rocblas_handle *handle)¶
rocblas_destroy_handle()¶
-
rocblas_status
rocblas_destroy_handle(rocblas_handle handle)¶
rocblas_add_stream()¶
-
rocblas_status
rocblas_add_stream(rocblas_handle handle, hipStream_t stream)¶
rocblas_set_stream()¶
-
rocblas_status
rocblas_set_stream(rocblas_handle handle, hipStream_t stream)¶
rocblas_get_stream()¶
-
rocblas_status
rocblas_get_stream(rocblas_handle handle, hipStream_t *stream)¶
rocblas_set_pointer_mode()¶
-
rocblas_status
rocblas_set_pointer_mode(rocblas_handle handle, rocblas_pointer_mode pointer_mode)¶
rocblas_get_pointer_mode()¶
-
rocblas_status
rocblas_get_pointer_mode(rocblas_handle handle, rocblas_pointer_mode *pointer_mode)¶
rocblas_set_vector()¶
-
rocblas_status
rocblas_set_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)¶
rocblas_get_vector()¶
-
rocblas_status
rocblas_get_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)¶
rocblas_set_matrix()¶
-
rocblas_status
rocblas_set_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)¶
rocblas_get_matrix()¶
-
rocblas_status
rocblas_get_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)¶
All API¶
-
file
rocblas-auxiliary.h - #include <hip/hip_runtime_api.h>#include “rocblas-types.h”
rocblas-auxiliary.h provides auxilary functions in rocblas
Defines
-
_ROCBLAS_AUXILIARY_H_¶
Functions
-
rocblas_pointer_mode
rocblas_pointer_to_mode(void *ptr) indicates whether the pointer is on the host or device. currently HIP API can only recoginize the input ptr on deive or not can not recoginize it is on host or not
-
rocblas_status
rocblas_create_handle(rocblas_handle *handle)
-
rocblas_status
rocblas_destroy_handle(rocblas_handle handle)
-
rocblas_status
rocblas_add_stream(rocblas_handle handle, hipStream_t stream)
-
rocblas_status
rocblas_set_stream(rocblas_handle handle, hipStream_t stream)
-
rocblas_status
rocblas_get_stream(rocblas_handle handle, hipStream_t *stream)
-
rocblas_status
rocblas_set_pointer_mode(rocblas_handle handle, rocblas_pointer_mode pointer_mode)
-
rocblas_status
rocblas_get_pointer_mode(rocblas_handle handle, rocblas_pointer_mode *pointer_mode)
-
rocblas_status
rocblas_set_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)
-
rocblas_status
rocblas_get_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)
-
rocblas_status
rocblas_set_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)
-
rocblas_status
rocblas_get_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)
-
-
file
rocblas-functions.h - #include “rocblas-types.h”
rocblas_functions.h provides Basic Linear Algebra Subprograms of Level 1, 2 and 3, using HIP optimized for AMD HCC-based GPU hardware. This library can also run on CUDA-based NVIDIA GPUs. This file exposes C89 BLAS interface
Defines
-
_ROCBLAS_FUNCTIONS_H_¶
Functions
-
rocblas_status
rocblas_sscal(rocblas_handle handle, rocblas_int n, const float *alpha, float *x, rocblas_int incx) BLAS Level 1 API.
scal scal the vector x[i] with scalar alpha, for i = 1 , … , n
x := alpha * x ,
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] alpha: specifies the scalar alpha.[inout] x: pointer storing vector x on the GPU.[in] incx: specifies the increment for the elements of x.
-
rocblas_status
rocblas_dscal(rocblas_handle handle, rocblas_int n, const double *alpha, double *x, rocblas_int incx)
-
rocblas_status
rocblas_scopy(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *y, rocblas_int incy) BLAS Level 1 API.
copy copies the vector x into the vector y, for i = 1 , … , n
y := x,
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] x: pointer storing vector x on the GPU.[in] incx: specifies the increment for the elements of x.[out] y: pointer storing vector y on the GPU.[in] incy: rocblas_int specifies the increment for the elements of y.
-
rocblas_status
rocblas_dcopy(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *y, rocblas_int incy)
-
rocblas_status
rocblas_sdot(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, const float *y, rocblas_int incy, float *result) BLAS Level 1 API.
dot(u) perform dot product of vector x and y
result = x * y;
dotc perform dot product of complex vector x and complex y
result = conjugate (x) * y;
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of y.[inout] result: store the dot product. either on the host CPU or device GPU. return is 0.0 if n <= 0.
-
rocblas_status
rocblas_ddot(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, const double *y, rocblas_int incy, double *result)
-
rocblas_status
rocblas_sswap(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, float *y, rocblas_int incy) BLAS Level 1 API.
swap interchange vector x[i] and y[i], for i = 1 , … , n
y := x; x := y
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[inout] x: pointer storing vector x on the GPU.[in] incx: specifies the increment for the elements of x.[inout] y: pointer storing vector y on the GPU.[in] incy: rocblas_int specifies the increment for the elements of y.
-
rocblas_status
rocblas_dswap(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, double *y, rocblas_int incy)
-
rocblas_status
rocblas_haxpy(rocblas_handle handle, rocblas_int n, const rocblas_half *alpha, const rocblas_half *x, rocblas_int incx, rocblas_half *y, rocblas_int incy) BLAS Level 1 API.
axpy compute y := alpha * x + y
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] alpha: specifies the scalar alpha.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of x.[out] y: pointer storing vector y on the GPU.[inout] incy: rocblas_int specifies the increment for the elements of y.
-
rocblas_status
rocblas_saxpy(rocblas_handle handle, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, float *y, rocblas_int incy)
-
rocblas_status
rocblas_daxpy(rocblas_handle handle, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, double *y, rocblas_int incy)
-
rocblas_status
rocblas_sasum(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *result) BLAS Level 1 API.
asum computes the sum of the magnitudes of elements of a real vector x, or the sum of magnitudes of the real and imaginary parts of elements if x is a complex vector
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of y.[inout] result: store the asum product. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.
-
rocblas_status
rocblas_dasum(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *result)
-
rocblas_status
rocblas_snrm2(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *result) BLAS Level 1 API.
nrm2 computes the euclidean norm of a real or complex vector := sqrt( x’*x ) for real vector := sqrt( x**H*x ) for complex vector
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of y.[inout] result: store the nrm2 product. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.
-
rocblas_status
rocblas_dnrm2(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *result)
-
rocblas_status
rocblas_isamax(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_int *result) BLAS Level 1 API.
amax finds the first index of the element of maximum magnitude of real vector x or the sum of magnitude of the real and imaginary parts of elements if x is a complex vector
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of y.[inout] result: store the amax index. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.
-
rocblas_status
rocblas_idamax(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_int *result)
-
rocblas_status
rocblas_isamin(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_int *result) BLAS Level 1 API.
amin finds the first index of the element of minimum magnitude of real vector x or the sum of magnitude of the real and imaginary parts of elements if x is a complex vector
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of y.[inout] result: store the amin index. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.
-
rocblas_status
rocblas_idamin(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_int *result)
-
rocblas_status
rocblas_sgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *x, rocblas_int incx, const float *beta, float *y, rocblas_int incy) BLAS Level 2 API.
xGEMV performs one of the matrix-vector operations
y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, or y := alpha*A**H*x + beta*y,
where alpha and beta are scalars, x and y are vectors and A is an m by n matrix.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] trans: rocblas_operation[in] m: rocblas_int[in] n: rocblas_int[in] alpha: specifies the scalar alpha.[in] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of A.[in] x: pointer storing vector x on the GPU.[in] incx: specifies the increment for the elements of x.[in] beta: specifies the scalar beta.[out] y: pointer storing vector y on the GPU.[in] incy: rocblas_int specifies the increment for the elements of y.
-
rocblas_status
rocblas_dgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *x, rocblas_int incx, const double *beta, double *y, rocblas_int incy)
-
rocblas_status
rocblas_strsv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const float *A, rocblas_int lda, float *x, rocblas_int incx) BLAS Level 2 API.
trsv solves
A*x = alpha*b or A**T*x = alpha*b,
where x and b are vectors and A is a triangular matrix.
The vector x is overwritten on b.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] uplo: rocblas_fill. rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.[in] transA: rocblas_operation[in] diag: rocblas_diagonal. rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.[in] m: rocblas_int m specifies the number of rows of b. m >= 0.[in] alpha: specifies the scalar alpha.[in] A: pointer storing matrix A on the GPU, of dimension ( lda, m )[in] lda: rocblas_int specifies the leading dimension of A. lda = max( 1, m ).[in] x: pointer storing vector x on the GPU.[in] incx: specifies the increment for the elements of x.
-
rocblas_status
rocblas_dtrsv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const double *A, rocblas_int lda, double *x, rocblas_int incx)
-
rocblas_status
rocblas_sger(rocblas_handle handle, rocblas_int m, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, const float *y, rocblas_int incy, float *A, rocblas_int lda) BLAS Level 2 API.
xHE(SY)MV performs the matrix-vector operation:
y := alpha*A*x + beta*y,
where alpha and beta are scalars, x and y are n element vectors and A is an n by n Hermitian(Symmetric) matrix.
BLAS Level 2 API
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] uplo: rocblas_fill. specifies whether the upper or lower[in] n: rocblas_int.[in] alpha: specifies the scalar alpha.[in] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of A.[in] x: pointer storing vector x on the GPU.[in] incx: specifies the increment for the elements of x.[in] beta: specifies the scalar beta.[out] y: pointer storing vector y on the GPU.[in] incy: rocblas_int specifies the increment for the elements of y.
xGER performs the matrix-vector operations
A := A + alpha*x*y**T
where alpha is a scalars, x and y are vectors, and A is an m by n matrix.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] m: rocblas_int[in] n: rocblas_int[in] alpha: specifies the scalar alpha.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of x.[in] y: pointer storing vector y on the GPU.[in] incy: rocblas_int specifies the increment for the elements of y.[inout] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of A.
-
rocblas_status
rocblas_dger(rocblas_handle handle, rocblas_int m, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, const double *y, rocblas_int incy, double *A, rocblas_int lda)
-
rocblas_status
rocblas_ssyr(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, float *A, rocblas_int lda) BLAS Level 2 API.
xSYR performs the matrix-vector operations
A := A + alpha*x*x**T
where alpha is a scalars, x is a vector, and A is an n by n symmetric matrix.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] n: rocblas_int[in] alpha: specifies the scalar alpha.[in] x: pointer storing vector x on the GPU.[in] incx: rocblas_int specifies the increment for the elements of x.[inout] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of A.
-
rocblas_status
rocblas_dsyr(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, double *A, rocblas_int lda)
-
rocblas_status
rocblas_strtri(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const float *A, rocblas_int lda, float *invA, rocblas_int ldinvA)¶ BLAS Level 3 API.
trtri compute the inverse of a matrix A, namely, invA
and write the result into invA;
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] uplo: rocblas_fill. specifies whether the upper ‘rocblas_fill_upper’ or lower ‘rocblas_fill_lower’ if rocblas_fill_upper, the lower part of A is not referenced if rocblas_fill_lower, the upper part of A is not referenced[in] diag: rocblas_diagonal. = ‘rocblas_diagonal_non_unit’, A is non-unit triangular; = ‘rocblas_diagonal_unit’, A is unit triangular;[in] n: rocblas_int. size of matrix A and invA[in] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of A.
-
rocblas_status
rocblas_dtrtri(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const double *A, rocblas_int lda, double *invA, rocblas_int ldinvA)¶
-
rocblas_status
rocblas_strtri_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const float *A, rocblas_int lda, rocblas_int stride_a, float *invA, rocblas_int ldinvA, rocblas_int bsinvA, rocblas_int batch_count) BLAS Level 3 API.
trtri compute the inverse of a matrix A
inv(A);
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] uplo: rocblas_fill. specifies whether the upper ‘rocblas_fill_upper’ or lower ‘rocblas_fill_lower’[in] diag: rocblas_diagonal. = ‘rocblas_diagonal_non_unit’, A is non-unit triangular; = ‘rocblas_diagonal_unit’, A is unit triangular;[in] n: rocblas_int.[in] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of A.[in] stride_a: rocblas_int “batch stride a”: stride from the start of one “A” matrix to the next
-
rocblas_status
rocblas_dtrtri_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const double *A, rocblas_int lda, rocblas_int stride_a, double *invA, rocblas_int ldinvA, rocblas_int bsinvA, rocblas_int batch_count)
-
rocblas_status
rocblas_strsm(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, float *B, rocblas_int ldb) BLAS Level 3 API.
trsm solves
op(A)*X = alpha*B or X*op(A) = alpha*B,
where alpha is a scalar, X and B are m by n matrices, A is triangular matrix and op(A) is one of
op( A ) = A or op( A ) = A^T or op( A ) = A^H.
The matrix X is overwritten on B.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] side: rocblas_side. rocblas_side_left: op(A)*X = alpha*B. rocblas_side_right: X*op(A) = alpha*B.[in] uplo: rocblas_fill. rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.[in] transA: rocblas_operation. transB: op(A) = A. rocblas_operation_transpose: op(A) = A^T. rocblas_operation_conjugate_transpose: op(A) = A^H.[in] diag: rocblas_diagonal. rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.[in] m: rocblas_int. m specifies the number of rows of B. m >= 0.[in] n: rocblas_int. n specifies the number of columns of B. n >= 0.[in] alpha: alpha specifies the scalar alpha. When alpha is &zero then A is not referenced and B need not be set before entry.[in] A: pointer storing matrix A on the GPU. of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.[in] lda: rocblas_int. lda specifies the first dimension of A. if side = rocblas_side_left, lda >= max( 1, m ), if side = rocblas_side_right, lda >= max( 1, n ).
-
rocblas_status
rocblas_dtrsm(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, double *B, rocblas_int ldb)
-
rocblas_status
rocblas_hgemm(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, const rocblas_half *B, rocblas_int ldb, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc) BLAS Level 3 API.
xGEMM performs one of the matrix-matrix operations
C = alpha*op( A )*op( B ) + beta*C,
where op( X ) is one of
op( X ) = X or op( X ) = X**T or op( X ) = X**H,
alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.
- Parameters
[in] handle: rocblas_handle, handle to the rocblas library context queue.[in] transA: rocblas_operation, specifies the form of op( A )[in] transB: rocblas_operation, specifies the form of op( B )[in] m: rocblas_int, number or rows of matrices op( A ) and C[in] n: rocblas_int, number of columns of matrices op( B ) and C[in] k: rocblas_int, number of columns of matrix op( A ) and number of rows of matrix op( B )[in] alpha: specifies the scalar alpha.[in] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int, specifies the leading dimension of A.[in] B: pointer storing matrix B on the GPU.[in] ldb: rocblas_int, specifies the leading dimension of B.[in] beta: specifies the scalar beta.[inout] C: pointer storing matrix C on the GPU.[in] ldc: rocblas_int, specifies the leading dimension of C.
-
rocblas_status
rocblas_sgemm(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, const float *B, rocblas_int ldb, const float *beta, float *C, rocblas_int ldc)
-
rocblas_status
rocblas_dgemm(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, const double *B, rocblas_int ldb, const double *beta, double *C, rocblas_int ldc)
-
rocblas_status
rocblas_hgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, rocblas_int stride_a, const rocblas_half *B, rocblas_int ldb, rocblas_int stride_b, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count) BLAS Level 3 API.
xGEMM_STRIDED_BATCHED performs one of the strided batched matrix-matrix operations
[0,batch_count-1]C[i*stride_c] = alpha*op( A[i*stride_a] )*op( B[i*stride_b] ) + beta*C[i*stride_c], for i in
where op( X ) is one of
op( X ) = X or op( X ) = X**T or op( X ) = X**H,
alpha and beta are scalars, and A, B and C are strided batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) an k by n by batch_count strided_batched matrix and C an m by n by batch_count strided_batched matrix.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] transA: rocblas_operation specifies the form of op( A )[in] transB: rocblas_operation specifies the form of op( B )[in] m: rocblas_int. matrix dimention m.[in] n: rocblas_int. matrix dimention n.[in] k: rocblas_int. matrix dimention k.[in] alpha: specifies the scalar alpha.[in] A: pointer storing strided batched matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of “A”.[in] stride_a: rocblas_int stride from the start of one “A” matrix to the next[in] B: pointer storing strided batched matrix B on the GPU.[in] ldb: rocblas_int specifies the leading dimension of “B”.[in] stride_b: rocblas_int stride from the start of one “B” matrix to the next[in] beta: specifies the scalar beta.[inout] C: pointer storing strided batched matrix C on the GPU.[in] ldc: rocblas_int specifies the leading dimension of “C”.[in] stride_c: rocblas_int stride from the start of one “C” matrix to the next[in] batch_count: rocblas_int number of gemm operatons in the batch
-
rocblas_status
rocblas_sgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, rocblas_int stride_a, const float *B, rocblas_int ldb, rocblas_int stride_b, const float *beta, float *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)
-
rocblas_status
rocblas_dgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, rocblas_int stride_a, const double *B, rocblas_int ldb, rocblas_int stride_b, const double *beta, double *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)
-
rocblas_status
rocblas_hgemm_kernel_name(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, rocblas_int stride_a, const rocblas_half *B, rocblas_int ldb, rocblas_int stride_b, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)
-
rocblas_status
rocblas_sgemm_kernel_name(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, rocblas_int stride_a, const float *B, rocblas_int ldb, rocblas_int stride_b, const float *beta, float *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)
-
rocblas_status
rocblas_dgemm_kernel_name(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, rocblas_int stride_a, const double *B, rocblas_int ldb, rocblas_int stride_b, const double *beta, double *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)
-
rocblas_status
rocblas_sgeam(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *beta, const float *B, rocblas_int ldb, float *C, rocblas_int ldc) BLAS Level 3 API.
xGEAM performs one of the matrix-matrix operations
C = alpha*op( A ) + beta*op( B ),
where op( X ) is one of
op( X ) = X or op( X ) = X**T or op( X ) = X**H,
alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by n matrix, op( B ) an m by n matrix, and C an m by n matrix.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] transA: rocblas_operation specifies the form of op( A )[in] transB: rocblas_operation specifies the form of op( B )[in] m: rocblas_int.[in] n: rocblas_int.[in] alpha: specifies the scalar alpha.[in] A: pointer storing matrix A on the GPU.[in] lda: rocblas_int specifies the leading dimension of A.[in] beta: specifies the scalar beta.[in] B: pointer storing matrix B on the GPU.[in] ldb: rocblas_int specifies the leading dimension of B.[inout] C: pointer storing matrix C on the GPU.[in] ldc: rocblas_int specifies the leading dimension of C.
-
rocblas_status
rocblas_dgeam(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *beta, const double *B, rocblas_int ldb, double *C, rocblas_int ldc)
-
rocblas_status
rocblas_gemm_ex(rocblas_handle handle, rocblas_operation trans_a, rocblas_operation trans_b, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags, size_t *workspace_size, void *workspace)
-
rocblas_status
rocblas_gemm_strided_batched_ex(rocblas_handle handle, rocblas_operation trans_a, rocblas_operation trans_b, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, rocblas_long stride_a, const void *b, rocblas_datatype b_type, rocblas_int ldb, rocblas_long stride_b, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, rocblas_long stride_c, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_long stride_d, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags, size_t *workspace_size, void *workspace) BLAS EX API.
GEMM_EX performs one of the matrix-matrix operations
D = alpha*op( A )*op( B ) + beta*C,
where op( X ) is one of
op( X ) = X or op( X ) = X**T or op( X ) = X**H,
alpha and beta are scalars, and A, B, C, and D are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C and D are m by n matrices.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] transA: rocblas_operation specifies the form of op( A )[in] transB: rocblas_operation specifies the form of op( B )[in] m: rocblas_int. matrix dimension m[in] n: rocblas_int. matrix dimension n[in] k: rocblas_int. matrix dimension k[in] alpha: const void * specifies the scalar alpha. Same datatype as compute_type.[in] a: void * pointer storing matrix A on the GPU.[in] a_type: rocblas_datatype specifies the datatype of matrix A[in] lda: rocblas_int specifies the leading dimension of A.[in] b: void * pointer storing matrix B on the GPU.[in] b_type: rocblas_datatype specifies the datatype of matrix B[in] ldb: rocblas_int specifies the leading dimension of B.[in] beta: const void * specifies the scalar beta. Same datatype as compute_type.[in] c: void * pointer storing matrix C on the GPU.[in] c_type: rocblas_datatype specifies the datatype of matrix C[in] ldc: rocblas_int specifies the leading dimension of C.[out] d: void * pointer storing matrix D on the GPU.[in] d_type: rocblas_datatype specifies the datatype of matrix D[in] ldd: rocblas_int specifies the leading dimension of D.[in] compute_type: rocblas_datatype specifies the datatype of computation[in] algo: rocblas_gemm_algo enumerant specifying the algorithm type.[in] solution_index: int32_t reserved for future use[in] flags: uint32_t reserved for future use
-
rocblas_status
rocblas_get_version_string(char *buf, size_t len) BLAS EX API.
GEMM_STRIDED_BATCHED_EX performs one of the strided_batched matrix-matrix operations
[0,batch_count-1]D[i*stride_d] = alpha*op(A[i*stride_a])*op(B[i*stride_b]) + beta*C[i*stride_c], for i in
where op( X ) is one of
op( X ) = X or op( X ) = X**T or op( X ) = X**H,
alpha and beta are scalars, and A, B, C, and D are strided_batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) a k by n by batch_count strided_batched matrix and C and D are m by n by batch_count strided_batched matrices.
The strided_batched matrices are multiple matrices separated by a constant stride. The number of matrices is batch_count.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] transA: rocblas_operation specifies the form of op( A )[in] transB: rocblas_operation specifies the form of op( B )[in] m: rocblas_int. matrix dimension m[in] n: rocblas_int. matrix dimension n[in] k: rocblas_int. matrix dimension k[in] alpha: const void * specifies the scalar alpha. Same datatype as compute_type.[in] a: void * pointer storing matrix A on the GPU.[in] a_type: rocblas_datatype specifies the datatype of matrix A[in] lda: rocblas_int specifies the leading dimension of A.[in] stride_a: rocblas_long specifies stride from start of one “A” matrix to the next[in] b: void * pointer storing matrix B on the GPU.[in] b_type: rocblas_datatype specifies the datatype of matrix B[in] ldb: rocblas_int specifies the leading dimension of B.[in] stride_b: rocblas_long specifies stride from start of one “B” matrix to the next[in] beta: const void * specifies the scalar beta. Same datatype as compute_type.[in] c: void * pointer storing matrix C on the GPU.[in] c_type: rocblas_datatype specifies the datatype of matrix C[in] ldc: rocblas_int specifies the leading dimension of C.[in] stride_c: rocblas_long specifies stride from start of one “C” matrix to the next[out] d: void * pointer storing matrix D on the GPU.[in] d_type: rocblas_datatype specifies the datatype of matrix D[in] ldd: rocblas_int specifies the leading dimension of D.[in] stride_d: rocblas_long specifies stride from start of one “D” matrix to the next[in] batch_count: rocblas_int number of gemm operations in the batch[in] compute_type: rocblas_datatype specifies the datatype of computation[in] algo: rocblas_gemm_algo enumerant specifying the algorithm type.[in] solution_index: int32_t reserved for future use[in] flags: uint32_t reserved for future use
-
-
file
rocblas-types.h - #include <stddef.h>#include <stdint.h>#include <hip/hip_vector_types.h>
rocblas-types.h defines data types used by rocblas
Defines
-
_ROCBLAS_TYPES_H_¶
Typedefs
-
typedef int32_t
rocblas_int To specify whether int32 or int64 is used.
-
typedef int64_t
rocblas_long
-
typedef float2
rocblas_float_complex
-
typedef double2
rocblas_double_complex
-
typedef uint16_t
rocblas_half
-
typedef float2
rocblas_half_complex
-
typedef struct _rocblas_handle *
rocblas_handle
Enums
-
enum
rocblas_operation Used to specify whether the matrix is to be transposed or not.
parameter constants. numbering is consistent with CBLAS, ACML and most standard C BLAS libraries
Values:
-
rocblas_operation_none= 111 Operate with the matrix.
-
rocblas_operation_transpose= 112 Operate with the transpose of the matrix.
-
rocblas_operation_conjugate_transpose= 113 Operate with the conjugate transpose of the matrix.
-
-
enum
rocblas_fill Used by the Hermitian, symmetric and triangular matrix routines to specify whether the upper or lower triangle is being referenced.
Values:
-
rocblas_fill_upper= 121 Upper triangle.
-
rocblas_fill_lower= 122 Lower triangle.
-
rocblas_fill_full= 123
-
-
enum
rocblas_diagonal It is used by the triangular matrix routines to specify whether the matrix is unit triangular.
Values:
-
rocblas_diagonal_non_unit= 131 Non-unit triangular.
-
rocblas_diagonal_unit= 132 Unit triangular.
-
-
enum
rocblas_side Indicates the side matrix A is located relative to matrix B during multiplication.
Values:
-
rocblas_side_left= 141 Multiply general matrix by symmetric, Hermitian or triangular matrix on the left.
-
rocblas_side_right= 142 Multiply general matrix by symmetric, Hermitian or triangular matrix on the right.
-
rocblas_side_both= 143
-
-
enum
rocblas_status rocblas status codes definition
Values:
-
rocblas_status_success= 0 success
-
rocblas_status_invalid_handle= 1 handle not initialized, invalid or null
-
rocblas_status_not_implemented= 2 function is not implemented
-
rocblas_status_invalid_pointer= 3 invalid pointer parameter
-
rocblas_status_invalid_size= 4 invalid size parameter
-
rocblas_status_memory_error= 5 failed internal memory allocation, copy or dealloc
-
rocblas_status_internal_error= 6 other internal library failure
-
-
enum
rocblas_datatype Indicates the precision width of data stored in a blas type.
Values:
-
rocblas_datatype_f16_r= 150
-
rocblas_datatype_f32_r= 151
-
rocblas_datatype_f64_r= 152
-
rocblas_datatype_f16_c= 153
-
rocblas_datatype_f32_c= 154
-
rocblas_datatype_f64_c= 155
-
rocblas_datatype_i8_r= 160
-
rocblas_datatype_u8_r= 161
-
rocblas_datatype_i32_r= 162
-
rocblas_datatype_u32_r= 163
-
rocblas_datatype_i8_c= 164
-
rocblas_datatype_u8_c= 165
-
rocblas_datatype_i32_c= 166
-
rocblas_datatype_u32_c= 167
-
-
enum
rocblas_pointer_mode Indicates the pointer is device pointer or host pointer.
Values:
-
rocblas_pointer_mode_host= 0
-
rocblas_pointer_mode_device= 1
-
-
enum
rocblas_layer_mode Indicates if layer is active with bitmask.
Values:
-
rocblas_layer_mode_none= 0b0000000000
-
rocblas_layer_mode_log_trace= 0b0000000001
-
rocblas_layer_mode_log_bench= 0b0000000010
-
rocblas_layer_mode_log_profile= 0b0000000100
-
-
enum
rocblas_gemm_algo Indicates if layer is active with bitmask.
Values:
-
rocblas_gemm_algo_standard= 0b0000000000
-
-
-
file
rocblas.h - #include <stdbool.h>#include “rocblas-export.h”#include “rocblas-version.h”#include “rocblas-types.h”#include “rocblas-auxiliary.h”#include “rocblas-functions.h”
rocblas.h includes other *.h and exposes a common interface
Defines
-
_ROCBLAS_H_¶
-
-
file
buildinfo.cpp - #include <stdio.h>#include <sstream>#include <string.h>#include “definitions.h”#include “rocblas-types.h”#include “rocblas-functions.h”#include “rocblas-version.h”
Functions
-
rocblas_status
rocblas_get_version_string(char *buf, size_t len) BLAS EX API.
GEMM_STRIDED_BATCHED_EX performs one of the strided_batched matrix-matrix operations
[0,batch_count-1]D[i*stride_d] = alpha*op(A[i*stride_a])*op(B[i*stride_b]) + beta*C[i*stride_c], for i in
where op( X ) is one of
op( X ) = X or op( X ) = X**T or op( X ) = X**H,
alpha and beta are scalars, and A, B, C, and D are strided_batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) a k by n by batch_count strided_batched matrix and C and D are m by n by batch_count strided_batched matrices.
The strided_batched matrices are multiple matrices separated by a constant stride. The number of matrices is batch_count.
- Parameters
[in] handle: rocblas_handle. handle to the rocblas library context queue.[in] transA: rocblas_operation specifies the form of op( A )[in] transB: rocblas_operation specifies the form of op( B )[in] m: rocblas_int. matrix dimension m[in] n: rocblas_int. matrix dimension n[in] k: rocblas_int. matrix dimension k[in] alpha: const void * specifies the scalar alpha. Same datatype as compute_type.[in] a: void * pointer storing matrix A on the GPU.[in] a_type: rocblas_datatype specifies the datatype of matrix A[in] lda: rocblas_int specifies the leading dimension of A.[in] stride_a: rocblas_long specifies stride from start of one “A” matrix to the next[in] b: void * pointer storing matrix B on the GPU.[in] b_type: rocblas_datatype specifies the datatype of matrix B[in] ldb: rocblas_int specifies the leading dimension of B.[in] stride_b: rocblas_long specifies stride from start of one “B” matrix to the next[in] beta: const void * specifies the scalar beta. Same datatype as compute_type.[in] c: void * pointer storing matrix C on the GPU.[in] c_type: rocblas_datatype specifies the datatype of matrix C[in] ldc: rocblas_int specifies the leading dimension of C.[in] stride_c: rocblas_long specifies stride from start of one “C” matrix to the next[out] d: void * pointer storing matrix D on the GPU.[in] d_type: rocblas_datatype specifies the datatype of matrix D[in] ldd: rocblas_int specifies the leading dimension of D.[in] stride_d: rocblas_long specifies stride from start of one “D” matrix to the next[in] batch_count: rocblas_int number of gemm operations in the batch[in] compute_type: rocblas_datatype specifies the datatype of computation[in] algo: rocblas_gemm_algo enumerant specifying the algorithm type.[in] solution_index: int32_t reserved for future use[in] flags: uint32_t reserved for future use
-
rocblas_status
-
file
handle.cpp - #include “handle.h”#include <cstdlib>
Functions
-
static void
open_log_stream(const char *environment_variable_name, std::ostream *&log_os, std::ofstream &log_ofs)¶ Logging function.
open_log_stream Open stream log_os for logging. If the environment variable with name environment_variable_name is not set, then stream log_os to std::cerr. Else open a file at the full logfile path contained in the environment variable. If opening the file suceeds, stream to the file else stream to std::cerr.
[out] log_os std::ostream*& Output stream. Stream to std:cerr if environment_variable_name is not set, else set to stream to log_ofs
- Parameters
[in] environment_variable_name: const char* Name of environment variable that contains the full logfile path.
[out] log_ofs std::ofstream& Output file stream. If log_ofs->is_open()==true, then log_os will stream to log_ofs. Else it will stream to std::cerr.
-
static void
-
file
rocblas_auxiliary.cpp - #include <stdio.h>#include <hip/hip_runtime.h>#include “definitions.h”#include “rocblas-types.h”#include “handle.h”#include “logging.h”#include “utility.h”#include “rocblas_unique_ptr.hpp”#include “rocblas-auxiliary.h”
Functions
-
rocblas_pointer_mode
rocblas_pointer_to_mode(void *ptr) indicates whether the pointer is on the host or device. currently HIP API can only recoginize the input ptr on deive or not can not recoginize it is on host or not
-
rocblas_status
rocblas_get_pointer_mode(rocblas_handle handle, rocblas_pointer_mode *mode)
-
rocblas_status
rocblas_set_pointer_mode(rocblas_handle handle, rocblas_pointer_mode mode)
-
rocblas_status
rocblas_create_handle(rocblas_handle *handle)
-
rocblas_status
rocblas_destroy_handle(rocblas_handle handle)
-
rocblas_status
rocblas_set_stream(rocblas_handle handle, hipStream_t stream_id)
-
rocblas_status
rocblas_get_stream(rocblas_handle handle, hipStream_t *stream_id)
-
__global__ void copy_void_ptr_vector_kernel(rocblas_int n, rocblas_int elem_size, const void * x, rocblas_int incx, void * y, rocblas_int incy)
-
rocblas_status
rocblas_set_vector(rocblas_int n, rocblas_int elem_size, const void *x_h, rocblas_int incx, void *y_d, rocblas_int incy)
-
rocblas_status
rocblas_get_vector(rocblas_int n, rocblas_int elem_size, const void *x_d, rocblas_int incx, void *y_h, rocblas_int incy)
-
__global__ void copy_void_ptr_matrix_kernel(rocblas_int rows, rocblas_int cols, size_t elem_size, const void * a, rocblas_int lda, void * b, rocblas_int ldb)
-
rocblas_status
rocblas_set_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a_h, rocblas_int lda, void *b_d, rocblas_int ldb)
-
rocblas_status
rocblas_get_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a_d, rocblas_int lda, void *b_h, rocblas_int ldb)
Variables
-
constexpr size_t
VEC_BUFF_MAX_BYTES= 1048576¶
-
constexpr rocblas_int
NB_X= 256¶
-
constexpr size_t
MAT_BUFF_MAX_BYTES= 1048576¶
-
constexpr rocblas_int
MATRIX_DIM_X= 128¶
-
constexpr rocblas_int
MATRIX_DIM_Y= 8¶
-
rocblas_pointer_mode
-
file
status.cpp - #include <hip/hip_runtime_api.h>#include “rocblas.h”#include “status.h”
Functions
-
rocblas_status
get_rocblas_status_for_hip_status(hipError_t status)¶
-
rocblas_status
-
dir
ROCm_Libraries/rocBLAS
-
dir
ROCm_Libraries
-
dir
ROCm_Libraries/rocBLAS/src
-
dir
ROCm_Libraries/rocBLAS/src/src
hipBLAS¶
Introduction¶
Please Refer here for Github link hipBLAS
hipBLAS is a BLAS marshalling library, with multiple supported backends. It sits between the application and a ‘worker’ BLAS library, marshalling inputs into the backend library and marshalling results back to the application. hipBLAS exports an interface that does not require the client to change, regardless of the chosen backend. Currently, hipBLAS supports rocBLAS and cuBLAS as backends.
Installing pre-built packages¶
Download pre-built packages either from ROCm’s package servers or by clicking the github releases tab and manually downloading, which could be newer. Release notes are available for each release on the releases tab.
sudo apt update && sudo apt install hipblas
Quickstart hipBLAS build¶
Bash helper build script (Ubuntu only)
The root of this repository has a helper bash script install.sh to build and install hipBLAS on Ubuntu with a single command. It does not take a lot of options and hard-codes configuration that can be specified through invoking cmake directly, but it’s a great way to get started quickly and can serve as an example of how to build/install. A few commands in the script need sudo access, so it may prompt you for a password.
./install -h -- shows help
./install -id -- build library, build dependencies and install (-d flag only needs to be passed once on a system)
Manual build (all supported platforms)
If you use a distro other than Ubuntu, or would like more control over the build process, the hipblas build has helpful information on how to configure cmake and manually build.
Functions supported
A list of exported functions from hipblas can be found on the wiki
hipBLAS interface examples¶
The hipBLAS interface is compatible with rocBLAS and cuBLAS-v2 APIs. Porting a CUDA application which originally calls the cuBLAS API to an application calling hipBLAS API should be relatively straightforward. For example, the hipBLAS SGEMV interface is
GEMV API¶
hipblasStatus_t
hipblasSgemv( hipblasHandle_t handle,
hipblasOperation_t trans,
int m, int n, const float *alpha,
const float *A, int lda,
const float *x, int incx, const float *beta,
float *y, int incy );
Batched and strided GEMM API¶
hipBLAS GEMM can process matrices in batches with regular strides. There are several permutations of these API’s, the following is an example that takes everything
hipblasStatus_t
hipblasSgemmStridedBatched( hipblasHandle_t handle,
hipblasOperation_t transa, hipblasOperation_t transb,
int m, int n, int k, const float *alpha,
const float *A, int lda, long long bsa,
const float *B, int ldb, long long bsb, const float *beta,
float *C, int ldc, long long bsc,
int batchCount);
hipBLAS assumes matrices A and vectors x, y are allocated in GPU memory space filled with data. Users are responsible for copying data from/to the host and device memory.
Build¶
Dependencies For Building Library¶
CMake 3.5 or later
The build infrastructure for hipBLAS is based on Cmake v3.5. This is the version of cmake available on ROCm supported platforms. If you are on a headless machine without the x-windows system, we recommend using ccmake; if you have access to X-windows, we recommend using cmake-gui.
Install one-liners cmake:
Ubuntu: sudo apt install cmake-qt-gui
Fedora: sudo dnf install cmake-gui
Build Library Using Script (Ubuntu only)¶
The root of this repository has a helper bash script install.sh to build and install hipBLAS on Ubuntu with a single command. It does not take a lot of options and hard-codes configuration that can be specified through invoking cmake directly, but it’s a great way to get started quickly and can serve as an example of how to build/install. A few commands in the script need sudo access, so it may prompt you for a password.
./install.sh -h -- shows help
./install.sh -id -- build library, build dependencies and install (-d flag only needs to be passed once on a system)
Build Library Using Individual Commands¶
mkdir -p [HIPBLAS_BUILD_DIR]/release
cd [HIPBLAS_BUILD_DIR]/release
# Default install location is in /opt/rocm, define -DCMAKE_INSTALL_PREFIX=<path> to specify other
# Default build config is 'Release', define -DCMAKE_BUILD_TYPE=<config> to specify other
CXX=/opt/rocm/bin/hcc ccmake [HIPBLAS_SOURCE]
make -j$(nproc)
sudo make install # sudo required if installing into system directory such as /opt/rocm
Build Library + Tests + Benchmarks + Samples Using Individual Commands¶
The repository contains source for clients that serve as samples, tests and benchmarks. Clients source can be found in the clients subdir.
Dependencies (only necessary for hipBLAS clients)
The hipBLAS samples have no external dependencies, but our unit test and benchmarking applications do. These clients introduce the following dependencies:
boost
- lapack
lapack itself brings a dependency on a fortran compiler
googletest
Linux distros typically have an easy installation mechanism for boost through the native package manager.
Ubuntu: sudo apt install libboost-program-options-dev
Fedora: sudo dnf install boost-program-options
Unfortunately, googletest and lapack are not as easy to install. Many distros do not provide a googletest package with pre-compiled libraries, and the lapack packages do not have the necessary cmake config files for cmake to configure linking the cblas library. hipBLAS provide a cmake script that builds the above dependencies from source. This is an optional step; users can provide their own builds of these dependencies and help cmake find them by setting the CMAKE_PREFIX_PATH definition. The following is a sequence of steps to build dependencies and install them to the cmake default /usr/local.
(optional, one time only)
mkdir -p [HIPBLAS_BUILD_DIR]/release/deps
cd [HIPBLAS_BUILD_DIR]/release/deps
ccmake -DBUILD_BOOST=OFF [HIPBLAS_SOURCE]/deps # assuming boost is installed through package manager as above
make -j$(nproc) install
Once dependencies are available on the system, it is possible to configure the clients to build. This requires a few extra cmake flags to the library cmake configure script. If the dependencies are not installed into system defaults (like /usr/local ), you should pass the CMAKE_PREFIX_PATH to cmake to help find them.
-DCMAKE_PREFIX_PATH="<semicolon separated paths>"
# Default install location is in /opt/rocm, use -DCMAKE_INSTALL_PREFIX=<path> to specify other
CXX=/opt/rocm/bin/hcc ccmake -DBUILD_CLIENTS_TESTS=ON -DBUILD_CLIENTS_BENCHMARKS=ON [HIPBLAS_SOURCE]
make -j$(nproc)
sudo make install # sudo required if installing into system directory such as /opt/rocm
Common build problems¶
Issue: HIP (/opt/rocm/hip) was built using hcc 1.0.xxx-xxx-xxx-xxx, but you are using /opt/rocm/hcc/hcc with version 1.0.yyy-yyy-yyy-yyy from hipcc. (version does not match) . Please rebuild HIP including cmake or update HCC_HOME variable.
Solution: Download HIP from github and use hcc to build from source and then use the build HIP instead of /opt/rocm/hip one or singly overwrite the new build HIP to this location.
Issue: For Carrizo - HCC RUNTIME ERROR: Fail to find compatible kernel
Solution: Add the following to the cmake command when configuring: -DCMAKE_CXX_FLAGS=”–amdgpu-target=gfx801”
Issue: For MI25 (Vega10 Server) - HCC RUNTIME ERROR: Fail to find compatible kernel
Solution: export HCC_AMDGPU_TARGET=gfx900
Running¶
Notice¶
Before reading this Wiki, it is assumed hipBLAS with the client applications has been successfully built as described in Build hipBLAS libraries and verification code
Samples
cd [BUILD_DIR]/clients/staging
./example-sscal
Example code that calls hipBLAS you can also see the following blog on the right side Example C code calling hipBLAS routine.
Unit tests
Run tests with the following:
cd [BUILD_DIR]/clients/staging
./hipblas-test
To run specific tests, use –gtest_filter=match where match is a ‘:’-separated list of wildcard patterns (called the positive patterns) optionally followed by a ‘-‘ and another ‘:’-separated pattern list (called the negative patterns). For example, run gemv tests with the following:
cd [BUILD_DIR]/clients/staging
./hipblas-test --gtest_filter=*gemv*
hcRNG¶
Introduction¶
The hcRNG library is an implementation of uniform random number generators targeting the AMD heterogeneous hardware via HCC compiler runtime. The computational resources of underlying AMD heterogenous compute gets exposed and exploited through the HCC C++ frontend. Refer here for more details on HCC compiler.
The following list enumerates the current set of RNG generators that are supported so far.
MRG31k3p
MRG32k3a
LFSR113
Philox-4x32-10
Examples¶
Random number generator Mrg31k3p example:
file: Randomarray.cpp
#!c++
//This example is a simple random array generation and it compares host output with device output
//Random number generator Mrg31k3p
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <assert.h>
#include <hcRNG/mrg31k3p.h>
#include <hcRNG/hcRNG.h>
#include <hc.hpp>
#include <hc_am.hpp>
using namespace hc;
int main()
{
hcrngStatus status = HCRNG_SUCCESS;
bool ispassed = 1;
size_t streamBufferSize;
// Number oi streams
size_t streamCount = 10;
//Number of random numbers to be generated
//numberCount must be a multiple of streamCount
size_t numberCount = 100;
//Enumerate the list of accelerators
std::vector<hc::accelerator>acc = hc::accelerator::get_all();
accelerator_view accl_view = (acc[1].create_view());
//Allocate memory for host pointers
float *Random1 = (float*) malloc(sizeof(float) * numberCount);
float *Random2 = (float*) malloc(sizeof(float) * numberCount);
float *outBufferDevice = hc::am_alloc(sizeof(float) * numberCount, acc[1], 0);
//Create streams
hcrngMrg31k3pStream *streams = hcrngMrg31k3pCreateStreams(NULL, streamCount, &streamBufferSize, NULL);
hcrngMrg31k3pStream *streams_buffer = hc::am_alloc(sizeof(hcrngMrg31k3pStream) * streamCount, acc[1], 0);
accl_view.copy(streams, streams_buffer, streamCount* sizeof(hcrngMrg31k3pStream));
//Invoke random number generators in device (here strean_length and streams_per_thread arguments are default)
status = hcrngMrg31k3pDeviceRandomU01Array_single(accl_view, streamCount, streams_buffer, numberCount, outBufferDevice);
if(status) std::cout << "TEST FAILED" << std::endl;
accl_view.copy(outBufferDevice, Random1, numberCount * sizeof(float));
//Invoke random number generators in host
for (size_t i = 0; i < numberCount; i++)
Random2[i] = hcrngMrg31k3pRandomU01(&streams[i % streamCount]);
// Compare host and device outputs
for(int i =0; i < numberCount; i++) {
if (Random1[i] != Random2[i]) {
ispassed = 0;
std::cout <<" RANDDEVICE[" << i<< "] " << Random1[i] << "and RANDHOST[" << i <<"] mismatches"<< Random2[i] << std::endl;
break;
}
else
continue;
}
if(!ispassed) std::cout << "TEST FAILED" << std::endl;
//Free host resources
free(Random1);
free(Random2);
//Release device resources
hc::am_free(outBufferDevice);
hc::am_free(streams_buffer);
return 0;
}
Compiling the example code:
/opt/hcc/bin/clang++ /opt/hcc/bin/hcc-config –cxxflags –ldflags -lhc_am -lhcrng Randomarray.cpp
Installation¶
Installation steps
The following are the steps to use the library
ROCM 2.4 Kernel, Driver and Compiler Installation (if not done until now)
Library installation.
ROCM 2.4 Installation
To Know more about ROCM refer here
a. Installing Debian ROCM repositories
Before proceeding, make sure to completely uninstall any pre-release ROCm packages.
Refer Here for instructions to remove pre-release ROCM packages
Follow Steps to install rocm package
wget -qO - http://packages.amd.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -
sudo sh -c 'echo deb [arch=amd64] http://packages.amd.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list'
sudo apt-get update
sudo apt-get install rocm
Then, make the ROCm kernel your default kernel. If using grub2 as your bootloader, you can edit the GRUB_DEFAULT variable in the following file:
sudo vi /etc/default/grub
sudo update-grub
and Reboot the system
b. Verifying the Installation
Once Reboot, to verify that the ROCm stack completed successfully you can execute HSA vector_copy sample application:
cd /opt/rocm/hsa/sample
make
./vector_copy
Library Installation
a. Install using Prebuilt debian
wget https://github.com/ROCmSoftwarePlatform/hcRNG/blob/master/pre-builds/hcrng-master-184472e-Linux.deb
sudo dpkg -i hcrng-master-184472e-Linux.deb
b. Build debian from source
git clone https://github.com/ROCmSoftwarePlatform/hcRNG.git && cd hcRNG
chmod +x build.sh && ./build.sh
build.sh execution builds the library and generates a debian under build directory.
Key Features¶
Support for 4 commonly used uniform random number generators.
Single and Double precision.
Multiple streams, created on the host and generates random numbers either on the host or on computing devices.
Prerequisites
This section lists the known set of hardware and software requirements to build this library
Hardware
CPU: mainstream brand, Better if with >=4 Cores Intel Haswell based CPU
System Memory >= 4GB (Better if >10GB for NN application over multiple GPUs)
Hard Drive > 200GB (Better if SSD or NVMe driver for NN application over multiple GPUs)
Minimum GPU Memory (Global) > 2GB
GPU cards supported
dGPU: AMD R9 Fury X, R9 Fury, R9 Nano
APU: AMD Kaveri or Carrizo
AMD Driver and Runtime
Radeon Open Compute Kernel (ROCK) driver : https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver
HSA runtime API and runtime for Boltzmann: https://github.com/RadeonOpenCompute/ROCR-Runtime
System software
Ubuntu 14.04 trusty and later
GCC 4.6 and later
CPP 4.6 and later (come with GCC package)
python 2.7 and later
python-pip
BeautifulSoup4 (installed using python-pip)
HCC 0.9 from here
Tools and Misc
git 1.9 and later
cmake 2.6 and later (2.6 and 2.8 are tested)
firewall off
root privilege or user account in sudo group
Ubuntu Packages
libc6-dev-i386
liblapack-dev
graphicsmagick
libblas-dev
Tested Environments¶
Driver versions
- Boltzmann Early Release Driver + dGPU
Radeon Open Compute Kernel (ROCK) driver : https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver
HSA runtime API and runtime for Boltzmann: https://github.com/RadeonOpenCompute/ROCR-Runtime
Traditional HSA driver + APU (Kaveri)
GPU Cards
Radeon R9 Nano
Radeon R9 FuryX
Radeon R9 Fury
Kaveri and Carizo APU
Server System
Supermicro SYS 2028GR-THT 6 R9 NANO
Supermicro SYS-1028GQ-TRT 4 R9 NANO
Supermicro SYS-7048GR-TR Tower 4 R9 NANO
Unit testing¶
a) Automated testing:
Follow these steps to start automated testing:
cd ~/hcRNG/
./build.sh --test=on
b) Manual testing:
(i) Google testing (GTEST) with Functionality check
cd ~/hcRNG/build/test/unit/bin/
All functions are tested against google test.
hipeigen¶
Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.
For more information go to http://eigen.tuxfamily.org/.
Installation instructions for ROCm¶
The ROCm Platform brings a rich foundation to advanced computing by seamlessly integrating the CPU and GPU with the goal of solving real-world problems.
To insatll rocm, please follow:
Installing from AMD ROCm repositories¶
AMD is hosting both debian and rpm repositories for the ROCm 2.4 packages. The packages in both repositories have been signed to ensure package integrity. Directions for each repository are given below:
Debian repository - apt-get
Add the ROCm apt repository
Complete installation steps of ROCm can be found Here
or
For Debian based systems, like Ubuntu, configure the Debian ROCm repository as follows:
wget -qO - http://packages.amd.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -
sudo sh -c 'echo deb [arch=amd64] http://packages.amd.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list'
The gpg key might change, so it may need to be updated when installing a new release.
Install or Update
Next, update the apt-get repository list and install/update the rocm package:
Warning
Before proceeding, make sure to completely uninstall any pre-release ROCm packages
sudo apt-get update
sudo apt-get install rocm
Then, make the ROCm kernel your default kernel. If using grub2 as your bootloader, you can edit the GRUB_DEFAULT variable in the following file:
sudo vi /etc/default/grub
sudo update-grub
Once complete, reboot your system.
We recommend you verify your installation to make sure everything completed successfully.
Installation instructions for Eigen¶
Explanation before starting
Eigen consists only of header files, hence there is nothing to compile before you can use it. Moreover, these header files do not depend on your platform, they are the same for everybody.
Method 1. Installing without using CMake
You can use right away the headers in the Eigen/ subdirectory. In order to install, just copy this Eigen/ subdirectory to your favorite location. If you also want the unsupported features, copy the unsupported/ subdirectory too.
Method 2. Installing using CMake
Let’s call this directory ‘source_dir’ (where this INSTALL file is). Before starting, create another directory which we will call ‘build_dir’.
Do:
cd build_dir
cmake source_dir
make install
The make install step may require administrator privileges.
You can adjust the installation destination (the “prefix”) by passing the -DCMAKE_INSTALL_PREFIX=myprefix option to cmake, as is explained in the message that cmake prints at the end.
Build and Run hipeigen direct tests¶
To build the direct tests for hipeigen:
cd build_dir
make check -j $(nproc)
Note: All direct tests should pass with ROCm 2.4
clFFT¶
For Github Repository clFFT
clFFT is a software library containing FFT functions written in OpenCL. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and heterogeneous programming.
Pre-built binaries are available here.
Introduction to clFFT¶
The FFT is an implementation of the Discrete Fourier Transform (DFT) that makes use of symmetries in the FFT definition to reduce the mathematical intensity required from O(N^2) to O(N log2(N)) when the sequence length N is the product of small prime factors. Currently, there is no standard API for FFT routines. Hardware vendors usually provide a set of high-performance FFTs optimized for their systems: no two vendors employ the same interfaces for their FFT routines. clFFT provides a set of FFT routines that are optimized for AMD graphics processors, but also are functional across CPU and other compute devices.
The clFFT library is an open source OpenCL library implementation of discrete Fast Fourier Transforms. The library:
provides a fast and accurate platform for calculating discrete FFTs.
works on CPU or GPU backends.
supports in-place or out-of-place transforms.
supports 1D, 2D, and 3D transforms with a batch size that can be greater than 1.
supports planar (real and complex components in separate arrays) and interleaved (real and complex components as a pair contiguous in memory) formats.
supports dimension lengths that can be any combination of powers of 2, 3, 5, 7, 11 and 13.
Supports single and double precision floating point formats.
clFFT library user documentation¶
Library and API documentation for developers is available online as a GitHub Pages website
API semantic versioning¶
Good software is typically the result of the loop of feedback and iteration; software interfaces no less so. clFFT follows the semantic versioning guidelines. The version number used is of the form MAJOR.MINOR.PATCH.
clFFT Wiki¶
The project wiki contains helpful documentation, including a build primer
Contributing code¶
Please refer to and read the Contributing document for guidelines on how to contribute code to this open source project. The code in the /master branch is considered to be stable, and all pull-requests must be made against the /develop branch.
License¶
The source for clFFT is licensed under the Apache License , Version 2.0
Example¶
The following simple example shows how to use clFFT to compute a simple 1D forward transform
#include <stdlib.h>
/* No need to explicitely include the OpenCL headers */
#include <clFFT.h>
int main( void )
{
cl_int err;
cl_platform_id platform = 0;
cl_device_id device = 0;
cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
cl_context ctx = 0;
cl_command_queue queue = 0;
cl_mem bufX;
float *X;
cl_event event = NULL;
int ret = 0;
size_t N = 16;
/* FFT library realted declarations */
clfftPlanHandle planHandle;
clfftDim dim = CLFFT_1D;
size_t clLengths[1] = {N};
/* Setup OpenCL environment. */
err = clGetPlatformIDs( 1, &platform, NULL );
err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL );
props[1] = (cl_context_properties)platform;
ctx = clCreateContext( props, 1, &device, NULL, NULL, &err );
queue = clCreateCommandQueue( ctx, device, 0, &err );
/* Setup clFFT. */
clfftSetupData fftSetup;
err = clfftInitSetupData(&fftSetup);
err = clfftSetup(&fftSetup);
/* Allocate host & initialize data. */
/* Only allocation shown for simplicity. */
X = (float *)malloc(N * 2 * sizeof(*X));
/* Prepare OpenCL memory objects and place data inside them. */
bufX = clCreateBuffer( ctx, CL_MEM_READ_WRITE, N * 2 * sizeof(*X), NULL, &err );
err = clEnqueueWriteBuffer( queue, bufX, CL_TRUE, 0,
N * 2 * sizeof( *X ), X, 0, NULL, NULL );
/* Create a default plan for a complex FFT. */
err = clfftCreateDefaultPlan(&planHandle, ctx, dim, clLengths);
/* Set plan parameters. */
err = clfftSetPlanPrecision(planHandle, CLFFT_SINGLE);
err = clfftSetLayout(planHandle, CLFFT_COMPLEX_INTERLEAVED, CLFFT_COMPLEX_INTERLEAVED);
err = clfftSetResultLocation(planHandle, CLFFT_INPLACE);
/* Bake the plan. */
err = clfftBakePlan(planHandle, 1, &queue, NULL, NULL);
/* Execute the plan. */
err = clfftEnqueueTransform(planHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, NULL, &bufX, NULL, NULL);
/* Wait for calculations to be finished. */
err = clFinish(queue);
/* Fetch results of calculations. */
err = clEnqueueReadBuffer( queue, bufX, CL_TRUE, 0, N * 2 * sizeof( *X ), X, 0, NULL, NULL );
/* Release OpenCL memory objects. */
clReleaseMemObject( bufX );
free(X);
/* Release the plan. */
err = clfftDestroyPlan( &planHandle );
/* Release clFFT library. */
clfftTeardown( );
/* Release OpenCL working objects. */
clReleaseCommandQueue( queue );
clReleaseContext( ctx );
return ret;
}
Build dependencies¶
Library for Windows
To develop the clFFT library code on a Windows operating system, ensure to install the following packages on your system:
Windows® 7/8.1
Visual Studio 2012 or later
Latest CMake
An OpenCL SDK, such as APP SDK 3.0
Library for Linux
To develop the clFFT library code on a Linux operating system, ensure to install the following packages on your system:
GCC 4.6 and onwards
Latest CMake
An OpenCL SDK, such as APP SDK 3.0
Library for Mac OSX
To develop the clFFT library code on a Mac OS X, it is recommended to generate Unix makefiles with cmake.
Test infrastructure
To test the developed clFFT library code, ensure to install the following packages on your system:
Googletest v1.6
Latest FFTW
Latest Boost
Performance infrastructure¶
To measure the performance of the clFFT library code, ensure that the Python package is installed on your system.
clBLAS¶
For Github repository clBLAS
This repository houses the code for the OpenCL™ BLAS portion of clMath. The complete set of BLAS level 1, 2 & 3 routines is implemented. Please see Netlib BLAS for the list of supported routines. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming. APPML 1.12 is the most current generally available pre-packaged binary version of the library available for download for both Linux and Windows platforms.
The primary goal of clBLAS is to make it easier for developers to utilize the inherent performance and power efficiency benefits of heterogeneous computing. clBLAS interfaces do not hide nor wrap OpenCL interfaces, but rather leaves OpenCL state management to the control of the user to allow for maximum performance and flexibility. The clBLAS library does generate and enqueue optimized OpenCL kernels, relieving the user from the task of writing, optimizing and maintaining kernel code themselves.
clBLAS update notes 01/2017
v2.12 is a bugfix release as a rollup of all fixes in /develop branch Thanks to @pavanky, @iotamudelta, @shahsan10, @psyhtest, @haahh, @hughperkins, @tfauck @abhiShandy, @IvanVergiliev, @zougloub, @mgates3 for contributions to clBLAS v2.12 Summary of fixes available to read on the releases tab
clBLAS library user documentation¶
Library and API documentation for developers is available online as a GitHub Pages website
clBLAS Wiki
The project wiki contains helpful documentation, including a build primer
Contributing code
Please refer to and read the Contributing document for guidelines on how to contribute code to this open source project. The code in the /master branch is considered to be stable, and all pull-requests should be made against the /develop branch.
License¶
The source for clBLAS is licensed under the Apache License, Version 2.0
Example¶
The simple example below shows how to use clBLAS to compute an OpenCL accelerated SGEMM
#include <sys/types.h>
#include <stdio.h>
/* Include the clBLAS header. It includes the appropriate OpenCL headers */
#include <clBLAS.h>
/* This example uses predefined matrices and their characteristics for
* simplicity purpose.
*/
#define M 4
#define N 3
#define K 5
static const cl_float alpha = 10;
static const cl_float A[M*K] = {
11, 12, 13, 14, 15,
21, 22, 23, 24, 25,
31, 32, 33, 34, 35,
41, 42, 43, 44, 45,
};
static const size_t lda = K; /* i.e. lda = K */
static const cl_float B[K*N] = {
11, 12, 13,
21, 22, 23,
31, 32, 33,
41, 42, 43,
51, 52, 53,
};
static const size_t ldb = N; /* i.e. ldb = N */
static const cl_float beta = 20;
static cl_float C[M*N] = {
11, 12, 13,
21, 22, 23,
31, 32, 33,
41, 42, 43,
};
static const size_t ldc = N; /* i.e. ldc = N */
static cl_float result[M*N];
int main( void )
{
cl_int err;
cl_platform_id platform = 0;
cl_device_id device = 0;
cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
cl_context ctx = 0;
cl_command_queue queue = 0;
cl_mem bufA, bufB, bufC;
cl_event event = NULL;
int ret = 0;
/* Setup OpenCL environment. */
err = clGetPlatformIDs( 1, &platform, NULL );
err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL );
props[1] = (cl_context_properties)platform;
ctx = clCreateContext( props, 1, &device, NULL, NULL, &err );
queue = clCreateCommandQueue( ctx, device, 0, &err );
/* Setup clBLAS */
err = clblasSetup( );
/* Prepare OpenCL memory objects and place matrices inside them. */
bufA = clCreateBuffer( ctx, CL_MEM_READ_ONLY, M * K * sizeof(*A),
NULL, &err );
bufB = clCreateBuffer( ctx, CL_MEM_READ_ONLY, K * N * sizeof(*B),
NULL, &err );
bufC = clCreateBuffer( ctx, CL_MEM_READ_WRITE, M * N * sizeof(*C),
NULL, &err );
err = clEnqueueWriteBuffer( queue, bufA, CL_TRUE, 0,
M * K * sizeof( *A ), A, 0, NULL, NULL );
err = clEnqueueWriteBuffer( queue, bufB, CL_TRUE, 0,
K * N * sizeof( *B ), B, 0, NULL, NULL );
err = clEnqueueWriteBuffer( queue, bufC, CL_TRUE, 0,
M * N * sizeof( *C ), C, 0, NULL, NULL );
/* Call clBLAS extended function. Perform gemm for the lower right sub-matrices */
err = clblasSgemm( clblasRowMajor, clblasNoTrans, clblasNoTrans,
M, N, K,
alpha, bufA, 0, lda,
bufB, 0, ldb, beta,
bufC, 0, ldc,
1, &queue, 0, NULL, &event );
/* Wait for calculations to be finished. */
err = clWaitForEvents( 1, &event );
/* Fetch results of calculations from GPU memory. */
err = clEnqueueReadBuffer( queue, bufC, CL_TRUE, 0,
M * N * sizeof(*result),
result, 0, NULL, NULL );
/* Release OpenCL memory objects. */
clReleaseMemObject( bufC );
clReleaseMemObject( bufB );
clReleaseMemObject( bufA );
/* Finalize work with clBLAS */
clblasTeardown( );
/* Release OpenCL working objects. */
clReleaseCommandQueue( queue );
clReleaseContext( ctx );
return ret;
}
Build dependencies¶
Library for Windows
Windows® 7/8
Visual Studio 2010 SP1, 2012
An OpenCL SDK, such as APP SDK 2.8
Latest CMake
Library for Linux
GCC 4.6 and onwards
An OpenCL SDK, such as APP SDK 2.9
Latest CMake
Library for Mac OSX
Recommended to generate Unix makefiles with cmake
Test infrastructure
Googletest v1.6
Latest Boost
CPU BLAS
Netlib CBLAS (recommended) Ubuntu: install by “apt-get install libblas-dev” Windows: download & install lapack-3.6.0 which comes with CBLAS
or ACML on windows/linux; Accelerate on Mac OSX
Performance infrastructure¶
Python
clSPARSE¶
For Github repository clSPARSE
an OpenCL™ library implementing Sparse linear algebra routines. This project is a result of a collaboration between AMD Inc. and Vratis Ltd..
What’s new in clSPARSE v0.10.1¶
- bug fix release
Fixes for travis builds
Fix to the matrix market reader in the cuSPARSE benchmark to synchronize with the regular MM reader
Replace cl.hpp with cl2.hpp (thanks to arrayfire)
- Fixes for the Nvidia platform; tested 352.79
Fixed buffer overruns in CSR-Adaptive kernels
Fix invalid memory access on Nvidia GPUs in CSR-Adaptive SpMV kernel
clSPARSE features¶
Sparse Matrix - dense Vector multiply (SpM-dV)
Sparse Matrix - dense Matrix multiply (SpM-dM)
Sparse Matrix - Sparse Matrix multiply Sparse Matrix Multiply(SpGEMM) - Single Precision
Iterative conjugate gradient solver (CG)
Iterative biconjugate gradient stabilized solver (BiCGStab)
Dense to CSR conversions (& converse)
COO to CSR conversions (& converse)
Functions to read matrix market files in COO or CSR format
True in spirit with the other clMath libraries, clSPARSE exports a “C” interface to allow projects to build wrappers around clSPARSE in any language they need. A great deal of thought and effort went into designing the API’s to make them less ‘cluttered’ compared to the older clMath libraries. OpenCL state is not explicitly passed through the API, which enables the library to be forward compatible when users are ready to switch from OpenCL 1.2 to OpenCL 2.0 3
API semantic versioning¶
Good software is typically the result of iteration and feedback. clSPARSE follows the semantic versioning guidelines, and while the major version number remains ‘0’, the public API should not be considered stable. We release clSPARSE as beta software (0.y.z) early to the community to elicit feedback and comment. This comes with the expectation that with feedback, we may incorporate breaking changes to the API that might require early users to recompile, or rewrite portions of their code as we iterate on the design.
clSPARSE Wiki
The project wiki contains helpful documentation. A build primer is available, which describes how to use cmake to generate platforms specific build files
Samples
clSPARSE contains a directory of simple OpenCL samples that demonstrate the use of the API in both C and C++. The superbuild script for clSPARSE also builds the samples as an external project, to demonstrate how an application would find and link to clSPARSE with cmake.
clSPARSE library documentation
API documentation is available at http://clmathlibraries.github.io/clSPARSE/. The samples give an excellent starting point to basic library operations.
Contributing code
Please refer to and read the Contributing document for guidelines on how to contribute code to this open source project. Code in the /master branch is considered to be stable and new library releases are made when commits are merged into /master. Active development and pull-requests should be made to the develop branch.
License¶
clSPARSE is licensed under the Apache License, Version 2.0
Compiling for Windows
Windows® 7/8
Visual Studio 2013 and above
CMake 2.8.12 (download from Kitware)
Solution (.sln) or
Nmake makefiles
An OpenCL SDK, such as APP SDK 3.0
Compiling for Linux
GCC 4.8 and above
CMake 2.8.12 (install with distro package manager )
- Unix makefiles or
KDevelop or
QT Creator
An OpenCL SDK, such as APP SDK 3.0
Compiling for Mac OSX
CMake 2.8.12 (install via brew)
Unix makefiles or
XCode
An OpenCL SDK (installed via xcode-select –install)
Bench & Test infrastructure dependencies
Googletest v1.7
Boost v1.58
Footnotes
[1]: Changed to reflect CppCoreGuidelines: F.21
[2]: Changed to reflect CppCoreGuidelines: NL.8
[3]: OpenCL 2.0 support is not yet fully implemented; only the interfaces have been designed
clRNG¶
For Github repository clRNG
A library for uniform random number generation in OpenCL.
Streams of random numbers act as virtual random number generators. They can be created on the host computer in unlimited numbers, and then used either on the host or on computing devices by work items to generate random numbers. Each stream also has equally-spaced substreams, which are occasionally useful. The API is currently implemented for four different RNGs, namely the MRG31k3p, MRG32k3a, LFSR113 and Philox-4×32-10 generators.
What’s New¶
Libraries related to clRNG, for probability distributions and quasi-Monte Carlo methods, are available:
Releases
The first public version of clRNG is v1.0.0 beta. Please go to releases for downloads.
Building¶
- Install the runtime dependency:
An OpenCL SDK, such as APP SDK.
Install the build dependencies:
The CMake cross-platform build system. Visual Studio users can use CMake Tools for Visual Studio.
A recent C compiler, such as GCC 4.9 , or Visual Studio 2013.
Get the clRNG source code.
Configure the project using CMake (to generate standard makefiles) or CMake Tools for Visual Studio (to generate solution and project files).
Build the project.
Install the project (by default, the library will be installed in the package directory under the build directory).
Point the environment variable CLRNG_ROOT to the installation directory, i.e., the directory under which include/clRNG can be found. This step is optional if the library is installed under /usr, which is the default.
In order to execute the example programs (under the bin subdirectory of the installation directory) or to link clRNG into other software, the dynamic linker must be informed where to find the clRNG shared library. The name and location of the shared library generally depend on the platform.
Optionally run the tests.
Example Instructions for Linux¶
On a 64-bit Linux platform, steps 3 through 9 from above, executed in a Bash-compatible shell, could consist of:
git clone https://github.com/clMathLibraries/clRNG.git
mkdir clRNG.build; cd clRNG.build; cmake ../clRNG/src
make
make install
export CLRNG_ROOT=$PWD/package
export LD_LIBRARY_PATH=$CLRNG_ROOT/lib64:$LD_LIBRARY_PATH
$CLRNG_ROOT/bin/CTest
Examples
Examples can be found in src/client. The compiled client program examples can be found under the bin subdirectory of the installation package ($CLRNG_ROOT/bin under Linux). Note that the examples expect an OpenCL GPU device to be available.
Simple example
The simple example below shows how to use clRNG to generate random numbers by directly using device side headers (.clh) in your OpenCL kernel.
#include <stdlib.h>
#include <string.h>
#include "clRNG/clRNG.h"
#include "clRNG/mrg31k3p.h"
int main( void )
{
cl_int err;
cl_platform_id platform = 0;
cl_device_id device = 0;
cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
cl_context ctx = 0;
cl_command_queue queue = 0;
cl_program program = 0;
cl_kernel kernel = 0;
cl_event event = 0;
cl_mem bufIn, bufOut;
float *out;
char *clrng_root;
char include_str[1024];
char build_log[4096];
size_t i = 0;
size_t numWorkItems = 64;
clrngMrg31k3pStream *streams = 0;
size_t streamBufferSize = 0;
size_t kernelLines = 0;
/* Sample kernel that calls clRNG device-side interfaces to generate random numbers */
const char *kernelSrc[] = {
" #define CLRNG_SINGLE_PRECISION \n",
" #include <clRNG/mrg31k3p.clh> \n",
" \n",
" __kernel void example(__global clrngMrg31k3pHostStream *streams, \n",
" __global float *out) \n",
" { \n",
" int gid = get_global_id(0); \n",
" \n",
" clrngMrg31k3pStream workItemStream; \n",
" clrngMrg31k3pCopyOverStreamsFromGlobal(1, &workItemStream, \n",
" &streams[gid]); \n",
" \n",
" out[gid] = clrngMrg31k3pRandomU01(&workItemStream); \n",
" } \n",
" \n",
};
/* Setup OpenCL environment. */
err = clGetPlatformIDs( 1, &platform, NULL );
err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL );
props[1] = (cl_context_properties)platform;
ctx = clCreateContext( props, 1, &device, NULL, NULL, &err );
queue = clCreateCommandQueue( ctx, device, 0, &err );
/* Make sure CLRNG_ROOT is specified to get library path */
clrng_root = getenv("CLRNG_ROOT");
if(clrng_root == NULL) printf("\nSpecify environment variable CLRNG_ROOT as described\n");
strcpy(include_str, "-I ");
strcat(include_str, clrng_root);
strcat(include_str, "/include");
/* Create sample kernel */
kernelLines = sizeof(kernelSrc) / sizeof(kernelSrc[0]);
program = clCreateProgramWithSource(ctx, kernelLines, kernelSrc, NULL, &err);
err = clBuildProgram(program, 1, &device, include_str, NULL, NULL);
if(err != CL_SUCCESS)
{
printf("\nclBuildProgram has failed\n");
clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, 4096, build_log, NULL);
printf("%s", build_log);
}
kernel = clCreateKernel(program, "example", &err);
/* Create streams */
streams = clrngMrg31k3pCreateStreams(NULL, numWorkItems, &streamBufferSize, (clrngStatus *)&err);
/* Create buffers for the kernel */
bufIn = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, streamBufferSize, streams, &err);
bufOut = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY | CL_MEM_HOST_READ_ONLY, numWorkItems * sizeof(cl_float), NULL, &err);
/* Setup the kernel */
err = clSetKernelArg(kernel, 0, sizeof(bufIn), &bufIn);
err = clSetKernelArg(kernel, 1, sizeof(bufOut), &bufOut);
/* Execute the kernel and read back results */
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &numWorkItems, NULL, 0, NULL, &event);
err = clWaitForEvents(1, &event);
out = (float *)malloc(numWorkItems * sizeof(out[0]));
err = clEnqueueReadBuffer(queue, bufOut, CL_TRUE, 0, numWorkItems * sizeof(out[0]), out, 0, NULL, NULL);
/* Release allocated resources */
clReleaseEvent(event);
free(out);
clReleaseMemObject(bufIn);
clReleaseMemObject(bufOut);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(queue);
clReleaseContext(ctx);
return 0;
}
Building the documentation manually¶
The documentation can be generated by running make from within the doc directory. This requires Doxygen to be installed.
hcFFT¶
Installation¶
The following are the steps to use the library
ROCM 2.4 Kernel, Driver and Compiler Installation (if not done until now)
Library installation.
ROCM 2.4 Installation
To Know more about ROCM refer https://github.com/RadeonOpenCompute/ROCm/blob/master/README.md
a. Installing Debian ROCM repositories
Before proceeding, make sure to completely uninstall any pre-release ROCm packages.
Refer https://github.com/RadeonOpenCompute/ROCm#removing-pre-release-packages for instructions to remove pre-release ROCM packages.
Steps to install rocm package are,
wget -qO - http://packages.amd.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -
sudo sh -c 'echo deb [arch=amd64] http://packages.amd.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list'
sudo apt-get update
sudo apt-get install rocm
Then, make the ROCm kernel your default kernel. If using grub2 as your bootloader, you can edit the GRUB_DEFAULT variable in the following file:
sudo vi /etc/default/grub
sudo update-grub
and Reboot the system
b. Verifying the Installation
Once Reboot, to verify that the ROCm stack completed successfully you can execute HSA vector_copy sample application:
cd /opt/rocm/hsa/sample
make
./vector_copy
Library Installation
a. Install using Prebuilt debian
wget https://github.com/ROCmSoftwarePlatform/hcFFT/blob/master/pre-builds/hcfft-master-87a37f5-Linux.deb
sudo dpkg -i hcfft-master-87a37f5-Linux.deb
b. Build debian from source
git clone https://github.com/ROCmSoftwarePlatform/hcFFT.git && cd hcFFT
chmod +x build.sh && ./build.sh
build.sh execution builds the library and generates a debian under build directory.
c. Install CPU based FFTW3 library
sudo apt-get install fftw3 fftw3-dev pkg-config
Introduction¶
This repository hosts the HCC based FFT Library, that targets GPU acceleration of FFT routines on AMD devices. To know what HCC compiler features, refer here.
The following are the sub-routines that are implemented
R2C : Transforms Real valued input in Time domain to Complex valued output in Frequency domain.
C2R : Transforms Complex valued input in Frequency domain to Real valued output in Real domain.
C2C : Transforms Complex valued input in Frequency domain to Complex valued output in Real domain or vice versa
KeyFeature¶
Support 1D, 2D and 3D Fast Fourier Transforms
Supports R2C, C2R, C2C, D2Z, Z2D and Z2Z Transforms
Support Out-Of-Place data storage
Ability to Choose desired target accelerator
Single and Double precision
Prerequisites
This section lists the known set of hardware and software requirements to build this library
Hardware
CPU: mainstream brand, Better if with >=4 Cores Intel Haswell based CPU
System Memory >= 4GB (Better if >10GB for NN application over multiple GPUs)
Hard Drive > 200GB (Better if SSD or NVMe driver for NN application over multiple GPUs)
Minimum GPU Memory (Global) > 2GB
GPU cards supported
dGPU: AMD R9 Fury X, R9 Fury, R9 Nano
APU: AMD Kaveri or Carrizo
AMD Driver and Runtime
Radeon Open Compute Kernel (ROCK) driver : https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver
HSA runtime API and runtime for Boltzmann: https://github.com/RadeonOpenCompute/ROCR-Runtime
System software
Ubuntu 14.04 trusty and later
GCC 4.6 and later
CPP 4.6 and later (come with GCC package)
python 2.7 and later
python-pip
BeautifulSoup4 (installed using python-pip)
HCC 0.9 from here
Tools and Misc
git 1.9 and later
cmake 2.6 and later (2.6 and 2.8 are tested)
firewall off
root privilege or user account in sudo group
Ubuntu Packages
libc6-dev-i386
liblapack-dev
graphicsmagick
libblas-dev
Examples¶
FFT 1D R2C example:
file: hcfft_1D_R2C.cpp
#!c++
#include <iostream>
#include <cstdlib>
#include "hcfft.h"
#include "hc_am.hpp"
#include "hcfftlib.h"
int main(int argc, char* argv[]) {
int N = argc > 1 ? atoi(argv[1]) : 1024;
// HCFFT work flow
hcfftHandle plan;
hcfftResult status = hcfftPlan1d(&plan, N, HCFFT_R2C);
assert(status == HCFFT_SUCCESS);
int Rsize = N;
int Csize = (N / 2) + 1;
hcfftReal* input = (hcfftReal*)calloc(Rsize, sizeof(hcfftReal));
int seed = 123456789;
srand(seed);
// Populate the input
for(int i = 0; i < Rsize ; i++) {
input[i] = rand();
}
hcfftComplex* output = (hcfftComplex*)calloc(Csize, sizeof(hcfftComplex));
std::vector<hc::accelerator> accs = hc::accelerator::get_all();
assert(accs.size() && "Number of Accelerators == 0!");
hc::accelerator_view accl_view = accs[1].get_default_view();
hcfftReal* idata = hc::am_alloc(Rsize * sizeof(hcfftReal), accs[1], 0);
accl_view.copy(input, idata, sizeof(hcfftReal) * Rsize);
hcfftComplex* odata = hc::am_alloc(Csize * sizeof(hcfftComplex), accs[1], 0);
accl_view.copy(output, odata, sizeof(hcfftComplex) * Csize);
status = hcfftExecR2C(plan, idata, odata);
assert(status == HCFFT_SUCCESS);
accl_view.copy(odata, output, sizeof(hcfftComplex) * Csize);
status = hcfftDestroy(plan);
assert(status == HCFFT_SUCCESS);
free(input);
free(output);
hc::am_free(idata);
hc::am_free(odata);
}
Compiling the example code:
Assuming the library and compiler installation is followed as in installation.
/opt/rocm/hcc/bin/clang++ /opt/rocm/hcc/bin/hcc-config –cxxflags –ldflags -lhc_am -lhcfft -I../lib/include -L../build/lib/src hcfft_1D_R2C.cpp
Tested Environments¶
This sections enumerates the list of tested combinations of Hardware and system softwares.
Driver versions
- Boltzmann Early Release Driver + dGPU
Radeon Open Compute Kernel (ROCK) driver : https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver
HSA runtime API and runtime for Boltzmann: https://github.com/RadeonOpenCompute/ROCR-Runtime
Traditional HSA driver + APU (Kaveri)
GPU Cards
Radeon R9 Nano
Radeon R9 FuryX
Radeon R9 Fury
Kaveri and Carizo APU
Server System
Supermicro SYS 2028GR-THT 6 R9 NANO
Supermicro SYS-1028GQ-TRT 4 R9 NANO
Supermicro SYS-7048GR-TR Tower 4 R9 NANO
Tensile¶
Introduction¶
Tensile is a tool for creating a benchmark-driven backend library for GEMMs, GEMM-like problems (such as batched GEMM), N-dimensional tensor contractions, and anything else that multiplies two multi-dimensional objects together on a AMD GPU.
Overview for creating a custom TensileLib backend library for your application:
Install the PyYAML and cmake dependency (mandatory),
git clone and cd TensileCreate a benchmark config.yaml file in
./Tensile/Configs/Run the benchmark. After the benchmark is finished. Tensile will dump 4 directories: 1 & 2 is about benchmarking. 3 & 4 is the summarized results from your library (like rocBLAS) viewpoints.
1_BenchmarkProblems: has all the problems descriptions and executables generated during benchmarking, where you can re-launch exe to reproduce results.
2_BenchmarkData: has the raw performance results.
3_LibraryLogic: has optimal kernel configurations yaml file and Winner*.csv. Usually rocBLAS takes the yaml files from this folder.
4_LibraryClient: has a client exe, so you can launch from a library viewpoint.
Add the Tensile library to your application’s CMake target. The Tensile library will be written, compiled and linked to your application at application-compile-time.
GPU kernels, written in HIP, OpenCL, or AMD GCN assembly.
Solution classes which enqueue the kernels.
APIs which call the fastest solution for a problem.
Quick Example (Ubuntu):¶
sudo apt-get install python-yaml
mkdir Tensile
cd Tensile
git clone https://github.com/ROCmSoftwarePlatform/Tensile repo
cd repo
git checkout master
mkdir build
cd build
python ../Tensile/Tensile.py ../Tensile/Configs/test_sgemm.yaml ./
After about 10 minutes of benchmarking, Tensile will print out the path to the client you can run.
./4_LibraryClient/build/client -h
./4_LibraryClient/build/client --sizes 5760 5760 1 5760
Benchmark Config example¶
Tensile uses an incremental and “programmable” benchmarking protocol.
Example Benchmark config.yaml as input file to Tensile¶
GlobalParameters:
PrintLevel: 1
ForceRedoBenchmarkProblems: False
ForceRedoLibraryLogic: True
ForceRedoLibraryClient: True
CMakeBuildType: Release
EnqueuesPerSync: 1
SyncsPerBenchmark: 1
LibraryPrintDebug: False
NumElementsToValidate: 128
ValidationMaxToPrint: 16
ValidationPrintValids: False
ShortNames: False
MergeFiles: True
PlatformIdx: 0
DeviceIdx: 0
DataInitTypeAB: 0
BenchmarkProblems:
- # sgemm NN
- # ProblemType
OperationType: GEMM
DataType: s
TransposeA: False
TransposeB: False
UseBeta: True
Batched: True
- # BenchmarkProblemSizeGroup
InitialSolutionParameters:
BenchmarkCommonParameters:
- ProblemSizes:
- Range: [ [5760], 0, [1], 0 ]
- LoopDoWhile: [False]
- NumLoadsCoalescedA: [-1]
- NumLoadsCoalescedB: [1]
- WorkGroupMapping: [1]
ForkParameters:
- ThreadTile:
- [ 8, 8 ]
- [ 4, 8 ]
- [ 4, 4 ]
- WorkGroup:
- [ 8, 16, 1 ]
- [ 16, 16, 1 ]
- LoopTail: [False, True]
- EdgeType: ["None", "Branch", "ShiftPtr"]
- DepthU: [ 8, 16]
- VectorWidth: [1, 2, 4]
BenchmarkForkParameters:
JoinParameters:
- MacroTile
BenchmarkJoinParameters:
BenchmarkFinalParameters:
- ProblemSizes:
- Range: [ [5760], 0, [1], 0 ]
LibraryLogic:
LibraryClient:
Structure of config.yaml¶
Top level data structure whose keys are Parameters, BenchmarkProblems, LibraryLogic and LibraryClient.
Parameters contains a dictionary storing global parameters used for all parts of the benchmarking.
BenchmarkProblems contains a list of dictionaries representing the benchmarks to conduct; each element, i.e. dictionary, in the list is for benchmarking a single ProblemType. The keys for these dictionaries are ProblemType, InitialSolutionParameters, BenchmarkCommonParameters, ForkParameters, BenchmarkForkParameters, JoinParameters, BenchmarkJoinParameters and BenchmarkFinalParameters. See Benchmark Protocol for more information on these steps.
LibraryLogic contains a dictionary storing parameters for analyzing the benchmark data and designing how the backend library will select which Solution for certain ProblemSizes.
LibraryClient contains a dictionary storing parameters for actually creating the library and creating a client which calls into the library.
Global Parameters¶
Name: Prefix to add to API function names; typically name of device.
MinimumRequiredVersion: Which version of Tensile is required to interpret this yaml file
RuntimeLanguage: Use HIP or OpenCL runtime.
KernelLanguage: For OpenCL runtime, kernel language must be set to OpenCL. For HIP runtime, kernel language can be set to HIP or assembly (gfx803, gfx900).
PrintLevel: 0=Tensile prints nothing, 1=prints some, 2=prints a lot.
ForceRedoBenchmarkProblems: False means don’t redo a benchmark phase if results for it already exist.
ForceRedoLibraryLogic: False means don’t re-generate library logic if it already exist.
ForceRedoLibraryClient: False means don’t re-generate library client if it already exist.
CMakeBuildType: Release or Debug
EnqueuesPerSync: Num enqueues before syncing the queue.
SyncsPerBenchmark: Num queue syncs for each problem size.
LibraryPrintDebug: True means Tensile solutions will print kernel enqueue info to stdout
NumElementsToValidate: Number of elements to validate; 0 means no validation.
ValidationMaxToPrint: How many invalid results to print.
ValidationPrintValids: True means print validation comparisons that are valid, not just invalids.
ShortNames: Convert long kernel, solution and files names to short serial ids.
MergeFiles: False means write each solution and kernel to its own file.
PlatformIdx: OpenCL platform id.
DeviceIdx: OpenCL or HIP device id.
DataInitType[AB,C]: Initialize validation data with 0=0’s, 1=1’s, 2=serial, 3=random.
KernelTime: Use kernel time reported from runtime rather than api times from cpu clocks to compare kernel performance.
The exhaustive list of global parameters and their defaults is stored in Common.py.
Problem Type Parameters¶
OperationType: GEMM or TensorContraction.
DataType: s, d, c, z, h
UseBeta: False means library/solutions/kernel won’t accept a beta parameter; thus beta=0.
UseInitialStrides: False means data is contiguous in memory.
HighPrecisionAccumulate: For tmpC += a*b, use twice the precision for tmpC as for DataType. Not yet implemented.
ComplexConjugateA: True or False; ignored for real precision.
ComplexConjugateB: True or False; ignored for real precision.
For OperationType=GEMM only:
TransposeA: True or False.
TransposeB: True or False.
Batched: True (False has been deprecated). For OperationType=TensorContraction only (showing batched gemm NT: C[ijk] = Sum[l] A[ilk] * B[jlk])
IndexAssignmentsA: [0, 3, 2]
IndexAssignmentsB: [1, 3, 2]
NumDimensionsC: 3.
Solution / Kernel Parameters¶
See: Kernel Parameters.
Defaults¶
Because of the flexibility / complexity of the benchmarking process and, therefore, of the config.yaml files; Tensile has a default value for every parameter. If you neglect to put LoopUnroll anywhere in your benchmark, rather than crashing or complaining, Tensile will put the default LoopUnroll options into the default phase (common, fork, join…). This guarantees ease of use and more importantly backward compatibility; every time we add a new possible solution parameter, you don’t necessarily need to update your configs; we’ll have a default figured out for you.
However, this may cause some confusion. If your config fork 2 parameters, but you see that 3 were forked during benchmarking, that’s because you didn’t specify the 3rd parameter anywhere, so Tensile stuck it in its default phase, which was forking (for example). Also, specifying ForkParameters: and leaving it empty isn’t the same as leaving JoinParameter out of your config. If you leave ForkParameters out of your config, Tensile will add a ForkParameters step and put the default parameters into it (unless you put all the parameters elsewhere), but if you specify ForkParameters and leave it empty, then you won’t work anything.
Therefore, it is safest to specify all parameters in your config.yaml files; that way you’ll guarantee the behavior you want. See /Tensile/Common.py for the current list of parameters.
Benchmark Protocol¶
Old Benchmark Architecture was Intractable¶
The benchmarking strategy from version 1 was vanilla flavored brute force: (8 WorkGroups)* (12 ThreadTiles)* (4 NumLoadsCoalescedAs)* (4 NumLoadsCoalescedBs)* (3 LoopUnrolls)* (5 BranchTypes)* …*(1024 ProblemSizes)=23,592,960 is a multiplicative series which grows very quickly. Adding one more boolean parameter doubles the number of kernel enqueues of the benchmark.
Incremental Benchmark is Faster¶
Tensile version 2 allows the user to manually interrupt the multiplicative series with “additions” instead of “multiplies”, i.e., (8 WorkGroups)* (12 ThreadTiles)+ (4 NumLoadsCoalescedAs)* (4 NumLoadsCoalescedBs)* (3 LoopUnrolls)+ (5 BranchTypes)* …+(1024 ProblemSizes)=1,151 is a dramatically smaller number of enqueues. Now, adding one more boolean parameter may only add on 2 more enqueues.
Phases of Benchmark¶
To make the Tensile’s programability more manageable for the user and developer, the benchmarking protocol has been split up into several steps encoded in a config.yaml file. The below sections reference the following config.yaml. Note that this config.yaml has been created to be a simple illustration and doesn’t not represent an actual good benchmark protocol. See the configs included in the repository (/Tensile/Configs) for examples of good benchmarking configs.
BenchmarkProblems:
- # sgemm
- # Problem Type
OperationType: GEMM
Batched: True
- # Benchmark Size-Group
InitialSolutionParameters:
- WorkGroup: [ [ 16, 16, 1 ] ]
- NumLoadsCoalescedA: [ 1 ]
- NumLoadsCoalescedB: [ 1 ]
- ThreadTile: [ [ 4, 4 ] ]
BenchmarkCommonParameters:
- ProblemSizes:
- Range: [ [512], [512], [1], [512] ]
- EdgeType: ["Branch", "ShiftPtr"]
PrefetchGlobalRead: [False, True]
ForkParameters:
- WorkGroup: [ [8, 32, 1], [16, 16, 1], [32, 8, 1] ]
ThreadTile: [ [2, 8], [4, 4], [8, 2] ]
BenchmarkForkParameters:
- ProblemSizes:
- Exact: [ 2880, 2880, 1, 2880 ]
- NumLoadsCoalescedA: [ 1, 2, 4, 8 ]
- NumLoadsCoalescedB: [ 1, 2, 4, 8 ]
JoinParameters:
- MacroTile
BenchmarkJoinParameters:
- LoopUnroll: [8, 16]
BenchmarkFinalParameters:
- ProblemSizes:
- Range: [ [16, 128], [16, 128], [1], [256] ]
Initial Solution Parameters¶
A Solution is comprised of ~20 parameters, and all are needed to create a kernel. Therefore, during the first benchmark which determines which WorkGroupShape is fastest, what are the other 19 solution parameters which are used to describe the kernels that we benchmark? That’s what InitialSolutionParameters are for. The solution used for benchmarking WorkGroupShape will use the parameters from InitialSolutionParameters. The user must choose good default solution parameters in order to correctly identify subsequent optimal parameters.
Problem Sizes¶
Each step of the benchmark can override what problem sizes will be benchmarked. A ProblemSizes entry of type Range is a list whose length is the number of indices in the ProblemType. A GEMM ProblemSizes must have 3 elements while a batched-GEMM ProblemSizes must have 4 elements. So, for a ProblemType of C[ij] = Sum[k] A[ik]*B[jk], the ProblemSizes elements represent [SizeI, SizeJ, SizeK]. For each index, there are 5 ways of specifying the sizes of that index:
[1968]
Benchmark only size 1968; n = 1.
[16, 1920]
Benchmark sizes 16 to 1968 using the default step size (=16); n = 123.
[16, 32, 1968]
Benchmark sizes 16 to 1968 using a step size of 32; n = 61.
[64, 32, 16, 1968]
Benchmark sizes from 64 to 1968 with a step size of 32. Also, increase the step size by 16 each iteration.
This causes fewer sizes to be benchmarked when the sizes are large, and more benchmarks where the sizes are small; this is typically desired behavior.
n = 16 (64, 96, 144, 208, 288, 384, 496, 624, 768, 928, 1104, 1296, 1504, 1728, 1968). The stride at the beginning is 32, but the stride at the end is 256.
0
The size of this index is just whatever size index 0 is. For a 3-dimensional ProblemType, this allows benchmarking only a 2- dimensional or 1-dimensional slice of problem sizes.
Here are a few examples of valid ProblemSizes for 3D GEMMs:
Range: [ [16, 128], [16, 128], [16, 128] ] # n = 512
Range: [ [16, 128], 0, 0] # n = 8
Range: [ [16, 16, 16, 5760], 0, [1024, 1024, 4096] ] # n = 108
Benchmark Common Parameters¶
During this first phase of benchmarking, we examine parameters which will be the same for all solutions for this ProblemType. During each step of benchmarking, there is only 1 winner. In the above example we are benchmarking the dictionary {EdgeType: [ Branch, ShiftPtr], PrefetchGlobalRead: [False, True]}.; therefore, this benchmark step generates 4 solution candidates, and the winner will be the fastest EdgeType/PrefetchGlobalRead combination. Assuming the winner is ET=SP and PGR=T, then all solutions for this ProblemType will have ET=SP and PGR=T. Also, once a parameter has been determined, all subsequent benchmarking steps will use this determined parameter rather than pulling values from InitialSolutionParameters. Because the common parameters will apply to all kernels, they are typically the parameters which are compiler-dependent or hardware-dependent rather than being tile-dependent.
Fork Parameters¶
If we continued to determine every parameter in the above manner, we’d end up with a single fastest solution for the specified ProblemSizes; we usually desire multiple different solutions with varying parameters which may be fastest for different groups of ProblemSizes. One simple example of this is small tiles sizes are fastest for small problem sizes, and large tiles are fastest for large tile sizes.
Therefore, we allow “forking” parameters; this means keeping multiple winners after each benchmark steps. In the above example we fork {WorkGroup: […], ThreadTile: […]}. This means that in subsequent benchmarking steps, rather than having one winning parameter, we’ll have one winning parameter per fork permutation; we’ll have 9 winners.
Benchmark Fork Parameters¶
When we benchmark the fork parameters, we retain one winner per permutation. Therefore, we first determine the fastest NumLoadsCoalescedA for each of the WG,TT permutations, then we determine the fastest NumLoadsCoalescedB for each permutation.
Join Parameters¶
After determining fastest parameters for all the forked solution permutations, we have the option of reducing the number of winning solutions. When a parameter is listed in the JoinParameters section, that means that of the kept winning solutions, each will have a different value for that parameter. Listing more parameters to join results in more winners being kept, while having a JoinParameters section with no parameters listed results on only 1 fastest solution.
In our example we join over the MacroTile (work-group x thread-tile). After forking tiles, there were 9 solutions that we kept. After joining MacroTile, we’ll only keep six: 16x256, 32x128, 64x64, 128x32 and 256x16. The solutions that are kept are based on their performance during the last BenchmarkForkParameters benchmark, or, if there weren’t any, JoinParameters will conduct a benchmark of all solution candidates then choose the fastest.
Benchmark Join Parameters¶
After narrowing the list of fastest solutions through joining, you can continue to benchmark parameters, keeping one winning parameter per solution permutation.
Benchmark Final Parameters¶
After all the parameter benchmarking has been completed and the final list of fastest solution has been assembled, we can benchmark all the solution over a large set of ProblemSizes. This benchmark represent the final output of benchmarking; it outputs a .csv file where the rows are all the problem sizes and the columns are all the solutions. This is the information which gets analysed to produce the library logic.
Contributing¶
We’d love your help, but…
Never check in a tab (t); use 4 spaces.
Follow the coding style of the file you’re editing.
Make pull requests against develop branch.
Rebase your develop branch against ROCmSoftwarePlatform::Tensile::develop branch right before pull-requesting.
In your pull request, state what you tested (which OS, what drivers, what devices, which config.yaml’s) so we can ensure that your changes haven’t broken anything.
Dependencies¶
CMake¶
CMake 2.8
Python¶
(One time only)
Ubuntu: sudo apt install python2.7 python-yaml
CentOS: sudo yum install python PyYAML
Fedora: sudo dnf install python PyYAML
Compilers¶
For Tensile_BACKEND = OpenCL1.2 (untested)
Visual Studio 14 (2015). (VS 2012 may also be supported; c++11 should no longer be required by Tensile. Need to verify.)
GCC 4.8 and above
For Tensile_BACKEND = HIP
Public ROCm
Installation¶
Tensile can be installed via:
Download repo and don’t install; install PyYAML dependency manually and call python scripts manually:
git clone https://github.com/ROCmSoftwarePlatform/Tensile.git
python Tensile/Tensile/Tensile.py your_custom_config.yaml your_benchmark_path
Install develop branch directly from repo using pip:
pip install git+https://github.com/ROCmSoftwarePlatform/Tensile.git@develop
tensile your_custom_config.yaml your_benchmark_path
Download repo and install manually: (deprecated)
git clone https://github.com/ROCmSoftwarePlatform/Tensile.git
cd Tensile
sudo python setup.py install
tensile your_custom_config.yaml your_benchmark_path
Kernel Parameters¶
Solution / Kernel Parameters¶
LoopDoWhile: True=DoWhile loop, False=While or For loop
LoopTail: Additional loop with LoopUnroll=1.
EdgeType: Branch, ShiftPtr or None
WorkGroup: [dim0, dim1, LocalSplitU]
ThreadTile: [dim0, dim1]
GlobalSplitU: Split up summation among work-groups to create more concurrency. This option launches a kernel to handle the beta scaling, then a second kernel where the writes to global memory are atomic.
PrefetchGlobalRead: True means outer loop should prefetch global data one iteration ahead.
PrefetchLocalRead: True means inner loop should prefetch lds data one iteration ahead.
WorkGroupMapping: In what order will work-groups compute C; affects cacheing.
LoopUnroll: How many iterations to unroll inner loop; helps loading coalesced memory.
MacroTile: Derrived from WorkGroup*ThreadTile.
DepthU: Derrived from LoopUnroll*SplitU.
NumLoadsCoalescedA,B: Number of loads from A in coalesced dimension.
GlobalReadCoalesceGroupA,B: True means adjacent threads map to adjacent global read elements (but, if transposing data then write to lds is scattered).
GlobalReadCoalesceVectorA,B: True means vector components map to adjacent global read elements (but, if transposing data then write to lds is scattered).
VectorWidth: Thread tile elements are contiguous for faster memory accesses. For example VW=4 means a thread will read a float4 from memory rather than 4 non-contiguous floats.
KernelLanguage: Whether kernels should be written in source code (HIP, OpenCL) or assembly (gfx803, gfx900, …).
The exhaustive list of solution parameters and their defaults is stored in Common.py.
Kernel Parameters Affect Performance¶
The kernel parameters affect many aspects of performance. Changing a parameter may help address one performance bottleneck but worsen another. That is why searching through the parameter space is vital to discovering the fastest kernel for a given problem.
How N-Dimensional Tensor Contractions Are Mapped to Finite-Dimensional GPU Kernels¶
For a traditional GEMM, the 2-dimensional output, C[i,j], is mapped to launching a 2-dimensional grid of work groups, each of which has a 2-dimensional grid of work items; one dimension belongs to i and one dimension belongs to j. The 1-dimensional summation is represented by a single loop within the kernel body.
Special Dimensions: D0, D1 and DU¶
To handle arbitrary dimensionality, Tensile begins by determining 3 special dimensions: D0, D1 and DU.
D0 and D1 are the free indices of A and B (one belongs to A and one to B) which have the shortest strides. This allows the inner-most loops to read from A and B the fastest via coalescing. In a traditional GEMM, every matrix has a dimension with a shortest stride of 1, but Tensile doesn’t make that assumption. Of these two dimensions, D0 is the dimension which has the shortest tensor C stride which allows for fast writing.
DU represents the summation index with the shortest combined stride (stride in A + stride in B); it becomes the inner most loop which gets “U”nrolled. This assignment is also mean’t to assure fast reading in the inner-most summation loop. There can be multiple summation indices (i.e. embedded loops) and DU will be iterated over in the inner most loop.
GPU Kernel Dimension¶
OpenCL allows for 3-dimensional grid of work-groups, and each work-group can be a 3-dimensional grid of work-items. Tensile assigns D0 to be dimension-0 of the work-group and work-item grid; it assigns D1 to be dimension-1 of the work-group and work-item grids. All other free or batch dimensions are flattened down into the final dimension-2 of the work-group and work-item grids. Withing the GPU kernel, dimensions-2 is reconstituted back into whatever dimensions it represents.
Languages¶
Tensile Benchmarking is Python¶
The benchmarking module, Tensile.py, is written in python. The python scripts write solution, kernels, cmake files and all other C/C++ files used for benchmarking.
Tensile Library¶
The Tensile API, Tensile.h, is confined to C89 so that it will be usable by most software. The code behind the API is allowed to be c++11.
Device Languages¶
The device languages Tensile supports for the gpu kernels is
OpenCL 1.2
HIP
Assembly
gfx803
gfx900
Library Logic¶
Running the LibraryLogic phase of benchmarking analyses the benchmark data and encodes a mapping for each problem type. For each problem type, it maps problem sizes to best solution (i.e. kernel).
When you build Tensile.lib, you point the TensileCreateLibrary function to a directory where your library logic yaml files are.
Problem Nomenclature¶
Example Problems¶
Standard GEMM has 4 variants (2 free indices (i, j) and 1 summation index l)
N(N:nontranspose)N: C[i,j] = Sum[l] A[i,l] * B[l,j]
NT(T:transpose): C[i,j] = Sum[l] A[i,l] * B[j, l]
TN: C[i,j] = Sum[l] A[l, i] * B[l,j]
TT: C[i,j] = Sum[l] A[l, i] * B[j, l]
C[i,j,k] = Sum[l] A[i,l,k] * B[l,j,k] (batched-GEMM; 2 free indices, 1 batched index k and 1 summation index l)
C[i,j] = Sum[k,l] A[i,k,l] * B[j,l,k] (2D summation)
C[i,j,k,l,m] = Sum[n] A[i,k,m,l,n] * B[j,k,l,n,m] (GEMM with 3 batched indices)
C[i,j,k,l,m] = Sum[n,o] A[i,k,m,o,n] * B[j,m,l,n,o] (4 free indices, 2 summation indices and 1 batched index)
C[i,j,k,l] = Sum[m,n] A[i,j,m,n,l] * B[m,n,k,j,l] (batched image convolution mapped to 7D tensor contraction)
and even crazier
Nomenclature¶
The indices describe the dimensionality of the problem being solved. A GEMM operation takes 2 2-dimensional matrices as input (totaling 4 input dimensions) and contracts them along one dimension (which cancels out 2 of the dimensions), resulting in a 2-dimensional result.
Whenever an index shows up in multiple tensors, those tensors must be the same size along that dimension but they may have different strides.
There are 3 categories of indices/dimensions that Tensile deals with: free, batch and bound.
Free Indices
Free indices are the indices of tensor C which come in pairs; one of the pair shows up in tensor A while the other shows up in tensor B. In the really crazy example above, i/j/k/l are the 4 free indices of tensor C. Indices i and k come from tensor A and indices j and l come from tensor B.
Batch Indices
Batch indices are the indices of tensor C which shows up in both tensor A and tensor B. For example, the difference between the GEMM example and the batched-GEMM example above is the additional index. In the batched-GEMM example, the index K is the batch index which is batching together multiple independent GEMMs.
Bound/Summation Indices
The final type of indices are called bound indices or summation indices. These indices do not show up in tensor C; they show up in the summation symbol (Sum[k]) and in tensors A and B. It is along these indices that we perform the inner products (pairwise multiply then sum).
Limitations¶
Problem supported by Tensile must meet the following conditions:
There must be at least one pair of free indices.
Tensile.lib¶
After running the benchmark and generating library config files, you’re ready to add Tensile.lib to your project. Tensile provides a TensileCreateLibrary function, which can be called:
set(Tensile_BACKEND "HIP")
set( Tensile_LOGIC_PATH "~/LibraryLogic" CACHE STRING "Path to Tensile logic.yaml files")
option( Tensile_MERGE_FILES "Tensile to merge kernels and solutions files?" OFF)
option( Tensile_SHORT_NAMES "Tensile to use short file/function names? Use if compiler complains they're too long." OFF)
option( Tensile_PRINT_DEBUG "Tensile to print runtime debug info?" OFF)
find_package(Tensile) # use if Tensile has been installed
TensileCreateLibrary(
${Tensile_LOGIC_PATH}
${Tensile_BACKEND}
${Tensile_MERGE_FILES}
${Tensile_SHORT_NAMES}
${Tensile_PRINT_DEBUG}
Tensile_ROOT ${Tensile_ROOT} # optional; use if tensile not installed
)
target_link_libraries( TARGET Tensile )
TODO: Where is the Tensile include directory?
Versioning¶
Tensile follows semantic versioning practices, i.e. Major.Minor.Patch, in BenchmarkConfig.yaml files, LibraryConfig.yaml files and in cmake find_package. Tensile is compatible with a “MinimumRequiredVersion” if Tensile.Major==MRV.Major and Tensile.Minor.Patch >= MRV.Minor.Patch.
Major: Tensile increments the major version if the public API changes, or if either the benchmark.yaml or library-config.yaml files change format in a non-backwards-compatible manner.
Minor: Tensile increments the minor version when new kernel, solution or benchmarking features are introduced in a backwards-compatible manner.
Patch: Bug fixes or minor improvements.
rocALUTION¶
Introduction¶
Overview¶
rocALUTION is a sparse linear algebra library with focus on exploring fine-grained parallelism, targeting modern processors and accelerators including multi/many-core CPU and GPU platforms. The main goal of this package is to provide a portable library for iterative sparse methods on state of the art hardware. rocALUTION can be seen as middle-ware between different parallel backends and application specific packages.
The major features and characteristics of the library are
- Various backends
Host - fallback backend, designed for CPUs
GPU/HIP - accelerator backend, designed for HIP capable AMD GPUs
OpenMP - designed for multi-core CPUs
MPI - designed for multi-node and multi-GPU configurations
- Easy to use
The syntax and structure of the library provide easy learning curves. With the help of the examples, anyone can try out the library - no knowledge in HIP, OpenMP or MPI programming required.
- No special hardware requirements
There are no hardware requirements to install and run rocALUTION. If a GPU device and HIP is available, the library will use them.
- Variety of iterative solvers
Fixed-Point iteration - Jacobi, Gauss-Seidel, Symmetric-Gauss Seidel, SOR and SSOR
Krylov subspace methods - CR, CG, BiCGStab, BiCGStab(l), GMRES, IDR, QMRCGSTAB, Flexible CG/GMRES
Mixed-precision defect-correction scheme
Chebyshev iteration
Multiple MultiGrid schemes, geometric and algebraic
- Various preconditioners
Matrix splitting - Jacobi, (Multi-colored) Gauss-Seidel, Symmetric Gauss-Seidel, SOR, SSOR
Factorization - ILU(0), ILU(p) (based on levels), ILU(p,q) (power(q)-pattern method), Multi-Elimination ILU (nested/recursive), ILUT (based on threshold) and IC(0)
Approximate Inverse - Chebyshev matrix-valued polynomial, SPAI, FSAI and TNS
Diagonal-based preconditioner for Saddle-point problems
Block-type of sub-preconditioners/solvers
Additive Schwarz and Restricted Additive Schwarz
Variable type preconditioners
- Generic and robust design
rocALUTION is based on a generic and robust design allowing expansion in the direction of new solvers and preconditioners and support for various hardware types. Furthermore, the design of the library allows the use of all solvers as preconditioners in other solvers. For example you can easily define a CG solver with a Multi-Elimination preconditioner, where the last-block is preconditioned with another Chebyshev iteration method which is preconditioned with a multi-colored Symmetric Gauss-Seidel scheme.
- Portable code and results
All code based on rocALUTION is portable and independent of HIP or OpenMP. The code will compile and run everywhere. All solvers and preconditioners are based on a single source code, which delivers portable results across all supported backends (variations are possible due to different rounding modes on the hardware). The only difference which you can see for a hardware change is the performance variation.
- Support for several sparse matrix formats
Compressed Sparse Row (CSR), Modified Compressed Sparse Row (MCSR), Dense (DENSE), Coordinate (COO), ELL, Diagonal (DIA), Hybrid format of ELL and COO (HYB).
The code is open-source under MIT license and hosted on here: https://github.com/ROCmSoftwarePlatform/rocALUTION
Building and Installing¶
Installing from AMD ROCm repositories¶
TODO, not yet available
Building rocALUTION from Open-Source repository¶
Download rocALUTION¶
The rocALUTION source code is available at the rocALUTION github page. Download the master branch using:
git clone -b master https://github.com/ROCmSoftwarePlatform/rocALUTION.git
cd rocALUTION
Note that if you want to contribute to rocALUTION, you will need to checkout the develop branch instead of the master branch. See rocalution_contributing for further details. Below are steps to build different packages of the library, including dependencies and clients. It is recommended to install rocALUTION using the install.sh script.
Using install.sh to build dependencies + library¶
The following table lists common uses of install.sh to build dependencies + library. Accelerator support via HIP and OpenMP will be enabled by default, whereas MPI is disabled.
Command |
Description |
|---|---|
./install.sh -h |
Print help information. |
./install.sh -d |
Build dependencies and library in your local directory. The -d flag only needs to be |br| used once. For subsequent invocations of install.sh it is not necessary to rebuild the |br| dependencies. |
./install.sh |
Build library in your local directory. It is assumed dependencies are available. |
./install.sh -i |
Build library, then build and install rocALUTION package in /opt/rocm/rocalution. You will |br| be prompted for sudo access. This will install for all users. |
./install.sh –host |
Build library in your local directory without HIP support. It is assumed dependencies |br| are available. |
./install.sh –mpi |
Build library in your local directory with HIP and MPI support. It is assumed |br| dependencies are available. |
Using install.sh to build dependencies + library + client¶
The client contains example code, unit tests and benchmarks. Common uses of install.sh to build them are listed in the table below.
Command |
Description |
|---|---|
./install.sh -h |
Print help information. |
./install.sh -dc |
Build dependencies, library and client in your local directory. The -d flag only needs to |br| be used once. For subsequent invocations of install.sh it is not necessary to rebuild the |br| dependencies. |
./install.sh -c |
Build library and client in your local directory. It is assumed dependencies are available. |
./install.sh -idc |
Build library, dependencies and client, then build and install rocALUTION package in |br| /opt/rocm/rocalution. You will be prompted for sudo access. This will install for all users. |
./install.sh -ic |
Build library and client, then build and install rocALUTION package in |br| opt/rocm/rocalution. You will be prompted for sudo access. This will install for all users. |
Using individual commands to build rocALUTION¶
CMake 3.5 or later is required in order to build rocALUTION.
rocALUTION can be built with cmake using the following commands:
# Create and change to build directory
mkdir -p build/release ; cd build/release
# Default install path is /opt/rocm, use -DCMAKE_INSTALL_PREFIX=<path> to adjust it
cmake ../.. -DSUPPORT_HIP=ON \
-DSUPPORT_MPI=OFF \
-DSUPPORT_OMP=ON
# Compile rocALUTION library
make -j$(nproc)
# Install rocALUTION to /opt/rocm
sudo make install
GoogleTest is required in order to build rocALUTION client.
rocALUTION with dependencies and client can be built using the following commands:
# Install googletest
mkdir -p build/release/deps ; cd build/release/deps
cmake ../../../deps
sudo make -j$(nproc) install
# Change to build directory
cd ..
# Default install path is /opt/rocm, use -DCMAKE_INSTALL_PREFIX=<path> to adjust it
cmake ../.. -DBUILD_CLIENTS_TESTS=ON \
-DBUILD_CLIENTS_SAMPLES=ON
# Compile rocALUTION library
make -j$(nproc)
# Install rocALUTION to /opt/rocm
sudo make install
The compilation process produces a shared library file librocalution.so and librocalution_hip.so if HIP support is enabled. Ensure that the library objects can be found in your library path. If you do not copy the library to a specific location you can add the path under Linux in the LD_LIBRARY_PATH variable.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_rocalution>
Common build problems¶
Issue: HIP (/opt/rocm/hip) was built using hcc 1.0.xxx-xxx-xxx-xxx, but you are using /opt/rocm/bin/hcc with version 1.0.yyy-yyy-yyy-yyy from hipcc (version mismatch). Please rebuild HIP including cmake or update HCC_HOME variable.
Solution: Download HIP from github and use hcc to build from source and then use the built HIP instead of /opt/rocm/hip.
Issue: For Carrizo - HCC RUNTIME ERROR: Failed to find compatible kernel
Solution: Add the following to the cmake command when configuring: -DCMAKE_CXX_FLAGS=”–amdgpu-target=gfx801”
Issue: For MI25 (Vega10 Server) - HCC RUNTIME ERROR: Failed to find compatible kernel
Solution: export HCC_AMDGPU_TARGET=gfx900
- Issue: Could not find a package configuration file provided by “ROCM” with any of the following names:
ROCMConfig.cmake |br| rocm-config.cmake
Solution: Install ROCm cmake modules
- Issue: Could not find a package configuration file provided by “ROCSPARSE” with any of the following names:
ROCSPARSE.cmake |br| rocsparse-config.cmake
Solution: Install rocSPARSE
- Issue: Could not find a package configuration file provided by “ROCBLAS” with any of the following names:
ROCBLAS.cmake |br| rocblas-config.cmake
Solution: Install rocBLAS
Simple Test¶
You can test the installation by running a CG solver on a Laplace matrix. After compiling the library you can perform the CG solver test by executing
cd rocALUTION/build/release/examples
wget ftp://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/laplace/gr_30_30.mtx.gz
gzip -d gr_30_30.mtx.gz
./cg gr_30_30.mtx
For more information regarding rocALUTION library and corresponding API documentation, refer rocALUTION
API¶
This section provides details of the library API
Host Utility Functions¶
-
template<typename
DataType>
voidrocalution::allocate_host(int size, DataType **ptr)¶ Allocate buffer on the host.
allocate_hostallocates a buffer on the host.- Parameters
[in] size: number of elements the buffer need to be allocated for[out] ptr: pointer to the position in memory where the buffer should be allocated, it is expected that*ptr==NULL
- Template Parameters
DataType: can be char, int, unsigned int, float, double, std::complex<float> or std::complex<double>.
-
template<typename
DataType>
voidrocalution::free_host(DataType **ptr)¶ Free buffer on the host.
free_hostdeallocates a buffer on the host.*ptrwill be set to NULL after successful deallocation.- Parameters
[inout] ptr: pointer to the position in memory where the buffer should be deallocated, it is expected that*ptr!=NULL
- Template Parameters
DataType: can be char, int, unsigned int, float, double, std::complex<float> or std::complex<double>.
-
template<typename
DataType>
voidrocalution::set_to_zero_host(int size, DataType *ptr)¶ Set a host buffer to zero.
set_to_zero_hostsets a host buffer to zero.- Parameters
[in] size: number of elements[inout] ptr: pointer to the host buffer
- Template Parameters
DataType: can be char, int, unsigned int, float, double, std::complex<float> or std::complex<double>.
-
double
rocalution::rocalution_time(void)¶ Return current time in microseconds.
Backend Manager¶
-
int
rocalution::init_rocalution(int rank = -1, int dev_per_node = 1)¶ Initialize rocALUTION platform.
init_rocalutiondefines a backend descriptor with information about the hardware and its specifications. All objects created after that contain a copy of this descriptor. If the specifications of the global descriptor are changed (e.g. set different number of threads) and new objects are created, only the new objects will use the new configurations.For control, the library provides the following functions
set_device_rocalution() is a unified function to select a specific device. If you have compiled the library with a backend and for this backend there are several available devices, you can use this function to select a particular one. This function has to be called before init_rocalution().
set_omp_threads_rocalution() sets the number of OpenMP threads. This function has to be called after init_rocalution().
- Example
#include <rocalution.hpp> using namespace rocalution; int main(int argc, char* argv[]) { init_rocalution(); // ... stop_rocalution(); return 0; }
- Parameters
[in] rank: specifies MPI rank when multi-node environment[in] dev_per_node: number of accelerator devices per node, when in multi-GPU environment
-
int
rocalution::stop_rocalution(void)¶ Shutdown rocALUTION platform.
stop_rocalutionshuts down the rocALUTION platform.
-
void
rocalution::set_device_rocalution(int dev)¶ Set the accelerator device.
set_device_rocalutionlets the user select the accelerator device that is supposed to be used for the computation.- Parameters
[in] dev: accelerator device ID for computation
-
void
rocalution::set_omp_threads_rocalution(int nthreads)¶ Set number of OpenMP threads.
The number of threads which rocALUTION will use can be set with
set_omp_threads_rocalutionor by the global OpenMP environment variable (for Unix-like OS this isOMP_NUM_THREADS). During the initialization phase, the library provides affinity thread-core mapping:If the number of cores (including SMT cores) is greater or equal than two times the number of threads, then all the threads can occupy every second core ID (e.g. 0, 2, 4, \(\ldots\)). This is to avoid having two threads working on the same physical core, when SMT is enabled.
If the number of threads is less or equal to the number of cores (including SMT), and the previous clause is false, then the threads can occupy every core ID (e.g. 0, 1, 2, 3, \(\ldots\)).
If non of the above criteria is matched, then the default thread-core mapping is used (typically set by the OS).
- Note
The thread-core mapping is available only for Unix-like OS.
- Note
The user can disable the thread affinity by calling set_omp_affinity_rocalution(), before initializing the library (i.e. before init_rocalution()).
- Parameters
[in] nthreads: number of OpenMP threads
-
void
rocalution::set_omp_affinity_rocalution(bool affinity)¶ Enable/disable OpenMP host affinity.
set_omp_affinity_rocalutionenables / disables OpenMP host affinity.- Parameters
[in] affinity: boolean to turn on/off OpenMP host affinity
-
void
rocalution::set_omp_threshold_rocalution(int threshold)¶ Set OpenMP threshold size.
Whenever you want to work on a small problem, you might observe that the OpenMP host backend is (slightly) slower than using no OpenMP. This is mainly attributed to the small amount of work, which every thread should perform and the large overhead of forking/joining threads. This can be avoid by the OpenMP threshold size parameter in rocALUTION. The default threshold is set to 10000, which means that all matrices under (and equal) this size will use only one thread (disregarding the number of OpenMP threads set in the system). The threshold can be modified with
set_omp_threshold_rocalution.- Parameters
[in] threshold: OpenMP threshold size
-
void
rocalution::info_rocalution(void) Print info about rocALUTION.
info_rocalutionprints information about the rocALUTION platform
-
void
rocalution::info_rocalution(const struct Rocalution_Backend_Descriptor backend_descriptor) Print info about specific rocALUTION backend descriptor.
info_rocalutionprints information about the rocALUTION platform of the specific backend descriptor.- Parameters
[in] backend_descriptor: rocALUTION backend descriptor
-
void
rocalution::disable_accelerator_rocalution(bool onoff = true)¶ Disable/Enable the accelerator.
If you want to disable the accelerator (without re-compiling the code), you need to call
disable_accelerator_rocalutionbefore init_rocalution().- Parameters
[in] onoff: boolean to turn on/off the accelerator
-
void
rocalution::_rocalution_sync(void)¶ Sync rocALUTION.
_rocalution_syncblocks the host until all active asynchronous transfers are completed.
Base Rocalution¶
-
template<typename
ValueType>
classBaseRocalution: public rocalution::RocalutionObj¶ Base class for all operators and vectors.
- Template Parameters
ValueType: - can be int, float, double, std::complex<float> and std::complex<double>
Subclassed by rocalution::Operator< ValueType >, rocalution::Vector< ValueType >
-
virtual void
rocalution::BaseRocalution::MoveToAccelerator(void) = 0¶ Move the object to the accelerator backend.
-
virtual void
rocalution::BaseRocalution::MoveToHost(void) = 0¶ Move the object to the host backend.
-
void
rocalution::BaseRocalution::MoveToAcceleratorAsync(void)¶ Move the object to the accelerator backend with async move.
-
void
rocalution::BaseRocalution::MoveToHostAsync(void)¶ Move the object to the host backend with async move.
-
void
rocalution::BaseRocalution::Sync(void)¶ Sync (the async move)
-
void
rocalution::BaseRocalution::CloneBackend(const BaseRocalution<ValueType> &src) Clone the Backend descriptor from another object.
With
CloneBackend, the backend can be cloned without copying any data. This is especially useful, if several objects should reside on the same backend, but keep their original data.- Example
LocalVector<ValueType> vec; LocalMatrix<ValueType> mat; // Allocate and initialize vec and mat // ... LocalVector<ValueType> tmp; // By cloning backend, tmp and vec will have the same backend as mat tmp.CloneBackend(mat); vec.CloneBackend(mat); // The following matrix vector multiplication will be performed on the backend // selected in mat mat.Apply(vec, &tmp);
- Parameters
[in] src: Object, where the backend should be cloned from.
-
virtual void
rocalution::BaseRocalution::Info(void) const = 0¶ Print object information.
Infocan print object information about any rocALUTION object. This information consists of object properties and backend data.- Example
mat.Info(); vec.Info();
-
virtual void
rocalution::BaseRocalution::Clear(void) = 0¶ Clear (free all data) the object.
Operator¶
-
template<typename
ValueType>
classOperator: public rocalution::BaseRocalution<ValueType>¶ Operator class.
The Operator class defines the generic interface for applying an operator (e.g. matrix or stencil) from/to global and local vectors.
- Template Parameters
ValueType: - can be int, float, double, std::complex<float> and std::complex<double>
Subclassed by rocalution::GlobalMatrix< ValueType >, rocalution::LocalMatrix< ValueType >, rocalution::LocalStencil< ValueType >
-
virtual IndexType2
rocalution::Operator::GetM(void) const = 0¶ Return the number of rows in the matrix/stencil.
-
virtual IndexType2
rocalution::Operator::GetN(void) const = 0¶ Return the number of columns in the matrix/stencil.
-
virtual IndexType2
rocalution::Operator::GetNnz(void) const = 0¶ Return the number of non-zeros in the matrix/stencil.
-
int
rocalution::Operator::GetLocalM(void) const¶ Return the number of rows in the local matrix/stencil.
-
int
rocalution::Operator::GetLocalN(void) const¶ Return the number of columns in the local matrix/stencil.
-
int
rocalution::Operator::GetLocalNnz(void) const¶ Return the number of non-zeros in the local matrix/stencil.
-
int
rocalution::Operator::GetGhostM(void) const¶ Return the number of rows in the ghost matrix/stencil.
-
int
rocalution::Operator::GetGhostN(void) const¶ Return the number of columns in the ghost matrix/stencil.
-
int
rocalution::Operator::GetGhostNnz(void) const¶ Return the number of non-zeros in the ghost matrix/stencil.
-
void
rocalution::Operator::Apply(const LocalVector<ValueType> &in, LocalVector<ValueType> *out) const Apply the operator, out = Operator(in), where in and out are local vectors.
-
void
rocalution::Operator::ApplyAdd(const LocalVector<ValueType> &in, ValueType scalar, LocalVector<ValueType> *out) const Apply and add the operator, out += scalar * Operator(in), where in and out are local vectors.
-
void
rocalution::Operator::Apply(const GlobalVector<ValueType> &in, GlobalVector<ValueType> *out) const Apply the operator, out = Operator(in), where in and out are global vectors.
-
void
rocalution::Operator::ApplyAdd(const GlobalVector<ValueType> &in, ValueType scalar, GlobalVector<ValueType> *out) const Apply and add the operator, out += scalar * Operator(in), where in and out are global vectors.
Vector¶
-
template<typename
ValueType>
classVector: public rocalution::BaseRocalution<ValueType>¶ Vector class.
The Vector class defines the generic interface for local and global vectors.
- Template Parameters
ValueType: - can be int, float, double, std::complex<float> and std::complex<double>
Subclassed by rocalution::LocalVector< int >, rocalution::GlobalVector< ValueType >, rocalution::LocalVector< ValueType >
-
virtual bool
rocalution::Vector::Check(void) const = 0¶ Perform a sanity check of the vector.
Checks, if the vector contains valid data, i.e. if the values are not infinity and not NaN (not a number).
- Return Value
true: if the vector is ok (empty vector is also ok).false: if there is something wrong with the values.
-
virtual void
rocalution::Vector::SetValues(ValueType val) = 0¶ Set all values of the vector to given argument.
-
virtual void
rocalution::Vector::SetRandomUniform(unsigned long long seed, ValueType a = static_cast<ValueType>(-1), ValueType b = static_cast<ValueType>(1)) = 0¶ Fill the vector with random values from interval [a,b].
-
virtual void
rocalution::Vector::SetRandomNormal(unsigned long long seed, ValueType mean = static_cast<ValueType>(0), ValueType var = static_cast<ValueType>(1)) = 0¶ Fill the vector with random values from normal distribution.
-
virtual void
rocalution::Vector::ReadFileASCII(const std::string filename) = 0¶ Read vector from ASCII file.
Read a vector from ASCII file.
- Example
LocalVector<ValueType> vec; vec.ReadFileASCII("my_vector.dat");
- Parameters
[in] filename: name of the file containing the ASCII data.
-
virtual void
rocalution::Vector::WriteFileASCII(const std::string filename) const = 0¶ Write vector to ASCII file.
Write a vector to ASCII file.
- Example
LocalVector<ValueType> vec; // Allocate and fill vec // ... vec.WriteFileASCII("my_vector.dat");
- Parameters
[in] filename: name of the file to write the ASCII data to.
-
virtual void
rocalution::Vector::ReadFileBinary(const std::string filename) = 0¶ Read vector from binary file.
Read a vector from binary file. For details on the format, see WriteFileBinary().
- Example
LocalVector<ValueType> vec; vec.ReadFileBinary("my_vector.bin");
- Parameters
[in] filename: name of the file containing the data.
-
virtual void
rocalution::Vector::WriteFileBinary(const std::string filename) const = 0¶ Write vector to binary file.
Write a vector to binary file.
The binary format contains a header, the rocALUTION version and the vector data as follows
// Header out << "#rocALUTION binary vector file" << std::endl; // rocALUTION version out.write((char*)&version, sizeof(int)); // Vector data out.write((char*)&size, sizeof(int)); out.write((char*)vec_val, size * sizeof(double));
- Note
Vector values array is always stored in double precision (e.g. double or std::complex<double>).
- Example
LocalVector<ValueType> vec; // Allocate and fill vec // ... vec.WriteFileBinary("my_vector.bin");
- Parameters
[in] filename: name of the file to write the data to.
-
void
rocalution::Vector::CopyFrom(const LocalVector<ValueType> &src) Copy vector from another vector.
CopyFromcopies values from another vector.- Note
This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.
- Example
LocalVector<ValueType> vec1, vec2; // Allocate and initialize vec1 and vec2 // ... // Move vec1 to accelerator // vec1.MoveToAccelerator(); // Now, vec1 is on the accelerator (if available) // and vec2 is on the host // Copy vec1 to vec2 (or vice versa) will move data between host and // accelerator backend vec1.CopyFrom(vec2);
- Parameters
[in] src: Vector, where values should be copied from.
-
void
rocalution::Vector::CopyFrom(const GlobalVector<ValueType> &src) Copy vector from another vector.
CopyFromcopies values from another vector.- Note
This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.
- Example
LocalVector<ValueType> vec1, vec2; // Allocate and initialize vec1 and vec2 // ... // Move vec1 to accelerator // vec1.MoveToAccelerator(); // Now, vec1 is on the accelerator (if available) // and vec2 is on the host // Copy vec1 to vec2 (or vice versa) will move data between host and // accelerator backend vec1.CopyFrom(vec2);
- Parameters
[in] src: Vector, where values should be copied from.
-
void
rocalution::Vector::CopyFromAsync(const LocalVector<ValueType> &src)¶ Async copy from another local vector.
-
void
rocalution::Vector::CopyFromFloat(const LocalVector<float> &src)¶ Copy values from another local float vector.
-
void
rocalution::Vector::CopyFromDouble(const LocalVector<double> &src)¶ Copy values from another local double vector.
-
void
rocalution::Vector::CopyFrom(const LocalVector<ValueType> &src, int src_offset, int dst_offset, int size) Copy vector from another vector with offsets and size.
CopyFromcopies values with specific source and destination offsets and sizes from another vector.- Note
This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.
- Parameters
[in] src: Vector, where values should be copied from.[in] src_offset: source offset.[in] dst_offset: destination offset.[in] size: number of entries to be copied.
-
void
rocalution::Vector::CloneFrom(const LocalVector<ValueType> &src) Clone the vector.
CloneFromclones the entire vector, with data and backend descriptor from another Vector.- Example
LocalVector<ValueType> vec; // Allocate and initialize vec (host or accelerator) // ... LocalVector<ValueType> tmp; // By cloning vec, tmp will have identical values and will be on the same // backend as vec tmp.CloneFrom(vec);
- Parameters
[in] src: Vector to clone from.
-
void
rocalution::Vector::CloneFrom(const GlobalVector<ValueType> &src) Clone the vector.
CloneFromclones the entire vector, with data and backend descriptor from another Vector.- Example
LocalVector<ValueType> vec; // Allocate and initialize vec (host or accelerator) // ... LocalVector<ValueType> tmp; // By cloning vec, tmp will have identical values and will be on the same // backend as vec tmp.CloneFrom(vec);
- Parameters
[in] src: Vector to clone from.
-
void
rocalution::Vector::AddScale(const LocalVector<ValueType> &x, ValueType alpha) Perform vector update of type this = this + alpha * x.
-
void
rocalution::Vector::AddScale(const GlobalVector<ValueType> &x, ValueType alpha) Perform vector update of type this = this + alpha * x.
-
void
rocalution::Vector::ScaleAdd(ValueType alpha, const LocalVector<ValueType> &x) Perform vector update of type this = alpha * this + x.
-
void
rocalution::Vector::ScaleAdd(ValueType alpha, const GlobalVector<ValueType> &x) Perform vector update of type this = alpha * this + x.
-
void
rocalution::Vector::ScaleAddScale(ValueType alpha, const LocalVector<ValueType> &x, ValueType beta) Perform vector update of type this = alpha * this + x * beta.
-
void
rocalution::Vector::ScaleAddScale(ValueType alpha, const GlobalVector<ValueType> &x, ValueType beta) Perform vector update of type this = alpha * this + x * beta.
-
void
rocalution::Vector::ScaleAddScale(ValueType alpha, const LocalVector<ValueType> &x, ValueType beta, int src_offset, int dst_offset, int size) Perform vector update of type this = alpha * this + x * beta with offsets.
-
void
rocalution::Vector::ScaleAddScale(ValueType alpha, const GlobalVector<ValueType> &x, ValueType beta, int src_offset, int dst_offset, int size) Perform vector update of type this = alpha * this + x * beta with offsets.
-
void
rocalution::Vector::ScaleAdd2(ValueType alpha, const LocalVector<ValueType> &x, ValueType beta, const LocalVector<ValueType> &y, ValueType gamma) Perform vector update of type this = alpha * this + x * beta + y * gamma.
-
void
rocalution::Vector::ScaleAdd2(ValueType alpha, const GlobalVector<ValueType> &x, ValueType beta, const GlobalVector<ValueType> &y, ValueType gamma) Perform vector update of type this = alpha * this + x * beta + y * gamma.
-
virtual void
rocalution::Vector::Scale(ValueType alpha) = 0¶ Perform vector scaling this = alpha * this.
-
ValueType
rocalution::Vector::Dot(const LocalVector<ValueType> &x) const Compute dot (scalar) product, return this^T y.
-
ValueType
rocalution::Vector::Dot(const GlobalVector<ValueType> &x) const Compute dot (scalar) product, return this^T y.
-
ValueType
rocalution::Vector::DotNonConj(const LocalVector<ValueType> &x) const Compute non-conjugate dot (scalar) product, return this^T y.
-
ValueType
rocalution::Vector::DotNonConj(const GlobalVector<ValueType> &x) const Compute non-conjugate dot (scalar) product, return this^T y.
-
virtual ValueType
rocalution::Vector::Norm(void) const = 0¶ Compute \(L_2\) norm of the vector, return = srqt(this^T this)
-
virtual ValueType
rocalution::Vector::Asum(void) const = 0¶ Compute the sum of absolute values of the vector, return = sum(|this|)
-
virtual int
rocalution::Vector::Amax(ValueType &value) const = 0¶ Compute the absolute max of the vector, return = index(max(|this|))
-
void
rocalution::Vector::PointWiseMult(const LocalVector<ValueType> &x) Perform point-wise multiplication (element-wise) of this = this * x.
-
void
rocalution::Vector::PointWiseMult(const GlobalVector<ValueType> &x) Perform point-wise multiplication (element-wise) of this = this * x.
-
void
rocalution::Vector::PointWiseMult(const LocalVector<ValueType> &x, const LocalVector<ValueType> &y) Perform point-wise multiplication (element-wise) of this = x * y.
-
void
rocalution::Vector::PointWiseMult(const GlobalVector<ValueType> &x, const GlobalVector<ValueType> &y) Perform point-wise multiplication (element-wise) of this = x * y.
Local Matrix¶
-
template<typename
ValueType>
classLocalMatrix: public rocalution::Operator<ValueType>¶ LocalMatrix class.
A LocalMatrix is called local, because it will always stay on a single system. The system can contain several CPUs via UMA or NUMA memory system or it can contain an accelerator.
- Template Parameters
ValueType: - can be int, float, double, std::complex<float> and std::complex<double>
-
unsigned int
rocalution::LocalMatrix::GetFormat(void) const¶ Return the matrix format id (see matrix_formats.hpp)
-
bool
rocalution::LocalMatrix::Check(void) const¶ Perform a sanity check of the matrix.
Checks, if the matrix contains valid data, i.e. if the values are not infinity and not NaN (not a number) and if the structure of the matrix is correct (e.g. indices cannot be negative, CSR and COO matrices have to be sorted, etc.).
- Return Value
true: if the matrix is ok (empty matrix is also ok).false: if there is something wrong with the structure or values.
-
void
rocalution::LocalMatrix::AllocateCSR(const std::string name, int nnz, int nrow, int ncol)¶ Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
- Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();
-
void
rocalution::LocalMatrix::AllocateBCSR(void)¶ Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
- Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();
-
void
rocalution::LocalMatrix::AllocateMCSR(const std::string name, int nnz, int nrow, int ncol)¶ Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
- Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();
-
void
rocalution::LocalMatrix::AllocateCOO(const std::string name, int nnz, int nrow, int ncol)¶ Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
- Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();
-
void
rocalution::LocalMatrix::AllocateDIA(const std::string name, int nnz, int nrow, int ncol, int ndiag)¶ Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
- Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();
-
void
rocalution::LocalMatrix::AllocateELL(const std::string name, int nnz, int nrow, int ncol, int max_row)¶ Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
- Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();
-
void
rocalution::LocalMatrix::AllocateHYB(const std::string name, int ell_nnz, int coo_nnz, int ell_max_row, int nrow, int ncol)¶ Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
- Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();
-
void
rocalution::LocalMatrix::AllocateDENSE(const std::string name, int nrow, int ncol)¶ Allocate a local matrix with name and sizes.
The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.
- Example
LocalMatrix<ValueType> mat; mat.AllocateCSR("my CSR matrix", 456, 100, 100); mat.Clear(); mat.AllocateCOO("my COO matrix", 200, 100, 100); mat.Clear();
-
void
rocalution::LocalMatrix::SetDataPtrCOO(int **row, int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol)¶ Initialize a LocalMatrix on the host with externally allocated data.
SetDataPtrfunctions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers.- Note
Setting data pointers will leave the original pointers empty (set to
NULL).- Example
// Allocate a CSR matrix int* csr_row_ptr = new int[100 + 1]; int* csr_col_ind = new int[345]; ValueType* csr_val = new ValueType[345]; // Fill the CSR matrix // ... // rocALUTION local matrix object LocalMatrix<ValueType> mat; // Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become // invalid mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);
-
void
rocalution::LocalMatrix::SetDataPtrCSR(int **row_offset, int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol)¶ Initialize a LocalMatrix on the host with externally allocated data.
SetDataPtrfunctions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers.- Note
Setting data pointers will leave the original pointers empty (set to
NULL).- Example
// Allocate a CSR matrix int* csr_row_ptr = new int[100 + 1]; int* csr_col_ind = new int[345]; ValueType* csr_val = new ValueType[345]; // Fill the CSR matrix // ... // rocALUTION local matrix object LocalMatrix<ValueType> mat; // Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become // invalid mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);
-
void
rocalution::LocalMatrix::SetDataPtrMCSR(int **row_offset, int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol)¶ Initialize a LocalMatrix on the host with externally allocated data.
SetDataPtrfunctions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers.- Note
Setting data pointers will leave the original pointers empty (set to
NULL).- Example
// Allocate a CSR matrix int* csr_row_ptr = new int[100 + 1]; int* csr_col_ind = new int[345]; ValueType* csr_val = new ValueType[345]; // Fill the CSR matrix // ... // rocALUTION local matrix object LocalMatrix<ValueType> mat; // Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become // invalid mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);
-
void
rocalution::LocalMatrix::SetDataPtrELL(int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol, int max_row)¶ Initialize a LocalMatrix on the host with externally allocated data.
SetDataPtrfunctions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers.- Note
Setting data pointers will leave the original pointers empty (set to
NULL).- Example
// Allocate a CSR matrix int* csr_row_ptr = new int[100 + 1]; int* csr_col_ind = new int[345]; ValueType* csr_val = new ValueType[345]; // Fill the CSR matrix // ... // rocALUTION local matrix object LocalMatrix<ValueType> mat; // Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become // invalid mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);
-
void
rocalution::LocalMatrix::SetDataPtrDIA(int **offset, ValueType **val, std::string name, int nnz, int nrow, int ncol, int num_diag)¶ Initialize a LocalMatrix on the host with externally allocated data.
SetDataPtrfunctions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers.- Note
Setting data pointers will leave the original pointers empty (set to
NULL).- Example
// Allocate a CSR matrix int* csr_row_ptr = new int[100 + 1]; int* csr_col_ind = new int[345]; ValueType* csr_val = new ValueType[345]; // Fill the CSR matrix // ... // rocALUTION local matrix object LocalMatrix<ValueType> mat; // Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become // invalid mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);
-
void
rocalution::LocalMatrix::SetDataPtrDENSE(ValueType **val, std::string name, int nrow, int ncol)¶ Initialize a LocalMatrix on the host with externally allocated data.
SetDataPtrfunctions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers.- Note
Setting data pointers will leave the original pointers empty (set to
NULL).- Example
// Allocate a CSR matrix int* csr_row_ptr = new int[100 + 1]; int* csr_col_ind = new int[345]; ValueType* csr_val = new ValueType[345]; // Fill the CSR matrix // ... // rocALUTION local matrix object LocalMatrix<ValueType> mat; // Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become // invalid mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);
-
void
rocalution::LocalMatrix::LeaveDataPtrCOO(int **row, int **col, ValueType **val)¶ Leave a LocalMatrix to host pointers.
LeaveDataPtrfunctions have direct access to the raw data via pointers. A LocalMatrix object can leave its raw data to host pointers. This will leave the LocalMatrix empty.- Example
// rocALUTION CSR matrix object LocalMatrix<ValueType> mat; // Allocate the CSR matrix mat.AllocateCSR("my_matrix", 345, 100, 100); // Fill CSR matrix // ... int* csr_row_ptr = NULL; int* csr_col_ind = NULL; ValueType* csr_val = NULL; // Get (steal) the data from the matrix, this will leave the local matrix // object empty mat.LeaveDataPtrCSR(&csr_row_ptr, &csr_col_ind, &csr_val);
-
void
rocalution::LocalMatrix::LeaveDataPtrCSR(int **row_offset, int **col, ValueType **val)¶ Leave a LocalMatrix to host pointers.
LeaveDataPtrfunctions have direct access to the raw data via pointers. A LocalMatrix object can leave its raw data to host pointers. This will leave the LocalMatrix empty.- Example
// rocALUTION CSR matrix object LocalMatrix<ValueType> mat; // Allocate the CSR matrix mat.AllocateCSR("my_matrix", 345, 100, 100); // Fill CSR matrix // ... int* csr_row_ptr = NULL; int* csr_col_ind = NULL; ValueType* csr_val = NULL; // Get (steal) the data from the matrix, this will leave the local matrix // object empty mat.LeaveDataPtrCSR(&csr_row_ptr, &csr_col_ind, &csr_val);
-
void
rocalution::LocalMatrix::LeaveDataPtrMCSR(int **row_offset, int **col, ValueType **val)¶ Leave a LocalMatrix to host pointers.
LeaveDataPtrfunctions have direct access to the raw data via pointers. A LocalMatrix object can leave its raw data to host pointers. This will leave the LocalMatrix empty.- Example
// rocALUTION CSR matrix object LocalMatrix<ValueType> mat; // Allocate the CSR matrix mat.AllocateCSR("my_matrix", 345, 100, 100); // Fill CSR matrix // ... int* csr_row_ptr = NULL; int* csr_col_ind = NULL; ValueType* csr_val = NULL; // Get (steal) the data from the matrix, this will leave the local matrix // object empty mat.LeaveDataPtrCSR(&csr_row_ptr, &csr_col_ind, &csr_val);
-
void
rocalution::LocalMatrix::LeaveDataPtrELL(int **col, ValueType **val, int &max_row)¶ Leave a LocalMatrix to host pointers.
LeaveDataPtrfunctions have direct access to the raw data via pointers. A LocalMatrix object can leave its raw data to host pointers. This will leave the LocalMatrix empty.- Example
// rocALUTION CSR matrix object LocalMatrix<ValueType> mat; // Allocate the CSR matrix mat.AllocateCSR("my_matrix", 345, 100, 100); // Fill CSR matrix // ... int* csr_row_ptr = NULL; int* csr_col_ind = NULL; ValueType* csr_val = NULL; // Get (steal) the data from the matrix, this will leave the local matrix // object empty mat.LeaveDataPtrCSR(&csr_row_ptr, &csr_col_ind, &csr_val);
-
void
rocalution::LocalMatrix::LeaveDataPtrDIA(int **offset, ValueType **val, int &num_diag)¶ Leave a LocalMatrix to host pointers.
LeaveDataPtrfunctions have direct access to the raw data via pointers. A LocalMatrix object can leave its raw data to host pointers. This will leave the LocalMatrix empty.- Example
// rocALUTION CSR matrix object LocalMatrix<ValueType> mat; // Allocate the CSR matrix mat.AllocateCSR("my_matrix", 345, 100, 100); // Fill CSR matrix // ... int* csr_row_ptr = NULL; int* csr_col_ind = NULL; ValueType* csr_val = NULL; // Get (steal) the data from the matrix, this will leave the local matrix // object empty mat.LeaveDataPtrCSR(&csr_row_ptr, &csr_col_ind, &csr_val);
-
void
rocalution::LocalMatrix::LeaveDataPtrDENSE(ValueType **val)¶ Leave a LocalMatrix to host pointers.
LeaveDataPtrfunctions have direct access to the raw data via pointers. A LocalMatrix object can leave its raw data to host pointers. This will leave the LocalMatrix empty.- Example
// rocALUTION CSR matrix object LocalMatrix<ValueType> mat; // Allocate the CSR matrix mat.AllocateCSR("my_matrix", 345, 100, 100); // Fill CSR matrix // ... int* csr_row_ptr = NULL; int* csr_col_ind = NULL; ValueType* csr_val = NULL; // Get (steal) the data from the matrix, this will leave the local matrix // object empty mat.LeaveDataPtrCSR(&csr_row_ptr, &csr_col_ind, &csr_val);
-
void
rocalution::LocalMatrix::Zeros(void)¶ Set all matrix values to zero.
-
void
rocalution::LocalMatrix::Scale(ValueType alpha)¶ Scale all values in the matrix.
-
void
rocalution::LocalMatrix::ScaleDiagonal(ValueType alpha)¶ Scale the diagonal entries of the matrix with alpha, all diagonal elements must exist.
-
void
rocalution::LocalMatrix::ScaleOffDiagonal(ValueType alpha)¶ Scale the off-diagonal entries of the matrix with alpha, all diagonal elements must exist.
-
void
rocalution::LocalMatrix::AddScalar(ValueType alpha)¶ Add a scalar to all matrix values.
-
void
rocalution::LocalMatrix::AddScalarDiagonal(ValueType alpha)¶ Add alpha to the diagonal entries of the matrix, all diagonal elements must exist.
-
void
rocalution::LocalMatrix::AddScalarOffDiagonal(ValueType alpha)¶ Add alpha to the off-diagonal entries of the matrix, all diagonal elements must exist.
-
void
rocalution::LocalMatrix::ExtractSubMatrix(int row_offset, int col_offset, int row_size, int col_size, LocalMatrix<ValueType> *mat) const¶ Extract a sub-matrix with row/col_offset and row/col_size.
-
void
rocalution::LocalMatrix::ExtractSubMatrices(int row_num_blocks, int col_num_blocks, const int *row_offset, const int *col_offset, LocalMatrix<ValueType> ***mat) const¶ Extract array of non-overlapping sub-matrices (row/col_num_blocks define the blocks for rows/columns; row/col_offset have sizes col/row_num_blocks+1, where [i+1]-[i] defines the i-th size of the sub-matrix)
-
void
rocalution::LocalMatrix::ExtractDiagonal(LocalVector<ValueType> *vec_diag) const¶ Extract the diagonal values of the matrix into a LocalVector.
-
void
rocalution::LocalMatrix::ExtractInverseDiagonal(LocalVector<ValueType> *vec_inv_diag) const¶ Extract the inverse (reciprocal) diagonal values of the matrix into a LocalVector.
-
void
rocalution::LocalMatrix::ExtractU(LocalMatrix<ValueType> *U, bool diag) const¶ Extract the upper triangular matrix.
-
void
rocalution::LocalMatrix::ExtractL(LocalMatrix<ValueType> *L, bool diag) const¶ Extract the lower triangular matrix.
-
void
rocalution::LocalMatrix::Permute(const LocalVector<int> &permutation)¶ Perform (forward) permutation of the matrix.
-
void
rocalution::LocalMatrix::PermuteBackward(const LocalVector<int> &permutation)¶ Perform (backward) permutation of the matrix.
-
void
rocalution::LocalMatrix::CMK(LocalVector<int> *permutation) const¶ Create permutation vector for CMK reordering of the matrix.
The Cuthill-McKee ordering minimize the bandwidth of a given sparse matrix.
- Example
LocalVector<int> cmk; mat.CMK(&cmk); mat.Permute(cmk);
- Parameters
[out] permutation: permutation vector for CMK reordering
-
void
rocalution::LocalMatrix::RCMK(LocalVector<int> *permutation) const¶ Create permutation vector for reverse CMK reordering of the matrix.
The Reverse Cuthill-McKee ordering minimize the bandwidth of a given sparse matrix.
- Example
LocalVector<int> rcmk; mat.RCMK(&rcmk); mat.Permute(rcmk);
- Parameters
[out] permutation: permutation vector for reverse CMK reordering
-
void
rocalution::LocalMatrix::ConnectivityOrder(LocalVector<int> *permutation) const¶ Create permutation vector for connectivity reordering of the matrix.
Connectivity ordering returns a permutation, that sorts the matrix by non-zero entries per row.
- Example
LocalVector<int> conn; mat.ConnectivityOrder(&conn); mat.Permute(conn);
- Parameters
[out] permutation: permutation vector for connectivity reordering
-
void
rocalution::LocalMatrix::MultiColoring(int &num_colors, int **size_colors, LocalVector<int> *permutation) const¶ Perform multi-coloring decomposition of the matrix.
The Multi-Coloring algorithm builds a permutation (coloring of the matrix) in a way such that no two adjacent nodes in the sparse matrix have the same color.
- Example
LocalVector<int> mc; int num_colors; int* block_colors = NULL; mat.MultiColoring(num_colors, &block_colors, &mc); mat.Permute(mc);
- Parameters
[out] num_colors: number of colors[out] size_colors: pointer to array that holds the number of nodes for each color[out] permutation: permutation vector for multi-coloring reordering
-
void
rocalution::LocalMatrix::MaximalIndependentSet(int &size, LocalVector<int> *permutation) const¶ Perform maximal independent set decomposition of the matrix.
The Maximal Independent Set algorithm finds a set with maximal size, that contains elements that do not depend on other elements in this set.
- Example
LocalVector<int> mis; int size; mat.MaximalIndependentSet(size, &mis); mat.Permute(mis);
- Parameters
[out] size: number of independent sets[out] permutation: permutation vector for maximal independent set reordering
-
void
rocalution::LocalMatrix::ZeroBlockPermutation(int &size, LocalVector<int> *permutation) const¶ Return a permutation for saddle-point problems (zero diagonal entries)
For Saddle-Point problems, (i.e. matrices with zero diagonal entries), the Zero Block Permutation maps all zero-diagonal elements to the last block of the matrix.
- Example
LocalVector<int> zbp; int size; mat.ZeroBlockPermutation(size, &zbp); mat.Permute(zbp);
- Parameters
[out] size:[out] permutation: permutation vector for zero block permutation
-
void
rocalution::LocalMatrix::ILU0Factorize(void)¶ Perform ILU(0) factorization.
-
void
rocalution::LocalMatrix::LUFactorize(void)¶ Perform LU factorization.
-
void
rocalution::LocalMatrix::ILUTFactorize(double t, int maxrow)¶ Perform ILU(t,m) factorization based on threshold and maximum number of elements per row.
-
void
rocalution::LocalMatrix::ILUpFactorize(int p, bool level = true)¶ Perform ILU(p) factorization based on power.
-
void
rocalution::LocalMatrix::LUAnalyse(void)¶ Analyse the structure (level-scheduling)
-
void
rocalution::LocalMatrix::LUAnalyseClear(void)¶ Delete the analysed data (see LUAnalyse)
-
void
rocalution::LocalMatrix::LUSolve(const LocalVector<ValueType> &in, LocalVector<ValueType> *out) const¶ Solve LU out = in; if level-scheduling algorithm is provided then the graph traversing is performed in parallel.
-
void
rocalution::LocalMatrix::ICFactorize(LocalVector<ValueType> *inv_diag)¶ Perform IC(0) factorization.
-
void
rocalution::LocalMatrix::LLAnalyse(void)¶ Analyse the structure (level-scheduling)
-
void
rocalution::LocalMatrix::LLAnalyseClear(void)¶ Delete the analysed data (see LLAnalyse)
-
void
rocalution::LocalMatrix::LLSolve(const LocalVector<ValueType> &in, LocalVector<ValueType> *out) const Solve LL^T out = in; if level-scheduling algorithm is provided then the graph traversing is performed in parallel.
-
void
rocalution::LocalMatrix::LLSolve(const LocalVector<ValueType> &in, const LocalVector<ValueType> &inv_diag, LocalVector<ValueType> *out) const Solve LL^T out = in; if level-scheduling algorithm is provided then the graph traversing is performed in parallel.
-
void
rocalution::LocalMatrix::LAnalyse(bool diag_unit = false)¶ Analyse the structure (level-scheduling) L-part.
diag_unit == true the diag is 1;
diag_unit == false the diag is 0;
-
void
rocalution::LocalMatrix::LAnalyseClear(void)¶ Delete the analysed data (see LAnalyse) L-part.
-
void
rocalution::LocalMatrix::LSolve(const LocalVector<ValueType> &in, LocalVector<ValueType> *out) const¶ Solve L out = in; if level-scheduling algorithm is provided then the graph traversing is performed in parallel.
-
void
rocalution::LocalMatrix::UAnalyse(bool diag_unit = false)¶ Analyse the structure (level-scheduling) U-part;.
diag_unit == true the diag is 1;
diag_unit == false the diag is 0;
-
void
rocalution::LocalMatrix::UAnalyseClear(void)¶ Delete the analysed data (see UAnalyse) U-part.
-
void
rocalution::LocalMatrix::USolve(const LocalVector<ValueType> &in, LocalVector<ValueType> *out) const¶ Solve U out = in; if level-scheduling algorithm is provided then the graph traversing is performed in parallel.
-
void
rocalution::LocalMatrix::Householder(int idx, ValueType &beta, LocalVector<ValueType> *vec) const¶ Compute Householder vector.
-
void
rocalution::LocalMatrix::QRDecompose(void)¶ QR Decomposition.
-
void
rocalution::LocalMatrix::QRSolve(const LocalVector<ValueType> &in, LocalVector<ValueType> *out) const¶ Solve QR out = in.
-
void
rocalution::LocalMatrix::Invert(void)¶ Matrix inversion using QR decomposition.
-
void
rocalution::LocalMatrix::ReadFileMTX(const std::string filename)¶ Read matrix from MTX (Matrix Market Format) file.
Read a matrix from Matrix Market Format file.
- Example
LocalMatrix<ValueType> mat; mat.ReadFileMTX("my_matrix.mtx");
- Parameters
[in] filename: name of the file containing the MTX data.
-
void
rocalution::LocalMatrix::WriteFileMTX(const std::string filename) const¶ Write matrix to MTX (Matrix Market Format) file.
Write a matrix to Matrix Market Format file.
- Example
LocalMatrix<ValueType> mat; // Allocate and fill mat // ... mat.WriteFileMTX("my_matrix.mtx");
- Parameters
[in] filename: name of the file to write the MTX data to.
-
void
rocalution::LocalMatrix::ReadFileCSR(const std::string filename)¶ Read matrix from CSR (rocALUTION binary format) file.
Read a CSR matrix from binary file. For details on the format, see WriteFileCSR().
- Example
LocalMatrix<ValueType> mat; mat.ReadFileCSR("my_matrix.csr");
- Parameters
[in] filename: name of the file containing the data.
-
void
rocalution::LocalMatrix::WriteFileCSR(const std::string filename) const¶ Write CSR matrix to binary file.
Write a CSR matrix to binary file.
The binary format contains a header, the rocALUTION version and the matrix data as follows
// Header out << "#rocALUTION binary csr file" << std::endl; // rocALUTION version out.write((char*)&version, sizeof(int)); // CSR matrix data out.write((char*)&m, sizeof(int)); out.write((char*)&n, sizeof(int)); out.write((char*)&nnz, sizeof(int)); out.write((char*)csr_row_ptr, (m + 1) * sizeof(int)); out.write((char*)csr_col_ind, nnz * sizeof(int)); out.write((char*)csr_val, nnz * sizeof(double));
- Note
Vector values array is always stored in double precision (e.g. double or std::complex<double>).
- Example
LocalMatrix<ValueType> mat; // Allocate and fill mat // ... mat.WriteFileCSR("my_matrix.csr");
- Parameters
[in] filename: name of the file to write the data to.
-
void
rocalution::LocalMatrix::CopyFrom(const LocalMatrix<ValueType> &src)¶ Copy matrix from another LocalMatrix.
CopyFromcopies values and structure from another local matrix. Source and destination matrix should be in the same format.- Note
This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.
- Example
LocalMatrix<ValueType> mat1, mat2; // Allocate and initialize mat1 and mat2 // ... // Move mat1 to accelerator // mat1.MoveToAccelerator(); // Now, mat1 is on the accelerator (if available) // and mat2 is on the host // Copy mat1 to mat2 (or vice versa) will move data between host and // accelerator backend mat1.CopyFrom(mat2);
- Parameters
[in] src: Local matrix where values and structure should be copied from.
-
void
rocalution::LocalMatrix::CopyFromAsync(const LocalMatrix<ValueType> &src)¶ Async copy matrix (values and structure) from another LocalMatrix.
-
void
rocalution::LocalMatrix::CloneFrom(const LocalMatrix<ValueType> &src)¶ Clone the matrix.
CloneFromclones the entire matrix, including values, structure and backend descriptor from another LocalMatrix.- Example
LocalMatrix<ValueType> mat; // Allocate and initialize mat (host or accelerator) // ... LocalMatrix<ValueType> tmp; // By cloning mat, tmp will have identical values and structure and will be on // the same backend as mat tmp.CloneFrom(mat);
- Parameters
[in] src: LocalMatrix to clone from.
-
void
rocalution::LocalMatrix::UpdateValuesCSR(ValueType *val)¶ Update CSR matrix entries only, structure will remain the same.
-
void
rocalution::LocalMatrix::CopyFromCSR(const int *row_offsets, const int *col, const ValueType *val)¶ Copy (import) CSR matrix described in three arrays (offsets, columns, values). The object data has to be allocated (call AllocateCSR first)
-
void
rocalution::LocalMatrix::CopyToCSR(int *row_offsets, int *col, ValueType *val) const¶ Copy (export) CSR matrix described in three arrays (offsets, columns, values). The output arrays have to be allocated.
-
void
rocalution::LocalMatrix::CopyFromCOO(const int *row, const int *col, const ValueType *val)¶ Copy (import) COO matrix described in three arrays (rows, columns, values). The object data has to be allocated (call AllocateCOO first)
-
void
rocalution::LocalMatrix::CopyToCOO(int *row, int *col, ValueType *val) const¶ Copy (export) COO matrix described in three arrays (rows, columns, values). The output arrays have to be allocated.
-
void
rocalution::LocalMatrix::CopyFromHostCSR(const int *row_offset, const int *col, const ValueType *val, const std::string name, int nnz, int nrow, int ncol)¶ Allocates and copies (imports) a host CSR matrix.
If the CSR matrix data pointers are only accessible as constant, the user can create a LocalMatrix object and pass const CSR host pointers. The LocalMatrix will then be allocated and the data will be copied to the corresponding backend, where the original object was located at.
- Parameters
[in] row_offset: CSR matrix row offset pointers.[in] col: CSR matrix column indices.[in] val: CSR matrix values array.[in] name: Matrix object name.[in] nnz: Number of non-zero elements.[in] nrow: Number of rows.[in] ncol: Number of columns.
-
void
rocalution::LocalMatrix::CreateFromMap(const LocalVector<int> &map, int n, int m) Create a restriction matrix operator based on an int vector map.
-
void
rocalution::LocalMatrix::CreateFromMap(const LocalVector<int> &map, int n, int m, LocalMatrix<ValueType> *pro) Create a restriction and prolongation matrix operator based on an int vector map.
-
void
rocalution::LocalMatrix::ConvertToCSR(void)¶ Convert the matrix to CSR structure.
-
void
rocalution::LocalMatrix::ConvertToMCSR(void)¶ Convert the matrix to MCSR structure.
-
void
rocalution::LocalMatrix::ConvertToBCSR(void)¶ Convert the matrix to BCSR structure.
-
void
rocalution::LocalMatrix::ConvertToCOO(void)¶ Convert the matrix to COO structure.
-
void
rocalution::LocalMatrix::ConvertToELL(void)¶ Convert the matrix to ELL structure.
-
void
rocalution::LocalMatrix::ConvertToDIA(void)¶ Convert the matrix to DIA structure.
-
void
rocalution::LocalMatrix::ConvertToHYB(void)¶ Convert the matrix to HYB structure.
-
void
rocalution::LocalMatrix::ConvertToDENSE(void)¶ Convert the matrix to DENSE structure.
-
void
rocalution::LocalMatrix::ConvertTo(unsigned int matrix_format)¶ Convert the matrix to specified matrix ID format.
-
void
rocalution::LocalMatrix::SymbolicPower(int p)¶ Perform symbolic computation (structure only) of \(|this|^p\).
-
void
rocalution::LocalMatrix::MatrixAdd(const LocalMatrix<ValueType> &mat, ValueType alpha = static_cast<ValueType>(1), ValueType beta = static_cast<ValueType>(1), bool structure = false)¶ Perform matrix addition, this = alpha*this + beta*mat;.
if structure==false the sparsity pattern of the matrix is not changed;
if structure==true a new sparsity pattern is computed
-
void
rocalution::LocalMatrix::MatrixMult(const LocalMatrix<ValueType> &A, const LocalMatrix<ValueType> &B)¶ Multiply two matrices, this = A * B.
-
void
rocalution::LocalMatrix::DiagonalMatrixMult(const LocalVector<ValueType> &diag)¶ Multiply the matrix with diagonal matrix (stored in LocalVector), as DiagonalMatrixMultR()
-
void
rocalution::LocalMatrix::DiagonalMatrixMultL(const LocalVector<ValueType> &diag)¶ Multiply the matrix with diagonal matrix (stored in LocalVector), this=diag*this.
-
void
rocalution::LocalMatrix::DiagonalMatrixMultR(const LocalVector<ValueType> &diag)¶ Multiply the matrix with diagonal matrix (stored in LocalVector), this=this*diag.
-
void
rocalution::LocalMatrix::Gershgorin(ValueType &lambda_min, ValueType &lambda_max) const¶ Compute the spectrum approximation with Gershgorin circles theorem.
-
void
rocalution::LocalMatrix::Compress(double drop_off)¶ Delete all entries in the matrix which abs(a_ij) <= drop_off; the diagonal elements are never deleted.
-
void
rocalution::LocalMatrix::Transpose(void)¶ Transpose the matrix.
-
void
rocalution::LocalMatrix::Sort(void)¶ Sort the matrix indices.
Sorts the matrix by indices.
For CSR matrices, column values are sorted.
For COO matrices, row indices are sorted.
-
void
rocalution::LocalMatrix::Key(long int &row_key, long int &col_key, long int &val_key) const¶ Compute a unique hash key for the matrix arrays.
Typically, it is hard to compare if two matrices have the same structure (and values). To do so, rocALUTION provides a keying function, that generates three keys, for the row index, column index and values array.
- Parameters
[out] row_key: row index array key[out] col_key: column index array key[out] val_key: values array key
-
void
rocalution::LocalMatrix::ReplaceColumnVector(int idx, const LocalVector<ValueType> &vec)¶ Replace a column vector of a matrix.
-
void
rocalution::LocalMatrix::ReplaceRowVector(int idx, const LocalVector<ValueType> &vec)¶ Replace a row vector of a matrix.
-
void
rocalution::LocalMatrix::ExtractColumnVector(int idx, LocalVector<ValueType> *vec) const¶ Extract values from a column of a matrix to a vector.
-
void
rocalution::LocalMatrix::ExtractRowVector(int idx, LocalVector<ValueType> *vec) const¶ Extract values from a row of a matrix to a vector.
-
void
rocalution::LocalMatrix::AMGConnect(ValueType eps, LocalVector<int> *connections) const¶ Strong couplings for aggregation-based AMG.
-
void
rocalution::LocalMatrix::AMGAggregate(const LocalVector<int> &connections, LocalVector<int> *aggregates) const¶ Plain aggregation - Modification of a greedy aggregation scheme from Vanek (1996)
-
void
rocalution::LocalMatrix::AMGSmoothedAggregation(ValueType relax, const LocalVector<int> &aggregates, const LocalVector<int> &connections, LocalMatrix<ValueType> *prolong, LocalMatrix<ValueType> *restrict) const¶ Interpolation scheme based on smoothed aggregation from Vanek (1996)
-
void
rocalution::LocalMatrix::AMGAggregation(const LocalVector<int> &aggregates, LocalMatrix<ValueType> *prolong, LocalMatrix<ValueType> *restrict) const¶ Aggregation-based interpolation scheme.
-
void
rocalution::LocalMatrix::RugeStueben(ValueType eps, LocalMatrix<ValueType> *prolong, LocalMatrix<ValueType> *restrict) const¶ Ruge Stueben coarsening.
-
void
rocalution::LocalMatrix::FSAI(int power, const LocalMatrix<ValueType> *pattern)¶ Factorized Sparse Approximate Inverse assembly for given system matrix power pattern or external sparsity pattern.
-
void
rocalution::LocalMatrix::SPAI(void)¶ SParse Approximate Inverse assembly for given system matrix pattern.
-
void
rocalution::LocalMatrix::InitialPairwiseAggregation(ValueType beta, int &nc, LocalVector<int> *G, int &Gsize, int **rG, int &rGsize, int ordering) const Initial Pairwise Aggregation scheme.
-
void
rocalution::LocalMatrix::InitialPairwiseAggregation(const LocalMatrix<ValueType> &mat, ValueType beta, int &nc, LocalVector<int> *G, int &Gsize, int **rG, int &rGsize, int ordering) const Initial Pairwise Aggregation scheme for split matrices.
-
void
rocalution::LocalMatrix::FurtherPairwiseAggregation(ValueType beta, int &nc, LocalVector<int> *G, int &Gsize, int **rG, int &rGsize, int ordering) const Further Pairwise Aggregation scheme.
-
void
rocalution::LocalMatrix::FurtherPairwiseAggregation(const LocalMatrix<ValueType> &mat, ValueType beta, int &nc, LocalVector<int> *G, int &Gsize, int **rG, int &rGsize, int ordering) const Further Pairwise Aggregation scheme for split matrices.
-
void
rocalution::LocalMatrix::CoarsenOperator(LocalMatrix<ValueType> *Ac, int nrow, int ncol, const LocalVector<int> &G, int Gsize, const int *rG, int rGsize) const¶ Build coarse operator for pairwise aggregation scheme.
Local Stencil¶
-
template<typename
ValueType>
classLocalStencil: public rocalution::Operator<ValueType>¶ LocalStencil class.
A LocalStencil is called local, because it will always stay on a single system. The system can contain several CPUs via UMA or NUMA memory system or it can contain an accelerator.
- Template Parameters
ValueType: - can be int, float, double, std::complex<float> and std::complex<double>
-
rocalution::LocalStencil::LocalStencil(unsigned int type) Initialize a local stencil with a type.
-
int
rocalution::LocalStencil::GetNDim(void) const¶ Return the dimension of the stencil.
-
void
rocalution::LocalStencil::SetGrid(int size)¶ Set the stencil grid size.
Global Matrix¶
-
template<typename
ValueType>
classGlobalMatrix: public rocalution::Operator<ValueType>¶ GlobalMatrix class.
A GlobalMatrix is called global, because it can stay on a single or on multiple nodes in a network. For this type of communication, MPI is used.
- Template Parameters
ValueType: - can be int, float, double, std::complex<float> and std::complex<double>
-
rocalution::GlobalMatrix::GlobalMatrix(const ParallelManager &pm) Initialize a global matrix with a parallel manager.
-
bool
rocalution::GlobalMatrix::Check(void) const¶ Return true if the matrix is ok (empty matrix is also ok) and false if there is something wrong with the strcture or some of values are NaN.
-
void
rocalution::GlobalMatrix::AllocateCSR(std::string name, int local_nnz, int ghost_nnz)¶ Allocate CSR Matrix.
-
void
rocalution::GlobalMatrix::AllocateCOO(std::string name, int local_nnz, int ghost_nnz)¶ Allocate COO Matrix.
-
void
rocalution::GlobalMatrix::SetParallelManager(const ParallelManager &pm)¶ Set the parallel manager of a global vector.
-
void
rocalution::GlobalMatrix::SetDataPtrCSR(int **local_row_offset, int **local_col, ValueType **local_val, int **ghost_row_offset, int **ghost_col, ValueType **ghost_val, std::string name, int local_nnz, int ghost_nnz)¶ Initialize a CSR matrix on the host with externally allocated data.
-
void
rocalution::GlobalMatrix::SetDataPtrCOO(int **local_row, int **local_col, ValueType **local_val, int **ghost_row, int **ghost_col, ValueType **ghost_val, std::string name, int local_nnz, int ghost_nnz)¶ Initialize a COO matrix on the host with externally allocated data.
-
void
rocalution::GlobalMatrix::SetLocalDataPtrCSR(int **row_offset, int **col, ValueType **val, std::string name, int nnz)¶ Initialize a CSR matrix on the host with externally allocated local data.
-
void
rocalution::GlobalMatrix::SetLocalDataPtrCOO(int **row, int **col, ValueType **val, std::string name, int nnz)¶ Initialize a COO matrix on the host with externally allocated local data.
-
void
rocalution::GlobalMatrix::SetGhostDataPtrCSR(int **row_offset, int **col, ValueType **val, std::string name, int nnz)¶ Initialize a CSR matrix on the host with externally allocated ghost data.
-
void
rocalution::GlobalMatrix::SetGhostDataPtrCOO(int **row, int **col, ValueType **val, std::string name, int nnz)¶ Initialize a COO matrix on the host with externally allocated ghost data.
-
void
rocalution::GlobalMatrix::LeaveDataPtrCSR(int **local_row_offset, int **local_col, ValueType **local_val, int **ghost_row_offset, int **ghost_col, ValueType **ghost_val)¶ Leave a CSR matrix to host pointers.
-
void
rocalution::GlobalMatrix::LeaveDataPtrCOO(int **local_row, int **local_col, ValueType **local_val, int **ghost_row, int **ghost_col, ValueType **ghost_val)¶ Leave a COO matrix to host pointers.
-
void
rocalution::GlobalMatrix::LeaveLocalDataPtrCSR(int **row_offset, int **col, ValueType **val)¶ Leave a local CSR matrix to host pointers.
-
void
rocalution::GlobalMatrix::LeaveLocalDataPtrCOO(int **row, int **col, ValueType **val)¶ Leave a local COO matrix to host pointers.
-
void
rocalution::GlobalMatrix::LeaveGhostDataPtrCSR(int **row_offset, int **col, ValueType **val)¶ Leave a CSR ghost matrix to host pointers.
-
void
rocalution::GlobalMatrix::LeaveGhostDataPtrCOO(int **row, int **col, ValueType **val)¶ Leave a COO ghost matrix to host pointers.
-
void
rocalution::GlobalMatrix::CloneFrom(const GlobalMatrix<ValueType> &src)¶ Clone the entire matrix (values,structure+backend descr) from another GlobalMatrix.
-
void
rocalution::GlobalMatrix::CopyFrom(const GlobalMatrix<ValueType> &src)¶ Copy matrix (values and structure) from another GlobalMatrix.
-
void
rocalution::GlobalMatrix::ConvertToCSR(void)¶ Convert the matrix to CSR structure.
-
void
rocalution::GlobalMatrix::ConvertToMCSR(void)¶ Convert the matrix to MCSR structure.
-
void
rocalution::GlobalMatrix::ConvertToBCSR(void)¶ Convert the matrix to BCSR structure.
-
void
rocalution::GlobalMatrix::ConvertToCOO(void)¶ Convert the matrix to COO structure.
-
void
rocalution::GlobalMatrix::ConvertToELL(void)¶ Convert the matrix to ELL structure.
-
void
rocalution::GlobalMatrix::ConvertToDIA(void)¶ Convert the matrix to DIA structure.
-
void
rocalution::GlobalMatrix::ConvertToHYB(void)¶ Convert the matrix to HYB structure.
-
void
rocalution::GlobalMatrix::ConvertToDENSE(void)¶ Convert the matrix to DENSE structure.
-
void
rocalution::GlobalMatrix::ConvertTo(unsigned int matrix_format)¶ Convert the matrix to specified matrix ID format.
-
void
rocalution::GlobalMatrix::ReadFileMTX(const std::string filename)¶ Read matrix from MTX (Matrix Market Format) file.
-
void
rocalution::GlobalMatrix::WriteFileMTX(const std::string filename) const¶ Write matrix to MTX (Matrix Market Format) file.
-
void
rocalution::GlobalMatrix::ReadFileCSR(const std::string filename)¶ Read matrix from CSR (ROCALUTION binary format) file.
-
void
rocalution::GlobalMatrix::WriteFileCSR(const std::string filename) const¶ Write matrix to CSR (ROCALUTION binary format) file.
-
void
rocalution::GlobalMatrix::Sort(void)¶ Sort the matrix indices.
-
void
rocalution::GlobalMatrix::ExtractInverseDiagonal(GlobalVector<ValueType> *vec_inv_diag) const¶ Extract the inverse (reciprocal) diagonal values of the matrix into a GlobalVector.
-
void
rocalution::GlobalMatrix::Scale(ValueType alpha)¶ Scale all the values in the matrix.
-
void
rocalution::GlobalMatrix::InitialPairwiseAggregation(ValueType beta, int &nc, LocalVector<int> *G, int &Gsize, int **rG, int &rGsize, int ordering) const¶ Initial Pairwise Aggregation scheme.
-
void
rocalution::GlobalMatrix::FurtherPairwiseAggregation(ValueType beta, int &nc, LocalVector<int> *G, int &Gsize, int **rG, int &rGsize, int ordering) const¶ Further Pairwise Aggregation scheme.
-
void
rocalution::GlobalMatrix::CoarsenOperator(GlobalMatrix<ValueType> *Ac, ParallelManager *pm, int nrow, int ncol, const LocalVector<int> &G, int Gsize, const int *rG, int rGsize) const¶ Build coarse operator for pairwise aggregation scheme.
Local Vector¶
-
template<typename
ValueType>
classLocalVector: public rocalution::Vector<ValueType>¶ LocalVector class.
A LocalVector is called local, because it will always stay on a single system. The system can contain several CPUs via UMA or NUMA memory system or it can contain an accelerator.
- Template Parameters
ValueType: - can be int, float, double, std::complex<float> and std::complex<double>
-
void
rocalution::LocalVector::Allocate(std::string name, IndexType2 size)¶ Allocate a local vector with name and size.
The local vector allocation function requires a name of the object (this is only for information purposes) and corresponding size description for vector objects.
- Example
LocalVector<ValueType> vec; vec.Allocate("my vector", 100); vec.Clear();
- Parameters
[in] name: object name[in] size: number of elements in the vector
-
void
rocalution::LocalVector::SetDataPtr(ValueType **ptr, std::string name, int size)¶ Initialize a LocalVector on the host with externally allocated data.
SetDataPtrhas direct access to the raw data via pointers. Already allocated data can be set by passing the pointer.- Note
Setting data pointer will leave the original pointer empty (set to
NULL).- Example
// Allocate vector ValueType* ptr_vec = new ValueType[200]; // Fill vector // ... // rocALUTION local vector object LocalVector<ValueType> vec; // Set the vector data, ptr_vec will become invalid vec.SetDataPtr(&ptr_vec, "my_vector", 200);
-
void
rocalution::LocalVector::LeaveDataPtr(ValueType **ptr)¶ Leave a LocalVector to host pointers.
LeaveDataPtrhas direct access to the raw data via pointers. A LocalVector object can leave its raw data to a host pointer. This will leave the LocalVector empty.- Example
// rocALUTION local vector object LocalVector<ValueType> vec; // Allocate the vector vec.Allocate("my_vector", 100); // Fill vector // ... ValueType* ptr_vec = NULL; // Get (steal) the data from the vector, this will leave the local vector object empty vec.LeaveDataPtr(&ptr_vec);
-
ValueType &
rocalution::LocalVector::operator[](int i) Access operator (only for host data)
The elements in the vector can be accessed via [] operators, when the vector is allocated on the host.
- Return
value at index
i- Example
// rocALUTION local vector object LocalVector<ValueType> vec; // Allocate vector vec.Allocate("my_vector", 100); // Initialize vector with 1 vec.Ones(); // Set even elements to -1 for(int i = 0; i < vec.GetSize(); i += 2) { vec[i] = -1; }
- Parameters
[in] i: access data at indexi
-
const ValueType &
rocalution::LocalVector::operator[](int i) const Access operator (only for host data)
The elements in the vector can be accessed via [] operators, when the vector is allocated on the host.
- Return
value at index
i- Example
// rocALUTION local vector object LocalVector<ValueType> vec; // Allocate vector vec.Allocate("my_vector", 100); // Initialize vector with 1 vec.Ones(); // Set even elements to -1 for(int i = 0; i < vec.GetSize(); i += 2) { vec[i] = -1; }
- Parameters
[in] i: access data at indexi
-
void
rocalution::LocalVector::CopyFromPermute(const LocalVector<ValueType> &src, const LocalVector<int> &permutation)¶ Copy a vector under permutation (forward permutation)
-
void
rocalution::LocalVector::CopyFromPermuteBackward(const LocalVector<ValueType> &src, const LocalVector<int> &permutation)¶ Copy a vector under permutation (backward permutation)
-
void
rocalution::LocalVector::CopyFromData(const ValueType *data)¶ Copy (import) vector.
Copy (import) vector data that is described in one array (values). The object data has to be allocated with Allocate(), using the corresponding size of the data, first.
- Parameters
[in] data: data to be imported.
-
void
rocalution::LocalVector::CopyToData(ValueType *data) const¶ Copy (export) vector.
Copy (export) vector data that is described in one array (values). The output array has to be allocated, using the corresponding size of the data, first. Size can be obtain by GetSize().
- Parameters
[out] data: exported data.
-
void
rocalution::LocalVector::Permute(const LocalVector<int> &permutation)¶ Perform in-place permutation (forward) of the vector.
-
void
rocalution::LocalVector::PermuteBackward(const LocalVector<int> &permutation)¶ Perform in-place permutation (backward) of the vector.
-
void
rocalution::LocalVector::Restriction(const LocalVector<ValueType> &vec_fine, const LocalVector<int> &map)¶ Restriction operator based on restriction mapping vector.
-
void
rocalution::LocalVector::Prolongation(const LocalVector<ValueType> &vec_coarse, const LocalVector<int> &map)¶ Prolongation operator based on restriction mapping vector.
-
void
rocalution::LocalVector::SetIndexArray(int size, const int *index)¶ Set index array.
-
void
rocalution::LocalVector::GetIndexValues(ValueType *values) const¶ Get indexed values.
-
void
rocalution::LocalVector::SetIndexValues(const ValueType *values)¶ Set indexed values.
-
void
rocalution::LocalVector::GetContinuousValues(int start, int end, ValueType *values) const¶ Get continuous indexed values.
-
void
rocalution::LocalVector::SetContinuousValues(int start, int end, const ValueType *values)¶ Set continuous indexed values.
-
void
rocalution::LocalVector::ExtractCoarseMapping(int start, int end, const int *index, int nc, int *size, int *map) const¶ Extract coarse boundary mapping.
-
void
rocalution::LocalVector::ExtractCoarseBoundary(int start, int end, const int *index, int nc, int *size, int *boundary) const¶ Extract coarse boundary index.
Global Vector¶
-
template<typename
ValueType>
classGlobalVector: public rocalution::Vector<ValueType>¶ GlobalVector class.
A GlobalVector is called global, because it can stay on a single or on multiple nodes in a network. For this type of communication, MPI is used.
- Template Parameters
ValueType: - can be int, float, double, std::complex<float> and std::complex<double>
-
rocalution::GlobalVector::GlobalVector(const ParallelManager &pm) Initialize a global vector with a parallel manager.
-
void
rocalution::GlobalVector::Allocate(std::string name, IndexType2 size)¶ Allocate a global vector with name and size.
-
void
rocalution::GlobalVector::SetParallelManager(const ParallelManager &pm)¶ Set the parallel manager of a global vector.
-
ValueType &
rocalution::GlobalVector::operator[](int i) Access operator (only for host data)
-
const ValueType &
rocalution::GlobalVector::operator[](int i) const Access operator (only for host data)
-
void
rocalution::GlobalVector::SetDataPtr(ValueType **ptr, std::string name, IndexType2 size)¶ Initialize the local part of a global vector with externally allocated data.
-
void
rocalution::GlobalVector::LeaveDataPtr(ValueType **ptr)¶ Get a pointer to the data from the local part of a global vector and free the global vector object.
-
void
rocalution::GlobalVector::Restriction(const GlobalVector<ValueType> &vec_fine, const LocalVector<int> &map)¶ Restriction operator based on restriction mapping vector.
-
void
rocalution::GlobalVector::Prolongation(const GlobalVector<ValueType> &vec_coarse, const LocalVector<int> &map)¶ Prolongation operator based on restriction mapping vector.
Parallel Manager¶
-
class
ParallelManager: public rocalution::RocalutionObj¶ Parallel Manager class.
The parallel manager class handles the communication and the mapping of the global operators. Each global operator and vector need to be initialized with a valid parallel manager in order to perform any operation. For many distributed simulations, the underlying operator is already distributed. This information need to be passed to the parallel manager.
-
void
rocalution::ParallelManager::SetMPICommunicator(const void *comm)¶ Set the MPI communicator.
-
void
rocalution::ParallelManager::Clear(void)¶ Clear all allocated resources.
-
IndexType2
rocalution::ParallelManager::GetGlobalSize(void) const¶ Return the global size.
-
int
rocalution::ParallelManager::GetLocalSize(void) const¶ Return the local size.
-
int
rocalution::ParallelManager::GetNumReceivers(void) const¶ Return the number of receivers.
-
int
rocalution::ParallelManager::GetNumSenders(void) const¶ Return the number of senders.
-
int
rocalution::ParallelManager::GetNumProcs(void) const¶ Return the number of involved processes.
-
void
rocalution::ParallelManager::SetGlobalSize(IndexType2 size)¶ Initialize the global size.
-
void
rocalution::ParallelManager::SetLocalSize(int size)¶ Initialize the local size.
-
void
rocalution::ParallelManager::SetBoundaryIndex(int size, const int *index)¶ Set all boundary indices of this ranks process.
-
void
rocalution::ParallelManager::SetReceivers(int nrecv, const int *recvs, const int *recv_offset)¶ Number of processes, the current process is receiving data from, array of the processes, the current process is receiving data from and offsets, where the boundary for process ‘receiver’ starts.
-
void
rocalution::ParallelManager::SetSenders(int nsend, const int *sends, const int *send_offset)¶ Number of processes, the current process is sending data to, array of the processes, the current process is sending data to and offsets where the ghost part for process ‘sender’ starts.
-
void
rocalution::ParallelManager::LocalToGlobal(int proc, int local, int &global)¶ Mapping local to global.
-
void
rocalution::ParallelManager::GlobalToLocal(int global, int &proc, int &local)¶ Mapping global to local.
-
bool
rocalution::ParallelManager::Status(void) const¶ Check sanity status of parallel manager.
-
void
rocalution::ParallelManager::ReadFileASCII(const std::string filename)¶ Read file that contains all relevant parallel manager data.
-
void
rocalution::ParallelManager::WriteFileASCII(const std::string filename) const¶ Write file that contains all relevant parallel manager data.
Solvers¶
-
template<class
OperatorType, classVectorType, typenameValueType>
classSolver: public rocalution::RocalutionObj¶ Base class for all solvers and preconditioners.
Most of the solvers can be performed on linear operators LocalMatrix, LocalStencil and GlobalMatrix - i.e. the solvers can be performed locally (on a shared memory system) or in a distributed manner (on a cluster) via MPI. The only exception is the AMG (Algebraic Multigrid) solver which has two versions (one for LocalMatrix and one for GlobalMatrix class). The only pure local solvers (which do not support global/MPI operations) are the mixed-precision defect-correction solver and all direct solvers.
All solvers need three template parameters - Operators, Vectors and Scalar type.
The Solver class is purely virtual and provides an interface for
SetOperator() to set the operator \(A\), i.e. the user can pass the matrix here.
Build() to build the solver (including preconditioners, sub-solvers, etc.). The user need to specify the operator first before calling Build().
Solve() to solve the system \(Ax = b\). The user need to pass a right-hand-side \(b\) and a vector \(x\), where the solution will be obtained.
Print() to show solver information.
ReBuildNumeric() to only re-build the solver numerically (if possible).
MoveToHost() and MoveToAccelerator() to offload the solver (including preconditioners and sub-solvers) to the host/accelerator.
- Template Parameters
OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencilVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::DirectLinearSolver< OperatorType, VectorType, ValueType >, rocalution::IterativeLinearSolver< OperatorType, VectorType, ValueType >, rocalution::Preconditioner< OperatorType, VectorType, ValueType >
-
void
rocalution::Solver::ResetOperator(const OperatorType &op)¶ Reset the operator; see ReBuildNumeric()
-
virtual void
rocalution::Solver::Solve(const VectorType &rhs, VectorType *x) = 0¶ Solve Operator x = rhs.
-
void
rocalution::Solver::SolveZeroSol(const VectorType &rhs, VectorType *x)¶ Solve Operator x = rhs, setting initial x = 0.
-
void
rocalution::Solver::Build(void)¶ Build the solver (data allocation, structure and numerical computation)
-
void
rocalution::Solver::BuildMoveToAcceleratorAsync(void)¶ Build the solver and move it to the accelerator asynchronously.
-
void
rocalution::Solver::ReBuildNumeric(void)¶ Rebuild the solver only with numerical computation (no allocation or data structure computation)
-
void
rocalution::Solver::MoveToAccelerator(void)¶ Move all data (i.e. move the solver) to the accelerator.
-
void
rocalution::Solver::Verbose(int verb = 1)¶ Provide verbose output of the solver.
verb = 0 -> no output
verb = 1 -> print info about the solver (start, end);
verb = 2 -> print (iter, residual) via iteration control;
-
template<class
OperatorType, classVectorType, typenameValueType>
classIterativeLinearSolver: public rocalution::Solver<OperatorType, VectorType, ValueType>¶ Base class for all linear iterative solvers.
The iterative solvers are controlled by an iteration control object, which monitors the convergence properties of the solver, i.e. maximum number of iteration, relative tolerance, absolute tolerance and divergence tolerance. The iteration control can also record the residual history and store it in an ASCII file.
Init(), InitMinIter(), InitMaxIter() and InitTol() initialize the solver and set the stopping criteria.
RecordResidualHistory() and RecordHistory() start the recording of the residual and write it into a file.
Verbose() sets the level of verbose output of the solver (0 - no output, 2 - detailed output, including residual and iteration information).
SetPreconditioner() sets the preconditioning.
All iterative solvers are controlled based on
Absolute stopping criteria, when \(|r_{k}|_{L_{p}} \lt \epsilon_{abs}\)
Relative stopping criteria, when \(|r_{k}|_{L_{p}} / |r_{1}|_{L_{p}} \leq \epsilon_{rel}\)
Divergence stopping criteria, when \(|r_{k}|_{L_{p}} / |r_{1}|_{L_{p}} \geq \epsilon_{div}\)
Maximum number of iteration \(N\), when \(k = N\)
where \(k\) is the current iteration, \(r_{k}\) the residual for the current iteration \(k\) (i.e. \(r_{k} = b - Ax_{k}\)) and \(r_{1}\) the starting residual (i.e. \(r_{1} = b - Ax_{init}\)). In addition, the minimum number of iterations \(M\) can be specified. In this case, the solver will not stop to iterate, before \(k \geq M\).
The \(L_{p}\) norm is used for the computation, where \(p\) could be 1, 2 and \(\infty\). The norm computation can be set with SetResidualNorm() with 1 for \(L_{1}\), 2 for \(L_{2}\) and 3 for \(L_{\infty}\). For the computation with \(L_{\infty}\), the index of the maximum value can be obtained with GetAmaxResidualIndex(). If this function is called and \(L_{\infty}\) was not selected, this function will return -1.
The reached criteria can be obtained with GetSolverStatus(), returning
0, if no criteria has been reached yet
1, if absolute tolerance has been reached
2, if relative tolerance has been reached
3, if divergence tolerance has been reached
4, if maximum number of iteration has been reached
- Template Parameters
OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencilVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::BaseMultiGrid< OperatorType, VectorType, ValueType >, rocalution::BiCGStab< OperatorType, VectorType, ValueType >, rocalution::BiCGStabl< OperatorType, VectorType, ValueType >, rocalution::CG< OperatorType, VectorType, ValueType >, rocalution::Chebyshev< OperatorType, VectorType, ValueType >, rocalution::CR< OperatorType, VectorType, ValueType >, rocalution::FCG< OperatorType, VectorType, ValueType >, rocalution::FGMRES< OperatorType, VectorType, ValueType >, rocalution::FixedPoint< OperatorType, VectorType, ValueType >, rocalution::GMRES< OperatorType, VectorType, ValueType >, rocalution::IDR< OperatorType, VectorType, ValueType >, rocalution::QMRCGStab< OperatorType, VectorType, ValueType >
-
void
rocalution::IterativeLinearSolver::Init(double abs_tol, double rel_tol, double div_tol, int max_iter) Initialize the solver with absolute/relative/divergence tolerance and maximum number of iterations.
-
void
rocalution::IterativeLinearSolver::Init(double abs_tol, double rel_tol, double div_tol, int min_iter, int max_iter) Initialize the solver with absolute/relative/divergence tolerance and minimum/maximum number of iterations.
-
void
rocalution::IterativeLinearSolver::InitMinIter(int min_iter)¶ Set the minimum number of iterations.
-
void
rocalution::IterativeLinearSolver::InitMaxIter(int max_iter)¶ Set the maximum number of iterations.
-
void
rocalution::IterativeLinearSolver::InitTol(double abs, double rel, double div)¶ Set the absolute/relative/divergence tolerance.
-
void
rocalution::IterativeLinearSolver::SetResidualNorm(int resnorm)¶ Set the residual norm to \(L_1\), \(L_2\) or \(L_\infty\) norm.
resnorm = 1 -> \(L_1\) norm
resnorm = 2 -> \(L_2\) norm
resnorm = 3 -> \(L_\infty\) norm
-
void
rocalution::IterativeLinearSolver::RecordResidualHistory(void)¶ Record the residual history.
-
void
rocalution::IterativeLinearSolver::RecordHistory(const std::string filename) const¶ Write the history to file.
-
void
rocalution::IterativeLinearSolver::Verbose(int verb = 1)¶ Set the solver verbosity output.
-
void
rocalution::IterativeLinearSolver::Solve(const VectorType &rhs, VectorType *x)¶ Solve Operator x = rhs.
-
void
rocalution::IterativeLinearSolver::SetPreconditioner(Solver<OperatorType, VectorType, ValueType> &precond)¶ Set a preconditioner of the linear solver.
-
int
rocalution::IterativeLinearSolver::GetIterationCount(void)¶ Return the iteration count.
-
double
rocalution::IterativeLinearSolver::GetCurrentResidual(void)¶ Return the current residual.
-
int
rocalution::IterativeLinearSolver::GetSolverStatus(void)¶ Return the current status.
-
int
rocalution::IterativeLinearSolver::GetAmaxResidualIndex(void)¶ Return absolute maximum index of residual vector when using \(L_\infty\) norm.
-
template<class
OperatorType, classVectorType, typenameValueType>
classFixedPoint: public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶ Fixed-Point Iteration Scheme.
The Fixed-Point iteration scheme is based on additive splitting of the matrix \(A = M + N\). The scheme reads
\[ x_{k+1} = M^{-1} (b - N x_{k}). \]It can also be reformulated as a weighted defect correction scheme\[ x_{k+1} = x_{k} - \omega M^{-1} (Ax_{k} - b). \]The inversion of \(M\) can be performed by preconditioners (Jacobi, Gauss-Seidel, ILU, etc.) or by any type of solvers.- Template Parameters
OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencilVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::FixedPoint::SetRelaxation(ValueType omega)¶ Set relaxation parameter \(\omega\).
-
template<class
OperatorTypeH, classVectorTypeH, typenameValueTypeH, classOperatorTypeL, classVectorTypeL, typenameValueTypeL>
classMixedPrecisionDC: public rocalution::IterativeLinearSolver<OperatorTypeH, VectorTypeH, ValueTypeH>¶ Mixed-Precision Defect Correction Scheme.
The Mixed-Precision solver is based on a defect-correction scheme. The current implementation of the library is using host based correction in double precision and accelerator computation in single precision. The solver is implemeting the scheme
\[ x_{k+1} = x_{k} + A^{-1} r_{k}, \]where the computation of the residual \(r_{k} = b - Ax_{k}\) and the update \(x_{k+1} = x_{k} + d_{k}\) are performed on the host in double precision. The computation of the residual system \(Ad_{k} = r_{k}\) is performed on the accelerator in single precision. In addition to the setup functions of the iterative solver, the user need to specify the inner ( \(Ad_{k} = r_{k}\)) solver.- Template Parameters
OperatorTypeH: - can be LocalMatrixVectorTypeH: - can be LocalVectorValueTypeH: - can be doubleOperatorTypeL: - can be LocalMatrixVectorTypeL: - can be LocalVectorValueTypeL: - can be float
-
void
rocalution::MixedPrecisionDC::Set(Solver<OperatorTypeL, VectorTypeL, ValueTypeL> &Solver_L)¶ Set the inner solver for \(Ad_{k} = r_{k}\).
-
template<class
OperatorType, classVectorType, typenameValueType>
classChebyshev: public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶ Chebyshev Iteration Scheme.
The Chebyshev Iteration scheme (also known as acceleration scheme) is similar to the CG method but requires minimum and maximum eigenvalues of the operator. templates
- Template Parameters
OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencilVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::Chebyshev::Set(ValueType lambda_min, ValueType lambda_max)¶ Set the minimum and maximum eigenvalues of the operator.
-
template<class
OperatorType, classVectorType, typenameValueType>
classBiCGStab: public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶ Bi-Conjugate Gradient Stabilized Method.
The Bi-Conjugate Gradient Stabilized method is a variation of CGS and solves sparse (non) symmetric linear systems \(Ax=b\). SAAD
- Template Parameters
OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencilVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classBiCGStabl: public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶ Bi-Conjugate Gradient Stabilized (l) Method.
The Bi-Conjugate Gradient Stabilized (l) method is a generalization of BiCGStab for solving sparse (non) symmetric linear systems \(Ax=b\). It minimizes residuals over \(l\)-dimensional Krylov subspaces. The degree \(l\) can be set with SetOrder(). bicgstabl
- Template Parameters
OperatorType: - can be LocalMatrix or GlobalMatrixVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classCG: public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶ Conjugate Gradient Method.
The Conjugate Gradient method is the best known iterative method for solving sparse symmetric positive definite (SPD) linear systems \(Ax=b\). It is based on orthogonal projection onto the Krylov subspace \(\mathcal{K}_{m}(r_{0}, A)\), where \(r_{0}\) is the initial residual. The method can be preconditioned, where the approximation should also be SPD. SAAD
- Template Parameters
OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencilVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classCR: public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶ Conjugate Residual Method.
The Conjugate Residual method is an iterative method for solving sparse symmetric semi-positive definite linear systems \(Ax=b\). It is a Krylov subspace method and differs from the much more popular Conjugate Gradient method that the system matrix is not required to be positive definite. The method can be preconditioned where the approximation should also be SPD or semi-positive definite. SAAD
- Template Parameters
OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencilVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classFCG: public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶ Flexible Conjugate Gradient Method.
The Flexible Conjugate Gradient method is an iterative method for solving sparse symmetric positive definite linear systems \(Ax=b\). It is similar to the Conjugate Gradient method with the only difference, that it allows the preconditioner \(M^{-1}\) to be not a constant operator. This can be especially helpful if the operation \(M^{-1}x\) is the result of another iterative process and not a constant operator. fcg
- Template Parameters
OperatorType: - can be LocalMatrix or GlobalMatrixVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classGMRES: public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶ Generalized Minimum Residual Method.
The Generalized Minimum Residual method (GMRES) is a projection method for solving sparse (non) symmetric linear systems \(Ax=b\), based on restarting technique. The solution is approximated in a Krylov subspace \(\mathcal{K}=\mathcal{K}_{m}\) and \(\mathcal{L}=A\mathcal{K}_{m}\) with minimal residual, where \(\mathcal{K}_{m}\) is the \(m\)-th Krylov subspace with \(v_{1} = r_{0}/||r_{0}||_{2}\). SAAD
The Krylov subspace basis size can be set using SetBasisSize(). The default size is 30.
- Template Parameters
OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencilVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classFGMRES: public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶ Flexible Generalized Minimum Residual Method.
The Flexible Generalized Minimum Residual method (FGMRES) is a projection method for solving sparse (non) symmetric linear systems \(Ax=b\). It is similar to the GMRES method with the only difference, the FGMRES is based on a window shifting of the Krylov subspace and thus allows the preconditioner \(M^{-1}\) to be not a constant operator. This can be especially helpful if the operation \(M^{-1}x\) is the result of another iterative process and not a constant operator. SAAD
The Krylov subspace basis size can be set using SetBasisSize(). The default size is 30.
- Template Parameters
OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencilVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classIDR: public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶ Induced Dimension Reduction Method.
The Induced Dimension Reduction method is a Krylov subspace method for solving sparse (non) symmetric linear systems \(Ax=b\). IDR(s) generates residuals in a sequence of nested subspaces. IDR1 IDR2
The dimension of the shadow space can be set by SetShadowSpace(). The default size of the shadow space is 4.
- Template Parameters
OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencilVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::IDR::SetRandomSeed(unsigned long long seed)¶ Set random seed for ONB creation (seed must be greater than 0)
-
template<class
OperatorType, classVectorType, typenameValueType>
classQMRCGStab: public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶ Quasi-Minimal Residual Conjugate Gradient Stabilized Method.
The Quasi-Minimal Residual Conjugate Gradient Stabilized method is a variant of the Krylov subspace BiCGStab method for solving sparse (non) symmetric linear systems \(Ax=b\). qmrcgstab
- Template Parameters
OperatorType: - can be LocalMatrix or GlobalMatrixVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classBaseMultiGrid: public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶ Base class for all multigrid solvers Trottenberg2003.
- Template Parameters
OperatorType: - can be LocalMatrix or GlobalMatrixVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::BaseAMG< OperatorType, VectorType, ValueType >, rocalution::MultiGrid< OperatorType, VectorType, ValueType >
-
void
rocalution::BaseMultiGrid::SetSolver(Solver<OperatorType, VectorType, ValueType> &solver)¶ Set the coarse grid solver.
-
void
rocalution::BaseMultiGrid::SetSmoother(IterativeLinearSolver<OperatorType, VectorType, ValueType> **smoother)¶ Set the smoother for each level.
-
void
rocalution::BaseMultiGrid::SetSmootherPreIter(int iter)¶ Set the number of pre-smoothing steps.
-
void
rocalution::BaseMultiGrid::SetSmootherPostIter(int iter)¶ Set the number of post-smoothing steps.
-
virtual void
rocalution::BaseMultiGrid::SetRestrictOperator(OperatorType **op) = 0¶ Set the restriction operator for each level.
-
virtual void
rocalution::BaseMultiGrid::SetProlongOperator(OperatorType **op) = 0¶ Set the prolongation operator for each level.
-
virtual void
rocalution::BaseMultiGrid::SetOperatorHierarchy(OperatorType **op) = 0¶ Set the operator for each level.
-
void
rocalution::BaseMultiGrid::SetScaling(bool scaling)¶ Enable/disable scaling of intergrid transfers.
-
void
rocalution::BaseMultiGrid::SetHostLevels(int levels)¶ Force computation of coarser levels on the host backend.
-
void
rocalution::BaseMultiGrid::SetCycle(unsigned int cycle)¶ Set the MultiGrid Cycle (default: Vcycle)
-
void
rocalution::BaseMultiGrid::SetKcycleFull(bool kcycle_full)¶ Set the MultiGrid Kcycle on all levels or only on finest level.
-
void
rocalution::BaseMultiGrid::InitLevels(int levels)¶ Set the depth of the multigrid solver.
-
template<class
OperatorType, classVectorType, typenameValueType>
classMultiGrid: public rocalution::BaseMultiGrid<OperatorType, VectorType, ValueType>¶ MultiGrid Method.
The MultiGrid method can be used with external data, such as externally computed restriction, prolongation and operator hierarchy. The user need to pass all this information for each level and for its construction. This includes smoothing step, prolongation/restriction, grid traversing and coarse grid solver. This data need to be passed to the solver. Trottenberg2003
Restriction and prolongation operations can be performed in two ways, based on Restriction() and Prolongation() of the LocalVector class, or by matrix-vector multiplication. This is configured by a set function.
Smoothers can be of any iterative linear solver. Valid options are Jacobi, Gauss-Seidel, ILU, etc. using a FixedPoint iteration scheme with pre-defined number of iterations. The smoothers could also be a solver such as CG, BiCGStab, etc.
Coarse grid solver could be of any iterative linear solver type. The class also provides mechanisms to specify, where the coarse grid solver has to be performed, on the host or on the accelerator. The coarse grid solver can be preconditioned.
Grid scaling based on a \(L_2\) norm ratio.
Operator matrices need to be passed on each grid level.
- Template Parameters
OperatorType: - can be LocalMatrix or GlobalMatrixVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classBaseAMG: public rocalution::BaseMultiGrid<OperatorType, VectorType, ValueType>¶ Base class for all algebraic multigrid solvers.
The Algebraic MultiGrid solver is based on the BaseMultiGrid class. The coarsening is obtained by different aggregation techniques. The smoothers can be constructed inside or outside of the class.
All parameters in the Algebraic MultiGrid class can be set externally, including smoothers and coarse grid solver.
- Template Parameters
OperatorType: - can be LocalMatrix or GlobalMatrixVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::GlobalPairwiseAMG< OperatorType, VectorType, ValueType >, rocalution::PairwiseAMG< OperatorType, VectorType, ValueType >, rocalution::RugeStuebenAMG< OperatorType, VectorType, ValueType >, rocalution::SAAMG< OperatorType, VectorType, ValueType >, rocalution::UAAMG< OperatorType, VectorType, ValueType >
-
void
rocalution::BaseAMG::SetCoarsestLevel(int coarse_size)¶ Set coarsest level for hierarchy creation.
-
void
rocalution::BaseAMG::SetManualSmoothers(bool sm_manual)¶ Set flag to pass smoothers manually for each level.
-
void
rocalution::BaseAMG::SetManualSolver(bool s_manual)¶ Set flag to pass coarse grid solver manually.
-
void
rocalution::BaseAMG::SetDefaultSmootherFormat(unsigned int op_format)¶ Set the smoother operator format.
-
template<class
OperatorType, classVectorType, typenameValueType>
classUAAMG: public rocalution::BaseAMG<OperatorType, VectorType, ValueType>¶ Unsmoothed Aggregation Algebraic MultiGrid Method.
The Unsmoothed Aggregation Algebraic MultiGrid method is based on unsmoothed aggregation based interpolation scheme. stuben
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::UAAMG::SetOverInterp(ValueType overInterp)¶ Set over-interpolation parameter for aggregation.
-
template<class
OperatorType, classVectorType, typenameValueType>
classSAAMG: public rocalution::BaseAMG<OperatorType, VectorType, ValueType>¶ Smoothed Aggregation Algebraic MultiGrid Method.
The Smoothed Aggregation Algebraic MultiGrid method is based on smoothed aggregation based interpolation scheme. vanek
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classRugeStuebenAMG: public rocalution::BaseAMG<OperatorType, VectorType, ValueType>¶ Ruge-Stueben Algebraic MultiGrid Method.
The Ruge-Stueben Algebraic MultiGrid method is based on the classic Ruge-Stueben coarsening with direct interpolation. The solver provides high-efficiency in terms of complexity of the solver (i.e. number of iterations). However, most of the time it has a higher building step and requires higher memory usage. stuben
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::RugeStuebenAMG::SetCouplingStrength(ValueType eps)¶ Set coupling strength.
-
template<class
OperatorType, classVectorType, typenameValueType>
classPairwiseAMG: public rocalution::BaseAMG<OperatorType, VectorType, ValueType>¶ Pairwise Aggregation Algebraic MultiGrid Method.
The Pairwise Aggregation Algebraic MultiGrid method is based on a pairwise aggregation matching scheme. It delivers very efficient building phase which is suitable for Poisson-like equation. Most of the time it requires K-cycle for the solving phase to provide low number of iterations. pairwiseamg
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::PairwiseAMG::SetBeta(ValueType beta)¶ Set beta for pairwise aggregation.
-
void
rocalution::PairwiseAMG::SetOrdering(unsigned int ordering)¶ Set re-ordering for aggregation.
-
void
rocalution::PairwiseAMG::SetCoarseningFactor(double factor)¶ Set target coarsening factor.
-
template<class
OperatorType, classVectorType, typenameValueType>
classGlobalPairwiseAMG: public rocalution::BaseAMG<OperatorType, VectorType, ValueType>¶ Pairwise Aggregation Algebraic MultiGrid Method (multi-node)
The Pairwise Aggregation Algebraic MultiGrid method is based on a pairwise aggregation matching scheme. It delivers very efficient building phase which is suitable for Poisson-like equation. Most of the time it requires K-cycle for the solving phase to provide low number of iterations. This version has multi-node support. pairwiseamg
- Template Parameters
OperatorType: - can be GlobalMatrixVectorType: - can be GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::GlobalPairwiseAMG::SetBeta(ValueType beta)¶ Set beta for pairwise aggregation.
-
void
rocalution::GlobalPairwiseAMG::SetOrdering(const _aggregation_ordering ordering)¶ Set re-ordering for aggregation.
-
void
rocalution::GlobalPairwiseAMG::SetCoarseningFactor(double factor)¶ Set target coarsening factor.
-
template<class
OperatorType, classVectorType, typenameValueType>
classDirectLinearSolver: public rocalution::Solver<OperatorType, VectorType, ValueType>¶ Base class for all direct linear solvers.
The library provides three direct methods - LU, QR and Inversion (based on QR decomposition). The user can pass a sparse matrix, internally it will be converted to dense and then the selected method will be applied. These methods are not very optimal and due to the fact that the matrix is converted to a dense format, these methods should be used only for very small matrices.
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::Inversion< OperatorType, VectorType, ValueType >, rocalution::LU< OperatorType, VectorType, ValueType >, rocalution::QR< OperatorType, VectorType, ValueType >
-
template<class
OperatorType, classVectorType, typenameValueType>
classInversion: public rocalution::DirectLinearSolver<OperatorType, VectorType, ValueType>¶ Matrix Inversion.
Full matrix inversion based on QR decomposition.
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classLU: public rocalution::DirectLinearSolver<OperatorType, VectorType, ValueType>¶ LU Decomposition.
Lower-Upper Decomposition factors a given square matrix into lower and upper triangular matrix, such that \(A = LU\).
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classQR: public rocalution::DirectLinearSolver<OperatorType, VectorType, ValueType>¶ QR Decomposition.
The QR Decomposition decomposes a given matrix into \(A = QR\), such that \(Q\) is an orthogonal matrix and \(R\) an upper triangular matrix.
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
Preconditioners¶
-
template<class
OperatorType, classVectorType, typenameValueType>
classPreconditioner: public rocalution::Solver<OperatorType, VectorType, ValueType>¶ Base class for all preconditioners.
- Template Parameters
OperatorType: - can be LocalMatrix or GlobalMatrixVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::AIChebyshev< OperatorType, VectorType, ValueType >, rocalution::AS< OperatorType, VectorType, ValueType >, rocalution::BlockJacobi< OperatorType, VectorType, ValueType >, rocalution::BlockPreconditioner< OperatorType, VectorType, ValueType >, rocalution::DiagJacobiSaddlePointPrecond< OperatorType, VectorType, ValueType >, rocalution::FSAI< OperatorType, VectorType, ValueType >, rocalution::GS< OperatorType, VectorType, ValueType >, rocalution::IC< OperatorType, VectorType, ValueType >, rocalution::ILU< OperatorType, VectorType, ValueType >, rocalution::ILUT< OperatorType, VectorType, ValueType >, rocalution::Jacobi< OperatorType, VectorType, ValueType >, rocalution::MultiColored< OperatorType, VectorType, ValueType >, rocalution::MultiElimination< OperatorType, VectorType, ValueType >, rocalution::SGS< OperatorType, VectorType, ValueType >, rocalution::SPAI< OperatorType, VectorType, ValueType >, rocalution::TNS< OperatorType, VectorType, ValueType >, rocalution::VariablePreconditioner< OperatorType, VectorType, ValueType >
-
template<class
OperatorType, classVectorType, typenameValueType>
classAIChebyshev: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Approximate Inverse - Chebyshev Preconditioner.
The Approximate Inverse - Chebyshev Preconditioner is an inverse matrix preconditioner with values from a linear combination of matrix-valued Chebyshev polynomials. chebpoly
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::AIChebyshev::Set(int p, ValueType lambda_min, ValueType lambda_max)¶ Set order, min and max eigenvalues.
-
template<class
OperatorType, classVectorType, typenameValueType>
classFSAI: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Factorized Approximate Inverse Preconditioner.
The Factorized Sparse Approximate Inverse preconditioner computes a direct approximation of \(M^{-1}\) by minimizing the Frobenius norm \(||I − GL||_{F}\), where \(L\) denotes the exact lower triangular part of \(A\) and \(G:=M^{-1}\). The FSAI preconditioner is initialized by \(q\), based on the sparsity pattern of \(|A^{q}|\). However, it is also possible to supply external sparsity patterns in form of the LocalMatrix class. kolotilina
- Note
The FSAI preconditioner is only suited for symmetric positive definite matrices.
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::FSAI::Set(int power) Set the power of the system matrix sparsity pattern.
-
void
rocalution::FSAI::Set(const OperatorType &pattern) Set an external sparsity pattern.
-
void
rocalution::FSAI::SetPrecondMatrixFormat(unsigned int mat_format)¶ Set the matrix format of the preconditioner.
-
template<class
OperatorType, classVectorType, typenameValueType>
classSPAI: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ SParse Approximate Inverse Preconditioner.
The SParse Approximate Inverse algorithm is an explicitly computed preconditioner for general sparse linear systems. In its current implementation, only the sparsity pattern of the system matrix is supported. The SPAI computation is based on the minimization of the Frobenius norm \(||AM − I||_{F}\). grote
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::SPAI::SetPrecondMatrixFormat(unsigned int mat_format)¶ Set the matrix format of the preconditioner.
-
template<class
OperatorType, classVectorType, typenameValueType>
classTNS: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Truncated Neumann Series Preconditioner.
The Truncated Neumann Series (TNS) preconditioner is based on \(M^{-1} = K^{T} D^{-1} K\), where \(K=(I-LD^{-1}+(LD^{-1})^{2})\), with the diagonal \(D\) of \(A\) and the strictly lower triangular part \(L\) of \(A\). The preconditioner can be computed in two forms - explicitly and implicitly. In the implicit form, the full construction of \(M\) is performed via matrix-matrix operations, whereas in the explicit from, the application of the preconditioner is based on matrix-vector operations only. The matrix format for the stored matrices can be specified.
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::TNS::SetPrecondMatrixFormat(unsigned int mat_format)¶ Set the matrix format of the preconditioner.
-
template<class
OperatorType, classVectorType, typenameValueType>
classAS: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Additive Schwarz Preconditioner.
The Additive Schwarz preconditioner relies on a preconditioning technique, where the linear system \(Ax=b\) can be decomposed into small sub-problems based on \(A_{i} = R_{i}^{T}AR_{i}\), where \(R_{i}\) are restriction operators. Those restriction operators produce sub-matrices wich overlap. This leads to contributions from two preconditioners on the overlapped area which are scaled by \(1/2\). RAS
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::RAS< OperatorType, VectorType, ValueType >
-
void
rocalution::AS::Set(int nb, int overlap, Solver<OperatorType, VectorType, ValueType> **preconds)¶ Set number of blocks, overlap and array of preconditioners.
-
template<class
OperatorType, classVectorType, typenameValueType>
classRAS: public rocalution::AS<OperatorType, VectorType, ValueType>¶ Restricted Additive Schwarz Preconditioner.
The Restricted Additive Schwarz preconditioner relies on a preconditioning technique, where the linear system \(Ax=b\) can be decomposed into small sub-problems based on \(A_{i} = R_{i}^{T}AR_{i}\), where \(R_{i}\) are restriction operators. The RAS method is a mixture of block Jacobi and the AS scheme. In this case, the sub-matrices contain overlapped areas from other blocks, too. RAS
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classBlockJacobi: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Block-Jacobi Preconditioner.
The Block-Jacobi preconditioner is designed to wrap any local preconditioner and apply it in a global block fashion locally on each interior matrix.
- Template Parameters
OperatorType: - can be GlobalMatrixVectorType: - can be GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::BlockJacobi::Set(Solver<LocalMatrix<ValueType>, LocalVector<ValueType>, ValueType> &precond)¶ Set local preconditioner.
-
template<class
OperatorType, classVectorType, typenameValueType>
classBlockPreconditioner: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Block-Preconditioner.
When handling vector fields, typically one can try to use different preconditioners and/or solvers for the different blocks. For such problems, the library provides a block-type preconditioner. This preconditioner builds the following block-type matrix
\[\begin{split} P = \begin{pmatrix} A_{d} & 0 & . & 0 \\ B_{1} & B_{d} & . & 0 \\ . & . & . & . \\ Z_{1} & Z_{2} & . & Z_{d} \end{pmatrix} \end{split}\]The solution of \(P\) can be performed in two ways. It can be solved by block-lower-triangular sweeps with inversion of the blocks \(A_{d} \ldots Z_{d}\) and with a multiplication of the corresponding blocks. This is set by SetLSolver() (which is the default solution scheme). Alternatively, it can be used only with an inverse of the diagonal \(A_{d} \ldots Z_{d}\) (Block-Jacobi type) by using SetDiagonalSolver().- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::BlockPreconditioner::Set(int n, const int *size, Solver<OperatorType, VectorType, ValueType> **D_solver)¶ Set number, size and diagonal solver.
-
void
rocalution::BlockPreconditioner::SetDiagonalSolver(void)¶ Set diagonal solver mode.
-
void
rocalution::BlockPreconditioner::SetLSolver(void)¶ Set lower triangular sweep mode.
-
void
rocalution::BlockPreconditioner::SetExternalLastMatrix(const OperatorType &mat)¶ Set external last block matrix.
-
void
rocalution::BlockPreconditioner::SetPermutation(const LocalVector<int> &perm)¶ Set permutation vector.
-
template<class
OperatorType, classVectorType, typenameValueType>
classJacobi: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Jacobi Method.
The Jacobi method is for solving a diagonally dominant system of linear equations \(Ax=b\). It solves for each diagonal element iteratively until convergence, such that
\[ x_{i}^{(k+1)} = (1 - \omega)x_{i}^{(k)} + \frac{\omega}{a_{ii}} \left( b_{i} - \sum\limits_{j=1}^{i-1}{a_{ij}x_{j}^{(k)}} - \sum\limits_{j=i}^{n}{a_{ij}x_{j}^{(k)}} \right) \]- Template Parameters
OperatorType: - can be LocalMatrix or GlobalMatrixVectorType: - can be LocalVector or GlobalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classGS: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Gauss-Seidel / Successive Over-Relaxation Method.
The Gauss-Seidel / SOR method is for solving system of linear equations \(Ax=b\). It approximates the solution iteratively with
\[ x_{i}^{(k+1)} = (1 - \omega) x_{i}^{(k)} + \frac{\omega}{a_{ii}} \left( b_{i} - \sum\limits_{j=1}^{i-1}{a_{ij}x_{j}^{(k+1)}} - \sum\limits_{j=i}^{n}{a_{ij}x_{j}^{(k)}} \right), \]with \(\omega \in (0,2)\).- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classSGS: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Symmetric Gauss-Seidel / Symmetric Successive Over-Relaxation Method.
The Symmetric Gauss-Seidel / SSOR method is for solving system of linear equations \(Ax=b\). It approximates the solution iteratively.
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classILU: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Incomplete LU Factorization based on levels.
The Incomplete LU Factorization based on levels computes a sparse lower and sparse upper triangular matrix such that \(A = LU - R\).
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::ILU::Set(int p, bool level = true)¶ Initialize ILU(p) factorization.
Initialize ILU(p) factorization based on power. SAAD
level = true build the structure based on levels
level = false build the structure only based on the power(p+1)
-
template<class
OperatorType, classVectorType, typenameValueType>
classILUT: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Incomplete LU Factorization based on threshold.
The Incomplete LU Factorization based on threshold computes a sparse lower and sparse upper triangular matrix such that \(A = LU - R\). Fill-in values are dropped depending on a threshold and number of maximal fill-ins per row. SAAD
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::ILUT::Set(double t) Set drop-off threshold.
-
void
rocalution::ILUT::Set(double t, int maxrow) Set drop-off threshold and maximum fill-ins per row.
-
template<class
OperatorType, classVectorType, typenameValueType>
classIC: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Incomplete Cholesky Factorization without fill-ins.
The Incomplete Cholesky Factorization computes a sparse lower triangular matrix such that \(A=LL^{T} - R\). Additional fill-ins are dropped and the sparsity pattern of the original matrix is preserved.
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classVariablePreconditioner: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Variable Preconditioner.
The Variable Preconditioner can hold a selection of preconditioners. Thus, any type of preconditioners can be combined. As example, the variable preconditioner can combine Jacobi, GS and ILU – then, the first iteration of the iterative solver will apply Jacobi, the second iteration will apply GS and the third iteration will apply ILU. After that, the solver will start again with Jacobi, GS, ILU.
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::VariablePreconditioner::SetPreconditioner(int n, Solver<OperatorType, VectorType, ValueType> **precond)¶ Set the preconditioner sequence.
-
template<class
OperatorType, classVectorType, typenameValueType>
classMultiColored: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Base class for all multi-colored preconditioners.
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::MultiColoredILU< OperatorType, VectorType, ValueType >, rocalution::MultiColoredSGS< OperatorType, VectorType, ValueType >
-
void
rocalution::MultiColored::SetPrecondMatrixFormat(unsigned int mat_format)¶ Set a specific matrix type of the decomposed block matrices.
-
void
rocalution::MultiColored::SetDecomposition(bool decomp)¶ Set if the preconditioner should be decomposed or not.
-
template<class
OperatorType, classVectorType, typenameValueType>
classMultiColoredSGS: public rocalution::MultiColored<OperatorType, VectorType, ValueType>¶ Multi-Colored Symmetric Gauss-Seidel / SSOR Preconditioner.
The Multi-Colored Symmetric Gauss-Seidel / SSOR preconditioner is based on the splitting of the original matrix. Higher parallelism in solving the forward and backward substitution is obtained by performing a multi-colored decomposition. Details on the Symmetric Gauss-Seidel / SSOR algorithm can be found in the SGS preconditioner.
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
Subclassed by rocalution::MultiColoredGS< OperatorType, VectorType, ValueType >
-
void
rocalution::MultiColoredSGS::SetRelaxation(ValueType omega)¶ Set the relaxation parameter for the SOR/SSOR scheme.
-
template<class
OperatorType, classVectorType, typenameValueType>
classMultiColoredGS: public rocalution::MultiColoredSGS<OperatorType, VectorType, ValueType>¶ Multi-Colored Gauss-Seidel / SOR Preconditioner.
The Multi-Colored Symmetric Gauss-Seidel / SOR preconditioner is based on the splitting of the original matrix. Higher parallelism in solving the forward substitution is obtained by performing a multi-colored decomposition. Details on the Gauss-Seidel / SOR algorithm can be found in the GS preconditioner.
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
template<class
OperatorType, classVectorType, typenameValueType>
classMultiColoredILU: public rocalution::MultiColored<OperatorType, VectorType, ValueType>¶ Multi-Colored Incomplete LU Factorization Preconditioner.
Multi-Colored Incomplete LU Factorization based on the ILU(p) factorization with a power(q)-pattern method. This method provides a higher degree of parallelism of forward and backward substitution compared to the standard ILU(p) preconditioner. Lukarski2012
- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::MultiColoredILU::Set(int p) Initialize a multi-colored ILU(p, p+1) preconditioner.
-
void
rocalution::MultiColoredILU::Set(int p, int q, bool level = true) Initialize a multi-colored ILU(p, q) preconditioner.
level = true will perform the factorization with levels
level = false will perform the factorization only on the power(q)-pattern
-
template<class
OperatorType, classVectorType, typenameValueType>
classMultiElimination: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Multi-Elimination Incomplete LU Factorization Preconditioner.
The Multi-Elimination Incomplete LU preconditioner is based on the following decomposition
\[\begin{split} A = \begin{pmatrix} D & F \\ E & C \end{pmatrix} = \begin{pmatrix} I & 0 \\ ED^{-1} & I \end{pmatrix} \times \begin{pmatrix} D & F \\ 0 & \hat{A} \end{pmatrix}, \end{split}\]where \(\hat{A} = C - ED^{-1} F\). To make the inversion of \(D\) easier, we permute the preconditioning before the factorization with a permutation \(P\) to obtain only diagonal elements in \(D\). The permutation here is based on a maximal independent set. This procedure can be applied to the block matrix \(\hat{A}\), in this way we can perform the factorization recursively. In the last level of the recursion, we need to provide a solution procedure. By the design of the library, this can be any kind of solver. SAAD- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
int
rocalution::MultiElimination::GetSizeDiagBlock(void) const¶ Returns the size of the first (diagonal) block of the preconditioner.
-
int
rocalution::MultiElimination::GetLevel(void) const¶ Return the depth of the current level.
-
void
rocalution::MultiElimination::Set(Solver<OperatorType, VectorType, ValueType> &AA_Solver, int level, double drop_off = 0.0)¶ Initialize (recursively) ME-ILU with level (depth of recursion)
AA_Solvers - defines the last-block solver
drop_off - defines drop-off tolerance
-
void
rocalution::MultiElimination::SetPrecondMatrixFormat(unsigned int mat_format)¶ Set a specific matrix type of the decomposed block matrices.
-
template<class
OperatorType, classVectorType, typenameValueType>
classDiagJacobiSaddlePointPrecond: public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶ Diagonal Preconditioner for Saddle-Point Problems.
Consider the following saddle-point problem
\[\begin{split} A = \begin{pmatrix} K & F \\ E & 0 \end{pmatrix}. \end{split}\]For such problems we can construct a diagonal Jacobi-type preconditioner of type\[\begin{split} P = \begin{pmatrix} K & 0 \\ 0 & S \end{pmatrix}, \end{split}\]with \(S=ED^{-1}F\), where \(D\) are the diagonal elements of \(K\). The matrix \(S\) is fully constructed (via sparse matrix-matrix multiplication). The preconditioner needs to be initialized with two external solvers/preconditioners - one for the matrix \(K\) and one for the matrix \(S\).- Template Parameters
OperatorType: - can be LocalMatrixVectorType: - can be LocalVectorValueType: - can be float, double, std::complex<float> or std::complex<double>
-
void
rocalution::DiagJacobiSaddlePointPrecond::Set(Solver<OperatorType, VectorType, ValueType> &K_Solver, Solver<OperatorType, VectorType, ValueType> &S_Solver)¶ Initialize solver for \(K\) and \(S\).
rocSPARSE¶
Introduction¶
rocSPARSE is a library that contains basic linear algebra subroutines for sparse matrices and vectors written in HiP for GPU devices. It is designed to be used from C and C++ code.
The code is open and hosted here: https://github.com/ROCmSoftwarePlatform/rocSPARSE
Device and Stream Management¶
hipSetDevice() and hipGetDevice() are HIP device management APIs. They are NOT part of the rocSPARSE API.
All rocSPARSE library functions, unless otherwise stated, are non blocking and executed asynchronously with respect to the host. They may return before the actual computation has finished. To force synchronization, hipDeviceSynchronize() or hipStreamSynchronize() can be used. This will ensure that all previously executed rocSPARSE functions on the device / this particular stream have completed.
Before a HIP kernel invocation, users need to call hipSetDevice() to set a device, e.g. device 1. If users do not explicitly call it, the system by default sets it as device 0. Unless users explicitly call hipSetDevice() to set to another device, their HIP kernels are always launched on device 0.
The above is a HIP (and CUDA) device management approach and has nothing to do with rocSPARSE. rocSPARSE honors the approach above and assumes users have already set the device before a rocSPARSE routine call.
HIP kernels are always launched in a queue (also known as stream).
If users do not explicitly specify a stream, the system provides a default stream, maintained by the system. Users cannot create or destroy the default stream. However, users can freely create new streams (with hipStreamCreate()) and bind it to the rocSPARSE handle. HIP kernels are invoked in rocSPARSE routines. The rocSPARSE handle is always associated with a stream, and rocSPARSE passes its stream to the kernels inside the routine. One rocSPARSE routine only takes one stream in a single invocation. If users create a stream, they are responsible for destroying it.
If the system under test has multiple HIP devices, users can run multiple rocSPARSE handles concurrently, but can NOT run a single rocSPARSE handle on different discrete devices. Each handle is associated with a particular singular device, and a new handle should be created for each additional device.
Building and Installing¶
Installing from AMD ROCm repositories¶
rocSPARSE can be installed from AMD ROCm repositories by
sudo apt install rocsparse
Building rocSPARSE from Open-Source repository¶
The rocSPARSE source code is available at the rocSPARSE github page. Download the master branch using:
git clone -b master https://github.com/ROCmSoftwarePlatform/rocSPARSE.git
cd rocSPARSE
Note that if you want to contribute to rocSPARSE, you will need to checkout the develop branch instead of the master branch.
Below are steps to build different packages of the library, including dependencies and clients. It is recommended to install rocSPARSE using the install.sh script.
The following table lists common uses of install.sh to build dependencies + library.
Command |
Description |
|---|---|
./install.sh -h |
Print help information. |
./install.sh -d |
Build dependencies and library in your local directory. The -d flag only needs to be |br| used once. For subsequent invocations of install.sh it is not necessary to rebuild the |br| dependencies. |
./install.sh |
Build library in your local directory. It is assumed dependencies are available. |
./install.sh -i |
Build library, then build and install rocSPARSE package in /opt/rocm/rocsparse. You will be |br| prompted for sudo access. This will install for all users. |
The client contains example code, unit tests and benchmarks. Common uses of install.sh to build them are listed in the table below.
Command |
Description |
|---|---|
./install.sh -h |
Print help information. |
./install.sh -dc |
Build dependencies, library and client in your local directory. The -d flag only needs to be |br| used once. For subsequent invocations of install.sh it is not necessary to rebuild the |br| dependencies. |
./install.sh -c |
Build library and client in your local directory. It is assumed dependencies are available. |
./install.sh -idc |
Build library, dependencies and client, then build and install rocSPARSE package in |br| /opt/rocm/rocsparse. You will be prompted for sudo access. This will install for all users. |
./install.sh -ic |
Build library and client, then build and install rocSPARSE package in opt/rocm/rocsparse. |br| You will be prompted for sudo access. This will install for all users. |
CMake 3.5 or later is required in order to build rocSPARSE. The rocSPARSE library contains both, host and device code, therefore the HCC compiler must be specified during cmake configuration process.
rocSPARSE can be built using the following commands:
# Create and change to build directory
mkdir -p build/release ; cd build/release
# Default install path is /opt/rocm, use -DCMAKE_INSTALL_PREFIX=<path> to adjust it
CXX=/opt/rocm/bin/hcc cmake ../..
# Compile rocSPARSE library
make -j$(nproc)
# Install rocSPARSE to /opt/rocm
sudo make install
Boost and GoogleTest is required in order to build rocSPARSE client.
rocSPARSE with dependencies and client can be built using the following commands:
# Install boost on Ubuntu
sudo apt install libboost-program-options-dev
# Install boost on Fedora
sudo dnf install boost-program-options
# Install googletest
mkdir -p build/release/deps ; cd build/release/deps
cmake -DBUILD_BOOST=OFF ../../../deps
sudo make -j$(nproc) install
# Change to build directory
cd ..
# Default install path is /opt/rocm, use -DCMAKE_INSTALL_PREFIX=<path> to adjust it
CXX=/opt/rocm/bin/hcc cmake ../.. -DBUILD_CLIENTS_TESTS=ON \
-DBUILD_CLIENTS_BENCHMARKS=ON \
-DBUILD_CLIENTS_SAMPLES=ON
# Compile rocSPARSE library
make -j$(nproc)
# Install rocSPARSE to /opt/rocm
sudo make install
Issue: HIP (/opt/rocm/hip) was built using hcc 1.0.xxx-xxx-xxx-xxx, but you are using /opt/rocm/bin/hcc with version 1.0.yyy-yyy-yyy-yyy from hipcc (version mismatch). Please rebuild HIP including cmake or update HCC_HOME variable.
Solution: Download HIP from github and use hcc to build from source and then use the built HIP instead of /opt/rocm/hip.
Issue: For Carrizo - HCC RUNTIME ERROR: Failed to find compatible kernel
Solution: Add the following to the cmake command when configuring: -DCMAKE_CXX_FLAGS=”–amdgpu-target=gfx801”
Issue: For MI25 (Vega10 Server) - HCC RUNTIME ERROR: Failed to find compatible kernel
Solution: export HCC_AMDGPU_TARGET=gfx900
- Issue: Could not find a package configuration file provided by “ROCM” with any of the following names:
ROCMConfig.cmake |br| rocm-config.cmake
Solution: Install ROCm cmake modules
Storage Formats¶
COO storage format¶
The Coordinate (COO) storage format represents a \(m \times n\) matrix by
m |
number of rows (integer). |
n |
number of columns (integer). |
nnz |
number of non-zero elements (integer). |
coo_val |
array of |
coo_row_ind |
array of |
coo_col_ind |
array of |
The COO matrix is expected to be sorted by row indices and column indices per row. Furthermore, each pair of indices should appear only once. Consider the following \(3 \times 5\) matrix and the corresponding COO structures, with \(m = 3, n = 5\) and \(\text{nnz} = 8\) using zero based indexing:
where
CSR storage format¶
The Compressed Sparse Row (CSR) storage format represents a \(m \times n\) matrix by
m |
number of rows (integer). |
n |
number of columns (integer). |
nnz |
number of non-zero elements (integer). |
csr_val |
array of |
csr_row_ptr |
array of |
csr_col_ind |
array of |
The CSR matrix is expected to be sorted by column indices within each row. Furthermore, each pair of indices should appear only once. Consider the following \(3 \times 5\) matrix and the corresponding CSR structures, with \(m = 3, n = 5\) and \(\text{nnz} = 8\) using one based indexing:
where
ELL storage format¶
The Ellpack-Itpack (ELL) storage format represents a \(m \times n\) matrix by
m |
number of rows (integer). |
n |
number of columns (integer). |
ell_width |
maximum number of non-zero elements per row (integer) |
ell_val |
array of |
ell_col_ind |
array of |
The ELL matrix is assumed to be stored in column-major format. Rows with less than ell_width non-zero elements are padded with zeros (ell_val) and \(-1\) (ell_col_ind).
Consider the following \(3 \times 5\) matrix and the corresponding ELL structures, with \(m = 3, n = 5\) and \(\text{ell_width} = 3\) using zero based indexing:
where
HYB storage format¶
The Hybrid (HYB) storage format represents a \(m \times n\) matrix by
m |
number of rows (integer). |
n |
number of columns (integer). |
nnz |
number of non-zero elements of the COO part (integer) |
ell_width |
maximum number of non-zero elements per row of the ELL part (integer) |
ell_val |
array of |
ell_col_ind |
array of |
coo_val |
array of |
coo_row_ind |
array of |
coo_col_ind |
array of |
The HYB format is a combination of the ELL and COO sparse matrix formats. Typically, the regular part of the matrix is stored in ELL storage format, and the irregular part of the matrix is stored in COO storage format. Three different partitioning schemes can be applied when converting a CSR matrix to a matrix in HYB storage format. For further details on the partitioning schemes, see rocsparse_hyb_partition_.
Types¶
rocsparse_handle¶
-
typedef struct _rocsparse_handle *
rocsparse_handle¶ Handle to the rocSPARSE library context queue.
The rocSPARSE handle is a structure holding the rocSPARSE library context. It must be initialized using rocsparse_create_handle() and the returned handle must be passed to all subsequent library function calls. It should be destroyed at the end using rocsparse_destroy_handle().
rocsparse_mat_descr¶
-
typedef struct _rocsparse_mat_descr *
rocsparse_mat_descr¶ Descriptor of the matrix.
The rocSPARSE matrix descriptor is a structure holding all properties of a matrix. It must be initialized using rocsparse_create_mat_descr() and the returned descriptor must be passed to all subsequent library calls that involve the matrix. It should be destroyed at the end using rocsparse_destroy_mat_descr().
rocsparse_mat_info¶
-
typedef struct _rocsparse_mat_info *
rocsparse_mat_info¶ Info structure to hold all matrix meta data.
The rocSPARSE matrix info is a structure holding all matrix information that is gathered during analysis routines. It must be initialized using rocsparse_create_mat_info() and the returned info structure must be passed to all subsequent library calls that require additional matrix information. It should be destroyed at the end using rocsparse_destroy_mat_info().
rocsparse_hyb_mat¶
-
typedef struct _rocsparse_hyb_mat *
rocsparse_hyb_mat¶ HYB matrix storage format.
The rocSPARSE HYB matrix structure holds the HYB matrix. It must be initialized using rocsparse_create_hyb_mat() and the returned HYB matrix must be passed to all subsequent library calls that involve the matrix. It should be destroyed at the end using rocsparse_destroy_hyb_mat().
rocsparse_action¶
-
enum
rocsparse_action¶ Specify where the operation is performed on.
The rocsparse_action indicates whether the operation is performed on the full matrix, or only on the sparsity pattern of the matrix.
Values:
-
rocsparse_action_symbolic= 0¶ Operate only on indices.
-
rocsparse_action_numeric= 1¶ Operate on data and indices.
-
rocsparse_hyb_partition¶
-
enum
rocsparse_hyb_partition¶ HYB matrix partitioning type.
The rocsparse_hyb_partition type indicates how the hybrid format partitioning between COO and ELL storage formats is performed.
Values:
-
rocsparse_hyb_partition_auto= 0¶ automatically decide on ELL nnz per row.
-
rocsparse_hyb_partition_user= 1¶ user given ELL nnz per row.
-
rocsparse_hyb_partition_max= 2¶ max ELL nnz per row, no COO part.
-
rocsparse_index_base¶
-
enum
rocsparse_index_base¶ Specify the matrix index base.
The rocsparse_index_base indicates the index base of the indices. For a given rocsparse_mat_descr, the rocsparse_index_base can be set using rocsparse_set_mat_index_base(). The current rocsparse_index_base of a matrix can be obtained by rocsparse_get_mat_index_base().
Values:
-
rocsparse_index_base_zero= 0¶ zero based indexing.
-
rocsparse_index_base_one= 1¶ one based indexing.
-
rocsparse_matrix_type¶
-
enum
rocsparse_matrix_type¶ Specify the matrix type.
The rocsparse_matrix_type indices the type of a matrix. For a given rocsparse_mat_descr, the rocsparse_matrix_type can be set using rocsparse_set_mat_type(). The current rocsparse_matrix_type of a matrix can be obtained by rocsparse_get_mat_type().
Values:
-
rocsparse_matrix_type_general= 0¶ general matrix type.
-
rocsparse_matrix_type_symmetric= 1¶ symmetric matrix type.
-
rocsparse_matrix_type_hermitian= 2¶ hermitian matrix type.
-
rocsparse_matrix_type_triangular= 3¶ triangular matrix type.
-
rocsparse_fill_mode¶
-
enum
rocsparse_fill_mode¶ Specify the matrix fill mode.
The rocsparse_fill_mode indicates whether the lower or the upper part is stored in a sparse triangular matrix. For a given rocsparse_mat_descr, the rocsparse_fill_mode can be set using rocsparse_set_mat_fill_mode(). The current rocsparse_fill_mode of a matrix can be obtained by rocsparse_get_mat_fill_mode().
Values:
-
rocsparse_fill_mode_lower= 0¶ lower triangular part is stored.
-
rocsparse_fill_mode_upper= 1¶ upper triangular part is stored.
-
rocsparse_diag_type¶
-
enum
rocsparse_diag_type¶ Indicates if the diagonal entries are unity.
The rocsparse_diag_type indicates whether the diagonal entries of a matrix are unity or not. If rocsparse_diag_type_unit is specified, all present diagonal values will be ignored. For a given rocsparse_mat_descr, the rocsparse_diag_type can be set using rocsparse_set_mat_diag_type(). The current rocsparse_diag_type of a matrix can be obtained by rocsparse_get_mat_diag_type().
Values:
-
rocsparse_diag_type_non_unit= 0¶ diagonal entries are non-unity.
-
rocsparse_diag_type_unit= 1¶ diagonal entries are unity
-
rocsparse_operation¶
-
enum
rocsparse_operation¶ Specify whether the matrix is to be transposed or not.
The rocsparse_operation indicates the operation performed with the given matrix.
Values:
-
rocsparse_operation_none= 111¶ Operate with matrix.
-
rocsparse_operation_transpose= 112¶ Operate with transpose.
-
rocsparse_operation_conjugate_transpose= 113¶ Operate with conj. transpose.
-
rocsparse_pointer_mode¶
-
enum
rocsparse_pointer_mode¶ Indicates if the pointer is device pointer or host pointer.
The rocsparse_pointer_mode indicates whether scalar values are passed by reference on the host or device. The rocsparse_pointer_mode can be changed by rocsparse_set_pointer_mode(). The currently used pointer mode can be obtained by rocsparse_get_pointer_mode().
Values:
-
rocsparse_pointer_mode_host= 0¶ scalar pointers are in host memory.
-
rocsparse_pointer_mode_device= 1¶ scalar pointers are in device memory.
-
rocsparse_analysis_policy¶
-
enum
rocsparse_analysis_policy¶ Specify policy in analysis functions.
The rocsparse_analysis_policy specifies whether gathered analysis data should be re-used or not. If meta data from a previous e.g. rocsparse_csrilu0_analysis() call is available, it can be re-used for subsequent calls to e.g. rocsparse_csrsv_analysis() and greatly improve performance of the analysis function.
Values:
-
rocsparse_analysis_policy_reuse= 0¶ try to re-use meta data.
-
rocsparse_analysis_policy_force= 1¶ force to re-build meta data.
-
rocsparse_solve_policy¶
rocsparse_layer_mode¶
-
enum
rocsparse_layer_mode¶ Indicates if layer is active with bitmask.
The rocsparse_layer_mode bit mask indicates the logging characteristics.
Values:
-
rocsparse_layer_mode_none= 0x0¶ layer is not active.
-
rocsparse_layer_mode_log_trace= 0x1¶ layer is in logging mode.
-
rocsparse_layer_mode_log_bench= 0x2¶ layer is in benchmarking mode.
-
rocsparse_status¶
-
enum
rocsparse_status¶ List of rocsparse status codes definition.
This is a list of the rocsparse_status types that are used by the rocSPARSE library.
Values:
-
rocsparse_status_success= 0¶ success.
-
rocsparse_status_invalid_handle= 1¶ handle not initialized, invalid or null.
-
rocsparse_status_not_implemented= 2¶ function is not implemented.
-
rocsparse_status_invalid_pointer= 3¶ invalid pointer parameter.
-
rocsparse_status_invalid_size= 4¶ invalid size parameter.
-
rocsparse_status_memory_error= 5¶ failed memory allocation, copy, dealloc.
-
rocsparse_status_internal_error= 6¶ other internal library failure.
-
rocsparse_status_invalid_value= 7¶ invalid value parameter.
-
rocsparse_status_arch_mismatch= 8¶ device arch is not supported.
-
rocsparse_status_zero_pivot= 9¶ encountered zero pivot.
-
Logging¶
Three different environment variables can be set to enable logging in rocSPARSE: ROCSPARSE_LAYER, ROCSPARSE_LOG_TRACE_PATH and ROCSPARSE_LOG_BENCH_PATH.
ROCSPARSE_LAYER is a bit mask, where several logging modes can be combined as follows:
|
logging is disabled. |
|
trace logging is enabled. |
|
bench logging is enabled. |
|
trace logging and bench logging is enabled. |
When logging is enabled, each rocSPARSE function call will write the function name as well as function arguments to the logging stream. The default logging stream is stderr.
If the user sets the environment variable ROCSPARSE_LOG_TRACE_PATH to the full path name for a file, the file is opened and trace logging is streamed to that file. If the user sets the environment variable ROCSPARSE_LOG_BENCH_PATH to the full path name for a file, the file is opened and bench logging is streamed to that file. If the file cannot be opened, logging output is stream to stderr.
Note that performance will degrade when logging is enabled. By default, the environment variable ROCSPARSE_LAYER is unset and logging is disabled.
Sparse Auxiliary Functions¶
This module holds all sparse auxiliary functions.
The functions that are contained in the auxiliary module describe all available helper functions that are required for subsequent library calls.
rocsparse_create_handle()¶
-
rocsparse_status
rocsparse_create_handle(rocsparse_handle *handle)¶ Create a rocsparse handle.
rocsparse_create_handlecreates the rocSPARSE library context. It must be initialized before any other rocSPARSE API function is invoked and must be passed to all subsequent library function calls. The handle should be destroyed at the end using rocsparse_destroy_handle().- Parameters
[out] handle: the pointer to the handle to the rocSPARSE library context.
- Return Value
rocsparse_status_success: the initialization succeeded.rocsparse_status_invalid_handle:handlepointer is invalid.rocsparse_status_internal_error: an internal error occurred.
rocsparse_destroy_handle()¶
-
rocsparse_status
rocsparse_destroy_handle(rocsparse_handle handle)¶ Destroy a rocsparse handle.
rocsparse_destroy_handledestroys the rocSPARSE library context and releases all resources used by the rocSPARSE library.- Parameters
[in] handle: the handle to the rocSPARSE library context.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle:handleis invalid.rocsparse_status_internal_error: an internal error occurred.
rocsparse_set_stream()¶
-
rocsparse_status
rocsparse_set_stream(rocsparse_handle handle, hipStream_t stream)¶ Specify user defined HIP stream.
rocsparse_set_streamspecifies the stream to be used by the rocSPARSE library context and all subsequent function calls.- Example
This example illustrates, how a user defined stream can be used in rocSPARSE.
// Create rocSPARSE handle rocsparse_handle handle; rocsparse_create_handle(&handle); // Create stream hipStream_t stream; hipStreamCreate(&stream); // Set stream to rocSPARSE handle rocsparse_set_stream(handle, stream); // Do some work // ... // Clean up rocsparse_destroy_handle(handle); hipStreamDestroy(stream);
- Parameters
[inout] handle: the handle to the rocSPARSE library context.[in] stream: the stream to be used by the rocSPARSE library context.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle:handleis invalid.
rocsparse_get_stream()¶
-
rocsparse_status
rocsparse_get_stream(rocsparse_handle handle, hipStream_t *stream)¶ Get current stream from library context.
rocsparse_get_streamgets the rocSPARSE library context stream which is currently used for all subsequent function calls.- Parameters
[in] handle: the handle to the rocSPARSE library context.[out] stream: the stream currently used by the rocSPARSE library context.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle:handleis invalid.
rocsparse_set_pointer_mode()¶
-
rocsparse_status
rocsparse_set_pointer_mode(rocsparse_handle handle, rocsparse_pointer_mode pointer_mode)¶ Specify pointer mode.
rocsparse_set_pointer_modespecifies the pointer mode to be used by the rocSPARSE library context and all subsequent function calls. By default, all values are passed by reference on the host. Valid pointer modes are rocsparse_pointer_mode_host orrocsparse_pointer_mode_device.- Parameters
[in] handle: the handle to the rocSPARSE library context.[in] pointer_mode: the pointer mode to be used by the rocSPARSE library context.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle:handleis invalid.
rocsparse_get_pointer_mode()¶
-
rocsparse_status
rocsparse_get_pointer_mode(rocsparse_handle handle, rocsparse_pointer_mode *pointer_mode)¶ Get current pointer mode from library context.
rocsparse_get_pointer_modegets the rocSPARSE library context pointer mode which is currently used for all subsequent function calls.- Parameters
[in] handle: the handle to the rocSPARSE library context.[out] pointer_mode: the pointer mode that is currently used by the rocSPARSE library context.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle:handleis invalid.
rocsparse_get_version()¶
-
rocsparse_status
rocsparse_get_version(rocsparse_handle handle, int *version)¶ Get rocSPARSE version.
rocsparse_get_versiongets the rocSPARSE library version number.patch = version % 100
minor = version / 100 % 1000
major = version / 100000
- Parameters
[in] handle: the handle to the rocSPARSE library context.[out] version: the version number of the rocSPARSE library.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle:handleis invalid.
rocsparse_get_git_rev()¶
-
rocsparse_status
rocsparse_get_git_rev(rocsparse_handle handle, char *rev)¶ Get rocSPARSE git revision.
rocsparse_get_git_revgets the rocSPARSE library git commit revision (SHA-1).- Parameters
[in] handle: the handle to the rocSPARSE library context.[out] rev: the git commit revision (SHA-1).
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle:handleis invalid.
rocsparse_create_mat_descr()¶
-
rocsparse_status
rocsparse_create_mat_descr(rocsparse_mat_descr *descr)¶ Create a matrix descriptor.
rocsparse_create_mat_descrcreates a matrix descriptor. It initializes rocsparse_matrix_type to rocsparse_matrix_type_general and rocsparse_index_base to rocsparse_index_base_zero. It should be destroyed at the end using rocsparse_destroy_mat_descr().- Parameters
[out] descr: the pointer to the matrix descriptor.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_pointer:descrpointer is invalid.
rocsparse_destroy_mat_descr()¶
-
rocsparse_status
rocsparse_destroy_mat_descr(rocsparse_mat_descr descr)¶ Destroy a matrix descriptor.
rocsparse_destroy_mat_descrdestroys a matrix descriptor and releases all resources used by the descriptor.- Parameters
[in] descr: the matrix descriptor.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_pointer:descris invalid.
rocsparse_copy_mat_descr()¶
-
rocsparse_status
rocsparse_copy_mat_descr(rocsparse_mat_descr dest, const rocsparse_mat_descr src)¶ Copy a matrix descriptor.
rocsparse_copy_mat_descrcopies a matrix descriptor. Both, source and destination matrix descriptors must be initialized prior to callingrocsparse_copy_mat_descr.- Parameters
[out] dest: the pointer to the destination matrix descriptor.[in] src: the pointer to the source matrix descriptor.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_pointer:srcordestpointer is invalid.
rocsparse_set_mat_index_base()¶
-
rocsparse_status
rocsparse_set_mat_index_base(rocsparse_mat_descr descr, rocsparse_index_base base)¶ Specify the index base of a matrix descriptor.
rocsparse_set_mat_index_basesets the index base of a matrix descriptor. Valid options are rocsparse_index_base_zero or rocsparse_index_base_one.- Parameters
[inout] descr: the matrix descriptor.[in] base: rocsparse_index_base_zero or rocsparse_index_base_one.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_pointer:descrpointer is invalid.rocsparse_status_invalid_value:baseis invalid.
rocsparse_get_mat_index_base()¶
-
rocsparse_index_base
rocsparse_get_mat_index_base(const rocsparse_mat_descr descr)¶ Get the index base of a matrix descriptor.
rocsparse_get_mat_index_basereturns the index base of a matrix descriptor.- Return
- Parameters
[in] descr: the matrix descriptor.
rocsparse_set_mat_type()¶
-
rocsparse_status
rocsparse_set_mat_type(rocsparse_mat_descr descr, rocsparse_matrix_type type)¶ Specify the matrix type of a matrix descriptor.
rocsparse_set_mat_typesets the matrix type of a matrix descriptor. Valid matrix types are rocsparse_matrix_type_general, rocsparse_matrix_type_symmetric, rocsparse_matrix_type_hermitian or rocsparse_matrix_type_triangular.- Parameters
[inout] descr: the matrix descriptor.[in] type: rocsparse_matrix_type_general, rocsparse_matrix_type_symmetric, rocsparse_matrix_type_hermitian or rocsparse_matrix_type_triangular.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_pointer:descrpointer is invalid.rocsparse_status_invalid_value:typeis invalid.
rocsparse_get_mat_type()¶
-
rocsparse_matrix_type
rocsparse_get_mat_type(const rocsparse_mat_descr descr)¶ Get the matrix type of a matrix descriptor.
rocsparse_get_mat_typereturns the matrix type of a matrix descriptor.- Return
rocsparse_matrix_type_general, rocsparse_matrix_type_symmetric, rocsparse_matrix_type_hermitian or rocsparse_matrix_type_triangular.
- Parameters
[in] descr: the matrix descriptor.
rocsparse_set_mat_fill_mode()¶
-
rocsparse_status
rocsparse_set_mat_fill_mode(rocsparse_mat_descr descr, rocsparse_fill_mode fill_mode)¶ Specify the matrix fill mode of a matrix descriptor.
rocsparse_set_mat_fill_modesets the matrix fill mode of a matrix descriptor. Valid fill modes are rocsparse_fill_mode_lower or rocsparse_fill_mode_upper.- Parameters
[inout] descr: the matrix descriptor.[in] fill_mode: rocsparse_fill_mode_lower or rocsparse_fill_mode_upper.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_pointer:descrpointer is invalid.rocsparse_status_invalid_value:fill_modeis invalid.
rocsparse_get_mat_fill_mode()¶
-
rocsparse_fill_mode
rocsparse_get_mat_fill_mode(const rocsparse_mat_descr descr)¶ Get the matrix fill mode of a matrix descriptor.
rocsparse_get_mat_fill_modereturns the matrix fill mode of a matrix descriptor.- Return
- Parameters
[in] descr: the matrix descriptor.
rocsparse_set_mat_diag_type()¶
-
rocsparse_status
rocsparse_set_mat_diag_type(rocsparse_mat_descr descr, rocsparse_diag_type diag_type)¶ Specify the matrix diagonal type of a matrix descriptor.
rocsparse_set_mat_diag_typesets the matrix diagonal type of a matrix descriptor. Valid diagonal types are rocsparse_diag_type_unit or rocsparse_diag_type_non_unit.- Parameters
[inout] descr: the matrix descriptor.[in] diag_type: rocsparse_diag_type_unit or rocsparse_diag_type_non_unit.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_pointer:descrpointer is invalid.rocsparse_status_invalid_value:diag_typeis invalid.
rocsparse_get_mat_diag_type()¶
-
rocsparse_diag_type
rocsparse_get_mat_diag_type(const rocsparse_mat_descr descr)¶ Get the matrix diagonal type of a matrix descriptor.
rocsparse_get_mat_diag_typereturns the matrix diagonal type of a matrix descriptor.- Return
- Parameters
[in] descr: the matrix descriptor.
rocsparse_create_hyb_mat()¶
-
rocsparse_status
rocsparse_create_hyb_mat(rocsparse_hyb_mat *hyb)¶ Create a
HYBmatrix structure.rocsparse_create_hyb_matcreates a structure that holds the matrix inHYBstorage format. It should be destroyed at the end using rocsparse_destroy_hyb_mat().- Parameters
[inout] hyb: the pointer to the hybrid matrix.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_pointer:hybpointer is invalid.
rocsparse_destroy_hyb_mat()¶
-
rocsparse_status
rocsparse_destroy_hyb_mat(rocsparse_hyb_mat hyb)¶ Destroy a
HYBmatrix structure.rocsparse_destroy_hyb_matdestroys aHYBstructure.- Parameters
[in] hyb: the hybrid matrix structure.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_pointer:hybpointer is invalid.rocsparse_status_internal_error: an internal error occurred.
rocsparse_create_mat_info()¶
-
rocsparse_status
rocsparse_create_mat_info(rocsparse_mat_info *info)¶ Create a matrix info structure.
rocsparse_create_mat_infocreates a structure that holds the matrix info data that is gathered during the analysis routines available. It should be destroyed at the end using rocsparse_destroy_mat_info().- Parameters
[inout] info: the pointer to the info structure.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_pointer:infopointer is invalid.
rocsparse_destroy_mat_info()¶
-
rocsparse_status
rocsparse_destroy_mat_info(rocsparse_mat_info info)¶ Destroy a matrix info structure.
rocsparse_destroy_mat_infodestroys a matrix info structure.- Parameters
[in] info: the info structure.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_pointer:infopointer is invalid.rocsparse_status_internal_error: an internal error occurred.
Sparse Level 1 Functions¶
The sparse level 1 routines describe operations between a vector in sparse format and a vector in dense format. This section describes all rocSPARSE level 1 sparse linear algebra functions.
rocsparse_axpyi()¶
-
rocsparse_status
rocsparse_saxpyi(rocsparse_handle handle, rocsparse_int nnz, const float *alpha, const float *x_val, const rocsparse_int *x_ind, float *y, rocsparse_index_base idx_base)¶
-
rocsparse_status
rocsparse_daxpyi(rocsparse_handle handle, rocsparse_int nnz, const double *alpha, const double *x_val, const rocsparse_int *x_ind, double *y, rocsparse_index_base idx_base)¶ Scale a sparse vector and add it to a dense vector.
rocsparse_axpyimultiplies the sparse vector \(x\) with scalar \(\alpha\) and adds the result to the dense vector \(y\), such that\[ y := y + \alpha \cdot x \]for(i = 0; i < nnz; ++i) { y[x_ind[i]] = y[x_ind[i]] + alpha * x_val[i]; }
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] nnz: number of non-zero entries of vector \(x\).[in] alpha: scalar \(\alpha\).[in] x_val: array ofnnzelements containing the values of \(x\).[in] x_ind: array ofnnzelements containing the indices of the non-zero values of \(x\).[inout] y: array of values in dense format.[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_value:idx_baseis invalid.rocsparse_status_invalid_size:nnzis invalid.rocsparse_status_invalid_pointer:alpha,x_val,x_indorypointer is invalid.
rocsparse_doti()¶
-
rocsparse_status
rocsparse_sdoti(rocsparse_handle handle, rocsparse_int nnz, const float *x_val, const rocsparse_int *x_ind, const float *y, float *result, rocsparse_index_base idx_base)¶
-
rocsparse_status
rocsparse_ddoti(rocsparse_handle handle, rocsparse_int nnz, const double *x_val, const rocsparse_int *x_ind, const double *y, double *result, rocsparse_index_base idx_base)¶ Compute the dot product of a sparse vector with a dense vector.
rocsparse_doticomputes the dot product of the sparse vector \(x\) with the dense vector \(y\), such that\[ \text{result} := y^T x \]for(i = 0; i < nnz; ++i) { result += x_val[i] * y[x_ind[i]]; }
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] nnz: number of non-zero entries of vector \(x\).[in] x_val: array ofnnzvalues.[in] x_ind: array ofnnzelements containing the indices of the non-zero values of \(x\).[in] y: array of values in dense format.[out] result: pointer to the result, can be host or device memory[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_value:idx_baseis invalid.rocsparse_status_invalid_size:nnzis invalid.rocsparse_status_invalid_pointer:x_val,x_ind,yorresultpointer is invalid.rocsparse_status_memory_error: the buffer for the dot product reduction could not be allocated.rocsparse_status_internal_error: an internal error occurred.
rocsparse_gthr()¶
-
rocsparse_status
rocsparse_sgthr(rocsparse_handle handle, rocsparse_int nnz, const float *y, float *x_val, const rocsparse_int *x_ind, rocsparse_index_base idx_base)¶
-
rocsparse_status
rocsparse_dgthr(rocsparse_handle handle, rocsparse_int nnz, const double *y, double *x_val, const rocsparse_int *x_ind, rocsparse_index_base idx_base)¶ Gather elements from a dense vector and store them into a sparse vector.
rocsparse_gthrgathers the elements that are listed inx_indfrom the dense vector \(y\) and stores them in the sparse vector \(x\).for(i = 0; i < nnz; ++i) { x_val[i] = y[x_ind[i]]; }
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] nnz: number of non-zero entries of \(x\).[in] y: array of values in dense format.[out] x_val: array ofnnzelements containing the values of \(x\).[in] x_ind: array ofnnzelements containing the indices of the non-zero values of \(x\).[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_value:idx_baseis invalid.rocsparse_status_invalid_size:nnzis invalid.rocsparse_status_invalid_pointer:y,x_valorx_indpointer is invalid.
rocsparse_gthrz()¶
-
rocsparse_status
rocsparse_sgthrz(rocsparse_handle handle, rocsparse_int nnz, float *y, float *x_val, const rocsparse_int *x_ind, rocsparse_index_base idx_base)¶
-
rocsparse_status
rocsparse_dgthrz(rocsparse_handle handle, rocsparse_int nnz, double *y, double *x_val, const rocsparse_int *x_ind, rocsparse_index_base idx_base)¶ Gather and zero out elements from a dense vector and store them into a sparse vector.
rocsparse_gthrzgathers the elements that are listed inx_indfrom the dense vector \(y\) and stores them in the sparse vector \(x\). The gathered elements in \(y\) are replaced by zero.for(i = 0; i < nnz; ++i) { x_val[i] = y[x_ind[i]]; y[x_ind[i]] = 0; }
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] nnz: number of non-zero entries of \(x\).[inout] y: array of values in dense format.[out] x_val: array ofnnzelements containing the non-zero values of \(x\).[in] x_ind: array ofnnzelements containing the indices of the non-zero values of \(x\).[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_value:idx_baseis invalid.rocsparse_status_invalid_size:nnzis invalid.rocsparse_status_invalid_pointer:y,x_valorx_indpointer is invalid.
rocsparse_roti()¶
-
rocsparse_status
rocsparse_sroti(rocsparse_handle handle, rocsparse_int nnz, float *x_val, const rocsparse_int *x_ind, float *y, const float *c, const float *s, rocsparse_index_base idx_base)¶
-
rocsparse_status
rocsparse_droti(rocsparse_handle handle, rocsparse_int nnz, double *x_val, const rocsparse_int *x_ind, double *y, const double *c, const double *s, rocsparse_index_base idx_base)¶ Apply Givens rotation to a dense and a sparse vector.
rocsparse_rotiapplies the Givens rotation matrix \(G\) to the sparse vector \(x\) and the dense vector \(y\), where\[\begin{split} G = \begin{pmatrix} c & s \\ -s & c \end{pmatrix} \end{split}\]for(i = 0; i < nnz; ++i) { x_tmp = x_val[i]; y_tmp = y[x_ind[i]]; x_val[i] = c * x_tmp + s * y_tmp; y[x_ind[i]] = c * y_tmp - s * x_tmp; }
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] nnz: number of non-zero entries of \(x\).[inout] x_val: array ofnnzelements containing the non-zero values of \(x\).[in] x_ind: array ofnnzelements containing the indices of the non-zero values of \(x\).[inout] y: array of values in dense format.[in] c: pointer to the cosine element of \(G\), can be on host or device.[in] s: pointer to the sine element of \(G\), can be on host or device.[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_value:idx_baseis invalid.rocsparse_status_invalid_size:nnzis invalid.rocsparse_status_invalid_pointer:c,s,x_val,x_indorypointer is invalid.
rocsparse_sctr()¶
-
rocsparse_status
rocsparse_ssctr(rocsparse_handle handle, rocsparse_int nnz, const float *x_val, const rocsparse_int *x_ind, float *y, rocsparse_index_base idx_base)¶
-
rocsparse_status
rocsparse_dsctr(rocsparse_handle handle, rocsparse_int nnz, const double *x_val, const rocsparse_int *x_ind, double *y, rocsparse_index_base idx_base)¶ Scatter elements from a dense vector across a sparse vector.
rocsparse_sctrscatters the elements that are listed inx_indfrom the sparse vector \(x\) into the dense vector \(y\). Indices of \(y\) that are not listed inx_indremain unchanged.for(i = 0; i < nnz; ++i) { y[x_ind[i]] = x_val[i]; }
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] nnz: number of non-zero entries of \(x\).[in] x_val: array ofnnzelements containing the non-zero values of \(x\).[in] x_ind: array ofnnzelements containing the indices of the non-zero values of x.[inout] y: array of values in dense format.[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_value:idx_baseis invalid.rocsparse_status_invalid_size:nnzis invalid.rocsparse_status_invalid_pointer:x_val,x_indorypointer is invalid.
Sparse Level 2 Functions¶
This module holds all sparse level 2 routines.
The sparse level 2 routines describe operations between a matrix in sparse format and a vector in dense format.
rocsparse_coomv()¶
-
rocsparse_status
rocsparse_scoomv(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const float *alpha, const rocsparse_mat_descr descr, const float *coo_val, const rocsparse_int *coo_row_ind, const rocsparse_int *coo_col_ind, const float *x, const float *beta, float *y)¶
-
rocsparse_status
rocsparse_dcoomv(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const double *alpha, const rocsparse_mat_descr descr, const double *coo_val, const rocsparse_int *coo_row_ind, const rocsparse_int *coo_col_ind, const double *x, const double *beta, double *y)¶ Sparse matrix vector multiplication using COO storage format.
rocsparse_coomvmultiplies the scalar \(\alpha\) with a sparse \(m \times n\) matrix, defined in COO storage format, and the dense vector \(x\) and adds the result to the dense vector \(y\) that is multiplied by the scalar \(\beta\), such that\[ y := \alpha \cdot op(A) \cdot x + \beta \cdot y, \]with\[\begin{split} op(A) = \left\{ \begin{array}{ll} A, & if\: trans == rocsparse\_operation\_none \\ A^T, & if\: trans == rocsparse\_operation\_transpose \\ A^H, & if\: trans == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]The COO matrix has to be sorted by row indices. This can be achieved by using rocsparse_coosort_by_row().
for(i = 0; i < m; ++i) { y[i] = beta * y[i]; } for(i = 0; i < nnz; ++i) { y[coo_row_ind[i]] += alpha * coo_val[i] * x[coo_col_ind[i]]; }
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Note
Currently, only
trans== rocsparse_operation_none is supported.- Parameters
[in] handle: handle to the rocsparse library context queue.[in] trans: matrix operation type.[in] m: number of rows of the sparse COO matrix.[in] n: number of columns of the sparse COO matrix.[in] nnz: number of non-zero entries of the sparse COO matrix.[in] alpha: scalar \(\alpha\).[in] descr: descriptor of the sparse COO matrix. Currently, only rocsparse_matrix_type_general is supported.[in] coo_val: array ofnnzelements of the sparse COO matrix.[in] coo_row_ind: array ofnnzelements containing the row indices of the sparse COO matrix.[in] coo_col_ind: array ofnnzelements containing the column indices of the sparse COO matrix.[in] x: array ofnelements ( \(op(A) = A\)) ormelements ( \(op(A) = A^T\) or \(op(A) = A^H\)).[in] beta: scalar \(\beta\).[inout] y: array ofmelements ( \(op(A) = A\)) ornelements ( \(op(A) = A^T\) or \(op(A) = A^H\)).
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,nornnzis invalid.rocsparse_status_invalid_pointer:descr,alpha,coo_val,coo_row_ind,coo_col_ind,x,betaorypointer is invalid.rocsparse_status_arch_mismatch: the device is not supported.rocsparse_status_not_implemented:trans!= rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_csrmv_analysis()¶
-
rocsparse_status
rocsparse_scsrmv_analysis(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info)¶
-
rocsparse_status
rocsparse_dcsrmv_analysis(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info)¶ Sparse matrix vector multiplication using CSR storage format.
rocsparse_csrmv_analysisperforms the analysis step for rocsparse_scsrmv() and rocsparse_dcsrmv(). It is expected that this function will be executed only once for a given matrix and particular operation type. The gathered analysis meta data can be cleared by rocsparse_csrmv_clear().- Note
If the matrix sparsity pattern changes, the gathered information will become invalid.
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] trans: matrix operation type.[in] m: number of rows of the sparse CSR matrix.[in] n: number of columns of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] descr: descriptor of the sparse CSR matrix.[in] csr_val: array ofnnzelements of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix.[out] info: structure that holds the information collected during the analysis step.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,nornnzis invalid.rocsparse_status_invalid_pointer:descr,csr_val,csr_row_ptr,csr_col_indorinfopointer is invalid.rocsparse_status_memory_error: the buffer for the gathered information could not be allocated.rocsparse_status_internal_error: an internal error occurred.rocsparse_status_not_implemented:trans!= rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_csrmv()¶
-
rocsparse_status
rocsparse_scsrmv(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const float *alpha, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, const float *x, const float *beta, float *y)¶
-
rocsparse_status
rocsparse_dcsrmv(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const double *alpha, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, const double *x, const double *beta, double *y)¶ Sparse matrix vector multiplication using CSR storage format.
rocsparse_csrmvmultiplies the scalar \(\alpha\) with a sparse \(m \times n\) matrix, defined in CSR storage format, and the dense vector \(x\) and adds the result to the dense vector \(y\) that is multiplied by the scalar \(\beta\), such that\[ y := \alpha \cdot op(A) \cdot x + \beta \cdot y, \]with\[\begin{split} op(A) = \left\{ \begin{array}{ll} A, & if\: trans == rocsparse\_operation\_none \\ A^T, & if\: trans == rocsparse\_operation\_transpose \\ A^H, & if\: trans == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]The
infoparameter is optional and contains information collected by rocsparse_scsrmv_analysis() or rocsparse_dcsrmv_analysis(). If present, the information will be used to speed up thecsrmvcomputation. Ifinfo==NULL, generalcsrmvroutine will be used instead.for(i = 0; i < m; ++i) { y[i] = beta * y[i]; for(j = csr_row_ptr[i]; j < csr_row_ptr[i + 1]; ++j) { y[i] = y[i] + alpha * csr_val[j] * x[csr_col_ind[j]]; } }
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Note
Currently, only
trans== rocsparse_operation_none is supported.- Example
This example performs a sparse matrix vector multiplication in CSR format using additional meta data to improve performance.
// Create matrix info structure rocsparse_mat_info info; rocsparse_create_mat_info(&info); // Perform analysis step to obtain meta data rocsparse_scsrmv_analysis(handle, rocsparse_operation_none, m, n, nnz, descr, csr_val, csr_row_ptr, csr_col_ind, info); // Compute y = Ax rocsparse_scsrmv(handle, rocsparse_operation_none, m, n, nnz, &alpha, descr, csr_val, csr_row_ptr, csr_col_ind, info, x, &beta, y); // Do more work // ... // Clean up rocsparse_destroy_mat_info(info);
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] trans: matrix operation type.[in] m: number of rows of the sparse CSR matrix.[in] n: number of columns of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] alpha: scalar \(\alpha\).[in] descr: descriptor of the sparse CSR matrix. Currently, only rocsparse_matrix_type_general is supported.[in] csr_val: array ofnnzelements of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix.[in] info: information collected by rocsparse_scsrmv_analysis() or rocsparse_dcsrmv_analysis(), can beNULLif no information is available.[in] x: array ofnelements ( \(op(A) == A\)) ormelements ( \(op(A) == A^T\) or \(op(A) == A^H\)).[in] beta: scalar \(\beta\).[inout] y: array ofmelements ( \(op(A) == A\)) ornelements ( \(op(A) == A^T\) or \(op(A) == A^H\)).
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,nornnzis invalid.rocsparse_status_invalid_pointer:descr,alpha,csr_val,csr_row_ptr,csr_col_ind,x,betaorypointer is invalid.rocsparse_status_arch_mismatch: the device is not supported.rocsparse_status_not_implemented:trans!= rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_csrmv_analysis_clear()¶
-
rocsparse_status
rocsparse_csrmv_clear(rocsparse_handle handle, rocsparse_mat_info info)¶ Sparse matrix vector multiplication using CSR storage format.
rocsparse_csrmv_cleardeallocates all memory that was allocated by rocsparse_scsrmv_analysis() or rocsparse_dcsrmv_analysis(). This is especially useful, if memory is an issue and the analysis data is not required anymore for further computation, e.g. when switching to another sparse matrix format.- Note
Calling
rocsparse_csrmv_clearis optional. All allocated resources will be cleared, when the opaque rocsparse_mat_info struct is destroyed using rocsparse_destroy_mat_info().- Parameters
[in] handle: handle to the rocsparse library context queue.[inout] info: structure that holds the information collected during analysis step.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_pointer:infopointer is invalid.rocsparse_status_memory_error: the buffer for the gathered information could not be deallocated.rocsparse_status_internal_error: an internal error occurred.
rocsparse_ellmv()¶
-
rocsparse_status
rocsparse_sellmv(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, const float *alpha, const rocsparse_mat_descr descr, const float *ell_val, const rocsparse_int *ell_col_ind, rocsparse_int ell_width, const float *x, const float *beta, float *y)¶
-
rocsparse_status
rocsparse_dellmv(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, const double *alpha, const rocsparse_mat_descr descr, const double *ell_val, const rocsparse_int *ell_col_ind, rocsparse_int ell_width, const double *x, const double *beta, double *y)¶ Sparse matrix vector multiplication using ELL storage format.
rocsparse_ellmvmultiplies the scalar \(\alpha\) with a sparse \(m \times n\) matrix, defined in ELL storage format, and the dense vector \(x\) and adds the result to the dense vector \(y\) that is multiplied by the scalar \(\beta\), such that\[ y := \alpha \cdot op(A) \cdot x + \beta \cdot y, \]with\[\begin{split} op(A) = \left\{ \begin{array}{ll} A, & if\: trans == rocsparse\_operation\_none \\ A^T, & if\: trans == rocsparse\_operation\_transpose \\ A^H, & if\: trans == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]for(i = 0; i < m; ++i) { y[i] = beta * y[i]; for(p = 0; p < ell_width; ++p) { idx = p * m + i; if((ell_col_ind[idx] >= 0) && (ell_col_ind[idx] < n)) { y[i] = y[i] + alpha * ell_val[idx] * x[ell_col_ind[idx]]; } } }
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Note
Currently, only
trans== rocsparse_operation_none is supported.- Parameters
[in] handle: handle to the rocsparse library context queue.[in] trans: matrix operation type.[in] m: number of rows of the sparse ELL matrix.[in] n: number of columns of the sparse ELL matrix.[in] alpha: scalar \(\alpha\).[in] descr: descriptor of the sparse ELL matrix. Currently, only rocsparse_matrix_type_general is supported.[in] ell_val: array that contains the elements of the sparse ELL matrix. Padded elements should be zero.[in] ell_col_ind: array that contains the column indices of the sparse ELL matrix. Padded column indices should be -1.[in] ell_width: number of non-zero elements per row of the sparse ELL matrix.[in] x: array ofnelements ( \(op(A) == A\)) ormelements ( \(op(A) == A^T\) or \(op(A) == A^H\)).[in] beta: scalar \(\beta\).[inout] y: array ofmelements ( \(op(A) == A\)) ornelements ( \(op(A) == A^T\) or \(op(A) == A^H\)).
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,norell_widthis invalid.rocsparse_status_invalid_pointer:descr,alpha,ell_val,ell_col_ind,x,betaorypointer is invalid.rocsparse_status_not_implemented:trans!= rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_hybmv()¶
-
rocsparse_status
rocsparse_shybmv(rocsparse_handle handle, rocsparse_operation trans, const float *alpha, const rocsparse_mat_descr descr, const rocsparse_hyb_mat hyb, const float *x, const float *beta, float *y)¶
-
rocsparse_status
rocsparse_dhybmv(rocsparse_handle handle, rocsparse_operation trans, const double *alpha, const rocsparse_mat_descr descr, const rocsparse_hyb_mat hyb, const double *x, const double *beta, double *y)¶ Sparse matrix vector multiplication using HYB storage format.
rocsparse_hybmvmultiplies the scalar \(\alpha\) with a sparse \(m \times n\) matrix, defined in HYB storage format, and the dense vector \(x\) and adds the result to the dense vector \(y\) that is multiplied by the scalar \(\beta\), such that\[ y := \alpha \cdot op(A) \cdot x + \beta \cdot y, \]with\[\begin{split} op(A) = \left\{ \begin{array}{ll} A, & if\: trans == rocsparse\_operation\_none \\ A^T, & if\: trans == rocsparse\_operation\_transpose \\ A^H, & if\: trans == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Note
Currently, only
trans== rocsparse_operation_none is supported.- Parameters
[in] handle: handle to the rocsparse library context queue.[in] trans: matrix operation type.[in] alpha: scalar \(\alpha\).[in] descr: descriptor of the sparse HYB matrix. Currently, only rocsparse_matrix_type_general is supported.[in] hyb: matrix in HYB storage format.[in] x: array ofnelements ( \(op(A) == A\)) ormelements ( \(op(A) == A^T\) or \(op(A) == A^H\)).[in] beta: scalar \(\beta\).[inout] y: array ofmelements ( \(op(A) == A\)) ornelements ( \(op(A) == A^T\) or \(op(A) == A^H\)).
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:hybstructure was not initialized with valid matrix sizes.rocsparse_status_invalid_pointer:descr,alpha,hyb,x,betaorypointer is invalid.rocsparse_status_invalid_value:hybstructure was not initialized with a valid partitioning type.rocsparse_status_arch_mismatch: the device is not supported.rocsparse_status_memory_error: the buffer could not be allocated.rocsparse_status_internal_error: an internal error occurred.rocsparse_status_not_implemented:trans!= rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_csrsv_zero_pivot()¶
-
rocsparse_status
rocsparse_csrsv_zero_pivot(rocsparse_handle handle, const rocsparse_mat_descr descr, rocsparse_mat_info info, rocsparse_int *position)¶ Sparse triangular solve using CSR storage format.
rocsparse_csrsv_zero_pivotreturns rocsparse_status_zero_pivot, if either a structural or numerical zero has been found during rocsparse_scsrsv_solve() or rocsparse_dcsrsv_solve() computation. The first zero pivot \(j\) at \(A_{j,j}\) is stored inposition, using same index base as the CSR matrix.positioncan be in host or device memory. If no zero pivot has been found,positionis set to -1 and rocsparse_status_success is returned instead.- Note
rocsparse_csrsv_zero_pivotis a blocking function. It might influence performance negatively.- Parameters
[in] handle: handle to the rocsparse library context queue.[in] descr: descriptor of the sparse CSR matrix.[in] info: structure that holds the information collected during the analysis step.[inout] position: pointer to zero pivot \(j\), can be in host or device memory.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_pointer:infoorpositionpointer is invalid.rocsparse_status_internal_error: an internal error occurred.rocsparse_status_zero_pivot: zero pivot has been found.
rocsparse_csrsv_buffer_size()¶
-
rocsparse_status
rocsparse_scsrsv_buffer_size(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, size_t *buffer_size)¶
-
rocsparse_status
rocsparse_dcsrsv_buffer_size(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, size_t *buffer_size)¶ Sparse triangular solve using CSR storage format.
rocsparse_csrsv_buffer_sizereturns the size of the temporary storage buffer that is required by rocsparse_scsrsv_analysis(), rocsparse_dcsrsv_analysis(), rocsparse_scsrsv_solve() and rocsparse_dcsrsv_solve(). The temporary storage buffer must be allocated by the user. The size of the temporary storage buffer is identical to the size returned by rocsparse_scsrilu0_buffer_size() and rocsparse_dcsrilu0_buffer_size() if the matrix sparsity pattern is identical. The user allocated buffer can thus be shared between subsequent calls to those functions.- Parameters
[in] handle: handle to the rocsparse library context queue.[in] trans: matrix operation type.[in] m: number of rows of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] descr: descriptor of the sparse CSR matrix.[in] csr_val: array ofnnzelements of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix.[out] info: structure that holds the information collected during the analysis step.[in] buffer_size: number of bytes of the temporary storage buffer required by rocsparse_scsrsv_analysis(), rocsparse_dcsrsv_analysis(), rocsparse_scsrsv_solve() and rocsparse_dcsrsv_solve().
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:mornnzis invalid.rocsparse_status_invalid_pointer:descr,csr_val,csr_row_ptr,csr_col_ind,infoorbuffer_sizepointer is invalid.rocsparse_status_internal_error: an internal error occurred.rocsparse_status_not_implemented:trans!= rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_csrsv_analysis()¶
-
rocsparse_status
rocsparse_scsrsv_analysis(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, rocsparse_analysis_policy analysis, rocsparse_solve_policy solve, void *temp_buffer)¶
-
rocsparse_status
rocsparse_dcsrsv_analysis(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, rocsparse_analysis_policy analysis, rocsparse_solve_policy solve, void *temp_buffer)¶ Sparse triangular solve using CSR storage format.
rocsparse_csrsv_analysisperforms the analysis step for rocsparse_scsrsv_solve() and rocsparse_dcsrsv_solve(). It is expected that this function will be executed only once for a given matrix and particular operation type. The analysis meta data can be cleared by rocsparse_csrsv_clear().rocsparse_csrsv_analysiscan share its meta data with rocsparse_scsrilu0_analysis() and rocsparse_dcsrilu0_analysis(). Selecting rocsparse_analysis_policy_reuse policy can greatly improve computation performance of meta data. However, the user need to make sure that the sparsity pattern remains unchanged. If this cannot be assured, rocsparse_analysis_policy_force has to be used.- Note
If the matrix sparsity pattern changes, the gathered information will become invalid.
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] trans: matrix operation type.[in] m: number of rows of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] descr: descriptor of the sparse CSR matrix.[in] csr_val: array ofnnzelements of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix.[out] info: structure that holds the information collected during the analysis step.[in] analysis: rocsparse_analysis_policy_reuse or rocsparse_analysis_policy_force.[in] solve: rocsparse_solve_policy_auto.[in] temp_buffer: temporary storage buffer allocated by the user.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:mornnzis invalid.rocsparse_status_invalid_pointer:descr,csr_row_ptr,csr_col_ind,infoortemp_bufferpointer is invalid.rocsparse_status_internal_error: an internal error occurred.rocsparse_status_not_implemented:trans!= rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_csrsv_solve()¶
-
rocsparse_status
rocsparse_scsrsv_solve(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int nnz, const float *alpha, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, const float *x, float *y, rocsparse_solve_policy policy, void *temp_buffer)¶
-
rocsparse_status
rocsparse_dcsrsv_solve(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int nnz, const double *alpha, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, const double *x, double *y, rocsparse_solve_policy policy, void *temp_buffer)¶ Sparse triangular solve using CSR storage format.
rocsparse_csrsv_solvesolves a sparse triangular linear system of a sparse \(m \times m\) matrix, defined in CSR storage format, a dense solution vector \(y\) and the right-hand side \(x\) that is multiplied by \(\alpha\), such that\[ op(A) \cdot y = \alpha \cdot x, \]with\[\begin{split} op(A) = \left\{ \begin{array}{ll} A, & if\: trans == rocsparse\_operation\_none \\ A^T, & if\: trans == rocsparse\_operation\_transpose \\ A^H, & if\: trans == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]rocsparse_csrsv_solverequires a user allocated temporary buffer. Its size is returned by rocsparse_scsrsv_buffer_size() or rocsparse_dcsrsv_buffer_size(). Furthermore, analysis meta data is required. It can be obtained by rocsparse_scsrsv_analysis() or rocsparse_dcsrsv_analysis().rocsparse_csrsv_solvereports the first zero pivot (either numerical or structural zero). The zero pivot status can be checked calling rocsparse_csrsv_zero_pivot(). If rocsparse_diag_type == rocsparse_diag_type_unit, no zero pivot will be reported, even if \(A_{j,j} = 0\) for some \(j\).- Note
The sparse CSR matrix has to be sorted. This can be achieved by calling rocsparse_csrsort().
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Note
Currently, only
trans== rocsparse_operation_none is supported.- Example
Consider the lower triangular \(m \times m\) matrix \(L\), stored in CSR storage format with unit diagonal. The following example solves \(L \cdot y = x\).
// Create rocSPARSE handle rocsparse_handle handle; rocsparse_create_handle(&handle); // Create matrix descriptor rocsparse_mat_descr descr; rocsparse_create_mat_descr(&descr); rocsparse_set_mat_fill_mode(descr, rocsparse_fill_mode_lower); rocsparse_set_mat_diag_type(descr, rocsparse_diag_type_unit); // Create matrix info structure rocsparse_mat_info info; rocsparse_create_mat_info(&info); // Obtain required buffer size size_t buffer_size; rocsparse_dcsrsv_buffer_size(handle, rocsparse_operation_none, m, nnz, descr, csr_val, csr_row_ptr, csr_col_ind, info, &buffer_size); // Allocate temporary buffer void* temp_buffer; hipMalloc(&temp_buffer, buffer_size); // Perform analysis step rocsparse_dcsrsv_analysis(handle, rocsparse_operation_none, m, nnz, descr, csr_val, csr_row_ptr, csr_col_ind, info, rocsparse_analysis_policy_reuse, rocsparse_solve_policy_auto, temp_buffer); // Solve Ly = x rocsparse_dcsrsv_solve(handle, rocsparse_operation_none, m, nnz, &alpha, descr, csr_val, csr_row_ptr, csr_col_ind, info, x, y, rocsparse_solve_policy_auto, temp_buffer); // No zero pivot should be found, with L having unit diagonal // Clean up hipFree(temp_buffer); rocsparse_destroy_mat_info(info); rocsparse_destroy_mat_descr(descr); rocsparse_destroy_handle(handle);
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] trans: matrix operation type.[in] m: number of rows of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] alpha: scalar \(\alpha\).[in] descr: descriptor of the sparse CSR matrix.[in] csr_val: array ofnnzelements of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix.[in] info: structure that holds the information collected during the analysis step.[in] x: array ofmelements, holding the right-hand side.[out] y: array ofmelements, holding the solution.[in] policy: rocsparse_solve_policy_auto.[in] temp_buffer: temporary storage buffer allocated by the user.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:mornnzis invalid.rocsparse_status_invalid_pointer:descr,alpha,csr_val,csr_row_ptr,csr_col_ind,xorypointer is invalid.rocsparse_status_arch_mismatch: the device is not supported.rocsparse_status_internal_error: an internal error occurred.rocsparse_status_not_implemented:trans!= rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_csrsv_clear()¶
-
rocsparse_status
rocsparse_csrsv_clear(rocsparse_handle handle, const rocsparse_mat_descr descr, rocsparse_mat_info info)¶ Sparse triangular solve using CSR storage format.
rocsparse_csrsv_cleardeallocates all memory that was allocated by rocsparse_scsrsv_analysis() or rocsparse_dcsrsv_analysis(). This is especially useful, if memory is an issue and the analysis data is not required for further computation, e.g. when switching to another sparse matrix format. Callingrocsparse_csrsv_clearis optional. All allocated resources will be cleared, when the opaque rocsparse_mat_info struct is destroyed using rocsparse_destroy_mat_info().- Parameters
[in] handle: handle to the rocsparse library context queue.[in] descr: descriptor of the sparse CSR matrix.[inout] info: structure that holds the information collected during the analysis step.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_pointer:infopointer is invalid.rocsparse_status_memory_error: the buffer holding the meta data could not be deallocated.rocsparse_status_internal_error: an internal error occurred.
Sparse Level 3 Functions¶
This module holds all sparse level 3 routines.
The sparse level 3 routines describe operations between a matrix in sparse format and multiple vectors in dense format that can also be seen as a dense matrix.
rocsparse_csrmm()¶
-
rocsparse_status
rocsparse_scsrmm(rocsparse_handle handle, rocsparse_operation trans_A, rocsparse_operation trans_B, rocsparse_int m, rocsparse_int n, rocsparse_int k, rocsparse_int nnz, const float *alpha, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, const float *B, rocsparse_int ldb, const float *beta, float *C, rocsparse_int ldc)¶
-
rocsparse_status
rocsparse_dcsrmm(rocsparse_handle handle, rocsparse_operation trans_A, rocsparse_operation trans_B, rocsparse_int m, rocsparse_int n, rocsparse_int k, rocsparse_int nnz, const double *alpha, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, const double *B, rocsparse_int ldb, const double *beta, double *C, rocsparse_int ldc)¶ Sparse matrix dense matrix multiplication using CSR storage format.
rocsparse_csrmmmultiplies the scalar \(\alpha\) with a sparse \(m \times k\) matrix \(A\), defined in CSR storage format, and the dense \(k \times n\) matrix \(B\) and adds the result to the dense \(m \times n\) matrix \(C\) that is multiplied by the scalar \(\beta\), such that\[ C := \alpha \cdot op(A) \cdot op(B) + \beta \cdot C, \]with\[\begin{split} op(A) = \left\{ \begin{array}{ll} A, & if\: trans\_A == rocsparse\_operation\_none \\ A^T, & if\: trans\_A == rocsparse\_operation\_transpose \\ A^H, & if\: trans\_A == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]and\[\begin{split} op(B) = \left\{ \begin{array}{ll} B, & if\: trans\_B == rocsparse\_operation\_none \\ B^T, & if\: trans\_B == rocsparse\_operation\_transpose \\ B^H, & if\: trans\_B == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]for(i = 0; i < ldc; ++i) { for(j = 0; j < n; ++j) { C[i][j] = beta * C[i][j]; for(k = csr_row_ptr[i]; k < csr_row_ptr[i + 1]; ++k) { C[i][j] += alpha * csr_val[k] * B[csr_col_ind[k]][j]; } } }
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Note
Currently, only
trans_A== rocsparse_operation_none is supported.- Example
This example multiplies a CSR matrix with a dense matrix.
// 1 2 0 3 0 // A = 0 4 5 0 0 // 6 0 0 7 8 rocsparse_int m = 3; rocsparse_int k = 5; rocsparse_int nnz = 8; csr_row_ptr[m+1] = {0, 3, 5, 8}; // device memory csr_col_ind[nnz] = {0, 1, 3, 1, 2, 0, 3, 4}; // device memory csr_val[nnz] = {1, 2, 3, 4, 5, 6, 7, 8}; // device memory // Set dimension n of B rocsparse_int n = 64; // Allocate and generate dense matrix B std::vector<float> hB(k * n); for(rocsparse_int i = 0; i < k * n; ++i) { hB[i] = static_cast<float>(rand()) / RAND_MAX; } // Copy B to the device float* B; hipMalloc((void**)&B, sizeof(float) * k * n); hipMemcpy(B, hB.data(), sizeof(float) * k * n, hipMemcpyHostToDevice); // alpha and beta float alpha = 1.0f; float beta = 0.0f; // Allocate memory for the resulting matrix C float* C; hipMalloc((void**)&C, sizeof(float) * m * n); // Perform the matrix multiplication rocsparse_scsrmm(handle, rocsparse_operation_none, rocsparse_operation_none, m, n, k, nnz, &alpha, descr, csr_val, csr_row_ptr, csr_col_ind, B, k, &beta, C, m);
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] trans_A: matrix \(A\) operation type.[in] trans_B: matrix \(B\) operation type.[in] m: number of rows of the sparse CSR matrix \(A\).[in] n: number of columns of the dense matrix \(op(B)\) and \(C\).[in] k: number of columns of the sparse CSR matrix \(A\).[in] nnz: number of non-zero entries of the sparse CSR matrix \(A\).[in] alpha: scalar \(\alpha\).[in] descr: descriptor of the sparse CSR matrix \(A\). Currently, only rocsparse_matrix_type_general is supported.[in] csr_val: array ofnnzelements of the sparse CSR matrix \(A\).[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix \(A\).[in] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix \(A\).[in] B: array of dimension \(ldb \times n\) ( \(op(B) == B\)) or \(ldb \times k\) ( \(op(B) == B^T\) or \(op(B) == B^H\)).[in] ldb: leading dimension of \(B\), must be at least \(\max{(1, k)}\) ( \(op(A) == A\)) or \(\max{(1, m)}\) ( \(op(A) == A^T\) or \(op(A) == A^H\)).[in] beta: scalar \(\beta\).[inout] C: array of dimension \(ldc \times n\).[in] ldc: leading dimension of \(C\), must be at least \(\max{(1, m)}\) ( \(op(A) == A\)) or \(\max{(1, k)}\) ( \(op(A) == A^T\) or \(op(A) == A^H\)).
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,n,k,nnz,ldborldcis invalid.rocsparse_status_invalid_pointer:descr,alpha,csr_val,csr_row_ptr,csr_col_ind,B,betaorCpointer is invalid.rocsparse_status_arch_mismatch: the device is not supported.rocsparse_status_not_implemented:trans_A!= rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.
Preconditioner Functions¶
This module holds all sparse preconditioners.
The sparse preconditioners describe manipulations on a matrix in sparse format to obtain a sparse preconditioner matrix.
rocsparse_csrilu0_zero_pivot()¶
-
rocsparse_status
rocsparse_csrilu0_zero_pivot(rocsparse_handle handle, rocsparse_mat_info info, rocsparse_int *position)¶ Incomplete LU factorization with 0 fill-ins and no pivoting using CSR storage format.
rocsparse_csrilu0_zero_pivotreturns rocsparse_status_zero_pivot, if either a structural or numerical zero has been found during rocsparse_scsrilu0() or rocsparse_dcsrilu0() computation. The first zero pivot \(j\) at \(A_{j,j}\) is stored inposition, using same index base as the CSR matrix.positioncan be in host or device memory. If no zero pivot has been found,positionis set to -1 and rocsparse_status_success is returned instead.- Note
rocsparse_csrilu0_zero_pivotis a blocking function. It might influence performance negatively.- Parameters
[in] handle: handle to the rocsparse library context queue.[in] info: structure that holds the information collected during the analysis step.[inout] position: pointer to zero pivot \(j\), can be in host or device memory.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_pointer:infoorpositionpointer is invalid.rocsparse_status_internal_error: an internal error occurred.rocsparse_status_zero_pivot: zero pivot has been found.
rocsparse_csrilu0_buffer_size()¶
-
rocsparse_status
rocsparse_scsrilu0_buffer_size(rocsparse_handle handle, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, size_t *buffer_size)¶
-
rocsparse_status
rocsparse_dcsrilu0_buffer_size(rocsparse_handle handle, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, size_t *buffer_size)¶ Incomplete LU factorization with 0 fill-ins and no pivoting using CSR storage format.
rocsparse_csrilu0_buffer_sizereturns the size of the temporary storage buffer that is required by rocsparse_scsrilu0_analysis(), rocsparse_dcsrilu0_analysis(), rocsparse_scsrilu0() and rocsparse_dcsrilu0(). The temporary storage buffer must be allocated by the user. The size of the temporary storage buffer is identical to the size returned by rocsparse_scsrsv_buffer_size() and rocsparse_dcsrsv_buffer_size() if the matrix sparsity pattern is identical. The user allocated buffer can thus be shared between subsequent calls to those functions.- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] descr: descriptor of the sparse CSR matrix.[in] csr_val: array ofnnzelements of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix.[out] info: structure that holds the information collected during the analysis step.[in] buffer_size: number of bytes of the temporary storage buffer required by rocsparse_scsrilu0_analysis(), rocsparse_dcsrilu0_analysis(), rocsparse_scsrilu0() and rocsparse_dcsrilu0().
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:mornnzis invalid.rocsparse_status_invalid_pointer:descr,csr_val,csr_row_ptr,csr_col_ind,infoorbuffer_sizepointer is invalid.rocsparse_status_internal_error: an internal error occurred.rocsparse_status_not_implemented:trans!= rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_csrilu0_analysis()¶
-
rocsparse_status
rocsparse_scsrilu0_analysis(rocsparse_handle handle, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, rocsparse_analysis_policy analysis, rocsparse_solve_policy solve, void *temp_buffer)¶
-
rocsparse_status
rocsparse_dcsrilu0_analysis(rocsparse_handle handle, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, rocsparse_analysis_policy analysis, rocsparse_solve_policy solve, void *temp_buffer)¶ Incomplete LU factorization with 0 fill-ins and no pivoting using CSR storage format.
rocsparse_csrilu0_analysisperforms the analysis step for rocsparse_scsrilu0() and rocsparse_dcsrilu0(). It is expected that this function will be executed only once for a given matrix and particular operation type. The analysis meta data can be cleared by rocsparse_csrilu0_clear().rocsparse_csrilu0_analysiscan share its meta data with rocsparse_scsrsv_analysis() and rocsparse_dcsrsv_analysis(). Selecting rocsparse_analysis_policy_reuse policy can greatly improve computation performance of meta data. However, the user need to make sure that the sparsity pattern remains unchanged. If this cannot be assured, rocsparse_analysis_policy_force has to be used.- Note
If the matrix sparsity pattern changes, the gathered information will become invalid.
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] descr: descriptor of the sparse CSR matrix.[in] csr_val: array ofnnzelements of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix.[out] info: structure that holds the information collected during the analysis step.[in] analysis: rocsparse_analysis_policy_reuse or rocsparse_analysis_policy_force.[in] solve: rocsparse_solve_policy_auto.[in] temp_buffer: temporary storage buffer allocated by the user.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:mornnzis invalid.rocsparse_status_invalid_pointer:descr,csr_val,csr_row_ptr,csr_col_ind,infoortemp_bufferpointer is invalid.rocsparse_status_internal_error: an internal error occurred.rocsparse_status_not_implemented:trans!= rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_csrilu0()¶
-
rocsparse_status
rocsparse_scsrilu0(rocsparse_handle handle, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, rocsparse_solve_policy policy, void *temp_buffer)¶
-
rocsparse_status
rocsparse_dcsrilu0(rocsparse_handle handle, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, rocsparse_solve_policy policy, void *temp_buffer)¶ Incomplete LU factorization with 0 fill-ins and no pivoting using CSR storage format.
rocsparse_csrilu0computes the incomplete LU factorization with 0 fill-ins and no pivoting of a sparse \(m \times m\) CSR matrix \(A\), such that\[ A \approx LU \]rocsparse_csrilu0requires a user allocated temporary buffer. Its size is returned by rocsparse_scsrilu0_buffer_size() or rocsparse_dcsrilu0_buffer_size(). Furthermore, analysis meta data is required. It can be obtained by rocsparse_scsrilu0_analysis() or rocsparse_dcsrilu0_analysis().rocsparse_csrilu0reports the first zero pivot (either numerical or structural zero). The zero pivot status can be obtained by calling rocsparse_csrilu0_zero_pivot().- Note
The sparse CSR matrix has to be sorted. This can be achieved by calling rocsparse_csrsort().
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Example
Consider the sparse \(m \times m\) matrix \(A\), stored in CSR storage format. The following example computes the incomplete LU factorization \(M \approx LU\) and solves the preconditioned system \(My = x\).
// Create rocSPARSE handle rocsparse_handle handle; rocsparse_create_handle(&handle); // Create matrix descriptor for M rocsparse_mat_descr descr_M; rocsparse_create_mat_descr(&descr_M); // Create matrix descriptor for L rocsparse_mat_descr descr_L; rocsparse_create_mat_descr(&descr_L); rocsparse_set_mat_fill_mode(descr_L, rocsparse_fill_mode_lower); rocsparse_set_mat_diag_type(descr_L, rocsparse_diag_type_unit); // Create matrix descriptor for U rocsparse_mat_descr descr_U; rocsparse_create_mat_descr(&descr_U); rocsparse_set_mat_fill_mode(descr_U, rocsparse_fill_mode_upper); rocsparse_set_mat_diag_type(descr_U, rocsparse_diag_type_non_unit); // Create matrix info structure rocsparse_mat_info info; rocsparse_create_mat_info(&info); // Obtain required buffer size size_t buffer_size_M; size_t buffer_size_L; size_t buffer_size_U; rocsparse_dcsrilu0_buffer_size(handle, m, nnz, descr_M, csr_val, csr_row_ptr, csr_col_ind, info, &buffer_size_M); rocsparse_dcsrsv_buffer_size(handle, rocsparse_operation_none, m, nnz, descr_L, csr_val, csr_row_ptr, csr_col_ind, info, &buffer_size_L); rocsparse_dcsrsv_buffer_size(handle, rocsparse_operation_none, m, nnz, descr_U, csr_val, csr_row_ptr, csr_col_ind, info, &buffer_size_U); size_t buffer_size = max(buffer_size_M, max(buffer_size_L, buffer_size_U)); // Allocate temporary buffer void* temp_buffer; hipMalloc(&temp_buffer, buffer_size); // Perform analysis steps, using rocsparse_analysis_policy_reuse to improve // computation performance rocsparse_dcsrilu0_analysis(handle, m, nnz, descr_M, csr_val, csr_row_ptr, csr_col_ind, info, rocsparse_analysis_policy_reuse, rocsparse_solve_policy_auto, temp_buffer); rocsparse_dcsrsv_analysis(handle, rocsparse_operation_none, m, nnz, descr_L, csr_val, csr_row_ptr, csr_col_ind, info, rocsparse_analysis_policy_reuse, rocsparse_solve_policy_auto, temp_buffer); rocsparse_dcsrsv_analysis(handle, rocsparse_operation_none, m, nnz, descr_U, csr_val, csr_row_ptr, csr_col_ind, info, rocsparse_analysis_policy_reuse, rocsparse_solve_policy_auto, temp_buffer); // Check for zero pivot rocsparse_int position; if(rocsparse_status_zero_pivot == rocsparse_csrilu0_zero_pivot(handle, info, &position)) { printf("A has structural zero at A(%d,%d)\n", position, position); } // Compute incomplete LU factorization rocsparse_dcsrilu0(handle, m, nnz, descr_M, csr_val, csr_row_ptr, csr_col_ind, info, rocsparse_solve_policy_auto, temp_buffer); // Check for zero pivot if(rocsparse_status_zero_pivot == rocsparse_csrilu0_zero_pivot(handle, info, &position)) { printf("U has structural and/or numerical zero at U(%d,%d)\n", position, position); } // Solve Lz = x rocsparse_dcsrsv_solve(handle, rocsparse_operation_none, m, nnz, &alpha, descr_L, csr_val, csr_row_ptr, csr_col_ind, info, x, z, rocsparse_solve_policy_auto, temp_buffer); // Solve Uy = z rocsparse_dcsrsv_solve(handle, rocsparse_operation_none, m, nnz, &alpha, descr_U, csr_val, csr_row_ptr, csr_col_ind, info, z, y, rocsparse_solve_policy_auto, temp_buffer); // Clean up hipFree(temp_buffer); rocsparse_destroy_mat_info(info); rocsparse_destroy_mat_descr(descr_M); rocsparse_destroy_mat_descr(descr_L); rocsparse_destroy_mat_descr(descr_U); rocsparse_destroy_handle(handle);
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] descr: descriptor of the sparse CSR matrix.[inout] csr_val: array ofnnzelements of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix.[in] info: structure that holds the information collected during the analysis step.[in] policy: rocsparse_solve_policy_auto.[in] temp_buffer: temporary storage buffer allocated by the user.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:mornnzis invalid.rocsparse_status_invalid_pointer:descr,csr_val,csr_row_ptrorcsr_col_indpointer is invalid.rocsparse_status_arch_mismatch: the device is not supported.rocsparse_status_internal_error: an internal error occurred.rocsparse_status_not_implemented:trans!= rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_csrilu0_clear()¶
-
rocsparse_status
rocsparse_csrilu0_clear(rocsparse_handle handle, rocsparse_mat_info info)¶ Incomplete LU factorization with 0 fill-ins and no pivoting using CSR storage format.
rocsparse_csrilu0_cleardeallocates all memory that was allocated by rocsparse_scsrilu0_analysis() or rocsparse_dcsrilu0_analysis(). This is especially useful, if memory is an issue and the analysis data is not required for further computation.- Note
Calling
rocsparse_csrilu0_clearis optional. All allocated resources will be cleared, when the opaque rocsparse_mat_info struct is destroyed using rocsparse_destroy_mat_info().- Parameters
[in] handle: handle to the rocsparse library context queue.[inout] info: structure that holds the information collected during the analysis step.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_pointer:infopointer is invalid.rocsparse_status_memory_error: the buffer holding the meta data could not be deallocated.rocsparse_status_internal_error: an internal error occurred.
Sparse Conversion Functions¶
This module holds all sparse conversion routines.
The sparse conversion routines describe operations on a matrix in sparse format to obtain a matrix in a different sparse format.
rocsparse_csr2coo()¶
-
rocsparse_status
rocsparse_csr2coo(rocsparse_handle handle, const rocsparse_int *csr_row_ptr, rocsparse_int nnz, rocsparse_int m, rocsparse_int *coo_row_ind, rocsparse_index_base idx_base)¶ Convert a sparse CSR matrix into a sparse COO matrix.
rocsparse_csr2cooconverts the CSR array containing the row offsets, that point to the start of every row, into a COO array of row indices.- Note
It can also be used to convert a CSC array containing the column offsets into a COO array of column indices.
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Example
This example converts a CSR matrix into a COO matrix.
// 1 2 0 3 0 // A = 0 4 5 0 0 // 6 0 0 7 8 rocsparse_int m = 3; rocsparse_int n = 5; rocsparse_int nnz = 8; csr_row_ptr[m+1] = {0, 3, 5, 8}; // device memory csr_col_ind[nnz] = {0, 1, 3, 1, 2, 0, 3, 4}; // device memory csr_val[nnz] = {1, 2, 3, 4, 5, 6, 7, 8}; // device memory // Allocate COO matrix arrays rocsparse_int* coo_row_ind; rocsparse_int* coo_col_ind; float* coo_val; hipMalloc((void**)&coo_row_ind, sizeof(rocsparse_int) * nnz); hipMalloc((void**)&coo_col_ind, sizeof(rocsparse_int) * nnz); hipMalloc((void**)&coo_val, sizeof(float) * nnz); // Convert the csr row offsets into coo row indices rocsparse_csr2coo(handle, csr_row_ptr, nnz, m, coo_row_ind, rocsparse_index_base_zero); // Copy the column and value arrays hipMemcpy(coo_col_ind, csr_col_ind, sizeof(rocsparse_int) * nnz, hipMemcpyDeviceToDevice); hipMemcpy(coo_val, csr_val, sizeof(float) * nnz, hipMemcpyDeviceToDevice);
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] m: number of rows of the sparse CSR matrix.[out] coo_row_ind: array ofnnzelements containing the row indices of the sparse COO matrix.[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:mornnzis invalid.rocsparse_status_invalid_pointer:csr_row_ptrorcoo_row_indpointer is invalid.rocsparse_status_arch_mismatch: the device is not supported.
rocsparse_coo2csr()¶
-
rocsparse_status
rocsparse_coo2csr(rocsparse_handle handle, const rocsparse_int *coo_row_ind, rocsparse_int nnz, rocsparse_int m, rocsparse_int *csr_row_ptr, rocsparse_index_base idx_base)¶ Convert a sparse COO matrix into a sparse CSR matrix.
rocsparse_coo2csrconverts the COO array containing the row indices into a CSR array of row offsets, that point to the start of every row. It is assumed that the COO row index array is sorted.- Note
It can also be used, to convert a COO array containing the column indices into a CSC array of column offsets, that point to the start of every column. Then, it is assumed that the COO column index array is sorted, instead.
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Example
This example converts a COO matrix into a CSR matrix.
// 1 2 0 3 0 // A = 0 4 5 0 0 // 6 0 0 7 8 rocsparse_int m = 3; rocsparse_int n = 5; rocsparse_int nnz = 8; coo_row_ind[nnz] = {0, 0, 0, 1, 1, 2, 2, 2}; // device memory coo_col_ind[nnz] = {0, 1, 3, 1, 2, 0, 3, 4}; // device memory coo_val[nnz] = {1, 2, 3, 4, 5, 6, 7, 8}; // device memory // Allocate CSR matrix arrays rocsparse_int* csr_row_ptr; rocsparse_int* csr_col_ind; float* csr_val; hipMalloc((void**)&csr_row_ptr, sizeof(rocsparse_int) * (m + 1)); hipMalloc((void**)&csr_col_ind, sizeof(rocsparse_int) * nnz); hipMalloc((void**)&csr_val, sizeof(float) * nnz); // Convert the coo row indices into csr row offsets rocsparse_coo2csr(handle, coo_row_ind, nnz, m, csr_row_ptr, rocsparse_index_base_zero); // Copy the column and value arrays hipMemcpy(csr_col_ind, coo_col_ind, sizeof(rocsparse_int) * nnz, hipMemcpyDeviceToDevice); hipMemcpy(csr_val, coo_val, sizeof(float) * nnz, hipMemcpyDeviceToDevice);
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] coo_row_ind: array ofnnzelements containing the row indices of the sparse COO matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] m: number of rows of the sparse CSR matrix.[out] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:mornnzis invalid.rocsparse_status_invalid_pointer:coo_row_indorcsr_row_ptrpointer is invalid.
rocsparse_csr2csc_buffer_size()¶
-
rocsparse_status
rocsparse_csr2csc_buffer_size(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_action copy_values, size_t *buffer_size)¶ Convert a sparse CSR matrix into a sparse CSC matrix.
rocsparse_csr2csc_buffer_sizereturns the size of the temporary storage buffer required by rocsparse_scsr2csc() and rocsparse_dcsr2csc(). The temporary storage buffer must be allocated by the user.- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse CSR matrix.[in] n: number of columns of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix.[in] copy_values: rocsparse_action_symbolic or rocsparse_action_numeric.[out] buffer_size: number of bytes of the temporary storage buffer required by sparse_csr2csc().
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,nornnzis invalid.rocsparse_status_invalid_pointer:csr_row_ptr,csr_col_indorbuffer_sizepointer is invalid.rocsparse_status_internal_error: an internal error occurred.
rocsparse_csr2csc()¶
-
rocsparse_status
rocsparse_scsr2csc(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, float *csc_val, rocsparse_int *csc_row_ind, rocsparse_int *csc_col_ptr, rocsparse_action copy_values, rocsparse_index_base idx_base, void *temp_buffer)¶
-
rocsparse_status
rocsparse_dcsr2csc(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, double *csc_val, rocsparse_int *csc_row_ind, rocsparse_int *csc_col_ptr, rocsparse_action copy_values, rocsparse_index_base idx_base, void *temp_buffer)¶ Convert a sparse CSR matrix into a sparse CSC matrix.
rocsparse_csr2cscconverts a CSR matrix into a CSC matrix.rocsparse_csr2csccan also be used to convert a CSC matrix into a CSR matrix.copy_valuesdecides whethercsc_valis being filled during conversion (rocsparse_action_numeric) or not (rocsparse_action_symbolic).rocsparse_csr2cscrequires extra temporary storage buffer that has to be allocated by the user. Storage buffer size can be determined by rocsparse_csr2csc_buffer_size().- Note
The resulting matrix can also be seen as the transpose of the input matrix.
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Example
This example computes the transpose of a CSR matrix.
// 1 2 0 3 0 // A = 0 4 5 0 0 // 6 0 0 7 8 rocsparse_int m_A = 3; rocsparse_int n_A = 5; rocsparse_int nnz_A = 8; csr_row_ptr_A[m+1] = {0, 3, 5, 8}; // device memory csr_col_ind_A[nnz] = {0, 1, 3, 1, 2, 0, 3, 4}; // device memory csr_val_A[nnz] = {1, 2, 3, 4, 5, 6, 7, 8}; // device memory // Allocate memory for transposed CSR matrix rocsparse_int m_T = n_A; rocsparse_int n_T = m_A; rocsparse_int nnz_T = nnz_A; rocsparse_int* csr_row_ptr_T; rocsparse_int* csr_col_ind_T; float* csr_val_T; hipMalloc((void**)&csr_row_ptr_T, sizeof(rocsparse_int) * (m_T + 1)); hipMalloc((void**)&csr_col_ind_T, sizeof(rocsparse_int) * nnz_T); hipMalloc((void**)&csr_val_T, sizeof(float) * nnz_T); // Obtain the temporary buffer size size_t buffer_size; rocsparse_csr2csc_buffer_size(handle, m_A, n_A, nnz_A, csr_row_ptr_A, csr_col_ind_A, rocsparse_action_numeric, &buffer_size); // Allocate temporary buffer void* temp_buffer; hipMalloc(&temp_buffer, buffer_size); rocsparse_scsr2csc(handle, m_A, n_A, nnz_A, csr_val_A, csr_row_ptr_A, csr_col_ind_A, csr_val_T, csr_col_ind_T, csr_row_ptr_T, rocsparse_action_numeric, rocsparse_index_base_zero, temp_buffer);
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse CSR matrix.[in] n: number of columns of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] csr_val: array ofnnzelements of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix.[out] csc_val: array ofnnzelements of the sparse CSC matrix.[out] csc_row_ind: array ofnnzelements containing the row indices of the sparse CSC matrix.[out] csc_col_ptr: array ofn+1elements that point to the start of every column of the sparse CSC matrix.[in] copy_values: rocsparse_action_symbolic or rocsparse_action_numeric.[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.[in] temp_buffer: temporary storage buffer allocated by the user, size is returned by rocsparse_csr2csc_buffer_size().
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,nornnzis invalid.rocsparse_status_invalid_pointer:csr_val,csr_row_ptr,csr_col_ind,csc_val,csc_row_ind,csc_col_ptrortemp_bufferpointer is invalid.rocsparse_status_arch_mismatch: the device is not supported.rocsparse_status_internal_error: an internal error occurred.
rocsparse_csr2ell_width()¶
-
rocsparse_status
rocsparse_csr2ell_width(rocsparse_handle handle, rocsparse_int m, const rocsparse_mat_descr csr_descr, const rocsparse_int *csr_row_ptr, const rocsparse_mat_descr ell_descr, rocsparse_int *ell_width)¶ Convert a sparse CSR matrix into a sparse ELL matrix.
rocsparse_csr2ell_widthcomputes the maximum of the per row non-zero elements over all rows, the ELLwidth, for a given CSR matrix.- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse CSR matrix.[in] csr_descr: descriptor of the sparse CSR matrix. Currently, only rocsparse_matrix_type_general is supported.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] ell_descr: descriptor of the sparse ELL matrix. Currently, only rocsparse_matrix_type_general is supported.[out] ell_width: pointer to the number of non-zero elements per row in ELL storage format.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:mis invalid.rocsparse_status_invalid_pointer:csr_descr,csr_row_ptr, orell_widthpointer is invalid.rocsparse_status_internal_error: an internal error occurred.rocsparse_status_not_implemented: rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_csr2ell()¶
-
rocsparse_status
rocsparse_scsr2ell(rocsparse_handle handle, rocsparse_int m, const rocsparse_mat_descr csr_descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, const rocsparse_mat_descr ell_descr, rocsparse_int ell_width, float *ell_val, rocsparse_int *ell_col_ind)¶
-
rocsparse_status
rocsparse_dcsr2ell(rocsparse_handle handle, rocsparse_int m, const rocsparse_mat_descr csr_descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, const rocsparse_mat_descr ell_descr, rocsparse_int ell_width, double *ell_val, rocsparse_int *ell_col_ind)¶ Convert a sparse CSR matrix into a sparse ELL matrix.
rocsparse_csr2ellconverts a CSR matrix into an ELL matrix. It is assumed, thatell_valandell_col_indare allocated. Allocation size is computed by the number of rows times the number of ELL non-zero elements per row, such that \( nnz_{ELL} = m \cdot ell\_width\). The number of ELL non-zero elements per row is obtained by rocsparse_csr2ell_width().- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Example
This example converts a CSR matrix into an ELL matrix.
// 1 2 0 3 0 // A = 0 4 5 0 0 // 6 0 0 7 8 rocsparse_int m = 3; rocsparse_int n = 5; rocsparse_int nnz = 8; csr_row_ptr[m+1] = {0, 3, 5, 8}; // device memory csr_col_ind[nnz] = {0, 1, 3, 1, 2, 0, 3, 4}; // device memory csr_val[nnz] = {1, 2, 3, 4, 5, 6, 7, 8}; // device memory // Create ELL matrix descriptor rocsparse_mat_descr ell_descr; rocsparse_create_mat_descr(&ell_descr); // Obtain the ELL width rocsparse_int ell_width; rocsparse_csr2ell_width(handle, m, csr_descr, csr_row_ptr, ell_descr, &ell_width); // Compute ELL non-zero entries rocsparse_int ell_nnz = m * ell_width; // Allocate ELL column and value arrays rocsparse_int* ell_col_ind; hipMalloc((void**)&ell_col_ind, sizeof(rocsparse_int) * ell_nnz); float* ell_val; hipMalloc((void**)&ell_val, sizeof(float) * ell_nnz); // Format conversion rocsparse_scsr2ell(handle, m, csr_descr, csr_val, csr_row_ptr, csr_col_ind, ell_descr, ell_width, ell_val, ell_col_ind);
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse CSR matrix.[in] csr_descr: descriptor of the sparse CSR matrix. Currently, only rocsparse_matrix_type_general is supported.[in] csr_val: array containing the values of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array containing the column indices of the sparse CSR matrix.[in] ell_descr: descriptor of the sparse ELL matrix. Currently, only rocsparse_matrix_type_general is supported.[in] ell_width: number of non-zero elements per row in ELL storage format.[out] ell_val: array ofmtimesell_widthelements of the sparse ELL matrix.[out] ell_col_ind: array ofmtimesell_widthelements containing the column indices of the sparse ELL matrix.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:morell_widthis invalid.rocsparse_status_invalid_pointer:csr_descr,csr_val,csr_row_ptr,csr_col_ind,ell_descr,ell_valorell_col_indpointer is invalid.rocsparse_status_not_implemented: rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_ell2csr_nnz()¶
-
rocsparse_status
rocsparse_ell2csr_nnz(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, const rocsparse_mat_descr ell_descr, rocsparse_int ell_width, const rocsparse_int *ell_col_ind, const rocsparse_mat_descr csr_descr, rocsparse_int *csr_row_ptr, rocsparse_int *csr_nnz)¶ Convert a sparse ELL matrix into a sparse CSR matrix.
rocsparse_ell2csr_nnzcomputes the total CSR non-zero elements and the CSR row offsets, that point to the start of every row of the sparse CSR matrix, for a given ELL matrix. It is assumed thatcsr_row_ptrhas been allocated with sizem+ 1.- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse ELL matrix.[in] n: number of columns of the sparse ELL matrix.[in] ell_descr: descriptor of the sparse ELL matrix. Currently, only rocsparse_matrix_type_general is supported.[in] ell_width: number of non-zero elements per row in ELL storage format.[in] ell_col_ind: array ofmtimesell_widthelements containing the column indices of the sparse ELL matrix.[in] csr_descr: descriptor of the sparse CSR matrix. Currently, only rocsparse_matrix_type_general is supported.[out] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[out] csr_nnz: pointer to the total number of non-zero elements in CSR storage format.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,norell_widthis invalid.rocsparse_status_invalid_pointer:ell_descr,ell_col_ind,csr_descr,csr_row_ptrorcsr_nnzpointer is invalid.rocsparse_status_not_implemented: rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_ell2csr()¶
-
rocsparse_status
rocsparse_csr2csc_buffer_size(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_action copy_values, size_t *buffer_size) Convert a sparse CSR matrix into a sparse CSC matrix.
rocsparse_csr2csc_buffer_sizereturns the size of the temporary storage buffer required by rocsparse_scsr2csc() and rocsparse_dcsr2csc(). The temporary storage buffer must be allocated by the user.- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse CSR matrix.[in] n: number of columns of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix.[in] copy_values: rocsparse_action_symbolic or rocsparse_action_numeric.[out] buffer_size: number of bytes of the temporary storage buffer required by sparse_csr2csc().
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,nornnzis invalid.rocsparse_status_invalid_pointer:csr_row_ptr,csr_col_indorbuffer_sizepointer is invalid.rocsparse_status_internal_error: an internal error occurred.
rocsparse_csr2hyb()¶
-
rocsparse_status
rocsparse_scsr2hyb(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_hyb_mat hyb, rocsparse_int user_ell_width, rocsparse_hyb_partition partition_type)¶
-
rocsparse_status
rocsparse_dcsr2hyb(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_hyb_mat hyb, rocsparse_int user_ell_width, rocsparse_hyb_partition partition_type)¶ Convert a sparse CSR matrix into a sparse HYB matrix.
rocsparse_csr2hybconverts a CSR matrix into a HYB matrix. It is assumed thathybhas been initialized with rocsparse_create_hyb_mat().- Note
This function requires a significant amount of storage for the HYB matrix, depending on the matrix structure.
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Example
This example converts a CSR matrix into a HYB matrix using user defined partitioning.
// Create HYB matrix structure rocsparse_hyb_mat hyb; rocsparse_create_hyb_mat(&hyb); // User defined ell width rocsparse_int user_ell_width = 5; // Perform the conversion rocsparse_scsr2hyb(handle, m, n, descr, csr_val, csr_row_ptr, csr_col_ind, hyb, user_ell_width, rocsparse_hyb_partition_user); // Do some work // Clean up rocsparse_destroy_hyb_mat(hyb);
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse CSR matrix.[in] n: number of columns of the sparse CSR matrix.[in] descr: descriptor of the sparse CSR matrix. Currently, only rocsparse_matrix_type_general is supported.[in] csr_val: array containing the values of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array containing the column indices of the sparse CSR matrix.[out] hyb: sparse matrix in HYB format.[in] user_ell_width: width of the ELL part of the HYB matrix (only required ifpartition_type== rocsparse_hyb_partition_user).[in] partition_type: rocsparse_hyb_partition_auto (recommended), rocsparse_hyb_partition_user or rocsparse_hyb_partition_max.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,noruser_ell_widthis invalid.rocsparse_status_invalid_value:partition_typeis invalid.rocsparse_status_invalid_pointer:descr,hyb,csr_val,csr_row_ptrorcsr_col_indpointer is invalid.rocsparse_status_memory_error: the buffer for the HYB matrix could not be allocated.rocsparse_status_internal_error: an internal error occurred.rocsparse_status_not_implemented: rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_create_identity_permutation()¶
-
rocsparse_status
rocsparse_create_identity_permutation(rocsparse_handle handle, rocsparse_int n, rocsparse_int *p)¶ Create the identity map.
rocsparse_create_identity_permutationstores the identity map inp, such that \(p = 0:1:(n-1)\).for(i = 0; i < n; ++i) { p[i] = i; }
- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Example
The following example creates an identity permutation.
rocsparse_int size = 200; // Allocate memory to hold the identity map rocsparse_int* perm; hipMalloc((void**)&perm, sizeof(rocsparse_int) * size); // Fill perm with the identity permutation rocsparse_create_identity_permutation(handle, size, perm);
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] n: size of the mapp.[out] p: array ofnintegers containing the map.
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:nis invalid.rocsparse_status_invalid_pointer:ppointer is invalid.
rocsparse_csrsort_buffer_size()¶
-
rocsparse_status
rocsparse_csrsort_buffer_size(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, size_t *buffer_size)¶ Sort a sparse CSR matrix.
rocsparse_csrsort_buffer_sizereturns the size of the temporary storage buffer required by rocsparse_csrsort(). The temporary storage buffer must be allocated by the user.- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse CSR matrix.[in] n: number of columns of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[in] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix.[out] buffer_size: number of bytes of the temporary storage buffer required by rocsparse_csrsort().
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,nornnzis invalid.rocsparse_status_invalid_pointer:csr_row_ptr,csr_col_indorbuffer_sizepointer is invalid.
rocsparse_csrsort()¶
-
rocsparse_status
rocsparse_csrsort(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_mat_descr descr, const rocsparse_int *csr_row_ptr, rocsparse_int *csr_col_ind, rocsparse_int *perm, void *temp_buffer)¶ Sort a sparse CSR matrix.
rocsparse_csrsortsorts a matrix in CSR format. The sorted permutation vectorpermcan be used to obtain sortedcsr_valarray. In this case,permmust be initialized as the identity permutation, see rocsparse_create_identity_permutation().rocsparse_csrsortrequires extra temporary storage buffer that has to be allocated by the user. Storage buffer size can be determined by rocsparse_csrsort_buffer_size().- Note
permcan beNULLif a sorted permutation vector is not required.- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Example
The following example sorts a \(3 \times 3\) CSR matrix.
// 1 2 3 // A = 4 5 6 // 7 8 9 rocsparse_int m = 3; rocsparse_int n = 3; rocsparse_int nnz = 9; csr_row_ptr[m + 1] = {0, 3, 6, 9}; // device memory csr_col_ind[nnz] = {2, 0, 1, 0, 1, 2, 0, 2, 1}; // device memory csr_val[nnz] = {3, 1, 2, 4, 5, 6, 7, 9, 8}; // device memory // Create permutation vector perm as the identity map rocsparse_int* perm; hipMalloc((void**)&perm, sizeof(rocsparse_int) * nnz); rocsparse_create_identity_permutation(handle, nnz, perm); // Allocate temporary buffer size_t buffer_size; void* temp_buffer; rocsparse_csrsort_buffer_size(handle, m, n, nnz, csr_row_ptr, csr_col_ind, &buffer_size); hipMalloc(&temp_buffer, buffer_size); // Sort the CSR matrix rocsparse_csrsort(handle, m, n, nnz, descr, csr_row_ptr, csr_col_ind, perm, temp_buffer); // Gather sorted csr_val array float* csr_val_sorted; hipMalloc((void**)&csr_val_sorted, sizeof(float) * nnz); rocsparse_sgthr(handle, nnz, csr_val, csr_val_sorted, perm, rocsparse_index_base_zero); // Clean up hipFree(temp_buffer); hipFree(perm); hipFree(csr_val);
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse CSR matrix.[in] n: number of columns of the sparse CSR matrix.[in] nnz: number of non-zero entries of the sparse CSR matrix.[in] descr: descriptor of the sparse CSR matrix. Currently, only rocsparse_matrix_type_general is supported.[in] csr_row_ptr: array ofm+1elements that point to the start of every row of the sparse CSR matrix.[inout] csr_col_ind: array ofnnzelements containing the column indices of the sparse CSR matrix.[inout] perm: array ofnnzintegers containing the unsorted map indices, can beNULL.[in] temp_buffer: temporary storage buffer allocated by the user, size is returned by rocsparse_csrsort_buffer_size().
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,nornnzis invalid.rocsparse_status_invalid_pointer:descr,csr_row_ptr,csr_col_indortemp_bufferpointer is invalid.rocsparse_status_internal_error: an internal error occurred.rocsparse_status_not_implemented: rocsparse_matrix_type != rocsparse_matrix_type_general.
rocsparse_coosort_buffer_size()¶
-
rocsparse_status
rocsparse_coosort_buffer_size(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_int *coo_row_ind, const rocsparse_int *coo_col_ind, size_t *buffer_size)¶ Sort a sparse COO matrix.
coosort_buffer_sizereturns the size of the temporary storage buffer that is required by rocsparse_coosort_by_row() and rocsparse_coosort_by_column(). The temporary storage buffer has to be allocated by the user.- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse COO matrix.[in] n: number of columns of the sparse COO matrix.[in] nnz: number of non-zero entries of the sparse COO matrix.[in] coo_row_ind: array ofnnzelements containing the row indices of the sparse COO matrix.[in] coo_col_ind: array ofnnzelements containing the column indices of the sparse COO matrix.[out] buffer_size: number of bytes of the temporary storage buffer required by rocsparse_coosort_by_row() and rocsparse_coosort_by_column().
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,nornnzis invalid.rocsparse_status_invalid_pointer:coo_row_ind,coo_col_indorbuffer_sizepointer is invalid.rocsparse_status_internal_error: an internal error occurred.
rocsparse_coosort_by_row()¶
-
rocsparse_status
rocsparse_coosort_by_row(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, rocsparse_int *coo_row_ind, rocsparse_int *coo_col_ind, rocsparse_int *perm, void *temp_buffer)¶ Sort a sparse COO matrix by row.
rocsparse_coosort_by_rowsorts a matrix in COO format by row. The sorted permutation vectorpermcan be used to obtain sortedcoo_valarray. In this case,permmust be initialized as the identity permutation, see rocsparse_create_identity_permutation().rocsparse_coosort_by_rowrequires extra temporary storage buffer that has to be allocated by the user. Storage buffer size can be determined by rocsparse_coosort_buffer_size().- Note
permcan beNULLif a sorted permutation vector is not required.- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Example
The following example sorts a \(3 \times 3\) COO matrix by row indices.
// 1 2 3 // A = 4 5 6 // 7 8 9 rocsparse_int m = 3; rocsparse_int n = 3; rocsparse_int nnz = 9; coo_row_ind[nnz] = {0, 1, 2, 0, 1, 2, 0, 1, 2}; // device memory coo_col_ind[nnz] = {0, 0, 0, 1, 1, 1, 2, 2, 2}; // device memory coo_val[nnz] = {1, 4, 7, 2, 5, 8, 3, 6, 9}; // device memory // Create permutation vector perm as the identity map rocsparse_int* perm; hipMalloc((void**)&perm, sizeof(rocsparse_int) * nnz); rocsparse_create_identity_permutation(handle, nnz, perm); // Allocate temporary buffer size_t buffer_size; void* temp_buffer; rocsparse_coosort_buffer_size(handle, m, n, nnz, coo_row_ind, coo_col_ind, &buffer_size); hipMalloc(&temp_buffer, buffer_size); // Sort the COO matrix rocsparse_coosort_by_row(handle, m, n, nnz, coo_row_ind, coo_col_ind, perm, temp_buffer); // Gather sorted coo_val array float* coo_val_sorted; hipMalloc((void**)&coo_val_sorted, sizeof(float) * nnz); rocsparse_sgthr(handle, nnz, coo_val, coo_val_sorted, perm, rocsparse_index_base_zero); // Clean up hipFree(temp_buffer); hipFree(perm); hipFree(coo_val);
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse COO matrix.[in] n: number of columns of the sparse COO matrix.[in] nnz: number of non-zero entries of the sparse COO matrix.[inout] coo_row_ind: array ofnnzelements containing the row indices of the sparse COO matrix.[inout] coo_col_ind: array ofnnzelements containing the column indices of the sparse COO matrix.[inout] perm: array ofnnzintegers containing the unsorted map indices, can beNULL.[in] temp_buffer: temporary storage buffer allocated by the user, size is returned by rocsparse_coosort_buffer_size().
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,nornnzis invalid.rocsparse_status_invalid_pointer:coo_row_ind,coo_col_indortemp_bufferpointer is invalid.rocsparse_status_internal_error: an internal error occurred.
rocsparse_coosort_by_column()¶
-
rocsparse_status
rocsparse_coosort_by_column(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, rocsparse_int *coo_row_ind, rocsparse_int *coo_col_ind, rocsparse_int *perm, void *temp_buffer)¶ Sort a sparse COO matrix by column.
rocsparse_coosort_by_columnsorts a matrix in COO format by column. The sorted permutation vectorpermcan be used to obtain sortedcoo_valarray. In this case,permmust be initialized as the identity permutation, see rocsparse_create_identity_permutation().rocsparse_coosort_by_columnrequires extra temporary storage buffer that has to be allocated by the user. Storage buffer size can be determined by rocsparse_coosort_buffer_size().- Note
permcan beNULLif a sorted permutation vector is not required.- Note
This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.
- Example
The following example sorts a \(3 \times 3\) COO matrix by column indices.
// 1 2 3 // A = 4 5 6 // 7 8 9 rocsparse_int m = 3; rocsparse_int n = 3; rocsparse_int nnz = 9; coo_row_ind[nnz] = {0, 0, 0, 1, 1, 1, 2, 2, 2}; // device memory coo_col_ind[nnz] = {0, 1, 2, 0, 1, 2, 0, 1, 2}; // device memory coo_val[nnz] = {1, 2, 3, 4, 5, 6, 7, 8, 9}; // device memory // Create permutation vector perm as the identity map rocsparse_int* perm; hipMalloc((void**)&perm, sizeof(rocsparse_int) * nnz); rocsparse_create_identity_permutation(handle, nnz, perm); // Allocate temporary buffer size_t buffer_size; void* temp_buffer; rocsparse_coosort_buffer_size(handle, m, n, nnz, coo_row_ind, coo_col_ind, &buffer_size); hipMalloc(&temp_buffer, buffer_size); // Sort the COO matrix rocsparse_coosort_by_column(handle, m, n, nnz, coo_row_ind, coo_col_ind, perm, temp_buffer); // Gather sorted coo_val array float* coo_val_sorted; hipMalloc((void**)&coo_val_sorted, sizeof(float) * nnz); rocsparse_sgthr(handle, nnz, coo_val, coo_val_sorted, perm, rocsparse_index_base_zero); // Clean up hipFree(temp_buffer); hipFree(perm); hipFree(coo_val);
- Parameters
[in] handle: handle to the rocsparse library context queue.[in] m: number of rows of the sparse COO matrix.[in] n: number of columns of the sparse COO matrix.[in] nnz: number of non-zero entries of the sparse COO matrix.[inout] coo_row_ind: array ofnnzelements containing the row indices of the sparse COO matrix.[inout] coo_col_ind: array ofnnzelements containing the column indices of the sparse COO matrix.[inout] perm: array ofnnzintegers containing the unsorted map indices, can beNULL.[in] temp_buffer: temporary storage buffer allocated by the user, size is returned by rocsparse_coosort_buffer_size().
- Return Value
rocsparse_status_success: the operation completed successfully.rocsparse_status_invalid_handle: the library context was not initialized.rocsparse_status_invalid_size:m,nornnzis invalid.rocsparse_status_invalid_pointer:coo_row_ind,coo_col_indortemp_bufferpointer is invalid.rocsparse_status_internal_error: an internal error occurred.