ROCm Libraries¶

rocFFT¶

rocFFT is a software library for computing Fast Fourier Transforms (FFT) written in HIP. It is part of AMD’s software ecosystem based on ROCm. In addition to AMD GPU devices, the library can also be compiled with the CUDA compiler using HIP tools for running on Nvidia GPU devices.

API design¶

Please refer to the rocFFT API design for current documentation. Work in progress.

Installing pre-built packages¶

Download pre-built packages either from ROCm’s package servers or by clicking the github releases tab and manually downloading, which could be newer. Release notes are available for each release on the releases tab.

sudo apt update && sudo apt install rocfft

Quickstart rocFFT build¶

Bash helper build script (Ubuntu only) The root of this repository has a helper bash script install.sh to build and install rocFFT on Ubuntu with a single command. It does not take a lot of options and hard-codes configuration that can be specified through invoking cmake directly, but it’s a great way to get started quickly and can serve as an example of how to build/install. A few commands in the script need sudo access, so it may prompt you for a password. * ./install -h – shows help * ./install -id – build library, build dependencies and install globally (-d flag only needs to be specified once on a system) * ./install -c --cuda – build library and clients for cuda backend into a local directory Manual build (all supported platforms) If you use a distro other than Ubuntu, or would like more control over the build process, the rocfft build wiki has helpful information on how to configure cmake and manually build.

Library and API Documentation Please refer to the Library documentation for current documentation.

Example¶

The following is a simple example code that shows how to use rocFFT to compute a 1D single precision 16-point complex forward transform.

#include <iostream>
#include <vector>
#include "hip/hip_runtime_api.h"
#include "hip/hip_vector_types.h"
#include "rocfft.h"

int main()
{
        // rocFFT gpu compute
        // ========================================

        size_t N = 16;
        size_t Nbytes = N * sizeof(float2);

        // Create HIP device buffer
        float2 *x;
        hipMalloc(&x, Nbytes);

        // Initialize data
        std::vector<float2> cx(N);
        for (size_t i = 0; i < N; i++)
        {
                cx[i].x = 1;
                cx[i].y = -1;
        }

        //  Copy data to device
        hipMemcpy(x, cx.data(), Nbytes, hipMemcpyHostToDevice);

        // Create rocFFT plan
        rocfft_plan plan = NULL;
        size_t length = N;
        rocfft_plan_create(&plan, rocfft_placement_inplace, rocfft_transform_type_complex_forward, rocfft_precision_single, 1,                        &length, 1, NULL);

        // Execute plan
        rocfft_execute(plan, (void**) &x, NULL, NULL);

        // Wait for execution to finish
        hipDeviceSynchronize();

        // Destroy plan
        rocfft_plan_destroy(plan);

        // Copy result back to host
        std::vector<float2> y(N);
        hipMemcpy(y.data(), x, Nbytes, hipMemcpyDeviceToHost);

        // Print results
        for (size_t i = 0; i < N; i++)
        {
                std::cout << y[i].x << ", " << y[i].y << std::endl;
        }

        // Free device buffer
        hipFree(x);

        return 0;
  }

API¶

This section provides details of the library API

Types¶

There are few data structures that are internal to the library. The pointer types to these structures are given below. The user would need to use these types to create handles and pass them between different library functions.

typedef struct rocfft_plan_t *rocfft_plan¶

Pointer type to plan structure.

This type is used to declare a plan handle that can be initialized with rocfft_plan_create

typedef struct rocfft_plan_description_t *rocfft_plan_description¶

Pointer type to plan description structure.

This type is used to declare a plan description handle that can be initialized with rocfft_plan_description_create

typedef struct rocfft_execution_info_t *rocfft_execution_info¶

Pointer type to execution info structure.

This type is used to declare an execution info handle that can be initialized with rocfft_execution_info_create

Library Setup and Cleanup¶

The following functions deals with initialization and cleanup of the library.

rocfft_status rocfft_setup()¶: Library setup function, called once in program before start of library use.

rocfft_status rocfft_cleanup()¶: Library cleanup function, called once in program after end of library use.

Plan¶

The following functions are used to create and destroy plan objects.

rocfft_status rocfft_plan_create(rocfft_plan *plan, rocfft_result_placement placement, rocfft_transform_type transform_type, rocfft_precision precision, size_t dimensions, const size_t *lengths, size_t number_of_transforms, const rocfft_plan_description description)¶

Create an FFT plan.

This API creates a plan, which the user can execute subsequently. This function takes many of the fundamental parameters needed to specify a transform. The parameters are self explanatory. The dimensions parameter can take a value of 1,2 or 3. The ‘lengths’ array specifies size of data in each dimension. Note that lengths[0] is the size of the innermost dimension, lengths[1] is the next higher dimension and so on. The ‘number_of_transforms’ parameter specifies how many transforms (of the same kind) needs to be computed. By specifying a value greater than 1, a batch of transforms can be computed with a single api call. Additionally, a handle to a plan description can be passed for more detailed transforms. For simple transforms, this parameter can be set to null ptr.

Parameters

[out] plan: plan handle
[in] placement: placement of result
[in] transform_type: type of transform
[in] precision: precision
[in] dimensions: dimensions
[in] lengths: dimensions sized array of transform lengths
[in] number_of_transforms: number of transforms
[in] description: description handle created by rocfft_plan_description_create; can be null ptr for simple transforms

rocfft_status rocfft_plan_destroy(rocfft_plan plan)¶

Destroy an FFT plan.

This API frees the plan. This function destructs a plan after it is no longer needed.

Parameters

[in] plan: plan handle

The following functions are used to query for information after a plan is created.

rocfft_status rocfft_plan_get_work_buffer_size(const rocfft_plan plan, size_t *size_in_bytes)¶

Get work buffer size.

This is one of plan query functions to obtain information regarding a plan. This API gets the work buffer size.

Parameters

[in] plan: plan handle
[out] size_in_bytes: size of needed work buffer in bytes

rocfft_status rocfft_plan_get_print(const rocfft_plan plan)¶

Print all plan information.

This is one of plan query functions to obtain information regarding a plan. This API prints all plan info to stdout to help user verify plan specification.

Parameters

[in] plan: plan handle

Plan description¶

Most of the times, rocfft_plan_create() is all is needed to fully specify a transform. And the description object can be skipped. But when a transform specification has more details a description object need to be created and set up and the handle passed to the rocfft_plan_create(). Functions referred below can be used to manage plan description in order to specify more transform details. The plan description object can be safely deleted after call to the plan api rocfft_plan_create().

rocfft_status rocfft_plan_description_create(rocfft_plan_description *description)¶

Create plan description.

This API creates a plan description with which the user can set more plan properties

Parameters

[out] description: plan description handle

rocfft_status rocfft_plan_description_destroy(rocfft_plan_description description)¶

Destroy a plan description.

This API frees the plan description

Parameters

[in] description: plan description handle

rocfft_status rocfft_plan_description_set_data_layout(rocfft_plan_description description, rocfft_array_type in_array_type, rocfft_array_type out_array_type, const size_t *in_offsets, const size_t *out_offsets, size_t in_strides_size, const size_t *in_strides, size_t in_distance, size_t out_strides_size, const size_t *out_strides, size_t out_distance)¶

Set data layout.

This is one of plan description functions to specify optional additional plan properties using the description handle. This API specifies the layout of buffers. This function can be used to specify input and output array types. Not all combinations of array types are supported and error code will be returned for unsupported cases. Additionally, input and output buffer offsets can be specified. The function can be used to specify custom layout of data, with the ability to specify stride between consecutive elements in all dimensions. Also, distance between transform data members can be specified. The library will choose appropriate defaults if offsets/strides are set to null ptr and/or distances set to 0.

Parameters

[in] description: description handle
[in] in_array_type: array type of input buffer
[in] out_array_type: array type of output buffer
[in] in_offsets: offsets, in element units, to start of data in input buffer
[in] out_offsets: offsets, in element units, to start of data in output buffer
[in] in_strides_size: size of in_strides array (must be equal to transform dimensions)
[in] in_strides: array of strides, in each dimension, of input buffer; if set to null ptr library chooses defaults
[in] in_distance: distance between start of each data instance in input buffer
[in] out_strides_size: size of out_strides array (must be equal to transform dimensions)
[in] out_strides: array of strides, in each dimension, of output buffer; if set to null ptr library chooses defaults
[in] out_distance: distance between start of each data instance in output buffer

Execution¶

The following details the execution function. After a plan has been created, it can be used to compute a transform on specified data. Aspects of the execution can be controlled and any useful information returned to the user.

rocfft_status rocfft_execute(const rocfft_plan plan, void *in_buffer[], void *out_buffer[], rocfft_execution_info info)¶

Execute an FFT plan.

This API executes an FFT plan on buffers given by the user. If the transform is in-place, only the input buffer is needed and the output buffer parameter can be set to NULL. For not in-place transforms, output buffers have to be specified. Note that both input and output buffer are arrays of pointers, this is to facilitate passing planar buffers where real and imaginary parts are in 2 separate buffers. For the default interleaved format, just a unit sized array holding the pointer to input/output buffer need to be passed. The final parameter in this function is an execution_info handle. This parameter serves as a way for the user to control execution, as well as for the library to pass any execution related information back to the user.

Parameters

[in] plan: plan handle
[inout] in_buffer: array (of size 1 for interleaved data, of size 2 for planar data) of input buffers
[inout] out_buffer: array (of size 1 for interleaved data, of size 2 for planar data) of output buffers, can be nullptr for inplace result placement
[in] info: execution info handle created by rocfft_execution_info_create

Execution info¶

The execution api rocfft_execute() takes a rocfft_execution_info parameter. This parameter needs to be created and setup by the user and passed to the execution api. The execution info handle encapsulates information such as execution mode, pointer to any work buffer etc. It can also hold information that are side effect of execution such as event objects. The following functions deal with managing execution info object. Note that the set functions below need to be called before execution and get functions after execution.

rocfft_status rocfft_execution_info_create(rocfft_execution_info *info)¶

Create execution info.

This API creates an execution info with which the user can control plan execution & retrieve execution information

Parameters

[out] info: execution info handle

rocfft_status rocfft_execution_info_destroy(rocfft_execution_info info)¶

Destroy an execution info.

This API frees the execution info

Parameters

[in] info: execution info handle

rocfft_status rocfft_execution_info_set_work_buffer(rocfft_execution_info info, void *work_buffer, size_t size_in_bytes)¶

Set work buffer in execution info.

This is one of the execution info functions to specify optional additional information to control execution. This API specifies work buffer needed. It has to be called before the call to rocfft_execute. When a non-zero value is obtained from rocfft_plan_get_work_buffer_size, that means the library needs a work buffer to compute the transform. In this case, the user has to allocate the work buffer and pass it to the library via this api.

Parameters

[in] info: execution info handle
[in] work_buffer: work buffer
[in] size_in_bytes: size of work buffer in bytes

rocfft_status rocfft_execution_info_set_stream(rocfft_execution_info info, void *stream)¶

Set stream in execution info.

This is one of the execution info functions to specify optional additional information to control execution. This API specifies compute stream. It has to be called before the call to rocfft_execute. It is the underlying device queue/stream where the library computations would be inserted. The library assumes user has created such a stream in the program and merely assigns work to the stream.

Parameters

[in] info: execution info handle
[in] stream: underlying compute stream

Enumerations¶

This section provides all the enumerations used.

enum rocfft_status¶

rocfft status/error codes

Values:

rocfft_status_success¶

rocfft_status_failure¶

rocfft_status_invalid_arg_value¶

rocfft_status_invalid_dimensions¶

rocfft_status_invalid_array_type¶

rocfft_status_invalid_strides¶

rocfft_status_invalid_distance¶

rocfft_status_invalid_offset¶

enum rocfft_transform_type¶

Type of transform.

Values:

rocfft_transform_type_complex_forward¶

rocfft_transform_type_complex_inverse¶

rocfft_transform_type_real_forward¶

rocfft_transform_type_real_inverse¶

enum rocfft_precision¶

Precision.

Values:

rocfft_precision_single¶

rocfft_precision_double¶

enum rocfft_result_placement¶

Result placement.

Values:

rocfft_placement_inplace¶

rocfft_placement_notinplace¶

enum rocfft_array_type¶

Array type.

Values:

rocfft_array_type_complex_interleaved¶

rocfft_array_type_complex_planar¶

rocfft_array_type_real¶

rocfft_array_type_hermitian_interleaved¶

rocfft_array_type_hermitian_planar¶

enum rocfft_execution_mode¶

Execution mode.

Values:

rocfft_exec_mode_nonblocking¶

rocfft_exec_mode_nonblocking_with_flush¶

rocfft_exec_mode_blocking¶

rocBLAS¶

rocBLAS Github link

A BLAS implementation on top of AMD’s Radeon Open Compute ROCm runtime and toolchains. rocBLAS is implemented in the HIP programming language and optimized for AMD’s latest discrete GPUs.

Installing pre-built packages¶

Download pre-built packages either from ROCm’s package servers or by clicking the github releases tab and manually downloading, which could be newer. Release notes are available for each release on the releases tab.

sudo apt update && sudo apt install rocblas

Quickstart rocBLAS build¶

Bash helper build script (Ubuntu only)

The root of this repository has a helper bash script install.sh to build and install rocBLAS on Ubuntu with a single command. It does not take a lot of options and hard-codes configuration that can be specified through invoking cmake directly, but it’s a great way to get started quickly and can serve as an example of how to build/install. A few commands in the script need sudo access, so it may prompt you for a password.

./install -h -- shows help
./install -id -- build library, build dependencies and install (-d flag only needs to be passed once on a system)

Manual build (all supported platforms)¶

If you use a distro other than Ubuntu, or would like more control over the build process, the rocblas build wiki has helpful information on how to configure cmake and manually build.

Functions supported

A list of exported functions from rocblas can be found on the wiki

rocBLAS interface examples¶

In general, the rocBLAS interface is compatible with CPU oriented Netlib BLAS and the cuBLAS-v2 API, with the explicit exception that traditional BLAS interfaces do not accept handles. The cuBLAS’ cublasHandle_t is replaced with rocblas_handle everywhere. Thus, porting a CUDA application which originally calls the cuBLAS API to a HIP application calling rocBLAS API should be relatively straightforward. For example, the rocBLAS SGEMV interface is

GEMV API¶

rocblas_status
rocblas_sgemv(rocblas_handle handle,
              rocblas_operation trans,
              rocblas_int m, rocblas_int n,
              const float* alpha,
              const float* A, rocblas_int lda,
              const float* x, rocblas_int incx,
              const float* beta,
              float* y, rocblas_int incy);

Batched and strided GEMM API¶

rocBLAS GEMM can process matrices in batches with regular strides. There are several permutations of these API’s, the following is an example that takes everything

rocblas_status
rocblas_sgemm_strided_batched(
    rocblas_handle handle,
    rocblas_operation transa, rocblas_operation transb,
    rocblas_int m, rocblas_int n, rocblas_int k,
    const float* alpha,
    const float* A, rocblas_int ls_a, rocblas_int ld_a, rocblas_int bs_a,
    const float* B, rocblas_int ls_b, rocblas_int ld_b, rocblas_int bs_b,
    const float* beta,
          float* C, rocblas_int ls_c, rocblas_int ld_c, rocblas_int bs_c,
    rocblas_int batch_count )

rocBLAS assumes matrices A and vectors x, y are allocated in GPU memory space filled with data. Users are responsible for copying data from/to the host and device memory. HIP provides memcpy style API’s to facilitate data management.

Asynchronous API¶

Except a few routines (like TRSM) having memory allocation inside preventing asynchronicity, most of the library routines (like BLAS-1 SCAL, BLAS-2 GEMV, BLAS-3 GEMM) are configured to operate in asynchronous fashion with respect to CPU, meaning these library functions return immediately.

For more information regarding rocBLAS library and corresponding API documentation, refer rocBLAS

API¶

This section provides details of the library API

Types¶

Definitions¶

rocblas_int¶

typedef int32_t rocblas_int¶: To specify whether int32 or int64 is used.

rocblas_long¶

typedef int64_t rocblas_long¶

rocblas_float_complex¶

typedef float2 rocblas_float_complex¶

rocblas_double_complex¶

typedef double2 rocblas_double_complex¶

rocblas_half¶

typedef uint16_t rocblas_half¶

rocblas_half_complex¶

typedef float2 rocblas_half_complex¶

rocblas_handle¶

typedef struct _rocblas_handle *rocblas_handle¶

Enums¶

Enumeration constants have numbering that is consistent with CBLAS, ACML and most standard C BLAS libraries.

rocblas_operation¶

enum rocblas_operation¶

Used to specify whether the matrix is to be transposed or not.

parameter constants. numbering is consistent with CBLAS, ACML and most standard C BLAS libraries

Values:

rocblas_operation_none = 111¶: Operate with the matrix.

rocblas_operation_transpose = 112¶: Operate with the transpose of the matrix.

rocblas_operation_conjugate_transpose = 113¶: Operate with the conjugate transpose of the matrix.

rocblas_fill¶

enum rocblas_fill¶

Used by the Hermitian, symmetric and triangular matrix routines to specify whether the upper or lower triangle is being referenced.

Values:

rocblas_fill_upper = 121¶: Upper triangle.

rocblas_fill_lower = 122¶: Lower triangle.

rocblas_fill_full = 123¶

rocblas_diagonal¶

enum rocblas_diagonal¶

It is used by the triangular matrix routines to specify whether the matrix is unit triangular.

Values:

rocblas_diagonal_non_unit = 131¶: Non-unit triangular.

rocblas_diagonal_unit = 132¶: Unit triangular.

rocblas_side¶

enum rocblas_side¶

Indicates the side matrix A is located relative to matrix B during multiplication.

Values:

rocblas_side_left = 141¶: Multiply general matrix by symmetric, Hermitian or triangular matrix on the left.

rocblas_side_right = 142¶: Multiply general matrix by symmetric, Hermitian or triangular matrix on the right.

rocblas_side_both = 143¶

rocblas_status¶

enum rocblas_status¶

rocblas status codes definition

Values:

rocblas_status_success = 0¶: success

rocblas_status_invalid_handle = 1¶: handle not initialized, invalid or null

rocblas_status_not_implemented = 2¶: function is not implemented

rocblas_status_invalid_pointer = 3¶: invalid pointer parameter

rocblas_status_invalid_size = 4¶: invalid size parameter

rocblas_status_memory_error = 5¶: failed internal memory allocation, copy or dealloc

rocblas_status_internal_error = 6¶: other internal library failure

rocblas_datatype¶

enum rocblas_datatype¶

Indicates the precision width of data stored in a blas type.

Values:

rocblas_datatype_f16_r = 150¶

rocblas_datatype_f32_r = 151¶

rocblas_datatype_f64_r = 152¶

rocblas_datatype_f16_c = 153¶

rocblas_datatype_f32_c = 154¶

rocblas_datatype_f64_c = 155¶

rocblas_datatype_i8_r = 160¶

rocblas_datatype_u8_r = 161¶

rocblas_datatype_i32_r = 162¶

rocblas_datatype_u32_r = 163¶

rocblas_datatype_i8_c = 164¶

rocblas_datatype_u8_c = 165¶

rocblas_datatype_i32_c = 166¶

rocblas_datatype_u32_c = 167¶

rocblas_pointer_mode¶

enum rocblas_pointer_mode¶

Indicates the pointer is device pointer or host pointer.

Values:

rocblas_pointer_mode_host = 0¶

rocblas_pointer_mode_device = 1¶

rocblas_layer_mode¶

enum rocblas_layer_mode¶

Indicates if layer is active with bitmask.

Values:

rocblas_layer_mode_none = 0b0000000000¶

rocblas_layer_mode_log_trace = 0b0000000001¶

rocblas_layer_mode_log_bench = 0b0000000010¶

rocblas_layer_mode_log_profile = 0b0000000100¶

rocblas_gemm_algo¶

enum rocblas_gemm_algo¶

Indicates if layer is active with bitmask.

Values:

rocblas_gemm_algo_standard = 0b0000000000¶

Functions¶

Level 1 BLAS¶

rocblas_<type>scal()¶

rocblas_status rocblas_dscal(rocblas_handle handle, rocblas_int n, const double *alpha, double *x, rocblas_int incx)¶

rocblas_status rocblas_sscal(rocblas_handle handle, rocblas_int n, const float *alpha, float *x, rocblas_int incx)¶

BLAS Level 1 API.

scal scal the vector x[i] with scalar alpha, for i = 1 , … , n

x := alpha * x ,

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] alpha: specifies the scalar alpha.
[inout] x: pointer storing vector x on the GPU.
[in] incx: specifies the increment for the elements of x.

rocblas_<type>copy()¶

rocblas_status rocblas_dcopy(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *y, rocblas_int incy)¶

rocblas_status rocblas_scopy(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *y, rocblas_int incy)¶

BLAS Level 1 API.

copy copies the vector x into the vector y, for i = 1 , … , n

y := x,

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] x: pointer storing vector x on the GPU.
[in] incx: specifies the increment for the elements of x.
[out] y: pointer storing vector y on the GPU.
[in] incy: rocblas_int specifies the increment for the elements of y.

rocblas_<type>dot()¶

rocblas_status rocblas_ddot(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, const double *y, rocblas_int incy, double *result)¶

rocblas_status rocblas_sdot(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, const float *y, rocblas_int incy, float *result)¶

BLAS Level 1 API.

dot(u) perform dot product of vector x and y

result = x * y;

dotc perform dot product of complex vector x and complex y

result = conjugate (x) * y;

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of y.
[inout] result: store the dot product. either on the host CPU or device GPU. return is 0.0 if n <= 0.

rocblas_<type>swap()¶

rocblas_status rocblas_sswap(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, float *y, rocblas_int incy)¶

BLAS Level 1 API.

swap interchange vector x[i] and y[i], for i = 1 , … , n

y := x; x := y

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[inout] x: pointer storing vector x on the GPU.
[in] incx: specifies the increment for the elements of x.
[inout] y: pointer storing vector y on the GPU.
[in] incy: rocblas_int specifies the increment for the elements of y.

rocblas_status rocblas_dswap(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, double *y, rocblas_int incy)¶

rocblas_<type>axpy()¶

rocblas_status rocblas_daxpy(rocblas_handle handle, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, double *y, rocblas_int incy)¶

rocblas_status rocblas_saxpy(rocblas_handle handle, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, float *y, rocblas_int incy)¶

rocblas_status rocblas_haxpy(rocblas_handle handle, rocblas_int n, const rocblas_half *alpha, const rocblas_half *x, rocblas_int incx, rocblas_half *y, rocblas_int incy)¶

BLAS Level 1 API.

axpy compute y := alpha * x + y

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] alpha: specifies the scalar alpha.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of x.
[out] y: pointer storing vector y on the GPU.
[inout] incy: rocblas_int specifies the increment for the elements of y.

rocblas_<type>asum()¶

rocblas_status rocblas_dasum(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *result)¶

rocblas_status rocblas_sasum(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *result)¶

BLAS Level 1 API.

asum computes the sum of the magnitudes of elements of a real vector x, or the sum of magnitudes of the real and imaginary parts of elements if x is a complex vector

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of y.
[inout] result: store the asum product. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.

rocblas_<type>nrm2()¶

rocblas_status rocblas_dnrm2(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *result)¶

rocblas_status rocblas_snrm2(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *result)¶

BLAS Level 1 API.

nrm2 computes the euclidean norm of a real or complex vector := sqrt( x’*x ) for real vector := sqrt( x**H*x ) for complex vector

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of y.
[inout] result: store the nrm2 product. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.

rocblas_i<type>amax()¶

rocblas_status rocblas_idamax(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_int *result)¶

rocblas_status rocblas_isamax(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_int *result)¶

BLAS Level 1 API.

amax finds the first index of the element of maximum magnitude of real vector x or the sum of magnitude of the real and imaginary parts of elements if x is a complex vector

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of y.
[inout] result: store the amax index. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.

rocblas_i<type>amin()¶

rocblas_status rocblas_idamin(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_int *result)¶

rocblas_status rocblas_isamin(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_int *result)¶

BLAS Level 1 API.

amin finds the first index of the element of minimum magnitude of real vector x or the sum of magnitude of the real and imaginary parts of elements if x is a complex vector

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of y.
[inout] result: store the amin index. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.

Level 2 BLAS¶

rocblas_<type>gemv()¶

rocblas_status rocblas_dgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *x, rocblas_int incx, const double *beta, double *y, rocblas_int incy)¶

rocblas_status rocblas_sgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *x, rocblas_int incx, const float *beta, float *y, rocblas_int incy)¶

BLAS Level 2 API.

xGEMV performs one of the matrix-vector operations

y := alpha*A*x    + beta*y,   or
y := alpha*A**T*x + beta*y,   or
y := alpha*A**H*x + beta*y,

where alpha and beta are scalars, x and y are vectors and A is an m by n matrix.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] trans: rocblas_operation
[in] m: rocblas_int
[in] n: rocblas_int
[in] alpha: specifies the scalar alpha.
[in] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of A.
[in] x: pointer storing vector x on the GPU.
[in] incx: specifies the increment for the elements of x.
[in] beta: specifies the scalar beta.
[out] y: pointer storing vector y on the GPU.
[in] incy: rocblas_int specifies the increment for the elements of y.

rocblas_<type>trsv()¶

rocblas_status rocblas_dtrsv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const double *A, rocblas_int lda, double *x, rocblas_int incx)¶

rocblas_status rocblas_strsv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const float *A, rocblas_int lda, float *x, rocblas_int incx)¶

BLAS Level 2 API.

trsv solves

 A*x = alpha*b or A**T*x = alpha*b,

where x and b are vectors and A is a triangular matrix.

The vector x is overwritten on b.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] uplo: rocblas_fill. rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.
[in] transA: rocblas_operation
[in] diag: rocblas_diagonal. rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.
[in] m: rocblas_int m specifies the number of rows of b. m >= 0.
[in] alpha: specifies the scalar alpha.
[in] A: pointer storing matrix A on the GPU, of dimension ( lda, m )
[in] lda: rocblas_int specifies the leading dimension of A. lda = max( 1, m ).
[in] x: pointer storing vector x on the GPU.
[in] incx: specifies the increment for the elements of x.

rocblas_<type>ger()¶

rocblas_status rocblas_dger(rocblas_handle handle, rocblas_int m, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, const double *y, rocblas_int incy, double *A, rocblas_int lda)¶

rocblas_status rocblas_sger(rocblas_handle handle, rocblas_int m, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, const float *y, rocblas_int incy, float *A, rocblas_int lda)¶

BLAS Level 2 API.

xHE(SY)MV performs the matrix-vector operation:

y := alpha*A*x + beta*y,

where alpha and beta are scalars, x and y are n element vectors and A is an n by n Hermitian(Symmetric) matrix.

BLAS Level 2 API

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] uplo: rocblas_fill. specifies whether the upper or lower
[in] n: rocblas_int.
[in] alpha: specifies the scalar alpha.
[in] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of A.
[in] x: pointer storing vector x on the GPU.
[in] incx: specifies the increment for the elements of x.
[in] beta: specifies the scalar beta.
[out] y: pointer storing vector y on the GPU.
[in] incy: rocblas_int specifies the increment for the elements of y.

xGER performs the matrix-vector operations

A := A + alpha*x*y**T

where alpha is a scalars, x and y are vectors, and A is an m by n matrix.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] m: rocblas_int
[in] n: rocblas_int
[in] alpha: specifies the scalar alpha.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of x.
[in] y: pointer storing vector y on the GPU.
[in] incy: rocblas_int specifies the increment for the elements of y.
[inout] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of A.

rocblas_<type>syr()¶

rocblas_status rocblas_dsyr(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, double *A, rocblas_int lda)¶

rocblas_status rocblas_ssyr(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, float *A, rocblas_int lda)¶

BLAS Level 2 API.

xSYR performs the matrix-vector operations

A := A + alpha*x*x**T

where alpha is a scalars, x is a vector, and A is an n by n symmetric matrix.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int
[in] alpha: specifies the scalar alpha.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of x.
[inout] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of A.

Level 3 BLAS¶

rocblas_<type>trtri_batched()¶

rocblas_status rocblas_dtrtri_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const double *A, rocblas_int lda, rocblas_int stride_a, double *invA, rocblas_int ldinvA, rocblas_int bsinvA, rocblas_int batch_count)¶

rocblas_status rocblas_strtri_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const float *A, rocblas_int lda, rocblas_int stride_a, float *invA, rocblas_int ldinvA, rocblas_int bsinvA, rocblas_int batch_count)¶

BLAS Level 3 API.

trtri compute the inverse of a matrix A

inv(A);

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] uplo: rocblas_fill. specifies whether the upper ‘rocblas_fill_upper’ or lower ‘rocblas_fill_lower’
[in] diag: rocblas_diagonal. = ‘rocblas_diagonal_non_unit’, A is non-unit triangular; = ‘rocblas_diagonal_unit’, A is unit triangular;
[in] n: rocblas_int.
[in] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of A.
[in] stride_a: rocblas_int “batch stride a”: stride from the start of one “A” matrix to the next

rocblas_<type>trsm()¶

rocblas_status rocblas_dtrsm(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, double *B, rocblas_int ldb)¶

rocblas_status rocblas_strsm(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, float *B, rocblas_int ldb)¶

BLAS Level 3 API.

trsm solves

op(A)*X = alpha*B or  X*op(A) = alpha*B,

where alpha is a scalar, X and B are m by n matrices, A is triangular matrix and op(A) is one of

op( A ) = A   or   op( A ) = A^T   or   op( A ) = A^H.

The matrix X is overwritten on B.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] side: rocblas_side. rocblas_side_left: op(A)*X = alpha*B. rocblas_side_right: X*op(A) = alpha*B.
[in] uplo: rocblas_fill. rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.
[in] transA: rocblas_operation. transB: op(A) = A. rocblas_operation_transpose: op(A) = A^T. rocblas_operation_conjugate_transpose: op(A) = A^H.
[in] diag: rocblas_diagonal. rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.
[in] m: rocblas_int. m specifies the number of rows of B. m >= 0.
[in] n: rocblas_int. n specifies the number of columns of B. n >= 0.
[in] alpha: alpha specifies the scalar alpha. When alpha is &zero then A is not referenced and B need not be set before entry.
[in] A: pointer storing matrix A on the GPU. of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.
[in] lda: rocblas_int. lda specifies the first dimension of A. if side = rocblas_side_left, lda >= max( 1, m ), if side = rocblas_side_right, lda >= max( 1, n ).

rocblas_<type>gemm()¶

rocblas_status rocblas_dgemm(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, const double *B, rocblas_int ldb, const double *beta, double *C, rocblas_int ldc)¶

rocblas_status rocblas_sgemm(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, const float *B, rocblas_int ldb, const float *beta, float *C, rocblas_int ldc)¶

rocblas_status rocblas_hgemm(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, const rocblas_half *B, rocblas_int ldb, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc)¶

BLAS Level 3 API.

xGEMM performs one of the matrix-matrix operations

C = alpha*op( A )*op( B ) + beta*C,

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

Parameters

[in] handle: rocblas_handle, handle to the rocblas library context queue.
[in] transA: rocblas_operation, specifies the form of op( A )
[in] transB: rocblas_operation, specifies the form of op( B )
[in] m: rocblas_int, number or rows of matrices op( A ) and C
[in] n: rocblas_int, number of columns of matrices op( B ) and C
[in] k: rocblas_int, number of columns of matrix op( A ) and number of rows of matrix op( B )
[in] alpha: specifies the scalar alpha.
[in] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int, specifies the leading dimension of A.
[in] B: pointer storing matrix B on the GPU.
[in] ldb: rocblas_int, specifies the leading dimension of B.
[in] beta: specifies the scalar beta.
[inout] C: pointer storing matrix C on the GPU.
[in] ldc: rocblas_int, specifies the leading dimension of C.

rocblas_<type>gemm_strided_batched()¶

rocblas_status rocblas_dgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, rocblas_int stride_a, const double *B, rocblas_int ldb, rocblas_int stride_b, const double *beta, double *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)¶

rocblas_status rocblas_sgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, rocblas_int stride_a, const float *B, rocblas_int ldb, rocblas_int stride_b, const float *beta, float *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)¶

rocblas_status rocblas_hgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, rocblas_int stride_a, const rocblas_half *B, rocblas_int ldb, rocblas_int stride_b, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)¶

BLAS Level 3 API.

xGEMM_STRIDED_BATCHED performs one of the strided batched matrix-matrix operations

C[i*stride_c] = alpha*op( A[i*stride_a] )*op( B[i*stride_b] ) + beta*C[i*stride_c], for i in

[0,batch_count-1]

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B and C are strided batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) an k by n by batch_count strided_batched matrix and C an m by n by batch_count strided_batched matrix.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] transA: rocblas_operation specifies the form of op( A )
[in] transB: rocblas_operation specifies the form of op( B )
[in] m: rocblas_int. matrix dimention m.
[in] n: rocblas_int. matrix dimention n.
[in] k: rocblas_int. matrix dimention k.
[in] alpha: specifies the scalar alpha.
[in] A: pointer storing strided batched matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of “A”.
[in] stride_a: rocblas_int stride from the start of one “A” matrix to the next
[in] B: pointer storing strided batched matrix B on the GPU.
[in] ldb: rocblas_int specifies the leading dimension of “B”.
[in] stride_b: rocblas_int stride from the start of one “B” matrix to the next
[in] beta: specifies the scalar beta.
[inout] C: pointer storing strided batched matrix C on the GPU.
[in] ldc: rocblas_int specifies the leading dimension of “C”.
[in] stride_c: rocblas_int stride from the start of one “C” matrix to the next
[in] batch_count: rocblas_int number of gemm operatons in the batch

rocblas_<type>gemm_kernel_name()¶

rocblas_status rocblas_dgemm_kernel_name(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, rocblas_int stride_a, const double *B, rocblas_int ldb, rocblas_int stride_b, const double *beta, double *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)¶

rocblas_status rocblas_sgemm_kernel_name(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, rocblas_int stride_a, const float *B, rocblas_int ldb, rocblas_int stride_b, const float *beta, float *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)¶

rocblas_status rocblas_hgemm_kernel_name(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, rocblas_int stride_a, const rocblas_half *B, rocblas_int ldb, rocblas_int stride_b, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)¶

rocblas_<type>geam()¶

rocblas_status rocblas_dgeam(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *beta, const double *B, rocblas_int ldb, double *C, rocblas_int ldc)¶

rocblas_status rocblas_sgeam(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *beta, const float *B, rocblas_int ldb, float *C, rocblas_int ldc)¶

BLAS Level 3 API.

xGEAM performs one of the matrix-matrix operations

C = alpha*op( A ) + beta*op( B ),

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by n matrix, op( B ) an m by n matrix, and C an m by n matrix.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] transA: rocblas_operation specifies the form of op( A )
[in] transB: rocblas_operation specifies the form of op( B )
[in] m: rocblas_int.
[in] n: rocblas_int.
[in] alpha: specifies the scalar alpha.
[in] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of A.
[in] beta: specifies the scalar beta.
[in] B: pointer storing matrix B on the GPU.
[in] ldb: rocblas_int specifies the leading dimension of B.
[inout] C: pointer storing matrix C on the GPU.
[in] ldc: rocblas_int specifies the leading dimension of C.

BLAS Extensions¶

rocblas_gemm_ex()¶

rocblas_status rocblas_gemm_ex(rocblas_handle handle, rocblas_operation trans_a, rocblas_operation trans_b, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags, size_t *workspace_size, void *workspace)¶

rocblas_gemm_strided_batched_ex()¶

rocblas_status rocblas_gemm_strided_batched_ex(rocblas_handle handle, rocblas_operation trans_a, rocblas_operation trans_b, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, rocblas_long stride_a, const void *b, rocblas_datatype b_type, rocblas_int ldb, rocblas_long stride_b, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, rocblas_long stride_c, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_long stride_d, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags, size_t *workspace_size, void *workspace)¶

BLAS EX API.

GEMM_EX performs one of the matrix-matrix operations

D = alpha*op( A )*op( B ) + beta*C,

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B, C, and D are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C and D are m by n matrices.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] transA: rocblas_operation specifies the form of op( A )
[in] transB: rocblas_operation specifies the form of op( B )
[in] m: rocblas_int. matrix dimension m
[in] n: rocblas_int. matrix dimension n
[in] k: rocblas_int. matrix dimension k
[in] alpha: const void * specifies the scalar alpha. Same datatype as compute_type.
[in] a: void * pointer storing matrix A on the GPU.
[in] a_type: rocblas_datatype specifies the datatype of matrix A
[in] lda: rocblas_int specifies the leading dimension of A.
[in] b: void * pointer storing matrix B on the GPU.
[in] b_type: rocblas_datatype specifies the datatype of matrix B
[in] ldb: rocblas_int specifies the leading dimension of B.
[in] beta: const void * specifies the scalar beta. Same datatype as compute_type.
[in] c: void * pointer storing matrix C on the GPU.
[in] c_type: rocblas_datatype specifies the datatype of matrix C
[in] ldc: rocblas_int specifies the leading dimension of C.
[out] d: void * pointer storing matrix D on the GPU.
[in] d_type: rocblas_datatype specifies the datatype of matrix D
[in] ldd: rocblas_int specifies the leading dimension of D.
[in] compute_type: rocblas_datatype specifies the datatype of computation
[in] algo: rocblas_gemm_algo enumerant specifying the algorithm type.
[in] solution_index: int32_t reserved for future use
[in] flags: uint32_t reserved for future use

Build Information¶

rocblas_get_version_string()¶

rocblas_status rocblas_get_version_string(char *buf, size_t len)¶

BLAS EX API.

GEMM_STRIDED_BATCHED_EX performs one of the strided_batched matrix-matrix operations

D[i*stride_d] = alpha*op(A[i*stride_a])*op(B[i*stride_b]) + beta*C[i*stride_c], for i in

[0,batch_count-1]

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B, C, and D are strided_batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) a k by n by batch_count strided_batched matrix and C and D are m by n by batch_count strided_batched matrices.

The strided_batched matrices are multiple matrices separated by a constant stride. The number of matrices is batch_count.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] transA: rocblas_operation specifies the form of op( A )
[in] transB: rocblas_operation specifies the form of op( B )
[in] m: rocblas_int. matrix dimension m
[in] n: rocblas_int. matrix dimension n
[in] k: rocblas_int. matrix dimension k
[in] alpha: const void * specifies the scalar alpha. Same datatype as compute_type.
[in] a: void * pointer storing matrix A on the GPU.
[in] a_type: rocblas_datatype specifies the datatype of matrix A
[in] lda: rocblas_int specifies the leading dimension of A.
[in] stride_a: rocblas_long specifies stride from start of one “A” matrix to the next
[in] b: void * pointer storing matrix B on the GPU.
[in] b_type: rocblas_datatype specifies the datatype of matrix B
[in] ldb: rocblas_int specifies the leading dimension of B.
[in] stride_b: rocblas_long specifies stride from start of one “B” matrix to the next
[in] beta: const void * specifies the scalar beta. Same datatype as compute_type.
[in] c: void * pointer storing matrix C on the GPU.
[in] c_type: rocblas_datatype specifies the datatype of matrix C
[in] ldc: rocblas_int specifies the leading dimension of C.
[in] stride_c: rocblas_long specifies stride from start of one “C” matrix to the next
[out] d: void * pointer storing matrix D on the GPU.
[in] d_type: rocblas_datatype specifies the datatype of matrix D
[in] ldd: rocblas_int specifies the leading dimension of D.
[in] stride_d: rocblas_long specifies stride from start of one “D” matrix to the next
[in] batch_count: rocblas_int number of gemm operations in the batch
[in] compute_type: rocblas_datatype specifies the datatype of computation
[in] algo: rocblas_gemm_algo enumerant specifying the algorithm type.
[in] solution_index: int32_t reserved for future use
[in] flags: uint32_t reserved for future use

Auxiliary¶

rocblas_pointer_to_mode()¶

rocblas_pointer_mode rocblas_pointer_to_mode(void *ptr)¶: indicates whether the pointer is on the host or device. currently HIP API can only recoginize the input ptr on deive or not can not recoginize it is on host or not

rocblas_create_handle()¶

rocblas_status rocblas_create_handle(rocblas_handle *handle)¶

rocblas_destroy_handle()¶

rocblas_status rocblas_destroy_handle(rocblas_handle handle)¶

rocblas_add_stream()¶

rocblas_status rocblas_add_stream(rocblas_handle handle, hipStream_t stream)¶

rocblas_set_stream()¶

rocblas_status rocblas_set_stream(rocblas_handle handle, hipStream_t stream)¶

rocblas_get_stream()¶

rocblas_status rocblas_get_stream(rocblas_handle handle, hipStream_t *stream)¶

rocblas_set_pointer_mode()¶

rocblas_status rocblas_set_pointer_mode(rocblas_handle handle, rocblas_pointer_mode pointer_mode)¶

rocblas_get_pointer_mode()¶

rocblas_status rocblas_get_pointer_mode(rocblas_handle handle, rocblas_pointer_mode *pointer_mode)¶

rocblas_set_vector()¶

rocblas_status rocblas_set_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)¶

rocblas_get_vector()¶

rocblas_status rocblas_get_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)¶

rocblas_set_matrix()¶

rocblas_status rocblas_set_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)¶

rocblas_get_matrix()¶

rocblas_status rocblas_get_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)¶

All API¶

namespace rocblas¶

Functions

void reinit_logs()¶

file rocblas-auxiliary.h

#include <hip/hip_runtime_api.h>#include “rocblas-types.h”

rocblas-auxiliary.h provides auxilary functions in rocblas

Defines

_ROCBLAS_AUXILIARY_H_¶

Functions

rocblas_pointer_mode rocblas_pointer_to_mode(void *ptr): indicates whether the pointer is on the host or device. currently HIP API can only recoginize the input ptr on deive or not can not recoginize it is on host or not

rocblas_status rocblas_create_handle(rocblas_handle *handle)

rocblas_status rocblas_destroy_handle(rocblas_handle handle)

rocblas_status rocblas_add_stream(rocblas_handle handle, hipStream_t stream)

rocblas_status rocblas_set_stream(rocblas_handle handle, hipStream_t stream)

rocblas_status rocblas_get_stream(rocblas_handle handle, hipStream_t *stream)

rocblas_status rocblas_set_pointer_mode(rocblas_handle handle, rocblas_pointer_mode pointer_mode)

rocblas_status rocblas_get_pointer_mode(rocblas_handle handle, rocblas_pointer_mode *pointer_mode)

rocblas_status rocblas_set_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)

rocblas_status rocblas_get_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)

rocblas_status rocblas_set_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)

rocblas_status rocblas_get_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)

file rocblas-functions.h

#include “rocblas-types.h”

rocblas_functions.h provides Basic Linear Algebra Subprograms of Level 1, 2 and 3, using HIP optimized for AMD HCC-based GPU hardware. This library can also run on CUDA-based NVIDIA GPUs. This file exposes C89 BLAS interface

Defines

_ROCBLAS_FUNCTIONS_H_¶

Functions

rocblas_status rocblas_sscal(rocblas_handle handle, rocblas_int n, const float *alpha, float *x, rocblas_int incx)

BLAS Level 1 API.

scal scal the vector x[i] with scalar alpha, for i = 1 , … , n

x := alpha * x ,

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] alpha: specifies the scalar alpha.
[inout] x: pointer storing vector x on the GPU.
[in] incx: specifies the increment for the elements of x.

rocblas_status rocblas_dscal(rocblas_handle handle, rocblas_int n, const double *alpha, double *x, rocblas_int incx)

rocblas_status rocblas_scopy(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *y, rocblas_int incy)

BLAS Level 1 API.

copy copies the vector x into the vector y, for i = 1 , … , n

y := x,

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] x: pointer storing vector x on the GPU.
[in] incx: specifies the increment for the elements of x.
[out] y: pointer storing vector y on the GPU.
[in] incy: rocblas_int specifies the increment for the elements of y.

rocblas_status rocblas_dcopy(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *y, rocblas_int incy)

rocblas_status rocblas_sdot(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, const float *y, rocblas_int incy, float *result)

BLAS Level 1 API.

dot(u) perform dot product of vector x and y

result = x * y;

dotc perform dot product of complex vector x and complex y

result = conjugate (x) * y;

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of y.
[inout] result: store the dot product. either on the host CPU or device GPU. return is 0.0 if n <= 0.

rocblas_status rocblas_ddot(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, const double *y, rocblas_int incy, double *result)

rocblas_status rocblas_sswap(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, float *y, rocblas_int incy)

BLAS Level 1 API.

swap interchange vector x[i] and y[i], for i = 1 , … , n

y := x; x := y

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[inout] x: pointer storing vector x on the GPU.
[in] incx: specifies the increment for the elements of x.
[inout] y: pointer storing vector y on the GPU.
[in] incy: rocblas_int specifies the increment for the elements of y.

rocblas_status rocblas_dswap(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, double *y, rocblas_int incy)

rocblas_status rocblas_haxpy(rocblas_handle handle, rocblas_int n, const rocblas_half *alpha, const rocblas_half *x, rocblas_int incx, rocblas_half *y, rocblas_int incy)

BLAS Level 1 API.

axpy compute y := alpha * x + y

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] alpha: specifies the scalar alpha.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of x.
[out] y: pointer storing vector y on the GPU.
[inout] incy: rocblas_int specifies the increment for the elements of y.

rocblas_status rocblas_saxpy(rocblas_handle handle, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, float *y, rocblas_int incy)

rocblas_status rocblas_daxpy(rocblas_handle handle, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, double *y, rocblas_int incy)

rocblas_status rocblas_sasum(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *result)

BLAS Level 1 API.

asum computes the sum of the magnitudes of elements of a real vector x, or the sum of magnitudes of the real and imaginary parts of elements if x is a complex vector

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of y.
[inout] result: store the asum product. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.

rocblas_status rocblas_dasum(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *result)

rocblas_status rocblas_snrm2(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *result)

BLAS Level 1 API.

nrm2 computes the euclidean norm of a real or complex vector := sqrt( x’*x ) for real vector := sqrt( x**H*x ) for complex vector

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of y.
[inout] result: store the nrm2 product. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.

rocblas_status rocblas_dnrm2(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *result)

rocblas_status rocblas_isamax(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_int *result)

BLAS Level 1 API.

amax finds the first index of the element of maximum magnitude of real vector x or the sum of magnitude of the real and imaginary parts of elements if x is a complex vector

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of y.
[inout] result: store the amax index. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.

rocblas_status rocblas_idamax(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_int *result)

rocblas_status rocblas_isamin(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_int *result)

BLAS Level 1 API.

amin finds the first index of the element of minimum magnitude of real vector x or the sum of magnitude of the real and imaginary parts of elements if x is a complex vector

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of y.
[inout] result: store the amin index. either on the host CPU or device GPU. return is 0.0 if n, incx<=0.

rocblas_status rocblas_idamin(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_int *result)

rocblas_status rocblas_sgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *x, rocblas_int incx, const float *beta, float *y, rocblas_int incy)

BLAS Level 2 API.

xGEMV performs one of the matrix-vector operations

y := alpha*A*x    + beta*y,   or
y := alpha*A**T*x + beta*y,   or
y := alpha*A**H*x + beta*y,

where alpha and beta are scalars, x and y are vectors and A is an m by n matrix.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] trans: rocblas_operation
[in] m: rocblas_int
[in] n: rocblas_int
[in] alpha: specifies the scalar alpha.
[in] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of A.
[in] x: pointer storing vector x on the GPU.
[in] incx: specifies the increment for the elements of x.
[in] beta: specifies the scalar beta.
[out] y: pointer storing vector y on the GPU.
[in] incy: rocblas_int specifies the increment for the elements of y.

rocblas_status rocblas_dgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *x, rocblas_int incx, const double *beta, double *y, rocblas_int incy)

rocblas_status rocblas_strsv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const float *A, rocblas_int lda, float *x, rocblas_int incx)

BLAS Level 2 API.

trsv solves

 A*x = alpha*b or A**T*x = alpha*b,

where x and b are vectors and A is a triangular matrix.

The vector x is overwritten on b.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] uplo: rocblas_fill. rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.
[in] transA: rocblas_operation
[in] diag: rocblas_diagonal. rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.
[in] m: rocblas_int m specifies the number of rows of b. m >= 0.
[in] alpha: specifies the scalar alpha.
[in] A: pointer storing matrix A on the GPU, of dimension ( lda, m )
[in] lda: rocblas_int specifies the leading dimension of A. lda = max( 1, m ).
[in] x: pointer storing vector x on the GPU.
[in] incx: specifies the increment for the elements of x.

rocblas_status rocblas_dtrsv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const double *A, rocblas_int lda, double *x, rocblas_int incx)

rocblas_status rocblas_sger(rocblas_handle handle, rocblas_int m, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, const float *y, rocblas_int incy, float *A, rocblas_int lda)

BLAS Level 2 API.

xHE(SY)MV performs the matrix-vector operation:

y := alpha*A*x + beta*y,

where alpha and beta are scalars, x and y are n element vectors and A is an n by n Hermitian(Symmetric) matrix.

BLAS Level 2 API

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] uplo: rocblas_fill. specifies whether the upper or lower
[in] n: rocblas_int.
[in] alpha: specifies the scalar alpha.
[in] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of A.
[in] x: pointer storing vector x on the GPU.
[in] incx: specifies the increment for the elements of x.
[in] beta: specifies the scalar beta.
[out] y: pointer storing vector y on the GPU.
[in] incy: rocblas_int specifies the increment for the elements of y.

xGER performs the matrix-vector operations

A := A + alpha*x*y**T

where alpha is a scalars, x and y are vectors, and A is an m by n matrix.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] m: rocblas_int
[in] n: rocblas_int
[in] alpha: specifies the scalar alpha.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of x.
[in] y: pointer storing vector y on the GPU.
[in] incy: rocblas_int specifies the increment for the elements of y.
[inout] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of A.

rocblas_status rocblas_dger(rocblas_handle handle, rocblas_int m, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, const double *y, rocblas_int incy, double *A, rocblas_int lda)

rocblas_status rocblas_ssyr(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, float *A, rocblas_int lda)

BLAS Level 2 API.

xSYR performs the matrix-vector operations

A := A + alpha*x*x**T

where alpha is a scalars, x is a vector, and A is an n by n symmetric matrix.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] n: rocblas_int
[in] alpha: specifies the scalar alpha.
[in] x: pointer storing vector x on the GPU.
[in] incx: rocblas_int specifies the increment for the elements of x.
[inout] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of A.

rocblas_status rocblas_dsyr(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, double *A, rocblas_int lda)

rocblas_status rocblas_strtri(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const float *A, rocblas_int lda, float *invA, rocblas_int ldinvA)¶

BLAS Level 3 API.

trtri compute the inverse of a matrix A, namely, invA

and write the result into invA;

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] uplo: rocblas_fill. specifies whether the upper ‘rocblas_fill_upper’ or lower ‘rocblas_fill_lower’ if rocblas_fill_upper, the lower part of A is not referenced if rocblas_fill_lower, the upper part of A is not referenced
[in] diag: rocblas_diagonal. = ‘rocblas_diagonal_non_unit’, A is non-unit triangular; = ‘rocblas_diagonal_unit’, A is unit triangular;
[in] n: rocblas_int. size of matrix A and invA
[in] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of A.

rocblas_status rocblas_dtrtri(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const double *A, rocblas_int lda, double *invA, rocblas_int ldinvA)¶

rocblas_status rocblas_strtri_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const float *A, rocblas_int lda, rocblas_int stride_a, float *invA, rocblas_int ldinvA, rocblas_int bsinvA, rocblas_int batch_count)

BLAS Level 3 API.

trtri compute the inverse of a matrix A

inv(A);

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] uplo: rocblas_fill. specifies whether the upper ‘rocblas_fill_upper’ or lower ‘rocblas_fill_lower’
[in] diag: rocblas_diagonal. = ‘rocblas_diagonal_non_unit’, A is non-unit triangular; = ‘rocblas_diagonal_unit’, A is unit triangular;
[in] n: rocblas_int.
[in] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of A.
[in] stride_a: rocblas_int “batch stride a”: stride from the start of one “A” matrix to the next

rocblas_status rocblas_dtrtri_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const double *A, rocblas_int lda, rocblas_int stride_a, double *invA, rocblas_int ldinvA, rocblas_int bsinvA, rocblas_int batch_count)

rocblas_status rocblas_strsm(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, float *B, rocblas_int ldb)

BLAS Level 3 API.

trsm solves

op(A)*X = alpha*B or  X*op(A) = alpha*B,

where alpha is a scalar, X and B are m by n matrices, A is triangular matrix and op(A) is one of

op( A ) = A   or   op( A ) = A^T   or   op( A ) = A^H.

The matrix X is overwritten on B.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] side: rocblas_side. rocblas_side_left: op(A)*X = alpha*B. rocblas_side_right: X*op(A) = alpha*B.
[in] uplo: rocblas_fill. rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.
[in] transA: rocblas_operation. transB: op(A) = A. rocblas_operation_transpose: op(A) = A^T. rocblas_operation_conjugate_transpose: op(A) = A^H.
[in] diag: rocblas_diagonal. rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.
[in] m: rocblas_int. m specifies the number of rows of B. m >= 0.
[in] n: rocblas_int. n specifies the number of columns of B. n >= 0.
[in] alpha: alpha specifies the scalar alpha. When alpha is &zero then A is not referenced and B need not be set before entry.
[in] A: pointer storing matrix A on the GPU. of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.
[in] lda: rocblas_int. lda specifies the first dimension of A. if side = rocblas_side_left, lda >= max( 1, m ), if side = rocblas_side_right, lda >= max( 1, n ).

rocblas_status rocblas_dtrsm(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, double *B, rocblas_int ldb)

rocblas_status rocblas_hgemm(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, const rocblas_half *B, rocblas_int ldb, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc)

BLAS Level 3 API.

xGEMM performs one of the matrix-matrix operations

C = alpha*op( A )*op( B ) + beta*C,

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

Parameters

[in] handle: rocblas_handle, handle to the rocblas library context queue.
[in] transA: rocblas_operation, specifies the form of op( A )
[in] transB: rocblas_operation, specifies the form of op( B )
[in] m: rocblas_int, number or rows of matrices op( A ) and C
[in] n: rocblas_int, number of columns of matrices op( B ) and C
[in] k: rocblas_int, number of columns of matrix op( A ) and number of rows of matrix op( B )
[in] alpha: specifies the scalar alpha.
[in] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int, specifies the leading dimension of A.
[in] B: pointer storing matrix B on the GPU.
[in] ldb: rocblas_int, specifies the leading dimension of B.
[in] beta: specifies the scalar beta.
[inout] C: pointer storing matrix C on the GPU.
[in] ldc: rocblas_int, specifies the leading dimension of C.

rocblas_status rocblas_sgemm(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, const float *B, rocblas_int ldb, const float *beta, float *C, rocblas_int ldc)

rocblas_status rocblas_dgemm(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, const double *B, rocblas_int ldb, const double *beta, double *C, rocblas_int ldc)

rocblas_status rocblas_hgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, rocblas_int stride_a, const rocblas_half *B, rocblas_int ldb, rocblas_int stride_b, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)

BLAS Level 3 API.

xGEMM_STRIDED_BATCHED performs one of the strided batched matrix-matrix operations

C[i*stride_c] = alpha*op( A[i*stride_a] )*op( B[i*stride_b] ) + beta*C[i*stride_c], for i in

[0,batch_count-1]

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B and C are strided batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) an k by n by batch_count strided_batched matrix and C an m by n by batch_count strided_batched matrix.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] transA: rocblas_operation specifies the form of op( A )
[in] transB: rocblas_operation specifies the form of op( B )
[in] m: rocblas_int. matrix dimention m.
[in] n: rocblas_int. matrix dimention n.
[in] k: rocblas_int. matrix dimention k.
[in] alpha: specifies the scalar alpha.
[in] A: pointer storing strided batched matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of “A”.
[in] stride_a: rocblas_int stride from the start of one “A” matrix to the next
[in] B: pointer storing strided batched matrix B on the GPU.
[in] ldb: rocblas_int specifies the leading dimension of “B”.
[in] stride_b: rocblas_int stride from the start of one “B” matrix to the next
[in] beta: specifies the scalar beta.
[inout] C: pointer storing strided batched matrix C on the GPU.
[in] ldc: rocblas_int specifies the leading dimension of “C”.
[in] stride_c: rocblas_int stride from the start of one “C” matrix to the next
[in] batch_count: rocblas_int number of gemm operatons in the batch

rocblas_status rocblas_sgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, rocblas_int stride_a, const float *B, rocblas_int ldb, rocblas_int stride_b, const float *beta, float *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)

rocblas_status rocblas_dgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, rocblas_int stride_a, const double *B, rocblas_int ldb, rocblas_int stride_b, const double *beta, double *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)

rocblas_status rocblas_hgemm_kernel_name(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, rocblas_int stride_a, const rocblas_half *B, rocblas_int ldb, rocblas_int stride_b, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)

rocblas_status rocblas_sgemm_kernel_name(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, rocblas_int stride_a, const float *B, rocblas_int ldb, rocblas_int stride_b, const float *beta, float *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)

rocblas_status rocblas_dgemm_kernel_name(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, rocblas_int stride_a, const double *B, rocblas_int ldb, rocblas_int stride_b, const double *beta, double *C, rocblas_int ldc, rocblas_int stride_c, rocblas_int batch_count)

rocblas_status rocblas_sgeam(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *beta, const float *B, rocblas_int ldb, float *C, rocblas_int ldc)

BLAS Level 3 API.

xGEAM performs one of the matrix-matrix operations

C = alpha*op( A ) + beta*op( B ),

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by n matrix, op( B ) an m by n matrix, and C an m by n matrix.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] transA: rocblas_operation specifies the form of op( A )
[in] transB: rocblas_operation specifies the form of op( B )
[in] m: rocblas_int.
[in] n: rocblas_int.
[in] alpha: specifies the scalar alpha.
[in] A: pointer storing matrix A on the GPU.
[in] lda: rocblas_int specifies the leading dimension of A.
[in] beta: specifies the scalar beta.
[in] B: pointer storing matrix B on the GPU.
[in] ldb: rocblas_int specifies the leading dimension of B.
[inout] C: pointer storing matrix C on the GPU.
[in] ldc: rocblas_int specifies the leading dimension of C.

rocblas_status rocblas_dgeam(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *beta, const double *B, rocblas_int ldb, double *C, rocblas_int ldc)

rocblas_status rocblas_gemm_ex(rocblas_handle handle, rocblas_operation trans_a, rocblas_operation trans_b, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags, size_t *workspace_size, void *workspace)

rocblas_status rocblas_gemm_strided_batched_ex(rocblas_handle handle, rocblas_operation trans_a, rocblas_operation trans_b, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, rocblas_long stride_a, const void *b, rocblas_datatype b_type, rocblas_int ldb, rocblas_long stride_b, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, rocblas_long stride_c, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_long stride_d, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags, size_t *workspace_size, void *workspace)

BLAS EX API.

GEMM_EX performs one of the matrix-matrix operations

D = alpha*op( A )*op( B ) + beta*C,

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B, C, and D are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C and D are m by n matrices.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] transA: rocblas_operation specifies the form of op( A )
[in] transB: rocblas_operation specifies the form of op( B )
[in] m: rocblas_int. matrix dimension m
[in] n: rocblas_int. matrix dimension n
[in] k: rocblas_int. matrix dimension k
[in] alpha: const void * specifies the scalar alpha. Same datatype as compute_type.
[in] a: void * pointer storing matrix A on the GPU.
[in] a_type: rocblas_datatype specifies the datatype of matrix A
[in] lda: rocblas_int specifies the leading dimension of A.
[in] b: void * pointer storing matrix B on the GPU.
[in] b_type: rocblas_datatype specifies the datatype of matrix B
[in] ldb: rocblas_int specifies the leading dimension of B.
[in] beta: const void * specifies the scalar beta. Same datatype as compute_type.
[in] c: void * pointer storing matrix C on the GPU.
[in] c_type: rocblas_datatype specifies the datatype of matrix C
[in] ldc: rocblas_int specifies the leading dimension of C.
[out] d: void * pointer storing matrix D on the GPU.
[in] d_type: rocblas_datatype specifies the datatype of matrix D
[in] ldd: rocblas_int specifies the leading dimension of D.
[in] compute_type: rocblas_datatype specifies the datatype of computation
[in] algo: rocblas_gemm_algo enumerant specifying the algorithm type.
[in] solution_index: int32_t reserved for future use
[in] flags: uint32_t reserved for future use

rocblas_status rocblas_get_version_string(char *buf, size_t len)

BLAS EX API.

GEMM_STRIDED_BATCHED_EX performs one of the strided_batched matrix-matrix operations

D[i*stride_d] = alpha*op(A[i*stride_a])*op(B[i*stride_b]) + beta*C[i*stride_c], for i in

[0,batch_count-1]

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B, C, and D are strided_batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) a k by n by batch_count strided_batched matrix and C and D are m by n by batch_count strided_batched matrices.

The strided_batched matrices are multiple matrices separated by a constant stride. The number of matrices is batch_count.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] transA: rocblas_operation specifies the form of op( A )
[in] transB: rocblas_operation specifies the form of op( B )
[in] m: rocblas_int. matrix dimension m
[in] n: rocblas_int. matrix dimension n
[in] k: rocblas_int. matrix dimension k
[in] alpha: const void * specifies the scalar alpha. Same datatype as compute_type.
[in] a: void * pointer storing matrix A on the GPU.
[in] a_type: rocblas_datatype specifies the datatype of matrix A
[in] lda: rocblas_int specifies the leading dimension of A.
[in] stride_a: rocblas_long specifies stride from start of one “A” matrix to the next
[in] b: void * pointer storing matrix B on the GPU.
[in] b_type: rocblas_datatype specifies the datatype of matrix B
[in] ldb: rocblas_int specifies the leading dimension of B.
[in] stride_b: rocblas_long specifies stride from start of one “B” matrix to the next
[in] beta: const void * specifies the scalar beta. Same datatype as compute_type.
[in] c: void * pointer storing matrix C on the GPU.
[in] c_type: rocblas_datatype specifies the datatype of matrix C
[in] ldc: rocblas_int specifies the leading dimension of C.
[in] stride_c: rocblas_long specifies stride from start of one “C” matrix to the next
[out] d: void * pointer storing matrix D on the GPU.
[in] d_type: rocblas_datatype specifies the datatype of matrix D
[in] ldd: rocblas_int specifies the leading dimension of D.
[in] stride_d: rocblas_long specifies stride from start of one “D” matrix to the next
[in] batch_count: rocblas_int number of gemm operations in the batch
[in] compute_type: rocblas_datatype specifies the datatype of computation
[in] algo: rocblas_gemm_algo enumerant specifying the algorithm type.
[in] solution_index: int32_t reserved for future use
[in] flags: uint32_t reserved for future use

file rocblas-types.h

#include <stddef.h>#include <stdint.h>#include <hip/hip_vector_types.h>

rocblas-types.h defines data types used by rocblas

Defines

_ROCBLAS_TYPES_H_¶

Typedefs

typedef int32_t rocblas_int: To specify whether int32 or int64 is used.

typedef int64_t rocblas_long

typedef float2 rocblas_float_complex

typedef double2 rocblas_double_complex

typedef uint16_t rocblas_half

typedef float2 rocblas_half_complex

typedef struct _rocblas_handle *rocblas_handle

Enums

enum rocblas_operation

Used to specify whether the matrix is to be transposed or not.

parameter constants. numbering is consistent with CBLAS, ACML and most standard C BLAS libraries

Values:

rocblas_operation_none = 111: Operate with the matrix.

rocblas_operation_transpose = 112: Operate with the transpose of the matrix.

rocblas_operation_conjugate_transpose = 113: Operate with the conjugate transpose of the matrix.

enum rocblas_fill

Used by the Hermitian, symmetric and triangular matrix routines to specify whether the upper or lower triangle is being referenced.

Values:

rocblas_fill_upper = 121: Upper triangle.

rocblas_fill_lower = 122: Lower triangle.

rocblas_fill_full = 123

enum rocblas_diagonal

It is used by the triangular matrix routines to specify whether the matrix is unit triangular.

Values:

rocblas_diagonal_non_unit = 131: Non-unit triangular.

rocblas_diagonal_unit = 132: Unit triangular.

enum rocblas_side

Indicates the side matrix A is located relative to matrix B during multiplication.

Values:

rocblas_side_left = 141: Multiply general matrix by symmetric, Hermitian or triangular matrix on the left.

rocblas_side_right = 142: Multiply general matrix by symmetric, Hermitian or triangular matrix on the right.

rocblas_side_both = 143

enum rocblas_status

rocblas status codes definition

Values:

rocblas_status_success = 0: success

rocblas_status_invalid_handle = 1: handle not initialized, invalid or null

rocblas_status_not_implemented = 2: function is not implemented

rocblas_status_invalid_pointer = 3: invalid pointer parameter

rocblas_status_invalid_size = 4: invalid size parameter

rocblas_status_memory_error = 5: failed internal memory allocation, copy or dealloc

rocblas_status_internal_error = 6: other internal library failure

enum rocblas_datatype

Indicates the precision width of data stored in a blas type.

Values:

rocblas_datatype_f16_r = 150

rocblas_datatype_f32_r = 151

rocblas_datatype_f64_r = 152

rocblas_datatype_f16_c = 153

rocblas_datatype_f32_c = 154

rocblas_datatype_f64_c = 155

rocblas_datatype_i8_r = 160

rocblas_datatype_u8_r = 161

rocblas_datatype_i32_r = 162

rocblas_datatype_u32_r = 163

rocblas_datatype_i8_c = 164

rocblas_datatype_u8_c = 165

rocblas_datatype_i32_c = 166

rocblas_datatype_u32_c = 167

enum rocblas_pointer_mode

Indicates the pointer is device pointer or host pointer.

Values:

rocblas_pointer_mode_host = 0

rocblas_pointer_mode_device = 1

enum rocblas_layer_mode

Indicates if layer is active with bitmask.

Values:

rocblas_layer_mode_none = 0b0000000000

rocblas_layer_mode_log_trace = 0b0000000001

rocblas_layer_mode_log_bench = 0b0000000010

rocblas_layer_mode_log_profile = 0b0000000100

enum rocblas_gemm_algo

Indicates if layer is active with bitmask.

Values:

rocblas_gemm_algo_standard = 0b0000000000

file rocblas.h

#include <stdbool.h>#include “rocblas-export.h”#include “rocblas-version.h”#include “rocblas-types.h”#include “rocblas-auxiliary.h”#include “rocblas-functions.h”

rocblas.h includes other *.h and exposes a common interface

Defines

_ROCBLAS_H_¶

file buildinfo.cpp

#include <stdio.h>#include <sstream>#include <string.h>#include “definitions.h”#include “rocblas-types.h”#include “rocblas-functions.h”#include “rocblas-version.h”

Defines

TO_STR2(x)¶

TO_STR(x)¶

VERSION_STRING¶

Functions

rocblas_status rocblas_get_version_string(char *buf, size_t len)

BLAS EX API.

GEMM_STRIDED_BATCHED_EX performs one of the strided_batched matrix-matrix operations

D[i*stride_d] = alpha*op(A[i*stride_a])*op(B[i*stride_b]) + beta*C[i*stride_c], for i in

[0,batch_count-1]

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B, C, and D are strided_batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) a k by n by batch_count strided_batched matrix and C and D are m by n by batch_count strided_batched matrices.

The strided_batched matrices are multiple matrices separated by a constant stride. The number of matrices is batch_count.

Parameters

[in] handle: rocblas_handle. handle to the rocblas library context queue.
[in] transA: rocblas_operation specifies the form of op( A )
[in] transB: rocblas_operation specifies the form of op( B )
[in] m: rocblas_int. matrix dimension m
[in] n: rocblas_int. matrix dimension n
[in] k: rocblas_int. matrix dimension k
[in] alpha: const void * specifies the scalar alpha. Same datatype as compute_type.
[in] a: void * pointer storing matrix A on the GPU.
[in] a_type: rocblas_datatype specifies the datatype of matrix A
[in] lda: rocblas_int specifies the leading dimension of A.
[in] stride_a: rocblas_long specifies stride from start of one “A” matrix to the next
[in] b: void * pointer storing matrix B on the GPU.
[in] b_type: rocblas_datatype specifies the datatype of matrix B
[in] ldb: rocblas_int specifies the leading dimension of B.
[in] stride_b: rocblas_long specifies stride from start of one “B” matrix to the next
[in] beta: const void * specifies the scalar beta. Same datatype as compute_type.
[in] c: void * pointer storing matrix C on the GPU.
[in] c_type: rocblas_datatype specifies the datatype of matrix C
[in] ldc: rocblas_int specifies the leading dimension of C.
[in] stride_c: rocblas_long specifies stride from start of one “C” matrix to the next
[out] d: void * pointer storing matrix D on the GPU.
[in] d_type: rocblas_datatype specifies the datatype of matrix D
[in] ldd: rocblas_int specifies the leading dimension of D.
[in] stride_d: rocblas_long specifies stride from start of one “D” matrix to the next
[in] batch_count: rocblas_int number of gemm operations in the batch
[in] compute_type: rocblas_datatype specifies the datatype of computation
[in] algo: rocblas_gemm_algo enumerant specifying the algorithm type.
[in] solution_index: int32_t reserved for future use
[in] flags: uint32_t reserved for future use

file handle.cpp

#include “handle.h”#include <cstdlib>

Functions

static void open_log_stream(const char *environment_variable_name, std::ostream *&log_os, std::ofstream &log_ofs)¶

Logging function.

open_log_stream Open stream log_os for logging. If the environment variable with name environment_variable_name is not set, then stream log_os to std::cerr. Else open a file at the full logfile path contained in the environment variable. If opening the file suceeds, stream to the file else stream to std::cerr.

[out] log_os std::ostream*& Output stream. Stream to std:cerr if environment_variable_name is not set, else set to stream to log_ofs

Parameters

[in] environment_variable_name: const char* Name of environment variable that contains the full logfile path.

[out] log_ofs std::ofstream& Output file stream. If log_ofs->is_open()==true, then log_os will stream to log_ofs. Else it will stream to std::cerr.

file rocblas_auxiliary.cpp

#include <stdio.h>#include <hip/hip_runtime.h>#include “definitions.h”#include “rocblas-types.h”#include “handle.h”#include “logging.h”#include “utility.h”#include “rocblas_unique_ptr.hpp”#include “rocblas-auxiliary.h”

Functions

rocblas_pointer_mode rocblas_pointer_to_mode(void *ptr): indicates whether the pointer is on the host or device. currently HIP API can only recoginize the input ptr on deive or not can not recoginize it is on host or not

rocblas_status rocblas_get_pointer_mode(rocblas_handle handle, rocblas_pointer_mode *mode)

rocblas_status rocblas_set_pointer_mode(rocblas_handle handle, rocblas_pointer_mode mode)

rocblas_status rocblas_create_handle(rocblas_handle *handle)

rocblas_status rocblas_destroy_handle(rocblas_handle handle)

rocblas_status rocblas_set_stream(rocblas_handle handle, hipStream_t stream_id)

rocblas_status rocblas_get_stream(rocblas_handle handle, hipStream_t *stream_id)

__global__ void copy_void_ptr_vector_kernel(rocblas_int n, rocblas_int elem_size, const void * x, rocblas_int incx, void * y, rocblas_int incy)

rocblas_status rocblas_set_vector(rocblas_int n, rocblas_int elem_size, const void *x_h, rocblas_int incx, void *y_d, rocblas_int incy)

rocblas_status rocblas_get_vector(rocblas_int n, rocblas_int elem_size, const void *x_d, rocblas_int incx, void *y_h, rocblas_int incy)

__global__ void copy_void_ptr_matrix_kernel(rocblas_int rows, rocblas_int cols, size_t elem_size, const void * a, rocblas_int lda, void * b, rocblas_int ldb)

rocblas_status rocblas_set_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a_h, rocblas_int lda, void *b_d, rocblas_int ldb)

rocblas_status rocblas_get_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a_d, rocblas_int lda, void *b_h, rocblas_int ldb)

Variables

constexpr size_t VEC_BUFF_MAX_BYTES = 1048576¶

constexpr rocblas_int NB_X = 256¶

constexpr size_t MAT_BUFF_MAX_BYTES = 1048576¶

constexpr rocblas_int MATRIX_DIM_X = 128¶

constexpr rocblas_int MATRIX_DIM_Y = 8¶

file status.cpp

#include <hip/hip_runtime_api.h>#include “rocblas.h”#include “status.h”

Functions

rocblas_status get_rocblas_status_for_hip_status(hipError_t status)¶

dir ROCm_Libraries/rocBLAS

dir ROCm_Libraries

dir ROCm_Libraries/rocBLAS/src

dir ROCm_Libraries/rocBLAS/src/src

hipBLAS¶

Introduction¶

Please Refer here for Github link hipBLAS

hipBLAS is a BLAS marshalling library, with multiple supported backends. It sits between the application and a ‘worker’ BLAS library, marshalling inputs into the backend library and marshalling results back to the application. hipBLAS exports an interface that does not require the client to change, regardless of the chosen backend. Currently, hipBLAS supports rocBLAS and cuBLAS as backends.

Installing pre-built packages¶

Download pre-built packages either from ROCm’s package servers or by clicking the github releases tab and manually downloading, which could be newer. Release notes are available for each release on the releases tab.

sudo apt update && sudo apt install hipblas

Quickstart hipBLAS build¶

Bash helper build script (Ubuntu only)

The root of this repository has a helper bash script install.sh to build and install hipBLAS on Ubuntu with a single command. It does not take a lot of options and hard-codes configuration that can be specified through invoking cmake directly, but it’s a great way to get started quickly and can serve as an example of how to build/install. A few commands in the script need sudo access, so it may prompt you for a password.

./install -h -- shows help
./install -id -- build library, build dependencies and install (-d flag only needs to be passed once on a system)

Manual build (all supported platforms)

If you use a distro other than Ubuntu, or would like more control over the build process, the hipblas build has helpful information on how to configure cmake and manually build.

Functions supported

A list of exported functions from hipblas can be found on the wiki

hipBLAS interface examples¶

The hipBLAS interface is compatible with rocBLAS and cuBLAS-v2 APIs. Porting a CUDA application which originally calls the cuBLAS API to an application calling hipBLAS API should be relatively straightforward. For example, the hipBLAS SGEMV interface is

GEMV API¶

hipblasStatus_t
hipblasSgemv( hipblasHandle_t handle,
             hipblasOperation_t trans,
             int m, int n, const float *alpha,
             const float *A, int lda,
             const float *x, int incx, const float *beta,
             float *y, int incy );

Batched and strided GEMM API¶

hipBLAS GEMM can process matrices in batches with regular strides. There are several permutations of these API’s, the following is an example that takes everything

hipblasStatus_t
hipblasSgemmStridedBatched( hipblasHandle_t handle,
             hipblasOperation_t transa, hipblasOperation_t transb,
             int m, int n, int k, const float *alpha,
             const float *A, int lda, long long bsa,
             const float *B, int ldb, long long bsb, const float *beta,
             float *C, int ldc, long long bsc,
             int batchCount);

hipBLAS assumes matrices A and vectors x, y are allocated in GPU memory space filled with data. Users are responsible for copying data from/to the host and device memory.

Build¶

Dependencies For Building Library¶

CMake 3.5 or later

The build infrastructure for hipBLAS is based on Cmake v3.5. This is the version of cmake available on ROCm supported platforms. If you are on a headless machine without the x-windows system, we recommend using ccmake; if you have access to X-windows, we recommend using cmake-gui.

Install one-liners cmake:

Ubuntu: sudo apt install cmake-qt-gui
Fedora: sudo dnf install cmake-gui

Build Library Using Script (Ubuntu only)¶

The root of this repository has a helper bash script install.sh to build and install hipBLAS on Ubuntu with a single command. It does not take a lot of options and hard-codes configuration that can be specified through invoking cmake directly, but it’s a great way to get started quickly and can serve as an example of how to build/install. A few commands in the script need sudo access, so it may prompt you for a password.

./install.sh -h -- shows help
./install.sh -id -- build library, build dependencies and install (-d flag only needs to be passed once on a system)

Build Library Using Individual Commands¶

mkdir -p [HIPBLAS_BUILD_DIR]/release
cd [HIPBLAS_BUILD_DIR]/release
# Default install location is in /opt/rocm, define -DCMAKE_INSTALL_PREFIX=<path> to specify other
# Default build config is 'Release', define -DCMAKE_BUILD_TYPE=<config> to specify other
CXX=/opt/rocm/bin/hcc ccmake [HIPBLAS_SOURCE]
make -j$(nproc)
sudo make install # sudo required if installing into system directory such as /opt/rocm

Build Library + Tests + Benchmarks + Samples Using Individual Commands¶

The repository contains source for clients that serve as samples, tests and benchmarks. Clients source can be found in the clients subdir.

Dependencies (only necessary for hipBLAS clients)

The hipBLAS samples have no external dependencies, but our unit test and benchmarking applications do. These clients introduce the following dependencies:

boost
lapack
- lapack itself brings a dependency on a fortran compiler
googletest

Linux distros typically have an easy installation mechanism for boost through the native package manager.

Ubuntu: sudo apt install libboost-program-options-dev
Fedora: sudo dnf install boost-program-options

Unfortunately, googletest and lapack are not as easy to install. Many distros do not provide a googletest package with pre-compiled libraries, and the lapack packages do not have the necessary cmake config files for cmake to configure linking the cblas library. hipBLAS provide a cmake script that builds the above dependencies from source. This is an optional step; users can provide their own builds of these dependencies and help cmake find them by setting the CMAKE_PREFIX_PATH definition. The following is a sequence of steps to build dependencies and install them to the cmake default /usr/local.

(optional, one time only)

mkdir -p [HIPBLAS_BUILD_DIR]/release/deps
cd [HIPBLAS_BUILD_DIR]/release/deps
ccmake -DBUILD_BOOST=OFF [HIPBLAS_SOURCE]/deps   # assuming boost is installed through package manager as above
make -j$(nproc) install

Once dependencies are available on the system, it is possible to configure the clients to build. This requires a few extra cmake flags to the library cmake configure script. If the dependencies are not installed into system defaults (like /usr/local ), you should pass the CMAKE_PREFIX_PATH to cmake to help find them.

-DCMAKE_PREFIX_PATH="<semicolon separated paths>"

# Default install location is in /opt/rocm, use -DCMAKE_INSTALL_PREFIX=<path> to specify other
CXX=/opt/rocm/bin/hcc ccmake -DBUILD_CLIENTS_TESTS=ON -DBUILD_CLIENTS_BENCHMARKS=ON [HIPBLAS_SOURCE]
make -j$(nproc)
sudo make install   # sudo required if installing into system directory such as /opt/rocm

Common build problems¶

Issue: HIP (/opt/rocm/hip) was built using hcc 1.0.xxx-xxx-xxx-xxx, but you are using /opt/rocm/hcc/hcc with version 1.0.yyy-yyy-yyy-yyy from hipcc. (version does not match) . Please rebuild HIP including cmake or update HCC_HOME variable.

Solution: Download HIP from github and use hcc to build from source and then use the build HIP instead of /opt/rocm/hip one or singly overwrite the new build HIP to this location.

Issue: For Carrizo - HCC RUNTIME ERROR: Fail to find compatible kernel

Solution: Add the following to the cmake command when configuring: -DCMAKE_CXX_FLAGS=”–amdgpu-target=gfx801”

Issue: For MI25 (Vega10 Server) - HCC RUNTIME ERROR: Fail to find compatible kernel

Solution: export HCC_AMDGPU_TARGET=gfx900

Running¶

Notice¶

Before reading this Wiki, it is assumed hipBLAS with the client applications has been successfully built as described in Build hipBLAS libraries and verification code

Samples

cd [BUILD_DIR]/clients/staging
./example-sscal

Example code that calls hipBLAS you can also see the following blog on the right side Example C code calling hipBLAS routine.

Unit tests

Run tests with the following:

cd [BUILD_DIR]/clients/staging
./hipblas-test

To run specific tests, use –gtest_filter=match where match is a ‘:’-separated list of wildcard patterns (called the positive patterns) optionally followed by a ‘-‘ and another ‘:’-separated pattern list (called the negative patterns). For example, run gemv tests with the following:

cd [BUILD_DIR]/clients/staging
./hipblas-test --gtest_filter=*gemv*

hcRNG¶

Introduction¶

The hcRNG library is an implementation of uniform random number generators targeting the AMD heterogeneous hardware via HCC compiler runtime. The computational resources of underlying AMD heterogenous compute gets exposed and exploited through the HCC C++ frontend. Refer here for more details on HCC compiler.

The following list enumerates the current set of RNG generators that are supported so far.

MRG31k3p

MRG32k3a

LFSR113

Philox-4x32-10

Examples¶

Random number generator Mrg31k3p example:

file: Randomarray.cpp

#!c++

//This example is a simple random array generation and it compares host output with device output
//Random number generator Mrg31k3p
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <assert.h>
#include <hcRNG/mrg31k3p.h>
#include <hcRNG/hcRNG.h>
#include <hc.hpp>
#include <hc_am.hpp>
using namespace hc;

int main()
{
      hcrngStatus status = HCRNG_SUCCESS;
      bool ispassed = 1;
      size_t streamBufferSize;
      // Number oi streams
      size_t streamCount = 10;
      //Number of random numbers to be generated
      //numberCount must be a multiple of streamCount
      size_t numberCount = 100;
      //Enumerate the list of accelerators
      std::vector<hc::accelerator>acc = hc::accelerator::get_all();
      accelerator_view accl_view = (acc[1].create_view());
      //Allocate memory for host pointers
      float *Random1 = (float*) malloc(sizeof(float) * numberCount);
      float *Random2 = (float*) malloc(sizeof(float) * numberCount);
      float *outBufferDevice = hc::am_alloc(sizeof(float) * numberCount, acc[1], 0);

      //Create streams
      hcrngMrg31k3pStream *streams = hcrngMrg31k3pCreateStreams(NULL, streamCount, &streamBufferSize, NULL);
      hcrngMrg31k3pStream *streams_buffer = hc::am_alloc(sizeof(hcrngMrg31k3pStream) * streamCount, acc[1], 0);
      accl_view.copy(streams, streams_buffer, streamCount* sizeof(hcrngMrg31k3pStream));

      //Invoke random number generators in device (here strean_length and streams_per_thread arguments are default)
      status = hcrngMrg31k3pDeviceRandomU01Array_single(accl_view, streamCount, streams_buffer, numberCount, outBufferDevice);

      if(status) std::cout << "TEST FAILED" << std::endl;
      accl_view.copy(outBufferDevice, Random1, numberCount * sizeof(float));

      //Invoke random number generators in host
      for (size_t i = 0; i < numberCount; i++)
        Random2[i] = hcrngMrg31k3pRandomU01(&streams[i % streamCount]);
      // Compare host and device outputs
      for(int i =0; i < numberCount; i++) {
          if (Random1[i] != Random2[i]) {
              ispassed = 0;
              std::cout <<" RANDDEVICE[" << i<< "] " << Random1[i] << "and RANDHOST[" << i <<"] mismatches"<< Random2[i] <<                   std::endl;
              break;
          }
          else
              continue;
      }
      if(!ispassed) std::cout << "TEST FAILED" << std::endl;

      //Free host resources
      free(Random1);
      free(Random2);
      //Release device resources
      hc::am_free(outBufferDevice);
      hc::am_free(streams_buffer);
      return 0;
}

Compiling the example code:

/opt/hcc/bin/clang++ /opt/hcc/bin/hcc-config –cxxflags –ldflags -lhc_am -lhcrng Randomarray.cpp

Installation¶

Installation steps

The following are the steps to use the library

ROCM 2.4 Kernel, Driver and Compiler Installation (if not done until now)

Library installation.

ROCM 2.4 Installation

To Know more about ROCM refer here

a. Installing Debian ROCM repositories

Before proceeding, make sure to completely uninstall any pre-release ROCm packages.

Refer Here for instructions to remove pre-release ROCM packages

Follow Steps to install rocm package

wget -qO - http://packages.amd.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -
sudo sh -c 'echo deb [arch=amd64] http://packages.amd.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list'
sudo apt-get update
sudo apt-get install rocm

Then, make the ROCm kernel your default kernel. If using grub2 as your bootloader, you can edit the GRUB_DEFAULT variable in the following file:

sudo vi /etc/default/grub
sudo update-grub

and Reboot the system

b. Verifying the Installation

Once Reboot, to verify that the ROCm stack completed successfully you can execute HSA vector_copy sample application:

cd /opt/rocm/hsa/sample
make
./vector_copy

Library Installation

a. Install using Prebuilt debian

wget https://github.com/ROCmSoftwarePlatform/hcRNG/blob/master/pre-builds/hcrng-master-184472e-Linux.deb
sudo dpkg -i hcrng-master-184472e-Linux.deb

b. Build debian from source

git clone https://github.com/ROCmSoftwarePlatform/hcRNG.git && cd hcRNG
chmod +x build.sh && ./build.sh

build.sh execution builds the library and generates a debian under build directory.

Key Features¶

Support for 4 commonly used uniform random number generators.

Single and Double precision.

Multiple streams, created on the host and generates random numbers either on the host or on computing devices.

Prerequisites

This section lists the known set of hardware and software requirements to build this library

Hardware

CPU: mainstream brand, Better if with >=4 Cores Intel Haswell based CPU

System Memory >= 4GB (Better if >10GB for NN application over multiple GPUs)

Hard Drive > 200GB (Better if SSD or NVMe driver for NN application over multiple GPUs)

Minimum GPU Memory (Global) > 2GB

GPU cards supported

dGPU: AMD R9 Fury X, R9 Fury, R9 Nano

APU: AMD Kaveri or Carrizo

AMD Driver and Runtime

Radeon Open Compute Kernel (ROCK) driver : https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver

HSA runtime API and runtime for Boltzmann: https://github.com/RadeonOpenCompute/ROCR-Runtime

System software

Ubuntu 14.04 trusty and later

GCC 4.6 and later

CPP 4.6 and later (come with GCC package)

python 2.7 and later

python-pip

BeautifulSoup4 (installed using python-pip)

HCC 0.9 from here

Tools and Misc

git 1.9 and later

cmake 2.6 and later (2.6 and 2.8 are tested)

firewall off

root privilege or user account in sudo group

Ubuntu Packages

libc6-dev-i386

liblapack-dev

graphicsmagick

libblas-dev

Tested Environments¶

Driver versions

Boltzmann Early Release Driver + dGPU

Radeon Open Compute Kernel (ROCK) driver : https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver

HSA runtime API and runtime for Boltzmann: https://github.com/RadeonOpenCompute/ROCR-Runtime

Traditional HSA driver + APU (Kaveri)

GPU Cards

Radeon R9 Nano

Radeon R9 FuryX

Radeon R9 Fury

Kaveri and Carizo APU

Server System

Supermicro SYS 2028GR-THT 6 R9 NANO

Supermicro SYS-1028GQ-TRT 4 R9 NANO

Supermicro SYS-7048GR-TR Tower 4 R9 NANO

Unit testing¶

a) Automated testing:

Follow these steps to start automated testing:

cd ~/hcRNG/
./build.sh --test=on

b) Manual testing:

(i) Google testing (GTEST) with Functionality check

cd ~/hcRNG/build/test/unit/bin/

All functions are tested against google test.

hipeigen¶

Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.

For more information go to http://eigen.tuxfamily.org/.

Installation instructions for ROCm¶

The ROCm Platform brings a rich foundation to advanced computing by seamlessly integrating the CPU and GPU with the goal of solving real-world problems.

To insatll rocm, please follow:

Installing from AMD ROCm repositories¶

AMD is hosting both debian and rpm repositories for the ROCm 2.4 packages. The packages in both repositories have been signed to ensure package integrity. Directions for each repository are given below:

Debian repository - apt-get
Add the ROCm apt repository

Complete installation steps of ROCm can be found Here

or

For Debian based systems, like Ubuntu, configure the Debian ROCm repository as follows:

wget -qO - http://packages.amd.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -
sudo sh -c 'echo deb [arch=amd64] http://packages.amd.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list'

The gpg key might change, so it may need to be updated when installing a new release.

Install or Update

Next, update the apt-get repository list and install/update the rocm package:

Warning

Before proceeding, make sure to completely uninstall any pre-release ROCm packages

sudo apt-get update
sudo apt-get install rocm

Then, make the ROCm kernel your default kernel. If using grub2 as your bootloader, you can edit the GRUB_DEFAULT variable in the following file:

sudo vi /etc/default/grub
sudo update-grub

Once complete, reboot your system.

We recommend you verify your installation to make sure everything completed successfully.

Installation instructions for Eigen¶

Explanation before starting

Eigen consists only of header files, hence there is nothing to compile before you can use it. Moreover, these header files do not depend on your platform, they are the same for everybody.

Method 1. Installing without using CMake

You can use right away the headers in the Eigen/ subdirectory. In order to install, just copy this Eigen/ subdirectory to your favorite location. If you also want the unsupported features, copy the unsupported/ subdirectory too.

Method 2. Installing using CMake

Let’s call this directory ‘source_dir’ (where this INSTALL file is). Before starting, create another directory which we will call ‘build_dir’.

Do:

cd build_dir
cmake source_dir
make install

The make install step may require administrator privileges.

You can adjust the installation destination (the “prefix”) by passing the -DCMAKE_INSTALL_PREFIX=myprefix option to cmake, as is explained in the message that cmake prints at the end.

Build and Run hipeigen direct tests¶

To build the direct tests for hipeigen:

cd build_dir
make check -j $(nproc)

Note: All direct tests should pass with ROCm 2.4

clFFT¶

For Github Repository clFFT

clFFT is a software library containing FFT functions written in OpenCL. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and heterogeneous programming.

Pre-built binaries are available here.

Introduction to clFFT¶

The FFT is an implementation of the Discrete Fourier Transform (DFT) that makes use of symmetries in the FFT definition to reduce the mathematical intensity required from O(N^2) to O(N log2(N)) when the sequence length N is the product of small prime factors. Currently, there is no standard API for FFT routines. Hardware vendors usually provide a set of high-performance FFTs optimized for their systems: no two vendors employ the same interfaces for their FFT routines. clFFT provides a set of FFT routines that are optimized for AMD graphics processors, but also are functional across CPU and other compute devices.

The clFFT library is an open source OpenCL library implementation of discrete Fast Fourier Transforms. The library:

provides a fast and accurate platform for calculating discrete FFTs.

works on CPU or GPU backends.

supports in-place or out-of-place transforms.

supports 1D, 2D, and 3D transforms with a batch size that can be greater than 1.

supports planar (real and complex components in separate arrays) and interleaved (real and complex components as a pair contiguous in memory) formats.

supports dimension lengths that can be any combination of powers of 2, 3, 5, 7, 11 and 13.

Supports single and double precision floating point formats.

clFFT library user documentation¶

Library and API documentation for developers is available online as a GitHub Pages website

API semantic versioning¶

Good software is typically the result of the loop of feedback and iteration; software interfaces no less so. clFFT follows the semantic versioning guidelines. The version number used is of the form MAJOR.MINOR.PATCH.

clFFT Wiki¶

The project wiki contains helpful documentation, including a build primer

Contributing code¶

Please refer to and read the Contributing document for guidelines on how to contribute code to this open source project. The code in the /master branch is considered to be stable, and all pull-requests must be made against the /develop branch.

License¶

The source for clFFT is licensed under the Apache License , Version 2.0

Example¶

The following simple example shows how to use clFFT to compute a simple 1D forward transform

#include <stdlib.h>

/* No need to explicitely include the OpenCL headers */
#include <clFFT.h>

int main( void )
{
   cl_int err;
   cl_platform_id platform = 0;
   cl_device_id device = 0;
   cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
   cl_context ctx = 0;
   cl_command_queue queue = 0;
   cl_mem bufX;
       float *X;
   cl_event event = NULL;
   int ret = 0;
       size_t N = 16;

       /* FFT library realted declarations */
       clfftPlanHandle planHandle;
       clfftDim dim = CLFFT_1D;
       size_t clLengths[1] = {N};

   /* Setup OpenCL environment. */
   err = clGetPlatformIDs( 1, &platform, NULL );
   err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL );

   props[1] = (cl_context_properties)platform;
   ctx = clCreateContext( props, 1, &device, NULL, NULL, &err );
   queue = clCreateCommandQueue( ctx, device, 0, &err );

   /* Setup clFFT. */
       clfftSetupData fftSetup;
       err = clfftInitSetupData(&fftSetup);
       err = clfftSetup(&fftSetup);

       /* Allocate host & initialize data. */
       /* Only allocation shown for simplicity. */
       X = (float *)malloc(N * 2 * sizeof(*X));

   /* Prepare OpenCL memory objects and place data inside them. */
   bufX = clCreateBuffer( ctx, CL_MEM_READ_WRITE, N * 2 * sizeof(*X), NULL, &err );

   err = clEnqueueWriteBuffer( queue, bufX, CL_TRUE, 0,
       N * 2 * sizeof( *X ), X, 0, NULL, NULL );

       /* Create a default plan for a complex FFT. */
       err = clfftCreateDefaultPlan(&planHandle, ctx, dim, clLengths);

       /* Set plan parameters. */
       err = clfftSetPlanPrecision(planHandle, CLFFT_SINGLE);
       err = clfftSetLayout(planHandle, CLFFT_COMPLEX_INTERLEAVED, CLFFT_COMPLEX_INTERLEAVED);
       err = clfftSetResultLocation(planHandle, CLFFT_INPLACE);

   /* Bake the plan. */
       err = clfftBakePlan(planHandle, 1, &queue, NULL, NULL);

       /* Execute the plan. */
       err = clfftEnqueueTransform(planHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, NULL, &bufX, NULL, NULL);

       /* Wait for calculations to be finished. */
       err = clFinish(queue);

       /* Fetch results of calculations. */
       err = clEnqueueReadBuffer( queue, bufX, CL_TRUE, 0, N * 2 * sizeof( *X ), X, 0, NULL, NULL );

   /* Release OpenCL memory objects. */
   clReleaseMemObject( bufX );

       free(X);

       /* Release the plan. */
       err = clfftDestroyPlan( &planHandle );

   /* Release clFFT library. */
   clfftTeardown( );

   /* Release OpenCL working objects. */
   clReleaseCommandQueue( queue );
   clReleaseContext( ctx );

   return ret;
 }

Build dependencies¶

Library for Windows

To develop the clFFT library code on a Windows operating system, ensure to install the following packages on your system:

Windows® 7/8.1

Visual Studio 2012 or later

Latest CMake

An OpenCL SDK, such as APP SDK 3.0

Library for Linux

To develop the clFFT library code on a Linux operating system, ensure to install the following packages on your system:

GCC 4.6 and onwards

Latest CMake

An OpenCL SDK, such as APP SDK 3.0

Library for Mac OSX

To develop the clFFT library code on a Mac OS X, it is recommended to generate Unix makefiles with cmake.

Test infrastructure

To test the developed clFFT library code, ensure to install the following packages on your system:

Googletest v1.6

Latest FFTW

Latest Boost

Performance infrastructure¶

To measure the performance of the clFFT library code, ensure that the Python package is installed on your system.

clBLAS¶

For Github repository clBLAS

This repository houses the code for the OpenCL™ BLAS portion of clMath. The complete set of BLAS level 1, 2 & 3 routines is implemented. Please see Netlib BLAS for the list of supported routines. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming. APPML 1.12 is the most current generally available pre-packaged binary version of the library available for download for both Linux and Windows platforms.

The primary goal of clBLAS is to make it easier for developers to utilize the inherent performance and power efficiency benefits of heterogeneous computing. clBLAS interfaces do not hide nor wrap OpenCL interfaces, but rather leaves OpenCL state management to the control of the user to allow for maximum performance and flexibility. The clBLAS library does generate and enqueue optimized OpenCL kernels, relieving the user from the task of writing, optimizing and maintaining kernel code themselves.

clBLAS update notes 01/2017

v2.12 is a bugfix release as a rollup of all fixes in /develop branch Thanks to @pavanky, @iotamudelta, @shahsan10, @psyhtest, @haahh, @hughperkins, @tfauck @abhiShandy, @IvanVergiliev, @zougloub, @mgates3 for contributions to clBLAS v2.12 Summary of fixes available to read on the releases tab

clBLAS library user documentation¶

Library and API documentation for developers is available online as a GitHub Pages website

clBLAS Wiki

The project wiki contains helpful documentation, including a build primer

Contributing code

Please refer to and read the Contributing document for guidelines on how to contribute code to this open source project. The code in the /master branch is considered to be stable, and all pull-requests should be made against the /develop branch.

License¶

The source for clBLAS is licensed under the Apache License, Version 2.0

Example¶

The simple example below shows how to use clBLAS to compute an OpenCL accelerated SGEMM

#include <sys/types.h>
#include <stdio.h>

/* Include the clBLAS header. It includes the appropriate OpenCL headers */
#include <clBLAS.h>

/* This example uses predefined matrices and their characteristics for
 * simplicity purpose.
*/

#define M  4
#define N  3
#define K  5

static const cl_float alpha = 10;

static const cl_float A[M*K] = {
11, 12, 13, 14, 15,
21, 22, 23, 24, 25,
31, 32, 33, 34, 35,
41, 42, 43, 44, 45,
};
static const size_t lda = K;        /* i.e. lda = K */

static const cl_float B[K*N] = {
11, 12, 13,
21, 22, 23,
31, 32, 33,
41, 42, 43,
51, 52, 53,
};
static const size_t ldb = N;        /* i.e. ldb = N */

static const cl_float beta = 20;

static cl_float C[M*N] = {
    11, 12, 13,
    21, 22, 23,
    31, 32, 33,
    41, 42, 43,
};
static const size_t ldc = N;        /* i.e. ldc = N */

static cl_float result[M*N];

int main( void )
{
cl_int err;
cl_platform_id platform = 0;
cl_device_id device = 0;
cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
cl_context ctx = 0;
cl_command_queue queue = 0;
cl_mem bufA, bufB, bufC;
cl_event event = NULL;
int ret = 0;

/* Setup OpenCL environment. */
err = clGetPlatformIDs( 1, &platform, NULL );
err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL );

props[1] = (cl_context_properties)platform;
ctx = clCreateContext( props, 1, &device, NULL, NULL, &err );
queue = clCreateCommandQueue( ctx, device, 0, &err );

/* Setup clBLAS */
err = clblasSetup( );

/* Prepare OpenCL memory objects and place matrices inside them. */
bufA = clCreateBuffer( ctx, CL_MEM_READ_ONLY, M * K * sizeof(*A),
                      NULL, &err );
bufB = clCreateBuffer( ctx, CL_MEM_READ_ONLY, K * N * sizeof(*B),
                      NULL, &err );
bufC = clCreateBuffer( ctx, CL_MEM_READ_WRITE, M * N * sizeof(*C),
                      NULL, &err );

err = clEnqueueWriteBuffer( queue, bufA, CL_TRUE, 0,
    M * K * sizeof( *A ), A, 0, NULL, NULL );
err = clEnqueueWriteBuffer( queue, bufB, CL_TRUE, 0,
    K * N * sizeof( *B ), B, 0, NULL, NULL );
err = clEnqueueWriteBuffer( queue, bufC, CL_TRUE, 0,
    M * N * sizeof( *C ), C, 0, NULL, NULL );

    /* Call clBLAS extended function. Perform gemm for the lower right sub-matrices */
    err = clblasSgemm( clblasRowMajor, clblasNoTrans, clblasNoTrans,
                            M, N, K,
                            alpha, bufA, 0, lda,
                            bufB, 0, ldb, beta,
                            bufC, 0, ldc,
                            1, &queue, 0, NULL, &event );

/* Wait for calculations to be finished. */
err = clWaitForEvents( 1, &event );

/* Fetch results of calculations from GPU memory. */
err = clEnqueueReadBuffer( queue, bufC, CL_TRUE, 0,
                            M * N * sizeof(*result),
                            result, 0, NULL, NULL );

/* Release OpenCL memory objects. */
clReleaseMemObject( bufC );
clReleaseMemObject( bufB );
clReleaseMemObject( bufA );

/* Finalize work with clBLAS */
clblasTeardown( );

/* Release OpenCL working objects. */
clReleaseCommandQueue( queue );
clReleaseContext( ctx );

return ret;
}

Build dependencies¶

Library for Windows

Windows® 7/8

Visual Studio 2010 SP1, 2012

An OpenCL SDK, such as APP SDK 2.8

Latest CMake

Library for Linux

GCC 4.6 and onwards

An OpenCL SDK, such as APP SDK 2.9

Latest CMake

Library for Mac OSX

Recommended to generate Unix makefiles with cmake

Test infrastructure

Googletest v1.6

Latest Boost

CPU BLAS

Netlib CBLAS (recommended) Ubuntu: install by “apt-get install libblas-dev” Windows: download & install lapack-3.6.0 which comes with CBLAS

or ACML on windows/linux; Accelerate on Mac OSX

Performance infrastructure¶

Python

clSPARSE¶

For Github repository clSPARSE

an OpenCL™ library implementing Sparse linear algebra routines. This project is a result of a collaboration between AMD Inc. and Vratis Ltd..

What’s new in clSPARSE v0.10.1¶

bug fix release

Fixes for travis builds

Fix to the matrix market reader in the cuSPARSE benchmark to synchronize with the regular MM reader

Replace cl.hpp with cl2.hpp (thanks to arrayfire)

Fixes for the Nvidia platform; tested 352.79

Fixed buffer overruns in CSR-Adaptive kernels

Fix invalid memory access on Nvidia GPUs in CSR-Adaptive SpMV kernel

clSPARSE features¶

Sparse Matrix - dense Vector multiply (SpM-dV)

Sparse Matrix - dense Matrix multiply (SpM-dM)

Sparse Matrix - Sparse Matrix multiply Sparse Matrix Multiply(SpGEMM) - Single Precision

Iterative conjugate gradient solver (CG)

Iterative biconjugate gradient stabilized solver (BiCGStab)

Dense to CSR conversions (& converse)

COO to CSR conversions (& converse)

Functions to read matrix market files in COO or CSR format

True in spirit with the other clMath libraries, clSPARSE exports a “C” interface to allow projects to build wrappers around clSPARSE in any language they need. A great deal of thought and effort went into designing the API’s to make them less ‘cluttered’ compared to the older clMath libraries. OpenCL state is not explicitly passed through the API, which enables the library to be forward compatible when users are ready to switch from OpenCL 1.2 to OpenCL 2.0 3

API semantic versioning¶

Good software is typically the result of iteration and feedback. clSPARSE follows the semantic versioning guidelines, and while the major version number remains ‘0’, the public API should not be considered stable. We release clSPARSE as beta software (0.y.z) early to the community to elicit feedback and comment. This comes with the expectation that with feedback, we may incorporate breaking changes to the API that might require early users to recompile, or rewrite portions of their code as we iterate on the design.

clSPARSE Wiki

The project wiki contains helpful documentation. A build primer is available, which describes how to use cmake to generate platforms specific build files

Samples

clSPARSE contains a directory of simple OpenCL samples that demonstrate the use of the API in both C and C++. The superbuild script for clSPARSE also builds the samples as an external project, to demonstrate how an application would find and link to clSPARSE with cmake.

clSPARSE library documentation

API documentation is available at http://clmathlibraries.github.io/clSPARSE/. The samples give an excellent starting point to basic library operations.

Contributing code

Please refer to and read the Contributing document for guidelines on how to contribute code to this open source project. Code in the /master branch is considered to be stable and new library releases are made when commits are merged into /master. Active development and pull-requests should be made to the develop branch.

License¶

clSPARSE is licensed under the Apache License, Version 2.0

Compiling for Windows

Windows® 7/8

Visual Studio 2013 and above

CMake 2.8.12 (download from Kitware)

Solution (.sln) or

Nmake makefiles

An OpenCL SDK, such as APP SDK 3.0

Compiling for Linux

GCC 4.8 and above

CMake 2.8.12 (install with distro package manager )

Unix makefiles or

KDevelop or

QT Creator

An OpenCL SDK, such as APP SDK 3.0

Compiling for Mac OSX

CMake 2.8.12 (install via brew)

Unix makefiles or

XCode

An OpenCL SDK (installed via xcode-select –install)

Bench & Test infrastructure dependencies

Googletest v1.7

Boost v1.58

Footnotes

[1]: Changed to reflect CppCoreGuidelines: F.21

[2]: Changed to reflect CppCoreGuidelines: NL.8

[3]: OpenCL 2.0 support is not yet fully implemented; only the interfaces have been designed

clRNG¶

For Github repository clRNG

A library for uniform random number generation in OpenCL.

Streams of random numbers act as virtual random number generators. They can be created on the host computer in unlimited numbers, and then used either on the host or on computing devices by work items to generate random numbers. Each stream also has equally-spaced substreams, which are occasionally useful. The API is currently implemented for four different RNGs, namely the MRG31k3p, MRG32k3a, LFSR113 and Philox-4×32-10 generators.

What’s New¶

Libraries related to clRNG, for probability distributions and quasi-Monte Carlo methods, are available:

clProbDist

clQMC

Releases

The first public version of clRNG is v1.0.0 beta. Please go to releases for downloads.

Building¶

Install the runtime dependency:

An OpenCL SDK, such as APP SDK.

Install the build dependencies:

The CMake cross-platform build system. Visual Studio users can use CMake Tools for Visual Studio.

A recent C compiler, such as GCC 4.9 , or Visual Studio 2013.

Get the clRNG source code.

Configure the project using CMake (to generate standard makefiles) or CMake Tools for Visual Studio (to generate solution and project files).

Build the project.

Install the project (by default, the library will be installed in the package directory under the build directory).

Point the environment variable CLRNG_ROOT to the installation directory, i.e., the directory under which include/clRNG can be found. This step is optional if the library is installed under /usr, which is the default.

In order to execute the example programs (under the bin subdirectory of the installation directory) or to link clRNG into other software, the dynamic linker must be informed where to find the clRNG shared library. The name and location of the shared library generally depend on the platform.

Optionally run the tests.

Example Instructions for Linux¶

On a 64-bit Linux platform, steps 3 through 9 from above, executed in a Bash-compatible shell, could consist of:

git clone https://github.com/clMathLibraries/clRNG.git
mkdir clRNG.build; cd clRNG.build; cmake ../clRNG/src
make
make install
export CLRNG_ROOT=$PWD/package
export LD_LIBRARY_PATH=$CLRNG_ROOT/lib64:$LD_LIBRARY_PATH
$CLRNG_ROOT/bin/CTest

Examples

Examples can be found in src/client. The compiled client program examples can be found under the bin subdirectory of the installation package ($CLRNG_ROOT/bin under Linux). Note that the examples expect an OpenCL GPU device to be available.

Simple example

The simple example below shows how to use clRNG to generate random numbers by directly using device side headers (.clh) in your OpenCL kernel.

#include <stdlib.h>
#include <string.h>

#include "clRNG/clRNG.h"
#include "clRNG/mrg31k3p.h"

int main( void )
{
    cl_int err;
    cl_platform_id platform = 0;
    cl_device_id device = 0;
    cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
    cl_context ctx = 0;
    cl_command_queue queue = 0;
    cl_program program = 0;
    cl_kernel kernel = 0;
    cl_event event = 0;
    cl_mem bufIn, bufOut;
    float *out;
    char *clrng_root;
    char include_str[1024];
    char build_log[4096];
    size_t i = 0;
    size_t numWorkItems = 64;
    clrngMrg31k3pStream *streams = 0;
    size_t streamBufferSize = 0;
    size_t kernelLines = 0;

    /* Sample kernel that calls clRNG device-side interfaces to generate random numbers */
    const char *kernelSrc[] = {
    "    #define CLRNG_SINGLE_PRECISION                                   \n",
    "    #include <clRNG/mrg31k3p.clh>                                    \n",
    "                                                                     \n",
    "    __kernel void example(__global clrngMrg31k3pHostStream *streams, \n",
    "                          __global float *out)                       \n",
    "    {                                                                \n",
    "        int gid = get_global_id(0);                                  \n",
    "                                                                     \n",
    "        clrngMrg31k3pStream workItemStream;                          \n",
    "        clrngMrg31k3pCopyOverStreamsFromGlobal(1, &workItemStream,   \n",
    "                                                     &streams[gid]); \n",
    "                                                                     \n",
    "        out[gid] = clrngMrg31k3pRandomU01(&workItemStream);          \n",
    "    }                                                                \n",
    "                                                                     \n",
    };

  /* Setup OpenCL environment. */
  err = clGetPlatformIDs( 1, &platform, NULL );
  err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL );

  props[1] = (cl_context_properties)platform;
  ctx = clCreateContext( props, 1, &device, NULL, NULL, &err );
  queue = clCreateCommandQueue( ctx, device, 0, &err );

  /* Make sure CLRNG_ROOT is specified to get library path */
  clrng_root = getenv("CLRNG_ROOT");
  if(clrng_root == NULL) printf("\nSpecify environment variable CLRNG_ROOT as described\n");
  strcpy(include_str, "-I ");
  strcat(include_str, clrng_root);
  strcat(include_str, "/include");

  /* Create sample kernel */
  kernelLines = sizeof(kernelSrc) / sizeof(kernelSrc[0]);
  program = clCreateProgramWithSource(ctx, kernelLines, kernelSrc, NULL, &err);
  err = clBuildProgram(program, 1, &device, include_str, NULL, NULL);
  if(err != CL_SUCCESS)
  {
      printf("\nclBuildProgram has failed\n");
      clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, 4096, build_log, NULL);
      printf("%s", build_log);
  }
  kernel = clCreateKernel(program, "example", &err);

  /* Create streams */
  streams = clrngMrg31k3pCreateStreams(NULL, numWorkItems, &streamBufferSize, (clrngStatus *)&err);

  /* Create buffers for the kernel */
  bufIn = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, streamBufferSize, streams, &err);
  bufOut = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY | CL_MEM_HOST_READ_ONLY, numWorkItems * sizeof(cl_float), NULL, &err);

  /* Setup the kernel */
  err = clSetKernelArg(kernel, 0, sizeof(bufIn),  &bufIn);
  err = clSetKernelArg(kernel, 1, sizeof(bufOut), &bufOut);

  /* Execute the kernel and read back results */
  err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &numWorkItems, NULL, 0, NULL, &event);
  err = clWaitForEvents(1, &event);
  out = (float *)malloc(numWorkItems * sizeof(out[0]));
  err = clEnqueueReadBuffer(queue, bufOut, CL_TRUE, 0, numWorkItems * sizeof(out[0]), out, 0, NULL, NULL);

  /* Release allocated resources */
  clReleaseEvent(event);
  free(out);
  clReleaseMemObject(bufIn);
  clReleaseMemObject(bufOut);

  clReleaseKernel(kernel);
  clReleaseProgram(program);

  clReleaseCommandQueue(queue);
  clReleaseContext(ctx);

  return 0;
}

Building the documentation manually¶

The documentation can be generated by running make from within the doc directory. This requires Doxygen to be installed.

hcFFT¶

Installation¶

The following are the steps to use the library

ROCM 2.4 Kernel, Driver and Compiler Installation (if not done until now)

Library installation.

ROCM 2.4 Installation

To Know more about ROCM refer https://github.com/RadeonOpenCompute/ROCm/blob/master/README.md

a. Installing Debian ROCM repositories

Before proceeding, make sure to completely uninstall any pre-release ROCm packages.

Refer https://github.com/RadeonOpenCompute/ROCm#removing-pre-release-packages for instructions to remove pre-release ROCM packages.

Steps to install rocm package are,

wget -qO - http://packages.amd.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -

sudo sh -c 'echo deb [arch=amd64] http://packages.amd.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list'

sudo apt-get update

sudo apt-get install rocm

Then, make the ROCm kernel your default kernel. If using grub2 as your bootloader, you can edit the GRUB_DEFAULT variable in the following file:

sudo vi /etc/default/grub

sudo update-grub

and Reboot the system

b. Verifying the Installation

Once Reboot, to verify that the ROCm stack completed successfully you can execute HSA vector_copy sample application:

cd /opt/rocm/hsa/sample

make

./vector_copy

Library Installation

a. Install using Prebuilt debian

wget https://github.com/ROCmSoftwarePlatform/hcFFT/blob/master/pre-builds/hcfft-master-87a37f5-Linux.deb
sudo dpkg -i hcfft-master-87a37f5-Linux.deb

b. Build debian from source

git clone https://github.com/ROCmSoftwarePlatform/hcFFT.git && cd hcFFT

chmod +x build.sh && ./build.sh

build.sh execution builds the library and generates a debian under build directory.

c. Install CPU based FFTW3 library

sudo apt-get install fftw3 fftw3-dev pkg-config

Introduction¶

This repository hosts the HCC based FFT Library, that targets GPU acceleration of FFT routines on AMD devices. To know what HCC compiler features, refer here.

The following are the sub-routines that are implemented

R2C : Transforms Real valued input in Time domain to Complex valued output in Frequency domain.

C2R : Transforms Complex valued input in Frequency domain to Real valued output in Real domain.

C2C : Transforms Complex valued input in Frequency domain to Complex valued output in Real domain or vice versa

KeyFeature¶

Support 1D, 2D and 3D Fast Fourier Transforms

Supports R2C, C2R, C2C, D2Z, Z2D and Z2Z Transforms

Support Out-Of-Place data storage

Ability to Choose desired target accelerator

Single and Double precision

Prerequisites

This section lists the known set of hardware and software requirements to build this library

Hardware

CPU: mainstream brand, Better if with >=4 Cores Intel Haswell based CPU

System Memory >= 4GB (Better if >10GB for NN application over multiple GPUs)

Hard Drive > 200GB (Better if SSD or NVMe driver for NN application over multiple GPUs)

Minimum GPU Memory (Global) > 2GB

GPU cards supported

dGPU: AMD R9 Fury X, R9 Fury, R9 Nano

APU: AMD Kaveri or Carrizo

AMD Driver and Runtime

Radeon Open Compute Kernel (ROCK) driver : https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver

HSA runtime API and runtime for Boltzmann: https://github.com/RadeonOpenCompute/ROCR-Runtime

System software

Ubuntu 14.04 trusty and later

GCC 4.6 and later

CPP 4.6 and later (come with GCC package)

python 2.7 and later

python-pip

BeautifulSoup4 (installed using python-pip)

HCC 0.9 from here

Tools and Misc

git 1.9 and later

cmake 2.6 and later (2.6 and 2.8 are tested)

firewall off

root privilege or user account in sudo group

Ubuntu Packages

libc6-dev-i386

liblapack-dev

graphicsmagick

libblas-dev

Examples¶

FFT 1D R2C example:

file: hcfft_1D_R2C.cpp

#!c++

#include <iostream>
#include <cstdlib>
#include "hcfft.h"
#include "hc_am.hpp"
#include "hcfftlib.h"

int main(int argc, char* argv[]) {
  int N = argc > 1 ? atoi(argv[1]) : 1024;
  // HCFFT work flow
  hcfftHandle plan;
  hcfftResult status  = hcfftPlan1d(&plan, N, HCFFT_R2C);
  assert(status == HCFFT_SUCCESS);
  int Rsize = N;
  int Csize = (N / 2) + 1;
  hcfftReal* input = (hcfftReal*)calloc(Rsize, sizeof(hcfftReal));
  int seed = 123456789;
  srand(seed);

  // Populate the input
  for(int i = 0; i < Rsize ; i++) {
    input[i] = rand();
  }

  hcfftComplex* output = (hcfftComplex*)calloc(Csize, sizeof(hcfftComplex));

  std::vector<hc::accelerator> accs = hc::accelerator::get_all();
  assert(accs.size() && "Number of Accelerators == 0!");
  hc::accelerator_view accl_view = accs[1].get_default_view();

  hcfftReal* idata = hc::am_alloc(Rsize * sizeof(hcfftReal), accs[1], 0);
  accl_view.copy(input, idata, sizeof(hcfftReal) * Rsize);
  hcfftComplex* odata = hc::am_alloc(Csize * sizeof(hcfftComplex), accs[1], 0);
  accl_view.copy(output,  odata, sizeof(hcfftComplex) * Csize);
  status = hcfftExecR2C(plan, idata, odata);
  assert(status == HCFFT_SUCCESS);
  accl_view.copy(odata, output, sizeof(hcfftComplex) * Csize);
  status =  hcfftDestroy(plan);
  assert(status == HCFFT_SUCCESS);
  free(input);
  free(output);
  hc::am_free(idata);
  hc::am_free(odata);
}

Compiling the example code:

Assuming the library and compiler installation is followed as in installation.

/opt/rocm/hcc/bin/clang++ /opt/rocm/hcc/bin/hcc-config –cxxflags –ldflags -lhc_am -lhcfft -I../lib/include -L../build/lib/src hcfft_1D_R2C.cpp

Tested Environments¶

This sections enumerates the list of tested combinations of Hardware and system softwares.

Driver versions

Boltzmann Early Release Driver + dGPU

Radeon Open Compute Kernel (ROCK) driver : https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver

HSA runtime API and runtime for Boltzmann: https://github.com/RadeonOpenCompute/ROCR-Runtime

Traditional HSA driver + APU (Kaveri)

GPU Cards

Radeon R9 Nano

Radeon R9 FuryX

Radeon R9 Fury

Kaveri and Carizo APU

Server System

Supermicro SYS 2028GR-THT 6 R9 NANO

Supermicro SYS-1028GQ-TRT 4 R9 NANO

Supermicro SYS-7048GR-TR Tower 4 R9 NANO

Tensile¶

Introduction¶

Tensile is a tool for creating a benchmark-driven backend library for GEMMs, GEMM-like problems (such as batched GEMM), N-dimensional tensor contractions, and anything else that multiplies two multi-dimensional objects together on a AMD GPU.

Overview for creating a custom TensileLib backend library for your application:

Install the PyYAML and cmake dependency (mandatory), git clone and cd Tensile
Create a benchmark config.yaml file in ./Tensile/Configs/
Run the benchmark. After the benchmark is finished. Tensile will dump 4 directories: 1 & 2 is about benchmarking. 3 & 4 is the summarized results from your library (like rocBLAS) viewpoints.

1_BenchmarkProblems: has all the problems descriptions and executables generated during benchmarking, where you can re-launch exe to reproduce results.

2_BenchmarkData: has the raw performance results.

3_LibraryLogic: has optimal kernel configurations yaml file and Winner*.csv. Usually rocBLAS takes the yaml files from this folder.

4_LibraryClient: has a client exe, so you can launch from a library viewpoint.
Add the Tensile library to your application’s CMake target. The Tensile library will be written, compiled and linked to your application at application-compile-time.
- GPU kernels, written in HIP, OpenCL, or AMD GCN assembly.
- Solution classes which enqueue the kernels.
- APIs which call the fastest solution for a problem.

Quick Example (Ubuntu):¶

sudo apt-get install python-yaml
mkdir Tensile
cd Tensile
git clone https://github.com/ROCmSoftwarePlatform/Tensile repo
cd repo
git checkout master
mkdir build
cd build
python ../Tensile/Tensile.py ../Tensile/Configs/test_sgemm.yaml ./

After about 10 minutes of benchmarking, Tensile will print out the path to the client you can run.

./4_LibraryClient/build/client -h
./4_LibraryClient/build/client --sizes 5760 5760 1 5760

Benchmark Config example¶

Tensile uses an incremental and “programmable” benchmarking protocol.

Example Benchmark config.yaml as input file to Tensile¶

GlobalParameters:
  PrintLevel: 1
  ForceRedoBenchmarkProblems: False
  ForceRedoLibraryLogic: True
  ForceRedoLibraryClient: True
  CMakeBuildType: Release
  EnqueuesPerSync: 1
  SyncsPerBenchmark: 1
  LibraryPrintDebug: False
  NumElementsToValidate: 128
  ValidationMaxToPrint: 16
  ValidationPrintValids: False
  ShortNames: False
  MergeFiles: True
  PlatformIdx: 0
  DeviceIdx: 0
  DataInitTypeAB: 0

BenchmarkProblems:
  - # sgemm NN
    - # ProblemType
      OperationType: GEMM
      DataType: s
      TransposeA: False
      TransposeB: False
      UseBeta: True
      Batched: True

    - # BenchmarkProblemSizeGroup
      InitialSolutionParameters:
      BenchmarkCommonParameters:
        - ProblemSizes:
          - Range: [ [5760], 0, [1], 0 ]
        - LoopDoWhile: [False]
        - NumLoadsCoalescedA: [-1]
        - NumLoadsCoalescedB: [1]
        - WorkGroupMapping: [1]
      ForkParameters:
         - ThreadTile:
         - [ 8, 8 ]
         - [ 4, 8 ]
         - [ 4, 4 ]
        - WorkGroup:
          - [  8, 16,  1 ]
          - [ 16, 16,  1 ]
        - LoopTail: [False, True]
        - EdgeType: ["None", "Branch", "ShiftPtr"]
        - DepthU: [ 8, 16]
        - VectorWidth: [1, 2, 4]
      BenchmarkForkParameters:
      JoinParameters:
        - MacroTile
      BenchmarkJoinParameters:
      BenchmarkFinalParameters:
        - ProblemSizes:
          - Range: [ [5760], 0, [1], 0 ]

LibraryLogic:

LibraryClient:

Structure of config.yaml¶

Top level data structure whose keys are Parameters, BenchmarkProblems, LibraryLogic and LibraryClient.

Parameters contains a dictionary storing global parameters used for all parts of the benchmarking.

BenchmarkProblems contains a list of dictionaries representing the benchmarks to conduct; each element, i.e. dictionary, in the list is for benchmarking a single ProblemType. The keys for these dictionaries are ProblemType, InitialSolutionParameters, BenchmarkCommonParameters, ForkParameters, BenchmarkForkParameters, JoinParameters, BenchmarkJoinParameters and BenchmarkFinalParameters. See Benchmark Protocol for more information on these steps.

LibraryLogic contains a dictionary storing parameters for analyzing the benchmark data and designing how the backend library will select which Solution for certain ProblemSizes.

LibraryClient contains a dictionary storing parameters for actually creating the library and creating a client which calls into the library.

Global Parameters¶

Name: Prefix to add to API function names; typically name of device.
MinimumRequiredVersion: Which version of Tensile is required to interpret this yaml file
RuntimeLanguage: Use HIP or OpenCL runtime.
KernelLanguage: For OpenCL runtime, kernel language must be set to OpenCL. For HIP runtime, kernel language can be set to HIP or assembly (gfx803, gfx900).
PrintLevel: 0=Tensile prints nothing, 1=prints some, 2=prints a lot.
ForceRedoBenchmarkProblems: False means don’t redo a benchmark phase if results for it already exist.
ForceRedoLibraryLogic: False means don’t re-generate library logic if it already exist.
ForceRedoLibraryClient: False means don’t re-generate library client if it already exist.
CMakeBuildType: Release or Debug
EnqueuesPerSync: Num enqueues before syncing the queue.
SyncsPerBenchmark: Num queue syncs for each problem size.
LibraryPrintDebug: True means Tensile solutions will print kernel enqueue info to stdout
NumElementsToValidate: Number of elements to validate; 0 means no validation.
ValidationMaxToPrint: How many invalid results to print.
ValidationPrintValids: True means print validation comparisons that are valid, not just invalids.
ShortNames: Convert long kernel, solution and files names to short serial ids.
MergeFiles: False means write each solution and kernel to its own file.
PlatformIdx: OpenCL platform id.
DeviceIdx: OpenCL or HIP device id.
DataInitType[AB,C]: Initialize validation data with 0=0’s, 1=1’s, 2=serial, 3=random.
KernelTime: Use kernel time reported from runtime rather than api times from cpu clocks to compare kernel performance.

The exhaustive list of global parameters and their defaults is stored in Common.py.

Problem Type Parameters¶

OperationType: GEMM or TensorContraction.
DataType: s, d, c, z, h
UseBeta: False means library/solutions/kernel won’t accept a beta parameter; thus beta=0.
UseInitialStrides: False means data is contiguous in memory.
HighPrecisionAccumulate: For tmpC += a*b, use twice the precision for tmpC as for DataType. Not yet implemented.
ComplexConjugateA: True or False; ignored for real precision.
ComplexConjugateB: True or False; ignored for real precision.

For OperationType=GEMM only:

TransposeA: True or False.
TransposeB: True or False.
Batched: True (False has been deprecated). For OperationType=TensorContraction only (showing batched gemm NT: C[ijk] = Sum[l] A[ilk] * B[jlk])
IndexAssignmentsA: [0, 3, 2]
IndexAssignmentsB: [1, 3, 2]
NumDimensionsC: 3.

Solution / Kernel Parameters¶

See: Kernel Parameters.

Defaults¶

Because of the flexibility / complexity of the benchmarking process and, therefore, of the config.yaml files; Tensile has a default value for every parameter. If you neglect to put LoopUnroll anywhere in your benchmark, rather than crashing or complaining, Tensile will put the default LoopUnroll options into the default phase (common, fork, join…). This guarantees ease of use and more importantly backward compatibility; every time we add a new possible solution parameter, you don’t necessarily need to update your configs; we’ll have a default figured out for you.

However, this may cause some confusion. If your config fork 2 parameters, but you see that 3 were forked during benchmarking, that’s because you didn’t specify the 3rd parameter anywhere, so Tensile stuck it in its default phase, which was forking (for example). Also, specifying ForkParameters: and leaving it empty isn’t the same as leaving JoinParameter out of your config. If you leave ForkParameters out of your config, Tensile will add a ForkParameters step and put the default parameters into it (unless you put all the parameters elsewhere), but if you specify ForkParameters and leave it empty, then you won’t work anything.

Therefore, it is safest to specify all parameters in your config.yaml files; that way you’ll guarantee the behavior you want. See /Tensile/Common.py for the current list of parameters.

Benchmark Protocol¶

Old Benchmark Architecture was Intractable¶

The benchmarking strategy from version 1 was vanilla flavored brute force: (8 WorkGroups)* (12 ThreadTiles)* (4 NumLoadsCoalescedAs)* (4 NumLoadsCoalescedBs)* (3 LoopUnrolls)* (5 BranchTypes)* …*(1024 ProblemSizes)=23,592,960 is a multiplicative series which grows very quickly. Adding one more boolean parameter doubles the number of kernel enqueues of the benchmark.

Incremental Benchmark is Faster¶

Tensile version 2 allows the user to manually interrupt the multiplicative series with “additions” instead of “multiplies”, i.e., (8 WorkGroups)* (12 ThreadTiles)+ (4 NumLoadsCoalescedAs)* (4 NumLoadsCoalescedBs)* (3 LoopUnrolls)+ (5 BranchTypes)* …+(1024 ProblemSizes)=1,151 is a dramatically smaller number of enqueues. Now, adding one more boolean parameter may only add on 2 more enqueues.

Phases of Benchmark¶

To make the Tensile’s programability more manageable for the user and developer, the benchmarking protocol has been split up into several steps encoded in a config.yaml file. The below sections reference the following config.yaml. Note that this config.yaml has been created to be a simple illustration and doesn’t not represent an actual good benchmark protocol. See the configs included in the repository (/Tensile/Configs) for examples of good benchmarking configs.

BenchmarkProblems:
 - # sgemm
   - # Problem Type
     OperationType: GEMM
     Batched: True
   - # Benchmark Size-Group
    InitialSolutionParameters:
      - WorkGroup: [ [ 16, 16, 1 ] ]
      - NumLoadsCoalescedA: [ 1 ]
      - NumLoadsCoalescedB: [ 1 ]
      - ThreadTile: [ [ 4, 4 ] ]

    BenchmarkCommonParameters:
      - ProblemSizes:
        - Range: [ [512], [512], [1], [512] ]
      - EdgeType: ["Branch", "ShiftPtr"]
        PrefetchGlobalRead: [False, True]

    ForkParameters:
      - WorkGroup: [ [8, 32, 1], [16, 16, 1], [32, 8, 1] ]
        ThreadTile: [ [2, 8], [4, 4], [8, 2] ]

    BenchmarkForkParameters:
      - ProblemSizes:
        - Exact: [ 2880, 2880, 1, 2880 ]
      - NumLoadsCoalescedA: [ 1, 2, 4, 8 ]
      - NumLoadsCoalescedB: [ 1, 2, 4, 8 ]

    JoinParameters:
      - MacroTile

    BenchmarkJoinParameters:
      - LoopUnroll: [8, 16]

    BenchmarkFinalParameters:
      - ProblemSizes:
        - Range: [ [16, 128], [16, 128], [1], [256] ]

Initial Solution Parameters¶

A Solution is comprised of ~20 parameters, and all are needed to create a kernel. Therefore, during the first benchmark which determines which WorkGroupShape is fastest, what are the other 19 solution parameters which are used to describe the kernels that we benchmark? That’s what InitialSolutionParameters are for. The solution used for benchmarking WorkGroupShape will use the parameters from InitialSolutionParameters. The user must choose good default solution parameters in order to correctly identify subsequent optimal parameters.

Problem Sizes¶

Each step of the benchmark can override what problem sizes will be benchmarked. A ProblemSizes entry of type Range is a list whose length is the number of indices in the ProblemType. A GEMM ProblemSizes must have 3 elements while a batched-GEMM ProblemSizes must have 4 elements. So, for a ProblemType of C[ij] = Sum[k] A[ik]*B[jk], the ProblemSizes elements represent [SizeI, SizeJ, SizeK]. For each index, there are 5 ways of specifying the sizes of that index:

[1968]

Benchmark only size 1968; n = 1.

[16, 1920]

Benchmark sizes 16 to 1968 using the default step size (=16); n = 123.

[16, 32, 1968]

Benchmark sizes 16 to 1968 using a step size of 32; n = 61.

[64, 32, 16, 1968]

Benchmark sizes from 64 to 1968 with a step size of 32. Also, increase the step size by 16 each iteration.

This causes fewer sizes to be benchmarked when the sizes are large, and more benchmarks where the sizes are small; this is typically desired behavior.

n = 16 (64, 96, 144, 208, 288, 384, 496, 624, 768, 928, 1104, 1296, 1504, 1728, 1968). The stride at the beginning is 32, but the stride at the end is 256.

0

The size of this index is just whatever size index 0 is. For a 3-dimensional ProblemType, this allows benchmarking only a 2- dimensional or 1-dimensional slice of problem sizes.

Here are a few examples of valid ProblemSizes for 3D GEMMs:

Range: [ [16, 128], [16, 128], [16, 128] ] # n = 512
Range: [ [16, 128], 0, 0] # n = 8
Range: [ [16, 16, 16, 5760], 0, [1024, 1024, 4096] ] # n = 108

Benchmark Common Parameters¶

During this first phase of benchmarking, we examine parameters which will be the same for all solutions for this ProblemType. During each step of benchmarking, there is only 1 winner. In the above example we are benchmarking the dictionary {EdgeType: [ Branch, ShiftPtr], PrefetchGlobalRead: [False, True]}.; therefore, this benchmark step generates 4 solution candidates, and the winner will be the fastest EdgeType/PrefetchGlobalRead combination. Assuming the winner is ET=SP and PGR=T, then all solutions for this ProblemType will have ET=SP and PGR=T. Also, once a parameter has been determined, all subsequent benchmarking steps will use this determined parameter rather than pulling values from InitialSolutionParameters. Because the common parameters will apply to all kernels, they are typically the parameters which are compiler-dependent or hardware-dependent rather than being tile-dependent.

Fork Parameters¶

If we continued to determine every parameter in the above manner, we’d end up with a single fastest solution for the specified ProblemSizes; we usually desire multiple different solutions with varying parameters which may be fastest for different groups of ProblemSizes. One simple example of this is small tiles sizes are fastest for small problem sizes, and large tiles are fastest for large tile sizes.

Therefore, we allow “forking” parameters; this means keeping multiple winners after each benchmark steps. In the above example we fork {WorkGroup: […], ThreadTile: […]}. This means that in subsequent benchmarking steps, rather than having one winning parameter, we’ll have one winning parameter per fork permutation; we’ll have 9 winners.

Benchmark Fork Parameters¶

When we benchmark the fork parameters, we retain one winner per permutation. Therefore, we first determine the fastest NumLoadsCoalescedA for each of the WG,TT permutations, then we determine the fastest NumLoadsCoalescedB for each permutation.

Join Parameters¶

After determining fastest parameters for all the forked solution permutations, we have the option of reducing the number of winning solutions. When a parameter is listed in the JoinParameters section, that means that of the kept winning solutions, each will have a different value for that parameter. Listing more parameters to join results in more winners being kept, while having a JoinParameters section with no parameters listed results on only 1 fastest solution.

In our example we join over the MacroTile (work-group x thread-tile). After forking tiles, there were 9 solutions that we kept. After joining MacroTile, we’ll only keep six: 16x256, 32x128, 64x64, 128x32 and 256x16. The solutions that are kept are based on their performance during the last BenchmarkForkParameters benchmark, or, if there weren’t any, JoinParameters will conduct a benchmark of all solution candidates then choose the fastest.

Benchmark Join Parameters¶

After narrowing the list of fastest solutions through joining, you can continue to benchmark parameters, keeping one winning parameter per solution permutation.

Benchmark Final Parameters¶

After all the parameter benchmarking has been completed and the final list of fastest solution has been assembled, we can benchmark all the solution over a large set of ProblemSizes. This benchmark represent the final output of benchmarking; it outputs a .csv file where the rows are all the problem sizes and the columns are all the solutions. This is the information which gets analysed to produce the library logic.

Contributing¶

We’d love your help, but…

Never check in a tab (t); use 4 spaces.

Follow the coding style of the file you’re editing.

Make pull requests against develop branch.

Rebase your develop branch against ROCmSoftwarePlatform::Tensile::develop branch right before pull-requesting.

In your pull request, state what you tested (which OS, what drivers, what devices, which config.yaml’s) so we can ensure that your changes haven’t broken anything.

Dependencies¶

CMake¶

CMake 2.8

Python¶

(One time only)

Ubuntu: sudo apt install python2.7 python-yaml

CentOS: sudo yum install python PyYAML

Fedora: sudo dnf install python PyYAML

Compilers¶

For Tensile_BACKEND = OpenCL1.2 (untested)

Visual Studio 14 (2015). (VS 2012 may also be supported; c++11 should no longer be required by Tensile. Need to verify.)

GCC 4.8 and above

For Tensile_BACKEND = HIP

Public ROCm

Installation¶

Tensile can be installed via:

Download repo and don’t install; install PyYAML dependency manually and call python scripts manually:

git clone https://github.com/ROCmSoftwarePlatform/Tensile.git
python Tensile/Tensile/Tensile.py your_custom_config.yaml your_benchmark_path

Install develop branch directly from repo using pip:

pip install git+https://github.com/ROCmSoftwarePlatform/Tensile.git@develop
tensile your_custom_config.yaml your_benchmark_path

Download repo and install manually: (deprecated)

git clone https://github.com/ROCmSoftwarePlatform/Tensile.git
cd Tensile
sudo python setup.py install
tensile your_custom_config.yaml your_benchmark_path

Kernel Parameters¶

Solution / Kernel Parameters¶

LoopDoWhile: True=DoWhile loop, False=While or For loop
LoopTail: Additional loop with LoopUnroll=1.
EdgeType: Branch, ShiftPtr or None
WorkGroup: [dim0, dim1, LocalSplitU]
ThreadTile: [dim0, dim1]
GlobalSplitU: Split up summation among work-groups to create more concurrency. This option launches a kernel to handle the beta scaling, then a second kernel where the writes to global memory are atomic.
PrefetchGlobalRead: True means outer loop should prefetch global data one iteration ahead.
PrefetchLocalRead: True means inner loop should prefetch lds data one iteration ahead.
WorkGroupMapping: In what order will work-groups compute C; affects cacheing.
LoopUnroll: How many iterations to unroll inner loop; helps loading coalesced memory.
MacroTile: Derrived from WorkGroup*ThreadTile.
DepthU: Derrived from LoopUnroll*SplitU.
NumLoadsCoalescedA,B: Number of loads from A in coalesced dimension.
GlobalReadCoalesceGroupA,B: True means adjacent threads map to adjacent global read elements (but, if transposing data then write to lds is scattered).
GlobalReadCoalesceVectorA,B: True means vector components map to adjacent global read elements (but, if transposing data then write to lds is scattered).
VectorWidth: Thread tile elements are contiguous for faster memory accesses. For example VW=4 means a thread will read a float4 from memory rather than 4 non-contiguous floats.
KernelLanguage: Whether kernels should be written in source code (HIP, OpenCL) or assembly (gfx803, gfx900, …).

The exhaustive list of solution parameters and their defaults is stored in Common.py.

Kernel Parameters Affect Performance¶

The kernel parameters affect many aspects of performance. Changing a parameter may help address one performance bottleneck but worsen another. That is why searching through the parameter space is vital to discovering the fastest kernel for a given problem.

How N-Dimensional Tensor Contractions Are Mapped to Finite-Dimensional GPU Kernels¶

For a traditional GEMM, the 2-dimensional output, C[i,j], is mapped to launching a 2-dimensional grid of work groups, each of which has a 2-dimensional grid of work items; one dimension belongs to i and one dimension belongs to j. The 1-dimensional summation is represented by a single loop within the kernel body.

Special Dimensions: D0, D1 and DU¶

To handle arbitrary dimensionality, Tensile begins by determining 3 special dimensions: D0, D1 and DU.

D0 and D1 are the free indices of A and B (one belongs to A and one to B) which have the shortest strides. This allows the inner-most loops to read from A and B the fastest via coalescing. In a traditional GEMM, every matrix has a dimension with a shortest stride of 1, but Tensile doesn’t make that assumption. Of these two dimensions, D0 is the dimension which has the shortest tensor C stride which allows for fast writing.

DU represents the summation index with the shortest combined stride (stride in A + stride in B); it becomes the inner most loop which gets “U”nrolled. This assignment is also mean’t to assure fast reading in the inner-most summation loop. There can be multiple summation indices (i.e. embedded loops) and DU will be iterated over in the inner most loop.

GPU Kernel Dimension¶

OpenCL allows for 3-dimensional grid of work-groups, and each work-group can be a 3-dimensional grid of work-items. Tensile assigns D0 to be dimension-0 of the work-group and work-item grid; it assigns D1 to be dimension-1 of the work-group and work-item grids. All other free or batch dimensions are flattened down into the final dimension-2 of the work-group and work-item grids. Withing the GPU kernel, dimensions-2 is reconstituted back into whatever dimensions it represents.

Languages¶

Tensile Benchmarking is Python¶

The benchmarking module, Tensile.py, is written in python. The python scripts write solution, kernels, cmake files and all other C/C++ files used for benchmarking.

Tensile Library¶

The Tensile API, Tensile.h, is confined to C89 so that it will be usable by most software. The code behind the API is allowed to be c++11.

Device Languages¶

The device languages Tensile supports for the gpu kernels is

OpenCL 1.2
HIP
Assembly
- gfx803
- gfx900

Library Logic¶

Running the LibraryLogic phase of benchmarking analyses the benchmark data and encodes a mapping for each problem type. For each problem type, it maps problem sizes to best solution (i.e. kernel).

When you build Tensile.lib, you point the TensileCreateLibrary function to a directory where your library logic yaml files are.

Problem Nomenclature¶

Example Problems¶

Standard GEMM has 4 variants (2 free indices (i, j) and 1 summation index l)

N(N:nontranspose)N: C[i,j] = Sum[l] A[i,l] * B[l,j]

NT(T:transpose): C[i,j] = Sum[l] A[i,l] * B[j, l]

TN: C[i,j] = Sum[l] A[l, i] * B[l,j]

TT: C[i,j] = Sum[l] A[l, i] * B[j, l]

C[i,j,k] = Sum[l] A[i,l,k] * B[l,j,k] (batched-GEMM; 2 free indices, 1 batched index k and 1 summation index l)

C[i,j] = Sum[k,l] A[i,k,l] * B[j,l,k] (2D summation)

C[i,j,k,l,m] = Sum[n] A[i,k,m,l,n] * B[j,k,l,n,m] (GEMM with 3 batched indices)

C[i,j,k,l,m] = Sum[n,o] A[i,k,m,o,n] * B[j,m,l,n,o] (4 free indices, 2 summation indices and 1 batched index)

C[i,j,k,l] = Sum[m,n] A[i,j,m,n,l] * B[m,n,k,j,l] (batched image convolution mapped to 7D tensor contraction)

and even crazier

Nomenclature¶

The indices describe the dimensionality of the problem being solved. A GEMM operation takes 2 2-dimensional matrices as input (totaling 4 input dimensions) and contracts them along one dimension (which cancels out 2 of the dimensions), resulting in a 2-dimensional result.

Whenever an index shows up in multiple tensors, those tensors must be the same size along that dimension but they may have different strides.

There are 3 categories of indices/dimensions that Tensile deals with: free, batch and bound.

Free Indices

Free indices are the indices of tensor C which come in pairs; one of the pair shows up in tensor A while the other shows up in tensor B. In the really crazy example above, i/j/k/l are the 4 free indices of tensor C. Indices i and k come from tensor A and indices j and l come from tensor B.

Batch Indices

Batch indices are the indices of tensor C which shows up in both tensor A and tensor B. For example, the difference between the GEMM example and the batched-GEMM example above is the additional index. In the batched-GEMM example, the index K is the batch index which is batching together multiple independent GEMMs.

Bound/Summation Indices

The final type of indices are called bound indices or summation indices. These indices do not show up in tensor C; they show up in the summation symbol (Sum[k]) and in tensors A and B. It is along these indices that we perform the inner products (pairwise multiply then sum).

Limitations¶

Problem supported by Tensile must meet the following conditions:

There must be at least one pair of free indices.

Tensile.lib¶

After running the benchmark and generating library config files, you’re ready to add Tensile.lib to your project. Tensile provides a TensileCreateLibrary function, which can be called:

set(Tensile_BACKEND "HIP")
set( Tensile_LOGIC_PATH "~/LibraryLogic" CACHE STRING "Path to Tensile logic.yaml files")
option( Tensile_MERGE_FILES "Tensile to merge kernels and solutions files?" OFF)
option( Tensile_SHORT_NAMES "Tensile to use short file/function names? Use if compiler complains they're too long." OFF)
option( Tensile_PRINT_DEBUG "Tensile to print runtime debug info?" OFF)

find_package(Tensile) # use if Tensile has been installed

TensileCreateLibrary(
  ${Tensile_LOGIC_PATH}
  ${Tensile_BACKEND}
  ${Tensile_MERGE_FILES}
  ${Tensile_SHORT_NAMES}
  ${Tensile_PRINT_DEBUG}
  Tensile_ROOT ${Tensile_ROOT} # optional; use if tensile not installed
  )
target_link_libraries( TARGET Tensile )

TODO: Where is the Tensile include directory?

Versioning¶

Tensile follows semantic versioning practices, i.e. Major.Minor.Patch, in BenchmarkConfig.yaml files, LibraryConfig.yaml files and in cmake find_package. Tensile is compatible with a “MinimumRequiredVersion” if Tensile.Major==MRV.Major and Tensile.Minor.Patch >= MRV.Minor.Patch.

Major: Tensile increments the major version if the public API changes, or if either the benchmark.yaml or library-config.yaml files change format in a non-backwards-compatible manner.
Minor: Tensile increments the minor version when new kernel, solution or benchmarking features are introduced in a backwards-compatible manner.
Patch: Bug fixes or minor improvements.

rocALUTION¶

Introduction¶

Overview¶

rocALUTION is a sparse linear algebra library with focus on exploring fine-grained parallelism, targeting modern processors and accelerators including multi/many-core CPU and GPU platforms. The main goal of this package is to provide a portable library for iterative sparse methods on state of the art hardware. rocALUTION can be seen as middle-ware between different parallel backends and application specific packages.

The major features and characteristics of the library are

Various backends
- Host - fallback backend, designed for CPUs
- GPU/HIP - accelerator backend, designed for HIP capable AMD GPUs
- OpenMP - designed for multi-core CPUs
- MPI - designed for multi-node and multi-GPU configurations
Easy to use
The syntax and structure of the library provide easy learning curves. With the help of the examples, anyone can try out the library - no knowledge in HIP, OpenMP or MPI programming required.
No special hardware requirements
There are no hardware requirements to install and run rocALUTION. If a GPU device and HIP is available, the library will use them.
Variety of iterative solvers
- Fixed-Point iteration - Jacobi, Gauss-Seidel, Symmetric-Gauss Seidel, SOR and SSOR
- Krylov subspace methods - CR, CG, BiCGStab, BiCGStab(l), GMRES, IDR, QMRCGSTAB, Flexible CG/GMRES
- Mixed-precision defect-correction scheme
- Chebyshev iteration
- Multiple MultiGrid schemes, geometric and algebraic
Various preconditioners
- Matrix splitting - Jacobi, (Multi-colored) Gauss-Seidel, Symmetric Gauss-Seidel, SOR, SSOR
- Factorization - ILU(0), ILU(p) (based on levels), ILU(p,q) (power(q)-pattern method), Multi-Elimination ILU (nested/recursive), ILUT (based on threshold) and IC(0)
- Approximate Inverse - Chebyshev matrix-valued polynomial, SPAI, FSAI and TNS
- Diagonal-based preconditioner for Saddle-point problems
- Block-type of sub-preconditioners/solvers
- Additive Schwarz and Restricted Additive Schwarz
- Variable type preconditioners
Generic and robust design
rocALUTION is based on a generic and robust design allowing expansion in the direction of new solvers and preconditioners and support for various hardware types. Furthermore, the design of the library allows the use of all solvers as preconditioners in other solvers. For example you can easily define a CG solver with a Multi-Elimination preconditioner, where the last-block is preconditioned with another Chebyshev iteration method which is preconditioned with a multi-colored Symmetric Gauss-Seidel scheme.
Portable code and results
All code based on rocALUTION is portable and independent of HIP or OpenMP. The code will compile and run everywhere. All solvers and preconditioners are based on a single source code, which delivers portable results across all supported backends (variations are possible due to different rounding modes on the hardware). The only difference which you can see for a hardware change is the performance variation.
Support for several sparse matrix formats
Compressed Sparse Row (CSR), Modified Compressed Sparse Row (MCSR), Dense (DENSE), Coordinate (COO), ELL, Diagonal (DIA), Hybrid format of ELL and COO (HYB).

The code is open-source under MIT license and hosted on here: https://github.com/ROCmSoftwarePlatform/rocALUTION

Building and Installing¶

Installing from AMD ROCm repositories¶

TODO, not yet available

Building rocALUTION from Open-Source repository¶

Download rocALUTION¶

The rocALUTION source code is available at the rocALUTION github page. Download the master branch using:

git clone -b master https://github.com/ROCmSoftwarePlatform/rocALUTION.git
cd rocALUTION

Note that if you want to contribute to rocALUTION, you will need to checkout the develop branch instead of the master branch. See rocalution_contributing for further details. Below are steps to build different packages of the library, including dependencies and clients. It is recommended to install rocALUTION using the install.sh script.

Using install.sh to build dependencies + library¶

The following table lists common uses of install.sh to build dependencies + library. Accelerator support via HIP and OpenMP will be enabled by default, whereas MPI is disabled.

Command	Description
./install.sh -h	Print help information.
./install.sh -d	Build dependencies and library in your local directory. The -d flag only needs to be \|br\| used once. For subsequent invocations of install.sh it is not necessary to rebuild the \|br\| dependencies.
./install.sh	Build library in your local directory. It is assumed dependencies are available.
./install.sh -i	Build library, then build and install rocALUTION package in /opt/rocm/rocalution. You will \|br\| be prompted for sudo access. This will install for all users.
./install.sh –host	Build library in your local directory without HIP support. It is assumed dependencies \|br\| are available.
./install.sh –mpi	Build library in your local directory with HIP and MPI support. It is assumed \|br\| dependencies are available.

Using install.sh to build dependencies + library + client¶

The client contains example code, unit tests and benchmarks. Common uses of install.sh to build them are listed in the table below.

Command	Description
./install.sh -h	Print help information.
./install.sh -dc	Build dependencies, library and client in your local directory. The -d flag only needs to \|br\| be used once. For subsequent invocations of install.sh it is not necessary to rebuild the \|br\| dependencies.
./install.sh -c	Build library and client in your local directory. It is assumed dependencies are available.
./install.sh -idc	Build library, dependencies and client, then build and install rocALUTION package in \|br\| /opt/rocm/rocalution. You will be prompted for sudo access. This will install for all users.
./install.sh -ic	Build library and client, then build and install rocALUTION package in \|br\| opt/rocm/rocalution. You will be prompted for sudo access. This will install for all users.

Using individual commands to build rocALUTION¶

CMake 3.5 or later is required in order to build rocALUTION.

rocALUTION can be built with cmake using the following commands:

# Create and change to build directory
mkdir -p build/release ; cd build/release

# Default install path is /opt/rocm, use -DCMAKE_INSTALL_PREFIX=<path> to adjust it
cmake ../.. -DSUPPORT_HIP=ON \
            -DSUPPORT_MPI=OFF \
            -DSUPPORT_OMP=ON

# Compile rocALUTION library
make -j$(nproc)

# Install rocALUTION to /opt/rocm
sudo make install

GoogleTest is required in order to build rocALUTION client.

rocALUTION with dependencies and client can be built using the following commands:

# Install googletest
mkdir -p build/release/deps ; cd build/release/deps
cmake ../../../deps
sudo make -j$(nproc) install

# Change to build directory
cd ..

# Default install path is /opt/rocm, use -DCMAKE_INSTALL_PREFIX=<path> to adjust it
cmake ../.. -DBUILD_CLIENTS_TESTS=ON \
            -DBUILD_CLIENTS_SAMPLES=ON

# Compile rocALUTION library
make -j$(nproc)

# Install rocALUTION to /opt/rocm
sudo make install

The compilation process produces a shared library file librocalution.so and librocalution_hip.so if HIP support is enabled. Ensure that the library objects can be found in your library path. If you do not copy the library to a specific location you can add the path under Linux in the LD_LIBRARY_PATH variable.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_rocalution>

Common build problems¶

Issue: HIP (/opt/rocm/hip) was built using hcc 1.0.xxx-xxx-xxx-xxx, but you are using /opt/rocm/bin/hcc with version 1.0.yyy-yyy-yyy-yyy from hipcc (version mismatch). Please rebuild HIP including cmake or update HCC_HOME variable.

Solution: Download HIP from github and use hcc to build from source and then use the built HIP instead of /opt/rocm/hip.
Issue: For Carrizo - HCC RUNTIME ERROR: Failed to find compatible kernel

Solution: Add the following to the cmake command when configuring: -DCMAKE_CXX_FLAGS=”–amdgpu-target=gfx801”
Issue: For MI25 (Vega10 Server) - HCC RUNTIME ERROR: Failed to find compatible kernel

Solution: export HCC_AMDGPU_TARGET=gfx900
Issue: Could not find a package configuration file provided by “ROCM” with any of the following names:
ROCMConfig.cmake |br| rocm-config.cmake

Solution: Install ROCm cmake modules
Issue: Could not find a package configuration file provided by “ROCSPARSE” with any of the following names:
ROCSPARSE.cmake |br| rocsparse-config.cmake

Solution: Install rocSPARSE
Issue: Could not find a package configuration file provided by “ROCBLAS” with any of the following names:
ROCBLAS.cmake |br| rocblas-config.cmake

Solution: Install rocBLAS

Simple Test¶

You can test the installation by running a CG solver on a Laplace matrix. After compiling the library you can perform the CG solver test by executing

cd rocALUTION/build/release/examples

wget ftp://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/laplace/gr_30_30.mtx.gz
gzip -d gr_30_30.mtx.gz

./cg gr_30_30.mtx

For more information regarding rocALUTION library and corresponding API documentation, refer rocALUTION

API¶

This section provides details of the library API

Host Utility Functions¶

template<typename DataType>
void rocalution::allocate_host(int size, DataType **ptr)¶

Allocate buffer on the host.

allocate_host allocates a buffer on the host.

Parameters

[in] size: number of elements the buffer need to be allocated for
[out] ptr: pointer to the position in memory where the buffer should be allocated, it is expected that *ptr == NULL

Template Parameters

DataType: can be char, int, unsigned int, float, double, std::complex<float> or std::complex<double>.

template<typename DataType>
void rocalution::free_host(DataType **ptr)¶

Free buffer on the host.

free_host deallocates a buffer on the host. *ptr will be set to NULL after successful deallocation.

Parameters

[inout] ptr: pointer to the position in memory where the buffer should be deallocated, it is expected that *ptr != NULL

Template Parameters

DataType: can be char, int, unsigned int, float, double, std::complex<float> or std::complex<double>.

template<typename DataType> void rocalution::set_to_zero_host(int size, DataType *ptr)¶

Set a host buffer to zero.

set_to_zero_host sets a host buffer to zero.

Parameters

[in] size: number of elements
[inout] ptr: pointer to the host buffer

Template Parameters

DataType: can be char, int, unsigned int, float, double, std::complex<float> or std::complex<double>.

double rocalution::rocalution_time(void)¶: Return current time in microseconds.

Backend Manager¶

int rocalution::init_rocalution(int rank = -1, int dev_per_node = 1)¶

Initialize rocALUTION platform.

init_rocalution defines a backend descriptor with information about the hardware and its specifications. All objects created after that contain a copy of this descriptor. If the specifications of the global descriptor are changed (e.g. set different number of threads) and new objects are created, only the new objects will use the new configurations.

For control, the library provides the following functions

set_device_rocalution() is a unified function to select a specific device. If you have compiled the library with a backend and for this backend there are several available devices, you can use this function to select a particular one. This function has to be called before init_rocalution().
set_omp_threads_rocalution() sets the number of OpenMP threads. This function has to be called after init_rocalution().

Example

#include <rocalution.hpp>

using namespace rocalution;

int main(int argc, char* argv[])
{
    init_rocalution();

    // ...

    stop_rocalution();

    return 0;
}

Parameters

[in] rank: specifies MPI rank when multi-node environment
[in] dev_per_node: number of accelerator devices per node, when in multi-GPU environment

int rocalution::stop_rocalution(void)¶

Shutdown rocALUTION platform.

stop_rocalution shuts down the rocALUTION platform.

void rocalution::set_device_rocalution(int dev)¶

Set the accelerator device.

set_device_rocalution lets the user select the accelerator device that is supposed to be used for the computation.

Parameters

[in] dev: accelerator device ID for computation

void rocalution::set_omp_threads_rocalution(int nthreads)¶

Set number of OpenMP threads.

The number of threads which rocALUTION will use can be set with set_omp_threads_rocalution or by the global OpenMP environment variable (for Unix-like OS this is OMP_NUM_THREADS). During the initialization phase, the library provides affinity thread-core mapping:

If the number of cores (including SMT cores) is greater or equal than two times the number of threads, then all the threads can occupy every second core ID (e.g. 0, 2, 4, $\ldots$). This is to avoid having two threads working on the same physical core, when SMT is enabled.
If the number of threads is less or equal to the number of cores (including SMT), and the previous clause is false, then the threads can occupy every core ID (e.g. 0, 1, 2, 3, $\ldots$).
If non of the above criteria is matched, then the default thread-core mapping is used (typically set by the OS).

Note

The thread-core mapping is available only for Unix-like OS.

Note

The user can disable the thread affinity by calling set_omp_affinity_rocalution(), before initializing the library (i.e. before init_rocalution()).

Parameters

[in] nthreads: number of OpenMP threads

void rocalution::set_omp_affinity_rocalution(bool affinity)¶

Enable/disable OpenMP host affinity.

set_omp_affinity_rocalution enables / disables OpenMP host affinity.

Parameters

[in] affinity: boolean to turn on/off OpenMP host affinity

void rocalution::set_omp_threshold_rocalution(int threshold)¶

Set OpenMP threshold size.

Whenever you want to work on a small problem, you might observe that the OpenMP host backend is (slightly) slower than using no OpenMP. This is mainly attributed to the small amount of work, which every thread should perform and the large overhead of forking/joining threads. This can be avoid by the OpenMP threshold size parameter in rocALUTION. The default threshold is set to 10000, which means that all matrices under (and equal) this size will use only one thread (disregarding the number of OpenMP threads set in the system). The threshold can be modified with set_omp_threshold_rocalution.

Parameters

[in] threshold: OpenMP threshold size

void rocalution::info_rocalution(void)

Print info about rocALUTION.

info_rocalution prints information about the rocALUTION platform

void rocalution::info_rocalution(const struct Rocalution_Backend_Descriptor backend_descriptor)

Print info about specific rocALUTION backend descriptor.

info_rocalution prints information about the rocALUTION platform of the specific backend descriptor.

Parameters

[in] backend_descriptor: rocALUTION backend descriptor

void rocalution::disable_accelerator_rocalution(bool onoff = true)¶

Disable/Enable the accelerator.

If you want to disable the accelerator (without re-compiling the code), you need to call disable_accelerator_rocalution before init_rocalution().

Parameters

[in] onoff: boolean to turn on/off the accelerator

void rocalution::_rocalution_sync(void)¶

Sync rocALUTION.

_rocalution_sync blocks the host until all active asynchronous transfers are completed.

Base Rocalution¶

template<typename ValueType> class BaseRocalution : public rocalution::RocalutionObj¶

Base class for all operators and vectors.

Template Parameters

ValueType: - can be int, float, double, std::complex<float> and std::complex<double>

Subclassed by rocalution::Operator< ValueType >, rocalution::Vector< ValueType >

virtual void rocalution::BaseRocalution::MoveToAccelerator(void) = 0¶: Move the object to the accelerator backend.

virtual void rocalution::BaseRocalution::MoveToHost(void) = 0¶: Move the object to the host backend.

void rocalution::BaseRocalution::MoveToAcceleratorAsync(void)¶: Move the object to the accelerator backend with async move.

void rocalution::BaseRocalution::MoveToHostAsync(void)¶: Move the object to the host backend with async move.

void rocalution::BaseRocalution::Sync(void)¶: Sync (the async move)

void rocalution::BaseRocalution::CloneBackend(const BaseRocalution<ValueType> &src)

Clone the Backend descriptor from another object.

With CloneBackend, the backend can be cloned without copying any data. This is especially useful, if several objects should reside on the same backend, but keep their original data.

Example

LocalVector<ValueType> vec;
LocalMatrix<ValueType> mat;

// Allocate and initialize vec and mat
// ...

LocalVector<ValueType> tmp;
// By cloning backend, tmp and vec will have the same backend as mat
tmp.CloneBackend(mat);
vec.CloneBackend(mat);

// The following matrix vector multiplication will be performed on the backend
// selected in mat
mat.Apply(vec, &tmp);

Parameters

[in] src: Object, where the backend should be cloned from.

virtual void rocalution::BaseRocalution::Info(void) const = 0¶

Print object information.

Info can print object information about any rocALUTION object. This information consists of object properties and backend data.

Example

mat.Info();
vec.Info();

virtual void rocalution::BaseRocalution::Clear(void) = 0¶: Clear (free all data) the object.

Operator¶

template<typename ValueType> class Operator : public rocalution::BaseRocalution<ValueType>¶

Operator class.

The Operator class defines the generic interface for applying an operator (e.g. matrix or stencil) from/to global and local vectors.

Template Parameters

ValueType: - can be int, float, double, std::complex<float> and std::complex<double>

Subclassed by rocalution::GlobalMatrix< ValueType >, rocalution::LocalMatrix< ValueType >, rocalution::LocalStencil< ValueType >

virtual IndexType2 rocalution::Operator::GetM(void) const = 0¶: Return the number of rows in the matrix/stencil.

virtual IndexType2 rocalution::Operator::GetN(void) const = 0¶: Return the number of columns in the matrix/stencil.

virtual IndexType2 rocalution::Operator::GetNnz(void) const = 0¶: Return the number of non-zeros in the matrix/stencil.

int rocalution::Operator::GetLocalM(void) const¶: Return the number of rows in the local matrix/stencil.

int rocalution::Operator::GetLocalN(void) const¶: Return the number of columns in the local matrix/stencil.

int rocalution::Operator::GetLocalNnz(void) const¶: Return the number of non-zeros in the local matrix/stencil.

int rocalution::Operator::GetGhostM(void) const¶: Return the number of rows in the ghost matrix/stencil.

int rocalution::Operator::GetGhostN(void) const¶: Return the number of columns in the ghost matrix/stencil.

int rocalution::Operator::GetGhostNnz(void) const¶: Return the number of non-zeros in the ghost matrix/stencil.

void rocalution::Operator::Apply(const LocalVector<ValueType> &in, LocalVector<ValueType> *out) const: Apply the operator, out = Operator(in), where in and out are local vectors.

void rocalution::Operator::ApplyAdd(const LocalVector<ValueType> &in, ValueType scalar, LocalVector<ValueType> *out) const: Apply and add the operator, out += scalar * Operator(in), where in and out are local vectors.

void rocalution::Operator::Apply(const GlobalVector<ValueType> &in, GlobalVector<ValueType> *out) const: Apply the operator, out = Operator(in), where in and out are global vectors.

void rocalution::Operator::ApplyAdd(const GlobalVector<ValueType> &in, ValueType scalar, GlobalVector<ValueType> *out) const: Apply and add the operator, out += scalar * Operator(in), where in and out are global vectors.

Vector¶

template<typename ValueType> class Vector : public rocalution::BaseRocalution<ValueType>¶

Vector class.

The Vector class defines the generic interface for local and global vectors.

Template Parameters

ValueType: - can be int, float, double, std::complex<float> and std::complex<double>

Subclassed by rocalution::LocalVector< int >, rocalution::GlobalVector< ValueType >, rocalution::LocalVector< ValueType >

virtual IndexType2 rocalution::Vector::GetSize(void) const = 0¶: Return the size of the vector.

int rocalution::Vector::GetLocalSize(void) const¶: Return the size of the local vector.

int rocalution::Vector::GetGhostSize(void) const¶: Return the size of the ghost vector.

virtual bool rocalution::Vector::Check(void) const = 0¶

Perform a sanity check of the vector.

Checks, if the vector contains valid data, i.e. if the values are not infinity and not NaN (not a number).

Return Value

true: if the vector is ok (empty vector is also ok).
false: if there is something wrong with the values.

virtual void rocalution::Vector::Zeros(void) = 0¶: Set all values of the vector to 0.

virtual void rocalution::Vector::Ones(void) = 0¶: Set all values of the vector to 1.

virtual void rocalution::Vector::SetValues(ValueType val) = 0¶: Set all values of the vector to given argument.

virtual void rocalution::Vector::SetRandomUniform(unsigned long long seed, ValueType a = static_cast<ValueType>(-1), ValueType b = static_cast<ValueType>(1)) = 0¶: Fill the vector with random values from interval [a,b].

virtual void rocalution::Vector::SetRandomNormal(unsigned long long seed, ValueType mean = static_cast<ValueType>(0), ValueType var = static_cast<ValueType>(1)) = 0¶: Fill the vector with random values from normal distribution.

virtual void rocalution::Vector::ReadFileASCII(const std::string filename) = 0¶

Read vector from ASCII file.

Read a vector from ASCII file.

Example

LocalVector<ValueType> vec;
vec.ReadFileASCII("my_vector.dat");

Parameters

[in] filename: name of the file containing the ASCII data.

virtual void rocalution::Vector::WriteFileASCII(const std::string filename) const = 0¶

Write vector to ASCII file.

Write a vector to ASCII file.

Example

LocalVector<ValueType> vec;

// Allocate and fill vec
// ...

vec.WriteFileASCII("my_vector.dat");

Parameters

[in] filename: name of the file to write the ASCII data to.

virtual void rocalution::Vector::ReadFileBinary(const std::string filename) = 0¶

Read vector from binary file.

Read a vector from binary file. For details on the format, see WriteFileBinary().

Example

LocalVector<ValueType> vec;
vec.ReadFileBinary("my_vector.bin");

Parameters

[in] filename: name of the file containing the data.

virtual void rocalution::Vector::WriteFileBinary(const std::string filename) const = 0¶

Write vector to binary file.

Write a vector to binary file.

The binary format contains a header, the rocALUTION version and the vector data as follows

// Header
out << "#rocALUTION binary vector file" << std::endl;

// rocALUTION version
out.write((char*)&version, sizeof(int));

// Vector data
out.write((char*)&size, sizeof(int));
out.write((char*)vec_val, size * sizeof(double));

Note

Vector values array is always stored in double precision (e.g. double or std::complex<double>).

Example

LocalVector<ValueType> vec;

// Allocate and fill vec
// ...

vec.WriteFileBinary("my_vector.bin");

Parameters

[in] filename: name of the file to write the data to.

void rocalution::Vector::CopyFrom(const LocalVector<ValueType> &src)

Copy vector from another vector.

CopyFrom copies values from another vector.

Note

This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.

Example

LocalVector<ValueType> vec1, vec2;

// Allocate and initialize vec1 and vec2
// ...

// Move vec1 to accelerator
// vec1.MoveToAccelerator();

// Now, vec1 is on the accelerator (if available)
// and vec2 is on the host

// Copy vec1 to vec2 (or vice versa) will move data between host and
// accelerator backend
vec1.CopyFrom(vec2);

Parameters

[in] src: Vector, where values should be copied from.

void rocalution::Vector::CopyFrom(const GlobalVector<ValueType> &src)

Copy vector from another vector.

CopyFrom copies values from another vector.

Note

This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.

Example

LocalVector<ValueType> vec1, vec2;

// Allocate and initialize vec1 and vec2
// ...

// Move vec1 to accelerator
// vec1.MoveToAccelerator();

// Now, vec1 is on the accelerator (if available)
// and vec2 is on the host

// Copy vec1 to vec2 (or vice versa) will move data between host and
// accelerator backend
vec1.CopyFrom(vec2);

Parameters

[in] src: Vector, where values should be copied from.

void rocalution::Vector::CopyFromAsync(const LocalVector<ValueType> &src)¶: Async copy from another local vector.

void rocalution::Vector::CopyFromFloat(const LocalVector<float> &src)¶: Copy values from another local float vector.

void rocalution::Vector::CopyFromDouble(const LocalVector<double> &src)¶: Copy values from another local double vector.

void rocalution::Vector::CopyFrom(const LocalVector<ValueType> &src, int src_offset, int dst_offset, int size)

Copy vector from another vector with offsets and size.

CopyFrom copies values with specific source and destination offsets and sizes from another vector.

Note

This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.

Parameters

[in] src: Vector, where values should be copied from.
[in] src_offset: source offset.
[in] dst_offset: destination offset.
[in] size: number of entries to be copied.

void rocalution::Vector::CloneFrom(const LocalVector<ValueType> &src)

Clone the vector.

CloneFrom clones the entire vector, with data and backend descriptor from another Vector.

Example

LocalVector<ValueType> vec;

// Allocate and initialize vec (host or accelerator)
// ...

LocalVector<ValueType> tmp;

// By cloning vec, tmp will have identical values and will be on the same
// backend as vec
tmp.CloneFrom(vec);

Parameters

[in] src: Vector to clone from.

void rocalution::Vector::CloneFrom(const GlobalVector<ValueType> &src)

Clone the vector.

CloneFrom clones the entire vector, with data and backend descriptor from another Vector.

Example

LocalVector<ValueType> vec;

// Allocate and initialize vec (host or accelerator)
// ...

LocalVector<ValueType> tmp;

// By cloning vec, tmp will have identical values and will be on the same
// backend as vec
tmp.CloneFrom(vec);

Parameters

[in] src: Vector to clone from.

void rocalution::Vector::AddScale(const LocalVector<ValueType> &x, ValueType alpha): Perform vector update of type this = this + alpha * x.

void rocalution::Vector::AddScale(const GlobalVector<ValueType> &x, ValueType alpha): Perform vector update of type this = this + alpha * x.

void rocalution::Vector::ScaleAdd(ValueType alpha, const LocalVector<ValueType> &x): Perform vector update of type this = alpha * this + x.

void rocalution::Vector::ScaleAdd(ValueType alpha, const GlobalVector<ValueType> &x): Perform vector update of type this = alpha * this + x.

void rocalution::Vector::ScaleAddScale(ValueType alpha, const LocalVector<ValueType> &x, ValueType beta): Perform vector update of type this = alpha * this + x * beta.

void rocalution::Vector::ScaleAddScale(ValueType alpha, const GlobalVector<ValueType> &x, ValueType beta): Perform vector update of type this = alpha * this + x * beta.

void rocalution::Vector::ScaleAddScale(ValueType alpha, const LocalVector<ValueType> &x, ValueType beta, int src_offset, int dst_offset, int size): Perform vector update of type this = alpha * this + x * beta with offsets.

void rocalution::Vector::ScaleAddScale(ValueType alpha, const GlobalVector<ValueType> &x, ValueType beta, int src_offset, int dst_offset, int size): Perform vector update of type this = alpha * this + x * beta with offsets.

void rocalution::Vector::ScaleAdd2(ValueType alpha, const LocalVector<ValueType> &x, ValueType beta, const LocalVector<ValueType> &y, ValueType gamma): Perform vector update of type this = alpha * this + x * beta + y * gamma.

void rocalution::Vector::ScaleAdd2(ValueType alpha, const GlobalVector<ValueType> &x, ValueType beta, const GlobalVector<ValueType> &y, ValueType gamma): Perform vector update of type this = alpha * this + x * beta + y * gamma.

virtual void rocalution::Vector::Scale(ValueType alpha) = 0¶: Perform vector scaling this = alpha * this.

ValueType rocalution::Vector::Dot(const LocalVector<ValueType> &x) const: Compute dot (scalar) product, return this^T y.

ValueType rocalution::Vector::Dot(const GlobalVector<ValueType> &x) const: Compute dot (scalar) product, return this^T y.

ValueType rocalution::Vector::DotNonConj(const LocalVector<ValueType> &x) const: Compute non-conjugate dot (scalar) product, return this^T y.

ValueType rocalution::Vector::DotNonConj(const GlobalVector<ValueType> &x) const: Compute non-conjugate dot (scalar) product, return this^T y.

virtual ValueType rocalution::Vector::Norm(void) const = 0¶: Compute $L_2$ norm of the vector, return = srqt(this^T this)

virtual ValueType rocalution::Vector::Reduce(void) const = 0¶: Reduce the vector.

virtual ValueType rocalution::Vector::Asum(void) const = 0¶: Compute the sum of absolute values of the vector, return = sum(|this|)

virtual int rocalution::Vector::Amax(ValueType &value) const = 0¶: Compute the absolute max of the vector, return = index(max(|this|))

void rocalution::Vector::PointWiseMult(const LocalVector<ValueType> &x): Perform point-wise multiplication (element-wise) of this = this * x.

void rocalution::Vector::PointWiseMult(const GlobalVector<ValueType> &x): Perform point-wise multiplication (element-wise) of this = this * x.

void rocalution::Vector::PointWiseMult(const LocalVector<ValueType> &x, const LocalVector<ValueType> &y): Perform point-wise multiplication (element-wise) of this = x * y.

void rocalution::Vector::PointWiseMult(const GlobalVector<ValueType> &x, const GlobalVector<ValueType> &y): Perform point-wise multiplication (element-wise) of this = x * y.

virtual void rocalution::Vector::Power(double power) = 0¶: Perform power operation to a vector.

Local Matrix¶

template<typename ValueType> class LocalMatrix : public rocalution::Operator<ValueType>¶

LocalMatrix class.

A LocalMatrix is called local, because it will always stay on a single system. The system can contain several CPUs via UMA or NUMA memory system or it can contain an accelerator.

Template Parameters

ValueType: - can be int, float, double, std::complex<float> and std::complex<double>

unsigned int rocalution::LocalMatrix::GetFormat(void) const¶: Return the matrix format id (see matrix_formats.hpp)

bool rocalution::LocalMatrix::Check(void) const¶

Perform a sanity check of the matrix.

Checks, if the matrix contains valid data, i.e. if the values are not infinity and not NaN (not a number) and if the structure of the matrix is correct (e.g. indices cannot be negative, CSR and COO matrices have to be sorted, etc.).

Return Value

true: if the matrix is ok (empty matrix is also ok).
false: if there is something wrong with the structure or values.

void rocalution::LocalMatrix::AllocateCSR(const std::string name, int nnz, int nrow, int ncol)¶

Allocate a local matrix with name and sizes.

The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.

Example

LocalMatrix<ValueType> mat;

mat.AllocateCSR("my CSR matrix", 456, 100, 100);
mat.Clear();

mat.AllocateCOO("my COO matrix", 200, 100, 100);
mat.Clear();

void rocalution::LocalMatrix::AllocateBCSR(void)¶

Allocate a local matrix with name and sizes.

The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.

Example

LocalMatrix<ValueType> mat;

mat.AllocateCSR("my CSR matrix", 456, 100, 100);
mat.Clear();

mat.AllocateCOO("my COO matrix", 200, 100, 100);
mat.Clear();

void rocalution::LocalMatrix::AllocateMCSR(const std::string name, int nnz, int nrow, int ncol)¶

Allocate a local matrix with name and sizes.

The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.

Example

LocalMatrix<ValueType> mat;

mat.AllocateCSR("my CSR matrix", 456, 100, 100);
mat.Clear();

mat.AllocateCOO("my COO matrix", 200, 100, 100);
mat.Clear();

void rocalution::LocalMatrix::AllocateCOO(const std::string name, int nnz, int nrow, int ncol)¶

Allocate a local matrix with name and sizes.

The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.

Example

LocalMatrix<ValueType> mat;

mat.AllocateCSR("my CSR matrix", 456, 100, 100);
mat.Clear();

mat.AllocateCOO("my COO matrix", 200, 100, 100);
mat.Clear();

void rocalution::LocalMatrix::AllocateDIA(const std::string name, int nnz, int nrow, int ncol, int ndiag)¶

Allocate a local matrix with name and sizes.

The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.

Example

LocalMatrix<ValueType> mat;

mat.AllocateCSR("my CSR matrix", 456, 100, 100);
mat.Clear();

mat.AllocateCOO("my COO matrix", 200, 100, 100);
mat.Clear();

void rocalution::LocalMatrix::AllocateELL(const std::string name, int nnz, int nrow, int ncol, int max_row)¶

Allocate a local matrix with name and sizes.

The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.

Example

LocalMatrix<ValueType> mat;

mat.AllocateCSR("my CSR matrix", 456, 100, 100);
mat.Clear();

mat.AllocateCOO("my COO matrix", 200, 100, 100);
mat.Clear();

void rocalution::LocalMatrix::AllocateHYB(const std::string name, int ell_nnz, int coo_nnz, int ell_max_row, int nrow, int ncol)¶

Allocate a local matrix with name and sizes.

The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.

Example

LocalMatrix<ValueType> mat;

mat.AllocateCSR("my CSR matrix", 456, 100, 100);
mat.Clear();

mat.AllocateCOO("my COO matrix", 200, 100, 100);
mat.Clear();

void rocalution::LocalMatrix::AllocateDENSE(const std::string name, int nrow, int ncol)¶

Allocate a local matrix with name and sizes.

The local matrix allocation functions require a name of the object (this is only for information purposes) and corresponding number of non-zero elements, number of rows and number of columns. Furthermore, depending on the matrix format, additional parameters are required.

Example

LocalMatrix<ValueType> mat;

mat.AllocateCSR("my CSR matrix", 456, 100, 100);
mat.Clear();

mat.AllocateCOO("my COO matrix", 200, 100, 100);
mat.Clear();

void rocalution::LocalMatrix::SetDataPtrCOO(int **row, int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol)¶

Initialize a LocalMatrix on the host with externally allocated data.

SetDataPtr functions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers.

Note

Setting data pointers will leave the original pointers empty (set to NULL).

Example

// Allocate a CSR matrix
int* csr_row_ptr   = new int[100 + 1];
int* csr_col_ind   = new int[345];
ValueType* csr_val = new ValueType[345];

// Fill the CSR matrix
// ...

// rocALUTION local matrix object
LocalMatrix<ValueType> mat;

// Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become
// invalid
mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);

void rocalution::LocalMatrix::SetDataPtrCSR(int **row_offset, int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol)¶

Initialize a LocalMatrix on the host with externally allocated data.

SetDataPtr functions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers.

Note

Setting data pointers will leave the original pointers empty (set to NULL).

Example

// Allocate a CSR matrix
int* csr_row_ptr   = new int[100 + 1];
int* csr_col_ind   = new int[345];
ValueType* csr_val = new ValueType[345];

// Fill the CSR matrix
// ...

// rocALUTION local matrix object
LocalMatrix<ValueType> mat;

// Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become
// invalid
mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);

void rocalution::LocalMatrix::SetDataPtrMCSR(int **row_offset, int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol)¶

Initialize a LocalMatrix on the host with externally allocated data.

SetDataPtr functions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers.

Note

Setting data pointers will leave the original pointers empty (set to NULL).

Example

// Allocate a CSR matrix
int* csr_row_ptr   = new int[100 + 1];
int* csr_col_ind   = new int[345];
ValueType* csr_val = new ValueType[345];

// Fill the CSR matrix
// ...

// rocALUTION local matrix object
LocalMatrix<ValueType> mat;

// Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become
// invalid
mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);

void rocalution::LocalMatrix::SetDataPtrELL(int **col, ValueType **val, std::string name, int nnz, int nrow, int ncol, int max_row)¶

Initialize a LocalMatrix on the host with externally allocated data.

SetDataPtr functions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers.

Note

Setting data pointers will leave the original pointers empty (set to NULL).

Example

// Allocate a CSR matrix
int* csr_row_ptr   = new int[100 + 1];
int* csr_col_ind   = new int[345];
ValueType* csr_val = new ValueType[345];

// Fill the CSR matrix
// ...

// rocALUTION local matrix object
LocalMatrix<ValueType> mat;

// Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become
// invalid
mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);

void rocalution::LocalMatrix::SetDataPtrDIA(int **offset, ValueType **val, std::string name, int nnz, int nrow, int ncol, int num_diag)¶

Initialize a LocalMatrix on the host with externally allocated data.

SetDataPtr functions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers.

Note

Setting data pointers will leave the original pointers empty (set to NULL).

Example

// Allocate a CSR matrix
int* csr_row_ptr   = new int[100 + 1];
int* csr_col_ind   = new int[345];
ValueType* csr_val = new ValueType[345];

// Fill the CSR matrix
// ...

// rocALUTION local matrix object
LocalMatrix<ValueType> mat;

// Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become
// invalid
mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);

void rocalution::LocalMatrix::SetDataPtrDENSE(ValueType **val, std::string name, int nrow, int ncol)¶

Initialize a LocalMatrix on the host with externally allocated data.

SetDataPtr functions have direct access to the raw data via pointers. Already allocated data can be set by passing their pointers.

Note

Setting data pointers will leave the original pointers empty (set to NULL).

Example

// Allocate a CSR matrix
int* csr_row_ptr   = new int[100 + 1];
int* csr_col_ind   = new int[345];
ValueType* csr_val = new ValueType[345];

// Fill the CSR matrix
// ...

// rocALUTION local matrix object
LocalMatrix<ValueType> mat;

// Set the CSR matrix data, csr_row_ptr, csr_col and csr_val pointers become
// invalid
mat.SetDataPtrCSR(&csr_row_ptr, &csr_col, &csr_val, "my_matrix", 345, 100, 100);

void rocalution::LocalMatrix::LeaveDataPtrCOO(int **row, int **col, ValueType **val)¶

Leave a LocalMatrix to host pointers.

LeaveDataPtr functions have direct access to the raw data via pointers. A LocalMatrix object can leave its raw data to host pointers. This will leave the LocalMatrix empty.

Example

// rocALUTION CSR matrix object
LocalMatrix<ValueType> mat;

// Allocate the CSR matrix
mat.AllocateCSR("my_matrix", 345, 100, 100);

// Fill CSR matrix
// ...

int* csr_row_ptr   = NULL;
int* csr_col_ind   = NULL;
ValueType* csr_val = NULL;

// Get (steal) the data from the matrix, this will leave the local matrix
// object empty
mat.LeaveDataPtrCSR(&csr_row_ptr, &csr_col_ind, &csr_val);

void rocalution::LocalMatrix::LeaveDataPtrCSR(int **row_offset, int **col, ValueType **val)¶

Leave a LocalMatrix to host pointers.

LeaveDataPtr functions have direct access to the raw data via pointers. A LocalMatrix object can leave its raw data to host pointers. This will leave the LocalMatrix empty.

Example

// rocALUTION CSR matrix object
LocalMatrix<ValueType> mat;

// Allocate the CSR matrix
mat.AllocateCSR("my_matrix", 345, 100, 100);

// Fill CSR matrix
// ...

int* csr_row_ptr   = NULL;
int* csr_col_ind   = NULL;
ValueType* csr_val = NULL;

// Get (steal) the data from the matrix, this will leave the local matrix
// object empty
mat.LeaveDataPtrCSR(&csr_row_ptr, &csr_col_ind, &csr_val);

void rocalution::LocalMatrix::LeaveDataPtrMCSR(int **row_offset, int **col, ValueType **val)¶

Leave a LocalMatrix to host pointers.

LeaveDataPtr functions have direct access to the raw data via pointers. A LocalMatrix object can leave its raw data to host pointers. This will leave the LocalMatrix empty.

Example

// rocALUTION CSR matrix object
LocalMatrix<ValueType> mat;

// Allocate the CSR matrix
mat.AllocateCSR("my_matrix", 345, 100, 100);

// Fill CSR matrix
// ...

int* csr_row_ptr   = NULL;
int* csr_col_ind   = NULL;
ValueType* csr_val = NULL;

// Get (steal) the data from the matrix, this will leave the local matrix
// object empty
mat.LeaveDataPtrCSR(&csr_row_ptr, &csr_col_ind, &csr_val);

void rocalution::LocalMatrix::LeaveDataPtrELL(int **col, ValueType **val, int &max_row)¶

Leave a LocalMatrix to host pointers.

LeaveDataPtr functions have direct access to the raw data via pointers. A LocalMatrix object can leave its raw data to host pointers. This will leave the LocalMatrix empty.

Example

// rocALUTION CSR matrix object
LocalMatrix<ValueType> mat;

// Allocate the CSR matrix
mat.AllocateCSR("my_matrix", 345, 100, 100);

// Fill CSR matrix
// ...

int* csr_row_ptr   = NULL;
int* csr_col_ind   = NULL;
ValueType* csr_val = NULL;

// Get (steal) the data from the matrix, this will leave the local matrix
// object empty
mat.LeaveDataPtrCSR(&csr_row_ptr, &csr_col_ind, &csr_val);

void rocalution::LocalMatrix::LeaveDataPtrDIA(int **offset, ValueType **val, int &num_diag)¶

Leave a LocalMatrix to host pointers.

LeaveDataPtr functions have direct access to the raw data via pointers. A LocalMatrix object can leave its raw data to host pointers. This will leave the LocalMatrix empty.

Example

// rocALUTION CSR matrix object
LocalMatrix<ValueType> mat;

// Allocate the CSR matrix
mat.AllocateCSR("my_matrix", 345, 100, 100);

// Fill CSR matrix
// ...

int* csr_row_ptr   = NULL;
int* csr_col_ind   = NULL;
ValueType* csr_val = NULL;

// Get (steal) the data from the matrix, this will leave the local matrix
// object empty
mat.LeaveDataPtrCSR(&csr_row_ptr, &csr_col_ind, &csr_val);

void rocalution::LocalMatrix::LeaveDataPtrDENSE(ValueType **val)¶

Leave a LocalMatrix to host pointers.

LeaveDataPtr functions have direct access to the raw data via pointers. A LocalMatrix object can leave its raw data to host pointers. This will leave the LocalMatrix empty.

Example

// rocALUTION CSR matrix object
LocalMatrix<ValueType> mat;

// Allocate the CSR matrix
mat.AllocateCSR("my_matrix", 345, 100, 100);

// Fill CSR matrix
// ...

int* csr_row_ptr   = NULL;
int* csr_col_ind   = NULL;
ValueType* csr_val = NULL;

// Get (steal) the data from the matrix, this will leave the local matrix
// object empty
mat.LeaveDataPtrCSR(&csr_row_ptr, &csr_col_ind, &csr_val);

void rocalution::LocalMatrix::Zeros(void)¶: Set all matrix values to zero.

void rocalution::LocalMatrix::Scale(ValueType alpha)¶: Scale all values in the matrix.

void rocalution::LocalMatrix::ScaleDiagonal(ValueType alpha)¶: Scale the diagonal entries of the matrix with alpha, all diagonal elements must exist.

void rocalution::LocalMatrix::ScaleOffDiagonal(ValueType alpha)¶: Scale the off-diagonal entries of the matrix with alpha, all diagonal elements must exist.

void rocalution::LocalMatrix::AddScalar(ValueType alpha)¶: Add a scalar to all matrix values.

void rocalution::LocalMatrix::AddScalarDiagonal(ValueType alpha)¶: Add alpha to the diagonal entries of the matrix, all diagonal elements must exist.

void rocalution::LocalMatrix::AddScalarOffDiagonal(ValueType alpha)¶: Add alpha to the off-diagonal entries of the matrix, all diagonal elements must exist.

void rocalution::LocalMatrix::ExtractSubMatrix(int row_offset, int col_offset, int row_size, int col_size, LocalMatrix<ValueType> *mat) const¶: Extract a sub-matrix with row/col_offset and row/col_size.

void rocalution::LocalMatrix::ExtractSubMatrices(int row_num_blocks, int col_num_blocks, const int *row_offset, const int *col_offset, LocalMatrix<ValueType> ***mat) const¶: Extract array of non-overlapping sub-matrices (row/col_num_blocks define the blocks for rows/columns; row/col_offset have sizes col/row_num_blocks+1, where [i+1]-[i] defines the i-th size of the sub-matrix)

void rocalution::LocalMatrix::ExtractDiagonal(LocalVector<ValueType> *vec_diag) const¶: Extract the diagonal values of the matrix into a LocalVector.

void rocalution::LocalMatrix::ExtractInverseDiagonal(LocalVector<ValueType> *vec_inv_diag) const¶: Extract the inverse (reciprocal) diagonal values of the matrix into a LocalVector.

void rocalution::LocalMatrix::ExtractU(LocalMatrix<ValueType> *U, bool diag) const¶: Extract the upper triangular matrix.

void rocalution::LocalMatrix::ExtractL(LocalMatrix<ValueType> *L, bool diag) const¶: Extract the lower triangular matrix.

void rocalution::LocalMatrix::Permute(const LocalVector<int> &permutation)¶: Perform (forward) permutation of the matrix.

void rocalution::LocalMatrix::PermuteBackward(const LocalVector<int> &permutation)¶: Perform (backward) permutation of the matrix.

void rocalution::LocalMatrix::CMK(LocalVector<int> *permutation) const¶

Create permutation vector for CMK reordering of the matrix.

The Cuthill-McKee ordering minimize the bandwidth of a given sparse matrix.

Example

LocalVector<int> cmk;

mat.CMK(&cmk);
mat.Permute(cmk);

Parameters

[out] permutation: permutation vector for CMK reordering

void rocalution::LocalMatrix::RCMK(LocalVector<int> *permutation) const¶

Create permutation vector for reverse CMK reordering of the matrix.

The Reverse Cuthill-McKee ordering minimize the bandwidth of a given sparse matrix.

Example

LocalVector<int> rcmk;

mat.RCMK(&rcmk);
mat.Permute(rcmk);

Parameters

[out] permutation: permutation vector for reverse CMK reordering

void rocalution::LocalMatrix::ConnectivityOrder(LocalVector<int> *permutation) const¶

Create permutation vector for connectivity reordering of the matrix.

Connectivity ordering returns a permutation, that sorts the matrix by non-zero entries per row.

Example

LocalVector<int> conn;

mat.ConnectivityOrder(&conn);
mat.Permute(conn);

Parameters

[out] permutation: permutation vector for connectivity reordering

void rocalution::LocalMatrix::MultiColoring(int &num_colors, int **size_colors, LocalVector<int> *permutation) const¶

Perform multi-coloring decomposition of the matrix.

The Multi-Coloring algorithm builds a permutation (coloring of the matrix) in a way such that no two adjacent nodes in the sparse matrix have the same color.

Example

LocalVector<int> mc;
int num_colors;
int* block_colors = NULL;

mat.MultiColoring(num_colors, &block_colors, &mc);
mat.Permute(mc);

Parameters

[out] num_colors: number of colors
[out] size_colors: pointer to array that holds the number of nodes for each color
[out] permutation: permutation vector for multi-coloring reordering

void rocalution::LocalMatrix::MaximalIndependentSet(int &size, LocalVector<int> *permutation) const¶

Perform maximal independent set decomposition of the matrix.

The Maximal Independent Set algorithm finds a set with maximal size, that contains elements that do not depend on other elements in this set.

Example

LocalVector<int> mis;
int size;

mat.MaximalIndependentSet(size, &mis);
mat.Permute(mis);

Parameters

[out] size: number of independent sets
[out] permutation: permutation vector for maximal independent set reordering

void rocalution::LocalMatrix::ZeroBlockPermutation(int &size, LocalVector<int> *permutation) const¶

Return a permutation for saddle-point problems (zero diagonal entries)

For Saddle-Point problems, (i.e. matrices with zero diagonal entries), the Zero Block Permutation maps all zero-diagonal elements to the last block of the matrix.

Example

LocalVector<int> zbp;
int size;

mat.ZeroBlockPermutation(size, &zbp);
mat.Permute(zbp);

Parameters

[out] size:
[out] permutation: permutation vector for zero block permutation

void rocalution::LocalMatrix::ILU0Factorize(void)¶: Perform ILU(0) factorization.

void rocalution::LocalMatrix::LUFactorize(void)¶: Perform LU factorization.

void rocalution::LocalMatrix::ILUTFactorize(double t, int maxrow)¶: Perform ILU(t,m) factorization based on threshold and maximum number of elements per row.

void rocalution::LocalMatrix::ILUpFactorize(int p, bool level = true)¶: Perform ILU(p) factorization based on power.

void rocalution::LocalMatrix::LUAnalyse(void)¶: Analyse the structure (level-scheduling)

void rocalution::LocalMatrix::LUAnalyseClear(void)¶: Delete the analysed data (see LUAnalyse)

void rocalution::LocalMatrix::LUSolve(const LocalVector<ValueType> &in, LocalVector<ValueType> *out) const¶: Solve LU out = in; if level-scheduling algorithm is provided then the graph traversing is performed in parallel.

void rocalution::LocalMatrix::ICFactorize(LocalVector<ValueType> *inv_diag)¶: Perform IC(0) factorization.

void rocalution::LocalMatrix::LLAnalyse(void)¶: Analyse the structure (level-scheduling)

void rocalution::LocalMatrix::LLAnalyseClear(void)¶: Delete the analysed data (see LLAnalyse)

void rocalution::LocalMatrix::LLSolve(const LocalVector<ValueType> &in, LocalVector<ValueType> *out) const: Solve LL^T out = in; if level-scheduling algorithm is provided then the graph traversing is performed in parallel.

void rocalution::LocalMatrix::LLSolve(const LocalVector<ValueType> &in, const LocalVector<ValueType> &inv_diag, LocalVector<ValueType> *out) const: Solve LL^T out = in; if level-scheduling algorithm is provided then the graph traversing is performed in parallel.

void rocalution::LocalMatrix::LAnalyse(bool diag_unit = false)¶

Analyse the structure (level-scheduling) L-part.

diag_unit == true the diag is 1;
diag_unit == false the diag is 0;

void rocalution::LocalMatrix::LAnalyseClear(void)¶: Delete the analysed data (see LAnalyse) L-part.

void rocalution::LocalMatrix::LSolve(const LocalVector<ValueType> &in, LocalVector<ValueType> *out) const¶: Solve L out = in; if level-scheduling algorithm is provided then the graph traversing is performed in parallel.

void rocalution::LocalMatrix::UAnalyse(bool diag_unit = false)¶

Analyse the structure (level-scheduling) U-part;.

diag_unit == true the diag is 1;
diag_unit == false the diag is 0;

void rocalution::LocalMatrix::UAnalyseClear(void)¶: Delete the analysed data (see UAnalyse) U-part.

void rocalution::LocalMatrix::USolve(const LocalVector<ValueType> &in, LocalVector<ValueType> *out) const¶: Solve U out = in; if level-scheduling algorithm is provided then the graph traversing is performed in parallel.

void rocalution::LocalMatrix::Householder(int idx, ValueType &beta, LocalVector<ValueType> *vec) const¶: Compute Householder vector.

void rocalution::LocalMatrix::QRDecompose(void)¶: QR Decomposition.

void rocalution::LocalMatrix::QRSolve(const LocalVector<ValueType> &in, LocalVector<ValueType> *out) const¶: Solve QR out = in.

void rocalution::LocalMatrix::Invert(void)¶: Matrix inversion using QR decomposition.

void rocalution::LocalMatrix::ReadFileMTX(const std::string filename)¶

Read matrix from MTX (Matrix Market Format) file.

Read a matrix from Matrix Market Format file.

Example

LocalMatrix<ValueType> mat;
mat.ReadFileMTX("my_matrix.mtx");

Parameters

[in] filename: name of the file containing the MTX data.

void rocalution::LocalMatrix::WriteFileMTX(const std::string filename) const¶

Write matrix to MTX (Matrix Market Format) file.

Write a matrix to Matrix Market Format file.

Example

LocalMatrix<ValueType> mat;

// Allocate and fill mat
// ...

mat.WriteFileMTX("my_matrix.mtx");

Parameters

[in] filename: name of the file to write the MTX data to.

void rocalution::LocalMatrix::ReadFileCSR(const std::string filename)¶

Read matrix from CSR (rocALUTION binary format) file.

Read a CSR matrix from binary file. For details on the format, see WriteFileCSR().

Example

LocalMatrix<ValueType> mat;
mat.ReadFileCSR("my_matrix.csr");

Parameters

[in] filename: name of the file containing the data.

void rocalution::LocalMatrix::WriteFileCSR(const std::string filename) const¶

Write CSR matrix to binary file.

Write a CSR matrix to binary file.

The binary format contains a header, the rocALUTION version and the matrix data as follows

// Header
out << "#rocALUTION binary csr file" << std::endl;

// rocALUTION version
out.write((char*)&version, sizeof(int));

// CSR matrix data
out.write((char*)&m, sizeof(int));
out.write((char*)&n, sizeof(int));
out.write((char*)&nnz, sizeof(int));
out.write((char*)csr_row_ptr, (m + 1) * sizeof(int));
out.write((char*)csr_col_ind, nnz * sizeof(int));
out.write((char*)csr_val, nnz * sizeof(double));

Note

Vector values array is always stored in double precision (e.g. double or std::complex<double>).

Example

LocalMatrix<ValueType> mat;

// Allocate and fill mat
// ...

mat.WriteFileCSR("my_matrix.csr");

Parameters

[in] filename: name of the file to write the data to.

void rocalution::LocalMatrix::CopyFrom(const LocalMatrix<ValueType> &src)¶

Copy matrix from another LocalMatrix.

CopyFrom copies values and structure from another local matrix. Source and destination matrix should be in the same format.

Note

This function allows cross platform copying. One of the objects could be allocated on the accelerator backend.

Example

LocalMatrix<ValueType> mat1, mat2;

// Allocate and initialize mat1 and mat2
// ...

// Move mat1 to accelerator
// mat1.MoveToAccelerator();

// Now, mat1 is on the accelerator (if available)
// and mat2 is on the host

// Copy mat1 to mat2 (or vice versa) will move data between host and
// accelerator backend
mat1.CopyFrom(mat2);

Parameters

[in] src: Local matrix where values and structure should be copied from.

void rocalution::LocalMatrix::CopyFromAsync(const LocalMatrix<ValueType> &src)¶: Async copy matrix (values and structure) from another LocalMatrix.

void rocalution::LocalMatrix::CloneFrom(const LocalMatrix<ValueType> &src)¶

Clone the matrix.

CloneFrom clones the entire matrix, including values, structure and backend descriptor from another LocalMatrix.

Example

LocalMatrix<ValueType> mat;

// Allocate and initialize mat (host or accelerator)
// ...

LocalMatrix<ValueType> tmp;

// By cloning mat, tmp will have identical values and structure and will be on
// the same backend as mat
tmp.CloneFrom(mat);

Parameters

[in] src: LocalMatrix to clone from.

void rocalution::LocalMatrix::UpdateValuesCSR(ValueType *val)¶: Update CSR matrix entries only, structure will remain the same.

void rocalution::LocalMatrix::CopyFromCSR(const int *row_offsets, const int *col, const ValueType *val)¶: Copy (import) CSR matrix described in three arrays (offsets, columns, values). The object data has to be allocated (call AllocateCSR first)

void rocalution::LocalMatrix::CopyToCSR(int *row_offsets, int *col, ValueType *val) const¶: Copy (export) CSR matrix described in three arrays (offsets, columns, values). The output arrays have to be allocated.

void rocalution::LocalMatrix::CopyFromCOO(const int *row, const int *col, const ValueType *val)¶: Copy (import) COO matrix described in three arrays (rows, columns, values). The object data has to be allocated (call AllocateCOO first)

void rocalution::LocalMatrix::CopyToCOO(int *row, int *col, ValueType *val) const¶: Copy (export) COO matrix described in three arrays (rows, columns, values). The output arrays have to be allocated.

void rocalution::LocalMatrix::CopyFromHostCSR(const int *row_offset, const int *col, const ValueType *val, const std::string name, int nnz, int nrow, int ncol)¶

Allocates and copies (imports) a host CSR matrix.

If the CSR matrix data pointers are only accessible as constant, the user can create a LocalMatrix object and pass const CSR host pointers. The LocalMatrix will then be allocated and the data will be copied to the corresponding backend, where the original object was located at.

Parameters

[in] row_offset: CSR matrix row offset pointers.
[in] col: CSR matrix column indices.
[in] val: CSR matrix values array.
[in] name: Matrix object name.
[in] nnz: Number of non-zero elements.
[in] nrow: Number of rows.
[in] ncol: Number of columns.

void rocalution::LocalMatrix::CreateFromMap(const LocalVector<int> &map, int n, int m): Create a restriction matrix operator based on an int vector map.

void rocalution::LocalMatrix::CreateFromMap(const LocalVector<int> &map, int n, int m, LocalMatrix<ValueType> *pro): Create a restriction and prolongation matrix operator based on an int vector map.

void rocalution::LocalMatrix::ConvertToCSR(void)¶: Convert the matrix to CSR structure.

void rocalution::LocalMatrix::ConvertToMCSR(void)¶: Convert the matrix to MCSR structure.

void rocalution::LocalMatrix::ConvertToBCSR(void)¶: Convert the matrix to BCSR structure.

void rocalution::LocalMatrix::ConvertToCOO(void)¶: Convert the matrix to COO structure.

void rocalution::LocalMatrix::ConvertToELL(void)¶: Convert the matrix to ELL structure.

void rocalution::LocalMatrix::ConvertToDIA(void)¶: Convert the matrix to DIA structure.

void rocalution::LocalMatrix::ConvertToHYB(void)¶: Convert the matrix to HYB structure.

void rocalution::LocalMatrix::ConvertToDENSE(void)¶: Convert the matrix to DENSE structure.

void rocalution::LocalMatrix::ConvertTo(unsigned int matrix_format)¶: Convert the matrix to specified matrix ID format.

void rocalution::LocalMatrix::SymbolicPower(int p)¶: Perform symbolic computation (structure only) of $|this|^p$.

void rocalution::LocalMatrix::MatrixAdd(const LocalMatrix<ValueType> &mat, ValueType alpha = static_cast<ValueType>(1), ValueType beta = static_cast<ValueType>(1), bool structure = false)¶

Perform matrix addition, this = alpha*this + beta*mat;.

if structure==false the sparsity pattern of the matrix is not changed;
if structure==true a new sparsity pattern is computed

void rocalution::LocalMatrix::MatrixMult(const LocalMatrix<ValueType> &A, const LocalMatrix<ValueType> &B)¶: Multiply two matrices, this = A * B.

void rocalution::LocalMatrix::DiagonalMatrixMult(const LocalVector<ValueType> &diag)¶: Multiply the matrix with diagonal matrix (stored in LocalVector), as DiagonalMatrixMultR()

void rocalution::LocalMatrix::DiagonalMatrixMultL(const LocalVector<ValueType> &diag)¶: Multiply the matrix with diagonal matrix (stored in LocalVector), this=diag*this.

void rocalution::LocalMatrix::DiagonalMatrixMultR(const LocalVector<ValueType> &diag)¶: Multiply the matrix with diagonal matrix (stored in LocalVector), this=this*diag.

void rocalution::LocalMatrix::Gershgorin(ValueType &lambda_min, ValueType &lambda_max) const¶: Compute the spectrum approximation with Gershgorin circles theorem.

void rocalution::LocalMatrix::Compress(double drop_off)¶: Delete all entries in the matrix which abs(a_ij) <= drop_off; the diagonal elements are never deleted.

void rocalution::LocalMatrix::Transpose(void)¶: Transpose the matrix.

void rocalution::LocalMatrix::Sort(void)¶

Sort the matrix indices.

Sorts the matrix by indices.

For CSR matrices, column values are sorted.
For COO matrices, row indices are sorted.

void rocalution::LocalMatrix::Key(long int &row_key, long int &col_key, long int &val_key) const¶

Compute a unique hash key for the matrix arrays.

Typically, it is hard to compare if two matrices have the same structure (and values). To do so, rocALUTION provides a keying function, that generates three keys, for the row index, column index and values array.

Parameters

[out] row_key: row index array key
[out] col_key: column index array key
[out] val_key: values array key

void rocalution::LocalMatrix::ReplaceColumnVector(int idx, const LocalVector<ValueType> &vec)¶: Replace a column vector of a matrix.

void rocalution::LocalMatrix::ReplaceRowVector(int idx, const LocalVector<ValueType> &vec)¶: Replace a row vector of a matrix.

void rocalution::LocalMatrix::ExtractColumnVector(int idx, LocalVector<ValueType> *vec) const¶: Extract values from a column of a matrix to a vector.

void rocalution::LocalMatrix::ExtractRowVector(int idx, LocalVector<ValueType> *vec) const¶: Extract values from a row of a matrix to a vector.

void rocalution::LocalMatrix::AMGConnect(ValueType eps, LocalVector<int> *connections) const¶: Strong couplings for aggregation-based AMG.

void rocalution::LocalMatrix::AMGAggregate(const LocalVector<int> &connections, LocalVector<int> *aggregates) const¶: Plain aggregation - Modification of a greedy aggregation scheme from Vanek (1996)

void rocalution::LocalMatrix::AMGSmoothedAggregation(ValueType relax, const LocalVector<int> &aggregates, const LocalVector<int> &connections, LocalMatrix<ValueType> *prolong, LocalMatrix<ValueType> *restrict) const¶: Interpolation scheme based on smoothed aggregation from Vanek (1996)

void rocalution::LocalMatrix::AMGAggregation(const LocalVector<int> &aggregates, LocalMatrix<ValueType> *prolong, LocalMatrix<ValueType> *restrict) const¶: Aggregation-based interpolation scheme.

void rocalution::LocalMatrix::RugeStueben(ValueType eps, LocalMatrix<ValueType> *prolong, LocalMatrix<ValueType> *restrict) const¶: Ruge Stueben coarsening.

void rocalution::LocalMatrix::FSAI(int power, const LocalMatrix<ValueType> *pattern)¶: Factorized Sparse Approximate Inverse assembly for given system matrix power pattern or external sparsity pattern.

void rocalution::LocalMatrix::SPAI(void)¶: SParse Approximate Inverse assembly for given system matrix pattern.

void rocalution::LocalMatrix::InitialPairwiseAggregation(ValueType beta, int &nc, LocalVector<int> *G, int &Gsize, int **rG, int &rGsize, int ordering) const: Initial Pairwise Aggregation scheme.

void rocalution::LocalMatrix::InitialPairwiseAggregation(const LocalMatrix<ValueType> &mat, ValueType beta, int &nc, LocalVector<int> *G, int &Gsize, int **rG, int &rGsize, int ordering) const: Initial Pairwise Aggregation scheme for split matrices.

void rocalution::LocalMatrix::FurtherPairwiseAggregation(ValueType beta, int &nc, LocalVector<int> *G, int &Gsize, int **rG, int &rGsize, int ordering) const: Further Pairwise Aggregation scheme.

void rocalution::LocalMatrix::FurtherPairwiseAggregation(const LocalMatrix<ValueType> &mat, ValueType beta, int &nc, LocalVector<int> *G, int &Gsize, int **rG, int &rGsize, int ordering) const: Further Pairwise Aggregation scheme for split matrices.

void rocalution::LocalMatrix::CoarsenOperator(LocalMatrix<ValueType> *Ac, int nrow, int ncol, const LocalVector<int> &G, int Gsize, const int *rG, int rGsize) const¶: Build coarse operator for pairwise aggregation scheme.

Local Stencil¶

template<typename ValueType> class LocalStencil : public rocalution::Operator<ValueType>¶

LocalStencil class.

A LocalStencil is called local, because it will always stay on a single system. The system can contain several CPUs via UMA or NUMA memory system or it can contain an accelerator.

Template Parameters

ValueType: - can be int, float, double, std::complex<float> and std::complex<double>

rocalution::LocalStencil::LocalStencil(unsigned int type): Initialize a local stencil with a type.

int rocalution::LocalStencil::GetNDim(void) const¶: Return the dimension of the stencil.

void rocalution::LocalStencil::SetGrid(int size)¶: Set the stencil grid size.

Global Matrix¶

template<typename ValueType> class GlobalMatrix : public rocalution::Operator<ValueType>¶

GlobalMatrix class.

A GlobalMatrix is called global, because it can stay on a single or on multiple nodes in a network. For this type of communication, MPI is used.

Template Parameters

ValueType: - can be int, float, double, std::complex<float> and std::complex<double>

rocalution::GlobalMatrix::GlobalMatrix(const ParallelManager &pm): Initialize a global matrix with a parallel manager.

bool rocalution::GlobalMatrix::Check(void) const¶: Return true if the matrix is ok (empty matrix is also ok) and false if there is something wrong with the strcture or some of values are NaN.

void rocalution::GlobalMatrix::AllocateCSR(std::string name, int local_nnz, int ghost_nnz)¶: Allocate CSR Matrix.

void rocalution::GlobalMatrix::AllocateCOO(std::string name, int local_nnz, int ghost_nnz)¶: Allocate COO Matrix.

void rocalution::GlobalMatrix::SetParallelManager(const ParallelManager &pm)¶: Set the parallel manager of a global vector.

void rocalution::GlobalMatrix::SetDataPtrCSR(int **local_row_offset, int **local_col, ValueType **local_val, int **ghost_row_offset, int **ghost_col, ValueType **ghost_val, std::string name, int local_nnz, int ghost_nnz)¶: Initialize a CSR matrix on the host with externally allocated data.

void rocalution::GlobalMatrix::SetDataPtrCOO(int **local_row, int **local_col, ValueType **local_val, int **ghost_row, int **ghost_col, ValueType **ghost_val, std::string name, int local_nnz, int ghost_nnz)¶: Initialize a COO matrix on the host with externally allocated data.

void rocalution::GlobalMatrix::SetLocalDataPtrCSR(int **row_offset, int **col, ValueType **val, std::string name, int nnz)¶: Initialize a CSR matrix on the host with externally allocated local data.

void rocalution::GlobalMatrix::SetLocalDataPtrCOO(int **row, int **col, ValueType **val, std::string name, int nnz)¶: Initialize a COO matrix on the host with externally allocated local data.

void rocalution::GlobalMatrix::SetGhostDataPtrCSR(int **row_offset, int **col, ValueType **val, std::string name, int nnz)¶: Initialize a CSR matrix on the host with externally allocated ghost data.

void rocalution::GlobalMatrix::SetGhostDataPtrCOO(int **row, int **col, ValueType **val, std::string name, int nnz)¶: Initialize a COO matrix on the host with externally allocated ghost data.

void rocalution::GlobalMatrix::LeaveDataPtrCSR(int **local_row_offset, int **local_col, ValueType **local_val, int **ghost_row_offset, int **ghost_col, ValueType **ghost_val)¶: Leave a CSR matrix to host pointers.

void rocalution::GlobalMatrix::LeaveDataPtrCOO(int **local_row, int **local_col, ValueType **local_val, int **ghost_row, int **ghost_col, ValueType **ghost_val)¶: Leave a COO matrix to host pointers.

void rocalution::GlobalMatrix::LeaveLocalDataPtrCSR(int **row_offset, int **col, ValueType **val)¶: Leave a local CSR matrix to host pointers.

void rocalution::GlobalMatrix::LeaveLocalDataPtrCOO(int **row, int **col, ValueType **val)¶: Leave a local COO matrix to host pointers.

void rocalution::GlobalMatrix::LeaveGhostDataPtrCSR(int **row_offset, int **col, ValueType **val)¶: Leave a CSR ghost matrix to host pointers.

void rocalution::GlobalMatrix::LeaveGhostDataPtrCOO(int **row, int **col, ValueType **val)¶: Leave a COO ghost matrix to host pointers.

void rocalution::GlobalMatrix::CloneFrom(const GlobalMatrix<ValueType> &src)¶: Clone the entire matrix (values,structure+backend descr) from another GlobalMatrix.

void rocalution::GlobalMatrix::CopyFrom(const GlobalMatrix<ValueType> &src)¶: Copy matrix (values and structure) from another GlobalMatrix.

void rocalution::GlobalMatrix::ConvertToCSR(void)¶: Convert the matrix to CSR structure.

void rocalution::GlobalMatrix::ConvertToMCSR(void)¶: Convert the matrix to MCSR structure.

void rocalution::GlobalMatrix::ConvertToBCSR(void)¶: Convert the matrix to BCSR structure.

void rocalution::GlobalMatrix::ConvertToCOO(void)¶: Convert the matrix to COO structure.

void rocalution::GlobalMatrix::ConvertToELL(void)¶: Convert the matrix to ELL structure.

void rocalution::GlobalMatrix::ConvertToDIA(void)¶: Convert the matrix to DIA structure.

void rocalution::GlobalMatrix::ConvertToHYB(void)¶: Convert the matrix to HYB structure.

void rocalution::GlobalMatrix::ConvertToDENSE(void)¶: Convert the matrix to DENSE structure.

void rocalution::GlobalMatrix::ConvertTo(unsigned int matrix_format)¶: Convert the matrix to specified matrix ID format.

void rocalution::GlobalMatrix::ReadFileMTX(const std::string filename)¶: Read matrix from MTX (Matrix Market Format) file.

void rocalution::GlobalMatrix::WriteFileMTX(const std::string filename) const¶: Write matrix to MTX (Matrix Market Format) file.

void rocalution::GlobalMatrix::ReadFileCSR(const std::string filename)¶: Read matrix from CSR (ROCALUTION binary format) file.

void rocalution::GlobalMatrix::WriteFileCSR(const std::string filename) const¶: Write matrix to CSR (ROCALUTION binary format) file.

void rocalution::GlobalMatrix::Sort(void)¶: Sort the matrix indices.

void rocalution::GlobalMatrix::ExtractInverseDiagonal(GlobalVector<ValueType> *vec_inv_diag) const¶: Extract the inverse (reciprocal) diagonal values of the matrix into a GlobalVector.

void rocalution::GlobalMatrix::Scale(ValueType alpha)¶: Scale all the values in the matrix.

void rocalution::GlobalMatrix::InitialPairwiseAggregation(ValueType beta, int &nc, LocalVector<int> *G, int &Gsize, int **rG, int &rGsize, int ordering) const¶: Initial Pairwise Aggregation scheme.

void rocalution::GlobalMatrix::FurtherPairwiseAggregation(ValueType beta, int &nc, LocalVector<int> *G, int &Gsize, int **rG, int &rGsize, int ordering) const¶: Further Pairwise Aggregation scheme.

void rocalution::GlobalMatrix::CoarsenOperator(GlobalMatrix<ValueType> *Ac, ParallelManager *pm, int nrow, int ncol, const LocalVector<int> &G, int Gsize, const int *rG, int rGsize) const¶: Build coarse operator for pairwise aggregation scheme.

Local Vector¶

template<typename ValueType> class LocalVector : public rocalution::Vector<ValueType>¶

LocalVector class.

A LocalVector is called local, because it will always stay on a single system. The system can contain several CPUs via UMA or NUMA memory system or it can contain an accelerator.

Template Parameters

ValueType: - can be int, float, double, std::complex<float> and std::complex<double>

void rocalution::LocalVector::Allocate(std::string name, IndexType2 size)¶

Allocate a local vector with name and size.

The local vector allocation function requires a name of the object (this is only for information purposes) and corresponding size description for vector objects.

Example

LocalVector<ValueType> vec;

vec.Allocate("my vector", 100);
vec.Clear();

Parameters

[in] name: object name
[in] size: number of elements in the vector

void rocalution::LocalVector::SetDataPtr(ValueType **ptr, std::string name, int size)¶

Initialize a LocalVector on the host with externally allocated data.

SetDataPtr has direct access to the raw data via pointers. Already allocated data can be set by passing the pointer.

Note

Setting data pointer will leave the original pointer empty (set to NULL).

Example

// Allocate vector
ValueType* ptr_vec = new ValueType[200];

// Fill vector
// ...

// rocALUTION local vector object
LocalVector<ValueType> vec;

// Set the vector data, ptr_vec will become invalid
vec.SetDataPtr(&ptr_vec, "my_vector", 200);

void rocalution::LocalVector::LeaveDataPtr(ValueType **ptr)¶

Leave a LocalVector to host pointers.

LeaveDataPtr has direct access to the raw data via pointers. A LocalVector object can leave its raw data to a host pointer. This will leave the LocalVector empty.

Example

// rocALUTION local vector object
LocalVector<ValueType> vec;

// Allocate the vector
vec.Allocate("my_vector", 100);

// Fill vector
// ...

ValueType* ptr_vec = NULL;

// Get (steal) the data from the vector, this will leave the local vector object empty
vec.LeaveDataPtr(&ptr_vec);

ValueType &rocalution::LocalVector::operator[](int i)

Access operator (only for host data)

The elements in the vector can be accessed via [] operators, when the vector is allocated on the host.

Return

value at index i

Example

// rocALUTION local vector object
LocalVector<ValueType> vec;

// Allocate vector
vec.Allocate("my_vector", 100);

// Initialize vector with 1
vec.Ones();

// Set even elements to -1
for(int i = 0; i < vec.GetSize(); i += 2)
{
  vec[i] = -1;
}

Parameters

[in] i: access data at index i

const ValueType &rocalution::LocalVector::operator[](int i) const

Access operator (only for host data)

The elements in the vector can be accessed via [] operators, when the vector is allocated on the host.

Return

value at index i

Example

// rocALUTION local vector object
LocalVector<ValueType> vec;

// Allocate vector
vec.Allocate("my_vector", 100);

// Initialize vector with 1
vec.Ones();

// Set even elements to -1
for(int i = 0; i < vec.GetSize(); i += 2)
{
  vec[i] = -1;
}

Parameters

[in] i: access data at index i

void rocalution::LocalVector::CopyFromPermute(const LocalVector<ValueType> &src, const LocalVector<int> &permutation)¶: Copy a vector under permutation (forward permutation)

void rocalution::LocalVector::CopyFromPermuteBackward(const LocalVector<ValueType> &src, const LocalVector<int> &permutation)¶: Copy a vector under permutation (backward permutation)

void rocalution::LocalVector::CopyFromData(const ValueType *data)¶

Copy (import) vector.

Copy (import) vector data that is described in one array (values). The object data has to be allocated with Allocate(), using the corresponding size of the data, first.

Parameters

[in] data: data to be imported.

void rocalution::LocalVector::CopyToData(ValueType *data) const¶

Copy (export) vector.

Copy (export) vector data that is described in one array (values). The output array has to be allocated, using the corresponding size of the data, first. Size can be obtain by GetSize().

Parameters

[out] data: exported data.

void rocalution::LocalVector::Permute(const LocalVector<int> &permutation)¶: Perform in-place permutation (forward) of the vector.

void rocalution::LocalVector::PermuteBackward(const LocalVector<int> &permutation)¶: Perform in-place permutation (backward) of the vector.

void rocalution::LocalVector::Restriction(const LocalVector<ValueType> &vec_fine, const LocalVector<int> &map)¶: Restriction operator based on restriction mapping vector.

void rocalution::LocalVector::Prolongation(const LocalVector<ValueType> &vec_coarse, const LocalVector<int> &map)¶: Prolongation operator based on restriction mapping vector.

void rocalution::LocalVector::SetIndexArray(int size, const int *index)¶: Set index array.

void rocalution::LocalVector::GetIndexValues(ValueType *values) const¶: Get indexed values.

void rocalution::LocalVector::SetIndexValues(const ValueType *values)¶: Set indexed values.

void rocalution::LocalVector::GetContinuousValues(int start, int end, ValueType *values) const¶: Get continuous indexed values.

void rocalution::LocalVector::SetContinuousValues(int start, int end, const ValueType *values)¶: Set continuous indexed values.

void rocalution::LocalVector::ExtractCoarseMapping(int start, int end, const int *index, int nc, int *size, int *map) const¶: Extract coarse boundary mapping.

void rocalution::LocalVector::ExtractCoarseBoundary(int start, int end, const int *index, int nc, int *size, int *boundary) const¶: Extract coarse boundary index.

Global Vector¶

template<typename ValueType> class GlobalVector : public rocalution::Vector<ValueType>¶

GlobalVector class.

A GlobalVector is called global, because it can stay on a single or on multiple nodes in a network. For this type of communication, MPI is used.

Template Parameters

ValueType: - can be int, float, double, std::complex<float> and std::complex<double>

rocalution::GlobalVector::GlobalVector(const ParallelManager &pm): Initialize a global vector with a parallel manager.

void rocalution::GlobalVector::Allocate(std::string name, IndexType2 size)¶: Allocate a global vector with name and size.

void rocalution::GlobalVector::SetParallelManager(const ParallelManager &pm)¶: Set the parallel manager of a global vector.

ValueType &rocalution::GlobalVector::operator[](int i): Access operator (only for host data)

const ValueType &rocalution::GlobalVector::operator[](int i) const: Access operator (only for host data)

void rocalution::GlobalVector::SetDataPtr(ValueType **ptr, std::string name, IndexType2 size)¶: Initialize the local part of a global vector with externally allocated data.

void rocalution::GlobalVector::LeaveDataPtr(ValueType **ptr)¶: Get a pointer to the data from the local part of a global vector and free the global vector object.

void rocalution::GlobalVector::Restriction(const GlobalVector<ValueType> &vec_fine, const LocalVector<int> &map)¶: Restriction operator based on restriction mapping vector.

void rocalution::GlobalVector::Prolongation(const GlobalVector<ValueType> &vec_coarse, const LocalVector<int> &map)¶: Prolongation operator based on restriction mapping vector.

Parallel Manager¶

class ParallelManager : public rocalution::RocalutionObj¶

Parallel Manager class.

The parallel manager class handles the communication and the mapping of the global operators. Each global operator and vector need to be initialized with a valid parallel manager in order to perform any operation. For many distributed simulations, the underlying operator is already distributed. This information need to be passed to the parallel manager.

void rocalution::ParallelManager::SetMPICommunicator(const void *comm)¶: Set the MPI communicator.

void rocalution::ParallelManager::Clear(void)¶: Clear all allocated resources.

IndexType2 rocalution::ParallelManager::GetGlobalSize(void) const¶: Return the global size.

int rocalution::ParallelManager::GetLocalSize(void) const¶: Return the local size.

int rocalution::ParallelManager::GetNumReceivers(void) const¶: Return the number of receivers.

int rocalution::ParallelManager::GetNumSenders(void) const¶: Return the number of senders.

int rocalution::ParallelManager::GetNumProcs(void) const¶: Return the number of involved processes.

void rocalution::ParallelManager::SetGlobalSize(IndexType2 size)¶: Initialize the global size.

void rocalution::ParallelManager::SetLocalSize(int size)¶: Initialize the local size.

void rocalution::ParallelManager::SetBoundaryIndex(int size, const int *index)¶: Set all boundary indices of this ranks process.

void rocalution::ParallelManager::SetReceivers(int nrecv, const int *recvs, const int *recv_offset)¶: Number of processes, the current process is receiving data from, array of the processes, the current process is receiving data from and offsets, where the boundary for process ‘receiver’ starts.

void rocalution::ParallelManager::SetSenders(int nsend, const int *sends, const int *send_offset)¶: Number of processes, the current process is sending data to, array of the processes, the current process is sending data to and offsets where the ghost part for process ‘sender’ starts.

void rocalution::ParallelManager::LocalToGlobal(int proc, int local, int &global)¶: Mapping local to global.

void rocalution::ParallelManager::GlobalToLocal(int global, int &proc, int &local)¶: Mapping global to local.

bool rocalution::ParallelManager::Status(void) const¶: Check sanity status of parallel manager.

void rocalution::ParallelManager::ReadFileASCII(const std::string filename)¶: Read file that contains all relevant parallel manager data.

void rocalution::ParallelManager::WriteFileASCII(const std::string filename) const¶: Write file that contains all relevant parallel manager data.

Solvers¶

template<class OperatorType, class VectorType, typename ValueType> class Solver : public rocalution::RocalutionObj¶

Base class for all solvers and preconditioners.

Most of the solvers can be performed on linear operators LocalMatrix, LocalStencil and GlobalMatrix - i.e. the solvers can be performed locally (on a shared memory system) or in a distributed manner (on a cluster) via MPI. The only exception is the AMG (Algebraic Multigrid) solver which has two versions (one for LocalMatrix and one for GlobalMatrix class). The only pure local solvers (which do not support global/MPI operations) are the mixed-precision defect-correction solver and all direct solvers.

All solvers need three template parameters - Operators, Vectors and Scalar type.

The Solver class is purely virtual and provides an interface for

SetOperator() to set the operator $A$, i.e. the user can pass the matrix here.
Build() to build the solver (including preconditioners, sub-solvers, etc.). The user need to specify the operator first before calling Build().
Solve() to solve the system $Ax = b$. The user need to pass a right-hand-side $b$ and a vector $x$, where the solution will be obtained.
Print() to show solver information.
ReBuildNumeric() to only re-build the solver numerically (if possible).
MoveToHost() and MoveToAccelerator() to offload the solver (including preconditioners and sub-solvers) to the host/accelerator.

Template Parameters

OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencil
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

Subclassed by rocalution::DirectLinearSolver< OperatorType, VectorType, ValueType >, rocalution::IterativeLinearSolver< OperatorType, VectorType, ValueType >, rocalution::Preconditioner< OperatorType, VectorType, ValueType >

void rocalution::Solver::SetOperator(const OperatorType &op)¶: Set the Operator of the solver.

void rocalution::Solver::ResetOperator(const OperatorType &op)¶: Reset the operator; see ReBuildNumeric()

virtual void rocalution::Solver::Print(void) const = 0¶: Print information about the solver.

virtual void rocalution::Solver::Solve(const VectorType &rhs, VectorType *x) = 0¶: Solve Operator x = rhs.

void rocalution::Solver::SolveZeroSol(const VectorType &rhs, VectorType *x)¶: Solve Operator x = rhs, setting initial x = 0.

void rocalution::Solver::Clear(void)¶: Clear (free all local data) the solver.

void rocalution::Solver::Build(void)¶: Build the solver (data allocation, structure and numerical computation)

void rocalution::Solver::BuildMoveToAcceleratorAsync(void)¶: Build the solver and move it to the accelerator asynchronously.

void rocalution::Solver::Sync(void)¶: Synchronize the solver.

void rocalution::Solver::ReBuildNumeric(void)¶: Rebuild the solver only with numerical computation (no allocation or data structure computation)

void rocalution::Solver::MoveToHost(void)¶: Move all data (i.e. move the solver) to the host.

void rocalution::Solver::MoveToAccelerator(void)¶: Move all data (i.e. move the solver) to the accelerator.

void rocalution::Solver::Verbose(int verb = 1)¶

Provide verbose output of the solver.

verb = 0 -> no output
verb = 1 -> print info about the solver (start, end);
verb = 2 -> print (iter, residual) via iteration control;

template<class OperatorType, class VectorType, typename ValueType> class IterativeLinearSolver : public rocalution::Solver<OperatorType, VectorType, ValueType>¶

Base class for all linear iterative solvers.

The iterative solvers are controlled by an iteration control object, which monitors the convergence properties of the solver, i.e. maximum number of iteration, relative tolerance, absolute tolerance and divergence tolerance. The iteration control can also record the residual history and store it in an ASCII file.

Init(), InitMinIter(), InitMaxIter() and InitTol() initialize the solver and set the stopping criteria.
RecordResidualHistory() and RecordHistory() start the recording of the residual and write it into a file.
Verbose() sets the level of verbose output of the solver (0 - no output, 2 - detailed output, including residual and iteration information).
SetPreconditioner() sets the preconditioning.

All iterative solvers are controlled based on

Absolute stopping criteria, when $|r_{k}|_{L_{p}} \lt \epsilon_{abs}$
Relative stopping criteria, when $|r_{k}|_{L_{p}} / |r_{1}|_{L_{p}} \leq \epsilon_{rel}$
Divergence stopping criteria, when $|r_{k}|_{L_{p}} / |r_{1}|_{L_{p}} \geq \epsilon_{div}$
Maximum number of iteration $N$, when $k = N$

where $k$ is the current iteration, $r_{k}$ the residual for the current iteration $k$ (i.e. $r_{k} = b - Ax_{k}$) and $r_{1}$ the starting residual (i.e. $r_{1} = b - Ax_{init}$). In addition, the minimum number of iterations $M$ can be specified. In this case, the solver will not stop to iterate, before $k \geq M$.

The $L_{p}$ norm is used for the computation, where $p$ could be 1, 2 and $\infty$. The norm computation can be set with SetResidualNorm() with 1 for $L_{1}$, 2 for $L_{2}$ and 3 for $L_{\infty}$. For the computation with $L_{\infty}$, the index of the maximum value can be obtained with GetAmaxResidualIndex(). If this function is called and $L_{\infty}$ was not selected, this function will return -1.

The reached criteria can be obtained with GetSolverStatus(), returning

0, if no criteria has been reached yet
1, if absolute tolerance has been reached
2, if relative tolerance has been reached
3, if divergence tolerance has been reached
4, if maximum number of iteration has been reached

Template Parameters

OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencil
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::IterativeLinearSolver::Init(double abs_tol, double rel_tol, double div_tol, int max_iter): Initialize the solver with absolute/relative/divergence tolerance and maximum number of iterations.

void rocalution::IterativeLinearSolver::Init(double abs_tol, double rel_tol, double div_tol, int min_iter, int max_iter): Initialize the solver with absolute/relative/divergence tolerance and minimum/maximum number of iterations.

void rocalution::IterativeLinearSolver::InitMinIter(int min_iter)¶: Set the minimum number of iterations.

void rocalution::IterativeLinearSolver::InitMaxIter(int max_iter)¶: Set the maximum number of iterations.

void rocalution::IterativeLinearSolver::InitTol(double abs, double rel, double div)¶: Set the absolute/relative/divergence tolerance.

void rocalution::IterativeLinearSolver::SetResidualNorm(int resnorm)¶

Set the residual norm to $L_1$, $L_2$ or $L_\infty$ norm.

resnorm = 1 -> $L_1$ norm
resnorm = 2 -> $L_2$ norm
resnorm = 3 -> $L_\infty$ norm

void rocalution::IterativeLinearSolver::RecordResidualHistory(void)¶: Record the residual history.

void rocalution::IterativeLinearSolver::RecordHistory(const std::string filename) const¶: Write the history to file.

void rocalution::IterativeLinearSolver::Verbose(int verb = 1)¶: Set the solver verbosity output.

void rocalution::IterativeLinearSolver::Solve(const VectorType &rhs, VectorType *x)¶: Solve Operator x = rhs.

void rocalution::IterativeLinearSolver::SetPreconditioner(Solver<OperatorType, VectorType, ValueType> &precond)¶: Set a preconditioner of the linear solver.

int rocalution::IterativeLinearSolver::GetIterationCount(void)¶: Return the iteration count.

double rocalution::IterativeLinearSolver::GetCurrentResidual(void)¶: Return the current residual.

int rocalution::IterativeLinearSolver::GetSolverStatus(void)¶: Return the current status.

int rocalution::IterativeLinearSolver::GetAmaxResidualIndex(void)¶: Return absolute maximum index of residual vector when using $L_\infty$ norm.

template<class OperatorType, class VectorType, typename ValueType> class FixedPoint : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶

Fixed-Point Iteration Scheme.

The Fixed-Point iteration scheme is based on additive splitting of the matrix $A = M + N$. The scheme reads

\[ x_{k+1} = M^{-1} (b - N x_{k}). \]

It can also be reformulated as a weighted defect correction scheme

\[ x_{k+1} = x_{k} - \omega M^{-1} (Ax_{k} - b). \]

The inversion of $M$ can be performed by preconditioners (Jacobi, Gauss-Seidel, ILU, etc.) or by any type of solvers.

Template Parameters

OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencil
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::FixedPoint::SetRelaxation(ValueType omega)¶: Set relaxation parameter $\omega$.

template<class OperatorTypeH, class VectorTypeH, typename ValueTypeH, class OperatorTypeL, class VectorTypeL, typename ValueTypeL> class MixedPrecisionDC : public rocalution::IterativeLinearSolver<OperatorTypeH, VectorTypeH, ValueTypeH>¶

Mixed-Precision Defect Correction Scheme.

The Mixed-Precision solver is based on a defect-correction scheme. The current implementation of the library is using host based correction in double precision and accelerator computation in single precision. The solver is implemeting the scheme

\[ x_{k+1} = x_{k} + A^{-1} r_{k}, \]

where the computation of the residual $r_{k} = b - Ax_{k}$ and the update $x_{k+1} = x_{k} + d_{k}$ are performed on the host in double precision. The computation of the residual system $Ad_{k} = r_{k}$ is performed on the accelerator in single precision. In addition to the setup functions of the iterative solver, the user need to specify the inner ( $Ad_{k} = r_{k}$) solver.

Template Parameters

OperatorTypeH: - can be LocalMatrix
VectorTypeH: - can be LocalVector
ValueTypeH: - can be double
OperatorTypeL: - can be LocalMatrix
VectorTypeL: - can be LocalVector
ValueTypeL: - can be float

void rocalution::MixedPrecisionDC::Set(Solver<OperatorTypeL, VectorTypeL, ValueTypeL> &Solver_L)¶: Set the inner solver for $Ad_{k} = r_{k}$.

template<class OperatorType, class VectorType, typename ValueType> class Chebyshev : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶

Chebyshev Iteration Scheme.

The Chebyshev Iteration scheme (also known as acceleration scheme) is similar to the CG method but requires minimum and maximum eigenvalues of the operator. templates

Template Parameters

OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencil
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::Chebyshev::Set(ValueType lambda_min, ValueType lambda_max)¶: Set the minimum and maximum eigenvalues of the operator.

template<class OperatorType, class VectorType, typename ValueType> class BiCGStab : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶

Bi-Conjugate Gradient Stabilized Method.

The Bi-Conjugate Gradient Stabilized method is a variation of CGS and solves sparse (non) symmetric linear systems $Ax=b$. SAAD

Template Parameters

OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencil
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class BiCGStabl : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶

Bi-Conjugate Gradient Stabilized (l) Method.

The Bi-Conjugate Gradient Stabilized (l) method is a generalization of BiCGStab for solving sparse (non) symmetric linear systems $Ax=b$. It minimizes residuals over $l$-dimensional Krylov subspaces. The degree $l$ can be set with SetOrder(). bicgstabl

Template Parameters

OperatorType: - can be LocalMatrix or GlobalMatrix
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::BiCGStabl::SetOrder(int l)¶: Set the order.

template<class OperatorType, class VectorType, typename ValueType> class CG : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶

Conjugate Gradient Method.

The Conjugate Gradient method is the best known iterative method for solving sparse symmetric positive definite (SPD) linear systems $Ax=b$. It is based on orthogonal projection onto the Krylov subspace $\mathcal{K}_{m}(r_{0}, A)$, where $r_{0}$ is the initial residual. The method can be preconditioned, where the approximation should also be SPD. SAAD

Template Parameters

OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencil
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class CR : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶

Conjugate Residual Method.

The Conjugate Residual method is an iterative method for solving sparse symmetric semi-positive definite linear systems $Ax=b$. It is a Krylov subspace method and differs from the much more popular Conjugate Gradient method that the system matrix is not required to be positive definite. The method can be preconditioned where the approximation should also be SPD or semi-positive definite. SAAD

Template Parameters

OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencil
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class FCG : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶

Flexible Conjugate Gradient Method.

The Flexible Conjugate Gradient method is an iterative method for solving sparse symmetric positive definite linear systems $Ax=b$. It is similar to the Conjugate Gradient method with the only difference, that it allows the preconditioner $M^{-1}$ to be not a constant operator. This can be especially helpful if the operation $M^{-1}x$ is the result of another iterative process and not a constant operator. fcg

Template Parameters

OperatorType: - can be LocalMatrix or GlobalMatrix
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class GMRES : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶

Generalized Minimum Residual Method.

The Generalized Minimum Residual method (GMRES) is a projection method for solving sparse (non) symmetric linear systems $Ax=b$, based on restarting technique. The solution is approximated in a Krylov subspace $\mathcal{K}=\mathcal{K}_{m}$ and $\mathcal{L}=A\mathcal{K}_{m}$ with minimal residual, where $\mathcal{K}_{m}$ is the $m$-th Krylov subspace with $v_{1} = r_{0}/||r_{0}||_{2}$. SAAD

The Krylov subspace basis size can be set using SetBasisSize(). The default size is 30.

Template Parameters

OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencil
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::GMRES::SetBasisSize(int size_basis)¶: Set the size of the Krylov subspace basis.

template<class OperatorType, class VectorType, typename ValueType> class FGMRES : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶

Flexible Generalized Minimum Residual Method.

The Flexible Generalized Minimum Residual method (FGMRES) is a projection method for solving sparse (non) symmetric linear systems $Ax=b$. It is similar to the GMRES method with the only difference, the FGMRES is based on a window shifting of the Krylov subspace and thus allows the preconditioner $M^{-1}$ to be not a constant operator. This can be especially helpful if the operation $M^{-1}x$ is the result of another iterative process and not a constant operator. SAAD

The Krylov subspace basis size can be set using SetBasisSize(). The default size is 30.

Template Parameters

OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencil
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::FGMRES::SetBasisSize(int size_basis)¶: Set the size of the Krylov subspace basis.

template<class OperatorType, class VectorType, typename ValueType> class IDR : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶

Induced Dimension Reduction Method.

The Induced Dimension Reduction method is a Krylov subspace method for solving sparse (non) symmetric linear systems $Ax=b$. IDR(s) generates residuals in a sequence of nested subspaces. IDR1 IDR2

The dimension of the shadow space can be set by SetShadowSpace(). The default size of the shadow space is 4.

Template Parameters

OperatorType: - can be LocalMatrix, GlobalMatrix or LocalStencil
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::IDR::SetShadowSpace(int s)¶: Set the size of the Shadow Space.

void rocalution::IDR::SetRandomSeed(unsigned long long seed)¶: Set random seed for ONB creation (seed must be greater than 0)

template<class OperatorType, class VectorType, typename ValueType> class QMRCGStab : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶

Quasi-Minimal Residual Conjugate Gradient Stabilized Method.

The Quasi-Minimal Residual Conjugate Gradient Stabilized method is a variant of the Krylov subspace BiCGStab method for solving sparse (non) symmetric linear systems $Ax=b$. qmrcgstab

Template Parameters

OperatorType: - can be LocalMatrix or GlobalMatrix
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class BaseMultiGrid : public rocalution::IterativeLinearSolver<OperatorType, VectorType, ValueType>¶

Base class for all multigrid solvers Trottenberg2003.

Template Parameters

OperatorType: - can be LocalMatrix or GlobalMatrix
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

Subclassed by rocalution::BaseAMG< OperatorType, VectorType, ValueType >, rocalution::MultiGrid< OperatorType, VectorType, ValueType >

void rocalution::BaseMultiGrid::SetSolver(Solver<OperatorType, VectorType, ValueType> &solver)¶: Set the coarse grid solver.

void rocalution::BaseMultiGrid::SetSmoother(IterativeLinearSolver<OperatorType, VectorType, ValueType> **smoother)¶: Set the smoother for each level.

void rocalution::BaseMultiGrid::SetSmootherPreIter(int iter)¶: Set the number of pre-smoothing steps.

void rocalution::BaseMultiGrid::SetSmootherPostIter(int iter)¶: Set the number of post-smoothing steps.

virtual void rocalution::BaseMultiGrid::SetRestrictOperator(OperatorType **op) = 0¶: Set the restriction operator for each level.

virtual void rocalution::BaseMultiGrid::SetProlongOperator(OperatorType **op) = 0¶: Set the prolongation operator for each level.

virtual void rocalution::BaseMultiGrid::SetOperatorHierarchy(OperatorType **op) = 0¶: Set the operator for each level.

void rocalution::BaseMultiGrid::SetScaling(bool scaling)¶: Enable/disable scaling of intergrid transfers.

void rocalution::BaseMultiGrid::SetHostLevels(int levels)¶: Force computation of coarser levels on the host backend.

void rocalution::BaseMultiGrid::SetCycle(unsigned int cycle)¶: Set the MultiGrid Cycle (default: Vcycle)

void rocalution::BaseMultiGrid::SetKcycleFull(bool kcycle_full)¶: Set the MultiGrid Kcycle on all levels or only on finest level.

void rocalution::BaseMultiGrid::InitLevels(int levels)¶: Set the depth of the multigrid solver.

template<class OperatorType, class VectorType, typename ValueType> class MultiGrid : public rocalution::BaseMultiGrid<OperatorType, VectorType, ValueType>¶

MultiGrid Method.

The MultiGrid method can be used with external data, such as externally computed restriction, prolongation and operator hierarchy. The user need to pass all this information for each level and for its construction. This includes smoothing step, prolongation/restriction, grid traversing and coarse grid solver. This data need to be passed to the solver. Trottenberg2003

Restriction and prolongation operations can be performed in two ways, based on Restriction() and Prolongation() of the LocalVector class, or by matrix-vector multiplication. This is configured by a set function.
Smoothers can be of any iterative linear solver. Valid options are Jacobi, Gauss-Seidel, ILU, etc. using a FixedPoint iteration scheme with pre-defined number of iterations. The smoothers could also be a solver such as CG, BiCGStab, etc.
Coarse grid solver could be of any iterative linear solver type. The class also provides mechanisms to specify, where the coarse grid solver has to be performed, on the host or on the accelerator. The coarse grid solver can be preconditioned.
Grid scaling based on a $L_2$ norm ratio.
Operator matrices need to be passed on each grid level.

Template Parameters

OperatorType: - can be LocalMatrix or GlobalMatrix
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class BaseAMG : public rocalution::BaseMultiGrid<OperatorType, VectorType, ValueType>¶

Base class for all algebraic multigrid solvers.

The Algebraic MultiGrid solver is based on the BaseMultiGrid class. The coarsening is obtained by different aggregation techniques. The smoothers can be constructed inside or outside of the class.

All parameters in the Algebraic MultiGrid class can be set externally, including smoothers and coarse grid solver.

Template Parameters

OperatorType: - can be LocalMatrix or GlobalMatrix
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

Subclassed by rocalution::GlobalPairwiseAMG< OperatorType, VectorType, ValueType >, rocalution::PairwiseAMG< OperatorType, VectorType, ValueType >, rocalution::RugeStuebenAMG< OperatorType, VectorType, ValueType >, rocalution::SAAMG< OperatorType, VectorType, ValueType >, rocalution::UAAMG< OperatorType, VectorType, ValueType >

void rocalution::BaseAMG::ClearLocal(void)¶: Clear all local data.

void rocalution::BaseAMG::BuildHierarchy(void)¶: Create AMG hierarchy.

void rocalution::BaseAMG::BuildSmoothers(void)¶: Create AMG smoothers.

void rocalution::BaseAMG::SetCoarsestLevel(int coarse_size)¶: Set coarsest level for hierarchy creation.

void rocalution::BaseAMG::SetManualSmoothers(bool sm_manual)¶: Set flag to pass smoothers manually for each level.

void rocalution::BaseAMG::SetManualSolver(bool s_manual)¶: Set flag to pass coarse grid solver manually.

void rocalution::BaseAMG::SetDefaultSmootherFormat(unsigned int op_format)¶: Set the smoother operator format.

void rocalution::BaseAMG::SetOperatorFormat(unsigned int op_format)¶: Set the operator format.

int rocalution::BaseAMG::GetNumLevels(void)¶: Returns the number of levels in hierarchy.

template<class OperatorType, class VectorType, typename ValueType> class UAAMG : public rocalution::BaseAMG<OperatorType, VectorType, ValueType>¶

Unsmoothed Aggregation Algebraic MultiGrid Method.

The Unsmoothed Aggregation Algebraic MultiGrid method is based on unsmoothed aggregation based interpolation scheme. stuben

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::UAAMG::SetCouplingStrength(ValueType eps)¶: Set coupling strength.

void rocalution::UAAMG::SetOverInterp(ValueType overInterp)¶: Set over-interpolation parameter for aggregation.

template<class OperatorType, class VectorType, typename ValueType> class SAAMG : public rocalution::BaseAMG<OperatorType, VectorType, ValueType>¶

Smoothed Aggregation Algebraic MultiGrid Method.

The Smoothed Aggregation Algebraic MultiGrid method is based on smoothed aggregation based interpolation scheme. vanek

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::SAAMG::SetCouplingStrength(ValueType eps)¶: Set coupling strength.

void rocalution::SAAMG::SetInterpRelax(ValueType relax)¶: Set the relaxation parameter.

template<class OperatorType, class VectorType, typename ValueType> class RugeStuebenAMG : public rocalution::BaseAMG<OperatorType, VectorType, ValueType>¶

Ruge-Stueben Algebraic MultiGrid Method.

The Ruge-Stueben Algebraic MultiGrid method is based on the classic Ruge-Stueben coarsening with direct interpolation. The solver provides high-efficiency in terms of complexity of the solver (i.e. number of iterations). However, most of the time it has a higher building step and requires higher memory usage. stuben

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::RugeStuebenAMG::SetCouplingStrength(ValueType eps)¶: Set coupling strength.

template<class OperatorType, class VectorType, typename ValueType> class PairwiseAMG : public rocalution::BaseAMG<OperatorType, VectorType, ValueType>¶

Pairwise Aggregation Algebraic MultiGrid Method.

The Pairwise Aggregation Algebraic MultiGrid method is based on a pairwise aggregation matching scheme. It delivers very efficient building phase which is suitable for Poisson-like equation. Most of the time it requires K-cycle for the solving phase to provide low number of iterations. pairwiseamg

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::PairwiseAMG::SetBeta(ValueType beta)¶: Set beta for pairwise aggregation.

void rocalution::PairwiseAMG::SetOrdering(unsigned int ordering)¶: Set re-ordering for aggregation.

void rocalution::PairwiseAMG::SetCoarseningFactor(double factor)¶: Set target coarsening factor.

template<class OperatorType, class VectorType, typename ValueType> class GlobalPairwiseAMG : public rocalution::BaseAMG<OperatorType, VectorType, ValueType>¶

Pairwise Aggregation Algebraic MultiGrid Method (multi-node)

The Pairwise Aggregation Algebraic MultiGrid method is based on a pairwise aggregation matching scheme. It delivers very efficient building phase which is suitable for Poisson-like equation. Most of the time it requires K-cycle for the solving phase to provide low number of iterations. This version has multi-node support. pairwiseamg

Template Parameters

OperatorType: - can be GlobalMatrix
VectorType: - can be GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::GlobalPairwiseAMG::SetBeta(ValueType beta)¶: Set beta for pairwise aggregation.

void rocalution::GlobalPairwiseAMG::SetOrdering(const _aggregation_ordering ordering)¶: Set re-ordering for aggregation.

void rocalution::GlobalPairwiseAMG::SetCoarseningFactor(double factor)¶: Set target coarsening factor.

template<class OperatorType, class VectorType, typename ValueType> class DirectLinearSolver : public rocalution::Solver<OperatorType, VectorType, ValueType>¶

Base class for all direct linear solvers.

The library provides three direct methods - LU, QR and Inversion (based on QR decomposition). The user can pass a sparse matrix, internally it will be converted to dense and then the selected method will be applied. These methods are not very optimal and due to the fact that the matrix is converted to a dense format, these methods should be used only for very small matrices.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

Subclassed by rocalution::Inversion< OperatorType, VectorType, ValueType >, rocalution::LU< OperatorType, VectorType, ValueType >, rocalution::QR< OperatorType, VectorType, ValueType >

template<class OperatorType, class VectorType, typename ValueType> class Inversion : public rocalution::DirectLinearSolver<OperatorType, VectorType, ValueType>¶

Matrix Inversion.

Full matrix inversion based on QR decomposition.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class LU : public rocalution::DirectLinearSolver<OperatorType, VectorType, ValueType>¶

LU Decomposition.

Lower-Upper Decomposition factors a given square matrix into lower and upper triangular matrix, such that $A = LU$.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class QR : public rocalution::DirectLinearSolver<OperatorType, VectorType, ValueType>¶

QR Decomposition.

The QR Decomposition decomposes a given matrix into $A = QR$, such that $Q$ is an orthogonal matrix and $R$ an upper triangular matrix.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

Preconditioners¶

template<class OperatorType, class VectorType, typename ValueType> class Preconditioner : public rocalution::Solver<OperatorType, VectorType, ValueType>¶

Base class for all preconditioners.

Template Parameters

OperatorType: - can be LocalMatrix or GlobalMatrix
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class AIChebyshev : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Approximate Inverse - Chebyshev Preconditioner.

The Approximate Inverse - Chebyshev Preconditioner is an inverse matrix preconditioner with values from a linear combination of matrix-valued Chebyshev polynomials. chebpoly

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::AIChebyshev::Set(int p, ValueType lambda_min, ValueType lambda_max)¶: Set order, min and max eigenvalues.

template<class OperatorType, class VectorType, typename ValueType> class FSAI : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Factorized Approximate Inverse Preconditioner.

The Factorized Sparse Approximate Inverse preconditioner computes a direct approximation of $M^{-1}$ by minimizing the Frobenius norm $||I − GL||_{F}$, where $L$ denotes the exact lower triangular part of $A$ and $G:=M^{-1}$. The FSAI preconditioner is initialized by $q$, based on the sparsity pattern of $|A^{q}|$. However, it is also possible to supply external sparsity patterns in form of the LocalMatrix class. kolotilina

Note

The FSAI preconditioner is only suited for symmetric positive definite matrices.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::FSAI::Set(int power): Set the power of the system matrix sparsity pattern.

void rocalution::FSAI::Set(const OperatorType &pattern): Set an external sparsity pattern.

void rocalution::FSAI::SetPrecondMatrixFormat(unsigned int mat_format)¶: Set the matrix format of the preconditioner.

template<class OperatorType, class VectorType, typename ValueType> class SPAI : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

SParse Approximate Inverse Preconditioner.

The SParse Approximate Inverse algorithm is an explicitly computed preconditioner for general sparse linear systems. In its current implementation, only the sparsity pattern of the system matrix is supported. The SPAI computation is based on the minimization of the Frobenius norm $||AM − I||_{F}$. grote

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::SPAI::SetPrecondMatrixFormat(unsigned int mat_format)¶: Set the matrix format of the preconditioner.

template<class OperatorType, class VectorType, typename ValueType> class TNS : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Truncated Neumann Series Preconditioner.

The Truncated Neumann Series (TNS) preconditioner is based on $M^{-1} = K^{T} D^{-1} K$, where $K=(I-LD^{-1}+(LD^{-1})^{2})$, with the diagonal $D$ of $A$ and the strictly lower triangular part $L$ of $A$. The preconditioner can be computed in two forms - explicitly and implicitly. In the implicit form, the full construction of $M$ is performed via matrix-matrix operations, whereas in the explicit from, the application of the preconditioner is based on matrix-vector operations only. The matrix format for the stored matrices can be specified.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::TNS::Set(bool imp)¶: Set implicit (true) or explicit (false) computation.

void rocalution::TNS::SetPrecondMatrixFormat(unsigned int mat_format)¶: Set the matrix format of the preconditioner.

template<class OperatorType, class VectorType, typename ValueType> class AS : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Additive Schwarz Preconditioner.

The Additive Schwarz preconditioner relies on a preconditioning technique, where the linear system $Ax=b$ can be decomposed into small sub-problems based on $A_{i} = R_{i}^{T}AR_{i}$, where $R_{i}$ are restriction operators. Those restriction operators produce sub-matrices wich overlap. This leads to contributions from two preconditioners on the overlapped area which are scaled by $1/2$. RAS

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

Subclassed by rocalution::RAS< OperatorType, VectorType, ValueType >

void rocalution::AS::Set(int nb, int overlap, Solver<OperatorType, VectorType, ValueType> **preconds)¶: Set number of blocks, overlap and array of preconditioners.

template<class OperatorType, class VectorType, typename ValueType> class RAS : public rocalution::AS<OperatorType, VectorType, ValueType>¶

Restricted Additive Schwarz Preconditioner.

The Restricted Additive Schwarz preconditioner relies on a preconditioning technique, where the linear system $Ax=b$ can be decomposed into small sub-problems based on $A_{i} = R_{i}^{T}AR_{i}$, where $R_{i}$ are restriction operators. The RAS method is a mixture of block Jacobi and the AS scheme. In this case, the sub-matrices contain overlapped areas from other blocks, too. RAS

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class BlockJacobi : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Block-Jacobi Preconditioner.

The Block-Jacobi preconditioner is designed to wrap any local preconditioner and apply it in a global block fashion locally on each interior matrix.

Template Parameters

OperatorType: - can be GlobalMatrix
VectorType: - can be GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::BlockJacobi::Set(Solver<LocalMatrix<ValueType>, LocalVector<ValueType>, ValueType> &precond)¶: Set local preconditioner.

template<class OperatorType, class VectorType, typename ValueType> class BlockPreconditioner : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Block-Preconditioner.

When handling vector fields, typically one can try to use different preconditioners and/or solvers for the different blocks. For such problems, the library provides a block-type preconditioner. This preconditioner builds the following block-type matrix

\[\begin{split} P = \begin{pmatrix} A_{d} & 0 & . & 0 \\ B_{1} & B_{d} & . & 0 \\ . & . & . & . \\ Z_{1} & Z_{2} & . & Z_{d} \end{pmatrix} \end{split}\]

The solution of $P$ can be performed in two ways. It can be solved by block-lower-triangular sweeps with inversion of the blocks $A_{d} \ldots Z_{d}$ and with a multiplication of the corresponding blocks. This is set by SetLSolver() (which is the default solution scheme). Alternatively, it can be used only with an inverse of the diagonal $A_{d} \ldots Z_{d}$ (Block-Jacobi type) by using SetDiagonalSolver().

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::BlockPreconditioner::Set(int n, const int *size, Solver<OperatorType, VectorType, ValueType> **D_solver)¶: Set number, size and diagonal solver.

void rocalution::BlockPreconditioner::SetDiagonalSolver(void)¶: Set diagonal solver mode.

void rocalution::BlockPreconditioner::SetLSolver(void)¶: Set lower triangular sweep mode.

void rocalution::BlockPreconditioner::SetExternalLastMatrix(const OperatorType &mat)¶: Set external last block matrix.

void rocalution::BlockPreconditioner::SetPermutation(const LocalVector<int> &perm)¶: Set permutation vector.

template<class OperatorType, class VectorType, typename ValueType> class Jacobi : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Jacobi Method.

The Jacobi method is for solving a diagonally dominant system of linear equations $Ax=b$. It solves for each diagonal element iteratively until convergence, such that

\[ x_{i}^{(k+1)} = (1 - \omega)x_{i}^{(k)} + \frac{\omega}{a_{ii}} \left( b_{i} - \sum\limits_{j=1}^{i-1}{a_{ij}x_{j}^{(k)}} - \sum\limits_{j=i}^{n}{a_{ij}x_{j}^{(k)}} \right) \]

Template Parameters

OperatorType: - can be LocalMatrix or GlobalMatrix
VectorType: - can be LocalVector or GlobalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class GS : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Gauss-Seidel / Successive Over-Relaxation Method.

The Gauss-Seidel / SOR method is for solving system of linear equations $Ax=b$. It approximates the solution iteratively with

\[ x_{i}^{(k+1)} = (1 - \omega) x_{i}^{(k)} + \frac{\omega}{a_{ii}} \left( b_{i} - \sum\limits_{j=1}^{i-1}{a_{ij}x_{j}^{(k+1)}} - \sum\limits_{j=i}^{n}{a_{ij}x_{j}^{(k)}} \right), \]

with $\omega \in (0,2)$.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class SGS : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Symmetric Gauss-Seidel / Symmetric Successive Over-Relaxation Method.

The Symmetric Gauss-Seidel / SSOR method is for solving system of linear equations $Ax=b$. It approximates the solution iteratively.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class ILU : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Incomplete LU Factorization based on levels.

The Incomplete LU Factorization based on levels computes a sparse lower and sparse upper triangular matrix such that $A = LU - R$.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::ILU::Set(int p, bool level = true)¶

Initialize ILU(p) factorization.

Initialize ILU(p) factorization based on power. SAAD

level = true build the structure based on levels
level = false build the structure only based on the power(p+1)

template<class OperatorType, class VectorType, typename ValueType> class ILUT : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Incomplete LU Factorization based on threshold.

The Incomplete LU Factorization based on threshold computes a sparse lower and sparse upper triangular matrix such that $A = LU - R$. Fill-in values are dropped depending on a threshold and number of maximal fill-ins per row. SAAD

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::ILUT::Set(double t): Set drop-off threshold.

void rocalution::ILUT::Set(double t, int maxrow): Set drop-off threshold and maximum fill-ins per row.

template<class OperatorType, class VectorType, typename ValueType> class IC : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Incomplete Cholesky Factorization without fill-ins.

The Incomplete Cholesky Factorization computes a sparse lower triangular matrix such that $A=LL^{T} - R$. Additional fill-ins are dropped and the sparsity pattern of the original matrix is preserved.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class VariablePreconditioner : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Variable Preconditioner.

The Variable Preconditioner can hold a selection of preconditioners. Thus, any type of preconditioners can be combined. As example, the variable preconditioner can combine Jacobi, GS and ILU – then, the first iteration of the iterative solver will apply Jacobi, the second iteration will apply GS and the third iteration will apply ILU. After that, the solver will start again with Jacobi, GS, ILU.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::VariablePreconditioner::SetPreconditioner(int n, Solver<OperatorType, VectorType, ValueType> **precond)¶: Set the preconditioner sequence.

template<class OperatorType, class VectorType, typename ValueType> class MultiColored : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Base class for all multi-colored preconditioners.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

Subclassed by rocalution::MultiColoredILU< OperatorType, VectorType, ValueType >, rocalution::MultiColoredSGS< OperatorType, VectorType, ValueType >

void rocalution::MultiColored::SetPrecondMatrixFormat(unsigned int mat_format)¶: Set a specific matrix type of the decomposed block matrices.

void rocalution::MultiColored::SetDecomposition(bool decomp)¶: Set if the preconditioner should be decomposed or not.

template<class OperatorType, class VectorType, typename ValueType> class MultiColoredSGS : public rocalution::MultiColored<OperatorType, VectorType, ValueType>¶

Multi-Colored Symmetric Gauss-Seidel / SSOR Preconditioner.

The Multi-Colored Symmetric Gauss-Seidel / SSOR preconditioner is based on the splitting of the original matrix. Higher parallelism in solving the forward and backward substitution is obtained by performing a multi-colored decomposition. Details on the Symmetric Gauss-Seidel / SSOR algorithm can be found in the SGS preconditioner.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

Subclassed by rocalution::MultiColoredGS< OperatorType, VectorType, ValueType >

void rocalution::MultiColoredSGS::SetRelaxation(ValueType omega)¶: Set the relaxation parameter for the SOR/SSOR scheme.

template<class OperatorType, class VectorType, typename ValueType> class MultiColoredGS : public rocalution::MultiColoredSGS<OperatorType, VectorType, ValueType>¶

Multi-Colored Gauss-Seidel / SOR Preconditioner.

The Multi-Colored Symmetric Gauss-Seidel / SOR preconditioner is based on the splitting of the original matrix. Higher parallelism in solving the forward substitution is obtained by performing a multi-colored decomposition. Details on the Gauss-Seidel / SOR algorithm can be found in the GS preconditioner.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

template<class OperatorType, class VectorType, typename ValueType> class MultiColoredILU : public rocalution::MultiColored<OperatorType, VectorType, ValueType>¶

Multi-Colored Incomplete LU Factorization Preconditioner.

Multi-Colored Incomplete LU Factorization based on the ILU(p) factorization with a power(q)-pattern method. This method provides a higher degree of parallelism of forward and backward substitution compared to the standard ILU(p) preconditioner. Lukarski2012

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::MultiColoredILU::Set(int p): Initialize a multi-colored ILU(p, p+1) preconditioner.

void rocalution::MultiColoredILU::Set(int p, int q, bool level = true)

Initialize a multi-colored ILU(p, q) preconditioner.

level = true will perform the factorization with levels

level = false will perform the factorization only on the power(q)-pattern

template<class OperatorType, class VectorType, typename ValueType> class MultiElimination : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Multi-Elimination Incomplete LU Factorization Preconditioner.

The Multi-Elimination Incomplete LU preconditioner is based on the following decomposition

\[\begin{split} A = \begin{pmatrix} D & F \\ E & C \end{pmatrix} = \begin{pmatrix} I & 0 \\ ED^{-1} & I \end{pmatrix} \times \begin{pmatrix} D & F \\ 0 & \hat{A} \end{pmatrix}, \end{split}\]

where $\hat{A} = C - ED^{-1} F$. To make the inversion of $D$ easier, we permute the preconditioning before the factorization with a permutation $P$ to obtain only diagonal elements in $D$. The permutation here is based on a maximal independent set. This procedure can be applied to the block matrix $\hat{A}$, in this way we can perform the factorization recursively. In the last level of the recursion, we need to provide a solution procedure. By the design of the library, this can be any kind of solver. SAAD

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

int rocalution::MultiElimination::GetSizeDiagBlock(void) const¶: Returns the size of the first (diagonal) block of the preconditioner.

int rocalution::MultiElimination::GetLevel(void) const¶: Return the depth of the current level.

void rocalution::MultiElimination::Set(Solver<OperatorType, VectorType, ValueType> &AA_Solver, int level, double drop_off = 0.0)¶

Initialize (recursively) ME-ILU with level (depth of recursion)

AA_Solvers - defines the last-block solver

drop_off - defines drop-off tolerance

void rocalution::MultiElimination::SetPrecondMatrixFormat(unsigned int mat_format)¶: Set a specific matrix type of the decomposed block matrices.

template<class OperatorType, class VectorType, typename ValueType> class DiagJacobiSaddlePointPrecond : public rocalution::Preconditioner<OperatorType, VectorType, ValueType>¶

Diagonal Preconditioner for Saddle-Point Problems.

Consider the following saddle-point problem

\[\begin{split} A = \begin{pmatrix} K & F \\ E & 0 \end{pmatrix}. \end{split}\]

For such problems we can construct a diagonal Jacobi-type preconditioner of type

\[\begin{split} P = \begin{pmatrix} K & 0 \\ 0 & S \end{pmatrix}, \end{split}\]

with $S=ED^{-1}F$, where $D$ are the diagonal elements of $K$. The matrix $S$ is fully constructed (via sparse matrix-matrix multiplication). The preconditioner needs to be initialized with two external solvers/preconditioners - one for the matrix $K$ and one for the matrix $S$.

Template Parameters

OperatorType: - can be LocalMatrix
VectorType: - can be LocalVector
ValueType: - can be float, double, std::complex<float> or std::complex<double>

void rocalution::DiagJacobiSaddlePointPrecond::Set(Solver<OperatorType, VectorType, ValueType> &K_Solver, Solver<OperatorType, VectorType, ValueType> &S_Solver)¶: Initialize solver for $K$ and $S$.

rocSPARSE¶

Introduction¶

rocSPARSE is a library that contains basic linear algebra subroutines for sparse matrices and vectors written in HiP for GPU devices. It is designed to be used from C and C++ code.

The code is open and hosted here: https://github.com/ROCmSoftwarePlatform/rocSPARSE

Device and Stream Management¶

hipSetDevice() and hipGetDevice() are HIP device management APIs. They are NOT part of the rocSPARSE API.

All rocSPARSE library functions, unless otherwise stated, are non blocking and executed asynchronously with respect to the host. They may return before the actual computation has finished. To force synchronization, hipDeviceSynchronize() or hipStreamSynchronize() can be used. This will ensure that all previously executed rocSPARSE functions on the device / this particular stream have completed.

Before a HIP kernel invocation, users need to call hipSetDevice() to set a device, e.g. device 1. If users do not explicitly call it, the system by default sets it as device 0. Unless users explicitly call hipSetDevice() to set to another device, their HIP kernels are always launched on device 0.

The above is a HIP (and CUDA) device management approach and has nothing to do with rocSPARSE. rocSPARSE honors the approach above and assumes users have already set the device before a rocSPARSE routine call.

HIP kernels are always launched in a queue (also known as stream).

If users do not explicitly specify a stream, the system provides a default stream, maintained by the system. Users cannot create or destroy the default stream. However, users can freely create new streams (with hipStreamCreate()) and bind it to the rocSPARSE handle. HIP kernels are invoked in rocSPARSE routines. The rocSPARSE handle is always associated with a stream, and rocSPARSE passes its stream to the kernels inside the routine. One rocSPARSE routine only takes one stream in a single invocation. If users create a stream, they are responsible for destroying it.

If the system under test has multiple HIP devices, users can run multiple rocSPARSE handles concurrently, but can NOT run a single rocSPARSE handle on different discrete devices. Each handle is associated with a particular singular device, and a new handle should be created for each additional device.

Building and Installing¶

Installing from AMD ROCm repositories¶

rocSPARSE can be installed from AMD ROCm repositories by

sudo apt install rocsparse

Building rocSPARSE from Open-Source repository¶

The rocSPARSE source code is available at the rocSPARSE github page. Download the master branch using:

git clone -b master https://github.com/ROCmSoftwarePlatform/rocSPARSE.git
cd rocSPARSE

Note that if you want to contribute to rocSPARSE, you will need to checkout the develop branch instead of the master branch.

Below are steps to build different packages of the library, including dependencies and clients. It is recommended to install rocSPARSE using the install.sh script.

The following table lists common uses of install.sh to build dependencies + library.

Command	Description
./install.sh -h	Print help information.
./install.sh -d	Build dependencies and library in your local directory. The -d flag only needs to be \|br\| used once. For subsequent invocations of install.sh it is not necessary to rebuild the \|br\| dependencies.
./install.sh	Build library in your local directory. It is assumed dependencies are available.
./install.sh -i	Build library, then build and install rocSPARSE package in /opt/rocm/rocsparse. You will be \|br\| prompted for sudo access. This will install for all users.

The client contains example code, unit tests and benchmarks. Common uses of install.sh to build them are listed in the table below.

Command	Description
./install.sh -h	Print help information.
./install.sh -dc	Build dependencies, library and client in your local directory. The -d flag only needs to be \|br\| used once. For subsequent invocations of install.sh it is not necessary to rebuild the \|br\| dependencies.
./install.sh -c	Build library and client in your local directory. It is assumed dependencies are available.
./install.sh -idc	Build library, dependencies and client, then build and install rocSPARSE package in \|br\| /opt/rocm/rocsparse. You will be prompted for sudo access. This will install for all users.
./install.sh -ic	Build library and client, then build and install rocSPARSE package in opt/rocm/rocsparse. \|br\| You will be prompted for sudo access. This will install for all users.

CMake 3.5 or later is required in order to build rocSPARSE. The rocSPARSE library contains both, host and device code, therefore the HCC compiler must be specified during cmake configuration process.

rocSPARSE can be built using the following commands:

# Create and change to build directory
mkdir -p build/release ; cd build/release

# Default install path is /opt/rocm, use -DCMAKE_INSTALL_PREFIX=<path> to adjust it
CXX=/opt/rocm/bin/hcc cmake ../..

# Compile rocSPARSE library
make -j$(nproc)

# Install rocSPARSE to /opt/rocm
sudo make install

Boost and GoogleTest is required in order to build rocSPARSE client.

rocSPARSE with dependencies and client can be built using the following commands:

# Install boost on Ubuntu
sudo apt install libboost-program-options-dev
# Install boost on Fedora
sudo dnf install boost-program-options

# Install googletest
mkdir -p build/release/deps ; cd build/release/deps
cmake -DBUILD_BOOST=OFF ../../../deps
sudo make -j$(nproc) install

# Change to build directory
cd ..

# Default install path is /opt/rocm, use -DCMAKE_INSTALL_PREFIX=<path> to adjust it
CXX=/opt/rocm/bin/hcc cmake ../.. -DBUILD_CLIENTS_TESTS=ON \
                                  -DBUILD_CLIENTS_BENCHMARKS=ON \
                                  -DBUILD_CLIENTS_SAMPLES=ON

# Compile rocSPARSE library
make -j$(nproc)

# Install rocSPARSE to /opt/rocm
sudo make install

Issue: HIP (/opt/rocm/hip) was built using hcc 1.0.xxx-xxx-xxx-xxx, but you are using /opt/rocm/bin/hcc with version 1.0.yyy-yyy-yyy-yyy from hipcc (version mismatch). Please rebuild HIP including cmake or update HCC_HOME variable.

Solution: Download HIP from github and use hcc to build from source and then use the built HIP instead of /opt/rocm/hip.
Issue: For Carrizo - HCC RUNTIME ERROR: Failed to find compatible kernel

Solution: Add the following to the cmake command when configuring: -DCMAKE_CXX_FLAGS=”–amdgpu-target=gfx801”
Issue: For MI25 (Vega10 Server) - HCC RUNTIME ERROR: Failed to find compatible kernel

Solution: export HCC_AMDGPU_TARGET=gfx900
Issue: Could not find a package configuration file provided by “ROCM” with any of the following names:
ROCMConfig.cmake |br| rocm-config.cmake

Solution: Install ROCm cmake modules

Storage Formats¶

COO storage format¶

The Coordinate (COO) storage format represents a $m \times n$ matrix by

m	number of rows (integer).
n	number of columns (integer).
nnz	number of non-zero elements (integer).
coo_val	array of `nnz` elements containing the data (floating point).
coo_row_ind	array of `nnz` elements containing the row indices (integer).
coo_col_ind	array of `nnz` elements containing the column indices (integer).

The COO matrix is expected to be sorted by row indices and column indices per row. Furthermore, each pair of indices should appear only once. Consider the following $3 \times 5$ matrix and the corresponding COO structures, with $m = 3, n = 5$ and $\text{nnz} = 8$ using zero based indexing:

\[\begin{split}A = \begin{pmatrix} 1.0 & 2.0 & 0.0 & 3.0 & 0.0 \\ 0.0 & 4.0 & 5.0 & 0.0 & 0.0 \\ 6.0 & 0.0 & 0.0 & 7.0 & 8.0 \\ \end{pmatrix}\end{split}\]

where

\[\begin{split}\begin{array}{ll} coo\_val[8] & = \{1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0\} \\ coo\_row\_ind[8] & = \{0, 0, 0, 1, 1, 2, 2, 2\} \\ coo\_col\_ind[8] & = \{0, 1, 3, 1, 2, 0, 3, 4\} \end{array}\end{split}\]

CSR storage format¶

The Compressed Sparse Row (CSR) storage format represents a $m \times n$ matrix by

m	number of rows (integer).
n	number of columns (integer).
nnz	number of non-zero elements (integer).
csr_val	array of `nnz` elements containing the data (floating point).
csr_row_ptr	array of `m+1` elements that point to the start of every row (integer).
csr_col_ind	array of `nnz` elements containing the column indices (integer).

The CSR matrix is expected to be sorted by column indices within each row. Furthermore, each pair of indices should appear only once. Consider the following $3 \times 5$ matrix and the corresponding CSR structures, with $m = 3, n = 5$ and $\text{nnz} = 8$ using one based indexing:

\[\begin{split}A = \begin{pmatrix} 1.0 & 2.0 & 0.0 & 3.0 & 0.0 \\ 0.0 & 4.0 & 5.0 & 0.0 & 0.0 \\ 6.0 & 0.0 & 0.0 & 7.0 & 8.0 \\ \end{pmatrix}\end{split}\]

where

\[\begin{split}\begin{array}{ll} csr\_val[8] & = \{1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0\} \\ csr\_row\_ptr[4] & = \{1, 4, 6, 9\} \\ csr\_col\_ind[8] & = \{1, 2, 4, 2, 3, 1, 4, 5\} \end{array}\end{split}\]

ELL storage format¶

The Ellpack-Itpack (ELL) storage format represents a $m \times n$ matrix by

m	number of rows (integer).
n	number of columns (integer).
ell_width	maximum number of non-zero elements per row (integer)
ell_val	array of `m times ell_width` elements containing the data (floating point).
ell_col_ind	array of `m times ell_width` elements containing the column indices (integer).

The ELL matrix is assumed to be stored in column-major format. Rows with less than ell_width non-zero elements are padded with zeros (ell_val) and $-1$ (ell_col_ind). Consider the following $3 \times 5$ matrix and the corresponding ELL structures, with $m = 3, n = 5$ and $\text{ell_width} = 3$ using zero based indexing:

\[\begin{split}A = \begin{pmatrix} 1.0 & 2.0 & 0.0 & 3.0 & 0.0 \\ 0.0 & 4.0 & 5.0 & 0.0 & 0.0 \\ 6.0 & 0.0 & 0.0 & 7.0 & 8.0 \\ \end{pmatrix}\end{split}\]

where

\[\begin{split}\begin{array}{ll} ell\_val[9] & = \{1.0, 4.0, 6.0, 2.0, 5.0, 7.0, 3.0, 0.0, 8.0\} \\ ell\_col\_ind[9] & = \{0, 1, 0, 1, 2, 3, 3, -1, 4\} \end{array}\end{split}\]

HYB storage format¶

The Hybrid (HYB) storage format represents a $m \times n$ matrix by

m	number of rows (integer).
n	number of columns (integer).
nnz	number of non-zero elements of the COO part (integer)
ell_width	maximum number of non-zero elements per row of the ELL part (integer)
ell_val	array of `m times ell_width` elements containing the ELL part data (floating point).
ell_col_ind	array of `m times ell_width` elements containing the ELL part column indices (integer).
coo_val	array of `nnz` elements containing the COO part data (floating point).
coo_row_ind	array of `nnz` elements containing the COO part row indices (integer).
coo_col_ind	array of `nnz` elements containing the COO part column indices (integer).

The HYB format is a combination of the ELL and COO sparse matrix formats. Typically, the regular part of the matrix is stored in ELL storage format, and the irregular part of the matrix is stored in COO storage format. Three different partitioning schemes can be applied when converting a CSR matrix to a matrix in HYB storage format. For further details on the partitioning schemes, see rocsparse_hyb_partition_.

Types¶

rocsparse_handle¶

typedef struct _rocsparse_handle *rocsparse_handle¶

Handle to the rocSPARSE library context queue.

The rocSPARSE handle is a structure holding the rocSPARSE library context. It must be initialized using rocsparse_create_handle() and the returned handle must be passed to all subsequent library function calls. It should be destroyed at the end using rocsparse_destroy_handle().

rocsparse_mat_descr¶

typedef struct _rocsparse_mat_descr *rocsparse_mat_descr¶

Descriptor of the matrix.

The rocSPARSE matrix descriptor is a structure holding all properties of a matrix. It must be initialized using rocsparse_create_mat_descr() and the returned descriptor must be passed to all subsequent library calls that involve the matrix. It should be destroyed at the end using rocsparse_destroy_mat_descr().

rocsparse_mat_info¶

typedef struct _rocsparse_mat_info *rocsparse_mat_info¶

Info structure to hold all matrix meta data.

The rocSPARSE matrix info is a structure holding all matrix information that is gathered during analysis routines. It must be initialized using rocsparse_create_mat_info() and the returned info structure must be passed to all subsequent library calls that require additional matrix information. It should be destroyed at the end using rocsparse_destroy_mat_info().

rocsparse_hyb_mat¶

typedef struct _rocsparse_hyb_mat *rocsparse_hyb_mat¶

HYB matrix storage format.

The rocSPARSE HYB matrix structure holds the HYB matrix. It must be initialized using rocsparse_create_hyb_mat() and the returned HYB matrix must be passed to all subsequent library calls that involve the matrix. It should be destroyed at the end using rocsparse_destroy_hyb_mat().

rocsparse_action¶

enum rocsparse_action¶

Specify where the operation is performed on.

The rocsparse_action indicates whether the operation is performed on the full matrix, or only on the sparsity pattern of the matrix.

Values:

rocsparse_action_symbolic = 0¶: Operate only on indices.

rocsparse_action_numeric = 1¶: Operate on data and indices.

rocsparse_hyb_partition¶

enum rocsparse_hyb_partition¶

HYB matrix partitioning type.

The rocsparse_hyb_partition type indicates how the hybrid format partitioning between COO and ELL storage formats is performed.

Values:

rocsparse_hyb_partition_auto = 0¶: automatically decide on ELL nnz per row.

rocsparse_hyb_partition_user = 1¶: user given ELL nnz per row.

rocsparse_hyb_partition_max = 2¶: max ELL nnz per row, no COO part.

rocsparse_index_base¶

enum rocsparse_index_base¶

Specify the matrix index base.

The rocsparse_index_base indicates the index base of the indices. For a given rocsparse_mat_descr, the rocsparse_index_base can be set using rocsparse_set_mat_index_base(). The current rocsparse_index_base of a matrix can be obtained by rocsparse_get_mat_index_base().

Values:

rocsparse_index_base_zero = 0¶: zero based indexing.

rocsparse_index_base_one = 1¶: one based indexing.

rocsparse_matrix_type¶

enum rocsparse_matrix_type¶

Specify the matrix type.

The rocsparse_matrix_type indices the type of a matrix. For a given rocsparse_mat_descr, the rocsparse_matrix_type can be set using rocsparse_set_mat_type(). The current rocsparse_matrix_type of a matrix can be obtained by rocsparse_get_mat_type().

Values:

rocsparse_matrix_type_general = 0¶: general matrix type.

rocsparse_matrix_type_symmetric = 1¶: symmetric matrix type.

rocsparse_matrix_type_hermitian = 2¶: hermitian matrix type.

rocsparse_matrix_type_triangular = 3¶: triangular matrix type.

rocsparse_fill_mode¶

enum rocsparse_fill_mode¶

Specify the matrix fill mode.

The rocsparse_fill_mode indicates whether the lower or the upper part is stored in a sparse triangular matrix. For a given rocsparse_mat_descr, the rocsparse_fill_mode can be set using rocsparse_set_mat_fill_mode(). The current rocsparse_fill_mode of a matrix can be obtained by rocsparse_get_mat_fill_mode().

Values:

rocsparse_fill_mode_lower = 0¶: lower triangular part is stored.

rocsparse_fill_mode_upper = 1¶: upper triangular part is stored.

rocsparse_diag_type¶

enum rocsparse_diag_type¶

Indicates if the diagonal entries are unity.

The rocsparse_diag_type indicates whether the diagonal entries of a matrix are unity or not. If rocsparse_diag_type_unit is specified, all present diagonal values will be ignored. For a given rocsparse_mat_descr, the rocsparse_diag_type can be set using rocsparse_set_mat_diag_type(). The current rocsparse_diag_type of a matrix can be obtained by rocsparse_get_mat_diag_type().

Values:

rocsparse_diag_type_non_unit = 0¶: diagonal entries are non-unity.

rocsparse_diag_type_unit = 1¶: diagonal entries are unity

rocsparse_operation¶

enum rocsparse_operation¶

Specify whether the matrix is to be transposed or not.

The rocsparse_operation indicates the operation performed with the given matrix.

Values:

rocsparse_operation_none = 111¶: Operate with matrix.

rocsparse_operation_transpose = 112¶: Operate with transpose.

rocsparse_operation_conjugate_transpose = 113¶: Operate with conj. transpose.

rocsparse_pointer_mode¶

enum rocsparse_pointer_mode¶

Indicates if the pointer is device pointer or host pointer.

The rocsparse_pointer_mode indicates whether scalar values are passed by reference on the host or device. The rocsparse_pointer_mode can be changed by rocsparse_set_pointer_mode(). The currently used pointer mode can be obtained by rocsparse_get_pointer_mode().

Values:

rocsparse_pointer_mode_host = 0¶: scalar pointers are in host memory.

rocsparse_pointer_mode_device = 1¶: scalar pointers are in device memory.

rocsparse_analysis_policy¶

enum rocsparse_analysis_policy¶

Specify policy in analysis functions.

The rocsparse_analysis_policy specifies whether gathered analysis data should be re-used or not. If meta data from a previous e.g. rocsparse_csrilu0_analysis() call is available, it can be re-used for subsequent calls to e.g. rocsparse_csrsv_analysis() and greatly improve performance of the analysis function.

Values:

rocsparse_analysis_policy_reuse = 0¶: try to re-use meta data.

rocsparse_analysis_policy_force = 1¶: force to re-build meta data.

rocsparse_solve_policy¶

enum rocsparse_solve_policy¶

Specify policy in triangular solvers and factorizations.

This is a placeholder.

Values:

rocsparse_solve_policy_auto = 0¶: automatically decide on level information.

rocsparse_layer_mode¶

enum rocsparse_layer_mode¶

Indicates if layer is active with bitmask.

The rocsparse_layer_mode bit mask indicates the logging characteristics.

Values:

rocsparse_layer_mode_none = 0x0¶: layer is not active.

rocsparse_layer_mode_log_trace = 0x1¶: layer is in logging mode.

rocsparse_layer_mode_log_bench = 0x2¶: layer is in benchmarking mode.

rocsparse_status¶

enum rocsparse_status¶

List of rocsparse status codes definition.

This is a list of the rocsparse_status types that are used by the rocSPARSE library.

Values:

rocsparse_status_success = 0¶: success.

rocsparse_status_invalid_handle = 1¶: handle not initialized, invalid or null.

rocsparse_status_not_implemented = 2¶: function is not implemented.

rocsparse_status_invalid_pointer = 3¶: invalid pointer parameter.

rocsparse_status_invalid_size = 4¶: invalid size parameter.

rocsparse_status_memory_error = 5¶: failed memory allocation, copy, dealloc.

rocsparse_status_internal_error = 6¶: other internal library failure.

rocsparse_status_invalid_value = 7¶: invalid value parameter.

rocsparse_status_arch_mismatch = 8¶: device arch is not supported.

rocsparse_status_zero_pivot = 9¶: encountered zero pivot.

Logging¶

Three different environment variables can be set to enable logging in rocSPARSE: ROCSPARSE_LAYER, ROCSPARSE_LOG_TRACE_PATH and ROCSPARSE_LOG_BENCH_PATH.

ROCSPARSE_LAYER is a bit mask, where several logging modes can be combined as follows:

`ROCSPARSE_LAYER` unset	logging is disabled.
`ROCSPARSE_LAYER` set to `1`	trace logging is enabled.
`ROCSPARSE_LAYER` set to `2`	bench logging is enabled.
`ROCSPARSE_LAYER` set to `3`	trace logging and bench logging is enabled.

When logging is enabled, each rocSPARSE function call will write the function name as well as function arguments to the logging stream. The default logging stream is stderr.

If the user sets the environment variable ROCSPARSE_LOG_TRACE_PATH to the full path name for a file, the file is opened and trace logging is streamed to that file. If the user sets the environment variable ROCSPARSE_LOG_BENCH_PATH to the full path name for a file, the file is opened and bench logging is streamed to that file. If the file cannot be opened, logging output is stream to stderr.

Note that performance will degrade when logging is enabled. By default, the environment variable ROCSPARSE_LAYER is unset and logging is disabled.

Sparse Auxiliary Functions¶

This module holds all sparse auxiliary functions.

The functions that are contained in the auxiliary module describe all available helper functions that are required for subsequent library calls.

rocsparse_create_handle()¶

rocsparse_status rocsparse_create_handle(rocsparse_handle *handle)¶

Create a rocsparse handle.

rocsparse_create_handle creates the rocSPARSE library context. It must be initialized before any other rocSPARSE API function is invoked and must be passed to all subsequent library function calls. The handle should be destroyed at the end using rocsparse_destroy_handle().

Parameters

[out] handle: the pointer to the handle to the rocSPARSE library context.

Return Value

rocsparse_status_success: the initialization succeeded.
rocsparse_status_invalid_handle: handle pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.

rocsparse_destroy_handle()¶

rocsparse_status rocsparse_destroy_handle(rocsparse_handle handle)¶

Destroy a rocsparse handle.

rocsparse_destroy_handle destroys the rocSPARSE library context and releases all resources used by the rocSPARSE library.

Parameters

[in] handle: the handle to the rocSPARSE library context.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: handle is invalid.
rocsparse_status_internal_error: an internal error occurred.

rocsparse_set_stream()¶

rocsparse_status rocsparse_set_stream(rocsparse_handle handle, hipStream_t stream)¶

Specify user defined HIP stream.

rocsparse_set_stream specifies the stream to be used by the rocSPARSE library context and all subsequent function calls.

Example

This example illustrates, how a user defined stream can be used in rocSPARSE.

// Create rocSPARSE handle
rocsparse_handle handle;
rocsparse_create_handle(&handle);

// Create stream
hipStream_t stream;
hipStreamCreate(&stream);

// Set stream to rocSPARSE handle
rocsparse_set_stream(handle, stream);

// Do some work
// ...

// Clean up
rocsparse_destroy_handle(handle);
hipStreamDestroy(stream);

Parameters

[inout] handle: the handle to the rocSPARSE library context.
[in] stream: the stream to be used by the rocSPARSE library context.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: handle is invalid.

rocsparse_get_stream()¶

rocsparse_status rocsparse_get_stream(rocsparse_handle handle, hipStream_t *stream)¶

Get current stream from library context.

rocsparse_get_stream gets the rocSPARSE library context stream which is currently used for all subsequent function calls.

Parameters

[in] handle: the handle to the rocSPARSE library context.
[out] stream: the stream currently used by the rocSPARSE library context.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: handle is invalid.

rocsparse_set_pointer_mode()¶

rocsparse_status rocsparse_set_pointer_mode(rocsparse_handle handle, rocsparse_pointer_mode pointer_mode)¶

Specify pointer mode.

rocsparse_set_pointer_mode specifies the pointer mode to be used by the rocSPARSE library context and all subsequent function calls. By default, all values are passed by reference on the host. Valid pointer modes are rocsparse_pointer_mode_host or rocsparse_pointer_mode_device.

Parameters

[in] handle: the handle to the rocSPARSE library context.
[in] pointer_mode: the pointer mode to be used by the rocSPARSE library context.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: handle is invalid.

rocsparse_get_pointer_mode()¶

rocsparse_status rocsparse_get_pointer_mode(rocsparse_handle handle, rocsparse_pointer_mode *pointer_mode)¶

Get current pointer mode from library context.

rocsparse_get_pointer_mode gets the rocSPARSE library context pointer mode which is currently used for all subsequent function calls.

Parameters

[in] handle: the handle to the rocSPARSE library context.
[out] pointer_mode: the pointer mode that is currently used by the rocSPARSE library context.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: handle is invalid.

rocsparse_get_version()¶

rocsparse_status rocsparse_get_version(rocsparse_handle handle, int *version)¶

Get rocSPARSE version.

rocsparse_get_version gets the rocSPARSE library version number.

patch = version % 100
minor = version / 100 % 1000
major = version / 100000

Parameters

[in] handle: the handle to the rocSPARSE library context.
[out] version: the version number of the rocSPARSE library.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: handle is invalid.

rocsparse_get_git_rev()¶

rocsparse_status rocsparse_get_git_rev(rocsparse_handle handle, char *rev)¶

Get rocSPARSE git revision.

rocsparse_get_git_rev gets the rocSPARSE library git commit revision (SHA-1).

Parameters

[in] handle: the handle to the rocSPARSE library context.
[out] rev: the git commit revision (SHA-1).

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: handle is invalid.

rocsparse_create_mat_descr()¶

rocsparse_status rocsparse_create_mat_descr(rocsparse_mat_descr *descr)¶

Create a matrix descriptor.

rocsparse_create_mat_descr creates a matrix descriptor. It initializes rocsparse_matrix_type to rocsparse_matrix_type_general and rocsparse_index_base to rocsparse_index_base_zero. It should be destroyed at the end using rocsparse_destroy_mat_descr().

Parameters

[out] descr: the pointer to the matrix descriptor.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_pointer: descr pointer is invalid.

rocsparse_destroy_mat_descr()¶

rocsparse_status rocsparse_destroy_mat_descr(rocsparse_mat_descr descr)¶

Destroy a matrix descriptor.

rocsparse_destroy_mat_descr destroys a matrix descriptor and releases all resources used by the descriptor.

Parameters

[in] descr: the matrix descriptor.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_pointer: descr is invalid.

rocsparse_copy_mat_descr()¶

rocsparse_status rocsparse_copy_mat_descr(rocsparse_mat_descr dest, const rocsparse_mat_descr src)¶

Copy a matrix descriptor.

rocsparse_copy_mat_descr copies a matrix descriptor. Both, source and destination matrix descriptors must be initialized prior to calling rocsparse_copy_mat_descr.

Parameters

[out] dest: the pointer to the destination matrix descriptor.
[in] src: the pointer to the source matrix descriptor.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_pointer: src or dest pointer is invalid.

rocsparse_set_mat_index_base()¶

rocsparse_status rocsparse_set_mat_index_base(rocsparse_mat_descr descr, rocsparse_index_base base)¶

Specify the index base of a matrix descriptor.

rocsparse_set_mat_index_base sets the index base of a matrix descriptor. Valid options are rocsparse_index_base_zero or rocsparse_index_base_one.

Parameters

[inout] descr: the matrix descriptor.
[in] base: rocsparse_index_base_zero or rocsparse_index_base_one.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_pointer: descr pointer is invalid.
rocsparse_status_invalid_value: base is invalid.

rocsparse_get_mat_index_base()¶

rocsparse_index_base rocsparse_get_mat_index_base(const rocsparse_mat_descr descr)¶

Get the index base of a matrix descriptor.

rocsparse_get_mat_index_base returns the index base of a matrix descriptor.

Return

rocsparse_index_base_zero or rocsparse_index_base_one.

Parameters

[in] descr: the matrix descriptor.

rocsparse_set_mat_type()¶

rocsparse_status rocsparse_set_mat_type(rocsparse_mat_descr descr, rocsparse_matrix_type type)¶

Specify the matrix type of a matrix descriptor.

rocsparse_set_mat_type sets the matrix type of a matrix descriptor. Valid matrix types are rocsparse_matrix_type_general, rocsparse_matrix_type_symmetric, rocsparse_matrix_type_hermitian or rocsparse_matrix_type_triangular.

Parameters

[inout] descr: the matrix descriptor.
[in] type: rocsparse_matrix_type_general, rocsparse_matrix_type_symmetric, rocsparse_matrix_type_hermitian or rocsparse_matrix_type_triangular.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_pointer: descr pointer is invalid.
rocsparse_status_invalid_value: type is invalid.

rocsparse_get_mat_type()¶

rocsparse_matrix_type rocsparse_get_mat_type(const rocsparse_mat_descr descr)¶

Get the matrix type of a matrix descriptor.

rocsparse_get_mat_type returns the matrix type of a matrix descriptor.

Return

rocsparse_matrix_type_general, rocsparse_matrix_type_symmetric, rocsparse_matrix_type_hermitian or rocsparse_matrix_type_triangular.

Parameters

[in] descr: the matrix descriptor.

rocsparse_set_mat_fill_mode()¶

rocsparse_status rocsparse_set_mat_fill_mode(rocsparse_mat_descr descr, rocsparse_fill_mode fill_mode)¶

Specify the matrix fill mode of a matrix descriptor.

rocsparse_set_mat_fill_mode sets the matrix fill mode of a matrix descriptor. Valid fill modes are rocsparse_fill_mode_lower or rocsparse_fill_mode_upper.

Parameters

[inout] descr: the matrix descriptor.
[in] fill_mode: rocsparse_fill_mode_lower or rocsparse_fill_mode_upper.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_pointer: descr pointer is invalid.
rocsparse_status_invalid_value: fill_mode is invalid.

rocsparse_get_mat_fill_mode()¶

rocsparse_fill_mode rocsparse_get_mat_fill_mode(const rocsparse_mat_descr descr)¶

Get the matrix fill mode of a matrix descriptor.

rocsparse_get_mat_fill_mode returns the matrix fill mode of a matrix descriptor.

Return

rocsparse_fill_mode_lower or rocsparse_fill_mode_upper.

Parameters

[in] descr: the matrix descriptor.

rocsparse_set_mat_diag_type()¶

rocsparse_status rocsparse_set_mat_diag_type(rocsparse_mat_descr descr, rocsparse_diag_type diag_type)¶

Specify the matrix diagonal type of a matrix descriptor.

rocsparse_set_mat_diag_type sets the matrix diagonal type of a matrix descriptor. Valid diagonal types are rocsparse_diag_type_unit or rocsparse_diag_type_non_unit.

Parameters

[inout] descr: the matrix descriptor.
[in] diag_type: rocsparse_diag_type_unit or rocsparse_diag_type_non_unit.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_pointer: descr pointer is invalid.
rocsparse_status_invalid_value: diag_type is invalid.

rocsparse_get_mat_diag_type()¶

rocsparse_diag_type rocsparse_get_mat_diag_type(const rocsparse_mat_descr descr)¶

Get the matrix diagonal type of a matrix descriptor.

rocsparse_get_mat_diag_type returns the matrix diagonal type of a matrix descriptor.

Return

rocsparse_diag_type_unit or rocsparse_diag_type_non_unit.

Parameters

[in] descr: the matrix descriptor.

rocsparse_create_hyb_mat()¶

rocsparse_status rocsparse_create_hyb_mat(rocsparse_hyb_mat *hyb)¶

Create a HYB matrix structure.

rocsparse_create_hyb_mat creates a structure that holds the matrix in HYB storage format. It should be destroyed at the end using rocsparse_destroy_hyb_mat().

Parameters

[inout] hyb: the pointer to the hybrid matrix.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_pointer: hyb pointer is invalid.

rocsparse_destroy_hyb_mat()¶

rocsparse_status rocsparse_destroy_hyb_mat(rocsparse_hyb_mat hyb)¶

Destroy a HYB matrix structure.

rocsparse_destroy_hyb_mat destroys a HYB structure.

Parameters

[in] hyb: the hybrid matrix structure.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_pointer: hyb pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.

rocsparse_create_mat_info()¶

rocsparse_status rocsparse_create_mat_info(rocsparse_mat_info *info)¶

Create a matrix info structure.

rocsparse_create_mat_info creates a structure that holds the matrix info data that is gathered during the analysis routines available. It should be destroyed at the end using rocsparse_destroy_mat_info().

Parameters

[inout] info: the pointer to the info structure.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_pointer: info pointer is invalid.

rocsparse_destroy_mat_info()¶

rocsparse_status rocsparse_destroy_mat_info(rocsparse_mat_info info)¶

Destroy a matrix info structure.

rocsparse_destroy_mat_info destroys a matrix info structure.

Parameters

[in] info: the info structure.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_pointer: info pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.

Sparse Level 1 Functions¶

The sparse level 1 routines describe operations between a vector in sparse format and a vector in dense format. This section describes all rocSPARSE level 1 sparse linear algebra functions.

rocsparse_axpyi()¶

rocsparse_status rocsparse_saxpyi(rocsparse_handle handle, rocsparse_int nnz, const float *alpha, const float *x_val, const rocsparse_int *x_ind, float *y, rocsparse_index_base idx_base)¶

rocsparse_status rocsparse_daxpyi(rocsparse_handle handle, rocsparse_int nnz, const double *alpha, const double *x_val, const rocsparse_int *x_ind, double *y, rocsparse_index_base idx_base)¶

Scale a sparse vector and add it to a dense vector.

rocsparse_axpyi multiplies the sparse vector $x$ with scalar $\alpha$ and adds the result to the dense vector $y$, such that

\[ y := y + \alpha \cdot x \]

for(i = 0; i < nnz; ++i)
{
    y[x_ind[i]] = y[x_ind[i]] + alpha * x_val[i];
}

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] nnz: number of non-zero entries of vector $x$.
[in] alpha: scalar $\alpha$.
[in] x_val: array of nnz elements containing the values of $x$.
[in] x_ind: array of nnz elements containing the indices of the non-zero values of $x$.
[inout] y: array of values in dense format.
[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_value: idx_base is invalid.
rocsparse_status_invalid_size: nnz is invalid.
rocsparse_status_invalid_pointer: alpha, x_val, x_ind or y pointer is invalid.

rocsparse_doti()¶

rocsparse_status rocsparse_sdoti(rocsparse_handle handle, rocsparse_int nnz, const float *x_val, const rocsparse_int *x_ind, const float *y, float *result, rocsparse_index_base idx_base)¶

rocsparse_status rocsparse_ddoti(rocsparse_handle handle, rocsparse_int nnz, const double *x_val, const rocsparse_int *x_ind, const double *y, double *result, rocsparse_index_base idx_base)¶

Compute the dot product of a sparse vector with a dense vector.

rocsparse_doti computes the dot product of the sparse vector $x$ with the dense vector $y$, such that

\[ \text{result} := y^T x \]

for(i = 0; i < nnz; ++i)
{
    result += x_val[i] * y[x_ind[i]];
}

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] nnz: number of non-zero entries of vector $x$.
[in] x_val: array of nnz values.
[in] x_ind: array of nnz elements containing the indices of the non-zero values of $x$.
[in] y: array of values in dense format.
[out] result: pointer to the result, can be host or device memory
[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_value: idx_base is invalid.
rocsparse_status_invalid_size: nnz is invalid.
rocsparse_status_invalid_pointer: x_val, x_ind, y or result pointer is invalid.
rocsparse_status_memory_error: the buffer for the dot product reduction could not be allocated.
rocsparse_status_internal_error: an internal error occurred.

rocsparse_gthr()¶

rocsparse_status rocsparse_sgthr(rocsparse_handle handle, rocsparse_int nnz, const float *y, float *x_val, const rocsparse_int *x_ind, rocsparse_index_base idx_base)¶

rocsparse_status rocsparse_dgthr(rocsparse_handle handle, rocsparse_int nnz, const double *y, double *x_val, const rocsparse_int *x_ind, rocsparse_index_base idx_base)¶

Gather elements from a dense vector and store them into a sparse vector.

rocsparse_gthr gathers the elements that are listed in x_ind from the dense vector $y$ and stores them in the sparse vector $x$.

for(i = 0; i < nnz; ++i)
{
    x_val[i] = y[x_ind[i]];
}

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] nnz: number of non-zero entries of $x$.
[in] y: array of values in dense format.
[out] x_val: array of nnz elements containing the values of $x$.
[in] x_ind: array of nnz elements containing the indices of the non-zero values of $x$.
[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_value: idx_base is invalid.
rocsparse_status_invalid_size: nnz is invalid.
rocsparse_status_invalid_pointer: y, x_val or x_ind pointer is invalid.

rocsparse_gthrz()¶

rocsparse_status rocsparse_sgthrz(rocsparse_handle handle, rocsparse_int nnz, float *y, float *x_val, const rocsparse_int *x_ind, rocsparse_index_base idx_base)¶

rocsparse_status rocsparse_dgthrz(rocsparse_handle handle, rocsparse_int nnz, double *y, double *x_val, const rocsparse_int *x_ind, rocsparse_index_base idx_base)¶

Gather and zero out elements from a dense vector and store them into a sparse vector.

rocsparse_gthrz gathers the elements that are listed in x_ind from the dense vector $y$ and stores them in the sparse vector $x$. The gathered elements in $y$ are replaced by zero.

for(i = 0; i < nnz; ++i)
{
    x_val[i]    = y[x_ind[i]];
    y[x_ind[i]] = 0;
}

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] nnz: number of non-zero entries of $x$.
[inout] y: array of values in dense format.
[out] x_val: array of nnz elements containing the non-zero values of $x$.
[in] x_ind: array of nnz elements containing the indices of the non-zero values of $x$.
[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_value: idx_base is invalid.
rocsparse_status_invalid_size: nnz is invalid.
rocsparse_status_invalid_pointer: y, x_val or x_ind pointer is invalid.

rocsparse_roti()¶

rocsparse_status rocsparse_sroti(rocsparse_handle handle, rocsparse_int nnz, float *x_val, const rocsparse_int *x_ind, float *y, const float *c, const float *s, rocsparse_index_base idx_base)¶

rocsparse_status rocsparse_droti(rocsparse_handle handle, rocsparse_int nnz, double *x_val, const rocsparse_int *x_ind, double *y, const double *c, const double *s, rocsparse_index_base idx_base)¶

Apply Givens rotation to a dense and a sparse vector.

rocsparse_roti applies the Givens rotation matrix $G$ to the sparse vector $x$ and the dense vector $y$, where

\[\begin{split} G = \begin{pmatrix} c & s \\ -s & c \end{pmatrix} \end{split}\]

for(i = 0; i < nnz; ++i)
{
    x_tmp = x_val[i];
    y_tmp = y[x_ind[i]];

    x_val[i]    = c * x_tmp + s * y_tmp;
    y[x_ind[i]] = c * y_tmp - s * x_tmp;
}

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] nnz: number of non-zero entries of $x$.
[inout] x_val: array of nnz elements containing the non-zero values of $x$.
[in] x_ind: array of nnz elements containing the indices of the non-zero values of $x$.
[inout] y: array of values in dense format.
[in] c: pointer to the cosine element of $G$, can be on host or device.
[in] s: pointer to the sine element of $G$, can be on host or device.
[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_value: idx_base is invalid.
rocsparse_status_invalid_size: nnz is invalid.
rocsparse_status_invalid_pointer: c, s, x_val, x_ind or y pointer is invalid.

rocsparse_sctr()¶

rocsparse_status rocsparse_ssctr(rocsparse_handle handle, rocsparse_int nnz, const float *x_val, const rocsparse_int *x_ind, float *y, rocsparse_index_base idx_base)¶

rocsparse_status rocsparse_dsctr(rocsparse_handle handle, rocsparse_int nnz, const double *x_val, const rocsparse_int *x_ind, double *y, rocsparse_index_base idx_base)¶

Scatter elements from a dense vector across a sparse vector.

rocsparse_sctr scatters the elements that are listed in x_ind from the sparse vector $x$ into the dense vector $y$. Indices of $y$ that are not listed in x_ind remain unchanged.

for(i = 0; i < nnz; ++i)
{
    y[x_ind[i]] = x_val[i];
}

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] nnz: number of non-zero entries of $x$.
[in] x_val: array of nnz elements containing the non-zero values of $x$.
[in] x_ind: array of nnz elements containing the indices of the non-zero values of x.
[inout] y: array of values in dense format.
[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_value: idx_base is invalid.
rocsparse_status_invalid_size: nnz is invalid.
rocsparse_status_invalid_pointer: x_val, x_ind or y pointer is invalid.

Sparse Level 2 Functions¶

This module holds all sparse level 2 routines.

The sparse level 2 routines describe operations between a matrix in sparse format and a vector in dense format.

rocsparse_coomv()¶

rocsparse_status rocsparse_scoomv(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const float *alpha, const rocsparse_mat_descr descr, const float *coo_val, const rocsparse_int *coo_row_ind, const rocsparse_int *coo_col_ind, const float *x, const float *beta, float *y)¶

rocsparse_status rocsparse_dcoomv(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const double *alpha, const rocsparse_mat_descr descr, const double *coo_val, const rocsparse_int *coo_row_ind, const rocsparse_int *coo_col_ind, const double *x, const double *beta, double *y)¶

Sparse matrix vector multiplication using COO storage format.

rocsparse_coomv multiplies the scalar $\alpha$ with a sparse $m \times n$ matrix, defined in COO storage format, and the dense vector $x$ and adds the result to the dense vector $y$ that is multiplied by the scalar $\beta$, such that

\[ y := \alpha \cdot op(A) \cdot x + \beta \cdot y, \]

with

\[\begin{split} op(A) = \left\{ \begin{array}{ll} A, & if\: trans == rocsparse\_operation\_none \\ A^T, & if\: trans == rocsparse\_operation\_transpose \\ A^H, & if\: trans == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]

The COO matrix has to be sorted by row indices. This can be achieved by using rocsparse_coosort_by_row().

for(i = 0; i < m; ++i)
{
    y[i] = beta * y[i];
}

for(i = 0; i < nnz; ++i)
{
    y[coo_row_ind[i]] += alpha * coo_val[i] * x[coo_col_ind[i]];
}

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Note

Currently, only trans == rocsparse_operation_none is supported.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] trans: matrix operation type.
[in] m: number of rows of the sparse COO matrix.
[in] n: number of columns of the sparse COO matrix.
[in] nnz: number of non-zero entries of the sparse COO matrix.
[in] alpha: scalar $\alpha$.
[in] descr: descriptor of the sparse COO matrix. Currently, only rocsparse_matrix_type_general is supported.
[in] coo_val: array of nnz elements of the sparse COO matrix.
[in] coo_row_ind: array of nnz elements containing the row indices of the sparse COO matrix.
[in] coo_col_ind: array of nnz elements containing the column indices of the sparse COO matrix.
[in] x: array of n elements ( $op(A) = A$) or m elements ( $op(A) = A^T$ or $op(A) = A^H$).
[in] beta: scalar $\beta$.
[inout] y: array of m elements ( $op(A) = A$) or n elements ( $op(A) = A^T$ or $op(A) = A^H$).

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or nnz is invalid.
rocsparse_status_invalid_pointer: descr, alpha, coo_val, coo_row_ind, coo_col_ind, x, beta or y pointer is invalid.
rocsparse_status_arch_mismatch: the device is not supported.
rocsparse_status_not_implemented: trans != rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_csrmv_analysis()¶

rocsparse_status rocsparse_scsrmv_analysis(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info)¶

rocsparse_status rocsparse_dcsrmv_analysis(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info)¶

Sparse matrix vector multiplication using CSR storage format.

rocsparse_csrmv_analysis performs the analysis step for rocsparse_scsrmv() and rocsparse_dcsrmv(). It is expected that this function will be executed only once for a given matrix and particular operation type. The gathered analysis meta data can be cleared by rocsparse_csrmv_clear().

Note

If the matrix sparsity pattern changes, the gathered information will become invalid.

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] trans: matrix operation type.
[in] m: number of rows of the sparse CSR matrix.
[in] n: number of columns of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] descr: descriptor of the sparse CSR matrix.
[in] csr_val: array of nnz elements of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix.
[out] info: structure that holds the information collected during the analysis step.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or nnz is invalid.
rocsparse_status_invalid_pointer: descr, csr_val, csr_row_ptr, csr_col_ind or info pointer is invalid.
rocsparse_status_memory_error: the buffer for the gathered information could not be allocated.
rocsparse_status_internal_error: an internal error occurred.
rocsparse_status_not_implemented: trans != rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_csrmv()¶

rocsparse_status rocsparse_scsrmv(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const float *alpha, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, const float *x, const float *beta, float *y)¶

rocsparse_status rocsparse_dcsrmv(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const double *alpha, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, const double *x, const double *beta, double *y)¶

Sparse matrix vector multiplication using CSR storage format.

rocsparse_csrmv multiplies the scalar $\alpha$ with a sparse $m \times n$ matrix, defined in CSR storage format, and the dense vector $x$ and adds the result to the dense vector $y$ that is multiplied by the scalar $\beta$, such that

\[ y := \alpha \cdot op(A) \cdot x + \beta \cdot y, \]

with

\[\begin{split} op(A) = \left\{ \begin{array}{ll} A, & if\: trans == rocsparse\_operation\_none \\ A^T, & if\: trans == rocsparse\_operation\_transpose \\ A^H, & if\: trans == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]

The info parameter is optional and contains information collected by rocsparse_scsrmv_analysis() or rocsparse_dcsrmv_analysis(). If present, the information will be used to speed up the csrmv computation. If info == NULL, general csrmv routine will be used instead.

for(i = 0; i < m; ++i)
{
    y[i] = beta * y[i];

    for(j = csr_row_ptr[i]; j < csr_row_ptr[i + 1]; ++j)
    {
        y[i] = y[i] + alpha * csr_val[j] * x[csr_col_ind[j]];
    }
}

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Note

Currently, only trans == rocsparse_operation_none is supported.

Example

This example performs a sparse matrix vector multiplication in CSR format using additional meta data to improve performance.

// Create matrix info structure
rocsparse_mat_info info;
rocsparse_create_mat_info(&info);

// Perform analysis step to obtain meta data
rocsparse_scsrmv_analysis(handle,
                          rocsparse_operation_none,
                          m,
                          n,
                          nnz,
                          descr,
                          csr_val,
                          csr_row_ptr,
                          csr_col_ind,
                          info);

// Compute y = Ax
rocsparse_scsrmv(handle,
                 rocsparse_operation_none,
                 m,
                 n,
                 nnz,
                 &alpha,
                 descr,
                 csr_val,
                 csr_row_ptr,
                 csr_col_ind,
                 info,
                 x,
                 &beta,
                 y);

// Do more work
// ...

// Clean up
rocsparse_destroy_mat_info(info);

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] trans: matrix operation type.
[in] m: number of rows of the sparse CSR matrix.
[in] n: number of columns of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] alpha: scalar $\alpha$.
[in] descr: descriptor of the sparse CSR matrix. Currently, only rocsparse_matrix_type_general is supported.
[in] csr_val: array of nnz elements of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix.
[in] info: information collected by rocsparse_scsrmv_analysis() or rocsparse_dcsrmv_analysis(), can be NULL if no information is available.
[in] x: array of n elements ( $op(A) == A$) or m elements ( $op(A) == A^T$ or $op(A) == A^H$).
[in] beta: scalar $\beta$.
[inout] y: array of m elements ( $op(A) == A$) or n elements ( $op(A) == A^T$ or $op(A) == A^H$).

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or nnz is invalid.
rocsparse_status_invalid_pointer: descr, alpha, csr_val, csr_row_ptr, csr_col_ind, x, beta or y pointer is invalid.
rocsparse_status_arch_mismatch: the device is not supported.
rocsparse_status_not_implemented: trans != rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_csrmv_analysis_clear()¶

rocsparse_status rocsparse_csrmv_clear(rocsparse_handle handle, rocsparse_mat_info info)¶

Sparse matrix vector multiplication using CSR storage format.

rocsparse_csrmv_clear deallocates all memory that was allocated by rocsparse_scsrmv_analysis() or rocsparse_dcsrmv_analysis(). This is especially useful, if memory is an issue and the analysis data is not required anymore for further computation, e.g. when switching to another sparse matrix format.

Note

Calling rocsparse_csrmv_clear is optional. All allocated resources will be cleared, when the opaque rocsparse_mat_info struct is destroyed using rocsparse_destroy_mat_info().

Parameters

[in] handle: handle to the rocsparse library context queue.
[inout] info: structure that holds the information collected during analysis step.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_pointer: info pointer is invalid.
rocsparse_status_memory_error: the buffer for the gathered information could not be deallocated.
rocsparse_status_internal_error: an internal error occurred.

rocsparse_ellmv()¶

rocsparse_status rocsparse_sellmv(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, const float *alpha, const rocsparse_mat_descr descr, const float *ell_val, const rocsparse_int *ell_col_ind, rocsparse_int ell_width, const float *x, const float *beta, float *y)¶

rocsparse_status rocsparse_dellmv(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int n, const double *alpha, const rocsparse_mat_descr descr, const double *ell_val, const rocsparse_int *ell_col_ind, rocsparse_int ell_width, const double *x, const double *beta, double *y)¶

Sparse matrix vector multiplication using ELL storage format.

rocsparse_ellmv multiplies the scalar $\alpha$ with a sparse $m \times n$ matrix, defined in ELL storage format, and the dense vector $x$ and adds the result to the dense vector $y$ that is multiplied by the scalar $\beta$, such that

\[ y := \alpha \cdot op(A) \cdot x + \beta \cdot y, \]

with

\[\begin{split} op(A) = \left\{ \begin{array}{ll} A, & if\: trans == rocsparse\_operation\_none \\ A^T, & if\: trans == rocsparse\_operation\_transpose \\ A^H, & if\: trans == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]

for(i = 0; i < m; ++i)
{
    y[i] = beta * y[i];

    for(p = 0; p < ell_width; ++p)
    {
        idx = p * m + i;

        if((ell_col_ind[idx] >= 0) && (ell_col_ind[idx] < n))
        {
            y[i] = y[i] + alpha * ell_val[idx] * x[ell_col_ind[idx]];
        }
    }
}

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Note

Currently, only trans == rocsparse_operation_none is supported.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] trans: matrix operation type.
[in] m: number of rows of the sparse ELL matrix.
[in] n: number of columns of the sparse ELL matrix.
[in] alpha: scalar $\alpha$.
[in] descr: descriptor of the sparse ELL matrix. Currently, only rocsparse_matrix_type_general is supported.
[in] ell_val: array that contains the elements of the sparse ELL matrix. Padded elements should be zero.
[in] ell_col_ind: array that contains the column indices of the sparse ELL matrix. Padded column indices should be -1.
[in] ell_width: number of non-zero elements per row of the sparse ELL matrix.
[in] x: array of n elements ( $op(A) == A$) or m elements ( $op(A) == A^T$ or $op(A) == A^H$).
[in] beta: scalar $\beta$.
[inout] y: array of m elements ( $op(A) == A$) or n elements ( $op(A) == A^T$ or $op(A) == A^H$).

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or ell_width is invalid.
rocsparse_status_invalid_pointer: descr, alpha, ell_val, ell_col_ind, x, beta or y pointer is invalid.
rocsparse_status_not_implemented: trans != rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_hybmv()¶

rocsparse_status rocsparse_shybmv(rocsparse_handle handle, rocsparse_operation trans, const float *alpha, const rocsparse_mat_descr descr, const rocsparse_hyb_mat hyb, const float *x, const float *beta, float *y)¶

rocsparse_status rocsparse_dhybmv(rocsparse_handle handle, rocsparse_operation trans, const double *alpha, const rocsparse_mat_descr descr, const rocsparse_hyb_mat hyb, const double *x, const double *beta, double *y)¶

Sparse matrix vector multiplication using HYB storage format.

rocsparse_hybmv multiplies the scalar $\alpha$ with a sparse $m \times n$ matrix, defined in HYB storage format, and the dense vector $x$ and adds the result to the dense vector $y$ that is multiplied by the scalar $\beta$, such that

\[ y := \alpha \cdot op(A) \cdot x + \beta \cdot y, \]

with

\[\begin{split} op(A) = \left\{ \begin{array}{ll} A, & if\: trans == rocsparse\_operation\_none \\ A^T, & if\: trans == rocsparse\_operation\_transpose \\ A^H, & if\: trans == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Note

Currently, only trans == rocsparse_operation_none is supported.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] trans: matrix operation type.
[in] alpha: scalar $\alpha$.
[in] descr: descriptor of the sparse HYB matrix. Currently, only rocsparse_matrix_type_general is supported.
[in] hyb: matrix in HYB storage format.
[in] x: array of n elements ( $op(A) == A$) or m elements ( $op(A) == A^T$ or $op(A) == A^H$).
[in] beta: scalar $\beta$.
[inout] y: array of m elements ( $op(A) == A$) or n elements ( $op(A) == A^T$ or $op(A) == A^H$).

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: hyb structure was not initialized with valid matrix sizes.
rocsparse_status_invalid_pointer: descr, alpha, hyb, x, beta or y pointer is invalid.
rocsparse_status_invalid_value: hyb structure was not initialized with a valid partitioning type.
rocsparse_status_arch_mismatch: the device is not supported.
rocsparse_status_memory_error: the buffer could not be allocated.
rocsparse_status_internal_error: an internal error occurred.
rocsparse_status_not_implemented: trans != rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_csrsv_zero_pivot()¶

rocsparse_status rocsparse_csrsv_zero_pivot(rocsparse_handle handle, const rocsparse_mat_descr descr, rocsparse_mat_info info, rocsparse_int *position)¶

Sparse triangular solve using CSR storage format.

rocsparse_csrsv_zero_pivot returns rocsparse_status_zero_pivot, if either a structural or numerical zero has been found during rocsparse_scsrsv_solve() or rocsparse_dcsrsv_solve() computation. The first zero pivot $j$ at $A_{j,j}$ is stored in position, using same index base as the CSR matrix.

position can be in host or device memory. If no zero pivot has been found, position is set to -1 and rocsparse_status_success is returned instead.

Note

rocsparse_csrsv_zero_pivot is a blocking function. It might influence performance negatively.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] descr: descriptor of the sparse CSR matrix.
[in] info: structure that holds the information collected during the analysis step.
[inout] position: pointer to zero pivot $j$, can be in host or device memory.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_pointer: info or position pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.
rocsparse_status_zero_pivot: zero pivot has been found.

rocsparse_csrsv_buffer_size()¶

rocsparse_status rocsparse_scsrsv_buffer_size(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, size_t *buffer_size)¶

rocsparse_status rocsparse_dcsrsv_buffer_size(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, size_t *buffer_size)¶

Sparse triangular solve using CSR storage format.

rocsparse_csrsv_buffer_size returns the size of the temporary storage buffer that is required by rocsparse_scsrsv_analysis(), rocsparse_dcsrsv_analysis(), rocsparse_scsrsv_solve() and rocsparse_dcsrsv_solve(). The temporary storage buffer must be allocated by the user. The size of the temporary storage buffer is identical to the size returned by rocsparse_scsrilu0_buffer_size() and rocsparse_dcsrilu0_buffer_size() if the matrix sparsity pattern is identical. The user allocated buffer can thus be shared between subsequent calls to those functions.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] trans: matrix operation type.
[in] m: number of rows of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] descr: descriptor of the sparse CSR matrix.
[in] csr_val: array of nnz elements of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix.
[out] info: structure that holds the information collected during the analysis step.
[in] buffer_size: number of bytes of the temporary storage buffer required by rocsparse_scsrsv_analysis(), rocsparse_dcsrsv_analysis(), rocsparse_scsrsv_solve() and rocsparse_dcsrsv_solve().

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m or nnz is invalid.
rocsparse_status_invalid_pointer: descr, csr_val, csr_row_ptr, csr_col_ind, info or buffer_size pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.
rocsparse_status_not_implemented: trans != rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_csrsv_analysis()¶

rocsparse_status rocsparse_scsrsv_analysis(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, rocsparse_analysis_policy analysis, rocsparse_solve_policy solve, void *temp_buffer)¶

rocsparse_status rocsparse_dcsrsv_analysis(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, rocsparse_analysis_policy analysis, rocsparse_solve_policy solve, void *temp_buffer)¶

Sparse triangular solve using CSR storage format.

rocsparse_csrsv_analysis performs the analysis step for rocsparse_scsrsv_solve() and rocsparse_dcsrsv_solve(). It is expected that this function will be executed only once for a given matrix and particular operation type. The analysis meta data can be cleared by rocsparse_csrsv_clear().

rocsparse_csrsv_analysis can share its meta data with rocsparse_scsrilu0_analysis() and rocsparse_dcsrilu0_analysis(). Selecting rocsparse_analysis_policy_reuse policy can greatly improve computation performance of meta data. However, the user need to make sure that the sparsity pattern remains unchanged. If this cannot be assured, rocsparse_analysis_policy_force has to be used.

Note

If the matrix sparsity pattern changes, the gathered information will become invalid.

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] trans: matrix operation type.
[in] m: number of rows of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] descr: descriptor of the sparse CSR matrix.
[in] csr_val: array of nnz elements of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix.
[out] info: structure that holds the information collected during the analysis step.
[in] analysis: rocsparse_analysis_policy_reuse or rocsparse_analysis_policy_force.
[in] solve: rocsparse_solve_policy_auto.
[in] temp_buffer: temporary storage buffer allocated by the user.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m or nnz is invalid.
rocsparse_status_invalid_pointer: descr, csr_row_ptr, csr_col_ind, info or temp_buffer pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.
rocsparse_status_not_implemented: trans != rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_csrsv_solve()¶

rocsparse_status rocsparse_scsrsv_solve(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int nnz, const float *alpha, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, const float *x, float *y, rocsparse_solve_policy policy, void *temp_buffer)¶

rocsparse_status rocsparse_dcsrsv_solve(rocsparse_handle handle, rocsparse_operation trans, rocsparse_int m, rocsparse_int nnz, const double *alpha, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, const double *x, double *y, rocsparse_solve_policy policy, void *temp_buffer)¶

Sparse triangular solve using CSR storage format.

rocsparse_csrsv_solve solves a sparse triangular linear system of a sparse $m \times m$ matrix, defined in CSR storage format, a dense solution vector $y$ and the right-hand side $x$ that is multiplied by $\alpha$, such that

\[ op(A) \cdot y = \alpha \cdot x, \]

with

\[\begin{split} op(A) = \left\{ \begin{array}{ll} A, & if\: trans == rocsparse\_operation\_none \\ A^T, & if\: trans == rocsparse\_operation\_transpose \\ A^H, & if\: trans == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]

rocsparse_csrsv_solve requires a user allocated temporary buffer. Its size is returned by rocsparse_scsrsv_buffer_size() or rocsparse_dcsrsv_buffer_size(). Furthermore, analysis meta data is required. It can be obtained by rocsparse_scsrsv_analysis() or rocsparse_dcsrsv_analysis(). rocsparse_csrsv_solve reports the first zero pivot (either numerical or structural zero). The zero pivot status can be checked calling rocsparse_csrsv_zero_pivot(). If rocsparse_diag_type == rocsparse_diag_type_unit, no zero pivot will be reported, even if $A_{j,j} = 0$ for some $j$.

Note

The sparse CSR matrix has to be sorted. This can be achieved by calling rocsparse_csrsort().

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Note

Currently, only trans == rocsparse_operation_none is supported.

Example

Consider the lower triangular $m \times m$ matrix $L$, stored in CSR storage format with unit diagonal. The following example solves $L \cdot y = x$.

// Create rocSPARSE handle
rocsparse_handle handle;
rocsparse_create_handle(&handle);

// Create matrix descriptor
rocsparse_mat_descr descr;
rocsparse_create_mat_descr(&descr);
rocsparse_set_mat_fill_mode(descr, rocsparse_fill_mode_lower);
rocsparse_set_mat_diag_type(descr, rocsparse_diag_type_unit);

// Create matrix info structure
rocsparse_mat_info info;
rocsparse_create_mat_info(&info);

// Obtain required buffer size
size_t buffer_size;
rocsparse_dcsrsv_buffer_size(handle,
                             rocsparse_operation_none,
                             m,
                             nnz,
                             descr,
                             csr_val,
                             csr_row_ptr,
                             csr_col_ind,
                             info,
                             &buffer_size);

// Allocate temporary buffer
void* temp_buffer;
hipMalloc(&temp_buffer, buffer_size);

// Perform analysis step
rocsparse_dcsrsv_analysis(handle,
                          rocsparse_operation_none,
                          m,
                          nnz,
                          descr,
                          csr_val,
                          csr_row_ptr,
                          csr_col_ind,
                          info,
                          rocsparse_analysis_policy_reuse,
                          rocsparse_solve_policy_auto,
                          temp_buffer);

// Solve Ly = x
rocsparse_dcsrsv_solve(handle,
                       rocsparse_operation_none,
                       m,
                       nnz,
                       &alpha,
                       descr,
                       csr_val,
                       csr_row_ptr,
                       csr_col_ind,
                       info,
                       x,
                       y,
                       rocsparse_solve_policy_auto,
                       temp_buffer);

// No zero pivot should be found, with L having unit diagonal

// Clean up
hipFree(temp_buffer);
rocsparse_destroy_mat_info(info);
rocsparse_destroy_mat_descr(descr);
rocsparse_destroy_handle(handle);

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] trans: matrix operation type.
[in] m: number of rows of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] alpha: scalar $\alpha$.
[in] descr: descriptor of the sparse CSR matrix.
[in] csr_val: array of nnz elements of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix.
[in] info: structure that holds the information collected during the analysis step.
[in] x: array of m elements, holding the right-hand side.
[out] y: array of m elements, holding the solution.
[in] policy: rocsparse_solve_policy_auto.
[in] temp_buffer: temporary storage buffer allocated by the user.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m or nnz is invalid.
rocsparse_status_invalid_pointer: descr, alpha, csr_val, csr_row_ptr, csr_col_ind, x or y pointer is invalid.
rocsparse_status_arch_mismatch: the device is not supported.
rocsparse_status_internal_error: an internal error occurred.
rocsparse_status_not_implemented: trans != rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_csrsv_clear()¶

rocsparse_status rocsparse_csrsv_clear(rocsparse_handle handle, const rocsparse_mat_descr descr, rocsparse_mat_info info)¶

Sparse triangular solve using CSR storage format.

rocsparse_csrsv_clear deallocates all memory that was allocated by rocsparse_scsrsv_analysis() or rocsparse_dcsrsv_analysis(). This is especially useful, if memory is an issue and the analysis data is not required for further computation, e.g. when switching to another sparse matrix format. Calling rocsparse_csrsv_clear is optional. All allocated resources will be cleared, when the opaque rocsparse_mat_info struct is destroyed using rocsparse_destroy_mat_info().

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] descr: descriptor of the sparse CSR matrix.
[inout] info: structure that holds the information collected during the analysis step.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_pointer: info pointer is invalid.
rocsparse_status_memory_error: the buffer holding the meta data could not be deallocated.
rocsparse_status_internal_error: an internal error occurred.

Sparse Level 3 Functions¶

This module holds all sparse level 3 routines.

The sparse level 3 routines describe operations between a matrix in sparse format and multiple vectors in dense format that can also be seen as a dense matrix.

rocsparse_csrmm()¶

rocsparse_status rocsparse_scsrmm(rocsparse_handle handle, rocsparse_operation trans_A, rocsparse_operation trans_B, rocsparse_int m, rocsparse_int n, rocsparse_int k, rocsparse_int nnz, const float *alpha, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, const float *B, rocsparse_int ldb, const float *beta, float *C, rocsparse_int ldc)¶

rocsparse_status rocsparse_dcsrmm(rocsparse_handle handle, rocsparse_operation trans_A, rocsparse_operation trans_B, rocsparse_int m, rocsparse_int n, rocsparse_int k, rocsparse_int nnz, const double *alpha, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, const double *B, rocsparse_int ldb, const double *beta, double *C, rocsparse_int ldc)¶

Sparse matrix dense matrix multiplication using CSR storage format.

rocsparse_csrmm multiplies the scalar $\alpha$ with a sparse $m \times k$ matrix $A$, defined in CSR storage format, and the dense $k \times n$ matrix $B$ and adds the result to the dense $m \times n$ matrix $C$ that is multiplied by the scalar $\beta$, such that

\[ C := \alpha \cdot op(A) \cdot op(B) + \beta \cdot C, \]

with

\[\begin{split} op(A) = \left\{ \begin{array}{ll} A, & if\: trans\_A == rocsparse\_operation\_none \\ A^T, & if\: trans\_A == rocsparse\_operation\_transpose \\ A^H, & if\: trans\_A == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]

and

\[\begin{split} op(B) = \left\{ \begin{array}{ll} B, & if\: trans\_B == rocsparse\_operation\_none \\ B^T, & if\: trans\_B == rocsparse\_operation\_transpose \\ B^H, & if\: trans\_B == rocsparse\_operation\_conjugate\_transpose \end{array} \right. \end{split}\]

for(i = 0; i < ldc; ++i)
{
    for(j = 0; j < n; ++j)
    {
        C[i][j] = beta * C[i][j];

        for(k = csr_row_ptr[i]; k < csr_row_ptr[i + 1]; ++k)
        {
            C[i][j] += alpha * csr_val[k] * B[csr_col_ind[k]][j];
        }
    }
}

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Note

Currently, only trans_A == rocsparse_operation_none is supported.

Example

This example multiplies a CSR matrix with a dense matrix.

//     1 2 0 3 0
// A = 0 4 5 0 0
//     6 0 0 7 8

rocsparse_int m   = 3;
rocsparse_int k   = 5;
rocsparse_int nnz = 8;

csr_row_ptr[m+1] = {0, 3, 5, 8};             // device memory
csr_col_ind[nnz] = {0, 1, 3, 1, 2, 0, 3, 4}; // device memory
csr_val[nnz]     = {1, 2, 3, 4, 5, 6, 7, 8}; // device memory

// Set dimension n of B
rocsparse_int n = 64;

// Allocate and generate dense matrix B
std::vector<float> hB(k * n);
for(rocsparse_int i = 0; i < k * n; ++i)
{
    hB[i] = static_cast<float>(rand()) / RAND_MAX;
}

// Copy B to the device
float* B;
hipMalloc((void**)&B, sizeof(float) * k * n);
hipMemcpy(B, hB.data(), sizeof(float) * k * n, hipMemcpyHostToDevice);

// alpha and beta
float alpha = 1.0f;
float beta  = 0.0f;

// Allocate memory for the resulting matrix C
float* C;
hipMalloc((void**)&C, sizeof(float) * m * n);

// Perform the matrix multiplication
rocsparse_scsrmm(handle,
                 rocsparse_operation_none,
                 rocsparse_operation_none,
                 m,
                 n,
                 k,
                 nnz,
                 &alpha,
                 descr,
                 csr_val,
                 csr_row_ptr,
                 csr_col_ind,
                 B,
                 k,
                 &beta,
                 C,
                 m);

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] trans_A: matrix $A$ operation type.
[in] trans_B: matrix $B$ operation type.
[in] m: number of rows of the sparse CSR matrix $A$.
[in] n: number of columns of the dense matrix $op(B)$ and $C$.
[in] k: number of columns of the sparse CSR matrix $A$.
[in] nnz: number of non-zero entries of the sparse CSR matrix $A$.
[in] alpha: scalar $\alpha$.
[in] descr: descriptor of the sparse CSR matrix $A$. Currently, only rocsparse_matrix_type_general is supported.
[in] csr_val: array of nnz elements of the sparse CSR matrix $A$.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix $A$.
[in] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix $A$.
[in] B: array of dimension $ldb \times n$ ( $op(B) == B$) or $ldb \times k$ ( $op(B) == B^T$ or $op(B) == B^H$).
[in] ldb: leading dimension of $B$, must be at least $\max{(1, k)}$ ( $op(A) == A$) or $\max{(1, m)}$ ( $op(A) == A^T$ or $op(A) == A^H$).
[in] beta: scalar $\beta$.
[inout] C: array of dimension $ldc \times n$.
[in] ldc: leading dimension of $C$, must be at least $\max{(1, m)}$ ( $op(A) == A$) or $\max{(1, k)}$ ( $op(A) == A^T$ or $op(A) == A^H$).

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n, k, nnz, ldb or ldc is invalid.
rocsparse_status_invalid_pointer: descr, alpha, csr_val, csr_row_ptr, csr_col_ind, B, beta or C pointer is invalid.
rocsparse_status_arch_mismatch: the device is not supported.
rocsparse_status_not_implemented: trans_A != rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.

Preconditioner Functions¶

This module holds all sparse preconditioners.

The sparse preconditioners describe manipulations on a matrix in sparse format to obtain a sparse preconditioner matrix.

rocsparse_csrilu0_zero_pivot()¶

rocsparse_status rocsparse_csrilu0_zero_pivot(rocsparse_handle handle, rocsparse_mat_info info, rocsparse_int *position)¶

Incomplete LU factorization with 0 fill-ins and no pivoting using CSR storage format.

rocsparse_csrilu0_zero_pivot returns rocsparse_status_zero_pivot, if either a structural or numerical zero has been found during rocsparse_scsrilu0() or rocsparse_dcsrilu0() computation. The first zero pivot $j$ at $A_{j,j}$ is stored in position, using same index base as the CSR matrix.

position can be in host or device memory. If no zero pivot has been found, position is set to -1 and rocsparse_status_success is returned instead.

Note

rocsparse_csrilu0_zero_pivot is a blocking function. It might influence performance negatively.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] info: structure that holds the information collected during the analysis step.
[inout] position: pointer to zero pivot $j$, can be in host or device memory.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_pointer: info or position pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.
rocsparse_status_zero_pivot: zero pivot has been found.

rocsparse_csrilu0_buffer_size()¶

rocsparse_status rocsparse_scsrilu0_buffer_size(rocsparse_handle handle, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, size_t *buffer_size)¶

rocsparse_status rocsparse_dcsrilu0_buffer_size(rocsparse_handle handle, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, size_t *buffer_size)¶

Incomplete LU factorization with 0 fill-ins and no pivoting using CSR storage format.

rocsparse_csrilu0_buffer_size returns the size of the temporary storage buffer that is required by rocsparse_scsrilu0_analysis(), rocsparse_dcsrilu0_analysis(), rocsparse_scsrilu0() and rocsparse_dcsrilu0(). The temporary storage buffer must be allocated by the user. The size of the temporary storage buffer is identical to the size returned by rocsparse_scsrsv_buffer_size() and rocsparse_dcsrsv_buffer_size() if the matrix sparsity pattern is identical. The user allocated buffer can thus be shared between subsequent calls to those functions.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] descr: descriptor of the sparse CSR matrix.
[in] csr_val: array of nnz elements of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix.
[out] info: structure that holds the information collected during the analysis step.
[in] buffer_size: number of bytes of the temporary storage buffer required by rocsparse_scsrilu0_analysis(), rocsparse_dcsrilu0_analysis(), rocsparse_scsrilu0() and rocsparse_dcsrilu0().

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m or nnz is invalid.
rocsparse_status_invalid_pointer: descr, csr_val, csr_row_ptr, csr_col_ind, info or buffer_size pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.
rocsparse_status_not_implemented: trans != rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_csrilu0_analysis()¶

rocsparse_status rocsparse_scsrilu0_analysis(rocsparse_handle handle, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, rocsparse_analysis_policy analysis, rocsparse_solve_policy solve, void *temp_buffer)¶

rocsparse_status rocsparse_dcsrilu0_analysis(rocsparse_handle handle, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, rocsparse_analysis_policy analysis, rocsparse_solve_policy solve, void *temp_buffer)¶

Incomplete LU factorization with 0 fill-ins and no pivoting using CSR storage format.

rocsparse_csrilu0_analysis performs the analysis step for rocsparse_scsrilu0() and rocsparse_dcsrilu0(). It is expected that this function will be executed only once for a given matrix and particular operation type. The analysis meta data can be cleared by rocsparse_csrilu0_clear().

rocsparse_csrilu0_analysis can share its meta data with rocsparse_scsrsv_analysis() and rocsparse_dcsrsv_analysis(). Selecting rocsparse_analysis_policy_reuse policy can greatly improve computation performance of meta data. However, the user need to make sure that the sparsity pattern remains unchanged. If this cannot be assured, rocsparse_analysis_policy_force has to be used.

Note

If the matrix sparsity pattern changes, the gathered information will become invalid.

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] descr: descriptor of the sparse CSR matrix.
[in] csr_val: array of nnz elements of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix.
[out] info: structure that holds the information collected during the analysis step.
[in] analysis: rocsparse_analysis_policy_reuse or rocsparse_analysis_policy_force.
[in] solve: rocsparse_solve_policy_auto.
[in] temp_buffer: temporary storage buffer allocated by the user.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m or nnz is invalid.
rocsparse_status_invalid_pointer: descr, csr_val, csr_row_ptr, csr_col_ind, info or temp_buffer pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.
rocsparse_status_not_implemented: trans != rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_csrilu0()¶

rocsparse_status rocsparse_scsrilu0(rocsparse_handle handle, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, rocsparse_solve_policy policy, void *temp_buffer)¶

rocsparse_status rocsparse_dcsrilu0(rocsparse_handle handle, rocsparse_int m, rocsparse_int nnz, const rocsparse_mat_descr descr, double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_mat_info info, rocsparse_solve_policy policy, void *temp_buffer)¶

Incomplete LU factorization with 0 fill-ins and no pivoting using CSR storage format.

rocsparse_csrilu0 computes the incomplete LU factorization with 0 fill-ins and no pivoting of a sparse $m \times m$ CSR matrix $A$, such that

\[ A \approx LU \]

rocsparse_csrilu0 requires a user allocated temporary buffer. Its size is returned by rocsparse_scsrilu0_buffer_size() or rocsparse_dcsrilu0_buffer_size(). Furthermore, analysis meta data is required. It can be obtained by rocsparse_scsrilu0_analysis() or rocsparse_dcsrilu0_analysis(). rocsparse_csrilu0 reports the first zero pivot (either numerical or structural zero). The zero pivot status can be obtained by calling rocsparse_csrilu0_zero_pivot().

Note

The sparse CSR matrix has to be sorted. This can be achieved by calling rocsparse_csrsort().

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Example

Consider the sparse $m \times m$ matrix $A$, stored in CSR storage format. The following example computes the incomplete LU factorization $M \approx LU$ and solves the preconditioned system $My = x$.

// Create rocSPARSE handle
rocsparse_handle handle;
rocsparse_create_handle(&handle);

// Create matrix descriptor for M
rocsparse_mat_descr descr_M;
rocsparse_create_mat_descr(&descr_M);

// Create matrix descriptor for L
rocsparse_mat_descr descr_L;
rocsparse_create_mat_descr(&descr_L);
rocsparse_set_mat_fill_mode(descr_L, rocsparse_fill_mode_lower);
rocsparse_set_mat_diag_type(descr_L, rocsparse_diag_type_unit);

// Create matrix descriptor for U
rocsparse_mat_descr descr_U;
rocsparse_create_mat_descr(&descr_U);
rocsparse_set_mat_fill_mode(descr_U, rocsparse_fill_mode_upper);
rocsparse_set_mat_diag_type(descr_U, rocsparse_diag_type_non_unit);

// Create matrix info structure
rocsparse_mat_info info;
rocsparse_create_mat_info(&info);

// Obtain required buffer size
size_t buffer_size_M;
size_t buffer_size_L;
size_t buffer_size_U;
rocsparse_dcsrilu0_buffer_size(handle,
                              m,
                              nnz,
                              descr_M,
                              csr_val,
                              csr_row_ptr,
                              csr_col_ind,
                              info,
                              &buffer_size_M);
rocsparse_dcsrsv_buffer_size(handle,
                             rocsparse_operation_none,
                             m,
                             nnz,
                             descr_L,
                             csr_val,
                             csr_row_ptr,
                             csr_col_ind,
                             info,
                             &buffer_size_L);
rocsparse_dcsrsv_buffer_size(handle,
                             rocsparse_operation_none,
                             m,
                             nnz,
                             descr_U,
                             csr_val,
                             csr_row_ptr,
                             csr_col_ind,
                             info,
                             &buffer_size_U);

size_t buffer_size = max(buffer_size_M, max(buffer_size_L, buffer_size_U));

// Allocate temporary buffer
void* temp_buffer;
hipMalloc(&temp_buffer, buffer_size);

// Perform analysis steps, using rocsparse_analysis_policy_reuse to improve
// computation performance
rocsparse_dcsrilu0_analysis(handle,
                            m,
                            nnz,
                            descr_M,
                            csr_val,
                            csr_row_ptr,
                            csr_col_ind,
                            info,
                            rocsparse_analysis_policy_reuse,
                            rocsparse_solve_policy_auto,
                            temp_buffer);
rocsparse_dcsrsv_analysis(handle,
                          rocsparse_operation_none,
                          m,
                          nnz,
                          descr_L,
                          csr_val,
                          csr_row_ptr,
                          csr_col_ind,
                          info,
                          rocsparse_analysis_policy_reuse,
                          rocsparse_solve_policy_auto,
                          temp_buffer);
rocsparse_dcsrsv_analysis(handle,
                          rocsparse_operation_none,
                          m,
                          nnz,
                          descr_U,
                          csr_val,
                          csr_row_ptr,
                          csr_col_ind,
                          info,
                          rocsparse_analysis_policy_reuse,
                          rocsparse_solve_policy_auto,
                          temp_buffer);

// Check for zero pivot
rocsparse_int position;
if(rocsparse_status_zero_pivot == rocsparse_csrilu0_zero_pivot(handle,
                                                               info,
                                                               &position))
{
    printf("A has structural zero at A(%d,%d)\n", position, position);
}

// Compute incomplete LU factorization
rocsparse_dcsrilu0(handle,
                   m,
                   nnz,
                   descr_M,
                   csr_val,
                   csr_row_ptr,
                   csr_col_ind,
                   info,
                   rocsparse_solve_policy_auto,
                   temp_buffer);

// Check for zero pivot
if(rocsparse_status_zero_pivot == rocsparse_csrilu0_zero_pivot(handle,
                                                               info,
                                                               &position))
{
    printf("U has structural and/or numerical zero at U(%d,%d)\n",
           position,
           position);
}

// Solve Lz = x
rocsparse_dcsrsv_solve(handle,
                       rocsparse_operation_none,
                       m,
                       nnz,
                       &alpha,
                       descr_L,
                       csr_val,
                       csr_row_ptr,
                       csr_col_ind,
                       info,
                       x,
                       z,
                       rocsparse_solve_policy_auto,
                       temp_buffer);

// Solve Uy = z
rocsparse_dcsrsv_solve(handle,
                       rocsparse_operation_none,
                       m,
                       nnz,
                       &alpha,
                       descr_U,
                       csr_val,
                       csr_row_ptr,
                       csr_col_ind,
                       info,
                       z,
                       y,
                       rocsparse_solve_policy_auto,
                       temp_buffer);

// Clean up
hipFree(temp_buffer);
rocsparse_destroy_mat_info(info);
rocsparse_destroy_mat_descr(descr_M);
rocsparse_destroy_mat_descr(descr_L);
rocsparse_destroy_mat_descr(descr_U);
rocsparse_destroy_handle(handle);

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] descr: descriptor of the sparse CSR matrix.
[inout] csr_val: array of nnz elements of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix.
[in] info: structure that holds the information collected during the analysis step.
[in] policy: rocsparse_solve_policy_auto.
[in] temp_buffer: temporary storage buffer allocated by the user.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m or nnz is invalid.
rocsparse_status_invalid_pointer: descr, csr_val, csr_row_ptr or csr_col_ind pointer is invalid.
rocsparse_status_arch_mismatch: the device is not supported.
rocsparse_status_internal_error: an internal error occurred.
rocsparse_status_not_implemented: trans != rocsparse_operation_none or rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_csrilu0_clear()¶

rocsparse_status rocsparse_csrilu0_clear(rocsparse_handle handle, rocsparse_mat_info info)¶

Incomplete LU factorization with 0 fill-ins and no pivoting using CSR storage format.

rocsparse_csrilu0_clear deallocates all memory that was allocated by rocsparse_scsrilu0_analysis() or rocsparse_dcsrilu0_analysis(). This is especially useful, if memory is an issue and the analysis data is not required for further computation.

Note

Calling rocsparse_csrilu0_clear is optional. All allocated resources will be cleared, when the opaque rocsparse_mat_info struct is destroyed using rocsparse_destroy_mat_info().

Parameters

[in] handle: handle to the rocsparse library context queue.
[inout] info: structure that holds the information collected during the analysis step.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_pointer: info pointer is invalid.
rocsparse_status_memory_error: the buffer holding the meta data could not be deallocated.
rocsparse_status_internal_error: an internal error occurred.

Sparse Conversion Functions¶

This module holds all sparse conversion routines.

The sparse conversion routines describe operations on a matrix in sparse format to obtain a matrix in a different sparse format.

rocsparse_csr2coo()¶

rocsparse_status rocsparse_csr2coo(rocsparse_handle handle, const rocsparse_int *csr_row_ptr, rocsparse_int nnz, rocsparse_int m, rocsparse_int *coo_row_ind, rocsparse_index_base idx_base)¶

Convert a sparse CSR matrix into a sparse COO matrix.

rocsparse_csr2coo converts the CSR array containing the row offsets, that point to the start of every row, into a COO array of row indices.

Note

It can also be used to convert a CSC array containing the column offsets into a COO array of column indices.

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Example

This example converts a CSR matrix into a COO matrix.

//     1 2 0 3 0
// A = 0 4 5 0 0
//     6 0 0 7 8

rocsparse_int m   = 3;
rocsparse_int n   = 5;
rocsparse_int nnz = 8;

csr_row_ptr[m+1] = {0, 3, 5, 8};             // device memory
csr_col_ind[nnz] = {0, 1, 3, 1, 2, 0, 3, 4}; // device memory
csr_val[nnz]     = {1, 2, 3, 4, 5, 6, 7, 8}; // device memory

// Allocate COO matrix arrays
rocsparse_int* coo_row_ind;
rocsparse_int* coo_col_ind;
float* coo_val;

hipMalloc((void**)&coo_row_ind, sizeof(rocsparse_int) * nnz);
hipMalloc((void**)&coo_col_ind, sizeof(rocsparse_int) * nnz);
hipMalloc((void**)&coo_val, sizeof(float) * nnz);

// Convert the csr row offsets into coo row indices
rocsparse_csr2coo(handle,
                  csr_row_ptr,
                  nnz,
                  m,
                  coo_row_ind,
                  rocsparse_index_base_zero);

// Copy the column and value arrays
hipMemcpy(coo_col_ind,
          csr_col_ind,
          sizeof(rocsparse_int) * nnz,
          hipMemcpyDeviceToDevice);

hipMemcpy(coo_val,
          csr_val,
          sizeof(float) * nnz,
          hipMemcpyDeviceToDevice);

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] m: number of rows of the sparse CSR matrix.
[out] coo_row_ind: array of nnz elements containing the row indices of the sparse COO matrix.
[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m or nnz is invalid.
rocsparse_status_invalid_pointer: csr_row_ptr or coo_row_ind pointer is invalid.
rocsparse_status_arch_mismatch: the device is not supported.

rocsparse_coo2csr()¶

rocsparse_status rocsparse_coo2csr(rocsparse_handle handle, const rocsparse_int *coo_row_ind, rocsparse_int nnz, rocsparse_int m, rocsparse_int *csr_row_ptr, rocsparse_index_base idx_base)¶

Convert a sparse COO matrix into a sparse CSR matrix.

rocsparse_coo2csr converts the COO array containing the row indices into a CSR array of row offsets, that point to the start of every row. It is assumed that the COO row index array is sorted.

Note

It can also be used, to convert a COO array containing the column indices into a CSC array of column offsets, that point to the start of every column. Then, it is assumed that the COO column index array is sorted, instead.

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Example

This example converts a COO matrix into a CSR matrix.

//     1 2 0 3 0
// A = 0 4 5 0 0
//     6 0 0 7 8

rocsparse_int m   = 3;
rocsparse_int n   = 5;
rocsparse_int nnz = 8;

coo_row_ind[nnz] = {0, 0, 0, 1, 1, 2, 2, 2}; // device memory
coo_col_ind[nnz] = {0, 1, 3, 1, 2, 0, 3, 4}; // device memory
coo_val[nnz]     = {1, 2, 3, 4, 5, 6, 7, 8}; // device memory

// Allocate CSR matrix arrays
rocsparse_int* csr_row_ptr;
rocsparse_int* csr_col_ind;
float* csr_val;

hipMalloc((void**)&csr_row_ptr, sizeof(rocsparse_int) * (m + 1));
hipMalloc((void**)&csr_col_ind, sizeof(rocsparse_int) * nnz);
hipMalloc((void**)&csr_val, sizeof(float) * nnz);

// Convert the coo row indices into csr row offsets
rocsparse_coo2csr(handle,
                  coo_row_ind,
                  nnz,
                  m,
                  csr_row_ptr,
                  rocsparse_index_base_zero);

// Copy the column and value arrays
hipMemcpy(csr_col_ind,
          coo_col_ind,
          sizeof(rocsparse_int) * nnz,
          hipMemcpyDeviceToDevice);

hipMemcpy(csr_val,
          coo_val,
          sizeof(float) * nnz,
          hipMemcpyDeviceToDevice);

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] coo_row_ind: array of nnz elements containing the row indices of the sparse COO matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] m: number of rows of the sparse CSR matrix.
[out] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m or nnz is invalid.
rocsparse_status_invalid_pointer: coo_row_ind or csr_row_ptr pointer is invalid.

rocsparse_csr2csc_buffer_size()¶

rocsparse_status rocsparse_csr2csc_buffer_size(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_action copy_values, size_t *buffer_size)¶

Convert a sparse CSR matrix into a sparse CSC matrix.

rocsparse_csr2csc_buffer_size returns the size of the temporary storage buffer required by rocsparse_scsr2csc() and rocsparse_dcsr2csc(). The temporary storage buffer must be allocated by the user.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse CSR matrix.
[in] n: number of columns of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix.
[in] copy_values: rocsparse_action_symbolic or rocsparse_action_numeric.
[out] buffer_size: number of bytes of the temporary storage buffer required by sparse_csr2csc().

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or nnz is invalid.
rocsparse_status_invalid_pointer: csr_row_ptr, csr_col_ind or buffer_size pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.

rocsparse_csr2csc()¶

rocsparse_status rocsparse_scsr2csc(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, float *csc_val, rocsparse_int *csc_row_ind, rocsparse_int *csc_col_ptr, rocsparse_action copy_values, rocsparse_index_base idx_base, void *temp_buffer)¶

rocsparse_status rocsparse_dcsr2csc(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, double *csc_val, rocsparse_int *csc_row_ind, rocsparse_int *csc_col_ptr, rocsparse_action copy_values, rocsparse_index_base idx_base, void *temp_buffer)¶

Convert a sparse CSR matrix into a sparse CSC matrix.

rocsparse_csr2csc converts a CSR matrix into a CSC matrix. rocsparse_csr2csc can also be used to convert a CSC matrix into a CSR matrix. copy_values decides whether csc_val is being filled during conversion (rocsparse_action_numeric) or not (rocsparse_action_symbolic).

rocsparse_csr2csc requires extra temporary storage buffer that has to be allocated by the user. Storage buffer size can be determined by rocsparse_csr2csc_buffer_size().

Note

The resulting matrix can also be seen as the transpose of the input matrix.

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Example

This example computes the transpose of a CSR matrix.

//     1 2 0 3 0
// A = 0 4 5 0 0
//     6 0 0 7 8

rocsparse_int m_A   = 3;
rocsparse_int n_A   = 5;
rocsparse_int nnz_A = 8;

csr_row_ptr_A[m+1] = {0, 3, 5, 8};             // device memory
csr_col_ind_A[nnz] = {0, 1, 3, 1, 2, 0, 3, 4}; // device memory
csr_val_A[nnz]     = {1, 2, 3, 4, 5, 6, 7, 8}; // device memory

// Allocate memory for transposed CSR matrix
rocsparse_int m_T   = n_A;
rocsparse_int n_T   = m_A;
rocsparse_int nnz_T = nnz_A;

rocsparse_int* csr_row_ptr_T;
rocsparse_int* csr_col_ind_T;
float* csr_val_T;

hipMalloc((void**)&csr_row_ptr_T, sizeof(rocsparse_int) * (m_T + 1));
hipMalloc((void**)&csr_col_ind_T, sizeof(rocsparse_int) * nnz_T);
hipMalloc((void**)&csr_val_T, sizeof(float) * nnz_T);

// Obtain the temporary buffer size
size_t buffer_size;
rocsparse_csr2csc_buffer_size(handle,
                              m_A,
                              n_A,
                              nnz_A,
                              csr_row_ptr_A,
                              csr_col_ind_A,
                              rocsparse_action_numeric,
                              &buffer_size);

// Allocate temporary buffer
void* temp_buffer;
hipMalloc(&temp_buffer, buffer_size);

rocsparse_scsr2csc(handle,
                   m_A,
                   n_A,
                   nnz_A,
                   csr_val_A,
                   csr_row_ptr_A,
                   csr_col_ind_A,
                   csr_val_T,
                   csr_col_ind_T,
                   csr_row_ptr_T,
                   rocsparse_action_numeric,
                   rocsparse_index_base_zero,
                   temp_buffer);

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse CSR matrix.
[in] n: number of columns of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] csr_val: array of nnz elements of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix.
[out] csc_val: array of nnz elements of the sparse CSC matrix.
[out] csc_row_ind: array of nnz elements containing the row indices of the sparse CSC matrix.
[out] csc_col_ptr: array of n+1 elements that point to the start of every column of the sparse CSC matrix.
[in] copy_values: rocsparse_action_symbolic or rocsparse_action_numeric.
[in] idx_base: rocsparse_index_base_zero or rocsparse_index_base_one.
[in] temp_buffer: temporary storage buffer allocated by the user, size is returned by rocsparse_csr2csc_buffer_size().

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or nnz is invalid.
rocsparse_status_invalid_pointer: csr_val, csr_row_ptr, csr_col_ind, csc_val, csc_row_ind, csc_col_ptr or temp_buffer pointer is invalid.
rocsparse_status_arch_mismatch: the device is not supported.
rocsparse_status_internal_error: an internal error occurred.

rocsparse_csr2ell_width()¶

rocsparse_status rocsparse_csr2ell_width(rocsparse_handle handle, rocsparse_int m, const rocsparse_mat_descr csr_descr, const rocsparse_int *csr_row_ptr, const rocsparse_mat_descr ell_descr, rocsparse_int *ell_width)¶

Convert a sparse CSR matrix into a sparse ELL matrix.

rocsparse_csr2ell_width computes the maximum of the per row non-zero elements over all rows, the ELL width, for a given CSR matrix.

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse CSR matrix.
[in] csr_descr: descriptor of the sparse CSR matrix. Currently, only rocsparse_matrix_type_general is supported.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] ell_descr: descriptor of the sparse ELL matrix. Currently, only rocsparse_matrix_type_general is supported.
[out] ell_width: pointer to the number of non-zero elements per row in ELL storage format.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m is invalid.
rocsparse_status_invalid_pointer: csr_descr, csr_row_ptr, or ell_width pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.
rocsparse_status_not_implemented: rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_csr2ell()¶

rocsparse_status rocsparse_scsr2ell(rocsparse_handle handle, rocsparse_int m, const rocsparse_mat_descr csr_descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, const rocsparse_mat_descr ell_descr, rocsparse_int ell_width, float *ell_val, rocsparse_int *ell_col_ind)¶

rocsparse_status rocsparse_dcsr2ell(rocsparse_handle handle, rocsparse_int m, const rocsparse_mat_descr csr_descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, const rocsparse_mat_descr ell_descr, rocsparse_int ell_width, double *ell_val, rocsparse_int *ell_col_ind)¶

Convert a sparse CSR matrix into a sparse ELL matrix.

rocsparse_csr2ell converts a CSR matrix into an ELL matrix. It is assumed, that ell_val and ell_col_ind are allocated. Allocation size is computed by the number of rows times the number of ELL non-zero elements per row, such that $ nnz_{ELL} = m \cdot ell\_width$. The number of ELL non-zero elements per row is obtained by rocsparse_csr2ell_width().

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Example

This example converts a CSR matrix into an ELL matrix.

//     1 2 0 3 0
// A = 0 4 5 0 0
//     6 0 0 7 8

rocsparse_int m   = 3;
rocsparse_int n   = 5;
rocsparse_int nnz = 8;

csr_row_ptr[m+1] = {0, 3, 5, 8};             // device memory
csr_col_ind[nnz] = {0, 1, 3, 1, 2, 0, 3, 4}; // device memory
csr_val[nnz]     = {1, 2, 3, 4, 5, 6, 7, 8}; // device memory

// Create ELL matrix descriptor
rocsparse_mat_descr ell_descr;
rocsparse_create_mat_descr(&ell_descr);

// Obtain the ELL width
rocsparse_int ell_width;
rocsparse_csr2ell_width(handle,
                        m,
                        csr_descr,
                        csr_row_ptr,
                        ell_descr,
                        &ell_width);

// Compute ELL non-zero entries
rocsparse_int ell_nnz = m * ell_width;

// Allocate ELL column and value arrays
rocsparse_int* ell_col_ind;
hipMalloc((void**)&ell_col_ind, sizeof(rocsparse_int) * ell_nnz);

float* ell_val;
hipMalloc((void**)&ell_val, sizeof(float) * ell_nnz);

// Format conversion
rocsparse_scsr2ell(handle,
                   m,
                   csr_descr,
                   csr_val,
                   csr_row_ptr,
                   csr_col_ind,
                   ell_descr,
                   ell_width,
                   ell_val,
                   ell_col_ind);

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse CSR matrix.
[in] csr_descr: descriptor of the sparse CSR matrix. Currently, only rocsparse_matrix_type_general is supported.
[in] csr_val: array containing the values of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array containing the column indices of the sparse CSR matrix.
[in] ell_descr: descriptor of the sparse ELL matrix. Currently, only rocsparse_matrix_type_general is supported.
[in] ell_width: number of non-zero elements per row in ELL storage format.
[out] ell_val: array of m times ell_width elements of the sparse ELL matrix.
[out] ell_col_ind: array of m times ell_width elements containing the column indices of the sparse ELL matrix.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m or ell_width is invalid.
rocsparse_status_invalid_pointer: csr_descr, csr_val, csr_row_ptr, csr_col_ind, ell_descr, ell_val or ell_col_ind pointer is invalid.
rocsparse_status_not_implemented: rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_ell2csr_nnz()¶

rocsparse_status rocsparse_ell2csr_nnz(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, const rocsparse_mat_descr ell_descr, rocsparse_int ell_width, const rocsparse_int *ell_col_ind, const rocsparse_mat_descr csr_descr, rocsparse_int *csr_row_ptr, rocsparse_int *csr_nnz)¶

Convert a sparse ELL matrix into a sparse CSR matrix.

rocsparse_ell2csr_nnz computes the total CSR non-zero elements and the CSR row offsets, that point to the start of every row of the sparse CSR matrix, for a given ELL matrix. It is assumed that csr_row_ptr has been allocated with size m + 1.

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse ELL matrix.
[in] n: number of columns of the sparse ELL matrix.
[in] ell_descr: descriptor of the sparse ELL matrix. Currently, only rocsparse_matrix_type_general is supported.
[in] ell_width: number of non-zero elements per row in ELL storage format.
[in] ell_col_ind: array of m times ell_width elements containing the column indices of the sparse ELL matrix.
[in] csr_descr: descriptor of the sparse CSR matrix. Currently, only rocsparse_matrix_type_general is supported.
[out] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[out] csr_nnz: pointer to the total number of non-zero elements in CSR storage format.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or ell_width is invalid.
rocsparse_status_invalid_pointer: ell_descr, ell_col_ind, csr_descr, csr_row_ptr or csr_nnz pointer is invalid.
rocsparse_status_not_implemented: rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_ell2csr()¶

rocsparse_status rocsparse_csr2csc_buffer_size(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_action copy_values, size_t *buffer_size)

Convert a sparse CSR matrix into a sparse CSC matrix.

rocsparse_csr2csc_buffer_size returns the size of the temporary storage buffer required by rocsparse_scsr2csc() and rocsparse_dcsr2csc(). The temporary storage buffer must be allocated by the user.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse CSR matrix.
[in] n: number of columns of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix.
[in] copy_values: rocsparse_action_symbolic or rocsparse_action_numeric.
[out] buffer_size: number of bytes of the temporary storage buffer required by sparse_csr2csc().

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or nnz is invalid.
rocsparse_status_invalid_pointer: csr_row_ptr, csr_col_ind or buffer_size pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.

rocsparse_csr2hyb()¶

rocsparse_status rocsparse_scsr2hyb(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, const rocsparse_mat_descr descr, const float *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_hyb_mat hyb, rocsparse_int user_ell_width, rocsparse_hyb_partition partition_type)¶

rocsparse_status rocsparse_dcsr2hyb(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, const rocsparse_mat_descr descr, const double *csr_val, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, rocsparse_hyb_mat hyb, rocsparse_int user_ell_width, rocsparse_hyb_partition partition_type)¶

Convert a sparse CSR matrix into a sparse HYB matrix.

rocsparse_csr2hyb converts a CSR matrix into a HYB matrix. It is assumed that hyb has been initialized with rocsparse_create_hyb_mat().

Note

This function requires a significant amount of storage for the HYB matrix, depending on the matrix structure.

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Example

This example converts a CSR matrix into a HYB matrix using user defined partitioning.

// Create HYB matrix structure
rocsparse_hyb_mat hyb;
rocsparse_create_hyb_mat(&hyb);

// User defined ell width
rocsparse_int user_ell_width = 5;

// Perform the conversion
rocsparse_scsr2hyb(handle,
                   m,
                   n,
                   descr,
                   csr_val,
                   csr_row_ptr,
                   csr_col_ind,
                   hyb,
                   user_ell_width,
                   rocsparse_hyb_partition_user);

// Do some work

// Clean up
rocsparse_destroy_hyb_mat(hyb);

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse CSR matrix.
[in] n: number of columns of the sparse CSR matrix.
[in] descr: descriptor of the sparse CSR matrix. Currently, only rocsparse_matrix_type_general is supported.
[in] csr_val: array containing the values of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array containing the column indices of the sparse CSR matrix.
[out] hyb: sparse matrix in HYB format.
[in] user_ell_width: width of the ELL part of the HYB matrix (only required if partition_type == rocsparse_hyb_partition_user).
[in] partition_type: rocsparse_hyb_partition_auto (recommended), rocsparse_hyb_partition_user or rocsparse_hyb_partition_max.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or user_ell_width is invalid.
rocsparse_status_invalid_value: partition_type is invalid.
rocsparse_status_invalid_pointer: descr, hyb, csr_val, csr_row_ptr or csr_col_ind pointer is invalid.
rocsparse_status_memory_error: the buffer for the HYB matrix could not be allocated.
rocsparse_status_internal_error: an internal error occurred.
rocsparse_status_not_implemented: rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_create_identity_permutation()¶

rocsparse_status rocsparse_create_identity_permutation(rocsparse_handle handle, rocsparse_int n, rocsparse_int *p)¶

Create the identity map.

rocsparse_create_identity_permutation stores the identity map in p, such that $p = 0:1:(n-1)$.

for(i = 0; i < n; ++i)
{
    p[i] = i;
}

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Example

The following example creates an identity permutation.

rocsparse_int size = 200;

// Allocate memory to hold the identity map
rocsparse_int* perm;
hipMalloc((void**)&perm, sizeof(rocsparse_int) * size);

// Fill perm with the identity permutation
rocsparse_create_identity_permutation(handle, size, perm);

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] n: size of the map p.
[out] p: array of n integers containing the map.

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: n is invalid.
rocsparse_status_invalid_pointer: p pointer is invalid.

rocsparse_csrsort_buffer_size()¶

rocsparse_status rocsparse_csrsort_buffer_size(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_int *csr_row_ptr, const rocsparse_int *csr_col_ind, size_t *buffer_size)¶

Sort a sparse CSR matrix.

rocsparse_csrsort_buffer_size returns the size of the temporary storage buffer required by rocsparse_csrsort(). The temporary storage buffer must be allocated by the user.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse CSR matrix.
[in] n: number of columns of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[in] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix.
[out] buffer_size: number of bytes of the temporary storage buffer required by rocsparse_csrsort().

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or nnz is invalid.
rocsparse_status_invalid_pointer: csr_row_ptr, csr_col_ind or buffer_size pointer is invalid.

rocsparse_csrsort()¶

rocsparse_status rocsparse_csrsort(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_mat_descr descr, const rocsparse_int *csr_row_ptr, rocsparse_int *csr_col_ind, rocsparse_int *perm, void *temp_buffer)¶

Sort a sparse CSR matrix.

rocsparse_csrsort sorts a matrix in CSR format. The sorted permutation vector perm can be used to obtain sorted csr_val array. In this case, perm must be initialized as the identity permutation, see rocsparse_create_identity_permutation().

rocsparse_csrsort requires extra temporary storage buffer that has to be allocated by the user. Storage buffer size can be determined by rocsparse_csrsort_buffer_size().

Note

perm can be NULL if a sorted permutation vector is not required.

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Example

The following example sorts a $3 \times 3$ CSR matrix.

//     1 2 3
// A = 4 5 6
//     7 8 9
rocsparse_int m   = 3;
rocsparse_int n   = 3;
rocsparse_int nnz = 9;

csr_row_ptr[m + 1] = {0, 3, 6, 9};                // device memory
csr_col_ind[nnz]   = {2, 0, 1, 0, 1, 2, 0, 2, 1}; // device memory
csr_val[nnz]       = {3, 1, 2, 4, 5, 6, 7, 9, 8}; // device memory

// Create permutation vector perm as the identity map
rocsparse_int* perm;
hipMalloc((void**)&perm, sizeof(rocsparse_int) * nnz);
rocsparse_create_identity_permutation(handle, nnz, perm);

// Allocate temporary buffer
size_t buffer_size;
void* temp_buffer;
rocsparse_csrsort_buffer_size(handle, m, n, nnz, csr_row_ptr, csr_col_ind, &buffer_size);
hipMalloc(&temp_buffer, buffer_size);

// Sort the CSR matrix
rocsparse_csrsort(handle, m, n, nnz, descr, csr_row_ptr, csr_col_ind, perm, temp_buffer);

// Gather sorted csr_val array
float* csr_val_sorted;
hipMalloc((void**)&csr_val_sorted, sizeof(float) * nnz);
rocsparse_sgthr(handle, nnz, csr_val, csr_val_sorted, perm, rocsparse_index_base_zero);

// Clean up
hipFree(temp_buffer);
hipFree(perm);
hipFree(csr_val);

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse CSR matrix.
[in] n: number of columns of the sparse CSR matrix.
[in] nnz: number of non-zero entries of the sparse CSR matrix.
[in] descr: descriptor of the sparse CSR matrix. Currently, only rocsparse_matrix_type_general is supported.
[in] csr_row_ptr: array of m+1 elements that point to the start of every row of the sparse CSR matrix.
[inout] csr_col_ind: array of nnz elements containing the column indices of the sparse CSR matrix.
[inout] perm: array of nnz integers containing the unsorted map indices, can be NULL.
[in] temp_buffer: temporary storage buffer allocated by the user, size is returned by rocsparse_csrsort_buffer_size().

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or nnz is invalid.
rocsparse_status_invalid_pointer: descr, csr_row_ptr, csr_col_ind or temp_buffer pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.
rocsparse_status_not_implemented: rocsparse_matrix_type != rocsparse_matrix_type_general.

rocsparse_coosort_buffer_size()¶

rocsparse_status rocsparse_coosort_buffer_size(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, const rocsparse_int *coo_row_ind, const rocsparse_int *coo_col_ind, size_t *buffer_size)¶

Sort a sparse COO matrix.

coosort_buffer_size returns the size of the temporary storage buffer that is required by rocsparse_coosort_by_row() and rocsparse_coosort_by_column(). The temporary storage buffer has to be allocated by the user.

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse COO matrix.
[in] n: number of columns of the sparse COO matrix.
[in] nnz: number of non-zero entries of the sparse COO matrix.
[in] coo_row_ind: array of nnz elements containing the row indices of the sparse COO matrix.
[in] coo_col_ind: array of nnz elements containing the column indices of the sparse COO matrix.
[out] buffer_size: number of bytes of the temporary storage buffer required by rocsparse_coosort_by_row() and rocsparse_coosort_by_column().

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or nnz is invalid.
rocsparse_status_invalid_pointer: coo_row_ind, coo_col_ind or buffer_size pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.

rocsparse_coosort_by_row()¶

rocsparse_status rocsparse_coosort_by_row(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, rocsparse_int *coo_row_ind, rocsparse_int *coo_col_ind, rocsparse_int *perm, void *temp_buffer)¶

Sort a sparse COO matrix by row.

rocsparse_coosort_by_row sorts a matrix in COO format by row. The sorted permutation vector perm can be used to obtain sorted coo_val array. In this case, perm must be initialized as the identity permutation, see rocsparse_create_identity_permutation().

rocsparse_coosort_by_row requires extra temporary storage buffer that has to be allocated by the user. Storage buffer size can be determined by rocsparse_coosort_buffer_size().

Note

perm can be NULL if a sorted permutation vector is not required.

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Example

The following example sorts a $3 \times 3$ COO matrix by row indices.

//     1 2 3
// A = 4 5 6
//     7 8 9
rocsparse_int m   = 3;
rocsparse_int n   = 3;
rocsparse_int nnz = 9;

coo_row_ind[nnz] = {0, 1, 2, 0, 1, 2, 0, 1, 2}; // device memory
coo_col_ind[nnz] = {0, 0, 0, 1, 1, 1, 2, 2, 2}; // device memory
coo_val[nnz]     = {1, 4, 7, 2, 5, 8, 3, 6, 9}; // device memory

// Create permutation vector perm as the identity map
rocsparse_int* perm;
hipMalloc((void**)&perm, sizeof(rocsparse_int) * nnz);
rocsparse_create_identity_permutation(handle, nnz, perm);

// Allocate temporary buffer
size_t buffer_size;
void* temp_buffer;
rocsparse_coosort_buffer_size(handle,
                              m,
                              n,
                              nnz,
                              coo_row_ind,
                              coo_col_ind,
                              &buffer_size);
hipMalloc(&temp_buffer, buffer_size);

// Sort the COO matrix
rocsparse_coosort_by_row(handle,
                         m,
                         n,
                         nnz,
                         coo_row_ind,
                         coo_col_ind,
                         perm,
                         temp_buffer);

// Gather sorted coo_val array
float* coo_val_sorted;
hipMalloc((void**)&coo_val_sorted, sizeof(float) * nnz);
rocsparse_sgthr(handle, nnz, coo_val, coo_val_sorted, perm, rocsparse_index_base_zero);

// Clean up
hipFree(temp_buffer);
hipFree(perm);
hipFree(coo_val);

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse COO matrix.
[in] n: number of columns of the sparse COO matrix.
[in] nnz: number of non-zero entries of the sparse COO matrix.
[inout] coo_row_ind: array of nnz elements containing the row indices of the sparse COO matrix.
[inout] coo_col_ind: array of nnz elements containing the column indices of the sparse COO matrix.
[inout] perm: array of nnz integers containing the unsorted map indices, can be NULL.
[in] temp_buffer: temporary storage buffer allocated by the user, size is returned by rocsparse_coosort_buffer_size().

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or nnz is invalid.
rocsparse_status_invalid_pointer: coo_row_ind, coo_col_ind or temp_buffer pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.

rocsparse_coosort_by_column()¶

rocsparse_status rocsparse_coosort_by_column(rocsparse_handle handle, rocsparse_int m, rocsparse_int n, rocsparse_int nnz, rocsparse_int *coo_row_ind, rocsparse_int *coo_col_ind, rocsparse_int *perm, void *temp_buffer)¶

Sort a sparse COO matrix by column.

rocsparse_coosort_by_column sorts a matrix in COO format by column. The sorted permutation vector perm can be used to obtain sorted coo_val array. In this case, perm must be initialized as the identity permutation, see rocsparse_create_identity_permutation().

rocsparse_coosort_by_column requires extra temporary storage buffer that has to be allocated by the user. Storage buffer size can be determined by rocsparse_coosort_buffer_size().

Note

perm can be NULL if a sorted permutation vector is not required.

Note

This function is non blocking and executed asynchronously with respect to the host. It may return before the actual computation has finished.

Example

The following example sorts a $3 \times 3$ COO matrix by column indices.

//     1 2 3
// A = 4 5 6
//     7 8 9
rocsparse_int m   = 3;
rocsparse_int n   = 3;
rocsparse_int nnz = 9;

coo_row_ind[nnz] = {0, 0, 0, 1, 1, 1, 2, 2, 2}; // device memory
coo_col_ind[nnz] = {0, 1, 2, 0, 1, 2, 0, 1, 2}; // device memory
coo_val[nnz]     = {1, 2, 3, 4, 5, 6, 7, 8, 9}; // device memory

// Create permutation vector perm as the identity map
rocsparse_int* perm;
hipMalloc((void**)&perm, sizeof(rocsparse_int) * nnz);
rocsparse_create_identity_permutation(handle, nnz, perm);

// Allocate temporary buffer
size_t buffer_size;
void* temp_buffer;
rocsparse_coosort_buffer_size(handle,
                              m,
                              n,
                              nnz,
                              coo_row_ind,
                              coo_col_ind,
                              &buffer_size);
hipMalloc(&temp_buffer, buffer_size);

// Sort the COO matrix
rocsparse_coosort_by_column(handle,
                            m,
                            n,
                            nnz,
                            coo_row_ind,
                            coo_col_ind,
                            perm,
                            temp_buffer);

// Gather sorted coo_val array
float* coo_val_sorted;
hipMalloc((void**)&coo_val_sorted, sizeof(float) * nnz);
rocsparse_sgthr(handle, nnz, coo_val, coo_val_sorted, perm, rocsparse_index_base_zero);

// Clean up
hipFree(temp_buffer);
hipFree(perm);
hipFree(coo_val);

Parameters

[in] handle: handle to the rocsparse library context queue.
[in] m: number of rows of the sparse COO matrix.
[in] n: number of columns of the sparse COO matrix.
[in] nnz: number of non-zero entries of the sparse COO matrix.
[inout] coo_row_ind: array of nnz elements containing the row indices of the sparse COO matrix.
[inout] coo_col_ind: array of nnz elements containing the column indices of the sparse COO matrix.
[inout] perm: array of nnz integers containing the unsorted map indices, can be NULL.
[in] temp_buffer: temporary storage buffer allocated by the user, size is returned by rocsparse_coosort_buffer_size().

Return Value

rocsparse_status_success: the operation completed successfully.
rocsparse_status_invalid_handle: the library context was not initialized.
rocsparse_status_invalid_size: m, n or nnz is invalid.
rocsparse_status_invalid_pointer: coo_row_ind, coo_col_ind or temp_buffer pointer is invalid.
rocsparse_status_internal_error: an internal error occurred.