How computing in CuPy works on Python

The basics of GPU computing with CuPy can be very easily understood with a side-by-side comparison with the traditional use of NumPy code on Python.

Once we explore the simple terminologies, we will shift our focus towards actual GPU-accelerated computations for solving specific computational problems with CuPy.

If you recall our traditional NumPy program that was first described in the PyCUDA chapter, we implemented a function to multiply two array elements through numpy. The syntax we used to import numpy was the following:

import numpy as np

As you can see, numpy is abbreviated as np for convenience of use throughout the program code.

In case of CuPy, too, we can use a similar syntax, as shown here:

import cupy as cp

In our NumPy code, we used the following syntax to initialize two arrays of the double data type for N elements with zero values:

p = np.zeros(N, dtype=np.double)

In CuPy, there is no difference in using that syntax. But instead of the host (CPU), it gets created on the device (GPU) itself, as shown here:

p = cp.zeros(N, dtype=cp.double)

To change all the values of each array according to our choice, we used numpy.ndarray.fill, as follows:

p.fill(23.0)

We do not have to change anything in the preceding declaration if we replace a numpy import with that of cupy. It is exactly the same.

We had also used NumPy to set a random integer index value between 0 to N, like so:

random = np.random.randint(0, N)

With CuPy, the declaration is no different. Just replace np with cp to use the GPU for the random operation, as follows:

random = cp.random.randint(0, N)

CuPy can also be used to move an array created with NumPy to the GPU device like so:

p = np.zeros(N, dtype=np.double)
p_gpu = cp.asarray(p)

After using it for carrying out mathematical operations of our choice, we can move the GPU device array back to the host via the following:

p = cp.asnumpy(p_gpu)

You can also use the following:

p = p_gpu.get()

You might also remember that the ElementwiseKernel was first explored in Chapter 6, Working with CUDA and PyCUDA, with PyCUDA as shown here:

multiply = ElementwiseKernel(
      "double *a_gpu, double *b_gpu",
      "b_gpu[i] = a_gpu[i] * b_gpu[i]",
      "multiply")

In a similar manner, you can also use CuPy for the same with a very minor modification like so:

multiply = cp.ElementwiseKernel(
      "double *a_gpu, double *b_gpu",
      "b_gpu[i] = a_gpu[i] * b_gpu[i]",
      "multiply")

In Numpy, universal functions (also called ufuncs) operate on ndarrays element-wise, supporting array broadcasting, output type determination, and many other features. They are instances of the numpy.ufunc class. As a vectorized wrapper for a function, a ufunc takes a fixed number of specific inputs and correspondingly produces a fixed number of specific outputs.

Built-in functions in ufunc are mostly implemented in compiled C code. Basic ufuncs operate on scalars and sub-arrays, such as vectors and matrices. Custom ufunc instances can also be created by using the frompyfunc factory function.

Like NumPy, CuPy also allows the use of universal functions to support various elementwise operations. CuPy's ufunc supports features similar to that of ufunc in NumPy.

CuPy can be used for implementing raw CUDA kernels, as well with cupy.RawKernel().