Code optimization

Allocating objects is not the only thing we should take into consideration when thinking about performance in our custom view. The amount of calculations, the type of calculations, the amount of primitives we are drawing, the amount of overdrawing, and the list of things we should check, is pretty big. In the end, most things are common sense: just don't recalculate values that we already have and maximize parts of the code that can be skipped if there are no changes required or, basically, try to reuse as much as possible what has been calculated on previous frames.

Let's compare two methods that convert YUV pixel data to RGB. It is not the most typical thing you'll have to do in a custom view, but it's perfect to show how performance can be impacted by reusing as much as we can and not recalculating what doesn't have to be recalculated.

When getting frames from the camera viewfinder in Android, they are usually in YUV format instead of RGB. More information about YUV can be found at the following URL:
https://en.wikipedia.org/wiki/YUV.

We will start with a straightforward code and we will optimize it step by step to evaluate the impact of all the optimizations:

private static void yuv2rgb(int width, int height, byte[] yuvData,
    int[] rgbData) { 
    int uvOffset = width * height; 
    for (int i = 0; i < height; i++) { 
         int u = 0; 
         int v = 0; 
         for (int j = 0; j < width; j++) { 
           int y = yuvData[i * width + j]; 
           if (y < 0) y += 256; 
 
           if (j % 2 == 0) { 
               u = yuvData[uvOffset++]; 
               v = yuvData[uvOffset++]; 
            } 
 
            if (u < 0) u += 256; 
            if (v < 0) v += 256; 
 
            int nY = y - 16; 
            int nU = u - 128; 
            int nV = v - 128; 
 
            if (nY< 0) nY = 0; 
 
            int nR = (int) (1.164 * nY + 2.018 * nU); 
            int nG = (int) (1.164 * nY - 0.813 * nV - 0.391 * nU); 
            int nB = (int) (1.164 * nY + 1.596 * nV); 
 
            nR = min(255, max(0, nR)); 
            nG = min(255, max(0, nG)); 
            nB = min(255, max(0, nB)); 
 
            nR&= 0xff; 
            nG&= 0xff; 
            nB&= 0xff; 
 
            int color = 0xff000000 | (nR<< 16) | (nG<< 8) | nB; 
            rgbData[i * width + j] = color; 
        } 
    } 
}

This version is based on the YUV-to-RGB converter, found at the following URL:
https://searchcode.com/codesearch/view/2393/ and
http://sourceforge.jp/projects/nyartoolkit-and/.

We've used the floating point version here so that we can see, later on, the differences with the fixed point version.

Now, let's create a small custom view that will transform, in every frame, a YUV image to an RGB, set it into a Bitmap, and draw it on the screen:

@Override 
protected void onDraw(Canvas canvas) { 
    yuv2rgb(imageWidth, imageHeight, yuvData, rgbData); 
    bitmap.setPixels(rgbData, 0, imageWidth, 0, 0, imageWidth,
    imageHeight); 
 
    canvas.drawBitmap(bitmap, 0.f, 0.f, null); 
 
    frames++; 
    invalidate(); 
}

Let's also add a code to check the number of frames per second that our small code will manage to draw. We will use this measurement to check the performance improvements on the optimizations we'll be doing:

if (timeStart == -1) { 
    timeStart = SystemClock.elapsedRealtime(); 
} else { 
    long tdiff = SystemClock.elapsedRealtime() - timeStart; 
    if (tdiff != 0) { 
        float fps = ((float) frames * 1000.f) / tdiff; 
        Log.d(TAG, "FPS: " + fps); 
    } 
}

If we run this code as it is, on my device it measures 1.20 frames per second. The demo image used is a 1,000x1,500 image. Let's see what we can do to improve it.

To start, we can remove some unnecessary calculations:

private static void yuv2rgb(int width, int height, byte[] yuvData,
    int[] rgbData) { 
    int uvOffset = width * height; 
    int offset = 0; 
    for (int i = 0; i < height; i++) { 
        int u = 0; 
        int v = 0; 
        for (int j = 0; j < width; j++) { 
            int y = yuvData[offset]; 
            ... 
            rgbData[offset] = color; 
 
            offset++; 
        } 
    } 
}

Here, we've removed the two calculations of the pixel position and we are doing it by just a single increment at each pixel. In the previous case, it was doing the calculation i * width + j both for reading the yuvData and writing to rgbData. If we check the frames per second counter after this change, we'll notice it has slightly increased to 1.22. Not a huge improvement, but it's a start.

Now, we can see in the original implementation, the one used in the Android SDK, that the floating point operations are commented out in favor of fixed point operations. Floating point operations are usually costlier than plain integer operations. The performance of floating point operations has been improving quite a lot these last years with all the new hardware, but integer operations are still faster. We will not be able to get the same precision as with floating point operations, but we can get quite a good approximation by using fixed-point arithmetic.

More information about fixed-point arithmetic can be found at the following URL:
https://en.wikipedia.org/wiki/Fixed-point_arithmetic.

When using fixed-point arithmetic, we have to define the number of bits of an integer value that will be used as the fixed point precision. The remaining bits will be used to actually store the integer value. Obviously, we'll have more precision as more bits we use to store it but, on the other hand, we'll have less bits to store the integer value. The idea is to multiply all constants and operations by a power of two number and, after doing all the operations, divide the result by the same number. As it's a power of two, we can easily perform a fast bitwise shift right operation instead of a costly divide.

For example, if we used a fixed point precision of 10 bits, we have to multiply all values by 1,024 or shift them 10 bits to the left and, at the end of all calculations, perform a right shift of 10 bits.

Let's apply this to these operations:

int nR = (int) (1.164 * nY + 2.018 * nU); 
int nG = (int) (1.164 * nY - 0.813 * nV - 0.391 * nU); 
int nB = (int) (1.164 * nY + 1.596 * nV);

We are transforming them into the following:

int nR = (int) (1192 * nY + 2066 * nU); 
int nG = (int) (1192 * nY - 833 * nV - 400 * nU); 
int nB = (int) (1192 * nY + 1634 * nV);

We can check that 1.164 * 1,024 is 1192 rounded up, and the same applies to all the other constants-we rounded the numbers to get the most valid approximation.

For the same reason, we have to change the following checks:

nR = min(255, max(0, nR)); 
nG = min(255, max(0, nG)); 
nB = min(255, max(0, nB));

We have to change the check with 255*255 multiplied by 1,024 of shifted 10 positions to the left:

nR = min(255 << 10, max(0, nR)); 
nG = min(255 << 10, max(0, nG)); 
nB = min(255 << 10, max(0, nB));

Add the division by 1,024 or right shift by 10 before using the values to output the color:

nR>>= 10; 
nG>>= 10; 
nB>>= 10;

Implementing these changes, even if we have added some more operations compared to the floating point version, improves our frames per second counter to 1.55.

Another small optimization we can do is to avoid calculating the luminance factor of every component, as it's the same in each case. So let's replace this code:

int nR = (int) (1192 * nY + 2066 * nU); 
int nG = (int) (1192 * nY - 833 * nV - 400 * nU); 
int nB = (int) (1192 * nY + 1634 * nV);

With this one, which only calculates the luminance once:

int luminance = 1192 * nY; 
int nR = (int)(luminance + 2066 * nU); 
int nG = (int)(luminance - 833 * nV - 400 * nU); 
int nB = (int)(luminance + 1634 * nV);

This should be optimized by most compilers; I'm not sure what the new compilers D8 and R8 will do, but with the current Java/Android tooling, it isn't optimized. By making this small change, we increase the frames per second counter to 1.59.

The way this YUV file format works is that a pair of U and V chroma values are shared for two luminance values, so let's try to use this to our advantage to compute two pixels at the same time, avoiding additional checks and code overhead:

for(int j = 0; j < width; j += 2) {
   int y0 = yuvData[offset]; 
   if (y0 < 0) y0 += 256; 
 
   int y1 = yuvData[offset + 1]; 
   if (y1 < 0) y1 += 256; 
 
   u = yuvData[uvOffset++]; 
   v = yuvData[uvOffset++]; 
   if (u < 0) u += 256; 
   if (v < 0) v += 256; 
 
   int nY0 = y0 - 16; 
   int nY1 = y1 - 16; 
   int nU = u - 128; 
   int nV = v - 128; 
 
   if (nY0 < 0) nY0 = 0; 
   if (nY1 < 0) nY1 = 0; 
 
   int chromaR = 2066 * nU; 
   int chromaG = -833 * nV - 400 * nU; 
   int chromaB = 1634 * nV; 
 
   int luminance = 1192 * nY0; 
   int nR = (int) (luminance + chromaR); 
   int nG = (int) (luminance + chromaG); 
   int nB = (int) (luminance + chromaB); 
 
   nR = min(255 << 10, max(0, nR)); 
   nG = min(255 << 10, max(0, nG)); 
   nB = min(255 << 10, max(0, nB)); 
 
   nR>>= 10; 
   nG>>= 10; 
   nB>>= 10; 
 
   nR&= 0xff; 
   nG&= 0xff; 
   nB&= 0xff; 
 
   rgbData[offset] = 0xff000000 | (nR<< 16) | (nG<< 8) | nB; 
 
   luminance = 1192 * nY1; 
   nR = (int) (luminance + chromaR); 
   nG = (int) (luminance + chromaG); 
   nB = (int) (luminance + chromaB); 
 
   nR = min(255 << 10, max(0, nR)); 
   nG = min(255 << 10, max(0, nG)); 
   nB = min(255 << 10, max(0, nB)); 
 
   nR>>= 10; 
   nG>>= 10; 
   nB>>= 10; 
 
   nR&= 0xff; 
   nG&= 0xff; 
   nB&= 0xff; 
 
   rgbData[offset + 1] = 0xff000000 | (nR<< 16) | (nG<< 8) | nB; 
 
   offset += 2; 
}

We are now just calculating the chroma components once, and we've removed the check to get new U and V components only every two pixels. Doing these changes increased our frames per second counter to 1.77.

As Java bytes range from -128 to 127, we've added some checks for negative numbers, but instead of doing these checks, we can do a quick bitwise AND operation (&):

for (int i = 0; i < height; i++) { 
    for (int j = 0; j < width; j += 2) { 
      int y0 = yuvData[offset    ] & 0xff; 
      int y1 = yuvData[offset + 1] & 0xff; 
 
      int u = yuvData[uvOffset++] & 0xff; 
      int v = yuvData[uvOffset++] & 0xff; 
 
        ... 
   } 
}

That small change slightly increased our frames per second counter to 1.83. But we can still optimize it a bit more. We have used 10 bits of fixed-point arithmetic precision, but, in this particular case, we might have enough using 8 bits of precision. Changing from 10 bits of precision to only 8 will save us one operation:

for (int i = 0; i < height; i++) { 
  for (int j = 0; j < width; j += 2) { 
        ... 
    int chromaR = 517 * nU; 
    int chromaG = -208 * nV - 100 * nU; 
    int chromaB = 409 * nV; 
 
    int lum = 298 * nY0; 
 
    nR = min(65280, max(0, nR)); 
    nG = min(65280, max(0, nG)); 
    nB = min(65280, max(0, nB)); 
 
    nR<<= 8; 
    nB>>= 8; 
 
    nR&= 0x00ff0000; 
    nG&= 0x0000ff00; 
    nB&= 0x000000ff; 
 
    rgbData[offset] = 0xff000000 | nR | nG | nB; 
 
        ... 
 
    offset += 2; 
   } 
}

We've updated all the constants to be multiplied by 256 instead of 1,024, and we've updated the checks. The constant 65280 that appears on the code is 255 multiplied by 256. On the part of the code that we are shifting the values to in order to get the actual color components, we have to shift right the red component by 8 and shift it left by 16 to adjust it to the ARGB position in the color component, so we can just do one single shift operation of 8 bits left. It's even better on the green coordinate-we have to shift it right by 8 and shift it left by 8, so we can leave it as it is and not shift it at all. We still have to shift right the blue component by 8 positions.

We also had to update the masks to check that every component stays between its 0-255 range, but now the masks are shifted by the right bit position s0x00ff0000, 0x0000ff00, and 0x000000ff.

This change marginally improved our frames per second counter to 1.85, but we can still go further. Let's try to get rid of all the shifts, checks, and masks. We could do that by using some pre-calculated tables that we'll calculate once at the creation of our custom view. Let's create this function to pre-calculate everything we need:

private static int[] luminance; 
private static int[] chromaR; 
private static int[] chromaGU; 
private static int[] chromaGV; 
private static int[] chromaB; 
 
private static int[] clipValuesR; 
private static int[] clipValuesG; 
private static int[] clipValuesB; 
 
private static void precalcTables() {
    luminance = new int[256];
    for (int i = 0; i <luminance.length; i++) {
        luminance[i] = ((298 * (i - 16)) >> 8) + 300;
    }
    chromaR = new int[256]; 
    chromaGU = new int[256]; 
    chromaGV = new int[256]; 
    chromaB = new int[256]; 
    for (int i = 0; i < 256; i++) {
       chromaR[i] = (517 * (i - 128)) >> 8;
       chromaGU[i] = (-100 * (i - 128)) >> 8;
       chromaGV[i] = (-208 * (i - 128)) >> 8;
       chromaB[i] = (409 * (i - 128)) >> 8;
    }
 
    clipValuesR = new int[1024]; 
    clipValuesG = new int[1024]; 
    clipValuesB = new int[1024]; 
    for (int i = 0; i < 1024; i++) { 
       clipValuesR[i] = 0xFF000000 | (min(max(i - 300, 0), 255) << 16); 
       clipValuesG[i] = min(max(i - 300, 0), 255) << 8; 
       clipValuesB[i] = min(max(i - 300, 0), 255); 
    } 
}

We are calculating the values for luminance, all chroma components, and finally the clipped, shifted, and masked values for everything. As luminance and some chromas might be negative, we've added a +300 value to the luminance value, as it'll be added to all values, and then adapted the clipValues tables to take that 300 offset into account. Otherwise, we might try to index an array with a negative index, and that would make our application crash. Checking whether the index is negative before accessing the array will kill all the performance optimizations, as we are trying to get rid, as much as possible, of all operations and checks.

Using these tables, our YUV-to-RGB converter code is reduced to the following:

private static void yuv2rgb(int width, int height, byte[] yuvData,
    int[] rgbData) { 
    int uvOffset = width * height; 
    int offset = 0; 
 
    for (int i = 0; i < height; i++) { 
        for (int j = 0; j < width; j += 2) { 
        int y0 = yuvData[offset ] & 0xff; 
        int y1 = yuvData[offset + 1] & 0xff; 
 
        int u = yuvData[uvOffset++] & 0xff; 
        int v = yuvData[uvOffset++] & 0xff; 
 
        int chR = chromaR[u]; 
        int chG = chromaGV[v] + chromaGU[u]; 
        int chB = chromaB[v]; 
 
        int lum = luminance[y0]; 
        int nR = clipValuesR[lum + chR]; 
        int nG = clipValuesG[lum + chG]; 
        int nB = clipValuesB[lum + chB]; 
 
        rgbData[offset] = nR | nG | nB; 
 
        lum = luminance[y1]; 
        nR = clipValuesR[lum + chR]; 
        nG = clipValuesG[lum + chG]; 
        nB = clipValuesB[lum + chB]; 
 
        rgbData[offset + 1] = nR | nG | nB; 
 
        offset += 2; 
       } 
    } 
}

With these changes, we get a 2.04 frames per second counter, or a 70% increase of performance, compared with the original method. Anyway, this is only an example of how code can be optimized; if you really want to convert, in real-time, a YUV image to RGB, I suggest that you either check out a native C or C++ implementation or go for a GPU or render script approach.

Finally, if we run this application, we will get a screen similar to the following screenshot. We are not scaling or applying any additional transformation to the image, as we just wanted to measure the amount of time that it takes to transform from a YUV image to an RGB image. Your screen image might differ depending on the screen size and your device:

Check the whole example source code in the Example31-Performance folder in the GitHub repository.

There are many other things to take into consideration when talking about performance. If you want to know more details about how Java code is translated into dexbyte code and executed in the Android VM, check out the following presentation:
https://www.slideshare.net/RaimonRls/the-bytecode-mumbojumbo.