Saturday, March 21, 2015

Android's Graphics Buffer Management System (Part I: gralloc)

In this post series I'll do a deep dive into Android's graphics buffer management system.  I'll cover how buffers produced by the camera use the generic BufferQueue abstraction to flow to different parts of the system, how buffers are shared between different hardware modules, and how they traverse process boundaries.
But I will start at buffer allocation, and before I describe what triggers buffer allocation and when, let's look at the low-level graphics buffer allocator, a.k.a. gralloc.

gralloc: Buffer Allocation

The gralloc is part of the HAL (Hardware Abstraction Layer) which means that the implementation is platform-specific.  You can find the interface definitions in hardware/libhardware/include/hardware/gralloc.h.  As expected from a HAL component, the interface is divided into a module interface (gralloc_module_t) and a  device interface (alloc_device_t).  Loading the gralloc module is performed as for all HAL modules, so I won't go into these details because they can be easily googled.  But I will mention that the entry point into a newly loaded HAL module is via the open method of the structure hw_module_methods which is referenced by the structure hw_module_t.  Structure hw_module_t acts as a mandatory "base class" (not quite since this is "C" code) of all HAL modules including gralloc_module_t.
Both the module and the device interfaces are versioned.  The current module version is 0.3 and the device version is 0.1.  Only Google knows why these interfaces have these sub-1.0 interface versions. :-)

As I said above, gralloc implementations are platform-specific and for reference you can look at the goldfish device's implementation (device/generic/goldfish/opengl/system/gralloc/gralloc.c).  Goldfish is the code name for the Android emulation platform device.
The sole responsibility of the device (alloc_device_t) is allocation (and consequent release) of buffer memory so it has a straight-forward  signature:

typedef struct alloc_device_t {
    struct hw_device_t common;

    /*
     * (*alloc)() Allocates a buffer in graphic memory with the requested
     * parameters and returns a buffer_handle_t and the stride in pixels to
     * allow the implementation to satisfy hardware constraints on the width
     * of a pixmap (eg: it may have to be multiple of 8 pixels).
     * The CALLER TAKES OWNERSHIP of the buffer_handle_t.
     *
     * If format is HAL_PIXEL_FORMAT_YCbCr_420_888, the returned stride must be
     * 0, since the actual strides are available from the android_ycbcr
     * structure.
     *
     * Returns 0 on success or -errno on error.
     */

    int (*alloc)(struct alloc_device_t* dev,
            int w, int h, int format, int usage,
            buffer_handle_t* handle, int* stride);
    /*
     * (*free)() Frees a previously allocated buffer.
     * Behavior is undefined if the buffer is still mapped in any process,
     * but shall not result in termination of the program or security breaches
     * (allowing a process to get access to another process' buffers).
     * THIS FUNCTION TAKES OWNERSHIP of the buffer_handle_t which becomes
     * invalid after the call.
     *
     * Returns 0 on success or -errno on error.
     */

    int (*free)(struct alloc_device_t* dev,
            buffer_handle_t handle);

    /* This hook is OPTIONAL.
     *
     * If non NULL it will be caused by SurfaceFlinger on dumpsys
     */
    void (*dump)(struct alloc_device_t *dev, char *buff, int buff_len);
    void* reserved_proc[7];
} alloc_device_t;

Lets examine the parameters for the alloc() function.  The first parameter (dev) is of course the instance handle.

The next two parameters (w, h) provide the requested width and height of the buffer.  When describing the dimensions of a graphics buffer there are two points to watch for.  First, we need to understand the units of the dimensions.  If the dimensions are expressed in pixels, as is the case for gralloc, then we need to understand how to translate pixels to bits.  And for this we need to know the color encoding format.

The requested color format is the forth parameter.  The color formats that Android supports are defined in /system/core/include/system/graphics.h.  Color format HAL_PIXEL_FORMAT_RGBA_8888 uses 32 bits for each pixel (8 pixels for each of the pixel components: red, green, blue and alpha-blending), while HAL_PIXEL_FORMAT_RGB_565 uses 16 bits for each pixel (5 bits for red and blue, and 6 bits for green).

The second important factor affecting the physical dimensions of the graphics buffer is its stride. Stride is the last parameter to alloc and it is also an out parameter.  To understand stride (a.k.a. pitch), it is easiest to refer to a diagram:




We can think of memory buffers as matrices arranged in rows and columns of pixels.  A row is usually referred to as a line.  Stride is defined as the number of pixels (or bytes, depending on your units!) that need to be counted from the beginning of one buffer line, to the next buffer line.  As the diagram above shows, the stride is necessarily at least equal to the width of the buffer, but can very well be larger than the width.  The difference between the stride and the width (stride-width) is just wasted memory and one takeaway from this is that the memory used to store an image or graphics may not be continuous.  So where does the stride come from?  Due to hardware implementation complexity, memory bandwidth optimizations, and other constraints, the hardware accessing the graphics memory may require the buffer to be a multiple of some number of bytes.  For example, if for a particular hardware module the line addresses need to align to 64 bytes, then memory widths need to be multiples of 64 bytes.  If this constraint results in longer lines than requested, then the buffer stride is different from the width. Another motivation for stride is buffer reuse: imagine that you want to refer to a cropped image within another image.  In this case, the cropped (internal) image has a stride different than the width.




Allocated buffer memory can be written to, or read from, by user-space code of course, but first and foremost it is written to, or read from, by different hardware modules such as the GPU (graphics processing unit), camera, composition engine, DMA engine, display controller, etc.  On a typical SoC these hardware modules come from different vendors and have different constraints on the buffer memory which all need to be reconciled if they are to share buffers.  For example, a buffer written by the GPU should be readable by the display controller.  The different constraints on the buffers are not necessarily the result of heterogeneous component vendors, but also because of different optimization points.  In any case, gralloc needs to ensure that the image format and memory layout is agreeable to both image producer and consumer.  This is where the usage parameter comes into play.

The usage flags are defined in file gralloc.h.  The first four least significant bits (bits 0-3) describe how the software reads the buffer (never, rarely, often); and the next four bits (bits 4-7) describe how the software writes the buffer (never, rarely, often).  The next twelve bits describe how the hardware uses the buffer: as an OpenGL ES texture or OpenGL ES render target; by the 2D hardware blitter, HWComposer, framebuffer device, or HW video encoder; written or read by the HW camera pipeline; used as part of zero-shutter-lag camera queue; used as a RenderScript Allocation; displayed full-screen on an external display; or used as a cursor.
Obviously there may be some coupling between the color format and the usage flag.  For example, if the usage parameter indicates that the buffer is written by the camera and read by the video encoder, then the format must be agreeable by both HW modules.
If software needs to access the buffer contents, either for read or write, then gralloc needs to make sure that there is a mapping from the physical address space to the CPU's virtual address space and that the cache is kept coherent.
For a sample implementation, you can examine the goldfish device implementation at /device/generic/goldfish/opengl/system/gralloc/gralloc.cpp.

Other factors affecting buffer memory

There are other factors affecting how graphic and image memory is allocated and how images are stored (memory layout) and accessed which we should briefly review:
Alignment
Once again, different hardware may impose hard or soft memory alignment requirements.  Not complying with a hard requirement will result in the failure of the hardware to perform its function, while not complying with a soft requirement will result in an sub-optimal use of the hardware (usually expressed in power, thermal and performance).

Color Space, Formats and Memory Layout
There are several color spaces of which the most familiar ones are YCbCr (images) and RGB (graphics).  Within each color space information may be encoded differently.  Some sample RGB encodings include RGB565 (16 bits; 5 bits for red and blue and 6 bits for green), RGB888 (24 bits) or ARGB8888 (32 bits; with the alpha blending channel).  YCbCr encoding formats usually employ chroma subsampling.
Because our eyes are less sensitive to color than to gray levels, the chroma channels can have a lower sampling rate compared to the luma channel with little loss of perceptual quality.  The subsampling scheme used does not necessarily dictate the memory layout.  For example, for 4:2:0 subsampling formats NV12 and YV12 there are two very different memory layouts, as depicted in the diagram below.

YV12 color - format memory layout (planar)
NV12 color - format memory layout (packed)
There are two YUV formats: packed formats (also known as semi-planar) and planar formats. NV12 is an example of a packed format, and YV12 is an example of a planar format.  In a packed format, the Y, U, and V components are stored in a single array. Pixels are organized into groups of macropixels, whose layout depends on the format. In a planar format, the Y, U, and V components are stored as three separate planes.
In the YV12 diagram above the Y (luma) plane has size equal to width * height, and each of the chroma planes (U, V) has a size equal to width/2 * height/2.  This means that both width and height must be even integers.  YV12 also stipulates hat the line stride must be a multiple of 16 pixels. Because both NV12 and YV12 are 4:2:0 subsampled, for each 2x2 group of pixels, there are 4*Y samples and 1*U and 1*V samples.

Tiling 
If the SoC hardware uses algorithms which mostly access blocks of neighboring pixels, then it is probably more efficient to arrange the image's memory layout such that neighboring pixels are laid out in line, instead of their usual position.This is called tiling.
Some graphics/imaging hardware use more elaborate tiling, such as supporting two tile sizes: a group of small tiles might be arranged in some scan order inside a larger tile.

Tiling: one the left is the image with the pixels in their natural order.  The green frame defines the 4x4 tile size and the red arrow shows the scan order.  On the right is the same image, but now with pixels arranged in the tile scan order.

Compression
If both producer and consumer are hardware components on the same SoC, then the may write and read a common, proprietary compressed data format and decompress the data on-the-fly (i.e. using on-chip memory, usually SRAM, just before processing the pixel data).

Memory Contiguity
Some older imaging hardware modules (cameras, display, etc) don't have an MMU or don't support scatter-gather DMA.  In this case the device DMA is programmed using physical addresses which point to contiguous memory.  This does not affect the memory layout, but it is certainly the kind of platform-specific constraint that gralloc needs to be aware of when it allocates memory.

gralloc: Buffer Ownership Management

Memory is a shared resource.  It is either shared between the graphics hardware module and the CPU; or between two graphics modules.  If the CPU is rendering to a graphics buffer, we have to make sure that the display controller waits for the CPU to complete writing, before it begins reading the buffer memory.  This is done using system-level synchronization which I'll discuss in a later blog entry.  But this synchronization is not sufficient to ensure that the display controller will be accessing a coherent view of the memory.  In the above example, the final updates to the buffer that the CPU writes may not have been flushed from the cache to the system memory.  If this happens, the display might show an incorrect view of the graphics buffer.  Therefore, we need some kind of low-level atomic synchronization mechanism to explicitly manage the transfer of memory buffer ownership which verifies that the memory "owner" sees a consistent view of the memory.

Access to buffer memory (both read and write, for both hardware and software)  is explicitly managed by gralloc users (this can be done synchronously or asynchronously).  This is done by locking and unlocking a buffer memory patch.  There can be many threads with a read-lock concurrently, but only one thread can hold a write lock.

    /*
     * The (*lock)() method is called before a buffer is accessed for the
     * specified usage. This call may block, for instance if the h/w needs
     * to finish rendering or if CPU caches need to be synchronized.
     *
     * The caller promises to modify only pixels in the area specified
     * by (l,t,w,h).
     *
     * The content of the buffer outside of the specified area is NOT modified
     * by this call.
     *
     * If usage specifies GRALLOC_USAGE_SW_*, vaddr is filled with the address
     * of the buffer in virtual memory.
     *
     * Note calling (*lock)() on HAL_PIXEL_FORMAT_YCbCr_*_888 buffers will fail
     * and return -EINVAL.  These buffers must be locked with (*lock_ycbcr)()
     * instead.
     *
     * THREADING CONSIDERATIONS:
     *
     * It is legal for several different threads to lock a buffer from
     * read access, none of the threads are blocked.
     *
     * However, locking a buffer simultaneously for write or read/write is
     * undefined, but:
     * - shall not result in termination of the process
     * - shall not block the caller
     * It is acceptable to return an error or to leave the buffer's content
     * into an indeterminate state.
     *
     * If the buffer was created with a usage mask incompatible with the
     * requested usage flags here, -EINVAL is returned.
     *
     */
 
    int (*lock)(struct gralloc_module_t const* module,
            buffer_handle_t handle, int usage,
            int l, int t, int w, int h,
            void** vaddr);
/*
     * The (*lockAsync)() method is like the (*lock)() method except
     * that the buffer's sync fence object is passed into the lock
     * call instead of requiring the caller to wait for completion.
     *
     * The gralloc implementation takes ownership of the fenceFd and
     * is responsible for closing it when no longer needed.
     */
    int (*lockAsync)(struct gralloc_module_t const* module,
            buffer_handle_t handle, int usage,
            int l, int t, int w, int h,
            void** vaddr, int fenceFd);


Cache Coherence
If software needs to access a graphics buffer, then the correct data needs to be accessible to the CPU for reading and/or writing.  Keeping the cache coherent is one of the responsibilities of gralloc. Needlessly flushing the cache, or enabling bus snooping on some SoCs, to keep the memory view consistent across graphics hardware and CPU wastes power and can add latency.  Therefore, here too, gralloc needs to employ platform-specific mechanisms.

Locking Pages in RAM
Another aspect of sharing memory between CPU and graphics hardware is making sure that memory pages are not flushed to the swap file when they are used by the hardware.  I can't remember seeing Android on a device configured with a swap file, but it is certainly feasible, and lock() should literally lock the memory pages in RAM.
A related issue is page remapping which happens when a virtual page that is assigned to one physical page, is dynamically reassigned a different physical page (page migration).  One reason the kernel might choose to do this is to prevent fragmentation by rearranging the physical memory allocation. From the CPU's point of view this is fine as long as the new physical page contains the correct content.  But from the point of a graphics hardware module, this is pulling the rug under its feet. Pages shared with hardware should be designated non-movable.