Running down a dream: 2015

Wednesday, November 4, 2015

DMA Buffer Sharing

The need to share DMA buffers between drivers and applications is common in multimedia platforms. Android's gralloc and Ion driver provide this and some other goodies, but Linaro's dmabuf buffer sharing driver provides a ligher-weight alternative which is plenty good for many situations. Here's a good comparison of Ion and dmabuf.

I'm a visual person and relate to diagrams more than I do to textual descriptions. I use diagrams to quickly create a mental model of a subject-matter I'm learning, and use text to understand the fine details if I decide to dive in. Towards this end I created the sequence diagrams below, to complement the dmabuf documentation and help me follow to interactions between the importer, exporter and application. The kernel documentation is clear and concise, so I'm not adding further explanations, lest I detract more than I add. Hopefully these diagram will help you, too.

I've used MSC Generator to generate the diagrams and I'm providing the source for that as well. Enjoy ;-)

dma-buf operations for device dma only

Kernel cpu access to a dma-buf buffer object

git clone
git clone

Sunday, October 4, 2015

The great experiment: teaching my son to program (part II)

Last time I described why I'm trying to teach my son how to program, and my two first failed attempts. The third attempt is going to be a game.

My experience with the basic command-line project I did my son, has taught me that he is not as averse to text-mode (AKA "no-graphics") as I had previously thought. I also learned that text-mode graphics is an easy and intuitive way to do I/O. It's also quite powerful - just think back about some of the early games :-). It is really important for me to show my son that there is no magic in programming. That he can create a games with bare-bones tools. That he understands how everything is done. No physics engines, no sprites, no magic.

I decided that the next project will be a text-mode tank fighting game: two tanks fighting each other in a 2D terrain. Something like this, only using text-mode graphics.
There are plenty of features we can develop, so this gives a long runway for this project alone. I wrote down some of the features we might develop:

A 2D map which might be larger than the screen
Animation of shooting a tank shell
Animation of moving a tank around the screen
Different types of walls: some impenetrable, some not
Different types of ammunition
Game levels with different maps and different challenges
Animation when moving to the next level
Blast/explosion animation
Colors
Sound
Damage level and replenishment
Ammunition level and replenishment, through picking up "ammunition presents"
Different types of ammunition
Hidden walls, keys and other tricks that add surprises
Multi-player mode
Multi-player mode - over the network
Control from an Android application
Score display
Opening screen
Console interaction
Sprites, threads, collision detection

I can go on and on, but the point is that this is going to be interesting.
After making the list, I shared it with my son. I wanted to get him excited more than I was. Everything is designed to pique his interest: the theme, the game, and even the project name.
I decided to stick to my bottom-up approach. That is, I decided to teach through "doing" instead of teaching him in a structured manner. Instead of teaching about classes, objects, control structures, variables, and the lot, we will just dive in. This is how I learned to program and it was plenty good :-).

Each sit-down session is going to be self contained. That is, I don't want to quit in the middle of a feature. The application should always work. For this to work, we need to program in small increments. Another principle is that we both program: I show him how I do something and then ask him to implement something similar. And, as before, keep things lean and bare-bones. For example, if I can refrain from using functions for a few sessions, then I'll prefer to have one long 'main' function than to confuse him with functions for the sake of correctness.

We are using github to manage the source control and you can follow our progress there.
The first session was around the basic input/output:

We used Console.WriteLine to create a 2D map. To add some excitement, we added a title to the screen and played with the colors of the foreground/background.
The main loop waits for user key presses and interprets them. This introduced some basic control statements ('if', 'while') and their syntax. Again, I introduced these non-nonchalantly: "we need to loop on the key press and this is how we do this. Do you understand why we call this a loop? Can you explain why we loop forever?"
My son understands the Cartesian coordinate system from school, so introducing the Console cursor location attributes was pretty straight-forward.
I first showed him how to handle a right-arrow key and then asked him to write the code for the other three arrow keys. Of course, I sat next to him while he was doing this. Some guidance was still required here and there.
After he finished coding the basic loop, we player with the "tank" a bit and I showed him that nothing stops the tank from going off the map (or even the screen). We discussed how to fix this (boundary checking) and I again added the code to handle this in case of moving to the right, and then asked him to add the code for the other cases.
Finally, we added the option to quit the game.
He told me that today's games display "GG" (good game) when exiting, so I showed him how we can add some ASCII art.
Of course, because the exit is so quick, we didn't see the GG displayed so we worked out a 2-second delay.

And that was it. A good first session. I ended by committing the code to a new repository in Github.
I felt my son enjoyed himself, so I might have done it right this time :-)

The great experiment: teaching my son to program (part I)

There are many reasons why I want to teach my kids to program. For starters, it's just great fun. What could be a better present that to teach your children a new way to enjoy themselves? When I program I sometimes feel like a novelist creating new worlds; sometimes like a carpenter using old techniques with new tools; sometimes like an architect free to implement my vision under the constraints of time, budget, physics and the almighty customer; sometimes like a detective collecting data to uncover who killed the pointer. And all the time I'm an engineer - a problem solver. In 2015, I believe, everyone should know something about programming. Lawyers, and artists; doctors and carpenters; pilots and salesmen. Everywhere around us are systems that someone else programmed, and understanding something about how these "things" work is empowering. Improving on these "things" can be life-changing.
Joy comes from quenching one's thirst to understand, to imagine and to create. And so this skill is more than a utilitarian skill - it can be a lifelong hobby through which the kids will express their artistic sides.
Or not :-)
I don't know how this "experiment" will end: will my bubble burst, or will I help them open the gates of a new world? At least, I hope, we will bond. If only just a bit.

My daughter wasn't interested in this "project" of mine.
I couldn't entice her with promises of discovering bright new worlds. Bummer. It took me a couple of weeks to recover from this arrow to the heart.
Alas, my son, who is younger and spends enough hours of the day playing games on the net, was more amenable. So off we went.
To his room, that is, to crank up the old IDE.

Roll back a few weeks earlier: I was in the fantasizing stage at that time. First thing I did was choose a language. I wanted a high-level language to abstract away the complexities of the underlying hardware. A language that has great development tools, and a large set of libraries pre-integrated in the development environment. In other words, it should be intuitive and with minimum friction so that we can get off and going with no resistance. C# was the obvious choice for me - it meets all my criteria and has a free world-class IDE provided by Microsoft.

I searched for online C# learning resources, but quickly gave up on that direction. None of the sites I found provided an environment that is hands-on, educational, and captivating to a 13 year old with a short span of attention.
So it is up to me. I have one chance, I know. Get it wrong from the out-start, and I will lose his interest for a long time.
My son is a youtub'er and an online gamer fond of gadgets and is not easily impressed by simple graphics. It took me a few drives to work and back, to come up with an idea (these long drives are good for catching up on podcasts, but also for reflection). But before I hit this idea I had a couple of failures. I first tried interesting him in building a site on Wix. He thought it was "cool", but it didn't stick. Mostly it was my fault: I didn't plan ahead and I thought that once I show him the tools, he will pick it up from there - motivated by his endless curiosity. But, no, that didn't happen. He was much more "curious" about watching some youtube video :-( And this was OK by me, because really, Wix is not programming. Next I tried interesting him in replicating the Mine Craft console interface. I often see my son using the Mine Craft console and he commands it better that I know that Linux Bash shell. We hacked up in a single session, a simple command-line-interface console application which accepted user input and executed simple commands such as the 'log-in' command with (fake) user and password; 'quit' command to exit the application, and error messages. I showed him how to use the Console object for basic I/O and taught him the basics of the "if" statement and string comparison. I stuck to a bare-bones approach: use only what you need and use it "leanly" - no fluffy stuff to add confusion. So, for example, when we used Console.WriteLine - well, it was just a way to print characters to the screen. No explanation about classes, objects, or methods.
I never use C# myself, which is fortunate for this experiment. Each time I wasn't sure how to do something, we googled it. Using the experiences and advice of others through google is a great lesson.

And then came the flop.
My son's attention was wavering and we've achieved enough for this first session. So I called it quits and gave him some ideas on how he could make a few incremental changes. And then I left him, hoping that I did enough to pique his curiosity and that he will continue alone later. That "later" never came. In the coming couple of weeks I tried to get him to continue alone a few times, but youtube and Mine Craft were more interesting. But I saw the "spark" of curiosity and some excitement, and that was enough to keep me motivated. He said he now understands how the Mine Craft command line works.

I learned two things from that experience. First, this is going to be "our" project. There are too many distractions luring him, and I'll have to sit with him through several sessions before trying again to invite him to work by himself. And this is fine because I enjoyed our session. Second, I really have to choose an interesting project, which also has a lot of room for new features.

The next project will be a game.

Monday, July 6, 2015

Installing Caffe on Ubuntu 12.04

I have recently installed Caffe, the deep learning framework, on an Ubuntu 12.04 workstation and found a problem with the Ubuntu installation instructions.
The instructions point us to gitorious, to clone the LMDB (Lighting Memory-Mapped Database)
repository:

git clone https://gitorious.org/mdb/mdb.git

However, trying to clone this repository results in a connection timeout. According to the official gitorious blog, gitorious was acquired and is migrating their repositories to another location:

As you may know, Gitorious was acquired by GitLab about a month ago, and we announced that Gitorious.org would be shutting down at the end of May, 2015.

After the announcement we talked to the Archive Team about how to preserve Gitorious.org and its data for the future. A member of the Archive Team graciously offered to host gitorious.org as a read-only archive on Gitorious.org and GitLab agreed to allow to use the Gitorious.org domain name for this.

As of today, at least, the above repo link is not working for me. To bypass this issue, I downloaded and installed the following two packages:


$ wget http://launchpadlibrarian.net/188228946/liblmdb0_0.9.14-1_amd64.deb

$ wget http://launchpadlibrarian.net/188228947/liblmdb-dev_0.9.14-1_amd64.deb

$ sudo dpkg -I liblmdb0_0.9.14-1_amd64.deb

$ sudo apt-get install -f .

$ sudo dpkg -I liblmdb-dev_0.9.14-1_amd64.deb

$ sudo apt-get install -f .

I hope this information will help any of you trying to install Caffe on Ubuntu 12.04.

Saturday, June 20, 2015

Android BitTube

BitTube is a tiny AOSP (Android Open Source Project) class that I came upon while scouring the SensorService code. It first piqued my interest because of its name, which I really like for some reason (do geeks really need reasons to like class names?). But although the class is small, it felt like there was something interesting going on here, and that it will be worthwhile to do some digging.

If I came upon the BitTube class outside of the Android context, it would have been quite unremarkable and forgetful. The BitTube implementation is pretty obvious and straight-forward: it is a "parcel-able" wrapper to a pair of sockets. A socketpair to be exact. And that's the eyebrow-raising tidbit: a socketpair is a Linux/Unix IPC (inter-process communication) mechanism very similar to a pipe. What is a Linux IPC doing at the heart of AOSP when Binder is used almost everywhere else (another outlier is the RIL to rild - radio interface daemon - socket IPC)?

A socketpair sets up a two-way communication pipe with a socket attached to each end. With file descriptor duplication (dup/dup2), you can pass the socket handle to another process, duplicate it and start communicating. BitTube uses Unix sockets with sequenced packets (SOCK_SEQPACKET) which, like datagrams, only deliver whole records, and like SOCK_STREAM, are connection-oriented and guarantee in-order delivery. Although socketpair is a two-way communication pipe, BitTube uses it as a one-way pipe and assigns one end of the pipe to a writer and another to a reader (more on that later on). The send and receive buffers are set to a default limit of 4KB each. There's an interface for writing and reading a sequence of same-size "objects" (sendObjects, recvObjects).

A short look around AOSP reveals that BitTube is used by the Display subsystem and by the Sensors subsystem, so let's look at how it is used the Sensors subsystem. I'll provide a very brief recap of the Sensors Java API to level-set, in case you are not familiar with this.
An application uses the SensorManager system service to access (virtual and physical) device sensors. It registers to receive sensor events via two callbacks, which report an accuracy change or the availability of a sensor reading sample (event).

public class SensorActivity extends Activity, implements SensorEventListener {
private final SensorManager mSensorManager;
private final Sensor mAccelerometer;

public SensorActivity() {
mSensorManager = (SensorManager)getSystemService(SENSOR_SERVICE);
mAccelerometer = mSensorManager.getDefaultSensor(Sensor.TYPE_ACCELEROMETER);
}

protected void onResume() {
super.onResume();
mSensorManager.registerListener(this, mAccelerometer, SensorManager.SENSOR_DELAY_NORMAL);
}

protected void onPause() {
super.onPause();
mSensorManager.unregisterListener(this);
}

public void onAccuracyChanged(Sensor sensor, int accuracy) {
}

public void onSensorChanged(SensorEvent event) {
}
}

There's a lot of work performed behind the scenes in order to implement the SensorManager.registerListener. First, SensorManager delegates the request to SystemSensorManager which is the real workhorse. I've copy-pasted the Lollipop code after removing some of the less-important, yet distracting code:

/** @hide */
@Override
protected boolean registerListenerImpl(SensorEventListener listener, Sensor sensor,
int delayUs, Handler handler, int maxBatchReportLatencyUs, int reservedFlags) {

// Invariants to preserve:
// - one Looper per SensorEventListener
// - one Looper per SensorEventQueue
// We map SensorEventListener to a SensorEventQueue, which holds the looper
synchronized (mSensorListeners) {
SensorEventQueue queue = mSensorListeners.get(listener);
if (queue == null) {
Looper looper = (handler != null) ? handler.getLooper() : mMainLooper;
queue = new SensorEventQueue(listener, looper, this);
if (!queue.addSensor(sensor, delayUs, maxBatchReportLatencyUs, reservedFlags)) {
queue.dispose();
return false;
}
mSensorListeners.put(listener, queue);
return true;
} else {
return queue.addSensor(sensor, delayUs, maxBatchReportLatencyUs, reservedFlags);
}
}
}

As you can see, a SensorEventQueue and Looper are created per registered SensorEventListener.

The SensorEventQueue is the object which eventually delivers sensor events to the application. This class diagram can give you some high-level grasp of the Java and native class hierarchy.

Because this blog entry is about BitTube and not about the Sensor subsystem, I'll jump over many details: eventually a native SensorEventQueue is created.

The native SensorEventQueue uses a SensorEventConnection to bridge the process address-space gap and communicate with the native SensorService. The BnSensorEventConnection (this is the server-side of the IPC) creates a BitTube and with it a socketpair. One of the socket handles is dup'ed ('dup' system call) by the BpSensorEventConnection and, voila: we have a communication pipe between the two processes, as depicted below.

As I mentioned above, the BitTube is used as a one-way pipe: events are written on one side and read by the SensorEventQueue, on the other side.

During the construction of the socketpair 2 x 4KB (default size) buffers are allocated by the kernel (one for the send-side buffer and the other for the receive-side buffer) using the SO_SNDBUF and SO_RCVBUF socket options. Remember that this is done per SensorEventListener. And there's also a Looper thread per SensorEventListener. Quite a lot of overhead.
So, the question still begs, what's gained by using this "new" IPC? At first I thought that this was some legacy design from the early days of Android, or perhaps from some module that was integrated some time ago into the code-base. But this wouldn't explain why BitTube is also used by DisplayEventReceiver for what looks like a similar setup.
Maybe it provides extra low latency? BitTube can deliver several events in one write/read, but that can also be done with Binder without introducing any complications. They both incur about the same number of context switches, buffer copies, and system calls.
Is simplicity the motivation? No, BitTube is about as complex as using Binder.
This leaves me with throughput as the only other reason I can think of. But sensors are defined as low bandwidth components:

Not included in the list of physical devices providing data are camera, fingerprint sensor, microphone, and touch screen. These devices have their own reporting mechanism; the separation is arbitrary, but in general, Android sensors provide lower bandwidth data. For example, “100hz x 3 channels” for an accelerometer versus “25hz x 8 MP x 3 channels” for a camera or “44kHz x 1 channel” for a microphone.

For me, the mystery remains. If you have some thoughts on this, please comment - I'd love to learn.
In any case, BitTube provides another tool in our AOSP tool chest - although I'm hesitant about using it, until I understand what extra powers it give me :-)

Sunday, June 14, 2015

OpenVx for Android - ovx4android

OpenVX is a new Khronos specification for an API for hardware-accelerated computer vision.
The Khronos OpenVX homepage describes it such:

OpenVX is an open, royalty-free standard for cross platform acceleration of computer vision applications. OpenVX enables performance and power-optimized computer vision processing, especially important in embedded and real-time uses cases such as face, body and gesture tracking, smart video surveillance, advanced driver assistance systems (ADAS), object and scene reconstruction, augmented reality, visual inspection, robotics and more.

The OpenVX specification and sample code for download are available here.
Unfortunately, Khronos didn't bother testing their release on Android and it doesn't even compile. I went ahead and made the necessary changes to compile the code with NDK (I tested with NDKr9d) and I've made it available on github.

In a future post I'll describe how to integrate this with android.hardware.camera2.
I hope you enjoy experimenting with this.
Neta

Saturday, April 18, 2015

Android's Graphics Buffer Management System (Part II: BufferQueue)

In the first post on Android's graphics buffer management, I discussed gralloc, which is Android's graphics buffer allocation HAL. In this post I'll describe graphics buffers flows in Android, with special attention to class BufferQueue which plays a central role in graphics buffer management.

Introduction

Before I dive in, I want to discuss buffers in general. There is a surprising number of details and aspects involved in designing buffer systems and I think it is best to examine what was done in Android once we've assumed a wide and generic perspective.
Data buffers, and specifically image and graphics data buffers, exist as part of a specific subsystem, such as the camera subsystem, but can also span multiple subsystems, such as buffers shared between the camera and video subsystems. Buffers provide a means to temporarily store data to allow us to separate the production of data from the consumption of data - in both time and space. That is, we can produce (or collect) data at one moment, and use it at a different moment. This decouples producer and consumer, and also allows producer and consumer to be asynchronous to one another. Many times in an event-based system the data producer and the data consumer are triggered (clocked) by different time sources. For example, the camera on your mobile phone produces image frames at some arbitrary frame-rate (e.g. 30 frames per second, of FPS) while the display panel (showing the preview) can operate at a different refresh-rate (e.g. 60 Hz). Moreover, even if the devices were guaranteed to operate at the same frequency (or if one frequency is a harmonic of the other), they are unlikely to have the same phase offset since the display operation starts when we turn on the screen, while the camera operation starts at some other arbitrary time when we start the camera application. And of course there is drift and jitter that contribute the asynchronous nature of the two subsystems. There may also be several consumers, or several producers. SurfaceFlinger, for example, uses buffers from multiple sources and composes them into a single output buffer.

Buffers also allow us to move data from one part of our system to another. Inevitably, buffers follow some paths within our system and these are commonly referred to the as the "data paths". A path can start at a buffer provider which allocates new memory or provides a buffer from a pre-allocated pool. The buffers are considered empty at this stage. That is, they do not contain consumable data or metadata. A source entity provides the initial data by attaching it to a buffer (reference holding buffers) or copying the data to the buffer's memory. Somehow, a buffer makes its way along a path of buffer handlers until it arrives at the content consumer which uses the data and discards the buffer. A buffer handler may be passive (e.g. monitor or logger), or it be active: filtering (drop), altering, augmenting, extracting, or otherwise manipulating the contents. These paths can be either dynamic or static. There are many design patterns which define how a data path is defined and controlled (pipes and filters, layering, pipeline, software bus messaging, direct addressing, broadcasting, observing, and so forth) and I will not cover them here as that would really be diverging from our topic.

Buffer systems are either closed-looped or open-looped. In closed-loop paths there is a buffer path from the consumer back to the producer. Sometimes this is made explicit, and sometimes implicit. For example, if the producer and consumer use a shared memory pool they implicitly form a closed-loop. One can argue that using a shared buffer pool is not really a closed-loop, but I contend that as long as the system is designed using explicit knowledge of shared buffer memory, then it is closed. That is, if the consumer can starve or delay the producer because it controls the flow of buffers available to the producer, then this is a closed-loop system. C

Ah, and there is the question of what we mean by buffer. A lot of time when people say "buffer" they are referring to the actual backend memory storing the content, but in real systems it is quite rare to see raw data moving around the system. It is much more common to see buffer objects which contain metadata describing the data content. What is contained in this metadata is implementation-specific and depends on the problem domain and context, but I'm sure we can agree that one piece of information we need to know is the amount of data stored in the buffer. And there is the question of pointer-to-data (by reference) vs embedded data (by value). Obviously zero-copy buffer handling is preferred, but requires us to be exact about buffer memory life-time management. Life time management, access management and synchronization are other related aspects which I've discussed in the previous post so I'll cut things short right here.

BufferQueue

After this generic discussion of data buffers, we can finally dive into the Android details. I'll start with class BufferQueue because it is at the center graphic buffer movement in Android. It abstracts a queue of graphics buffers, uses gralloc to allocate buffers, and has means to connect buffer producers and consumers which reside in different process address spaces.
Code for class BufferQueue and many of the cooperating classes that I'll be discussing can be found in directory /frameworks/native/libs/gui/ with the header files in /frameworks/native/include/gui.

Class BufferQueue has a static factory method, BufferQueue::createBufferQueue, which is used to create BufferQueue instances.

// BufferQueue manages a pool of gralloc memory slots to be used by

// producers and consumers. allocator is used to allocate all the

// needed gralloc buffers.

static void createBufferQueue(sp* outProducer,

sp* outConsumer,

const sp& allocator = NULL);

A quick glance at the implementation reveals that class BufferQueue is only a thin facade to class BufferQueueCore, which conatins the actual implementation logic. For simplicity of this discussion, I will not make a distinction between these classes.

Working with BufferQueue is pretty straight-forward. First, producers and consumers connect to the BufferQueue.

1. The producer takes an “empty” buffer from the BufferQueue (dequeueBuffer)

2. The producer (e.g. camera) copies image or graphics data into the buffer

3. The producer returns the “filled” buffer to the BufferQueue (queueBuffer)

4. The consumer receives an indication (via callback) of the presence of a “filled” buffer

5. The consumer removes this buffer from the BufferQueue (acquireBuffer)

6. When the consumer is done consuming the buffer is returned to the BufferQueue (releaseBuffer)

The following diagram shows a simplified interaction diagram between the camera (image buffer producer) and the display (image buffer consumer).

Figure 1: Simplified data path between the camera subsystem and the GPU

Producers and Consumers may reside in different processes and this is accomplished using Binder, as always.

BufferQueueProducer is the workhorse behind IGraphicBufferProducer. BufferQueueProducer maintains an intimate relationship with BufferQueueCore and directly accesses its member variables, including mutexes, conditions and other significant members (such as its pointer to IGraphicBufferAlloc). Personally, I don't like this - it is confusing and fragile.

When a Producer is requested to provide an empty buffer using dequeueBuffer, it tries to fetch one from BufferQueueCore which maintains an array of buffers and their states (DEQUEUED, QUEUED, ACQUIRED, FREE). If a free slot is found in the buffer array but it doesn’t contain a buffer, or if the Producer was explicitly asked to reallocate the buffer, then BufferQueueProducer uses BufferQueueCore’s to allocate a new buffer.

Initially, all invocations of dequeueBuffer results in the allocation of new buffers. But because this is a closed-loop system, where the buffer Consumer returns buffers once it has consumed their contents (by calling releaseBuffer), we should see the system reaching equilibrium after a very short while. Be aware that although BufferQueueCore can maintain an array of variable-sized GraphicBuffer objects, it is wise to make all buffers of the same size. Otherwise, each invocation of dequeueBuffer may require the allocation of a new GraphicBuffer instance.

Figure 2: The main classes related to BufferQueue

The GraphicBuffer allocation is performed using an implementation of IGraphicBufferAlloc which is provided to BufferQueueCore when it is constructed. The default implementation of IGraphicBufferAlloc is provided by SurfaceFlinger (the system object in charge of composing all surfaces) and uses gralloc to allocate buffers. In the previous post I discussed why a central graphics buffers allocator is well-advised when dealing with various hardware SoC modules.

Class BufferQueueCore doesn’t directly store GraphicBuffer – it uses class BufferItem which contains a pointer to a GraphicBuffer instance, including various other metadata (see frameworks/native/include/gui/BufferItem.h).

Figure 3: Class diagram showing the main classes related to graphics buffer allocation

Asynchronous notification interfaces IConsumerListener and IProducerListener are used to alert listeners about events such as a buffer being ready for consumption (IConsumerListener::onFrameAvailable); or the availability of an empty buffer (IProducerListener::onBufferReleased). These callback interfaces also use Binder and can cross process boundaries. Checkout further details in frameworks/native/include/gui/IConsumerListener.h

The best source of information I found on Android’s graphics system, aside from the code itself of course, is here.

Consumers

Figure: Some consumer classes

BufferQueue Creation

Figure: Top to bottom BufferQueue creation flow

Saturday, March 21, 2015

Android's Graphics Buffer Management System (Part I: gralloc)

In this post series I'll do a deep dive into Android's graphics buffer management system. I'll cover how buffers produced by the camera use the generic BufferQueue abstraction to flow to different parts of the system, how buffers are shared between different hardware modules, and how they traverse process boundaries.
But I will start at buffer allocation, and before I describe what triggers buffer allocation and when, let's look at the low-level graphics buffer allocator, a.k.a. gralloc.

gralloc: Buffer Allocation

The gralloc is part of the HAL (Hardware Abstraction Layer) which means that the implementation is platform-specific. You can find the interface definitions in hardware/libhardware/include/hardware/gralloc.h. As expected from a HAL component, the interface is divided into a module interface (gralloc_module_t) and a device interface (alloc_device_t). Loading the gralloc module is performed as for all HAL modules, so I won't go into these details because they can be easily googled. But I will mention that the entry point into a newly loaded HAL module is via the open method of the structure hw_module_methods which is referenced by the structure hw_module_t. Structure hw_module_t acts as a mandatory "base class" (not quite since this is "C" code) of all HAL modules including gralloc_module_t.
Both the module and the device interfaces are versioned. The current module version is 0.3 and the device version is 0.1. Only Google knows why these interfaces have these sub-1.0 interface versions. :-)

As I said above, gralloc implementations are platform-specific and for reference you can look at the goldfish device's implementation (device/generic/goldfish/opengl/system/gralloc/gralloc.c). Goldfish is the code name for the Android emulation platform device.
The sole responsibility of the device (alloc_device_t) is allocation (and consequent release) of buffer memory so it has a straight-forward signature:

typedef struct alloc_device_t {
struct hw_device_t common;

/*
* (*alloc)() Allocates a buffer in graphic memory with the requested
* parameters and returns a buffer_handle_t and the stride in pixels to
* allow the implementation to satisfy hardware constraints on the width
* of a pixmap (eg: it may have to be multiple of 8 pixels).
* The CALLER TAKES OWNERSHIP of the buffer_handle_t.
*
* If format is HAL_PIXEL_FORMAT_YCbCr_420_888, the returned stride must be
* 0, since the actual strides are available from the android_ycbcr
* structure.
*
* Returns 0 on success or -errno on error.
*/

int (*alloc)(struct alloc_device_t* dev,
int w, int h, int format, int usage,
buffer_handle_t* handle, int* stride);
/*
* (*free)() Frees a previously allocated buffer.
* Behavior is undefined if the buffer is still mapped in any process,
* but shall not result in termination of the program or security breaches
* (allowing a process to get access to another process' buffers).
* THIS FUNCTION TAKES OWNERSHIP of the buffer_handle_t which becomes
* invalid after the call.
*
* Returns 0 on success or -errno on error.
*/

int (*free)(struct alloc_device_t* dev,
buffer_handle_t handle);

/* This hook is OPTIONAL.
*
* If non NULL it will be caused by SurfaceFlinger on dumpsys
*/
void (*dump)(struct alloc_device_t *dev, char *buff, int buff_len);
void* reserved_proc[7];
} alloc_device_t;

Lets examine the parameters for the alloc() function. The first parameter (dev) is of course the instance handle.

The next two parameters (w, h) provide the requested width and height of the buffer. When describing the dimensions of a graphics buffer there are two points to watch for. First, we need to understand the units of the dimensions. If the dimensions are expressed in pixels, as is the case for gralloc, then we need to understand how to translate pixels to bits. And for this we need to know the color encoding format.

The requested color format is the forth parameter. The color formats that Android supports are defined in /system/core/include/system/graphics.h. Color format HAL_PIXEL_FORMAT_RGBA_8888 uses 32 bits for each pixel (8 pixels for each of the pixel components: red, green, blue and alpha-blending), while HAL_PIXEL_FORMAT_RGB_565 uses 16 bits for each pixel (5 bits for red and blue, and 6 bits for green).

The second important factor affecting the physical dimensions of the graphics buffer is its stride. Stride is the last parameter to alloc and it is also an out parameter. To understand stride (a.k.a. pitch), it is easiest to refer to a diagram:

We can think of memory buffers as matrices arranged in rows and columns of pixels. A row is usually referred to as a line. Stride is defined as the number of pixels (or bytes, depending on your units!) that need to be counted from the beginning of one buffer line, to the next buffer line. As the diagram above shows, the stride is necessarily at least equal to the width of the buffer, but can very well be larger than the width. The difference between the stride and the width (stride-width) is just wasted memory and one takeaway from this is that the memory used to store an image or graphics may not be continuous. So where does the stride come from? Due to hardware implementation complexity, memory bandwidth optimizations, and other constraints, the hardware accessing the graphics memory may require the buffer to be a multiple of some number of bytes. For example, if for a particular hardware module the line addresses need to align to 64 bytes, then memory widths need to be multiples of 64 bytes. If this constraint results in longer lines than requested, then the buffer stride is different from the width. Another motivation for stride is buffer reuse: imagine that you want to refer to a cropped image within another image. In this case, the cropped (internal) image has a stride different than the width.

Allocated buffer memory can be written to, or read from, by user-space code of course, but first and foremost it is written to, or read from, by different hardware modules such as the GPU (graphics processing unit), camera, composition engine, DMA engine, display controller, etc. On a typical SoC these hardware modules come from different vendors and have different constraints on the buffer memory which all need to be reconciled if they are to share buffers. For example, a buffer written by the GPU should be readable by the display controller. The different constraints on the buffers are not necessarily the result of heterogeneous component vendors, but also because of different optimization points. In any case, gralloc needs to ensure that the image format and memory layout is agreeable to both image producer and consumer. This is where the usage parameter comes into play.

The usage flags are defined in file gralloc.h. The first four least significant bits (bits 0-3) describe how the software reads the buffer (never, rarely, often); and the next four bits (bits 4-7) describe how the software writes the buffer (never, rarely, often). The next twelve bits describe how the hardware uses the buffer: as an OpenGL ES texture or OpenGL ES render target; by the 2D hardware blitter, HWComposer, framebuffer device, or HW video encoder; written or read by the HW camera pipeline; used as part of zero-shutter-lag camera queue; used as a RenderScript Allocation; displayed full-screen on an external display; or used as a cursor.
Obviously there may be some coupling between the color format and the usage flag. For example, if the usage parameter indicates that the buffer is written by the camera and read by the video encoder, then the format must be agreeable by both HW modules.
If software needs to access the buffer contents, either for read or write, then gralloc needs to make sure that there is a mapping from the physical address space to the CPU's virtual address space and that the cache is kept coherent.
For a sample implementation, you can examine the goldfish device implementation at /device/generic/goldfish/opengl/system/gralloc/gralloc.cpp.

Other factors affecting buffer memory

There are other factors affecting how graphic and image memory is allocated and how images are stored (memory layout) and accessed which we should briefly review:
Alignment
Once again, different hardware may impose hard or soft memory alignment requirements. Not complying with a hard requirement will result in the failure of the hardware to perform its function, while not complying with a soft requirement will result in an sub-optimal use of the hardware (usually expressed in power, thermal and performance).

Color Space, Formats and Memory Layout
There are several color spaces of which the most familiar ones are YCbCr (images) and RGB (graphics). Within each color space information may be encoded differently. Some sample RGB encodings include RGB565 (16 bits; 5 bits for red and blue and 6 bits for green), RGB888 (24 bits) or ARGB8888 (32 bits; with the alpha blending channel). YCbCr encoding formats usually employ chroma subsampling.
Because our eyes are less sensitive to color than to gray levels, the chroma channels can have a lower sampling rate compared to the luma channel with little loss of perceptual quality. The subsampling scheme used does not necessarily dictate the memory layout. For example, for 4:2:0 subsampling formats NV12 and YV12 there are two very different memory layouts, as depicted in the diagram below.

YV12 color - format memory layout (planar)

NV12 color - format memory layout (packed)

There are two YUV formats: packed formats (also known as semi-planar) and planar formats. NV12 is an example of a packed format, and YV12 is an example of a planar format. In a packed format, the Y, U, and V components are stored in a single array. Pixels are organized into groups of macropixels, whose layout depends on the format. In a planar format, the Y, U, and V components are stored as three separate planes.
In the YV12 diagram above the Y (luma) plane has size equal to width * height, and each of the chroma planes (U, V) has a size equal to width/2 * height/2. This means that both width and height must be even integers. YV12 also stipulates hat the line stride must be a multiple of 16 pixels. Because both NV12 and YV12 are 4:2:0 subsampled, for each 2x2 group of pixels, there are 4*Y samples and 1*U and 1*V samples.

Tiling
If the SoC hardware uses algorithms which mostly access blocks of neighboring pixels, then it is probably more efficient to arrange the image's memory layout such that neighboring pixels are laid out in line, instead of their usual position.This is called tiling.
Some graphics/imaging hardware use more elaborate tiling, such as supporting two tile sizes: a group of small tiles might be arranged in some scan order inside a larger tile.

Tiling: one the left is the image with the pixels in their natural order. The green frame defines the 4x4 tile size and the red arrow shows the scan order. On the right is the same image, but now with pixels arranged in the tile scan order.

Compression
If both producer and consumer are hardware components on the same SoC, then the may write and read a common, proprietary compressed data format and decompress the data on-the-fly (i.e. using on-chip memory, usually SRAM, just before processing the pixel data).

Memory Contiguity
Some older imaging hardware modules (cameras, display, etc) don't have an MMU or don't support scatter-gather DMA. In this case the device DMA is programmed using physical addresses which point to contiguous memory. This does not affect the memory layout, but it is certainly the kind of platform-specific constraint that gralloc needs to be aware of when it allocates memory.

gralloc: Buffer Ownership Management

Memory is a shared resource. It is either shared between the graphics hardware module and the CPU; or between two graphics modules. If the CPU is rendering to a graphics buffer, we have to make sure that the display controller waits for the CPU to complete writing, before it begins reading the buffer memory. This is done using system-level synchronization which I'll discuss in a later blog entry. But this synchronization is not sufficient to ensure that the display controller will be accessing a coherent view of the memory. In the above example, the final updates to the buffer that the CPU writes may not have been flushed from the cache to the system memory. If this happens, the display might show an incorrect view of the graphics buffer. Therefore, we need some kind of low-level atomic synchronization mechanism to explicitly manage the transfer of memory buffer ownership which verifies that the memory "owner" sees a consistent view of the memory.

Access to buffer memory (both read and write, for both hardware and software) is explicitly managed by gralloc users (this can be done synchronously or asynchronously). This is done by locking and unlocking a buffer memory patch. There can be many threads with a read-lock concurrently, but only one thread can hold a write lock.

/*
* The (*lock)() method is called before a buffer is accessed for the
* specified usage. This call may block, for instance if the h/w needs
* to finish rendering or if CPU caches need to be synchronized.
*
* The caller promises to modify only pixels in the area specified
* by (l,t,w,h).
*
* The content of the buffer outside of the specified area is NOT modified
* by this call.
*
* If usage specifies GRALLOC_USAGE_SW_*, vaddr is filled with the address
* of the buffer in virtual memory.
*
* Note calling (*lock)() on HAL_PIXEL_FORMAT_YCbCr_*_888 buffers will fail
* and return -EINVAL. These buffers must be locked with (*lock_ycbcr)()
* instead.
*
* THREADING CONSIDERATIONS:
*
* It is legal for several different threads to lock a buffer from
* read access, none of the threads are blocked.
*
* However, locking a buffer simultaneously for write or read/write is
* undefined, but:
* - shall not result in termination of the process
* - shall not block the caller
* It is acceptable to return an error or to leave the buffer's content
* into an indeterminate state.
*
* If the buffer was created with a usage mask incompatible with the
* requested usage flags here, -EINVAL is returned.
*
*/

int (*lock)(struct gralloc_module_t const* module,
buffer_handle_t handle, int usage,
int l, int t, int w, int h,
void** vaddr);
/*
* The (*lockAsync)() method is like the (*lock)() method except
* that the buffer's sync fence object is passed into the lock
* call instead of requiring the caller to wait for completion.
*
* The gralloc implementation takes ownership of the fenceFd and
* is responsible for closing it when no longer needed.
*/
int (*lockAsync)(struct gralloc_module_t const* module,
buffer_handle_t handle, int usage,
int l, int t, int w, int h,
void** vaddr, int fenceFd);

Cache Coherence
If software needs to access a graphics buffer, then the correct data needs to be accessible to the CPU for reading and/or writing. Keeping the cache coherent is one of the responsibilities of gralloc. Needlessly flushing the cache, or enabling bus snooping on some SoCs, to keep the memory view consistent across graphics hardware and CPU wastes power and can add latency. Therefore, here too, gralloc needs to employ platform-specific mechanisms.

Locking Pages in RAM
Another aspect of sharing memory between CPU and graphics hardware is making sure that memory pages are not flushed to the swap file when they are used by the hardware. I can't remember seeing Android on a device configured with a swap file, but it is certainly feasible, and lock() should literally lock the memory pages in RAM.
A related issue is page remapping which happens when a virtual page that is assigned to one physical page, is dynamically reassigned a different physical page (page migration). One reason the kernel might choose to do this is to prevent fragmentation by rearranging the physical memory allocation. From the CPU's point of view this is fine as long as the new physical page contains the correct content. But from the point of a graphics hardware module, this is pulling the rug under its feet. Pages shared with hardware should be designated non-movable.

Friday, January 23, 2015

Revisiting the Active Object Pattern - with C++11 Closures

I have a confession: with the never ending things going on (you know: life ;-) I missed the (fairly) recent changes in the C++ language. C++ was my first OO language and it probably remains my favorite. I can't help but love the mix of high level abstractions with metal grinding pointer arithmetic - it's like a cool sports car with manual transmission. Beauty and power. You have to be more alert, more involved. That's part of the fun - you are taking control. But for some time now C++ felt old, tired, disconnected from the endless stream of new languages. Until C++11 came along.

Wikipedia describes C++11 as follows:

C++11 (formerly known as C++0x) is the most recent version of the standard of the C++ programming language. It was approved by ISO on 12 August 2011. C++11 includes several additions to the core language and extends the C++ standard library, incorporating most of the C++ Technical Report 1 (TR1) libraries.

Bjarne Stroustrup wrote that:

Surprisingly, C++11 feels like a new language: The pieces just fit together better than they used to and I find a higher-level style of programming more natural than before and as efficient as ever.

And it does feel like a new language. And this is exciting for geeks like me. In this blog post I discuss how I implemented Schmidt's Active Object pattern in a novel way using C++11 Closures.

First, another confession: for a long time I've suffered from Node envy. Node.js envy, to be precise. Look at this "Hello World" Javascript code:

What I "envy" is not the use of asynchronous I/O operation with callbacks ("the callback pattern"), but the compelling beauty of Lambda functions. Lambda functions simplify asynchronous programming because they allow us to write code that is seemingly synchronous. The code that is executed by the lambda function is temporally disjointed from the code that precedes it, and yet both parts are spatially co-located. And the outcome is smaller, tighter code that feels more natural and is easier to read and maintain. And this can be done in C++11.

I won't discuss C++11 lambda functions because others have done this better than I can. This article is an example of some of the great coverage you can find on the net (Alex Allain has lots of interesting material to read). But I do want to touch on the difference between Lambda functions and Closures, since my implementation below uses Closures. Lambda functions are anonymous functions that don't need to be bound to a name and can be specified as lambda expressions. A Closure is an example of a lambda function which "closes" over the environment in which it was specified (meaning that it can access the variables available in the referencing environment). Alex Allain's article (which I referenced above) doesn't make a big distinction between lambdas and closures and simply treats closures as lambdas with "variable capture". Syntactically in C++ lambdas and closures are almost identical, so the distinction is there and it is slight, yet I think it is important to note the semantic difference.

On to Active Object.

Douglas Schmidt describes the Active Object design pattern in Pattern Oriented Software Architecture (Volume 2: Patterns for Concurrent and Networked Objects):

The Active Object design pattern decouples method execution from method invocation to enhance concurrency and simplify synchronized access to objects that reside in their own threads of control.

Once again, I don't want to paraphrase the work of others, so I assume that you are knowledgeable about the details of the Active Object pattern. If not, you should probably familiarize yourself with the pattern before reading on.
To illustrate my ideas, I will only concentrate on one variation of the Active Object pattern. In this variation the Client and Proxy are "folded" into the same object and the Scheduler and ActivationList implement a simple message queue policy (this is reminiscent of Schmidt's original AO paper, which he later expanded on). I think this is probably the most prevalent variation of the pattern - in which we want to serialize access to an object, and use an in-order queue (FIFO) to "bounce" the method invocation from one thread to another.
Let's look at the example code from the Wikipedia entry on Active Object. The Wikipedia code is implemented in Java and I went and implemented it using C++11. I placed the comments in the code to explain the logic.

The more "traditional" method of implementing ActiveObject in C++ involves defining two sets of interfaces: a public interface and a private interface. Every method in the public interface also appears in the private interface. The public interface is used by clients to invoke methods on the object, and they create a message indicating the request and its parameters and enqueue the message. The private interface is used by the dispatcher which dequeues messages and invokes the private method. This works well enough but creates big classes that have a lot of extraneous code that is there just to get all this mechanics to work. Every change to the interface requires a series of changes (public and private interface; message definition).

A somewhat more sophisticated implementation uses functors. We no longer need the code which does the switching on the message type when we grab a message from the FIFO and dispatch it. But the sophistication of the code probably only adds a layer of obfuscation if you are not familiar with the underlying idiom. We gain too little from this to be worthwhile.

Now let's come full circle and return to the Closure implementation of ActiveObject and add a few of features to it.

Friday, January 2, 2015

Measuring the Performance of Halide Convolutions

In the previous post I detailed five ways to express a 3x3 Gaussian smoothing filter in Halide. I was curious to understand if the choice of the algorithm expression will have any impact of its performance. After all, I interpret Halide's promise to "write the algorithm once and then search for the optimal schedule" (not a direct quote) as telling us that, all things being equal, the algorithm implementation is not very important as long as it is functionally correct. So off I went to do some experimenting and data collection.

To perform the tests, I used rgb.png (from the Halide distribution) as the input image. This image is has dimensions 768x1280 and has a 24-bit RGB format, which means that each one of the three color channels (red, green, and blue) is represented by 8 bits.
My workstation uses an Intel i7-3770 CPU which supports SSE, SSE2, SSE3, SSE4.1, SSE 4.2 and AVX. On a Linux machine you can learn about your cpu by invoking:
$ cat /proc/cpuinfo
Interestingly, the cpuid program did not list all of the vectorization features supported by the CPU so I double checked here.
I selected a set of schedules for the separable Gaussian implementations (those expressed using two Halide functions) and a different set of schedules for the non-separable implementations. I ran each of the schedules 50 times in a loop and calculated the mean value after removing the two smallest and two largest time samples. It takes Halide a couple of rounds to "warm up", which I found a bit strange since I invoked the JIT compiler before starting each 50-run loop. Perhaps some code gets mapped to the instruction cache. I also calculated the standard deviation of each 50-run loop. Naturally, schedules using parallelization show more jitter.

Before I show the results, I want to discuss some interesting results I observed.

Simple update steps impact performance

Update steps are separately scheduled, but I never expected that a simple update such as the one highlighted below can impact performance. I expected the default inline schedule will be used, but since the data is readily available in the cache, the update would be painless. I pasted below two filter implementations, the first with an extra update step and the second without the update step:

Halide::Func gaussian_3x3_1(Halide::Func input) {

Halide::Func k, gaussian;

Halide::Var x,y,c;

gaussian(x,y,c) = input(x-1, y-1, c) * 1 + input(x, y-1, c) * 2 + input(x+1, y-1, c) * 1 +

input(x-1, y, c) * 2 + input(x, y, c) * 4 + input(x+1, y, c) * 2 +

input(x-1, y+1, c) * 1 + input(x, y+1, c) * 2 + input(x+1, y+1, c) * 1;

gaussian(x,y,c) /= 16;

return gaussian;

}

Halide::Func gaussian_3x3_1(Halide::Func input) {

Halide::Func k, gaussian;

Halide::Var x,y,c;

gaussian(x,y,c) = (input(x-1, y-1, c) * 1 + input(x, y-1, c) * 2 + input(x+1, y-1, c) * 1 +

input(x-1, y, c) * 2 + input(x, y, c) * 4 + input(x+1, y, c) * 2 +

input(x-1, y+1, c) * 1 + input(x, y+1, c) * 2 + input(x+1, y+1, c) * 1) /16;

return gaussian;

}

How much does this affect the performance? This really depends on the algorithm and the schedule, in ways that I cannot explain. I ran the second Gaussian function (gaussian_3x3_2) with and without an update and sometimes I got better results, sometimes worse. This is shown in the second and third results rows of the table below (2-update and 2-no-update). In the best performing schedules of gaussian_3x3_2,the no-update implementation provided the best results.

Casting operations also impact performance

In the previous post I discussed why we need to cast the 8-bit pixel channel inputs before calling the Gaussian filter. If we want to realize the results of the Gaussian filter to a 24-bit PNG image, then we also need to cast the results back to uint8_t. I found that if I perform the update as part of the Gaussian filter I get the best results. But if I insist on performing the casting on the results of the filter (i.e. after existing the filter function), then that cast is a legitimate part of the Halide schedule and should be optimized like all other parts.

Halide::Var x,y,xi,yi,c;

Halide::Func padded, padded32, test2;

padded(x,y,c) = input(clamp(x, 0, input.width()-1), clamp(y, 0, input.height()-1), c);

padded32(x,y,c) = Halide::cast(padded(x,y,c));

Halide::Func test = gaussian_3x3_5(padded32, Separable2dConvolutionSched(7));

test.compute_root();

test2(x,y,c) = Halide::cast(test(x,y,c));

test2.vectorize(x, 4).parallel(y);

The choice of schedule varies widely with the implementation of the algorithm

In the table below you can see seven schedules applied to six different algorithm implementation of Gaussian 3x3. For each pair of {algorithm, schedule} I provide the mean and standard deviation. The first three implementations (1, 2-update, 2-no-update) are single function implementations while the latter three implementations use two functions (separable convolutions kernels). That means that the seven schedules of the first three algorithms are different from the schedules of the latter three algorithms. None the less, it is clear that the choice of schedule varies widely with the implementation of the algorithm, as expected.

Separable kernels are faster

Gaussian 3x3 is a separable filter which means that it can be expressed as the outer product of two vectors. The number of computations in the non-separated Gaussian is roughly (MxNx3x3) when applied to an input image of size MxN pixels. For the separated version it is (MxN)x(3+3). So less computations and we would expect it to run faster. And indeed the results show that the separated implementation is the fastest (of course, I could be wrong. It is possible that I have not found the optimal schedules). This is expected, but disheartening. It means that we should be thinking about optimizing our algorithm, not just the schedule.

Inline reductions perform best

Functions gaussian_3x3_4 and gaussian_3x3_5 are the same except for how they do the accumulation. Function gaussian_3x3_4 uses the Halide:sum inline reduction while gaussian_3x3_5 accumulates over the domain using an update step. The results speak for themselves: using the Halide::sum inline reduction provides almost 14-fold better results compared to using an update step (look at the best results in the rows for functions 4 and 5 in the table below).

Halide::Func gaussian_3x3_4(Halide::Func input, const Scheduler &s) {

Halide::Func k, gaussian_x, gaussian_y;

Halide::Var x,y,xi,yi,c;

Halide::RDom r(-1,3);

k(x) = 0;

k(-1) = 1; k(0) = 2; k(1) = 1;

gaussian_x(x,y,c) = sum(input(x+r.x, y, c) * k(r)) / 4;

gaussian_y(x,y,c) = sum(gaussian_x(x, y+r, c) * k(r)) / 4;

s.schedule(gaussian_x, gaussian_y, x, y);

return gaussian_y;

}

Halide::Func gaussian_3x3_5(Halide::Func input, const Scheduler &s) {

Halide::Func k, gaussian_x, gaussian_y;

Halide::Var x,y,xi,yi,c;

Halide::RDom r(-1,3);

k(x) = 0;

k(-1) = 1; k(0) = 2; k(1) = 1;

gaussian_x(x,y,c) += input(x+r.x, y, c) * k(r) / 4;

gaussian_y(x,y,c) += gaussian_x(x, y+r, c) * k(r) / 4;

s.schedule(gaussian_x, gaussian_y, x, y);

return gaussian_y;

}

Larger vector sizes perform better

OK, this one was predictable, but it's nice to see the empirical results. And make sure your machine supports the vectorization you're trying out.

A sample set consisting of 50 measurement is usually too small

The data I present in the table below consists of 50 samples per tests, but I noticed that sometimes there was variation in the results (the average measurement of the 50 samples) between two test runs (of 50 samples each) which can't be explained by the standard deviation. When I increased the sample set to 500 samples I got very stable results (I didn't try anything between 50-500, laziness).

Really bad schedules have a really high sample variance

I need to understand why this happens. I would expect the high variance to appear when thread parallelization is used in high granularity (frequent context switches), but it seems to work the opposite. I am missing something.

Choosing tile size is a delicate act

My workstation has 4x32KB 8-way set associative L1 data caches and 4x256KB 8-way set associative L2 caches. Plenty of memory, no doubt. The largest tile size with which I managed to achieve good performance has size 256x32x4=32KB. Remember that each 8-bit pixel channel value is expanded to 32 bits to prevent overflow, and that is why I multiplied the tile size by 4. This limits the vector sizes and also the tile sizes. Now, if we also parallelize the tile computation, then Halide allocates several instances of the temporary tile buffer and I suspect this is why I saw the optimized tile size peek at 256x32.
Finally,pay attention to the relationship between your tile (or split) sizes and the vector dimensions. The dimension which you choose to vectorize should not be smaller than the vector size that you choose. Once you set a size for that dimension, you can try several values for the second dimension, while keeping the size of the entire tile within the cache bounds.

Results of running different Gaussian 3x3 implementations with different schedules