Sunday, October 27, 2013

The US Constitution and Meyer's Open/Closed Principle

While trying to explain Meyer's Open/Closed Principle to a friend, I scratched my head trying to find a real-world example which illustrates the principle.  An example which will be hard to dispute and easy to grasp.

On my way home from work the news reported on the NSA's latest shenanigans (this time it was spying on German Chancellor Angela Merkel).  My thoughts drifted and I contemplated the US Constitution.

Some facts on the Constitution of the United States (source):
  • It went into effect on March 4, 1789
  • It has been amended twenty-seven times
  • The Bill of Rights (the first 10 amendments) was ratified on December 15, 1791
  • The list of all 27 amendments is worth reviewing and of particular interest are amendments 18 and 21 ('git revert', anyone?)
Imagine that! 224 years: from 13 states to 50; one Civil war, two World wars, and countless other wars; the invention of the light bulb; radio and television; labor laws; civil rights movement; the Great Depression; the lunar landing; Roe vs. Wade; 9/11.  And on it goes - with only 27 amendments.
Damn!  Tell me that ain't cool.

The US Constitution is perhaps the ultimate, time-tested example of Meyer's Open/Closed Principle: Open for extension; but Closed for modifications.

It is also worthwhile to reflect on the procedures for amending the constitution:

Before an amendment can take effect, it must be proposed to the states by a two-thirds vote of both houses of Congress or by a convention (known as an Article V convention) called by two-thirds of the states, and ratified by three-fourths of the states or by three-fourths of conventions thereof, the method of ratification being determined by Congress at the time of proposal. To date, no convention for proposing amendments has been called by the states, and only once—in 1933 for the ratification of the twenty-first amendment—has the convention method of ratification been employed.


As software architects and designers, perhaps we should build similar protections against perpetual refactoring of production quality code.  No, I didn't mean that in the literal sense, but I do advocate investing the time to excavate an existing architecture to uncover its governing principles, and understanding how it can be extended while preserving those principles.
Maybe we'll end up with software as durable as the US Constitution.

Saturday, October 19, 2013

Android Synchronization Fences – An Introduction

In any system that employs the exchange of buffers between independent buffer Producers and buffer Consumers, there is a need for a policy to control buffer life times (allocation/deallocation) and a policy to control the access to the buffer memory (read/write).  A third entity, the buffer Allocator, is in charge of providing access to the system memory and implementing the buffer life time maintenance (a “dead” buffer cannot be accessed by any entity except the Allocator, while a “live” buffer may be used by entities other than the Allocator).  The “C” language malloc/free system calls are an example of an Allocator.  In a way, the buffer life time control policy is really another form of buffer access control. The buffer access control policy determines if either the Producer or the Consumer can access the buffer in a mutually exclusive manner.



The Android Fence abstraction is a mechanism that implements a particular buffer access control policy, and does not deal with buffer lifetime control (allocation/deallocation).  It allows for situations where there is a 1:1 relationship between Producer:Consumer and a 1:many relationship between Producer:Consumers.  Fences are external to buffers (i.e. they are not part of the buffer structure) and synchronize the exchange of buffer ownership (access control) between Producer and Consumer(s) or vice versa. 
It is of particular importance to understand that in situations where Android mandates the use of Fences, it is not sufficient for a Consumer to have a pointer to buffer memory - even when it is explicitly provided by the Producer.  The Fence must also permit the Consumer to access the buffer memory, for either read or write access, depending on the situation.

Timelines, Synchronization Points and Fences


To fully understand the Android fences, beyond its use in the Camera subsystem, you need to get familiar with Timelines and Synchronization Points.  The kernel documentation (linux/kernel/Documentation/sync.txt) provides the only source of information on these concepts that I could find, and instead of rephrasing this documentation, I bring it here in full:
Motivation:

In complicated DMA pipelines such as graphics (multimedia, camera, gpu, display)
a consumer of a buffer needs to know when the producer has finished producing
it.  Likewise the producer needs to know when the consumer is finished with the
buffer so it can reuse it.  A particular buffer may be consumed by multiple consumers which will retain the buffer for different amounts of time.  In addition, a consumer may consume multiple buffers atomically.
The sync framework adds an API which allows synchronization between the
producers and consumers in a generic way while also allowing platforms which
have shared hardware synchronization primitives to exploit them.

Goals:
                * provide a generic API for expressing synchronization dependencies
                * allow drivers to exploit hardware synchronization between hardware
                  blocks
                * provide a userspace API that allows a compositor to manage
                  dependencies.
                * provide rich telemetry data to allow debugging slowdowns and stalls of
                   the graphics pipeline.

Objects:
                * sync_timeline
                * sync_pt
                * sync_fence

sync_timeline:

A sync_timeline is an abstract monotonically increasing counter. In general, each driver/hardware block context will have one of these.  They can be backed by the appropriate hardware or rely on the generic sw_sync implementation.
Timelines are only ever created through their specific implementations
(i.e. sw_sync.)

sync_pt:

A sync_pt is an abstract value which marks a point on a sync_timeline. Sync_pts have a single timeline parent.  They have 3 states: active, signaled, and error.
They start in active state and transition, once, to either signaled (when the timeline counter advances beyond the sync_pt’s value) or error state.

sync_fence:

Sync_fences are the primary primitives used by drivers to coordinate synchronization of their buffers.  They are a collection of sync_pts which may or may not have the same timeline parent.  A sync_pt can only exist in one fence and the fence's list of sync_pts is immutable once created.  Fences can be waited on synchronously or asynchronously.  Two fences can also be merged to create a third fence containing a copy of the two fencesג€™ sync_pts.  Fences are backed by file descriptors to allow userspace to coordinate the display pipeline dependencies.

Use:

A driver implementing sync support should have a work submission function which:
     * takes a fence argument specifying when to begin work
     * asynchronously queues that work to kick off when the fence is signaled 
     * returns a fence to indicate when its work will be done.
     * signals the returned fence once the work is completed.

Consider an imaginary display driver that has the following API:
/*
 * assumes buf is ready to be displayed.
 * blocks until the buffer is on screen.
 */
    void display_buffer(struct dma_buf *buf);

The new API will become:
/*
 * will display buf when fence is signaled.
 * returns immediately with a fence that will signal when buf
 * is no longer displayed.
 */
struct sync_fence* display_buffer(struct dma_buf *buf,
                                 struct sync_fence *fence);


The relationships between the objects described above is depicted in the diagram below.


Android Fence Implementation Details 

User-space code can choose between a C++ fence implementation (using the Fence class) and a C code library implementation.  The C++ implementation is just a lean wrapper around the sync C library code, and the C library does little more than invoke ioctl system calls on a kernel device implementing the synchronization API.

The Android kernel includes the ‘sync’ module, also known as the synchronization framework, which implements the Timeline, Fence, and Synchronization Point infrastructure.  This module can be leveraged by hardware device drivers which choose to implement the synchronization API. 

The kernel also includes a software timeline device driver (/dev/sync) which implements a software based timeline that does not reference a specific hardware module.  The SW timeline device driver uses the kernel’s Synchronization framework.

Understanding the Synchronization API

The first step in using the Synchronization API in user-space is creating a timeline handle (file descriptor).  The sample call flow below shows how the userspace C library creates a handle to an instance of the generic software timeline (sw_sync) using function sw_sync_timeline_create.


After the timeline is created, the user can use arbitrarily increase the timeline counter (sw_sync_timeline_inc) or create fence handles (sw_sync_fence_create).  Each fence initially contains one synchronization points on the timeline. 



If the user needs two or more synchronization points attached to a fence, he creates more fences and then merges them together (sync_merge).

// Create a generic sw_sync timeline
int sw_timelime = sw_sync_timeline_create();

// Create two fences on the sw_sync timeline; at sync points 2 and 5
int sw_fence1 = sw_sync_fence_create(sw_timeline, "fence1", 2);
int sw_fence2 = sw_sync_fence_create(sw_timeline, "fence2", 5);

// Merge sw_fence1 and sw_fence2 to create a single fence containing
// the two sync points
int sw_fence3 = sync_merge("fence3", sw_fence1, sw_fence2);

 

The kernel Synchronization API (for in-kernel modules) is similar, but synchronization points need to be created explicitly:

// Create a generic sw_sync timeline
struct sync_timeline* timeline = sw_sync_timeline_create(“some_name”);

// Create a sync_pt
struct sync_pt *pt = sw_sync_pt_create(sfb->timeline, sfb->timeline_max);

// Create a fence attached to a sync_pt
struct sync_fence *fence = sync_fence_create("some_other_name", pt);

// Attach a file descriptor to the fence
int fd = get_unused_fd()
sync_fence_install(fence, fd);

Using Fences for Synchronization

Recall that the timeline abstraction represents a monotonically increasing counter, and synchronization points represent specific future values of this counter (points on the timeline).  How a timeline increases (its clock rate, so to say) is timeline specific.  A GPU, for example, may use an internal counter interrupt to increase its timeline counter.  The generic sw_sync timeline is manually increased by the Synchronization API client when it invokes sw_sync_timeline_inc.  The meaning of the synchronization point values and the method of how two synchronization points are compared to one another are timeline specific.  The sw_sync device models simple points on a line.   Whenever the Synchronization framework is notified of timeline counter increase, it tests if the counter reached (or passed) the timeline value of existing synchronization points on the timeline and triggers wake-up events on the relevant fences.

Userspace clients of the Synchronization framework that want to be notified (signaled) about fence state change use the sync_wait API.  Kernel clients of the Synchronization framework have a similar API, but also have an API for asynchronous fence state change notification (via callback registration).

When userspace closes a valid sync_timeline handle, the Synchronization framework checks if it needs to signal any active fences which have synchronization points on that timeline.  Closing a fence handle does not signal the fence: it just removes the fence’s synchronization points from their respective timelines.


Userspace C++ Fence Wrapper

  • ./frameworks/native/libs/ui/Fence.cpp
  • ./frameworks/native/include/ui/Fence.h               
Userspace C Library

  • ./system/core/libsync/sync.c 
Kernel Software Timeline

  • ./linux/kernel/drivers/staging/android/sw_sync.h
  • ./linux/kernel/drivers/staging/android/sw_sync.c
  • ./external/kernel-headers/original/linux/sw_sync.h 
Kernel Fence Framework

  • ./external/kernel-headers/original/linux/sync.h
  • ./linux/kernel/drivers/staging/android/sync.h
  • ./linux/kernel/drivers/staging/android/sync.c

Saturday, April 13, 2013

Broken windows will turn your code to spaghetti


Software engineers don't need to read "AntiPatterns: Refactoring Software, Architectures, and Projects in Crisis" to know what "spaghetti code" means.  But what does it have to do with broken windows?

Enough has been written on the topic of the Broken Windows Theory and its relation to software development so there is no reason for me to repeat this yet again.  If this is your first introduction to the topic, just Google "Broken Windows Software" and you'll get plenty of background information and opinions on the topic.
I started thinking about the connection between the Broken Windows Theory and spaghetti code some time after hearing former New York mayor Rudy Giuliani talk about the cleaning up of NYC's streets.  I was a software team leader at the time, and I formed my private theory based on empirical observations I made on my team and others around the company.  I could see the spaghetti start cooking whenever we loosened our standards of practice, either because of pressure to deliver or just plain sloth.  It takes discipline, time, and energy to fix every broken piece of code, design, or document and this is never easy. Methodologies are easier to follow if you start from a place of conviction, and so I began sharing and discussing my "theory" with the team.  Not everyone bought into the story, but years after I still believe (and practice) constant refactoring of the code-base is essential in order to prevent "chaos creep" and eventual spaghetti code.  However, it is interesting to note that today there are many who believe Giuliani was wrong and that Broken Windows had nothing to do with the reduction of the crime rate in NYC during the '90s.

It is the engineer's job to fix broken windows, but who's responsibility is it to make sure that broken windows are fixed?  If you're a manager, then it's your responsibility.  In everything you do: how you prioritize tasks, how you treat documentation, how you treat your own broken windows; and not least of which, how you reward engineers with a knack for clean shiny windows - you determine what kind of code base you end up with.  Having guidelines, coding standards, designs and architectures to follow is essential to discern broken from fixed, but this is not sufficient.

And as you move higher up you are in the management ladder, your values, your attitude and your actions will have broader consequences - on everything your engineers do; and that includes fixing broken windows. So I try to follow these guidelines:
  • Tell the team that a clean code-base is important to me.  Again.  And again.  And again.  In fact, most messages to your team need to be repeated over and again; and this is especially true when you want to change team behavior patterns until they form new habits.
  • Show my team that a clean code-base is important to me by rewarding engineers who exhibit extra care for the code base.  "Rewarding" has many manifestations, but if this is done in a "public" manner I increase the impact of my message because the entire team is made aware.  Backing my code-base "gate keepers" against internal and external "sources of entropy" is very important as it solidifies trust and mutual respect.
  • Point out broken windows as I review code, git commit messages, documentation, presentations and designs.
Many times when a broken window is brought to my attention as a manager, I cannot afford to halt development to immediately attend to the fix.  There are customers waiting at the end of the rainbow and they have schedules and priorities.  But ignoring or dismissing a broken window brought to my attention will have consequences on how the development team perceives my values, so I try to:
  • Never ignore an engineer who reports a broken window.  I either add it to the team's task burn-down list or file a bug report, as a minimum first stage of showing intention.  It is important to make the engineers part of the discussion of how and when to do the fix, since ultimately I want them to be accountable for the quality of the code base (shifting accountability from the team leader to the team members deserves its own post which I hope to write some time).
  • Schedule a percentage of the team's time to specifically handle broken windows.  Adding this to the work plan helps fight the temptation to cave under the daily work pressure. This also shows the engineers that broken windows are not buried and ignored in some list or database and gives credibility to the act of deferring fixes to a later stage.
Of course, not all code bases are alike and if your code base is short lived because you are in an exploratory phase, a start-up in its seed phase, or producing a one-time demo, then your energy should be focused elsewhere.  This is where the saying "First things first, second things never" is most applicable.  But when building long-lasting code-bases these "second things" can bring havoc if ignored for too long.  Fixing every small issue that threatens the integrity of your code is hard work for your engineers, but it starts, and continues, from hard work that you - the manager - puts in.

Sunday, March 24, 2013

The Innovator's Dilemma?


The Innovator's Dilemma - I watch and live the dilemma happen every day, as the company I work for tries to recover from the disruption that took us by a surprise reserved only for the disrupted.  A blind spot that plagues the best of companies, as this book describes very well.
As I leafed through the pages of The Innovator's Dilemma today, a book that I've read before it all became so personal, I asked myself these questions:
  • Anecdotal evidence shows that change is happening faster than ever and adaptation to these changes needs to be at least as fast.  If we failed to innovate we should at least have the capacity to identify innovative trends and quickly realign to them.  It is interesting to study what structural, psychological, and cultural qualities and attributes characterize R&D groups that rise to the challenge of responsive innovation while catering to the needs of existing customers and products.
  • As the excerpt below so honestly describes, when innovation is not in your DNA (or company mission), R&D middle management plays a crucial role in unknowingly supporting or hampering innovation. How do we align the low-level decisions with the strategic decision to keep innovation a priority?
  • As we manage our own professional career, aren't we exposed to the same forces and circumstances which can "cause great firms employees to fail" (if I may paraphrase the title)?  Is our private disruption around the corner?  Are we correctly spotting our personal disruptive threats and opportunities?
"As we saw in chapter 4, resource allocation is not simply a matter of top-down decision making  followed by implementation. Typically, senior managers are asked to decide whether to fund a project only after many others at lower levels in the organization have already decided which types of project proposals they want to package and send on to senior management for approval and which they don’t think are worth the effort. Senior managers typically see only a well-screened subset of the innovative ideas generated.
And even after senior management has endorsed funding for a particular project, it is rarely a “done deal.” Many crucial resource allocation decisions are made after project approval—indeed, after product launch—by mid-level managers who set priorities when multiple projects and products compete for the time of the same people, equipment, and vendors. As management scholar Chester Barnard has noted: 
From the point of view of the relative importance of specific decisions, those of executives properly call for first attention. But from the point of view of aggregate importance, it is not decisions of executives, but of non-executive participants in organizations which should enlist major interest. 
So how do non-executive participants make their resource allocation decisions? They decide which projects they will propose to senior management and which they will give priority to, based upon their understanding of what types of customers and products are most profitable to the company. Tightly coupled with this is their view of how their sponsorship of different proposals will affect their own career trajectories within the company, a view that is formed heavily by their understanding of what customers want and what types of products the company needs to sell more of in order to be more profitable. Individuals’ career trajectories can soar when they sponsor highly profitable innovation programs. It is through these mechanisms of seeking corporate profit and personal success, therefore, that customers exert a profound influence on the process of resource allocation, and hence on the patterns of innovation, in most companies."