A Wonderful Race Condition
A wonderful and educational race condition involving three independent threads of control, and three state variables. |
While working on FitNesse (http://www.fitnesse.org) this week I ran into an interesting and educational race condition. A race condition is when the order of two or more asynchronous events is both indeterminate and significant. To put it more simply, the program works fine if event A always happens before event B, but fails if event B happens before A.
We find race conditions most often in multi-threaded code in which the programmer has made certain assumptions about the timing of events. If those assumptions are incorrect, the events may occurr in the wrong order and cause the program to misbehave. Often the programmer's assumption is very close to being correct, and so the failure condition occurrs very infrequently and unrepeatably. This makes race conditions very hard to debug.
The race condition we found in FitNesse this week involved three different threads. We'll call them FitNesse, FitClient[?], and FitServer[?]. FitNesse sends packets of information to FitServer[?]. FitServer[?] processes those packets and sends them back to FitNesse. FitClient[?] is a mediator that monitors the protocol between the two. FitNesse sends N blocks to FitServer[?] by calling send() on FitClient[?]. After the Nth block FitNesse calls done() on FitClient[?]. FitServer[?] sends its response blocks back through a socket controlled by FitClient[?]. FitClient[?] reads the blocks and then forwards them on to FitNesse.
FitClient[?] knows when the session is complete when the number of blocks sent by FitNesse equals the number of blocks received from FitServer[?] and done() has been called, otherwise FitClient[?] client is not done, and it hangs a read on the FitServer[?] socket waiting for the next block. The code for this looks something like this:
while(!finishedReading())
{
readBlock();
processBlock();
}
private boolean finishedReading()
{
return isDoneSending && received == sent;
}
This code had been in the system for months; and though we had seen very occasional misbehavior, we had never traced it back to this code. Then one day some small thing changed, and we saw frequent failures of the communications between FitNesse and FitServer[?]. Can you see the problem?
The problem was that sometimes FitServer[?] would catch up to FitNesse so that sent == received. By itself this is not a problem, but it begs a question that we had not previously asked. "What should FitClient[?] do next?" The way it was written, FitClient[?] would try to read the next block. However, this assumes that there is another block coming. What if the next event is done()? In that case FitServer[?] isn't going to send another block, and FitClient[?]'s read will hang waiting for a block that never comes. FitServer[?] then closes the socket, and FitClient[?] throws an exception because the socket unexpectedly terminated.
A quick fix to this problem is shown below. In short, we consider (sent==received && !done) to be indeterminate, and to wait until either sent > received or done() is called.
private boolean finishedReading()
{
while (stateIndeterminate())
shortSleep();
return isDoneSending && received == sent;
}
private boolean stateIndeterminate()
{
return (received == sent) && !isDoneSending;
}
The shortSleep might not be an ideal solution. It might be better to set up a wait/notify signal. However, I'm not convinced that I understand all the race conditions that might occurr. Might I call wait just after the appropriate notify? Would that cause me to wait forever? Would I have to use synchronized methods, or some kind of semaphore to protect against that?
So I guess I'll leave the solution as you see it until I understand things better. The current structure has the benefit that it doesn't need any synchronized methods. (at least I don't think it does.) The sent and receieved variables increment upwards, and the isDoneSending flag only goes from false to true. So I don't think there's any way for the finishedReading() function to do the wrong thing.
Finding and fixing this bug took many hours. FitNesse is a project that is being developed using Test Driven Development (TDD), and has therefore enjoyed very few such expensive bugs. The fact that we had a large number of unit tests and acceptance tests helped a lot in tracing down this bug. We were eventually able to maneuver an acceptance tests, and later a unit test into reliably failing. From there we simply did experiments in the code to see what changed and what didn't. Some of the changes we confidently made (i.e. inserting various sleeps and yeilds) caused our immediate symptom to go away, but also caused other parts of the system fail in unforseen ways. Without our suite of tests we might have assumed that we had fixed the bug without realizing that we had broken other parts of the system, and that we had not actually found the real problem, but had just masked it.
- Sep 2, 2004, Volker Zink:
The URI to TrackBack[?] this entry is: http://www.butunclebob.com/wp-trackback.php/8
Why not just let the method finishedReading as it is, but instead do something like
while(!finishedReading())
{
if (received == sent)
//done() is missing
{wait}
else //not all blocks are in
{
readBlock();
processBlock()
}
}
Volker, the received == sent condition does not necessarily mean that ‘done() is missing’. It may also mean that another block is about to be sent, and that we are nowhere near being done. –UB
- Sep 2, 2004, Chris Uppal (Email)
On Thu, 2 Sep 2004 10:47:15 +0100, “Chris Uppal” “>.uppal@metagnostic.REMOVE-THIS.org> wrote:
>Robert C. Martin wrote:
>
>> Me too! Yesterday must have been “nasty bug” day. You can see my
>> description on http://butunclebob.com in the “A Wonderful Race
>> Condition” article.
>
>quote from that page:
>
>——————-
>So I guess I’ll leave the solution as you see it until I understand things
>better. The current structure has the benefit that it doesn’t need any
>synchronized methods. (at least I don’t think it does.)
>——————-
>
>Eek! Maybe I’m misunderstanding the context you are working in, but if any
>programmer under my supervision said something like that, then it’d be “come
>into my office for a quick chat, please, now". If the programmer were a
>beginner then the chat would be about how to write concurrent code, and what
>his/her training needs were. If the programmer were senior, then the chat
>would be about the advantages of a career in management, or maybe testing…
>
>In Java there are only two possibilities.
>
>1) /All/ the state shared by more than one thread is declared volatile.
>2) You use synchronisation properly.
I’ve tried the career in management, thank you – several times. I gave it up to return to programming, and then I started my own company which I now run. (Ironic Sigh).
Declaring the three variables to be volatile is probably “correct". (Actually only ‘isDoneSending’ and ’sent’ need to be volatile since ‘recieved’ is read in the thread that changes it.) So I did that, and the 795 unit tests and 101 acceptance tests still pass. So declaring the variables volatile did not appear to break anything. Nor did it seem to improve anything. Indeed, there seemed to be no change in behavior at all.
Given only your two possibilities above, this must mean that the program was already “properly” synchronized. This is probably the case since every iteration of the wait loop is punctuated by a ’sleep’. What little reseach I have done on this topic suggests that non-volatile variables get “written-through” by a wait, notify, or synchronization, and I imagine that a sleep uses at least one of those.
Still, declaring the variables to be volatile is the right thing to do, if only for documentation purposes, so thank you for that bit of advice.
Add Child Page to AwonderFulRaceCondition