Monday, March 13, 2006

So, as many of you probably already know, I’ve been brought onboard as a consultant with the Rosetta@Home project.  A big issue they were experiencing was related to random crashes when BOINC would notify them that it was time to quite and for another application to begin.

I believe I have found and fixed this style of bug, but alas only time and testing will tell.

To understand this bug I need to explain how things work with a science application.  When a science application starts and notifies BOINC that it supports graphics three threads are created to manage what is going on.

The worker thread is the heavy lifter of the science application, it handles all the science.  The majority of the memory allocations and de-allocations happen in this thread.

The graphics thread is responsible for displaying the graphics window and for hiding and showing the window at BOINCs request.

The timer thread is responsible for processing the suspend/resume/quite/abort messages from BOINC as well as notify BOINC of trickles.

Now when the science application received the quit request it would call the C Runtime Library function called exit which is supposed to shutdown the application.  Part of this shutdown operation calls the Win32 API called ExitProcess.  ExitProcess would let the threads continue to run while cleaning up the heap, which is a holdout for letting DLLs decrement their ref counts and unload themselves if nobody else is using them.  Well there in lies the problem, the worker thread was still running trying to allocate and de-allocate memory from a heap that has been freed by ExitProcess.

This in turn would cause an access violation which shows up in the log file as 0xc0000005.

Science applications now have the option of requesting a hard termination which stops all executing threads and then cleans up after the process.  In essence the application calls TerminateProcess on itself.  What this also means is that the application has no chance of writing any more information to a state file or checkpoint file when the BOINC API hasn’t been notified that a checkpoint is in progress.  Use with care.  It also means that BOINC should no longer believe that a task is invalid from a random crash.

I believe this will take care of quite a few ‘crash on close’ style of bugs.  What was really annoying about this kind of bug is that it crashes in a different location each time.  Sometimes it would crash in the timer thread and sometimes in the worker thread.  A good chunk of the time the clients would report an empty call stack which doesn’t give us anything to work off of.

This style of bug would affect slower machines more than the faster machines.  The bug wouldn’t surface if the timer thread could finish all the CPU instructions needed from the time exit was called to the time ExitProcess actually kills the threads in one OS thread scheduling cycle.

I think Rosetta@Home hit this bug more often then most projects because of the amount of memory it allocates while doing its thing.  150MB’s per process.  That was just enough to get it to happen on my machine if I left it running for 10 minutes and the graphics running.

It looks like both Einstein@Home and Rosetta@Home are going to be testing this out in the next few days.  I’m excited to see what this change does for the success rates of the tasks being assigned to client machines.

----- Rom

Comments are closed.