Thursday, August 31, 2006

What is up? It seems like there is some major disconnect between what David, I, GridRepublic, BAM, WCG, and the CPDN/BBC partnership have been up too and what the community believes is going on.

I don't believe there are all that many people who actually believe that the current interface to BOINC is the best piece of software in the world and can be used by everybody on the planet. As a matter of fact, more often than not, we get feedback along the lines of "I wish it was as easy to use as S@H classic was."

The biggest problem brought up at the First Pangalactic BOINC Workshop from the projects was the attach to project process.

For those who are relativity new out there BOINC used to require a project URL, and a authenticator in order to attach to a project.  Authenticators are a string of 128 randomly generated characters.  These were sent to you via email and had to be typed into the attach to project dialog which didn't describe where that information came from or where to get it.

The authenticator turned out to be such a big problem that S@H didn't want to complete the transition to BOINC until a better solution was in place.

There are two parts here:

  • S@H classic was a single executable with hard coded IP addresses, so the participants only needed to know their username to start processing data.
  • The user interface didn't allow you to do anything, so you really couldn't get into trouble.

In order to address these usability issues we first had to figure out what our goals ought to be.  Here is a brief outline of our goals:

  • The stock BOINC software should not play favorites among the projects.
  • The stock BOINC software should not depend on any sever for normal operation.
  • The BOINC platform needs to be flexible enough for all research and business needs.
  • Find a way to simulate the simplicity of S@H classic's attach process so the participant only needs to provide something easy to remember.
  • Find a way to simplify the interface.

One of the first proposals we talked about for Windows was a control panel applet that you could just check which projects you would want to participate in, which had several problems. Where was it going to get the list of projects? What happens if that server no longer exists? How do future projects get on the list?

Matt, of GridRepublic, approached us in late 2004 and wanted to see if he could help further BOINC's adoption. He was especially interested in attracting the millions of users who might not otherwise join a distributed computing project because of ease of use problems. After a few sessions of brain storming Matt and David started to put together the concept of a web-based account management system.

At the time I was finishing up the last of the 4.x line of client software and helping E@H launch.

Support for the initial draft of the account management system began in early 2005 after the launch of E@H.  By this time David and Matt started having regular phone conversations about feature sets and possible delivery times of features within the BOINC client software and the GridRepublic account manager.  GridRepublic's design goal became "provide an attractive, simple, and easy-to-use interface so that anybody could attach to any GridRepublic supported project with a click of a button." an off-shoot of that was the need for a branded client so that the participant didn't get confused about what they installed on their computer.

BOINC? What's BOINC? I wanted to install this GridRepublic software to help fight cancer.  Did I just get a piece of spyware? malware?

So the GridRepublic branded version of the BOINC client software was born to help reduce confusion for new participants. This had the added advantage that the GridRepublic client could be directly tied to the GridRepublic website without hurting the stock application. Usability studies, aren't they grand? 

WCG approached us early-2005 and asked if we would be willing to collaborate in allowing the BOINC client software run against the WCG servers.  They were getting quite a bit of feedback from the distributed computing community that wanted to run WCG plus a few other BOINC projects.  In turn they would help us come up with a simpler GUI and possibly other features that would be needed for corporate deployments.  By mid-2005 IBM legal gave WCG the go ahead and work begun in earnest.

In November 2005, WCG launched the BOINC compatible interface to WCG.  The amount of positive feedback they got was fantastic, the demand for a Windows based application was so high that they had to reschedule some other workitems to get the Windows version out the door.

In December 2005, we were approached by CPDN and the BBC to help put together a BBC branded BOINC client to run an experiment that the BBC was going to document.  Working with the BBC was an experience, the visuals changed rather regularly but they did get better.  We managed to complete the branded client by the deadline for the first part of the TV documentary in February 2006.  The BBC did a few usability studies by pulling people off the street, a few things came from that including renaming "credit" to "work done" as some of the people polled were concerned that their credit card might be involved.

May 2006, Willy de Zutter of BOINCStats fame launched BAMBAM largely flew under my radar as I don't recall Willy asking for any client-side changes.  I do recall a few bug reports being posted to boinc development though.

June 2006, WCG told us that they were ready to begin the simple GUI work, they had recently hired an internal developer who would do the work.  Later that month we had mock-ups sent to us via email and were giving them feedback on what we thought.  They started writing code at the beginning of this month.

Collaboration continues between BOINC, GridRepublic, BAM, WCG, and the community with tweaks to the UI to make things more easily understood by new participants. Account management systems continue to evolve as well, I believe the ultimate goal of the system is to provide a single remote interface to all of your BOINC based clients. Suspending/Resuming tasks, projects, and clients. Changing preferences for clients.  Attaching and detaching to various projects for clients.  In short anything you can do in the advanced GUI you could do via an account manager.

As far as the whole branding thing, I tend to look at the BOINC technology stack in a layered way.  BOINC sits at the very bottom, projects sit on top of BOINC, and account managers sit on top of projects. Account manager's need to be able to differentiate amongst themselves.  Right now that consists of website design and client graphics.  That doesn't make them any less BOINC enabled than if they kept the stock graphics or website design.

Another analogy in the software world with branding is Linux.  The Linux kernel sits at the bottom, applications sit in the middle, and distro's at the top.  Does Fedora get lambasted for changing the "start" button from the default KDE/Gnome button to a red hat?  Would Red Hat get beat up if they said "use our software, x number of people can't be wrong." where x is the total number of Linux users worldwide when the target audience for the advertisement is Windows?

I think the hostility towards WCG and GridRepublic is unwarranted and harmful to BOINC overall. From where I'm standing, everybody is being a good citizen. I do not believe anybody is out to do BOINC or the community harm, so if you see a problem just drop them or me an email.  Everything will get sorted out.

----- Rom

Thursday, July 6, 2006

Somebody pointed out a thread to me on E@H:
http://einstein.phys.uwm.edu/forum_thread.php?id=4480

I have to say that I'm a little shocked at some of the themes in attitudes of some of the participants I've seen.

First let me clear up some misunderstandings about what validators and assimilators for a BOINC server cluster are supposed to do.  Validators only check to make sure there is agreement between the machines who have crunched the same workunit.  If all of the machines agree on what the numbers are then the results are considered valid and flagged for assimilation.  Assimilators just copy the result data from the BOINC database/file system to the projects internal database for analysis.  After assimilation a result finally has meaning in the context of the projects goal, prior to that it is a collection of numbers and BOINC doesn't have a clue if they are correct or not. 

Projects are free to add additional logic to their validators and assimilators to try and weed out incorrect results, but to some degree it is still just a guess.  If they already know what the correct answer is then they would not have needed to send out the work to begin with.

For projects that are searching for something, their results can be broken down to into two camps, something that needs further investigation and background noise.  What separates something that needs further investigation and something that is background noise?  There is some value or a set of values in the result files that exceed one or more thresholds.  Some thresholds may have a cap on them in which case an interesting value or set of values falls into.  We can then refer to the lower and upper bound of a threshold as a threshold window.  Those thresholds are typically calibrated against the default client a project sends out.  Tests are run against the default client using special workunits that contain various samples of data that expose what the application is looking for so the scientists can make sure the client is working like it is supposed to.

So now the crux of the problem, changing instruction sets for an application can and will change the level of precision of the data returned back to the project.

Optimized SSE/SSE2/SSE3/3DNow applications change how the mathematical operations are performed vs. and un-optimized application.  Now whether that adversely affects the project totally depends on how the project handles data types internally.  If a project doesn’t release the source code or test workunits for their application then somebody optimizing the application with a disassembler or hex editor is making an assumption about how calculations are being performed and what they can do to optimize it.  If they are wrong then something might be flagged as noise when it should have been flagged as needing to be investigated.  What if something is missed because the thresholds are geared for a different range of values then what the optimized application is producing?

SSE/SSE2/SSE3/3DNow instruction sets use 128-bit registers while the original x87 FPU uses 80-bit registers.  Now most programming languages store floating point numbers as either 32-bit single precision floats or 64-bit double precision floats.  Quite a bit of the performance improvement that these new instruction sets provide comes from packing multiple numbers into a register and then performing mathematical operations on them in a matrix style fashion.  So you could fit 4 single precision floats, or two double precision floats into a single 128-bit register.  Depending on the instruction the result may be bounded to 32-bits, 64-bits, or 128-bits.  That means in the worse case scenarios any optimized application is rounding any computation either higher or lower than the original application.

You might be thinking, why don't projects just enlarge the threshold window so that those small rounding errors can get through.  Some of them have, but others still need to investigate how using different instructions affect the system overall.  A few of the science applications perform calculations on the result of previous calculations over and over again.  How large would the threshold window have to be if the calculations on previous calculations happened 1,000,000 or 10,000,000 times?

Here is an example of two different Intel SSE CPU instructions (one for working on packed data, and the other one using the whole register) on the same processor producing different results:
http://softwareforums.intel.com/ISN/Community/en-US/forums/thread/5484332.aspx

Note, that was using the Intel IPP library.  That is how easy rounding problems can be introduced when optimizing.

For those who are quick to say by using optimized applications I'm doing more science because I can process workunits faster, my response is:
Only if the projects backend databases and tools are equipped to deal with the differences, otherwise something might be missed.  If you processed the one but sent back numbers outside the target threshold windows have you really helped the project? 

Another common thing I've seen is; I've run the standard application and the optimized application across x number of workunits I've been assigned and they produced the same result files so the optimized application must be good in all scenarios, my response is:
What that really means is that no rounding issues occurred with the workunits you had access too.  Without the test workunits a project uses internally you really don't know if you covered all your bases.

The good news in all of this is the projects are listening and are working with the optimizers to incorporate the needed changes into the projects default application.  Please be patient during the transition though, it is going to take a bit of time to double check everything and make sure it is all in working order.

In case you are curious, I do not use any optimized clients on any of my machines.  To me the science applications are big black boxes, I don't know enough about what they do under the hood to smartly make changes for the better.  I'll wait for optimization changes to be released by the projects which means that their backend systems can account for any changes to the data.

At the end of the day most of the projects are probably not concerned with the problems of verifying data that has been flagged as interesting, it is concern about missing something interesting that was flagged as background noise.

----- Rom

References:
http://en.wikipedia.org/wiki/IA-32
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
http://en.wikipedia.org/wiki/SSE2
http://en.wikipedia.org/wiki/SSE3
http://en.wikipedia.org/wiki/SSE4
http://en.wikipedia.org/wiki/3dnow
http://docs.sun.com/source/806-3568/ncg_goldberg.html

[08/15/2006] Adding a few more reference articles, the banking industry is still battling with rounding errors in its software.
http://www.regdeveloper.co.uk/2006/08/12/floating_point_approximation/
http://cch.loria.fr/documentation/IEEE754/

Thursday, March 23, 2006

I am at something of an impasse with the 1% bug.

In order to gain ground I need to be able to see where the program is stuck on the destination machine.  There are three ways to do this:

1.      Have the community report which workunit stalled on there machine and attempt to reproduce it.

2.      Hook up a debugger on the target machine and have the person at the keyboard create a dump file of the process.

3.      Introduce a trigger into the executable so that on a certain action it causes it to dump its own backtraces.

Option one proves difficult just in managing the sheer number of workunits to look at.  Roughly 550 workunits a day are being aborted or have exceeded their allotted CPU time.  R@H hasn’t been able to reproduce the problem in the lab with the workunits they have looked at and are continuing to look at.

Option two doesn’t scale very well, namely of all the people who are hitting this problem only small fraction of them know how to create a dump file with a debugger and only a small fraction of them are willing to spend the time to compress and break the 200MB to 350MB file into smaller pieces to email them to me so I can look at them.  Then of course there is only one of me and I still have all my other BOINC work to do, like fixing bugs in the 5.3.x clients so we can ship 5.4.0!

Option three didn’t hit me till Monday night.  As part of the feature work we did for CPDN we introduced a way for the core client to notify the science application that it was being aborted so it could clean up after itself.  Well I completely forgot that the 5.2.x clients don’t send the abort command to the client when I burned the midnight oil to deliver the backtrace functionality for R@H 4.94.  At 4am I had the functionality working for Windows and checked it in.

Fast forward to today.  I went looking through the results on Ralph@Home and discovered that the backtraces were not being logged like I thought they should have been.  After further investigation I realized that the 5.2.x clients were sending the quit command instead of the abort command.  Talk about killing morale.  I have posted in the Ralph@Home forums that people should upgrade and I’ve been seeing results come back with 5.3.28 which is good.  I’m just not sure when I’ll have enough information about the bug.

We are pretty close to having 5.4 ready for public release.  I believe in a week or less.  But a big problem remains, typically it takes a few months for a new stable client to reach a high enough level of adoption that patterns emerge that can be tracked.

After some discussions with David Baker we are going to drop the maximum amount of time allotted for a workunit to run on a machine.  That’ll keep a good chunk of the wasted CPU cycles down.  I am also selling the idea of releasing the PDB file with the Rosetta application for the public project.  Now granted, it is a 30MB file.  Without it none of the diagnostic stuff built into the BOINC API for tracking down bugs will work.  Isn’t a 30MB insurance policy for an abort or crash worth it if the project can get something useful out of it which will lead to bug fixes?

----- Rom

Monday, March 13, 2006

So, as many of you probably already know, I’ve been brought onboard as a consultant with the Rosetta@Home project.  A big issue they were experiencing was related to random crashes when BOINC would notify them that it was time to quite and for another application to begin.

I believe I have found and fixed this style of bug, but alas only time and testing will tell.

To understand this bug I need to explain how things work with a science application.  When a science application starts and notifies BOINC that it supports graphics three threads are created to manage what is going on.

The worker thread is the heavy lifter of the science application, it handles all the science.  The majority of the memory allocations and de-allocations happen in this thread.

The graphics thread is responsible for displaying the graphics window and for hiding and showing the window at BOINCs request.

The timer thread is responsible for processing the suspend/resume/quite/abort messages from BOINC as well as notify BOINC of trickles.

Now when the science application received the quit request it would call the C Runtime Library function called exit which is supposed to shutdown the application.  Part of this shutdown operation calls the Win32 API called ExitProcess.  ExitProcess would let the threads continue to run while cleaning up the heap, which is a holdout for letting DLLs decrement their ref counts and unload themselves if nobody else is using them.  Well there in lies the problem, the worker thread was still running trying to allocate and de-allocate memory from a heap that has been freed by ExitProcess.

This in turn would cause an access violation which shows up in the log file as 0xc0000005.

Science applications now have the option of requesting a hard termination which stops all executing threads and then cleans up after the process.  In essence the application calls TerminateProcess on itself.  What this also means is that the application has no chance of writing any more information to a state file or checkpoint file when the BOINC API hasn’t been notified that a checkpoint is in progress.  Use with care.  It also means that BOINC should no longer believe that a task is invalid from a random crash.

I believe this will take care of quite a few ‘crash on close’ style of bugs.  What was really annoying about this kind of bug is that it crashes in a different location each time.  Sometimes it would crash in the timer thread and sometimes in the worker thread.  A good chunk of the time the clients would report an empty call stack which doesn’t give us anything to work off of.

This style of bug would affect slower machines more than the faster machines.  The bug wouldn’t surface if the timer thread could finish all the CPU instructions needed from the time exit was called to the time ExitProcess actually kills the threads in one OS thread scheduling cycle.

I think Rosetta@Home hit this bug more often then most projects because of the amount of memory it allocates while doing its thing.  150MB’s per process.  That was just enough to get it to happen on my machine if I left it running for 10 minutes and the graphics running.

It looks like both Einstein@Home and Rosetta@Home are going to be testing this out in the next few days.  I’m excited to see what this change does for the success rates of the tasks being assigned to client machines.

----- Rom