Thursday, March 23, 2006

I am at something of an impasse with the 1% bug.

In order to gain ground I need to be able to see where the program is stuck on the destination machine.  There are three ways to do this:

1.      Have the community report which workunit stalled on there machine and attempt to reproduce it.

2.      Hook up a debugger on the target machine and have the person at the keyboard create a dump file of the process.

3.      Introduce a trigger into the executable so that on a certain action it causes it to dump its own backtraces.

Option one proves difficult just in managing the sheer number of workunits to look at.  Roughly 550 workunits a day are being aborted or have exceeded their allotted CPU time.  R@H hasn’t been able to reproduce the problem in the lab with the workunits they have looked at and are continuing to look at.

Option two doesn’t scale very well, namely of all the people who are hitting this problem only small fraction of them know how to create a dump file with a debugger and only a small fraction of them are willing to spend the time to compress and break the 200MB to 350MB file into smaller pieces to email them to me so I can look at them.  Then of course there is only one of me and I still have all my other BOINC work to do, like fixing bugs in the 5.3.x clients so we can ship 5.4.0!

Option three didn’t hit me till Monday night.  As part of the feature work we did for CPDN we introduced a way for the core client to notify the science application that it was being aborted so it could clean up after itself.  Well I completely forgot that the 5.2.x clients don’t send the abort command to the client when I burned the midnight oil to deliver the backtrace functionality for R@H 4.94.  At 4am I had the functionality working for Windows and checked it in.

Fast forward to today.  I went looking through the results on Ralph@Home and discovered that the backtraces were not being logged like I thought they should have been.  After further investigation I realized that the 5.2.x clients were sending the quit command instead of the abort command.  Talk about killing morale.  I have posted in the Ralph@Home forums that people should upgrade and I’ve been seeing results come back with 5.3.28 which is good.  I’m just not sure when I’ll have enough information about the bug.

We are pretty close to having 5.4 ready for public release.  I believe in a week or less.  But a big problem remains, typically it takes a few months for a new stable client to reach a high enough level of adoption that patterns emerge that can be tracked.

After some discussions with David Baker we are going to drop the maximum amount of time allotted for a workunit to run on a machine.  That’ll keep a good chunk of the wasted CPU cycles down.  I am also selling the idea of releasing the PDB file with the Rosetta application for the public project.  Now granted, it is a 30MB file.  Without it none of the diagnostic stuff built into the BOINC API for tracking down bugs will work.  Isn’t a 30MB insurance policy for an abort or crash worth it if the project can get something useful out of it which will lead to bug fixes?

----- Rom

Sunday, March 19, 2006

 

Results on RALPH@Home which is R@H’s alpha project have been very promising.

To give an idea about how large this problem was for R@H I guess I need to provide some numbers.  So here goes:

R@H receives roughly 115k results a day.

Roughly there are 16k failures a day.

Of those 16k failures a day, 5.5k fell under the ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED and 0xc0000005 banner.  Those are the two error codes used when something really really really bad has happened on Windows.  There are another 1.5k errors that have cryptic Windows error codes which may or may not be related.

Now how does this translate to RALPH@Home?  Well if you work under the assumption that RALPH@home is a mini R@H, then the percentages should be roughly the same.

That said, sure enough RALPH@Home had roughly the same breakdown of errors that the public project had.  Here are some rough stats for RALPH@Home:

RALPH@Home receives roughly 1k results a day.

Before 4.93 was released for Beta the failure rate was 150 or so a day.

Now with 4.93 in the mix it has dropped to 100 or so a day.

Keep in mind that the Mac and Linux clients have not been updated yet and so there error rates remain unchanged.

RALPH@Home went from a 25% failure rate down to a 12% failure rate.  Now if you remove the results from Linux and the Mac the failure rate for the Windows client is floating at 5%.

I’ll include the current error rates in the public project and RALPH@Home below.

Now I’m on to the next biggest problem which has been deemed the ‘1% bug’.

For those who noticed the error code 1 in the charts below, that error code is given when Rosetta could not find something in one of the pre-staged files downloaded to your machine or when the application felt something really bad has happened and it couldn’t continue.  With 4.82 that actual error data was being written to a different log file than the one BOINC sends back to the server.  Starting with 4.94 the reason for the application quitting will be logged and sent back to the server in a way that can be easily tracked and fixed without having to write the workunit names in the forums.

----- Rom

Public Project Results:

482

Darwin

-197 (0xffffff3b) ERR_ABORTED_VIA_GUI

5

482

Darwin

-186 (0xffffff46) ERR_RESULT_DOWNLOAD

3

482

Darwin

-185 (0xffffff47) ERR_RESULT_START

83

482

Darwin

1 Unknown error number

10

482

Darwin

4 Unknown error number

135

482

Darwin

5 Unknown error number

9

482

Darwin

6 Unknown error number

1

482

Darwin

131 (0x83) Unknown error number

26

482

Windows

-2147483641 (0x80000007) Unknown error number

18

482

Windows

-1073741819 (0xc0000005) Unknown error number

1797

482

Windows

-1073741811 (0xc000000d) Unknown error number

880

482

Windows

-1073741795 (0xc000001d) Unknown error number

2

482

Windows

-1073741674 (0xc0000096) Unknown error number

4

482

Windows

-1073741571 (0xc00000fd) Unknown error number

63

482

Windows

-1073741515 (0xc0000135) Unknown error number

2

482

Windows

-1073741502 (0xc0000142) Unknown error number

336

482

Windows

-1073740972 (0xc0000354) Unknown error number

2

482

Windows

-529697949 (0xe06d7363) Unknown error number

226

482

Windows

-197 (0xffffff3b) ERR_ABORTED_VIA_GUI

466

482

Windows

-187 (0xffffff45) ERR_RESULT_UPLOAD

3

482

Windows

-186 (0xffffff46) ERR_RESULT_DOWNLOAD

316

482

Windows

-185 (0xffffff47) ERR_RESULT_START

248

482

Windows

-177 (0xffffff4f) ERR_RSC_LIMIT_EXCEEDED

49

482

Windows

-164 (0xffffff5c) ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED

3761

482

Windows

-1 (0xffffffff) Unknown error number

4

482

Windows

0

18

482

Windows

1 Unknown error number

1004

482

Windows

3 Unknown error number

52

482

Windows

128 (0x80) Unknown error number

7

482

Windows

1073807364 (0x40010004) Unknown error number

23

481

Linux

-197 (0xffffff3b) ERR_ABORTED_VIA_GUI

7

481

Linux

-186 (0xffffff46) ERR_RESULT_DOWNLOAD

15

481

Linux

-185 (0xffffff47) ERR_RESULT_START

4

481

Linux

0

1

481

Linux

1 Unknown error number

221

481

Linux

11 (0xb) Unknown error number

25

481

Linux

26 (0x1a) Unknown error number

2

481

Linux

131 (0x83) Unknown error number

144

481

Windows

-2147483645 (0x80000003) Unknown error number

1

481

Windows

-197 (0xffffff3b) ERR_ABORTED_VIA_GUI

3

 

 

Total

9976

RALPH@Home Results:

493

Windows

-1073741819 (0xffffffffc0000005) Unknown error number

4

493

Windows

-1073741811 (0xffffffffc000000d) Unknown error number

19

493

Windows

-1073741678 (0xffffffffc0000092) Unknown error number

1

493

Windows

-529697949 (0xffffffffe06d7363) Unknown error number

5

493

Windows

-197 (0xffffffffffffff3b) ERR_ABORTED_VIA_GUI

5

493

Windows

-186 (0xffffffffffffff46) ERR_RESULT_DOWNLOAD

5

493

Windows

0

2

493

Windows

1 Unknown error number

5

493

Windows

3 Unknown error number

1

492

Windows

-1073741819 (0xffffffffc0000005) Unknown error number

3

491

Windows

-197 (0xffffffffffffff3b) ERR_ABORTED_VIA_GUI

1

485

Darwin

-185 (0xffffffffffffff47) ERR_RESULT_START

22

485

Darwin

4 Unknown error number

6

485

Darwin

131 (0x83) Unknown error number

1

484