Saturday, July 15, 2006

It is time to tell us what you think.  We are conducting a poll to determine where the hot spots are for what needs to happen with BOINC.  We welcome all kinds of feedback, the more people that respond and the better coverage we get, the more we can improve BOINC and help the projects improve their overall experience.

You can find the poll here:
http://boinc.berkeley.edu/poll.php

The results are published here:
http://boinc.berkeley.edu/poll_results.php

 

I turned on my TV this weekend to catch up on some of my recordings and I found this in my wait recorded queue:
Rosetta Presentation

I have my media center setup to record any of the Computer Science Colloquium from the University of Washington that comes on UWTV.  It happens to be David Baker of R@H giving a presentation to the computer science students about how Rosetta works and how they use the results.  He even gave BOINC a plug and discussed how R@H was changing how they do things.

 

We have had some nice press within the last week, here are some of the articles:
Use your computer idle time for a great cause
Putting your computer to work to fight against malaria in Africa
Coming down to Earth

Friday, April 7, 2006

Howdy Folks,

 

Well error rates are still dropping on R@H.  It generally takes a few weeks for the old version of an application to filter out of the system.

 

Here was the pass percentage breakout from yesterday:

 

Version

OS

Total Results

Pass Rate

Fail Rate

483

Darwin

6733

90.24

9.76

483

Windows

99095

95.74

4.26

482

Darwin

213

96.71

3.29

482

Linux

9387

96.48

3.52

482

Windows

6000

84.68

15.32

 

As you can see there was a 10% drop in failure rate for windows.

 

Here is what the current error type breakout looks like:

 

App Version

OS

Exit Status

Error Count

483

Darwin

-197 (0xffffff3b) ERR_ABORTED_VIA_GUI

4

483

Darwin

-186 (0xffffff46) ERR_RESULT_DOWNLOAD

14

483

Darwin

-185 (0xffffff47) ERR_RESULT_START

290

483

Darwin

1 Unknown error number

11

483

Darwin

2 Unknown error number

77

483

Darwin

4 Unknown error number

223

483

Darwin

5 Unknown error number

17

483

Darwin

131 (0x83) Unknown error number

21

483

Windows

-2147483645 (0x80000003) Unknown error number

29

483

Windows

-2147483641 (0x80000007) Unknown error number

1

483

Windows

-1073741819 (0xc0000005) Unknown error number

672

483

Windows

-1073741818 (0xc0000006) Unknown error number

5

483

Windows

-1073741811 (0xc000000d) Unknown error number

935

483

Windows

-1073741795 (0xc000001d) Unknown error number

5

483

Windows

-1073741794 (0xc000001e) Unknown error number

1

483

Windows

-1073741783 (0xc0000029) Unknown error number

1

483

Windows

-1073741675 (0xc0000095) Unknown error number

1

483

Windows

-1073741674 (0xc0000096) Unknown error number

1

483

Windows

-1073741515 (0xc0000135) Unknown error number

143

483

Windows

-1073741502 (0xc0000142) Unknown error number

285

483

Windows

-1073740972 (0xc0000354) Unknown error number

3

483

Windows

-1073740791 (0xc0000409) Unknown error number

16

483

Windows

-529697949 (0xe06d7363) Unknown error number

102

483

Windows

-197 (0xffffff3b) ERR_ABORTED_VIA_GUI

154

483

Windows

-187 (0xffffff45) ERR_RESULT_UPLOAD

4

483

Windows

-186 (0xffffff46) ERR_RESULT_DOWNLOAD

598

483

Windows

-185 (0xffffff47) ERR_RESULT_START

210

483

Windows

-177 (0xffffff4f) ERR_RSC_LIMIT_EXCEEDED

1

483

Windows

-164 (0xffffff5c) ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED

39

483

Windows

-1 (0xffffffff) Unknown error number

74

483

Windows

0

6

483

Windows

1 Unknown error number

806

483

Windows

3 Unknown error number

15

483

Windows

128 (0x80) Unknown error number

79

483

Windows

1073741845 (0x40000015) Unknown error number

1

483

Windows

1073807364 (0x40010004) Unknown error number

32

482

Darwin

-197 (0xffffff3b) ERR_ABORTED_VIA_GUI

2

482

Darwin

-185 (0xffffff47) ERR_RESULT_START

3

482

Darwin

1 Unknown error number

1

482

Darwin

131 (0x83) Unknown error number

1

482

Linux

-197 (0xffffff3b) ERR_ABORTED_VIA_GUI

16

482

Linux

-186 (0xffffff46) ERR_RESULT_DOWNLOAD

46

482

Linux

-185 (0xffffff47) ERR_RESULT_START

1

482

Linux

1 Unknown error number

22

482

Linux

2 Unknown error number

25

482

Linux

4 Unknown error number

1

482

Linux

7 Unknown error number

1

482

Linux

11 (0xb) Unknown error number

29

482

Linux

13 (0xd) Unknown error number

33

482

Linux

26 (0x1a) Unknown error number

1

482

Linux

131 (0x83) Unknown error number

154

482

Linux

139 (0x8b) Unknown error number

1

482

Windows

-1073741819 (0xc0000005) Unknown error number

98

482

Windows

-1073741811 (0xc000000d) Unknown error number

30

482

Windows

-1073741502 (0xc0000142) Unknown error number

7

482

Windows

-529697949 (0xe06d7363) Unknown error number

7

482

Windows

-197 (0xffffff3b) ERR_ABORTED_VIA_GUI

561

482

Windows

-187 (0xffffff45) ERR_RESULT_UPLOAD

9

482

Windows

-186 (0xffffff46) ERR_RESULT_DOWNLOAD

26

482

Windows

-185 (0xffffff47) ERR_RESULT_START

4

482

Windows

-177 (0xffffff4f) ERR_RSC_LIMIT_EXCEEDED

4

482

Windows

-164 (0xffffff5c) ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED

115

482

Windows

1 Unknown error number

57

481

Linux

-186 (0xffffff46) ERR_RESULT_DOWNLOAD

1

481

Linux

1 Unknown error number

2

481

Linux

11 (0xb) Unknown error number

2

481

Linux

131 (0x83) Unknown error number

4

479

Linux

1 Unknown error number

1

 

This weekend I’m going to try and get a register dump of each thread added to the diagnostic output.  Along with that I would like to get the function pointers and function parameters added to the diagnostic output.

 

I did manage to shrink the PDB file size for R@H down to 7MB which still seems to be a little steep for mass consumption.  So maybe with the function pointers and parameters we can continue to bring down the error rates.

 

----- Rom

 

Sunday, March 19, 2006

 

Results on RALPH@Home which is R@H’s alpha project have been very promising.

To give an idea about how large this problem was for R@H I guess I need to provide some numbers.  So here goes:

R@H receives roughly 115k results a day.

Roughly there are 16k failures a day.

Of those 16k failures a day, 5.5k fell under the ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED and 0xc0000005 banner.  Those are the two error codes used when something really really really bad has happened on Windows.  There are another 1.5k errors that have cryptic Windows error codes which may or may not be related.

Now how does this translate to RALPH@Home?  Well if you work under the assumption that RALPH@home is a mini R@H, then the percentages should be roughly the same.

That said, sure enough RALPH@Home had roughly the same breakdown of errors that the public project had.  Here are some rough stats for RALPH@Home:

RALPH@Home receives roughly 1k results a day.

Before 4.93 was released for Beta the failure rate was 150 or so a day.

Now with 4.93 in the mix it has dropped to 100 or so a day.

Keep in mind that the Mac and Linux clients have not been updated yet and so there error rates remain unchanged.

RALPH@Home went from a 25% failure rate down to a 12% failure rate.  Now if you remove the results from Linux and the Mac the failure rate for the Windows client is floating at 5%.

I’ll include the current error rates in the public project and RALPH@Home below.

Now I’m on to the next biggest problem which has been deemed the ‘1% bug’.

For those who noticed the error code 1 in the charts below, that error code is given when Rosetta could not find something in one of the pre-staged files downloaded to your machine or when the application felt something really bad has happened and it couldn’t continue.  With 4.82 that actual error data was being written to a different log file than the one BOINC sends back to the server.  Starting with 4.94 the reason for the application quitting will be logged and sent back to the server in a way that can be easily tracked and fixed without having to write the workunit names in the forums.

----- Rom

Public Project Results:

482

Darwin

-197 (0xffffff3b) ERR_ABORTED_VIA_GUI

5

482

Darwin

-186 (0xffffff46) ERR_RESULT_DOWNLOAD

3

482

Darwin

-185 (0xffffff47) ERR_RESULT_START

83

482

Darwin

1 Unknown error number

10

482

Darwin

4 Unknown error number

135

482

Darwin

5 Unknown error number

9

482

Darwin

6 Unknown error number

1

482

Darwin

131 (0x83) Unknown error number

26

482

Windows

-2147483641 (0x80000007) Unknown error number

18

482

Windows

-1073741819 (0xc0000005) Unknown error number

1797

482

Windows

-1073741811 (0xc000000d) Unknown error number

880

482

Windows

-1073741795 (0xc000001d) Unknown error number

2

482

Windows

-1073741674 (0xc0000096) Unknown error number

4

482

Windows

-1073741571 (0xc00000fd) Unknown error number

63

482

Windows

-1073741515 (0xc0000135) Unknown error number

2

482

Windows

-1073741502 (0xc0000142) Unknown error number

336

482

Windows

-1073740972 (0xc0000354) Unknown error number

2

482

Windows

-529697949 (0xe06d7363) Unknown error number

226

482

Windows

-197 (0xffffff3b) ERR_ABORTED_VIA_GUI

466

482

Windows

-187 (0xffffff45) ERR_RESULT_UPLOAD

3

482

Windows

-186 (0xffffff46) ERR_RESULT_DOWNLOAD

316

482

Windows

-185 (0xffffff47) ERR_RESULT_START

248

482

Windows

-177 (0xffffff4f) ERR_RSC_LIMIT_EXCEEDED

49

482

Windows

-164 (0xffffff5c) ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED

3761

482

Windows

-1 (0xffffffff) Unknown error number

4

482

Windows

0

18

482

Windows

1 Unknown error number

1004

482

Windows

3 Unknown error number

52

482

Windows

128 (0x80) Unknown error number

7

482

Windows

1073807364 (0x40010004) Unknown error number

23

481

Linux

-197 (0xffffff3b) ERR_ABORTED_VIA_GUI

7

481

Linux

-186 (0xffffff46) ERR_RESULT_DOWNLOAD

15

481

Linux

-185 (0xffffff47) ERR_RESULT_START

4

481

Linux

0

1

481

Linux

1 Unknown error number

221

481

Linux

11 (0xb) Unknown error number

25

481

Linux

26 (0x1a) Unknown error number

2

481

Linux

131 (0x83) Unknown error number

144

481

Windows

-2147483645 (0x80000003) Unknown error number

1

481

Windows

-197 (0xffffff3b) ERR_ABORTED_VIA_GUI

3

 

 

Total

9976

RALPH@Home Results:

493

Windows

-1073741819 (0xffffffffc0000005) Unknown error number

4

493

Windows

-1073741811 (0xffffffffc000000d) Unknown error number

19

493

Windows

-1073741678 (0xffffffffc0000092) Unknown error number

1

493

Windows

-529697949 (0xffffffffe06d7363) Unknown error number

5

493

Windows

-197 (0xffffffffffffff3b) ERR_ABORTED_VIA_GUI

5

493

Windows

-186 (0xffffffffffffff46) ERR_RESULT_DOWNLOAD

5

493

Windows

0

2

493

Windows

1 Unknown error number

5

493

Windows

3 Unknown error number

1

492

Windows

-1073741819 (0xffffffffc0000005) Unknown error number

3

491

Windows

-197 (0xffffffffffffff3b) ERR_ABORTED_VIA_GUI

1

485

Darwin

-185 (0xffffffffffffff47) ERR_RESULT_START

22

485

Darwin

4 Unknown error number

6

485

Darwin

131 (0x83) Unknown error number

1

484

Linux

11 (0xb) Unknown error number

3

484

Linux

131 (0x83) Unknown error number

6

 

 

Total

89

 

Monday, March 13, 2006

So, as many of you probably already know, I’ve been brought onboard as a consultant with the Rosetta@Home project.  A big issue they were experiencing was related to random crashes when BOINC would notify them that it was time to quite and for another application to begin.

I believe I have found and fixed this style of bug, but alas only time and testing will tell.

To understand this bug I need to explain how things work with a science application.  When a science application starts and notifies BOINC that it supports graphics three threads are created to manage what is going on.

The worker thread is the heavy lifter of the science application, it handles all the science.  The majority of the memory allocations and de-allocations happen in this thread.

The graphics thread is responsible for displaying the graphics window and for hiding and showing the window at BOINCs request.

The timer thread is responsible for processing the suspend/resume/quite/abort messages from BOINC as well as notify BOINC of trickles.

Now when the science application received the quit request it would call the C Runtime Library function called exit which is supposed to shutdown the application.  Part of this shutdown operation calls the Win32 API called ExitProcess.  ExitProcess would let the threads continue to run while cleaning up the heap, which is a holdout for letting DLLs decrement their ref counts and unload themselves if nobody else is using them.  Well there in lies the problem, the worker thread was still running trying to allocate and de-allocate memory from a heap that has been freed by ExitProcess.

This in turn would cause an access violation which shows up in the log file as 0xc0000005.

Science applications now have the option of requesting a hard termination which stops all executing threads and then cleans up after the process.  In essence the application calls TerminateProcess on itself.  What this also means is that the application has no chance of writing any more information to a state file or checkpoint file when the BOINC API hasn’t been notified that a checkpoint is in progress.  Use with care.  It also means that BOINC should no longer believe that a task is invalid from a random crash.

I believe this will take care of quite a few ‘crash on close’ style of bugs.  What was really annoying about this kind of bug is that it crashes in a different location each time.  Sometimes it would crash in the timer thread and sometimes in the worker thread.  A good chunk of the time the clients would report an empty call stack which doesn’t give us anything to work off of.

This style of bug would affect slower machines more than the faster machines.  The bug wouldn’t surface if the timer thread could finish all the CPU instructions needed from the time exit was called to the time ExitProcess actually kills the threads in one OS thread scheduling cycle.

I think Rosetta@Home hit this bug more often then most projects because of the amount of memory it allocates while doing its thing.  150MB’s per process.  That was just enough to get it to happen on my machine if I left it running for 10 minutes and the graphics running.

It looks like both Einstein@Home and Rosetta@Home are going to be testing this out in the next few days.  I’m excited to see what this change does for the success rates of the tasks being assigned to client machines.

----- Rom

Friday, March 10, 2006

Well tomorrow I'll be taking a trip to the Rosetta@Home project.

They are going to be explaining how Rosetta works so I can try and help them out with the problems they are having with the BOINC interface code.  I believe it'll be a great learning experience for both Rosetta@Home and BOINC.

It seems everytime we learn about a new project, there is another way of doing something that is just slightly different from any other project.

----- Rom