Tuesday, October 24, 2006

Significant amount of time and energy has gone into making BOINC's communication infrastructure efficient, yet there are still many whom believe that it really doesn't cost the projects any more to return the results immediately vs. returning the results when BOINC believes it ought too.

For the purposes of this article I'm going to define the cost of a query at $1.00 per query to cover the cost of electricity, air conditioning, maintenance, and cost of personal to manage the database server. Now in real life that number is greatly exaggerated, but it is easier to describe the relative cost of something based off of something tangible.

Here is a basic rundown of query cost per MySQL documentation:

Insert Select Update
Connecting 3 3 3
Sending query to server 2 2 2
Parsing query 2 2 2
Inserting row 1 x
size of row
   
Inserting indexes 1 x
number of indexes
   
Selecting row   0.01  
Updating row     0.05
Updating indexes     1 x
number of index changes
Sending results to caller   2  
Closing 1 1 1

Notes:

  • Scheduler only does selects and updates
  • Selects happen really fast since that is what databases are optimized around
  • Updates are only a little more expensive than a select because they have to acquire an exclusive lock on the row to make sure nobody else is trying to write to that record and then change the record.

Now with using FastCGI we can throw out the connecting and closing costs since the database connection is always available for the life of a single scheduler process, which their can be 100-150 running at a time.

We'll keep track of the number of queries executed and the number of query parts used so we can calculate the cost per query part.

Well break out the results for the following two scenarios:

  1. Reporting 20 results individually.
  2. Reporting 20 results at once.

 

Scheduler RPC

A scheduler RPC does many things as it has to do authentication, preferences, receive incoming result status, and send out new results to be processed. I'll tackle each section one at a time.

Authentication

Authentication consists of a query for host, user, and team. Each query is independent, although we have talked about batching them into a single query, we just haven't gotten that far yet. Now this part of the RPC may result in a new host record being created if your connecting up for the first time or something is wrong with what you have sent to the scheduler.

Scenario 1: 60 Queries, 360.6 Query parts.

Scenario 2: 3 Queries, 18.03 Query parts.

Platform Check

Checks to see if your platform is supported.

Scenario 1: 20 Queries, 120.2 Query parts.

Scenario 2: 1 Queries, 6.01 Query parts.

Preferences Check

Determines if the preferences on the client need to be updates or the server needs to be updated. If the server needs to be updated then an update query is submitted.

Scenario 1: 20 Queries, 120.2 Query parts.

Scenario 2: 1 Queries, 6.01 Query parts.

Handle Reported Results

Here each result is looked up to see if it was assigned to the person reporting it and to update its values. The workunit record for each result record has to be updated so the transitioner will look at the workunit and decide what to do next. Two indexes have to be updated in the result table and 1 in the workunit table for each result. What is important to point out here is that in scenario 2 we batch all of the selects and updates for results in scenario 1 into a single select and update. The workunit updates are also batched in scenario 2.

Scenario 1: 60 Queries, 342.2 Query parts.

Scenario 2: 3 Queries, 17.11 Query parts.

Assign New Results

Most of the preparation work for this phase is actually done by the feeder. So here we get the latest information about the result, then update the result, then update the workunit.

Scenario 1: 60 Queries, 342.2 Query parts.

Scenario 2: 60 Queries, 342.2 Query parts.

 

Totals

Now that we have broken down the scheduler into each of its parts and isolated the number and types of queries we can calculate what it would cost the project if each query cost $1. The query parts metric is useful in determining how much wasted database time is spent for each operation. All around scenario 2 costs the project less in time and maintenance on equipment.

Scenario 1 costs a project $11 per result, and scenario 2 costs a project $3.40 per result.

Scenario 2 is 70% more efficient than scenario 1 in the amount of time used to process 20 results.

 

So be kind to your project(s), let BOINC report the results in batches. The project admin's will be able to support more people and more machines with the same hardware.

 

----- Rom

[Edit: Since originally writing this I hunted down a few numbers from jocelyn which is the S@H database server

On average jocelyn is processing 314 queries per second.
In the last 5 days jocelyn has processed 144.7 million queries.

]