My colleague Matt (spinningmatt) posted a really useful article, Submitting Jobs with AMQP on a Condor Based Grid. The article includes an example C++ program that uses a low-latency job submission feature of MRG.
The low-latency feature uses AMQP (MRG's Messaging) to deliver job workloads to a grid's execution nodes. This bypasses the job scheduler (Condor's schedd). Instead a special daemon on the execute node consume jobs off an AMQP queue. It's a pull versus push model.
Why is this useful? Well originally the intent was for sub-second jobs in the financial services industry. Consider a grid of index calculation applications. Each calculation may take less than a second to perform. The applications can't wait for a job scheduler to decide that their execution nodes are available, then schedule a job onto the node, then push the job out there. Instead jobs are placed on AMQP queues. As soon as an execute node is free to perform work it pulls the next job form the queue. There is barely any latency between jobs.
Of course as soon as you do this for one specific use, sub-second jobs, then others see the advantages too. This feature doesn't need to be sub-second jobs. MRG customers from various industries now see the advantage of this feature.
There are some considerations. For example, should the entire grid be untilized this way or should a specific portion of the grid be carved off for low-latency workloads? If the jobs are sub-second and high volume, should they be reported to a management console (this could cause quite a bit of clutter) or just logged? How should failed jobs be managed under different scenarios? e.g. in sub-second transactions a failed job may have missed its window of being useful and therefore there is little point in resubmitting. The answers to these quesitons will depend on the type of low-latency workload.
If you are interested in more information on this topic please remember to check out Matt's post.
The low-latency feature uses AMQP (MRG's Messaging) to deliver job workloads to a grid's execution nodes. This bypasses the job scheduler (Condor's schedd). Instead a special daemon on the execute node consume jobs off an AMQP queue. It's a pull versus push model.
Why is this useful? Well originally the intent was for sub-second jobs in the financial services industry. Consider a grid of index calculation applications. Each calculation may take less than a second to perform. The applications can't wait for a job scheduler to decide that their execution nodes are available, then schedule a job onto the node, then push the job out there. Instead jobs are placed on AMQP queues. As soon as an execute node is free to perform work it pulls the next job form the queue. There is barely any latency between jobs.
Of course as soon as you do this for one specific use, sub-second jobs, then others see the advantages too. This feature doesn't need to be sub-second jobs. MRG customers from various industries now see the advantage of this feature.
There are some considerations. For example, should the entire grid be untilized this way or should a specific portion of the grid be carved off for low-latency workloads? If the jobs are sub-second and high volume, should they be reported to a management console (this could cause quite a bit of clutter) or just logged? How should failed jobs be managed under different scenarios? e.g. in sub-second transactions a failed job may have missed its window of being useful and therefore there is little point in resubmitting. The answers to these quesitons will depend on the type of low-latency workload.
If you are interested in more information on this topic please remember to check out Matt's post.

IP Babble is the personal blog of William Henry.
Recent Comments