The Scheduler sends events to its SchedListener objects once at initialization and shutdown, and each time through its main loop. These objects are the ones listed in the maui.modules property in the maui.properties file.
Many of these modules are critical to the operation of the scheduler; there's a backfill scheduling module, and a reservations handler module, and even a resource manager module which handles communications to the node daemons. Although, you could replace some or all of these modules with your own implementations!
If you wish to write your own module, all you have to do is write an implementation of the SchedListener interface, and hook your module to the loading chain (add it to the maui.modules property, and include your module's own properties in the maui.properties file). You have full access to the Scheduler interface from your module.
The default scheduling module takes a list of jobs that have been prioritized by the scheduler and attempts to make an immediate reservation for the highest priority job and start it, repeating this until we cannot do so.
Then, starting at the first job which cannot be scheduled immediately, we start the process of backfilling. Backfilling refers to the process of making reservations for jobs to run either immediately or at some later time. Maui backfills up to N jobs on the queue. The number of jobs to consider is configurable through the maui.properties file.
Maui Scheduler provides an extensible Matcher interface for implementing various backfill algorithms. We provide a default implementation of first fit backfilling.
The Slot object is described here. In Fig 2 we see the system state and the job queue state at time t0 before backfilling. There are three jobs currently running, (j0, j1, j2). j0 is running on all slots of node n0, j1 is running on node n3 slot p3, and j2 is running on all slots of node n2, and slots p0 and p1 on node n3.
The process of backfilling will take jobs from the job queue (which are already sorted in highest-priority order) and attempt to either schedule them immediately or make a reservation for them to execute at a later time.
In Fig 3 we see the system state and the job queue state at time t0 after backfilling. The backfill algorithm made a reservation for the first queued job, q0 for as soon as possible. q0 requires two nodes of 4 slots each (4,4), and these resources will be free at time t18 when job j0 finishes. Reservation r0 has been made for job q0, and q0 remains in the job queue.
Queued job q1 requires 2 nodes of 2 slots each (2,2) and resources will be available when job j1 finishes at time t7. Reservation r1 has been made for job q1, and q1 remains in the job queue.
Queued job q2 requires 1 node of 4 slots and will run for only 5 time intervals. Job q2 can be run immediately since it doesn't interfere with any running jobs (j0, j1, j2), nor does it interfere with reservations made for higher-priority jobs q0 and q1. Queued job q2 is removed from the queue, put on the run list, and is run on all slots of node n1 immediately.
NOTE: We've assumed that the backfill procedure takes no time. We show the before- and after-backfill states, both at time t0. Of course the procedure will take some time. Users should make sure to account for this scheduling overhead when considering the runtimes of their jobs.
The default reservations module checks the status of the various reservations, both job reservations and sys reservations. If activates and deactivates reservations as needed.
The default resource manager module, (AKA Wiki), handles communications with the Node Daemons: starting and stopping jobs, executing remote tasks, and receiving asynchronous status updates. It maintains a separate thread to receive the updates and must be careful to enforce synchronization to its data structures since two threads vie for control within this module.
All socket communications to and from the node daemons are encrypted and authenticated using the same mechanism as between Maui Scheduler Server and command client programs. See here.
This is the method that starts a job on the remote nodes. It is so named since the remote node daemon(s) do not know anything prior to this component loading the job and requesting that it be started.
All node daemon status updates are sent via UDP to reduce the overhead of TCP. We also are resilient to missing a few updates. The Wiki module will mark nodes Unkown if they haven't reported back within a certain amount of time (like other parameters, this interval is configurable).
The Wiki module can be configured to use either multicast (if enabled in your kernel) or TCP communications to send commands to the node daemons. For smallish clusters, TCP is sufficient. You may see better performance using multicast for clusters > 100 nodes. There are many configurable parameters for the multicast communications.
The Logger module checks the size of the log files, rotates the logs when they get too big, and optionally compresses the older logs.
The JobChecker module checks for finished or cancelled jobs, performs job accounting on these jobs, and then purges them from the system. In addition, the JobChecker occasionally attempts to restart deferred jobs.
The SystemProfiler module checks the state of the Java runtime, potentially reporting problems. SystemProfiler also checks the state of the local filesystem where log information is being written, to make sure that enough space exists to continue writing logs.