This means that the scheduler cannot make a connection to the host/port that you specified for the MySQL database. This could be because the MySQL server is not running on that host/port, or because of a firewall, etc.
This means that the scheduler can connect to a MySQL database at the specified host/port but is not authorized to access the Maui Scheduler database. Fix this by either or both of the following:
maui_create_db command) or regrant the access permissions (as
root run maui_grant_db | mysql -uroot -p maui_db and enter
the MySQL root password when prompted.
maui_genkey script to generate a new key, and then securely copy the key to all nodes in your cluster.
The Maui Scheduler uses a distributed secret key to secure and
authenticate it's socket communications. You need to generate this
key, and copy it securely to all nodes in your cluster. The default
location for this key is the /etc/maui/maui.key file. The
permissions on this file should be 600 owned by the
mauid user.
This means that the timestamp of the message is significantly older than the localtime on the Maui Scheduler server machine. This is usually not due to a replay attack --just reset/resync the date on the offending machine(s).
In addition, this can sometimes mean that the scheduler itself is running very slowly and not processing these messages within the timeout period. This could be due to a problem in the Java VM, the scheduler itself, or other runaway processes on your scheduler node. Try using ps, top, and w to diagnose the problem.
This is because you tried to execute a Maui Scheduler command which
requires Maui Scheduler administrator permissions. Add the
appropriate uids to the sched.admins property of the
maui.properties file and restart the scheduler.
Make sure you aren't using the Kaffe Java Virtual Machine. Kaffe has trouble with the database connection code and certain file IO operations. If, however, you do succeed in getting Maui to work with Kaffe, we would love to hear about it!
If you see a permission denied error, something like:
Maui: java.io.FileNotFoundException: /usr/share/maui/maui.properties: Permission denied
Make sure the file is owned and readable by the mauid user.
For more troubleshooting see
here.
This is similar to this problem. Make sure all your nodes have their clocks synchronized to a common source. We suggest ntpd or ntpdate.
Make sure your IWD (Initial Working Directory) exists across all nodes and you can chdir/read it.
Make sure the directory paths of your Output, Input, Error stanzas in your CMD file allow write access to you across all nodes.
If you still have trouble, check the maui.log file on the server machines, and the wiki.log files on the drone/compute nodes for any problem output.
For "default" JobType jobs, the Maui Scheduler executes as many unique tasks across the allocated nodes as you've specified in your command file. Your output files must be a unique name in order not to clobber the output from your other tasks. In your CMD file, we suggest something like:
Output == "myjob-$(MAUI_JOB_ID)-$(MAUI_TASKID).out"
Error == "myjob-$(MAUI_JOB_ID)-$(MAUI_TASKID).err"
Log == "myjob-$(MAUI_JOB_ID)-$(MAUI_TASKID).log"
NOTE that this isn't a problem for MPI-based jobs since we only
execute a single task on the head node, assuming that the
mpirun or mpiexec you invoke fans out the tasks
using it's own mechanism (like rsh/ssh).
This is because you're not using the same maui.key file across all nodes in your cluster. This file needs to be stored in a non-networked directory, readable only by the mauid user. The key file is default stored in the /etc/maui directory. You should copy this file securely to all nodes in your cluster. You may need to restart the Maui Scheduler server and node daemons for the changes to take affect.
unm.maui.wiki.WikiResponsejava.lang.ClassCastException: unm.maui.wiki.WikiResponse at unm.maui.wikid.NodeDaemonImpl.readWikiComm(NodeDaemonImpl.java:XXX)Similar to above, this is due to not having the secret keys synchronized properly between Scheduler and Node Daemons. Generate your key once (with
maui_genkeyand then securly copy it to
all other nodes in your cluster.
This indicates a problem in NIS (or YP) or with your password and group files. Basically it cannot find the real user or the mauid user. Please make sure that you have added the mauid user to your UNIX user trables.
mauictl or nodectl!
You need to make sure you are listed as one of the administrators of the scheduler in the sched.admins property of the maui.properties file. If you are not, or if there is some other problem preventing a clean shutdown, you can just kill the scheduler processes running as the mauid user. When restarted, the scheduler will come back to the last checkpointed state.
NOTE: if you are listed as a scheduler administrator and you still are having problems shutting down, it could be because of a mismatch of secret key files, a bug in the scheduler, or even a Java VM problem. Again, you can just go ahead and kill the scheduler processes.
You might see something like the following in the logfile:
03/11 17:38:00 (ReservationsMod.java:207) java.sql.SQLException: Error during query: Unexpected Exception: java.lang.IllegalAr gumentException message given: Packet is larger than max_allowed_packet from server configuration of 1048576 bytes java.sql.SQLException: Error during query: Unexpected Exception: java.lang.IllegalArgumentException message given: Packet is l arger than max_allowed_packet from server configuration of 1048576 bytes at org.gjt.mm.mysql.Connection.execSQL(Unknown Source) at org.gjt.mm.mysql.Connection.execSQL(Unknown Source) at org.gjt.mm.mysql.Statement.executeUpdate(Unknown Source) at org.gjt.mm.mysql.jdbc2.Statement.executeUpdate(Unknown Source) at unm.maui.db.MauiMySQL.checkpoint(MauiMySQL.java:675) at unm.maui.sched.ReservationsMod.event(ReservationsMod.java:197) at unm.maui.sched.Sched.fireLoop(Sched.java:922) at unm.maui.sched.Sched.run(Sched.java:306) at java.lang.Thread.run(Thread.java:498)
Please edit your /etc/my.cnf MySQL configuration file and
make it look something like the following. (You may not have to add
the [mysqld] stanza if one already exists --just put the line
right below it).
[mysqld] set-variable = max_allowed_packet=16M
Just make sure the max_allowed_packet is large. 16M should be more than enough!
This may be a problem with the checkpoint table. You can flush the checkpoint table with the maui_flush script. WARNING: this script will cause the scheduler to lose track of jobs, reservations, and nodes! This script basically just stops the scheduler and issues the SQL command DELETE FROM checkpoint;, removing all saved state.
The next time the scheduler starts it will have a clean slate. It may take some time for it to get fully back online (waiting to receive updates from the node daemons).
This could be due to several configuration problems. Make sure that all your nodes are forward and reverse resolvable in DNS from the head node where the scheduler process is running.
If you've configured the scheduler to use multicasting, please make sure that all your nodes have kernels configured for multicast. Make sure there isn't any firewall or packet filter that might be blocking the communication ports (both UDP & TCP).