Next Previous Contents

3. Troubleshooting

3.1 Logfile maui.out prints Cannot connect to MySQL server on HOST:PORT. Is there a MySQL server running on the machine/port you are trying to connect to?

This means that the scheduler cannot make a connection to the host/port that you specified for the MySQL database. This could be because the MySQL server is not running on that host/port, or because of a firewall, etc.

3.2 Logfile maui.out prints Invalid authorization specification: Access denied for user: 'mauid@localhost' (Using password: YES)

This means that the scheduler can connect to a MySQL database at the specified host/port but is not authorized to access the Maui Scheduler database. Fix this by either or both of the following:

3.3 Each scheduler command prints a SECURITY WARNING! You're using a well-known default key, all your communications could be easily snooped! Please run the maui_genkey script to generate a new key, and then securely copy the key to all nodes in your cluster.

The Maui Scheduler uses a distributed secret key to secure and authenticate it's socket communications. You need to generate this key, and copy it securely to all nodes in your cluster. The default location for this key is the /etc/maui/maui.key file. The permissions on this file should be 600 owned by the mauid user.

3.4 The logfile prints SECURITY: old message! possible replay attack

This means that the timestamp of the message is significantly older than the localtime on the Maui Scheduler server machine. This is usually not due to a replay attack --just reset/resync the date on the offending machine(s).

In addition, this can sometimes mean that the scheduler itself is running very slowly and not processing these messages within the timeout period. This could be due to a problem in the Java VM, the scheduler itself, or other runaway processes on your scheduler node. Try using ps, top, and w to diagnose the problem.

3.5 A few commands print UNAUTHORIZED auth=UID

This is because you tried to execute a Maui Scheduler command which requires Maui Scheduler administrator permissions. Add the appropriate uids to the sched.admins property of the maui.properties file and restart the scheduler.

3.6 Errors Starting Daemon(s)

Make sure you aren't using the Kaffe Java Virtual Machine. Kaffe has trouble with the database connection code and certain file IO operations. If, however, you do succeed in getting Maui to work with Kaffe, we would love to hear about it!

If you see a permission denied error, something like:

        Maui: java.io.FileNotFoundException: /usr/share/maui/maui.properties: Permission denied
Make sure the file is owned and readable by the mauid user. For more troubleshooting see here.

3.7 Client command prints out Request timestamp is too old!

This is similar to this problem. Make sure all your nodes have their clocks synchronized to a common source. We suggest ntpd or ntpdate.

3.8 There's no output for my job!

Make sure your IWD (Initial Working Directory) exists across all nodes and you can chdir/read it.

Make sure the directory paths of your Output, Input, Error stanzas in your CMD file allow write access to you across all nodes.

If you still have trouble, check the maui.log file on the server machines, and the wiki.log files on the drone/compute nodes for any problem output.

3.9 Job output for my "default" JobType job is overwritten!

For "default" JobType jobs, the Maui Scheduler executes as many unique tasks across the allocated nodes as you've specified in your command file. Your output files must be a unique name in order not to clobber the output from your other tasks. In your CMD file, we suggest something like:

        Output == "myjob-$(MAUI_JOB_ID)-$(MAUI_TASKID).out"
        Error  == "myjob-$(MAUI_JOB_ID)-$(MAUI_TASKID).err"
        Log    == "myjob-$(MAUI_JOB_ID)-$(MAUI_TASKID).log"
NOTE that this isn't a problem for MPI-based jobs since we only execute a single task on the head node, assuming that the mpirun or mpiexec you invoke fans out the tasks using it's own mechanism (like rsh/ssh).

3.10 Command output is gibberish on terminal (or in log file)!

This is because you're not using the same maui.key file across all nodes in your cluster. This file needs to be stored in a non-networked directory, readable only by the mauid user. The key file is default stored in the /etc/maui directory. You should copy this file securely to all nodes in your cluster. You may need to restart the Maui Scheduler server and node daemons for the changes to take affect.

3.11 Scheduler and/or Node Daemon logfile reports ClassCastException in communications

unm.maui.wiki.WikiResponsejava.lang.ClassCastException:
unm.maui.wiki.WikiResponse at unm.maui.wikid.NodeDaemonImpl.readWikiComm(NodeDaemonImpl.java:XXX)
Similar to above, this is due to not having the secret keys synchronized properly between Scheduler and Node Daemons. Generate your key once (with maui_genkeyand then securly copy it to all other nodes in your cluster.

3.12 Scheduler command prints Can't getgrgid for gid=NUM, or Can't getpwuid for uid=NUM, or similar.

This indicates a problem in NIS (or YP) or with your password and group files. Basically it cannot find the real user or the mauid user. Please make sure that you have added the mauid user to your UNIX user trables.

3.13 Help I cannot shutdown the scheduler or nodedaemon using mauictl or nodectl!

You need to make sure you are listed as one of the administrators of the scheduler in the sched.admins property of the maui.properties file. If you are not, or if there is some other problem preventing a clean shutdown, you can just kill the scheduler processes running as the mauid user. When restarted, the scheduler will come back to the last checkpointed state.

NOTE: if you are listed as a scheduler administrator and you still are having problems shutting down, it could be because of a mismatch of secret key files, a bug in the scheduler, or even a Java VM problem. Again, you can just go ahead and kill the scheduler processes.

3.14 Logfile maui.out prints Packet is larger than max_allowed_packet from server, and scheduler cannot checkpoint!

You might see something like the following in the logfile:

03/11 17:38:00 (ReservationsMod.java:207) java.sql.SQLException: Error during query: Unexpected Exception: java.lang.IllegalAr
gumentException message given: Packet is larger than max_allowed_packet from server configuration of 1048576 bytes
java.sql.SQLException: Error during query: Unexpected Exception: java.lang.IllegalArgumentException message given: Packet is l
arger than max_allowed_packet from server configuration of 1048576 bytes
   at org.gjt.mm.mysql.Connection.execSQL(Unknown Source)
   at org.gjt.mm.mysql.Connection.execSQL(Unknown Source)
   at org.gjt.mm.mysql.Statement.executeUpdate(Unknown Source)
   at org.gjt.mm.mysql.jdbc2.Statement.executeUpdate(Unknown Source)
   at unm.maui.db.MauiMySQL.checkpoint(MauiMySQL.java:675)
   at unm.maui.sched.ReservationsMod.event(ReservationsMod.java:197)
   at unm.maui.sched.Sched.fireLoop(Sched.java:922)
   at unm.maui.sched.Sched.run(Sched.java:306)
   at java.lang.Thread.run(Thread.java:498)

Please edit your /etc/my.cnf MySQL configuration file and make it look something like the following. (You may not have to add the [mysqld] stanza if one already exists --just put the line right below it).

[mysqld]
set-variable    = max_allowed_packet=16M

Just make sure the max_allowed_packet is large. 16M should be more than enough!

3.15 Scheduler encounters the same problem each time it's restarted!

This may be a problem with the checkpoint table. You can flush the checkpoint table with the maui_flush script. WARNING: this script will cause the scheduler to lose track of jobs, reservations, and nodes! This script basically just stops the scheduler and issues the SQL command DELETE FROM checkpoint;, removing all saved state.

The next time the scheduler starts it will have a clean slate. It may take some time for it to get fully back online (waiting to receive updates from the node daemons).

3.16 I'm not seeing all the nodes in my cluster or some/one of the nodes shows up but its features/values keep changing!

This could be due to several configuration problems. Make sure that all your nodes are forward and reverse resolvable in DNS from the head node where the scheduler process is running.

If you've configured the scheduler to use multicasting, please make sure that all your nodes have kernels configured for multicast. Make sure there isn't any firewall or packet filter that might be blocking the communication ports (both UDP & TCP).


Next Previous Contents