next up previous contents index
Next: 7.7 Other questions Up: 7. Frequently Asked Questions Previous: 7.5 Grid Computing   Contents   Index

Subsections

7.6 Troubleshooting


If I see PERMISSION DENIED in my log files, what does that mean?

Most likely, the Condor installation has been misconfigured and Condor's access control security functionality is preventing daemons and tools from communicating with each other. Other symptoms of this problem include Condor tools (such as condor_ status and condor_ q) not producing any output, or commands that appear to have no effect (for example, condor_ off or condor_ on).

The solution is to properly configure the HOSTALLOW_* and HOSTDENY_* settings (for host/IP based authentication) or to configure strong authentication and set ALLOW_* and DENY_* as appropriate. Host-based authentiation is described in section 3.6.8 on page [*]. Information about other forms of authentication is provided in section 3.6.1 on page [*].


What happens if the central manager crashes?

If the central manager crashes, jobs that are already running will continue to run unaffected. Queued jobs will remain in the queue unharmed, but can not begin running until the central manager is restarted and begins matchmaking again. Nothing special needs to be done after the central manager is brought back on line.


Why did the condor_ schedd daemon die and restart?

The condor_ schedd daemon receives signal 25, dies, and is restarted when the history file reaches a 2 Gbyte size limit. Until a larger history file size or the rotation of the history file is supported in Condor, try one of these work arounds:

  1. When the history file becomes large, remove it. Note that this causes a loss of the information in the history file, but the condor_ schedd daemon will not die.
  2. When the history file becomes large, move it.
  3. Stop keeping the history. Only condor_ history accesses the history file, so this particular functionality will be gone. To stop keeping the history, place
    HISTORY=
    
    in the configuration, followed by a condor_ reconfig command to recognize the change in currently executing daemons.

When I ssh/telnet to a machine to check particulars of how Condor is doing something, it is always vacating or unclaimed when I know a job had been running there!

Depending on how your policy is set up, Condor will track any tty on the machine for the purpose of determining if a job is to be vacated or suspended on the machine. It could be the case that after you ssh there, Condor notices activity on the tty allocated to your connection and then vacates the job.

What is wrong? I get no output from condor_ status, but the Condor daemons are running.

One likely error message within the collector log of the form

DaemonCore: PERMISSION DENIED to host <xxx.xxx.xxx.xxx> for command 0 (UPDATE_STARTD_AD)
indicates a permissions problem. The condor_ startd daemons do not have write permission to the condor_ collector daemon. This could be because you used domain names in your HOSTALLOW_WRITE and/or HOSTDENY_WRITE configuration macros, but the domain name server (DNS) is not properly configured at your site. Without the proper configuration, Condor cannot resolve the IP addresses of your machines into fully-qualified domain names (an inverse lookup). If this is the problem, then the solution takes one of two forms:
  1. Fix the DNS so that inverse lookups (trying to get the domain name from an IP address) works for your machines. You can either fix the DNS itself, or use the DEFAULT_DOMAIN_NAME setting in your Condor configuration file.
  2. Use numeric IP addresses in the HOSTALLOW_WRITE and/or HOSTDENY_WRITE configuration macros instead of domain names. As an example of this, assume your site has a machine such as foo.your.domain.com, and it has two subnets, with IP addresses 129.131.133.10, and 129.131.132.10. If the configuration macro is set as

     HOSTALLOW_WRITE = *.your.domain.com
    

    and this does not work, use

     HOSTALLOW_WRITE = 192.131.133.*, 192.131.132.*
    

Alternatively, this permissions problem may be caused by being too restrictive in the setting of your HOSTALLOW_WRITE and/or HOSTDENY_WRITE configuration macros. If it is, then the solution is to change the macros, for example from

 HOSTALLOW_WRITE = condor.your.domain.com
to
 HOSTALLOW_WRITE = *.your.domain.com
or possibly
 HOSTALLOW_WRITE = condor.your.domain.com, foo.your.domain.com, \
 bar.your.domain.com

Another likely error message within the collector log of the form

DaemonCore: PERMISSION DENIED to host <xxx.xxx.xxx.xxx> for command 5 (QUERY_STARTD_ADS)
indicates a similar problem as above, but read permission is the problem (as opposed to write permission). Use the solutions given above.

Why does Condor leave mail processes around?

Under FreeBSD and Mac OSX operating systems, misconfiguration of of a system's outgoing mail causes Condor to inadvertently leave paused and zombie mail processes around when Condor attempts to send notification e-mail. The solution to this problem is to correct the mailer configuration.

Execute the following command as the user under which Condor daemons run to determine whether outgoing e-mail works.

$ uname -a | mail -v your@emailaddress.com

If no e-mail arrives, then outgoing e-mail does not work correctly.

Note that this problem does not manifest itself on non-BSD Unix platforms, such as Linux.


next up previous contents index
Next: 7.7 Other questions Up: 7. Frequently Asked Questions Previous: 7.5 Grid Computing   Contents   Index
condor-admin@cs.wisc.edu