NOTE: This document refers to the Job Engine included in CloudBolt versions before 7.7. To troubleshoot newer versions of the Job Engine, refer to the documentation here: http://docs.cloudbolt.io/job-engine.html#how-do-i-troubleshoot-my-job-engine
Failures in the CloudBolt Job Engine are rare, and it is designed to be resilient against most types of provider-specific failures. However, user-written actions have the potential to cause the job engine to have issues and you may need to troubleshoot these issues.
What is the normal progression of job statuses?
- If canceled:
- If job completes: (for actions, these get determined by what is returned by the action)
- SUCCESS -or- WARNING -or- FAILURE
How does the job engine run and determine jobs to run?
- A root cron job that runs once a minute which...
- Note: It runs every minute to prevent memory leaks from custom actions from breaking the job engine, and so that it re-runs on a frequent enough basis be barely noticeable by the end-user.
- Runs runjobs.sh, which checks for locks, then...
- Runs runjobs.py
- That picks jobs up from PENDING status and runs them (see next section for details)
- When there are no more jobs to run and it is within five seconds of the next minute, runjobs.py exits, then runjobs.sh exits
How does each job get processed by the job engine?
- The UI creates a PENDING Job
- runjobs.py finds it, QUEUEDs it
- runjobs.py spawns new thread, changes job to RUNNING
- job.run() executes, writes progress back to CloudBolt
- (If the 'Cancel Job' button is clicked in the UI at this point, the job is set to TO_CANCEL, an Exception is raised inside the job's thread, and then the job is set to CANCELED.)
- run() completes, returns results to runjobs.py
- Job status gets set to SUCCESS, WARNING, or FAILURE depending on the return from run()
- runjobs.py writes the results to CloudBolt
What can go wrong with my job engine?
- It could be failing to start up (jobs never go from pending → queued)
- e.g. If you have disabled or modified the cron job on your CloudBolt server
- Possibly if there are problems with the runjobs.sh, runjobs.py scripts (and you can run them interactively after disabling the cron job to test this)
- There could be problems in the underlying database (and running runjobs.sh interactively would also help troubleshoot this)
- A job can run infinitely, then the job engine will never exit
- This can be caused by infinite loops or deadlocking issues within your scripts or jobs
- The job engine could get in a bad state where it changes jobs pending → queued, but never runs them
- This can also be a byproduct of deadlock or other hung jobs
- A job could be spawning more threads, and never cleaning them up (causing the job engine to queue, but not run new jobs)
How do I troubleshoot my job engine?
- Answer the following questions to understand the scope of your issues:
- Which job or order cause you to notice the problem? Note its job number, status, and URL.
- Which related jobs are having problems? Note their job numbers, status, and URLs.
- Which job(s) started running most recently? Note their job numbers, status, and URLs.
- Have other jobs started running after the jobs you noted above?
- Are some jobs in a different status from others? Note this and review the above descriptions of the sequence of how the job engine runs to understand which job was the last job to run.
- Are your jobs in a PENDING state? This indicates that runjobs.py has not restarted recently. Wait two minutes for the job engine to restart and pick up the job. If it does not begin running, refresh the page. If the job is still not running, check whether the job engine is still running (see next section) other jobs.
- Are your jobs in a QUEUED state? This indicates the job engine has identified it should run a job but has not started running it yet. Check whether the job engine is still running (see next section).
- Are your jobs in a RUNNING state? This indicates the job engine has started running the job, and is either still running the job correctly or the job engine has crashed. Check whether the job engine is still running (see next section). If the job is still running, review the job's log to see what its last step was and whether there are any error messages.
- Are your jobs in a TO_CANCEL state? This indicates the job has been marked for cancelation and the job engine will recognize this soon and cancel the job. If the job never enters the CANCELED state, check whether the job engine is still running (see next section).
- Are your jobs in a CANCELED state? This indicates the job engine has canceled those jobs successfully, and should be available to run future jobs. Try running a new job—if it runs correctly, there is no problem: your job engine is behaving as designed.
- Are your jobs in a FAILURE, WARNING, or SUCCESS state? Your jobs completed running and returned this status, so the job engine is working properly. Troubleshoot the underlying job for any failures you are seeing.
- If the above steps do not help, create a support ticket and describe which of the above steps you took as well as the specific job numbers (with your expected and their actual status for each), and attach your application.log, jobengine.log, and each job's individual /var/log/cloudbolt/jobs/12345.log file. Provide screenshots of how the jobs appear in the UI, as well as the output from ps -ef | grep runjobs.
How do I check whether the job engine is running?
- Run ps -ef | grep runjobs and look to see how long runjobs has been running.
- If runjobs is running:
- Look for QUEUED, TO_CANCEL, or RUNNING jobs in CloudBolt. These jobs should be run by the current runjobs instance.
- If runjobs has been running for a long time, look at the log output from the jobs from the previous step to see if one of them has hung.
- Review the /var/log/cloudbolt/jobengine.log to see which jobs the job engine has picked up and whether any error messages are present.
- If runjobs is not running:
- tail -f /var/log/cloudbolt/jobengine.log and wait two minutes to see if the cron job runs and outputs messages to this log file.
- No runjobs process means the cron job has not kicked off yet, the cron job failed to run, or the cron job is disabled/missing.
- Check your cron jobs, try running runjobs.sh manually, and check the jobengine.log file for relevant errors.