oJob Check for stall
When running an oJob there might be situations that you want to ensure that the entire process won’t enter into a stall (e.g. being stopped on a “dead-end” waiting for some service or lock or whatever).
In oJob there is actually a feature to allow you to ensure, no matter what, your oJob won’t run pass a specific timeout or for a function to be executed to determine if the oJob is at a stall situation.
Killing after x seconds
The easiest configuration is ensuring that there is a general timeout for the entire oJob:
ojob:
checkStall:
# check for a stall every x seconds (default 60)
everySeconds : 1
# kill the entire process after x seconds
killAfterSeconds: 4
todo:
- Test job
jobs:
#---------------
- name: Test job
exec: |
args.wait = _$(args.wait).default(5000);
log("Waiting for " + args.wait + "ms...");
sleep(args.wait, true);
log("Done");
Executing this oJob you will get different results depending on the amount of time the “Test job” takes. It’s configured to “kill it self” if it takes longer than 4 seconds and it will check for that every second (e.g. on real situations you should use the default of 60 seconds).
For 1,5 seconds:
$ ojob test.yaml wait=1500
>> [Test job] | STARTED | 2019-11-11T12:13:17.199Z------------------------
2019-11-11 12:13:17.230 | INFO | Waiting for 1500ms...
2019-11-11 12:13:18.736 | INFO | Done
<< [Test job] | Ended with SUCCESS | 2019-11-11T12:13:18.740Z ============
For 5 seconds:
$ ojob test.yaml wait=5000
>> [Test job] | STARTED | 2019-11-11T12:22:00.058Z -----------------------
2019-11-11 12:22:00.085 | INFO | Waiting for 5000ms...
oJob: Check stall over 4000
2019-11-11 12:22:03.878 | ERROR | oJob: Check stall over 4000
Killing depending on a function
If you have certain conditions that can be easily checked to determine if the oJob is stalled you can use a function:
ojob:
checkStall:
everySeconds : 1
checkFunc : |
print("checking for stall...");
if (global.canDie) {
print("should die.");
return true;
}
todo:
- Init
- Test job
jobs:
#-----------
- name: Init
exec: |
global.canDie = false;
#---------------
- name: Test job
exec: |
log("Waiting for 2500ms...");
sleep(2500, true);
log("Setting canDie to true...");
global.canDie = true;
log("Waiting for another 2500ms...");
sleep(2500, true);
log("Done");
In this case a global variable canDie is only set to true after the first 2,5 seconds of execution of the Test job job. As soon as the checkFunc is executed and confirms the conditions by returning a true value the oJob is immediatelly stopped.
$ ojob test2.yaml
checking for stall...
>> [Init] | STARTED | 2019-11-11T12:32:02.828Z -----------------------------
<< [Init] | Ended with SUCCESS | 2019-11-11T12:32:02.857Z ==================
>> [Test job] | STARTED | 2019-11-11T12:32:02.893Z -------------------------
2019-11-11 12:32:02.924 | INFO | Waiting for 2500ms...
checking for stall...
checking for stall...
2019-11-11 12:32:05.429 | INFO | Setting canDie to true...
2019-11-11 12:32:05.430 | INFO | Waiting for another 2500ms...
checking for stall...
should die.
You can see the several checkFunc executions by the output “checking for stall…” and once the global variable canDie was true all the oJob stopped it’s execution.
Checking at the job level
All the previous options checked for stall for the entire oJob execution but you can specify the same at the job level using typeArgs.timeout and typeArgs.stopWhen that are available for all types of jobs in oJob.
Example with typeArgs.timeout
In this example the Test job job is set to timeout after 1,5 seconds:
todo:
- Init
- Test job
- Done
jobs:
#-----------
- name: Init
exec: |
global.canDie = false;
#-----------
- name: Done
exec: |
log("Everything is done.");
#-------------------
- name : Test job
typeArgs:
timeout: 1500
exec : |
log("Waiting for 2500ms...");
sleep(2500, true);
log("Setting canDie to true...");
global.canDie = true;
log("Waiting for another 2500ms...");
sleep(2500, true);
log("Done");
Executing it the job will actually end in error after the specified timeout:
>> [Init] | STARTED | 2019-11-11T12:47:25.568Z ----------------------------
<< [Init] | Ended with SUCCESS | 2019-11-11T12:47:25.624Z =================
>> [Test job] | STARTED | 2019-11-11T12:47:25.662Z ------------------------
2019-11-11 12:47:25.684 | INFO | Waiting for 2500ms...
!! [Test job] | Ended in ERROR | 2019-11-11T12:47:27.197Z =================
- id: 8ebbf961-822d-3b95-ca09-8dfb335ab6cb
error: Job exceeded timeout of 1500ms
===========================================================================
>> [Done] | STARTED | 2019-11-11T12:47:27.277Z ----------------------------
2019-11-11 12:47:27.297 | INFO | Everything is done.
<< [Done] | Ended with SUCCESS | 2019-11-11T12:47:27.300Z =================
Example with typeArgs.stopWhen
In this example the Test job job is set stop whenever the stopWhen function returns a true value:
todo:
- Init
- Test job
- Done
jobs:
#-----------
- name: Init
exec: |
global.canDie = false;
#-----------
- name: Done
exec: |
log("Everything is done.");
#-------------------
- name : Test job
typeArgs:
stopWhen: |
if (global.canDie) {
print("should die...");
return true;
}
exec : |
log("Waiting for 2500ms...");
sleep(2500, true);
log("Setting canDie to true...");
global.canDie = true;
log("Waiting for another 2500ms...");
sleep(2500, true);
log("Done");
Executing the job will actually stop without any error if the stopWhen function returns the a true value. To end the job with an error simply throw an exception on the stopWhen function.
>> [Init] | STARTED | 2019-11-11T12:38:54.232Z -----------------------------
<< [Init] | Ended with SUCCESS | 2019-11-11T12:38:54.263Z ==================
>> [Test job] | STARTED | 2019-11-11T12:38:54.298Z -------------------------
2019-11-11 12:38:54.330 | INFO | Waiting for 2500ms...
2019-11-11 12:38:56.837 | INFO | Setting canDie to true...
should die...
2019-11-11 12:38:56.838 | INFO | Waiting for another 2500ms...
<< [Test job] | Ended with SUCCESS | 2019-11-11T12:38:56.857Z ==============
>> [Done] | STARTED | 2019-11-11T12:38:56.012Z =============================
2019-11-11 12:38:56.025 | INFO | Everything is done.
<< [Done] | Ended with SUCCESS | 2019-11-11T12:38:56.098Z ==================