Scheduler Fault Tolerance & Load Balancing

Obsidian Scheduler provides enterprise scheduling features while natively supporting pooling and clustering, or in other words, load balancing and fault tolerance. But Obsidian does so in a way that is painless and non-invasive. In fact, you don’t have to do anything. Load balancing and fault tolerance are built into each instance of Obsidian Scheduler whether you choose to run it inside the web admin app, embedded in your own application, as a standalone or any combination of these. This is critical for a scheduler since you could encounter software/hardware faults, unanticipated load or any number of other things that could cripple or bring down a scheduler instance that would otherwise impact critical items from firing. This is where pooling & clustering fits so well.

In fact, we are so passionate about fault tolerance and load balancing, that we don’t offer a single node version of Obsidian. All licences are a minimum of two nodes and your fully functional trial allows you to see two nodes running without any functional restriction. We want you to have, at minimum, a second instance running to ensure your scheduled jobs run on time and that a failure doesn’t prevent other scheduled items from completing or subsequent instances from firing.

Many enterprise server solutions support pooling and clustering but often utilize a variety of complex configuration strategies and/or pool participant inter communication approaches. Obsidian doesn’t need any of these. Every Obsidian Scheduler instance of any type automatically joins the existing pool/cluster or establishes it if it is the first one on the scene. No extra configuration required. No communication between servers necessary. No multicast, no replication of data between servers. This means that you can easily swap out hardware in case of failure or add a new member for load sharing with ease. In fact, if you have standby hardware, you can have it running, awaiting availability of a node licence and it will automatically take over as soon as a node licence is available.

Obsidian also supports fault tolerance of individual jobs. If a job stalls, fails to complete because the instance failed, fails with an exception, didn’t run because no nodes were running, was conflicted by another job – all these are job failure modes that Obsidian provides recovery and tolerance mechanisms for and are all configurable and managed via the web interface. You can even configure specialized job chaining using source chain job state. In an upcoming release, Obsidian will expose internally fully manageable workflow based on source job state and/or its output/results, really, any condition or criteria you may have. You can also use the web interface to subscribe to server and job events at a high level or just target the events you are concerned with so that you’re kept up-to-date without having to login, parse and review log files, etc.

We know that running software in production environments can be unpredictable at times and that all too frequently, bad things happen. We want Obsidian Scheduler to keep you safe and to help you feel secure. Share with us your stories or let us know if you can think of any other ways we can make Obsidian better able to adapt to scheduling problems.

Getting Licensing Right

A challenge that many software organizations face is how to license your software, and the corresponding controls to make sure it is not abused. The vendor wants a solution that protects their product and gives them multiple ways to sell it. Users want something that just works and doesn’t interfere with up-time or reliability of their systems.

Options that we reviewed include:

  • No licences
  • Activation-based/key-based licences
  • Node-locked licences
  • Floating licences

Choosing the Best Option

The approach that we’ve settled on for our scheduler and other products is a floating licence model.  We felt this best represented how our product is used while still providing our users flexibility.

Choosing a floating licence model means that organizations aren’t forced into tying their software to a specific node and they have flexibility to use their licences as they see fit. It also means we can provide the same type of licensing for trial users by using an expiring license. So we can easily convert a trial licence into the real thing, and there is no effort on the customers end to upgrade. In fact, you’re free to start using it as a trial and then upgrade to a full licence without ever bringing the service down! Now that’s the kind of software we love to use – stuff that lets you get things done and doesn’t make you its slave.

Another benefit is that it allows our customers to easily add licences as their needs grow. We know our users’ time is valuable, and ours is too, so we don’t want to force any unnecessary work on either of us.

However, this poses some technical issues. When you implement floating licences, there has to be some kind of licence server which is responsible for leasing and validating clients. Clearly, since the goal with our scheduler is high availability and fail-over, we need to provide protections in case there are connectivity issues with the licence server.

Licensing Server Dilemmas

Naturally, we at Carfey Software want to run our own master license server in order to provide easy-to-use trial licences and also to reduce the burden on our clients. But at the same time, this poses a problem. Using an externally hosted server means a drop in Internet connectivity means your scheduler can no longer validate and acquire its licence.

To get around this we built support for a proxy server, and also added automatic fail-over to a secondary licensing server. This means you can run your own proxy server which does its best to communicate with the master server, but can continue to provide floating licences to users within your organization even if connectivity drops.

In the diagram you can see a somewhat typical setup of our scheduler showing both the proxy and master licensing servers. You’ll also notice the backup scheduler. The really interesting part here is you could easily have that third scheduler in stand-by and only purchase two licences. This fail-over node will just pick up a licence as soon as it can (which means another node died).

What about if your proxy server dies? Well, if you use a proxy server, by default we will use our master licensing server as a backup. The proxy can easily survive long drops in connectivity with the master server, providing our users the reliability and excellent up-time they demand.

Leasing the Licence

So far we’ve pretty much ignored the semantics of actually giving a licence out to a client. Giving out a licence involves leasing it to a given node for a requested length of time during which it is locked and cannot be obtained by any other node.

So when the scheduler attempts to grab a licence, it must determine how long it wants its lease to last. Choosing a lease period is a compromise between protecting against connectivity issues and the ability to spin up a new node with a licence that may have been locked by a node that died. Clients can also validate their leases multiple times within the lease periods to refresh them and help guard against connectivity drops. And to make sure things run smoothly we’ve made all licences node-locked – if your server goes down and is restarted, it will reacquire the same licence even if the lease hasn’t expired.

Of course, there’s lots of  work in getting this all to work well and integrate it within the software without sacrificing security, but the power it gives our end users while still protecting our licensing model is worth it.

As an added benefit to our customers, we’ve ensured our highly configurable notifications support includes the option to notify you of any licensing issues. So if there is any issue with contacting your licence server, you will know right away.

If you have any questions about specific technical problems we solved or that you are running into with a similar effort, leave a comment. We’d love to hear from you.