Using Obsidian Maintenance Jobs

When we began development on Obsidian Scheduler, one of its key objectives was transparency. Quoting my own post,

transparent means letting us know what is going on. We want to have available to us information such as when was the job scheduled to run, when did it start, when did it complete, which host ran it (in pool scenarios). We want to know if it failed, what the problem was. We want the option to be notified either being emailed, paged, and/or messaged when all or specific jobs fails. We want detailed information available to us should we wish to investigate problems in our executing jobs. And we want all of this without having to write code, create our own interface, parse log files or do extra configuration. The scheduler should just work that way.

We’re very happy that we achieved that goal and Obsidian gives us the transparency we sought.

As you likely know, Obsidian stores all this information in a database. This allows you to retain as much of this history as you desire or are required by your business group’s policy. In theory, you could retain all of this history indefinitely in the live database ensuring this information would be available at any time simply by accessing the Obsidian UI. In practice, this is often not required. Many organizations simply maintain an indefinite archive of database backups and can retrieve and restore these to an environment they would use specifically to perform any desired historical investigations.

Obsidian is bundled with two maintenance jobs that can be used to keep the Obsidian database trim and holding only the necessary recent history. These maintenance jobs take advantage of Obsidian’s built-in ability for parametrization allowing you to choose the desired retention period. These jobs are not scheduled by default, so let’s take a look at how we can configure them to run.

JobHistoryCleanupJob

This job cleans up the execution history of your jobs, retaining the most recent history. When scheduling a new job in Obsidian, you will find the Job Class com.carfey.ops.job.maint.JobHistoryCleanupJob in the selection list. Once chosen, you will see that it supports a single required parameter – maxAgeDays. Assuming you want job execution history retained for the most recent 90 days, set this value to 90 days. Here’s a screenshot of what it would look like.
JobHistoryCleanupUsage

LogCleanupJob

This job cleans up the audit and information logs that track all activity in Obsidian. As described in our wiki, these logs contain everything from scheduler system activity such as spawning and execution to user activity such as changing a schedule, its configuration or adding new jobs. These logs are classified by severity level and the Job Class com.carfey.ops.job.maint.LogCleanupJob is configured using the same required parameter maxAgeDays, but has an additional required parameter – level. It defaults to ALL and allows multiple values, but you can also retain this log data for different retention periods by severity by scheduling and configuring this job for each classification. Notice these samples:
LogCleanupUsage1
LogCleanupUsage2

That’s all there is to it! In both cases, the absence of a configured maintenance job or in the case of the LogCleanupJob, configured for a given severity level, means that data will be retained indefinitely. Of course, you can always decide later on to configure such a job and at that time Obsidian will take care of it for you.

Is there something else you’d like to see Obsidian do? Drop us a line or leave a comment below. We listen carefully to all our customers’ feature requests and give priority to customer’s needs in our product roadmap. We also appreciate hearing how Obsidian is helping you.

Job Chaining in Quartz and Obsidian Scheduler

In this post I’m going to cover how to do job chaining in Quartz versus Obsidian Scheduler. Both are Java job schedulers, but they have different approaches so I thought I’d highlight them here and give some guidance to users using both options.

It’s very common when using a job scheduler to need to chain one job to another. Chaining in this case refers to executing a specific job after a certain job completes (or maybe even fails). Often we want to do this conditionally, or pass on data to the target job so it can receive it as input from the original job.

We’ll start with demonstrating how to do this in Quartz, which will take a fair bit of work. Obsidian will come after since it’s so simple.

Chaining in Quartz

Quartz is the most popular job scheduler out there, but unfortunately it doesn’t provide any way to give you chaining without you writing some code. Quartz is a low-level library at heart, and it doesn’t try to solve these types of problems for you, which in my mind is unfortunate since it puts the onus on developers. But despite this, many teams still end up using Quartz, so hopefully this is useful to some of you.

I’m going to outline probably the most basic way to perform chaining. It will allow a job to chain to another, passing on its JobDataMap (for state). This is simpler than using listeners, which would require extra configuration, but if you want to take a look, check out this listener for a starting point.

Sample Code

This will rely on an abstract class that will provided basic flow and chaining functionality to any subclasses. It acts as a very simple Template class.

First, let’s create the abstract class that gives us chaining behaviour:

import static org.quartz.JobBuilder.newJob;
import static org.quartz.TriggerBuilder.newTrigger;
import org.quartz.*;
import org.quartz.impl.*;

public abstract class ChainableJob implements Job {
   private static final String CHAIN_JOB_CLASS = "chainedJobClass";
   private static final String CHAIN_JOB_NAME = "chainedJobName";
   private static final String CHAIN_JOB_GROUP = "chainedJobGroup";
   
   @Override
   public void execute(JobExecutionContext context) throws JobExecutionException {
      // execute actual job code
      doExecute(context);

      // if chainJob() was called, chain the target job, passing on the JobDataMap
      if (context.getJobDetail().getJobDataMap().get(CHAIN_JOB_CLASS) != null) {
         try {
            chain(context);
         } catch (SchedulerException e) {
            e.printStackTrace();
         }
      }
   }
   
   // actually schedule the chained job to run now
   private void chain(JobExecutionContext context) throws SchedulerException {
      JobDataMap map = context.getJobDetail().getJobDataMap();
      @SuppressWarnings("unchecked")
      Class jobClass = (Class) map.remove(CHAIN_JOB_CLASS);
      String jobName = (String) map.remove(CHAIN_JOB_NAME);
      String jobGroup = (String) map.remove(CHAIN_JOB_GROUP);
      
      
      JobDetail jobDetail = newJob(jobClass)
            .withIdentity(jobName, jobGroup)
            .usingJobData(map)
            .build();
         
      Trigger trigger = newTrigger()
            .withIdentity(jobName + "Trigger", jobGroup + "Trigger")
                  .startNow()      
                  .build();
      System.out.println("Chaining " + jobName);
      StdSchedulerFactory.getDefaultScheduler().scheduleJob(jobDetail, trigger);
   }

   protected abstract void doExecute(JobExecutionContext context) 
                                    throws JobExecutionException;
   
   // trigger job chain (invocation waits for job completion)
   protected void chainJob(JobExecutionContext context, 
                          Class jobClass, 
                          String jobName, 
                          String jobGroup) {
      JobDataMap map = context.getJobDetail().getJobDataMap();
      map.put(CHAIN_JOB_CLASS, jobClass);
      map.put(CHAIN_JOB_NAME, jobName);
      map.put(CHAIN_JOB_GROUP, jobGroup);
   }
}

There’s a fair bit of code here, but it’s nothing too complicated. We create the basic flow for job chaining by creating an abstract class which calls a doExecute() method in the child class, then chains the job if it was requested by calling chainJob().

So how do we use it? Check out the job below. It actually chains to itself to demonstrate that you can chain any job and that it can be conditional. In this case, we will chain the job to another instance of the same class if it hasn’t already been chained, and we get a true value from new Random().nextBoolean().

import java.util.*;
import org.quartz.*;

public class TestJob extends ChainableJob {

   @Override
   protected void doExecute(JobExecutionContext context) 
                                   throws JobExecutionException {
      JobDataMap map = context.getJobDetail().getJobDataMap();
      System.out.println("Executing " + context.getJobDetail().getKey().getName() 
                         + " with " + new LinkedHashMap(map));
      
      boolean alreadyChained = map.get("jobValue") != null;
      if (!alreadyChained) {
         map.put("jobTime", new Date().toString());
         map.put("jobValue", new Random().nextLong());
      }
      
      if (!alreadyChained && new Random().nextBoolean()) {
         chainJob(context, TestJob.class, "secondJob", "secondJobGroup");
      }
   }
   
}

The call to chainJob() at the end will result in the automatic job chaining behaviour in the parent class. Note that this isn’t called immediately, but only executes after the job completes its doExecute() method.

Here’s a simple harness that demonstrates everything together:

import org.quartz.*;
import org.quartz.impl.*;

public class Test {
   
   public static void main(String[] args) throws Exception {

      // start up scheduler
      StdSchedulerFactory.getDefaultScheduler().start();

      JobDetail job = JobBuilder.newJob(TestJob.class)
             .withIdentity("firstJob", "firstJobGroup").build();

      // Trigger our source job to triggers another
      Trigger trigger = TriggerBuilder.newTrigger()
            .withIdentity("firstJobTrigger", "firstJobbTriggerGroup")
            .startNow()
            .withSchedule(
                  SimpleScheduleBuilder.simpleSchedule().withIntervalInSeconds(1)
                  .repeatForever()).build();

      StdSchedulerFactory.getDefaultScheduler().scheduleJob(job, trigger);
      Thread.sleep(5000);   // let job run a few times

      StdSchedulerFactory.getDefaultScheduler().shutdown();
   }
   
}

Sample Output

Executing firstJob with {}
Chaining secondJob
Executing secondJob with {jobValue=5420204983304142728, jobTime=Sat Mar 02 15:19:29 PST 2013}
Executing firstJob with {}
Executing firstJob with {}
Chaining secondJob
Executing secondJob with {jobValue=-2361712834083016932, jobTime=Sat Mar 02 15:19:31 PST 2013}
Executing firstJob with {}
Chaining secondJob
Executing secondJob with {jobValue=7080718769449337795, jobTime=Sat Mar 02 15:19:32 PST 2013}
Executing firstJob with {}
Chaining secondJob
Executing secondJob with {jobValue=7235143258790440677, jobTime=Sat Mar 02 15:19:33 PST 2013}
Executing firstJob with {}

Deficiencies

Well, we’re up and chaining, but there are some problems with this approach:

  • It doesn’t integrate with a container like Spring to use configured jobs. More code would be required.
  • It forces you to know up front which jobs you want to chain, and write code for it.
  • Configuration is fixed, unless, once again, you write more code.
  • No real-time changes (unless you write more code).
  • A fair bit of code to maintain , and high likelihood you will have to expand it for more functionality.

The theme here is that it’s doable, but it’s up to you to do the work to make it happen. Obsidian avoids these problems by making chaining configurable, instead of it being a feature of the job itself. Read on to find out how.

Chaining in Obsidian

In contrast to Quartz, chaining in Obsidian requires no code and no up-front knowledge of which jobs will chain or how you might want to chain them later. Chaining is a form of configuration, and like all job-related configuration in Obsidian, you can make live changes at any time without a build or any code at all. Job configuration can use a native REST API or the web UI that’s included with Obsidian.

The following chaining features are available for free:

  • No code and no redeploy to add or remove chains.
  • You can chain specific configurations of job classes.
  • You can chain only on certain states, including failure.
  • Chain conditionally based on source job saved state (equivalent to Quartz’s JobDataMap), including multiple conditions. Regexp/Equals/Greater than, etc.
  • Chain only when matching a schedule.

Check out the feature and UI documentation to find out more.

Now that we know what’s possible, let’s see an example. Once you have your jobs configured, just create a new chain using the UI. REST API support will be here shortly but as of 1.5.1 chaining isn’t included in the API. If you need to script this right now, we can provide pointers.

In the UI, it looks like the following:

Chaining UI

Easy, huh? All configuration is stored in a database, so it’s easy to replicate it in various environments or to automate it via scripting. As a bonus, Obsidian tracks and shows you all chaining state including what job triggered a chained job. It will even tell you why a job chain didn’t fire, whether it’s because the job status didn’t match, or one of your conditions didn’t.

Conclusion

That summarizes how you can go about chaining in Quartz and Obsidian. Quartz definitely has a minimalist approach, but that leaves developers with a lot of work to do.

Meanwhile, Obsidian provides rich functionality out of the box to keep developers working on their own rich functionality, instead of the plumbing that so often seems to consume their time. If you have any suggestions or feature requests for Obsidian, drop us a note by leaving a comment or by contacting us.

Comparing Job Development in Quartz and Obsidian

Getting your program code to the point that it satisfies the functional requirements provided is a milestone for developers, one that hopefully brings satisfaction and a sense of accomplishment. If that code must be executed on a schedule perhaps for multiple uses with custom schedules and configurable parameters, this can mean a whole new set of problems.
We’re going to compare how we would write a job in Quartz and one in Obsidian that would satisfy the above requirements. We’ll use the example scenario of a recurring report. In this scenario, the report has the following dynamic criteria: it is emailed to a specified user, the report format can be selected, either PDF or Excel, and of course the execution frequency varies by user.

The following will be the sample class we’ll use to satisfy these requirements.

public class MyReportClass {
    public void emailReport(String emailAddress, String reportFormat) {
	…generate report in desired format
	…email report to user
    }
}

For the purpose of this exercise, we will leave this class alone and write a wrapper class for scheduling, allowing for its continued use in non-scheduled contexts.

Let’s start with Obsidian. All Obsidian jobs start with implementing a single interface: SchedulableJob. Our Obsidian job class will look something like this:

import com.carfey.ops.job.Context;
import com.carfey.ops.job.SchedulableJob;
import com.carfey.ops.job.param.Configuration;
import com.carfey.ops.job.param.Parameter;
import com.carfey.ops.job.param.Type;

@Configuration(knownParameters={
		@Parameter(name= MyScheduledReportClass.EMAIL, type=Type.STRING, required=true),
		@Parameter(name= MyScheduledReportClass.REPORT_FORMAT, type=Type.STRING, defaultValue="PDF", required=true)
}) 
public class MyScheduledReportClass implements SchedulableJob {
	public static final String EMAIL = "email";
	public static final String REPORT_FORMAT = "reportFormat";

	public void execute(Context context) throws Exception {
		String email = context.getConfig().getString(EMAIL);
		String reportFormat = context.getConfig().getString(REPORT_FORMAT);
		new MyReportClass().emailReport(email, reportFormat);
	}
}

You’ll notice we can annotate the class with the required parameters. This ensures that when this job is scheduled for execution, the email and reportFormat parameters will always be available. Obsidian will not allow the job to be configured without these values and will also ensure their type. But we wouldn’t mind going a step further. We’d like to validate the reportFormat is valid. How can we do so before the job is run?
We can change our class to implement ConfigValidatingJob and implement the necessary method.

Now our class looks like this:

import com.carfey.ops.job.ConfigValidatingJob;
import com.carfey.ops.job.Context;
import com.carfey.ops.job.config.JobConfig;
import com.carfey.ops.job.param.Configuration;
import com.carfey.ops.job.param.Parameter;
import com.carfey.ops.job.param.Type;
import com.carfey.ops.parameter.ParameterException;
import com.carfey.suite.action.ValidationException;

@Configuration(knownParameters={
		@Parameter(name= MyScheduledReportClass.EMAIL, type=Type.STRING, required=true),
		@Parameter(name= MyScheduledReportClass.REPORT_FORMAT, type=Type.STRING, defaultValue="PDF", required=true)
}) 
public class MyScheduledReportClass implements ConfigValidatingJob {
	public static final String EMAIL = "email";
	public static final String REPORT_FORMAT = "reportFormat";

	public void execute(Context context) throws Exception {
		String email = context.getConfig().getString(EMAIL);
		String reportFormat = context.getConfig().getString(REPORT_FORMAT);
		new MyReportClass().emailReport(email, reportFormat);
	}

	public void validateConfig(JobConfig config) throws ValidationException, ParameterException {
		String reportFormat = config.getString(REPORT_FORMAT);
		if (!"PDF".equalsIgnoreCase(reportFormat) && !"EXCEL".equalsIgnoreCase(reportFormat)) {
			throw new ValidationException("Report format must be either PDF or EXCEL");
		}
	}

}

That’s it! Our job will now only accept being scheduled with an email address specified and a valid report format specified. You could easily extend this to other types of custom validation, such as ensuring the email address is valid or perhaps that is in an allowable domain.

Now for Quartz. Let’s first of all identify some differences. Quartz doesn’t provide any mechanisms for ensuring parameters are specified or are valid before runtime. And since Quartz doesn’t provide an execution context, the best you can do when you write your own code to do so is validate the parameters on startup. Our sample below will follow the easiest approach in Quartz, to simply fail the job at runtime if the report format is invalid.

import org.quartz.Job;
import org.quartz.JobDataMap;
import org.quartz.JobExecutionContext;
import org.quartz.JobExecutionException;


public class MyScheduledReportClass implements Job {
	public static final String EMAIL = "email";
	public static final String REPORT_FORMAT = "reportFormat";

	public void execute(JobExecutionContext context) throws JobExecutionException {

        JobDataMap data = context.getMergedJobDataMap();
        String email = data.getString(EMAIL);
        String reportFormat = data.getString(REPORT_FORMAT);
		if (!"PDF".equalsIgnoreCase(reportFormat) && !"EXCEL".equalsIgnoreCase(reportFormat)) {
			throw new JobExecutionException("Report format must be either PDF or EXCEL");
		}
		new MyReportClass().emailReport(email, reportFormat);
	}

}

You may be thinking that the classes seem fairly comparable and I would agree. But with the Obsidian job, there’s nothing else that needs to be done. Since setting the runtime schedule and specifying parameters tend to be fluid, those are not done in code or even in static configuration. Using Obsidian’s UI or REST API, you specify the schedule and parameters for each instance or version of the job that is needed.

Obsidian always provides an execution context that can be standalone or be embedded as a part of an existing execution context.

Quartz never provides an execution context. Unless you are deploying in a servlet container, you always need to initialize the scheduling environment. Even when using a servlet container, you must help Quartz along. That means that With Quartz, you’ve only a portion of the code and/or configuration you’ll need.

Initialize the scheduler:

SchedulerFactory sf = new StdSchedulerFactory();
Scheduler scheduler = sf.getScheduler();
scheduler.start();

No administration console and no REST API means code and/or config to schedule and parameterize your job.

JobDetail job = newJob(MyScheduledReportClass.class).withIdentity("joe's report", "group1").usingJobData(MyScheduledReportClass.EMAIL, "joe@****.com").usingJobData(MyScheduledReportClass.REPORT_FORMAT, "PDF").build();
Trigger trigger = newTrigger().withIdentity("trigger1", "group1").startNow().withSchedule(dailyAtHourAndMinute(1, 30)).build();
scheduler.scheduleJob(job, trigger);

Now this may not seem too bad, but now imagine that Joe says he wants the report in Excel, not PDF. Are you really going to say that it requires code changes, followed by a build, followed by testing, acceptance, and promoting a new release?

True, some of the above can be moved to configuration files. While that may avoid a build cycle, it does present its own set of issues. You still have to push new configuration files, restart the jvm process and deal with potential mistakes in the new configuration files that could potentially derail all scheduling.

This also doesn’t get into the issues surrounding misfires, Job Concurrency and execution exception handling and recoverability discussed here.

What do you think? Share your experiences using Quartz for scheduling in your java projects by leaving a comment. We’d like to hear from you.

Configuring Clustering in Quartz and Obsidian Schedulers

Job scheduling is used on many software projects to enable both internal jobs and third-party integration. Clustering can provide a huge boost to reliability by providing fail-over and load-sharing. I believe that clustering should be implemented for reliability on just about all software projects, so I’ve decided to outline how to go about doing that in two popular cluster-enabled Java job schedulers. This post is going to cover how to set up clustering for Quartz and Obsidian. It will explain what work is required to configure each, and help you watch for some common pitfalls. This guide will assume you have both schedulers running in the base configuration already.

Both Quartz and Obsidian have their strong points, and this post won’t debate which is better, but it will provide the information you need to cluster either one.

Quartz

Quartz is the most popular open-source job scheduling option, and it allows for you to cluster scheduler instances via the JDBCJobStore. Though it’s also possible to do clustering with Terracotta without a persistent job store, I recommend against it for most projects since job execution history is very useful to ongoing operations and troubleshooting, and database-clustering performs adequately in all but extreme cases.

Obsidian

Obsidian is a commercial job scheduler which provides free individual instances, and a single clustering licence free for a year. It provides a similar type of clustering as Quartz, and it also provides a full UI, a REST API, downtime recovery, any many other advanced configuration options.

Configurating Obsidian for Clustering

I’ll cover Obsidian first simply because there’s little to do in comparison to Quartz. Since running Obsidian always uses a database, if you have an instance running, there will be no database configuration to update. If you haven’t set it up yet, check out the Getting Started guide – the installation package comes with an interactive Ant tool to build the properties file for you.

So here’s the thing: to cluster Obsidian, simply start up additional instances. Obsidian doesn’t have a non-clustered mode, and all instances automatically handle adjusting the load-sharing algorithm when new members join the cluster (or drop out). Your system clocks should be roughly in sync, but if they differ by a few seconds, it’s not a problem since Obsidian gracefully handles this. Still, you may synchronize your server times if desired.

Note to those using the free version: right after download, you can start two instances and they will automatically join the same cluster. If you do not have adequate licences, new members will not be able to join the cluster.

There’s no need to deploy different properties files or anything like that. As an example, if you use Amazon’s EC2 service, you could use the same image for multiple nodes in the cluster and everything would work correctly. Each cluster member will automatically assign itself a unique instance name based on its local host name and a unique suffix if required.

However, we recommend you assign each cluster member an explicit name which will help with troubleshooting if there are issues on a specific host. To do so, simply set the Java system property schedulerDesignation to the host name of your choice. For example, if starting an instance using the standalone scheduler using java directly, simply add the value -DschedulerDesignation=myHostName to the end of the command.

That’s it! Clustering Obsidian is literally as easy as copying an installation and starting it on another server or virtual machine. As a bonus, unlike other schedulers, in Obsidian you can set a job to only run on a specific host, even with clustering enabled.

In summary, the steps are:

  1. Grab a copy of your Obsidian installation (WAR, standalone package or embedded application bundle).
  2. Start it up! At this step you can provide the schedulerDesignation you wish to use.

Configurating Quartz for Clustering

For Quartz, we’re only going to cover configuring database clustering since we believe it is the right choice to most projects.

Note before you get started: If you have a non-clustered version of Quartz running, you must shut it down before starting any cluster-enabled instances.

Another note about timing: Since Quartz uses very aggressive timing, you must ensure your different instances have precisely synchronized times. Even a second difference will cause your load-sharing to be effectively disabled and all jobs will run on a single host.

Setting up clustering for Quartz can be a bit overwhelming since it exposes so many properties, and it’s hard to figure out which must be added to your properties file to get up and running, but I will try to simplify the process as much as possible.

Quartz is configured via the quartz.properties file. Here’s a sample file that outlines the bare minimum properties you will have to configure for clustering. If you want to see full details on any portion of the config, see the Quartz configuration reference. Note that the properties file should be identical on all hosts, with the one exception being the org.quartz.scheduler.instanceId which is used to identify different hosts.


# Basic Quartz configuration to provide an adequate pool of threads for execution

org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 25

# Datasource for JDBCJobStore
org.quartz.dataSource.myDS.driver =com.mysql.jdbc.Driver
org.quartz.dataSource.myDS.URL = jdbc:mysql://localhost:3306/scheduler
org.quartz.dataSource.myDS.user = myUser
org.quartz.dataSource.myDS.password = myPassword

# JDBCJobStore
org.quartz.jobStore.class = org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.StdJDBCDelegate
org.quartz.jobStore.dataSource = myDS

# Turn on clustering
org.quartz.jobStore.isClustered = true

org.quartz.scheduler.instanceName = ClusteredScheduler
# If instanceIf is set to AUTO, if will auto-generate an id automatically.
# I recommend giving explicit names to each clustered host for easy identification.
org.quartz.scheduler.instanceId = Host1

Note about JDBC settings: For whatever reason, you have to configure both the JDBC driver class and the job store’s “driver delegate” class. These will have to be set to the appropriate value for your database platform.

As I mentioned, these are the bare minimum properties you will have to configure to get clustering running. The one big pain with this configuration is that the instanceId which identifies hosts resides in the same properties file as all the other properties which should all remain the same. Keeping different versions in sync can problematic, and isn’t required for Obsidian. You can use the “AUTO” setting to avoid having to set explicit instance IDs, but as with Obsidian, we recommend you give explicit names as it can help you locate where issues are happening more quickly if you know up front which host is which.

So the main steps to enabling clustering are:

  1. Prepare properties for each cluster member. Ensure only the org.quartz.scheduler.instanceId varies in the properties file.
  2. Turn off all running instances by shutting down the application which is running it, or disabling just the Quartz process (see here).
  3. Start your application, or start the Quartz process (see here).
  4. Ensure that you never start a non-clustered instance again!

Conclusion

I hope this helps you get off the ground quickly with your new or existing project. Clustering is a great feature in any scheduler, and I feel it provides a lot of value that software and operations teams might be missing out on.

Feature Comparison of Java Job Schedulers

At Carfey Software, we love our flagship product, Obsidian Scheduler. We believe that Obsidian is the best choice for most scheduling needs. Why? Because Obsidian is carefully designed to meet both simple and complex requirements. We think it stacks up well whether you are struggling with an existing scheduler or investigating if you should once again use one of the de facto scheduler solutions on your new project, or perhaps are just curious about alternatives.

So we decided to compare Obsidian to Quartz, cron4j and Spring. And if the technology you’re considering isn’t listed here, why not use these items as a guide to consider what is important for your upcoming project? For a brief overview, check out our feature comparison.

Real-time Schedule Changes / Real-time Job Configuration

Obsidian Quartz cron4j Spring
Yes No native interactivity No

Initially, these may not seem very important, but we’ve all likely dealt with situations where we had to temporarily disable a job or change when it runs due to changes in requirements, unexpected technical problems or simply unanticipated behaviour. Obsidian provides both a UI and a REST API to make these changes and they can be effective at the very next minute. Quartz and cron4j are able to make these changes, but they are done via an API or via configuration, so it’s up to you as the developer to find a way to expose this functionality in real-time.

  Obsidian Quartz cron4j Spring
Ad-hoc Job Submission Yes No No native interactivity No
Configurable Job Conflicts Yes No No No

As you can see, this means supporting something like ad-hoc job submission is also not easily done with these other technologies, when the library even supports it.

When it comes to configurable job conflicts, these too can be configured in real-time. So, if it turns out that two jobs that are executing concurrently are colliding with each other and this is while they are executing in your production environment, you can actually adjust to the circumstance with Obsidian, whereas with other schedulers, you may not have any recourse but shutting down, changing code or configuration, and then starting up again. With Obsidian’s conflict support, you could even choose the conflict configuration as a medium- or long-term solution.

Code- and XML-Free Job Configuration

Obsidian Quartz cron4j Spring
Yes No No No

Obsidian provides you with a rich administration UI exposed via a standard web application. We even support job parameterization that can be validated and enforced via the UI if your job is so designated. Quartz and cron4j are essentially just libraries, so they require code and/or configuration as their means of job configuration.

Since we want to be able to make these types of dynamic changes, Obsidian provides a write access user role which corresponds to scheduler operators who can access the UI and perform the necessary changes. All these changes are audited in Obsidian and these audit logs are searchable from the UI, giving you insight into what changes have been made by your team members.

Job Event Subscription/Notification

Obsidian Quartz cron4j Spring
Yes No No No

Quartz and cron4j can handle event notifications via custom listeners. But again, if you want to send out emails on certain events, you have to write that code. If you want to change who receives which notifications, you either expose the mechanism to make those changes, or push new configuration files or possibly even new code. Obsidian chooses not to use custom listeners since we have provided natively the means to do the things these listeners would be used for. Custom listeners would otherwise be needed to handle something like job chaining, but Obsidian supports that natively, even allowing for configuration of conditional chaining decisions. For all events, items can be subscribed to generally or by specific entity, e.g. subscribe to all job failures or just a specific job’s failures.

  Obsidian Quartz cron4j Spring
Custom Listeners No Yes Yes No
Job Chaining Yes Implement yourself using custom listeners No

Obsidian goes one step further and even allows you to be subscribe and be notified to a broader set of events. For example, you can be notified when an Obsidian node is shut down, when someone changes a job configuration item, when someone changes a system configuration item, and so on. And all notifications are logged in the system for review.

Monitoring & Management UI

Obsidian Quartz cron4j Spring
Yes No No No

Obsidian’s monitoring and management UI is powerful, yet very easy to use. You can even play around with it at our live, functional and interactive demo site to see for yourself. Or download Obsidian and have a local version running against an in-memory DB and bundled servlet container within minutes. Quartz does have an add-on pay product that provides some UI. But Obsidian’s UI is free to use even if you use Obsidian’s free single-node.

We’ve discussed management already, but monitoring and investigating is another key part of keeping software running smoothly. If a job fails or a job seems to have run with unexpected criteria, having to gain access to log files and then pore over them to try to find the problem is inefficient, unproductive and a frustrating process for support staff and developers alike. Obsidian’s UI can grant read-only access to support and developer staff so they can review the details of job executions (both success and failures). Filtering and custom search criteria can be used to drill down and find the relevant detail all without ever having to share or transfer files around.

Zero Configuration Clustering and Load Sharing

Obsidian Quartz cron4j Spring
Yes No No No

If Obsidian is running, it natively has the ability to be clustered providing you with load sharing, reliability and failover. Every Obsidian Scheduler instance of any type automatically joins the existing pool/cluster or establishes it if it is the first one on the scene. No extra configuration required. No communication between servers necessary. No multicast, no replication of data between servers. This means that you can easily swap out hardware in case of failure or add a new member for load sharing with ease. Of the comparison technologies, only Quartz supports clustering, but it requires special advanced configuration. Also, to change from non-clustered mode to clustered mode would require taking the existing Quartz instance down.

  Obsidian Quartz cron4j Spring
Job Execution Host Affinity Yes No Not Applicable

Obsidian in its pooling also supports host specificity so that within a cluster, specific nodes can be designated as the allowable execution nodes for a given job.

Scripting Language Support in Jobs

Obsidian Quartz cron4j Spring
Yes No No No

Obsidian allows you to use Groovy, JavaScript, Python and BeanShell as script languages, in addition to standard Java jobs. It’s been implemented such that you can edit the scripts right in Obsidian’s UI console. One of the biggest benefits this scripting support provides that we and our customers have found is the ability to quickly write new jobs without redeploying. For example, operators can react quickly to situations and configure a simple Python script to run in certain job failure conditions.

Scheduling Precision

Obsidian Quartz cron4j Spring
Minute Second Minute Millisecond

No Java scheduler can really guarantee with fine precision when a job will fire. Busy hardware could easily lead to pauses or delays in any strategy to fire any activity at an expected time. As such, and due to the performance degradations that would be associated with more aggressive scheduling, we made a decision with Obsidian to support only minute-level precision for job scheduling. If you absolutely require more aggressive and precise scheduling knowing there are no assurances, consider the alternatives above.

Job Scheduling & Management REST API

Obsidian Quartz cron4j Spring
Yes No No No

Obsidian introduced a REST API in version 1.5 to ease integration into other applications and software environments, regardless of the technology used. A complete range of job, scheduling and host management features are exposed via the API. This allows you to integrate Obsidian into external monitoring systems or perhaps even writing Obsidian jobs to react to specific situations. For example, if a job that runs hourly has been failing continually over a period of many hours, perhaps you would want automatically disable it. The API can also be used to retrieve the available execution and logging data in Obsidian and could be used for generating reports or informing interested parties of pertinent activity.

Custom Calendar Support

Obsidian Quartz cron4j Spring
No Yes No No

Quartz does have a feature to support custom calendars. This allows you to reference custom scheduling options in your job’s configured schedule. For example, perhaps you would want to run a job on every weekday, skipping certain business holidays. You can do so with Quartz, but not so with any of these other schedulers unless you were to put custom code in the job itself.

Conclusion

Obsidian has many additional features that haven’t been detailed here, such as configurable recovery options, resubmission of failed jobs, parameterized job support, job configuration validation, job results storage/retrieval and so on. In practice, many developers and even project managers gravitate toward these de facto solutions, but for too long we in the developer community been fighting with these scheduling technologies and contending with the inferior results. Try our live, functional and interactive demo site to see for yourself. If you like what you see, download Obsidian and be refreshed with this easy-to-use and feature-rich scheduler.

Why Developers Keep Making Bad Technology Choices

Today, software developers are faced with a great abundance of options when choosing how to design and implement systems. We are constantly bombarded with choice and are used to dealing with buzzwords like NoSQL, the cloud, REST, Map-Reduce and so on. However, developers in charge of designing systems can be easily seduced into incorporating technologies that don’t provide a clear benefit over simpler solutions that aren’t as modern or hip. It seems like the KISS principle (Keep it simple, stupid!), while often referenced, is often neglected in favour of more “enterprisey” solutions. Why is this?

There are probably a lot of reasons, but I’ve identified a few that I think cover the majority of cases. As professional developers, I feel strongly that we have a duty to our employers to provide the best long-term solutions and therefore we need to rein in our desires when they conflict with this. Software development is not yet in the same realm as medicine or engineering, but I think we do need to make steps towards the professionalism, duty and responsibility that come with working in those fields.

Reason #1 – Boredom
Developers are often solving the same types of problems over and over. Not all of us have the privilege of working on new types of projects all the time, and even if we are, it’s often not new ground; similar problems have usually been solved thousands of times before by software developers around the globe.

It’s no surprise then that we want to try something new, even if we’ve adequately solved a problem before. We are natural puzzle solvers, and sometimes you just want to try a new puzzle. I’m sure many of you with several years of experience have seen functional systems effectively replaced with a new implementation that uses different technologies for no clear reason other than to suit the fancy of new developers.

So what do we do about this? How do we scratch that itch for something new? A relational DB is just so boring compared to trying out the latest NoSQL platform. Who cares if we don’t really have a good use for it? Well, I’d say you have a few options. For example, take the initiative and find ways to build out the platform that might actually benefit from some new technology. Other than that, why not work on a pet project in your spare time? After all, our job is to deliver high quality software – not entertain ourselves.

Disclaimer: I’m not trying to dissuade anyone from using new technologies. Just identify their benefits and see if they are the best choice for what you are doing, and if what you have doesn’t do the job, go for it!

Reason #2 – Resume Padding
This is perhaps the saddest of the reasons why developers make poor technology choices, and it mainly affects organizations with poor decision-making processes, but it’s still very common.

Contracts and positions in software development are very fluid these days – it’s not uncommon for a developer to be at a new company every year or two. Gone are the days where hopping from job to job is considered a no-no. Since this is the case, a lot of developers leap-frog from position to position to climb the ladder. It’s far easier for an average or lower-skilled developers to get ahead by doing this instead of trying to move up within a single company.

Since this is the case, developers will often try to incorporate technologies to get experience in them so as to add a bullet-point to their resume. How useful the technology will be to the platform is of secondary importance. Often it doesn’t matter how much they actually use it – nobody can pretend that people don’t exaggerate their skills when looking for new work. Therefore, platforms from small to large will often end up using untested technologies, or just technologies that nobody in-house actually understands well. Companies are then left with poor systems using too many technologies with nobody to maintain them as developers jump ship to more promising positions.

I don’t believe that most developers do this, but those of us who disagree with such actions should work to push back when presented with developers who are trying to make bad choices.

Reason #3 – Peer Pressure
Peer pressure is perhaps the most difficult cause to resist. We all like to believe that we are independent agents that make our own informed decisions, but all of us are human, and event the most prickly people are social creatures that want to have a happy social group.

When faced with new or hip technologies, a lot of us are somewhat afraid to resist implementing something that doesn’t really seem like a good idea to us. But we should suppress this feeling as much as we can. If you are in an environment where discussion and disagreement are valued (as one would hope), you should feel free to voice your concerns even if you aren’t totally familiar with the latest-and-greatest. Remember that software technologies come and go, but the basic principles pretty much stay the same. So if something doesn’t seem to add up, speak up! If you are a junior developer, you should still feel free to add your input – having experience doesn’t make one right. Plus, you could very well gain some insight into the choices that are being made.

Reason #4 – Lack of Understanding
Finally, technologies are sometimes chosen because developers don’t understand how things are actually working in a platform, or don’t want to find out.

For example, if you don’t have experience with highly performant relational databases, you may be inclined to go the NoSQL route, out of fear that you may implement something that won’t scale. Often though, this fear can be unfounded. If you are using a tool improperly, of course it won’t work well. But don’t let lack of understanding or knowledge force you into an unwise course of action. If, in reality, a solution could be implemented well in a relational database, and your platform already uses one, it would be foolish to introduce a new dependency simply because you aren’t familiar with what you have.

To avoid this, read and learn! If you are making choices, examine your assumptions and see if they hold up. Consult with senior developers who have worked with the tools in question and ask specifics about what they can and can’t do well. It’s never a waste to learn more about the tools that are available to us, and it will very likely pay dividends well into the future if you take the initiative.

Reason #5 – Misunderstanding or Solving Non-Existent Problems
This point ties into my previous point a little bit, but it really deserves its own discussion since it is such a big problem.

A common theme when developers pitch a new technology is that it does X and Y and protects against Z. But a lot of the time, X, Y and Z were never issues in the first place. For example, if we have a read-only data set that needs to be cached on multiple nodes in a cluster, someone may pitch a caching technology that offers distributed data sets where elements are not duplicated on each node. But what if the data set is small and we don’t anticipate any change that would necessitate distributed caching? We’d be introducing new technology that is inherently slower, more brittle and more complicated for a problem that doesn’t exist!

To guard against this, developers need to make sure that they understand the problem domain all the way through, and they also need to cross-check their assumptions to make sure they are correct. Sometimes we assume things that actually aren’t the case, so the latter step is important. Avoid the temptation to cover “what-if” situations. Chances are, you ain’t gonna need it, and if you do, we usually overestimate the cost of making changes at a later date, not realizing we are basically committing to the same effort now to avoid a slim chance of having to do the same amount of work later.

So What Should We Do?
So what are the rights things to do when choosing technologies? To start with, you might want to review the following points, and try to make it a team decision. The more input you have, the less likely you are to miss a piece of information that might alter your decision.

  • Review the requirements – consistency, failover, performance, etc.
  • Evaluate if what you have can meet the need well. If so, this is almost always the right choice.
  • Investigate how other technologies would meet the need, and factor in the costs of extra dependencies and potential failure points (nothing is free, and every new technology can have significant maintenance costs).
  • Find out your team’s expertise – favour things that you know well.
  • Factor in any other concerns like pricing, timelines, etc.
  • Discuss with the team, and make a pros and cons list.

These are just guidelines, and you can approach it any way you like, because the main thing is that you do make the decision carefully and rationally.

I hope that nobody takes this article to mean that new technologies are scary or that they should just be avoided! For instance, I’ve used NoSQL as an example already. I believe it definitely fills a need that exists and I’ve used it before to solve specific problems, but sometimes I think we get caught up in the fun stuff, and forget our ultimate goals. Just keep your objectives in mind and try to make the best long-term choice.

Null and 3-Dimensional Ordering Helpers in Java

When dealing with data sets retrieved from a database, if we want them ordered, we usually will want to order them right in the SQL, rather than order them after retrieval. Our database will typically be more efficient due to available processing power, potential use of available indexes and overall algorithm efficiency in modern RDBMSes. You also have great flexibility to have complex ordering criteria when you use these ORDER BY clauses. For example, assume you had a query that retrieved employee information including salary and relative steps (position) from the top position. You could easily have a first-level ordering where salaries are grouped into categories (<= $50 000, $50 001 to $100 000, > $100 001), and the next level ordered by relative position. If you could assume that salaries were appropriate for all employees, this might give you a nice idea of where there is too much or too little management in the company – a very rough approach I wouldn’t recommend, this is just a sample usage.

You get some free behaviour from your database when it comes to ordering, whether you realize it or not. When dealing with NULLs, the database has to make a decision how to order them. Every database I’ve ever worked with and likely all relational databases have a default behaviour. In ascending order, MySQL and SQL Server put NULL ahead of real values, they are “less than” a non-NULL value. Oracle and Postgres put NULL after real values, they are “greater than” non-NULL values. Oracle and Postgres nicely give you the NULLS FIRST and NULLS LAST instructions so you can actually override the defaults. Even in MySQL and SQLServer, you can find ways to override the defaults using functions in your order by clause. In MySQL I use IFNULL. In SQL Server, you could use ISNULL. These both give you the option of replacing null with a particular value. Just replace an appropriate value for the type you are sorting.

All sorting supported in these types of queries is two-dimensional. You pick columns and the rows are ordered by those. When you need to sort by additional dimensions of the data, you’re probably getting into areas that are addressed in other related technologies such as data warehousing and OLAP cubes. If that is appropriate and available for your case, by all means use those powerful features.

In many cases though, we either don’t have access to those technologies or we need our operations to be on current data. For example, let’s say you are working on an investment system, investor’s accounts, trades, positions, etc. are all maintained. You need to write a query to help extract trade activity for a given time frame. Our data comes back as a two-dimensional datasets even though we have more dimensions. Our query will return data on account(s) and the trade(s) per account. We need our results to be ordered by those accounts whose effected trades have the highest value. But we need to maintain the trades with their accounts. Simply ordering our query by the value of the effected trade would likely break the rows of the same account apart.

We have a choice, we can either order in the database and control our reconstruction of returning data to maintain the state and order of reconstructed objects or we can sort after the fact. In most cases, we probably don’t want to write new code each time we come across this problem that deals specifically with reconstituting the data from that query into our object model’s representation. Hopefully our ORM will help or we have some preexisting, functional and well-tested code that we can reuse to do so.

Another option is to sort in our code. We actually get lots of flexibility by doing this. Perhaps we have some financial functions that are written in our application that we can now use. We also don’t have to do all the sorting ourselves as we can take advantage of JDK features for Comparator and Collection sorting.

First, let’s deal with our null ordering problem. Let’s say our Trade object has some free public constant Comparators. These allow us to use a collection of Trades along with java.util.Collections.sort(List<Trade>, Comparator<Trade>). Trade.TRADE_VALUE_NULL_FIRST is the one we want to use. This Comparator is nothing more than a passthrough to a global Null Comparator helper.

private static final Comparator<Trade> TRADE_VALUE_NULL_FIRST = new Comparator<Trade>(){
  public int compare(Trade o1, Trade o2) {
    return ComparatorUtil.DECIMAL_NULL_FIRST_COMPARATOR.compare(
        o1.getTradeValue(), 
        o2.getTradeValue());
}};

... ComparatorUtil ...

public static NullFirstComparator<BigDecimal> DECIMAL_NULL_FIRST_COMPARATOR = 
  new NullFirstComparator<BigDecimal>();
public static NullLastComparator<BigDecimal> DECIMAL_NULL_LAST_COMPARATOR = 
  new NullLastComparator<BigDecimal>();
...snip...
public static NullLastComparator<String> STRING_NULL_LAST_COMPARATOR = 
  new NullLastComparator<String>();

public static class NullFirstComparator<T extends Comparable<T>> implements Comparator<T> {
  public int compare(T o1, T o2) {
    if (o1 == null && o2 == null) {
	return 0;
    } else if (o1 == null) {
	return -1;
    } else if (o2 == null) {
	return 1;
    } else {
	return o1.compareTo(o2);
    }
  }
}
public static class NullLastComparator<T extends Comparable<T>> implements Comparator<T> {
  public int compare(T o1, T o2) {
    if (o1 == null && o2 == null) {
	return 0;
    } else if (o1 == null) {
	return 1;
    } else if (o2 == null) {
	return -1;
    } else {
	return o1.compareTo(o2);
    }
  }
}

Now we have a simple, reusable solution we can use with any class and any nullable value in JDK sorting. Now we expose any ordering constants appropriate for business usage in our class. Now let’s deal with the more complex issue of hierarchical value ordering. We don’t want to write new code everytime we have to do something like this. So let’s just extend our idea of ordering helpers to hiearchical entities.

public interface Parent<C> {
  public List<C> getChildren();
}
public class ParentChildPropertiesComparator<P extends Parent<C>, C> implements Comparator<P> {
  private List<Comparator<C>> mChildComparators;
  public ParentChildPropertiesComparator(List<Comparator<C>> childComparators) {
    mChildComparators = Collections.unmodifiableList(childComparators);
  }
  public List<Comparator<C>> getChildComparators() {
    return mChildComparators;
  }
  public int compare(P o1, P o2) {
    int compareTo = 0;
    for (int i=0; i < mChildComparators.size() && compareTo == 0; i++) {
	Comparator<C> cc = mChildComparators.get(i);
	List<C> children1 = o1.getChildren();
	List<C> children2 = o2.getChildren();
	Collections.sort(children1, cc);
	Collections.sort(children2, cc);
	int max = Math.min(children1.size(), children2.size());
	for (int j=0; j < max && compareTo == 0; j++) {
	  compareTo = cc.compare(children1.get(j), children2.get(j));
	}
    }
    return compareTo;
  }
}

This is a little more complex, but still simple enough to easily grasp and reuse. We have the idea of a parent. This is not an OO relationship. This is a relationship of composition or aggregation. A parent can exist anywhere in the hierarchy, meaning a parent can also be a child. But in our sample, we have a simple parent/child relationship - Account/Trade. This new class, ParentChildPropertiesComparator supports the idea of taking in a List of ordered Comparators on the children entities but sorting on the parents. In our scenario, we are only sorting on one child value, but it could easily be several just as you could sort more than 2 levels of data.

We are assuming in our case that Account already implements the Parent interface for accounts. If not, you can always use the Adapter Design Pattern. Our Account/Trade sorting would now look like this.

List<Account> accounts = fetchPreviousMonthsTradeActivityByAccount();
List<Comparator<Trade>> comparators = Arrays.asList(Trade.TRADE_VALUE_NULL_FIRST);
ParentChildPropertiesComparator<Account, Trade> childComparator = 
  new ParentChildPropertiesComparator<Account, Trade>(comparators);
Collections.sort(accounts, childComparator);

Really! That's it. Our annoying problem of sorting accounts by those with highest trade values where some of those trade values could be null is solved in just a few lines of code. Our accounts are now sorted as desired. I came up with this approach and it is used successfully as a part of a query builder for a large-volume financial reconciliation system. Introduction of new sortable types and values requires only minimal additions. Take this approach for a whirl and see how incredibly powerful it is for dealing with sorting requirements across complex hierarchies of data. And drop us a line if you need help in implementation or have any comments.

Computing Common and Unique Elements In Multiple Collections – Java

This week, we’ll take a break from higher level problems and technology posts to deal with just a little code problem that a lot of us have probably faced. It’s nothing fancy or too hard, but it may save one of you 15 minutes someday, and occasionally it’s nice to get back to basics.

So let’s get down to it. On occasion, you’ll find you need to determine which elements in one collection exist in another, which are common, and/or which don’t exist in another collection. Apache Commons Collections has some some utility methods in CollectionUtils that are useful, notably intersection(), but this post goes a bit beyond that into calculating unique elements in collection of collections, and it’s always nice to get down to the details. We’ll also make the solution more generic by supporting any number of collections to operate against, rather than just two collections as CollectionUtils does. Plus there’s the fact that not all of us choose to or are able to include libraries just to get a couple useful utility methods.

When dealing with just two collections, it’s not a difficult problem, but not all developers are familiar with all the methods that java.util.Collection defines, so here is some sample code. They key is using the retainAll and removeAll methods together to build up the three sets – common, present in collection A only, and present in B only.

Set<String> a = new HashSet<String>();
a.add("a");
a.add("a2");
a.add("common");

Set<String> b = new HashSet<String>();
b.add("b");
b.add("b2");
b.add("common");

Set<String> inAOnly = new HashSet<String>(a);
inAOnly.removeAll(b);
System.out.println("A Only: " + inAOnly );

Set<String> inBOnly = new HashSet<String>(b);
inBOnly .removeAll(a);
System.out.println("B Only: " + inBOnly );

Set<String> common = new HashSet<String>(a);
common.retainAll(b);
System.out.println("Common: " + common);

Output:

A Only: [a, a2]
B Only: [b, b2]
Common: [common1]

Handling Three or More Collections

The problem is a bit more tricky when dealing with more than two collections, but it can be solved in a generic way fairly simply, as shown below:

Computing Common Elements
Computing common elements is easy, and this code will perform consistently even with a large number of collections.

   
public static void main(String[] args) {
   List<String> a = Arrays.asList("a", "b", "c");
   List<String> b = Arrays.asList("a", "b", "c", "d");   
   List<String> c = Arrays.asList("d", "e", "f", "g");
    
   List<List<String>> lists = new ArrayList<List<String>>();
   lists.add(a);
   System.out.println("Common in A: " + getCommonElements(lists));
   
   lists.add(b);
   System.out.println("Common in A & B: " + getCommonElements(lists));
   
   lists.add(c);
   System.out.println("Common in A & B & C: " + getCommonElements(lists));
   
   lists.remove(a);
   System.out.println("Common in B & C: " + getCommonElements(lists));
}
    
    
public static <T> Set<T> getCommonElements(Collection<? extends Collection<T>> collections) {

    Set<T> common = new LinkedHashSet<T>();
    if (!collections.isEmpty()) {
       Iterator<? extends Collection<T>> iterator = collections.iterator();
       common.addAll(iterator.next());
       while (iterator.hasNext()) {
          common.retainAll(iterator.next());
       }
    }
    return common;
}

Output:

Common in A: [a, b, c]
Common in A & B: [a, b, c]
Common in A & B & C: []
Common in B & C: [d]

Computing Unique Elements
Computing unique elements is just about as straightforward as computing common elements. Note that this code’s performance will degrade as you add a large number of collections, though in most practical cases it won’t matter. I presume there are ways this could be optimized, but since I haven’t had the problem, I’ve not bothered tryin. As Knuth famously said, “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil”.

   
public static void main(String[] args) {
   List<String> a = Arrays.asList("a", "b", "c");
   List<String> b = Arrays.asList("a", "b", "c", "d");   
   List<String> c = Arrays.asList("d", "e", "f", "g");
    
   List<List<String>> lists = new ArrayList<List<String>>();
   lists.add(a);
   System.out.println("Unique in A: " + getUniqueElements(lists));
   
   lists.add(b);
   System.out.println("Unique in A & B: " + getUniqueElements(lists));
   
   lists.add(c);
   System.out.println("Unique in A & B & C: " + getUniqueElements(lists));
   
   lists.remove(a);
   System.out.println("Unique in B & C: " + getUniqueElements(lists));
}
    
    
public static <T> List<Set<T>> getUniqueElements(Collection<? extends Collection<T>> collections) {
    
    List<Set<T>> allUniqueSets = new ArrayList<Set<T>>();
    for (Collection<T> collection : collections) {
       Set<T> unique = new LinkedHashSet<T>(collection);
       allUniqueSets.add(unique);
       for (Collection<T> otherCollection : collections) {
          if (collection != otherCollection) {
             unique.removeAll(otherCollection);
          }
       }
   }
       
    return allUniqueSets;
}

Output:

Unique in A: [[a, b, c]]
Unique in A & B: [[], [d]]
Unique in A & B & C: [[], [], [e, f, g]]
Unique in B & C: [[a, b, c], [e, f, g]]

That’s all there is to it. Feel free to use this code for whatever you like, and if you have any improvements or additions to suggest, leave a comment. Developers all benefit when we share knowledge and experience.

Files and Directories in the JDK

In Java, java.io.File is one of the more frequently used low-level API objects. It also happens to be lacking in some basic functionality that we’ve all needed at some point, doesn’t provide different representations/API for files and directories and doesn’t throw fine-grained exceptions to differentiate between different types of error conditions (i.e., file already exists, directory not empty, invalid path, etc.)

At Carfey, we’ve been using our own File and Directory classes for years and have been enhancing their utility over time by adding new functionality as the need arises – File.moveToFile(File newFile), File.writeFromStream(InputStream is), Directory.listFilesRecursively(), Directory.emptyDirectory, etc.

Probably the only real pain you’d encounter using an alternate representation of java.io.File would be in interacting with other libraries that would need access to java.io.File objects. That is easily accommodated by exposing a getJavaFile() method on each class.

Free for Use
Attached you can find our classes and test code available under the MIT licence. It does depend on our open-source Date library which we spoke about here. If you’d rather skip the extra library, you can drop the getLastModifiedDate method from the File class.

Here are some of the highlights.

First, we use different classes to represent File and Directory. Rather than having to invoke isFile() and isDirectory(), we have the type to guide us. Our Directory class has some public constants, such as PATH_SEPARATOR which is equivalent to java.io.File.pathSeparator but uses standard constant naming convention. Also TEMP_DIR, which provides easy access to the temporary directory loaded using the System property java.io.tmpdir. We get typed Exceptions such as DirectoryNotCreatedException, InvalidDirectoryException, DirectoryNotDeletedException with descriptive messages when attempting to construct a Directory reference or physically create/delete a Directory but encountering problems. We can listFiles() or even listFilesRecursively() on a Directory. Want to completely empty a directory of files and subdirectories? Use emptyDirectory().

Our File class similarly gives us typed Exceptions (InvalidFileException, FileIOException, FileNotDeletedException, FileNotDeletedException) with descriptive messages on construction, create or delete of physical Files but encountering problems. We have very convenient move and write methods: moveToFile(File moveFile), writeFromStream(InputStream is), write(String contents). More convenience methods for getting a FileInputStream or FileOutputStream for the given File. Even a getDirectory() that will return one of our Directory objects representing the containing Directory.

Also in the bundle is our own IOUtil class that provides some necessary functions for our File and Directory classes. You happen to get some of its own helpful functionality including

public static final byte[] getBytes(InputStream is) 

public static long copyStream(InputStream src, OutputStream dest, int bufferSize, boolean flushEachRead, StreamListener... listeners)

public static InputStream streamFromReader(Reader reader)

public static String streamAsString(InputStream is, String encoding)

public static <T extends Serializable> T cloneThroughSerialize(T t)

public static Object deserialize(InputStream is)

public static void serializeToOutputStream(Serializable ser, OutputStream os)

If you’re an Eclipse user, this brings up an interesting issue with resolving imports. If you add Carfey File and Directory to your project, you might want to avoid the annoyance of having to select the type of File you want to use. Eclipse allows you to customize how imports are done. If you’ve never done this before, check it out. You can use with other common names such as List, Util, StringUtil, etc.

Want to avoid this?

Go here

Add the restrictions

Now when you “Organize Imports” with Ctrl+Shift+O, you won’t be prompted to choose which File you want.

Happy coding and if you find any bugs or have questions, leave a comment here.