Null and 3-Dimensional Ordering Helpers in Java

When dealing with data sets retrieved from a database, if we want them ordered, we usually will want to order them right in the SQL, rather than order them after retrieval. Our database will typically be more efficient due to available processing power, potential use of available indexes and overall algorithm efficiency in modern RDBMSes. You also have great flexibility to have complex ordering criteria when you use these ORDER BY clauses. For example, assume you had a query that retrieved employee information including salary and relative steps (position) from the top position. You could easily have a first-level ordering where salaries are grouped into categories (<= $50 000, $50 001 to $100 000, > $100 001), and the next level ordered by relative position. If you could assume that salaries were appropriate for all employees, this might give you a nice idea of where there is too much or too little management in the company – a very rough approach I wouldn’t recommend, this is just a sample usage.

You get some free behaviour from your database when it comes to ordering, whether you realize it or not. When dealing with NULLs, the database has to make a decision how to order them. Every database I’ve ever worked with and likely all relational databases have a default behaviour. In ascending order, MySQL and SQL Server put NULL ahead of real values, they are “less than” a non-NULL value. Oracle and Postgres put NULL after real values, they are “greater than” non-NULL values. Oracle and Postgres nicely give you the NULLS FIRST and NULLS LAST instructions so you can actually override the defaults. Even in MySQL and SQLServer, you can find ways to override the defaults using functions in your order by clause. In MySQL I use IFNULL. In SQL Server, you could use ISNULL. These both give you the option of replacing null with a particular value. Just replace an appropriate value for the type you are sorting.

All sorting supported in these types of queries is two-dimensional. You pick columns and the rows are ordered by those. When you need to sort by additional dimensions of the data, you’re probably getting into areas that are addressed in other related technologies such as data warehousing and OLAP cubes. If that is appropriate and available for your case, by all means use those powerful features.

In many cases though, we either don’t have access to those technologies or we need our operations to be on current data. For example, let’s say you are working on an investment system, investor’s accounts, trades, positions, etc. are all maintained. You need to write a query to help extract trade activity for a given time frame. Our data comes back as a two-dimensional datasets even though we have more dimensions. Our query will return data on account(s) and the trade(s) per account. We need our results to be ordered by those accounts whose effected trades have the highest value. But we need to maintain the trades with their accounts. Simply ordering our query by the value of the effected trade would likely break the rows of the same account apart.

We have a choice, we can either order in the database and control our reconstruction of returning data to maintain the state and order of reconstructed objects or we can sort after the fact. In most cases, we probably don’t want to write new code each time we come across this problem that deals specifically with reconstituting the data from that query into our object model’s representation. Hopefully our ORM will help or we have some preexisting, functional and well-tested code that we can reuse to do so.

Another option is to sort in our code. We actually get lots of flexibility by doing this. Perhaps we have some financial functions that are written in our application that we can now use. We also don’t have to do all the sorting ourselves as we can take advantage of JDK features for Comparator and Collection sorting.

First, let’s deal with our null ordering problem. Let’s say our Trade object has some free public constant Comparators. These allow us to use a collection of Trades along with java.util.Collections.sort(List<Trade>, Comparator<Trade>). Trade.TRADE_VALUE_NULL_FIRST is the one we want to use. This Comparator is nothing more than a passthrough to a global Null Comparator helper.

private static final Comparator<Trade> TRADE_VALUE_NULL_FIRST = new Comparator<Trade>(){
  public int compare(Trade o1, Trade o2) {
    return ComparatorUtil.DECIMAL_NULL_FIRST_COMPARATOR.compare(
        o1.getTradeValue(), 
        o2.getTradeValue());
}};

... ComparatorUtil ...

public static NullFirstComparator<BigDecimal> DECIMAL_NULL_FIRST_COMPARATOR = 
  new NullFirstComparator<BigDecimal>();
public static NullLastComparator<BigDecimal> DECIMAL_NULL_LAST_COMPARATOR = 
  new NullLastComparator<BigDecimal>();
...snip...
public static NullLastComparator<String> STRING_NULL_LAST_COMPARATOR = 
  new NullLastComparator<String>();

public static class NullFirstComparator<T extends Comparable<T>> implements Comparator<T> {
  public int compare(T o1, T o2) {
    if (o1 == null && o2 == null) {
	return 0;
    } else if (o1 == null) {
	return -1;
    } else if (o2 == null) {
	return 1;
    } else {
	return o1.compareTo(o2);
    }
  }
}
public static class NullLastComparator<T extends Comparable<T>> implements Comparator<T> {
  public int compare(T o1, T o2) {
    if (o1 == null && o2 == null) {
	return 0;
    } else if (o1 == null) {
	return 1;
    } else if (o2 == null) {
	return -1;
    } else {
	return o1.compareTo(o2);
    }
  }
}

Now we have a simple, reusable solution we can use with any class and any nullable value in JDK sorting. Now we expose any ordering constants appropriate for business usage in our class. Now let’s deal with the more complex issue of hierarchical value ordering. We don’t want to write new code everytime we have to do something like this. So let’s just extend our idea of ordering helpers to hiearchical entities.

public interface Parent<C> {
  public List<C> getChildren();
}
public class ParentChildPropertiesComparator<P extends Parent<C>, C> implements Comparator<P> {
  private List<Comparator<C>> mChildComparators;
  public ParentChildPropertiesComparator(List<Comparator<C>> childComparators) {
    mChildComparators = Collections.unmodifiableList(childComparators);
  }
  public List<Comparator<C>> getChildComparators() {
    return mChildComparators;
  }
  public int compare(P o1, P o2) {
    int compareTo = 0;
    for (int i=0; i < mChildComparators.size() && compareTo == 0; i++) {
	Comparator<C> cc = mChildComparators.get(i);
	List<C> children1 = o1.getChildren();
	List<C> children2 = o2.getChildren();
	Collections.sort(children1, cc);
	Collections.sort(children2, cc);
	int max = Math.min(children1.size(), children2.size());
	for (int j=0; j < max && compareTo == 0; j++) {
	  compareTo = cc.compare(children1.get(j), children2.get(j));
	}
    }
    return compareTo;
  }
}

This is a little more complex, but still simple enough to easily grasp and reuse. We have the idea of a parent. This is not an OO relationship. This is a relationship of composition or aggregation. A parent can exist anywhere in the hierarchy, meaning a parent can also be a child. But in our sample, we have a simple parent/child relationship - Account/Trade. This new class, ParentChildPropertiesComparator supports the idea of taking in a List of ordered Comparators on the children entities but sorting on the parents. In our scenario, we are only sorting on one child value, but it could easily be several just as you could sort more than 2 levels of data.

We are assuming in our case that Account already implements the Parent interface for accounts. If not, you can always use the Adapter Design Pattern. Our Account/Trade sorting would now look like this.

List<Account> accounts = fetchPreviousMonthsTradeActivityByAccount();
List<Comparator<Trade>> comparators = Arrays.asList(Trade.TRADE_VALUE_NULL_FIRST);
ParentChildPropertiesComparator<Account, Trade> childComparator = 
  new ParentChildPropertiesComparator<Account, Trade>(comparators);
Collections.sort(accounts, childComparator);

Really! That's it. Our annoying problem of sorting accounts by those with highest trade values where some of those trade values could be null is solved in just a few lines of code. Our accounts are now sorted as desired. I came up with this approach and it is used successfully as a part of a query builder for a large-volume financial reconciliation system. Introduction of new sortable types and values requires only minimal additions. Take this approach for a whirl and see how incredibly powerful it is for dealing with sorting requirements across complex hierarchies of data. And drop us a line if you need help in implementation or have any comments.

When to use Pessimistic Locking

There are cases where we need to use a pessimistic locking strategy. While optimistic updates are an absolute minimum, we deploy a pessimistic locking strategy into a carefully thought out design. We use pessimistic locking strategies in two primary cases:

  • As a semaphore to ensure only a single process executes a certain block of code at a time
  • As a semaphore for the data itself

Let’s first make sure we agree on the term semaphore. In this context, we’re talking about a binary semaphore, or a token. The token is either available or is held by something. The token is required to be able to proceed with a certain task. A request is made to obtain the token and it is returned if available. Options when it’s not available are to wait for it, blocking until it becomes available or to fail.

Using a semaphore to ensure that certain blocks in your code are only run by a single process at a given time is a great strategy to facilitate load balancing and failover safely. We can now group related changes under token, requiring any other process endeavouring to do the same thing to wait for our lock to release. When the process has terminated either successfully or abnormally, the token is then returned by a commit or rollback.

When used as a semaphore for the data itself, it allows us to ensure the details of the data are consistent for the duration of our block of code. Changes will be prevented thereby ensuring that the actions we take are based on a consistent state for the duration of the code. For example, we may want to make a collection of changes on some foreign key related data based on the parent and need to ensure that the parent data is in an unchanged state. Pessimistically locking the row allows us to guarantee just that. And again, by relying on our existing infrastructure for managing the transaction, either successful completion or failure resulting in a commit/rollback releases our lock.

As mentioned, pessimistic locking does have the potential to cause performance problems. Most database implementations either allow you to not wait at all (thanks Oracle!), or at the very least wait for a maximum amount of time for the lock to be acquired. Knowing your database implementation and its ins and outs is crucial if you’ll be using this approach.

Our new scheduler has been tested with hundreds of concurrent jobs running across dozens of instances without any collisions or deadlocks or any degradation in performance. Drop us a line if you’d like to learn more how we were able to do so.

Optimistic Updates and Transaction Isolation

While we at Carfey Software prefer to run our transactions using READ COMMITTED isolation level, we have taken steps to ensure that our optimistic updates using a version column will work even with READ UNCOMMITTED. How does READ UNCOMMITTED impact optimistic updates? You might start to see that it is the dirty reads, or reads of data changes that won’t necessarily be committed, that cause an issue.

To ensure a safe, accurate optimistic update strategy, you must ensure that whatever value you use for a version column will never again be used on that row – using a GUID is overkill but would do the trick, we use a permanently incrementing value by table. This is because a dirty read of a version column could grab an uncommitted row that is eventually rolled back and a differing transaction modifying and committing the row resulting in a different version represented by the same version number. Let’s visualize the problem.

We’re running READ UNCOMMITTED. Our starting state is modified by a first transaction and we’re using a common strategy of incrementing the version number on the update. Then a second transaction comes along and gets a dirty read.

Afterwards, the first transaction is rolled back. And before transaction 2 does anything, a third transaction reads, writes an increase of 5 to the score and then commits, incrementing the version to 1. When our second transaction comes in for a write and wants to increment the score by 3, it verifies that the version is 1, it incorrectly assumes it can safely make its write and actually only increases the score by 1.

You might think that with dirty reads, safe optimistic updates aren’t possible. But all you need to do is make sure that a version is never used again for a row of data. In our above scenario, if we don’t reuse the version 1, the optimistic write would fail to validate and force a reread. Be aware that if you’re running READ UNCOMMITTED, many ORMs’ optimistic update strategies don’t protect against the above scenario.

As we said, we use optimistic updates as a minimum. Our next post will review cases where we do use pessimistic blocking.

Database Concurrency

As previously discussed, the database is a crucial component of our new scheduler and for most business systems. With an appropriately designed data model, well written SQL and well-thought out transaction lifecycles and their makeup, most database implementations will provide little cause for complaint, even with many concurrent requests.

But you need to think carefully about the potential issues you will face when it comes to concurrent writes and even writes that were based on reads that have since become stale. What this boils down to is ensuring you have a strategy to ensure that writes are performed based on an expected prior state. These strategies are commonly known as pessimistic and optimistic concurrency control.

Pessimistic control (or pessimistic locking) is so named because it takes the view that concurrent requests will frequently be trying to overwrite each other’s changes. This solution is to place a blocking lock on the data releasing the lock once the changes are complete or undone. A read lock is sufficient despite the frequency with which a dummy write is used to lock. All requests would similarly require a blocking lock before beginning the read/write process. This approach requires state to be maintained from the read all the way through to the write. This is essentially impossible and entirely impractical for a stateless application that are all web applications. It also would have significant performance implications if requests were continually waiting to acquire locks.

Optimistic control takes the view that is just as safe as pessimistic control, but assumes that concurrent writes to the same data are likely to be infrequent and will verify during the write that the state continues to be as it was when the read was first made. Let’s look at a simple example to see what we’re talking about.

We’re tracking scores. I get a request to give William Thomas 5 more points. At the same time, you get a request to give him 3 more points. We both look at his current score and adjust accordingly. If we don’t verify that the score as it currently stands is still 87, one of us will incorrectly increment the score.

An optimistic update would ensure that the score would only update if the score at the time was still 87. Optimistic control has clear benefits when it sufficiently meets your requirements. It doesn’t block giving you a performance benefit over pessimistic, but it’s just as safe.

Let’s briefly review the strategies for optimistic writes, be they updates or deletes.

  • Verify all updating row values against expected values
  • Verify only modified values against expected values
  • Verify a version column is as expected that changes with all writes

Both the first and second choices require that potentially a lot of previous state must be maintained along with the primary key. The third option only requires that our key and the version of the row be maintained. At Carfey Software, we use this third option for all writes. We view straight updates based on primary key of no real use in a concurrent application.

Choosing a version column has some interesting implications on the transaction isolation level that is in play. Our next post will review those implications.