Null and 3-Dimensional Ordering Helpers in Java

When dealing with data sets retrieved from a database, if we want them ordered, we usually will want to order them right in the SQL, rather than order them after retrieval. Our database will typically be more efficient due to available processing power, potential use of available indexes and overall algorithm efficiency in modern RDBMSes. You also have great flexibility to have complex ordering criteria when you use these ORDER BY clauses. For example, assume you had a query that retrieved employee information including salary and relative steps (position) from the top position. You could easily have a first-level ordering where salaries are grouped into categories (<= $50 000, $50 001 to $100 000, > $100 001), and the next level ordered by relative position. If you could assume that salaries were appropriate for all employees, this might give you a nice idea of where there is too much or too little management in the company – a very rough approach I wouldn’t recommend, this is just a sample usage.

You get some free behaviour from your database when it comes to ordering, whether you realize it or not. When dealing with NULLs, the database has to make a decision how to order them. Every database I’ve ever worked with and likely all relational databases have a default behaviour. In ascending order, MySQL and SQL Server put NULL ahead of real values, they are “less than” a non-NULL value. Oracle and Postgres put NULL after real values, they are “greater than” non-NULL values. Oracle and Postgres nicely give you the NULLS FIRST and NULLS LAST instructions so you can actually override the defaults. Even in MySQL and SQLServer, you can find ways to override the defaults using functions in your order by clause. In MySQL I use IFNULL. In SQL Server, you could use ISNULL. These both give you the option of replacing null with a particular value. Just replace an appropriate value for the type you are sorting.

All sorting supported in these types of queries is two-dimensional. You pick columns and the rows are ordered by those. When you need to sort by additional dimensions of the data, you’re probably getting into areas that are addressed in other related technologies such as data warehousing and OLAP cubes. If that is appropriate and available for your case, by all means use those powerful features.

In many cases though, we either don’t have access to those technologies or we need our operations to be on current data. For example, let’s say you are working on an investment system, investor’s accounts, trades, positions, etc. are all maintained. You need to write a query to help extract trade activity for a given time frame. Our data comes back as a two-dimensional datasets even though we have more dimensions. Our query will return data on account(s) and the trade(s) per account. We need our results to be ordered by those accounts whose effected trades have the highest value. But we need to maintain the trades with their accounts. Simply ordering our query by the value of the effected trade would likely break the rows of the same account apart.

We have a choice, we can either order in the database and control our reconstruction of returning data to maintain the state and order of reconstructed objects or we can sort after the fact. In most cases, we probably don’t want to write new code each time we come across this problem that deals specifically with reconstituting the data from that query into our object model’s representation. Hopefully our ORM will help or we have some preexisting, functional and well-tested code that we can reuse to do so.

Another option is to sort in our code. We actually get lots of flexibility by doing this. Perhaps we have some financial functions that are written in our application that we can now use. We also don’t have to do all the sorting ourselves as we can take advantage of JDK features for Comparator and Collection sorting.

First, let’s deal with our null ordering problem. Let’s say our Trade object has some free public constant Comparators. These allow us to use a collection of Trades along with java.util.Collections.sort(List<Trade>, Comparator<Trade>). Trade.TRADE_VALUE_NULL_FIRST is the one we want to use. This Comparator is nothing more than a passthrough to a global Null Comparator helper.

private static final Comparator<Trade> TRADE_VALUE_NULL_FIRST = new Comparator<Trade>(){
  public int compare(Trade o1, Trade o2) {
    return ComparatorUtil.DECIMAL_NULL_FIRST_COMPARATOR.compare(
        o1.getTradeValue(), 
        o2.getTradeValue());
}};

... ComparatorUtil ...

public static NullFirstComparator<BigDecimal> DECIMAL_NULL_FIRST_COMPARATOR = 
  new NullFirstComparator<BigDecimal>();
public static NullLastComparator<BigDecimal> DECIMAL_NULL_LAST_COMPARATOR = 
  new NullLastComparator<BigDecimal>();
...snip...
public static NullLastComparator<String> STRING_NULL_LAST_COMPARATOR = 
  new NullLastComparator<String>();

public static class NullFirstComparator<T extends Comparable<T>> implements Comparator<T> {
  public int compare(T o1, T o2) {
    if (o1 == null && o2 == null) {
	return 0;
    } else if (o1 == null) {
	return -1;
    } else if (o2 == null) {
	return 1;
    } else {
	return o1.compareTo(o2);
    }
  }
}
public static class NullLastComparator<T extends Comparable<T>> implements Comparator<T> {
  public int compare(T o1, T o2) {
    if (o1 == null && o2 == null) {
	return 0;
    } else if (o1 == null) {
	return 1;
    } else if (o2 == null) {
	return -1;
    } else {
	return o1.compareTo(o2);
    }
  }
}

Now we have a simple, reusable solution we can use with any class and any nullable value in JDK sorting. Now we expose any ordering constants appropriate for business usage in our class. Now let’s deal with the more complex issue of hierarchical value ordering. We don’t want to write new code everytime we have to do something like this. So let’s just extend our idea of ordering helpers to hiearchical entities.

public interface Parent<C> {
  public List<C> getChildren();
}
public class ParentChildPropertiesComparator<P extends Parent<C>, C> implements Comparator<P> {
  private List<Comparator<C>> mChildComparators;
  public ParentChildPropertiesComparator(List<Comparator<C>> childComparators) {
    mChildComparators = Collections.unmodifiableList(childComparators);
  }
  public List<Comparator<C>> getChildComparators() {
    return mChildComparators;
  }
  public int compare(P o1, P o2) {
    int compareTo = 0;
    for (int i=0; i < mChildComparators.size() && compareTo == 0; i++) {
	Comparator<C> cc = mChildComparators.get(i);
	List<C> children1 = o1.getChildren();
	List<C> children2 = o2.getChildren();
	Collections.sort(children1, cc);
	Collections.sort(children2, cc);
	int max = Math.min(children1.size(), children2.size());
	for (int j=0; j < max && compareTo == 0; j++) {
	  compareTo = cc.compare(children1.get(j), children2.get(j));
	}
    }
    return compareTo;
  }
}

This is a little more complex, but still simple enough to easily grasp and reuse. We have the idea of a parent. This is not an OO relationship. This is a relationship of composition or aggregation. A parent can exist anywhere in the hierarchy, meaning a parent can also be a child. But in our sample, we have a simple parent/child relationship - Account/Trade. This new class, ParentChildPropertiesComparator supports the idea of taking in a List of ordered Comparators on the children entities but sorting on the parents. In our scenario, we are only sorting on one child value, but it could easily be several just as you could sort more than 2 levels of data.

We are assuming in our case that Account already implements the Parent interface for accounts. If not, you can always use the Adapter Design Pattern. Our Account/Trade sorting would now look like this.

List<Account> accounts = fetchPreviousMonthsTradeActivityByAccount();
List<Comparator<Trade>> comparators = Arrays.asList(Trade.TRADE_VALUE_NULL_FIRST);
ParentChildPropertiesComparator<Account, Trade> childComparator = 
  new ParentChildPropertiesComparator<Account, Trade>(comparators);
Collections.sort(accounts, childComparator);

Really! That's it. Our annoying problem of sorting accounts by those with highest trade values where some of those trade values could be null is solved in just a few lines of code. Our accounts are now sorted as desired. I came up with this approach and it is used successfully as a part of a query builder for a large-volume financial reconciliation system. Introduction of new sortable types and values requires only minimal additions. Take this approach for a whirl and see how incredibly powerful it is for dealing with sorting requirements across complex hierarchies of data. And drop us a line if you need help in implementation or have any comments.

Problems with ORMs Part 2 – Queries

In my previous post on Object-Relational Mapping tools (ORMs), I discussed various issues that I’ve faced dealing with the common ORMs out there today, including Hibernate. This included issues related to generating a schema from POJOs, real-world performance and maintenance problems that crop up. Essentially, the conclusion is that ORMs get you most of the way there, but a balanced approach is needed, and sometimes you just want to avoid using your ORM’s toolset, so you should be able to bypass it when desired.

One huge flaw in modern ORMs that I see though is that they really want to help you solve all your SQL problems. What do I mean by this why would I say this is a fault? Well, I believe that Hibernate et al just try too hard and end up providing features that actually hurt developers more than they help. The main thing I have in mind when I say this is query support. Actual support for complex queries that are easily maintained is seriously lacking in ORMs and not because they’ve omitted things — it’s just because the tools they provide don’t use SQL, which was designed from the ground up for exactly this purpose.

Experiences in Hibernate

It’s been my experience that when you use features like HQL, frequently you’re thinking about saving yourself a few minutes up front, and there’s nothing wrong with this in itself, but it can cause serious problems. It’s my experience that frequently you end up wanting or needing to replace HQL with something more flexible, either because of a bug fix or enhancement, and this is where the trouble starts.

I consider myself an experienced developer and I pride myself on (usually) not breaking things — to me, that is one of the hallmarks of good developers. When you’re faced with ripping out a piece of code and replacing it wholesale, such as replacing HQL with SQL, you’re basically replacing code that has had a history that includes bug fixes, enhancements and performance tweaks. You are now responsible for duplicating every change to this code that’s ever been made and it’s quite possible you don’t understand the full scope of the changes or the niggling problems that were corrected in the past.

Note that this also applies to all the other query methods that Hibernate provides, including the Query API, and through extension, query support within the JPA. The issue is that you don’t want a solution that is brittle or limited that it has to be fully replaced later. This means that if you need to revert to SQL to get things done, there’s a good chance you should just do that in the first place. This same concept applies to all areas of software development.

So what do we aim for if the basic querying support in ORMs like Hibernate isn’t good enough?

Criteria for a Solid ORM

Bsaically, my personal requirements for an ORM come down to the following:

  • Schema first – generate your model from a database, not the other way around. If you have a platform-agnostic way of specifying DDL for the database, great, but it’s not a deal-breaker. Generating a database from some other domain-specific language or format helps nobody and results in a poorly designed schema.
  • SQL only – if you want to help me avoid writing code, then generate/expose key-based, etc. lookups for me. Don’t ask me to use your query API or some new query language. SQL was invented for queries, so let me use the right tool.
  • Give me easy ways to populate my domain objects from queries. This gives me 99% of what I’ll ever need, while giving me flexibility.
  • Allow me to populate arbitrary Java beans with query results – don’t tie me into your registry of known types.
  • Don’t force me into using a typical transaction container like the one Hibernate or Spring provides – they are a disaster and I’ve never see a practical use for them that made any sense. Let me handle where connections/transactions are acquired and released in my application – typically this only happens in a few places with clear semantics anyway. This can be some abstracted version of JDBC, but let me control it.
  • No clever/magic behaviour in my domain objects – when working with Hibernate, I spend a good time solving the same old proxy and lazy-loading issues. They never end and can’t be solved once-and-for-all which indicates a serious design issue.

Though these points seem completely reasonable to me, I’ve not encountered any ORMs that really meet my expectations, so at Carfey we’ve rolled our own little ORM, and I have to say that weekend projects and just general development with what we have is far easier and faster than Hibernate or the other ORMs I’ve used. What does it provide?

A Simple Utilitarian ORM

  • Java domain classes are generated from a DB schema. There’s no platform-agnostic DDL yet, but it’s on our TODO list. Beans include support for child collections, FK references, but it’s all lazy and optional – the beans support it, but if you don’t use them, there’s no impact. Use IDs directly if you want, or domain objects themselves. Persistence handles persisting dirty objects only, and saves are only done when requested – no magic flush behaviour.
  • Generated domain classes are for persistence only! Stick your business logic, etc. elsewhere.
  • SQL is used for all lookups, including primary key fetches and foreign key relationships. If you need to enhance a lookup, just steal the generated SQL and build on it. Methods and SQL is generated automatically from any indexed column so they are all provided for you automatically and are typesafe. This also provides a warning to the developer – if a lookup is not available in your domain class, it likely will perform poorly since no index exists.
  • Any domain class can be populated from a custom query in a typesafe manner – it’s flexible but easy to use.
  • Improved classes hide the standard JDBC types such as Connnection and Statement for ease of use, but we don’t force any transaction semantics on you, and you can always fall back to things like direct result set handling.
  • Some basic required features like a connection pool, database metadata, and soon, database slave failover.

We at Carfey don’t believe we’ve created some incredible new ORM that surpasses every other effort out there, and there are many features we’d have to add if this was a public project, but what we have works for us, and I think we have the correct approach. And at the very least, hopefully our experience can help you choose how you use your preferred ORM wisely and not spend too much time serving the tool instead of delivering software.

As a final note, if you have experience with ORMs that meet my list of requirements above and you’ve had good experiences with it, I’ve love to hear about it and would consider it for future Carfey projects.

Problems with ORMs

Object Relational Mapping tools like Hibernate have helped developers make huge productivity gains in dealing with relational databases in the past several years. ORMs free developers to focus on application logic and avoid writing a lot of boilerplate SQL for simple tasks like inserts or queries.

However, the well-documented problems with object-relational impedance mismatch inevitably cause headaches for developers. Relational databases are a specialized technology built on sound concepts, but they don’t necessarily line up to the object-oriented world. There are a few approaches and styles to using ORMs which have various pros and cons.

One of the fundamental choices to using an ORM is deciding whether the you will generate your ORM mappings from the database schema or the other way around, generating your database schema off of some ORM definition (possibly an XML configuration file, annotations or something like XDoclet).

The former approach of generating your ORM layer from your database schema means you have to deal with the database in its own language and terms, whether you deal with DDL specific to the database or have some abstraction layer, but nevertheless you are forced to treat the database as what it is. Unfortunately, it means you need expertise in the technology and it may take more work than allowing your schema to be generated. However, this forces developers to understand and deal with the RDBMS properly – it is dangerous and harmful to treat a DBMS as a simple datastore. Developers need to consider the impact of keys, indexes, etc. when designing applications, and shielding them from the realities of relational databases can be dangerous, and in my experience this always turns out badly. A related issue is the use of POJOs, which end up being manipulated by the ORM framework. While this sounds nice in theory, in practice you can hit various kinds of issues and it can be tempting to mix application logic with what should really amount to data access objects. Developers and architects like to praise the separation of concerns by using Spring and other frameworks, and there’s no real reason why the same concept shouldn’t be applied here. One other minor issue is the need to maintain the POJOs and mapping definition, but this is usually not too much work.

The second approach of generating your ORM mappings and code from your schema is my preferred approach. In my experience of using both approaches, having beans generated off of your schema allows your beans to be intelligently designed and be only as complicated as required, while making available fetch by PK, by index, etc. all for free. It also becomes easier to manage things like lazy collections and referenced objects since it is all managed within the persistent class itself. This approach also avoids the need for writing boilerplate POJOs and forces you to treat your data access objects separately from your domain objects and business logic. In my experience with generating data access beans off your schema, the beans end up being richer, more usable, perform better, and once you have your infrastructure in place, maintenance costs are lower. One may think that you end up needing additional data-wrapper classes, but in practice the need for separate bean classes is independent of what is going on in your data access layer. One issue here is the availability of frameworks to do this generation work for you – in the past, I have worked with custom-built solutions that worked well and paid off, but required initial up-front work. On smaller projects there may not be enough pay-off to be worth investing in such an effort. At the same time, there are ORMs out there that take this approach and generate persistent entity classes, such as jooq, but I have to try them.

Hibernate is the most popular ORM in the Java world and while it is a step up from dealing with writing copious amounts of SQL, it has its problems. Generally the approach is to define your mappings and POJOs and then let Hibernate manage your SQL generation. The problem with this is your defined schema is often less than ideal and things like proper indexing end up overlooked. Hibernate also forces you to end up using their transaction and query mechanisms, though the extent you choose to use their proprietary stuff is up to you. This isn’t necessarily a bad thing in all cases, but personally I have a distaste for the HQL language which is frequently used, since it introduces a familiar-yet-different language to developers which others will later have to maintain and try to figure out. There are also issues with query optimization that can crop up, and having done significant work on performance tuning in the past, access to actual queries for tuning is a must for me. I also believe that trying to implement inheritance in persistence classes is just a bad idea – it’s not worth trying to force a concept onto a technology which naturally does not accommodate it. Hibernate tempts developers to implement inheritance in the database through support for table-per-hierarchy and table-per-class mechanisms, but it is a mistake in my mind since you end up with a poor data model and problems later managing and extending the hierarchy. I also do not like to populate what should be a clean relational model – you can’t pretend relational DBs are object-oriented datastores.

If you take one thing away from this post, it should be to not ignore the actual technologies you are using. Treat an RDBMS for what it is, and learn to use it. Do the same for object-oriented systems. By all means, try to make your life easier by using ORMs to avoid writing boilerplate code and unnecessary SQL, but don’t think you can avoid dealing with some kind of translation or code to deal with the natural mismatch that occurs. Do not expect a framework or tool to solve the problem for you. Developers are paid to think and discerning the best road to take, so we shouldn’t be afraid to deal with problems as they come up and solve them intelligently. As with many things, the 80-20 rule seems to apply here. Use ORMs to take care of 80% of the work for you, and be prepared to write SQL and some persistence code for the other 20%. Don’t expect too much or you will end up with several types of problems – strange bugs, performance issues, a poorly designed object-oriented model, etc.

I’d love to hear your experience and thoughts on ORMs in any language and the issues you’ve faced, along with how you dealt with them. This is one of those topics where experience is incredibly valuable, so please share your thoughts.