Using Lucene To Find A Date
Listen to this article
For the next 3 weeks (and for the past few), I'm the DefectController. I get to watch the defects roll in, assess them, and hand them out to the approriate developer (which may be me). Last week I saw a rather odd defect pass by:
org.apache.lucene.queryParser.ParseException: Too many boolean clauses when performing date range search.
My first reaction was puzzlement replaced shortly thereafter with shock as I thought through the problem. It occured to me that the most obvious cause would be the unthinkable: the developer must have enumerated every possible date in the range and included them ALL in one gigantic OR condition.
A bit of groking later and shock turned to horror. Fortunately, the developer had not done as I suspected. They had done the correct thing and generated the correct Lucene criteria in the form:
dateOfBirth:[19700801 TO 20030615]
Unfortunately, that left only one option: It must be Lucene!
Two minutes on Google and the BuildController turned up, among others, this link. Yes indeed, it seems, Lucene does enumerate ALL possible dates. In fact depending on the granularity, it will end up enumerating all possible seconds! Apparently this is not a bug nor even a feature but a "known behaviour".
So now here's the thing that puzzles me. It would appear, from the documentation, that string ranges are also supported allowing us to find say, people where:
name:[Albert TO Betty]
This being the case, does Lucene enumerate ALL possible names? I find that hard to fathom. If it does, then I give up now. If not, then couldn't we just encode the dates as umambiguous, comparable strings? something like:
dateOfBirth:[19700801 TO 20030615]
Look familiar? It should. I just copied and pasted the original example. But if this time around we consider the dates as strings of the form yyyyMMdd instead of attributing any special notion of date, wouldn't that solve the problem? Wouldn't that also easily allow us to perform partial range searches that include say only the year or year and month?
A Lucene expert I am not but all the links we found suggesting various other "work-arounds" (one of which suggested upping the limit on the number of clauses!) seemed little more than hacks. So, please, please, please tell me I've missed something obvious because the solution really does seem that simple to my feeble bwain.
Comments
Damn, and I thought you found a way to pick up a woman with Java software.
Posted by: bob mcwhirter | November 21, 2004 04:15 PM
Perhaps this article helps:
http://www.onegasoft.com/tools/smartranges/
Posted by: Henk van Voorthuijsen | November 21, 2004 09:36 PM
Henk, I'll have a look. Thanks
Posted by: Simon Harris | November 21, 2004 10:56 PM
We have run into this too before. You can set the maximum number of boolean queries by calling
BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE);
YMMV with this many allowable terms!
Posted by: Uriah | November 22, 2004 12:22 PM
Lucene does indeed enumerate the terms that match a range query. More than 1024 terms added to a BooleanQuery by default will cause this error.
You can increase the number of maximum clauses, but this can greatly increase the amount of memory your query takes up depending on how many fields and documents you have in your index.
I recently released LIMO (the Lucene Index Monitor) 0.5 which has a couple of tools to help you manage your queries. First of all, it expands your queries so you can see how many terms they're matching. Second, it calculates an estimated memory usage for your queries.
You can check it out here: http://sourceforge.net/projects/limo/
Anyway, your best bet is to index date fields as strings containing the lowest amount of granularity you can get away with: YYYY if that'll work, else YYYYMM, else YYYYMMDD. I don't recommend going beyond day granularity if you want to perform range queries. You can also use a Filter instead of a query term. Check the JavaDoc for information on that.
Posted by: Luke Francl | November 22, 2004 02:09 PM
Why use a workaround though? Why not fix the problem?
As Simon says...
surely if Lucene can handle String ranges without having to add every possible value as a comparison clause, then it should then treat dates as Strings. Or is that too obvious?
Posted by: Phil | November 24, 2004 08:53 PM
Phil, it does enumerate strings as well (everything is stored as a string, anyway). You can trip this problem with a query like [a to z].
Posted by: Luke Francl | May 14, 2005 08:19 AM
You can use the DateFilter class to filter the results instead of using a RangeQuery. I haven't tried it myself yet so I don't know how it compares in terms of performance but it looks like it's the way to go.
Posted by: Paul Illingworth | May 27, 2005 07:58 PM