When coding in python I’m often performing text processing and I end up with some form of inverted index or associative array in memory and I want to persist it.
On and off I’ve tried using the Berkeley Database from Oracle. Inevitably I find that it takes forever to write out large data sets. There are some tuning parameters, especially the cache size, but it seems that the software just doesn’t scale well.
I recently rediscovered CDB, which was written by Dan Bernstein, with Python bindings. This has the basic functionality I need (large data sets, can split a dict out in a reasonable time span, and reasonably compact storage) and is amazingly simple. For more details see the internals page. The only disadvantage of it is that with CDB you can’t perform updates or deletes — instead, you need to be able to create your data set in one fell swoop, persist it all at once, and thereafter treat it as read-only. For me this works, as in typical and simple IR tasks you create some data structure that you save and then later use. Because of all the performance problems I’ve had with Sleepykat I plan on reducing the use of it and using CDB more.