Calculating and Compressing OLAP Cubes
Over the weekend, I’ve been hard at work at writing an implementation of the Dwarf algorithm for compressing and calculating (aggregating) OLAP cubes. The initial results look good, and I’ve learned a great deal.
To begin, a little background. OLAP cubes typically perform numerous calculations and aggregations on a dimensional model in order to speed up query performance. Data warehouses are all about storing extremely large sets of data, and presenting it to the user for analysis. Users being users, they don’t want to wait while you perform a SQL GROUP BY and SUM over 10 million rows. So an effective OLAP cube calculates all of the GROUP BY combinations ahead of time, dramatically speeding up users’ queries against the cube.
The trouble is that the number of GROUP BY combinations increases exponentially as the numbers of dimensions (and dimension attributes) increases. Plus, data warehouses are meant to store data over multiple years, and at a very fine grain. Therefore, a typical data warehouse can easily store hundreds of millions of rows.
So you can see where this gets a bit tricky. How do you effectively calculate all of your GROUP BY combinations across such large data sets? How can you do it before the universe ends? How can you update your calculated cube with new rows?
This is where the Dwarf algorithm comes in. It promises to provide a way of calculating and compressing your cube with only one pass through your data. Sounds good to me! Let’s try it.
First off, my implementation, which I’ve named BigDwarf, will become part of ActiveWarehouse. ActiveWarehouse is a Rails plugin which brings proven data warehouse techniques and conventions to Ruby on Rails. ActiveWarehouse follows the Rails way of “convention over configuration”. BigDwarf is one of ActiveWarehouse’s different aggregation strategies.
I’ve named the implementation BigDwarf because I’ve implemented the Dwarf algorithm in spirit only. Dwarf works so well because it does both prefix and suffix coalescing, or compression. I’ve implemented only prefix compression so far. Suffix coalescing, which apparently provides the most dramatic space reductions for a sparse cube, is on the TODO list.
The current implementation does not require ordering of the data set, which traditional Dwarf does. This is good and bad. It’s good because we don’t need to sort everything before loading, which can be a costly step. However, we’d probably just use the UNIX sort command to sort the file, with the assumption that it’s faster than doing in the database (that’s a big assumption). Loading the data in arbitrary order gives us a lot of flexibility.
However, it’s bad because there’s a great potential speed optimization we can use if we order all the dimension attributes in the file. It makes the algorithm a bit more complex, but I think it also reduces the amount of recursion. This is on the TODO list for further research and implementation.
Because BigDwarf is for ActiveWarehouse, this is all written in Ruby. Turns out Ruby is slow. Repeat after me: Ruby is slow. It’s a beautiful language, but it’s just not a data crunching language. I’ve managed to optimize BigDwarf enough so that the bottleneck now is the + operator on Fixnum. Here are some things to avoid if you want to write fast Ruby code:
* ==
* [] - array access
* Hash#[] - calculating hashes
Pretty much everything involving accessing your data in a collection will slow you down.
Once I’ve optimized BigDwarf enough where I can continue working on it, I ran some tests. Here’s what we have so far. My test data set is a 10,037,355 line file extracted from our SQL Server 2005 database. The file includes 5 dimensional attributes and one fact (the number we want to sum across all of the dimensions). The file is a text file, tab delimited, one line per row.
BigDwarf processes this file at 4348 lines per second. It will store the fully calculated cube in 3,301,132 bytes. This is down from the original file’s 337,022,624 bytes. That’s a very dramatic compression, at approximately 99% compression rate. YMMV, of course, as dimension cardinality and size of the values in your data dictate much of that compression. The lower cardinality, the higher compression you’ll see.
BigDwarf also supports querying, with basic filtering support. I’ve yet to do work to optimize the query performance, or to really get a sense of how fast it is. That’s on my TODO as well.
All in all, BigDwarf is working really well so far. There’s work to do for further compression through suffix coalescing and further optimization through smarter cube building.
Hi there.
Why did you choose DWARF to Quotient Cube?
ykud
17 Apr 07 at 11:14 pm
It’s a good question. I’ve never looked at Quotient Cube, but I will now. Thanks for the recommendation.
sethladd
19 Apr 07 at 4:42 pm
That wasn’t a recommendation, just a common question. Those are different styles of algorithms, semantic and syntax ones.
Actually, I’m doing some research in OLAP area in free time, and got some ideas on quotient cube-like algorithm set. I’m going to prototype them this summer (or sooner), and therefore I’m very interested in the way you’re gonna do the “upper part” of OLAP cube, meaning Querying and Visualizing.
I was thinking about gluing it up with Mondrian (pentaho.org), or implementing Palo api (palo.net), so your future steps seem very intriguing to me.
ykud
19 Apr 07 at 11:24 pm
In the dwarf algorithm paper you have mentioned, they always query the cube in a particular order i.e if we have n dimensions order is always let’s say lt;d1, d2, d3..gt; but I think that’s not what a cube means.
A cube means a symmetric access if we query the cube using lt;d2, d1, d3..gt;also. In such a case the algorithm will not work proabably (Correct me If I am wrong).
Can you give me some idea about how are you doing it in your BigDwarf?
Nisheet Jain
23 Apr 07 at 9:49 pm
You know Dwarf is patented right? So any chances of using this without a huge licensing fee are probably gone right out of the gate. Patented by the same guys that wrote the original paper. Very awesome of them eh?
Josh Ferguson
11 Jul 07 at 10:58 pm
Well, no worries because I gave up on dwarf. Numerous reasons… construction time was slow and querying it with arbitrarily complex queries was going to take some time. All in all, we’d need the power of SQL to query whatever OLAP store we’d create.
sethladd
12 Jul 07 at 1:59 am
I actually have a full implementation in ruby (including coalescion) that I’m writing as a C extension. It uses QDBM to build the resulting graph as it goes along so querying is always only a matter of a few QDBM calls wrapped up in a ruby DSL. I think you should give it some more thought and try to convince the guys that own the patent to give you a free license.
Josh Ferguson
12 Jul 07 at 7:43 am
That’s great to hear! I’d be curious to see more on how you dealt with Ruby’s slowness with things like array access and addition of numbers. I assume that is why you are writing it as a C extension?
I’ve written both a Ruby implementation and a Java implementation, but the reason I’ve left it behind is because I still need a very expressive query language on top of it. Not to say that you can’t wire one on top of your dwarf, though.
I’m sure Anthony Eden would be very interested in seeing your Dwarf implementation. He’s also written a Ruby implementation.
Where is your code hosted?
sethladd
12 Jul 07 at 8:31 pm
I’m actually scratching dwarf and writing a quotient cube tree implementation since I’ve talked to the authors and it’s open source.. Same deal though with the C extension. You can’t get away with doing this type of thing in an interpreted type language, it’s just too slow to be viable. I’m going to provide an active record like interface over top of it to do point, range, and iceberg queries as well as combinations of those including some custom query types. If I have to I might write a basic SQL-type interpreter using treetop.
Josh Ferguson
19 Jul 07 at 6:52 pm
I agree, Ruby is just too slow for this sort of thing. From when I was looking at it, it was array navigation and just pure addition that killed Ruby.
Will you be hosting your code somewhere? I’d love to follow along.
Meanwhile, I’ve been working on a pseudo column database implementation to see where that will take me.
sethladd
19 Jul 07 at 9:40 pm
I’ll put it up up more information at my blog when the time comes where it does something useful. I’ll put up my dwarf implementation sooner since I don’t care about it anymore..
Josh Ferguson
20 Jul 07 at 10:18 am