Archive for the ‘olap’ Category

OLAP Cube Construction with Erlang

Monday, September 10th, 2007

I’ve managed to build the parallel OLAP cube constructor in Erlang. This program achieves parallelization through creating a process for every dimension in the OLAP cube. Each process manages the file that holds the dimension data. Messages are passed from the first dimension all the way down to the last dimension which stores the measures themselves. To further parallelize things, you can partition any dimension using a modulus, which creates another file and process. This helps get around the 2 GB limit for dets tables.

I also have basic path based querying working, which is also parallelized through sending the query message through each dimension. While the querying itself isn’t parallel for a particular client, it will theoretically scale to handle many clients.

When I move to move traditional querying to generate a traditional tabular result set, I will be able to parallelize the query for a single client.

I’ll post the working code once I can choose a suitable license. I’m very interested to hear feedback, as I’m very much still an Erlang n00b.

Next up I’ll generate some performance numbers to see if this thing will actually perform in the real world.

I have to say, functional programming is great when your solution is algorithmic. Previous implementations of mine were done in Java or Ruby, which are object oriented. The classes and object obscured the algorithm, which in the case of OLAP cubes is the primary focus.

Parallel OLAP Cube Construction

Sunday, September 9th, 2007

Still tilting at windmills here, with my quest to calculate an OLAP cube still trotting along. Lately I’ve been using Erlang to see if I can use its concurrent programming abilities to scale the OLAP cube generation. The initial progress is promising, with clear concurrency complete.

I hope to post the majority of the code soon, after I run a few more tests. For now, here’s how I add two lists of numbers in Erlang. Please let me know if there’s a better way.

Update:

The best way to do this is (thanks to Daniel Larsson):


lists:zipwith(fun(X,Y) -> X+Y end, [1,2],[3,4])

Old ‘n busted way:


add_measures(OldMeasures, NewMeasures) ->
  add_measures(OldMeasures, NewMeasures, []).

add_measures([Old|OldRest], [New|NewRest], Accum) ->
  add_measures(OldRest, NewRest, [Old+New|Accum]);

add_measures([], [], Accum) -> lists:reverse(Accum).

Calculating and Compressing OLAP Cubes

Monday, April 16th, 2007

Over the weekend, I’ve been hard at work at writing an implementation of the Dwarf algorithm for compressing and calculating (aggregating) OLAP cubes. The initial results look good, and I’ve learned a great deal.

To begin, a little background. OLAP cubes typically perform numerous calculations and aggregations on a dimensional model in order to speed up query performance. Data warehouses are all about storing extremely large sets of data, and presenting it to the user for analysis. Users being users, they don’t want to wait while you perform a SQL GROUP BY and SUM over 10 million rows. So an effective OLAP cube calculates all of the GROUP BY combinations ahead of time, dramatically speeding up users’ queries against the cube.

The trouble is that the number of GROUP BY combinations increases exponentially as the numbers of dimensions (and dimension attributes) increases. Plus, data warehouses are meant to store data over multiple years, and at a very fine grain. Therefore, a typical data warehouse can easily store hundreds of millions of rows.

So you can see where this gets a bit tricky. How do you effectively calculate all of your GROUP BY combinations across such large data sets? How can you do it before the universe ends? How can you update your calculated cube with new rows?

This is where the Dwarf algorithm comes in. It promises to provide a way of calculating and compressing your cube with only one pass through your data. Sounds good to me! Let’s try it.

First off, my implementation, which I’ve named BigDwarf, will become part of ActiveWarehouse. ActiveWarehouse is a Rails plugin which brings proven data warehouse techniques and conventions to Ruby on Rails. ActiveWarehouse follows the Rails way of “convention over configuration”. BigDwarf is one of ActiveWarehouse’s different aggregation strategies.

I’ve named the implementation BigDwarf because I’ve implemented the Dwarf algorithm in spirit only. Dwarf works so well because it does both prefix and suffix coalescing, or compression. I’ve implemented only prefix compression so far. Suffix coalescing, which apparently provides the most dramatic space reductions for a sparse cube, is on the TODO list.

The current implementation does not require ordering of the data set, which traditional Dwarf does. This is good and bad. It’s good because we don’t need to sort everything before loading, which can be a costly step. However, we’d probably just use the UNIX sort command to sort the file, with the assumption that it’s faster than doing in the database (that’s a big assumption). Loading the data in arbitrary order gives us a lot of flexibility.

However, it’s bad because there’s a great potential speed optimization we can use if we order all the dimension attributes in the file. It makes the algorithm a bit more complex, but I think it also reduces the amount of recursion. This is on the TODO list for further research and implementation.

Because BigDwarf is for ActiveWarehouse, this is all written in Ruby. Turns out Ruby is slow. Repeat after me: Ruby is slow. It’s a beautiful language, but it’s just not a data crunching language. I’ve managed to optimize BigDwarf enough so that the bottleneck now is the + operator on Fixnum. Here are some things to avoid if you want to write fast Ruby code:

* ==
* [] - array access
* Hash#[] - calculating hashes

Pretty much everything involving accessing your data in a collection will slow you down.

Once I’ve optimized BigDwarf enough where I can continue working on it, I ran some tests. Here’s what we have so far. My test data set is a 10,037,355 line file extracted from our SQL Server 2005 database. The file includes 5 dimensional attributes and one fact (the number we want to sum across all of the dimensions). The file is a text file, tab delimited, one line per row.

BigDwarf processes this file at 4348 lines per second. It will store the fully calculated cube in 3,301,132 bytes. This is down from the original file’s 337,022,624 bytes. That’s a very dramatic compression, at approximately 99% compression rate. YMMV, of course, as dimension cardinality and size of the values in your data dictate much of that compression. The lower cardinality, the higher compression you’ll see.

BigDwarf also supports querying, with basic filtering support. I’ve yet to do work to optimize the query performance, or to really get a sense of how fast it is. That’s on my TODO as well.

All in all, BigDwarf is working really well so far. There’s work to do for further compression through suffix coalescing and further optimization through smarter cube building.

links for 2007-03-30

Thursday, March 29th, 2007
  • How to choose which views to materialize in an OLAP cube, when it is too expensive to materialize all views. This is the next optimization for our aggregation strategies in ActiveWarehouse.
    (tags: olap database)

ActiveWarehouse Gets Some Love

Wednesday, March 28th, 2007

ActiveWarehouse, the Ruby on Rails plugin for data warehouse development, was written up by InfoQ in their article ActiveWarehouse, a New Step for Enterprise Ruby.

I’ve been writing different aggregation strategies for ActiveWarehouse, trying to find something that’s not too slow or cumbersome. ActiveWarehouse supports pluggable aggregation, or rollup, strategies, so you can use what works best for you. We have some very large data sets and very large dimensions (one dimension we have has 215 million rows). So if ActiveWarehouse can eventually handle that, I think we’re in good shape.

I can say that ActiveWarehouse will work great if you have a smallish data set. I would say up to a million rows in your dimensions would be big enough. Of course, no matter how much work we put into optimizing ActiveWarehouse’s aggregation schemes, smart database tuning will always help tremendously.

links for 2007-03-28

Tuesday, March 27th, 2007