Archive for the ‘database’ Category

Too Much Data

Monday, August 13th, 2007

Bill de hÓra writes that Phat Data is the challenge of the future. Couldn’t agree more. My recent work with data warehouses certainly has shown me that managing and accessing terabytes of data is non trivial.

We’ve learned a few things, most importantly, “Denormalize and aggregate.” Avoiding I/O is the most important step to take. And we’ve achieved some pretty decent performance numbers with a traditional relational database. However, as Bill points out, we’re using it as a big indexed file system.

But having SQL and the numerous tools that support SQL has been critical to our success. I can’t imagine solving these problems with proprietary tools. Sure, it’s possible. Google did it, but they have more PhD’s than you can shake a stick at. Plus some mega clusters.

While multi-core CPUs are a welcomed upgrade, what I really want is multi-spindle hard drives. Call me when I can emulate a google cluster in my desktop. What’s lacking is a cheap and effective way to parallel my disk activity.

What would be really nice is to turn my corporate network into a giant compute farm utilizing both all those CPUs and all those hard drives. Now that is really turning the network into the computer. So, don’t give me EC2, give me EC2 in my office. With everyone using their huge desktops just to read email and write PowerPoint, I know there’s a ton of unused computing power. This is P2P with a purpose.

Yes, database normalization is good

Monday, August 13th, 2007

So InfoQ has collected a few blog posts which ask Data normalization, is it really that good?

Of course it’s good, as long as you have requirements which dictate this optimization. If your application requires extremely fast writes, and this can happen in a heavy loaded OLTP system, then data normalization is your savior. If your application requires extremely fast reads, like OLAP systems, then of course data normalization is a killer.

These competing requirements are exactly why you have database systems optimized for either read or write. This is why large systems will maintain an operational system conforming to OLTP principles, and reporting systems conforming to OLAP principles.

Remember, traditional database systems are row oriented. This architecture is itself an optimization for OLTP and normalized data. Read mostly (or read only) systems can be column oriented, which organize the data on disk to optimize reads. For instance, Google’s BigTable is an implementation of a column oriented database.

Calculating and Compressing OLAP Cubes

Monday, April 16th, 2007

Over the weekend, I’ve been hard at work at writing an implementation of the Dwarf algorithm for compressing and calculating (aggregating) OLAP cubes. The initial results look good, and I’ve learned a great deal.

To begin, a little background. OLAP cubes typically perform numerous calculations and aggregations on a dimensional model in order to speed up query performance. Data warehouses are all about storing extremely large sets of data, and presenting it to the user for analysis. Users being users, they don’t want to wait while you perform a SQL GROUP BY and SUM over 10 million rows. So an effective OLAP cube calculates all of the GROUP BY combinations ahead of time, dramatically speeding up users’ queries against the cube.

The trouble is that the number of GROUP BY combinations increases exponentially as the numbers of dimensions (and dimension attributes) increases. Plus, data warehouses are meant to store data over multiple years, and at a very fine grain. Therefore, a typical data warehouse can easily store hundreds of millions of rows.

So you can see where this gets a bit tricky. How do you effectively calculate all of your GROUP BY combinations across such large data sets? How can you do it before the universe ends? How can you update your calculated cube with new rows?

This is where the Dwarf algorithm comes in. It promises to provide a way of calculating and compressing your cube with only one pass through your data. Sounds good to me! Let’s try it.

First off, my implementation, which I’ve named BigDwarf, will become part of ActiveWarehouse. ActiveWarehouse is a Rails plugin which brings proven data warehouse techniques and conventions to Ruby on Rails. ActiveWarehouse follows the Rails way of “convention over configuration”. BigDwarf is one of ActiveWarehouse’s different aggregation strategies.

I’ve named the implementation BigDwarf because I’ve implemented the Dwarf algorithm in spirit only. Dwarf works so well because it does both prefix and suffix coalescing, or compression. I’ve implemented only prefix compression so far. Suffix coalescing, which apparently provides the most dramatic space reductions for a sparse cube, is on the TODO list.

The current implementation does not require ordering of the data set, which traditional Dwarf does. This is good and bad. It’s good because we don’t need to sort everything before loading, which can be a costly step. However, we’d probably just use the UNIX sort command to sort the file, with the assumption that it’s faster than doing in the database (that’s a big assumption). Loading the data in arbitrary order gives us a lot of flexibility.

However, it’s bad because there’s a great potential speed optimization we can use if we order all the dimension attributes in the file. It makes the algorithm a bit more complex, but I think it also reduces the amount of recursion. This is on the TODO list for further research and implementation.

Because BigDwarf is for ActiveWarehouse, this is all written in Ruby. Turns out Ruby is slow. Repeat after me: Ruby is slow. It’s a beautiful language, but it’s just not a data crunching language. I’ve managed to optimize BigDwarf enough so that the bottleneck now is the + operator on Fixnum. Here are some things to avoid if you want to write fast Ruby code:

* ==
* [] - array access
* Hash#[] - calculating hashes

Pretty much everything involving accessing your data in a collection will slow you down.

Once I’ve optimized BigDwarf enough where I can continue working on it, I ran some tests. Here’s what we have so far. My test data set is a 10,037,355 line file extracted from our SQL Server 2005 database. The file includes 5 dimensional attributes and one fact (the number we want to sum across all of the dimensions). The file is a text file, tab delimited, one line per row.

BigDwarf processes this file at 4348 lines per second. It will store the fully calculated cube in 3,301,132 bytes. This is down from the original file’s 337,022,624 bytes. That’s a very dramatic compression, at approximately 99% compression rate. YMMV, of course, as dimension cardinality and size of the values in your data dictate much of that compression. The lower cardinality, the higher compression you’ll see.

BigDwarf also supports querying, with basic filtering support. I’ve yet to do work to optimize the query performance, or to really get a sense of how fast it is. That’s on my TODO as well.

All in all, BigDwarf is working really well so far. There’s work to do for further compression through suffix coalescing and further optimization through smarter cube building.

Dabble DB Brings the Web of Data to Life

Wednesday, April 11th, 2007

Dabble DB has completely blown me away. Dabble DB is like Club Med for your data. You want your data to get a massage while sipping a Mai Tai on the beach, you got it. Your data will get the five star treatment at Dabble DB.

So everyone is talking about Web 3.0, AKA the Web of Data, AKA the Semantic Web. Those visions are all well and good, and I do believe we’ll see a Data Centric Web soon. But if there’s a Web of Data, that must mean you’ve got Data on the Web.

What? Your data is buried in some SQL Server database on the company LAN? That doesn’t sound very webby to me. And you’re building all these custom, one-off, Visual Basic apps or Excel macros manage your data? Tisk, tisk. So not webby.

This is where Dabble DB comes in. Not only does it provide a very slick, dripping with AJAX interface for you to import and manage your data, it’s a *very* smart interface. Normal muggles (Haven’t read Harry Potter? Whaaaa?) can easily use Dabble DB to classify, link, sort, and visualize their data. Dabble DB is not a snazzy front end to a relational database system. Dabble DB is a snazzy front end to data.

Let’s put it this way: I haven’t seen a desktop application that helps you with your data like Dabble DB.

OK, enough of the uber love fest. Bringing it all back to the semantic web, Dabble DB might be in a class of killer applications for the semantic web. I really love Dave Beckett’s description of the semantic web: “The semantic web is webby data.” So the semantic web will need, as a killer app, something that makes managing a *linking* data so super easy and more importantly: incredibly rewarding.

That last statement is important. The killer application for the semantic web must be *rewarding*. That is, you will get out of it more than you put into it. Dabble DB does this to some extent, as you can graph your data, map your data, export your data, subscribe to your data.

It doesn’t appear that Dabble exports to RDF, nor does it appear that you can link data together via ontologies. But if Dabble DB doesn’t do that, someone else will. For data that is truly webby is data that can be extended by sources outside of your control.

At work, we’ve been building a large data warehouse, and the interface to go with it, so systems like Dabble DB are extremely interesting to me. I want to give my users an experience like Dabble.

links for 2007-03-30

Thursday, March 29th, 2007
  • How to choose which views to materialize in an OLAP cube, when it is too expensive to materialize all views. This is the next optimization for our aggregation strategies in ActiveWarehouse.
    (tags: olap database)

ActiveWarehouse Gets Some Love

Wednesday, March 28th, 2007

ActiveWarehouse, the Ruby on Rails plugin for data warehouse development, was written up by InfoQ in their article ActiveWarehouse, a New Step for Enterprise Ruby.

I’ve been writing different aggregation strategies for ActiveWarehouse, trying to find something that’s not too slow or cumbersome. ActiveWarehouse supports pluggable aggregation, or rollup, strategies, so you can use what works best for you. We have some very large data sets and very large dimensions (one dimension we have has 215 million rows). So if ActiveWarehouse can eventually handle that, I think we’re in good shape.

I can say that ActiveWarehouse will work great if you have a smallish data set. I would say up to a million rows in your dimensions would be big enough. Of course, no matter how much work we put into optimizing ActiveWarehouse’s aggregation schemes, smart database tuning will always help tremendously.

links for 2007-03-28

Tuesday, March 27th, 2007

RDF Queries and Ontologies

Thursday, January 19th, 2006

Danny nicely puts the problem I’m trying to solve in his post titled SPARQL trick #23. He says:

> Running inference on the store’s data as a whole may be expensive, but without it many results for a given query may be missed.

This is exactly why we are attracted to semantic web technologies. I have a lot of data, but I know there are many more pieces of information in there if I can apply some ontologies and rules. My queries against the system must search both the raw triples I have plus any triples that can be inferred by my ontologies. To me, this is one of the main value adds of the system.

The other main value add of a RDF store vs. a traditional relational store is that it’s much easier and cheaper to say arbitrary things. In a relational store, your schema must be defined up front, severly limiting your ability to define data in the future. With RDF, saying anything about anything is cheap.

There are some solutions that work well for data sets that are static. You simply write your ontologies and rules, then run your triplestore through the reasoning engine. Bingo, you’ve got 4-5x more triples. Search away!

Problem with that is, users expect live, real time interaction with the system. They want to add a new triple, and then have immediate access to any newly inferred triples. Therefore, the batch run of your rules won’t work in this scenario.

It might work to apply the rules and ontologies at query time. Treat them like regular business logic that you would write in your object model. This logic is run when requested, not run in batch with results written to the database (in typical web applications, that is). How performant would it be to treat ontologies and rules like business logic, and thus treat the triple store as a traditional relational database (that is, dumb and full of data)?

Looks like I’ll give this a shot, and see where it leads us. Much like the ActiveOntology work being done over in Ruby, wiring in Jess or Drools as a base class for a Java object model might make sense here.

Of course, the downside of this is that it will never be as performant as the reasoning engine living inside the database. Time to integrate Jess directly with PostgreSQL?