Yes, database normalization is good
So InfoQ has collected a few blog posts which ask Data normalization, is it really that good?
Of course it’s good, as long as you have requirements which dictate this optimization. If your application requires extremely fast writes, and this can happen in a heavy loaded OLTP system, then data normalization is your savior. If your application requires extremely fast reads, like OLAP systems, then of course data normalization is a killer.
These competing requirements are exactly why you have database systems optimized for either read or write. This is why large systems will maintain an operational system conforming to OLTP principles, and reporting systems conforming to OLAP principles.
Remember, traditional database systems are row oriented. This architecture is itself an optimization for OLTP and normalized data. Read mostly (or read only) systems can be column oriented, which organize the data on disk to optimize reads. For instance, Google’s BigTable is an implementation of a column oriented database.
Unfortunately, none of this is really about “data normalization”. This is about relational databases, schema structures, referential integrity, and such like that.
“Data normalization” is much more about having standard formats for your data — little things like always using “Street” or always using “St” (or “St.”) in an address field, or storing all phone numbers as pure numerics without spaces or other punctuation vs. storing phone numbers as strings with whatever formatting the person inputting it chooses (this becomes important when you start dealing with international phone records — which is also where you have to start thinking about country codes, in addition to area codes and local exchanges).
Normalization is not only necessary if you’re going to be doing joins of any kind, but also when you’re doing selects based on the content of a given field — because you cannot match “(617) 555-1212″ to “6175551212″ or “+1-617-555-1212″ without doing some major manipulation — and that takes significant time when you start considering all the possible format variants, and whether the leading “+1-” about breaks a match or not, and whether “617-KL5-1212″ also matches….
Both OLAP and OLTP benefit from normalization — in different ways, but the benefits are inarguable. Schema design and referential integrity concerns are different and distinct — and these is where the tradeoffs for OLAP vs OLTP optimzation come in.
Ted Thibodeau Jr
15 Aug 07 at 6:47 am
Ted,
You seem to be talking about data cleansing more than normalization. I’m just curious if you have any references to usage of the term data normalization in the context that you are using it?
Anthony Eden
18 Aug 07 at 2:29 am