links for 2007-09-01
Friday, August 31st, 2007-
Note to self, quit job, read all of these.
I’m preparing to move everything I host in house to save on monthly fees. My bandwidth needs are very low, but I need a lot of storage for home movies. Running a small linux server off my cable modem makes the most sense now.
Sending email from this server is a bit tricky, but with the help of some excellent tutorials, I’ve managed to tell postfix to relay all email through gmail. Google Apps For Your Domain is hosting my email, so this works out perfectly. (note: I’m only using regular gmail for my relay so far. I will try to use my google apps account soon.)
If you want to use gmail to relay email, check out the Gmail Relay Emails for Postfix on Redhat tutorial or the Gmail Relay Emails for Postfix on Ubuntu tutorial. Note that if you are running Ubuntu, you need to download the Thawte root SSL certificates, as outlined in a comment at that tutorial.
As a side note, I’m using DNS Park for my DNS hosting. DNS Park will host two free domains for you, and supports dynamic DNS updates. Dynamic DNS is crucial if you are on a cable modem or DSL. The excellent ddclient has a sample DNS Park configuration.
So Bill says XMPP matters and is “pushing” a push model for message delivery.
Am curious, though, if this debate can be re-framed as Stateless or Stateful? A stateless messaging system would map to a Pull strategy, placing the burden on the client to actively poll or pull its messages. A stateful system would map to Push, where the server maintains a connection for every subscribed client.
Stateful systems are hard to scale over the internet. One reason is because there’s a limit to the number of TCP connections I can maintain open at any one time. What’s the limit? Not sure, probably OS and configuration specific, but is that a limit that I’ll easily hit? If I’m not actively maintaining an open connection, can the system still be called Push?
An aside on Push vs Pull. Push might make for faster reacting systems, but I know that Pull is usually the way I want to process information. The more Push events I have, the less I get done, the less I can focus, and the more transient everything becomes. To get things done, I need to Pull information when I’m good and ready. So even if we’ve figured out how to scale Push and deploy it everywhere, the edge agents of mine will still buffer everything until I’m ready to Pull it.
So take that, Outlook email notification popups!
What if Lincoln used Powerpoint? (file under How Not To Do It)
Of course, if you’re looking for an example of how to do powerpoint presentation right, don’t miss Dick Hardt giving the OSCON 2005 Keynote on Identity 2.0. That’s how you do it.
e is a stack for the data web. Not only is this all in Ruby and uses RDF, but it’s some of the most bare code I’ve seen in a while.
You had me at “data web”.
And +10 for using the file system as a data store instead of a database.
Bill de hÓra writes that Phat Data is the challenge of the future. Couldn’t agree more. My recent work with data warehouses certainly has shown me that managing and accessing terabytes of data is non trivial.
We’ve learned a few things, most importantly, “Denormalize and aggregate.” Avoiding I/O is the most important step to take. And we’ve achieved some pretty decent performance numbers with a traditional relational database. However, as Bill points out, we’re using it as a big indexed file system.
But having SQL and the numerous tools that support SQL has been critical to our success. I can’t imagine solving these problems with proprietary tools. Sure, it’s possible. Google did it, but they have more PhD’s than you can shake a stick at. Plus some mega clusters.
While multi-core CPUs are a welcomed upgrade, what I really want is multi-spindle hard drives. Call me when I can emulate a google cluster in my desktop. What’s lacking is a cheap and effective way to parallel my disk activity.
What would be really nice is to turn my corporate network into a giant compute farm utilizing both all those CPUs and all those hard drives. Now that is really turning the network into the computer. So, don’t give me EC2, give me EC2 in my office. With everyone using their huge desktops just to read email and write PowerPoint, I know there’s a ton of unused computing power. This is P2P with a purpose.
So InfoQ has collected a few blog posts which ask Data normalization, is it really that good?
Of course it’s good, as long as you have requirements which dictate this optimization. If your application requires extremely fast writes, and this can happen in a heavy loaded OLTP system, then data normalization is your savior. If your application requires extremely fast reads, like OLAP systems, then of course data normalization is a killer.
These competing requirements are exactly why you have database systems optimized for either read or write. This is why large systems will maintain an operational system conforming to OLTP principles, and reporting systems conforming to OLAP principles.
Remember, traditional database systems are row oriented. This architecture is itself an optimization for OLTP and normalized data. Read mostly (or read only) systems can be column oriented, which organize the data on disk to optimize reads. For instance, Google’s BigTable is an implementation of a column oriented database.