Tuesday, April 6, 2010

More on simple databases

When I think about it some more, the difference between something like MS SQL Server or DB2 vs Hibernate + MySQL with MyISAM, is that the traditional way is for the RDBMS to be a single system, relational management AND static data storage. The persistence layer/db strategy is separating the RMS (relational management system) from the DB (static storage). By separating out the relational management portion, and moving that into the persistence layer, you have the ability to run one processing cluster which serves your application and handles the persistence layer/rms, as opposed to running two clusters, one to serve your application, and one to handle the traditional RDBMS. I would think that combining your servers into one larger cluster to handle more of the computing would be more efficient than splitting it into two clusters, in which case one cluster may hit max load while the other sits idle.

By the way, I know that there are measurable differences between MyISAM, InnoDB, or other storage engines, like row level locking as opposed to table locking on updates.

How I'm learning to love simple databases

Once upon a time I was a total believer in strict database rules. Everything should be fully normalized, schema and diagrams thought out ahead of time, strict data validation at the database level, so on and so forth. A total believer in the primacy of MS SQL Server, DB2, so on and so forth. Within the last 6 months, I've had a change of mind that's been brewing, and only lately have I more fully realized it.

When you're developing a webapp nowadays, chances are you have various frameworks and technologies that you use, each one layered on top of the other. MySQL -> Hibernate -> Java application -> JSP/Servlets -> HTML/markup, as one example. The data that you accept in from your users must pass through each layer to finally get to your storage, whether it's in memory at the application layer, or static storage at the db. Each layer has its own data validation. So if you have a date field that a user fills out, it comes in as a string, which the Servlet can accept in and convert to a valid Java date, which passes it to your application layer, in which you can put that data in an object that checks that it's a valid date (if it's defined as a Date object). Then it goes to your persistence layer, which in our example, Hibernate would check that it's a valid date. It finally gets passed to your database layer, which in our example, MySQL would check that it's a valid date. All that work to verify that it's a date several times. Why? What's the point? If you checked that it's a valid date once, and you don't do any measurable transformation of the data, there is no need to check it again.

If I think about it, when is it exactly, that I care that I check if it's a valid date? When I accept data in from the user, when I do fancy things at the application layer like date/time arithmetic, and when I display data back to the user. None of that touches the database layer. Do I care if the static data is stored in a traditional RDBMS, a bunch of XML files, or a straight up B tree like GT.M? Other than the possible performance gains by storing static data one way or another, it doesn't matter.

I currently use Wicket, and Wicket has easy to use form validation built into the framework itself, which covers all the cases in which I would care to check if the data is appropriate or not. Right now, I do datatype management at each layer because that's how I was taught, but I'm seriously considering switching everything underneath Wicket to string, integer, or float, crossing my fingers, and hoping nothing blows up.

Beyond data validation, what about primary/foreign keys? Many databases don't even have pk/fk relationships built in. Originally, I thought those databases were mad (in a bad way). However, if you think about it, the business logic which says that one table must relate to another table via a field based relationship, that business logic has to exist in your RDBMS, your persistence layer, and your application (because you must pass data with the appropriate relationships to the layer below). Why should each layer check the relational integrity of the data? The only case I can think of, is to protect myself from making a mistake at the application layer. Fine then, I can set those rules in the persistence layer as well as my application, and the database can be simply a static container, nothing more.