I’ve worked in jobs where we’ve been forced to reinvent the wheel in terms of data storage. One company in particular built their own database engine – indices and all. Others have merely needed to iterate on one structure or another.
In just about every case, though, we’ve come across the same issue: tracking history.
In a previous job, I built a completely custom CMS we used to manage both web content and a database of business partners. Once you start adding an editorial interface, the next feature request is undoubtedly for some sort of undo feature.
Editors want to be able to track who changed what when. They want to be able to roll back to an earlier version. They want protection and catastrophic errors being introduced in their data.
For the custom CMS, I built an extra table to track data diffs. In WordPress, we keep track of previous post revisions in the existing post database. 1 Neither of these systems is comprehensive, and they both fail in a very specific way.
Editors are still modifying an object and merely tracking changes in some different object.
I’ve been fascinated to see the way tools like Twitter work. They write only, and merely flag certain specific changes in meta for the objects being changed. Namely, Tweets can only be created. They can’t be edited. Deleting a tweet merely changes a
deleted flag on the object itself.
Some other distributed systems I’ve worked with do incredibly will with rapid, distributed writes of information. Reading or modifying that information too quickly is a bit trickier – as a result, many of these systems are write-only and merely resent an “eventually consistent” 2 database state.
With all this information in mind, I have to wonder how efficient a write-only database system would be. Effectively, data objects are only ever added. When an update is required, a new copy of that data object is inserted into the database (with a higher ID or timestamp or vector clock to track it as the new copy). When a deletion is required, a flag on that object’s meta information is merely toggled.
Thus, no data is ever lost. Also, distributing the data across multiple server nodes, which may or may not have reliable connections, means we can live with an “eventually consistent” database and trust our data will always be present in the end.
The only potential downside I can see in this situation is that the data store will continually grow – no data is ever removed, so the overall file size will never decrease.
This is also likely a system we could implement on top of existing database engines. No reinventing the wheel, but we’d potentially have a more reliable data implementation than any of the engines give us by default.
Do you see value in such a structure? Do you see any potential downsides to such an implementation?