I had the pleasure of speaking to data management visionary Pat Helland earlier in the week about scalable distributed data management. Since then, I’ve been seeing the world through a lens of loosely-coupled distributed workflow and have started wondering if there is a pattern emerging in the Web: the use of multi-user document versioning for enabling the realtime Web.
One of the most familiar use cases of multi-user document versioning that programmers are certainly aware of is source code control. CVS, SVN, Git all do the paramount job of managing source code repositories making sure that one person’s changes are not stomped on by another user’s changes. They archive history of source code documents and detect conflicts in source files. From a database practitioner’s viewpoint, these source code control systems are very much transactional databases. When a user commits a file, the user is apprised of line-by-line source code conflicts if any. If there are conflicts, the user must resolve these conflicts, potentially rolling back their own changes, and then retry their commit operation. With only a handful of programmers and the typically bulk nature of source code repository commits, the expense of handling “transaction conflicts” is usually manageable.
But what happens when the source files, or let’s imagine more generally, any document is shared simultaneously by many people who all might make changes very quickly? What happens, for example, in a groupware scenario (a.k.a Computer Supported Collaborative Work) where many users simultaneously edit and read a shared document? The result of increased concurrent activity means more opportunity for conflicts. If each document mutation is done under a pessimistic locking model, this introduces user responsiveness latencies due to blocking.
To mask latency, optimistic concurrency control modes, i.e., multi-version concurrency control, avoid blocking on reads at the cost of potentially incurring more rollback space in cases of conflict, e.g., more work must be undone and more apologies must be made. With the increasingly distributed (e.g., client/server or cloud-based) nature of computing, adoption of multi-version concurrency mechanisms is growing since clients can perform local mods immediately rather than waiting for high latency communication with servers.
Being optimistic is great because it eliminates latency, however, the amount of concurrency control management is not fundamentally altered. There is still opportunity for conflict; it’s just that compared to pessimistic approaches, the discovery of conflicts occurs at the end (rather than beginning) of a workflow. It would be nice to scale the number of concurrent participants doing frequent commits to a shared document while masking, i.e., automating, more of the concurrency management. In other words, it would be nice if collaborators on a shared document could enjoy a rich, live, realtime user experience and be spared as much drudgery of rollback as possible.
Patterns, idioms, and topologies for such scalable document management seem to be appearing. In particular, it seems that concurrency semantics are evolving, e.g., from simple reasoning semantics like linearizability to sophisticated domain specific languages that effectively lay down strict parameters/rules for what and when and how concurrent operations are managed and resolved for a given domain, e.g., group editing of a shared text file. A benefit of parameterization is that it articulates the rules of the game for concurrency to users creating a well-understood playing field for multi-user operations. In other words, users become aware of where their workflow boundaries intersect with other concurrent user workflows. But most important, parameterization of domain specific concurrency idioms can define how competing (conflicting) operations will be transformed and composed into operations that when applied by each participant independently, create a sensible shared outcome. This enables automatic system reconcilaition to occur. Two conflicting operations can be “transformed” into operations that can be applied at a client and server independently so that active participants see the same agreed upon semantic document changes. Participants see their local changes immediately with no latency while server changes are applied back asynchronously. Thus a groupware editing environment can build tools and semantics that define how concurrent operations are combined to make sense for multiple users. As a trivial folksonomy idiom, in a source code file, two concurrent modifications each occurring in a separate method within the same module are merged without conflict. In a groupware text editing scenario, two concurrent modifications (e..g, character insertions) on separate paragraphs do not conflict and can thus be rendered concurrently (without delay) to multiple users. A great example (literally yesterday’s news) of more advanced concurrency parameterization and multi-user versioning is Google Wave. Google Wave extends Operational Transformation to create a rich, live collaborative editing experience: “The result is that Wave allows for a very engaging conversation where you can see what the other person is typing, character by character much like how you would converse in a cafe.” Google OT defines what mutation operations are legit on a document and defines how two concurrent operations are morphed and reconciled, creating a de-facto domain-specific language for collaborative editing. Operational transformation mutations include skip, insert characters, insert , delete characters, set attributes, etc.. Hence, these mutation components define/parameterize scopes for concurrency. Furthermore, Wave defines a composition operation which enables two operations to be composed into a single composite operation…which enables conflation of operations.
As a glimpse perhaps of what Google Wave collaboration might be like, try out EtherPad. EtherPad is made by the guys (former Google employees I think) at Appjet. Appjet is a slick platform-as-a-service that enables server-side JavaScript, streamlining application development from client to server (since all you need to do is use one magical language called JavaScript.
Finally, because document space within a single, centralized physical repository will ultimately overflow, there is the need to delegate, shard, partition the global document and history space among multiple machine resources, i.e., a distributed system of resource nodes that can scale with the document and user versioning load. Google Wave prescribes such a federation protocol and data model to accommodate scale-up of the “wavespace”.
Git as a Google Wave server?
So let’s say you’re looking for a federated home-based document repository that tracks fine-grained mutations for rollback/time travel as needed, has well-defined user idioms for managing and resolving conflicts, does efficient compression of deltas, and is very scalable and fast. To me this sounds like Git and thus, I’m wondering if Git could be a natural as a Google Wave server! I’m not sure if this will work, but if you’re interested, I’ve started prototyping a “gitwave” server and welcome help/feedback on this idea.
One final thought
Looking at Google Wave OT document mutation operations:
skip
insert characters
insert element start
insert element end
insert anti-element start
insert anti-element end
delete characters
delete element start
delete element end
delete anti-element start
delete anti-element end
set attributes
update attributes
commence annotation
conclude annotation
If we replace “character” with “object”, it’s interesting to see what sort of “object graph” semantics appear. In particular, it seems that the concept of object sequences or what have been called arrables emerge. Also unlimited “style attribute runs” could then be applied over arbitrary sequences as ad-hoc schema markup tags. I’m wondering if OT + Git = temporal database?