mtbc | Model graph operations

I've mentioned how at work we use a directed graph of model objects to represent all manner of pixel data and metadata about the microscopy images that our software handles. In the heart of the server this information is represented as Java object instances and persisted as rows in a relational database. There is a permissions system controlling access to these objects: for instance, if I have my image in a private group then you may not see it. Or, if I expose it in a read-only group then you may not draw and save regions of interest on it, nor may you put my image in your dataset. Most user data is owned by one user and is in one group. Users may be members of many groups or, more powerfully, owners of groups. There are also root or super-users.

It would probably be clarifying if I gave a few examples of the kinds of operation that require traversal of the model graph and decisions along the way:

I am deleting my dataset. My images and your images are in it. Some of the images are also in another dataset. Which images to also delete?

I am moving my image to a private group. However, you have added some comments to my image. What to do with those comments?

I have an image that shares an instrument with another image. I want to move one of the images to a different group. What to do?

There are also various permissions issues to watch for, including:

If I change the ownership of an image in a read-only group, what to do about that it is in the dataset of somebody who is about to not own it?

If somebody put my image in their dataset and now the group that the data is in is being made a private group, what to do about the dataset?

If I tagged my image with my tag and somebody else tagged their image with that tag, then I move my image to another group, what should happen about the tagging?

The previous implementation of various graph operations had some issues. One was difficulty with diamond structures such as in operating on two datasets that have the same image in them. The worst, though, was performance. This is because it would run in a single transaction, taking a checkpoint before attempting various operations that may or may be possible (if not, often due to permissions) and rolling back where necessary before continuing with the next part of the model graph. I don't know if the problem was in Java or the database engine but for a large operation these many checkpoints were very costly.

I reimplemented the operations to use a common core of graph traversal code that works more like a shortsighted rule engine: it represents the directed graph explicitly, the nodes can take various states (including, don't do anything with this and yes, we want to operate upon it), there is a list of transition rules that may match within one hop against node types and states, matching the linking property names, etc. To take, from the mini-language I had to create, a couple of real examples for moving data between groups,

Image[I].instrument = IN:[E]{i} → IN:{r} — meaning, if we are moving an image, think about moving its instrument, if any.

Image[E]{ia}.instrument = IN:[E]{r} → IN:{a} — meaning, if the instrument is also used by an image we are not moving, mark it to not be moved.

IN:Instrument[E]{o} → IN:[I] — meaning, if nothing else stops the instrument from being moved, then mark it for moving.

I1:Image[I].instrument = IN:[EI], I2:Image[E].instrument = IN triggers an error, may not move {I1} while {I2} remains as they share {IN} — the issue being that the instrument is shared between an image not being moved and an image that is.

I am glossing over a lot of detail including that the {r} node state transitions to {o} only when no other rules match, and that before accepting the above change to {a} as final we try switching it back to {r} to ensure that the rule that previously transitioned it to {a} still matches at all.

Of course, the rules are constructed so as to look out for various problems in advance, rather than checkpointing and checking if some sub-operation fails.

Figuring out an approach to all this that actually works was quite difficult: what we have now wasn't my first attempt and I had to complicate the code in various unanticipated ways. I used plenty of parameterization in writing integration tests for somewhat-combinatorially many situations: for instance, in moving datasets, from what kind of group, to what kind of group, who owns the images in the dataset, are they an owner of the group too, etc.; at least I seem to find bugs through my testing at a rather higher rate than end users report any problems at all. Performance was indeed greatly improved over the previous approach and, as a side effect, various bugs were fixed.

I tried to keep the mini-language and its interpretation simple and restrictive. I find it interesting that I have rarely had to extend or relax the language: if it wasn't obvious how to express the rules to fulfill some requirement then some careful thought usually revealed that actually it could be done.

A worry has been that of painfully reinventing the wheel. While libraries for operating on directed graphs and upon related objects are common, I never managed to find anything that addressed the need to flexibly operate on rich subgraphs of objects in these policy-dependent ways. Surely other applications must have to address similar problems but I don't know how they do it.