mtbc | The Open Microscopy Environment data model

At work we have a data model for the microscopy imaging data we work with. The field advances quickly so it needs ongoing adjustment but a funding agency thought otherwise so we had to let our main model guy go; I think he now works for a Swiss bank. Given the actual need to make adjustments to the model, I'm in the awkward position of covering for his absence, but thankfully only partially because others at work have also been working in the same area.

In our source tree the data model starts out as XSD. From that some Python code writes Java classes whose methods relate to the appropriate properties. This view of the data model is used by Bio-Formats, which is a library for (mostly) reading and (somewhat) writing microscopy data in which the pixel data is accompanied by metadata. In microscopy the metadata might describe the experimental setup: the microscope information, light frequencies, filters, mirrors, the actual physical size of the pixels (already transformed to be constant across the image), how far apart z-sections are, at what interval the multiple stills were taken, etc. For high-content screening this metadata would also include information about the plate: they have wells in a grid, so each image will be from a particular acquisition run for a specific well, probably treated with particular reagents.

OMERO is our server-client system for managing microscopy data. It stores users' image files in the original format so they can use OMERO as their primary data repository; it can use Bio-Formats under the hood to actually read and write the data. For reading, for each image it actually caches the state of Java instance of the Bio-Formats reader for that image format after it opened that image, so that it can always return to reading quickly enough for live rendering.

OMERO handles the metadata as well. It has a set of XML files that define classes that correspond to the various Bio-Formats model entities (images, plates, wells, instruments, detectors, lasers) and plenty more. These are processed with the help of Apache Velocity to create Java classes corresponding to these objects; these classes are persisted to a relational database using Hibernate for the ORM. (Hibernate's ubiquitous in enterprise Java but it's also buggy and badly documented.) This OMERO model is further translated for ZeroC's Ice which we use as a sort of CORBA thing for serializing the model objects over a multi-language API. This generates another Java class for each model object: the classes that are worked with by client-side code.

If the build system sounds intimidating, it's worse than that: what I have mentioned so far is a high-level glimpse with much detail omitted. For example, Bio-Formats has further Java code relating to copying metadata from one instance to another, then OMERO has additional Java code that uses the Bio-Formats code to map between the Bio-Formats model (defined in XSD) and the OMERO model (defined in XML), and increased decoupling in our codebase increases the versioning fun.

Recently I have been making required changes to the model, mostly to make the Bio-Formats and OMERO models for regions of interest in images match better. This involves adjusting the XSD, writing the XSLT for upgrades from and downgrades to the previous model, updating the Java code for copying between instances, updating the OMERO model and its code for converting to and from Bio-Formats' model, updating the database schema and writing the SQL upgrade scripts, and adjusting every bit of code, server- and client-side, that uses the adjusted objects, including the automated tests and the instructive examples in our documentation. I do this infrequently enough that each time I have to relearn XSLT, mostly XPath as I've not much touched JavaScript for a while now.

I have help. When the model-generation code in Python needs adjustment then David and Roger do much of that. Where there's other technical debt in the infrastructure, like in managing our collection of example sample image files, Sébastien has been enthusiastically leading that. I am hoping that for some of the client-side utility code for working with model objects, Dominik will handle the Java side and Will the Python side; also, now we have C++ code in Bio-Formats (with more code generation too) and Roger mostly handles that. And, while I don't want to become the model guy (especially given what happened to the previous), I do enjoy being the PL/SQL guy. Actually, in this job I have to handle only PostgreSQL, it's our commercial spinoff who handle Oracle; in my previous job I'd ended up writing Java code that generates both PostgreSQL and Oracle versions of the same script.