Pipelines and software architecture
Lately it's becoming increasingly clear that the product I work on is not designed to be easy to code in. I've been thinking about what the actual problem is that slows us down regularly. I think what's going on that it was not designed for changing assumptions at all. The codebase grew organically over the last five years and most implementations are more 'this works' than 'this is right', and that's exactly what's tripping us up.
In software engineering it's easy to look at work from previous implementors and dismiss it as not valuable, because you've seen all the grief it has caused. This is why so many software products get a re-write when a new lead takes over. However, there's two things that I've learned from the current codebase. Firstly, it proves that the product can exist. It is possible to create something that does what we need it to do. The current iteration might not be the most efficient, maintainable or easiest solution to the problem, but it can be done! Secondly, we've learned where the current approach breaks down and thus we have a great starting point from where we can make it more resilient, easy and beautiful.
This past week we've been discussing re-architecting the application fundamentally, even considering it a 3.0.0 because it's so invasive. Re-architecting is a bit of a misnomer, since we're going from no architecture to choosing one, as my lovely partner pointed out. There is also no pleasant way to grow the code with the current no architecture, should we perhaps describe it as a re-write?
Anyway, the product is a completely backend processing application that takes in several GBs of medical data and constructs several (file) outputs that are sent back to medical storage devices. It performs its function well and the characteristics of the system we need to process the data is straightforward.
- Receive data, i.e. medical images.
- Do several transformations and calculations.
- Generate several responses (files, reports, images, etc.).
- Send responses back for storage.
Since step 2 and 3 are not instantaneous the receiving/responding is disconnected from each other. Processing can take anywhere from a few minutes to 2 hours, depending on the data.
Previous implementors saw a pipeline and implemented a queuing system (celery/rabbitmq) that gets activated after receiving has been completed. Every step in the pipeline is triggered by the previous one and it works. However, what happens when a step in the middle fails? The pipeline gets disrupted an no responses are ever sent back (not even an error response). Thus they decided to wrap everything (and I mean everything) in exception handling so no step could ever fail.
From an architecture design perspective there's several issues with this approach. Firstly steps are tightly coupled, inserting new steps or deactivating steps that are unnecessary for certain users is impossible. Secondly, because of all the exception handling everywhere every step needs to check if any previous step has failed and then 'pass' its own execution while still triggering the next step. Lastly, in the case where a step is dependant on a previous one (f.e. a calculation result) it will have to check if that result is available, and still execute even when its pre-requisites are not met.
In short, this approach has resulted in tightly coupled, non-idempotent, non-replay-able steps that are hard to understand. In order to squash bugs you have to understand the entire pipeline in detail lest you introduce a new bug for a step somewhere down the line.
When thinking about alternatives my first thought is to completely separate all calculation/generation functions from any kind of scheduling logic. These functions should be as pure as possible and be completely stateless and idempotent. This way we can focus directly on when/how and in what order to 'schedule' these functions without having to worry about state or side-effects.
For the scheduling portion a queuing system can still be used but not per step/function. After receiving has been completed we could use the queuing system to enqueue a single task, essentially a controller, that executes all of the pure, stateless, idempotent functions either synchronously and single-threaded or asynchronously and multi-threaded if necessary. When any of them fail the controller would know not to trigger the remaining functions but go straight to an error response.
I prefer this approach because it produces less side-effects, is less challenging to reason about and most importantly all the calculation/generation functions are completely separated from scheduling/queuing logic. Which means we can change that assumption in the future. Lastly, it makes sure every request received gets a response, by design.