Identifying, Debugging and Preventing Software Problems that Arise from Implicit Constraints at Multiple Scales
If you look at a piece of software as being an arbitrary group of systems that interact to create bigger systems on multiple levels, you can see how looking at the behavior on any given level has to include all the levels above it.
So, for example:
A call to a set of consecutive methods involves a number of different classes, which are individual systems instantiated within the bigger program system.
Many of those methods are constrained by various programmatic rules, which could be defined at the method level, the class level, the process level, the module level, the entire system level, or may have additional constraints coming from the environment the entire system itself is running in.
Individual methods may be constrained in multiple ways at the same time by rules at different levels.
In a given test, after changes to another area of the program, an individual method fails when the test is run as part of the whole system, but still passes as a unit test, and still passes in at least a subset of various “clump” integration tests. The immediate intuition is to believe a bug in that method has been exposed by changes elsewhere, but analysis of the method doesn’t reveal any issues.
What has likely happened is that changes elsewhere have introduced an implicit constraint that, in the manner of a side effect, prevents the method from functioning if that constraint is in force.
Different “clump” tests may or may not enforce that constraint, since they may involve the systemic level where the constraint originates or not. The more integrated different system components are, the more common this type of interaction becomes, and the more difficult it becomes to predict implicit constraints.
This is one of the most frustrating bug patterns developers run into. One of the reasons they cause so much frustration comes from our usual methodologies for problem solving being directly opposed to what is needed in these situations.
Generally, when a problem is seen at a “whole system” level, the methodology of “root cause” analysis involves breaking the system down into more manageable pieces in order to localize the issue. The problem is not dissimilar to taking an animal apart to find the cause of some discomfort — you may find it, but not within a still living animal.
The name “root cause” itself defines the initial error — the cause is not at the “root” level at all, it is at the full system or the environmental level, and in the case of the latter a further problem involves reproducing it outside a specific customer’s environment.
Assuming you do manage to reproduce the problem at the whole system level, anyone with a fair amount of development experience can tell you about the problems that simply “go away” as soon as you try to break the problem area down.
Our tendency to look for any problems at the more tangible level of code betrays us completely in uncovering the cause of problems that display this pattern.
So, the problem has to be analyzed at the whole system level. That doesn’t discount localizing it, it just means that the localization has to occur in the context of the complete system functioning as a system.
In many cases problems with this pattern occur not “everywhere” but only at specific installation sites. This gives us an initial task of reproducing the problem in the lab environment. A sensible way to go about this is as follows:
In logical groupings, add, one set at a time, different sets of environmental aspects found to exist at the problem site that aren’t normally present in the lab. After each set is added, test to see if the problem is actualized.
At the point where the problem goes from potential to actual problem in the lab environment, you’ve isolated to that set of aspects the likely origin of the problem. Rather than refining the test by dropping the set and adding the aspects one at a time, though, the better idea is to remove the aspects one at a time until the problem goes away.
The reason for this is to get around another inherent tendency of our analyses that interferes in this kind of problem solving — the tendency to seek “one” cause. It’s as likely (in my experience more likely, because more apparent aspects are likely to have been considered and tested for) that the “cause” is itself something that is implied in the environment by a complex of apparent aspects, i.e. the noticeable aspects, together, create an “invisible” aspect that is the origin of the problematic constraint.
Once you have found the environmental constraint you immediately have a better idea of what you’re looking for on the code level, because it has to be able to interact with that constraint in a definite manner.
One of the more important questions to answer in terms of the problem pattern is ‘how do we avoid causing this problem in the first place?’
Although there are a number of complementary methodologies that can assist in avoiding this kind of problem, it is likely impossible, at this point, to completely eradicate the potential for it cropping up, simply due to the complexity of the underlying issue and the limitations of our representations.
For now, I’ll look at a potential methodology that can be adopted in the near term that will at least reduce the potential for these problems, along with a couple of methodologies already widely adopted that have limited to no potential to help in a significant way, and may even exacerbate the problem.
“Monolithic” systems are often pointed to as the underlying cause of these kinds of issues. At the same time, many of the most reliable software systems we use are traditionally monolithic designs (MVS, Unix), while a good number of modular systems display severe fragility with numerous problems that tend to follow this pattern.
This indicates that a monolithic design does not inherently lead to multiscale constraint issues, nor does a separated, modular design necessarily solve the problem.
Separation of subsystems can in some cases simplify the interactivity, leading to less emergence in general, but in many cases that interactivity is required in order to realize the intent of the program to begin with, so the separation is overcome via a workaround that reintroduces the emergence.
For example, separating interacting systems from a shared immediate environment (for instance running two subsystems in separate virtual machines) can solve some issues, but can introduce new ones. This doesn’t discount it as a potential solution, that solutions inevitably create different problems doesn’t imply that the problems are equivalent. Keeping in mind the problems it may introduce, though, can help minimize them.
- Since a new system scale has been introduced, in this case that of the virtual machines functioning together in a systemic way to accomplish a given task, any workarounds to implement the needed interactivity can reintroduce the dependencies that led to the emergent behaviour, while making them less explicit.
- The addition of a new scale can itself introduce new implicit constraints.
- Additional complexity within any workarounds needed to achieve the required interactivity can introduce further additional scales between the subsystem scale and the “VMs together” scale.
A pseudo solution often proposed for these issues is to limit the interactivity between separated subsystems to a minimalist set of functions.
Resource oriented architecture is a good example of this kind of pseudo solution. If all the programmatic interactivity needed were basic CRUD between subsystems using data that could easily be represented as generic resources, the problem wouldn’t have arisen in the first place, because the complexity of the system would initially have been too low to generate emergent behaviour.
More realistically, complex forms of interactivity are required in order to meet feature goals. Implementing these within the general ROA type of system involves workarounds that both stretch the notion of what a resource can reasonably represent, while reintroducing the complexity ROA tried to snuff out.
Worse, that complexity is no longer as apparent, much of it is hidden beneath the “generic” representations, so what would have been fairly transparent may now be completely opaque when trying to understand the system.
This both adds scales at which implicit constraints can appear and operate and makes them more difficult to identify.
An additional problem that arises from the oversimplification of the shared representation is the addition of otherwise unnecessary complexity in other areas.
For instance, the reconstituted representation may not be identical with the original, particularly when the “calling” system has no access, for various reasons, to the actual representations used in the “answering’ system, or worse when the calling system uses a different representation system in itself than the answering system (such as procedural scripts making calls to systems implemented with object representations).
The notion of ROA conveniently omits to note that its ‘stateless’, scalable architecture is entirely dependent on the complex, reliable, and very stateful infrastructure called the ‘internet’, on which the ‘web’ is built.
In devising a more successful approach, since the problem occurs most often due to contradictory constraints at multiple system scales, looking into the ways in which constraints can appear and propagate should be a good place to start.
Systems of any type have both explicit and implicit constraints. Explicit constraints usually involve ensuring proper functionality under the assumption that everything is working as planned.
As a result, explicit constraints tend to be expressed in a static manner, usually at a specific system scale. By themselves explicit constraints are rarely the issue at hand, although combinations of explicit constraints in a system are a common origin of implicit constraints.
Implicit constraints, on the other hand, generally arise through unplanned interactions between a system at a specific scale and its environment, which is always a higher scale system, but can be arbitrarily higher, and may include constraints from any particular scale above the scale which is directly affected.
Implicit constraints are dynamic, appearing when the system is actually functioning, and in many cases are emergent, i.e. not specifically caused by any individual subsystem to single environmental constraint taken in isolation, and thus difficult to localize.
In many cases it is impossible to predict what types of constraints will appear if those constraints are in fact strongly emergent (emergent behaviour, in the case of “strong” emergence, is not just practically but theoretically impossible to predict as it is not deterministic to begin with).
Since prediction fails in this case, we have to adopt a different approach. This involves assuming that implicit constraints at multiple scales will arise, and that simplistic solutions such as ROA in fact create hidden complexities that may trigger further problems without solving the immediate ones.
Other aspects of a system that need to be considered are discrepancies between dynamic implementations and static, semantic understandings of a system.
However, it should be possible to upwardly limit the scales over which implicit constraints may emerge, while downwardly limiting the scales over which implicit constraints may act.
Part of this approach is to identify, at each scale, what events a given system needs to respond to. The other side is limiting the propagation of events only within the relevant scope.
Often over-complexity is not a result of requirements at higher scales, but of implementation at lower scales, where simple subsystems respond to more events than they need to in order to actualize whatever potential they need to actualize.
The result is unnecessary interaction on the lower scales that leads to increased constraint emergence on multiple higher scales, and simultaneously an over-responsiveness to those constraints on lower scales.
The more implicit constraints, particularly those of an unpredictable, emergent nature, and the more systems on various scales respond to those constraints, the more likelihood of contradictory responses and resulting system unpredictability at any arbitrary scale.
Since a large number of system developers today are developing in Java (there is more application code written in Java than in all other programming languages combined), the representations available,such as static objects, some minimal aspect representation via Spring AOP libraries or AspectJ, and events that can be listened for, often don’t adequately indicate the multiple ways a given subsystem may be affected by potential constraints.
This effect is multiplied by the representation in the syntax of some objects as objects, others as primitives, etc.
It gets further multiplied when an application written and tested for one deployment environment, for instance Spring with Java SE, is then deployed into a Java EE container in a clustered manner.
The addition in newer versions of “syntactic Parmesan’, from annotations to lambdas, while it does help hide bad spaghetti, also makes finding the source of these types of problems more difficult.
As any Java developer knows, despite the language requiring the developer to specify potential exceptions that can be statically determined, by far the most common exceptions found in a running Java program are runtime exceptions, particularly the dreaded NullPointerException, which is not explicitly thrown by an application’s code but by the virtual machine in response to the system at the application scale attempting to invoke an invalid action (in this specific case of the NullPointerException, usually an action on an object that is not present).
Debugging runtime exceptions is often the initial task a developer faces in trying to solve multiple scale constraint problems. Tracing back to determine why the object referred to was null often results in the discovery of an implicit constraint that, earlier in the execution of the program, prevented the expected object from being created.
If we dumped the notion of objects, as being abstract, static representations that doesn’t actually exist in a functioning system, we could replace that notion with the notion of a system itself. The notion of scale allows us to have only the generic system, i.e. a system can be viewed as a subsystem by a system at the next scale up, but in itself it remains a system. In this sense a bit is the simplest system involved in software systems, given current hardware designs. So how do we define a system in this sense?
- A system has specific potentials that it can actualize, if a potential is actualized an event occurs.
- A system can only actualize some specific potential, it has no other behavioural features. When that actualization is possible, the system simply does act, as if it were ready to act but inhibited until something dis-inhibited the action.
- In the case of a system’s ability to have multiple potentials disinhibited at once, the system responds in a set priority sequence.
- A system always has state and an environment. Its state is comprised of every part of its definition that can be variable.
- Its environment is comprised of those things within the system it is a part of that it can potentially act upon or can act upon it. From this is follows that a system always has a location and a perspective.
- Systems may or may not have aspects. An aspect is a feature of a system that can only be said to be present while the system is functioning, i.e. a feature of the system as a whole in its interactions, not locatable in any subsystem of that system.
- Generally, a certain level of complexity within the system is necessary for the appearance of aspects, however the complexity required is often overestimated (take the simple example of 8 bits with one parity bit).
- Aspects are sometimes determinable, but are often emergent, and are emergent in very unpredictable ways.
- An event is something that appears in a given system’s environment. Something within the environment is only present to the system insofar as the system can potentially act upon it or vice versa, and the environment as a whole comprises only those things a given system can potentially act upon or vice versa.
- An event itself is a small system, with a set of static and variable definitions, and one or more destinations.
- There can be different event types with the same set of systems and destination but by convention only one type of potential can be actualized by each type.
- Systems interact by actualizing potentials that create events, which may then dis-inhibit actualization of potentials in other systems, creating further events.
Aside from systems and events, the only other really functionally necessary artifact is that of the set, or collection. A set is simply an arbitrary grouping of systems that doesn’t imply any functionality.
While a system functions in its specific manner, a set functions only in a generic manner, i.e. has potentials that can affect its contents only in their role as contents of that set.
Using the bit, the simplest system, as an example, the system definition of a concrete bit looks like the following:
1. A bit has one potential, that is the actualization of another potential bit. Any actualization results in an event, which contains the result, i.e. the actualized bit, and a destination.
2. A bit can correspondingly act on only one event, the appearance of a set that contains an actual bit, a potential bit, and a destination.
3. The event type, which is determined by naming convention, determines which bitwise operator is invoked in order to actualize the potential bit from its actual bit and its destination bit.
4. A bit has only two potential states, on and off.
5. A bit is a system without aspects.
While a language conforming to such a simple system doesn’t exist to my knowledge, and would likely be useless if it did, the notions involved can assist in developing systems in languages such as Java.
Java is a language that encourages event-driven functionality, since inactivity until an event dis-inhibits action is efficient. This is precisely the way biological systems and many physical systems operate as well. Where the designers fell down was in the means of event propagation and consequent lack of significant event scoping, and in using a syntax designed for procedural software that makes event propagation difficult to grasp in unfamiliar code.
Somewhat corresponding to the static notion of encapsulation within Java, itself a discrepancy between semantics and dynamics, since encapsulation does virtually nothing at runtime in Java, these issues can be limited by introducing the more dynamic notion of containment, and as far as possible making event flows transparent.
It should be relatively obvious that systems are made up of subsystems that are related in a particular way, so as together to be able to actualize a particular potential or potentials. Aristotle’s original triad of ways in which things can be related, energeia, systema, and entelechia, can be of help in understanding different relations and their implications.
- Energeia is a loose form of relation, things are related in a sense that can “now fall out this way and then another”, such as the relations between items on a messy desk.
- Systema is a more determinate form of relation, but not a self-encompassing unity. An example is a skeleton, which is self-subsistent, but not a self-contained unity. In programming, a framework is probably the most common example.
- Entelechia are self-contained unities.
That definition might help shed some light on the reason I refer to Smalltalk as a topological entelechy language.
Systems in most languages, whether object languages, functional languages, or some sort of mishmash, fall toward the definition of entelechia, but are not full entelechies, in the sense that they are not fully contained.
The system’s signature tells other systems what potential events it can actualize and what the event type dis-inhibits that actualization. A system’s signature should not expose signatures of any of the systems it contains.
Systems within a system at any particular scale can then only directly be affected by events created within the system that immediately contains them. Beyond its event signature a system is a black box to other systems, contained systems cannot perceive events that occur outside their immediate container.
Containment gives a system’s developer the ability to utilize emergent behaviour of systems and their relations directly and explicitly within the system they are developing, while preventing unexpected emergent behaviours from appearing outside that system.
As levels of scale increase and that system in turn becomes part of a higher scale system those emergent behaviours are contained at each system scale, preventing unpredictable emergence at arbitrary scales.
Solutions involving containment include indirection, such as declarative services; using runtime boundaries rather than relying on language constructs which the precompiler/interpreter/VM may not comply with; and proper scoping of event propagation.
The latter is a huge help both in terms of preventing exceptions and improving performance.
Event flows are better represented by message passing than by parameterized method invocation. Although most object based languages are not inherently message passing, writing object classes in such a way that they accept a method with an object and determine which of their internal methods to call based on the combination of object type and method name, places the responsibility of calling methods with arcane signatures on the developer that wrote them, and makes the resulting event flow easier to understand when looking at an unfamiliar code base.
Where possible, using small, efficient messaging systems such as Synapse rather than calling classes directly is a further help in this regard.
It would be ideal to avoid discrepancies between semantics and dynamics in the first place by using an environment where the VM, interpreter, JIT compiler etc. are all written in the language itself.
They’re also niche, partly due to the difficulty of writing them and the time required. Whatever the most ‘popular’ and ‘fashionable’ things are in the development world, by necessity they’re rarely adopted first by Smalltalk or LISP, unless they originate there or are already there under another name.
At some point though, software engineers need to start acting like engineers, i.e. people who define themselves by making things work, rather than acting like popularity contestants or fashion victims, but that’s a different post.
That a huge percentage of ‘best practices’ used in development overall did originate there despite neither having ever been widely adopted, such as refactoring, design patterns, unit tests, test driven development, model driven development, etc. says plenty about the worth of what’s developed in that manner.