Chapter 11. Managing Distributed Workflows

Tuesday, February 15, 14:34

Austen bolted into Logan’s office just after lunch. “I’ve been looking at the new architecture designs, and I want to help out. Do you need me to write up some ADR’s or help with some spikes? I’d be happy to write up the ADR that states that we’re only going to use choreography in the new architecture to keep things decoupled.”

“Whoa, there, you maniac,” said Logan. “Where did you hear that? What gives you that impression?”

“Well, I’ve been reading a lot about microservices, and everyone’s advice seems to be to keep things highly decoupled. When I look at the patterns for communication, it seems that choreography is the most decoupled, so we should always use it, right?”

"Always is a tricky term in software architecture. I had a mentor who had a memorable perspective on this, who always said ‘Never use absolutes when talking about architecture, except when talking about absolutes.’ In other words, never say never. I can’t think of many decisions in architecture where always or never applies.”

“OK,” said Austen. “So how do architects decide between the different communication patterns?”


As part of our ongoing analysis of the trade-offs associated with modern distributed architectures, we reach the dynamic part of quantum coupling, realizing many of the patterns we described and named in Chapter 2. In fact, even our named patterns only touch on the many permutations possible with modern architectures. Thus, an architect should understand the forces at work so that they can make a most objective trade-off analysis.

In Chapter 2, we identified three coupling forces when considering interaction models in distributed architectures: communication, consistency, and coordination, shown in Figure 11-1

Dimensions of dynamic quantum coupling
Figure 11-1. The dimensions of dynamic quantum coupling

In this chapter, we discuss coordination: combining two or more services in a distributed architecture to form some domain-specific work, along with the many attendant issues.

Two fundamental coordination patterns exist in distributed architectures: orchestration and choreography, the fundamental topological differences between the two styles is illustrated in Figure 11-2.

orchestration versus choreography
Figure 11-2. Orchestration versus choreography in distributed architectures

In Figure 11-2, orchestration is distinguished by the use of an orchestrator, whereas a choreographed solution does not.

Orchestration Communication Style

This pattern uses an orchestrator (sometimes called a mediator) component to manage workflow state, optional behavior, error handling, notification, and a host of other workflow maintenance. It is named for the distinguishing feature of an musical orchestra, which utilizes a conductor to synchronize the incomplete parts of the overall score to create a unified piece of music. Orchestration is illustrated in the most generic representation in Figure 11-3.

orchestration illustration
Figure 11-3. Orchestration amongst distributed microservices

In Figure 11-3, each of the services A-D are domain services, each responsible for their own bounded context, data, and behavior. The Orchestrator component generally doesn’t include any domain behavior outside of the workflow it mediates. Notice that microservices architectures have an orchestrator per workflow, not a global orchestrator such as an Enterprise Service Bus --one of the primary goals of the microservices architecture style is decoupling, and using a global component such as an ESB creates an undesirable coupling point. Thus, microservices tend to have a orchestrator per workflow.

Orchestration is useful in situation where an architect must model a complex workflow, which includes more than just the single “happy path”, but also alternate paths and error conditions. However, to understand the basic shape of the pattern, we start with the non-error happy path. Consider a very simple example of Penultimate Electronics selling a device to one of their customers online, shown in Figure 11-4.

happy path orchestration example
Figure 11-4. A “happy path” workflow using an orchestrator to purchase electronic equipment (note the asynchronous calls denoted by dotted lines for less time-sensitive calls)

In Figure 11-4, the system passes the Place Order request to the Order Placement Orchestrator, which makes a synchronous call to the Order Placement Service, which records the order and returns a status message. Next, the mediator calls the Payment Service, who updates payment information. Next, the orchestrator makes an asynchronous call to the Fulfillment Service to handle the order. The call is asynchronous because no strict timing dependencies exist for order fulfillment, unlike payment verification. For example, if order fulfillment only happens a few times a day, there is no reason for the overhead of a synchronous call. Similarly, the orchestrator then calls the Email Service to notify the user of a successful electronics order.

If only the world consisted only of happy paths, software architecture would be easy. However, one of the primary Hard Parts of software architecture is error conditions and pathways.

Consider two potential error scenarios for electronics purchasing. First, what happens if the customer’s payment method is rejected? This error scenario appears in Figure 11-5.

illustration of payment rejected error condition
Figure 11-5. Payment rejected error condition

In Figure 11-5, the Order Placement Orchestrator updates the order via the Order Placement Service as before. However, when trying to apply payment, it was rejected by the payment service, perhaps because of an expired credit card number. In that case, the Payment Service notifies the orchestrator, who then places a (typically) asynchronous call to send a message to the Email Service to notify the customer of the failed order. Additional, the orchestrator updates the state of the Order Placement Service, who still thinks this is an active order.

Notice in this example, we’re allowing each service to maintain it’s own transactional state, modeling our “Fairy Tale Saga(seo) Pattern” pattern. One of the hardest parts of modern architectures is managing transactions, which we cover in Chapter 12.

In the second error scenario, the workflow has progressed further along: what happens when the Fulfillment Service reports a backorder? This error scenario appears in Figure 11-6.

message flow for a failed order fulfillment
Figure 11-6. When an item is backordered, the orchestrator must rectify the state

In Figure 11-6, the workflow preceeds as normal until the Fulfillment Service notifies the orchestrator that the current item is out of stock, necessitating a back order. In that case, the orchestrator must refund the payment (this is why many online services don’t charge until shipment, not at order time) and update the state of the Order Placement Service.

One interesting characteristic to note in Figure 11-6: even in the most elaborate error scenarios, the architect wasn’t required to add additional communication paths that weren’t already there to facilitate the normal workflow, which differs in the “Choreography Communication Style”.

General advantages of the orchestration communication style:

Centralized workflow

As complexity goes up, having a unified component for state and behavior becomes beneficial.

Error handling

Error handling is a major part of many domain workflows, assisted by having a state owner for the workflow

Recoverability

Because an orchestrator monitors the state of the workflow, an architect may add logic to retry in the case that one or more domain services suffers from a short term outage

State management

Having an orchestrator makes the state of the workflow queriable, providing a place for other workflow and other transient state

General disadvantages of the orchestration communication style:

Responsiveness

All communication must go through the mediator, creating a potential throughput bottleneck that can harm responsiveness.

Fault tolerance

While orchestration enhances recoverability for domain services, it creates a potential single point of failure for the workflow, which can be addressed with redundancy but adds more complexity

Scalability

This communication style doesn’t scale as well as choreography because it has more coordination points (the orchestrator), which cuts down on potential parallelism. As we discussed in Chapter 2, several dynamic coupling patterns utilize choreography and thus achieve higher scale (notably “Time Travel Saga(sec) Pattern” and “Anthology Saga(aec) Pattern”.

Service coupling

Having a central orchestrator creates higher coupling between it and domain components, which is sometime necessary.

Utilizing an orchestrator for complex workflows greatly simplifies many architecture concerns, and assists in boundary and error conditions.

Choreography Communication Style

If the “Orchestration Communication Style”was named for the metaphorical central coordination offered by an orchestrator, the choreography pattern also visually illustrates the intent of the communication style, where there is no central coordination; rather, each service participates with the others similar to dance partners. It isn’t an ad hoc performance—the moves were planned before hand by the choreographer/architect, but executed without a central coordinator.

Figure 11-4described the orchestrated workflow for when a customer purchases electronics from Penultimate Electronics; the same workflow modeled in the choreography communication style appears in Figure 11-7.

simple workflow using choreography
Figure 11-7. Purchasing electronics using choreography

In Figure 11-7, the initiating request goes to the first service in the chain of responsibility, in this case the Order Placement Service. Once it has updated internal records about the order, it sends an asynchronous request that Payment Service receives. Once payment has been applied, Payment Service generates a message received by Fulfillment Service, which plans for delivery and sends a message to the Email Service.

At first glance, the choreography solution seems simpler—fewer services (no mediator), and a simple chain of events/commands(messages). However, as with many issues in software architecture, the difficulties lie not with the default paths but rather boundary and error conditions.

As in “Orchestration Communication Style”, we cover two potential error scenarios, the first resulting from failed payment, illustrated in Figure 11-8.

illustration of payment error workflow in choreography
Figure 11-8. Error in payment in choreography

Figure 11-8 shows that, rather than send a message intended for the Fulfillment Service, it instead sends messages indicating failure to the Email Service and back to the Order Placement Service to update the order status. This alternate workflow doesn’t appear too complex, with a single new communication link that didn’t exist before.

However, consider the increasing complexity imposed by the other error scenario for a product backorder, shown in Figure 11-9.

illustrating the workflow required for product backlog error
Figure 11-9. Managing the workflow error condition of product backlog

In Figure 11-9, many steps of the workflow have already completed before the event (out of stock) that causes the error. Because each of these services implement their own transactionality (this is an example of the “Anthology Saga(aec) Pattern”), when an error occurs, each service must issue compensating messages to other services. In Figure 11-9, once the Fulfillment Service realizes the error condition, it should generate events suited to it’s bounded context, perhaps a broadcast message subscribed to by the Email, Payment, and Order Placement services.

The example shown in Figure 11-9 illustrates the dependence between complex workflows and mediators. While the initial workflow in choreography illustrated in Figure 11-7 seemed simpler than Figure 11-4, the error case (and others) keeps adding more complexity to the choreographed solution. In Figure 11-10, each error scenario forces domain services to interact with each other, adding communication links that weren’t necessary for the happy path.

illustration of added communication links for choreography error handling
Figure 11-10. Error conditions in choreography typically add communication links

Every workflow that architects need to model in software has a certain amount of semantic coupling--the inherent coupling that exists in the problem domain. For example, the process of assigning a ticket to a Sysops Squad member has a certain workflow: a client must request service, skills must be matched to particular specialists, then cross-referenced to schedules and locations. How an architect models that interaction is the implementation coupling.

The semantic coupling of a workflow is mandated by the domain requirements of the solution, and must be modeled somehow. However clever an architect is, they cannot reduce the amount of semantic coupling, but their implementation choices may increase it. This doesn’t mean that an architect might not push back on impractical or impossible semantics defined by business users—some domain requirements create extraordinarily difficult problems in architecture.

Here is a common example. Consider the standard layered monolithic architecture compared to the more modern style of a modular monolith, shown in Figure 11-11.

Technical versus domain partitioning in architecture
Figure 11-11. Technical versus domain partitioning in architecture

In Figure 11-11, the architecture on the left represents the traditional layered architecture, separated by technical capabilities such as persistence, business rules, and so on. On the right, the same solution appears, but separated by domain concerns such as Catalog Checkout and Update Inventory rather than technical capabilities.

Both topologies are logical ways to organize a code base. However, consider where domain concepts such as Catalog Checkout reside within each architecture, illustrated in Figure 11-12.

illustration of the location of the domain workflow of Catalog Checkout
Figure 11-12. Catalog Checkout is smeared across implementation layers in a technically partitioned architecture

In Figure 11-12, Catalog Checkout is “smeared” across the layers of the technical architecture, whereas it appears only in the matching domain component and database in the domain partitioned example. Of course, aligning a domain with domain partitioned architecture isn’t a revelation. One of the insights of Domain-driven Design was the primacy of the domain workflows. No matter what, if an architect wants to model a workflow, they must make those moving parts work together. If the architect has organized their architecture the same as the domains, the implementation of the workflow should have similar complexity. However, if the architect has imposed additional layers (as in technical partitioning shown in Figure 11-12), it increases the overall implementation complexity because now the architect must design for the semantic complexity along with the additional implementation complexity.

Sometimes the extra complexity is warranted. For example, many layered architectures came from a desire by architects to gain cost savings by consolidating on architecture patterns such as database connection pooling. In that case, an architect considered the trade-offs of the cost saving associated with technically partitioning database connectivity versus the imposed complexity and cost won in many cases.

The major lesson of the last decade of architecture design is to model the semantics of the workflow as closely as possible with the implementation.

Semantic Coupling

An architect can never reduce semantic coupling via implementation, but they can make it worse.

Thus, we can establish a relationship between the semantic coupling and the need for coordination—the more steps required by the workflow, the more potential error and other optional paths appear.

Workflow State Management

Most workflows include transient state about the status of the workflow: what elements have executed, which ones are left, ordering, error conditions, retries, and so on. For orchestrated solutions, the obvious workflow state owner is the orchestrator (although some architectural solutions create stateless orchestrators for higher scale). However, for choreography, no obvious owner for workflow state exists.

Many common options exist to manage state in choreography; here are three common ones.

First, the Front Controller Pattern places the responsibility for state on the first called service in the chain of responsibility, which in this case is Order Placement Service. If that service contains information about both orders and the state of the workflow, some of the domain services must have a communication link to query and update the order state, illustrated in Figure 11-13.

illustration of state ownership in choreography
Figure 11-13. In choreography, a Front Controller is a domain service that owns workflow state in addition to domain behavior

In Figure 11-13, some services must communicate back to the Order Placement Service to update the state of the order, as it is the state owner. While this simplifies the workflow, it increases communication overhead and makes the Order Placement Service more complex than one that only handled domain behavior.

A second way for an architect to manage the transactional state is to keep no transient workflow state at all, relying on querying the individual services to build a real-time snapshot. This is known as stateless choreography. While this simplifies the state of the first service, it greatly increases network overhead in terms of chatter between services to build a stateful snapshot. For example, consider a workflow like the simple choreography happy path in Figure 11-7 with no extra state. If a customer wants to know the state of their order, the architect must build a workflow that queries the state of each domain service to determine the most update-to-date order status. While this makes for a highly scalable solution, rebuilding state can be complex and costly in terms of operational architecture characteristics like scalability and performance.

A third solution utilizes Stamp Coupling (described in more detail in “Stamp Coupling for Workflow Management”), storing extra workflow state in the message contract sent between services. Each domain service updates their part of the overall state and passes it to the next in the chain of responsibility. Thus, any consumer of that contract can check on the status of the workflow without querying each service.

This is a partial solution, as it still does not provide a single place for users to query the state of the ongoing workflow. However, it does provide a way to pass the state between services as part of the workflow, providing each service with additional potentially useful context.

In Chapter 13, we discuss how contracts can reduce or increase workflow coupling in choreographed solutions.

Advantages of the choreography communication style:

Responsiveness

This communication style has fewer single choke points thus offering more opportunities for parallelism.

Scalability

Similar to responsiveness, lack of coordination points like orchestrators allows more independent scaling

Fault tolerance

The lack of a single mediator allows an architect to enhance fault tolerance by the use of multiple instances

Service decoupling

No mediator means less coupling

Disadvantages of the choreography communication style:

Distributed workflow

No workflow owner makes error management and other boundary conditions more difficult

State management

No centralized state holder hinders ongoing state management

Error handling

Error handling becomes more difficult without an orchestrator because the domain services must have more workflow knowledge

Recoverability

Similarly, recoverability becomes more difficult without an orchestrator to attempt retries and other remediation efforts

Like “Orchestration Communication Style”, choreography has a number of good and bad trade-offs, often opposites, summarized in Table 11-5

Trade-Offs Between Orchestration and Choreography

As with all things in software architecture, neither orchestration nor choreography represent the perfect solution for all possibilities. A number of key trade-offs will lead an architect towards one of these two solutions, including some key ones delineated here.

State Owner and Coupling

As illustrated in Figure 11-13, state ownership typically resides somewhere, either in a formal mediator acting as an orchestrator or a front controller in a choreographed solution. In the choreographed solution, removing the mediator forced higher levels of communication between services. This might be a perfectly suitable trade-off. For example, if an architect has a workflow that needs higher scale and typically has few error conditions, it might be worth trading the higher scale of choreography with the complexity of error handling.

However, as workflow complexity goes up, the need for an orchestrator rises proportionally, as illustrated in Figure 11-14.

relationship between semantic workflow complexity and usefulness of orchestration
Figure 11-14. As the complexity of the workflow rises, orchestration becomes more useful

As illustrated in Figure 11-14, the more semantic complexity contained in a workflow, the more utilitarian an orchestrator is. Remember, implementation coupling can’t make semantic coupling better, only worse.

Ultimately, the sweet spot for choreography lies with workflows that need responsiveness and scalability and either don’t have complex error scenarios or they are infrequent. This communication style allows for high throughput; it is used by the dynamic coupling patterns “Phone Tag Saga(sac) Pattern”, “Time Travel Saga(sec) Pattern”, and “Anthology Saga(aec) Pattern”. However, it can also lead to extremely difficult implementations when other forces are mixed in, leading to the “Horror Story(aac) Pattern”.

On the other hand, orchestration is best suited for complex workflows that include boundary and error conditions. While this style doesn’t provide as much scale as choreography, it greatly reduces complexity in most cases. This communication style appears in “Epic Saga(sao) (covered in “Transactional Saga Patterns”), “Fairy Tale Saga(seo) Pattern”, “Fantasy Fiction Saga(aao) Pattern”, and “Parallel Saga(aeo) Pattern”.

Coordination is one of the primary forces that create complexity for architects when determining how to best communicate between microservices. Next, we investigate how this force intersects with another primary force, consistency.

Sysops Squad Saga: Managing Workflows

Thursday, March 15, 11:00

Addison and Austen arrived at Logan’s office right on time, armed with a presentation and ritual coffee urn from the kitchen. “Are you ready for us?” asked Addison.

“Sure,” said Logan. “Good timing—just got off a conference call. Are y’all ready to talk about workflow options for the primary ticket flow?”

“Yes!” said Austen. “I think we should use choreography, but Addison thinks orchestration, and we can’t decide.”

“Give me an overview of the workflow we’re looking at.”

“It’s the primary ticket workflow.” said Addison. “It involves four services; here are the steps:”

Customer facing operations
  1. Customers submits a trouble ticket through the Ticket Management service and receive a ticket number.

Background operations
  1. The Ticket Assignment service finds the right sysops expert for the trouble ticket.

  2. The Ticket Assignment service routes the trouble ticket to the systems experts mobile device.

  3. The customer is notified via the Notification Service that the sysops expert is on their way to fix the problem.

  4. The expert fixes the problem and marks the ticket as complete, which is sent to the Ticket Management service.

  5. The Ticket Management service communicates with the Survey Service to tell the customer to fill out the survey.

“Have you modeled both solutions?” asked Logan.

“Yes. The drawing for choreography is in Figure 11-15.”

Primary ticket flow modeled as choreography
Figure 11-15. Primary ticket flow modeled as choreography

“…and the model for orchestration is in Figure 11-16.”

Primary ticket workflow modeled as orchestration
Figure 11-16. Primary ticket workflow modeled as orchestration

Logan pondered the diagrams for a moment, then pronounced, “Well, there doesn’t seem to be an obvious winner here. You know what that means.”

Austen piped up, “Trade-offs!”

“Of course,” laughed Logan. " Let’s think about the likely scenarios and see how each solution reacts to them. What are the primary issues you are concerned with?”

“The first is lost or mis-routed tickets. The business has been complaining about it, and it has become a priority,” said Addison.

“OK, which handles that problem better—orchestration or choreography?”

“Easier control of the workflow sounds like the orchestrator version is better—we can handle all the workflow issues there,” volunteered Austen.

“OK, let’s build a table of issues and preferred solutions in Table 11-6.”

“What’s the next issue we should model?”

“We need to know the status of a trouble ticket at any given moment—the business has requested this feature, and it makes it easier to track several metrics. That implies we need an orchestrator so that we can query the state of the workflow.”

“But you don’t have to have an orchestrator for that—we can query any given service to see if it has handled a particular part of the workflow, or use stamp coupling,” said Addison.

“That’s right—this isn’t a zero-sum game,” said Logan. “It’s possible that both or neither work just as well. We’ll give both solutions credit in our updated table in Table 11-7.”

“OK, what else?”

“Just one more that I can think of—tickets can get canceled by the customer, and tickets can get reassigned due to expert availability, lost connections to the expert’s mobile device, or expert delays at a customer site. Therefore, proper error handling is important. That means orchestration?”

“Yes, generally. Complex workflows must go somewhere, either in an orchestrator or scattered through services. It’s nice to have a single place to consolidate error handling. And choreography definitely does not score well here, so we’ll update our table in Table 11-8.”

“That looks pretty good. Any more?”

“Nothing that’s not obvious,” said Addison. “We’ll write this up in a ADR; in case we think of any other issues, we can add them there.”

ADR: Use orchestration for primary ticket workflow

Context
For the primary ticket workflow, the architecture must support easy tracking of lost or mis-tracked messages, excellent error handling, and the ability to track ticket status. Either an orchestration solution illustrated in Figure 11-16 or a choreography solution illustrated in Figure 11-15 will work.

Decision
We will use orchestration for the primary ticketing workflow.

We modeled both orchestration and choreography and arrived at the trade-offs in Table 11-8.

Consequences
Ticketing workflow might have scalability issues around a single orchestrator, which should be reconsidered if current scalability requirements change.