Chapter 7. The Workflow Abstraction

Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.
— Literate Programming Donald Knuth (1992)

When a Cascading app runs, it creates a flow diagram that can optionally be written to a DOT file. Flow diagrams provide intuitive, visual representations for apps, which are great for cross-team collaboration. Several good examples exist, but one is the phenomenon of different developers troubleshooting a Cascading app together over the cascading-users email forum. Expert developers generally ask a novice to provide a flow diagram first, often before asking to see source code. For instance, a flow diagram for is provided in Figure 7-1.

Another good example is found in Scalding apps, which have a nearly 1:1 correspondence between function calls and the elements in their flow diagrams. This demonstrates excellent efficiency for language elision and literate representation. The benefit, in engineering terms, is that this property helps make the complex logic embodied in a Scalding app relatively simple to understand.

Figure 7-1. Annotated flow diagram for

Separation of Concerns

Based on the philosophy of literate programming, Cascading workflows emphasize the statement of business processes. This recalls a sense of business process management (BPM) for Enterprise apps. In other words, think of BPM/BPEL for Big Data as a means for workflow orchestration—in this case Cascading provides a kind of middleware. It creates a separation of concerns between the business process required for an app and its implementation details, such as Hadoop jobs, data serialization protocols, etc.

By virtue of the pattern language, the flow planners in Cascading have guarantees that they will be able to translate business processes into efficient, parallel code at scale. That is a kind of “one-two punch” in Cascading, leveraging computer science theory in different layers.

Functional Relational Programming

Cascalog developers describe the separation of concerns between business process and implementation (parallelization, etc.) as a principle: “specify what you require, not how it must be achieved.” That’s an important principle because in practice, quite arguably, developing Enterprise data workflows is an inherently complex matter. The frameworks for distributed systems such as Hadoop, HBase, Cassandra, Memcached, etc. introduce lots of complexity into the engineering process. Typical kinds of problems being solved, often leveraging machine learning algorithms to find a proverbial needle in a haystack within large data sets, also introduce significant complexity into apps.

The author of Cascalog, Nathan Marz , noted a general problem about Big Data frameworks: that the tools being used to solve a given problem can sometimes introduce more complexity than the problem itself. We call this phenomenon accidental complexity , and it represents an important anti-pattern in computer science.

A lot of people talk about how wonderfully expressive Clojure is. However, expressiveness is not the goal of Clojure. Clojure aims to minimize accidental complexity, and its expressiveness is a means to that end.
— Nathan Marz Twitter (2011)

There are limits to how much complexity people can understand at any given point, limits to how well we can understand the systems on which we rely. Some approaches to software design amplify that problem. For example, reading 50,000 lines of COBOL is not particularly simple. SQL and Java are notorious for encouraging the development of large, complicated apps. So it makes sense to prevent artifacts in our programming languages from making Enterprise data workflows even more complex.

Referring back to the original 1969 paper about the relational model, Edgar Codd focused on the process of structuring data as a mechanism for maintaining data integrity and consistency of state, while providing a separation of concerns regarding data storage and representation underneath. This description is quite apt for the workflow abstraction in Cascading. Codd’s first public paper about the relational model is archived in the ACM Digital Library.

More recently, in the highly influential 2006 paper “Out of the Tar Pit,” Ben Moseley and Peter Marks proposed functional relational programming (FRP) as a combination of three major programming paradigms: functional, relational, and logical. FRP is proposed as an alternative to object-oriented programming (OOP), with the intent to minimize the accidental complexity introduced into apps. For more information about FRP, see the following:

When it comes to accidental and essential complexity we firmly believe that the former exists and that the goal of software engineering must be both to eliminate as much of it as possible, and to assist with the latter.
— Moseley and Marks “Out of the Tar Pit” (2006)

Moseley and Marks attempted to categorize the different kinds of complexity encountered in software engineering, and analyzed the dimensions of state, code volume, and explicit (imperative) concern with the flow of control through a system. They noted that “complexity breeds complexity” in the absence of language-enforced guarantees—in other words, it creates positive feedback loops in software practice. They pointed out that complexity that derives from state is one of the biggest hurdles for making code testable. Their paper also considered the origins of the relational model, going back to Codd:

In FRP all essential state takes the form of relations, and the essential logic is expressed using relational algebra extended with (pure) user-defined functions.
— Moseley and Marks “Out of the Tar Pit” (2006)

In other words, if we can cut out the unnecessary state represented in an app and focus on the essential state (relations) we can eliminate much of the accidental complexity. Cascading embodies much of that philosophy, putting FRP into the practice of building Enterprise data workflows. Cascading and FRP have several important aspects of computer science theory in common:

A workflow is represented as a static DAG .
Only the root nodes of the DAG are mutable, i.e., the source taps.
All the nodes in the DAG are relations, e.g., pipes and operations.
The practice of programming this DAG separates logical aspects from control aspects.

Moseley and Marks also point toward a management approach indicated by FRP. For example, an organization could focus one team on minimizing the accidental aspects of a system. Other teams could then focus on the essential aspects, providing the infrastructure and the requirements for interfacing with other systems. Roughly speaking, that corresponds respectively to the roles of developer, data scientist, ops, etc.; however, the objectives of those teams become clarified through FRP. It also fits well with what is shown in Figure 6-3 for cross-team functional integration based on Cascading.

Enterprise vs. Start-Ups

In summary, there are several theoretical aspects of the workflow abstraction. These get leveraged in Cascading and the DSLs to help minimize the complexity of the engineering process, and the complexity of understanding systems.

Generally speaking, in terms of Enterprise data workflows, there are two avenues to the party—scale versus complexity—a contrast that is seen quite starkly in use case analysis of Cascading deployments.

On one hand, there are Enterprise firms where people must contend with complexity at scale all day, every day. Incumbents in the Enterprise space make very large investments in their back office infrastructure and practices—generally using Java, ANSI SQL, SAS, etc., and have a large staff trained in those systems and processes. While the incumbents typically face considerable challenges in trying to be innovative, they are faced with multiple priorities for migrating workflows onto Apache Hadoop. One priority is based on economics: scaling out a machine learning app on a Hadoop cluster implies much less in licensing costs than running the app in SAS. Another priority is risk management: being able to scale efficiently and rapidly, when the business requires it. Meanwhile, a big part of the challenge is to leverage existing staff and integrate infrastructure without disrupting established processes. The workflow abstraction in Cascading addresses those issues directly.

On the other hand, start-ups crave complexity and must scale to become viable. Start-ups are generally good at innovation and light on existing process. They tend to leverage sophisticated engineering practices—e.g., Cascalog and Scalding—so that they can have a relatively lean staff while positioning to compete against the Enterprise incumbents and disrupt their market share. Cascading provides the foundation for DSLs in functional programming languages that help power those ventures.

There is a transition curve plotted along the dimensions of scale, complexity, and innovation. One perspective of this is shown in Figure 7-2.

Figure 7-2. Scale versus complexity

Both the Enterprise incumbents and the start-ups are connected on that curve for any given Big Data project. Ultimately, when they succeed, they tend to meet in the middle. Dijkstra foresaw this relationship quite clearly:

Computing’s core challenge is how not to make a mess of it. If people object that any science has to meet that challenge, we should give a double rebuttal. Firstly, machines are so fast and storage capacities are so huge that we face orders of magnitude more room for confusion, the propagation and diffusion of which are easily inadvertently mechanized. Secondly, because we are dealing with artefacts, all unmastered complexity is of our own making; there is no one else to blame and so we had better learn how not to introduce the complexity in the first place.^[3]
— Edsger Dijkstra The next fifty years

^[3]http://www.cs.utexas.edu/~EWD/transcriptions/EWD12xx/EWD1243a.html