Appendix A. Troubleshooting Workflows

The following tips are intended to help troubleshoot common issues when people are first working with Cascading. These points are mostly about running the examples in the book, but they apply to Enterprise use cases in general.

Build and Runtime Problems

One of the most frequent and useful tips given to people who are new to Cascading—and to Apache Hadoop in general—is that if your build isn’t working as expected, you may need to delete the local Maven repo.

On a Linux or Mac OS X laptop, that purge is handled by:

$ rm -rf ~/.m2

The build systems mentioned in this book—Gradle, Leiningen, SBT—all depend on Maven under the hood. Unfortunately, sometimes Maven gets stuck. Purge its local repository, and then run your build again.

Another common issue with builds is that the Hadoop distribution—or other included JARs—has a dependency conflict with the Cascading artifacts in the Maven repo that you’re using. For example, most of the builds shown in this book require cascading-core and cascading-hadoop for compile-time dependencies. The builds that include unit tests will also depend on cascading-test, junit, etc. Depending on your deployment environment, some artifacts may need to be excluded, e.g., logging.

Other typical problems encountered include the following:

Using Java 7—should use Java 6 instead

Using a Hadoop version higher than 1.x—see the Cascading compatibility matrix
Installing Hadoop but not in “standalone” mode
Running Hadoop atop Cygwin on Windows—which generally does not work
Installing Hadoop using Homebrew on Mac OS X—install from the Apache Hadoop download or one of the other major distributions instead

Anti-Patterns

Some patterns of coding are counterproductive and generally indicate that the design of an app should be reworked. We call these anti-patterns , and some are specific to Cascading.

If you find that you are writing substantial amounts of custom operations to make a Cascading app perform the business process you need, that’s a warning sign. We find that most Cascading apps require few custom operations, unless a developer is trying to end-around the pattern language.

Another anti-pattern concerns traps. These are intended for exceptional data—rare, unintended edge cases in the tuple stream. If you find that traps are being used in an app to define the business process, that’s a warning sign. Filters and branches are supposed to be used to direct the tuple flows—for those tuples that are not exceptions. Apps will not perform well when traps get used in place of filters.

Factory methods represent another kind of anti-pattern. Instead use SubAssembly subclasses. The object constructors in Cascading are “factories,” so there’s not much sense in adding unneeded code that in turn makes the app harder to understand. That would be an example of introducing accidental complexity.

Workflow Bottlenecks

Performing aggregations at scale on Apache Hadoop is a hard problem. Joins in particular can be difficult, and Cascading provides alternatives to improve performance. In Chapter 3 we used HashJoin for a replicated join—in the case where one side is smaller than the other. Otherwise, the join must be based on a CoGroup and the developer may need to adjust the threshold for spilling to disk.

There also are many third-party extensions to Cascading, some of which can improve the performance of large joins. For example, BloomJoin is a drop-in replacement for CoGroup, based on using a bloom filter built from the righthand side (RHS) keys. This can improve performance significantly when the RHS is relatively small but the RHS tuples won’t fit in memory.

Another typical performance problem with Hadoop jobs concerns aggregations in general--key/value skew. Consider the social graph for a social network such as Twitter: most people may have up to a few hundred followers, but then a few outliers such as Lady Gaga may have millions. This can cause a highly skewed distribution of values per key during the reduce tasks. The effect is that many tasks will start during a reduce phase, and most finish relatively quickly. A few “straggler” tasks—e.g., Lady Gaga’s set of followers—continue processing, perhaps for many hours. Overall the cluster utilization metrics drop because only a few tasks are running; however, the app itself cannot progress until all of its reduce tasks complete. A potential workaround is to filter the outlier keys that have huge sets of values and process them in a different branch of the app.

Other Resources

This book is intended to be an introduction to Cascading and related open source projects. There are several resources online for learning about Cascading in much more detail:

Also, there are a wealth of Cascading users and active discussions on the cascading-user email forum . If you have a problem with a Cascading app—or Cascalog, Scalding, PyCascading, Cascading.JRuby, etc.—then generate your flow diagram as a DOT file and post a note to the email list.