Golf by the Numbers

TWELVE
More Rating Systems and Tiger Tales

When it comes to the game of life, I figure I’ve played the whole course.
—Lee Trevino

Lee Trevino used to characterize the difference between himself and Jack Nicklaus by describing typical par 5 birdies for each. Nicklaus would launch a majestic drive into the fairway, hit a towering long iron onto the green, and narrowly miss his eagle putt. Trevino would slice a drive into the trees, punch an iron across the fairway into the left rough, gouge a wedge onto the green, and curl in a 10-footer. The game, Trevino would conclude with a grin, can be played in many different ways.

In chapter 11, I presented an intricate rating system and concluded that Tiger Woods was the best player on the PGA Tour in 2009 (and 2008, and 2007, and …). There are numerous other routes to the same conclusion. Some are majestic and some involve a lot of scrambling. A small sampling follows.

There are some obvious flaws in the rating system described in chapter 11. One flaw is that I have not given it a clever name. For reference, I’ll call it the Total Strokes rating system. The TS ratings are influenced by how I divided a typical round into 10 approach shots from the fairway (50–200 yards), 4 approach shots from the rough, and so on. These numbers were chosen based on Tour averages for the number of shots hit of each type, but other choices could be made. More troublesome is the dependence of the TS ratings on ShotLink data, which does not include the four majors. This problem is addressed in a different rating system, which I’ll present next.

The only data I have available on the major tournaments are the scores posted by the players. To develop a rating system based only on scores, I modified a system that I have used for years for college football.1 The first step is to recognize some inadequacies in the standard measures of golfing success. The Vardon Trophy is given annually to the golfer with the lowest scoring average. It does make some sense to honor the golfer with the lowest scoring average as the best golfer, but this statistic does not take into account the difficulty of the courses played. The top golfers play a high percentage of their rounds in big tournaments on demanding courses, where even par is often in contention for the championships. At some other events, even par may not make the cut.

The public tends to pay less attention to the scoring average and more attention to rankings such as money earned, World Golf Rankings, and FedEx Cup standings. All of these measures give extra weight to important tournaments and place extreme emphasis on winning. Money and points for first place can be double what they are for second place. If someone snakes in a monstrous putt to win a playoff, is that really twice as good as the unlucky second-place performance?

This is an especially interesting question as it relates to consistent and inconsistent golfers. Think about how you would rank the following two golfers. Player A makes the cut every week, usually finishing between 25th and 50th. Player B misses the cut over half the time but gets hot for a month or two and posts several top ten finishes with one win per year. Player B’s wins attract more publicity, money, and points, but which golfer is really better? Both the TS ratings and the ratings that follow reward the strong play that results in a tournament win but award no extra credit for the win itself.

In its most basic form, the rating system I present here is very simple. Suppose that every time golfers A and B play in the same tournament, A beats B by 2 strokes. Based on this evidence, rating player A as 2 strokes better than player B makes sense. Following through on this logic, think of every tournament that Tiger plays in as a competition between Tiger and each of the other golfers. If Tiger shoots 283 and Phil shoots 286, then Tiger beats Phil by 3 points. If Trevor shoots 280, then Tiger loses to Trevor by 3. For every tournament and every golfer, compile all of the points won and points lost in all of these head-to-head comparisons.

Suppose that on average Tiger wins by 14 points per match-up. Then Tiger’s rating should be 14 points higher than the average of his opponents’ ratings. If, on average, Phil wins by 8 points per match-up, then Phil’s rating should be 8 points higher than the average of his opponents’ ratings. There is a mathematical procedure for determining all of the golfers’ ratings so that each rating matches the player’s actual results. Some details are given at the end of this chapter. I applied this system to tournament scores for 222 golfers in 38 tournaments from 2007, including the four majors.

Table 12.1 gives the top ten strokes ratings for 2007. The units here are strokes per tournament. Tiger is rated 14.2 shots above average for a four-round tournament. More remarkably, he rates more than 4 shots per tournament better than second-place Ernie Els. In the 2007 TS ratings Tiger was about 4.1 strokes better than second-place Steve Stricker for a four-round tournament and 10.8 strokes better than average for a full tournament. The extra dominance shown here may be the result of a strong record in majors (one win, two seconds, and a twelfth). It may also indicate that, on top of his other skills, Tiger manages his rounds so well that his total is greater than the sum of its parts.

A critical difference between these ratings and, for example, the World Golf Ratings is that if two golfers tied for the lead after four rounds, this system would rank them as tied regardless of which one won the playoff. This seems less than ideal, since winning is important and can be a measure of the player’s ability to perform under extreme pressure. To give winning a more prominent role, the same system can be implemented using only wins and losses. For example, instead of giving Tiger 3 points for beating Phil by 3 strokes, give him 1 point for winning. When the points are counted this way, the top ten list changes, as shown in table 12.2.

Table 12.1 Top ten strokes ratings, 2007

Table 12.2 Win/loss ratings, 2007

This version of the ratings is harder to interpret. What meaning can be attached to Tiger being 0.27 wins better than Ernie Els? There is not much to say, except that it looks like Tiger is way better than anyone else. What I find most interesting about this approach is the actual data. For the tournaments involved, Tiger finished 2007 with a won-lost record of 1255-85. That computes to a 94% winning percentage, compiled in the majors and other top tournaments. Second place in terms of winning percentage was Ernie Els, well behind at 80%.

An important aspect of either version of these ratings is that the quality of the opponents (called “strength of schedule” in other sports) is factored into the system. In the strokes version of the ratings, if you beat an opponent by 2 strokes, your rating will be 2 points higher than your opponent’s rating. The better your opponent’s rating is, the higher your rating will be (since you get your rating by adding 2). The same phenomenon holds for the won-lost version. Beating a good player boosts your rating by more than beating a mediocre player.

When you have two rating systems, the inevitable question is which one is better? The tests that I ran are inconclusive on this question, other than to say that a combination of the two outperformed either one individually. The combination that did well is essentially an average of the two. Looking at the top tens, you can see that multiplying the won-lost ratings by 14 comes close to matching the values in the stroke ratings. The combination rating is 14 times the won-lost ratings plus the stroke ratings divided by 2. If you average the two, the ratings can still be interpreted as number of strokes above average for a four-round tournament.

The claim that the combination rating system outperforms the individual ratings needs to be justified. In fact, the more important claim is that it outperformed the FedEx ranking system for predicting the results for the 2007 FedEx playoffs. In particular, I computed the strokes and wins/losses ratings using only the events preceding the FedEx playoffs and used them to predict the outcomes of the playoff events. For example, Tiger was rated above Phil, so the ratings predict that Tiger beats Phil. This prediction turned out to be correct in Atlanta but incorrect in Boston. For comparison, I also used the FedEx standings going into the playoffs to predict the outcomes of the playoff events. The combination ratings system had the best prediction record. To be fair, the combination system got 59% of its predictions right versus 58% for the FedEx system and for both of the individual ratings systems. However, over the entire playoffs, a 1% difference translates into nearly 400 more correct predictions.

Table 12.3 Combination ratings, 2007

OMG, It’s the OWGR

The topic of this section is the Official World Golf Rankings, a ranking system that is recognized and endorsed as the official golf ranking system by nearly all of the important golf organizations and tours. OWGR is a points system whereby a player’s performance in a tournament earns points based on that player’s placement in the tournament and on the importance of the tournament. Points earned in a tournament are kept for two years, although after 13 weeks the value of points from a tournament begins slowly decreasing, reaching 0 at the two-year mark. A player’s rating equals his average number of points per tournament played.

For major tournaments, 100 points are earned for first place, 60 points for second place, 40 points for third place, 30 points for fourth place, and so on. Even though only one stroke over four rounds may separate each place, the point differential between places is substantial.2 As with the money awarded, each place earns about 60% of what the next highest place earns. The Tour Championship winner receives 80 points, with points for second place, third place, and so on decreasing by similar percentages as for the major championships. Other tournaments have maximum point values reflecting their relative importance world-wide.

Table 12.4 Top ten point totals in Official World Golf Rankings, end of 2007

The most important characteristic of the OWGR is that it includes tournaments from all over the world. The OWGR point system is relatively easy to understand and provides a useful starting point for conversations about golfer rankings. The top ten in OWGR points at the end of the 2007 season are shown in Table 12.4. The OWGR are surprisingly similar to the combination system rankings. In particular, the top five are identical except for the Phil and Ernie rankings. More information about the OWGR can be found at the OWGR website.

Berry/Larkey Ratings

One question that has not yet been addressed is that of ranking golfers over time. The rating systems that I have presented can be applied over any time frame for which the data are available. The TS ratings are time-limited because ShotLink data is a 21st-century phenomenon, while the combination rating system can be applied over very long time frames. It is interesting to think about how that might work.

Suppose that I want to compare Tiger Woods to Gene Sarazen. Given tournament data going back to the 1930s, I would have head-to-head match-ups between Woods and Jack Nicklaus, Nicklaus and Sam Snead, and Snead and Sarazen. The combination rating system could then rank everybody who competed in any of these tournaments. I did not try this, because there is reason to believe that these ratings would not be fair.

The 2000 PGA Championship had one of the most exciting finishes ever, with Tiger defeating Bob May in a playoff after both made dramatic putts on the 72nd hole. In the first two rounds of that tournament, Tiger and Jack Nicklaus were paired together, as Jack played his last PGA Championship. Their exchange on the 18th green of the second round was out of a Hollywood movie, as Jack acknowledged a huge ovation and with a gesture passed the mantle of greatness to Tiger. The point of this story is that while this gives us two Woods-Nicklaus head-to-head tournament rounds, they occurred when Tiger was in his prime and Jack was near retirement. Similarly, the age difference between Nicklaus and Snead means that most of their shared tournaments caught Jack in or approaching his prime and Snead trying new putting strokes to try to stay competitive. The comparison is not fair.

In their 1999 article “Bridging Different Eras in Sports,” Berry et al. present a statistical model for estimating the effects of aging on ability in golf, baseball, and hockey. Their findings are fascinating in many ways. For golf, they find that the average trend for a golfer is somewhat like the simpler graph shown in figure 12.1.³ For most golfers, the peak scoring years are between age 30 and 40, with a learning curve for younger golfers at around age 20 and a decline for older golfers at around age 50 which results in an average of 2 strokes per round higher than at the peak of their careers. If an aging function could be computed for each golfer, then comparisons over time could become meaningful. For example, in 1960 Jack Nicklaus and Ben Hogan played the last rounds of the U.S. Open together. Nicklaus was 20 years old and had not yet mastered the science of course management. Hogan was 47 years old and was struggling with his nerves on the greens. If their aging patterns were normal (they were not), we could take their scores from that tournament and subtract 2 strokes per round, which would provide a fair comparison to golfers in the field who were in their prime.

Of course, not everyone has the same aging pattern. Ben Hogan did not hit his stride until later in life, while Ben Crenshaw peaked at a younger age. Berry et al. allowed the aging pattern for a given golfer to change in several ways. Each player has a maturation age m_A and a declining age m_D. In between m_A and m_D, the golfer ages according to a more sophisticated version of figure 12.1. For ages younger than m_A, the value of the generic aging function is multiplied by a constant c₁, and for ages older than m_D, the generic aging function is multiplied by a constant c₂. Ben Crenshaw, who matured early, was only about 1 stroke above peak level at age 20; for him, . Ben Hogan aged well and was only 0.5 strokes above peak level at age 50; for him, .

Figure 12.1 An aging function for golfers

The assumption is that a player’s score in a given round is given by score = peak ability + round adjustment + aging function + error. The round adjustment accounts for the difficulty of the course on that day, as measured by the average score of the field. Berry et al. took scores from the four majors from 1935 to 1997 and statistically estimated the values of the variables that best fit the data. The phrase “best fit” means that the “error” terms in the formula are as small as possible.

The values found for the peak abilities of each golfer can then be used to give an all-time ranking of golfers. Most discussions of whether or not an athlete belongs in the Hall of Fame include evaluations of longevity as well as peak ability. Since the standard aging model shows golfers maintaining peak ability for 10 years or more, peak ability in this case does include some information about strong performances over time.

The system is designed to eliminate bias from one generation to the next, whether it be from redesigned clubs or courses of varying difficulty. However, to my mind the list of top 20 players given in table 12.5 is heavily skewed to modern golfers. So, let the debate begin. Are modern golfers better than their predecessors? Remember that these ratings were published in 1999 using data through 1997. At this stage, Tiger Woods had won exactly one major. Berry et al. note Tiger’s young age and say that if he were to improve as much as the generic aging function predicts, he would reach a peak score of 68.6 and rate as far and away the greatest ever. This is what has happened.⁴

Table 12.5 Berry/Larkey ratings for peak ability in majors, 1935–1997

Maximum Sigma

An estimation of peak ability is one way to attempt to answer the question of who is the greatest of all time. John Zumerchik approaches the problem in a different way.5 He attempts to identify the greatest performance in a single tournament using standard deviation as his basic tool. Recall that if the distribution is normal (golf tournament scores are generally close to normal), the chance of getting a score 1 standard deviation below the mean is about 16%, 2 standard deviations, about 2.3%, and so on. This gives us a way to measure how good a score is. Compute the mean and standard deviations of scores for a tournament. The more standard deviations below the mean a score is, the more unlikely the score is and the more impressive the result. Zumerchik did this for a number of Masters tournaments and U.S. Opens through 2000.

Zumerchik’s analysis found only five tournaments in which the winning score was 3 or more standard deviations below the mean. Three occurred in the Masters, all by players breaking the scoring record. In 1965 Jack Nicklaus shot 271, which was 3.56 standard deviations below the mean. Raymond Floyd matched the 271 in 1976, finishing 3.27 standard deviations below the mean. Tiger Woods shot 270 in 1997, winning by 12 strokes with a score 3.26 standard deviations below the mean. In the 1953 U.S. Open, Ben Hogan’s 283 was 3.32 standard deviations below the mean.

Hogan’s 1953 season is one of the candidates for best year by a golfer. He won the Masters, U.S. Open, and British Open and did not compete in the PGA Championship. Three other seasons offer candidates for best year ever, including Bob Jones’s Grand Slam year of 1930 and Byron Nelson’s 1945 season with 18 tournament victories, including 11 straight. If you can name the other (fairly recent) remarkable season, you can probably answer the question that is still on the table—identifying the greatest single tournament ever.

In 2000, Tiger Woods won three majors, breaking the scoring record in each. Does this mean that his performances were the best ever? If the courses were just playing especially easy that year, Tiger’s records may not be that noteworthy. Tiger’s score of 272 in the 2000 U.S. Open at Pebble Beach won by 15 strokes. Moreover, it was an unprecedented, unapproached, and nearly unimaginable 4.75 standard deviations below the mean. Given the scores the field posted at Pebble Beach that year, the probability that anyone would shoot 12 under par is about 0.000001. This is almost literally one in a million!

The Tiger Effect

You would not expect that a job market paper by a graduate student in the Department of Agricultural and Resource Economics would be of much interest to golf fans. Nevertheless, Jennifer Brown’s 2007 paper “Quitters Never Win: The (Adverse) Incentive Effects of Competing with Superstars” created a national media stir with its conclusion that the PGA’s best players score higher when Tiger Woods is in the field than they do when he takes the week off.6

Brown’s intent was to explore an economic principle, that incentive bonuses can backfire. The naive thinking is that offering a bonus for the most sales will spur all salespeople to perform better than usual as they vie for the bonus. Brown wondered what would happen if the winner of the bonus was not in doubt due to the presence of a “superstar” salesperson. Having no realistic chance to earn the bonus, a good (but not great) salesperson not only would have little motivation to work extra hours but might in fact succumb to negative psychological impulses and actually sell less than usual. Brown had the inspiration to use tournament golf data to test this hypothesis. If Tiger is not playing, many good golfers can realistically expect to have a chance to win. They will prepare for the tournament with enthusiasm and be willing to grind out every stroke. If a dominating Tiger is in the field, perhaps they “give up” and play conservatively for a decent pay check.

At first glance, the “Tiger effect” appears to be dramatic. Limiting the data to the top 141 (exempt) Tour players, in 2006 the stroke average in tournaments with Tiger in the field was 2.73 under par, improving to 4.16 strokes under par when Tiger was not in the tournament. Be careful here! It would be easy to think that Tiger’s presence cost his opponents a stroke-and-a-half, but this would ignore the fact that Tiger does not play on a random assortment of courses. The courses for Tiger events are typically harder than those for non-Tiger events. If the courses are, say, a stroke-and-a-half harder, then Tiger himself is not creating any effect at all.

This is one of the reasons that Brown’s problem is challenging. There is a difference in performance when Tiger plays compared to when Tiger sits, but what is the real cause of it? Is it the difficulty of the course? Out-of-control fans in Tiger’s gallery? To try to remove the course difficulty factor, Brown looked at tournaments that Tiger usually plays. If he does not play a Buick Open one year and the scores are good, then plays it the following year and the scores are bad, maybe the root cause is Tiger’s intimidating presence. However, differences could still be due to how the course was set up or the weather or any number of factors (including random variation).

Brown divided her data into periods where Tiger’s play is dominant (winning almost everything), typical (winning some), and struggling (not winning). The results are provocative. Brown finds that, during periods when Tiger is struggling, there is no significant difference in players’ performance either with or without Tiger. During typical periods, the top players perform almost a stroke worse when Tiger is in the field. During dominant periods, the drops in performance average nearly 2 strokes per tournament. Brown’s analysis supports her hypothesis that the presence of a “sure” winner can serve as a disincentive for other competitors.7

It would be interesting to repeat Brown’s analysis in tournaments with a runaway leader who is not Tiger. Thinking about how to define a “runaway leader” presents some of the difficulties in knowing whether a study of this sort yields useful information. If someone is unchallenged and wins by several strokes, does that mean that the rest played worse than usual? I would say yes. After all, if any of the trailers played well, the runaway leader would not be likely to win by much. But did they play worse because they had little chance of winning, or did they have little chance of winning because (for some other reason) they played poorly? Here is another way of asking the same question. Nick Faldo won three Masters tournaments after dramatic collapses by very good golfers (Scott Hoch, Raymond Floyd, and Greg Norman). Does this mean that there was a “Faldo effect,” or am I asking about a Faldo effect because he got lucky three times?

The Back Tee: Two Routes to the Same Hole

The win/loss and strokes rating systems described in this chapter turn out to be equivalent to a least squares system for predicting individual tournaments. Some details are given here.

To make the discussion less abstract, let’s imagine a tour consisting of four golfers (A, B, C, and D) and three tournaments (T1, T2, and T3). Player A skipped tournament T3, and the tournament results were:

The strokes system defines an equation for the golfers’ ratings. We denote the ratings by a, b, c, and d. In any tournament, the ratings predict player A to beat players B, C, and D by a − b strokes, a − c strokes, and a − d strokes, respectively. In each of tournaments T1 and T2, then, player A should be ahead by a − b + a − c + a − d = 3 a − b − c − d strokes. For the two tournaments, player A should be ahead by twice this, or 6a − 2b − 2c − 2d strokes. In actuality, player A was ahead by 1 + 9 + 11 + 4 + 5 + 8 = 38 strokes. Player A’s equation is therefore 6a − 2b − 2c − 2d = 38, and we repeat the process for each player. The other equations are 8b − 2a − 3c − 3d = 35, 8c − 2a − 3b − 3d = −20, and 8d − 2a − 3b − 3c = −53.

The equations can be represented in a matrix:

which can be reduced to

The reduced matrix translates to the equations a−d = 10, b−d = 8, c −d = 3, and 0 = 0. Interestingly, there is an infinite number of solutions to our system of equations. To generate a particular solution, choose a value for d and solve for the other three ratings. A simple choice is d = 0, which leads to c = 3, b = 8, and a = 10. This says that player A is 2 strokes better than player B, who is 5 strokes better than player C, who is 3 strokes better than player D. Fortunately, all solutions give the same relative rankings and the same stroke differences between players. In this analysis, the equations were the result of insisting that each player’s predicted net strokes exactly match the player’s actual net strokes for the season. Individual tournament results can vary from the predictions, but the season totals must match.

A different way of approaching the rating system is to look at each individual tournament. For example, in T1 we predict player A to beat player B by a −b strokes. The actual result is player A winning by 1 stroke. The difference between the predicted result and the actual result is (a − b) − 1. We interpret this as the error in the prediction. The error for A versus C in T1 is (a − c) − 9, the error for A versus D in T1 is (a − d) − 11, and so on. Our goal is to choose a, b, c, and d to make the errors as small as possible. Of the various ways in which we might define “as small as possible,” we choose to minimize the sum of the squares of the errors. That is, we want to minimize [(a − b)− 1]² + [(a − c)− 9]² + [(a − d)− 11]² + …for all of the two-player match-ups for the season.