CHAPTER 35


Assessment, Appraisal, and Selection

Assessments based on tests and judgments as well as appraisals of current performance can be used for leadership and management-development purposes. They can be used to modify current performance and predict future performance. The judgments and their combinations can forecast success and effectiveness as leaders. Such predictive data can be extracted from assessments of personal backgrounds, psychological tests of ability and motivation, and current performance in contrived and real-life settings. Forbes and Piercy (1991) examined the backgrounds and careers of 250 CEOs and emphasized the importance of early career success. In mid-career, alternate routes to the top were movement into general management or functional specialization. Background and experience influenced early promotion into middle and senior management. Credibility and visibility were enhanced by level of education, breadth of experience, entry from a prestigious university and its advanced management training programs, service on the staff of a senior executive, work in a powerful department, and functional understanding of the corporate organization’s critical problems. It also could help to be a member of the family that controlled the corporation.

Business and industry were quick to adapt the intelligence tests developed for selection by the U.S. Army in 1917, in World War I, to assess 1.7 million draftees. (Portions of these tests can be found in the current, commercially available Wonderlic Personnel Test.) Development of personality inventories for assessment soon followed. The first modern standardized attitude surveys were completed by employees in the U.S. Department of Agriculture in the early 1930s (Likert, 1932).

Purposes of Assessment


The uses of assessment are twofold. By whatever means it is done, assessment provides the basis for choosing from among candidates for leadership and management posts. It also provides useful information for the counseling and further development of incumbent leaders. The organization clearly benefits from using valid selection processes for its leaders rather than relying on haphazard processes, ranging from casual and ill-formed choices to those involving favoritism, prejudice, politics, nepotism, and bribery (Immegart, 1987). Organizational effectiveness is enhanced by the increased assignment of those best able to fill leadership posts. A valid assessment increases the tendency of the status of organization members to be correlated more highly with their ability and esteem. Also, it reduces potential conflicts caused by incongruities among status, esteem, and ability.1 In the same way, counseling and development based on valid measurements rest on much firmer ground than do management counseling and development based exclusively on impressions and feelings. Thus Knight and Weiss (1980) demonstrated in an experiment that leaders perceived to have been chosen by a competent agent were judged by other group members to have greater expertise than those selected by a less competent agent. These leaders were also more influential.

Varieties of Available Assessment Information


Personal traits related to leadership have included tests and judgments of capacity, achievement, and verbal and nonverbal communication styles; interests, attitudes, and values; sociability, initiative, confidence, and popularity; task and relations orientation; status, family, educational background, and work history. For example, Miner (1960a) and Nash (1966) reported patterns for forecasting managerial success that were consistent with the personal factors connected with leadership discussed in earlier chapters. These patterns involved energy, risk taking, verbal fluency, confidence, independence, and the desire to be persuasive. Mahoney, Jerdee, and Nash (1960, 1961) showed that the success of 468 managers in Minnesota-based companies was predicted by work-related, business-related, and higher-level occupational interests. These concerned interests in leadership, independence, moderate risk, and work that were not closely detailed. Also predictive were intelligence, solid education, activities in more organizations, and self-reported dominance. In the same way, Ghiselli (1971) predicted managers’ progress based on a battery of tests of intelligence, supervisory ability, self-assurance, decisiveness, self-actualization, and motivation to achieve.

In addition to information generated from tests and simulations, the predictive data have been supplied by observers, interviewers, outsiders, superiors, peers, subordinates, and the assessees themselves. Assessment centers (discussed in detail below) have become a popular way of systematically gathering and pooling such multiple and diverse sources of information. The sources of information may be either subjective judgments about candidates for selection, development, and promotion or objective work samples, tests, and inventories. Often, both sources will be combined for a prediction. Naive observers may do just as good a job as trained process observers in identifying which individuals in a group are emerging as its leaders (Stein, 1977). The methods of combining the predictors for success as a leader are either statistical, judgmental, or both. Clinical judgment or mechanical pooling of the information in some objective systematic way may be used. To aid judgment, objective or subjective information or both may be displayed in profiles or printed in narrative reports. The mechanical combining may be based on unit or differential weighting of the information. The weighting can be a matter of judgment, or multiple regression analysis may be used to minimize error in the predictions (Sawyer, 1966).

Importance of Effective Assessment and Appraisal


“We bet on people, not strategies,” declared Larry Boss-idy, CEO of Allied Signal. Many senior managers fail to have a “talent mind-set” that can maintain the quality of an organization. Only 23% of top executives agreed that their firm brought in highly talented people, and even fewer (16%) said they could identify the high and low performers. Only 11% agreed that they could retain almost all high performers; only 5% said they could remove low performers quickly (Michaels, 1998). Yet talent needs to be identified early in high flyers’ careers in order to provide rapid promotion into leadership positions and avoid losing the high flyers to competing firms (Gratton, 1992). Early experiences in team sports and business and early identification are seen as important for the development of business leaders (Benton, 1990).

Importance of Acceptability of Appraisals and Feedback for Development


Increasingly, performance appraisals, especially from multiple sources, are being used for development as well as placement and advancement. Their acceptability is important in determining how much they can change behavior (Taylor, Masterson, & Renard, 1988). This depends on the willingness of the rater to provide unbiased appraisals, and that of the ratee to accept and make use of the feedback (Waldman & Bowen, 1998). (The importance of feedback of appraisals to leaders’ development and training, especially 360-degree feedback, was mentioned in earlier chapters.)

CEO Performance Evaluation


Truskie (1995) presented the results of a survey of appraisal practices of more than 600 large firms, by the chief executives. Thirty-eight percent had formalized oral and written processes; 78% had preset performance objectives; 75% of the boards used the evaluations for compensation purposes. All CEOs were held accountable for the company’s financial performance. Boards of directors were supposed to be able to justify CEOs’ compensation. CEOs could use the evaluations for personal improvement and for building relations with the board. However, boards considered short-term, quarterly, and yearly goals as more important than long-term ones. Qualitative goals such as leadership might be considered, but only informally, and they were unlikely to be factored into compensation. In most cases, the evaluations were based on formal performance reviews.

Judgmental Approaches


Pure judgments of the readiness of job incumbents for promotions into positions that require more leadership are commonly used. These judgments may be obtained from observers. More often, they are obtained from the supervisors of the incumbents or from higher management. Such judgments can be a part of standard performance appraisals, and, as was noted in Chapter 30, such early appraisals of performance as a leader are likely to predict subsequent success as a leader. They also can be based on observations of leaders’ performance in work samples and situational tests, as well as in the actual performance of a job. Simulations of management situations, paper-pencil simulations in the form of standardized managerial basket tests, initially leaderless group discussions (LGDs), and other small-group exercises can be used to generate the observations from which judgments are derived. Other ways of judging the potential for leadership are supervisors’ opinions and performance appraisals, interviews, references, and recommendations. In addition to judgments of an assessee’s overall leadership performance, discrete elements can be singled out for examination and decision. These elements can then be formed into scores for standardized processing.

Judgments Based on Simulations

In-basket Tests. These tests simulate the typical contents of a manager’s in-basket. The items may include telephone messages, brief memos, detailed reports, letters, directives, complaints, and junk mail about a recently vacated management position. The examinee is instructed to imagine that he or she has just taken over the position and has limited time, such as one hour, to decide how to handle each item (Frederiksen, 1962b, 1966; Frederiksen, Saunders, & Ward, 1957). Realism can be added with telephone calls and visual presentations (Lopez, 1966). An element that is deemed important in the test is ambiguity (Gill, 1979), although the common problems faced by all managers need to be sampled (Stewart & Stewart, 1976). Brass and Oldham (1976) argued that the more representative and appropriate the in-basket sample to the examinee’s particular future management situation, the better the test results predict that performance.

As many as nine independent factors were uncovered in a Sears in-basket test (Bentz, 1967), but Gill (1979) concluded that most factor analyses of in-basket results, including Meyer’s (1970a), emerge with just two factors: a relations dimension of supervision and an intellectual dimension of planning and administration. Highly reliable results can be obtained with standardized scoring procedures by adequately trained judges of the examinee’s responses (Richards & Jaffee, 1972). Interviews with the examinees after they complete the in-basket test can augment the accuracy of the predictions obtained (Clutterbuck, 1974).

The results of the in-basket tests of IBM supervisors were found by Wollowick and McNamara (1969) to correlate .32 with increases in management responsibility during the three years following testing. Judgments of the examinees’ organizing and planning ability and work standards, garnered from an in-basket test, correlated .27 and .44, respectively, with the managerial level they achieved at AT&T eight years later (Bray & Grant, 1966) and .19 and .24 with the level they achieved 20 years later (Howard & Bray, 1988). Similar findings were reported by Meyer (1963) for manufacturing supervisors at General Electric and by Lopez (1966) for managers at the Port Authority of New York and New Jersey. These investigators agreed that judgments and scores from in-basket tests add considerably to the accuracy of forecasting success as a manager that is not possible with ordinary paper-pencil tests of capacity and interests.

Leaderless Group Discussion (LGD). In the simplest form of small-group exercise, the initially leaderless discussion group (LGD) is assigned a problem to discuss to reach a group decision. Observers judge who emerges as a leader, initiatives displayed by each participant, and other aspects of interpersonal performance, such as motivation of others, persuasiveness in expression, and skills in dealing with others to handle the problems to be solved. Bass (1954a) reported correlations of these observers’ judgments of LGDs of .44, .53, and .38 with rated merit among the 348 ROTC cadets as cadet officers obtained six months to a year later. A similar correlation of .47 was found in predicting subsequent nominations for positions of fraternity leadership from earlier performance in an LGD. Similarly, Arbous and Maree (1951) obtained a correlation among 168 South African administrative trainees of .50 between an LGD and the trainees’ rated capacity a year later as administrators. Vernon (1950) found, among 123 personnel, a correlation of .33 between an LGD and their rated suitability for foreign service. The LGD was validated in the same way for shipyard supervisors (Mandell, 1950b), military officers (Weisloge, 1953), U.S. Army personnel (Gleason, 1957), and British supervisors who were followed up for four years (Handyside & Duncan, 1954). The LGD became a routine part of many assessment centers. Vernon (1950) found a correlation of .75 between judgments based on a one-hour LGD and judgments by assessors after three days of activities at an assessment center. Although the quantity of talk in an LGD reflects attempts to lead, its quality affects a person’s success in leading (Bottger, 1984). Participants who are highly rated by observers see themselves as more assertive, competitive, self-confident, and willing to function autonomously. Those rated less effective view themselves as cooperative, disciplined, and tactful (Hills, 1985).

Small-Group Simulations. Complex simulations are specifically constructed exercises for small groups of candidates that can be observed by assessors whose judgments are shaped by what they see. Roskin and Margerison (1983) presented examples of simulations that involved organizing a team effort among managers whose problem required little special information or experience but could bring out differences in participants’ leadership, managerial skills, and attitudes. These simulations were: (1) the production of a set of prototype greeting cards; (2) the building of a “monument” from material provided; (3) the solution of a case problem, at both individual and corporate levels; (4) the construction and flying of paper airplanes; (5) the selection of an employee from a number of candidates. Assessors’ judgments were based on observations of 39 elements in the assessees’ behavior. The assessments were related to the compatibility of the behavior with the situation and how the assessments dealt with cognitive and perceptual complexity. Man agers were detected who were most likely to be high achievers. Scores of leadership attitudes and opinions on paper-and-pencil tests alone failed to relate to the managers’ performance.

Dunnette (1970) described two simulated business situations from which judgments of performance could be gleaned for predictions of leadership. Six participants were told they were managing an investment fund and had to buy and sell stocks to make profits for their investors. Stock quotations were changed every five minutes, and other information was presented at predetermined intervals. Organizing skillfully, working cooperatively with other people, handling a large amount of information, and other group process and cognitive variables were judged from each participant’s behavior. In the second simulated business situation, the six participants were asked to play roles as managerial trainees. Each participant had to plan a project and make a 10-minute presentation justifying the project and its budget. After the six presentations, the participants engaged in an hour’s discussion about the merits of the various projects. Skills in judgment, planning, and organizing, proficiency in oral communication, and effectiveness at working in competitive situations could be judged with some accuracy from the emergent behaviors.

Cognitive Tests. Organizational leadership is characterized as social problem solving in ill-defined, usually novel domains (Mumford, Zaccaro, Harding, et al., 2000). Cognitive and metacognitive abilities are needed to solve the problems of executive leadership. Cognitive abilities include intelligence, general reasoning skills, divergent thinking, and oral and written expression and comprehension (Connelly, Gilbert, Zaccaro, et al., 2000). Cognitive abilities are needed to get things done. Metacognitive abilities guide the problem-solving process and enhance choosing, planning, monitoring, and evaluating what to do. Such abilities include verbal fluency, information-processing skills, and inductive and deductive reasoning (Marshal-Mies, Fleishman, Martin, et al., 2000). Cognitive and metacognitive tests of executive leadership skills have been conducted by both conventional and computer methods. A battery of these complex problem-solving measures was administered at the National University, in San Diego, to senior military officers.

Problem-Solving Scenarios. Zaccaro, Mumford, Con nelly, et al. (2000) administered a set of psychometric tests to tap the problem-solving capabilities of 1,807 U.S. Army officers ranging in rank from 597 second lieutenants to 37 colonels, most with experience serving in leadership roles from platoon to brigade commanders; 80% were men, and 20% were women. The first tests administered were either cued or uncued scenarios calling for creative problem-solving and processing of complex problems. (Half the samples underwent cued scenarios; the other half, uncued scenarios.) Next, military scenarios required the construction of solutions to assess attention to the constraints on and understanding of the context. Scenarios calling for social judgments tested the understanding of people. Knowledge of the roles of a leader in dealing with various tasks, writing skills (rewriting headlines), and divergent thinking about the consequences of unlikely events were also examined, using similarly brief 10-to 15-minute tests. Responses were constructed by the examinees rather than chosen by them. Cued and uncued complex problem-solving scenarios correlated with knowledge of leadership, .32, .33; understanding leadership as problem solving, .22, .41; divergent thinking, .49, .51; verbal reasoning, .18, .14; and creative writing, .30, .33. Cued problem solving correlated .55 with solution construction; uncued problem solving correlated .50 with social judgment.

Judgments Based on Speeches and Essays

Essays written by candidates, managers, and leaders and their speeches can provide the basis of judgments about their future performance. Winter (1987) demonstrated that the contents of the inaugural addresses of U.S. presidents could be judged and coded to obtain valid inferences of their leadership motivation patterns. Such patterns, in turn, were predictive of the success of their administrations. Speechwriters provided the specific words, but the leaders reviewed and modified the text. House, Woycke, and Fodor (1988) extended the work to Canadian prime ministers.2

Judgments Based on Appraised Performance

Superiors’ appraisals, evaluations by peers, and self-reports of leaders about their own performance may all be employed as predictors of future leadership success. Generally, consistencies are likely to fall between superiors’ and peers’ ratings, although superiors are prone to emphasize getting work done and peers are more likely to emphasize cooperativeness. Self-reports tend to be inflated (although some managers may have better insight into themselves than others do). Self-reports may be unrelated to judgments obtained from peers or superiors. For example, Lawler (1967a) found a correlation of .52 between superiors’ and peers’ ratings of the job ability of 113 middle and top managers. Agreement on the rated quality of the managers’ job performance was .38. On the other hand, correlations were close to zero between the managers’ self-reports and the peers’ or superiors’ ratings of the managers’ ability and performance. A meta-analysis by Harris and Schaubroeck (1988) of 11 to 36 reported correlations found a high correlation, on average, between peers’ and superiors’ ratings but much lower correlations between self-reports and superiors’ ratings, or between self-reports and peers’ ratings. Agreement between self-reports and others’ ratings was lower for managers and professionals than for blue-collar and service employees. Whether the rating format was dimensional or global had little effect on the agreement.

Judgments by Superiors

Organizations commonly use supervisors’ and superiors’ judgments to decide whom to hire, transfer, and promote. The judgments are attempts to forecast future success as a leader and manager in the positions for which the candidates are to be chosen. Equally common are teachers’ and trainers’ appraisals of prospective future position holders while they are enrolled in training and education programs. When 493 to 631 U.S. Army soldiers were rated by their supervisors and peers, 13% of the variance in the supervisors’ appraisals was accounted for by the soldiers’ scores in ability, job knowledge, and proficiency. But only 7% could be accounted for by peers. When interpersonal factors were included, the percentages rose to 28% and 19% (Borman, White, & Dorsey, 1995). The cumulative military performance grades awarded by superiors at Annapolis correlated .25 with subsequent fitness reports of 186 U.S. Navy officers serving in the fleet as much as a decade later. They also predicted the tendency of subordinates to describe the officers as charismatic (Yammarino & Bass, 1989).

Academic Performance Grades. Cumulative academic performance grades at Annapolis were not predictive (Yammarino & Bass, 1989) of success. This finding was consistent with the failure, more often than not, to find significant correlations between undergraduate college grades and success in business and industry (see, e.g., Schick & Kunnecke, 1982) or better ratings of job performance (Pallett & Hoyt, 1968). The record for postgraduate grades has been somewhat better (Weinstein & Srinivasan, 1974), although in general, postgraduate grades of any consequence emerged mainly in specific elective courses (Marshall, 1964). As the importance of technology increases, so may the importance of academic performance if grade inflation is not too severe. In special circumstances, academic performance may be predictive. In one study, academic performance predicted teachers’ ratings of the leadership of high school students (Schneider, Paul, White, et al., 1999). The school was a magnet school focusing on leadership development, and these students spent most of two years of their class time together, confirming the data gathered about them. Grade point average (GPA) correlated .59 with task-goal teacher ratings of leadership and .48 with peers’ nominations of leadership (Schneider, Ehrhart, & Ehrhart, 2002). Generally, results lagged over the two-year period. They showed more modest correlations among teachers’ and peers’ appraisals and the Myers-Briggs Type Indicator, the Campbell Interest and Skill Survey, and leadership in an initially leaderless group discussion.

From the research on the persistence of leadership in Chapter 30, it can be inferred that superiors’ performance appraisals of managers’ behavior as leaders furnish a greater opportunity to observe the managers in action than when the managers are observed in diverse positions by different superiors, peers, and subordinates. At the same time, the predictive validity of superiors’ judgments suffers to the extent that they overweight the candidates’ technical proficiency and manipulative styles (Farrow, Valenzi, & Bass, 1981); the requirements of the new positions are also different from those of the old ones. Furthermore, successful performance in a lower-level position may not necessarily predict performance in a higher-level position that requires more cognitive complexity (Jacobs & Jaques, 1987).

Brush and Schoenfeldt (1980) proposed a procedure that would direct superiors’ ratings toward a systematic evaluation of their current subordinates on dimensions of consequence to future positions the subordinates might hold. They argued that it is even possible to evaluate blue-collar workers, based on their observed behavior on the job, on dimensions relevant to supervisory positions to which they might aspire. Schippmann and Prien (1986) obtained such judgments of 47 candidates for first-level supervisory jobs in a steel company from 29 supervisors. The dimensions arranged for the supervisors’ judgments included: (1) representational skills, (2) supervisory ability, (3) ability to analyze and evaluate information to define problems, (4) knowledge about and ability to work with mechanical devices, (5) ability to manage/orchestrate current activities effectively, (6) ability to decide and act, (7) ability to follow through to achieve closure on a task, (8) administrative ability. Satisfactory reliabilities were achieved, but self-assessed merits were much more lenient than the supervisors’ ratings.

Superiors’ Intentions in Appraising Subordinates. What stimulates superiors to appraise subordinates? In one study, 73 subordinates were rated by 32 supervisors. Some were rated by 25 of the supervisors, and none were rated by 16 supervisors. Supervisors who rated all the subordinates differed significantly from those who rated only some or none. According to the subordinates, the 32 supervisors were higher in initiating structure; they came from more formal departments; their work groups were more cohesive; their subordinates were better educated; and they had more confidence in the appraisals (Fried, Fried, Tiegs, & Bellamy, 1992).

Superiors may intend to try to rate subordinates’ performance accurately or to inflate or deflate subordinates for political, social, or personal reasons. Davis (2000) queried the intentions of law enforcement superiors in appraising the performance of subordinates. A total of 266 superiors (captains, lieutenants, and sergeants in a county sheriff’s department) completed the questionnaire. The superiors indicated their intention to be more accurate in their performance appraisals of their subordinates if they felt that the appraisals: (1) enabled employees to participate in the process; (2) contributed to developmental purposes; (3) served to reward employees. Also, intentions to be accurate were greater if the supervisors had better relations with their subordinates and were more satisfied with the performance appraisal system.

Peer Appraisals. Consistent with other analyses of “buddy ratings,” peer ratings by cadets at West Point and at officer candidate schools (OCS) were found to be the best single predictor of subsequent success as a regular U.S. Army officer (Haggerty, Johnson, & King, 1954). A correlation of .51 was obtained between peers’ ratings at West Point and the rated success of infantry officers 18 months later. A correlation of .42 was obtained between peers’ ratings in OCS and officers’ combat performance in World War II (Baier, 1947). Similar results were reported for the U.S. Marine Corps (Wilkins, 1953; Williams & Leavitt, 1947b) and the U.S. Air Force (USAF 1952a, b). Ricciuti (1955) found fellow midshipmens’ ratings of aptitude for service more predictive of the subsequent performance of naval officers than ratings made at the U.S. Naval Academy by their navy officers. Hollander (1965) and Amir, Kovarsky, and Sharan (1970) applied peer ratings to accurately predict the success of junior officers. Downey, Medland, and Yates (1976) extended the findings to senior officers. They found that ratings by 1,656 colonels of each other forecast who would be promoted to general. The correlation was .47. Peer nominations among 133 women soldiers in the U.S. Army were found to generate two distinct factors, professional and social, but parallel ratings yielded only one general factor, suggesting the ratings were subject to halo bias (Schwartzwald, Koslowsky, & Mager-Bibi, 1999).

Judgments by peers are often in the form of nominations of one or several associates with whom the rater is acquainted (see Chapter 10). Nominations of the most and least valued have been employed. But to be predictive of subsequent success, the nominations must be positive. Kaufman and Johnson (1974) showed, in two studies of ROTC cadets, that being nominated as the most effective cadet officer during the school year correlated between .36 and .43 with independent criteria of performance as a platoon leader at summer camp. But being nominated for the least effective performance officer did not correlate as negatively as expected with such leadership at summer camp. For this, correlations were .03 and .16 in the two studies.

In business, Kraut (1975b) obtained peers’ ratings from 156 middle-level IBM managers and 83 higher-level executives attending a monthlong training program. Two factors emerged from 13 ratings: (1) impact; and (2) tactfulness. These factors correlated .35 and .37, respectively, with the performance appraisals of the higher-level executives, but close to zero with the performance appraisals of the middle managers once they were back on the job. However, the peers’ ratings of impact correlated with the subsequent number of promotions received by both the middle managers and the executives. These predictions from peer judgments were much more likely to be correlated with subsequent success than predictions obtained from the training staff of the monthlong program.

Are peer ratings biased by friendship? Mumford (1983) was impressed with the validity of peer ratings but found a need for a better theoretical understanding of the reasons for it. It would seem that peer ratings are valid because they are likely to be influenced by many of the persistent traits that contribute to success as a leader, such as competence and esteem not necessarily due to friendship.

Subordinates’ Appraisals. With the increase in participative management has come an increase in the acceptance of subordinates’ ratings of their bosses. Anecdotal evidence suggests that some managers learn from their subordinates that they talk too much or are not good listeners. Some receive better ratings than they expected. Reasons for discrepancies between ratings from the different sources can also be examined and action planning undertaken. Bob Myers, a vice president at the Limited, discovered that his colleagues wanted him to be more open to different views. Ellen Walton, at AT&T Network Systems, found out that she did not always listen. CEO Lawrence Bossidy of Allied Signal was not surprised to learn that his subordinates agreed with him that he was opinionated (Maynard, 1994). In 1992, the motorcycle manufacturer Harley-Davidson began an upward feedback program to its executive committee members from their direct reports. Increasingly, in many firms such as General Electric and IBM, feedback programs for developing managers are surveying the managers’ subordinates, as well as their peers and superiors (Gunn, 1992). Information from subordinates is also likely to be predictive of future success as a manager. However, the accuracy of the predictions derived from such information may suffer to the degree that the subordinates overweigh sentimentality, the superior’s likability and the extent to which the future position’s requirements differ from the current one. Nevertheless, Hater and Bass (1988) showed that among 56 Federal Express managers, those who were rated higher in transformational leadership and lower in laissez-faire leadership by their subordinates were significantly more likely to be judged by senior managers as having a greater potential for leadership. Similar results were found by Yammarino and Bass (1989) for junior naval officers.

Of course, it is one thing to use superiors’, peers’, and subordinates’ ratings in research; it is another thing to use them in operations. The colleagues’ and candidates’ resistance and distortions are likely to increase when the ratings are to be employed by a higher authority to decide on a promotion for which the colleagues are also competing.

Self-Ratings. Objective personality tests and inventories may provide assessments for personal leadership development as well as selection of leaders. Most survey feedback asks managers to describe their own behavior. These results are then fed back to the managers relative to the average group results. The self-ratings are not likely to be as useful to the prediction of future success as the discrepancies between theirs and others’ judgments. But the discrepancies are especially useful for developmental counseling and training. Most answers to questionnaires that ask managers to describe or evaluate their own behavior suffer from a variety of errors. Nevertheless, autobiographical material dealing with a person’s early history of the challenge of important life decisions can be the basis of valid judgments by trained reviewers (Ezekiel, 1968). Assessees may be asked to write about their high school and college experiences, their peak experiences, their jobs and organizations five years into the future, and their obituaries (what they expect and hope will be said about them and their performance). Anonymity is often important, using coded names for feedback and correlated research. Nonetheless, when it was optional for U.S. Army officers to sign their names to their answers to a personnel survey, most signed, wanting to indicate that they were cooperating.

Self-ratings tend to be inflated compared to other ratings sources. They may be accurate, especially when they are not being used for administrative purposes or when they are verifiable statements of intention. Again, some attributes that are self-rated, such as commitment, may be accurate rather than biased, as was found for 79 public service administrative staff (Goffin & Gellatly, 2001). Toegel and Conger (2003) argued that two kinds of 360-degree assessments are needed: one that provides more qualitative assessment and feedback for management development; the other, more quantitative assessment feedback for performance appraisal and performance outcomes.

Multiple Rating Sources


The validity, accuracy, and relevance of ratings and other subjective judgments can be improved in a variety of ways. These include changes in format—by rater selection and training, by corrections for biases and expectations, and in particular by the use of multiple sources for ratings. The overall agreement among raters and the reliability and accuracy of the results are likely to increase with the number of independent raters judging the same ratee. To add validity to performance appraisals, ratings may be obtained from more than one source. Multi-source ratings may be any combination of ratings of leaders from superiors, peers, subordinates, and self, as well as other sources such as clients, customers, and members of the same organization from different departments. A common multisource research plan is to obtain ratings of leaders from three or more subordinates and peers. As many as five observers may be employed by assessment centers and then compared with self-ratings. A plethora of studies has been completed on comparisons between self-rating and ratings by others.

Self-Ratings versus Others’ Ratings

Oh, wad some power the giftie gie us

To see oursels as others see us,

It wad frae monie a blunder free us

An’ foolish notion

(Burns, R., 1785/1974, pp. 43–44)

Poets, among others, have long been aware of the in-accuracy of self-appraisal. At the same time, many studies have shown that leaders are more accurate in their self-appraisals than are nonleaders; that is, leaders are in higher agreement with others’ ratings of them (e.g., Gallo & McClintock, 1962). Baril, Ayman, & Palmiter (1994) examined the agreement between the self-ratings of 92 supervisors and their 853 subordinates. There was only minor agreement, but the agreement correlated with raters’ situational control in Fiedler’s (1967a) contingency model of leadership. Atwater and Yammarino (1992) looked at the self-other agreement of 92 midshipmen leaders with their subordinates and with the officers they reported to at the U.S. Naval Academy’s summer camp. Using the Multifactor Leadership Questionnaire, subordinates and supervising officers rated the transformational leadership of each self-rating leader. As is usually found, mean self-ratings were higher than mean subordinates’ ratings. Nevertheless, some leaders could be classified as overestimators or underestimators based on their deviations from the means. Overestimators were defined as more than half a standard deviation above the mean; underestimators were defined as more than half a standard deviation below the mean. For the remainder, self-ratings were defined as being in agreement with subordinates’ or superiors’ ratings. Approximately half of the 61 leaders were categorized in the same way by subordinates and superiors. Underestimators were seen more favorably by others. Self-ratings correlated .35 with superiors’ ratings, but only .19 with subordinates’ ratings. Bass and Yammarino (1991) obtained comparable data about 155 U.S. navy officers in the surface fleet. Again, mean self-ratings were inflated. The officers were classified in the same way as the midshipmen. Those officers who were less likely to be overestimators were more likely to obtain recommendations for promotion and better performance appraisals from their superiors. Self-ratings were unrelated to the superiors’ performance appraisals and recommendations for promotion. Using differences between self-ratings and subordinates’ ratings, of 2,056 managerial raters, the in-agreement raters and the underestimators were regarded by their superiors as more effective than overestimators. Yammarino and Atwater (1997) inferred that overestimators compared to under-estimators and individuals in agreement were more likely to misdiagnose their strengths and weaknesses, make less effective decisions, suffer from career derailment, be resentful and hostile, see no need for training and development, be absent and get into conflicts with colleagues more frequently, and quit. They lowered their self-evaluations and improved their performance when feedback from others was received and accepted. Underestimators are likely to be more successful but are lower in self-esteem, set low aspirations, underachieve, and avoid attempting leadership. Individuals in agreement and rated as good are likely to be the best performers; make effective decisions; develop favorable expectations and achievement; be most promotable; be successful, effective leaders; be low in absenteeism, turnover, and conflicts with others; and accept and use feedback from others constructively. Those who agreed they were poor performers were likely to be poor decision makers; lack competencies, self-esteem, motivation, and a positive outlook; and take few actions to improve their performance. When subordinate-peer agreement was used to predict superiors’ ratings, overestimation, underestimation, or agreement of self and peers’ ratings cor related with superiors’ evaluations of the leadership effectiveness of 2,292 focal managers, but only if the self-and peers’ ratings were not in between favorable or unfavorable. Actually, it was mainly the peers’ ratings that provided the best prediction. Peers’ ratings correlated .40 with superiors’ evaluations; self-ratings correlated only .17 with superiors’ evaluations (Brutus, Fleenor, & Taylor, 1966).

According to a meta-analysis by Harris and Schaubroeck (1988), self-ratings were much less in agreement with peers’ (r = .36) and supervisors’ ratings (r = .35) than peers’ and supervisors’ ratings were with each other (r = .62). Shipper and Davy (2002) compared five peers’ and immediate subordinates’ evaluations with self-evaluations. The questionnaire ratings dealt with the interactive and initiating skills of 1,125 middle managers in various facilities of a high-tech company. The peers’ and immediate subordinates’ ratings of interactive skills (r = .69) and initiating skills (r = .33) predicted the attitudes of the professional employees serving in the managers’ units. In turn, the professional employees’ attitudes significantly predicted the managers’ performance. On the other hand, the self-evaluations of interactive skills were actually negatively related (r = .15) to managerial performance.

Inflated self-evaluations can cause careers to derail. The discrepancy in what leaders think of themselves and what their colleagues think of them may be seen by colleagues and others as arrogance (McCall & Lombardo, 1983). The absence of such a discrepancy benefits manager-subordinate relationships (Wexley, Alexander, Greenwalt, et al., 1980).

Adding Self-Ratings and Others, Ratings for Predicting Performance. Atwater, Ostroff, Yammarino, et al. (1998) obtained self-and peers’ Benchmarks ratings on 1,460 managers in leadership development programs as well as on-the-job performance appraisals from their direct supervisors. Correlations with the supervisors’ ratings were .25 with self, .33 with subordinates, .37 with peers, and .50 between subordinates and peers. When subordinates’ and peers’ ratings were added in a linear regression equation, a multiple R of .40 was obtained. When curvilinearity was introduced by adding to the equation the squares and cross-products of self-and peers’ ratings to predict supervisors’ effectiveness ratings, the multiple correlation rose to only .41, but numerous other conclusions were able to be drawn about extreme underestimators and extreme overestimators from the three-dimensional equation.

Moderators of Self-Other Agreement. Becker, Ayman, & Korabik (2002) found complex effects when men and women, either high or low self-monitors, were compared as to their agreement with their subordinates in male-dominated industrial and educational organizations rather than in organizations balanced in the sexes. Atwater and Yammarino (1997) reviewed the literature and noted that students rated their scholastic ability higher than their peers’ ability as they increased in age (Bailey & Bailey, 1974). Older and longer-tenured adults tended to inflate their own ratings (Ferris, Yates, Gilmore, et al., 1985). Intelligence, achievement, and internal locus of control correlate with more accurate self-evaluation (Mabe & West, 1982). Cognitively complex raters demonstrate more accurate self-assessments (Nydegger, 1975). Self-other ratings of MBTI types are higher in agreement for introverts than for extroverts, for feelers than for thinkers (Roush & Atwater, 1992), and for those higher in self-awareness (Van Velsor, Taylor, & Leslie, 1993). Accurate expectations increase the accuracy of self-assessments (Beyer, 1990) as do experiences with feedback of self-ratings (Kooker, 1974). Self-ratings will be inflated if they are to be used for personnel decisions or to manage impressions (Ferris & Judge, 1991). They also will be inflated by both subordinates and peers if they are used as recommendations for promotion rather than for developmental feedback (Gregura & Robie, 2001).

360-Degree Ratings

When ratings for an executive or manager by subordinates, peers, superiors, and self are determined, there exists a 360-degree report that can be used for administrative purposes, promotion from within, succession planning, self-development, coaching, or training and development. Though anonymity usually is initially maintained for subordinates’ ratings, when they are used for development, it is not uncommon for leaders to discuss their 360-degree report with subordinates as well as with peers and boss. Although the increase in 360-degree reporting paralleled the delayering of organizations and the increase in participative management and teamwork, it is not new. The Center for Creative Leadership introduced the 360-degree format in 1974, and Personnel Decisions did the same in 1984 (Maynard, 1994). A 360-degree analysis for managers and leaders can be obtained from sources such as Personnel Decisions and the Center for Creative Leadership. A 360-degree report can be obtained for the Full Range of Leadership Development; from charismatic to laissez-faire; in Web or paper format (Avolio & Bass, 2004). (For a list of 10 commercially available 360-degree programs and their validity, languages, Web feedback availability, and other characteristics, see Morical, 1999.)

Alimo-Metcalfe and Metcalfe (1998) enumerated the advantages of 360-degree appraisals: (1) The data are more valid since they include observations by subordinates, who are in closer contact with the manager than is the manager’s boss. (2) Subordinates see how the manager reacts in day-to-day situations as well as crises. (3) Subordinates are on the receiving end of managers’ practices. Their ratings may provide greater validity than top-down ratings. (4) The subordinates, peers, and bosses observe and rate different aspects of the manager’s behavior. (5) The reliability and fairness and the manager’s acceptance of the data are greater. (6) With peer ratings, the organization shows its interest in horizontal relations.

The Metcalfes collected 360-degree data at a university management development program for British National Health Service middle managers (212 males, 263 females), their bosses, two to three peers each, and two to three subordinates each. The data included ratings of 41 competencies. Up to nine factors emerged in the different subsets of raters, but the factor performance management was usually the first and largest factor to emerge in the sets of women and men from the different sources analyzed. Other factors included presentation/communication skills, enterpreneurial innovation, and analytical ability. The variance in ratings accounted for by the six to nine factors per set ranged from 53% to 61%. There appeared to be more similarities than differences between the sexes and the sources of the appraisals.

Scullen, Mount, and Sytsma (1996) selected 2,297 managers from 22,431 in a personnel decisions data set from a variety of firms and industries. Along with self-ratings, each manager had been rated by two others at organizational levels above them, at the same level, and below them. The ratings were based on the Management Skills Profile (MSP) of 18 skills measured by 122 items. According to a Principles Components factor analysis, responses to the items formed three factors: (1) interpersonal skills (listening and human relations); (2) administrative skills (delegating, personal organization, and time management); (3) planning and general supervisory skills (occupational and technical knowledge, problem analysis, and financial and quantitative skills). As expected, mean self-ratings were higher than those of any other type of rater for interpersonal skills but not the other two factors. Reliabilities of the three factored scales ranged for superiors from .84 to .87; for peers, .82 to .86; and for subordinates, .76 to .82. Agreement between the pairs of superior raters ranged from .41 to .45; for pairs of peers, .31 to .35; and for pairs of subordinates, .31 to .36. Correlations were as follows among the types of raters: superiors-peers, .34; superiors-subordinates, .26; superiors-self, .17; subordinates-peers; .27; peers-self, .19; and subordinates-self, .34. These results indicated that there was modest agreement between and within the different levels of raters. Each rater was providing a somewhat different snapshot of the ratee. No rater level was redundant. A composite based on 360-degree ratings was needed to provide a fuller picture of the ratee’s performance. Considerable training was needed to increase the correlations between raters at the same level. As was usually found across organizational levels, correlations were higher between peers and superiors and lower between self and others (Harris & Schaubroeck, 1988; Erisman & Reilly, 1996).

Conway and Huffcutt (1997) calculated for 319 managers from a Fortune 500 company how much unique variance would be found in the 360-degree ratings from different organizational levels and a customer assessment. They determined how much each source contributed to a multiple correlation when combined with the other sources. The results confirmed that each source was contributing mainly unique or nonredundant variance.

Multisource ratings are essential to provide a complete appraisal for developmental purposes, as each source of ratings comes from a different perspective of the same ratee. At the same time, although rating-point interpretations will differ among different groups of raters, the construct validity of most of the dimensions rated by different groups of raters holds up well on such 360-degree instruments as Benchmarks (Wise, 1998). An average predictive validity of .30 was obtained for the 360-degree ratings of 62 to 65 general managers of district retail stores. The ratings predicted their district discount retail stores’ sales-to-goals ratio for four business quarters (Healey & Rose, 2003).

Judgments Based on Organizational and Personnel Procedures

Personnel procedures upon which judgments are based include interviews, projective testing, board meetings, and references and recommendations.

Judgments Based on Interviews. The standardized interview from which the interviewer forms judgments about the assessee ordinarily covers the candidate’s family, education, and work history. Also often discussed are the candidate’s expressed objectives, strong and weak points, hopes and fears, social values, interests, attitudes toward the organization, and attitudes toward interpersonal relationships. The accuracy of these self-descriptions is higher for the information that the candidate believes can be independently corroborated (Cascio, 1975).

The interviews vary widely in how they are organized in the time devoted to them, in the probing skills of the interviewer, in the extent to which the interviewer is trained to provide a similar stimulating situation for each candidate, and so on. A deliberate effort may be made to place stress on the candidate to observe his or her ability to cope. Admiral Hyman Rickover conducted more than 20,000 such stressful interviews in the process of choosing personnel for the nuclear surface ship and submarine program. However, there appeared to be little standardization in what he did. Furthermore, anecdotal evidence suggests that what he did do contributed little to his accuracy in predicting the subsequent performance of officers in the nuclear fleet (Polmar & Allen, 1981).

Early reviews of the validity of interviews for predicting performance were not supportive (Wagner, 1949). Upward of 80 studies led Mayfield (1964) to conclude that mainly intelligence could be satisfactorily predicted from an interview. However, subsequent improvements in the interview process, especially when combined with other personnel procedures, have revealed that interviews, properly designed, can be valid and useful. Wiesner and Cronshaw (1988) completed a meta-analysis of 150 reported efforts to validate the interview. Structured interviews produced validity coefficients twice as high, on average, as unstructured interviews. In the structured interview, the interview proceeded in an orderly manner and was programmed with specific areas to probe, and specific questions to be answered and recorded.

Huse (1962) compared the ratings made by interviewers of 107 managers in 37 firms and the ratings of the same managers by the managers’ superiors. Consistent with Mayfield’s findings, the interviewers’ and superiors’ ratings of the intellectual capacity of the managers correlated .26. But they also correlated on various other aspects as well: leadership, .38; creativeness, .27; and overall effectiveness, .16. When the interviewers’ leadership judgments were combined with the results of projective and objective tests, the multiple prediction of the superiors’ ratings of the managers’ leadership reached .44. Likewise, 20 experienced supervisors who were rated as highly effective by their superiors were matched by Glaser, Schwartz, and Flanagan (1958) with 20 less effective supervisors of similar experience, age, and scores on a paper-and-pencil test. The judgments combined with the test scores correlated .34 with discriminating between more effective and less effective supervisors. The results of the interview added to the accuracy of the discrimination. Similarly, Howard and Bray (1988) reported that interviews with assessment center candidates, assessed 20 years earlier, could detect a number of attributes of consequence to the managers’ success at AT&T. They found a correlation of .30 between the interviewers’ judgments of the interviewees’ range of interests and the managerial level attained by the assessees 20 years later. The interviewers’ judgments of the assessees’ need for advancement correlated .44 with the level attained by the assessees 20 years later.3

Careful attention to the job requirements of the position for which candidates are being considered and the use of multiple trained interviewers appear to make a difference in the validity of the interview. Russell (1978) arranged for five trained interviewers to assess 66 senior manager candidates in a Fortune 500 firm for positions as general managers. The candidates were assessed on nine dimensions generated by a focus group to describe the task and process responsibilities of the desired positions. Accomplishments, disappointments, and developmental activities were discussed in the interviews. The interviewers prepared a report on the candidates for discussion at a meeting to reach consensual decisions about the candidates’ readiness for promotion to the positions and the candidates’ developmental needs. The consensual judgments of understanding, analyzing, and setting direction for a business correlated .32 with the bonuses that were awarded, .26 with the quality of the candidates’ relations with customers and other outsiders, and .23 with staffing performance. The pooled interview judgments of organizational acumen (understanding of the corporate environment to achieve both individual and unit objectives) also correlated .36 with superiors’ appraisals of the candidates’ nonfiscal performance. In another vein, Herriot and Rothwell (1983) found that the judgments of recruiters, based on their interviews with graduating students, added to the accuracy of predictions of the applicants’ suitability above what was obtained from the applicants’ résumés alone.

Other studies pointed to what was important in the interviews. For instance, Rasmussen (1984) observed that credentials listed on résumés and verbal behavior had more influence than nonverbal behavior on judgments of qualifications. Singer and Bruhns (1991) found, in an experiment comparing job selection interview scenarios, that students’ decisions gave more weight to academic qualifications, while managers emphasized work experience in their choices. Still other studies pointed to ways of improving the assessment of judgments from interviews. For example, Dipboye, Fontenelle, and Garner (1984) found that reading the candidate’s application before the interview increased the amount of correct information collected during the interview. Huffcutt, Weekley, J. A. Weisner, et al. (2001) showed that responses to interview questions about the actual past behavior of inter viewees were more predictive of subsequent job performance than questions about how the candidate would respond to a hypothetical job situation. Having the interviewee describe his or her past behavior generated more information about the interviewee. Long-term memory, needed to select alternatives to a hypothetical problem, was less informative.

Virtual Interviews. Straus, Miles, and Levesque (2001) arranged for 59 MBA students to serve as simulated job applicants for interviews by video, by telephone, and face-to-face (FTF) with interviewers. Interviewers assessed “applicants” more favorably by telephone than FTF, particularly if the applicants were less physically attractive. Interviewers had more difficulty regulating and understanding discussions by video than FTF, and applicants were less favorable to video interviewing.

Career Path Appreciation Interviews. Zaccaro (1996) summarized research studies about these two-hour sessions to assess an interviewee’s conceptual capacity. The scores predicted insight, self-confidence, creativity, intuitive thinking, innovative-adaptive thinking, breadth of perspective, skill in strategic thinking, ability to handle ambiguity, ability to work simultaneously on several projects, military officer potential, and attained level of management.

Objective Tests. We have described aptitude, attitude, personality tests, and behavioral inventories of individual differences that have been found predictive of current and future leadership performance combined. With good judgment, these criteria can be used for assessment in the context of normative standards. Many objective tests use special keys to forecast leadership potential. Some generate scores that can be combined to optimize predictions of performance for future leaders and managers. These methods will be discussed later.

Projective Tests. Responses to projective tests can also form the basis of valid judgments by trained clinicians. The relationship of Miner’s sentence completion test to managerial performance was noted in Chapter 6. Huse (1962), for example, found that the results of projective tests correlated with the ratings of 107 managers by their superiors as follows: persuasiveness, .33; leadership, .26; overall effectiveness, .21; social skills, .18; planning, .18; creativeness, .17; intellectual capacity, .13; and motivation and energy, .03.

Judgments by Boards. It is centuries-old military and naval practice to have candidates appear before boards of officers to present their credentials and be interviewed, after which the combined judgments of the board members are used to decide whether to accept or reject the candidates for selection or promotion. Such boards are also used to screen teachers and managers in civil service systems. Selection committees are the rule rather than the exception for selecting faculty and administrators in colleges and universities. However, some of the processes that affect the decisions of such boards and committees are just beginning to be known.4 For purposes of simulation, Day and Sessa (2001) created videotapes of four candidates differing in “hard” and “soft” skills. Hard skills are the concrete task-oriented skills contributing to handling the core technology of the organization. Soft skills are relations-oriented skills, leadership, and teamwork, as judged by inference. The candidates were finalists for president of a division of a medium-sized glass company. The candidates varied in negative attributions around an average of 21.5%, for leadership styles, competencies, decisiveness, vision, and experience. The tapes were viewed individually by 819 senior executives, who the next day were formed into 145 selection committees to achieve a consensus ranking of the four candidates during a 45-minute discussion. The committees significantly underre-ported candidates’ strengths and overreported weaknesses. They reported more hard strengths and more soft weaknesses. Among key words for explanations of support for candidates were: teamwork (70%), leadership (66%), international (65%), marketing (63%), visionary (56%), technological (44%), risk taking (32%), and interpersonal skills (23%). The more the respondents’ explanations used these prototypical key words, the lower was the correlation with the average ranking of the candidates (r = .31).

Judgments from References, Recommendations, and Search Firms. The judgments of former superiors, colleagues, and acquaintances are routinely and almost universally used. Yet little has been published on their validity as predictors of success as leaders and managers. Illustrative of what is possible, however, was G. W. McLaughlin’s (1971) predictions of the first year of success of cadets at West Point based on ratings about them by their high school teachers and coaches. Ratings of charisma (personal magnetism, bearing, and appearance) and situational behavior (moral and ethical values, cooperation and teamwork, common sense and judgment) were the best predictors of the cadets’ leadership and followership performance during their first year at the academy. Athletic coaches and mathematics teachers provided the most valid ratings. References and recommendations about candidates from outside an organization have become more suspect and invalid due to the threat of lawsuits as sources increasingly avoid making accurate negative appraisals; recommenders have become legally responsible for their information.

Cox (1986) surveyed more than a thousand executives from 13 U.S. service, consumer, and business corporations. They appeared somewhat more favorable to executives nominated by a search firm than by their own corporation’s personnel department.

Online versus Paper-and-Pencil Questionnaires. Traditional paper-and-pencil formats are being replaced by e-mail questionnaires sent via the Worldwide Web or via intranets, responses collected from raters in the same way. Large firms, with facilities at different locations, find the Web approach more efficient in a number of ways. In comparison to paper and pencil and regular mail, the Web reaches more respondents quickly, and response rates are increased, while costs, missing data, and turnaround time are reduced (Yost & Homer, 1998; Frame & Beatty, 2000). Of 12,111 PepsiCo managers in 40 countries who rated 4,063 managers, 69.4% chose to complete the survey online; 30.6% chose mailing back paper op-scan recommendations. The 87% choosing the Web actually completed the survey while 40% choosing paper did not. There was little difference in the overall outcomes, although the Web users skipped more items and paper users marked “Don’t know” more frequently. Online users provided more comments on developmental needs than did paper users (Church, 2002). Richmond, Keisler, Weisband, et al. (1999) completed a meta-analysis that showed hardly any difference between Web and paper in the effect of social desirability. After controlling for rater and ratee characteristics, Smither and Walker (2002) obtained no difference in results from paper and intranet arrangements.

Donovan, Drasgow, and Probst (2000) found that computerization did not affect item or test functioning. Stanton (1998) reported similar results for several tests of fairness. The Web is an obvious facilitator of 360-degree ratings, multiple raters, and feedback for development. Yet despite the advantages of using the Web for development, it also has problems. Everyone involved needs to have access to a computer. Addresses may be wrong. Poor page design and format can frustrate raters. Security must be provided so that only the correct raters are able to enter the system and ensure that the correct ratees are included. And results need to be transmitted to those with a need to know, for development or personnel decision purposes (Summers & Fleenor, 1998).

Measurement Equivalence. Measurements are equivalent among multiple raters rating the same ratees if they produce the same confirmatory factor patterns and factor loadings (Cheung, 1999). Web and paper measures have equal scale reliabilities and equal construct validities (King & Miles, 1995). Equivalence was found for Web and paper surveys across four of five countries (Spera & Moye, 2001). Smither and Walker (2003) found at first that the upward ratings of 789 employees who used an intranet rather than paper and pencil were more favorable, but equivalence between raters appeared when the results were controlled for rater and ratee characteristics.

Lack of equivalence between sources of ratings may occur for many reasons: (1) differences in raters biases (Harris & Schaubroack, 1988); (2) self-serving and felt need for self-enhancement (Fahr & Dobbins, 1989a); (3) opportunities to observe different behavior (Landy & Farr, 1980; Murphy & Cleveland, 1995); (4) differences in concerns and social comparison information at different organizational levels (Fahr & Dobbins, 1989b); (5) differences in understanding the scale items; (6) ratees’ different behavior around different raters; (7) different experiences or training with the scales or format (Diefendorff & Silverman, 2001); (8) different models of effectiveness (Facteau & Smith, 2000); (9) the importance attributed to some roles rather than to others (Tsui & Ohlott, 1988); (10) the inclusion, importance, and weighting of some dimensions rather than others (Campbell & Lee, 1988; Facteau & Smith, 2000); (11) the manager’s need to meet the expectations of different ratees (Tsui, 1984).

Repertory Grid. This method of appraisal was developed by an American clinical psychologist, George Kelly, in 1955. Since 1980, in Britain, it has become a general approach to appraising team members by comparing them on elements designated by the rater, along with a paired member judged to be most opposite to the first. A grid can be formed showing the pattern for each ratee on each element. There are many variations of the method that can be used to appraise individuals, groups, and systems (Easterby-Smith, Thorpe, & Holman (1996).

Classification and Narrative Description. Here the information obtained mechanically from tests and measures is combined into a display, summarized into a categorical type by its pattern, or reported in a narrative. From these displays, categorizations or narratives—judgments about the potential or performance of leaders—can be made. For instance, the results of the Myers-Briggs test are arranged to sort examinees into various combinations of sensors, feelers, thinkers, and judges; then the different types are related to the managers’ and leaders’ performance (McCaulley, 1989).

The individual examinee’s capacity, interest, and personality scores are typically categorized, profiled, or converted into narratives. The scores are shown in contrast to norms for the individual’s class, organization, occupation, and so on. Then judgments are made based on the profiles. For example, Dicken and Black (1965) administered a test battery to candidates for promotion. The battery included the Otis Quick Scoring Mental Ability Test, the Strong Vocational Interest Blank, the Minnesota Multiphasic Personality Inventory, and other measures of aptitude and knowledge. Narrative reports of about 500 words were written for each candidate based on the results. After reading the reports, four psychologists rated each candidate on effective intelligence, personal soundness, drive and ambition, leadership and dominance, likableness, responsibility and conscientiousness, ability to cooperate, and overall potential. Three to seven years later, officials from the two firms in which the examinees worked rated them on the same variables. The median correlations between the psychologists’ and the officials’ follow-up ratings were .38 and .33 in the two firms.

In similar fashion, Albrecht, Glaser, and Marks (1964) had three psychologists rank the potential of 31 prospective district marketing managers in budgeting effectiveness, sales performance, interpersonal relationships, and overall. Their rankings were clinical judgments based on combining objective and projective test results, as well as results from an interview and a personal history form. Although the test scores alone correlated .20, on average, with subsequent success as a manager, the psychologists’ predictions, based on the combined results of tests, interviews, and personal history, correlated between .43 and .58 with the managers’ subsequent job performance. When tests, interviews, and personal history are combined in this way with individual and small group exercises to make integrated judgments, the total illustrates what can be accomplished in an assessment center.

Mechanical Methods of Combining Assessments

Objective tests provide samples or signs of individual differences in cognition and behavior related to leadership that aim to predict leaders’ subsequent performance. Responses to the test items are unit-weighted rather than differentially weighted when they are totaled to form a score, since identical results will be obtained when more than four or five items are to be summed (Gulliksen, 1950). Keys are developed to score those items that correlate with emergent, successful, or effective leadership.

Special Keys. It has often been possible to develop an especially cross-validated key for the Strong Campbell Vocational Interest Blank to discriminate among those who subsequently achieve more success as leaders in their organizations. The inventory is examined item by item (Laurent, 1968). This takes advantage of the extent to which values and interests are important to success as a leader. For example, Harrell and Harrell (1975) developed such a key with the Strong Campbell for predicting high-and low-earning managers among MBAs from Stanford University five years after graduation. They also keyed responses to the Guilford-Zimmerman Personality Inventory and a survey of personal history in the same way. The total high earners key predicted earnings of the graduates at 5, 10, 15, and 20 years after graduation. The correlations were .52, .39, .33 and .44, respectively (Harrell & Harrell, 1984). For 443 Exxon managers, a key was developed from answers to the Guilford-Zimmerman Temperament Survey that correlated .31 with their success, as measured by their salaries adjusted for their age, education, seniority, and function. The key correlated .23 with their effectiveness as ranked by their superiors (Laurent, 1968).

Goodstein and Schrader (1963) developed a managerial potential key using the items of the California Psychological Inventory (CPI), a collection of personality items. The items included in the key that were found valid in differentiating managers from nonmanagers were consistent with much of what was said in earlier chapters about the personality requirements for successful leadership. The items were about masculinity, need for achievement, dominance, status, self-acceptance, and tolerance. In support, Rawls and Rawls (1968) compared 30 high-rated executives with 30 low-rated utilities executives using the CPI. High-rated executives scored higher than low-rated executives in dominance, status, sociability, and self-acceptance. Subsequently, the 206 items of the key were reduced to 34 by Gough (1984) in a study evaluating the items against measures of the success of military officers and bank managers. The reduced management potential key correlated .88 and .89 with Goodstein and Schrader’s much longer version. The most positive items in the revised management potential scale were also consistent with the discussions in earlier chapters. They dealt with optimism about the future, ability to create a good impression, interpersonal effectiveness, self-confidence, relative freedom from instability and conflict, and realism in thinking and social behavior. Gough (1984) also reported considerable consistency between the keyed management potential scores and descriptions of the respondents by their fellow college students, by their spouses, and by assessment staffs.

Scored Application and Biodata Blanks. Responses to standardized operational questions on application blanks and biographical information blanks can be similarly keyed to forecast leadership and management performance. The blanks may focus on personal characteristics, events, and experiences at work, school, and home. Kirkpatrick (1960, 1966) constructed such a key from an 11-page biographical questionnaire. The key was constructed to sort out above-average, average, and below-average chamber of commerce executives. A five-year follow-up study demonstrated the ability of the key to discriminate 80 high-salary from 81 low-salary managers matched in age and the size of their employing organizations. Along with higher salaries, as evidence of greater success as an executive, it was observed that the higher-salaried executives were more likely to place a heavy emphasis on their careers, participate actively in community affairs, and enjoy excellent health. Again, the items that formed the key and discriminated more successful from less successful executives of the chamber of commerce were consistent with general findings about the antecedents and concomitants of successful leadership discussed earlier. According to the key, the more successful chamber of commerce executives: (1) came from a middle-class background; (2) had spent a happy childhood in a stable family; (3) had received a good education; (4) had engaged in many extracurricular activities in high school and college; (5) had held leadership positions in many of the organizations to which they belonged; (6) displayed communication skills in debating, dramatics, and editorial work.

A similar type of key, developed for scoring the biographical information blanks of 443 Exxon managers, correlated above .50 with their success and above .30 with their ranked effectiveness (Laurent, 1968). Drakeley and Herriot (1988) used data from 41 scorable items of a biographical inventory completed by 420 officers of the British Royal Navy and then cross-validated on a further sample of 282 officers. Three separate keys could be constructed to predict ratings for professional performance, leadership, and commitment. When cross-validated, the professional performance key correlated .50 with appraised professional performance; the leadership key correlated .20 with appraised leadership; and the commitment key correlated .24 with remaining in the program. Russell and Domm (2007) found four factors for 748 cases in responses to biodata questions (1) problems and negative toning; example: “How often have you found yourself making assumptions about other people only to find out you were very wrong?”; (2) dealing with groups or information; example: “How often have you been part of a team or group that developed team spirit?”; (3) focused work efforts; example: “How often have you felt involved and constantly ‘on the go’ in your job?”; (4) rules and regulations; example: “Have you ever had to participate on a jury that actually deliberated some decision in a court of law?”

A promising biodata blank variant is one that emphasizes past accomplishments. The Accomplishment Record developed and validated by Hough (1984) correlated with subsequent job success (Hough, Keyes, & Dunnette, 1983). Biographical information—generated from retrospective life history essays primarily about past accomplishments and completed by first-year Annapolis midshipmen—was used to develop a biodata questionnaire of life history events. In constructing this questionnaire, Kuhnert and Russell (1990) were guided by Kegan’s (1982) six stages of constructive development, from the reflexivity and impulsivity of immaturity to interpersonal maturity and interindividuality. They suggested that the answers to the questions could help to explain the meanings that leaders derive from their life experiences. A total of 917 midshipmen who were entering Annapolis completed a scored and keyed biodata questionnaire. The biodata scales predicted, in cross-validation, subsequent military performance, academic performance, and peer ratings of leadership at the U.S. Naval Academy (Russell, Mattson, Devlin, et al., 1986; Stricker, 1987). Russell (1990) assessed previous life experiences of 66 candidates for a position of division general manager in a Fortune 50 multinational firm and suggested that the assessments could contribute to senior managers’ performance.

Data from Small-Group Exercises. Small-group exercises also generate objective leadership results that can be mechanically correlated with the future performance of leaders or managers. For example, in Bass, Burger, et al.’s (1979)5 large-scale study of managers from 12 national clusters, Exercise Life Goals was administered to 3,082 managers who were above or below the median in their own rate of advancement. In this exercise, the managers had to rank the importance to themselves of 11 life goals. The high flyers, managers who had advanced more rapidly in their careers, attached more importance to leadership, expertise, prestige, and duty, whereas the slower-climbing managers favored self-realization, affection, security, and pleasure.

In a small-group budgeting exercise, the high flyers in all 12 nations studied were more cautious about allocating funds for safety, labor relations, management development, research and development, and controlling water pollution. Nevertheless, in another small-group exercise that asked for recommendations about salary increases, the high flyers granted larger increases than the slow-track managers. In Exercise Supervise, the high flyers gave more weight to the importance of generosity, honesty, and fair-mindedness.

In other small-group exercises, fast-track, compared with slow-track, managers rated themselves higher in self-understanding and preference for being more task-oriented. They also saw themselves as interpersonally competent. In addition, the high flyers revealed more objective signs of effective intelligence, objectivity, pro-activity, and long-term views. They were more accurate and slower in a two-way communication exercise. Many other objective signs obtained from the exercises differentiating fast-track from slow-track managers were country-specific.

Rank and Salary Information. To assess managers’ success, Farrow, Valenzi, and Bass (1982) used salary information to compare 159 managers’ predicted salary grade and actual salary grade, arguing that salaries reflected managerial merit. To predict expected salary grade, the multiple regression equation was as follows: .56 × (starting salary grade level) + 1.24 3 (years of service) + .041 × (department size) −.93 (male) + .472 × (educational level). Overachievement was seen if a manager’s actual salary grade was higher than predicted; underachievement, when actual salary grade was lower than predicted.

Multiple Cutoffs, Unit Weighting, and Optimal Weighting

When multiple cutoffs are employed, only applicants or candidates whose scores on selected tests are above a certain level are accepted; all other applicants are rejected. When scores are unit-weighted, they are merely added together in a combined prediction—the scores need to be standardized with means of 0 and standard deviations of 1 to give them equal weighting if they are added together to form a single prediction. Such standardized scores can be differentially weighted on the basis of a multiple regression equation or can be accepted or rejected for inclusion, as well as differentially weighted, by a stepwise regression process to optimize the prediction of effectiveness.

Multiple Cutoffs. Mahoney, Sorenson, Jerdee, and Nash (1963) administered the Wonderlic Intelligence Test, an empathy test, the Strong Vocational Interest Blank, the CPI, and a personal history questionnaire to 468 managers from 13 Minnesota-based firms. Panels of executives ranked the examinees on global, or overall, effectiveness in carrying out their functions as managers. For half the sample, the ability of each of the 98 measures from the test battery was analyzed to discriminate between the top and bottom thirds of the managers in ranked effectiveness. These managers were grouped into six discriminating clusters. One cluster that separated effective from ineffective managers rated them on whether, or not they had the interests of a sales manager, a purchasing agent, and the president of a manufacturing firm. The second cluster dealt with their lack of interest in dentistry, printing, and farming. A high score on the intelligence test and self-rated dominance on the CPI formed the bases of the third and fourth clusters. One’s educational level and one’s spouse’s educational level and type of work were in the fifth cluster. Engagement in many sports and hobbies, offices held, and high school memberships were the sixth cluster. An optimum cutoff was determined for deciding whether a manager “passed” or “failed” each cluster. To be predicted “acceptable,” a manager had to score above the cutoff point for five of the six clusters. For example, an acceptable manager was one scoring above the cutoffs in the interests of a sales manager, intelligence, dominance, education, and extracurricular activities.

The other half of the sample was held out for cross-validation. The same multiple cutoffs were used; 45.5% of the managers in the cross-validation sample were accepted and 54.5% were rejected. Of those who were accepted, 62% had been ranked high in effectiveness on the job and 29% had been ranked low. But of those who were rejected, only 38% had been ranked high in effectiveness while 71% had been ranked low.

Unit Weighting. Campbell, Dunnette, Lawler, et al. (1970) reviewed studies of the value of various measures that are usually included in test batteries—such as intelligence, interests, personality, and biographical information—for predicting leadership and management potential in designated locations. Salaries of supervisors and executives, adjusted for length of service and age, were predicted with correlations above .20 at Minneapolis Gas on scale scores of 15 to 18 (Jurgensen, 1966). The promotability of managers at Jewel Tea (Meyer, 1963) and sales managers at the American Oil Company (Al-bright, 1966) could similarly be predicted. But mixed results were reported by Tenopyr and Ruch (1965) for various technical managers at North American Aviation.

Selover (1982) combined into a single advancement potential score five measures obtained from newly hired personnel at Prudential Insurance that had been found to predict the rate of advancement four to five years later. Correlations ranging from .35 to .40 were found between the advancement potential scores, similarly derived, and the rate of advancement in three additional samples of newly hired employees. Ghiselli (1971) identified 13 traits of managerial talent and developed a questionnaire, the Self Description Inventory, to measure them. He then related the scores on the 13 traits to supervisors’ ratings of effectiveness. Ten of the 13 scores related significantly to the criteria. These predictive scores dealt with supervisory ability, the need for occupational status, intelligence, self-actualization, self-assurance, decisiveness, and lack of the need for security. These scores could then be combined into a single score.

Basically negative conclusions about the value of tests and biographical information for predicting general success as a manager were reached by Brenner (1963) at Lockheed Aircraft. But with managers in the same company, Flanagan and Krug (1964) showed that it was necessary to differentiate between whether one was trying to predict success and effectiveness in supervision, creativity, organization, research, engineering, or sales. For each, a differently weighted optimization was required.

Multiple Regression: Optimal Weighting of Predictions

An illustration of the use of optimal weighting was created by Atwater and Yammarino (1997), who obtained a stepwise multiple regression coefficient of .53 when predicting the perceived transformational leadership of squad leaders of USNA plebes in summer camp from eight personal measures. The highest standardized beta weights were for engagement in varsity sports, .31; MBTI thinking/feeling, 2.21; and intelligence, .22. The use of stepwise multiple regression for predicting performance has become a common practice, as research is directed not only at optimal weighting of the predictors but also optimizing the number of predictors.

Sears Program. The Sears program for predicting managerial performance extended over a period of 35 years. The test battery that evolved tapped intellectual abilities, personality, values, and interests. Its results predicted the promotability of executives and the satisfaction of subordinates with operating efficiency, relationships, and their own personal satisfactions. Multiple regression was used to weight the test scores optimally to minimize the errors of the predictions. Illustrative are the results for 48 store managers and 42 higher-level managers, shown in Table 35.1.

The Sears battery of psychological tests for predicting executives’ performance included measures to tap flexible intellectual functioning by using the Sears adaptation of the California F Scale of authoritarianism. The battery included a number of tests constructed to measure creative ingenuity and ability to initiate change, tests for preference for complexity, a scored biographical history form, an adjective checklist, a verbal fluency test, and a self-rating of role definition.

To assess administrative skill and decision making, Sears devised a special in-basket test, along with two leaderless group problem-solving situations. In addition, three problem-solving simulations were designed. The Guilford Martin Personality Inventory was specially modified to measure emotional stability and competitive drive. The multiple correlation of the combination of predictor measures with various criteria, such as superiors’ ratings, ranged from .49 to above .80.

The Thurstone Test of Mental Ability, the Guilford Martin Personality Inventory, the Allport Vernon Study of Values, and the Kuder Interest Inventory were used to

Table 35.1 Multiple Correlations between Scores for the Morale of Sears Employees and Scores on the Test Battery for Executives Who Were Responsible for Supervising the Employees

Areas of Employee Morale

Multiple Correlations with Exective Test Battery Scoresa

48 Store Managers

42 Executives

I. Operating Efficiency

   

a. Effectiveness of administration

.77

.51

b. Technical competence of supervison

.74

.43

c. Adequacy of communication

.59

.61

II. Personal Relations

   

a. Supervisor employee relations

.68

.54

b. Confidence in management

.59

.52

III. Individual Satisfactions

   

a. Status and recognition

.69

.62

b. Identification with the company

.51

.52

IV. Job and Conditions of Work

   

a. Working conditons

.36

.40

b. Overall morale score

.47

.54

a All correlations are significant at the 1 or 5 percent level of confidence.

SOURCE: Adapted from Bentz (1968, p. 71).

predict success as a Sears purchasing manager. Multiple correlations ranging from .32 to .57 were obtained against various criteria of job performance and management potential. A similar battery for predicting the long-term performance of managers of Sears stores yielded measures that correlated from .37 to .61 when optimally combined (Bentz, 1983, 1987, 1992).

EIMP. Exxon’s Early Identification of Management Potential (EIMP) is another prominent effort to use multiple regression to optimize the prediction of managers’ success from a battery of predictors. The combination of tests includes the Miller Analogies Test, the Strong Vocational Interest Inventory, the Guilford-Zimmerman Temperament Inventory, and a specially keyed and scored biographical information blank (Laurent, 1968). In the United States, the multiple predictions correlated between .33 and .64 with managers’ success (salary adjusted for age, seniority, and function). In Norway, Denmark, and the Netherlands, the comparable multiple correlations ranged from .27 to .65 (Laurent, 1970). A multiple correlation of .70 was obtained in predicting performance appraisals from the EIMP battery.

Similar correlations were obtained in predicting examinees’ subsequent overall managerial effectiveness, according to their rankings by higher-level executives. For predicting managerial effectiveness, the multiple correlations ranged from .37 for managers in oil exploration and production to .66 for managers in traffic and purchasing. The multiple correlations predicted success as a manager (suitably adjusted for education, seniority, and function), from .67 for marketing managers to .82 for employee relations managers. The predictive validity of the EIMP battery appeared to increase steadily with the years between testing and the measurement of the criteria during 3-to 12-year follow-ups. Sparks (1989, 1992) attributed the increase to improvements in the criteria of managerial effectiveness. He inferred that the EIMP measures of reasoning, judgment, and temperament contribute more to predicting performance at higher levels of management.

A derivative of the EIMP battery is the Manager Profile Record (MPR), based on the biographical data, the keyed Guilford Zimmerman Temperament Inventory, and other tested judgmental elements. The MPR makes use of computer scanning, scoring, and output of results. Its measures include: (1) academic and extracurricular achievements, self-worthiness, and social and work orientation; and (2) judging typical management situations either needing solutions or calling for value judgments about staff communication, employee motivation and development, and decision-making style. Different scoring was provided for younger inexperienced candidates and for older, more experienced candidates. The profile indicated the similarity of the candidate to highly successful managers. It could be used to validate predictions specific to an organization. It excluded traditional intelligence and reasoning-ability tests of concern to civil rights agencies. New item analyses and new validations were completed for the MPR against criteria for the level of responsibility attained by more than 15,000 managers in a variety of industries. Rated potential job level attainable correlated .52 with the MPR total scores. The same correlations ranged from .50 for 86 administrative staff to .70 for 90 manufacturing personnel. In subsequent validation studies, rank and/or salary already achieved were correlated similarly with the MPR total scores of 446 managers and supervisors in a steelworks, 310 in a coal company, 2,205 in a baked goods firm, 361 in an urban utility firm, 592 in a national insurance company, 221 in an office equipment manufacturer, 478 in two chemical-processing firms, 416 in a hotel chain, and 4,856 managers and professionals in 14 southwestern banks. The work was repeated for first-level supervisors across a variety of situations, from utilities plants to data-processing offices. Validations for 8,765 first-level supervisors yielded correlations ranging from .25 to .37 between the derived Supervisory Profile Record and supervisors’ appraisal on job performance as meeting or exceeding the requirements of the job6 (Richardson, Bellows, Henry, & Company, undated).

A set of multiple regression predictions was developed by Saville (1984) for the Occupational Personality Questionnaire (OPQ) to measure the potential for a variety of different team roles and leader/subordinate types of roles. The output was standardized against a norm group of 527 British managers and professionals. Meaningful correlations were found between supervisors’ appraisals of the performance of 440 British managers and OPQ predictive scores. For instance, a score on persuasiveness, extracted from optimally weighted scoring of the OPQ results, as predicted, correlated .27 with being appraised as having a commercial flair and .28 with being appraised as a good negotiator.

Benchmarks. A product of the Center for Creative Leadership, Benchmarks is a 360-degree appraisal whose multiple scales in combination can predict managers’ performance. Managers are rated 360 degrees to form 16 positive scores about resourcefulness, being a quick study, doing whatever it takes, and leading subordinates. Additionally, six negative scores are included, such as difficulty molding a staff, difficulty making strategic transitions, and lack of follow-through. For 336 managers, an optimally weighted multiple regression accounted for 46% of the variance in bosses’ ratings of the managers (McCaulley & Lombardo, 1990).

Specifications Equations. If we have the multiple regressions for predicting a criterion of leadership performance from a set of factored ratings or test scores, and for predicting the same criterion from scores based on situational variables, we can develop a specification equation for predicting the leadership criterion from the beta weights of predictors X, Y, Z. The specification equation is given by the sum of the products of every situational beta weight with the standardized rating or test score (means of zero and standard deviations of 1). For each individual case, the specification provides the best possible prediction of the criterion of leadership performance. Cattell (1980) provided specification equations for the Sixteen Personality Factor Questionnaire (16 PF) of 16 traits and various situations and Saville (1984) for occupational profiles.

Moderators of Judgments and Their Predictive Validity


Moderators are analogous to Einstein’s theory that gravity is a phenomenon due to the warping of space around stars, planets, and moons. Newton’s laws have to be moderated to predict more accurately the force of gravitational attraction between them.

Some of the moderating conditions are primarily associated with the rater, some with the ratee, and some with both. Some methods are moderators. Waldman, Yammarino, and Avolio (1990) completed a WABA analysis of 140 managers in a large mining company paired with their immediate bosses. They were imbedded in 39 work groups. Individual differences were common in the various types of performance and skill ratings received by the managers. Their bosses varied in the average ratings they assigned to the managers. The dyadic matches were also of consequence.

Personal Moderating Conditions

Rater’s Training and Competence. The effects of rater training to produce more reliability and validity of the ratings and greater agreement among raters are well known. A rater of leaders needs perceptive skills, cognitive ability, and reflective thinking to make good judgments. This can be helped by the quality of the cues provided by the rating method and the research that has been completed to develop the procedure. Global assessments that are based on broad aspects of performance and can be used to make explicit comparisons between ratees provide the greatest differential accuracy, but more priming of the rater may be needed to provide for useful diagnostic feedback (Blake & Goffin, 2001).

Biases in Perceptions and Ratings. Both predictions and criteria of success are often subjective managerial ratings and assessments. Expectations bias ratings. When rating others, people tend to confirm rather than discon-firm their expectations (Copeland, 1993). In addition, ratings are biased by the effects of leniency, response set, halo effects, and implicit theories of leadership (Bass & Avolio, 1989). Starbucks and Mezias (1996) reviewed other distortions in management perceptions. For instance, correlations were close to zero between managers’ perceptions of the uncertainties in their business environments and the objective volatility of the environment as measured by financial reports and industry statistics (Tosi, Aldeg, & Story, 1973; Downey, Hellriegel, & Slocum, 1975). Payne and Pugh (1976) found little agreement among managers describing their own organizations. Subsequent studies led Starbuck and Mezias to conclude that average senior managers’ judgments about their firms’ performance could be accurate, but their individual judgments varied greatly (Sutcliffe, 1994). Salam, Cox, and Sims (1997) found that leaders who were rated as challenging the status quo and encouraging their subordinates to act independently were rated lower by their superiors and higher by their subordinates. The fairness seen by ratees of the performance appraisal and feedback system was greater when the ratees felt they could voice their opinions about it (Cawley, Keeping, & Levy, 1998; Folger & Cropanzano, 1988). When 208 employees responded in a survey about the voice they had in an appraisal system (“I felt I could have influenced the review discussion”), those impressions correlated .59 with perceived accuracy, .43 with perceived utility, .52 with procedural justice, and .59 with satisfaction with the system (Keeping, Brown, Scott, et al., 2003).

Feedback to leaders and managers of subordinates’ average judgments about performance can be misleading because they mask large differences among the subordinates’ judgments. The subordinates may not be, for example, able to discriminate between a manager’s personal idiosyncrasies and the requirements of the manager’s role (Herold, Fields, & Hyatt, 1993).

Raters’ Age. Benchmarks were completed by superiors, peers, subordinates, and self for more than 2,000 focal managers. While favorable superiors’ and self-ratings by the managers increased with age, subordinates’ and peers’ ratings did not. Brutus, McCaully, and Fleenor (1996) offered several explanations, but the most plausible one was that managers behave differently toward their superiors than toward others and have a better rapport with them. Also, managers may engage in more impression management with their superiors than with peers and subordinates.

Raters’ Choose Raters. When leaders select those who are to rate them, the results are likely to be inflated. For instance, the norms for the Multifactor Leadership Questionnaire ratings when raters are selected by ratees are likely to be one point higher on a five-point scale of the transformational and contingent reward factor scores (Avolio & Bass, 2004). At the same time, O’Leary, Harkom, et al. (2003) reported that moderate positive correlations of .30 to .53 were found between ratee-selected and nonselected raters of the performance of 96 professional education students.

Confidence in the Ratings. Harper, Maurer, and Tross (1988) examined the 360-degree ratings of 1,852 aggregations of raters evaluating the teamwork, character, learning, vision, and leadership skills of 463 leaders. Raters also indicated their confidence in the accuracy of their ratings. More confident supervisors and more confident self-raters were slightly more likely to connect team-work and character with leadership skills. More confident peers added vision to the prediction of leadership. Gabris and Ihrke (2001) showed that the job satisfaction of 134 professional county government employees was related to the perceived fairness and validity of the performance appraisals they received.

Moderating Conditions

Ratees’ Early Career Experience. Berlew and Hall (1966) showed that the degree of first-year challenge sensed by managers added to the prediction of their subsequent success. Within an Exxon U.S. subsidiary, Vicino and Bass (1978) showed that the accuracy of the predictions could be substantially enhanced by adding attending information about what happened subsequently to 140 newly hired managers. Six years after employment, the managers were more successful and effective than predicted by the EIMP battery if they (1) had initially been assigned to a supervisor with influence “upstairs”; (2) had a personality different from that of their first supervisor; (3) were more satisfied with the challenges of their first task; (4) were under less stress in the two years before their measured success and effectiveness.

Ratees’ Motivation and Needs. Ghiselli (1968a) reported results demonstrating that motivational factors interacted with personality traits that were predictive of managers’ job success. The need for job security was the strongest and the need for power was the weakest moderator in producing interactions between traits and success. Thus, whether managers had a low or a high need for security affected the extent to which their assertiveness correlated with their success as managers. In the same way, whether a score showing a high need or power forecast success as a leader depended on whether it was coupled with a strong sense of responsibility (Winter, 1989).

Ratees’ Seniority. Although Moore (1981) found that it did not matter whether librarians came to their assignments as a first or second career choice, the librarians’ patterns of personality and leadership were closer to norms for the general population than they had been 20 years earlier. Similarly, when Harrell and Harrell (1984) applied stepwise regression equations to ability and personality predictions of the success of MBAs from Stanford University in achieving high salaries 5, 10, 15, and 20 years after graduation, different predictors emerged. For example, high earnings at the end of five years were predicted by orientation to high earnings, but after 20 years, the best combination of predictors of high earnings from the measures obtained 20 years earlier were youth, interest in sales management, a projective measure of the need for power, and a scored background survey that was favorable.

Age. Baltes and Staudinger (1993) noted that aging into older adult years modifies the relations between intelligence and leadership effectiveness. On the one hand, there is a decline in mental fluid mechanics. Leadership tasks depending on speed of reaction, processes of sensory information input, visual and motor memory, and processes of discrimination, categorization, and coordination decline with aging. On the other hand, older age increases wisdom, good judgment, and advice on uncertain but important aspects of life, based on expert knowledge. Such expert knowledge includes factual knowledge about the fundamental pragmatics of life, the contexts and uncertainties of life and societal change, and the competing relativism of values, life goals, and moral absolutes. While much of the decline in fluid mechanics is heavily influenced by biology, expert wisdom is more a matter of cultural inheritance, cultural innovation, and facilitative experiences. The wise person is open to new facilitative experiences. Nevertheless, Irwin (1998) reported that when 915 members (30% women) of teams in a variety of U.S. firms rated each other on 31 behaviors, and sex, education, ethnicity and job level were controlled, older members were rated as less effective by their peers. They were also rated less effective as managers. However, those of high status in their own organizations were rated more positively, as were women, especially black women. Peers also rated others as more effective if they had known them longer. According to Ferris, Yates, Gilmore, et al. (1985), age also influenced performance ratings of subordinates.

Sex and Race. The effects of sex and race of assessment center examinees may be of limited consequence in moderating predictive accuracy. Women do about as well as men when undergoing assessment, according to an analysis by Ritchie and Moses (1983) of the overall scores obtained by 1,097 female managers at AT&T. The assessment center’s predictions for the progress of their careers at the end of seven years were as valid as those for male managers. Similar results for women and for men were reported by Moses and Boehm (1975) and for black women compared to white women (Huck & Bray, 1975). But, subsequently, a meta-analysis (to be discussed later) found that the predictive validity of assessment programs was higher when the percentage of women assessees was higher. A validity of .38 based on multiple correlations of the Exxon EIMP battery and managerial success was found for 67 black males; of .31, for 80 white females; and .53 for 685 white males (Sparks, 1989).

Coaching of Ratees. In a meta-analysis of 131 controlled studies of feedback, by DeNisi and Kluger (2000), coaching to accompany feedback to those working at tasks was beneficial to more favorable appraisals of ratees’ subsequent performance. In a different circumstance, 213 candidates for promotion in city fire and police departments were coached prior to being interviewed. The coached and better-prepared candidates gave more organized responses to the interviewers, who judged them as being better in performance in the interview (Maurer, Solomon, Andrews, et al., 2001).

Personality of Ratees. McCarthy & Garavan (2003) found that among 520 managers in a large Irish firm who received 360-degree feedback, belief in procedural justice and lack of cynicism accounted for 51.3% of the variance in the extent to which they accepted the feedback.

Situational Moderating Conditions

Organization. A modifier of regression predictions is organizational purpose. Szilagyi and Schweiger (1984) pointed to the need to keep an organization’s strategies in mind when trying to forecast the patterns of ability, personality, and experience that will best match candidates with the requirements of a job. Hambrick and Mason (1984) argued that an organization’s character and performance and the particular personal attributes and experience of its senior managers will be correlated. The differential weights in multiple regression predictions of success in a rapidly growing organization are likely to be different from those in a mature or declining organization. The weights of predictors of success as a combat officer are likely to be different from the weights for optimizing the prediction of success as a garrison officer.

Feedback Environment. Feedback ratings may be moderated by an organization’s business and human resource strategy. Firms with prospector strategies may place more of a premium on involved, innovative leadership; firms with defender strategies may focus more attention on inducements (Sivasubrahmanium & Kroeck, 1995). Employees who see a favorable feedback environment are more likely to use feedback from 360-degree ratings (Steelman & Levy, 2001).

DeNisi and Kluger (2000) concluded from a meta-analytic analysis of 131 evaluated controlled studies of feedback from 3,000 reports that 38% was harmful to performance rather than likely to improve it. The investigators assumed that attention was limited to gaps from feedback to standards. Little attention was paid where gaps were not present. Subsequent behavior was affected by this attention. They found that attention focused on task performance and resulted in improvement only if sufficient details for improvement were provided. Otherwise, the focus remained on the details of the feedback, not on improving performance. If multisource messages were in disagreement, more attention was focused on self-rating. If the feedback, particularly negative feedback, focused on the self and the self-concept rather than the task, performance suffered. Whether a ratee wants to work on a task or had to work on the task affected whether performance improved or declined with feedback. A goal-setting plan needed to be provided with feedback if subsequent performance was to improve. As much as possible, information provided with feedback should be about improving performance and not about the relative performance of others, as the latter can increase anxiety and a decline in performance. Feedback may be more harmful than beneficial to people working on complex tasks.

Culture. Given the cultural specificity of some leadership phenomena, as seen in Chapter 33, as one moves from one culture and country to another, it is not surprising that culture moderates the relationship between performance appraisal procedures and effectiveness. When ratees in work units in countries with individualistic cultural norms are similar in personality to their peers, the peers’ ratings predict the promotion of the ratees in work units, but in work units in countries with collectivistic norms, the supervisor’s ratings predict the promotion (Schaubroeck & Lam, 2002).

Milliman, Nason, Lowe, et al. (1995) asked 237 South Koreans, 241 Taiwanese, 223 Japanese, and 144 Americans what purposes were served by performance appraisals. Most were managers, engineers, and professionals. Questionnaire items were about performance appraisal practices, performance appraisal effectiveness, organizational effectiveness, job satisfaction, and five possible purposes: pay, promotion, development, enabling subordinates to express themselves, and documentation of performance. LISREL analyses revealed, as expected, significant cultural differences in the relationships among the variables. In South Korea, only the purposes of documentation, promotion, and development were seen as contributing to appraisal effectiveness, which, in turn, contributed to organizational effectiveness and job satisfaction. In Japan, all five purposes contributed to appraisal effectiveness, which in turn related to organizational effectiveness and job satisfaction. In Taiwan, only documentation, pay, and development were seen to directly affect appraisal effectiveness and, indirectly, organizational effectiveness and job satisfaction. Finally, among Americans, a considerably different path model emerged. Documentation, promotion, and enabling subordinate expression were the chosen purposes. Appraisal effectiveness was positively related to job satisfaction but negatively related to organizational effectiveness.

Methods as Moderators

Criteria of Success. Obviously, how success is measured affects the differential weightings that optimize a multiple regression prediction of success. A majority of studies have depended on subjective evaluations of success as a leader provided by superiors and a higher authority. These evaluations may suffer from various sources of bias by raters, such as the halo effect and leniency. They may be subject to the quirks of memory and different raters’ values.

Objective approaches to measuring success have also varied and, as a consequence, have led to different prediction results. At one extreme, as was noted in Chapter 4, many investigations before 1948 defined success as occupying a position of leadership, that is, holding an office or some position of responsibility. The level reached by a manager in an organization was often seen as an appropriate measure of success. This definition was improved by adjustments. Blake and Mouton (1964) used a managerial achievement quotient to evaluate the progress of managers’ careers in a single organization; the level of the positions was corrected by the managers’ chronological age: younger managers at higher organizational levels were seen as more successful than older managers at lower levels. Hall and Donnell (1979) modified the formula slightly. They continued to assume that regardless of the actual organization, managers worked in organizations of eight levels. To generate a realistic index number, they further assumed that all managers’ careers are in eight-level organizations. Bass, Burger, et al. (1979) adjusted the quotient by dividing a manager’s current level by the actual number of levels in the organization, then multiplying by the size of the organization. They argued that managers of larger organizations who are at the same place on their respective organizational ladders as managers of smaller organizations are more successful.

Another criterion of managers’ success was the amount of salary or compensation earned (Harrell & Harrell, 1975). But the compensation had to be adjusted for age, function, experience, seniority, inflation, and other moderators when used to forecast managerial success (Farrow, Bass & Valenzi, 1977; Sparks, 1966). When performance quality was the main criterion of success, Deming (1986) argued, performance appraisal programs were detrimental to the quality of the performance because: (1) They held the ratee responsible for mistakes that could be due to faults in the system; (2) When their rewards were contingent on appraisals, the employees were directed toward short-term targets and quotas; (3) Ratees who fell at the lower end of the distribution of the appraisals were discouraged from trying to excel. Ghorpede and Chen (1995) suggested that the performance appraisal system would contribute to performance quality if: (1) The primary purpose of performance appraisal was the improvement of the ratee; (2) Everyone affected by the appraisal system was involved in improving it; (3) Opportunities for improvement and reduction of systematic errors were examined; (4) The focus of the performance appraisal was on behavior using input and output for diagnosis and development of the ratee; (5) For each dimension of performance, both task accomplishment and task improvement were considered; (6) Absolute rather than relative standards of performance were used.

Single-Source Errors. Single-source bias arises when data are collected from one source or method about two or more variables, such as self-report survey questionnaires (D. T. Campbell & Fiske, 1959). The same leader provides a self-report of influence, X, and a rating of job satisfaction, Y. The correlation of X and Y will be inflated by single-source variance. Sometimes it cannot be helped, but a correction is needed. The correlation may reflect a generalized response set and form a general factor in a factor analysis whose effect can be subtracted. But new errors are then introduced (Kemery & Dunlap, 1996). Within-and-Between Analysis (WABA) may be used to determine and correct same-source variance and error in self-reports (Avolio, Bass, & Yammarino, 1988). Performance appraisal becomes overshadowed in the case of CEOs rated by biased compensation committees, which may award extremely high annual CEO compensation for continuously failing performance.

Assessment Centers


In the European tradition, the identification of potential leaders tended to make use of inferences based on observations, personality tests, and interviews. The assessment center fit readily with the European approach. As early as 1923, following experience in World War I, boards of psychologists and officers formed judgments of candidates for leadership positions in the German army from observing the candidates in a roundtable discussion and from other tests and methods (Simoneit, 1944). Comparable developments took place in Great Britain (Garforth, 1945). By the 1940s, the British had organized country house retreats for civil service and industrial candidates to meet with assessors. The candidates were then observed by psychologists and a management assessor in formal exercises and in informal interactions. The results provided valid assessments for choosing candidates (Vernon, 1948, 1950).

During World War II, the Office of Strategic Services (1948), the forerunner of the Central Intelligence Agency (CIA), created a similar assessment program for candidates for assignments as special agents and spies, based on the psychological theories of Henry Murray (1938). In addition to the usual testing, the candidates were systematically subjected to cognitive and physical group exercises, along with stress interviews. Decisions were made on the basis of the assessors’ ratings, which were formed from their observations of the candidates’ performance in the various exercises (Murray & MacKinnon, 1946). The interest in assessment programs generated the establishment, in 1949, of a civilian assessment center at the University of California, Berkeley (MacKinnon, 1960); at Standard Oil of Ohio (SOHIO) in 1953 (Finkle & Jones, 1970); at IBM (Kraut, 1972); at General Electric (Meyer, 1972); and at Sears (Bentz, 1971). The modern assessment center approach spread rapidly around the world (Thornton & Byham, 1982). But most prominent was the assessment center program launched in 1956 at AT&T as a long-term research study of the development of the managers’ careers (Bray, Campbell, & Grant, 1974).

As MacKinnon (1975) noted, “assessment center” may refer to a physical facility, such as Berkeley’s Institute of Personality Assessment Research; a standardized assessment program, such as SOHIO’s Formal Assessment of Corporate Talent; or the bringing together of a group of candidates for assessment. By the end of the 1970s, several thousand centers and programs were in operation.

Methods

Ideally, the procedures and variables to be assessed in a particular assessment center should stem from an analysis of the requirements of the jobs for which the candidates are being considered. Assessment procedures to tap the identified dimensions that match these requirements should then follow (Thornton & Byham, 1982). Worbois (1975) selected 12 assessment variables in this way and predicted supervisory ratings of the subsequent performance of candidates with a correlation of .39. As noted in the preceding meta-analyses, specific dimensions of a candidate’s current performance are harder to predict than the candidate’s overall future performance or potential as a manager.

All the individual and group testing activities discussed so far in this chapter have been used in assessment centers and take from a half day to five days. These activities may include paper-and-pencil and projective tests of values, interests, and personality; tests of cognitive abilities and reading and writing skills; and observers’ judgments of performance on in-basket tests, interviews, role-playing exercises, organizational simulations, work samples, and leaderless group discussions requiring cooperation or competition. What is special about the assessment center is its use of the pooled judgments of staff psychologists and managers from the organization who have been assigned as observers. The judgments are based on their observations of the candidates in action and the inferences they draw from the test results to reach consensual decisions about the leadership potential of each candidate. The observers also use consensus to predict each candidate’s likely performance as a manager and other aspects of the candidate’s future performance. From 10 to 52 variables, assumed to be related to managerial performance, may be judged and pooled by the team of observers in this way (Howard, 1974). According to a survey of assessment centers by Bender (1973), the typical center uses six assessors, including psychologists and managers at several levels above the candidates. Usually, candidates are processed in multiples of six, which is seen as an optimal number for interactions and observations. For example, 12 candidates at a time were processed by Macy’s in interviews, group discussions, and in-basket tests. Six executives from top and middle management formed the assessment team (Anonymous, 1975). Unless the information is withheld to avoid contaminating follow-up research, feedback of the consensual judgments is given to candidates in most centers to contribute to the candidates’ developmental efforts. Assessment centers have demonstrated moderate validities for predicting job performance and leadership potential.

Virtual Assessment Centers

The spread of firms nationally as well as globally has sparked the need for assessment of managers for development and promotion at a distance from each other, to enhance a firm’s quality of management. The Internet makes it possible to administer simulations and batteries of individual and group tests to managers in different locations. Virtuality eliminates extensive travel time, costs, and absence from work. It realistically simulates real business challenges (Smith, Rogg, & Collins, 2001). The investigators described a virtual assessment center for 78 senior managers in a global organization, 39% based in the United States and 61% in Britain, France, Germany, and the Netherlands. The AON virtual assessment center simulated a work environment in which the candidate had to analyze problems, make decisions, take appropriate actions, and move issues forward with e-mails, voice mails, timed interruptions, and role plays. Also included in the assessments were a work accomplishment record, structured interviews, a 360-degree survey, and the 16 PF Questionnaire. Higher convergent validities were obtained from assessment constructs that were alike than from divergent validities obtained for constructs that were unalike.

Variables

Because of the prominence of the AT&T Management Progress Study and the availability of materials from Development Dimensions International (DDI) derived from it, the 25 variables deemed important for success at AT&T have been incorporated, as a whole or in part, into many assessment programs elsewhere. These variables are: general mental ability, oral and written communication skills, human relations skills, personal impact, perception of threshold social cues, creativity, self-objectivity, social objectivity, behavioral flexibility, need for the approval of superiors, need for the approval of peers, inner work standards, need for advancement, need for security, goal flexibility, primacy of work, value orientation, realism of expectations, tolerance of uncertainty, ability to delay gratification, resistance to stress, range of interests, energy, organization and planning, and decision making (Bray, Campbell, & Grant, 1974, pp. 19–20).

The 25 AT&T assessment variables were logically grouped into seven categories, as follows: (1) administrative skills—a good manager has the ability to organize work effectively and to make high-quality decisions; (2) interpersonal skills—a forceful personality makes a favorable impression on others with good oral skills; (3) intellectual ability—a good manager learns quickly and has a wide range of interests; (4) stability of performance—a good manager is consistent in his or her performance, even under stressful conditions or in an uncertain environment; (5) motivation to work—a good manager will find positive areas of satisfaction in life; he or she is concerned with doing a good job for its own sake; (6) career orientation—a good manager wants to advance quickly in the organization but is not as concerned about a secure job; the manager does not want to delay rewards for too long a time; (7) lack of dependence on others—a good manager is not as concerned about getting approval from superiors or peers and is unwilling to change his other life goals.

Administrative skills were found to be assessed best by a manager’s performance on the in-basket test. An evaluation of interpersonal skills was procured from the observer teams’ judgment of the manager’s performance in group exercises. A measure of intellectual ability was reliably obtained from standardized general ability tests. A measure of stability of performance was acquired from the individual’s performance in the simulations. A motivation work measure was obtained from projective tests and interviews, with some contribution from the simulations. Career orientation was seen in both interviews and projective tests and, to some extent, in the personality questionnaires. A measure of dependence was gained mostly from the projective tests (Bray, Campbell, & Grant, 1974).

Dunnette (1971) concluded, from a review of factor analyses of these variables not only for the AT&T assessment programs but for those at IBM, SOHIO, and elsewhere, that the underlying factors included overall activity and general effectiveness, organizing and planning, interpersonal competence, cognitive competence, motivation to work, personal control of feelings, and resistance to stress.7 Empirical studies by Sackett and Hakel (1979) and Tziner (1984) also led to questioning whether more than just a few dimensions could account for all 25 variables or their factor clusters. Despite this questioning, the surplus of variables may be useful for diagnosis and counseling (Thornton & Byham, 1982).

Moreover, the correlations between the ratings of the dimensions on the same exercise are often much higher than the correlations for ratings of any one of the dimensions between exercises (Sackett & Dreher, 1982), which suggests that just a few global measures are involved. In 10 sessions, 149 government agency managers examined personal effects, effects of an exercise with nine behavioral dimensions, and assessors’ effects on the assessments. Multiple assessors were used for four exercises: a competitive allocation exercise, an in-basket test, a written test, and a noncompetitive management problems exercise. Differences among the individual assessees accounted for 60% of the assessments, whereas only 11% of assessments were accounted for by the differences between exercises and between assessors. From a study of 19 managers in three days of assessment and four months of follow-up training, Cunningham and Olshefski (1985) concluded that assessment centers, as ordinarily constituted, were better detectors of socioemotional leadership skills than of task leadership skills, but the two tended to be correlated. Bycio (1988) found little effect on ratings earned by participants at an assessment center due to one of 11 orders in which the participants completed assessment activities.

Prescreening

Much prescreening of candidates occurs for the expensive assessment center. For instance, in one year, 1974, Macy’s interviewed 2,000 prospective candidates for participation in an assessment center in New York City. Of these prospective candidates, only 500 were processed through the assessment center (Anonymous, 1975). Self-selection also served as prescreening. Ordinarily, participation from within an organization is voluntary after candidates (such as those at Michigan Bell) have learned that they have been nominated for participation in the assessment center by their supervisors, sometimes with higher-level approval. Self-nominations are also possible.

Psychometric testing can be used advantageously to prescreen candidates before involving them in the expense of the total assessment process. For example, for a sample of 80 AT&T prospects, the Gordon Personality Profile and Gordon Personality Inventory tests were administered before the prospective candidates could participate in an assessment center. Four of the scores on the tests correlated with the subsequent assessment based on the full three-day program, as follows: ascendancy, 46; sociability, .32; original thinking, .37; and vigor, .21. A specially scored biographical questionnaire alone correlated .48 with the results of the three days of assessments. If a composite score based on the biographical questionnaire and the measures of ascendancy and original thinking had been in effect, 84% of those scoring in the lowest or “D” category would have been detected as not acceptable to the assessors after three days, while only 16% would have been deemed acceptable. It was reckoned that more than $2,000 per successful candidate was saved by the installation of such a screening program (Moses & Margolis, 1979). In the same way, Dulewicz and Fletcher (1982) found that scores on an intelligence test, but not previous specialized experience, contributed to candidates’ performance and assessment based on the situational exercises of an assessment center. Thus, preassessment intelligence tests could be used to screen prospects for assessment. The possibilities of prescreening with a single LGD have been ignored, although, as mentioned earlier, a correlation of .75 was reported by Vernon (1950) between the results about candidates obtained in a one-hour LGD and the results of a full assessment at an assessment center.

Reliability of Assessments

Agreement among assessors in assessment centers has been subjected to numerous analyses to provide estimates of the reliability of the assessments. Fewer rate-rerate estimates are available.

Agreement among Assessors. Observers’ independent ratings of assessees correlated .74 in their ratings on average and .76 in the corresponding rankings of 12 participants in the various segments of a two-day IBM assessment program (Greenwood & McNamara, 1967). Comparable median correlations of .68 and .72 were reported for AT&T’s program (Grant, 1964). For the SOHIO program, correlations of assessments between psychologists and managers who served as assessors ranged from .74 for the assessment of the drive of assessees to .93 for the judgment of the amount of assessees’ participation.

Correlations among raters were increased most by the consensual discussions they held. Before such discussions, median correlations ranged from .50 (Tziner & Dolan, 1982) to .76 (Borman, 1982). After the discussions, the median correlations rose to .80 or above (Sack-ett & Dreher, 1982) and as high as .99 (Howard, 1974). Schmitt (1977) reported that when a team of four assessors evaluated 101 middle-management prospects over a four-month period, mean interrater correlations before the raters’ discussion of the ratings ranged from .46 for the rating of assessees’ inner work standards to .88 for the rating of assessees’ oral communication skills. After the assessors discussed their first ratings, the intercorrelations rose to .74 and .91, respectively, but the opinion of no one assessor shifted more than that of any other (Schmitt, 1977), which suggested that mutual influence had occurred. However, in an analysis of the participation of 2,640 candidates in an assessment program to determine their suitability for training as U.S. Navy officers, Herriot, Chalmers, and Wingrove (1985) found that a discussion changed the opinions of the assessors if they: (1) were deviates from the majority opinion; (2) depended on their rank as officers for their influence; (3) were above or below the majority in the rating they had initially given a candidate. This increase may reflect more shared information, or it may be a matter of the raters’ influence on one another. In an overall review, Howard (1984) concluded that correlations among assessors tended to be at least .60 in most cases, rising to as high as .99. This agreement was not affected by whether the assessors were psychologists or managers; but, as will be noted later, meta-analytic results from 47 assessment studies (Gaugler, Rosenthal, Thornton, & Bentson, 1987) showed that psychologists were more valid assessors than managers were.

Agreement among assessors depends on how much training they have received in making ratings in general and in making ratings about performance in specific exercises. For example, Richards and Jaffee (1972) obtained an increase in interrater agreement as a result of the assessors’ training in rating the performance on in-basket tests. They found increases in interrater agreement from .46 to .78 on ratings of human relations skills and of .58 to .90 on ratings of administrative/technical skills. On the other hand, as noted in an examination of self-ratings and multiple raters, there was little correlation found by Clapham (1998) between the self-ratings of 167 assessment center participants and the ratings they received from five assessors. Randall, Ferguson, & Patterson (2000) noted that accurate self-assessments might add to the validity of psychometric tests in assessment center decisions.

Rate-Rerate Reliability. A rate-rerate retest reliability of .77 was reported for the AT&T assessment center’s testing of men one month apart. The rate-rerate agreement obtained for women candidates was .70, .68 for black candidates of both sexes, and .73 for white candidates of both sexes (Moses & Boehm, 1979; Moses & Margolis, 1979).

Predictive Validity

Jansen and Stoop (2001) found that Dutch assessment center ratings, based on group discussion and presentation of analysis, predicted the average salary growth of 679 academic graduates over seven years. The correlation was .39 when corrected for starting salary and restriction in range. Reviews by Huck (1973) and Klimoski and Strickland (1977) concluded that assessment centers generated valid predictive information. This conclusion was corroborated in a meta-analysis by Hunter and Hunter (1984), which obtained median corrected correlations of .63 for predicting managerial potential from assessments and .43 for predicting on-the-job performance. But these findings appeared to be inflated according to a more extended meta-analysis by Gaugler, Rosenthal, Thornton, et al. (1987) based on 47 studies of assessment centers involving 12,235 participants. The 107 validity coefficients between the assessment centers’ forecasts for candidates and criterion appraisals of candidates’ managerial performance and advancement (corrected for sampling error, restrictions in range, and unreliability of the criteria) averaged .37. In the Management Program Study at AT&T, assessments were followed up for eight years to see if they predicted the promotion of assessees into middle management. Assessment results were kept unknown to promotion decision makers. At AT&T and IBM, the assessment center’s results for each candidate were not subsequently seen by their superiors, who were responsible for the on-the-job appraisals and promotions of the candidates; so the validity correlations could not be inflated by such contamination, as they might have been in other studies.8

Consensual Discussions among Raters. Correlations among raters were increased most by the consensual discussions they held. Before such discussions, median correlations ranged from .50 (Tziner & Dolan, 1982) to .76 (Borman, 1982). After discussions, median correlations were likely to rise to .80 or above (Sackett & Dreher, 1982) and as high as .99 (Howard, 1974).

Predictions of Career Advancement. At AT&T, of those assessees who were predicted to succeed, 42% did succeed, while only 7% of those predicted to fail actually attained middle-management positions (Bray, Campbell, & Grant, 1974). A correlation of .44 was found among 5,943 men designated as “more than acceptable,” “acceptable,” “less than acceptable,” or “not acceptable” by the consensus of the assessors’ judgments and information whether the men had subsequently earned two promotions in the eight years following the assessment. The overall managerial level the candidates achieved at the end of 20 years continued to be accurately forecast by the assessments obtained 20 years earlier. Intelligence scores and motivation and personality measures obtained from psychometric instruments, and the results of interviews and in-basket tests, all contributed positively to the accuracy of the predictions (Howard & Bray, 1988). The assessments collected 20 years previously also predicted satisfaction, self-confidence, and emotional adjustment.

At IBM, Wollowick and McNamara (1969) found a correlation of .37 between the assessment by the assessment center and increased responsibility assigned to candidates. In a follow-up, Hinrichs (1978a) obtained a correlation of .46 between the overall judgments of the assessment center and the level of management reached by candidates after eight years. At SOHIO, a correlation of .64 was achieved between the assessments and superiors’ appraisals of the candidates’ potential 6 to 27 months afterward. The correlation remained stable in forecasting appraisals up to 5 years later (Carleton, 1970).

Howard (1974) noted that the components of various assessment centers differed in the contributions they made to the accuracy of forecasts. For example, at AT&T, the overall assessment and situational tests had the highest predictive validities; at SOHIO, projective and psychometric tests were particularly important overall; at IBM, some elements in each of the procedures tended to contribute to the prediction of promotion. Arthur, Woehr, & Maldegen (2000) analyzed the convergent and divergent validities of an assessment center for two federal agencies. In 10 sessions, 149 government agency managers examined personal effects, effects of an exercise with nine behavioral dimensions, and assessors’ effects on the assessments. Multiple assessors were used for four exercises: a competitive allocation exercise, an in-basket test, a written test, and a noncompetitive management problems exercise. Differences among the individual assessees accounted for 60% of the assessments, whereas only 11% of assessments were accounted for by the differences between exercises and between assessors.

Moderating Effects on the Validities

Purposes Served. The purpose of the validation effort makes a difference. According to the extended meta-analysis by Gaugler, Rosenthal, Thornton, and Bentson (1987), the validity of an assessment center’s forecast depends on the criterion of success predicted by the assessment. On average, according to the validities obtained (weighted by sample size and corrected for attenuation and the unreliability of the criterion), candidates’ potential for promotion, rated by supervisors and a higher authority, was better predicted (.53) than was their rated current performance (.36), rated performance in training (.35), or advancement in their careers (.36). The least valid predictions were for appraisals by supervisors of the same on-the-job dimensions rated at the assessment center (.33). The assessment center’s focus on candidates’ current job requirements appeared to be less useful than its concentration on the candidates’ possible future job assignments. An extreme case was illustrated by Turnage and Muchinsky (1984), who found no validity in a one-day program for predicting candidates’ currently appraised job performance, but did find valid predictions of the candidates’ potential after five years and their appraised career potential.

When the effort was for research purposes, the correlation between assessment predictions and subsequent appraisals of success was .48. However, the correlation was lower if the purpose of the effort was for administration of selection (.41), early identification of promise (.46), or promotion (.30). Overall judgments of candidates’ participation in an assessment center’s activities appeared to be better predictors of their advancement and salary increases than were the performance appraisals they received from their supervisors. In turn, the assessments appeared to be better predictors of the appraised on-the-job performance than objective measures of such performance (Turnage & Muchinsky, 1984). Nevertheless, it seems that the pooled ratings and consensual decisions of assessors may yield a less accurate prediction than one made by optimally weighing the independent assessors’ ratings and using mechanical pooling of the assessors’ ratings by means of multiple regression analysis (Tziner, 1987).

Some assessment centers have been aimed at development of leaders rather than selection. They provide feedback to participants for coaching and self-learning. For instance, an MBA program may conduct such a center for new entrants into the program. Engelbrecht and Fischer (1995) compared an experimental group with a control group of randomly selected first-line supervisors (56 white, 20 nonwhite; 38 men, 38 women) in a South African insurance firm. The 41 in the experimental group received assessment and feedback; the 35 in the control group did not. Three months afterward, the experimentals were significantly higher than the controls in action management, human resources management, information management, and problem resolution. Similarly, Fleenor (1988) found that, compared to results observed from a control sample, career development was significantly advanced by attendance at an assessment center for development.

Individual Raters’ Differences in Validity. Borman, Eaton, Bryan, and Rosse (1983) could find no evidence of significant individual differences in the validity of raters’ judgments. However, the aforementioned meta-analyses suggested that validities were higher if the assessors were psychologists rather than managers, if the percentage of female assessees was greater, and if the percentage of minority assessees was smaller. Some exercises generated ratings with higher predictive validities than others did (Sackett & Dreher, 1982). Validities were also higher when a broader array of exercises were employed and when peer ratings of one another were provided by the assessees. On the other hand, the validity of assessment center predictions was little affected by other expected influences such as the ratio of assessees to assessors, the number of days of observation, the number of days the assessors were trained, and number of hours the assessors spent in integrating information. Criterion contamination accruing from providing feedback to assessees and their supervisors also had little of the expected effect on the validities.

Transparency of Constructs. Ninety-nine students were rated by assessors in a Dutch assessment center based on four types of simulated individual interviews. In one interview, they were assessed in their efforts to get a subordinate to work overtime. The experimental students were told in advance about the constructs being assessed (sensitivity, analytical skills, and persuasiveness); 50 control students were not so informed. Contrary to expectations, there were no differences in the validity of the outcomes. The study was replicated with the assessment of 297 job applicants (mostly for management) who were told about the constructs, and of 393 controls, who were not told. Group exercises were added. Predictive validity significantly improved with the transparency of the constructs to the 297 applicants, compared to the lack of information for 393 control applicants. At the same time, in both studies, no differences appeared in the mean ratings between the transparent and nontransparent methods (Kolk, Born, & van der Flier, 2003).

Validity of Other Applications. Specialized assessment centers have been created and found valid for predicting successful performance in various institutional settings. Centers were found to be valid and useful for assessing military recruits (Borman, 1982), naval officers (Gardner & Williams, 1973), Canadian female military officers (Tziner & Dolan, 1982), managers of law enforcement agencies (McEvoy, 1988), and civil service and foreign service administrators (Anstey, 1966; Vernon, 1950; Wilson, 1948). Schmitt, Noe, Merrit, and Fitzgerald (1984) extended the use of assessment centers to forecast candidates’ success as a school administrator. The specially developed assessment center featured activities and exercises relevant to schools, rather than business and industry. The assessments correlated between .20 and .35 with appraisals of the job performance of the administrators by teachers and administrative staff. However, the assessments failed to correlate with subsequent promotions or salary increases.

Utility. Because thousands of dollars may have to be spent per candidate, many essays on the cost-effectiveness of assessment centers have been produced (see, e.g., Byham & Thornton, 1970). Cascio and Sibley (1979) calculated the quantitative analysis of the hypothetical utility of assessment centers based on assumptions about the costs, validities, and gains to the organization as a consequence of improvements in the accuracy of prediction provided by the assessment center. The authors showed the numerous parameters that had to be considered and the optimal arrangements that were possible if the designated costs, validities, and benefits were involved.

Dunnette (1970) initially raised the question of whether the additional cost of assessment center procedures, in contrast to less expensive mechanical methods, added sufficiently to the validity of prediction. For instance, Glaser, Schwartz, and Flanagan (1958) had determined that assessments of 227 supervisors from individual interviews, an LCD session, a role-playing situation, and a simulated management-work situation, when combined with two paper-and-pencil tests, yielded multiple correlations of only .30 to .33. Yet the two paper-and-pencil tests alone correlated .23 and .25 with the appraised effectiveness of the supervisors on the job. On the other hand, the AT&T, SOHIO, and particularly the IBM studies showed considerable augmentation of the validity of predictions (up to multiple correlations of .62) when the results of an in-basket test, LGD, and biographical information inventory were added to information gained from a personality test (Wollowick & McNamara, 1969). But such augmentation did not occur in an assessment of 37 prospects for supervisory positions at Caterpiller Tractor (Bullard, 1969).

Cost estimates have ranged up to $5,000 per assessment center examinee,9 including space and materials as well as the assessors’ and assessees’ time and travel costs, without even considering staff salaries. Much less expensive alternatives use standardized machine-scored cognitive and personality psychometric tests. However, the validities of these tests are less likely to reach as high as what is possible when they are combined with the situational tests and other procedures of the assessment center. Another alternative is to rely on recommendations, individual interviews, committee reviews of credentials, and previous performance appraisals. Again, such approaches are unlikely to yield the same accuracies of prediction. Nor does actuarial prediction based on multiple regression yield more accurate predictions than assessors’ ratings, according to an analysis based on 24 predictors of the growth of the salaries of 254 managers one, three, and five years after the managers were assessed (Howard, 1984).

Still, the benefits outweigh the costs. One mistake in the selection or promotion of a manager may cost an organization grievously. The validity of decisions about selection and promotion of managers may be raised from .25 to .50 or higher by using assessment centers instead of less thorough efforts. Hogan and Zenke (1986) showed that with a sample of 115 persons applying for seven school principalships, selected exercises at an assessment center yielded a large gain in performance (valued in dollars) over the traditional interview procedures. Nevertheless, Tziner (1987) suggested that it is still an open question whether much less expensive alternatives can achieve sufficiently accurate predictions of future managerial performance to justify using them instead of the procedures of full-blown assessment centers.

MacKinnon (1975) enumerated a number of other benefits that may be obtained from an assessment center. The assessment center is an educational experience for the assessees and even more so for the managers who serve as observers. Both assessees and assessors may perform better after returning to their jobs (Campbell & Bray, 1967). Lorenzo (1984) demonstrated that experience as an assessor resulted in proficiency in interviewing others, as well as obtaining and presenting relevant information about candidates and their managerial qualifications. Assessors also increased their proficiency in preparing concise written reports about this information.

Intangible costs also need to be considered. The centers may promote organizational stagnation in that assessees will be favored who are most similar in attitudes and values to current managers, hence stifling organizational change. Another possible negative effect is that the assessment center may establish “crown princes” whose future success becomes a self-fulfilling prophecy of the organization. Conversely, a candidate who receives a poor assessment may be denied opportunities to show what he or she can really do in an organization. In addition, the stressfulness of the assessment experience may invalidate it for some candidates. Finally, those who are not nominated for participation at an assessment center may feel abandoned in the advancement of their careers in an organization (Howard, 1974; MacKinnon, 1975).

Summary and Conclusions


The traits that predict a person’s performance as a leader, which were discussed in previous chapters, can be assessed and combined in various ways to forecast success and effectiveness. Pure judgment and the expectations of supervisors and higher authority figures on the basis of current on-the-job performance are commonly used, often as part of performance appraisals; the judgments of peers and subordinates can also be considered, as can references from sources outside the organization. Clinical judgments based on psychological instruments and interviews with employees are another alternative. The reliability and validity of forecasting success with such pure judgments can be increased through standardization of judgmental requirements, the pooling of judgments, and the training of the judges.

Simulations make it possible to observe and judge employees’ performance in standardized problem settings. These simulations include managerial in-basket tests, small-group exercises, and LGDs. High-fidelity simulation and virtual assessment centers are now possible with computerization and the Internet. Complex data processing has been facilitated.

Valid scores from psychometric tests, application blanks, and biographical information blanks can be obtained, often through the use of specially constructed discriminating keys. The scores can be combined mechanically with or without the use of multiple regression. They also can be combined mechanically with judgments. Conversely, judgments can be formed from observations and information obtained from such tests and measurements. The procedures of assessment centers have become a popular means of combining scores on psychometric tests with observations of candidates in a variety of simulations that are relevant to current and future positions. The increased accuracy of predictions from the pooling of data gathering in assessment centers appears to justify the expense entailed. The centers usually provide more accurate predictions of the future potential of leaders and managers than do predictions of future performance of leadership candidates obtained by less expensive alternatives. The search for improvements and alternatives will be an issue for future research on leadership.