HOW TO IMPLEMENT A MEANINGFUL ASSESSMENT PROGRAMME
Assessment is central to medical education and has come to be understood as co-embedded with learning.
Assessment is not what it used to be. Where normally such a phrase is uttered in a nostalgic tone, here it is meant in a more positive vein. Assessment used to be a reasonably straightforward issue; it was mainly a problem of correctly measuring the students’ traits, as there were knowledge, skills, problem-solving ability and attitudes. The goal was to produce the optimal instrument for each of these four and the combination of those results would add up to medical competence. The typical discussions in the literature about whether open-ended questions are superior to MCQs and the various comparisons of standard-setting methods are all examples of that line of thinking.
Things have become more complex since then. First, the idea took root that every assessment instrument would have its advantages and disadvantages and that building a high-quality assessment programme was mainly about optimally combining instruments rather than trying to find a holy grail for each trait (van der Vleuten 1996). Second, the notion of competencies was developed (e.g. Frank 2005). Competencies embody a clear integration of knowledge, skills, abilities, attitudes and reflection rather than a separation of each of them (Albanese et al. 2008). Assessment programmes that rely on a reductionist approach can therefore not achieve a valid evaluation of students’ competence. Simply adding up a patient’s responses to history taking, findings on physical examination and lab values does not automatically – or by some arithmetic formula – result in a good evaluation of the equally holist and integrated notion of ‘health’. It is fair to say that this has created a landslide in our thinking about, and literature on, the quality of assessment. Publications on assessment to promote student learning (Shepard 2009) or the acquisition of academic values (Boud 1990), on assessment for learning (Schuwirth and van der Vleuten 2011), programmatic assessment (van der Vleuten and Schuwirth 2005) and on quality frameworks for programmes of assessment have filled the literature since.
The first case study demonstrates how this line of thinking has led to improvements in assessment programmes to broaden the scope of competencies being captured by such programmes.
Case study 17.1 Assessment in family medicine rotation, College of Medicine, King Saud University, Saudi Arabia
King Saud University (KSU) College of Medicine undergraduate family medicine curriculum consists of a 6-week rotation in the fourth year of medical school (Al-Faris 2000). Assessment in this course has evolved over the last 20 years, steadily moving towards a more authentic assessment that targets the upper level of Miller’s pyramid (Miller 1990).
In the beginning, the focus of assessment was knowledge, which was assessed by context-free multiple-choice questions (MCQs), modified essay questions (MEQs), practice profiles and oral examinations. There were two summative assessments, over the 6-week period, 40 per cent continuous assessment and 60 per cent final examination.
As assessment evolved, a ‘blueprint’ was developed to ensure content validity. In addition, critical reading questions were included in the continuous assessment and the oral examination was discontinued because of time requirement and concerns regarding its reliability and validity. The next reform in assessment was the inclusion of Objective Structured Clinical Examination (OSCE) to assess the psychomotor and communication skills. The use of MEQs was discontinued. The continuous assessment (40 marks) included single best context-rich MCQs, student-led seminars, case-based (CbD) discussion and evidence-based medicine (EBM) presentations with reports. The final assessment (60 marks) included MCQs, OSCE and evaluation in community centres. In addition, faculty development workshops were conducted to improve the quality of MCQs and OSCEs. More recently, the quality, diversity and number of OSCE stations have been improved, and a data interpretation paper has been added. The blueprint has been further developed and a committee of course teachers at departmental and central levels has been formed to review all assessment items.
Challenges and proposed solutions
The upper level of Miller’s triangle is not well covered
This needs to be covered in the form of workplace-based assessment (WPBA) tools, such as the mini-clinical evaluation exercise (mini-CEX) and/or a simple portfolio that includes self- and peer-assessment and professionalism questionnaire with reflection. A recorded student consultation with a patient using one of the consultation models, such as the Calgary-Cambridge Consult scale, would be added to the portfolio package.
There is little emphasis on formative assessment
More emphasis should be placed on formative assessment in the form of mock examinations with reflection and the use of self-assessment questions on the blackboard. Bilateral ‘student and teacher’ feedback should be strongly encouraged.
Clinical assessment in community centres is not discerning enough
Currently, most of the students’ scores are 9–10 out of 10. Annual meetings with the community centre supervisors are being planned, and university teachers will accompany students in the community centres to conduct experiential teaching and encourage more appropriate and insightful assessment that covers professionalism using WPBA tools.
Assessment of students’ assignments and presentations varies a lot between teachers (hawk and dove effect)
A calibration exercise will be conducted. Video-taped student presentations will be used for this purpose with the aim of improving inter-rater reliability.
The development of a blueprint that includes the objectives, content and methods of assessment is the first step towards achieving a valid course assessment
The Miller’s pyramid of competence is the framework that should be followed for achieving high validity in teaching and assessment. This requires the use of multiple methods, particularly WPBA tools, and the provision of frequent and constructive feedback.
All the teachers who participate in the course should be either involved or informed about the course details (Al-Faris and Naeem 2012). What is clear from this example is that assessment is no longer merely a measurement – reliability and construct validity – problem but has become increasingly an educational design, organisational and staff-development problem. So the attention has shifted towards the influence assessment has on the motivation for learning (and not just on study behaviour), the influence acceptability has on whether the theoretical curriculum, the enacted curriculum and the hidden curriculum align, and the cost efficiency of the programme to make it sustainable.
The next case studies in this chapter serve as illustrations for the broad palette of factors that determine the quality of an assessment programme.
The second case study, from Austria, addresses issues of efficiency and cost-effectiveness in relation to assessment.
Case study 17.2 Implementing a meaningful assessment programme, Medical University of Vienna, Austria
Until 2009 at the Medical University of Vienna the entire cohort of 720 second-year students routinely took a practical clerkship entry exam on basic clinical skills (two stations), physical examination (two stations) and history taking (one station). With only five stations, the exam’s reliability was only 0.50.
In 2009 we gained additional resources, but still not enough to reach the minimum recommended reliability of 0.7 for summative examinations (Downing 2004). So, instead of adding additional stations to increase the precision of the test globally, we decided to focus on improving reliability for the pass/fail decision and piloted a sequential testing approach.
Since 2010 all students take an initial eight-station practical exam. Despite a still low reliability of 0.61, students can be separated into a clear pass, a clear fail and a borderline group performing close to the pass/fail criterion. Only the ‘borderliners’ (about 25 per cent of the cohort) take an additional six stations to increase the reliability and better justify their pass/fail decision (r = 0.74). The borderline group can be estimated by counting the number of candidates below the score distribution within the limits of a confidence interval round the pass/fail score (Harvill 1991; Downing 2004).
This sequential approach requires only 6,840 student–examiner contacts compared to the 10,080 that would have been required in a conventional 14-station OSCE. Despite the low number of screening stations, we regard content validity as acceptable, since the number of assessable procedures is still limited in the second year.
As the resources of the faculty usually have to be allocated prior to the exam, the number of borderline students who will need further testing has to be estimated in advance, either based on a pilot study or by analysing data from previous examinations. This is necessary because the size of the borderline group is influenced by many factors, such as the cohort’s ability, reliability, pass/fail criterion’s distance to the score mean, score variance and score distribution skewness. Fortunately, the borderline group sizes have remained very stable in Vienna in recent years.
For logistical reasons there is about a week between screening and further testing in our setting. As sequential testing starts from the assumption of an unchanged object of measurement, one might argue whether it is correct to add up the results from screening and further testing to an overall score. However from an educational point of view, this time between screening and further testing makes a virtue out of a necessity. The weak performers receive feedback and are allowed to visit our skills lab for remedial action.
The decision about the pass/fail criterion is also challenging. In the second phase the borderline regression method does not work as all candidates tested here are already borderline candidates. Stations used for further testing have to be criterion-referenced in advance.
As it is not possible to distinguish justifiably between ‘excellent’, ‘good’ and ‘average’ students within the ‘clear pass’ group – since the screening’s reliability is low – no grades other than ‘pass’ or ‘fail’ should be given in this type of assessment.
Student acceptance of the sequential assessment can be increased when the exam is announced as a 14-station OSCE, which is shortened for students with better test results, instead of pronouncing that students with weaker results have to take a longer test.
It is important to keep in mind that sequential testing is more challenging to implement than a classical assessment. This assessment format requires careful piloting and psychometric assessment during the implementation phase. One has to decide the number of stations for the screening as well as further testing by taking reliability and validity issues into account. Although we acknowledge concerns about the violation of measurement assumptions, we established a very objective and structured ‘double-spot’ assessment that is formative for those students who probably most urgently need remediation. This fills the gap between single-spot summative and continuous assessment. Similar to recently reported results (Muijtjens et al. 2000; Smee et al. 2003; Cookson et al. 2011), the success of introducing sequential testing at Medical University of Vienna lies in saving resources by cutting down on student–examiner encounters and still having justifiable pass/fail decisions with acceptable reliability in the vicinity of the pass/fail criterion (Wagner-Menghin et al. 2011).
Assessing 720 students with an OSCE in a context with limited resources is a challenge. This case study is important in this chapter. Not because of its focus on optimising a single test of competence. Nor is it chosen because of the measurement issue it addresses; this is theoretically simple: if you want higher reliability, you simply include more cases, and if you cannot or won’t do this for the whole group of students, do it for those for whom it matters most. This would be neither new nor exciting. What is important is the approach the authors took.
First, they made a clear analysis of the local problem: huge student numbers, restrictive boundary conditions in terms of staff time and financial resources and a requirement to optimise the pass/fail decision at a certain point in the curriculum were at the core of the problem. Second, they consulted carefully the literature to find the best solutions that could and would work in their own context. Third, they adapted the findings to their local situation. And finally, they redesigned the OSCE and trialled and evaluated it.
If we want to put this problem in a broader context, it is helpful to look at validity from the perspective of Kane (2006). Kane describes validation as a stepwise process of generating supportive arguments or plausible evidence for the inference from observation of behaviour to conclusions about the construct of interest. If, for example, we want to draw conclusions about a student’s clinical reasoning skills, we can observe the student during simulated or real patient consultations. In order to draw such conclusions a series of inferences have to be made. First, the observation has to be translated into a score or verbal summary; for example, 85 per cent or ‘good’. Second, an inference has be made as to how generalisable this score or summary is (the so-called ’universe generalisation’ inference). Third, an inference has to be made to the so-called target domain (this could be clinical decision making) and, finally, to the construct of interest (clinical reasoning).
The second, universe generalisation, inference was central in the case example, and the authors have focused on reliability. But, reliability in the classical test theory sense is only one argument we could make for the universe generalisation, and there are others. Saturation of information is another type of argument. This is a methodological strategy often used in qualitative research. It differs conceptually from the notion of reliability. Reliability is firmly rooted in the assumption that all observations need to agree on the ability of the candidate. If in a panel viva the judges of the panel disagree about the competence of the candidate, this disagreement is seen as a source of unreliability. Saturation of information is based on the notion that generalisation is optimal if no new disagreement arises. So, if the panel in this viva disagree, this does not automatically mean that the assessment has poor generalisation, but that more judges are needed, and when adding new judges does not lead to new disagreements, the whole kaleidoscope of the candidate’s competence has been judged and therefore universal generalisation has occurred. Of course, just adding random opinions does not suffice in this case, but adding expert opinions does. Therefore, teacher training, clear rules and regulations, communities of practice and prolonged engagement are processes to ensure that opinions are based on sufficient exposure and expertise. Furthermore, a structure of evaluation and adding information needs to be in place, so benchmarking, stepwise replication, second opinion and transparency (audit trails) are organisational elements to ensure universe generalisation (Driessen et al. 2005).
It is fair to state that numerical scores (for example, on written or computer-based tests) are better supported by reliability approaches and assessment based on human judgements more on saturation of information approaches. In any case, a choice has to be made and followed through – much like in the case study, by analysing the problem, exploring the literature, applying the most promising solution and evaluating it – to design meaningful assessment.
Case study 17.3 Implementing a meaningful assessment programme, St George’s University of London, UK
St George’s University of London is a graduate and undergraduate medical school in the UK with 270 students per year. The medical school undergoes regular reviews of its medical course and its assessment structure. Prior to 2008 there were a plethora of different assessment tools used to assess skills and knowledge, applied at the end of the second, fourth and final years of the courses.
Following advice from the General Medical Council (GMC) and Tomorrow’s Doctors (2009), the medical school grouped all summative assessments under a professor of assessment, with an integrated strategy designed to test students’ skills in three domains: the doctor as a scholar and scientist; the doctor as a practitioner; and the doctor as a professional. I took responsibility for part of the doctor as scholar and scientist area, and developed end-of-year examinations in knowledge and its clinical application.
As a consultant, parent, patient and tax payer, I wanted a high and increasing standard for medical school graduates. Medical school is arduous and some students can avoid learning that which is not tested and use compensation between testing areas to reduce their own learning load.
At the outset, the principles we developed were:
• Test it once, test it properly.
• Test at the right level.
• Test only what is general practitioner (GP) or F1/2 relevant.
• All questions are clinically situated.
• Use evidence-based methodologies.
• Tell the students what they are being tested on.
• Make it feasible.
We chose to use single best answers (SBAs), testing for 180 minutes each year around subjects covered that year. We used an Angoff method for standard setting. These formats were chosen as being evidence-based, effective and simple.
The first year of testing was 2009–10. We developed a house style and promoted this with writer workshops for clinical academics. We had blueprinting meetings, which were initially stormy. There was much debate about question allocation across the curriculum. A formula was developed to calculate question numbers based on teaching time that year, if it was examined later and GP or F1/2 relevance. As in any spiral curriculum, some subjects are taught in more than one year. The faculty began to appreciate that more questions in one year was a better test than questions spread over two or more years. Once blueprinting had calculated the number of questions for each subject area, it further specified the skill to be tested and the setting of the clinical encounter in the SBA, in order to ensure adequate sampling and spread across the curriculum.
We found that some individuals took quickly to SBA writing and others, despite much assistance, remained poor. Many questions had to be rewritten almost from scratch to ensure quality and consistency of format. Common foibles were negative questions, ‘two-step’ questions, questions on minutiae, heterogeneous options, backwards questions or clinically implausible scenarios. In 2011 the management structure of the examination strand was changed to develop a small team to assist in this process.
Students appreciated the fairness of the examination, although felt it was taxing. We had intended to produce rapid feedback, but manpower challenges and lack of pre-planning prevented this initially.
The same format was extended to the fourth year in 2011 and the final year in 2012. For the final year, the medical school asked that multiple SBAs should be nested in 30 patient scenarios, testing aspects of practical patient management, reflecting situations F1s would be likely to encounter. Similar to real life, there would be multiple pieces of information, some irrelevant, to assimilate.
This format posed new challenges – how to sample the entire curriculum adequately, and how to avoid cueing answers by developments further along in the case. We found these questions particularly challenging to write well, and are developing further workshops to address this.
A key part of any assessment strategy is its own evaluation. Systems were put in place to record and review the performance of all questions used, discarding or rewriting poorly performing items. A question bank is currently being installed to facilitate this and to identify pre-written questions to meet future blueprints. The most useful data, however, are an exam’s predictive power. We are currently examining the outcome of those students on either side of the pass mark for these exams, both at medical school and beyond.
Evaluation should lead to change (learning). We are happy with the overall performance of this part of our assessment strategy, but believe more can be done to improve the quality of feedback to the students, and we are considering increasing the testing time to make the test more discriminatory for borderline students.
Medical school summative exams are a vital part of the medical course, encouraging excellence and preventing the public being exposed to poorly trained doctors.
By developing clear principles, and with considerable effort, an effective examination can be developed, be respected by the students and be able to distinguish between those students able to proceed and those who need further study and work.
The St George’s case study describes the challenges in setting up a new examination for a previously not well-assessed domain. An important aspect in this case study is the focus on the expertise of the faculty staff involved in the process. It starts with a clear statement about the purposes of the assessment and the way to achieve these. The second step is the development of a house style or template, to ensure that all those involved in the production of assessment material are on the same page. The third is the involvement of the faculty in the content or the distribution of the question. Finally, and most importantly, is the faculty development process that was put in place. The review of items by a team was not only used to improve the quality of the items and by this of the assessment, but also with feedback and assistance to raise the expertise of staff members in writing items. In this way the case study illustrates two other assertions in the opening section of this chapter, namely that assessment is as much an educational design and staff development problem.
If my daughters, who are currently almost 6 and 8 years old, were assigned the task of handing out multiple-choice exam papers they could most probably do that without having any impact on the assessment quality of the method. They would not influence the validity or the reliability of the assessment. My eldest would even be able to score the papers (if she sets her mind to it) and produce the results for each candidate. In multiple-choice tests the expertise is needed to produce the questions and not to administer the tests or produce the scores. This being said, I would never dream of asking my daughters to conduct a mini-CEX even if they were given an optimally structured and standardised form. The ‘administration’ of the assessment is where the expertise is needed and not the production of scores. If we go back to the validity concept by Kane (2006), expertise is needed to make the inference from observation to scores and if you don’t have the expertise you cannot make this inference. Actually, the specific form, whether it has nine or 20 criteria, whether it has a nine-point or a three-point scale, is largely irrelevant – an expert could validly conduct a mini-CEX with a blank sheet of paper. It is fair to say that any form of assessment requires expertise and therefore appropriate staff development, but it should be determined very carefully where and how that expertise should be used (or, as is often the case with the use of checklists, stifled) in the process.
The question then remains what this expertise entails. First, it is content expertise. If an assessor does not know what s/he is required to judge or what should be seen as good performance and what as poor performance, s/he cannot come to a valid inference from observation to score. Second, it is assessment expertise. This is not necessarily knowing much about the theory of assessment but more the acquisition of tacit knowledge of acceptable and unacceptable candidate performance (Berendonk et al. 2013), experience to develop adequate performance scripts (Govaerts et al. 2012) and gaining experience with negotiating the organisation, rules and regulations governing the assessment. As such, the required assessor expertise bears remarkable similarities with the development of diagnostic expertise in medicine, which has important implications for staff development programmes.
Diagnostic expertise resides in the ability to solve ill-defined problems through the development of meaningful knowledge networks (semantic networks), aggregating into illness and instance scripts. Through these, experts are not only better at diagnosing but also more efficient, allowing them to manage caseloads that non-experts – even with elaborate analytical reasoning and access to the internet – would not be able to manage. The same applies to expert assessors who develop meaningful networks and performance and person scripts which enable them to ‘diagnose’ competence in an assessment situation more effectively and more efficiently. It is clear that for such a difficult expertise a single training workshop will never do, like a single workshop does not turn a student into an expert diagnostician. In the case study, the staff development activities such as consensus building on the aims of the assessment, involvement in the content to create engagement, involvement in the review process and the constant feedback and training on the job all recognise the need for an ongoing staff development process focusing on where the expertise is needed.
The three case studies in this chapter illustrate the importance of other aspects of quality of assessment than construct validity and reliability. Currently, quality is no longer seen as a feature unique to the assessment instrument but as a constant interaction between the user and his/her organisation. To use a simplistic clinical metaphor: a surgeon cannot operate with a dull blade, but even with a sharp blade a non-expert can do serious harm to people, and over-reliance on the quality of the instrument without paying attention to the user will invariably produce harm.
When we return to the topic of this chapter – how to implement meaningful assessment – the case examples have illustrated the processes we currently see as unavoidable in achieving this. It starts with a clear analysis of the educational need or problem (and not a ‘new’ instrument). The second step is to explore the literature carefully for possible solutions, evaluate their quality, select the most promising ones and adapt them to the local situation. As previously highlighted, there is nothing magical about any assessment instrument, but the value lies in the interaction with the user and the organisation. The next step is the implementation of the instrument, involving users and other stakeholders and engaging them in the process. Finally, an ongoing faculty development or expertise development process needs to be in place to ensure an optimal interaction between method, staff and institution.
It is fair to say that such a full process may not always be possible: there may not be the time, there may not be the resources and there may not be the expertise within the organisation, and compromises may have to be made. But, in order to make the best compromise having a clear picture of the whole gamut of factors to consider is helpful, and it is hoped that the case studies with the annotations in this chapter have been helpful.
Take-home messages
• Meaningful assessment involves more than just the measurement of ability and assurances of the reliability and validity of the method. It requires a clear statement of the educational need or problem to be addressed.
• Expertise is essential to ensure that the design and assessment of the measurement tool are adequate to address the educational need or problem.
• The design of the assessment instrument or method requires the careful examination of existing evidence of tools and their quality and their adaptation to the particular case.
• Ongoing faculty development is essential to ensure the engagement of stakeholders in the realisation of meaningful assessment.
Bibliography
Albanese, M.A., Mejicano, G., Mullan, P., Kokotailo, P. and Gruppen, L. (2008) ‘Defining characteristics of educational competencies’, Medical Education, 42(3): 248–55.
Al-Faris, E.A. (2000) ‘Students’ evaluation of a traditional and an innovative family medicine course in Saudi Arabia’, Education for Health, 13(2): 231–5.
Al-Faris, E.A. and Naeem, N. (2012) ‘Effective teaching in medical schools’, Saudi Medical Journal, 33(3): 237–43.
Berendonk, C., Stalmeijer, R.E. and Schuwirth, L.W.T. (2013) ‘Expertise in performance assessment: Assessors’ perspective’, Advances in Health Sciences Education, 18(4): 559–71.
Boud, D. (1990) ‘Assessment and the promotion of academic values’, Studies in Higher Education, 15(1), 101–11.
Cookson, J., Crossley, J., Fagan, G., McKendree, J. and Mohsen, A. (2011) ‘A final clinical examination using a sequential design to improve cost-effectiveness’, Medical Education, 45(7): 741–7.
Downing, S.M. (2004) ‘Reliability: On the reproducibility of assessment data’, Medical Education, 38(10): 1006–12.
Driessen, E., van der Vleuten, C.P.M., Schuwirth, L.W.T., van Tartwijk, J. and Vermunt, J. (2005) ‘The use of qualitative research criteria for portfolio assessment as an alternative to reliability evaluation: A case study’, Medical Education, 39(2): 214–20.
Frank, J.R. (ed.) (2005) The CanMEDS 2005 physician competency framework: Better standards. Better physicians. Better care. Ottawa: The Royal College of Physicians and Surgeons of Canada. Online. Available HTTP: http://www.royalcollege.ca/portal/page/portal/rc/canmeds (accessed 26 July 2013).
General Medical Council (GMC) (2009) Tomorrow’s doctors: Outcomes and standards for undergraduate medical education. London: GMC. Online. Available HTTP: www.gmc-uk.org/New_Doctor09_FINAL.pdf_27493417.pdf_39279971.pdf (accessed 10 June 2014).
Govaerts, M.J.B., Van de Wiel, M.W.J., Schuwirth, L.W.T., Van der Vleuten, C.P.M. and Muijtjens, A.M.M. (2012) ‘Workplace-based assessment: raters’ performance theories and constructs’, Advances in Health Sciences Education, 18(3): 375–96.
Harvill, L.M. (1991) ‘Standard error of measurement’, Educational Measurement: Issues and Practice, 10(2): 33–41.
Kane, M. (2006) ‘Validation’. In R.L. Brennan (ed.) Educational measurement, Westport, CT: ACE/Praeger.
Miller, G.E. (1990) ‘The assessment of clinical skills/competence/performance’, Academic Medicine, 65: 563–7.
Muijtjens, A.M., van Vollenhoven, F.H., Femke, H.M., van Luijk, S.J. and van der Vleuten, C.P.M. (2000) ‘A sequential testing in the assessment of clinical skills’, Academic Medicine, 75(4): 369–73.
Schuwirth, L.W.T. and van der Vleuten, C.P.M. (2011) ‘Programmatic assessment: From assessment of learning to assessment for learning’, Medical Teacher, 33(6): 478–85.
Shepard, L. (2009) ‘The role of assessment in a learning culture’, Educational Researcher, 29(7): 4–14.
Smee, S.M., Dauphinee, W.D., Blackmore, D.E., Rothman, A.I., Reznick, R.K. and des Marchais, J. (2003) ‘A sequenced OSCE for licensure: administrative issues, results and myths’, Advances in Health Sciences Education, 8(3): 223–36.
van der Vleuten, C.P.M. (1996) ‘The assessment of professional competence: Developments, research and practical implications’, Advances in Health Science Education, 1(1): 41–67.
van der Vleuten, C.P.M. and Schuwirth, L.W.T. (2005) ‘Assessing professional competence: From methods to programmes’, Medical Education, 39(3): 309–17.
Wagner-Menghin, M., Preusche, I. and Schmidts, M. (2011) ‘Clinical performance assessment with five stations: Can it work?’, Medical Teacher, 33(6): 507.