Assessment Methods and Measurement Instruments
Working Review
August 2001
Introduction
Evaluation permits the critical question to be asked and answered: have the goals and objectives of new curriculum have been met? It assesses individual achievement to satisfy external requirements, and provides information that can be used to improve curriculum, and to document accomplishments or failures. Evaluation can provide feedback and motivation for continued improvement for learners, faculty, and innovative curriculum developers. To ensure that important questions are answered and relevant needs met, it is necessary to be methodical in designing a process of evaluation.
In the last decade, we have observed the rapid evolution of assessment methods used in medical education from the traditional ones towards more sophisticated evaluation strategies. Single methods were replaced by multiple methods, and paper-and-pencil tests were replaced by computerized tests. The normative pass/fail decisions moved to assessment standards, and the assessment of knowledge has been replaced by the assessment of competence. Efforts have been also made to standardize subjective judgments, to develop a set of performance standards, to generate assessment evidence from multiple sources, and to replace the search for knowledge with the search for "reflection in action" in a working environment. Assessment tools such as the objective structured clinical examination (OSCE), the portfolio approach, and hi-tech simulations are examples of the new measurement tools. The introduction of these new assessment methods and results obtained has had a system-wide effect on medical education and the medical profession in general. The commonly used slogan that "assessment drives learning", although certainly true, presents a rather limiting concept. It was therefore suggested that it should be replaced by an alternative motto: "assessment expands professional horizons" (M. Friedman, 2000). This stresses an important role of assessment in developing multiple dimensions of the medical profession.
Recent developments of so-called "quantified tests", standardized patient examinations, computer case simulations, and the present focus on the quality of the assessment evidence and the use of relevant research information to validate the preferred assessment approaches have been impressive, initiating the birth of Best Evidence-Based Assessment (BEBA). However, the problem is that such performance-based assessments consume resources and require a high level of technology. They are not readily applied in developing countries or even in most developed ones, due to their expense and logistical problems.
Therefore, we cannot forget the value and of the importance of all assessment methods, which recognize the primacy of evaluations by teachers and supervisors in the real health care environment. This so-called "descriptive evaluation" which uses words to describe and summarize a student's level of competence is in contrast to quantitative assessment techniques whose summary of achievements yields a score, typically a number. This is an area where the summative faculty judgments are necessary, but certainly not sufficient to pronounce a student as competent, and should be supplemented by the quantified assessment methods of professional performance.
"Objective" vs. "Traditional" Methods of Evaluation
Most educators would accept that prolonged periods of observation of students working with patients on a regular basis would have more validity than most assessment tests of clinical competence. The problem is that we strive to achieve reliability and precision in these observations as a requirement for a valid assessment. It is optimal to represent an evaluation of a spectrum of skills, including the cognitive ability to know what information is worth remembering, personal skill to manage one's time successfully, and a commitment to self-directed learning.
On the other hand, there are barriers to accepting the validity of descriptive evaluations of competence which are broadly used in the world. Deficiencies in conventional or traditional clinical examination have been clearly identified in an assessment of students' clinical skills in addition to the traditional multiple-choice questionnaire exams which measure only one aspect of competence, specifically knowledge. First and foremost is a belief that words are subjective and that numbers are objective. Use of the term objective for an assessment tool that yields a number or percentage, or score above or below a mean, gives it a status in the scientific community that is often denied to observations by teachers. A teacher wishes to be accurate in conveying impressions of a student or does not want to harm a student's career. Even if the teacher's observations are correct, he or she is aware that the number of student observations may not necessarily provide sufficient reliability, and therefore may be uncomfortable giving a grade. One solution is to increase the number of cases presented to students.
Nevertheless, a ward- or practice-based assessment is the most desirable environment in which to assess the student, and it provides the opportunity to make multiple observations over a period of time in a variety of clinical situations. Medical teachers frequently fail to take advantage of this opportunity by rarely observing students as they perform patient histories and examinations. The small number of observations is likely to make such assessments unreliable and thus unfair for decision-making purposes. Not only does this undermine the quality of such in-training assessments, but reduces the chances that students will get specific feedback (formative assessment) and appropriate remedial teaching. Such assessments could be made over an extended period of time or combined with a more objective procedure such as an OSCE, to achieve a higher degree of reliability.
In some countries and in particular in the United States, the tendency has been to move away from examinations at the bedside and towards patient management problems. Recent developments in performance assessment achieve a high level of authenticity and reliability. Computerization of multiple-choice examinations, especially those with sequential and adaptive testing as implemented by the National Board of Medical Examiners is an impressive feat.
What Should Be Evaluated and When?
The evaluation that attempts to determine different aspects of educational structure, process and outcomes may have several forms. Theformative individual evaluation provides feedback to an individual learner identifying areas and provides suggestions for improvement, whereas the formative program evaluation provides information and suggestions for improving a curriculum and program's performance.
On the other hand, summative individual evaluation measures whether specific performance objectives were accomplished, certifying competency or its lack in performance in a particular area, and summative program evaluation measures the success of a curriculum in achieving learner and process objectives.
Formative evaluations generally require the least rigor and summative individual and program evaluation for internal use need an intermediate level of rigor. Summative individual and program evaluation for external use, e.g., certification of competence or publication of evaluation results requires the most rigors.
When a high degree of methodological rigor is required, the measurement instrument must be appropriate in terms of validity and reliability. Establishingvalidity is the first priority in developing any form of assessment. In simple terms, this means ensuring that it measures what it is supposed to measure. The test must contain a representative sample of what the student is expected to have achieved. This aspect of validity, known as content validity, is the one of most concern to the medical teacher. On the other hand, reliability expresses the consistency and precision of the test measurements. There are a variety of factors, which contribute to reliability. In a clinical examination, there are three variables - the students, the examiners and the patients. In a reliable assessment procedure, variability due to the patient and the examiner should be removed. In the clinical examination, wherever possible, a subjective approach to marking should be replaced by a more objective one. Unreliability in clinical examinations result from the fact that different students usually examine different patients, where one may help some students while obstructing others.
Also important is the practicality of the assessment procedures. Factors such as the number of staff available, their status and specialties, availability of patients and space, and cost have to be taken into account. The ideal examination should take into account the number of students to be assessed, as an assessment procedure appropriate for twenty students may not be practical for hundreds. Unfortunately, the resources available to conduct evaluations are always restricted. However, if medical schools want to achieve minimally acceptable standards of validity and reliability, they have to be prepared to expend more time and resources in this area. This applies particularly to the assessment of clinical skills, where much longer or more frequent observations of student performance than is usually undertaken are required.
The first step in planning the evaluation is to identify the likely users of the evaluation. Different stakeholders who have responsibility for, or who may be affected by the curriculum will also be interested in evaluation results. In addition, students are interested in the evaluation of their own performance. Evaluation results may also be of interest to educators from other institutions.
The next step in designing an evaluation strategy for a curriculum is to identify whether the evaluation is used to measure the performance of individuals, the performance of the entire program, or both. The evaluation of an individual usually involves determining whether the person has achieved the objectives of a curriculum. On the other hand, program evaluation usually assesses the aggregate achievements of all individuals, clinical or other outcomes, actual processes of a curriculum implementation and perceptions of learners and faculty. Another use of an evaluation might be for formative purposes (to improve performance), summative purposes (to judge performance), or for both.
The long-term goal underlying revision of the curriculum is to produce better physicians with qualities such as extensive and appropriate knowledge, humanism, compassion, career achievement, the ability and desire to learn throughout life, and receptiveness to patients' care and clinical research. In that situation, the proper time of evaluation is graduation or later.
Whatever the purpose and whenever performed, such assessments will have a powerful effect on what students learn and how they go about their studies, and the assessment of clinical competence is one of the most important tasks. Therefore, the assessment should be regularly incorporated within the coursework to provide ongoing feedback to students and teachers which usually is undertaken at the end of a clinical course to certify a level of achievement.
Assessment of Medical Competence
Although the evaluation of professional competence is considered one of the most important final goals of medical education and the most important tasks of teachers, until very recently, we have used the term clinical competence rather loosely without a general agreement. Presently, competence is defined in terms of what the student or doctor should be able to do at an expected level of achievement, such as at graduation or when commencing an internship. Thus, competence is the synthesis of all attributes necessary to do the task for which one is being trained, and clinical competence may be regarded as the mastery of relevant knowledge and acquisition of a range of relevant skills, which would include interpersonal, clinical and technical components. Competence itself, of course, is only of value as a prerequisite for performance in a real clinical setting.
There is a tendency to separate the term clinical competence from the term clinical performance. Performance is defined as what a student or doctor actually does under specific conditions; for instance, during a test, or while being watched, or in real clinical practice. What more, "performing" is ongoing and continuous, and indicates activity rather than the finished product. To know that a student is competent, we need to observe the student performing in vivo, not an isolated performance under in vitro test conditions. In many ways, it is easier to assess competence than performance. This matter is of less concern in the undergraduate arena, where assessment of competence is particularly appropriate, as students are not expected to practice in an unsupervised situation.
Fig.1 Components of clinical competence (Newble, 1992) |
|
Unfortunately, competence does not always correlate highly with performance in practice. Both competence and performance are influenced by professional attitudes; however, assessing this component poses great difficulties.
The prevailing approach is analytic in nature, and is used by educators to break up competence into separate parts called skills, knowledge and attitudes. The components of clinical competence include abilities such as obtaining a detailed and relevant patient history, carrying out a physical examination, identifying patient problems, choosing appropriate diagnostic methods, performing differential diagnosis, interpreting obtained results, and undertaking an appropriate case management approach including patient education. In this way, the assessment of competence requires a whole series of performances reflecting the interaction of patients and competent physicians, and what varies from patient to patient. This helps avoid situations where more attention is paid to the detection of abnormal physical findings during examinations rather than student observations of history-taking from patients and their interactions.
What should be assessed is not simply whether the student is able to do a specific task when observed by a teacher, but how he or she is assessed by a patient. It is why the clinical examination is broadly regarded as of key importance in the assessment of a student's competence to practice medicine and the cornerstone of qualifying examinations. This requires observation of student performances in real practice settings.
In the clinical examination, there are three variables: the student, the examiner and the patient. The aim should be to standardize the examiner and the patient so that the student's performance can be seen as a measure of his or her clinical competence. The assessment of clinical competence is usually undertaken in one of two settings, such as a ward- or practice setting, or an examination setting. Theward- or practice-based assessment is the most desirable environment to assess a student. It provides the opportunity to make multiple observations in a variety of clinical situations, such as how students perform patient histories and examinations. It may also provide the opportunity for students to get specific feedback (formative assessment) and appropriate remedial teaching. In some parts of the world, competencies are certified by passing so-called examinations based assessment, consisting largely of multiple-choice written tests. In other parts of the world, the traditional clinical examination, consisting of long and short cases based on patients, is seen as a critical component of final examinations. The former approach suffers from a low level of validity, and the latter from a very low level of reliability.
To further improve the quality of assessment procedures, we should be more precise in defining what we aim to assess and should ensure that we introduce methods of assessment which are both valid and reliable. As no single method is adequate to appropriately measure all aspects of clinical knowledge, skills and problem solving techniques, the multi-format assessment conducted in examination settings is essential.
Selection of Evaluation Tools
The first step in making choices of measurement instruments is to determine the purpose and desired content of evaluation, as it is important to choose the measurement methods that are congruent with the evaluation questions. The choice of measurement methods and construction of measurement instruments is a crucial step in the evaluation process because it determines the data that will be collected. If the assessment methods are inappropriate, or if there is imbalance between theoretical knowledge assessment and clinical assessment, unfortunate learning consequences for students and curriculum may occur. Equally importantly, if the assessments are of low quality, improper decisions could be made which might be detrimental to the future of a student or to the welfare of the community.
Most evaluations will require the construction of specific measurement instruments such as tests, rating forms, interview schedules, or questionnaires. The methodological rigor with which the instruments are constructed and administered affects the reliability, validity, and cost of the evaluation. It is also necessary to choose measurement methods that are feasible in terms of technical possibilities and of available resources.
Planners and users of evaluations should be aware of various rating biases, which can affect both an instrument's reliability and validity. A more careful specification of content, a proper number of activities performed and observed, and use of structured and standardized approaches such as checklists and rating forms for marking, improve the quality of clinical assessment.
With the emergence of complex performance assessment methods in general, there is a need to re-examine the existing methods to determine standards of performance, which separate the competent from the non-competent candidate. Setting standards for performance assessment is a relatively new area of study and consequently, there are various standard setting approaches currently available for both written and performance tests.
In designing assessment tests, it is necessary to incorporate performance criteria designed to provide evidence that students have successfully completed the task, to demonstrate acquired competencies by responding correctly to the task criteria, and to achieve maximum scoring points. In reality, however, candidates may demonstrate a variety of performance profiles that range from non-competent, to minimally competent, to fully competent. Consequently, the cut-off point on the scoring scale, which separates the non-competent from the competent, has traditionally been set to respond correctly to 70% of the items, does not provide robust and valid evidence for pass/fail decisions.
To evaluate individuals or educational programs, methods of measurement commonly used are rating forms, questionnaires, essays, written or computer-interactive tests, oral examinations, individual or group interviews/discussions, direct observation, and performance audits. Among the so-called objective methods, the most popular is the OSCE. As students progress from novices to experts, they integrate their learning experiences, and as multiple aspects of their profession are introduced into their training, the complexity of the required tasks increases. Consequently, assessment of students' performances will require the proper selection of measurement methods and instruments. Because all measurement instruments are subject to threats to their reliability and validity, the ideal evaluation strategy will employ multiple approaches that include several different measurement methods and several different raters. When all results are similar, the findings are robust, and one can be reasonably comfortable about their validity.