Use of videotaped consultations in summative assessment of GP trainees

L M Campbell J G R Howie T S Murray BJGP March 1995

Summary

Background. There are many different methods by which trainees may be assessed summatively.
Aim. The objective of the study was to determine if videotaped consultations could be used to identify reliably those general practitioner trainees who have not yet reached acceptable levels of competence.
Method. Videotapes of 10 trainees carrying out normal consultations were assessed by 20 assessors for acceptable competence using a rating scale specifically developed for the purpose.
Results. A principal components analysis showed a strong correlation among the items in the rating scale used, indicating that a single underlying factor accounted for 76% of the overall scores. Agreement between assessors on the scoring of individual consultations was limited. There was much greater consistency with regard to the decision on overall competence, examined for the first consultation. A non-competent trainee would have a 95% probability of being identified by the process as described using two assessors for each videotape. The assessors had reached firm judgements on each trainee by the time four consultations had been viewed.
Conclusion. The workload involved in producing and analysing the tapes is discussed. Considerations of patient consent are addressed. It is concluded that the use of videotaped consultations appears to offer a feasible and reliable method of summative assessment of general practitioner trainees.

Keywords: consultation process; videotape recordings; vocational training assessment.

Introduction

Doctors wishing to become principals in general practice are required to obtain a certificate of prescribed or equivalent experience. The requirements for prescribed experience are that the doctor should have completed three years of approved posts after full registration. These posts normally consist of a total of two years in up to four hospital posts and one year in general practice. Where a doctor is deemed to have obtained adequate experience in other situations, for example working abroad, a certificate of equivalent experience is issued. The certificate is awarded by the Joint Committee on Postgraduate Training for General Practice (JCPTGP). The quality and quantity of assessment that takes place during the hospital component of vocational training has been questioned 1 and in practice the main responsibility for the issue of the certificate devolves to the general practice trainer. A joint statement by the chairmen of the JCPTGP, the British Medical Association and the General Medical Services Committee has stated that the issue of this certificate should be a statement that the trainee is deemed to be competent.2 The JCPTGP has stated that the issue of this certificate should be determined by a ‘competent system of assessment’ and that a national standard for entry into general practice should be considered.3 Fewer than 1% of trainees are currently refused certificates.

In an attempt to formulate a valid, reliable and externally credible summative assessment programme. The West of Scotland Committee for Postgraduate Medical Education has developed a system based on four components: the trainer’s overall judgment; assessment of videotaped consultations; a multiple choice paper; and an audit project. This framework has been adopted by the summative assessment working group of the JCPTG.3 In the west of Scotland failure by a trainee to satisfy the assessors in anyone of these components will initiate a referral process in which the evidence is reviewed by additional assessors from outside the region. The reason for the referral process is that the region took the view that an additional review process would be appropriate before the refusal of a certificate.

In a previous paper, the system as a whole, the development of the marking schedule and rating scales to be used were described.4 Assessors were asked to look at how well the trainee succeeded in carrying out the tasks of the consultation. The tasks chosen were modified from those of Pendleton and colleagues.5 These tasks were chosen because it was felt that they looked at outcome rather than process which was felt by the group to be appropriate in assessing competence. Assessors’ judgments were shown to be reasonably consistent when a selection of tapes was reviewed by the whole group in a workshop situation. However no attempt was made in this earlier study to test reliability in a scientific manner. The use of videotaped consultations as an educational resource is well established 5 and there is evidence that trainees find this of value.6 Several instruments have been developed to rate consultations.7-9 However none of these scales has been used in summative assessment of vocational trainees.

The principal objective of this study was to determine how videotaped consultations could be used to help to identify the small number of trainees who may not yet have reached acceptable levels of competence. The specific objectives were: to assess the practicalities and acceptability of videotape production;

to identify whether routine consultations were sufficiently challenging to assist assessors in differentiating competent from non-competent trainees:
to test out a marking schedule; to measure inter-observer reliability; and
to identify the number of consultations and assessors needed to assess each trainee reliably.

Method

A letter was sent to 150 practice-based trainees in the west of Scotland inviting them each to submit a four-hour videotape of routine consultations. The letter contained advice on the techniques of obtaining videotapes of suitable quality and also covered obtaining informed consent from patients. The trainees were provided with a simple log book for the recording of details of each consultation. The function of this log book was partly to enable assessors to find their way through the videotape but it also contained a section in which the trainees could discuss how the consultation had been managed. A total of 80 videotapes were received within six weeks of the invitation. On a subsequent occasion all trainees finishing in July were asked to supply a videotape; 106 out of a possible 107 videotapes were received. One trainer/trainee pair refused to take part. Trainees were advised that there was no need to edit the videotapes so as to submit ‘good’ consultations since on this occasion there were to be no sanctions imposed as a result of the assessment process.

Ten videotapes from the original batch of 80 were selected for the study. In order to test the reliability of the process at the level of questionable competence, one videotape was specifically selected from a trainee who was causing some concern to the trainer. The remainder of the videotapes were chosen randomly. The 25 assessors used were the same group who had taken part in the original pilot project. The majority (14) were trainers, four were associate advisers (the equivalent of course organizers in England) and seven were examiners for the Royal College of General Practitioners. Each of 25 assessors was sent the 10 videotapes with accompanying log books. Twenty assessors completed the task within the four weeks allocated and the reported results are based on this group. A further four assessors eventually completed the task and one dropped out for unspecified reasons. Assessors also received assessment forms for each trainee. Assessors could record specific details of the consultation under strengths and weakness; no score was attached to these evaluations but the assessor could use the comments as an aide-memoire when reaching a final decision. Assessors were asked to rate the trainee’s performance in each consultation in seven areas:

Was there any obvious diagnostic or management error?
How well did the doctor discover the reasons for the patient’s attendance (attending reason)?
How clearly did the doctor define the clinical problem (problem definition)?
How well did the doctor tailor the explanation to the needs of the patient (explanation)?
How well did the doctor manage the clinical problem (problem management)?
How effectively did the doctor use resources of time, investigations and manpower (resources)?
How effectively did the doctor relate to the patient (rapport)?

All these attributes apart from the first were then scored on a six-point scale where 1 = definitely refer; 2 = probably refer; 3 = bare pass; 4 = competent; 5 = good; and 6 = excellent. ‘Refer’ was used rather than ‘fail’ since an unsatisfactory performance in the videotape component would lead to a further assessment rather than to a refusal of the certificate of satisfactory completion.

Assessors were instructed to view as many consultations as were necessary to reach a final decision but in any event a minimum of six. Assessors were asked to rate the degree of challenge of each consultation to the trainee and to give an overall judgment of pass or refer to each consultation. It was recognized that an assessor, having noted some aspect of performance, might wish to examine a consultation of a particular type to help clarify judgment. For this reason assessors were not restricted to watching the same series of consultations.

All statistical calculations were carried out using the statistical package SPSS-PL version 4.01. In order to determine the relationships among the six attributes on the rating scale (apart from whether or not there was a clear error in diagnosis or treatment) a principal components analysis was carried out. All consultations were used for this analysis giving a total of 1176 consultations. This analysis determines how closely scoring in anyone item is related to scores in other items. In order to determine inter-rater reliability, the score given by each assessor to the first consultation of each trainee was used to produce a rank order of consultations by total score for each assessor. Correlations between assessors in terms of rank ordering were then assessed.

Results

Assessment scales

Eight assessors recorded that they believed an error in diagnosis or treatment to have occurred in the case of one particular trainee. Errors were also recorded once for each of two other trainees. Other than this no specific errors were recorded. The principal analysis of the six other attributes suggests that a common underlying factor accounted for 76.3% of the total variance (Table 1).

Table 1. Correlations between the six attributes, determined from all 1176 consultations where data were complete.
	Correlation scores between attributes(a)
Attribute	Attending reason	Problem definition	Explanation	Problem management	Resources	Rapport	%of variance
Attending reason	1.00						76.3
Problem definition	0.74	1.00					6.2
Explanation	0.69	0.75	1.00				5.6
Problem management	0.70	0.77	0.79	1.00			5.2
Resources	0.66	0.68	0.69	0.76	1.00		3.7
Rapport	0.69	0.69	0.74	0.70	0.67	1.00	3.2

(a) One tailed significance. P<0.001 for all correlations.

This would indicate that each of the items is in fact measuring a different aspect of the same overall behaviour pattern. A single composite score for each consultation was thus calculated which was then used in subsequent analyses. The overall mean score given by all assessors was set at 0. Table 2 shows the number of consultations where complete scores were recorded, and the overall mean score given to all consultations assessed. A positive score indicates high marking relative to the other assessors and a negative score indicates overallow marking. Assessor R had the highest overall mean score but referred one trainee.

Table 2. Assessors’ overall scores for all the consultations they assessed, the number of trainees referred, and assessors’ perceptions of how challenging the consultations were for the trainees.
				No.of consultations perceived by assessors as
Assessor	No. of consultations assessed	Overall mean score (SD)	No. of trainees referred	Low challenge	Medium challenge	High challenge
A	46	-0.17 (1.03)	1	36	8	3
B	60	0.76 (0.64)	0	9	49	8
C	60	0.49 (0.82)	0	10	39	11
D	66	0.06 (0.93)	1	32	28	5
E	53	-0.08 (1.28)	2	20	27	6
F	62	0.37 (1.22)	1	12	36	14
G	62	-0.82 (0.77)	0	6	46	12
H	58	0.14 (0.88)	1	27	30	5
I	56	0.18 (0.57)	1	18	33	7
J	51	0.18 (0.84)	0	13	22	15
K	59	0.36 (0.52)	0	20	37	3
L	58	-0.58 (0.99)	2	19	30	3
M	61	-0.32 (0.87)	1	19	45	12
N	74	-0.30 (0. 73)	1	25	28	6
O	61	0.09 (0.89)	2	26	36	7
P	60	-0.21 (0.97)	1	25	45	10
Q	48	-0.09 (0.84)	1	28	15	3
R	60	0.78 (1.36)	1	16	34	11
S	67	0.16 (0.88)	1	15	32	20
T	54	-0.51 (0.76)	2	28	27	4

SD = standard deviation.

Consultations

Not all assessors viewed the first consultation on each videotape and so the results were based on 18 assessors. Correlations between assessors in terms of rank ordering of consultations were poor with the correlation for each individual assessor with the mean ranking ranging from 0.2 to 0.5. This suggests that using the scoring system to reach overall judgements would be unreliable. Correlations would be likely to improve with increasing numbers of consultations.

Trainees

Table 2 also shows the number of trainees referred by each assessor. Five assessors did not refer any trainees and no assessor referred more than two trainees. On the other hand, 15 assessors referred the same trainee, two referred another trainee, and two other trainees were each referred by one assessor. Six trainees satisfied all assessors. Using a simple odds ratio calculation, where with one assessor the first trainee would have 15 chances out of 20 of being identified and with a second assessor the trainee would have 15 chances out of 20 of being identified, two assessors would give a combined probability of 95 out of l00. Thus, using two assessors for each videotape the first trainee would have had a 95% probability of being referred; no other trainee would have had more than a 20% chance of being referred after a minimum of six consultations.

Challenge

The number of consultations assessed is not identical to the number of consultations rated for challenge as some assessments contained missing data and were excluded from the consultation used to calculate scores (Table 2). Assessors varied in how challenging they perceived consultations for the trainees. For example, assessor A rated 76.6% of consultations as being of low challenge while assessor G rated 9.4% as low challenge.

Cumulative ratings

In order to determine how many consultations needed to be assessed, the occasions where assessors made an important change in their overall opinion were analysed, that is, from pass to refer or vice versa. After four consultations no assessor changed an overall judgement from pass to refer to vice versa.

Discussion

The use of real consultations as part of a process to assess the fitness of trainees to receive a certificate of satisfactory completion has obvious face validity. There is corresponding evidence that paper-based assessments using multiple choice or modified essay papers are not good predictors of actual performance.10 An alternative to real consultations is the use of simulated patients.11-15 Simulated patients, when well trained, have been reported to be reliable for assessment purposes 11-15 but problems of patient consistency have been noted with this method.16 Simulated patients are used extensively in North American medical schools for undergraduate assessment.

Videotape assessment by assessors from outside the practice has considerable potential benefit with regard to objectivity and external credibility. However, in order for this form of assessment to be worthwhile it must produce results of adequate reliability while taking up only a reasonable amount of trainee and assessor time.

Trainees appeared to have no difficulty obtaining patient consent. Disquiet was expressed by some trainers, both during the study to the authors arid to medical journals about consent, confidentiality and the effects of video recording on trainee and patient behaviour l7 and one trainer/trainee pair refused to take part for these reasons. If the trainees’ performance had been adversely affected by the presence of the camera some reference to poor performance would have been expected in the trainee log book. In no consultation judged to be unsatisfactory by the assessors was this referred to in the log book. A small survey carried out in the west of Scotland region, as yet unpublished in detail,18 showed that a number of patients (none of whom had ever been asked to be videotaped) felt unhappy about their possible reactions if they were to be asked to take part in video recording of consultations. The authors of this survey called for a complete ban on the use of videotaped consultations under any circumstances. Other published work does not support their conclusions.19,20 Such a course of action would be to the detriment of education and assessment of both trainees and principals. Inevitably some patients will feel pressurized to take part. One paper has shown that most patients were happy to give consent to videorecording 21 although it has been suggested that some patients would be unhappy.22 Perhaps the best solution is to make it as easy as possible for the patient to withhold consent or to have the videotape erased after recording. Draft guidelines have been produced which cover these areas.23 Consent forms based on these guidelines are now in general use in this region. This appears to have produced some increase in refusal rate but all trainees are still finding it possible to find enough patients to record. The region intends to follow the General Medical Council guidelines (which are understood to be almost identical to the guidelines by Southgate 23) on tapes leaving the practice. These guidelines will increase administration but should not materially hinder the use of videotapes in summative assessment.

Cameras are available to all trainers groups and many practices have their own cameras. No trainee reported any difficulty gaining assess to a camera. Of all 186 videotapes produced 90% were technically usable. The sound quality was usually adequate but was much enhanced by the use of a desk top microphone. In order to complete the log book it was necessary for the trainee to view the videotape thus taking up a maximum of four additional hours. In view of the large number of trainees who produced videotapes, the process is clearly practicable. No assessor found the tape length of four hours to be necessary to reach a judgment. A reduction to three hours or 15 consultations would still give an adequate number to sample.

The 24 assessors who completed their marking tasks found the workload acceptable. Only one assessor dropped out. The assessors reported spending a mean of 10 hours carrying out the process. This would be the same amount of time required to assess all 104 trainees in the region completing at the peak time at the end of July using two markers per trainee. The workload for the smaller group finishing in January would be much less. All assessors who completed the task expressed a willingness to continue in post.

Assessors were deliberately allowed to select for themselves which consultations to view. This has advantages in that assessors can attempt to seek out particular areas of competence to examine. However, this approach does produce difficulties in the analysis since it introduces an additional variable in that not all assessors looked at the same consultations.

Regarding minimal competence from the correlations among the components of the rating scale, assessors’ judgment of trainee performance in any given consultation was largely consistent across the different parameters. This finding is understandable since minimal competence is less likely to be case specific than higher order skills. For example, it seems reasonable that a trainee who attempts to take an adequate history will also attempt to explain the problem appropriately. The six-component scale was therefore not helpful in increasing reliability and so has been modified. For future use there will be a simple three parameter scale: listening – did the trainee identify adequately the patient’s problems?; action – did the trainee investigate/manage the patient’s problems appropriately?; and insight-was the trainee aware of the strengths and weaknesses of the consultation? The third item would be completed in the light of the trainee’s comments in the log book. These three new scales will be tested in the future to determine if they are discrete and usable.

The inter-rater reliability based on the first consultation was not impressive but this observation was based on a small number or consultations for each trainee compared with the total of 1176 consultations. It has been shown elsewhere that large numbers of consultations would be required to produce reliable results using a scoring system.9 Examiners varied considerably in the scores given to each candidate and in their range of scoring and the variation demonstrates the expected difference between hawks and doves (those who mark low and those who mark high). However, no correlation was found between those who scored high or low and the decision to pass or refer. There was a high level of agreement among assessors with regard to the overall decision to pass or refer.

Following discussion among the assessors the first trainee was considered to be just below an acceptable level of competence. This would indicate that the 95% probability of being referred would apply to all trainees who fell below the level of acceptability. If the assessors had considered that trainees more competent than the first trainee should have been referred, the reliability of the process would not have been confirmed. This seems acceptable particularly as this is just one component of a four section assessment. Despite the fact that the assessors looked at a total of 1176 consultations, the number of trainees assessed was still small.l0 Further analysis of a large number of trainees is in progress to see whether the system can identify consistently those trainees not yet competent.

This study did not look at intra-rater variability. It would have been relatively easy to look at this in single consultations but the main outcome variable was the overall pass or refer judgment for each trainee. There was a strong possibility that assessors would remember which trainees they had failed on the first occasion, thus invalidating any measure of intra-rater reliability. Assessors did not want to come to a decision concerning the competence of a trainee on the strength of viewing any single consultation. It is unlikely that any single consultation would test the range of competence needed in general practice, indeed a low challenge consultation may require little in the way of competence. That different competences are required in different consultations and that one consultation may give a better guide to competence than another has implications for any rating system which involves giving scores for each consultation and then producing an aggregate mark by some form of arithmetical manipulation.7,9 Since the rating system involved recording interim judgments after each consultation and then producing a final decision based on the cumulative impression, this difficulty has been avoided. Of course, the fact that the assessors were reasonably consistent does not mean that their decisions were correct. Studies of outcome validity would require the long term follow up of large numbers of trainees who had been deemed competent by different assessment methods. However, such studies would be difficult to mount and would be of doubtful utility since after training, the doctors will change and develop during their careers.

Assessors showed considerable variation in the degree of challenge they ascribed to the consultations viewed. Not all assessors viewed the same consultations which may explain some of the difference. A possible further explanation of this is that the trainees were able to accept or reject potentially challenging situations, either by probing the patients’ problems or dealing with them in a superficial manner. Some of the assessors may have been evaluating the potential challenge of the consultation while others may have based the judgment on the explicit challenge contained in the consultation which actually took place. Of consultations overall 52% were rated to be of at least moderate challenge. This would indicate that routine consultations are of sufficient challenge to enable assessment of performance to take place.

When the number of consultations required to be viewed before a confident judgment can be made was studied, no examiner changed from pass to refer or vice versa in the fifth or subsequent consultation. Most assessors viewed only six consultations and had presumably decided that further viewing would not affect their decision, and some felt sufficiently confident of their decision that they viewed fewer than the stipulated six consultations. Assessors tended to view a larger number of consultations in situations where the eventual decision was to refer, although in no case did this extra viewing change the result.

The question arises as to what would be the results if trainees were to attempt to produce a videotape consisting of ‘good’ consultations. Clearly if trainees were to be refused a certificate of satisfactory completion as a result of the process there would be a possible incentive to edit videotapes in this way. We intend to report later on the first 150 trainees to undertake the full assessment process and will discuss this possibility. However, it should be remembered that competence as opposed to performance is being assessed. True performance could only be assessed if trainees were unaware they were being assessed; there is a difference between what doctors can do as opposed to what they routinely do.

Although the assessors showed limited agreement on the individual components of rating scales and on their rating of individual consultations, they nevertheless showed an acceptable level of agreement on the ultimate issue of whether or not the trainees were competent and their decisions on this became stable after observing four consultations. A continuing monitoring programme to identify those examiners whose results do not correlate well with their peers and an exploration of the reasons for this should help to improve reliability as will further training and practice for the assessors. The use of two assessors per videotape appears to produce adequate reliability. The workload involved is feasible and the system is now in full scale use in the west of Scotland region. More than 200 trainees have now gone through the process and it is hoped that the results will be published in due course.

References

Carney T. A national standard of entry into general practice [editorial]. BMJ 1993; 305: 1449-1450.
Irvine DH, Gray DJP, Bogie IG. Vocational training: the meaning of ‘satisfactory completion’ [letter]. Br J Gen Pract 1990; 40: 434.
Joint Committee on Postgraduate Training for General Practice. Assessment working party. Interim report. London: JCPTGP, 1992.
Campbell LM, Howie JGR, Murray TS. Summative assessment; a pilot project in the west of Scotland. Br J Gen Pract 1993; 43: 430-434.
Pendleton D, Schofield T, Tate P, Havelock P. The consultarion: an approach to learning and teaching. Oxford University Press, 1984.
Campbell LM, Murray TS. Trainee assessment a regional survey. Br J Gen Pract 1990; 40: 507-509.
Hays RB. Assessment of general practice consultations: content validity of a rating scale. Med Educ 1990; 24: 110-116.
Fraser RC, McKinley RK, Mulholland H. Assessment of consultation competence in general practice; the Leicester assessment package. In: Harden RM. Hart IR, Mulholland H (eds). Approaches to the assessment of clinical competence. Dundee. Centre for Medical Education, 1992.
Cox J, Mulholland H. An instrument for assessment of videotapes of general practitioners’ performance, BMJ 1993; 306: 1043-1046.
Rabinowitz HK, The modified essay question, an evaluation of its use in a family medicine clerkship. Med Educ 1987; 21: I 14-118.
Harden RM, Gleeson FA, Assessment of clinical competence using an objective structured clinical examination (OSCE). Dundee; Association for the Study of Medical Education, 1979.
Norman OR, Neufeld VR, Walsh A. et al. Measuring physicians. performance by using simulated patients. J Med Educ 1985; 60: 925-934.
Rethans JJE, van Boven CPA. Simulated patients in general practice: a different look at the consultation. BMJ 1987; 294: 809-812.
Tamblyn RM. Klass DJ, Schanbl GK, Kopelow ML The accuracy of standardised patient presentation. Med Educ 1991; 25: 100-109.
Vu NV, March MM, Colliver JA. et al. Standardised (simulated) patients’ accuracy in recording clinical performance check-list items. Med Educ 1992; 26: 99-104.
ColliverJA. Vun NV, Markwell SJ, Verhulst SJ.Reliability and efficiency of components of clinical competence assessed with five performance-based examinations using standardised patients. Med Educ 1991; 25: 303-310.
Baird AG, Gillies JCM. Videotape assessment is threatening [letter]. BMJ 1993; 307: 60.
Bain JE, Mackay NSD. Videotaping general practice consultations [letter]. BMJ 1993; 307: 504.
Pringle M. Robins S. Brown G. Assessing the consultation: methods of observing trainees in general practice. BMJ 1984; 288: 1659-1660.
Pringle M, Stewart-Evans C. Does awareness of being videorecorded affect doctors’ consultation behaviour? Br J Gen Pract 1990: 40: 455-458.
Martin E, Martin PML. The reactions of patients to a video camera in the consulting room. J R Coil Gen Pract 1984; 34: 607-610.
Servant TB, Matheson JAB. Video recording in general practice: the patients do mind. J R Coil Gen Pract 1986: 36: 555-556.
Southgate L. Guidelines on the use of videotaped consultations. London: Royal College of General Practitioners, 1993.

Acknowledgements

The authors thank the Scottish Council for Postgraduate Medical and Dental Education for supporting and funding the study. We also thank the assessors for their time and enthusiasm and the trainees and trainers of the West of Scotland region for their cooperation.

Address for correspondence

Dr L M Campbell. Department of Postgraduate Medicine. University of Glasgow, Glasgow GI2 8QQ.