The Testing/Assessment Phase of Music CAI -

Making the Process More Reliable and Efficient

Dennis Bowers

Center for Instructional Technology in the Arts,

Meadows School of the Arts,

Southern Methodist University

 

The purpose of the research was to determine whether a computer-based adaptive testing procedure termed the Sequential Probability Ratio Test (SPRT; Wald, 1947) could be applied to a range of aural skills mastery assessments, and whether the inclusion of multiple difficulty levels in the procedure resulted in changes in test efficiency over a method which acknowledged only a single, general difficulty level. Results indicate that the SPRT appears to be a viable candidate as a procedure to assess mastery across a range of aural skills, and further, that the SPRT is more efficient than conventional, fixed-length tests when assessing mastery of these skills. Test lengths varied from 10-20 items, compared to fixed-length tests of 48 items, indicating efficiency gains in the range of approximately 55-80% with no loss of accuracy. Included in this paper are appendices that provide all information necessary to implement the SPRT.

 

Computers have proliferated in our nations’ elementary, secondary, and post-secondary schools. While we may not yet regard computers as completely ubiquitous or commonplace, a recent survey indicates that their number increased some 50-fold during the decade of the 1980s, and this number is expected to rise at a rate of 200,000-300,000 per year for the foreseeable future. Further, most of the machines currently in place are used to deliver, or support the delivery of, instruction (Becker, 1991). Among the various components of instruction that have been the beneficiaries of this increase in available high technologies, a fair degree of recent interest has centered around the use of the computer for purposes of adaptive testing.

Computerized Adaptive Testing (CAT) may be considered a limited form of artificial intelligence, wherein the computer not only delivers test items and scores the responses to these items, but also alters the subsequent form of the test on the basis of examinee performance. These alterations typically take the form of changes in subsequent item difficulty or overall test length, depending on the intent of the test. In either case, the test is altered based on a "real-time" assessment of examinee performance.

Until the advent of computer technology, true adaptive testing procedures were impossible due to the requirements that a large number of somewhat complex calculations be performed between each item. However, the availability of powerful microprocessors, in combination with the creation of algorithms that simplify somewhat the required number and complexity of calculations have provided us with the means to quickly deploy adaptive testing procedures. A perusal of the ERIC database, for example, indicates that over 40 publications or presentations have appeared over the past three years on the topic of computerized adaptive testing (CAT).

Investigations into the efficacy of CAT have typically fallen into two categories: (1) those studies that are based on item-response theory (IRT), (Lord & Novick, 1968), in which the intent is to maximize the information about a given examinee within the context of a specified test length, or (2) studies based on Bayesian statistics, where the intent is to make a decision about examinee performance relevant to a specified criterion level (e.g., mastery) in as efficient a manner as possible. IRT-based algorithms, while yielding more information about an examinee’s true performance level, are somewhat tedious to employ, since these methods typically require a large number of prior test item administrations in addition to extensive between-item calculations. Many of the IRT-based methods reported in the testing literature, for example, require in excess of 1000 prior administrations of each test item before the overall test yields a reliable assessment. These prior data are subsequently translated into a series of tables that must be available during the test administration -- so in addition to the time requirements involved in administering and collating these data, there are also storage and retrieval demands imposed. While some methods based on Bayesian statistical procedures also involve the real-time calculations of continuous probability distributions (in other words, many calculations and high overhead demands for data storage and retrieval), the method investigated in the research reported here is based on a Bayesian approach that requires no prior administration, and assumes only that the items (a) are broadly representative of the domain being tested and (b) are somewhat similar in level of difficulty. Hence, the requirements for data storage and retrieval are also minimal.

While the majority of extant studies that have investigated applications of CAT are based on more traditional, classroom testing paradigms (text-based, multiple choice items) some evidence exists of the potential value of CAT to music skills. Vispoel & Twing (1989) and Vispoel & Coffman (1992) report studies in which an adaptive test paradigm was found to be equal-to or superior in terms of both reliability and validity to conventional, fixed length tests of aural music skills. In these studies, tonal memory tasks were employed as content, and the authors found convincing evidence that adaptive test procedures resulted in reliable and valid assessments of music ability. These studies, however, employed an IRT-based approach, and the intent of testing was not to determine mastery in a given domain, but to provide an estimate of individual student ability along a continuum of difficulty.

The research reported in this paper seeks to extend an earlier study that investigated the application of an alternative adaptive procedure to mastery decisions in the domain of music listening skills. The procedure is termed the sequential probability ratio test (SPRT) (Wald, 1947), and it’s efficacy in the area of mastery tests for text-based, multiple choice test items has been established in a number of earlier studies (Frick, 1990; Frick, 1991). Consistent with these earlier studies, an application of the SPRT to an aural skills assessment covering the identification of simple melodic intervals resulted in mastery tests that averaged 66% fewer items than conventional fixed-length tests, with no loss of accuracy in assessment (Bowers, 1991). In other words, the study demonstrated that the power of the computer could be harnessed to make accurate and efficient mastery decisions in the domain of aural skills.

Some limitations were noted in the earlier study, however. First, the study used the domain of simple melodic intervals as test content, and employed a single level of performance as an indicator of mastery. Acknowledging that items in this domain vary to some degree in difficulty (Blombach & Parish, 1988), the question arises as to whether greater efficiencies might occur if the data were to be examined using an appropriate range of values as difficulty indices when compared to a single index of difficulty. An associated question is whether the deployment of a range of difficulty indices has any effect on the accuracy of assessment, i.e., to what degree do mastery decisions derived through the use of the SPRT agree with mastery decisions derived from conventional fixed-length tests?

Another issue left unanswered in the first study concerns the range of skills for which this testing procedure may be appropriate -- are similar efficiencies obtained in other, related domains of aural skills such as harmonic interval identification or triad quality identification? Answers to these questions represent several steps toward the ultimate decision about whether this form of CAT has value that extends beyond the laboratory and into the classroom.

In summary, this study addresses three primary research questions:

1. Does the application of multiple difficulty levels in a mastery test covering a single domain of aural skills result in test efficiency different from that achieved when single indices of mastery and nonmastery are used?

2. Does the application of multiple difficulty levels in a mastery test covering a single domain of aural skills result in test accuracy different from that achieved when single indices of mastery and nonmastery are used?

2. Do similar efficiencies result in a mastery test of related skills such as harmonic interval identification or triad quality identification?

Pertinent issues on the assessment of mastery

As teachers, we are familiar and comfortable with the concept of a "cut-off" point, i.e., the point at or above which we declare "success" and below which we declare "failure." A number of authors and researchers have noted, however, that such a dichotomy, even in a mastery situation or criterion-referenced test procedure is somewhat artificial. For example, assume that we decide as teachers and pedagogues that a score of 85% on a test represents a passing score. We have little subsequent difficulty in saying that a student who earns a 95 has passed -- or that a student who earns a 55 has not passed. But what about the student who earns an 86? Or the student who earns an 83? Are we as comfortable in classifying these students as we are when faced with extremely high or low score?

The procedure described in this research addresses the issue by employing two values - one which denotes a lower boundary of mastery, and a separate value denoting the upper bound for nonmastery. In between these values exists a gray area -- a "zone of indifference" -- within which we withhold from making a decision about mastery or nonmastery. At the core of the SPRT is a set of rules that says, basically, "wait until there is sufficient confidence that the performance level falls either (a) at or above the lower limit for mastery or (b) at or below the upper limit for nonmastery." Note that, for purposes of brevity, two appendices appear at the end of this report detailing exactly how the SPRT operates. Appendix A provides an algorithm for the deployment of the SPRT, and appendix B is an excerpted segment of an earlier article that simulates a "run-through" of a three-item test under the SPRT. Readers who wish either to examine the mechanism of the SPRT or to actually deploy it in a computer program are referred to these appendices.

Method

Subjects

Subjects for the studies reported here were freshman and sophomore students enrolled in the aural skills component of the music theory curriculum at Southern Methodist University. Studies were carried out over several successive semesters. Subjects took part in these studies in lieu of weekly computer-based aural skills practice assignments.

Computer Hardware and software

Stimulus presentation, data collection, and data analysis were completed with programs developed with HyperCard™ system software. These programs or "stacks" run on the Apple Macintosh computer. Two "stacks" were prepared:

1. A stack for stimulus presentation and data collection. This stack was enhanced with HyperMIDI, a set of software extensions (published by Opcode systems) that allow the control of MIDI devices from a HyperCard™ stack.

2. A stack for data analysis. This stack was used to simulate a test administration using data gathered in stimulus presentation stack, apply the SPRT, and store the results.

For the studies reported here, subjects visited the computer lab in the Hamon Fine Arts Library at Southern Methodist University. This facility houses a variety of Macintosh computers employing the 68030 microprocessor, each with a minimum of 4 Mb of RAM, and each is equipped with a Passport MIDI Interface, a Kawai K1-II MIDI keyboard, and stereo headphones

Experimental procedures

Subjects were directed by their respective teachers to complete the data collection portion of these studies as part of their normal ear-training assignments in the computer lab. Each subject reported to the computer (at a time of his/her choosing), and informed the lab monitor of participation in the study. At this point, the lab monitor handed each subject a disk accompanied by a sheet of printed instructions on how to insert the disk and start the computer, and what to do at the conclusion of the session. The subject was then directed to a computer, and instructed to start the program. When the program commenced, on -screen instructions were provided on how to use the program. In addition, the program provided a "sound-check" routine to ensure that subjects (a) turned on the attached MIDI keyboard, (b) put headphones on, and (c) could hear the output of the synthesizer. Lab monitors were instructed to watch the subjects begin this procedure to make sure that any problems in understanding or operation were addressed. No such problems occurred in any of the data collection sessions.

When the program commenced, and after the on-screen instructions were provided, subjects then proceeded to complete the test. In all cases, the test consisted of 48 items, randomly chosen based on parameters in the program. Items were performed on a Kawai K1-II synthesizer using a timbre that emulated a Fender Rhodes electronic piano. Items were constructed to fall within the pitch range G2-G5 in order to avoid confounds that might result by using extreme ranges. For each item, the subject heard a stimulus item and was then directed to identify the stimulus by clicking an appropriate, on-screen "button." After a response, the program proceeded to (a) store performance data for that item, and (b) present the next item.

At the conclusion of the session, the program provided an on-screen report card detailing the percentage score for correct items, and further directed the subject to press a "quit" button on screen, turn the computer off, and return the disk to the lab monitor. When the disk was returned to the monitor, the subject was directed to sign a check-out sheet that was also signed by the lab monitor. At the conclusion of each study, these sheets were distributed to aural skills instructors so that proper credit could be given to students for participation in the study. Credit was not assigned based on quality of performance, i.e., percentage score -- it was assigned only on the basis of participation in the study. Pertinent details of each study are presented in the following sections.

For purposes of testing the SPRT and comparing the outcomes with a traditional test format, a test-simulation program was prepared by the author. This program used the data gathered by the original stimulus presentation (test) program detailed above. For each simulation, the program would recreate the performance of a given subject by examining the sequence of correct and incorrect responses, and applying the SPRT after each item. When a point was reached at which the program could make a mastery/nonmastery decision that fell within acceptable parameters, the simulation for that subject ceased and pertinent data (number of items, decision of mastery or nonmastery) were stored.

Studies 1 and 2

Seventy-six subjects took part in studies 1 and 2. The test content for these studies was the domain of simple, melodic intervals, presented in either ascending or descending form. Items were chosen randomly based on an algorithm designed and implemented by the author. The results of study 1 have appeared in an earlier publication (Bowers, 1991), and are reprinted here for purposes of comparison. Study 2 was conducted in order to examine the effects of multiple difficulty levels (and associated probabilities) on test length and accuracy. For study 2, a new simulation was carried out on the original, fixed-length test data from study 1. Pertinent differences between the simulation carried out in study 1 and the simulation in study 2 are as follows:

a. In study 1, a single value of .85 was designated as the probability of mastery given a correct response and a single value of .60 was designated as the probability of nonmastery given a correct response. In other words, the general probability of choosing an item from the universe of related items that a master would answer correctly was set at .85; likewise, the general probability of choosing an item from the universe of related items that a nonmaster would answer correctly was set at .60. For the second study, the real performance data from the fixed-length test were examined to derive a series of values representing varying difficulty levels of the items. On the basis of this examination, items were separated into three classes - easy items were those which masters had a .95 probability of answering correctly and nonmasters had a .70 probability of answering correctly, and included m2, M2, P4 and P8. Moderate items were those which masters had a .85 probability of answering correctly and nonmasters has a .60 probability of answering correctly, and included m3, M3, and P5. Finally, difficult items were those which masters had a .75 probability of answering correctly and nonmasters had a .50 probability of answering correctly, and included tritone, m6, M6,m7, and M7.

b. The simulation program that actually applied the SPRT to examinee performance data was adjusted to account for the presence of varying difficulty levels in study 2.

Results of the comparison between studies 1 and 2

Table 1 contains a summary of the results of the computer simulation of the SPRT for studies 1 and 2. A graphical summary of these results appears in figure 1.

Table 1. SPRT Test lengths and assessment accuracy for studies 1 and 2.

Average test length/count

Master Nonmaster Accuracy

Study 1 19.95/37 14.57/35 100% (4 no-decision)

Study 2 18.73/37 14.29/35 100% (4 no-decision)

Figure 1. Graphical comparison of average item counts - single vs. multiple difficulty levels.

As can be seen from table 1 and figure 1, there is virtually no difference in the item counts for studies 1 and 2. In this case, the presence of a series of difficulty levels has no apparent effect on test efficiency. Likewise, test accuracy was unaffected; a comparison of the decision reached by the SPRT with the decision for the same examinee reached by examining scores for the complete 48-item test demonstrate that, in both studies, 100 percent agreement was achieved. Two procedural notes are in order at this point.

1. One might question how a comparison of decisions is carried out, since the results of performance on the fixed length test might place an examinee in the "zone of indifference" discussed earlier. In such a case, is the examinee declared a master or a nonmaster? Frick (1991) has suggested that, for purposes of comparison, one must choose a single point, at or above which a subject is declared a master, and below which the subject is declared a nonmaster. Further, researchers have suggested that the most appropriate point to use when comparing the results of a fixed length test with the results of an adaptive test is the point that falls exactly midway between the chosen (or average, in the case of multiple difficulty levels) lower limit for mastery and the chosen (or average) upper limit for nonmastery. For the studies reported here, this point would be midway between 85% and 60%, or 72.5%. Subjects whose scores equal or exceed 72.5% at the conclusion of the fixed length test were declared as masters for the sake of comparison. Subjects whose scores fall below this point were declared nonmasters.

2. Note that in both studies, the SPRT was unable to reach a decision for 4 subjects by the time the complete item pool (48 items) was exhausted. Such a case occurs when examinee performance is sufficiently random to obscure any tendency toward mastery or nonmastery. If such cases are regarded as errors, this falls within expected error parameters as designated in this deployment of the SPRT (see the explanation of the SPRT algorithm in appendix A).

Study 3

Thirty seven subjects took part in study 3. The test content for this study was the domain of simple, harmonic intervals. As in studies 1 and 2, items were chosen randomly based on an algorithm designed and implemented by the author. The fixed-length test consisted of 48 items. A summary of the item counts for masters and nonmasters under the SPRT, along with a statement of accuracy, appears in table 2.

Table 2. SPRT Test lengths and assessment accuracy for study 3.

Average test length/count

Master Nonmaster Accuracy

Study 3 15.91/11 10.04/26 100% (0 no-decision)

In this study, the SPRT proved, again, to be highly accurate when compared to the results of the 48-item fixed-length test. In addition, even greater efficiencies appear to have obtained; note the that average test length for masters in study 3 was approximately 16 items, whereas the average test lengths for masters in studies 1 and 2 were 19.95 and 18.73 items, respectively. While it is not the intent of this report to examine specific differences in item counts, one might speculate that the sample of subjects was distributed in a somewhat more bimodal fashion on this task than were subjects for studies 1 and 2 (the reader is reminded that the studies took place at different points in time -- hence the sample for study 3 is, by virtue of selection, somewhat different than the sample employed for studies 1 and 2).

Study 4

Seventeen subjects took part in study 4. The test content for this study was the domain of closed position triads. As in studies 1, 2 and 3, items were chosen randomly based on an algorithm designed and implemented by the author. The fixed-length test consisted of 48 items. A summary of the item counts for masters and nonmasters under the SPRT, along with a statement of accuracy, appears in table 3.

Table 3. SPRT Test lengths and assessment accuracy for study 4.

Average test length/count

Master Nonmaster Accuracy

Study 4 18.00/7 25.20/1 100% (0 no-decision)

Again, the SPRT proves to be highly accurate when compared with the fixed-length test. One surprising element of this study was the average test length under the SPRT when compared to previous studies. In a situation where each item is, for all purposes, analogous to a four-item multiple choice question, one would expect the average test length to be shorter than a situation where each item has 12 available response choices (as was the case with studies 1, 2 and 3). It seems contrary to intuition that the average test lengths for this task were considerably longer than any previously noted. The only logical explanation seems to lie in an examination of the subject pool. This study had considerably fewer subjects than either of the previous studies, and subjects in this study were, for the most part, first-semester freshmen who had just begun to study triad quality. The results of this study seem to demonstrate less diversity among masters and nonmasters, in combination with less learning -- examinees who are just learning a skill seem likely to perform with less consistency, and longer tests appear to result.

Discussion

Overall, the results of the studies reported here provide us with some tentative answers to the questions posed at the beginning of the report. First, the inclusion of multiple probabilities to characterize varying difficulty levels seems to have little effect on test lengths under the SPRT approach. These results are consistent with the observation by Frick (1991), that the SPRT is somewhat robust with respect to items that vary in difficulty. In other words, the inclusion of multiple difficulty level probabilities (and subsequent additional code in the computer program) appears unnecessary. To temper this finding, however, one must remember that, overall, mastery probabilities differed by only 20% across their entire range -- if a series of more disparate difficulty levels were employed, results might very well differ from those reported here.

Second, the deployment of multiple difficulty level probabilities seems to have no effect on test accuracy. In both studies 1 and 2, mastery decisions reached via the SPRT algorithm were in 100% agreement with those reached on the basis of a conventional, fixed-length test score. Given the findings discussed above, this fact is of limited value. Should one desire to employ the SPRT under more diverse difficulty levels, however, it seems that concerns about subsequent accuracy should be minimal.

Finally, it also appears that CAT may be employed with some confidence across a range of aural skills. Results of studies 3 and 4 indicate that the SPRT is effective not only in the domain of melodic interval identification, but has equivalent value in the related domains of harmonic interval identification and triad quality identification as well. In conjunction with the findings noted above, it would appear that the SPRT is a viable candidate for an overall CAT treatment of aural skills mastery testing.

A number of questions still remain about the efficacy of the SPRT in the domain of aural skills. In particular, two areas appear to be prime candidates for investigation. First, one must recall that the studies reported here applied the SPRT to a mastery testing situation. Although there is much evidence to support a mastery approach to learning and assessment in many domains, the question should still be addressed as to whether the SPRT could be modified to function in a testing paradigm where the outcome is some levelled measure of performance, i.e., the more traditional A-B-C-D-F grading scheme.

Second, the experiments reported here are geared only toward assessment. Another important question is whether the SPRT could be employed in a practice setting, where the outcome is more likely to be suggestions for remediation or enrichment. In other words, can a routine like the SPRT be a partner not only in the assessment phase, but in the learning phase as well? Issues of feedback, reinforcement, and sequencing are likely to affect, to a large degree, the results achieved with this process when used as an "engine" for a practice setting. These questions await future research.

The recent advent of high-power computing environments supports the development and deployment of rich, mediated instructional treatments. Indeed, the current interest and excitement that centers around multimedia presentations indicates we are at the "front door" of a new generation in Computer Assisted Instruction. Along with these engaging, interactive environments, however, we must remember that we are looking at an interface -- and beneath the interface must be a guidance mechanism that is based on sound research and practice. This studies reported here seek to add to that foundation of guidance, and to provide one means whereby we as learning facilitators can take greatest advantage of the rich array of instructional support tools at our disposal.

REFERENCES

Blombach, A., & Parish, R.T. (1988). Acquiring aural interval identification skills: Random vs. ordered grouping. Music Theory and Pedagogy, 2(1), 113-131.

Bowers, D.R. (1991). Computer-based adaptive testing in music research and instruction, Psychomusicology, 10(1), 49-63.

Frick, T.W. (1990). A comparison of three decision models for adapting the lngth of computer-based mastery tests. Journal of Educational Computing Research, 6(4), 479-513.

Frick, T.W. (1991). A comparison of an expert systems approach to computerized adaptive testing and an item response theory model. Paper presented at the 1991 Association for Educational Communications and Technology National Convention, Orlando, FL.

Lord, F. & Novick, M. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

Vispoel, W.P. & Coffman, D.D. (1992). Computerized adaptive testing of music-related skills. Bulletin of the Council for Research in Music Education, 112, 29-49.

Vispoel, W.P. & Twing, J.S. (1989). A comparison of the efficiency, reliability, and validity of adaptive and conventional listening tests. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.

Wald, A. (1947). Sequential Analysis. New York: Wiley & Sons.

Appendix A: How to deploy the SPRT

I. Values required before the test begins:

1. Alpha - analogous to experimental type I error, refers to the chances of misclassifying a master as a nonmaster. Suggested value is .025.

2. Beta - analogous to experimental type II error, refers to the chances of misclassifying a nonmaster as a master. Suggested value is .025.

NOTE: You can adjust these values, bearing in mind that there is an inverse relationship between test length and chance for misclassification - raise alpha and/or beta and the average test length will decrease; lower alpha and/or beta and the test length will increase.

3. The lower bound for mastery (LBM), derived from the formula

( 1-beta)/alpha

4. The upper bound for nonmastery (UBN), derived from the formula beta/(1-alpha)

5. For each level of difficulty, determine the following:

p(M|C): The probability of a master answering an item of this difficulty correctly

p(M|IC): The probability of a master answering such an item incorrectly
(1-the above)

p(N|C): The probability of a nonmaster answering such an item correctly

p(N|IC): The probability of a nonmaster answering such an item incorrectly
(1-the above)

 

 

SEE THE FOLLOWING PAGE FOR A DESCRIPTION OF THE ALGORITHM

II. How the SPRT works

A. At the beginning of a test, consider that we have no information about the examinee -- so the known probabilities of mastery (p|M) or nonmastery (p|N) would both be set at .5 (equal likelihood that the examinee is either a master or nonmaster).

B. Administer a test item.

C. After an item:

1. Score the response and adjust (p|M) and (p|N).

a. If differing levels of difficulty are employed, adjust p(M|C), p(N|C), p(M|IC), and p(N|IC) based on the difficulty level of the chosen item.

b. If the response is correct:

•Multiply the probability of mastery (p|M) by p(M|C)

•Multiply the probability of nonmastery (p|N) by p(N|C)

c. If the response is incorrect:

•Multiply the probability of mastery (p|M) by p(M|IC)

•Multiply the probability of nonmastery (p|N) by p(N|IC)

2. Normalize the probabilities

a. Calculate the new (p|M) as p|M/(p|M+ p|N)

b. Calculate the new (p|N) as 1 - (the new p|M)

(NOTE: The sum of p|M + p|N after this step should always be 1.

3. Calculate the likelihood ratio (LR):

a. LR = (the new p|M)/(the new p|N)

4. Determine what action to take:

a. If LR >= LBM then declare the examinee a master and exit

b. If LR <= UBN then declare the examinee a nonmaster and exit

c. If UBN < LR < LBM then return to step (B) above and start again.

[Programmer notes: The entire algorithm would be "coded" in a relatively short routine, and requires tracking only a few variables. I have no benchmarks on the amount of time required to complete this routine, but that is a function of the computer language of choice and the degree of "elegance" employed when constructing the routine. In HyperTalk, the routines/handlers are very short and virtually unnoticeable in terms of the required time for execution.]

APPENDIX B

An Example of the SPRT

Per Wald’s (1947) suggestion, initial probabilities of mastery and nonmastery are set at .50, indicating no prior information about the likelihood of the subject being either a master or a nonmaster. The first item is presented, which the subject answers correctly. The probability of mastery is updated by multiplying its current value by p(M I C), the probability of mastery given a correct response, .85. The resulting value is .50 x .85 = .425. Similarly, the probability of nonmastery is updated by multiplying its current value by p(N I C), .60. The resulting value is .50 x .60 = .30. Next, probabilities are normalized by dividing each by the sum of the two values, .425 + .30 = .725. Hence the posterior probability of mastery is .425/.725 = .586, and the posterior probability of nonmastery is .30/.725 = .414. Note, also, that the posterior probability of nonmastery can be derived by subtracting the posterior probability of mastery from I, since the probabilities of mastery and nonmastery are exhaustive and mutually exclusive.

The final operation to be conducted following item 1 is to combine the normalized probabilities into the likelihood ratio (LR):

p(mastery)

p(nonmastery)

If the likelihood ratio is at or above the LBM as derived in decision rule I of the SPRT, then testing is discontinued and the subject is declared a master. If, on the other hand, the ratio is at or below the UBN as derived in decision rule 2, testing is likewise ceased and the subject is declared a nonmaster. Per decision rule 3, if the ratio falls between the points derived in decisions rules I and 2, testing is continued. In this case, the ratio of mastery/nonmastery probabilities is .586/.414 = 1.415. This value falls between 39 and .026, so no decision is made and testing continues.

A second item is now presented, which the subject answers incorrectly. To update the prior probabilities (.586, .414), the rules relating to an incorrect response (IB and2B from Table l) must be employed. The posterior probability of mastery is derived by multiplying .586 by p(M|I), the probability of mastery given an incorrect response, which is .15. The resulting value is .586 x .15 = .088. The posterior probability of nonmastery is derived by multiplying its prior probability (.414) by p(N|I), .40. The resulting value is .414 x .40 = .166. The probabilities are then normalized by dividing each by their sum (.254). This yields a posterior probability for mastery of.088/254=.344, and a posterior probability of nonmastery as .166/.254 = .654. The final step following item 2 is to examine the ratio of the combined probabilities in light of the SPRT decision rules. The ratio is .344/.654 = .526. Again, the resulting value falls between the LBM (39) and the UBN (.026), so no decision is made and testing continues.

The third item is presented, and is answered incorrectly. Begin by multiplying the prior probability of mastery by p(M|I) to derive the posterior probability of mastery, .344 x .15 = .052. Next, multiply the prior probability of nonmastery by p(N|I) to obtain the posterior probability of nonmastery, .654 x .40 = .262. Normalize the probability of mastery by dividing it by the sum of the two probabilities, .052/.314 = .166. Normalize the probability of nonmastery in the same manner, .262/.314=.834. Finally, calculate the ratio of the probabilities, .166/.834 = .199. As one might suspect, the ratio has moved considerably closer to the UBN (.026). It still falls within the "zone of indifference" however, so testing would continue.

The preceding sequence of presenting an item, evaluating the response, updating prior probabilities into posterior probabilities to fomm the likelihood ratio, and applying the SPRT decision rules will continue until either (a) a reliable assessment of mastery or nonmastery can be made, or (b) the pool of items is exhausted. Note that,in each case, the respective probabilities for mastery and nonmastery were updated in light of the most recent performance on an item. Hence, posterior probabilities of mastery and nonmastery represent the best guess about the leamer's condition, based on all available information. This, in fact, is at the heart of Bayesian reasoning.

The material in this appendix was excerpted (pp. 54-56), with permission, from:

Bowers, D.R., (1991). Computer-based adaptive testing in music research and instruction, Psychomusicology, 10(1), 49-63.