A Program to Teach Sight-Singing

Lloyd A. Smith and Rodger J. McNab, Department of Computer Science, University of Waikato

Music is perhaps the most time intensive skill to teach. Unfortunately, a teacher can spend relatively little time with each student; consequently, students spend most of their time practicing without informed feedback. Computer based instruction in music makes it possible for students to receive feedback over much of their practice time. Early computer based music education programs (during the sixties and seventies) were limited to the use of visual and low fidelity sound output, and computer keyboard input. These programs were, for the most part, oriented toward teaching theory and ear training. More recently, the development of the MIDI interface has made it possible for a computer to receive and assess musical performance input from the user, first through musical keyboards, then through other MIDI instruments. With advances in digital signal processing and computer hardware technology, it is becoming feasible for a computer to receive and assess direct acoustic input. This makes it possible to extend computer based instruction to instruments which are not MIDI capable–in particular, computer assisted instruction can be applied to singing. This paper describes a program which forms the basis for teaching the skill of sight-singing–singing a melody without prior study. While there are commercial systems now available to teach sight-singing, they typically assess performance by accumulating the time during which the user is singing the correct pitch according to a strict tempo. Our approach is to transcribe the melody sung by the user, then to assess performance by finding the optimal alignment between the user’s input and the test melody. The design philosophy is that sight-singing tuition is just one application of a general melody transcription program. Our program, then, is organized in stages. First, the acoustic signal is captured and processed to identify its time varying frequency. Second, the frequencies so identified are converted to musical pitches and rhythms. Finally, the input melody is matched against the test melody in order to assess the user’s performance.

The following sections describe the various stages of the program–first the signal processing functions which convert the acoustic wave to musical notes, then the functions which provide feedback to a sight-singing student.

Melody Transcription

The first major task faced by the sight-singing tutor is to transcribe the melody sung by the user. This can be broken into several subtasks–the sound must be captured and filtered; the frequencies must be identified; notes must be extracted from the signal and labeled with musical pitch and rhythm notation.

Sampling and Filtering

The first step in melody transcription is to capture the analog input and convert it to digital form, filtering it to remove unwanted frequencies. The program runs on an Apple Macintosh PowerPC 8100, and makes use of that machine's built-in sound I/O functions. For sight-singing tuition, we are interested only in the fundamental frequency of the input. Harmonics, which occur at integral multiples of the base frequency, often confuse pitch trackers and make it more difficult to determine the fundamental. Therefore the input is filtered to remove as many harmonics as possible, while preserving the fundamental frequency. Reasonable limits for the singing voice are defined by the musical staff, which ranges from F2 (87.31 Hz) just below the bass staff, to G5 (784 Hz) just above the treble staff. While leger lines are used to extend the staff in either direction, these represent extreme pitches for singers and, because of the demands they make on singing technique, they are best avoided while practicing sight-singing. Our program uses a digital filter with a cutoff frequency of 1000 Hz; the resulting signal is passed to the pitch tracking module, which identifies its fundamental frequency.

Pitch Tracking

Pitch determination is a common operation in signal processing. Unfortunately it is difficult, as testified by the hundreds of different pitch tracking algorithms that have been developed (Hess, 1983). These algorithms may be loosely classified into three types, depending on whether they process the signal in the time domain, by examining the structure of the sampled waveform, the frequency domain, by examining the spectrum generated by a Fourier transform, or the cepstral domain, by performing a second Fourier transform and examining the resulting cepstrum. Our program uses a time domain algorithm which assigns pitch by finding the repeating pitch periods comprising the waveform (Gold and Rabiner, 1969). Figure 1 shows 20 milliseconds (ms) of a typical waveform for the vowel ah, as in father.

Figure 1. Acoustic waveform of ah

The pitch tracker breaks the sound into 20 ms frames and returns a pitch estimate for each frame. With 20 ms frames, pitch resolution is about 5 cents (a cent is one hundredth of a semitone), or 0.29%–near the limit of human pitch resolution. The pitch of each frame is represented by its number of cents above MIDI note 0 (8.176 Hz, an octave below C0). Notes on the equal tempered scale relative to A-440 occur at multiples of one hundred cents: C4, for example, is 6000 cents (MIDI note 60). This scheme easily incorporates alternative tunings, such as just or Pythagorean, simply by changing the relationship between cents and note name. It can also be adapted to identify notes in the music of non-Western cultures.

Note Segmentation

Once pitches have been identified, it is necessary to determine where notes begin and end. We have developed two ways of doing this, one based on amplitude (or loudness) and the other on pitch. Amplitude segmentation is simpler, but depends on the user’s separating each note by singing da or ta–the consonant causes a drop in amplitude of 60 ms duration or more at each note boundary. Adaptive thresholds are then used to determine note onsets and offsets; in order to keep a marginal signal from oscillating on and off, the onset threshold is higher than the offset threshold. Figure 2 illustrates the use of thresholds in note segmentation. The root-mean-square power is calculated over 10 ms time frames, then thresholds are set according to the power over the entire signal. A note starts when the ascending power crosses the upper threshold; the note ends when its descending power crosses the lower threshold. In order to avoid a slamming door or a crackling microphone lead causing spurious notes, each note must be at least 100 ms in duration.

Figure 2: Using thresholds to segment notes from the amplitude signal.

The alternative to amplitude segmentation is to segment notes directly from the pitch track by grouping and averaging 20 ms frames. An adjacent frame whose frequency is within 50 cents of a growing note segment is included in that segment. Any segment longer than 100 ms is considered a note. Pitch based segmentation has the advantage of relaxing constraints on the user, but may not be suitable for all applications–repeated notes at the same pitch may not be segmented, while a glissando is segmented into a sequence of ascending or descending notes. Figure 3 illustrates the effect of pitch based segmentation on a glissando sung, approximately, from Bb3 to Bb4 (Bb below middle C). Figure 3a shows the pitch track, while each "step" in figure 3b is identified as a separate note.

Figure 3: Pitch-based segmentation on a sung glissando.

After note onsets and offsets are determined, rhythmic values are assigned by quantizing each note to the nearest sixteenth according to the tempo set by the user. Each note is assigned the musical pitch label corresponding to the average frequency over its duration.

Adapting to the User’s Tuning

Each note is labeled by its MIDI number according to its frequency and a reference frequency. By making the reference frequency variable, the system is able to adapt to the user’s tuning and tie note identification to musical intervals rather than to any standard.

In adaptive tuning mode, the system assumes that the user will sing to A-440, but then adjusts by referencing each note to its predecessor. For example, if a user sings three notes, 5990 cents, 5770 cents and 5540 cents above MIDI note 0, the first is labeled C4 (MIDI 60) and the reference is moved down 10 cents. The second note is labeled Bb3, which is now referenced to 5790 (rather than 5800) cents, and the reference is lowered a further 20 cents. The third note is labeled Ab3, referenced now to 5570 cents–even though, by the A-440 standard, it is closer to G3. Thus the beginning of "Three Blind Mice" is transcribed.

While constantly changing the reference frequency may seem computationally expensive, it is efficiently implemented as an offset in MIDI note calculation. This feature is in keeping with the design philosophy of developing a general melody transcription front end–at this point we haven’t determined whether this feature is useful in a sight-singing tutor. If tuning is tied to a particular standard, the offset is fixed. To use a fixed A-440 tuning, for example, the offset is fixed at 0.

A Sight-Singing Tutor

Transcribing the user’s sung input is only the first stage in providing a program to teach sight-singing. The second stage is to provide feedback to the user concerning how well he or she performed the intended melody. To this end, the sight-singing tutor displays a melody on the screen and evaluates the user’s attempt to sing it. While it is possible to generate melodies in a given style using probabilistic methods (Hall and Smith, 1991), it is more desirable to use composed melodies or folk tunes. Early versions of the sight-singing tutor use a database of 100 Bach chorale melodies, while the system currently under development adds another 9600 folk tunes. The folk song database is made up of 1700 songs, most of North American origin, from the Digital Tradition database (Greenhaus, 1994); the remainder are from the Essen ESAC database, comprised of about 6000 German and Eastern European tunes, 2200 Chinese tunes and several hundred Irish melodies. (Schaffrath, 1992).

Users are able to set the tempo and hear the starting note using pull-down menus. Then the user sings the melody, using the mouse to start and stop recording. Next the system matches the input against the test melody using a dynamic programming algorithm designed to match discrete musical sequences (Mongeau and Sankoff, 1992). Dynamic programming finds the best match between two strings, allowing for individual elements to be inserted or deleted. The algorithm, as adapted for music, allows for fragmentation and consolidation, in which a whole note, for example, might match a half note and two quarter notes, or vice versa.

The system allows the option of matching the user's input against the test melody by absolute pitches or by musical intervals. If absolute matching is chosen, the notes sung by the user are matched against those of the test melody. Octave differences are ignored, thus allowing the system to function equally well with men and women singers. If interval matching is chosen, the system first converts both the user's sung melody and the test melody to a string of musical intervals before matching. This allows the user to sing in any key without penalty, so long as he or she sings the correct intervals.

Figure 4 shows the result of matching a user’s sung input with the test melody, using absolute pitches. The lines in the figure show which notes of the test melody are matched with which notes of the user’s melody in the optimal (best scoring) alignment. Notes are penalised for each semitone difference in pitch (by 0.1), and for each sixteenth note difference in duration (by 0.05). The individual pitch and rhythm scores from the best alignment are accumulated to determine a global distance: a perfect score of zero represents no difference between the input and the test melody. Future versions will convert the score to a more intuitive form, with 100 being a perfect match, and lower scores indicating poorer performances.

Figure 4. Display from the sight-singing tutor

Conclusion

This paper has described a program designed to teach sight-singing. The system's capability has been demonstrated, but there are several further developments required before the system is ready for use by music educators.

First, there is no easy way to add test melodies to the database. While the system may be distributed with an arbitrarily large set of melodies to choose from, it is still necessary to allow a user to add melodies–if only to add a graded progression of sight-singing exercises. There are three ways to allow addition of melodies to the database. The simplest way is to use a file format which follows that of a generally available music notation editor, such as Lime (Blostein and Haken, 1991). In this scenario, the teacher would use the notation editor to enter a melody, then simply save the resulting file in a folder containing the test melodies. Another option is to include a simple music notation editor in the sight-singing tutor. A third solution is to allow the teacher to sing melodies into the system–but that will still require some rudimentary editing capability in order to make corrections to the transcribed melody.

A second development needed to complete the system is to provide some means for a teacher to find songs which test particular sight-singing skills–syncopation, for example, or specific musical intervals. This requires the capability to search the database for musical patterns specified by the teacher; such patterns may be precisely specified, as in the case of searching for songs containing tritones, or may be more generally specified, as in searching for pentatonic songs, or songs containing syncopation.

Finally, the system must have an intuitive user interface. At this point, we have a research system which requires some knowledge to traverse the various menus and choose various options. The options must be reduced and the menus condensed in order to make the system generally usable.

References

Blostein, D., & Haken, L. (1991). Justification of printed music. Commun. ACM 34(3) 88—99.

Gold, B., & Rabiner, L. (1969). Parallel processing techniques for estimating pitch periods of speech in the time domain. J. Acoust. Soc. Am. 46(2) 442—448.

Greenhaus, D. (1994). About the Digital Tradition, ftp://parc.xerox.com/pub/music/ digital_tradition.

Hall, M., & Smith, L. (1991). Computer improvisation of blues melodies (Abs). J. Acoust. Soc. Am., 90, 2353 .

Hess, W. (1983). Pitch Determination of Speech Signals. New York: Springer-Verlag.

Schaffrath, H. (1992). The ESAC databases and MAPPET Software. In W. Hewlett & E. Selfridge-Field (Eds.), Computing in Musicology, Vol 8. Menlo Park: Center for Computer Assisted Research in the Humanities.

Mongeau, M., & Sankoff, D. (1990). Comparison of musical sequences. Computers and the Humanities, 24, 161—175.