«Supervisors Ing. Radim Špetík Czech Technical University in Prague Faculty of Electrical Engineering Department of Circuit Theory tel: +420 2 2435 ...»
Automatic Transcription of Audio Signals
Master of Science Thesis
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Measurement
tel: +420 737 111030
Ing. Radim Špetík
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Circuit Theory
tel: +420 2 2435 2049
web: http://amber.feld.cvut.cz Hadas Oﬁr Technion - Israel Institute of Technology Department of Electrical Engineering Signal and Image Processing Laboratory e-mail: email@example.com web: http://www-sipl.technion.ac.il i Abstract This thesis is concerned with automatic transcription of monophonic audio signals into the MIDI representation. The transcription system incorporates two separate algorithms in order to extract the necessary musical information from the audio signal. The detection of the fundamental frequency is based on a pattern recognition method applied on the constant Q spectral transform.
The onset detection is achieved by a sequential algorithm based on computing a statistical distance measure between two autoregressive models. The results of both algorithms are combined by heuristic rules eliminating the transcription errors. Finally, new criteria for evaluation are proposed and applied on transcription results of several musical recordings.
Keywords: music transcription, pitch detection, fundamental frequency tracking, onset detection, monophonic audio Abstrakt Tato diplomová práce se zabývá automatickou transkripcí jednohlasých hudebních signálů do formátu MIDI. Transkripční systém zahrnuje dva samostatné algoritmy nezbytné pro získání hudební informace z audio signálu.
Detekce základní harmonické složky je založena na metodě hledání vzorů ve spektrální transformaci s konstantním činitelem jakosti Q. Detekce začátků not je dosaženo pomocí sekvenčního algoritmu založeného na výpočtu statistické metriky mezi dvěma autoregresními modely. Výsledky obou algoritmů jsou sloučeny pomocí heuristických pravidel odstraňujících chyby transkripce. Závěrem jsou navržena nová kritéria pro vyhodnocování, která jsou použita na výsledky transkripce několika hudebních nahrávek.
Klíčová slova: transkripce hudby, detekce základní harmonické, detekce nestacionarit, jednohlasá hudba ii Acknowledgements First of all, I would like to thank to my parents and relatives for their immense support during my studies.
I would like to thank to Ing. Radim Špetík for supervision of this thesis.
Next, I thank very much to Prof. Ing. Pavel Sovka, CSc. and Doc. Ing. Petr Pollák, CSc. for excellent education, as well as for help in the design of new criteria.
The biggest thanks goes to my initial supervisor Hadas Oﬁr who is the co-author of the proposed system. A special thanks goes also to Nimrod Peleg for choosing the very best topic of my summer project in the SIPL laboratory. I am also very grateful to Heikki Jokinen, Pekka Kumpulainen and Anssi Klapuri for stimulating my interest in signal processing and audio applications in particular.
Thanks goes also to my friends Michal Olexa and Václav Vozár for valuable hints concerning DSP, Matlab and TeX. And as a nice Czech proverb says: ”the best at the end”, I would like to thank to my ♥ Zuzka Lenochová.
This work is dedicated to all musicians I have had the opportunity and the pleasure to play with...
Poděkování V první řadě bych chtěl poděkovat svým rodičům a příbuzným za nesmírnou podporu v průběhu studia.
Rád bych poděkoval Ing. Radimu Špetíkovi za odborné vedení této diplomové práce. Dále velmi děkuji Prof. Ing. Pavlu Sovkovi, CSc. a Doc. Ing.
Petru Pollákovi, CSc. za vynikajicí výuku a také pomoc při návrhu nových kritérií.
Největší dík patří mé počáteční odborné vedoucí Hadas Oﬁr, jež je zároveň spoluautorkou navrženého systému. Zvlaštní dík patří též Nimrodu Pelegovi za výběr toho nejlepšího možného tématu pro můj letní projekt v laboratoři SIPL. Jsem také velmi vděčný Heikki Jokinenovi, Pekka Kumpulainenovi a Anssi Klapurimu za vznícení mého zájmu o zpracování signálů se zaměřením na audio aplikace.
Děkuji též svým přátelům Michalu Olexovi a Václavu Vozárovi za cenné rady tykající se DSP, Matlabu a TeXu. A podle úsloví ”to nejlepší nakonec” bych rád poděkoval své ♥ Zuzce Lenochové.
Tato práce je věnována všem muzikantům, se kterými jsem měl tu možnost a čest si zahrát...
iii Čestné prohlášení Prohlašuji na svou čest, že jsem zde uvedenou diplomovou práci vypracoval samostatně, pouze za odborného vedení Ing. Radima Špetíka a při psaní diplomové práce jsem nepoužil jiných informačních zdrojů než zde uvedených.
Chapter 1 Introduction
1.1 Characterization of the Problem Automatic transcription of music is a task of converting a particular piece of music into symbolic representation by means of a computational system.
Symbolic representation is generally depicted using the standard music notation which consists of notes characterized by speciﬁc frequency and duration. From the transcription point of view, music can be classiﬁed as polyphonic and monophonic. The former consists of multiple simultaneously sounding notes, whereas the latter contains only a single note at each time instant, such as a saxophone solo or singing of a single vocalist.
Automatic transcription of music is related to several ﬁelds of science, including Musicology, Psychoacoustics, and Computational Auditory Scene Analysis (CASA). It belongs to the music content analysis discipline which consists of other audio research topics, such as rhythm analysis, instrument recognition, and sound separation. It has been studied since 1970s.
1.2 Literature Review The state-of-the-art in music transcription is focused on the polyphonic transcription, since the monophonic transcription is considered as practically solved [Klapuri1998], [Martins2001]. However, it represents an important case which should be treated separately with much stricter demands on the transcription quality, which still seems to be relatively limited for polyphonic transcribers. Extensive review of published polyphonic systems can be found in [Klapuri1998].
Since monophonic music share various properties with speech, many algorithms suitable for the music transcription purposes originate in speech processing [Rabiner1976], [Hess1983], [Andre-Obrecht1986],[Medan1991]. Recent works in monophonic music transcription explore the potential of the wavelet transform [Cemgil1995a], [Cemgil1995b], [Jehan1997], time-domain techniques based on autocorrelation [Bello2002], and probabilistic modelling using Hidden Markov Models [Ryynänen2004]. In addition to that, [Bořil2003] developed a simple and robust algorithm for real-time MIDI conversion, referred to as DFE algorihtm (Direct Time Domain Fundamental Frequency Estimation). This system performs separate monophonic analysis of a signal from each guitar string, and therefore illustrates that monophonic transcribers can be used in special polyphonic transcription systems.
1.3 Applications Applications of automatic transcription systems are numerous, though limited due to insuﬃcient reliability and robustness. The following list presents the potential areas of interest.
• Computer music applications Music transcription system is a useful tool for composers and musicians, since it provides means to easily analyze and edit the music recordings.
It is especially attractive for the real-time transcription of sounds to musical score.
• Coding of audio signals Conversion of signal samples to symbolic representation signiﬁcantly reduces the amount of data, and can be therefore used for the compression purposes. An example method is the structured audio (SA) coding described in the MPEG-4 Standard.
• Mobile technology Reliable transcription systems could be commercially applied in cellular phones to automatically create monophonic or polyphonic Ringtones. Such feature would allow customers to record their own musical compositions by a cellular phone and transmit the MIDI ﬁles via the Internet or the GSM network.
• Machine perception Analogically to computer vision, the ability of computers to hear music would improve the interaction between humans and systems with artiﬁcial intelligence.
• Music teaching Future transcription systems could be used in training of singers and solo instrument players, as well as assist in ear training of novice musicians. Such systems would compare the exact musical notation with the performance of an artist and objectively evaluate the performance quality.
1.4 Organization of the thesis This thesis inclines to be oriented more practically than theoretically, and thus brieﬂy explains only the essential background information and refers the reader to other publications, often available online. For this reason, it omits a separate theoretical chapter and deﬁnes the necessary terms ”on-the-ﬂy” during the description of the transcription system.
This thesis is organized as follows. Chapter 2 gives an overview of the MIDI standard. Chapter 3 presents the implemented solution. In Chapter 4, new criteria for evaluation are proposed and the transcription results are presented. Finally, Chapter 5 summarizes the accomplishments.
Chapter 2 The MIDI Standard
2.1 MIDI Introduction The Musical Instrument Digital Interface (MIDI) provides a standardized means of conveying musical performance information as electronic data. It has been accepted and utilized by musicians and composers since its conception in 1983, and is nowadays widely used for communication between sound cards, musical keyboards, sequencers, and other electronic instruments. A complete description of the MIDI protocol is deﬁned in the MIDI 1.0 Speciﬁcation established and updated by the MIDI Manufacturers Association [MMA2004].
The main advantage of MIDI is data storage eﬃciency: a typical MIDI sequence requires approximately 10 Kbytes of data per minute of sound.
Contrary to WAV ﬁles, which contain digitally sampled audio in the PCM format, the MIDI ﬁles consist of MIDI messages which can be understood as special instructions for synthesizers to generate the real sounds. These messages thus provide very eﬃcient symbolic representation of music. Moreover, the MIDI ﬁles are also editable, allowing the music to be rearranged or even composed interactively.
2.2 MIDI Basics The MIDI architecture consists of three main components: hardware interface (connector), a communication protocol (language), and a distribution format (Standard MIDI File).
2.2.1 MIDI Hardware Interface The MIDI interface of each instrument is generally provided by three MIDI connectors, labeled IN, OUT, and THRU. The only approved MIDI connector is a 5-pin DIN connector. The physical MIDI channel is divided into 16 logical channels, each capable of transmitting MIDI messages from and to a single musical instrument.
2.2.2 MIDI Communication Protocol The MIDI data stream is a unidirectional asynchronous bit stream at 31,25 Kbits/s with 10 bits transmitted per byte (a start bit, 8 data bits, and one stop bit). The MIDI protocol is composed of MIDI messages in a binary form; each message is formed by an 8-bit status byte, followed by one or two data bytes.
MIDI messages are processed in real time, i.e. when a MIDI synthesizer receives a note-on message, it plays the appropriate sound, and stops this sound when the corresponding note-oﬀ message is received. Similarly, when a key is pressed on the musical instrument keyboard, the note-on message is immediately generated, as well as the note-oﬀ message is generated when this key is then released. Therefore, no timing information is transmitted with the MIDI messages in the real time applications.
2.2.3 Standard MIDI Files However, in order to store the MIDI data as a data ﬁle, time-stamping of the MIDI messages must be performed to guarantee playback in a proper time sequence. In other words, each message is assigned a value of time in the SMPTE format (hours : minutes : seconds : frames) and the resulting speciﬁcation is referred to as the Standard MIDI File (SMF) format. In addition to that, the SMF speciﬁcation further deﬁnes three MIDI ﬁle formats, because the MIDI sequencers can generally manage multiple MIDI data streams, called tracks.
• MIDI Format 0 stores all MIDI data in a single track, although it may represent several musical parts at diﬀerent MIDI channels.
• MIDI Format 1 stores MIDI data as a collection of tracks (up to 256), each musical part separated in its own track.
• MIDI Format 2, which is relatively rare and often not supported, can store several independent songs.
Since this work is concerned with the monophonic audio, only the MIDI Format 0 is used and the terms track and MIDI channel are interchangeable.
It should also be noted that the MIDI ﬁles can be converted by the MIDI File Format Conversion Utility provided by [Glatt2004].
Finally, a MIDI ﬁle can also be understood as a ”musical version” of an ASCII text ﬁle, except that it contains binary data. Indeed, [Glatt2004] also oﬀers the MIDI File Dis-Assembler Utility converting a MIDI ﬁle to a readable text, which can then be edited in a text editor and converted back to a modiﬁed MIDI ﬁle.
2.3 MIDI File Representations Although there are many diﬀerent types of MIDI messages, this work is concerned only with the note-on and note-oﬀ messages carrying the musical notes data, and hence comprising most of the traﬃc in a typical MIDI data stream. The remaining MIDI messages are applied mainly for hardware tasks, such as selecting which instrument to play, mixing and panning sounds, and controlling various aspects of electronic musical instruments.