Digital Media
CAP4020 - Spring 1999
J. M. Moshell

Lecture 28 - Electronic Music

by J Bryan Pittard
April 22nd, 1999

This lecture also contained presentations by Daniel Beran and Jeffrey Ference.
However, these notes summarize Bryan Pittard's remarks.

Why electronic music?

Many people in the fields of music and computer science have asked this question.  Isn't the existence of traditional acoustic music enough and in fact more true to the art form?  What they forget is that technology has driven and propelled the creation of music along since its inception.  From the creation of the first wind-based instrument to the invention of the piano-forte to the evolution of the digital and analog synthesizer, music has continued to evolve and adapt to the new sounds that both people and their inventions have produced.

For music to continue to keep pace and grow with our technology, the creation and distribution of electronic music becomes a must!  What I hope to show below is a bit of history and a taste of the future of electronic music from the eyes of a musician and a programmer.


I. Audio Encoding Methods

A. PAM - Pulse Amplitude Modulation
This method takes slices of each sound input and interleaves them together.  These slices are transmitted as a series of pulses with the amplitude of each pulse representing the sound strength at that moment.  This method lends itself to the hardware and can be implemented with a relatively simple equipment (a fast switch to encode from analog to PAM and a low-pass filter to decode from PAM to analog).  This method is frequently used by DAC's and ADC's as an intermediate form.
B. PWM - Pulse Width Modulation
Similar to PAM, PWM samples the sound as a series of pulses.  However, instead of using the amplitude as in PAM, PWM uses the width or duration of the pulses to encode the signal.  This method lends itself well to producing relatively high-quality sound from a simple computer speaker.
C. PCM - Pulse Code Modulation
This method represents an analog signal as a series of binary numbers, corresponding to a series of decimal values sampled at a given bit-depth from the amplitude of the signal's waveform at the instance of each sample.  The greater the bit-depth that is used, the higher sampling rate is achieved resulting in a more accurate discrete representation (8 bits - 256 steps vs. 16-bits - 65,536 steps).  This method is the most frequently used in representing audio digitally.
1. IFF-based formats - developed by Electronic Arts on the Amiga, the Interchange File Format spawned both the Windows-based WAVE file and the Mac-based AIFF file.  Essentially, the IFF was designed to hold various kinds of multimedia data such as images, audio, animation, or combinations of each in a series of nested chunks.   This format provided develops flexibility in designing their own data definitions to fit within the IFF format.  Eventually, this format gave rise to Microsoft's Resource Interchange File Format which denote differing kinds of data with different tags.
a. WAVE - this was the most common audio file format on Windows-based systems.  It was derived from the RIFF format and is a close cousin to the Audio Video Interchange (AVI) format for audio and video.  A complete technical description of this format can be found at http://www-ccrma.stanford.edu/CCRMA/Courses/422/projects/WaveFormat/.

b. AIFF/AIFC - the Audio Interchange File Format, developed by Apple, was another derivative format from the IFF.  However, developers added an audio compression feature to the IFF-like structure, creating the AIFF - C or AIFC format.   Go to http://developer.apple.com/techpubs/mac/Sound/Sound-61.html for a complete technical description.

2. MPEG
This format was developed by the Motion Picture Expert Group and the International Standards Organization seeks to unify both audio and video compression.  There are two versions of this format now in use, MPEG1 and MPEG2.  Below is an explanation on how MPEG compresses the PCM audio to reduce the size of the data.
MPEG Audio strips information that is not important. Based on the research of human perception the encoder decides what information is elementary and what can be stripped.   Before we hear anything, the incoming data is analyzed by our brain. The brain interprets the sound and filters irrelevant information. MPEG Audio just does this job earlier. This is called "perceptual coding."  In more technical terms: If a strong signal appears, the weaker signal behind is not perceivable. The MPEG Audio codec removes this weaker signal.
Taken from http://www.raum.com/mpeg.
There are three layers of MPEG audio layers.  Layer I uses the frequency masking technique mentioned above.  Layer II uses frequency masking more aggressively to further compress the data.  Layer III uses the most aggressive frequency masking and a Huffman coding routine.  Each of these layers allows for a scaling compression ration, thus giving the encoder the decision on how much quality can be sacrificed for better compression.

The MPEG-2 Layer III (MP3) format has spawned numerous legal engagements among artists, distributors, and miscellaneous groups fighting for the right to copy and distribute copies of audio that sound almost as good as the original.  Also, several companies including Diamond and Sony have released or are developing MP3 walkmans and system components.  While still somewhat large in size (4 minute pop song is still around 5 MB), these systems are smaller, more portable, and more allow for more storage than a traditional tape or CD player.

3. AU, RAM, etc.
Another formats that exists is the AU format developed by Sun and based on µ-law encoding, a non-linear format of PCM that results in higher compression.  This format is still found in abundance as a way to represent sound on the internet.

RAM is a Real Audio's attempt at audio compression.  This proprietary format uses compression techniques and principles of scalability to distribute streaming audio over a network.  An whitepaper on G2, the newest version of the RAM, can be found at http://www.real.com/devzone/library/whitepapers/music.html.

Other proprietary formats such as Liquid Audio and Vocaltec's Internet Wave support scaled, streaming audio for distribution on networks or the Internet.

D. Query 1 - What are some of the advantages and limitations of audio encoding techniques?

II. Score/Notational Methods

A. MIDI
Established in 1983, the Musical Instrument Digital Interface protocol is the most widely used protocol in electronic music.  The original motivation behind MIDI was to organize and standardize communication among electronic synthesizers.  This would allow one musician to control any number of daisy-chained synthesizers, creating a virtual one-man orchestra.

MIDI really took off when the first music sequencers were released.  The sequencers, along with a MIDI interface between the computer and the synthesizer, used MIDI to coordinate the recording and playback of music much like a word-processor does for words.

Some brief technical details about MIDI:

     
  1. The MIDI data stream is a unidirectional asynchronous bit stream at 31.25 Kbits/sec. with 10 bits transferred per byte.
  2. MIDI is divided into 16 logical channels
  3. Most MIDI data values use an 8-bit encoding resulting in a range of 0 to 127.
  4. Most MIDI-compatible units have an IN, OUT, and THRU port with the IN/OUT allowing for the unidirectionality and the THRU simply passing the "in" data "thru" to other daisy-chained units.
An excellent overview of the MIDI specification without too much technical detail can be found at http://www.midi.org.

Generally speaking, MIDI is an event based protocol.  This means that instead of representing a whole note a just one object in a fixed time frame, a whole not becomes a Note On at time X and a Note Off at time X plus some fixed value Y.  These Note On/Note Off events contain information about the note's pitch and velocity.  This event-based approach complicates the sequencer code, but allows for changes in tempo and other meta-events to occur with more flexibility.

1. Messages
The bulk of of what MIDI contains are "messages."  These messages are either a channel voice message, a channel mode message, or a system message.  A channel voice message contains musical performance data such as Note On/Note Off/Velocity/Aftertouch/Pitch Bend/Program Change/Control Change/Bank Select.  A channel mode message contains data that changes the playback mode of the synthesizer.   A system message contains information pertaining to the time synchronization and system-exclusive settings of a synth.
2. Running Status
Running status considers the fact that it is very common for a string of consecutive messages to be of the same message type. For instance, when a chord is played on a keyboard, 10 successive Note On messages may be generated, followed by 10 Note Off messages. When running status is used, a status byte is sent for a message only when the message is not of the same type as the last message sent on the same Channel. The status byte for subsequent messages of the same type may be omitted (only the data bytes are sent for these subsequent messages).

The effectiveness of running status can be enhanced by sending Note On messages with a velocity of zero in place of Note Off messages. In this case, long strings of Note On messages will often occur. Changes in some of the MIDI controllers or movement of the pitch bend wheel on a musical instrument can produce a staggering number of MIDI Channel voice messages, and running status can also help a great deal in these instances.

Taken from http://www.midi.org/Tutorial/Tutor.htm.

There are three file formats for MIDI files.  Format 0 represents MIDI data in a single track.  Format 1 stores MIDI data in several tracks.  Format 2 stores MIDI data in several types of tracks.  The most commonly used format is format 1, due to its flexibility, size, and relative ease of use.
B. Proprietary
While MIDI files are the most frequently used notation-based formats, there are several companies that supplement MIDI files with their own proprietary formats.  Mark of the Unicorn and Steinberg along with several other companies produce sequencer and notational editing software for MIDI-based systems.  While using MIDI to communicate among the interface and the synthesizers, these products also employ a richer, proprietary format in which to save sequence data.  The proprietary formats allow for the inclusion of information more common to music notation than MIDI allows.
C. Query 2 - What are some of the advantages and limitations of MIDI contrasted with audio encoding techniques?

III. Hybrid Methods

A. MIDI/DLS-1
Realizing the limitations of MIDI, the MIDI Manufacturers Association have implemented a system that allows for sampled sounds to be packaged with the MIDI data and used to generate the audio output.  This format called Downloadable Sounds Level 1 (DLS-1) specifies aspects of each instruments timbre (sound color), frequency (pitch), and other information pertaining to generating its waveform (any effects such as chorus or reverb).   A brief overview of this format can be found at http://www.midi.org/dlsspec.htm.
B. MOD
The MOD is a file format created to fill in the gaps that MIDI without DLS-1 contained.   A MOD file contains both the sampled sounds of the instruments and the notational data.  Another key difference between MOD and MIDI is that MOD's are beat-based and not event-based like MIDI files.  MOD files are structured such that each chunk contains information about every sound that should occur on a given beat (beat meaning a division, not necessarily a rhythmic beat).  MOD's biggest limitation is that the standard was hastily thrown together for a single application, which resulted in a myriad of variant sub-types.
C. CSound
CSound is an audio modeling language developed by Barry Vercoe of MIT in the early '80's and is derived from the the series of music-N languages developed at the Bell Telephone Laboratories in the early '60's.  CSound is typically distributed in uncompiled C code, so is therefore platform independent so long as a C compiler exists for that platform.

CSound essentially combines the elements of MIDI and audio modeling techniques to create PCM waveforms in a variety of formats.  The "orchestra" file contains detailed descriptions of each instrument's timbre, articulation, and dynamics.   The "score" file contains a MIDI-like, time-based event specification list (Selfridge-Field 112).  These two files are compiled together with the CSound compiler which calculates the waveform output.  Also acceptable in place of a "score" files is an MIDI Type 0 file.  Though usually not done in realtime on CISC-based systems, faster RISC-based systems can compile and stream the output in realtime allowing for a MIDI controller system to make changes in code than can be heard instantaneously.

More information on CSound can be found at http://www.leeds.ac.uk/music/Man/c_front.html.

D. SAOL/MPEG4
The MPEG-4 specification is still being defined.  However, many of the components of MPEG4 are currently out for beta testing and development.  One of these, SAOL, is another attempt at an audio modeling language.  In fact, SAOL is directly derived from CSound.  Below are some paragraphs describing the audio features of MPEG4 and SAOL taken from their respective web sites.
The Structured Audio tools decode input data and produce output sounds. This decoding is driven by a special synthesis language called SAOL (Structured Audio Orchestra Language) standardized as a part of MPEG-4. This language is used to define an "orchestra" made up of "instruments" (downloaded in the bitstream, not fixed in the terminal) which create and process control data. An instrument is a small network of signal processing primitives that might emulate some specific sounds such as those of a natural acoustic instrument. The signal-processing network may be implemented in hardware or software and include both generation and processing of sounds and manipulation of pre-stored sounds.

MPEG-4 does not standardize "a single method" of synthesis, but rather a way to describe methods of synthesis. Any current or future sound-synthesis method can be described in SAOL, including wavetable, FM, additive, physical-modeling, and granular synthesis, as well as non-parametric hybrids of these methods.

Control of the synthesis is accomplished by downloading "scores" or "scripts" in the bitstream. A score is a time-sequenced set of commands that invokes various instruments at specific times to contribute their output to an overall music performance or generation of sound effects. The score description, downloaded in a language called SASL (Structured Audio Score Language), can be used to create new sounds, and also include additional control information for modifying existing sound. This allows the composer finer control over the final synthesized sound. For synthesis processes that do not require such fine control, the established MIDI protocol may also be used to control the orchestra.
Taken from http://drogo.cselt.stet.it/mpeg/, the MPEG Home Page
SAOL stands for "Structured Audio Orchestra Language" and is pronounced "sail".  It is a powerful, flexible language for describing music synthesis, and integrating synthetic sound with "natural" (recorded) sound in an MPEG-4 bitstream.   MPEG-4 integrates the two common methods of describing audio on the WWW today: streaming low-bitrate coding (like RealAudio) and structured audio description (like MIDI files).  However, the quality and flexibility of the MPEG-4 tools is greater than other audio tools available today.

SAOL lives within the MPEG-4 paradigm of streaming data and decoding processes.   Thus, the Structured Audio toolset is not only a method of synthesis, but a streaming format appropriate for WWW-based (or any other channel) transmission of audio data.  The 'saolc' package contains a program for encoding score and orchestras into the streaming format, and facility for decoding this format (so it can be used as a WWW helper app).

MPEG-4 Structured Audio has its roots in another Media Lab project called NetSound, developed by Michael Casey and other members of the Machine Listening Group at the MIT Media Lab in 1995-1996. NetSound has similar concepts to MPEG-4 Structured Audio, but uses Csound, developed by Barry Vercoe, to do the synthesis.

Taken from http://sound.media.mit.edu/mpeg4-old/#saolc, the SAOL/MPEG-4 Structured Audio Page
E. Query 3 - What does the future hold for the electronic distribution of music?

References
Kientzle, Tim. A Programmer's Guide to Sound. Reading, MA: Addison-Wesley, 1998.

Selfridge-Field, Eleanor. Beyond MIDI: The Handbook of Musical Codes. London: The MIT Press, 1997.

All Previously Mentioned Websites