## Digital Media

#### Moshell - Spring 99

Lecture 5: Video; Miscellaneous Media Tools on PCs

This lecture is based on the latter half of  Chapter 2 of the Wu text, beginning at section 2.1.6 .  It  explores the issues involved in storing moving pictures, and a variety of other things about moving signals around a computer.

During the course of the chapter, we pause for tutorials in video technology, run length and Huffman coding. In fact, let's pause right now for the first one.

Movies were invented (like lots of other good stuff) by Thomas Edison. Originally movies were made to show 16 frames per second, but the flicker drove folks crazy so someone (maybe Edison) invented a shutter which was a spinning disk with four slots in it. This caused the image to flicker 64 times per second, showing each frame 4 times. Much less strenuous on the eyes.

When they moved to sound, they moved up to 24 frames per second. If you show a movie shot for 16 FPS at 24 FPS, you get an exaggerated fast motion, which is sometimes seen on TV. But silent movies didn't actually look like that.

Many people tried for many years to figure out how to send images via radio or telephone. FAX machines were actually invented by the 1920's, and were used by newspapers to send "wire photos" around the country. A spinning drum was used and a picture took about 5 minutes to send.

Monochrome (black & white) TV images are based on an analog brightness signal which is spread across a raster or zigzag pattern of lines. The eye is quite sensitive to variations in brightness and so the bandwidth of this signal must be pretty high - in fact. about 3.5 mhz. Thirty frames per second are transmitted.  In order to get the flicker frequency up above 30 hz, interleaving is used. That is, the even lines and then the odd ones are transmitted. There are 525 visible lines in the NTSC (American) standard video signal, or 262.5 lines per field. Two fields=1 frame.

Black and white requires one degree of freedom (one real number) per pixel. Color TV requires that two degrees of freedom be added, since we already know that you need RGB to express a color.

A "degree of freedom" is an adjustable parameter. You can model a given system with many different assignments of DOF, but in order to have the freedom in the original system, the new model must have the same number of DOF. For instance you can locate a place by X,Y,Z or by (azimuth, elevation, range) where the first two are angles. But you have to have 3 numbers, either way.
There are many color models with 3 DOF. Obviously RGB will do it. Then there's HLS (hue, lightness, saturation) where L corresponds to the brightness in black & white. American TV uses a tricky system called CMY in which a color's hue  is represented by an angle between 0 and 360 degrees, and the saturation (purity) is represented by an amplitude. To send these two pieces of information at the same time, the chroma signal is used. Its phase is changed to describe hue, and its amplitude represents saturation. Minimal saturation means white; maximum means totally red or green or whatever.

So, you have a brightness signal running along, painting its way across the screen. Every 63 microseconds (1/15,750 second=1/525*30) it takes a violent plunge down below the voltage representing black. This horizontal reset pulse moves the raster dot back to the left. Every 16.67 milliseconds (1/60 second) a bigger vertical reset pulse moves the raster dot back to the top of the screen. Chroma and sound are blended in as two additional signals. The sound is FM.

NTSC - North & South America & Japan. 525 lines at 60 hz. Total of about 6 mhz stored, 5 transmitted.

PAL - most of the rest of the world. Europe & England & India & China. 625 lines, 50 hz. 8 mhz; higher quality signal.

SECAM - the French system, because they have to be different. Eastern Europe, parts of Africa. Same lines & hz as PAL. The chroma signal is dealt with differently.

So you have this complex analog signal pouring onto your screen. The camera has a raster dot moving just as fast and in sync with the one on your TV. So no recording is required, to transmit TV. That's good because originally in the 1940's when TV was invented, nothing could record that fast except film. They would film TV shows if they needed to pre-edit them, but most TV was live.

Aspect Ratio is the ratio of width to height. For TV it's 4:3=1.333. For standard cinema (seldom seen anymore except in classic films) it's the "golden section" which is (1+sqrt(5))/2.= 1.618.  For High Definition TV it is 16:9=1.777, which is conspicuously wider than a standard TV and quite similar to a panoramic movie, which most of them are these days.

Some queries for study purposes.

Query 5.1:. NTSC broadcasts (approximately) 30 frames per second, but the screen flickers 60 times per second. Why? How is this magic achieved?

Query 5.2: Explain the concept of "degrees of freedom" of a system. Illustrate by a discussion of color space in terms of RGB and HLS ("Hue, lightness, saturation")

Query 5.3: A signal that has maximum frequency component of 5 mhz can obviously change directions at most 5 million times a second. Put another way and crudely, it can only "spell out" about 5 million on-off messages per second. Now consider that raster dot that is scanning 525 lines into your TV screen 30 times a second, forming 15,750 raster lines in a second. What can you deduce about the horizontal resolution of broadcast TV, from this information?

### Storing Video

Now to digitally store information sufficient to reconstruct a signal with 5 mhz information in it, you'd need 10 million samples per second. A CD holds around 682 million bytes, so that would imply that a CD could hold about 68 seconds of uncompressed video. Clearly some compression would be needed.

VCRs don't do any compression, they just store the signal directly in analog form. A spinning head puts diagonal stripes of signals on the videotape, so that in effect a videotape is electronically ten or more times as long as it physically is. So a simple VHS (NTSC) videotape records the whole mess- audio, brightness and color, in one "soup".

To get higher quality, both Super VHS and BetaMax store the chroma signal separately from the brightness, and the audio somewhere else on a third track. This is why a Super VHS cable needs more than just one wire and the ground/return.

## Back to the Book: 2.1.6 - Videoconferencing

Videoconferencing is growing but it is not very satisfactory. Eye contact is VERY important to people and you can't really establish it with folks on video. Consider the problem of a 3 way meeting. To have cameras positioned so that everyone is appropriately positioned (so I can look left to see one person, and right to see another) would require 6 cameras and 6 video channels. But with 4 people you'd need 12 cameras and 12 video channels. Whew! Some folks have actually tried projecting video images into mannikin heads, to deal with this problem. The heads then have to be rotated somehow. It's truly eerie to see this going on, because the moving textures don't perfectly fit the mannikins.

I once taught a class from Orlando to Brown University (Providence, Rhode Island)  via a commercial videoconferencing system in a downtown business. It was sort of eerie to be semi-in-communication with the students. You could see them sometimes, and hear them all the time, but you couldn't really match voices with faces unless your camera operator got lucky. The low scan rates means that lip sync is not a reliable clue as to who is talking.

I have on several occasions also taught a "distance learning" class using the TV studios in the library. In this case you can't see the students at all, and it's more like broadcasting a TV show.

---------- Back to technology -------

Between the 5 mhz (actually more like 4.5 mhz) of a raw NTSC signal and the 128 Kbits/sec of an ISDN line, or the 56kbps of a POTS line, there is a 30:1 or 60:1 ratio. Obviously something has to give. Three things give:

1) Resolution. Typically, desktop teleconference systems ("video telephones") use 128x 128 instead of the approximately 250 x 250 resolution of TV. That's a factor of 4:1, not 2:1 as it might seem.

2) Frame rate: If you transmit the image 15 times a second instead of 30, you cut the requirement another factor of 2.

Real-time compression is still very challenging (we'll see why in Chapter 3) but bringing the problem down from 30:1 to 30/8 which is about 4:1, is much easier. But you still get a jerky and fuzzy picture. The ultimate solution will be faster MPEG compression techniques, and more bandwidth.

## The Rest of the Pieces of the Media PCs

What else is in the chapter? A grab-bag of concepts and techniques. Here are some of 'em. We'll intersperse technology with science, as needed (e. g. the bit about coding techniques.)

CD technology.

A brief introduction to some CD technologies is listed on page 34 and 35. The original Philips/Sony audio CD was capable of delivering 150 kb/s. Remembering from last lecture that audio needs at least 44 kbps, this seems like more than enough - until you realize that audio CDs use "read-ahead". They buffer up a couple of tracks worth of music, for two reasons. (a) They have to keep playing while the heads move from one track to another, without a jump. And (b) if you joggle the CD player, or something else happens so that the head misses the data, it comes out of the buffer.

150kbps was pretty slow for anything but music, so 2x, 4x ... 24x gradually grew.

Physical Technology. CDs involve a laser melting tiny pieces of plastic and forming 'pits'. Unpitted track is called 'land'. The tracks, unlike a phonograph, are NOT helical (spiral). The head has to jump forward one track when the last one is done. This means that the data on the next track had better not begin where it stops on the last track, obviously.

There are a WORLD of different variants on CD, listed on pages 35-37. I order you to not learn them.

A major fact to know about CD technology is that it's actually sort of "in between" digital and analog. That is, the original audio CD was intended to be just good enough for consumer grade stuff. An important commercial principle is to slop it when you can, get there firstest with the cheapest (or at least the firstest.) After you get market, you can get good. Or in Microsoft's case, just get big.

SO.... audio CD's were designed to expect lots of errors. This is dealt with in two ways: Huffman coding, and interleaving. We'll introduce both ideas here, because we'll need (at least Huffman) in the next chapter.

Interleaving is the idea of spreading your data out, so that a dust speck or flaw in the plastic would kill maybe one bit out of a string of successive 16-bit values(thus preserving the essential shape of the audio curve) rather than completely trashing one data word.

## Run Length Coding, Huffman Coding, and other Miracles of Simplicity

We will study several basic techniques including run length encoding, Huffman coding.

You will need to know how to compute the amount of storage required for raw, uncompressed data. It's simple byte-counting stuff. For instance,

Query 5.4 Compute the storage required for a 540 x 480 image with 8 bits per pixel.

### Run Length Encoding.

Consider the following data stream: 0000 0000 1111 1100 0000 0000 0000

I put in the spaces to make things easier to count. You could briefly describe the sequence as "eight 0, six 1, fourteen 0". This would be called "run length encoding" (RLE). Assume that instead of binary code, we're working in bytes. We could encode a complex sequence of bytes or characters using RLE, by the following technique. In this discussion, if we write [101] we mean the byte whose value is 101 decimal. But if we write a character like A or !, we mean the byte whose value is the ASCII value of that character.

Here's the Run Length Decoding algorithm

1) If you ever encounter the character "!", turn on special software which:

a) reads the next byte as an integer; call it N
b) puts N copies of the following byte into the output stream.

Thus, the sequence ! [32] A

would emit 32 A's into the output stream, at a storage cost of only 3 bytes.

Query 5.5: What kind of data would achieve maximum compression with the above one-byte RLE technique? What would its compression ratio be? (That is, if I gave you M bytes of data, how much space S would it take and would the ratio M/S be?

Query 5.6: What kind of data would achieve MINIMUM compression with RLE? In fact is there data which you're better off not trying to compress with RLE at all?

Yeah, of course there is. It's data with runs of less than some minimum length. If you're using a one-byte run-length field as we are, the minimum run length that it pays to encode is 4. I hope you see why. (What's the minimum if we're using a two-byte run-length field? In fact, what's the maximum possible compression ratio with two-byte length fields?)

Let's just call data with less-than-minimum runs,  "incoherent data". If you're encountering incoherent data with no runs (or short runs)  in it, you just dump the bytes right into the output stream. If you come to an acceptable run of some character C, you encode it with ! [n] C.

Query 5.7: There's only one problem with this technique.... what's that?

The "escape sequence" technique means that some specific character (or characters) is designated as special, and different from all others. When it is encountered, it's not treated like data, but like a flag. In the present case, the ! character is an escape character. (In fact the ESC character was originally put into ASCII to serve this purpose.)

But what if you want to transmit ! as part of your raw (incoherent) data stream? Won't it trigger escape sequences and generate all kinds of fake RLE?

Query 5.8: How do we solve the "transmit the escape character" problem?

I won't put the answer to that one here in these notes. You have to pay attention in class, after all. Or ask your friends.

Query 5.9: Using ! as the escape character, one-byte character counts and a minimum run-length of 4, run-length encode the following data. The blanks in the data are NOT to be encoded; they are simply inserted to make it easier for you to read and count characters.

0000 0001 1122 2222 2200 5555 5555 000

Why is the minimum run length chosen as 4?

### Huffman Coding.

Huffman coding is very important. It requires that the probability of occurrance of each symbol in an alphabet be known. The principle was originally discovered by Samuel F. B. Morse, who invented the .... Morse Code! I am a member of the last human generation that will know Morse (unless civilization collapses and we have to start over.) Consider these letters:

### .       E

The most frequent letter in English is E, so it gets the shortest symbol - just a "dot" (short sound.) The next most freaquent is T, so it gets a single dash (long sound.) Letters like C are relatively infrequent so it gets a long symbol.

Huffman coding follows a similar principle. You sort your symbols into two piles so that the sum of probabilities in each pile is as close as possible to equal. Then you assign 0 to the left pile and 1 to the right pile.

You repeat this process until a "tree" of code words is built. Here's the example for three symbols,

A (0.5)
B (0.25)
C (0.25)

First we have A on the left, and B and C on the right. So we assign 0 as the symbol for A, and 1 to the pile on the right.

Now we divide the right hand pile into equal probabilities, and so we must assign 10 to B and 11 to C.

Here's the alphabet:

0    A
10  B
11  C

Now let's try transmitting a string, to see if it can be decoded afterward.

CABBA

11010100

If you think like a finite automaton, you immediately see that at the beginning, 11 must mean C. Then you encounter a 0 so you know it must be A.... etc.

For
practice, we have the next query.

Query 5.10.  Develop a Huffman code for an alphabet with for letters with the following probabilities:

A: 0.5
B: 0.25
C: 0.125
D: 0.125

## And Now, back to the Technology

DVD is the first really new thing to come along. It uses smaller tracks because the laser light is shorter wavelength, and it uses two layers of media on each side. (How? Focusing the energy through a lens means you can "look through" the top layer.) DVD with 2x2 can hold 17 gigabytes. This is about 24 times the CD's capacity.

Univeral Serial Bus is a relatively slow channel, for keyboards and telephones and (in the high speed option) video cameras and scanners. The media relevance of USB is that it supports a simple form of Quality of Service, so let's talk about that.

Quality of Service means "each user gets what she really needs" - or, in fact, nothing at all. USB will allocate enough bandwidth to a camera to do the job RIGHT for that camera, even if it means some other low priority device like a keyboard gets no time slices. (Unlikely, 'cause keyboards are the lowest demand in the world.)

QOS on a network means you can buy how much bandwidth you need. Internet ain't that way yet; important or not, you just take what you get.

FireWire is a hepped up USB, running up to 400 mbits/sec.

MMX is built into the Pentium II and other processors thereafter, without being mentioned. It's a set of new opcodes specialized in splatting big bit-style activities like are done to images, very fast.

64 bit register with 8 parallel bytes, which can be organized as

8 8's
4 16's
2 32's
1 64.

These tools are used to perform nonlinear editing and multimedia performance. More later, I hope.

Quicktime

What need does Quicktime meet? Plays back movies with sound on PCs/Macs. It's a lot harder than it sounds, but the book doesn't tell you much about how it's done.

Quicktime VR is panoramic images cunningly sewn together so you can pan around and think you're in a virtual space. You can't look up or down much, though.

## DirectX

At last we come to something Important. If you want to do media on PCs, eventually you come to DirectX, where X varies. See http://www.microsoft.com/directx/default.asp

Direct3D - a hideously complicated way to do 3d, but the fastest thing there is (to date)
DirectDraw - a 2d system for drawing. Provides device independence and is used by Direct3D
DirectSound: - likewise, device independent sound commands.
DirectPlay: - multiplayer connectivity for games
DirectInput: - connects joysticks, etc.

## Authoring Tools

1. Presentation Tools: PowerPoint

2. Page-Turning Authoring Tools: HyperCard (Apple)

3. Frame Based Tools: Nobody heard of these

4. Icon Based Tools (Visual Programming Tools): Macromedia AuthorWare

5: Time-Based Tools (Macromedia Director.)

Time to reflect on the difference in an object-level multimedia presentation (e. g. MM Director) and a simple video movie. For video all you have to do is play back the sound synched with the audio. But in an MM Director piece, you may want to make sounds happen from clicks on the screen; or animations occur repeatedly on one page. Consider Myst.

All these systems are cowering in the shadow of Java.

--------

This chapter is largely a laundry list of spare parts that can go into a multimedia computer. We've used it as a skeleton upon which to hang two substantive tutorials, about video technology and huffman coding. Both are necessary for next week's meaty stuff about JPEG and MPEG.

Let's spend the rest of this period

a) Talking about File Access in PERL (so you can attack your project)
and
b) getting started on the readings project. Got books yet?

File Access in Perl

Here's a trivial example of how to create a file and write to it.

\$filename="mydata.txt";
open (FILE, ">", \$filename) or print "Reg-Master Error 103: Cannot open \$filename\n";

print FILE "Line 1\n";
print FILE "Line 2\n";
close (FILE);

The symbol ">" means open for writing, and overwrite the contents of the existing file if it is found.
The symbol ">>" means open for writing, and add to the end of  the existing file if it is found.

If you execute this from the command prompt, this script will put onto your disk (in the directory the script ran) a new file named mydata.txt.However, if you executed it as part of a web invoked script, you wouldn't be able to create a new file. This is because you're executing as a 'guest' named nobody in that case, without privileges to create new files.

The way around this problem is just to pre-create the files while you are telnet-logged into the Unix host. I simply use my text editor to make the file in the first place. For instance I would use vi to create diagnos.txt, then save it. Subsequently when I do diagnostic tracing (described below), stuff just gets dumped into the diagnos.txt file, and I read it and delete it using vi. Most sane folk use pico or some other more user friendly editor, but us dinosaurs use vi.

The error message seems obvious, but people often don't include these messages. It's much smarter to do so. I assign numbers to error messages and keep a running file with a sentence or two about what might cause each error message.

(By this point I've realized that I'm REALLY good at forgetting what I was doing last week....)

Here's how we READ a file.

open(RTABLE, "suites.txt")
or print "Reg-master Error 101: Cannot open suites.txt. \n";

#Then, to use what's in the file, proceed as follows:

until (eof(RTABLE))
{
\$line=<RTABLE>;  # Read one line, for captions
# Do whatever you're going to do with it...
}
The above construct is how one reads to the end of file.

When finished with a file (reading or writing), it's polite to close it.

close RTABLE;

***** Debugging Strategies *************

As I mentioned before, it's a bit of a bother to debug Perl runtime code. Untyped languages (like Perl) have a tendency to assume that you knew what you meant to do. So no error message will result if your real values are being assigned into a variable where you thought you had a string, for instance.

The best tool for debugging is always a trace of what happened at each important point. To achieve this kind of visibility, I like to open a diagnostic file and write into it, like this:

# I put this line up at the beginning of my program
\$testing=1; # Set this to zero to turn off the tracing function.

#And this also somewhere near the head of the program.
if (\$testing)
{
open (DIAGNOS,">>diagnos.txt")
or print "Reg-Master Error 106: Cannot open diagnos.txt \n";

print DIAGNOS  "*********************\n";
}

Then, wherever I want to look at the value of a variable, I put something like this:

if (\$testing)
{
print DIAGNOS " In fixupsuites: key::".\$key."::\n";
print DIAGNOS "--->intended contents::\$goodstuff::\n";
print DIAGNOS "----->actual contents::\$FILnsuites{\$key}::\n";
}

The resulting file is then opened with whatever editor you use (pico, vi, etc.) or even FTP'd back down to your PC for reading.

APPENDIX to Lecture 5: More than you ever wanted to know about HDTV:

This information was borrowed from Lycos' reference page on HDTV.

The basic formats (in pixels) are:

1920H x 1080V 16H x 9V Aspect Ratio, Square-Pixel,
Progressive-Scan at 24-Fr/sec and 30-Fr/sec,
Interlace-Scan at 60-fields/sec;

1280H x 720V 16H x 9V Aspect Ratio, Square-Pixel Alignment,
Progressive-Scan at 24-Fr/sec, 30-Fr/sec, and-60 Fr/sec;

704H x 480V 4H x 3V or 16H x 9V Aspect Ratio, Non-Square Pixel Alignment,
Progressive-Scan at 24-Fr/sec, 30-Fr/sec, and 60-Fr/sec,
Interlace-Scan at 60-fields/sec;

640H x 480V 4H x 3V Aspect Ratio, Square Pixel Alignment,
Progressive-Scan at 24-Fr/sec, 30-Fr/sec, and 60-Fr/sec,
Interlace-Scan at 60-fields/sec.

The 1920H x 1080V format is a superset of the 1920H x 1035V SMPTE-240M standard that has been in place since
the late 1980s, and is itself a modification of the original NHK (Japanese Public Broadcasting) Hi-Vision production
format. (In fact, after SMPTE-240M was approved, the Japanese modified their production standard to conform to the
new SMPTE standard, rendering millions of dollars of existing HDTV equipment obsolete.) All of the testing of the ATSC
system was made using SMPTE-240M equipment since it is the only standard that has a complete, if basic, range of
production equipment available. It is quite likely that all HDTV productions made for the next 18 months will be made
using SMPTE-240M equipment. In fact, in anticipation of NHK's HDTV coverage of the 1998 Nagano Winter
Olympics, a complete line of serial-digital 240M equipment has been developed, including digital camcorders and studio
recorders, up-to-date full-featured digital productions switchers and Digital Video Effects (DVE) processors. By the
spring of 1998, HDTV production parity with NTSC will be achieved. Most of the major manufacturers of
SMPTE-240M equipment have announced that they will have equipment available that conforms to the new ATSC
version of the standard (SMPTE-274M and SMPTE-292M) by mid-1998.

The 1280H x 720V Progressive-Scan-only format has the exact same overall data rate as the 1920H x 1080V standard,
but sacrifices some horizontal resolution (about 33%) in exchange for eliminating interlace at the highest frame rate, 60
Fr/sec. Theoretically, post-processing, conversion, encoding, and decoding should be easier and better with a
progressive-only standard. The immediate difficulty with this standard is that no production equipment is available for it.
Polaroid has modified a Philips HDTV camera with its own image sensors and has delivered two units to MIT as part of a
government-funded research project, but two cameras do not a production system make. During testing of the ATSC
system, special digital formatters were made by Tektronix that allow recording of the signal onto SMPTE-240M 1" digital
recorders, but this approach is very expensive and not very portable. Switchers, DVEs and other production equipment
needed for a complete production system are not available at this time, and, as of this writing, no manufacturer has come
forward in support of this standard (SMPTE-296M). Perhaps NAB will reveal that enough manufacturers have decided
to support this standard to make it viable for production.

The 704H x 480V standard is backwards compatible with existing NTSC at its 59.94 field/sec rate. The lower-rate
progressive standards (24 Fr/sec & 30 Fr/sec) are intended for better compatibility with film. (Due to MPEG-II
requirements, 1% of the original NTSC frame is lost.)

704H x 480V progressive at 60 Fr/sec is intended as a less-expensive wide-screen HDTV format that is compatible with
close-view computer graphics. At least one national television network has tentatively shown support for this standard,
and there is some production equipment available for it since it has been adopted by a Japanese network for EDTV
(Extended Definition Television) production. Unless more international support for this standard is shown, equipment for it
may eventually be more expensive than 1920H x 1080V equipment, since quantities will be more limited.

The 640H x 480V standard provides for direct VGA computer compatibility. There is no production equipment available
for this standard and no one in the television industry has proposed it as an acquisition format. It provides