Some Video Details
VHS is a physical tape format. It can be recorded using the NTSC, PAL or SECAM techniques, but they all use the same kinds of actual tape and mechanical parts. You will sometimes see the authors refer to VHS as a way of covering all 3 techniques in one sentence.
Wrapping up JPEG
JPEG comes in six main flavors.
Flat (non-hierarchical):
The lossless JPEG compression replaces DCT (a source of much roundoff and quantization error) with a kind of differential technique, based on the idea of encoding one of eight possible kinds of relationships between already transmitted pixels, and the new one. If the new one's name is X, we have a pattern like this:
C B
A X
Pixels C, B and A can be used to predict a value for X by eight methods. Here are 3 of them. All these methods assume that a sequence of bits which we could call EF is transmitted. Both E and F will consist of multiple bits, and the whole thing will be entropy encoded afterward.
Three possible methods:
0: Ignore A, B, C. Just store X's difference from previous value of
X, in variable F.
1: X = A + F
...
7: X = ((A+B)/2) + F
Since F is often near zero, it can be represented with just a few bits. So we may be able to get a whole new pixel value for (average) of 2 bits for the code E, and 2 or 3 bits for F.
H.261-the videoconferencing technology is technically somewhat like MPEG, as it is based on temporal as well as spatial redundency. The only things I expect you to understand about it, are the lines/frame discussion on page 84, and the fact that chrominance is transmitted at half the resolution as luminance. 90 * 72 takes up 1/4 the space of 180*144; but there are two components to chrominance. So the net effect is that one frame takes 1.5 times the space that a luminance-only ("black and white") signal would take.
ISDN is the next grade of telephone-system capacity above POTS. it can be bought in increments of 64 kilobits per second; if capacity =C then C=p*64 kbps, so p is a convenient number between (1 and 30) according to the text; (2 and 30) according to me. My experience is that the smallest commercially available ISDN has p=2.
I tried generating the numbers about frame rates on p. 84 but they don't make sense to me, so let's not.
Attacking MPEG
MPEG's goals are to take advantage of temporal redundency (similarities between successive frames of a video image), and to provide methods which support both symmetrical and assymetrical coding and decoding. Symmetrical would mean that both recording and playback time requirements are important, whereas assymetrical means that you can spend lots of time compressing the image, to speed up the playback. Symmetrical coding requires specialized hardware. No CPU now in existence can encode MPEG video in real-time, without special hardware help.
The text is fairly clear on the following topic, so I'll just ask you to explain it to me.
Query 9.1: Explain the roles of GOP, I, P and B frames.
The hardest idea in MPEG is the motion detection required to compute P (predictive) frames. They're not really predictive, because at the time you encode them, you know exactly what the original data is. But at playback time you don't have that data, of course.
A key idea is to realize that 8 x 8 macro blocks (in chroma, = 16 x 16 in luminance) are quite small, in a 640 x 480 image. Thus, if a car is moving across a scene, a macro block might represent a portion of the driver's door.
So, to compute a P frame, the encoder will look at two source images: the current frame C of raw data, and the previous I-frame. Considering a particular macro block in C, the system examines the corresponding macro block and a few adjacent ones, in I. The one which is most similar (by subtraction and computing an average across all the pixels) is designated as the "base", and a motion vector like (1,0) is used to represent the mapping from base to P-frame.
Now, this mapping is not exact, and so we still have to store a correction frame. But since the differences are small and likely to affect the whole block, the DCT of this difference will be almost all zeroes. It'll compress very nicely.
This scheme relies on the idea that the things moving will cause large blocks of color (e. g. the sky) to be relocated. "Large" means bigger than a macro block. A panning camera is the most common source of large scale motion in a scene, but cars driving by, and other scene action, obviously contributes too. Still scenes are MPEG's "bread and butter" since all that background will have zero-motion from one scene to another.
B frames are just "between" frames, computed by a kind of averaging from I and P frames. They use motion interpolation both forward and backward, whichever gives the best result. So obviously when decoding, you have to expand all the P frames from one I to the next, before you can constructthe B frames in between. B frames are "dumb and fast."
Query 9.2: The "motion" encoded in MPEG only maps macro blocks into other macro blocks. But the world doesn't actually move in convenient 16 pixel jumps. How can such a technique possibly work out in practice?
MPEG-2 is a refinement of MPEG-1. No major new techniques are introduced, but a variety of levels of service are provided. Due to various improvements, MPEG-2 achieves higher compression ratios than MPEG-1, but I believe the improvement is at most 2:1. Anyone with other data should let me know, and I'll check other sources before Thursday's class.
Data Rates
It's tricky to try to decide how many bits of information are really needed to store an analog video image. Let's try, approximately.Let's try to figure out the number of bytes in a 640 x 480 image with 8 bits per pixel. Answer= 640 x 480 = 307,200 bytes. Of course if it were a 24 bit image, this would be about 0.9 megabyte. If we wanted 60 frames a second, that's about 54 megabytes/second (lots of data!)
On the other hand, we know that NTSC composite video is transmitted in about 4.5 mHz of radio bandwidth, which (if we assume sampling at 8 bit resolution) could be stored with about 9 mb/second of sampling. There is a big gap between 54 mb/sec and 9 mb/sec. What's up?
Well, the analog image actually isn't nearly as good as 640 x 480. The two successive fields are nearly identical, so you have an effective vertical resolution of about 260 lines. The color is not changing nearly as fast across a horizontal line as the luminance, so the 640 x 480 image's "24 bits of color per pixel" is much richer in detail than the NTSC image.
Profile, Level, Scalability on Table 3.2, p.100.
Profile: What resolution is supported
Level: What bandwidth is required
Scalability: Multiple "levels of detail" in the same signal
Again, the capacity numbers seem strange to me. Let's not memorize them. Instead we look back at the ratios on page 90:
MPEG-1: 130:1
MPEG-2: 80:1 to 100:1
H.261: 20:1 to 300:1
Motion JPEG: 7:1 to 27:1 - but this uses no temporal redundancy.
Clearly, all this motion technology pays off, as can be seen in the contrast in Motion JPEG and any of the others.
Audio for MPEG.
Map it to frequency domain, analogous to DCT for images. Then use a psychoacoustic model (equivalent to the discretization step in video) to throw away parts of the signal that folks can't hear.
MPEG1: two channels
MPEG2: 5 channels.
I wish they'd talked about MPEG3 sound, since it's the source of the great music controversy. A half hour of web surfing has not found good info on how MPEG3 works; but I'll keep trying.
Alert: MPEG-3 sound is one potential domain for a Project 3.
MPEG-4.
Key phrase: "synthetically generated audiovisual information." SNHC="Synthetic and Natural Hybrid Coding" group. Consider a cartoon. If its characters exist in object-space as vectorized outlines, that info could be preserved & transmitted way cheaper than pixels.
SNHC includes human face and body geometry and animation, text, and
interactivity. It's almost like VRML in some respects - that is, one can
define a "virtual world" and move through it. Here's a comprehensive
article about mpeg-4.
The AVO (Audio-Visual Object.) AVO consists of a 2d fixed background, a visual object such as a person walking in front of the background, and the associated audio. AVOs can be organized into a hierarchy.
In addition, symbolic information is imbedded so that the resulting data files are searchable and indexable, like web pages. "MPEG-4 can be viewed as a configuration, communication and instantiation of classes of objects."
Object-space descriptions of visual or audio elements are inherently more flexible than image space descriptions. They can be scaled, rotated and transformed in ways which inevitably produce distortions in pixelated or sampled-audio representations. MPEG-4 music includes MIDI specification, for instance. With MIDI music, tempo can be varied and the music can be transposed to any key.
Other MPEG-4 features:
- noise resistance, for mobile computing applications
- streaming video so you don't have to wait for it all before you start
playing it.
- collaborative scene visualization, as in "shared virtual world"
Alert Again: MPEG-4 would be a potential student lecture, or a Project 3.
MPEG-7. adds an extensive descriptive layer, to improve the searchability of the data. Spec to be complete in year 2000. Query classes such as color, texture, sketch, 3d and 3d shapes, etc. Frequency contour, profile, timbre, source of sound, etc.
Sounds mighty ambitious to me.
MHEG - Multimedia Hypermedia Experts Group.
Supports extensions to the HTML concept. HTML is a markup language wheres MHEG is a script interchange representation. What's that? "Abstract Syntax Notation" (ISO ASN.1). Sounds neat; book doesn't tell enough to describe it. There's a Java connection. Video-on-demand seems to be one manifestation.
Alert #3: Volunteers?
GIF. CompuServe's spec for generalized color raster images. It's sorta like a direct Huffman style encoding without the frequency domain magic of the JPEG. GIF provides 256 colors instead of the 24 bit full spectrum of JPEG, and does a better job than JPEG on artificial documents. JPEG is best for photographic information. CompuServ attempted to charge license fees to GIF users a couple of years ago, which led to widespread protest and the invention of the PNG standard.
Vector Quantization as described in the text, is indistinguishable from GIF. Both use a codebook to look up most-common-patterns in an image.
Wavelets. A different, fractal-like basis set is sometimes used to represent images with sharp transitions in them. These wavelets are, for instance, being used for fingerprint transmission as the text describes.
Fractals. The concept behind fractal image compression is wierd and wonderful. It is based on the idea of "self similarity". For the description of, for instance, a fern, the idea is very natural since a part of a fern looks a lot like an entire fern. So all you have to store is a set of matrix transformations such as rotation, scaling and translation, together with a frequency (probablility) distribution. But why does this trick work for a human face? We don't normally think of a face as being composed of lots of little images of a face. Folks are still working on this one.... its chief virtue is that the representation can be zoomed in and out with less artifacts than one gets from a JPEG.
Iterated Function Systems (IFS) are the special kinds of fractals used for image compression. I'll show you some examples and discuss this technique a bit more in class.
Queries are not very numerous in this domain because the text
doesn't provide us many hard facts to work with.