We begin. We will study several basic techniques including the Walsh and Discrete Cosine transforms, and differential coding.. Then we'll put them together into JPEG.
You need to know how to compute the amount of storage required for raw, uncompressed data. The obvious method is to multiply the height times width times bit depth.
Query 7.1: Compute the storage required for a 540 x 480 image with 8 bits per pixel.
Before I can teach you about the DCT (Discrete Cosine Transform), I have to teach you (or remind you) about vectors, and matrix multiplication. Vectors are useful for many things. To make this lesson interesting, we'll consider vectors as used in computer graphics for geometric representation. We will represent points in two dimensions with a vector like (x, y, 1). The purpose of the 1 will perhaps be clear later.
Matrix Multiplication. To multiply the matrix M by the vector
V, where v = (x y 1) and
a b c
M = d e f
g h iWe like to arrange V as a column on the right side of M. Then we slide a left-hand finger across the top row of the matrix, while we slide a right-hand finger down the column of V. We multiply each pair of symbols and add the results. Thus
a b c x ax + by + 1*c <-- the first component of W
W = M*V = d e f y =
g h i 1
Then we do the same thing to compute the second and third components of W:
a b c x ax + by + 1*c
W = M*V = d e f y = dx + ey + 1*f <-- the second component of W
g h i 1 gx + hy + 1*i <-- the third component of WSo we wind up with a new vector, the result of multiplying M by V. We can think of the matrix M as a machine which takes in a vector and spits out another vector.
Now, for practice, please multiply V = (2,3,1) by the matrix S below.
2 0 0
S = 0 2 0
0 0 1Query 7.2: What happened to the point (2,3) in XY space, represented by the vector (2,3,1) ? S is called a SCALING Transformation.
Now try this one:
1 0 5
T = 0 1 0
0 0 1T is called a TRANSLATION. What did it do to the point (2,3)?
cos A sin A 0
R = -sin A cos A 0
0 0 1Query 7.3:. Cos 90 degrees=0; sin 90 degrees = 1. Try this matrix on the point (2,3), with A=90 degrees. Where does your point go?
Now that you can multiply matrices, we will forget about (x,y) and geometry, and go onward.
DCT (Discrete Cosine Transform) is very important. So I'm going to teach you about an easier-to-understand "cousin" of DCT, called the Walsh transform. Then we'll talk about DCT. We begin with the one dimensional Walsh. Consider a stream of data (real numbers); perhaps it's a series of samples of an audio signal.
You're familiar with the idea of building up binary numbers by adding together base components; or with making change by adding together pennies, nickels, dimes and quarters. (Why do coins keep coming up in this course?) For instance, to spell out the value "9" in binary, we need an eight (1 0 0 0) and a one (0 0 0 1). Adding, we get 1 0 0 1.
The Walsh transform represents a series of numbers by using a base set
like this one (for series of 8 numbers)
B0 = 1 1 1 1 1 1 1 1
B1 = 1 1 1 1 1 1 1 0
B2 = 1 1 1 1 1 1 0 0
B3 = 1 1 1 1 1 0 0 0
B4 = 1 1 1 1 0 0 0 0
B5 = 1 1 1 0 0 0 1 1
B6 = 1 1 0 0 1 1 0 0
B7 = 1 0 1 0 1 0 1 0We refer to these bases as B0, B1 ... B7. You can see that all of these are "really" repeating series, it's just that B0 through B4 haven't got room to be seen actually repeating. (B0 either repeats with every character, or never repeats, depending on your point of view.Its frequency is zero, anyhow, so its wavelength must be infinite.)
Now, what if we wanted to represent a series like S1 = 3 3 3 1 1 1 3 3? A little experimentation would reveal that S1 = B0 + 2*B5. In fact, you can represent ANY series of 8 numbers, by some linear combination of these bases. The coefficients might be negative. For instance, if I wanted to produce the series -1 1 -1 1 -1 1 -1 1, how would I do it? The answer is revealed below, but try it first!
B0 is called the "DC Component". DC means "direct current", which is what you get out of a battery. It provides a constant voltage (until the battery runs down.) The rest are AC (alternating current) components, with various frequencies. The frequency of a signal is the reciprocal of its wavelengh (how long it takes to repeat.) What's the frequency of B7? B6? B5? See a pattern here?
Definitions: A signal S is a series of numbers S0, S1, S2... Sn. A transform T is a series of coefficients T0, T1..Tn. T is the Walsh Transform of S, if S = T0*B0 + T1*B1 + T2*B2 ... + Tn*Bn.
It would be extremely tedious to have to hunt around for the Walsh transform. Turns out, though, that there's an elegant way to do it. MATRICES and VECTORS!
To cut down on the typing, we'll use signals with four samples instead
of 8. Here's the basis set, as a matrix called M4. Note that the bases
B0, B1 etc. form columns in M4.
1 1 1 1
M4 = 1 1 1 0
1 1 0 1
1 0 0 0Now, if I multiply a Walsh transform vector like T = (1 0 0 -2) by placing its column form on the right of M4, I should get the corresponding signal S = (-1 1 -1 1). I hope you got that result, too.
This trick works if you know T and want S. But what if you have S and
want to find T? Well, if you remember your linear algebra (which very few
computer scientists seem to remember, if they ever had it back in high
school), you know that if S = M*T, then T = R*S where R is the "inverse"
of M. If you multiply M * R, you get the Identity matrix I (left-up to
right-down diagonal=1, all else=0). Not all matrices have inverses, but
fortunately the Walsh matrices do. Here's R4:
0 0 0 1
R4 = -1 1 1 -1
1 0 -1 0
1 -1 0 0Query 7.4: Computer R4*S, where S=-1 1 -1 1, and see if you get our alleged transform which was 1 0 0 -2.
SUMMARY: If I gave you a series of numbers (of some length n which is a power of 2), you could compute its Walsh transform, and vice versa. The most important thing to note is that the transform is a frequency analysis of the signal. It can be read as saying that the signal has 1 volt of B0 (=DC) and -2 volts of B3 (which was the component with frequency 1/2.)
What do we mean "frequency 1/2"? Well, if we knew what units the time axis was in, we could calibrate that. Assume we're taking 1000 samples per second. So we see that B3's frequency is 1/2 cycle per millisecond, or 500 cycles/second (Hz).
Now what about the DCT? It's the same deal, only with a different matrix. The DCT matrix has somewhat better mathematical properties for image compression, but it's less intuitive to explain than the Walsh is.
Spatial frequencies.
BUT NOW we stop thinking about cycles per second, and start thinking about cycles per inch. What if the sample S consisted of pixel values along a scan line? We could still talk about its frequency. A pattern like 1 0 1 0 1 0 which repeats itself 50 times in 100 pixels, could be said to have a frequency of 1/2 cycle per pixel.
A pattern can have different frequencies in the X and Y directions.
What are spatial frequencies of this pattern?
1 1 0 0 1 1 0 0
1 1 0 0 1 1 0 0
1 1 0 0 1 1 0 0
0 0 1 1 0 0 1 1
0 0 1 1 0 0 1 1
0 0 1 1 0 0 1 1
1 1 0 0 1 1 0 0
1 1 0 0 1 1 0 0In the horizontal direction, it repeats every 4 pixels. In the vertical direction, it repeats every 6 pixels. SO... we can consider the idea of the Walsh transform of an image, too! The Walsh transform of a one dimensional signal with N samples consisted of N coefficients. The transform of an n x n image will, by analogy, consist of an array of nxn coefficients.
Let's call the pixels of the image Sxy. Let's call the transform Tuv. The two dimensional matrix which represents Tuv is the spatial-frequency representation of the corresponding image. Its DC component is called T00 and is found in the upper left corner, etc.
Now for JPEG.
Pick up the thread on page 73 of the Wu text. JPEG uses 8 x 8 blocks of pixels in an image. An RGB 24 bit image would use 3 x 64 bytes of information to store this part of the image. Our goal is to squeeze as much information out of the picture as possible.
The steps of the process are:
1) Finite Discrete Cosine Transform
2) Quantization
3) Zigzag Scan
4) Run-length and Huffman coding
5) JPEG Syntax Generation
Four Flavors. Sequential mode is the "basic" JPEG, lossy. You
don't see anything until the picture arrives and is decompressed; though
you may see it build up in 8x8 blocks.
Progressive mode: Send a low-res version of the image first, then a better one. The low-res images can be constructed from the same DCT as the high-res ones, so compression isn't so expensive. You've all seen this kind of image on the Web, methinks. Final picture is no better than sequential.
Hierarchicial mode: A low res version is used as the basis for sending a subsequent high res version, recursively.
Lossless mode; Don't do the DCT because it's lossy. This just leaves us with run length and huffman, which doesn't mack things very tightly. This mode is delivered sequentially.
Color. The image is represented not in RGB, but in a luminance-chroma fashion called YCbCr or sometimes called YUV. This is equivalent to a "coordinate transformation" in color space using the matrix described on page 77, namely
0.299 0.587 0.144
CT = -0.169 -0.331
0.500
0.500 -0.419 -0.081
The advantage of using YUV is that the human eye doesn't have as much acuity for chroma as for luminance, and so we can subsample the chroma (i.e. use every other sample, as on page 78. I wouldn't worry about the 4:2:2 terminology, as I don't understand it anyway.)
Interleaving of components. Should we send all the red in an image, then all the green, then all the blue? That would make it hard to do "on the fly" reconstruction, so we normally send the R G and B of a given small screen area, then repeat our way across the image. This is called component interleaving.
Quantization is confusing but important. In essence, for each of the 64 cells in a transform, the question is asked: "how important is this data?" A default table of weights is provided for JPEG. It is NOT symmetrical, by the way.It reflects the fact that most photographs taken by humans have strong horizontal structure to them, and so assumes more redundancy (i.e. assumes greater energy in the low frequency components) in the horizontal direction. Highly sophisticaded JPEG users can also make up their own quantization tables.
So.. the cell for S44 might contain the number 30. This means, "divide the value in this cell by 30 before storing it." Or, equivalently: "30 units of energy at frequency S44 is worth one point. Sixty units is worth 2 points, etc.
If you think about it, the net effect is to group similar but not identical features into a single level. A small quantization number means "store lots of detail at this frequency" and a large one says "ignore this component unless it's really large."
Since the higher frequencies are quantized into fewer levels, a lot of them in fact become zeroes. The next step is to put them into a list, according to the zig zag sequence chart on page 82. All those zeroes tend to fall toward the end of the list. The zeroes are run length encoded. Specifically, a tricky RLE technique is used which expects mostly zeroes. A pair of numbers like ((1,2),(0,1)) where the first number means how many zeroes, and the second number is the nonzero item that follows. The above code represents 021, which in this case wasn't a very useful thing to say.
Then Huffman encoding crunches this data (as "code word pairs") very nicely.
The Spatial Frequencies in a Picture
The charts on page 80 and 81 are the key to understanding JPEG. In the upper corner of page 80 we see a set of 64 coefficients which begin with a DC component (upper left corner) of -80. In the lower right corner we see mostly zeroes. When we divide each cell in this array by its corresponding cell in the Quantization Array below, we get the array on page 81. This one is REALLY full of zeroes.
Query 7.5: Use the zigzag pattern to traverse the array on page
81. Runlength encode the image according to the above zero-based two byte
encoding scheme. Now develop a Huffman code for the resulting set of two-byte
code words. (In an actual JPEG, this Huffman code would be developed
for ALL the two byte code words from the whole image, not just one 8x8
region.)
Our classroom discussion will focus on the bottom of page 84 where we
deal with the raw data rate and comporession ratios needed. This kind of
computation will be the sort of thing I'll put on exams about this section.
Back to previous lecture
Forward to next lecture
Back to the Index
Back to the Syllabus