Wednesday, May 9, 2007

Video Fundamentals


Video is an illusion that makes use of the properties of eye. Our eye has a peculiar property that image sensed by eye persists for th of a second: ie our eye cannot notice the change of scene in this period. It’s called persistence of vision. We can make the illusion of continuity of a scene by capturing just more than 16 pictures of the visual in a second and displaying the same on a screen in the same time period. It’s the basic principle behind video rendering. Each picture captured is called frames. Therefore a frame rate more than 16 is convenient. Usually we go for 25 fps or more. In the earlier television system we get the image by projecting an electron beam over a phosphor screen. The phosphor screen is divided into 525 lines in case PAL TV system or 625 lines in the case of NTSC system. The electron beam sequentially scans the 525/625 lines from left to right.

Figure 2.1 Progressive scan

When the electron beams, intensity modulated with the video signal, incidents on the phosphor screen, it starts glowing. The intensity of glow depends on the strength of video signal. The way of scanning in the sequential mode is called Progressive scan (Figure 2.1).

In the case of Progressive scanning, there is a finite delay, 0.04s, between successive frames.The point on the phosphor screen corresponding to an image should glow for that much time. But usually the screen has a poor response and results in flicker: ie the image fades and screen tends to become black. An ideal solution to remove the flickering problem is to increase the frame rate. But it will result in the increased bandwidth usage as the amount of data corresponding to those frames increases. Another solution for this problem is the interlaced scanning. A frame is divided into two fields, odd and even. Odd field is obtained by grouping the odd lines in the frame together and even field are obtained by grouping the even lines in the frame. Scanning the fields at different instant avoids the problem of flickering to an extent and doesn’t introduce burden of bandwidth hike. The screen refresh gets doubled. This type of scanning is called interlaced scanning, illustrated in Figure 2.2.

Figure 2.2 Interlaced scan

1.2 Video Coding Concepts

Data compression is achieved by removing redundancy, i.e. components that are not necessary for faithful reproduction of the data. Many types of data contain statistical redundancy and can be effectively compressed using lossless compression, so that the reconstructed data at the output of the decoder is a perfect copy of the original data. Unfortunately, lossless compression of image and video information gives only a moderate amount of compression. The best that can be achieved with current lossless image compression standards is a compression ratio of around 3–4 times. Lossy compression is necessary to achieve higher compression. In a lossy compression system, the decompressed data is not identical to the source data and much higher compression ratios can be achieved at the expense of a loss of visual quality. Lossy video compression systems are based on the principle of removing subjective redundancy, elements of the image or video sequence that can be removed without significantly affecting the viewer’s perception of visual quality. Compression results in higher computational complexities.

Most video coding methods exploit both temporal and spatial redundancy to achieve compression. In the temporal domain, there is usually a high correlation (similarity) between frames of video that are captured at around the same time, especially if the temporal sampling rate (the frame rate) is high. In the spatial domain,

there is usually a high correlation between pixels (samples) that are close to each other, i.e. the values of neighboring samples are often very similar. Spatial and temporal redundancy are illustrated in the Figure 2.3

Figure 2.3 Spatial & temporal correlation in a video sequence

1.3 Spatial Model

Due to the temporal correlation of frames, successive frames can be predicted from the previous one. For this we need reference frames. Reference frames are encoded in a very similar fashion as explained in section 1.5.

1.4 Temporal Model

The goal of the temporal model is to reduce redundancy between transmitted frames by forming a predicted frame and subtracting this from the current frame. The output of this process is a residual (difference) frame. The more accurate the prediction process, the

Figure 2.4 Difference of frame-1 and frame-2

less energy is contained in the residual frame. The residual frame is encoded

and sent to the decoder which re-creates the predicted frame, adds the decoded residual and reconstructs the current frame. The predicted frame is created from one or more past or future frames (‘reference frames’). The accuracy of the prediction can usually be improved by compensating for motion between the reference frame(s) and the current frame.

The simplest method of temporal prediction is to use the previous frame as the predictor for the current frame. Two successive frames from a video sequence are shown in Frame 1 and Frame 2 in Figure 2.4.. Frame 1 is used as a predictor for frame 2 and the residual formed by subtracting the predictor (frame 1) from the current frame (frame 2) is shown in Figure 2.4. The obvious problem with this simple prediction is that a lot of energy remains in the residual frame (indicated by the light and dark areas) and this means that there is still a significant amount of information to compress after temporal prediction. Much of the residual energy is due to object movements between the two frames and a better prediction may be formed by compensating for motion between the two frames.

1.5 Optic Flow

Changes between video frames may be caused by o

bject motion (rigid object motion, for

Figure 2.5 Optical Flow

example a moving car, and deformable object motion, for example a moving arm), camera motion (panning, tilt, zoom, rotation), uncovered regions (for example, a portion of the scene background uncovered by a moving object) and lighting changes. With the exception of uncovered regions and lighting changes, these differences correspond to pixel movements between frames. It is possible to estimate the trajectory of each pixel between successive video frames, producing a field of pixel trajectories known as the optical flow (optic flow). Figure 2.5 shows the optical flowfield for the frames 1 and 2 of Figure 2.4. The complete field contains a flow vector for every pixel position but for clarity, the field is sub-sampled so that only the vector for every 2nd pixel is shown.

If the optical flow field is accurately known, it should be possible to form an accurate prediction of most of the pixels of the current frame by moving each pixel from the reference frame along its optical flow vector. However, this is not a practical method of motion compensation for several reasons. An accurate calculation of optical flow is very computationally intensive (the more accurate methods use an iterative procedure for every pixel) and it would be necessary to send the optical flow vector for every pixel to the decoder.

1.6 Block-based Motion Estimation and Compensation

A practical and widely-used method of motion compensation is to compensate for movement of rectangular sections or ‘blocks’ of the current frame. The following procedure is carried out for each block of M × N samples in the current frame:

1. Search an area in the reference frame (past or future frame, previously coded and transmitted) to find a ‘matching’ M × N-sample region. This is carried out by comparing the M × N block in the current frame with some or all of the possible M × N regions in the search area (usually a region centered on the current block position) and finding the region that gives the ‘best’ match. A popular matching criterion is the energy in the residual formed by subtracting the candidate region from the current M × N block, so that the candidate region that minimizes the residual energy is chosen as the best match. This process of finding the best match is known as motion estimation.

2. The chosen candidate region becomes the predictor for the current M × N block and is subtracted from the current block to form a residual M × N block (motion compensation, MC).

3. The residual block is encoded and transmitted and the offset between the current block and the position of the candidate region (motion vector, MV) is also transmitted. The decoder uses the received motion vector to re-create the predictor region and decodes the residual block, adds it to the predictor and reconstructs a version of the original block.

Block-based motion compensation is popular for a number of reasons. It is relatively straight forward and computationally tractable, it fits well with rectangular video frames and with block-based image transforms (e.g. the Discrete Cosine Transform, see later) and it provides a reasonably effective temporal model for many video sequences. There are however a number of disadvantages, for example ‘real’ objects rarely have neat edges that match rectangular boundaries, objects often move by a fractional number of pixel positions between frames and many types of object motion are hard to compensate for using block-based methods (e.g. deformable objects, rotation and warping, complex motion such as a cloud of smoke). Despite these disadvantages, block-based motion compensation is the basis of the temporal model used by all current video coding standards. Motion Compensated Prediction of Macroblock

Image is divided into macroblocks (MB) of size N × N. By default, N = 16 for luminance images. For chrominance images N = 8 if 4:2:0 chroma subsampling (explained in section 1.6.2) is adopted. H.264 uses a variant of this. Motion compensation is performed at the macroblock level. Macroblock partitioning is shown in Figure 2.6. The general rule of prediction is as follows:

- The current image frame is referred to as target frame.

- A match (only for Y macroblock) is sought between the macroblock in the Target Frame and the most similar macroblock in previous and/or future frame(s) (referred to as reference frame(s)).

- The displacement of the reference macroblock to the target macroblock is called a motion vector, MV.

- Residual is found in case of Y macroblock as well as Cb & Cr macroblocks.

- Figure 2.7 shows the case of forward prediction in which the reference

frame is taken to be a previous frame.

- MV search is usually limited to a small immediate neighborhood - both

horizontal and vertical displacements in the range [−p, p]. This makes a search window of size (2p + 1) × (2p + 1).

Figure 2.6 Size of macroblock for prediction

Figure 2.7 Motion Compensated prediction of Macroblock

General Rule for searching motion vectors

Parameter MAD (Mean Absolute Differnce) is defined as

MAD (i, j) = | C(x + k, y + l) – R (x + i + k, y + j + l)| eqn 2.1


N - size of the macroblock,

k and l - indices for pixels in the macroblock,

i and j - horizontal and vertical displacements,

C(x + k, y + l) - pixels in macroblock in Target frame,

R(x + i + k, y + j + l) - pixels in macroblock in Reference frame

The search is to find a vector (i, j) as the motion vector MV = (u, v), such that MAD(i, j) is minimum:

(u, v) = { (i, j) | MAD(i, j) is minimum, i ε [−p, p]; j ε [−p; p] }

1.6.2 Frame Coding

RAW video can be viewed as a sequence of frames in the display rate of 25 frames per second or above. Encoder converts these frames as Intra-frames (I-frames), Inter-frames (P-frames) or Bidirectional frames (B-frames). Transform coding method similar to JPEG is applied within each I-frame, hence the name “Intra”. P-frames are not independent, coded by a forward predictive coding method (prediction from a previous P-frame is allowed - not just from a previous I-frame). B frame is predicted from more than one frame usually from a previous and a future frame.

Figure 2.8 I-frame Coding

I-frame coding is depicted in figure 2.8. Here Macroblocks are of size 16 × 16 pixels for the Y frame, and 8 × 8 for Cb and Cr frames, since 4:2:0 chroma subsampling is apllied. A macroblock consists of four Y, one Cb, and one Cr 8 × 8 blocks. For each 8 × 8 block a DCT transform is applied, the DCT coefficients then go through quantization zigzag scan and entropy coding.

P-frame coding is based on Motion Compensation. For each macroblock in the Target frame, a motion vector is allocated. After the prediction, a difference macroblock is derived to measure the prediction error. Each of these 8 × 8 blocks go through DCT, quantization, zigzag scan and entropy coding procedures. The P-frame coding encodes the difference macroblock (not the Target macroblock itself). Sometimes, a good match cannot be found, i.e., the prediction error exceeds a certain acceptable level. The MB itself is then encoded (treated as an Intra MB) and in this case it is termed a non-motion compensated MB. For motion vector, the difference MVD is sent for entropy coding. Figure 2.9 gives an overview on P-frame coding.

Figure 2.9 P-frame coding

Motion Compensation based B-frame coding method is illustrated in Figure. 2.10.

Each Macroblock from a B-frame will have up to two motion vectors (MV’s) (one from the forward and one from the backward prediction). If its matching in both directions, then two MVs will be sent and the two corresponding matching MBs are averaged (indicated by `%' in the figure 2.10) before comparing to the Target MB for generating the prediction error. If an acceptable match can be found in only one of the reference frames, then only one MV and its corresponding MB will be used from either the forward or backward prediction.

Figure 2.10 B-frame Coding

Figure 2.11 4:2:0 Subsampling in the case of interlaced video

1.6.3 Field Prediction

Field prediction comes in the case of codecs having an extended support to interlaced video. Chroma subsampling in the case of interlaced video is illustrated in the figure 2.11.

In interlaced video each frame consists of two fields, referred to as the top-field and the bottom-field. In a frame-picture, all scanlines from both fields are interleaved to form a single frame, then divided into 16 x 16 macroblocks and coded using motion compensation. If each field is treated as a separate picture, then it is called field picture.

Figure 2.12 Frame picture Vs Field picture

The common modes of prediction in case of interlaced video:
1) Frame Prediction for Frame-pictures

Fields are combined to form frames and prediction is performed for macro blocks of size 16 × 16, in a similar fashion explained as in 2.4.3.

2) Field Prediction for Field-pictures

A macroblock size of 16 × 16 from field pictures is used, as in figure 2.13

Figure 2.13 Field Prediction for field pictures

3) Field Prediction for Frame-pictures

The top field and bottom field of a frame picture are treated separately. Each

16 x 16 macroblock (MB) from the target frame picture is split into two 16 x 8 parts, each coming from one field. Field prediction is carried out for these 16 x 8 parts in a manner shown in Figure 2.14

Figure 2.14 Field Prediction for frame-pictures

4) 16 x 8 MC for Field-pictures

Each 16 x16 macroblock (MB) from the target field picture is split into top and

bottom 16 x 8 halves. Field prediction is performed on each half. This generates two motion vectors for each 16 x16 MB in the P-field picture, and up to four motion vectors for each MB in the B-field picture. This mode is good for a finer MC when motion is rapid and irregular. Figure 2.15 gives an illustration.

Figure 2.15 16 x 8 motion compensation for field-pictures.

5) Dual-Prime for P-pictures

Field prediction from each previous field with the same parity (top or bottom) is made. Each motion vector MV is then used to derive a calculated motion vector CV in the field with the opposite parity taking into account the temporal scaling and vertical shift between lines in the top and bottom fields. For each MB the pair MV and CV yields two preliminary predictions. Their prediction errors are averaged and used as the final prediction error. This mode mimics B-picture prediction for P-pictures without adopting backward prediction (and hence with less encoding delay). This is the only mode that can be used for both frame pictures and field-pictures.

1.6.4 Transform Coding

In case of I-frames, the 8 × 8 blocks of Y, Cb and Cr are transform coded. And in case of B-frames and P-frames the residual value obtained from block based motion estimation is transform coded. Figure 2.16 shows the 8 × 8 block division of a residual macroblock for the luma signal in the case of interlaced video. These 8 × 8 blocks are transform coded in a similar manner as in section 1.5.4. Similar is the case for chroma signal.

Figure 2.16 Frame and field DCT for a Y macroblock in interlaced video

1.6.5 Quantization

In the case video coding two quantization tables are used, one for intra-coding and the other for inter-coding in the case of MPEG-1 & MPEG-2. Also another parameter scale called quantization parameter (QP) is defined to use the variants of quantization matrix.

Zij = Round () eqn 2.2

Zij = Round () eqn 2.3

Where Z is the quantized matrix and QP, quantization parameter and Q1, Q2 represents the quantization matix for intra and inter coding respectively.

The value of QP varies from 1 – 31 and in effect we have more no number of quantization table. The wide range of quantizer step sizes makes it possible for an encoder to control the tradeoff accurately and flexibly between bit rate and quality.

Q1 =

Table 2.1 The matrix used for intra coding for eqn 1.1

Q2 =

Table 2.2 The matrix used for inter coding for eqn 1.2

1.6.6 Zig-Zag Scan

Two different types of reordering for the Transform coded values are found, zig-zag Figure 2.17 Reordering of quantized transform coefficients.

found, zig-zag scan for progressive video and another alternate scan for interlaced

video shown in figure 2.17. Refer section 1.5.6.

1.6.7 Run Level Coding

The output of the reordering process is an array that typically contains one or more clusters of nonzero coefficients near the start, followed by strings of zero coefficients. The large numbers of zero values are encoded to represent them more compactly, for example by representing the array as a series of (run, level) pairs where run indicates the number of zeros preceding a nonzero coefficient and level indicates the signed magnitude of the nonzero coefficient. Higher-frequency DCT coefficients are very often quantized to zero and so a reordered block will usually end in a run of zeros. A special case is required to indicate the final nonzero coefficient in a block, EOB (End of Block). This is called ‘Two-dimensional’ run-level encoding. If ‘Three-dimensional’ run-level encoding is used, each symbol encodes three quantities, run, level and last. Last indicates whether the level is the last non zero value in the linear array.


Input array (Zig Zag ordered value) : 16,0,0,−3,5,6,0,0,0,0,−7,0,0 . . .

2D Coding:

Output values : (0, 16), (2, −3), (0, 5), (0, 6), (4, −7), (EOB)

Each of these output values (a run-level pair) is encoded as a separate symbol by the entropy encoder.

3D coded values are:

Output values :(0, 16, 0), (2,−3, 0), (0, 5, 0), (0, 6, 0), (4,−7, 1)

The 1 in the final code indicates that this is the last nonzero coefficient in the block

1.6.8 Entropy Encoding

The entropy encoder converts a series of symbols representing elements of the video sequence into a compressed bitstream suitable for transmission and storage. In this section we discuss the widely-used entropy coding techniques. Variable-length Coding

A variable-length encoder maps input symbols to a series of codewords (variable length codes or VLCs). Each codeword may have varying length but each must contain an integral number of bits. Frequently occurring symbols are represented with short VLCs while less common symbols are represented with long VLCs. Over a sufficiently large number of encoded symbols this leads to compression of the data. Huffman Coding

Huffman coding assigns a VLC to each symbol based on its probability of occurrence. According to the original scheme proposed by Huffman, it is necessary to calculate the probability of occurrence of each symbol and to construct a set of variable length codewords. Based on this key word we can encode and decode symbols. Since the VLC formed by Huffman algorithm is a prefix code we can easily decode the same to corresponding symbol.

Consider the probability distribution of MVD (Motion Vector Differences) given in table 2.3.



















Table 2.3 Code word for vector, based on Huffman Algorithm

If vector sequence is (1, -2, 0, 0, 2), then its binary sequence is 010000110001. In order to decode the data, the decoder alsomust have a copy of the Huffman code tree (or look-up table). This may be achieved by transmitting the look-up table itself or by sending the list of data and probabilities prior to sending the coded data. Each uniquely-decodeable code is converted back to the original data, for example:

01 is decoded as (1)

0000 is decoded as (-2)

1 is decoded as (0).

1 is decoded as (0).

0001 is decoded as (2). Pre-calculated Huffman-based Coding

The Huffman coding process has two disadvantages for a practical video CODEC. First, the decoder must use the same codeword set as the encoder. Transmitting the information

Table 2.4 Pre-calculated Huffman table used for MPEG-4 Coding

contained in the probability table to the decoder would add extra overhead and reduce compression efficiency, particularly for shorter video sequences. Second, the probability table for a large video sequence (required to generate the Huffman tree) cannot be calculated until after the video data is encoded which may introduce an unacceptable delay into the encoding process. For these reasons, recent image and video coding standards define sets of codewords based on the probability distributions of ‘generic’ video material. Such a pre-calculated VLC table used by MPEG-4 Visual (Simple Profile) is shown in table 2.4.

Tuesday, May 8, 2007

Image Coding Fundamentals

1.1 Introduction

Since video can be viewed as a sequence of pictures, video coding can be seen as an extension to image compression. Compression is the process of compacting data, reducing the number of bits. With eye specific features and removal of redundant data we can achieve compression. Compression involves a complementary pair of systems, a compressor (enCOder) and a decompressor (DECoder) and hence the name CODEC, the system performs that performs encoding & decoding.

1.2 Human Visual system

Rods and cones senses the light focused by the lens on retina. Rods provide sensation on the intensity of light and cone the sensation on color. There are three types of cones each corresponds to red, green and blue. Figure 1.1 shows the sensitivity of eye towards various wavelengths.

Figure 1.1 Response of eye towards different wavelength

Curve R, G & B shows the sensitivity of cones towards wave length. The curve V shows the response of eye intensity variation. The value of V can be represented as V (λ) α 2R + G + B/20, where R, G and B is the intensity of Red, Green and Blue wavelengths. From the Figure 1.1 it’s clear that Eye is more sensitive towards light intensity variation than color variation. This is the principle behind color sub-sampling.


1.1 Picture Representation in Digital Domain

This section describes how an image is represented, provides an idea about pixels, resolution and RAW format.

1.1.1 Pixel

As you all know we can derive any color by properly combining the primary colors, Red, Green and Blue. Figure 1.2 shows the formation of secondary colors from the primary colors. In the case of analog picture representation we have the values of Red, Green and Blue intensities or a combination of these intensities ie: the effective color, at each and every point on the space. But in Digital domain we divide the space (the image focused by the lens) into a finite number of square patterns. This finest square area is called pixel, derived from the word picture element. A Pixel gives the value of Red, Green and Blue intensities in that area.

The intensity values of Red, Green and Blue components are separated by some filter arrangement called Baye’s Filter. This finite division of space is performed by image sensors: CCD (Charge coupled devices), CID (Charge injection Devices), Active pixel CMOS.

Figure 1.3 illustrates the Spatial sampling of a grey scale image into digital domain. The value at each position is approximated to any of 256 levels. Then we need 8 bits/pixel for representing the intensity value.

In the case of color image we need 8 bits for representing each of the primary color, which results in 24 bits/pixel. And we have 224 = 16,777,216 color levels

for each pixel. Number of bits/pixel is called color depth.

Figure 1.2 Formation of 20

Colors from the 10 colors

Figure 1.3 Spatial sampling of analog scene

1.1.2 Resolution

The numbers of pixels used to represent an image can be said in terms of resolution. Higher the number of pixels higher will be the clarity. Resolution determines the maximum enlargement possible for an image. For getting good quality pictures there should be a minimum of 300pixels/inch. Figure 1.4 shows an image sampled at different resolution.

Figure 1.4

1.1.3 RAW Formats

RAW Format implies that there is no compression done on the image. The major types of RAW format are RGB, YUV, YIQ. As explained in section 1.2, our eye is more sensitive towards light intensity variation than color variation. So loss on color information will not affect the over all quality of the image. RGB is an end stream format. Information from the image sensor is in RGB format and we need the same format for displaying the image on an end

device. YUV & YIQ formats are developed for Analog TV transmission (NTSC & PAL respectively) and the digital version of YUV, YCbCr is the most common format used for image and video compressions.

Conversion from one format to another is described below:

RGB to YCbCr Conversion

Y = 0.299 R + 0.587 G + 0.11B

Cb = 0.564 (B - Y)

Cr = 0.713 (R - Y)

YCbCr to RGB

R = Y + 1.402 Cr

G = Y – 0.344 Cb – 0.714 C r

B = Y + 1.722 Cb

Y – Luminance Signal

Cb, Cr – Chrominance Signal, Color difference signal

R – Red

G – Green

B – Blue

1.2 Need for Compression

Consider an image of resolution 640 × 480. Let us calculate the size of the picture in RAW format. Each of the 10 Color is represented by 8 bits. Then for each pixel it needs 24 bits. Total no of pixels in the image is 640 × 480 = 307200 pixels. Therefore the size of the image turns to 307200 × 3 bytes = 921600 bytes. But an image in compressed format with the same resolution takes only 100 KB.

In the case of RAW video stream of length 1 sec its needs 640 × 480 × 3 × 25 = 23040000 bytes (23 MB) of storage if the frame rate is 25 frames/sec. But it’s known that the VCD format video having a size 700 MB plays for around 80 minutes. In the former case we need 110400 MBs (23 MB × 60 × 80) as storage space for 80 minutes video. Therefore we can achieve a high compression 150: 1 at the cost of computational complexities.

1.3 Generic Method of Image Compression

The various steps involved in image compression using JPEG encoder is explained using the block diagram in Figure 1.5.

Figure 1.5 Block Diagram of JPEG Encoder

1.3.1 Color Space Conversion

The image obtained as the output of image sensor will be in RAW format, RGB. Suppose RAW image has a resolution 640 × 480. First step is to change it into another color space YCbCr. Y corresponds to the intensity variation and Cb, Cr are called color difference signals.

1.3.2 Chroma Subsampling

Color space conversion utilizes the sensitivity of eye in luminous part of image. Retaining the luma (Y signal) and sub-sampling the Chroma ( Cb & Cr ) we represent the image in a reasonably good manner. The commonly used subsampling formats are 4:4:4, 4:2:2 and 4:2:0. The numbers indicate the relative sampling rate of each component in the horizontal direction. In 4:4:4 (Figure 1.8) sampling for every four luminance samples there are four Cb and four Cr samples. In 4:2:2 (Figure 1.7) sampling, the chrominance components have the same vertical resolution as the luma but half the horizontal resolution (the numbers 4:2:2 mean that for every four luminance samples in the horizontal direction there are two Cb and two Cr samples). 4:2:2 video is used for high-quality color reproduction. In the popular 4:2:0 (Figure 1.6) sampling format, each Cb and Cr have half the horizontal and vertical resolution of Y. JPEG image compression is based on 4:2:0 subsampling.

Figure 1.6

Figure 1.7

Figure 1.8

Chroma Subsampling results in the reduced number of bits to represent the image.


Image resolution: 640 × 480 pixels

Y resolution: 640 × 480 samples, each represented with eight bits

4:4:4 :- Cb, Cr resolution: 640 × 480 samples, each eight bits

Total number of bits: 640 × 480 × 8 × 3 = 7372800 bits

4:2:0 :- Cb, Cr resolution: 320 × 240 samples, each eight bits

Total number of bits: (640 × 480× 8) + (320 × 240 × 8 × 2) = 3686400 bits

The 4:2:0 version requires half as many bits as the 4:4:4 version

4:2:2 :- Cb, Cr resolution: 320 × 480 samples, each eight bits

Total number of bits: (640 × 480× 8) + (320 × 480 × 8 × 2) = 4915200 bits.

The 4:2:2 version requires 2/3 as many bits as the 4:4:4 version.

1.3.3 MacroBlock

Statistical analysis reveals that images are spatially redundant (ie the pixels are found highly correlated to the neighboring ones) and removing the spatial redundancy itself results in high compression. Consider the block specified in Figure 1.9 which represents a homogenous region where the pixel values of block (Table 1.1) are found closely related. Studies revealed that the optimal size of a block is 8 × 8 for utilizing the spatial redundancy of an image and the 8 × 8 block is called a macroblock. Another advantage in using the macroblock size of 8 x 8 is low memory requirement. Macroblock can be viewed as square matrix of dimension 8.

Followed by chroma subsampling we divide the entire image into macro blocks. An image of resolution 640 × 480 can be divided into 4800 Y macroBlocks, 1200 Cb & 1200 Cr macroblocks in case of 4:2:0 subsampling.

Figure 1.9 Table 1.1 Pixel Values corresponding to MacroBlock shown in Figure 1.9

Next step is to take Discrete Cosine Transform for each of these macroblocks.

1.3.4 Transform Coding (Discrete Cosine Transform)

Discrete Cosine transform is a technique to represent the data spread over the macroblock into a lesser number of significant values. DCT (Discrete Cosine Transform)

represents any pattern (image) as the weighted sum of the basis patterns. On taking DCT, we will get a matrix of the same dimension as that of macroblock, which shows how much of each pattern is present in the image. Figure 1.10 shows the basis patterns for 8 x 8 block Each of these pattern is called frequency and the value obtained on DCT is the amplitude of that frequency in the macroblock.

Fig 1.10 8 × 8 DCT basis patterns

Figure 1.11 Illustration of DCT

Figure 1.11 illustrates the concept of DCT. On taking DCT for the original pattern shown in Figure 1.11 we will get weights corresponding to positions (0,0), (0,2), (2,0) in the basis pattern. The patterns present in original macroblock are Average DC value, Vertical frequency & Horizontal frequency respectively. Similar is the case for any image. Mathematical Formulation of DCT

The Discrete Cosine Transform (DCT) operates on X, a block of N × N samples (typically image samples) and creates Y, an N × N block of coefficients. The action of the DCT (and its inverse, the IDCT) can be described in terms of a transform matrix A. The forward DCT (FDCT) of an N × N sample block is given by:

Y = AXAT eqn 1.1

And inverse DCT (IDCT)

X = ATYA eqn 1.2

where X is a matrix of samples(macroblock), Y is a matrix of DCT coefficients and A is an N × N transform matrix. The elements of A are

Ai j = Ci cos Where Ci = (i =0), Ci = (i > 0) eqn 1.3

Yxy = Cx Cy Xij cos eqn 1.4

Xij = Cx Cy Yxy cos eqn 1.5

The transform matrix A for 8 × 8 DCT is

Example :Calculating DCT for the macroblock defined in Figure 1.9

X =

Table 1.2 Pixel values corresponding to macroblock shown in Figure 1.9

Y =

Table 1.3 The transform matrix Y is obtained using the eqn 1.1

Value of Y (Table 1.3), shows that most of the energy gets crowded to low coefficients. Even approximating the amplitude of higher coefficient values to zero will not make much effect on the image. That is what we observed from the experiments. It’s achieved by the process called Quantization.

1.3.5 Quantization

Experiments had already proved that our eye has a very poor response towards high intensity variation. ie our eye will not notify the absence of patterns in the high frequency side (those patterns comes in the lower right side of Figure 1.10). Thus we can achieve better compression with loss of data. So it’s called lossy compression. Quantizer helps to attain this. A quantiser maps a signal with a range of values X to a quantised signal with a reduced range of values Y. It should be possible to represent the quantised signal with fewer bits than the original since the range of possible values is smaller. A scalar quantiser maps one sample of the input signal to one quantised output value and a vector quantiser maps a group of input samples (a ‘vector’) to a group of quantised values. Here we are explaining the scalar quantiser, one commonly used in image and video compression.

Zij = Round ( ) eqn 1.6

Where Z is output of quantiser, Y is the transform matrix, Q is the Quantization matrix. The rescaled value on inverse quantization is

Y’ij = Zij × Qij eqn 1.7

Where Y’ij is the value in the matrix obtained on inverse quantization. i & j are the indices of matrix.

Two different quantization matrices are used in the case of color image, One for the luma and the other for chroma.

Table 1.4 Quantization matrix used for luminance signal

Table 1.5 Quantization matrix used for chrominanace signal


Consider the macroblock defined in Figure 1.9. The output of quantizer for the macro block using eqn 1.6 comes as

Z =

Table 1.6 Quantized value corresponding to transform matrix, Y shown in Table 1.3

The rescaled matrix Y’ as per the eqn 1.7 is shown as

Y’ =

Table 1.7 The rescaled Transform matrix from Z using eqn 1.7

Taking inverse transform, IDCT on Y’ we will get the reconstructed pixel value matrix X’ as

X’ =

Table 1.8 Reconstructed pixel value matrix of the macroblock shown in figure 1.9

The difference between the original and reconstructed macroblock pixel values is shown as Δ matrix. Its very clear there is only a slight difference between the original macroblock pixel values and the reconstructed macroblock.

Δ = X – X’ eqn 1.8

Δ =

Table 1.9 The error value from original pixel value

1.3.6 Zig Zag Scan

The quantizer ignores all the insignificant high frequency values as shown in Table1.6. The 8 x 8 matrix gives 64 values. Most of the high frequency values are approximated to zero. By efficiently representing the zeros we can achieve high compression. For this purpose Zig Zag scan and Run length encoding are introduced.

Figure 1.12 Zig Zag Scanning

Zig-zag scanning technique groups the non-zero and runs of zero and represent the data in a linear array . Run length encoding represents the runs zero in an efficient manner. The quantized values are reordered in a zig-zag manner to produce a linear array by taking values in the order of position (0,0), (0,1), (1,0), (2,0), (1,1), (1,2), (0,3)…. It’s explained in the Figure 1.12

By scanning in the Zig-Zag pattern, the values of Z matrix (Table 1.6) can be represented in a linear array as 32, 6, -1, -1, 0, -1, 0, 0, 0, -1, 0, 0, 1, 0, 0, 0, 0…

1.3.7 Run Level Coding.

It’s a method to represent the runs of zeros efficiently. The linear array obtained by zig-zag scanning is processed to produce a series of (zero-run, level) pairs. Zero-run will give the number of zeros preceding a non zero value. Level represents the signed magnitude of non-zero value.

General rule for run level coding:

First identify the non zero values. Count the number of zeros preceeding the non-zero as zero-runs, and the non-zero value as level. For example the Run level coding of the above linear array results in an array of pairs like (0, 32), (0, 6), (0, -1), (0, -1), (1,-1), (3, -1), (2, 1), (EOB). ‘EOB’ (End of Block) indicates that the remaining coefficients are zero.

1.3.8 Entropy encoding

Our aim is to represent the image with a less number of bits. So far we represent the values, like DCT coefficients or Run Length coded sequences with fixed number of bits. But statistical studies of image compression reveal that some values come frequently. Encoding those values, which contribute a major share of image, with less number of bits can leads to an efficient compression ratio Variable length coding (VLC) is used here. The various algorithms used for VLC are :


Huffman coding

Adaptive Huffman coding

Arithmetic coding

Based on these algorithms some tables are prepared and indications to these tables are signaled in the header of JPEG bit stream format.

Search Engine Optimization and SEO Tools