MULTIMEDIA SIGNAL PROCESSING ALGORITHMS PART II – MINIMIZATION OF THE AMOUNT OF INFORMATION TO BE PROCESSED AND BASIC ALGORITHMS The second principle of biological processing SEEMS TO BE : MINIMIZATION OF THE AMOUNT OF INFORMATION TO BE PROCESSED THAT IS THE PROCESSING SYSTEM ELIMINATES AS MUCH INFORMATION AS POSSIBLE AND USES ONLY ABSOLUTELY NECESSARY MINIMUM TO ACHIEVE ITS TASKS Why this principle is reasonable? Minimizing information to be processed saves energy, increases speed, reduces effort and is overall logical to do. This is not limited to biology but also applies to technical systems. IN PREVIOUS LECTURES THIS PRINCIPLE WAS EVIDENT SEVERAL TIMES: WE ARE ABLE TO RECOGNIZE OBJECTS BASED ON VERY MINIMAL INFORMATION THIS MEANS PROCESSING SYSTEM IS ABLE TO REDUCE INFORMATION TO MINIMUM OR IN OTHER WORDS TO EXTRACT THE NECESSARY MINIMUM SO WE CAN HAVE THE MAIN PRINICPLE FOR THIS COURSE : FOR EFECTIVE MULTIMEDIA SIGNAL PROCESSING ONE HAS TO MINIMIZE THE AMOUNT OF INFORMATION PROCESSED, EXTRACT THE ABSOLUTELY NECESSARY MINIMUM FOR THE PROCESSING TASK. HOW TO DO THIS IS NOT ALWAYS CLEAR AND EASY, WE NEED TO STUDY THIS. The second principle, as indicated before can be statistical processing, producing results matched to the most likely signals happening in the real world. But this principle also has to be applied correctly. NOW LET US GO TO TECHNOLOGY ASSUME WE HAVE COMPUTER SYSTEM: ASSUME WE HAVE COMPUTER WITH CAMERA AND DIGITIZER CARD AND WE WOULD LIKE TO EXTRACT VISUAL INFORMATION ABOUT ENVIRONMENT LIKE OUR EYES DO (OR WE HAVE MICROPHONES AND WE WOULD LIKE TO EXTRACT ACOUSTICAL INFORMATION LIKE OUR EARS DO) HOW WE SHOULD PROGRAM THE COMPUTER? Let’s think about typical example which is already becoming popular in cameras: We would like to implement algorithms which will mark faces in pictures, recognize familiar faces. This may of course extended to other objects and complete scenes, for example camera would recognize if the picture is taken of familiar building or landscape. The problem is not easy since objects can be seen from different viewpoints, lighting, time. But the input to algorithm which we have is digitized picture • WHAT IS THE PICTURE AFTER DIGITIZATION? IT IS A MATRIX OF NUMBERS. THE MATRIX SIZE CAN BE EG. 256X256 OR 720x576 – TELEVISION PICTURE 1024X768 - COMPUTER MONITOR 1920x1080- HIGH DEFINITION TELEVISION PICTURE MATRIX ELEMENTS ARE USUALLY 8-BIT NUMBERS, THIS CORRESPONDS TO 256 LEVELS OF LIGHT WHICH IS ENOUGH. COLOR PICTURES ARE DESCRIBED BY THREE SUCH MATRICES FOR EACH BASIC COLOR HERE IS A PICTURE FROM MARS LANDER AND PART OF THE MATRIX NEAR THE OBJECT WHAT WILL HAPPEN WHEN THE PICTURE RESOLUTION IS TOO SMALL? RESOLUTION WILL BE IMPAIRED LESS DETAILS VISIBLE HERE WE SEE WHAT WILL HAPPEN WHEN RESOLUTION IS REDUCED FROM 512X512 TO 32X32 WHAT IS THE SIZE OF ONE TV PICTURE IN BITS? 720x576x3x8-bit = about 10 Mbits • TOPIC: COLOR PROCESSING IMAGES ARE REGISTERED IN THREE BASIC COLOR COMPONENTS: RGB=RED, GREEN, BLUE MIXTURE OF THESE COLORS PROVIDES OTHER COLORS WE HAVE TO USE THREE IMAGE MATRICES TO REPRESENT ONE COLOR PICTURE RGB REPESENTATION IS USED FOR DISPLAY, E.G. COMPUTER MONITORS OR TELEVISION PANELS ARE DRIVEN BY R,G,B SIGNALS • COLOR IMAGE AND RGB COMPONENTS • WE OFTEN PERFORM CONVERSION TO MORE SUITABLE COLOR SPACE TWO SUCH SPACES ARE VERY USEFUL: YUV SPACE AND HSV SPACE YUV SPACE : Y – INTENSITY OF (WHITE) LIGHT U, V – COLOR CHROMINANCES TO OBTAIN YUV REPRESENTATION WE TAKE THE R,G,B COLOR MATRICES FOR A PICTURE AND CONVERT THEM BY -> • RGB->YUV TRANSFORMATION Y 0.299 0.587 0.114 U 0.148 0.289 0.437 V 0.615 0.515 0.100 R G B NOTE: Y IS BLACK AND WHITE COMPONENT, THAT IS MIXTURE OF R, G, B WHICH GIVES GRADATIONS OF WHITE COLOR, FROM BLACK TO GREY TO WHITE. U AND V ARE COLOR COMPONENTS – DO NOT HAVE PHYSICAL MEANING THUS HERE INTENSITY OF LIGHT IS SEPARATED FROM COLOR INFORMATIO • AFTER THIS TRANSFORMATION INSTEAD OF THREE R,G,B MATRICES WE GET THREE MATRICES Y, U, V TRANSFORMATION IS INVERTIBLE SO ALL INFORMATION IS PRESERVED BUT NOW WE CAN PLAY A TRICK: HUMAN VISUAL PROCESSING IS MUCH LESS SENSITIVE TO COLOR INFORMATION THAN TO BLACK AND WHITE LIGHT INTENSITY INFORMATION THUS, MATRICES U,V CAN BE REDUCED IN SIZE • SUBSAMPLING OF MATRICES U AND V FOR 4 ELEMENTS OF Y THERE WILL BE TAKEN ONLY ONE ELEMENT OF U,V Y1 Y2 Y3 Y4 U U U U VV VV ELEMENTS U AND V CAN BE E.G. AVERAGE VALUE OF ORIGINAL 4 ELEMENTS U AND V THUS MATRICES U,V CAN BE REDUCED BY FACTOR OF 4 IN SIZE RETURNING BACK TO RGB FORM WILL NOT CHANGE THE PICTURE VISUALLY • THE RGB->YUV TRANSFORMATION USES DIRECTLY PROPERTY OF HUMAN VISION WHICH ALLOWS: - TO REDUCE THE SIZE OF COLOR IMAGES (IMPORTANT FOR COMPRESSION) - TO USE ONLY LIGHT INTENSITY WITHOUT COLOR INFORMATION (FOR E.G. RECOGNITION OF OBJECTS) • ANOTHER TRANSFORMATION IS HSI HSI IS MORE RELATED TO HUMAN PERCEPTION WHERE WE CAN SEE SATURATION OF COLORS THAT IS WE CAN TELL ”REDNESS”, ’BLUENESS’ OF COLORS AND SO ON. TO GET THE HSI REPRESENTATION WE MAP RGB INTO H – HUE (COLOR) S – SATURATION (AMOUNT OF WHITE MIXED WITH COLOR) I - INTENSITY (AMOUNT OF GREY LEVEL EQUATIONS FOR HSI FROM RGB AND VICE VERSA: BASIC ASPECTS OF THE HSI REPRESENTATION: ON A CUBE THERE ARE SOME OTHER ’BASIC’ COLORS APART OF RGB, MAIN DIAGONAL IS THE AMOUNT OF WHITE ON THE DIAMOND WE SEE COLORS AROUND HEXAGON HEIGHT IS AMOUNT OF WHITE, SATURATION IS X-AXIS LOOK WHERE IS THE I (V) AXIS, S AXIS AND HUE ANGLE • HSI TRANSFORMATION IS USEFUL SINCE WE GET REPRESENTATION IN COLOR SPACE WHICH CORRESPONDS TO THE PROPERTY OF HUMAN VISION, THAT IS INTENSITY LEVEL CAN BE ESTIMATED. COLOR SATURATION, AND THE COLOR ITSELF. DIGRESSION ON COLOR SENSORS ASSUME YOU BUY DIGITAL CAMERA WITH E.G. 5 MEGAPIXELS. WHAT DOES THIS MEAN? IT TURNS OUT THAT THE PIXEL DEFINTION IS DIFFERENT FOR DIFFERENT APPLICATIONS. TRADITIONALLY 1 PIXEL = R, G, B COLOR COMBINATIONS SO WE NEED 3 COLOR SENSORS FOR CAMERA OR 3 COLOR ELEMENTS FOR DISPLAY FOR EXAMPLE: LCD COMPUTER MONITOR WITH RESOLUTION OF 1280X1024 PIXELS HAS 1280X1024 ELEMENTS FOR EACH R, G, B COLOR, THAT IS IT HAS 1280X1024X3 DISPLAY ELEMENTS. THE DISPLAY ELEMENTS ARE CALLED SUBPIXELS, ONE PIXEL IS COMPOSED OF THREE SUBPIXELS R G B IN DIGITAL CAMERAS THIS IS DIFFERENT SENSOR IN DIGITAL CAMERAS LOOKS LIKE THIS: IN DIGITAL CAMERAS EVERY COLOR SUBPIXEL COUNTS AS ”PIXEL” THE PIXELS ARE ARRANGED IN A MATRIX CALLED BAYER SENSOR EACH ”CAMERA” PIXEL IS MADE BY 4 COLOR PIXELS: 1 RED, 2 GREEN, 1 BLUE PIXEL 1 (REMEMBER THAT MOST OF PIXEL 2 VISIBLE LIGHT IS GREEN) WE CAN NOTICE THAT ”FULL” COLOR PIXEL CAN BE MADE FROM OVERLAPPING SQUARES BY HALF SHIFT SO THE E.G. 5 MILION PIXELS IN DIGITAL CAMERA IS NOT EXACTLY 5 MILION IN THE DISPLAY SENSE. IT SHOULD BE DIVIDED BY 4 OR BY TWO IF WE TAKE INTO ACCOUNT INTERPOLATION BUT THERE ARE TWO EXCEPTIONS: THERE ARE VIDEO CAMERAS WHICH HAVE 3 CCD SENSORS SEPARATELY ONE FOR EACH R,G,B COLORS IN 3 CCD VIDEO CAMERAS OPTICALSYSTEM SPLITS LIGHT INTO 3 SENSORS WHICH PICKUP R,G,B COLORS. TOTAL NUMBER OF PIXELS CORRESPONDS TO THE NUMBER OF PIXELS IN DISPLAY ANOTHER EXCEPTION IS FOVEON SENSOR IN FOVEON, THERE IS ONE SENSOR BUT IT MEASURES ALL 3 RGB COLORS IN ONE AREA THIS IS BASED ON THE FACT THAT PHOTONS GO TO DIFFERENT DEPTHS IN THE SEMICONDUCTOR DEPENDING ON THEIR WAVELENGHTS www.foveon.com COMPARISON: WE CAN SEE THAT SINGLE SENSOR DEVICES HAVE LOWER RESOLUTION THAN 3 SENSOR DEVICES OR FOVEON. BUT THEY ARE EASIEST TO PRODUCE SO THE NUMBER OF THEIR COLOR PIXELS IS INCREASING ALL THE TIME AND RESOLUTION PROBLEM IS SOLVED..... • The elimination of information based on color is an example of much more general principle: Elimination of information Input signal Output signal, representation of the input signal which is ”just good enough” for specific task How to produce the ”good enough” representation is the essential problem to solve Next we will show example of representation by edges • EDGE DETECTION LINEAR FILTERING: AREA AROUND EVERY POINT IN THE IMAGE MATRIX IS MULTIPLIED z l m u x v n p q BY VALUES FROM ANOTHER MATRIX AND RESULT IS SUMMED UP • DEPENDING ON THE MATRIX BY WHICH WE MULTPILY WE HAVE SEVERAL TYPES OF FILTERS: LOW PASS – SUM OF FILTER COEFFICIENTS IS ONE BANDPASS – SUM OF FILTER COEFFICIENTS IS ZERO HIGPASS - SUM IS BETWEEN ZERO AND ONE • WE SAID THAT IN HUMAN VISUAL SYSTEM IN THE RETINA PROCESSING ELEMENTS ARE SENSITIVE FOR CHANGES IN LIGHT LEVEL. THIS IS EQUIVALENT TO BANDPASS FILTERING SPECIAL CLASS OF BANDPASS FILTERS IS CALLED EDGE DETECTORS SINCE THEY ARE DESIGNED TO DETECT SHARP CHANGES IN IMAGE LIGHT INTENSITY • LET US CONSIDER THE FOLLOWING SITUATION – WHITE BAR ON BLACK BACKGROUND OR OPPOSITE OUR VISUAL SYSTEM AND WE HERE ARE INTERESTED MOSTLY IN AREAS WHERE LIGHT IS CHANGING IT VALUE, SHARP CHANGES IN LIGHT VALUE ARE CALLED EDGES HOWEVER, THERE IS A PROBLEM HERE: WHAT EXACTLY IS SHARP CHANGE IN INTENSITY? THIS IS NOT WELL DEFINED ON THE RIGHT WE SEE SOME EXAMPLES OF LIGHT CHANGE: RAMP EDGE – LIGHT INCREASING GRADUALLY STEP EDGE – SHARP TRANSITION NARROW LINE ROOF EDGE THERE COULD BE MANY MORE SUCH EXAMPLES! • EDGE DETECTION IS EQUIVALENT TO DIFFERENTIATION IN CONTINUOUS FUNCTION DOMAIN F ( x, y ) 0 x if F(x,y)=const BUT IN IMAGES WE HAVE LIMITED NUMBER OF PIXELS SO WE CAN PERFORM ONLY APPROXIMATE DIFFERENCING • EDGE DETECTORS HERE WE HAVE TWO MATRICES OF FILTERS FOR DIFFERENCING NOTE THAT THE FIRST ONE WILL PROVIDE ZERO OUTPUT WHEN THERE ARE CONSTANT VALUES IN VERTICAL DIRECTION AND SECONDE WHEN THERE ARE IN HORIZONTAL DIRECTION • NOW LET’S TAKE THE OUTPUTS OF BOTH FILTERS AND COMBINE THEM TOGETHER, FOR EXAMPLE BY Z H V THE OUTPUT WILL NOW BE QUITE INDEPENDENT FROM THE DIRECTION OF EDGES NOTE THAT GC/GR IS EQUIVALENT TO THE DIRECTION OF AN EDGE • HERE WE HAVE EXAMPLE OF RESULTS: - ORIGINAL PICTURE - HORIZONTAL DETECTOR - VERTICAL DETECTOR - BOTH COMBINED AS WE CAN SEE THE COMBINED OUTPUT GIVES BORDERS OF OBJECTS SO WE CAN RECOGNIZE IT EVEN IF THERE IS LITTLE INFORMATION THIS MAY CORRESPOND IN SOME WAY TO HOW HUMAN SYSTEM WORKS • WHY WE USED JUST SUCH MATRIX FOR EDGE DETECTION? THERE CAN BE MANY SUCH MATRICES USED, SOME OF THEM ARE SHOWN HERE, AND MANY OTHERS ARE KNOWN THEY DIFFER IN PROPERTIES AND OPERATION IN NOISE E.G. PREWITT, SOBEL ARE GOOD • IF WE TALK ABOUT OPERATION IN NOISY IMAGES, THERESHOLDING IS IMPORTANT AFTER RUNNING A DETECTOR WE GET OUTPUT SIGNAL. UNFORTUNATELY THIS CAN BE MADE BY NOISE, NOT BY EDGE. EDGE DETECTORS CAN BE SENISITVE TO NOISE. WE THRESHOLD THE OUTPUT SIGNAL IF IT IS > THAN SOME VALUE T IT IS CLASSIFIED AS EDGE HERE OPERATION OF EDGE DETECTOR IN NOISY CONDITIONS WITH THRESHOLDING IS SHOWN: AT LOW NOISE LEVEL IT IS GOOD AT HIGHER NOISE LEVEL, WE GET SOME NOISE POINTS CLASSIFIED AS EDGES, AND SOME EDGE POINTS ARE MISSING (WE SEE GOOD EDGE) AT VERY HIGH NOISE LEVEL, THE DETECTOR OPERATION BREAKS UP COMPLETELY AND NO EDGE IS DETECTED NOTE THAT WE CAN SEE SOME EDGE IN THIS PICTURE SO IN NOISY CONDITIONS THERE ARE PROBLEMS WITH EDGE DETECTORS BUT SOMEHOW IN HUMAN VISION THEY WORK VERY WELL – HOW??? RESEARCHERS MOTIVATED BY HUMAN VISION NOTICED THAT FILTERING ELEMENTS IN HUMAN RETINA AT THE BACK OF THE EYE ARE MORE COMPLICATED THAN SIMPLE DETECTORS HERE. • MOTIVATED BY OBSERVATION OF HUMAN SYSTEM AND SOME CONSIDERATION OF OPTIMAL NOISE ATTENUATION A ZERO-CROSSING, OR LAPLACIANOF-GAUSSIAN DETECTOR WAS DESIGNED THIS DETECTOR IS OBRAINED BY TAKING SECOND DERIVATIVE OF GAUSSIAN CURVE x y ( x 2 y 2 ) / 2 s 2 1 / s [1 ]e 2 2s 2 2 4 The resulting curve has characteristic ’Mexican’ hat shape NOW IF WE TAKE SECOND DERIVATIVE OF THE OUTPUT, WE NOTICE THAT EDGE IS WHEN SIGNAL CROSSES ZERO ! • ZERO CROSSING EDGE DETECTOR WILL BE BETTER IN NOISY CONDITIONS BUT IT IS MORE COMPLICATED SINCE IT REQUIRES MUCH MORE OPERATIONS FOR CALCULATION Assuming the we have such detector the next problem is how to build representation based on edges and this is shown next • LINKING EDGE POINTS TO FORM CONTOURS OF OBJECTS: WE LINK OUPUT POINTS FROM EDGE DETECTOR WHEN THEIR VALUES ARE SMILAR: - SIMILARITY MEANS - AMPLITUDE DIFFERENCE IS SMALLER THAN SOME THRESHOLD - ANGULAR DIRECTION IS SIMILAR LINKED EDGES ARE THOUGHT TO BELONG TO SAME OBJECT • EXAMPLE ORIGINAL PICTURE VERTICAL DETECTOR HORIZONTAL DETECTOR RESULT OF EDGE LINKING • SEGMENTATION HOW TO EXTRACT OBJECTS FROM PICTURES? THIS CAN BE DONE BASED ON FEATURES SUCH AS INTENSITY OR COLOR • WE CAN GROUP AREAS WITH SPECIFIC FEATURES BY LINKING THEM TOGETHER IF TWO AREAS HAVE THE SAME FEATURE WE LINK THEM TOGETHER SEGMENTATION ALGORITHM START WITH SOME AREA AND DIVIDE IT IN FOUR PARTS, CONTINUE DIVISION UNTIL ONLY PARTS WITH SPECIFIC FEATURE ARE KEPT • THRESHOLDING WE NEED TO DIFFERENTIATE BETWEEN THE ’USEFUL’ DATA AND ’NONEUSEFUL’ THRESHOLDING WORKS ON THE PRINCIPLE THAT USEFUL SIGNAL IS STRONGER. IF SIGNAL < T WE SET IT TO ZERO. HOW TO SELECT T? FOR THRESHOLDING, HISTOGRAM CAN BE USED SINCE IT OFTEN PROVIDES VIEW HOW OBJECT AND BACKGROUND CAN BE SEPARATED HOWEVER, FULLY AUTOMATIC THERSHOLDING IS DIFFICULT SINCE NOISE AND OBJECT LIGHT INTENSITIES MAY BE NOT COMPLETELY SEPARATED IF THE THRESHOLD IS SELECTED HERE WE CAN SEPARATE BACKGROUND AND OBJECT • FEATURE DETECTION FEATURES ARE SMALL PARTS OF OBJECTS WHICH ARE CRITICAL FOR RECOGNITION AND REPRESENTATION FEATURES • HOW TO DETECT FEATURES? THIS IS QUITE DIFFICULT PROBLEM. FEATURES ARE OFTEN COMPOSED OF SHORT THIS CORNER LINE SEGMENTS, E.G. CORNERS WE CAN THINK TO APPLY EDGE DETECTOR AND THRESHOLDING FOR FINDING FEATURES IS COMPOSED OF TWO LINES CORNER EDGE MMSP Irek Defée • FOR COMPACT REPRESENTATION WE HAVE TO ELIMINATE ALL NONRELEVANT SIGNAL ELEMENTS. THIS IS TASK SIMILAR TO MEDIA COMPRESSION MEDIA COMPRESSION HAS A GOAL TO MINIMIZE DESCRIPTION OF MEDIA WHILE PRESERVING PERCEPTUAL QUALITY. THIS IS ALSO IMPORTANT TO GENERAL MULTIMEDIA SIGNAL PROCESSING SINCE IT MINIMIZES THE AMOUNT OF INFORMATION TO BE PROCESSED. : MEDIA SIGNAL IS A STREAM OF BITS HOW TO REDUCE THE NUMBER OF BITS NEEDED FOR THE DESCRIPTION? THIS CAN BE DONE IN 2 WAYS: -MORE EFFICIENT DESCRIPTION OF BITSTREAM -ELIMINATING PERCEPTUALLY INSIGNIFICANT INFORMATION Technically this is called compression of information COMPRESSION CAN BE DONE ON BIT LEVEL -> BIT STREAM BLOCK-LEVEL -> SMALL BLOCKS OBJECT-LEVEL -> OBJECTS IN PICTURES PICTURE-LEVEL -> SAME PICTURE IN DIFFERENT SIZES IS VERY SIMILAR COMPRESSION IS ALSO RELATED TO REPRESENTATION OF VISUAL INFORMATION LET’S TAKE THE FOLLOWING EXAMPLE: a d g b e h c f i This is matrix of 3x3 points taken from a picture. Each point represents number from 0-255, that is 8-bit number. How many different signal matrices can be constructed out of these numbers? (28)9 = 272 - this is huge number ONLY MEANIGNFUL INFORMATION FROM THESE MATRICES MUST BE EXTRACTED. BUT WHAT IS THIS INFORMATION? IT IS ABOUT SPECIFIC SIGNAL CHANGES.... What are then those changes in small areas of pictures which might be of interest? 1. We were talking until now about edges We also mentioned that there can be different types of edges in pictures 2. There can be also other types of information in these small areas (e.g. lines) 3. The question is how to account for this information? Let see some examples: What is there? Dark line? Plus grey dots? Plus black dots? Dark Line? Roof edge? Edge? Edge with white dot? We see here that interpretation of small areas of pictures is ambiguous, several interpretations are possible. Sometimes a feature looks like nonideal or contaminated by other feature Dots? Line? Diagonal edge? So how to interpret such real signals? There has to be very efficient extraction mechanism allowing for - extraction of multiple features - dealing with imperfect features What seems to be very important is that features are made by grouping pixels which are touching and have similar values. Second, sometimes features might be imprefect. Thus, we have to try to assign each pixel where it might belong – to some feature(s) or not. We take center pixel and try to find a group of pixels to which it belongs. Pixel belongs if it has the same value, similar value or its value can be INTERPOLATED from neighbouring pixels. Where the center pixel belongs? It belongs to vertical grey line because pixel values are same, it belongs to diagonal edge if its value can be interpolated from neighbouring pixels, that is the pixel values change in linear way Pixel intensity values, center pixel value isaverage of the other two So we can try to assign pixel to neighbouring pixels. But there will be a problem if we look into larger area, Pixels may belong to many different areas It will be good to detect regularity in the areas When areas are irregular they may be random and thus not interesting How to find regularity? By transforming area of a picture using periodic (orthogonal) basis, e.g. Fourier Transform. But Fourier transform has complex values which is not the most efficient (2 real numbers) In practice there are two other transforms used: Discrete Cosine Transform, DCT and hierarchical 4x4 transform related to it DCT TRANSFORMATION • DCT : Discrete Cosine Transform • Reduction of spatial redundancy • Transform block size: 8 x 8 in our case 1 7 f (x ,y ) 4 u 0 (2 x 1)u (2 y 1)v c u cv f (u ,v )co s[ ]co s[ ] 16 16 v 0 7 1/ 2 ,k 0 w here u ,v ,x ,y 0 ,1,...,7 c k 1, k 0 For color pictures we take blocks: 16 pixels 1 2 3 4 5 6 Cb Cr 16 lines Y- black and white blocks Color blocks DCT in the matrix form One dimension: H kn H ( k , n) ck 2 1 k cos ( n ) N 2 N Two dimensions: H kn H (n, m) ck 2 1 k 1 l cos (n ) cos (m ) N 2 N 2 N • FOR N=4 WE HAVE DCT basis vectors For N=8 For N=4 Basis vectors are obtained by multiplying vertical and horizontal cosine functions • Example of DCT calculation Input matrix Enlarged picture with selected block Calculation of 1-D DCT for columns of the input matrix The block values Calculation of 1-D DCT for the rows of the previous DCT values THE DCT TRANSFORM IS A MAPPING FROM PICTURE BLOCK INTO FREQUENCY DOMAIN SINCE THERE WILL BE FEW HIGH FREQUENCIES NORMALLY, THERE WILL BE MANY ZEROS OR SMALL NUMBERS IN THE DCT MATRIX • EXAMPLE OF THE DCT CALCULATION 140 144 147 140 140 155 179 179 144 152 140 147 140 148 167 179 152 155 136 167 163 162 152 172 168 145 156 160 152 155 136 160 162 148 156 148 140 136 147 162 147 167 140 155 155 140 136 162 136 156 123 167 162 144 140 147 148 155 136 155 152 147 147 136 ORIGINAL PICTURE BLOCK IN PRACTICE SINCE PICTURE VALUES ARE IN (0,255) WE SHIFT THEM TO (-127 , 128) 12 16 19 12 11 27 51 47 16 24 12 19 12 20 39 51 24 27 8 39 35 34 24 44 40 17 28 32 24 27 8 32 34 20 28 20 12 8 19 34 19 39 12 27 27 12 8 34 8 28 –5 39 34 16 12 19 20 27 8 27 24 19 19 8 SHIFTED BLOCK • BLOCK AFTER DCT 185 –17 14 –8 23 –9 –13 –8 20 –34 26 –9 –10 10 13 -10 –23 –1 6 –18 3 -20 -8 -5 14 –14 -8 –2 -3 -3 9 7 1 -11 17 18 8 0 -2 3 -1 -7 -1 0 -7 –2 1 1 4 –6 6 0 8 15 -1 0 MANY SMALL NUMBERS The DCT values allow to detect and evaluate periodical structures in small areas. Sometimes this may be very useful. DCT has some drawbacks: It requires real numbers (cosine functions) and high precision of calculations. Another transform was introduced recently to improve on the DCT. This transform is obtained by rounding the coefficients in the DCT matrix H round {H DCT } When = 2.5 the following transform is obtained 1 1 1 1 2 1 1 2 H = 1 1 1 1 1 2 2 1 This transform has extremely simple coefficients, no multiplications are involved This transformation matrix is very simple. We can see that the rows of the matrix correspond to caclulations detecting: - average value of four signal samples - periodical function with period 1 - periodical function with period 2 (row 4) Thus we get signal decomposition into periodical functions. ENERGY IN THE DCT DOMAIN Lowest freq.(DC) DCT Large entropy Small entropy DCT coeff Inverse DCT Highest freq. 8 bit/pel 10 DCT coeff 8 4 2 DCT coeff Compression Average 8 bit/pel Equal bit alloc Average 3.2 bit/pel Unequal bit alloc bit/pel QUANTIZATION Quantization means removing information which is not relevant. Example: rounding of numbers, round(4.076756) = 4 It turns out that high frequency information is not very relevant for human vision. It can be thus removed. QUANTIZATION High frequencies in DCT can be removed by quantizing. Let K will be a value, we make the operation: n x round(K/n) This will round K to in the interval delimited by valus K-n/2, K+n/2 We can round numbers in such intervals: QUANTIZATION INTERVALS f$ f$ f f Uniform symmetric midtreader Uniform symmetric midriser QUANTIZATION MATRICES FOR DCT • • • • • • • • 16 12 14 14 18 24 49 72 11 12 13 17 22 35 64 92 10 14 16 22 37 55 78 95 16 24 40 19 26 58 24 40 57 29 51 87 56 68 109 64 81 104 87 103 121 98 112 100 51 61 60 55 69 56 80 62 103 77 113 92 120 101 103 99 For luminance Y 17 18 24 47 99 99 99 99 18 21 26 66 99 99 99 99 24 26 56 99 99 99 99 99 47 66 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 For chrominance U, V Each number in the DCT matrix is quantized (divided and rounded) by a number in the quantization matrix above. Notice that high frequencies have much higher quantization values. • EXAMPLE of DCT CALCULATION 140 144 147 140 140 155 179 179 144 152 140 147 140 148 167 179 152 155 136 167 163 162 152 172 168 145 156 160 152 155 136 160 162 148 156 148 140 136 147 162 147 167 140 155 155 140 136 162 136 156 123 167 162 144 140 147 148 155 136 155 152 147 147 136 ORIGINAL PICTURE BLOCK 12 16 19 12 11 27 51 47 16 24 12 19 12 20 39 51 24 27 8 39 35 34 24 44 40 17 28 32 24 27 8 32 34 20 28 20 12 8 19 34 19 39 12 27 27 12 8 34 8 28 –5 39 34 16 12 19 20 27 8 27 24 19 19 8 SHIFTED BLOCK • BLOCK AFTER DCT 185 –17 14 –8 23 –9 –13 –8 20 –34 26 –9 –10 10 13 -10 –23 –1 6 –18 3 -20 -8 -5 14 –14 -8 –2 -3 -3 9 7 1 -11 17 18 8 0 -2 3 -1 -7 -1 0 -7 –2 1 1 4 –6 6 0 8 15 -1 0 MANY SMALL NUMBERS • QUANTIZATION THE DCT VALUES ARE DIVIDED BY SPECIAL CONSTANTS AN ROUNDED 3 5 7 9 11 13 15 17 5 7 9 11 13 15 17 19 QUANTIZATION TABLE 7 9 11 13 15 17 19 21 9 11 13 15 17 19 21 23 61 –3 2 0 2 0 0 -1 11 13 15 17 19 21 23 25 4 –4 2 0 0 0 0 0 13 15 17 19 21 23 25 27 -1 –2 0 0 –1 0 –1 0 15 17 19 21 23 25 27 29 0 010 00 0 0 17 19 21 23 25 27 29 31 0 0 0 0 00 0 0 0 0 –1 0 0 0 0 0 0 0 00 00 00 0 0 00 00 00 AFTER QUANTIZATION OF THE MATRIX FROM THE PREVIOUS PAGE Another example – reconstruction of a block from quantized DCT coefficients We see that approximation is better when more coefficients are taken THE ROLE OF DCT AND QUANTIZATION Quantized DCT coefficients preserve very effectively content of small picture blocks. That is relevant perceptual information is well preserved and nonrelevant eliminated. DCT is thus very good in the representation of image features with minimized information. This is practically confirmed since the DCT is used in image and video compression standards, called JPEG, MPEG. These standards are used in digital cameras, digital television, DVD discs and internet media players. • Minimization of information in video Video is composed of picture sequences, 25-30 pictures per second One can observed that video is composed of ’shots’ or ’scenes’. These are short segments which have the same content. In single shot the difference between two subsequent pictures (taken at 40 ms interval) is very small Information representing video scene can be minimized as follows: - Pick and compress first picture - Calculate motion compensated difference between the second picture and first one - Calculate the motion compensated difference between the restored second picture and the third one - Continue for all pictures in the scene So we only need information about first (compressed) picture and differences between other pictures to preserve initial information from all pictures. This will result in huge saving of information • Example The difference is mostly caused by motion of objects • Movement of objects- there is problem with object borders, to avoid it we consider movements of small picture blocks and try to detect if they moved • The difference between two pictures can be reduced if motion vector of objects is found and motion is compensated, that is object which moved in the second picture is moved back by its motion vector. 16x16 blocks 8x8 blocks Error is lower when the blocks are smaller 4x4 blocks • It is also possible to detect movements of blocks with greater accuracy than 1 pixel, by interpolation between pixels Difference images will be smaller Half-pixel interpolation Quarter pixel interpolation Video information reduction • Instead of having information about all pictures it is enough to have 1. The first picture 2. Motion- compensated difference between subsequent pictures Motion vectors representing movements of picture blocks This is very significant reduction of information and also provides movement of objects information which is very important