JHU June 25, 2012 Multimedia Content Analysis via Computational Human Visual Model Shenghua ZHONG Department of Computing The Hong Kong Polytechnic University www.comp.polyu.edu.hk/~csshzhong 1 Outline Introduction to multimedia content analysis Introduction to human visual system Moment invariants to motion blur for water reflection detection and recognition Top-down and bottom-up saliency map for no-reference image quality assessment Fuzzy based contextual cueing for region level annotation Proposed deep learning for multimedia content analysis 2 Outline Introduction to multimedia content analysis Introduction to human visual system Moment invariants to motion blur for water reflection detection and recognition Top-down and bottom-up saliency map for no-reference image quality assessment Fuzzy based contextual cueing for region level annotation Proposed deep learning for multimedia content analysis 3 Multimedia Content Analysis Definition of multimedia content analysis Computerized understanding of the semantic meanings in multimedia document [Wang et al, SPM, 2000] Difficulty in multimedia content analysis Semantic gap is the well-known challenge [Jiang et al, ACMMM, 2009] Low-level features computable by computer High-level concepts understandable by human Typical multimedia content analysis tasks [Amit, MIT, 2002] [Liu, et al, HCI, 2001] Quality assessment Object detection and recognition Indexing and annotation Classification and retrieval 4 Outline Introduction to multimedia content analysis Introduction to human visual system Moment invariants to motion blur for water reflection detection and recognition Top-down and bottom-up saliency map for no-reference image quality assessment Fuzzy based contextual cueing for region level annotation Proposed deep learning for multimedia content analysis 5 Introduction to Human Visual System Definition of cognitive science Science of mind which may be concisely defined as the study of the nature of intelligence, mainly about the nature of the human mind [Eckardt, MIT, 1995] Definition of human visual system One of research focus of cognitive science The part of the central neural system which enables organisms to process visual information Four processes of human visual system Formation of image on retina Visual processing in visual cortex Attentional allocation Perceptual processing 6 Outline Introduction to multimedia content analysis Introduction to human visual system Moment invariants to motion blur for water reflection detection and recognition Top-down and bottom-up saliency map for no-reference image quality assessment Fuzzy based contextual cueing for region level annotation Proposed deep learning for multimedia content analysis 7 Moment Invariants to Motion Blur for Water Reflection Detection and Recognition Fig. The proposed work about water reflection detection and recognition which background color is purple. 8 Introduction to Water Reflection Definition The change in direction of a wavefront at an interface between two different media so that the wavefront returns into the original medium Special case of imperfect symmetry Importance Fig. Example about the influence of water reflection part. (a) is an example of image with water reflection. (b) is the correct segmentation result of (a). (c) is the actual segmentation result of (a). (d) is the color histogram of the whole image (a). (e) is the color histogram of the object part. 9 Ineffectiveness of Existing Symmetry Technology Failure of Scale-invariant feature transform (SIFT) descriptor [Loy et al, ECCV, 2006] (a) (b) Fig. Examples of ineffectiveness of local feature in images with water reflection. (a) is the correct one, (b) is the result of SIFT descriptor matching result. The red circles are used to denote the SIFT detector results. The green lines are used to denote the matched SIFT descriptor pairs. It is easily to find the SIFT method is ineffective to the water reflection recognition or detection. 10 Ineffective of Existing Water Reflection Detection Technology Failure of flip invariant shape detector [Zhang et al, ICPR, 2010] Fig. Examples of sharp detection result of invariant sharp technique. 11 Basic Idea Definition and influence of motion blur The relative motion of sensor and the scene in this exposure time [Flusser et al, CS, 1996] A well known degradation factor due to motion changes the image features needed for feature-based recognition techniques 12 Moment Invariants to Motion Blur Algorithm of Moment Invariants The geometric moment Input: Image I={x, y}. Output: Moment invariants to motion blur: IR , IR , IR , IR mpq x p y q ( x, y) x y 1. The geometric moment calculation m00 , m01, m10 2. The components of the centroid calculation Central moments 3. The central moments calculation 00 , 12 , 21, 30 , 03 pq ( x x ) p ( y y)q I ( x, y) 4. The complex moments calculation 12 ,21 ,30 ,03 x y 5. The Moment invariants to motion blur calculation m1 IRm1 (30 312 )2 (321 03 )2 2 2 IRm2 (30 12 ) (21 03 ) 2 2 IRm3 (30 312 )(30 12 )[(30 12 ) 3(21 03 ) ] 2 2 (321 03 )(21 03 )[3(30 12 ) ( 21 03 ) ] 2 2 IRm4 (321 03 )(30 12 )[(30 12 ) 3(21 03 ) ] ( 3 )( )[3( )2 ( ) 2 ] 12 21 03 30 12 21 03 30 m2 m3 m4 The centroid moment x M 10 M 00 y M 01 M 00 The complex moments (1 ij ij / 00 i j ) 2 13 High Frequency Energy Decay in Water Reflection (a) Original water reflection image (b) High frequency information part Fig. Decay of the information and energy in high-frequency band due to motion blur 14 Flowchart Original Image Low Frequency Coefficients Curvelet Transform Inverse Curvelet Transform Coefficients Difference Calculation Yes Moments Feature Calculation Reflection Axis Image Subblocks Reflection Cost Minimization If MIN RC IRT No Non-symmetry Image High Frequency Coefficients Yes Imperfect Symmetry Image If N1 N 2 NT No Other Imperfect Symmetry Image Yes Water Reflection Image If N1 N2 No 2 Object located in N side Yes Object located in N 1side 15 Experiments and Results on Detection of Reflection Axis Database 100 nature images with water reflection from Google Compared algorithms Matching of SIFT descriptors to detect reflection axis [Loy et al, ECCV, 2006] Matching based on flip invariant shape detector [Zhang et al, ICPR, 2010] Results The accuracy of axis detection is 29% [Loy et al, ECCV, 2006] The accuracy of axis detection is 46% [Zhang et al, ICPR, 2010] The accuracy of axis detection of our algorithm is 87% 16 Detection Results of Two Algorithms SIFT algorithm result Sharp algorithm result Our algorithm result Fig. Thumbnails of some comparision example images with reflection symmetry detection results 17 Distinguish Object Part and Reflection Part Reflection Part Object Part (a) Reversed water reflection image. (b) Positive Curvelet coefficients of object part (left) and reflection part (right). Fig. Object part and reflection part determined by Curvelet coefficients 18 Outline Introduction to multimedia content analysis Introduction to human visual system Moment invariants to motion blur for water reflection detection and recognition Top-down and bottom-up saliency map for no-reference image quality assessment Fuzzy based contextual cueing for region level annotation Proposed deep learning for multimedia content analysis 19 Top-down and Bottom-up Saliency Map for No-Reference Image Quality Assessment Fig. The proposed work about no-reference image quality assessment. The background color is purple. 20 Introduction to No-Reference Image Quality Assessment Definition of no-reference image quality assessment Difficulty Predefined correct image is not available Mainly aim to measure the sharpness/blurriness How to assess the quality in agree with human judgement Limitation of existing work Ignore cognitive understanding influences the perceived quality [Wang et al, TIP, 2004] (a) (b) (c) Fig. Example of images quality influenced by the cognitive understanding. (a) Image without distortion (b) Blurriness mainly on the girl (c) Blurriness mainly on the apple 21 Basic Idea Combine semantic information from prior information to build the saliency map Existing bottom-up saliency map does not match actual eye movements Measure sharpness based on top-down and bottom-up saliency map modeling 22 Target Information Acquisition in Whole Flowchart Input Image Tag Information WordNet Eye-Tracking Data Target Information Acquistion Build Saliency Map Model Visual Information Calculate Saliency Regions Output Sharpness Score Computer Edge Block Distortion Fig. Flowchart illustrating of the proposed image sharpness assessment metric. The orange color part is the target information acquisition. 23 Target Information Acquisition Input Image Tag: People, New York Remove: New York (not belong to physical entity) Remain: People Target information acquisition Tag Information WordNet Get Target Information 24 Saliency Map Model Learning in Whole Flowchart Input Image Tag Information WordNet Eye-Tracking Data Target Information Acquistion Build Saliency Map Model Visual Information Calculate Saliency Regions Output Sharpness Score Computer Edge Block Distortion Fig. Flowchart illustrating of the proposed image sharpness assessment metric. The orange color part is the saliency map model learning. 25 Flowchart of Top-Down & BottomUp Saliency Map Model Learning Input EyeTracking Data Tag Information Visual Information WordNet Get Target Information Target Detection Center Priority High-level feature detection Itti's Bottom-up Saliency Map Low-level feature detection Feature detection Saliency Map Model Learning by SVM Output Top-down & bottomup Saliency Model Fig. Flowchart illustrating of the proposed top-down & bottom-up model algorithm. 26 Top-Down & Bottom-Up Saliency Map Model Learning Learning the saliency map model by SVM Ground truth map Created by convolving the function of contrast sensitivity [Wang et al, TIP, 2001] over the fixation locations [Judd et al, ICCV, 2009] M N g ( x, y ) i , j (u f x (i, j ), v f y (i, j )) i 1 j 1 I ( x, y ) g ( x, y ) / max ( g ( x, y )) x, y An example of group truth map 1 d ( x u, y v) e tan 1 ( ) Lv , d ( x, y ) x 2 y 2 Function of contrast sensitivity Notations (a) Original image (b) Eye- fixation locations (c) Ground truth map • e: Half resolution eccentricity constant • L: Image width. • v: Viewing distance. • N: fixation locations • M:Number of users In the example, N=15 and M=6. 27 Image Quality Assessment Based on Proposed Saliency Map Tag Information WordNet Eye-Tracking Data Target Information Acquistion Build Saliency Map Model Input Image Visual Information Calculate Saliency Regions Output Sharpness Score Computer Edge Block Distortion Fig. Flowchart illustrating of the proposed image sharpness assessment metric. The orange color part is image quality assessment based on proposed saliency map. 28 Experiment Setting of Top-Down and Bottom-Up Saliency Map Model Dataset From eye-tracking database [Judd et al, ICCV, 2009] Training: 200 images Test: 64 images Training samples from ground truth map Positive labelled data Randomly choose 30 pixels from 10% most salient locations Negative labelled data Randomly choose 30 pixels from 10% least salient locations 29 Compare Results of Two Saliency Map Model (a) Original image (b) Eye- fixation locations (c) Fixation points covered by bottom-up saliency model (d) Fixation points covered by our saliency model Fig. An sample example to compare coverage of fixation points by different saliency model. Table Evaluation of the proposed saliency map Saliency Points inside 20% Most Saliency Regions Non-Saliency Points Outside 80% Saliency Regions Our Saliency Map Model 84.156% 76.344% Itti’s Bottom-up Saliency Map Model 45.35% 72.875% 30 Experiment Setting and Result of Image Quality Assessment Database Subjective image quality assessment 160 images download from Flickr blurred with eight different Gaussian noises Rate from 1 to 5 corresponding to “very annoying”, “annoying”, “slightly annoying”, “perceptible but not annoying”, and “imperceptible” by 14 subjects Evaluation Results Nonlinear Pearson Spearman MAE RMS Proposed Metric 0.914 0.86 0.173 0.25 Classical JNB [Ferzli et al, TIP, 2009] 0.885 0.815 0.219 0.292 Saliency Weighted JNB [Nabil et al, ICIP, 2008] 0.863 0.801 0.317 0.232 JNB with Edge Refinement [Varadarajan et al, ICIP, 2008] 0.618 0.466 0.387 0.494 31 Outline Introduction to multimedia content analysis Introduction to human visual system Moment invariants to motion blur for water reflection detection and recognition Top-down and bottom-up saliency map for no-reference image quality assessment Fuzzy based contextual cueing for region level annotation Proposed deep learning for multimedia content analysis 32 Fuzzy Based Contextual Cueing for Region Level Annotation Fig. The proposed work about region level annotation. The background color is purple. 33 Introduction to Region Level Annotation Definition of region level annotation Segment the image to semantic regions Assign the given image level annotations to precise regions Annotation Water Cow Grass (a) An image with given image level annotations (b) An image with automatic region level annotation Motivation of automatic region level annotation Helpful to achieve reliable content-based image retrieval [Liu, ACM MM 09’] Substitute tedious manually region-level annotation 34 Representative Work of Region Level Annotation Early work on region level annotation Known as simultaneous object recognition and image segmentation Unsupervised learning Handle images with single major object or with clean background [Cao, ICCV 07’] Supervised learning Focus on special object recognition or special domain [Li, CVPR 09’] Latest work for real-world applications Label propagation by bi-layer sparse coding [Liu, ACM MM 09’] Common annotations are more likely to have similar visual features in the corresponding regions Show impressive results on nature images 35 Limitation of Visual Similarity in Region Level Annotation Fig. Example of the difficulty to distinguish sky and sea based on visual feature. (a) The original image. (b) The original image with 200 data points. (c) – (f) 128 local features of four random points selected from sky and sea are shown. 36 Contextual Cueing in Perception Processing Contextual cueing Human brains gather information by incidentally learned associations between spatial configurations and target locations [Chun, CP, 1998] Spatial invariants [Biederman et al, CoP, 1982] : probability; co-occurrence; size; position; spatial topological relationship 37 Contextual Cueing Modeling by Fuzzy Theory The difficulty of modeling contextual cueing Classical bivalent set theory causes serious semantic loss Example of imprecise position and ambiguous topological relationship A The center of gravity The center of gravity B The center of gravity R (a) Example of ambiguous topological relationship (b) Example of topological relationship for object recognition Fuzzy theory Measure the degree of the truth Fuzzy membership to quantize the degree of truth Fuzzy logic allows decision making using imprecise information 38 Flowchart 39 Illustration of Fuzzy Based Contextual Cueing Label Propagation Annotation Sky Water Beach Boat (a) Original image with given image-level annotations (b) Over segmentation result (c) Label propagation inter images (d) Label propagation using fuzzy based contextual cueing 40 Experiment on MSRC Dataset MSRC Dataset 380 images with 18 categories Comparison methods Four baseline methods implemented by binary SVM with different values of maximal patch size SVM1: 150 pixels, SVM2: 200 pixels, SVM3: 400 pixels, and SVM4: 600 pixels Two latest techniques [Liu et al , ACM MM09’] Including building, grass, tree, cow, boat, sheep, sky, mountain, aeroplane, water, bird, book, road, car, flower, cat, sign, and dog Label propagation with one-layer sparse coding Label propagation with bi-layer sparse coding Experimental result Table 1. Label-to-region assignment accuracy comparison. Database SVM1 SVM2 SVM3 SVM4 One Layer Bi-Layer FCLP MSRC 0.24 0.22 0.27 0.25 0.54 0.65 0.72 41 Experiment Analysis on MSRC Dataset Annotation Sky Building Tree Road (a) An image with annotations (b) Bi-layer result (c) FCLP result (e) Bi-layer result (f) FCLP result Annotation Sky Building Tree Road Car (d) An image with annotations 42 Outline Introduction to multimedia content analysis Introduction to human visual system Moment invariants to motion blur for water reflection detection and recognition Top-down and bottom-up saliency map for no-reference image quality assessment Fuzzy based contextual cueing for region level annotation Proposed deep learning for multimedia content analysis 43 Diagram of Deep Learning Fig. The proposed work about deep learning for multimedia content analysis. The background color is purple. 44 Outline of Proposed Deep Learning Model Introduction Deep learning Proposed algorithm Experiments and results Bilinear deep belief networks Experiment on Handwriting Dataset MNIST Experiment on Complicated Object Dataset Caltech 101 Experiments on the Urban & Natural Scene Experiments on Face Dataset CMU PIE Field effect bilinear deep belief networks 45 Introduction Image classification is a classical problem Aim to understand the semantic meaning of visual information Determine the category of the images according to some predefined criteria Image classification remains a well-known challenge after more than fifteen years extensive research Humans do not have difficulty with classifying images Aim of this paper Provide human-like judgment by referencing the architecture of the human visual system and the procedure of intelligent perception Deep architecture is a representative paradigm that has achieved notable success in modeling the human visual system 46 Outline of Proposed Deep Learning Model Introduction Deep learning Proposed algorithm Experiments and results Bilinear deep belief networks Experiment on Handwriting Dataset MNIST Experiment on Complicated Object Dataset Caltech 101 Experiments on the Urban & Natural Scene Experiments on Face Dataset CMU PIE Field effect bilinear deep belief networks 47 Research on Deep Learning Definition of deep learning Models learning task using deep architectures composed of multiple layers nonlinear modules Deep belief network (DBN) A densely-connected between layers Utilize RBM as the basic block Two stages: abstract input information layer by layer and fine-tune the whole deep network to the ultimate learning target [Hinton et al, NC, 2006] Research progress Deep architectures are thought as the best exemplified by neural networks [Cottrell, science, 2006] DBN exhibits notable performance in different tasks, such as dimensionality reduction [Hinton et al, science, 2006] and classification [Salakhutdinov et al, AISTATS, 2007] 48 Architecture of Deep Belief Network 1. The initial weighted connections are randomly constructed. 2. The size of every layer is determined based on intuition. 3. The parameter space is refined by the greedy layer-wise information reconstruction. 4. Repeat first to third stages until the parameter space in all layers is constructed. Fig. Structure of the deep belief network (DBN). 5. The whole model is fine-tuned to minimize the classification error based on backpropagation. 49 Outline of Proposed Deep Learning Model Introduction Deep learning Proposed algorithm Experiments and results Bilinear deep belief networks Experiment on Handwriting Dataset MNIST Experiment on Complicated Object Dataset Caltech 101 Experiments on the Urban & Natural Scene Experiments on Face Dataset CMU PIE Field effect bilinear deep belief networks 50 Bilinear Deep Learning Deep architecture Three-stage learning Human: visual information through the optic tract to a nerve position is transmitted as the second-order data Proposed technique: a novel bilinear deep belief networks Human: two peaks of activation with “initial guess” and the “post-recognition” in the visual cortex areas Proposed techniques: a new bilinear discriminant initialization Semi-supervised framework Human: more practical for long-term daily learning of visual world Proposed techniques: a more flexible learning framework 51 Bilinear Deep Belief Network … … … … … … … … … … … … … … … … … … … … … … … H1 … … … … … … 1 … … … H … … … n1 n1 H … … … Hn … … … N1 N1 H 2 … … HN … N … k La … y 1k y k2 … y C … Input Label … Fully interconnected directed belief network are constructed by a set of second-order planes The number of units in the input layer is equal to the resolution of the images The number of units in the label layer is equal to the number of classes of images. The number of units in each hidden layer is determined by bilinear discriminant initialization The search of the mapping is transformed to the problem of finding the optimum parameter space for the deep architecture. … Input Image Xk 52 Three-stage Learning H1 … … … … … … … … … … … … … … … … … … … … … 1 … … … … … … H … … … Human: late peak related to the activation of “post-recognition” Deep: refine the parameter space for better classification performance 2 … … … Global fine-tuning … n1 n1 H … … … Hn … … … … Human: information propagation between adjacent layers Deep: determine the parameters of each pair of layers N1 N1 H … HN … Greedy layer-wise reconstruction N … La … Human: early peak related to the activation of “initial guess” Deep: determine the initial parameters and sizes of the upper layer k … y 1k y k2 … y C … Input Label … Bilinear discriminant initialization … Input Image Xk 53 Bilinear Discriminant Initialization Latent representation with projection matrices U and V Preserve discriminant information in the projected feature space by optimizing the objective function K arg max J (U, V ) = U ,V || U (X T s Xt )V ||2 ( B st (1 ) Wst ) s.t. UT U I P , V T V I Q s ,t 1 between-class weights within class weights Obtain the discriminant initial connections in layer pair and utilize the optimal dimension to define the structure of the next layer P 2 row(U1 ) , Q2 column(V1 ) A1ij , pq (0) (U1ip )T V1jq 54 Greedy Layer-Wise Reconstruction A joint configuration ( h 1 , h 2 ) of the input layer H1 and the first hidden H 2 layer has energy E h1, h2 ; 1 (h1A1h2 b1h1 c1h2 ), 1 A1 ,b1 ,c1 Utilize Contrastive Divergence algorithm to update the parameter space log p(h1 (0)) E (h 2 (0), h1 (0)) E (h 2 (t ), h1 (t )) 2 1 2 1 p(h (0) | h (0)) p(h (t ), h (t )) 1 1 1 2 2 1 h (0) h (t ) h (t ) Aij1, pq A ( h1ij (0)h2pq (0) data h1ij (1)h2pq (1) recon ) b1ij b (hij1 (0) hij1 (1)) c1pq c (h2pq (0) h2pq (1)) 55 Global Fine-Tuning Backpropagation adjusts the entire deep network to find good local optimum parameters y log y ] arg min[ * l l l Before backpropagation, a good region in the whole parameter space has been found The convergence obtained from backpropagation learning is not slow The result generally converge to a good local minimum on the error surface 56 Proposed Algorithm Algorithm 1: Bilinear Deep Belief Network Input: Training data set X, Labeled samples X L in X Corresponding labels set Y, Number of layers N, Number of epochs E Number of labeled data L , Parameter Between-class weights B st , Within class weights Wst Initial bias parameters b and c, Momentum and learning rate A , b , c Output: Optimal parameter space * [ A* , b* , c* ] 1. Preserve discriminant information by optimizing the objective function K T arg max J ( U , V ) = || U ( X s Xt )V ||2 ( B st (1 ) Wst ) U ,V s ,t 1 while not convergent do DV st Est (Tsn Ttn )VVT (Tsn Ttn )T DU st Est (Tsn Ttn )T UUT (Tsn Ttn ) 2. Fix V, compute U by solving DV u λu Fix U, compute V by solving DU v λv end while Compute initial weights of the connections 3. Determine the structure of the next layer 4. Calculate the state of the next layer 5 Bilinear Discriminant Initialization Aijn, pq (0) (Unip )T Vjqn Pn1 row(Un ) , Qn1 column(V n ) p hpqn1 1| h n ( i P n , j Qn i 1, j 1 Update the weights and biases hijn Aijn, pq c npq ) p h ijn 1| h n1 ( p P n1 ,qQ n1 p 1,q 1 Aijn, pq Aijn, pq A ( hijn (0)hpqn1 (0) data hijn (1)hpqn1 (1) recon ) 6. bij1 bij1 b (hij1 (0) hij1 (1)) Calculate optimal parameter space y log y ] Greedy Layer-Wise Reconstruction c1pq c1pq c (h2pq (0) h2pq (1)) * arg min[ Aijn, pq hpqn1 bijn ) l l Global Fine-Tuning l 57 Outline of Proposed Deep Learning Model Introduction Deep learning Proposed algorithm Experiments and results Bilinear deep belief networks Experiment on Handwriting Dataset MNIST Experiment on Complicated Object Dataset Caltech 101 Experiments on the Urban & Natural Scene Experiments on Face Dataset CMU PIE Field effect bilinear deep belief networks 58 Experiment Setting Database Subset of Caltech101 Urban and Natural Scene 60,000 training images,10,000 test images CMU PIE dataset 2,688 natural color images with 8 categories Standard hand written digits dataset MNIST Standard dataset for image classification with images of 100 different objects Frequently used subset including 2,935 images from the first 5 categories 11560 face images varying pose, illumination and expression of 68 subjects Compared algorithms K-nearest neighbor (KNN) Support vector machines (SVM) Transductive SVM (TSVM) [Collobert et al, JMLR, 2006] Neural network (NN) EmbedNN [Weston et al, ICML, 2008] Semi-DBN [Bengio et al, NIPS, 2006] DBN-rNCA [Salakhutdinov et al, AISTATS, 2007] DDBN [Liu et al, PR, 2011] DCNN [Jarrett et al, ICCV, 2009] 59 Experiments on Caltech101 Sample images from datasets Faces_easy Faces Motorbikes Airplanes Back_google Two experiments Classification accuracy comparison Converging time comparison Experiment setting 50 images for each category to form the test set The rest to form the training set 60 Classification Accuracy Comparison on Caltech101 Deep techniques achieve much better performance than shallow techniques Bilinear deep belief networks is the best with different numbers of labeled data 61 Converging Time Comparison on Caltech 101 Iterations in fine-tuning stage BDBN converges much more quickly because of better “initial guess” 62 Experiments on Urban and Natural Scene Sample images from datasets Forest Highway Street Open country Mountain Tall building City Center Two experiments Coast & beach Classification accuracy and real running time comparison Limitation discussion Experiment setting 50 images for each category to form the test set The rest to form the training set 63 Performance Comparison on Urban and Natural Scene Classical setting of neurons numbers in hidden layers BDBN setting by bilinear discriminant initalization 500, 500, and 2000 24*24, 21*21, 19*20 For compared models “_d” means the same size of BDBN setting “_c” means the same size of classical setting 64 Limitation of Image Classification Only Based on Visual Similarity Street Highway Misclassified image Ground truth label: “Street” Misclassified label: “Highway” Only calculating visual similarity is limited Human can give the correct judgment by referencing the buildings and cars along the street Contextual cueing The knowledge about spatial invariants learned from past experiences 65 Simulate Primary Visual Cortex on MNIST Responses of V1 neurons Selective spatial information filters Similar to spatially local, complex Fourier transforms, Gabor transforms Weights of proposed BDBN Roughly represent different “strokes” Oriented, Gabor-like and resemble the receptive fields of V1 simple cells Samples of first layer weights Examples represent “strokes” of digital 66 Experiments on CMU PIE Sample images from datasets Two experiments Classification accuracy comparison for noise data Parameter space visualization Experiment setting 50 images for each category to form the test set The rest to form the training set 67 Robustness to the Noise on CMU PIE The reconstruction of BDBN in every layer 68 Visualization of Parameter Space on CMU PIE Emphasize regions are identical to facial feature regions Facial feature points 69 Outline of Proposed Deep Learning Model Introduction Deep learning Proposed algorithm Experiments and results Bilinear deep belief networks Experiment on Handwriting Dataset MNIST Experiment on Complicated Object Dataset Caltech 101 Experiments on the Urban & Natural Scene Experiments on Face Dataset CMU PIE Field effect bilinear deep belief networks 70 Image Recognition with Incomplete Data Incomplete data Data values/features are partially observed Resulted from measurement noise, corruption or occlusion (a)Incomplete images due to noise and corruption (b) Incomplete face images due to the occlusion in the important facial feature regions 71 Learning Stages in FBDBN D S … S G S … D S D S … … … 2 H … H n1 … D …G … … … … … D G … G …S … … … … … … … … … n H … Bottom-up inference Construct the model by the available features and the estimated features based on the reliability Top-down inference Estimate the missing features by the higher layer activations of the reference datum H N1 … G … … … S … … G N H … Bi-direction inference k G D D C … La …… Map the original data into a discriminant bilinear subspace based on the features with high reliability k … y k2 … y 1 y Input Label … Field effect bilinear discriminant projection … D G S G … … S D … … G S … … S … H D … Fine-tune to minimize the recognition error and reestimate the values of missing features S 1 G … D … Post activation by backpropagation … D G Input Image X k 72 Output Characteristic of FRBM Fig. The output characteristic curve and the operating mode of field effect RBM depends on the voltage VGS, Vth , and VDS. 73 Algorithm of FBDBN 74 Experiment of Block Incomplete Digits with Fixed Missing Ratio (a) Original images (b) Incomplete images after fixed missing ratio pixels are removed (c) Estimated images via FBDBN. Fig. Samples of estimated images by FBDBN of the block missing features with fixed missing ratio. 75 Unsupervised Auto-encode Comparison (a) Auto-encoder results of DBN (a) Auto-encoder results of FBDBN Fig. Auto-encoder comparison of DBN and FBDBN. 76 Experiment of Face Image Estimation (a) Reliability curve with the estimated mouth part (b) Reliability curve with the estimated eyes part Fig. The reliability curve with estimated images 77 Reference [1] Hu, M.-K. “Visual pattern recognition by moment invariants,” IRE Transactions on Information Theory, vol. 8(2), 1962. [2] Gregory, R. L., “Eye and Brain: The Psychology of Seeing”, Oxford: Oxford Unversity Press, 1967.Palmer, S.E., “The effects of contextual scenes on the identification of objects”, Memory and Cognition, I. vol. 3, pp. 519-526, 1975. [3] Biederman, R. Mezzanotte, and J. Rabinowitz, “Scene perception: detecting and judging objects undergoing relational violations”, In Cognitve Psychology, vol. 14(2), pp. 143–77, 1982. [4] Koch, C. & Ullman, S., "Shifts in selective visual attention: Towards the Underlying Neural Circuitry," Human Neurobiology. vol. 4 (4), pp. 219-227, 1985. [5] Newsome, WT Paré, EB., “A selective impairment of motion perception following lesions of the middle temporal visual area (MT)”, J Neurosci., vol. 8, pp.2201–2211, 1988. [6] Victor JD, Purpura K, Katz E, Mao B., “Population encoding of spatial frequency, orientation, and color in macaque V1”, J Neurophysiol., vol.72, pp.2151–2166, 1994. [7] Barbara Von Eckardt, “What is cognitive science?”, Cambridge: MIT Press, Waddington, CH, 1995. [8] Neil A. Stillings, Steven E. Weisler, Christopher H.Chase, Mark H. Feinstein, Jay L, Garfield and Edwina L. Rissland, “Cognitive Science: An Introduction,” 2nd Edition, The MIT Press, 1995. [9] Edwards, M., & Badcock, D., “Global motion perception: Interaction of chromatic and luminance signals”, Vision Research, vol. 36, pp.2423-2431, 1996. [10] J. Flusser, T. Suk and S. Saic, "Recognition of images degraded by linear motion blur without restoration", Computing Suppl., vol. 11, pp. 37-51, 1996. [11] Chun, M. M. & Jiang, Y.. , “Contextual cueing: implicit learning and memory of visual context guides spatial attention”, In Cognit. Psychol., vol. 36, pp. 28-71, 1998. [12] D. Shen, H.H.S. Ip, and E.K. Teoh., “Robust detection of skewed symmetries”, In ICPR, vol.3, pp. 1010-1013, 2000 [13] Hegde J, Van Essen DC, “Selectivity for complex shapes in primate visual area v2”, J Neurosci., vol. 20: RC61, 2000. [14] L. Itti, C. Koch, & E. Niebur, "A saliency-based search mechanism for overt and covert shifts of visual attention," Vision Research, vol. 40, pp. 1489-1506, Apr. 2000. 78 Reference [15] Reynolds, J.H., Pasternak, T. & Desimone, R., “Attention increases sensitivity of V4 neurons”, Neuron., vol. 26, pp. 703–714, 2000. [16] Y. Wang, Z. Liu, and J. Huang, “Multimedia content analysis using both audio and visual clues”, IEEE Signal Processing Magazine, vol. 17(6), pp. 12-36, 2000. [17] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope, ” In IJCV, 2001. [18] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural Networks,” science, vol. 313, no.5786, pp. 5045507, July 2006.Interaction, pp. 326 – 333, 2001. [19] N.Zahid, 0.Abouelala, M.Limouri, A.Essaid, “Fuzzy clustering based on K-nearest-neighbors rule”, In Fuzzy sets and Systems, 200l. [20] W. Liu, S. Dumais, Y. Sun, and H. Zhang, “Semi-automatic image annotation,” In Proceedings of the International Conference on Human-Computer Interaction, pp. 326 – 333, 2001. [21] Z. Wang, A. C. Bovik, L. Lu and J. Kouloheris, "Foveated wavelet image quality index," SPIE’s 46th Annual Meeting, Proc. SPIE, Application of digital image processing XXIV, vol. 4472, 2001. [22] Zhou Wang and Alan Conrad Bovik, “Embedded foveation image coding,”IEEE Transactions of Image Processing, vol. 10(10), Oct. 2001. [23] Amit, Y., “2d object detection and recognition: models, algorithms and networks”, MIT Press: Cambridge, Mass, 2002. [24] D. Walther, L. Itti, M. Riesenhuber, T. Poggio, and C. Koch, “Attentional selection for object Recognition—a gentle way,” Proc. Second Int’l Workshop Biologically Motivated Computer Vision, 2002. [25] Stern, I. Kruchakov, E. Yoavi, and N.S. Kopeika, “Recognition of motion-blurred images by use of the method of moments”, Applied Optics, 2002. [26] Chen, L. Q., Xie, X., X., Ma, W. Y., Zhang, H. J., Zhou, H. Q., “A visual attention model for adapting images on small displays”, Multimed. Syst. Vol. 9(4), pp. 353–364, 2003. [27] Anderson, John R., “Cognitive psychology and its implications”, 6th Edition, Worth Publishers, 2004. [28] L. Lucchese, “Frequency domain classification of cyclic and dihedral symmetries of finite 2-D Patterns,” Pattern Recognition, 37:2263– 2280, 2004. [29] Moshe Bar, "Visual objects in context," Nature Reviews Neuroscience, vol. 5, pp. 617-629, Aug. 2004. [30] Ören, T.I. and L. Yilmaz., “Behavioral Anticipation in Agent Simulation”, Proceedings of WSC 2004 - Winter Simulation Conference, pp. 801-806, 2004. [31] P. Felzenszwalb and D. Huttenlocher, “Efficient graph-based imagesegmentation”, In IJCV, vol. 59(2), pp. 167–181, 2004. [32] U. Rutishauser, D. Walther, C. Koch, and P. Perona, “Is bottom-up attention useful for object recognition?” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp. 37-44, 2004. 79 Reference [33] Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, Eero P. Simoncelli, "Image quality assessment: from error visibility to structural similarity," IEEE Transactions on Image Processing, Vol. 13, No. 4, pp. 600-612, Apr. 2004. [34] E.A. Styles, “Attention, Perception, and Memory: An Integrated Introduction”, First edition, Psychology Press, 2005. [35] Zhongkang Lu, Weisi Lin, Xiaokang Yang, EePing Ong, Susu Yao, “Modeling visual attention’s modulatory aftereffects”, IEEE Transactions on Image Processing, vol. 14(11), pp. 1928-1942, 2005. [36] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, pp.1527-1554, 2006. [37] G. W. Cottrell, “New life for neural networks,” Science, vol. 313, pp. 454-455, July, 2006. [38] Rony Ferzli and Lina J. Karam, "A human visual system based no-reference objective image sharpness metric," IEEE International Conference on Image Processing, pp. 2949-2952, Oct. 2006. [39] G. Loy and J. Eklundh, “Detecting symmetry and symmetridc constellations of features,” In European Conference on Computer Vision, Part II, LNCS 3952, pp. 508-521, May 2006. [40] H. Cornelius and G. Loy, “Detecting bilateral symmetry in perspective,” In Proceedings of International Conference on Computer Vision and Pattern Recognition Workshop, 2006. [41] A. Torralba, A. Oliva, M.S. Castelhano and J.M. Henderson., “Contextual guidance of eye movements and attention in real world scenes: The role of global features in object search”, In Psychological Review., pp. 766-786, 2006. [42] J. Harel, C. Koch, and P. Perona, "Graph-Based Visual Saliency", In NIPS, 2006. [43] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, “Greedy layer-wise training of deep networks”, In NIPS, 2006. [44] R.R. Salakhutdinov, G.E. Hinton, “Learning a nonlinear embedding by preserving class neighbourhood structure”, In AISTATS, 2007. [45] Rony Ferzli and Lina J. Karam, "A no-reference objective image sharpness metric based on just noticeable blur and probability summation," IEEE International Conference on Image Processing, vol. 3, pp. 445-448, Sept. 2007. [46] R. R. Salakhutdinov and G. E. Hinton, “Learning a nonlinear embedding by preserving class neighbourhood structure,” in Proceedings of Eleventh International Conference on Artificial Intelligence and Statistics, 2007. [47] R. Ren, P. Punitha, J. M. Jose, and J. Urban., “Attention-based video summarisation in rushes collection”, In TVS ’07: Proceedings of the international workshop on TRECVID video summarization, New York, NY, USA, pp. 89–93, 2007. [48] Y. Bengio, and Y. LeCun, “Scaling learning algorithms towards AI,” Large-Scale Kernel Machines, 2007. [49] Ninassi, O. L. Meur, P. L. Callet, and D. Barbba, “Does where you gaze on an Image affect your perception of quality? Applying Visual Attention to Image Quality Metric”, in Proc. IEEE Int. Conf. Image Process, vol. 2, pp. 169–172, 2007. [50] Yuan, J., Li, J., Zhang, B., “Exploiting spatial context constraints for automatic image region annotation”, In ACMMM, pp. 595–604, 2007. 80 Reference [51] C. Galleguillos, A. Rabinovich and S. Belongie, “Object categorization using co-occurrence, location and appearance”, In CVPR, June. 2008. [52] J. Weston, F. Ratle, R. Collobert, “Deep learning via semi-supervised embedding”, In ICML, 2008. [53] E. K. Chen, X. K. Yang, H.Y. Zha, R. Zhang, and W. J. Zhang, “Learning object classes from image thumbnails through deep neural detworks,” International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, 2008. [54] Nabil G. Sadaka, Lina J. Karam, Rony Ferzli, and Glen P. Abousleman, "A no-reference perceptual image sharpness metric based on saliency-weighted foveal pooling," IEEE International Conference on Image Processing, pp. 369-372, Oct. 2008. [55] Srenivas Varadarajan and Lina J. Karam, "An improved perception-based no-rReference objective image sharpness metric using iterative edge refinement," IEEE International Conference on Image Processing, pp. 401-404, Oct. 2008. [56] J. You, A. Perkis, M. Hannuksela, and M. Gabbouj., “Perceptual quality assessment based on visual attention analysis”, In ACMMM, 2009. [57] J. Li, R. Socher, and L. Fei-Fei, “Towards total scene understaning: classification, annotation and segmentation in an automatic framework”, In CVPR, 2009. [58] Luhong Liang, D. Jianhua Chen, Siwei Man, Debin Zhao and Wen Gao, "A no-reference perceptual blur metric using histogram of gradient profile sharpness, ICIP, pp.4369-4372. Apr.2009. [59] L. Ballan, A. Bazzica, M. Bertini, A. D. Bimbo, and G. Serra, “Deep networks for audio event classification in soccer videos,” IEEE International Conference on Multimedia & Expo, 2009. [60] Rabinovich, A., and Belongie, S. “Scenes vs. objects: a comparative study of two approaches to context based recognition,” In ViSU, 2009. [61] Rony Ferzli and Lina J. Karam, "A no-reference objective image sharpness metric based on the notion of just noticeable blur (JNB)," IEEE Transactions on Image Processing, vol. 18, No. 4, pp. 717-728, Apr.2009. [62] S. Lee and Y. Liu. “Curved glide-reflection symmetry detection”, In CVPR, pp.1046–1053, 2009. [63] Tilke Judd, Krista Ehinger, Fr´edo Durand and Antonio Torralba, "Learning to predict where humans look," IEEE International Conference on Computer Vision, Sep. 2009. [64] Xiaobai Liu, Bin Cheng, Shuicheng Yan, Jinhui Tang, Tat Seng Chua, Hai Jin, “Label to region by Bi-Layer sparsity priors”, In Proceedings of ACM Multimedia, pp. 115-124, Oct. 2009. [65] Y.-G. Jiang, C.-W. Ngo, and S.-F. Chang, “Semantic context transfer across heterogeneous sources for domain adaptive video search”, in ACM Multimedia, 2009. [66] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y.L. Cun, “What is the best multi-stage architecture for object recognition?”, In ICCV, 2009. [67] R. Achanta, S. Hemami, F. Estrada and S. Süsstrunk, “Frequency-tuned salient region detection”, In CVPR, 2009. [68] Shenghua Zhong, Yan Liu, Yang Liu, and Fu-lai Chung, “Fuzzy based Contextual Cueing for Region Level Annotation”, In Proceeding of ACM International Conference on Internet Multimedia Computing and Service (ICIMCS’10), 2010. 81 Reference [69] Shenghua Zhong, Yan Liu, Yang Liu, and Fu-lai Chung, “A semantic no-reference image sharpness metric based on top-down and bottom-up saliency map modeling”, In Proceedings of 17th IEEE International Conference on Image Processing (ICIP’10), 2010. [70] Chertok, M., & Keller, Y., “Spectral symmetry analysis”, IEEE Transactions on Pattern Analysis and Machine Intelligence, July 2010. [71] M. Wang, J. Li, T. Huang, Y. Tian, L. Duan, and G. Jia, “Saliency detection based on 2D log-gabor wavelets and center bias”, In ACMMM, 2010. [72] S. Zhou, Q. Cheng, X. Wang, “Discriminative deep belief networks for image classification,” In ICIP, 2010. [73] Shenghua Zhong, Yan Liu, Yang Liu, and Changsheng Li, “Water reflection detection and recognition based on moment invariants to motion blur using dynamic programming”, In Procedding of ACM International Conference on Multimedia Retrieval (ICMR’11), 2011. [74] McMaster University, "Discover psychology", Attention and Memory, Toronto, Ontario: Nelson Education Ltd. ISBN-13: 978-0-17-6613969, 2011. [75] Shenghua Zhong, Yan Liu, Yang Liu, “Bilinear deep learning for image classification”, submitted to ACM International Conference on Multimedia, 2011. [76] Sheng-hua Zhong, Yan Liu, Ling Shao, Gangshan Wu, “Unsupervised Saliency Detection Based on 2D Gabor and Curvelets Transforms.” In ACM ICIMCS, 2011. 82 Q&A Thank You ! 83 Query-oriented Multiple Document Summarization Extractive style query-oriented multi-document summarization Generate the summary by extracting a proper set of sentences from multiple documents based on the pre-given query Important in both information retrieval and natural language processing Humans do not have difficulty with multi-document summarization How does the neocortex process the lexical-semantic task? Contribution First paper of utilizing deep learning in document summarization Provide human-like judgment by referencing the architecture of the human neocortex 84 Flowchart Query-oriented concept extraction Hidden layers are used to abstract the documents using greedy layer-wise extraction algorithm Reconstruction validation for global adjustment Reconstruct the data distribution by finetuning the whole deep architecture globally Summary Generation Dynamic Programming Summary Candidate Sentence Pool Reconstruction Validation ………… h0 ………… h1 ……… h2 1 T (A ) (A2)T 3 T (A ) Concept Extraction …… h3 A3 Candidate Sentence Extraction ……… A2 Summary generation via dynamic programming Dynamic programming is utilized to maximize the importance of the summary with the length constraint h2 Key Words Discovery ………… 1 h1 Not Important Words Filtering Out A ………… f [f , f , , f , , f ] tf Value d d 1 d 2 d v d V … h0 Preprocessing Word List Document Topic Set Query Oriented Initial Weight Setting … Query Oriented Penalty Process Query Word List 85