Massively Parallel Near-Linear Scalability Algorithms with Application to Unstructured Video Analysis Robert Farber and Harold Trease Pacific Northwest National Laboratory Acknowledge: Adam Wynne (PNNL) , Lynn Trease (PNNL), Tim Carlson (PNNL), Ryan Mooney (now at google.com) 1 Image/Video Analysis and Applications “Have we seen this person’s face before?” Goals: Image/Video content analysis, 1 million frames-persecond of processing capability (~1 TByte/sec) Streaming, unstructured video data represents high-volume, low information content data Huge volumes of archival data Requirement: Scalable algorithms to transform unstructured data into large sparse graphs for analysis This talk will focus on the Principle Component Analysis of video signatures The framework is generally applicable to other problems! Video analysis has many applications Face recognition (and object recognition) Social networks Many others 2 First Task: Isolate the Faces 1 2 3 4 1. Original frame 2. RGB-to-HIS 3. Sobel edge detection 4. Only skin pixels The bottom row contains frames of skin pixel patches that ID the three faces in this frame 3 Workflow Archival Data YouTube (huge!) 10k cameras = ~300,000 fps = ~300 GB/sec Split into frames and calculate entropic measures Signatures Algorithms PCA NLPCA MDS Clustering Others 1. Frames/Faces are separable 2. Faces form trajectories 3. Face DB 4. Derive social networks 4 Workflow Archival Data YouTube (huge!) 10k cameras = ~300,000 fps = ~300 GB/sec Split into frames and calculate entropic measures Signatures First steps are embarrassingly parallel 1. Split video into separate frames 2. Calculate signature of each frame and write to file 5 Workflow Archival Data YouTube (huge!) 10k cameras = ~300,000 fps = ~300 GB/sec Split into frames and calculate entropic measures Signatures Algorithms PCA NLPCA MDS Clustering Others 1. Frames/Faces are separable 2. Faces form trajectories 3. Face DB 4. Derive social networks 6 Working with large data sets (Think BIG: 108 signatures and greater) Formulate PCA (NLPCA, MDS, and others) as an objective function Use your favorite solver (Conjugate Gradient) Map to massively parallel hardware (SIMD, MIMD, SPMD, etc.) Ranger, NVIDIA GPUs, others Massive parallelism needed to handle large data sets 10,000 video cameras = ~300,000 fps = ~300 GB/sec Consider all of YouTube as a video archive Our Supercomputing 2005 data set = 2.2M frames Test YouTube dataset consisted of over 22M frames 7 Formulate PCA as objective function energy func( p1 , p2 ,...pn ) Calculate the PCA by passing information though a bottleneck layer in a linear feed-forward neural network Oja, Erkki (November 1982). "Simplified neuron model as a principal component analyzer". Journal of Mathematical Biology 15 (3): 267-273. Sanger, Terence D. (1989). "Optimal unsupervised learning in a single-layer linear feedforward neural network". Neural Networks 2 (6): 459-473. Use your favorite solver (Conjugate Gradient …) Saul A. William H. Press, Brian P. Flannery and William T. Vetterling. “Numerical Recipes in C: The Art of Scientific Computing”. Cambridge University Press, 1993. 8 Pass the information through a bottleneck O O O O O O O O O O B I I I I I I I I I I O O O O O B I I I I I O O O O O B O O O O O B1 B1 Poffset I I I I I nInput PB I 1 j j j I I I I I 9 Map to massively parallel hardware Large data sets require parallel data load to deliver necessary bandwidth Use Lustre because it scales: PNNL achieved 136 GB/s sustained read, 86 GB/s sustained write Data Partitioned Partitioned equally across Data all processing cores Core 1 Core 2 Core 3 Core 4 Examples 1,N Examples N+1,2N Examples 2N+1,3N Examples 3N+1,4N 1. Broadcast filename plus data size and file offset to each MPI client 2. Each client opens the data file, seeks to location and reads appropriate data 10 Evaluate objective function in massively parallel manner Step 1 Broadcast Parameters, P Scales by P Step 2 Calculate Partial Energies Scales by (data/Nproc) Step 33 Step Sum Partial Energies Optimization Routine (Powell, (Powell, Conjugate Conjugate Gradient, Gradient, or other etcetera) method) Objective Function Energy = func(P 1,P 2, …, P N) Step 1 Broadcast Parameters, P Step 2 Calculate Partial Energies Step Step Step 3 Step 33 3 Sum Partial Sum Partial Energies Energies Core 1 Core 2 Core 3 Core 4 P11,P ,P22,, …, …, PPNN P P11,P ,P22,, …, …, P PNN P P11,P ,P22,, …, …, P PNN P P11,P ,P22,, …, …, P PNN P Examples 1,N Examples N+1,2N Examples 2N+1,3N Examples 3N+1,4N O(log2(Nproc)) 11 Report Effective Rate TotalOpCount EffectiveRate TbroadcastP aram T func Treduce Every evaluation of the objective function requires: Broadcasting a new set of parameters Calculating the partial sum of the errors on each node Obtain the global sum of the partial sum of the errors Treduce is highly network dependent Low bandwidth and/or high latency is bad! 12 Very efficient and near-linear scaling on Ranger Ranger Scaling as a Function of Number of Cores 80000 70000 GF/s Effective Rate 60000 50000 40000 30000 20000 10000 0 0 2000 4000 6000 8000 10000 12000 14000 Number of Cores Note: 32k and 64k runs will occur when possible 13 Reduce operation does affect scaling Performance per Core (GF/s) for Various Run Sizes 8.00 7.00 GF/s per Core 6.00 5.00 4.00 3.00 2.00 1.00 0.00 0 2000 4000 6000 8000 10000 12000 14000 Num be r of Core s 14 Objective function performance scaling by data size on Ranger (Synthetic benchmark with no communications) 16way Performance vs Datasize Without Prefetch Achieved 8 GF/s per core using SSE Interesting performance segregation 8 7 6 GF/s 9 5 4 3 2 1 0 0 200000 400000 600000 800000 1000000 1200000 1000000 1200000 Number of 80 Byte Examples With Prefetch 16way performace vs data size With Prefetch Achieved nearly 8 GF/s per core Bizarre jump at 800k examples 7 6 5 GF/s 8 4 3 2 1 0 0 200000 400000 600000 800000 Number of 80 byte examples 15 Most time (> 90%) is spent in objective function when solving PCA problem Number of Number of calls to Percentage of wall Cores the objective clock time spent function in optimization routine Percentage of wall clock time in objective function 256 1,173,533 91.11% 90.89% 512 1,142,588 90.83% 90.44% 2048 1,141,565 93.79% 92.55% 4096 1,142,557 95.08% 93.22% Note: data sizes were kept constant per node, which meant each trial trained on different data 16 Mapping works with other problems (and architectures) SIMD version used by Farber since early 1980s on 64k processor Connection Machines (and other SIMD, MIMD, SPMD, Vector, & Cluster architectures) R.M Farber, “Efficiently Modeling Neural Networks on Massively Parallel Computers”, Los Alamos National Laboratory Technical Report LA-UR-92-3568. Kurt Thearling, “Massively Parallel Architectures and Algorithms for Time Series Analysis” , published in the 1993 Lectures in Complex Systems, edited by L. Nadel and D. Stein, AddisonWesley, 1995. Alexander Singer, “Implementations of Artificial Neural Networks on the Connection Machine. Technical Report” RL90-2, Thinking Machines Corporation, 245 First Street, Cambridge, MA 02142, January 1990. Many different applications aside from PCA Independent Component Analysis K-means Fourier approximation Expectation Maximization Logistic regression Gaussian Discriminative Analysis Locally weighted Linear Regression Naïve Bayes Support Vector Machines Others Performance of NVIDIA CUDA-enabled GPUs (Courtesy of NVIDIA corp.) 1200 1000 Effective Rate (GF/s) 800 600 400 200 0 0 2 4 6 8 10 12 14 16 Numbe r of GPUs 17 PCA components form trajectories in 3space 1. Separable trajectories – can build face DB! • Different faces form separate tracks • Same faces continuous across cameras 2. Multiple faces extracted from individual frames can infer social networks! 18 Preliminary results using PCA Public ground truth datasets are scarce – work in progress PCA was first step (funding limited) NLPCA, MDS and other methods promise to increase accuracy Using Eucledian distance between points as a recognition metric: 99.9% accuracy in one data set 2 false positives in a 2k database of known faces Each face in the database was compared against the entire database as a self-consistency check. Social networks have been created and are being evaluated Again, ground truth data is scarce 19 Summary: High-Performance Video Analysis SC05 Videos Streaming video Face database Building Social Network Graphs From Face Data and Face DB Social Network Partitioning face based graphs to discover relationships 20 Two video examples (in conjunction with Blogosphere text analysis by Michelle Gregory and Andrew Cowell) 351 videos, ~3.6 million frames, ~4.4 Tbytes (Each point is a video frame, each color is a different video, coordinates are PCA projection of N-d feature vector into 3-D) 512 YouTube videos, ~22.6 million frames, ~5.2 Tbytes 21 Connecting the points and forming the sparse graph connectivity for analysis Delaunay/Voronoi mesh: Shows the mesh connections where “points” are connected by “edges” which equals a graph Adjacency matrix: Each row represents a frame, columns represent connected frames Clusters and social network defines how one frame (face) relates to another 22 Graph Partitioning (using Voronoi/Delaunay mesh connected graphs) Point distribution Delaunay/Voronoi mesh Adjacency before partitioning Adjacency after partitioning Partitioned Mesh 23 Classification, Characterization and Clustering of High-Dimensional Data 24