Matrix and Graph • Matrix • Binary Matrix • Sparse Matrix • Operations for Vectors/Matrices • Graph and Adjacent Matrix • Adjacent List Matrix and Graph • Matrix is a 2-dimensional structure • Used in wide areas from physical simulations to customer management • Graphs are also used in many areas, to represent the relations and flows between data • Some data structures have been considered to handle matrix and graph; update, preserve, search, and operate 2-Dimensional Structure of Matrix • An n×m matrix has n×m numbers can be stored in an array of size n×m [i,j] element corresponds to the i*m+j th cell of the array • A naïve design is done, but there are something more 2-Diemnsaional Array • There is a way to make 2-dimensional array, instead of usual 1-dimensional array • Prepare an array of pointers of size n • Prepare n arrays of size m, and write the place of the first cell of i-th array to the i-th cell of the pointer array • [i,j] element of matrix a is accessed by a[i][j] (in C) Simple structure O(nm) memory space Allocate a 2-Dimensional Array int *MATRIX_alloc ( int n, int m ){ int i, **a, flag =0; a = malloc ( sizeof(int *)*n ); if ( a=NULL ) return (NULL); for ( i=0 ; i<n ; i++ ){ a[i] = malloc (sizeof(int)*m); if ( a[i] = NULL ) flag = 1; } if ( flag == 1 ) return (NULL); else return (a); } int *MATRIX_free ( int **a ){ int i; for ( i=0 ; i<n ; i++ ) free ( a[i] ); } Binary Matrix • A binary matrix is a matrix all whose cells are either 0 or 1 + each cell is either ○ or × + adjacency matrix of a graph, shown later • Space consuming if use one integer for one 01 value (1 bit) motivated to compress the matrix 01010 10001 11110 11000 Representation by Bits • A row composed of 01 values can be considered as a big integer by chopping into some integers of 32 bits (or 64 bits), the integer becomes tractable └m/32┘ integers are sufficient to store a row (space efficiency also increases, and also cache efficiency) • [i,j] element can be accessed by looking at the (j%32)th bit of the j/32 th integer in the i-th row Handling Bit Access • [i,j] element can be accessed by looking at the (j%32)th bit of the j/32 th integer in the i-th row … writing a code is bothering • Prepare an array BIT_MASK[]= {1,2,4,8,16,…} BIT_MASK_[]= {0xfffffffe, 0xfffffffd, 0xfffffffb, …} + read value: a[i][j/32] & BITMASK[j%32] + set to 1 : a[i][j/32] = a[i][j/32] | BITMASK[j%32] + set to 0 : a[i][j/32] = a[i][j/32] & BITMASK_[j%32] Sparse Matrix • That’s all, for structures for simple matrices • Space efficiency is in some sense optimal • But, in application, it is often not sufficient/efficient for example, if matrix is sparse, many parts are redundant • Sparse matrix has the same value in many cells (usually 0) • Sparse matrix should be stored by memorizing the places with non-zero values Storing Sparse Matrix • Let’s begin from binary matrix, for simplicity almost cells are 0, and few 1’s • A simple idea is to make a list of the places of the cells being 1 • That is, memorize (x1,y1),(x2,y2),(x3,y3),… , store the row ID and column ID of the cells being 1 • The memory requirement is “twice the number of 1’s” this is very efficient if there are few 1’s (sparse) • But, bad accessibility; to read a cell, we have to scan all (binary tree / hash can be used) Store Row-wise • Let’s have a structure to improve the accessibility • Classify the places of 1’s according to their row ID prepare n arrays, and store the column ID of 1’s in i-th row, in the ith array • We need to have n pointers to n arrays, but we don’t have to store the row ID’s, thus memory efficiency increases • The memory requirement is “# 1’s + #rows×2” (can be “# 1’s + #rows”) • Accessibility is good; sorting ID’s in a row array, binary search works (linear scan is enough, if few column ID’s) Structure in Each Row • In sparse cases, the efficiency is increased, However, the update concerned with insertion/deletions is not efficient • They are the same, in the situation of stacks and queues • So, according to the purpose, we use lists bucket/hash/binary tree for structures in a row (having n arrays is equivalent to having buckets) Real World Data • The characteristics of sparse matrices in practice are; + Matrix representing mesh network (structural calculation) few meshes are adjacent to one, in geometrical sense, thus not so many non-zeros per row array is sufficient for structures of rows + Road network data (adjacency of cross points + distance) almost the same, but update comes sometimes (would be sufficient if (re-)allocate bit larger memory) Real World Data (2) Ex) A matrix representing, row text, column word, a cell is one if the word is included in the text, is sparse, usually (POS data, Web links, Web surfing, etc.) + on average, #1’s in a row/column is constant, but some have so many (texts having many words, words included in many texts) + distribution of 1’s is that so called power (zip) law, scale free; #of items of size D is proportional to 1 / ΔD can be often seen in real world data (≠ geometric distribution) + Such data needs algorithms designed so that the dense part will not affect badly; will be the bottle neck of the computation Non-binary Sparse Matrix • Usual matrix are of course non-binary, it is not sufficient to remember the places having non-zero value remember (place, value) • In the case of using array, (place, value), (place, value), (place, value),…, or place1, plcae2,…, value1, value2,… • In the case of lists of binary tree, assign (place, value) to each cell/node or, simple prepare two of them Exercise • Make data representing the following matrix in a sparse way 0,0,1,4,0 0,1,0,0,5 2,0,0,0,0 1,2,5,0,2 0,0,0,0,0 Column: Memory Saving for Matrix • Buckets, or a row of a sparse matrix needs two data (pointer to the first cell, and the size ki) • We decrease these from two to one • First, prepare an array of size equal to # non-zero cells. Then, + 0th row uses the cells of the array ranging from 0 to k0-1 + 1st uses from k0 to k0+k1-1 … + i-th row uses from k0+…+ki-1 to k0+…+ki-1, and we remember only the start positions of the rows • The size of i-th row can be obtained by (start position of i+1) - (start position of i) Matrix Operation • Basic matrix operations are addition and multiplication (inner product of vectors is a special case) Further, AND and OR for binary matrix • Algorithms for the operations are trivial if the matrices are in the form of 2-dimensional array However, not clear if they are in sparse forms • Further, there are several structures that have advances for matrix operations Addition of Matrix • For the addition, it is sufficient to have algorithms for additions of each row (so, operations of vectors are sufficient) • First, we see the case of inner product of sparse vectors Inner Product • For computing inner product of two sparse vectors, the difficulty is that we have to find the cell corresponding to each • Sort the cells in each vector according to their column ID • Scan two vectors simultaneously, from smaller indices “simultaneously” means that iteratively pick up the smallest column ID among the two vectors • When we find a column ID at which both vector have non-zero values, accumulate the product of the cells 1 5 5 1 7 3 1 1 3 3 5 4 A Code for Sparse Inner Product int SVECTOR_innerpro (int *va, int ta, int *vb, int tb){ int ia=0, ib=0, c=0; while ( ia<ta && ib<tb){ if (va[ia*2] < vb[ib*2] ) ia++; else if (va[ia*2] > vb[ib*2] ) ib++; else { c = c + va[ia*2+1]*vb[ib*2+1]; ia++; ib++; } } return ( c ); } 1 5 5 1 7 3 2 1 3 3 5 4 Addition of Two Vectors • The addition can be done in a similar way • Sort the cells in each vector according to their column ID • Scan two vectors simultaneously, from smaller indices • The positions of non-zero values in the resulted vectors are those having non-zero values in one of two vectors, thus can be easily identified by the scan 1 5 5 1 7 3 2 1 3 3 5 4 A Code for Addition int SVECTOR_add (int *vc, int *va, int ta, int *vb, int tb){ int ia=0, ib=0, ic=0, c, cc; while ( ia<ta || ib<tb){ if (ia == ta ){ c = vb[ib*2+1]; cc = vb[ib*2]; ib++; } else if ( ib == tb ){c = va[ia*2+1]; cc = va[ia*2]; ia++; } else if (va[ia*2] > vb[ib*2] ) { c = vb[ib*2+1]; cc = vb[ib*2]; ib++; } else if (va[ia*2] < vb[ib*2] ) { c = va[ia*2+1]; cc = va[ia*2]; ia++; } else { c = va[ia*2+1] + vb[ib*2+1]; cc = vb[ib*2]; ia++; ib++; } vc[ic*2] = cc; vc[ic*2+1] = c; ic++; } return ( ic ); } 1 5 5 1 7 3 2 1 3 3 5 4 Column: Endmarks do a Good Job! • Compared to inner product, code for addition is relatively long we have exceptions at the end of the array • So, we are motivated to simplify the code by using “endmark” (endmark is a symbol that represent the end of the array, or something else representing the end) • 0, -1 or a very large value is used as an endmark • We prepare an additional cell next to the end of each array, and put an endmark at the cell Column: Endmarks do a Good Job! (2) int SVECTOR_innerpro (int *vc, int *va, int ta, int *vb, int tb){ int ia=0, ib=0, ic=0, c, cc; while ( va[ia*2] != ENDMARK && vb[ib*2] != ENDMARK){ if (va[ia*2] > vb[ib*2] ) { c = vb[ib*2+1]; cc = vb[ib*2]; ib++; } else if (va[ia*2] < vb[ib*2] ) { c = va[ia*2+1]; cc = va[ia*2]; ia++; } else { c = va[ia*2+1] + vb[ib*2+1]; cc = vb[ib*2]; ia++; ib++; } vc[ic*2] = cc; vc[ic*2+1] = c; ic++; } vc[ic*2] = ENDMARK; return ( ic ); 1 5 5 1 7 3 ■ } 2 1 3 3 5 4 ■ Matrix Multiplication • For sparse matrix multiplication, compute the inner products of all the pairs of a row and a column • However, a sparse matrix has row representations but not column representations, getting column vectors is hard • A simple solution is to use transposing algorithm that is explained in the section of bucket; we will have column representation • On the other hand, some data structures are designed to be enabled to trace also columns Four-Direction List • Lists are good at storing sparse vectors, for tracing • However, collection of lists isn’t good at tracing column vectors, because the cells are not connected vertically • …so, let’s have a list connected in both row direction and column direction • Each cell has four arms, that point the neighboring cells in directions of (←, →, ↑, ↓) 4 7 2 Pointing the Neighbors • Links to four directions seems to form a mesh network, but not • …since, the links can cross • In the other words, this structure can be seen as a superimpose of two kinds of lists; horizontal direction and vertical direction, and the identical cells are unified into one 4 4 4 7 2 4 Having Lists of 2-Directions • If we have lists of row vectors and column vectors both, we can have the same accessibility, but insertions/deletions are not same • For example, when we want to delete a cell in a row vector, we would take long time to find the corresponding cell in column lists In four-direction lists, they are already unified 4 4 7 Graph Structure • A graph is a structure composed of a set of vertices and a set of edges (an edge is a pair of vertices) • Formed by sets, so the information such as positions, shapes, and crossing edges do not matter, when it is drawn as a picture (a graph with shape/position information is called “graph visualization” or “embedded graph”) • When edges have directions (from one vertex to another), it is called directed very popular structure Examples of Graph Data • Adjacency relation Hierarchy in an organization Similarity relation • Web network, human network, SNS friend network,… Graph Terminology • Edge e is said to be incident to u, v, and vice versa, if e = (u,v) also u and v are said to be adjacent • The #edges incident to v is the degree of v • A graph having edges for any two vertices is a complete graph • When there are two or more edges connecting two vertices, the edges are called multiple edges • If there is a partition of vertices so that any edge connects a vertex in a group and one in the other, the graph is called bipartite graph Storing a Graph • n vertices can be seen as numbers 0,…,n-1 • Then, an edge is a pair of numbers can be stored by writing the pairs in array, lists, etc. • Further, we need something for the accessibility for example, we often visit a vertex, and go to the neighboring vertex, and so we need to scan all edges incident to the vertex Using Matrix • The set of edges can be represented by a matrix as follows ① j-th row/column corresponds to vertex j, and ij-cell is 1 if there is edge (i, j) (called adjacency matrix) + efficient for dense graph having many edges + multiplicity of edges can be represented by the value of a cell ② j-th row corresponds to vertex j, and each column corresponds to an edge; when edge e is incident to vertex i, ij cell is 1 (called incidence matrix) + multiple edges represented easily • Sparse matrix representation has advantage for incidence matrix and sparse graph In Practice • 2-dimensional array is sufficient when the matrix size is small the cost is small, redundancy is small • Sparse matrix such as 100 by 100 with 10 non-zero elements in a row, sparse representation will be efficient (approximately, when density is less than 10%) + When we often want to scan non-zero elements, such as tracing all vertices adjacent to a vertex, sparse representation is useful + If we want to check whether there is an edge between two specified vertices, 2-dimensional array has advantage Incidence Matrix • An incidence matrix represents the incidence relation between vertices and edges • Put indices from 0,…,n-1 to vertices, and 0,…,m-1 to edges + store edges incident to a vertex to the corresponding row = storing vertices incident to an edge in 0: 0,1 the corresponding column 1: 1,5 0 1 5 5 4 0 2 6 1 3 4 2 8 3 7 0: 1,3 1: 0,2,4,5 2: 1,3,4,5 3: 0,2 4: 1,2,5 5: 1,2,4 0: 0,2 1: 0,1,3,4 2: 4,6,7,8 3: 2,7 4: 3,5,8 5: 1,5,6 + 2: 0,3 3: 1,4 4: 1,2 5: 4,5 6: 2,5 7: 2,3 8: 2,4 Advantage of Incidence Matrix • In the case of incidence matrix, each edge has ID so, easy to handle the attached information to each edge just allocate an array of size m, and it is sufficient • In the case of adjacency matrix, edge doesn’t have ID, thus not easy to manage correspondence of edge and its data • Multiple edges are also easy to handle 0 1 5 5 4 0 2 6 1 3 4 2 8 3 7 0: 1,3 1: 0,2,4,5 2: 1,3,4,5 3: 0,2 4: 1,2,5 5: 1,2,4 0: 0,2 1: 0,1,3,4 2: 4,6,7,8 3: 2,7 4: 3,5,8 5: 1,5,6 + 0: 0,1 1: 1,5 2: 0,3 3: 1,4 4: 1,2 5: 4,5 6: 2,5 7: 2,3 8: 2,4 Allocate Memory for Cells • Incidence matrix can be realized by cells of lists having four links like sparse matrix (two for vertices of the edges, and two for the edges in the vertex) disadvantages of arrays are eliminated • Also can be of two array lists • or, prepare an array and edge i corresponds to 0: 0,1 cells 2i and 2i+1, to represent four links 1: 1,5 0 1 5 5 4 0 2 6 1 3 4 2 8 3 7 0: 1,3 1: 0,2,4,5 2: 1,3,4,5 3: 0,2 4: 1,2,5 5: 1,2,4 0: 0,2 1: 0,1,3,4 2: 4,6,7,8 3: 2,7 4: 3,5,8 5: 1,5,6 + 2: 0,3 3: 1,4 4: 1,2 5: 4,5 6: 2,5 7: 2,3 8: 2,4 Exercise • Make an adjacency matrix of the following graph, and that in A sparse incidence matrix 6 0 5 1 4 2 3 Bipartite Graph • A bipartite graph is often seen as a representation of a (binary) (sparse) matrix associate nodes of one group to rows, and the others to columns connect by edges between vertices corresponding a cell with nonzero value • A representation of different style 0 4 1 5 2 6 3 0: 4,6 1: 4,5 2: 5,6 3: 5,6 Column: Store Huge Graph • A graph needs two pointer (or integer) per edge weight, and etc. need more • 64 bits are required in 32 bit CPU • However, Web graphs have billion of vertices, and 20 billions of edges 160GB is necessary in this way • This is too much. Can we reduce the storage size? Column: Store Huge Graph (2) ① Only few edges have large degrees Vertices are mainly adjacent to these few vertices Put indices so that large degree vertices have small indices, and represent small indices by small number of bits, and large indices by many bits Ex.) • If the bit sequence representing a number begins with “0”, the following 7 bits represent [0-127] • If “10”, the following 14 bits represent 128+[0-16383] • If “11”, the following 30 bits represent 16384+128,… Column: Store Huge Graph (3) ② Sort the sites in dictionary order of their URLs links are usually to near, thus difference of ID’s becomes small • They can be recorded in the same way, to reduce the space • Using these, one edge needs just 10 bits Further, we can reduce it to 5 bits The storage will be 20GB, thus can fit recent computers Summary • Data structures for matrix • Structures for sparse matrix, and four directed lists • Structures for graphs: adjacency matrix and incidence matrix adjacency list