A Draft White Paper Comparing of SVM and FAUST Classifiers In this paper, we do some rough comparison of Support Vector Machines (SVM) versus Functional Analytic Unsupervised and Supervised Technology (FAUST) Classifiers. The first big difference is that SVM uses horizontally structured data and processes it vertically, while FAUST uses vertically structure data (to the bit slice level) and processes it horizontally. Big data usually means many rows (trillions) and only a few columns (tens, hundreds, thousands). Therefore, FAUST loops require many orders of magnitude fewer passes than SVM processing loops. And even though each FAUST loop pass involves trillions of bits while each SVM loop pass involves only tens, hundreds or thousands of values, this difference does not balance off the difference in the number of loop passes, since massive bit strings are processed on the metal, and such processes have become extremely fast (e.g., with GPUs instead of CPUs). The following short description of SVM closely resembles the description in Wikipedia. In Support Vector Machines, given a set of training examples, each belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. SVM represents points in space so that the examples of the separate categories are divided by a clear linear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. In addition to performing linear classification, SVMs can perform a non-linear classification using the kernel trick, implicitly mapping their inputs into high-dimensional spaces. To use SVM for 1-class classification, the kernel trick is required since a single class almost never linearly separable from its complement. FAUST doesn’t have this short-coming. More formally, SVM constructs a set of hyper-planes in a high dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class (so-called functional margin), since in general the larger the margin the lower the error of the classifier. Whereas the original problem may be stated in a low-dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original low-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function selected to suit the problem. The hyper-planes in the higher-dimensional space are defined as the set of points whose dot product with a vector in that space is constant. The vectors defining the hyper-planes can be chosen to be linear combinations with parameters of images of feature vectors that occur in the data base. With this choice of a hyper-plane, the points in the feature space that are mapped into the hyper-plane are defined by the relation: Note that if becomes small as y grows further away from , each term in the sum measures the degree of closeness of the test point to the corresponding data base point . So the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated. For Big Data, if one has to use a kernel, the computation cost (time) may become prohibitive, whether the data is structured horizontally or vertically. H1 does not separate the classes. H2 does, but only with a small margin. H3 separates them with the maximum margin. Functional Analytic Unsupervised and Supervised Technology (FAUST) Classifiers work for 1class, two-class and multi-class classification directly. The idea is to “house” or circumscribe each class in a tight hull which is constructed so that it is very easy to determine if a point lies in a hull or not. For convex classes, the so-called convex hull is mathematically the tightest hull possible. It turns out that, the recursively constructed FAUST hulls are often even better than the convex hull when the class is non-convex. FAUST class-hulls are piecewise hulls in the sense that they are made up of a series of pairs of “boundary pieces” (n-1 dimensional pieces, if the space is n dimensional) that fit up against the class tightly. These boundary pairs are linear when using the dot product functional, L; round when using the spherical functional , S, and tubular when using the radial functional, R. These functionals are: n Let X(X1..Xn)R with |X|=N. The classes are {C1..CK}. n Let d=(d1,...,dn), p=(p1,…,pn)R with |d|=1. o is the dot product. LdXod is a single column of numbers (bit sliced) and so are Ld,p(X-p)od = Xod-pod = Ld-pod, Sp(X-p)o(X-p) = XoX+Xo(-2p)+pop, 2 Rd,p Sp-L d,p . The FAUST classifier the assigns y to class Ck iff it is in the hull of class Ck as follows: yCk iff yHk{z | Fmind,p,k (z-p)od Fmaxd,p,k (d,p) from dpSet.} where F ranges over L, S and R. Fmin is the minimum and Fmax is the maximum value in the respective column. dpSet is a set of unit vectors and points (used to define projection lines in the direction of d through p, for the functionals). Typically, dpSet would include all the standard basis unit vectors so that L is just a column of X and requires no computation. In general, the bigger dpSet is the better (the tighter the hull). Once the Fmin and Fmax values have been computed (using bit string processing on FAUST’s massive vertical bit slice structures), the determination of whether y is in a class Ck or not involves simple numeric comparisons. The following shows some typical FAUST boundary pieces (Linear only). Some advantages of FAUST (over SVM) include: 1. FAUST scales well to very large datasets, both large cardinality and large dimensionality. 2. No translation or kernelization is ever required. 3. Building the hull models of each class is fast using pTree technology. 4. Applying the model is very fast (requiring only a series of number comparisons). 5. Elimination of False Positives can always be done by adding more boundary pieces (using more (d,p) pairs). 6. 1-class classification or multi-class (k-class) classification are done the same way. 7. The model building phase is highly parallelizable. 8. The model constantly be incrementally improved even while it is in use (by add more boundary pairs). 9. If the training set is suspected to be somewhat uncharacteristic or inadequate (in terms of faithfully representing classes), each boundary pair can be moved to the first and last Precipitous Count Change in the functional instead of the min and max values. This will eliminate outlier values and may provide a more accurate model (this corresponds to Lagrange noise elimination in SVM).