Numbers serve to discipline rhetoric. Without them it is too easy to follow flights of fancy, to ignore the world as it is and to remold it nearer the heart's desire Ralph Waldo Emerson Imagine: • If every number came with its units, uncertainty, and exact meaning all attached in a “metanumber”: • a universal standard for all numerical data, • Instantly readable by both humans and computers, • And there was a simple unique name for every number, • Which name could be used as a variable name in expressions, • With all dimensional & error analysis automatically performed • And all computations tracked allowing reuse of results • Supporting AI, Big Data, and automated data exchange. And Also Imagine : • An algorithm continuously scraping all numerical data from the web and converting it to standardized MetaNumber tables • And with a novel process each table is converted to a network • Among entities (rows) and among properties (columns) • From which the dominant clusters are extracted and ordered • Using a novel agnostic cluster identification algorithm • Then with still another algorithm the dominant clusters are linked into a single supernet spanning all numeric clusters. • A single network spanning our entire numerical universe MetaNumber A Standardization of Numerical Information Joseph E. Johnson, PhD jjohnson@sc.edu September 24, 2015 The Problem • Numerical Data is incomplete without • • • • Value (the number itself) Units (its dimensional basis) Accuracy (its numerical uncertainty) Defining Metadata (its exact meaning) • But Published Data Cannot be Read by Computers Because: • Units, Uncertainty, and Defining Metadata Indices are • Non-standard, Scattered, Buried in Text, & in Multiple Locations • Consequently: • Data must be preprocessed by humans prior to every transfer • A costly, time consuming, error prone process • Data exchange is tedious and AI is thwarted esp. for Big Data Proposed Solution • Design a standard for all numerical data that tightly links • Value - Units – Accuracy – Meaning (metadata) • We call this structure a MetaNumber (MN) • Design Algorithms to automatically evaluate MN expressions • • • • Numerical results computed & Full dimensional analysis of units with error trapping & Automatic computation of numerical accuracy (uncertainty) & Tracking of all associated metadata • MNs must be instantly readable by humans and computers Other Requirements • • • • • • • • • Use the SI metric system as the foundation Allow unlimited attached metadata without encumbrance Be sufficiently flexible to encompass existing standards Support computational collaboration Be compatible with current standards in computer coding Track the complete computational evolution of each number Allow extensions to the SI base units & user defined units Design as a function that can be included in other systems. Results must be expressible in any desired units. Numerical Accuracy: • If there is no decimal then the value is treated as exact • Change 5.43e4 to 543e2 to make the value exact. • If there is a decimal then it is converted to a “ufloat” • 5.43e4 becomes 5.43+/-0.01e4 automatically • Accuracy is based upon the number of significant digits • Unit conversion values and internal constants are kept as exact • Keyed MNs and table values are treated as stated • • • • >4+5 > 4.0 + 5 > 345e-2 * 88 > 3.45*88 >> 9 >> 9.00+/-0.10 >> 303.6 >> 303.6+/-0.9 • If the uncertainty value is not needed, ignore it, use the value • But the value as listed represents the result. Units - 1 • Are variable names in mathematical expressions: • > 3*kg + 2.4*lb >> 4.09+/-0.05*kg • Are single names: ouncetroy not troy ounce. • > 4*ouncetroy >> 0.1244139072*kg • Are lower case with only a…z, 0-9 characters, “_” (interior): • mile, mach, c, m3, s_2, gallon • No information in fonts, or superscripts or subscripts • Not m2 but m2 Not s-2 but s_2 Units – 2 • No Greek, other alphabetic or special symbols: • ohm not W, hb not ħ, pi not π. • Only lower case for all characters: • kelvin or k not Kelvin, ampere or a not Ampere • Use singular spelling and not plural: • foot not feet, mile not miles, gallon not gallons • Are all defined in terms of the basic SI units: • meter, kilogram, second, ampere, kelvin, candela • Or abbreviated as m, kg, s, a, k, cd Units - 3 • All prefixes are separate words, compounded any way: • >micro*thousand*nano*dozen*ft >> 3.6576e-12*m • Allows metric, English, and common dimensionless prefixes • All results are in SI (metric) units by default • > 4*m * 8*m >> 32.0*m2 • Results in other units using: (expression) ! (units desired): • > 4*m * 8*m ! acre >> 0.00790734057749*(acre) • c ! (mile/week) >> 1.12663593737e+11*((mile/week)) • Dimensional errors are trapped: 3*kg + 4*m gives error Units – 4 - Past Results (history) • Results are sequentially numbered for each user, 1,2,… • Past results cannot be erased • A user can use a past result as hist(18) or [my_18] as: • > 2* [my_18] • Any past input expression can be retrieved as [my_18] • The history file for a users can be retrieved as • history 153 20 begins with value 20 and provides 153 values • If history number offset has offset missing, then 10 is default • Past results of each user are archived in a table as: • username, seq#, datetime, input string, result, uid, subject • The optional subject is set with {!subject = any name} Units 5 - Optional Base Units • • • • • • • • Six new units are optionally available: bit, b; a binary bit of information: 1/0 or T/F person, p; a unit of a living human dollar, usd, d; a unit of the US dollar flop; op; a unit of one floating point operation bn; baryon number (conserved like charge) ln; lepton number (conserved like charge) These allow a vast extension of the SI system in the economic and social sciences: • • • • Population density: Income: Information flow rate: Processing speed: • Where flops = flop/s 45.2*p/hectare 34.7e3*d/(p*year) 4.2*giga*bit*s_1 3.8*tera*flops Metadata - 1 • Any MN can be multiplied by {var1=val1 | var2=val2|…}: • • • • 6.4533e-4*kg*m_3*{lon=44*deg|lat=81*deg|depth=34.3*m} The expression {…anything at all….} always evaluates to “1” Thus {…} is a container for information attached to a MN {.} or {.|reason = div by zero} to indicate a missing value • Input expressions containing missing values results in a missing value • {?} or {?|possible read error for x1} indicates a questionable value • Computation proceeds normally but {?} is attached to the result • Commands can be given in the form: • {!subject = value} to set a subject for a user • {!subject = } to remove a subject • {!subject = ?} to determine the current subject MetaNumber Archived Values • MetaNumbers are stored in tables on the default server: • The table dimension can be 0, 1, 2, … • The table, row, and column have unique names for MN values • A value is accessed as : • • • • [table name_ row name _ column name…] This gives a path to indicate the exact location of the metanumber Thus each archived MN has a unique name For example: [e_gold_thermal conductivity] will be replaced by the thermal conductivity of gold from the elements table (e) • MetaNumbers stored at any other location are indicated by: • • • • [internet path_dir__table name_ row name _ column name…] This allows the unique path name to uniquely name every MN This expression [ …] can be used as a variable in any expression Thus [e_lead_density]/[e_gold_density] gives the density ratio • 0.58822+/-0.00006 Documentation • An input of: • # AnyString, is not processed as a an expression • It is saved as a comment string as a result for that seq# • This allows one to document ones work and explain the process • expression ! units desired gives the result in units desired • >16*ft + 38*inch - 1.2*m ! ft >> 15.23+/-0.33*(ft) • Metric prefixes can be used in any order • > 456* billion*kilo* dozen*giga* million >> 5.472e+30 • > A farmer has 382.6 acres and needed 1.8 additional inches of rain. How many gallons would this be? • > 382.6*acre*1.8*inch ! gallon >> (1.87+/-0.10)e+07*(gallon) • Here one simply multiplies the area times the height to get the volume. Units are automatically managed • What is the gravitational attraction between a 222.3 lb man and a 85.1 kg golf cart 6 ft away? • Use F = G m1 * m2 / d 2 to get the force in newtons. • Note that the constant G is given by “g” • >g*222.3*lb*85.1*kg/((6*ft)**2) • >> (1.7124+/-0.0022)e-07*m*kg*s_2 • A sports car accelerates from 0 to 60 in 2.1 seconds. How many “ g’s “ will this give (ag is acceleration due to gravity) • > ((60*mile/hour)/(2.1*s)) ! ag • >> 1.30+/-0.06*(ag) Working with MN Tables Suggests Something: 1. Think of MN tables as entities vs properties (elements) 1. Some entities are more like others because their properties are similar. 2. Can we create an entity network from the table? 1. Create an “entity Cij” that reflects this closeness 3. Then use our network clustering algorithm 1. 4. Seek clustering among the entities, even among properties Let me describe this procedure 1. First a few preliminaries… 20 Math Preliminaries Vector Spaces Lie groups and algebras, Markov Transformations • Linear Vector Space (LVS): |A>+|B> = |C> and a|A> = |B> • A basis |i> for the space gives all elements: |A> = ai |i> • LVS becomes more powerful with the following two products: • Metric Space (MS) A*B = SAiBi = real # = |A| |B| cos q • So we get the ‘metrics of length and angle. • Examples: regular space and the unitary scalar product 21 Lie Algebras & Lie Groups • Lie Algebra (LA) [Li, Lj] = cijk Lk & the Jacobi Identity • Lie Group : G(q) = eqL = 1 +tL + t2L2/2!+… • LA Examples: • • • • • • • • • Rotation Sxi2, Translation xi=xi+ai, Lorentz c2t2-r2, Poincare (Lorentz with translations), Unitary SYa*Ya=1, General Linear xi = Gij xj, Markov Sxi = Sxi, Scaling xi=elixi & Heisenberg X, P, I. LA Representations: LA represented by matrices acting upon vectors in a metric space. 22 Information & Entropy • Information is Defined by probability – • If the probability is large at a place then we know where something is and thus have more information. • Information (order) is the inverse of Entropy (disorder) • Information is additive but probabilities are multiplicative. • Thus I is a log of P • Shannon I = S Pi log2 Pi • Renyi’ I = log2 (2 S Pi2) or generally: • • 1 / 0 state: I = log2 2(12+ 02) = log2 2 = 1 ½ / ½ state: I =log2 2 ((1/2)2 + (1/2)2 ) = log2 1 = 0 • S(a) = (1/(1-a)) log2 (S Pia) which gives Renyi entropy of order “a” • Information is measured in 1 & 0 “bits”, 23 Markov Transformations • Markov Transformations have, for 100 years, described diffusion e.g. a random walk. • Markov transformations preserve the sum of the components of a vector • The vector normally gives the probabilities for the components. • Diffusion & entropy are described by Markov Transformations • X’= Mx with Sx’i = Sxi • These are motions over a plane perpendicular to the vector (1, 1, 1, ….1) and in the positive quadrant. • This follows from <1|xi> =1 (is invariant) 24 Abelian Scaling Algebra A(n) 25 Decomposition of the General Linear Group • GL(n,R) contains all the previous Lie algebras (LA) & groups: • We discovered a new decomposition into: • a Markov Type (MT(n2-n) ) LA • A general element in the Lie Algebra is L = a12L12 +a21L21 26 More Specifically: • Markov Type LA constrained to Markov Transformations gives the Markov Monoid MM • This connects Lie Algebras and Lie Groups with Diffusion, Random walks, Entropy and Information • Here we see the identity for l =0 and equilibrium with total diffusion at l = infinity 27 The Markov Monoid • In order to preserve the positive definiteness of the new components, it is necessary and sufficient that the l parameters are all non-negative. • This removes the inverse of the Markov type transformation and leaves us with a Lie Monoid. • Note that the diagonal elements are necessarily the exact negative of the sum of the off-diagonal elements in each column. • The L basis are also called Lagrange matrices. • Thus we have linked Lie algebras and Lie groups with continuous Markov transformations • This allows the power of each domain of mathematics to inform the other. 28 Networks 29 Networks are Isomorphic to Markov Monoids • A network is a set of points called nodes numbered 1,2,..,n • which are connected with some ‘strength’ Cij • which is an n x n matrix of non-negative numbers. • However the diagonal is missing, not zero – missing! • There is no meaning for the connection of a thing to itself. • Thus a network C will exactly define the Lie monoid generator • With the off-diagonal elements and conversely, • If one defines the C diagonals as the neg sum of that column • Thus every network C is isomorphic to a Markov monoid generator L • And to a family of Markov transformations; M(a) = exp(a L). 30 A Static Network Generates Markov Flows and Provides a Model • Now given any C we can construct the Markov Monoid L and generate the continuous Markov transformation M = exp(t L) • This generates a series of flows among the nodes in proportion to their network connectivity. • The Markov transformation has columns which are nonnegative and sum to unity. • Thus they support a definition of Renyi entropy for each node. • The entropy values can be sorted to give order to the nodes • Representing the topology uniquely by an entropy spectral curve. 31 Our Previous Results • The isomorphism of every network to a Lie Markov monoid now integrates three branches of mathematics: • Lie algebras and groups --- Markov theory --- Networks • We were able to show that the Renyi entropy spectral curve revealed aberrant changes in a network topology thus showing changes over time (attacks, failures) and allowing a comparison of two different networks with a metric for the distance between them (scalar product…) 32 Expansion of the Topology as a Series 1. We seek to expand a network as a series of terms of decreasing importance. 2. We realized that each of the Renyi entropies was a higher power of the Markov column components 1. These are all functionally independent. 3. Furthermore each entropy spectral curve was smaller than the preceding entropy curve. 33 Differences of Successive Orders of Renyi Entropies • Each spectral curve is less than the previous, • Thus differences between two successive orders becomes smaller very rapidly . • Thus this sequence of curves constitutes a rapidly decreasing series of curves that contain all of the topological information of the network. • It serves as the expansion that we sought: complete and rapidly decreasing. • This expansion into differences between successive Renyi entropy sorted spectral curves provides a complete series expansion for the topology. 34 • The same process can be executed on the rows. • As there are n*(n-1) independent values in the network topology • Since each column and row collectively provide 2n, this requires (n-1)/2 orders of Renyi entropies. • These curves can be used to monitor anomalous changes in a network: • Attacks, failures, and the differences between any two networks. • It can also track successive differences over time in a single network and thus track its time evolution 35 Eigenvectors Contain the Cluster Information • Consider the eigenvector / eigenvalue solutions for the Markov transformation • The Markov transformation gives a model of the conserved flows. • We realized that flows were faster within clusters in approach to equilibrium but slower to the remaining nodes which are weakly joined. • We also realized that the eigenvectors described those combinations of nodes that approached equilibrium at the exact rate of the associated eigenvalue. • Like normal modes of oscillation in physics. 36 • Thus the entire cluster analysis is displayed in the eigenvectors • Each eigenvector is a liner combination of the component nodes • Furthermore, the strength of the clustering is collectively measured by the magnitude of the eigenvalue. • The eigenvalue gives an order and name to the clusters • The exp(tL) expansion is Markovian for any number of terms • The number of terms gives the degree of separation. • The eigenvector structure (and thus the cluster information) are slightly different for the different numbers of terms used for the exp(tL) expansion. 37 Network Classification – a beginning • Thus the entire network topology is captured in both methods of network analysis: 1. The (n-1)/2 successively smaller nested Renyi entropy spectral curves contain all of the topology in increasingly less important terms. 1. 2. These measure the nth order entropy of incoming and outgoing connections for each node. The n eigenvectors of the Markov matrix, each with n components, sorted by the order of the eigenvalues as a metric of the degree of clustering, likewise provides a complete description of the topology. 1. These measure the clustering structure in the network. 38 Applications to Simulated Networks • We created a network with clusters within clusters by using larger values of the connection matrix in successive layers and scrambled the nodes. • We computed the eigenvalue and eigenvectors which instantly revealed the anticipated cluster structure. • As there is no formal definition of clustering, there is no way to prove this hypothesis but our observations confirm what one intuitively can see by the solution as well as by the underlying Markov model. 39 A network that has a cluster embedded within a cluster gives these eigenvalues: • • • • • • • • Eigenvalue: 0.882491186839: Included nodes: [0, 1, 2] Eigenvalue: 0.882491186839: Included nodes: [1, 2] Eigenvalue: 0.926438963526: Included nodes: [0, 1, 2, 3, 4] Eigenvalue: 0.926556991774: Included nodes: [0, 1, 2, 4] Eigenvalue: 0.955737589687: Included nodes: [5, 6, 7] Eigenvalue: 0.955934195065: Included nodes: [5, 7] Eigenvalue: 0.970549538208: Included nodes: [8, 9] Eigenvalue: 0.999846933025: Included nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] • Eigenvalue: 0.999953415037: Included nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] • Eigenvalue: 1.0: Included nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 40 Show C and its Eigenvectors: 10-Node Cluster Network: 3 clusters 41 Testing the procedure • A threshold value is set to make it easier to see the strongest connections • Every element below a certain percentage of the max value is set to 0 • This example uses a table of 103 elements with 21 physical properties • The strongest 3% of the connections are shown (from strongest to weakest: blue, green, and red) Generate Networks from MetaNumber Tables and Seek Clusters 43 Clustering of Tabular Data for Entities verses their Properties • Consider a standardized MetaNumber table • Rows of entities like elements. columns that numerically measure their properties • Some entities are “closer to other entities” in the sense that their spectra of properties are more similar. 44 Calculating the elements of Cij • The mean and standard deviation of each property is computed (all of which have the same units), such as density. • Each property value is rewritten in terms of the number of standard deviations from the mean thus giving a table Tij • One then defines a network C as • Cjk = exp- Si(Tij – Tik)2 • One then computes the eigenvalues and eigenvectors of • M = exp (tC) to identify clusters Current Work Create and Analyze the “Numerical Universe” 1. Build algorithm to standardize a CSV table automatically 2. Automatically scan web sites for numerical tables and normalize them as metanumber standardized tables. 1. 2. 3. 3. 4. 5. 6. 7. Identify the unique defining indices for rows and columns. Identify the units for each value and attach it to the value Execute the table normalization with means and std dev. Compute the Renyi entropy spectral decompositions. Perform Eigenvector/Eigenvalue analysis to identify clusters Build a dashboard to monitor results Archive the resulting tables for use with MN computations Link clusters to form the ‘supernet’ 46 Development Plan 1. 2. 3. 4. Use MN with basic tables to do calculations Explore Web Grid as a user tool with other APIs Applications to instruction (AAPT presentation Oct 31, 2015) Algorithm to take well formed tables into MN tables 1. 2. 3. 4. Begin with well formed tables such as financial data tab Use with E-books, and Reference Materials Explore with Companies with their private data sources Explore with Government agencies for limited internal work 5. Algorithm to Scrape Web to Create MN Tables 1. Push as a means to create MN on a large scale 6. Explore Supernet as a research tool 1. Along with cluster structures from tables Our Appreciation to: • Office of USC VP of Research • for the ASPIRE I & II supporting grants Faculty CoPIs – Phase I: • • • • • • • • • • • • Dr. Don Jordan, Mathematics Dr. William Hogue, USC CIO & VP Dr. Phil Moore, Director RCI Dr. Bob Mullen, Chair , Civil Engineering Dr. Paul Huray, Electrical Engineering Dr. Camelia Knapp, Geological Sciences and ESRI Director Dr. Tammi Richardson, Biology Dr. Dwayne Porter , Chair Public Health Dr. Geoff Scott, New Chair, Public Health Dr. Kendra Albright, Library and Information Science Dr. Susan Rathbun-Grubb, Library and Information Science Dr. Bert Ely, Biology & Director Center for Science Education Faculty CoPIs - Phase II: • • • • • • Dr. William Hogue, USC VP and CIO Dr. Phil Moore, Director Research CyberInfrastructure (RCI) Dr. John Rose, Computer Science & Engineering Dr. Gabriel Terejanu, Computer Science & Engineering Dr. Francisco Blanco-Silva, Mathematics Dr. Kendra Albright, Library and Information Science Student Associates • Phase I • • • • • • • • • • • Christian Merchant Jordan Bradshaw Brian Flick Ben Torkian Jun Zhou William Hannah Ethan Anderson Ryan Giannelli Dan Ramage Lisa Wicklifte Ideen Ghorbani • Phase II • • • • • • Sara Chizari Roy Myers Samuel Watson Nick Puig Dylan Shields Sourav Das Thank you for your interest. • www.metanumber.com • www.asg.sc.edu • Joseph E. Johnson, PhD • • • • • • Distinguished Professor Emeritus Department of Physics and Astronomy Office: Room 405, Physical Science Center Phone: 803-777-6431 University of South Carolina Columbia, SC, 29208, USC Web application powered by University of South Carolina’s RCI Group Directed by Dr. Phil Moore 52